05-Blog

A “media file truth” check

How to find duplicates, missing originals, and silent re-encodes.

Zoran Grbic

06 Feb 2026 • 5 min read

Your media library can look clean while it rots underneath. Same filenames, different guts. “Edited” copies that quietly replaced the only original. A folder full of videos that play fine until minute 43, then fall apart like cheap alibis.

A media file integrity check is how you stop guessing. You collect facts, not vibes. You prove what’s real, what’s duplicated, what’s missing, and what got re-encoded when you weren’t looking.

You don’t need a lab. You need a routine. And the nerve to trust evidence over memory.

Start with a media file truth check

Before you delete anything.

Important rule: don’t clean up while you’re still blind. Make a snapshot copy (or at least a read-only mount) so your checking doesn’t become the damage.

Start by separating three different problems that get mixed up on bad days:

Corruption: the file can’t be decoded all the way through.
Duplicates: two files that are the same, or close enough to trick you.
Silent changes: same content, different encoding, stripped metadata, new timestamps.

For video and audio integrity, ffmpeg is the gatekeeper. It doesn’t care about your feelings.

Fast decode scan: ffmpeg -v error -i "file.mp4" -f null -
Stream-focused scan: ffprobe -v error -show_entries format=duration:stream=codec_name,codec_type,bit_rate,width,height,r_frame_rate -of default=nw=1 "file.mp4"

If you want a purpose-built sweep across folders (photos, videos, audio), consider the open-source CLI tool check-media-integrity. It’s a blunt instrument, which is what you want at the start.

Now the trap most people step into: timestamps. Filesystems lie all the time.

Copy tools can rewrite mtime.
Timezone fixes can shift EXIF dates without changing the pixels.
Cloud sync can “touch” files during conflict resolution.

So treat timestamps as hints, not proof. If you need metadata you can interrogate, use ExifTool:

Quick camera dates: exiftool -time:all -a -G1 -s "IMG_1234.JPG"
Batch export dates: exiftool -r -csv -CreateDate -DateTimeOriginal -ModifyDate /path/to/library > dates.csv

If you’re comparing analysis tools, this FFprobe vs MediaInfo comparison lays out the strengths and limits in plain terms. In practice, you’ll use both. ffprobe for scriptable truth, MediaInfo for quick human scanning and “what did this come from?” clues.

Duplicates: byte-for-byte twins, and the ones wearing a mask

Some duplicates are harmless. A second copy on another drive. A mirror for backup. Fine.

The dangerous ones are the near-duplicates. Same scene, same duration, different encoding. One of them is the original; the other is a second-generation story. Softer edges, crushed shadows, smeared noise. Still watchable. Still wrong.

Start with the cleanest win: cryptographic hashes. If the hashes match, the files are identical. No debate.

Create a SHA-256 manifest: hashdeep -r -c sha256 /path/to/media > SHA256SUMS.txt
Verify later: hashdeep -a -k SHA256SUMS.txt -r /path/to/media

Then find true duplicates in place:

fdupes -r /path/to/media
rdfind -makesymlinks true /path/to/media (use linking only when you’re confident and backed up)

If you prefer a GUI with serious teeth, Czkawka scales well without getting sloppy. Their write-up on how Czkawka detects duplicates is worth reading because it explains why “same name” is a junk signal. Available for Mac, Linux and Windows.

DupeGuru is another solid GUI when you want something calmer and cross-platform: dupeGuru duplicate finder. Also available for Mac, Linux and Windows.

When you hit near-duplicates, switch methods. Hashes won’t help because the bytes differ. That’s where perceptual matching comes in, the idea of “looks the same” even if it’s not bit-identical. For photos, Czkawka’s similar-image mode can help. For mixed image-and-video sets, this project is a starting point: Duplicate-Media-Finder.

Here’s a quick decision table you can live by:

Evidence you see	What it usually means	What you do next
SHA-256 matches	Safe duplicate	Keep one, archive the other, or hardlink
Hash differs, same duration and frame size	Likely re-encode or remux	Compare codec, bitrate, encoder tags
Hash differs, same date/name pattern	Likely edited metadata or moved container	Check EXIF/QuickTime tags, then stream info
Perceptual match, different resolution/bitrate	Likely export/proxy	Tag it as derivative, don’t replace original

Missing originals and silent re-encodes

The quiet damage.

Missing originals don’t announce themselves. They sit behind your “organized” exports and pretend everything’s fine.

You notice later. Years later. When you want the RAW. When you want the full-quality video. When you want the version before the app “optimized” it.

A practical way to catch missing originals is to define what an original looks like in your world, then scan for gaps.

For photos: originals are often.CR2, .NEF, .ARW, .DNG, .RAF plus the camera JPEG.
For video: originals might be .MOV from a phone, .MP4 from an action cam, or high-bitrate camera files.

Use file lists to compare basenames. Example pattern:

List likely originals:

find /media -type f \( -iname "*.cr2" -o -iname "*.nef" -o -iname "*.arw" -o -iname "*.dng" \) -printf "%f\n" | sed 's/\.[^.]*$//' | sort -u > originals.txt

List likely originals

List exports:

find /media -type f -iname "*.jpg" -printf "%f\n" | sed 's/\.[^.]*$//' | sort -u > exports.txt

List exports

Then compare with comm; Linux & Mac OS only:

comm -13 originals.txt exports.txt > exports_without_raw.txt

Compare with comm; Linux & Mac OS only

Now the other threat, the one that ruins archives while you sleep: the silent re-encode.

A remux is just a container swap. The streams stay the same. A re-encode changes the stream. Quality shifts, encoder changes, GOP structure changes, bitrate behavior changes. The duration can remain identical, which is why people get fooled.

Interrogate the file like it owes you money:

Container and stream facts:

ffprobe -hide_banner -show_format -show_streams -of default=nw=1 "file.mp4"

Container and stream facts

Look for encoder tags with MediaInfo (or JSON output if you script it):

mediainfo "file.mp4"

Look for encoder tags

Pitfalls you should expect:

Variable bitrate makes size comparisons unreliable.
GOP changes can happen with “smart render” tools, even when the video feels unchanged.
Metadata stripping happens when apps rewrite MP4 atoms, so dates vanish or shift.
Timezone edits can make two copies look “different” to naïve tools while the content is identical.

If your “original” file has an encoder tag like Lavf or HandBrake and you don’t remember doing that, take it as a confession. Compare it to other copies. Find the cleanest lineage.

A repeatable audit routine

And a golden record that doesn’t lie

You don’t need heroics. You need a routine you can run when tired.

Freeze a working copy (snapshot, read-only mount, or backup clone).
Run corruption checks on video and audio (ffmpeg -v error ... -f null -).
Generate checksums for the golden set (hashdeep ... > SHA256SUMS.txt).
Find true duplicates (fdupes/rdfind/Czkawka), then decide what gets removed or linked.
Hunt near-duplicates with perceptual tools, then label derivatives (exports, proxies, social versions).
Compare expected originals vs exports (by extension and basename), then fix gaps while you still can.

Keep your “golden record” simple, boring, and consistent:

Folder structure: Media/YYYY/YYYY-MM-DD_Event/Device/
Filenames: YYYYMMDD_HHMMSS_Device_Sequence.ext (no spaces, no mystery)
Checksums: one SHA256SUMS.txt per event folder, plus an optional SHA256SUMS.txt.sig if you sign files
Sidecars: keep edits as .xmp for photos, and keep exported videos in an Exports/ subfolder so they never impersonate originals

When you finish, you’re not just cleaning. You’re putting your name on a record. Media file integrity checks work by telling your future self, “I didn’t guess, I verified.”