Quick Script to Find Duplicate Files

Here’s a quick script to show duplicate files on Linux. It should cope with arbitrary spaces in file names, and to save time and CPU resources, it will checksum only files of the same size.

Usage: Save the script to dups.sh or whatever, then run it with no arguments. A list of duplicated files is output.

#!/bin/bash
#
# Quick script to list duplicate files under the current directory
# v 1.2
#

echo Running find... >&2
find . -type f -size +0 -printf "%-25s %pn" |
   sort -n | uniq -D -w 25  > /tmp/dupsizes.$$


echo Calculating $(wc -l < /tmp/dupsizes.$$) check sums... >&2
cat /tmp/dupsizes.$$ |
   sed 's/^w* *(.*)/md5sum "1"/' |
      sh |
         sort |
            uniq -w32 --all-repeated=separate  > /tmp/dups.$$


echo Found $(grep -c . /tmp/dups.$$) duplicated files
while read md5 filename
do
   if [[ ! -z "$filename" ]]; then
      ls -l "$filename"
   else
      echo
   fi
done < /tmp/dups.$$

Any duplicate files that the script finds are printed in "ls -ls" format, and grouped by duplicate sets.

bash-4.2$ ./dups.sh | more
Running find...
Calculating 386 check sums...
Found 352 duplicated files
-rw-r--r-- 1 james james 1369542 Sep  2 18:26 ./2012 09 Lakes/20120902_182629_2.jpg
-rw-rw-r-- 1 james james 1369542 Sep  4 18:13 ./Lakes Aug Sep 2012/20120902_182629_2.jpg

-rw-rw-r-- 1 james james 2894670 Sep  4 18:11 ./London August 2012 Olympic Marathon/20120812_134804_HDR.jpg
-rw-r--r-- 1 james james 2894670 Aug 12  2012 ./new1/2012 08 London Olympics/20120812_134804_HDR.jpg

-rw-r--r-- 1 james james 5386606 Sep  3 11:34 ./2012 09 Lakes/20120903_113432_HDR.jpg
-rw-rw-r-- 1 james james 5386606 Sep  4 18:14 ./Lakes Aug Sep 2012/20120903_113432_HDR.jpg

Looks like many of my photos are saved in 2 different folders.

Explanation

The script works by (a) collecting a list of files that have the same sizes (first paragraph), (b) checksumming those (second paragraph) and (c) doing an "ls -l" on each file, for clarity (third paragraph)

Calculating a checksum is CPU intensive. Much time is saved by not checksumming every file. Unless two files are of the same size, they cannot be duplicates of each other. Therefore the script checksums only a subset of the files under consideration.

The script could be joined into a single very big pipeline, rather than using temporary files. However that would not aid debugging and isn't how the script was written. The indentations are there for readability, as is the leading cat on the second block, and just because I like it, okay ?

Footnotes

The -size +0 stops the script telling you about how zero size files are duplicates of each other.

The sort -n | uniq -D -w 25 is a way of unique-ing a list by the first column only. The input to sort is arranged so the first column is always inside the first 25 characters

The --all-repeated=separate flag to uniq tells it to print only repeated lines, and to separate groups of repeated lines with a blank line.

Limitations

Multiple hard links to the same file will fool the script. It will tell you they are duplicate files.

It is possible, but rare, for two files to be identical but have slightly different sizes on disk. The script won't find duplicates of different sizes.

4 thoughts on “Quick Script to Find Duplicate Files

    • I’m glad you liked the post Amit. However it seems you have copied another post from this site onto your own blog and passed it off as your own work. Please don’t do this. If you want your readers to see my article, provide a link instead.

    • Good Stuff Marcel. If the script was searching the whole NAS, it must have taken quite a long time to run. Photos seem to be the files most often duplicated.

Leave a Reply

Your email address will not be published. Required fields are marked *