Here’s a quick script to show duplicate files on Linux. It should cope with arbitrary spaces in file names, and to save time and CPU resources, it will checksum only files of the same size.
Usage: Save the script to dups.sh or whatever, then run it with no arguments. A list of duplicated files is output.
#!/bin/bash # # Quick script to list duplicate files under the current directory # v 1.2 # echo Running find... >&2 find . -type f -size +0 -printf "%-25s %p\n" | sort -n | uniq -D -w 25 > /tmp/dupsizes.$$ echo Calculating $(wc -l < /tmp/dupsizes.$$) check sums... >&2 cat /tmp/dupsizes.$$ | sed 's/^\w* *\(.*\)/md5sum "\1"/' | sh | sort | uniq -w32 --all-repeated=separate > /tmp/dups.$$ echo Found $(grep -c . /tmp/dups.$$) duplicated files while read md5 filename do if [[ ! -z "$filename" ]]; then ls -l "$filename" else echo fi done < /tmp/dups.$$
Any duplicate files that the script finds are printed in “ls -ls” format, and grouped by duplicate sets.
bash-4.2$ ./dups.sh | more Running find... Calculating 386 check sums... Found 352 duplicated files -rw-r--r-- 1 james james 1369542 Sep 2 18:26 ./2012 09 Lakes/20120902_182629_2.jpg -rw-rw-r-- 1 james james 1369542 Sep 4 18:13 ./Lakes Aug Sep 2012/20120902_182629_2.jpg -rw-rw-r-- 1 james james 2894670 Sep 4 18:11 ./London August 2012 Olympic Marathon/20120812_134804_HDR.jpg -rw-r--r-- 1 james james 2894670 Aug 12 2012 ./new1/2012 08 London Olympics/20120812_134804_HDR.jpg -rw-r--r-- 1 james james 5386606 Sep 3 11:34 ./2012 09 Lakes/20120903_113432_HDR.jpg -rw-rw-r-- 1 james james 5386606 Sep 4 18:14 ./Lakes Aug Sep 2012/20120903_113432_HDR.jpg
Looks like many of my photos are saved in 2 different folders.
The script works by (a) collecting a list of files that have the same sizes (first paragraph), (b) checksumming those (second paragraph) and (c) doing an “ls -l” on each file, for clarity (third paragraph)
Calculating a checksum is CPU intensive. Much time is saved by not checksumming every file. Unless two files are of the same size, they cannot be duplicates of each other. Therefore the script checksums only a subset of the files under consideration.
The script could be joined into a single very big pipeline, rather than using temporary files. However that would not aid debugging and isn’t how the script was written. The indentations are there for readability, as is the leading cat on the second block.
The -size +0 stops the script telling you about how zero size files are duplicates of each other.
The sort -n | uniq -D -w 25 is a way of unique-ing a list by the first column only. The input to sort is arranged so the first column is always inside the first 25 characters
The –all-repeated=separate flag to uniq tells it to print only repeated lines, and to separate groups of repeated lines with a blank line.
Multiple hard links to the same file will fool the script. It will tell you they are duplicate files. Such files are rare, being mostly confined to a few OS binaries, and unlikely to crop up in user data. Nonetheless, commentor Beanux has proposed a fix. I haven’t tested it, but it can be viewed at his PrivateBin page.
It is possible, but rare, for two files to be identical but have slightly different sizes on disk. The script won’t find duplicates of different sizes.