Quick Script to Find Duplicate Files

Here’s a quick script to show duplicate files on Linux. It should cope with arbitrary spaces in file names, and to save time and CPU resources, it will checksum only files of the same size.

Usage: Save the script to dups.sh or whatever, then run it with no arguments. A list of duplicated files is output.

#!/bin/bash

#
# Quick script to list duplicate files under the current directory
# v 1.2
#

echo Running find... >&2
find . -type f -size +0 -printf "%-25s %p\n" |
   sort -n |
      uniq -D -w 25  > /tmp/dupsizes.$$


echo Calculating $(wc -l < /tmp/dupsizes.$$) check sums... >&2
cat /tmp/dupsizes.$$ |
   sed 's/^\w* *\(.*\)/md5sum "\1"/' | 
      sh | 
         sort | 
            uniq -w32 --all-repeated=separate  > /tmp/dups.$$


echo Found $(grep -c . /tmp/dups.$$) duplicated files
while read md5 filename
do
   if [[ ! -z "$filename" ]]; then
      ls -l "$filename"
   else
      echo
   fi
done < /tmp/dups.$$
   

Any duplicate files that the script finds are printed in “ls -ls” format, and grouped by duplicate sets.

bash-4.2$ ./dups.sh | more
Running find...
Calculating 386 check sums...
Found 352 duplicated files
-rw-r--r-- 1 james james 1369542 Sep  2 18:26 ./2012 09 Lakes/20120902_182629_2.jpg
-rw-rw-r-- 1 james james 1369542 Sep  4 18:13 ./Lakes Aug Sep 2012/20120902_182629_2.jpg

-rw-rw-r-- 1 james james 2894670 Sep  4 18:11 ./London August 2012 Olympic Marathon/20120812_134804_HDR.jpg
-rw-r--r-- 1 james james 2894670 Aug 12  2012 ./new1/2012 08 London Olympics/20120812_134804_HDR.jpg

-rw-r--r-- 1 james james 5386606 Sep  3 11:34 ./2012 09 Lakes/20120903_113432_HDR.jpg
-rw-rw-r-- 1 james james 5386606 Sep  4 18:14 ./Lakes Aug Sep 2012/20120903_113432_HDR.jpg

Looks like many of my photos are saved in 2 different folders.

Explanation

The script works by (a) collecting a list of files that have the same sizes (first paragraph), (b) checksumming those (second paragraph) and (c) doing an “ls -l” on each file, for clarity (third paragraph)

Calculating a checksum is CPU intensive. Much time is saved by not checksumming every file. Unless two files are of the same size, they cannot be duplicates of each other. Therefore the script checksums only a subset of the files under consideration.

The script could be joined into a single very big pipeline, rather than using temporary files. However that would not aid debugging and isn’t how the script was written. The indentations are there for readability, as is the leading cat on the second block.

Footnotes

The -size +0 stops the script telling you about how zero size files are duplicates of each other.

The sort -n | uniq -D -w 25 is a way of unique-ing a list by the first column only. The input to sort is arranged so the first column is always inside the first 25 characters

The –all-repeated=separate flag to uniq tells it to print only repeated lines, and to separate groups of repeated lines with a blank line.

Limitations

Multiple hard links to the same file will fool the script. It will tell you they are duplicate files. Such files are rare, being mostly confined to a few OS binaries, and unlikely to crop up in user data. Nonetheless, commentor Beanux has proposed a fix. I haven’t tested it, but it can be viewed at his PrivateBin page.

It is possible, but rare, for two files to be identical but have slightly different sizes on disk. The script won’t find duplicates of different sizes.

9 thoughts on “Quick Script to Find Duplicate Files

    • I’m glad you liked the post Amit. However it seems you have copied another post from this site onto your own blog and passed it off as your own work. Please don’t do this. If you want your readers to see my article, provide a link instead.

    • Good Stuff Marcel. If the script was searching the whole NAS, it must have taken quite a long time to run. Photos seem to be the files most often duplicated.

  1. Hello, great script, a big thanks for it 🙂

    Just some backslash are not shown in the script: sed ‘s/^\w* *\(.*\)/md5sum “\1″/’
    And i upgraded it a bit just by keeping your way to do things. Now it ignore hard link and take only on of the hard link.

    find . -type f -print0 |
    xargs -0 stat -c ‘%-25i %-25s %n’ |
    sort -n | uniq -w 25 | uniq -D -w 50 > /tmp/dupsizes.$$

    cat /tmp/dupsizes.$$ |
    sed ‘s/^\w* *\w* *\(.*\)/md5sum “\1″/’ |
    sh |
    sort |
    uniq -w32 –all-repeated=separate > /tmp/dups.$$

    With that i was able to erase duplicate, and replace them by hard link (for music and video, i needed these kind of thing) and then still ignoring hard link created.

    • my bad, it’s a partial (and wrong) script i just posted:

      #!/bin/bash
      #
      # Quick script to list duplicate files under the current directory
      # v 1.2
      #

      checkpath=”$1″

      echo Running find… >&2
      find $checkpath -type f -size +0 -print0 |
      xargs -0 stat -c ‘%-25i %-25s %n’ |
      sort -n | uniq -w 25 | sed ‘s/^\w* *\(.*\)/\1/’ | sort -n | uniq -D -w 25 > /tmp/dupsizes.$$

      echo Calculating $(wc -l &2
      cat /tmp/dupsizes.$$ |
      sed ‘s/^\w* *\(.*\)/md5sum “\1″/’ |
      sh |
      sort |
      uniq -w32 –all-repeated=separate > /tmp/dups.$$

      echo Found $(grep -c . /tmp/dups.$$) duplicated files
      while read md5 filename
      do
      if [[ ! -z “$filename” ]]; then
      ls -l “$filename”
      else
      echo
      fi
      done < /tmp/dups.$$

      And to describe how it check hardlink, it's the same process, find just list the name, then stat put inode and size.
      Then the first uniq compare only the inode, for hardlink
      We remove the inode print with sed to order them again by size this time and then same process as original script.

      • Hi Beanux. It seems that this post had become badly corrupted at some point in the 7 years since it was published. The code section was missing many escape characters and didn’t work at all. I have therefore re-written the article and posted the original script again, and tested it working.

        Probably the errors happened when WordPress switched to the “Gutenburg” editor a couple of years ago. Certainly it was in working order in 2016, as reflected by Marcel’s comment above.

        I don’t quite understand your comments, but it looks like you are trying to remove the hard link limitation by filtering out hard links. Perhaps you could post your modified script online somewhere (eg. Github) and I can then link to it from the article.

        Cheers,
        Jim.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.