Write a bash/shell script to Finding and deleting duplicate files or directories

At one point, we had already talked about checking to see if strings inside of a file were unique and if we could sort them, but we haven’t yet performed a similar operation on files. However, before diving in, let’s make some assumptions about what constitutes a duplicate file for the purpose of this recipe: a duplicate file is one that may have a different name, but the same contents as another.

One way to investigate the contents of a file would be to remove all white space and purely check the strings contained within, or we could merely use tools such as SHA512sum and MD5sum to generate a unique hash (think unique string full of gibberish) of the contents of the files. The general flow would be as follows:

Using this hash, we can compare the hash against a list of hashes already computed.

If the has matches, we have seen the contents of this file before and so we can delete it.

If the hash is new, we can record the entry and move onto calculating the hash of the next file until all files have been hashed.

Note:

Using a hash does not require you to know how the mathematics work, but rather to be aware of how it’s supposed to work IF it is a secure implementation and has enough possibilities to make finding a duplicate computationally unfeasible. Hashes are supposed to be one way, which means that they are different from encryption/decryption so that once a hash has been created, it should be impossible to determine the original input from the hash itself. MD5sums are considered completely insecure (although useful where security may be less needed), and SHA1/2 are considered to be potentially on their way out of popularity with the use of SPONGE algorithms in SHA3 (use SHA3 where possible). For more information, please see the NIST guidelines.

Getting ready

Open a terminal and create a data set consisting of several files with the dsetmkr.sh script:

dsetmkr.sh

$ #!/bin/bash
BDIR="files_galore"
rm -rf ${BDIR}
mkdir -p ${BDIR}
touch $BDIR/file1; echo "1111111111111111111111111111111" > $BDIR/file1;
touch $BDIR/file2; echo "2222222222222222222222222222222" > $BDIR/file2;
touch $BDIR/file3; echo "3333333333333333333333333333333" > $BDIR/file3;
touch $BDIR/file4; echo "4444444444444444444444444444444" > $BDIR/file4;
touch $BDIR/file5; echo "4444444444444444444444444444444" > $BDIR/file5;
touch $BDIR/sameas5; echo "4444444444444444444444444444444" > $BDIR/sameas5;
touch $BDIR/sameas1; echo "1111111111111111111111111111111" > $BDIR/sameas1;

Then, before jumping into scripting, a core concept needs to be discussed regarding arrays and whether they are static or dynamic; knowing how an array implementation works at its core is a key principle if performance is an objective.

Arrays can be really helpful, but the performance of a Bash script is often sub-par to that of a compiled program or even choosing a language with the appropriate data structures. In Bash, arrays are linked lists and dynamic, which means that if you resize the array, there isn’t a massive performance penalty.

For our purposes, we are going to create a dynamic array and once the array becomes quite large, it will be the searching of the array which becomes the performance bottleneck. This naive iterative approach usually works well up to an arbitrary amount (let’s say, N), and at which the benefits of using another mechanism may outweigh the simplicity of the current approach. For those who want to know more about data structures and the performance of them, check out Big O notation and complexity theory.

How to do it…

Open a terminal, and create the file-deduplicator.sh script.

The following is a code snippet of the script:

file-deduplicator.sh

#!/bin/bash
declare -a FILE_ARRAY=()
function add_file() {
  # echo $2 $1
  local NUM_OR_ELEMENTS=${#FILE_ARRAY[@]}
  FILE_ARRAY[$NUM_OR_ELEMENTS+1]=$1
}
function del_file() {
  rm "$1" 2>/dev/null
}

Run the setup command if not already: run $ bash dsetmkr.sh and then run $ bash ./file-deduplicator.sh. Enter files_galore/ at the prompt and press Enter:

$ bash dsetmkr.sh 
$ bash file-deduplicator.sh
Enter directory name to being searching and deduplicating:
Press [ENTER] when ready
files_galore/
#1 f559f33eee087ea5ac75b2639332e97512f305fc646cf422675927d4147500d4c4aa573bd3585bb866799d08c373c0427ece87b60a5c42dbee9c011640e04d75
#2 f7559990a03f2479bf49c85cb215daf60417cb59875b875a8a517c069716eb9417dfdb907e50c0fd5bd47127105b7df9e68a0c45a907dc5254ce6bc64d7ec82a
#3 2811ce292f38147613a84fdb406ef921929f864a627f78ef0ef16271d4996ed598d0f5c5f410f7ae75f9902ff0f63126b567e5f24882db3686be81f2a79f1bb3
#4 89f5df2b9f4908adca6a36f92b344d4a8ff96d04184e99d8dd31a86e96d45a1aa16a8b574d5815f17d649d521c9472670441a56f54dc1c2640e20567581d9b4e

Review the results and verify the contents of files_galore.

$ ls files_galore/

How it works…

Before getting started, a proceeding note of caution: the file-deduplicator.sh script deletes duplicate files in the directory it is targeted at.

Getting started (especially using the dsetmkr.sh script), we will produce a directory called files_galore that also contains several files: four are unique and three contain duplicate content:

$ bash dsetmkr.sh

Note:

The study of cryptography, security, and mathematics are all very interesting and broad information domains! Hashes have a multitude of other uses such as integrity checking of files, lookup values to find data quickly, unique identifiers, and much more.

When you run file-deduplicator.sh, it begins by asking the user for input using read and then it prints out four different values with seemingly random strings of characters. Random looking is absolutely correct—they are SHA512 hash sums! Each string is the sum of the contents inside of it. Even if the contents are even slightly different (for example, one bit has been flipped to a 1 instead of a 0), then a totally different hash sum will be produced. Again, this bash script leverages a foreign concept of arrays (using a global array variable meaning accessible everywhere in the script) and hash sums using the SHA512sum tool combined with awk to retrieve the correct values. This script is not recursive though, and only looks at the files inside of files_galore to generate a list of files, a hash for each one, and search an array containing all known hashes. If a hash is unknown, then it is a new file and is inserted into the array for storage. Otherwise, if a hash is seen twice, the file is deleted because it contains DUPLICATE content (even if the file name is different). There is another aspect here, and that is the use of return values as strings. As you may remember, return only can return numeric values:

$ bash file-deduplicator.sh 
Enter directory name to being searching and deduplicating:
Press [ENTER] when ready
files_galore/
#1 f559f33eee087ea5ac75b2639332e97512f305fc646cf422675927d4147500d4c4aa573bd3585bb866799d08c373c0427ece87b60a5c42dbee9c011640e04d75
#2 f7559990a03f2479bf49c85cb215daf60417cb59875b875a8a517c069716eb9417dfdb907e50c0fd5bd47127105b7df9e68a0c45a907dc5254ce6bc64d7ec82a
#3 2811ce292f38147613a84fdb406ef921929f864a627f78ef0ef16271d4996ed598d0f5c5f410f7ae75f9902ff0f63126b567e5f24882db3686be81f2a79f1bb3
#4 89f5df2b9f4908adca6a36f92b344d4a8ff96d04184e99d8dd31a86e96d45a1aa16a8b574d5815f17d649d521c9472670441a56f54dc1c2640e20567581d9b4e

After executing the operation, we can see that the files_galore directory only contains four files out of the original seven. The duplicate data is now removed!

$ ls files_galore/file1 file2 file3 file4

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Related Articles