Write a bash/shell script to Finding and deleting duplicate files or directories

At one point, we had already talked about checking to see if strings inside of a file were unique and if we could sort them, but we haven’t yet performed a similar operation on files. However, before diving in, let’s make some assumptions about what constitutes a duplicate file for the purpose of this recipe: a duplicate file is one that may have a different name, but the same contents as another.

Table of Contents

One way to investigate the contents of a file would be to remove all white space and purely check the strings contained within, or we could merely use tools such as SHA512sum and MD5sum to generate a unique hash (think unique string full of gibberish) of the contents of the files. The general flow would be as follows:

Using this hash, we can compare the hash against a list of hashes already computed.

If the has matches, we have seen the contents of this file before and so we can delete it.

If the hash is new, we can record the entry and move onto calculating the hash of the next file until all files have been hashed.

Note:

Using a hash does not require you to know how the mathematics work, but rather to be aware of how it’s supposed to work IF it is a secure implementation and has enough possibilities to make finding a duplicate computationally unfeasible. Hashes are supposed to be one way, which means that they are different from encryption/decryption so that once a hash has been created, it should be impossible to determine the original input from the hash itself. MD5sums are considered completely insecure (although useful where security may be less needed), and SHA1/2 are considered to be potentially on their way out of popularity with the use of SPONGE algorithms in SHA3 (use SHA3 where possible). For more information, please see the NIST guidelines.

Getting ready

Open a terminal and create a data set consisting of several files with the dsetmkr.sh script:

dsetmkr.sh

$ #!/bin/bash
BDIR="files_galore"
rm -rf ${BDIR}
mkdir -p ${BDIR}
touch $BDIR/file1; echo "1111111111111111111111111111111" > $BDIR/file1;
touch $BDIR/file2; echo "2222222222222222222222222222222" > $BDIR/file2;
touch $BDIR/file3; echo "3333333333333333333333333333333" > $BDIR/file3;
touch $BDIR/file4; echo "4444444444444444444444444444444" > $BDIR/file4;
touch $BDIR/file5; echo "4444444444444444444444444444444" > $BDIR/file5;
touch $BDIR/sameas5; echo "4444444444444444444444444444444" > $BDIR/sameas5;
touch $BDIR/sameas1; echo "1111111111111111111111111111111" > $BDIR/sameas1;

Then, before jumping into scripting, a core concept needs to be discussed regarding arrays and whether they are static or dynamic; knowing how an array implementation works at its core is a key principle if performance is an objective.

Arrays can be really helpful, but the performance of a Bash script is often sub-par to that of a compiled program or even choosing a language with the appropriate data structures. In Bash, arrays are linked lists and dynamic, which means that if you resize the array, there isn’t a massive performance penalty.

For our purposes, we are going to create a dynamic array and once the array becomes quite large, it will be the searching of the array which becomes the performance bottleneck. This naive iterative approach usually works well up to an arbitrary amount (let’s say, N), and at which the benefits of using another mechanism may outweigh the simplicity of the current approach. For those who want to know more about data structures and the performance of them, check out Big O notation and complexity theory.

How to do it…

Open a terminal, and create the file-deduplicator.sh script.

The following is a code snippet of the script:

file-deduplicator.sh

#!/bin/bash
declare -a FILE_ARRAY=()
function add_file() {
  # echo $2 $1
  local NUM_OR_ELEMENTS=${#FILE_ARRAY[@]}
  FILE_ARRAY[$NUM_OR_ELEMENTS+1]=$1
}
function del_file() {
  rm "$1" 2>/dev/null
}

Run the setup command if not already: run $ bash dsetmkr.sh and then run $ bash ./file-deduplicator.sh. Enter files_galore/ at the prompt and press Enter:

$ bash dsetmkr.sh 
$ bash file-deduplicator.sh
Enter directory name to being searching and deduplicating:
Press [ENTER] when ready
files_galore/
#1 f559f33eee087ea5ac75b2639332e97512f305fc646cf422675927d4147500d4c4aa573bd3585bb866799d08c373c0427ece87b60a5c42dbee9c011640e04d75
#2 f7559990a03f2479bf49c85cb215daf60417cb59875b875a8a517c069716eb9417dfdb907e50c0fd5bd47127105b7df9e68a0c45a907dc5254ce6bc64d7ec82a
#3 2811ce292f38147613a84fdb406ef921929f864a627f78ef0ef16271d4996ed598d0f5c5f410f7ae75f9902ff0f63126b567e5f24882db3686be81f2a79f1bb3
#4 89f5df2b9f4908adca6a36f92b344d4a8ff96d04184e99d8dd31a86e96d45a1aa16a8b574d5815f17d649d521c9472670441a56f54dc1c2640e20567581d9b4e

Review the results and verify the contents of files_galore.

$ ls files_galore/

How it works…

Before getting started, a proceeding note of caution: the file-deduplicator.sh script deletes duplicate files in the directory it is targeted at.

Getting started (especially using the dsetmkr.sh script), we will produce a directory called files_galore that also contains several files: four are unique and three contain duplicate content:

$ bash dsetmkr.sh

Note:

The study of cryptography, security, and mathematics are all very interesting and broad information domains! Hashes have a multitude of other uses such as integrity checking of files, lookup values to find data quickly, unique identifiers, and much more.

When you run file-deduplicator.sh, it begins by asking the user for input using read and then it prints out four different values with seemingly random strings of characters. Random looking is absolutely correct—they are SHA512 hash sums! Each string is the sum of the contents inside of it. Even if the contents are even slightly different (for example, one bit has been flipped to a 1 instead of a 0), then a totally different hash sum will be produced. Again, this bash script leverages a foreign concept of arrays (using a global array variable meaning accessible everywhere in the script) and hash sums using the SHA512sum tool combined with awk to retrieve the correct values. This script is not recursive though, and only looks at the files inside of files_galore to generate a list of files, a hash for each one, and search an array containing all known hashes. If a hash is unknown, then it is a new file and is inserted into the array for storage. Otherwise, if a hash is seen twice, the file is deleted because it contains DUPLICATE content (even if the file name is different). There is another aspect here, and that is the use of return values as strings. As you may remember, return only can return numeric values:

$ bash file-deduplicator.sh 
Enter directory name to being searching and deduplicating:
Press [ENTER] when ready
files_galore/
#1 f559f33eee087ea5ac75b2639332e97512f305fc646cf422675927d4147500d4c4aa573bd3585bb866799d08c373c0427ece87b60a5c42dbee9c011640e04d75
#2 f7559990a03f2479bf49c85cb215daf60417cb59875b875a8a517c069716eb9417dfdb907e50c0fd5bd47127105b7df9e68a0c45a907dc5254ce6bc64d7ec82a
#3 2811ce292f38147613a84fdb406ef921929f864a627f78ef0ef16271d4996ed598d0f5c5f410f7ae75f9902ff0f63126b567e5f24882db3686be81f2a79f1bb3
#4 89f5df2b9f4908adca6a36f92b344d4a8ff96d04184e99d8dd31a86e96d45a1aa16a8b574d5815f17d649d521c9472670441a56f54dc1c2640e20567581d9b4e

After executing the operation, we can see that the files_galore directory only contains four files out of the original seven. The duplicate data is now removed!

$ ls files_galore/file1 file2 file3 file4

0 Comments

Submit a Comment Cancel reply

Are you open to learn Linux?

Get weekly Linux news, tutoials, tips & tricks, and other useful information related to Linux and Open source in your INBOX.

An introduction on Error Checking and Handling

Bash script is a powerful tool that allows you to automate tasks and perform complex operations on...

Read More →

BASH

Bash script: Error prevention

Bash script is a powerful tool for automating repetitive tasks and streamlining your workflow....

Read More →

BASH

Bash script: Error handling

When it comes to writing scripts in Bash, it's important to consider how to handle errors that may...

Read More →

BASH

Bash script: Error checking

Bash is a powerful tool that can automate repetitive tasks and make your life easier. But with...

Read More →

BASH

Bash: Interactive versus non-interactive scripts

Bash, or the Bourne Again Shell, is a popular command-line interpreter for Unix-based systems. It...

Read More →

BASH

Dealing with user input in bash script

Dealing with user input in bash script can be a tricky task, but with a little bit of knowledge...

Read More →

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Write a bash/shell script to Finding and deleting duplicate files or directories

Getting ready

dsetmkr.sh

How to do it…

file-deduplicator.sh

How it works…

0 Comments

Submit a Comment Cancel reply

Are you open to learn Linux?

Success!

Related Articles

An introduction on Error Checking and Handling

Bash script: Error prevention

Bash script: Error handling

Bash script: Error checking

Bash: Interactive versus non-interactive scripts

Dealing with user input in bash script

LINUXCONCEPT

MOST VISITED

INFORMATION

CONNECT WITH US