At one point, we had already talked about to see if inside of a file were unique and if we could sort them, but we haven’t yet performed a similar operation on files. However, before diving in, let’s make some assumptions about what constitutes a duplicate file for the purpose of this recipe: a duplicate file is one that may have a different name, but the same contents as another.
One way to investigate the contents of a file would be to remove all white space and purely check the strings contained within, or we could merely use such as and to generate a unique hash (think unique string full of gibberish) of the contents of the files. The flow would be as follows:
Using this hash, we can compare the hash against a list of hashes already computed.
If the has matches, we have seen the contents of this file before and so we can delete it.
If the hash is new, we can record the entry and move onto calculating the hash of the next file until all files have been
Using a hash does not require you to know how the mathematics work, but rather to be aware of how it’s supposed to work IF it is a secure implementation and has enough possibilities to make finding a duplicate computationally unfeasible. Hashes are supposed to be one way, which means that they are different from encryption/decryption so that once a hash has been created, it should be impossible to determine the original input from the hash itself. MD5sums are considered completely insecure (although useful where security may be less needed), and SHA1/2 are considered to be potentially on their way out of popularity with the use of SPONGE algorithms in SHA3 (use SHA3 where possible). For more information, please see the NIST guidelines.
Then, before jumping into scripting, a core concept needs to be discussed regarding arrays and whether they are static or dynamic; knowing how an array implementation works at its core is a key principle if performance is an objective.
Arrays can be really helpful, but the performance of a Bash script is often sub-par to that of a compiled program or even choosing a language with the appropriate data structures. In Bash, arrays are linked lists and dynamic, which means that if you resize the array, there isn’t a massive performance penalty.
For our purposes, we are going to create a dynamic array and once the array becomes quite large, it will be the searching of the array which becomes the performance bottleneck. This naive iterative approach usually works well up to an arbitrary amount (let’s say,
N), and at which the benefits of using another mechanism may outweigh the simplicity of the current approach. For those who want to know more about data structures and the performance of them, check out Big O notation and complexity theory.
How to do it…
The following is a code snippet of the script:
Review the results and verify the contents of
Getting started (especially using the
dsetmkr.sh script), we will produce a directory called
files_galore that also contains several files: four are unique and three contain duplicate content:
The study of cryptography, security, and mathematics are all very interesting and broad information domains! Hashes have a multitude of other uses such as integrity checking of files, lookup values to find data quickly, unique identifiers, and much more.
When you run
file-deduplicator.sh, it begins by asking the user for input using
read and it prints out four different values with seemingly random strings of characters. looking is absolutely correct—they are SHA512 hash sums! Each string is the sum of the contents inside of it. Even if the contents are even slightly different (for example, one bit has been flipped to a
1 instead of a
0), then a totally different hash sum will be produced. Again, this bash script leverages a foreign concept of arrays (using a global array variable meaning accessible everywhere in the script) and hash sums using the tool combined with to retrieve the correct values. This script is not recursive though, and only looks at the files inside of
files_galore to generate a list of files, a hash for each one, and search an array containing all hashes. If a hash is unknown, then it is a new file and is inserted into the array for storage. Otherwise, if a hash is seen twice, the file is deleted because it contains DUPLICATE content (even if the file name is different). There is another aspect here, and that is the use of return values as strings. As you may remember, return only can return numeric values:
After executing the operation, we can see that the
files_galore directory only contains four files out of the original seven. The duplicate data is now removed!