Write a bash script to generating datasets and random files of various size

Usually, data that mimics real-world data is always the best, but sometimes we need an assortment of files of various content and size for validation testing without delay. Imagine that you have a web server and it is running some sort of application that accepts files for storage. However, the files have a size limit being enforced. Wouldn’t it be great to just whip up a batch of files in an instant?

To do this, we can use some few file system features such as /dev/random and a useful program called dd. The dd command is a utility that can be used to convert and copy files (including devices due to Linux’s concept of everything is a file, more or less). It can be used in a later recipe to back up data on an SD card (remember your favorite Raspberry Pi project?) or to “chomp” through files byte by byte without losses. Typical minimal dd usage can be $ dd if="inputFile" of="outputFile" bs=1M count=10. From this command, we can see:

  • if=: Stands for input file
  • of=: Stands for output file
  • bs=: Stands for block size
  • count=: Stands for numbers of blocks to be copied

Options bs= and count= are optional if you want to perform a 1:1 (pure duplicate) copy of a file because dd will attempt to use reasonably efficient parameters to provide adequate performance. The dd command also has a number of other options such as seek=, which will be explored later when performing low-level backups in another recipe. The count option is typically not needed as it’s far more common to copy an entire file instead of a section (when performing backups).

Note:

/dev/random is a device in Linux (hence the /dev path) which can be used to produce random numbers for use in your scripts or applications. There are also other /dev paths such as the console and various adaptors (for example, USB sticks or mice), all of which may be accessible, and gaining knowledge of them is recommended.

Prerequisites

To get ready for this recipe, install the dd command as follows and make a new directory called qa-data/:

$ sudo apt-get install dd bsdmainutils 
$ mkdir qa-data

This script uses the dmesg command, which is used to return system information such as interface status or the system boot process. It is nearly always present on a system and therefore a good substitute to reasonable system level “lorem ipsum”. If you wish to use another type of random text, or a dictionary of words, dmesg can easily be replaced! Another two commands used are seq and hexdump. The seq command can generate an array of n numbers from a starting point using a specified increment, and hexdump produces a human readable representation of a binary (or executable) in hexadecimal format.

Write Script:

Open a terminal and create a new script called data-maker.sh.

The following is the code snippet of the script:

data-maker.sh

#!/bin/bash 
N_FILES=3 
TYPE=binary 
DIRECTORY="qa-data" 
NAME="garbage" 
EXT=".bin" 
UNIT="M" 
RANDOM=$$ 
TMP_FILE="/tmp/tmp.datamaker.sh" 
function get_random_number() { 
	SEED=$(($(date +%s%N)/100000)) 
	RANDOM=$SEED 
	# Sleep is needed to make sure that the next time rnadom is ran, everything is good. 
	sleep 3 
	local STEP=$1 
	local VARIANCE=$2 
	local UPPER=$3 
	local LOWER=$VARIANCE 
	local ARR; 
	
	INC=0 
	for N in $( seq ${LOWER} ${STEP} ${UPPER} ); 
	do 
		ARR[$INC]=$N 
		INC=$(($INC+1)) 
	done 
	RAND=$[$RANDOM % ${#ARR[@]}] 
	echo $RAND 
}

Let’s begin the execution of the script using the following command. It uses the -t flag for type and is set to text-n is used for the number of files, which is 5-l is the lower bound: 1 characters, and -u is 1000 characters:

$ bash data-maker.sh -t text -n 5 -l 1 -u 1000

To checkout the output, use the following command:

$ ls -la qa-data/*.txt 
$ tail qa-data/garbage4.txt

Again, let’s run the data-maker.sh script, but for binary files. Instead of the size limits being 1 char (1 byte) or 1000 chars (1000 bytes or just less than one kilobyte), the sizes are in MB, with there being 110 MB files:

$ bash data-maker.sh -t binary -n 5 -l 1 -u 10

To check out the output, use the following command. The use of a new command called hexdump is because we cannot “dump” or “cat” a binary file the same way as we can a “regular” ASCII text file:

$ ls -la qa-data/*.bin 
$ hexdump qa-data/garbage0.bin 
0000000 0000 0000 0000 0000 0000 0000 0000 0000 
*

How Script Work:

Let’s understand, how things are happening:

First, we create the ;data-maker.sh script. This script introduces several new concepts including the ever fascinating concept of randomization. In computers, or really anything in life, true random events or number generation cannot happen and require several mathematical principles such as entropy. While this is beyond the scope of this cookbook, know that when reusing it randomly or even initially, you should give it a unique initialization vector or seed. Using a for loop, we can build an array of numbers using the seq command. Once the array is built, we choose a “random” value from the array. In each type of file output operation (binary or text), we determine approximately both minimum (-l or lower) and maximum (-u or upper) sizes to control the output data.

In step 2, we build 5 text files using the output of dmesg and our pseudo randomization process. We can see that we iterate until we have five text files created using different sizes and starting points with the dd command.

In step 3, we verify that indeed we created five files, and in the fifth one, we viewed the tail of the garbage4.txt file.

In step 4, we create five binary files (full of zeros) using the dd command. Instead of using a number of chars, we used megabytes or (MB).

In step 5, we verify that indeed we created five binary files, and in the fifth one, we viewed the contents of the binary file using the hexdump command. The hexdump command created a simplified “dump” of all of the bytes inside of the garbage0.bin file.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Related Articles