Write a bash script to joining and splitting files at arbitrary positions

Let’s not be shy! Who has tried to open a large file by accident or even intentionally with an application and it didn’t quite go as planned? I certainly have, and I have certainly seen the limitations such as the number of rows loaded in Excel, or OpenOffice calculator. In these cases, we use a handy tool that can split files at arbitrary points, such as the following:

  • Before X number of lines
  • Before Z number of bytes/chars

In this article, you will create a singe dual purpose script: a script that can use an input file and produce split or multiple files, and a second script to join files using a combining method. There are a few caveats when passing around string variables as they:

  • Can sometimes lose special characters such as new lines
  • (Binary) Should be handled by different tools than the usual commands on the command line

This file also reuses the getopts parameter parsing, but it also introduces the mktemp command and the getconf command with the PAGESIZE parameter. Mktemp is a useful command because it can produce unique temporary files that reside in the /tmp directory, but can even produce unique files that follow a template (notice the XXX—this will be replaced with random values, but uniquefile. will remain):

$ mktemp uniquefile.XXXX

Another useful command is the getconf programming utility, which is a standards compliant tool designed to fetch useful system variables. One in particular called PAGESIZE is useful to determine the size of memory in one block. Obviously, this is in very simplistic terms, but choosing the appropriate size to write data can be very beneficial performance-wise.

Prerequisites

Besides having a terminal open, a single text file called input-lines needs to be created with the following content (one character on each line):

input-lines

1 
2 
3 
4 
5 
6 
7 
8 
9 
0 
a 
b 
c 
d 
e 
f 
g 
h 
i 
j 
k

Next, create a second file called merge-lines with the following content:

merge-lines

It's -17 outside

Write Script:

Open a terminal and create a script named file-splitter.sh.

The following is the code snippet:

file-splitter.sh

#!/bin/bash 
FNAME="" 
LEN=10 
TYPE="line" 
OPT_ERROR=0 
set -f 
function determine_type_of_file() { 
	local FILE="$1" 
	file -b "${FILE}" | grep "ASCII text" > /dev/null 
	RES=$? 
	if [ $RES -eq 0 ]; then 
		echo "ASCII file - continuing" 
	else 
		echo "Not an ASCII file, perhaps it is Binary?" 
	fi 
}

Next, run file-splitter.sh with this command and flags ( -i-t-l):

$ bash file-splitter.sh -i input-lines -t line -l 10

Review the output and see what the difference is with -t size and when -l line is used. What about when -l 1 or -l 100 is used? Remember to remove the split files using $ rm input-lines.*:

$ rm input-lines.* 
$ bash file-splitter.sh -i input-lines -t line -l 10 
$ rm input-lines.* 
$ bash file-splitter.sh -i input-lines -t line -l 1 
$ rm input-lines.* 
$ bash file-splitter.sh -i input-lines -t line -l 100 
$ rm input-lines.* 
$ bash file-splitter.sh -i input-lines -t size -l 10

In the next step, create another script called file-joiner.sh.

The following is the code snippet:

file-joine.sh

#!/bin/bash 
INAME="" 
ONAME="" 
FNAME="" 
WHERE="" 
OPT_ERROR=0 
TMPFILE1=$(mktemp) 
function determine_type_of_file() { 
	local FILE="$1" 
	file -b "${FILE}" | grep "ASCII text" > /dev/null 
	RES=$? 
	if [ $RES -eq 0 ]; then 
		echo "ASCII file - continuing" 
	else 
		echo "Not an ASCII file, perhaps it is Binary?" 
	fi 
}

Next, run the script using this command:

$ bash file-joiner.sh -i input-lines -o merge-lines -f final-join.txt -w 2

How Script work:

Before proceeding, notice that the type option (-t) on final-join.txt ignores \n newlines when reading in characters one at a time. Read suffices for the purpose of this recipe, but the reader should be aware that read/cat are not the best tools for this type of work.

Creating the script was trivial and for the most part shouldn’t look like it came from the planet Mars.

Running the $ bash file-splitter.sh -i input-lines -t line -l 10 command should produce three files, all of which are input-lines {1,…,3}. The reason that there is three files is that if you used the same input, which is 22 lines long, it will produce three files (10+10+2). Using read and echo using a concatenated buffer (${BUFFER}), we can write to the file based on a specific criteria (provided by -l). If the EOF or end of file is met and the done loop is done, we need to write the buffer to the file because it may be under the threshold of the write criteria—this would result in lost/missing bytes in the last file created by the splitter script:

$ bash file-splitter.sh -i input-lines -t line -l 10 
ASCII file - continuing 
Wrote buffer to file: input-lines.1 
Wrote buffer to file: input-lines.2 
Wrote buffer to file: input-lines.3

Depending on the usage of the -l flag, the value of 1 will produce a file for every line, and the value of 100 will produce a single file because if fits under the threshold. Using the side-feature -t size, which can be used to split based on bytes, read has an unfortunate side effect: when we pass the buffer, it is altered and the new lines are missing. This sort of activity would be better if we used a tool such as dd, which is better for copying, writing, and creating raw data to files or devices.

Next, we created the script called file-joiners.sh. Again, it used getopts and requires four input parameters: -i originalFile -ootherFileToMerge -ffinalMergedFile -w, and whereInjectTheOtherFile. The script is simpler overall, but uses the mktemp command to create a temporary file which we can use as a storage buffer without modifying the originals. When we are finished, we can use the mv command to move the file from /tmp to the terminal’s current directory (.). The mv command can also be used to rename files and is usually faster than cp (not so much in this case) because a copy does not occur, rather just a renaming operation at the file system level.

Catting final-join.txt should contain the following output:

Output:

$ cat final-join.txt 
1 
2 
It's -17 outside 
3 
4 
5 
6 
7 
8 
9 
0 
a 
b 
c 
d 
e 
f 
g 
h 
i 
j 
k

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Related Articles