Let’s not be shy! Who has tried to open a large file by accident or even intentionally with an application and it didn’t quite go as planned? I certainly have, and I have certainly seen the limitations such as the number of rows loaded in Excel, or OpenOffice calculator. In these cases, we use a handy tool that can split files at arbitrary points, such as the following:
- Before X number of lines
- Before Z number of bytes/chars
In this article, you will create a singe dual purpose script: a script that can use an input file and produce split or multiple files, and a second script to join files using a combining method. There are a few caveats when passing around string variables as they:
- Can sometimes lose special characters such as new lines
- (Binary) Should be handled by different tools than the usual commands on the command line
This file also reuses the getopts
parameter parsing, but it also introduces the mktemp
command and the getconf
command with the PAGESIZE
parameter. Mktemp
is a useful command because it can produce unique temporary files that reside in the /tmp
directory, but can even produce unique files that follow a template (notice the XXX
—this will be replaced with random values, but uniquefile.
will remain):
$ mktemp uniquefile.XXXX
Another useful command is the getconf
programming utility, which is a standards compliant tool designed to fetch useful system variables. One in particular called PAGESIZE
is useful to determine the size of memory in one block. Obviously, this is in very simplistic terms, but choosing the appropriate size to write data can be very beneficial performance-wise.
Prerequisites
Besides having a terminal open, a single text file called input-lines
needs to be created with the following content (one character on each line):
input-lines
1
2
3
4
5
6
7
8
9
0
a
b
c
d
e
f
g
h
i
j
k
Next, create a second file called merge-lines
with the following content:
merge-lines
It's -17 outside
Write Script:
Open a terminal and create a script named file-splitter.sh
.
The following is the code snippet:
file-splitter.sh
#!/bin/bash
FNAME=""
LEN=10
TYPE="line"
OPT_ERROR=0
set -f
function determine_type_of_file() {
local FILE="$1"
file -b "${FILE}" | grep "ASCII text" > /dev/null
RES=$?
if [ $RES -eq 0 ]; then
echo "ASCII file - continuing"
else
echo "Not an ASCII file, perhaps it is Binary?"
fi
}
Next, run file-splitter.sh
with this command and flags ( -i
, -t
, -l
):
$ bash file-splitter.sh -i input-lines -t line -l 10
Review the output and see what the difference is with -t size
and when -l line
is used. What about when -l 1
or -l 100
is used? Remember to remove the split files using $ rm input-lines.*
:
$ rm input-lines.*
$ bash file-splitter.sh -i input-lines -t line -l 10
$ rm input-lines.*
$ bash file-splitter.sh -i input-lines -t line -l 1
$ rm input-lines.*
$ bash file-splitter.sh -i input-lines -t line -l 100
$ rm input-lines.*
$ bash file-splitter.sh -i input-lines -t size -l 10
In the next step, create another script called file-joiner.sh
.
The following is the code snippet:
file-joine.sh
#!/bin/bash
INAME=""
ONAME=""
FNAME=""
WHERE=""
OPT_ERROR=0
TMPFILE1=$(mktemp)
function determine_type_of_file() {
local FILE="$1"
file -b "${FILE}" | grep "ASCII text" > /dev/null
RES=$?
if [ $RES -eq 0 ]; then
echo "ASCII file - continuing"
else
echo "Not an ASCII file, perhaps it is Binary?"
fi
}
Next, run the script using this command:
$ bash file-joiner.sh -i input-lines -o merge-lines -f final-join.txt -w 2
How Script work:
Before proceeding, notice that the type option (-t
) on final-join.txt
ignores \n
newlines when reading in characters one at a time. Read suffices for the purpose of this recipe, but the reader should be aware that read/cat are not the best tools for this type of work.
Creating the script was trivial and for the most part shouldn’t look like it came from the planet Mars.
Running the $ bash file-splitter.sh -i input-lines -t line -l 10
command should produce three files, all of which are input-lines {1,…,3}. The reason that there is three files is that if you used the same input, which is 22 lines long, it will produce three files (10+10+2). Using read and echo using a concatenated buffer (${BUFFER}
), we can write to the file based on a specific criteria (provided by -l
). If the EOF or end of file is met and the done loop is done, we need to write the buffer to the file because it may be under the threshold of the write criteria—this would result in lost/missing bytes in the last file created by the splitter
script:
$ bash file-splitter.sh -i input-lines -t line -l 10
ASCII file - continuing
Wrote buffer to file: input-lines.1
Wrote buffer to file: input-lines.2
Wrote buffer to file: input-lines.3
Depending on the usage of the -l
flag, the value of 1
will produce a file for every line, and the value of 100 will produce a single file because if fits under the threshold. Using the side-feature -t size
, which can be used to split based on bytes, read has an unfortunate side effect: when we pass the buffer, it is altered and the new lines are missing. This sort of activity would be better if we used a tool such as dd
, which is better for copying, writing, and creating raw data to files or devices.
Next, we created the script called file-joiners.sh
. Again, it used getopts
and requires four input parameters: -i originalFile -o
, otherFileToMerge -f
, finalMergedFile -w
, and whereInjectTheOtherFile
. The script is simpler overall, but uses the mktemp
command to create a temporary file which we can use as a storage buffer without modifying the originals. When we are finished, we can use the mv
command to move the file from /tmp
to the terminal’s current directory (.
). The mv
command can also be used to rename files and is usually faster than cp
(not so much in this case) because a copy does not occur, rather just a renaming operation at the file system level.
Catting final-join.txt
should contain the following output:
Output:
$ cat final-join.txt
1
2
It's -17 outside
3
4
5
6
7
8
9
0
a
b
c
d
e
f
g
h
i
j
k
0 Comments