| Joining Text Files and Data ColumnsKrazyWorks

Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells, Featured, Files

Joining Text Files and Data Columns

Submitted by Igor on December 18, 2021 – 12:17 pm

When working with application logs and other text files, it is often useful to view the contents of different source files side-by-side. Here we will take a quick look at various command-line methods for joining data columns from different files.

Let’s start with a simple example: we have these three files:

al.txt

peggy.txt

marcy.txt

Date       Volume
2021-01-01 9827
2021-01-02 8372
2021-01-03 6399
2021-01-04 8726
2021-01-05 6882

Date       Volume
2021-01-01 8722
2021-01-02 2653
2021-01-03 9876
2021-01-04 2763
2021-01-05 8733

Date       Volume
2021-01-01 2762
2021-01-02 7622
2021-01-03 6523
2021-01-04 8725
2021-01-05 4265

We need to merge these three files into a single file that looks like this:

Date        Al      Peggy   Marcy
2021-01-01  9827    8722    2762
2021-01-02  8372    2653    7622
2021-01-03  6399    9876    6523
2021-01-04  8726    2763    8725
2021-01-05  6882    8733    4265

Because the key column is the Date and all the dates are in sequence across the three files, we can use something simple like paste or pr:

paste -d' ' al.txt <(awk '{print $2}' peggy.txt) <(awk '{print $2}' marcy.txt) | column -t

Date        Volume  Volume  Volume
2021-01-01  9827    8722    2762
2021-01-02  8372    2653    7622
2021-01-03  6399    9876    6523
2021-01-04  8726    2763    8725
2021-01-05  6882    8733    4265

pr -mts' ' al.txt <(awk '{print $2}' peggy.txt) <(awk '{print $2}' marcy.txt) | column -t
Date        Volume  Volume  Volume

2021-01-01  9827    8722    2762
2021-01-02  8372    2653    7622
2021-01-03  6399    9876    6523
2021-01-04  8726    2763    8725
2021-01-05  6882    8733    4265

The minor issue to fix is the “Volume” column name to reflect the name of the source data file:

paste -d' ' <(sed 's/Volume/Al/g' al.txt) <(awk '{print $2}' peggy.txt | sed 's/Volume/Peggy/g') <(awk '{print $2}' marcy.txt | sed 's/Volume/Marcy/g') | column -t

Date        Al    Peggy  Marcy
2021-01-01  9827  8722   2762
2021-01-02  8372  2653   7622
2021-01-03  6399  9876   6523
2021-01-04  8726  2763   8725
2021-01-05  6882  8733   4265

While not very complicated, this is a bit tedious and you might be tempted to do this manually in a text editor or a spreadsheet application. But, what if you have a bunch of folders with potentially hundreds of such files that need to be merged?

Here’s the same method as described above, but now incorporating the find command. I added an extra file just to illustrate the point. Imagine if you had not four but four hundred files – try doing that by hand in a reasonable amount of time and without making any mistakes!

tmpscript=$(mktemp)
chmod u+x ${tmpscript}
echo -n "paste -d ' ' <(sed 's/Volume/$(basename $(find ~/datafiles -name "*\.txt" | head -1) | awk -F'.' '{$NF=""; print $0}' | sed 's/[[:alpha:]]/\u&/')/g' $(find ~/datafiles -name "*\.txt" | head -1)) " > ${tmpscript}
find ~/datafiles -name "*\.txt" | tail -n +2 | while read i; do echo -n "<(awk '{print \}' ${i} | sed \"s/Volume/$(basename $i | awk -F'.' '{$NF=""; print $0}' | sed 's/[[:alpha:]]/\u&/')/g\") "; done >> ${tmpscript}
eval ${tmpscript} | column -t 
/bin/rm -f ${tmpscript}

Date        Al    Marcy  Peggy  Steve
2021-01-01  9827  2762   8722   6522
2021-01-02  8372  7622   2653   7725
2021-01-03  6399  6523   9876   1254
2021-01-04  8726  8725   2763   9272
2021-01-05  6882  4265   8733   3428

This all works just fine as long as the key columns in all of your data files align perfectly. But this rarely happens in real life. Let’s say we have three data files that look like this:

al.txt

peggy.txt

marcy.txt

Date       Volume
2021-01-01 9827
2021-01-02 8372
2021-01-04 8726
2021-01-05 6882
2021-01-06 2435

Date       Volume
2021-01-01 8722
2021-01-02 2653
2021-01-04 2763
2021-01-05 8733

Date       Volume
2021-01-02 7622
2021-01-03 6523
2021-01-04 8725
2021-01-05 4265

If you apply the previous method, you will get garbage:

paste -d' ' <(sed 's/Volume/Al/g' al.txt) <(awk '{print $2}' peggy.txt | sed 's/Volume/Peggy/g') <(awk '{print $2}' marcy.txt | sed 's/Volume/Marcy/g') | column -t

Date        Al    Peggy  Marcy
2021-01-01  9827  8722   7622
2021-01-02  8372  2653   6523
2021-01-04  8726  2763   8725
2021-01-05  6882  8733   4265
2021-01-06  2435

What we really need is to align these files on the “Date” field and merge those lines where dates match. This is well beyond what paste or pr can do. But there’s join. Let’s try using it on the first two files:

join -j 1 -o 1.1,1.2,2.2 al.txt peggy.txt | column -t

Date        Volume  Volume
2021-01-01  9827    8722
2021-01-02  8372    2653
2021-01-04  8726    2763
2021-01-05  6882    8733

As you can see, January 6 line from al.txt has been dropped because there is no corresponding data point in peggy.txt. Right now I am not going to worry about the “Volume” column names – it just takes sed to fix that and we have bigger fish to fry (but I’ll come back to this later).

The 1.1,1.2,2.2 bit needs to be understood. The 1.1 refers to the first field of the first file. It is also the key field that needs to be matched in both files. The 1.2 is just the second field of the first file. The 2.2 is the second field of the second file. Sounds simple, but there’s gonna be more later.

Unfortunately for us, join can only join two files and we have three. So, the approach is to join the first two and use that as the first “file” for the second join:

join -j 1 -o 1.1,1.2,1.3,2.2 <(join -j 1 -o 1.1,1.2,2.2 al.txt peggy.txt) marcy.txt | column -t

Date        Volume  Volume  Volume
2021-01-02  8372    2653    7622
2021-01-04  8726    2763    8725
2021-01-05  6882    8733    4265

Now only three rows remain that have matching dates across all three files. In this case, 1.1,1.2,1.3 are the first three fields from the output of the inner join command, while 2.2 is the second field of the third file (marcy.txt).

Just as easily we could’ve redirected the output of the first join to a temporary file and then joined that file with marcy.txt, instead of concatenating the two commands. However, the less you use the filesystem, the faster your scripts will run.

Just for the hell of it, let’s add the fourth file:

join -j 1 -o 1.1,1.2,1.3,1.4,2.2 <(join -j 1 -o 1.1,1.2,1.3,2.2 <(join -j 1 -o 1.1,1.2,2.2 al.txt peggy.txt) marcy.txt) steve.txt | column -t

Date        Volume  Volume  Volume  Volume
2021-01-02  8372    2653    7622    7725
2021-01-04  8726    2763    8725    9272
2021-01-05  6882    8733    4265    3428

We can also use this with the find command if we have a large number of files in various folders that need to be merged in this way.

tmpfile=$(mktemp)
chmod u+x ${tmpfile}

k=2
j="1.1,1.2"

echo -n "join -j 1 -o ${j},2.2 $(find ~/datafiles2 -name "*\.txt" | head -${k} | xargs)" > "${tmpfile}"

(( k = k + 1 ))
j="${j},1.${k}"

find ~/datafiles2 -name "*\.txt" | tail -n +${k} | while read i; do 
  echo "join -j 1 -o ${j},2.2 <($(cat ${tmpfile})) ${i}" > ${tmpfile}
  (( k = k + 1 ))
  j="${j},1.${k}"
done 

eval "${tmpfile}" | column -t 
/bin/rm -f "${tmpfile}"

Date        Volume    Volume  Volume  Volume
2021-01-02  8372      7622    2653    7725
2021-01-04  8726      8725    2763    9272
2021-01-05  6882      4265    8733    3428

This leaves just the pesky issue of the column headers. At this point, it is easier to simply modify the source data files to replace the “Volume” header with the one derived from the filename:

find ~/datafiles2 -name "*\.txt" | while read i; do 
  sed -i "s/Volume/$(basename "${i}" | awk -F'.' '{$NF=""; print $0}' | sed 's/[[:alpha:]]/\u&/')/g" "${i}"
done

tmpfile=$(mktemp)
chmod u+x ${tmpfile}

k=2
j="1.1,1.2"

echo -n "join -j 1 -o ${j},2.2 $(find ~/datafiles2 -name "*\.txt" | head -${k} | xargs)" > "${tmpfile}"

(( k = k + 1 ))
j="${j},1.${k}"

find ~/datafiles2 -name "*\.txt" | tail -n +${k} | while read i; do 
  echo "join -j 1 -o ${j},2.2 <($(cat ${tmpfile})) ${i}" > ${tmpfile}
  (( k = k + 1 ))
  j="${j},1.${k}"
done 

eval "${tmpfile}" | column -t 
/bin/rm -f "${tmpfile}"

Date        Al    Marcy  Peggy  Steve
2021-01-02  8372  7622   2653   7725
2021-01-04  8726  8725   2763   9272
2021-01-05  6882  4265   8733   3428

Now if you re-run the last join example you will have the “Volume” headers replaced by the correct names.

You may be thinking, this is all nice and very convoluted, but how is it useful? Well, for one, having all your data in one file in a consistent format allows for easier charting. Here’s a gnuplot timeline chart of the output from the last example above:

gnuplot -e "set title 'Sales'; 
set offset 1,1,1,1; 
set autoscale xy; 
set mxtics; set mytics; 
set style line 12 lc rgb '#ddccdd' lt 1 lw 1.5; set style line 13 lc rgb '#ddccdd' lt 1 lw 0.5; 
set grid xtics mxtics ytics mytics back ls 12, ls 13; 
set terminal png size 3840,2160 enhanced font '/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf,22';
set output '/mnt/c/zip/sales.png'; \
set style data labels; 
set xdata time; 
set xlabel 'Date' font '/usr/share/fonts/liberation/LiberationSans-Regular.ttf,20'; 
set ylabel 'Amount' font '/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf,20'; 
set timefmt '%Y-%m-%d';
plot \
'${logfile}' using 1:2 title '$(head -1 ${logfile} | awk '{print $2}')' with lines lw 3, \
'${logfile}' using 1:3 title '$(head -1 ${logfile} | awk '{print $3}')' with lines lw 3, \
'${logfile}' using 1:4 title '$(head -1 ${logfile} | awk '{print $4}')' with lines lw 3, \
'${logfile}' using 1:5 title '$(head -1 ${logfile} | awk '{print $5}')' with lines lw 3"

View the chart

More specific to sysadmin work, if you have an active-active application cluster, you may need to see if certain errors that appear on both cluster nodes ever align in time.

Another scenario: certain performance monitoring tools can show you this or that system parameter – but not this and that at the same time. So you end up with multiple timestamped log files that need to be mashed together.

If you want to see a practical example of the join command being used to process atop log, head over here.