I’d like to join two files in bash using a common column. I want to retain both all pairable and unpairable lines from both files. Unfortunately using join I could save unpairable fields from only one file, eg. join -1 1 -2 2 -a1 -t" ".
I’d also want to retain all pairings for repeated entries (in join column) from both files.
I.e. If file1 is
x id1 a b
x id1 c d
x id1 d f
x id2 c x
x id3 f v
and second file is
id1 df cf
id1 ds dg
id2 cv df
id2 as ds
id3 cf cg
the resulting file should be:
x id1 a b df cf
x id1 a b ds dg
x id1 c d df cf
x id1 c d ds dg
x id1 d f df cf
x id1 d f ds dg
x id2 c x cv df
x id2 c x as ds
x id3 f v cf cg
That’s why I’ve always using SAS to make such join, after sorting appropriate columns.
data x;
merge file1 file2;
by common_column;
run;
It works fine but
1. as I use Ubuntu for most time I have to switch to Windows to merge data in SAS.
2. most importantly, SAS can truncate too long data entries.
That’s why I’d prefer to join my files in bash, but I don’t know appropriate command.
Could someone help me, or direct me to appropriate resource?
According to
join‘s man page,-a <filenum>retains all unpairable lines from file<filenum>(1 or 2). So, just add-a1 -a2to your command line and you should be done. For example:Is this what you were looking for?
Edit:
Since you provided more detail, here is how to produce your desired output (note that my file
ais your first file and my filebyour second file. I had to reverse -1 1 -2 2 to -1 2 -2 1 to join on the id). I added a field list to format the output as well – note that ‘0’ is the join field in it:produces what you’ve given. Add -a1 -a2 to retain unpairable lines from both files you then get two more lines (you can guess my test data from them):
Which is rather unreadable since any left out field is just a space. So let’s replace them with a ‘-‘, leading to: