Did you know that there’s a relational database hiding in your Unix shell? There really is, it turns out.

A friend of mine was recently telling me about his discovery of the join command, which allows you to combine data from multiple files that contain tabular data. Let’s take a closer look.

Examples

For these examples, I’ll be using two files.

courses.csv :

CIS 162,Computer Science I CIS 163,Computer Science II CIS 263,Data Structures and Algorithms MTH 225,Discrete Structures I MTH 325,Discrete Structures II CIS 351,Computer Organization STA 215,Introductory Applied Statistics

and enrollments.csv :

Alice,CIS 351 Alice,CIS 263 Alice,MTH 325 Bob,CIS 163 Bob,MTH 225 Ryan,CIS 351 Ryan,MTH 325 Ryan,WRT 350 James,WRT 350

Inner Join

By default, the join command behaves as an INNER JOIN does in SQL. That is, each pair of matching lines from both files will be printed, but no additional lines from either file that may have matched.

To illustrate this, let’s find the names of all of the courses where each student is enrolled:

# join -t , -1 2 -2 1

You'll notice that it doesn't include anything about WRT 350 , which has no description in courses.csv . Similarly, it doesn't list anything about CIS 162 , which nobody is taking.

Since you're likely wondering, I think now's a good time to describe what those flags mean.

Flag Description -t , Use commas as the field separator -1 2 Join using the second column of the first input file -2 1 And the first column of the second file The `join` command wants its inputs to be sorted by the join key. This command sorts the enrollments file based on course number. Also sort the courses file

I'll also point out the <(sort ...) syntax, in case you aren't familiar with it. It's similar to other shell piping operators, except it doesn't pipe the output of the sort command to the standard input of join . Instead, it creates a new file handle, and a UNIX path is substituted into its place for the arguments to join . It looks something like /dev/fd/11 on my machine.

Outer join

Instead, imagine that we want to output all of the lines from enrollments.csv , regardless of whether we have a course description in courses.csv . In SQL, we'd use an outer join for this. Sure enough, join supports this too, via another flag.

The flag to use is -a 1 , which indicates that we should show all lines from File 1. (We could alternatively specify -a 2 if we wanted to include all lines from the second file.)

# join -t , -1 2 -2 1 -a 1

Excluding matches

What if we actually wanted to find all the courses where nobody has enrolled? In SQL, this would be akin to doing an outer join and constraining one table's ID to null.

To achieve this, join provides us with the -v flag. You specify a file number along with it, and then join will only print out the lines from that file that could not be paired with any lines in the other file.

Here it goes:

# join -t , -1 2 -2 1 -v 2 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv) CIS 162,Computer Science I STA 215,Introductory Applied Statistics #

Side note: It's actually possible to specify both files with multiple -v flags. The output is probably less useful to a machine.

Other SQL-like tools in the Unix toolbox

Unix offers other tools that have analogs to what SQL gives us. I've thought of a few:

Command SQL analog awk SELECT grep WHERE uniq SELECT DISTINCT sum SUM()

Can you think of others? Leave some suggestions in the comments!

Conclusion

I'm not really sure when or why I'll end up using this, but it's pretty cool to know it's there. Even after over a decade of using the Unix command line, I'm still surprised by it.