Representing relations

In mathematics, a relation is a set of tuples.

[A tuple is an ordered collection of values, $(v_1,\ldots,v_n)$.]

For example, $\{({\tt Bob}, 31), ({\tt Judy}, 32)\}$ is a relation.

In database theory, a relation is a set of tuples with an assigned name for each column; that is, a relation is a table.

For the earlier relation, we could define a header tuple $({\tt name}, {\tt age})$ that names each column.

We could then represent the relation explicitly as a table:

name age Bob 31 Judy 32

Relational algebra is a theory for manipulating relations whose power is equivalent to SQL and relational calculus.

Remarkably, relational algebra has only six primitive operations.

I define the six primitives below, but if you're looking for a comprehensive work on relational theory, particularly as it relates to modern databases, I recommend Date's SQL and Relational Theory :

Relations in Unix

Many Unix commands interpret files like relations: each line is a tuple.

CSV or "comma-separated value" files are a crude, generic way to encode a relation. For example, see the MaxMind GeoIP database:

"1.0.0.0","1.0.0.255","16777216","16777471","AU","Australia" "1.0.1.0","1.0.3.255","16777472","16778239","CN","China"

The password and group files for a Unix system ( /etc/passwd and /etc/group ), which store account and login information, are relations with elements of individual tuples separated by colons ( : ):

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false root:*:0:0:System Administrator:/var/root:/bin/sh daemon:*:1:1:System Services:/var/root:/usr/bin/false

The /etc/hosts file, which can fix a hostname to an IP address, is a space-separated relation:

127.0.0.1 localhost 255.255.255.255 broadcasthost ::1 localhost fe80::1%lo0 localhost

The /etc/resolv.conf file, which specifies domain name information, is a space-separated relation:

domain hsd1.ut.comcast.net. nameserver 208.67.220.220 nameserver 208.67.222.222

The command netstat -nr produces tables of routing information.

Destination Gateway Flags Refs Use Netif 127 127.0.0.1 UCS 0 0 lo0 127.0.0.1 127.0.0.1 UH 5 151051 lo0 224.0.0/4 lo0 UmCS 1 0 lo0 224.0.0.251 lo0 UHmW3I 0 0 lo0

The command traceroute produces relation-like data:

5 69.139.247.14 16.560 ms 26.933 ms 11.521 ms 6 68.86.90.125 74.433 ms 40.330 ms 57.565 ms 7 68.86.87.30 54.590 ms 52.155 ms 55.322 ms

Many log files are, in effect, giant relations as well.

Union

Mathematically, union $(\cup)$ combines the contents of two sets:

\[ A \cup B = \{ x \mathrel{|} x \in A \text{ or } x \in B \} \]

In the shell, union is straightforward--it's cat :

$ cat relation-1 relation-2 > unioned-relation

After any operation in which duplicate entries are a possibility, compose with sort and uniq :

$ cat relation-1 relation-2 | sort | uniq > unioned-relation

Since a relation is a set, duplicate entries are not allowed.

Many other programs accept multiple files as arguments and print out their contents: sed , grep , cut , awk .

Thus, these commands can combine selection (and projection) with union.

To "union" the output of one command with another, use a sequencing semicolon with an append:

$ command-1 > relation; command-2 >> relation

Selection

Mathematically, selection $(\sigma_\phi)$ is a filtering operation that applies a predicate $\phi$ to each member of a set, retaining only those for which the predicate is true:

\[ \sigma_\phi(A) = \{ x \mathrel{|} x \in A \text{ and } \phi(x) \text{ is true } \} \]

Many Unix commands in serve as selectors (and simultaneously as union); that is, many programs can filter a file, line by line.

The most common selectors are sed , grep and awk .

For example, awk can select entries from /etc/passwd where the user id and the primary group id are not equal:

awk -F ":" '{ if ( $3 != $4 ) print }' /etc/passwd

Projection

Mathematically, projection $(\pi_{c_1,\ldots,c_n})$ is a reduction operator that retains the specified columns $(c_1,\ldots,c_n)$ in a relation and discards the rest.

If a relation is table, then $c_1,\ldots,c_n$ can be the names of the columns (and the order in which to preserve them).

Otherwise, $c_1,\ldots,c_n$ can be column numbers so that:

\[ \pi_{i_1,\ldots,i_n}(R) = \{ (x_{i_1}, \ldots, x_{i_n}) \mathrel{|} (x_1,\ldots,x_n) \in R \} \]

The Unix command cut is a projection-like command.

For example, cut can project just the user names and shells out of /etc/passwd :

$ cut -d ":" -f 1,7 /etc/passwd

The Unix commands sed and awk also behave like projectors.

For example, cut cannot reorder colums, but awk can:

$ awk -F":" '{ print $7 ":" $1 }' /etc/passwd

If the syntactic structure of each tuple is more complex, then sed can pull it apart with a regular expression.

Rename

For tables, the rename operation $(\rho_{c/c'})$ changes the name of column $c'$ to $c$.

Since most Unix tables don't specify headers, or the structure is implicit, there is no equivalent of or need for rename.

Columns are implicitly named positionally.

awk can easily reorder columns.

Cartesian product

Cartesian product is the source of much power for relational algebra, since it is the key ingredient in joins--the primitive operation that combines two tables together.

Relational Cartesian product $(\times)$ fully combines two relations:

\[ A \times B = \{ (a_1, \ldots, a_n, b_1, \ldots, b_m) \mathrel{|} (a_1,\ldots,a_n) \in A \text{ and } (b_1,\ldots,b_m) \in B \} \]

Surprisingly, there is no shell equivalent of Cartesian product.

Fortunately, it's not hard to write a cartesian script:

#!/bin/bash delim="," # CSV by default # Parse flagged arguments: while getopts "td:" flag do case $flag in d) delim=$OPTARG;; t) delim="\t";; ?) exit;; esac done # Delete the flagged arguments: shift $(($OPTIND -1)) # Remaining args now in $* # Now join $1 and $2 while read a; do while read b; do echo "$a${delim}$b"; done < $2; done < $1

Update: Travis Johnson wrote in to say that the Unix command paste seems to provide functionality of Cartesian product, but it doesn't quite do that. The paste joins corresponding lines, but does not compute a product:

% cat f1 a b c % cat f2 d e f % paste -d "," f1 f2 a,d b,e c,f % cartesian f1 f2 a,d a,e a,f b,d b,e b,f c,d c,e c,f

Difference

Mathematically, set difference $(-)$ removes the elements of a set that are present in another set:

\[ A - B = \{ a \mathrel{|} a \in A \text{ and } a

ot\in B \} \]

The command diff reports the difference between two files, but it does not perform relational set difference.

We need to implement relational set difference.

It is simpler to implement difference if we assume the existence of a membership tester, memberp .

The command memberp file line exits successfully if the file file contains the line line, and it fails otherwise.

With access to memberp , difference becomes:

#!/bin/bash while read a; do if ! memberp $2 "$a"; then echo "$a" fi done < $1

Membership

A membership script, memberp , searches line by line:

#!/bin/bash # Usage: memberp file line file=$1 shift while read a; do if [ "$a" = "$*" ]; then exit 0 fi done < $file exit -1

An alternative: comm

The command comm can also compute the difference between two files A and B:

comm -23 <(sort A) <(sort B)

Join

A join is a synthetic operation composed of selection, projection and product.

We could construct it from

A common join is equijoin, where the product of two relations is filtered based on the equality of two columns.

To peform an equijoin, we need two relations and columns to compare.

The command equijoin [ -d <delim> | -t ] rel1 rel2 col1 col2 uses awk to combine and compare two tables:

#!/bin/bash delim="," # CSV by default # Parse flagged arguments: while getopts "td:" flag do case $flag in d) delim=$OPTARG;; t) delim="\t";; ?) exit;; esac done # Delete the flagged arguments: shift $(($OPTIND -1)) f1=$1 f2=$2 col1=$3 col2=$4 cols=`awk -F "${delim}" '{ print NF ; exit }' $f1` while read a do while read b do echo "$a${delim}$b" done < $f2 done < $f1 | awk -F "${delim}" "{ if ( \$${col1} == \$$(( col2 + cols )) ) print }"

Example: Deleting a list of bad users

We're given a list of bad users to remove from /etc/passwd .

They're in the file bad.db :

matt bob

Let's use a simplified /etc/passwd for this example:

root:*:0:0:The Admin:/root:/bin/sh matt:*:500:500:Matt:/home/matt:/bin/bash john:*:501:501:John:/home/john:/bin/bash bob:*:502:502:Bob:/home/bob:/bin/bash

How can we delete them relationally?

Start with a product, cartesian -d ":" bad.db /etc/passwd > badpasswd.db :

matt:root:*:0:0:The Admin:/root:/bin/sh matt:matt:*:500:500:Matt:/home/matt:/bin/bash matt:john:*:501:501:John:/home/john:/bin/bash matt:bob:*:502:502:Bob:/home/bob:/bin/bash bob:root:*:0:0:The Admin:/root:/bin/sh bob:matt:*:500:500:Matt:/home/matt:/bin/bash bob:john:*:501:501:John:/home/john:/bin/bash bob:bob:*:502:502:Bob:/home/bob:/bin/bash

Then, select rows where the bad user matches the account name, with awk -F: '{ if ( $1 == $2 ) print }' badpasswd.db > offenders.db :

matt:matt:*:500:500:Matt:/home/matt:/bin/bash bob:bob:*:502:502:Bob:/home/bob:/bin/bash

Now, project out the columns that make the result compatible with the format of /etc/passwd with cut -d ":" -f2- offenders.db > kill.db :

matt:*:500:500:Matt:/home/matt:/bin/bash bob:*:502:502:Bob:/home/bob:/bin/bash

Finally, use difference /etc/passwd kill.db to generate a new password file sans malefactors:

root:*:0:0:The Admin:/root:/bin/sh john:*:501:501:John:/home/john:/bin/bash

Exercise

Some of these scripts have quadratic time complexity.

Add a -s flag to equijoin and difference that assumes the inputs have been sorted by sort and produces a sorted output.

Show that -s can achieve better time complexity.

Then, rewrite the account-deletion example to eliminate Cartesian product and use fast equijoin instead.

Good resources

Sed and awk have become a lost art. They're the only languages I know that frequently beat perl in semantic density. If you haven't learned them yet, you'll be impressed with what they can do for you.

I recommend the O'Reilly classic, sed & awk :

I also recommend spending an hour or two going through the advanced bash scripting guide and bash pitfalls. It's worth the time invested.

If you want a solid theoretical understanding of databases, I recommend Date's SQL and Relational Theory :

Related pages