Generate a ERM from a PostgreSQL database schema

Posted on July 20, 2017

Creating a ERM is one of the first tasks, when a database is designed. During implementation, you have to sync the model with the schema. This manual task can be very annoying. With some database knowledge and some Linux standard tools, this task can be automated.

version="1.0" encoding="UTF-8" standalone="no" ? %3 cluster_0 Database cluster_1 bash schema database schema awk awk processing schema->awk dot graphviz processing awk->dot

get the schema

The information_schema exists in all databases.

$ psql -U postgres -c "SELECT table_name, column_name, data_type, udt_name \ > FROM information_schema.columns WHERE table_schema = 'test'" | head table_name | column_name | data_type | udt_name ------------------------+--------------------------+-----------------------------+----------------------- person_to_email | id_person | uuid | uuid person_to_email | id_email | uuid | uuid person_to_email | communication_type | USER-DEFINED | communication_type person_to_email | is_primary_email_address | boolean | bool person_to_email | created_at | timestamp without time zone | timestamp person_to_email | updated_at | timestamp without time zone | timestamp person_view | first_name | character varying | varchar person_view | last_name | character varying | varchar

These are all the columns from our test schema. We still need some information about references between the relations. For the next processing step, all of the necessary column data should be in one result record.

With the key column constraints

$ psql -U postgres -c "SELECT constraint_name, table_name, column_name \ > FROM information_schema.key_column_usage WHERE table_schema = 'test'" | head constraint_name | table_name | column_name --------------------------------------------+------------------------+----------------------- person_to_email_id_email_fkey | person_to_email | id_email person_to_email_id_person_fkey | person_to_email | id_person person_to_email_pkey | person_to_email | id_person person_to_email_pkey | person_to_email | id_email person_pkey | person | id address_pkey | address | id employee_id_person_fkey | employee | id_person employee_pkey | employee | id

and a list of table_constraints,

$ psql -U postgres -c "SELECT constraint_name, table_name, constraint_type \ > FROM information_schema.table_constraints WHERE table_schema = 'test' \ > AND constraint_type IN ('FOREIGN KEY','PRIMARY KEY')" | head constraint_name | table_name | constraint_type --------------------------------------------+------------------------+----------------- person_pkey | person | PRIMARY KEY address_pkey | address | PRIMARY KEY person_to_address_pkey | person_to_address | PRIMARY KEY person_to_address_id_person_fkey | person_to_address | FOREIGN KEY person_to_address_id_address_fkey | person_to_address | FOREIGN KEY email_pkey | email | PRIMARY KEY person_to_email_pkey | person_to_email | PRIMARY KEY person_to_email_id_person_fkey | person_to_email | FOREIGN KEY

we can build our first query.

$ psql -U postgres -c "SELECT c.table_name, > c.column_name, > c.data_type, > c.udt_name, > is_nullable, > c.character_maximum_length, > (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'PRIMARY KEY' LIMIT 1 > ) primary_key > FROM information_schema.columns c > JOIN information_schema.tables t on c.table_name = t.table_name > WHERE c.table_schema = 'test' AND t.table_type = 'BASE TABLE'" | head table_name | column_name | data_type | udt_name | is_nullable | character_maximum_length | primary_key ------------------------+--------------------------+-----------------------------+-----------------------+-------------+--------------------------+------------- person_to_email | id_person | uuid | uuid | NO | | PRIMARY KEY person_to_email | id_email | uuid | uuid | NO | | PRIMARY KEY person_to_email | communication_type | USER-DEFINED | communication_type | NO | | person_to_email | is_primary_email_address | boolean | bool | NO | | person_to_email | created_at | timestamp without time zone | timestamp | NO | | person_to_email | updated_at | timestamp without time zone | timestamp | NO | | person | id | uuid | uuid | NO | | PRIMARY KEY person | first_name | character varying | varchar | YES | 512 |

Now we need the foreign keys and the target of the relation. These information can be fetched from the constraint_column_usage view.

$ psql -U postgres -c "SELECT table_name, column_name, constraint_name FROM information_schema.constraint_column_usage \ > WHERE table_schema = 'test'" | head table_name | column_name | constraint_name ------------------------+-----------------------+-------------------------------------------- person | id | person_pkey address | id | address_pkey person_to_address | id_person | person_to_address_pkey person_to_address | id_address | person_to_address_pkey person | id | person_to_address_id_person_fkey address | id | person_to_address_id_address_fkey email | id | email_pkey person_to_email | id_person | person_to_email_pkey

With this we are coming to our next query.

$ psql -U postgres -c "SELECT c.table_name, > c.column_name, > c.data_type, > c.udt_name, > c.is_nullable, > c.character_maximum_length, > (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'PRIMARY KEY' LIMIT 1 > ) primary_key, > (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 > ) foreign_key, > (SELECT ccu.table_name FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > JOIN information_schema.constraint_column_usage ccu > ON tc.constraint_name = ccu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 > ) reference_table, > (SELECT ccu.column_name FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > JOIN information_schema.constraint_column_usage ccu > ON tc.constraint_name = ccu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 > ) reference_column > > FROM information_schema.columns c > JOIN information_schema.tables t on c.table_name = t.table_name > WHERE c.table_schema = 'test' AND t.table_type = 'BASE TABLE'" | head table_name | column_name | data_type | udt_name | is_nullable | character_maximum_length | primary_key | foreign_key | reference_table | reference_column ------------------------+--------------------------+-----------------------------+-----------------------+-------------+--------------------------+-------------+-------------+-----------------+------------------ person_to_email | id_person | uuid | uuid | NO | | PRIMARY KEY | FOREIGN KEY | person | id person_to_email | id_email | uuid | uuid | NO | | PRIMARY KEY | FOREIGN KEY | email | id person_to_email | communication_type | USER-DEFINED | communication_type | NO | | | | | person_to_email | is_primary_email_address | boolean | bool | NO | | | | | person_to_email | created_at | timestamp without time zone | timestamp | NO | | | | | person_to_email | updated_at | timestamp without time zone | timestamp | NO | | | | | person | id | uuid | uuid | NO | | PRIMARY KEY | | | person | first_name | character varying | varchar | YES | 512 | | | |

There is one thing left. It would be nice, if you can see the enum values within the ERM. Let’s look, what we can do about it.

$ psql -U postgres -c "SELECT e.enumlabel, t.typname FROM pg_type t \ > JOIN pg_enum e ON t.oid = e.enumtypid \ > JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace" | head enumlabel | typname ------------------------+----------------------- work | address_type invoice | address_type delivery | address_type private | address_type organization | communication_type private | communication_type work | communication_type cellular_network | communication_network

This can be matched on the column udt_name .

Now we have our final SQL statement for now.

$ psql -U postgres -c "SELECT c.table_name, > c.column_name, > c.data_type, > c.udt_name, > (SELECT string_agg(e.enumlabel::TEXT, ', ') > FROM pg_type t > JOIN pg_enum e on t.oid = e.enumtypid > JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace WHERE t.typname = c.udt_name) enum_values, > c.is_nullable, > c.character_maximum_length, > (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'PRIMARY KEY' LIMIT 1 > ) primary_key, > (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 > ) foreign_key, > (SELECT ccu.table_name FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > JOIN information_schema.constraint_column_usage ccu > ON tc.constraint_name = ccu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 > ) reference_table, > (SELECT ccu.column_name FROM information_schema.key_column_usage kcu > JOIN information_schema.table_constraints tc > ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name > JOIN information_schema.constraint_column_usage ccu > ON tc.constraint_name = ccu.constraint_name > WHERE c.column_name = kcu.column_name > AND c.table_name = kcu.table_name > AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 > ) reference_column > > FROM information_schema.columns c > JOIN information_schema.tables t on c.table_name = t.table_name > WHERE c.table_schema = 'test' AND t.table_type = 'BASE TABLE'" | head table_name | column_name | data_type | udt_name | enum_values | is_nullable | character_maximum_length | primary_key | foreign_key | reference_table | reference_column ------------------------+--------------------------+-----------------------------+-----------------------+-------------------------------------------------------------------------------------+-------------+--------------------------+-------------+-------------+-----------------+------------------ person_to_email | id_person | uuid | uuid | | NO | | PRIMARY KEY | FOREIGN KEY | person | id person_to_email | id_email | uuid | uuid | | NO | | PRIMARY KEY | FOREIGN KEY | email | id person_to_email | communication_type | USER-DEFINED | communication_type | work, private, organization | NO | | | | | person_to_email | is_primary_email_address | boolean | bool | | NO | | | | | person_to_email | created_at | timestamp without time zone | timestamp | | NO | | | | | person_to_email | updated_at | timestamp without time zone | timestamp | | NO | | | | | person | id | uuid | uuid | | NO | | PRIMARY KEY | | | person | first_name | character varying | varchar | | YES | 512 | | | |

The string_agg function is used to concentrate the enum values.

a look ahead

Before starting to work with the raw schema data, we take a look at our goal. We use graphviz for drawing the ERM. My goal is to get close to a ERM visualization.

digraph { node [shape=Mrecord; fontname="Courier New" style="filled, bold" fillcolor="white", fontcolor="black"]; customer [shape=plaintext; label=< <TABLE BORDER="1" CELLBORDER="0" CELLSPACING="0" CELLPADDING="3"> <TR> <TD COLSPAN="5" BGCOLOR="black"><FONT color="white"><B>customer</B></FONT></TD> </TR> <TR> <TD>column</TD> <TD>type</TD> <TD>nullable</TD> <TD>PK</TD> <TD>FK</TD> </TR> <TR> <TD port="f1">id</TD> <TD>uuid</TD> <TD>NO</TD> <TD>PRIMARY KEY</TD> <TD></TD> </TR> <TR> <TD port="f2">id_person</TD> <TD>uuid</TD> <TD>NO</TD> <TD></TD> <TD>FOREIGN KEY</TD> </TR> <TR> <TD port="f3">customer_number</TD> <TD>varchar</TD> <TD>NO</TD> <TD></TD> <TD></TD> </TR> <TR> <TD port="f4">json_view</TD> <TD>jsonb</TD> <TD>YES</TD> <TD></TD> <TD></TD> </TR> <TR> <TD port="f5">created_at</TD> <TD>timestamp</TD> <TD>NO</TD> <TD></TD> <TD></TD> </TR> <TR> <TD port="f6">updated_at</TD> <TD>timestamp</TD> <TD>NO</TD> <TD></TD> <TD></TD> </TR> </TABLE>>] }

version="1.0" encoding="UTF-8" standalone="no" ? %3 customer customer column type nullable PK FK id uuid NO PRIMARY KEY id_person uuid NO FOREIGN KEY customer_number varchar NO json_view jsonb YES created_at timestamp NO updated_at timestamp NO

The table column layout fits our needs for our relation. The port attribute is important for the edges.

If we have a person and a customer , adding

customer -> person;

will create an edge for these relations.

version="1.0" encoding="UTF-8" standalone="no" ? %3 person person column type nullable PK FK id uuid NO PRIMARY KEY first_name varchar YES last_name varchar YES birth_date date YES notes varchar YES website varchar YES json_view jsonb YES created_at timestamp NO updated_at timestamp NO customer customer column type nullable PK FK id uuid NO PRIMARY KEY id_person uuid NO FOREIGN KEY customer_number varchar NO json_view jsonb YES created_at timestamp NO updated_at timestamp NO customer->person

preparations

First we export the schema to a file (e.g. schema.txt ). This file will be used for the awk processing.

The first two lines of the head

$ head -n 5 schema.txt table_name | column_name | data_type | udt_name | enum_values | is_nullable | character_maximum_length | primary_key | foreign_key | reference_table | reference_column ------------------------+--------------------------+-----------------------------+-----------------------+-------------------------------------------------------------------------------------+-------------+--------------------------+-------------+-------------+-----------------+------------------ person | id | uuid | uuid | | NO | | PRIMARY KEY | | | person | first_name | character varying | varchar | | YES | 512 | | | | person | last_name | character varying | varchar | | YES | 512 | | | |

must be removed. This can be done by

$ head -n5 schema.txt | tail -n+3 person | id | uuid | uuid | | NO | | PRIMARY KEY | | | person | first_name | character varying | varchar | | YES | 512 | | | | person | last_name | character varying | varchar | | YES | 512 | | | |

The last two lines (one blank line) of the tail

article | status | USER-DEFINED | article_status | active, inactive | NO | | | | | article | created_at | timestamp without time zone | timestamp | | NO | | | | | article | updated_at | timestamp without time zone | timestamp | | NO | | | | | (109 Zeilen)

can be removed with

$ tail -n 5 schema.txt | head -n -2 article | status | USER-DEFINED | article_status | active, inactive | NO | | | | | article | created_at | timestamp without time zone | timestamp | | NO | | | | | article | updated_at | timestamp without time zone | timestamp | | NO | | | | |

Now we have a record in every line.

get started with awk

An awk program has the following structure.

version="1.0" encoding="UTF-8" standalone="no" ? %3 begin BEGIN middle middle part begin->middle end END middle->end

The BEGIN and the END part is executed once. The middle part is executed for every data record.

The BEGIN part introduces the graph.

BEGIN { print("digraph {") print("graph [overlap=false;splines=true;regular=true];") print("node [shape=Mrecord; fontname=\"Courier New\" style=\"filled, bold\" fillcolor=\"white\", fontcolor=\"black\"];") }

The middle part must print every graphviz table for every relation in the schema.

{ if (length(currentTableName) > 0 && $1 != currentTableName) { print("</TABLE>>]") } if ($1 != currentTableName) { print("") print(trim($1) " [shape=plaintext; label=<") print("<TABLE BORDER=\"1\" CELLBORDER=\"0\" CELLSPACING=\"0\" CELLPADDING=\"3\">") print("<TR>") print("<TD COLSPAN=\"5\" BGCOLOR=\"black\"><FONT color=\"white\"><B>" trim($1) "</B></FONT></TD>") print("</TR>") print("<TR>") print("<TD>column</TD>") print("<TD>type</TD>") print("<TD>nullable</TD>") print("<TD>PK</TD>") print("<TD>FK</TD>") print("</TR>") port = 0 } print("<TR>") print("<TD port=\"f" ++port "\">"trim($2)"</TD>") print("<TD>"trim($4)"</TD>") print("<TD>"trim($6)"</TD>") print("<TD>"trim($8)"</TD>") print("<TD>"trim($9)"</TD>") print("</TR>") currentTableName = $1 }

The END part closes the last TABLE and closes the graph.

END { print("</TABLE>>]") print("}") }

This script will generate graphviz tables for all relations in the database schema.

version="1.0" encoding="UTF-8" standalone="no" ? %3 purchase_order purchase_order column type nullable PK FK id uuid NO PRIMARY KEY id_customer uuid NO FOREIGN KEY purchase_order_number varchar NO status purchase_order_status NO frozen_purchase_order jsonb YES changes jsonb YES created_at timestamp NO updated_at timestamp NO purchase_order_item purchase_order_item column type nullable PK FK id uuid NO PRIMARY KEY id_article uuid NO FOREIGN KEY id_price uuid NO FOREIGN KEY amount int4 NO changes jsonb YES created_at timestamp NO updated_at timestamp NO

For the edges, the middle part must be extended with

if (trim($9) == "FOREIGN KEY") { edges[++edgeCounter] = trim($1) " -> " trim($10) ";" }

This takes the current table name and point it to the target relation.

Some relations use a enum as a datatype. It would be nice, if this could be visible in the ERM.

if (length(trim($5)) > 0) { nodes[++nodeCounter] = trim($4) "[shape=\"box\", style=\"rounded\", label=<<B>" trim($4) " (enum)</B><BR/>" trim($5) ">];" edges[++edgeCounter] = trim($1) ":f" port " -> " trim($4) ";" }

This adds new enum nodes to the graph and points it directly to the column used by the enum.

In the END part, the new nodes and edges must be added.

for (node in nodes) { print(nodes[++i]) } i = 0 for (edge in edges){ print(edges[++i]) }

Currently the enum values are comma separated. For the graph drawing it is easier, to have short lines.

A sed 's/, /<BR\/>/g' before script start will replace the commas with <BR/> .

final

The complete script glued together

#!/bin/bash psql -U postgres -c " SELECT c.table_name, c.column_name, c.data_type, c.udt_name, (SELECT string_agg(e.enumlabel::TEXT, ', ') FROM pg_type t JOIN pg_enum e on t.oid = e.enumtypid JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace WHERE t.typname = c.udt_name) enum_values, c.is_nullable, c.character_maximum_length, (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu JOIN information_schema.table_constraints tc ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name WHERE c.column_name = kcu.column_name AND c.table_name = kcu.table_name AND tc.constraint_type = 'PRIMARY KEY' LIMIT 1 ) primary_key, (SELECT tc.constraint_type FROM information_schema.key_column_usage kcu JOIN information_schema.table_constraints tc ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name WHERE c.column_name = kcu.column_name AND c.table_name = kcu.table_name AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 ) foreign_key, (SELECT ccu.table_name FROM information_schema.key_column_usage kcu JOIN information_schema.table_constraints tc ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name JOIN information_schema.constraint_column_usage ccu ON tc.constraint_name = ccu.constraint_name WHERE c.column_name = kcu.column_name AND c.table_name = kcu.table_name AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 ) reference_table, (SELECT ccu.column_name FROM information_schema.key_column_usage kcu JOIN information_schema.table_constraints tc ON tc.table_name = c.table_name AND tc.constraint_name = kcu.constraint_name JOIN information_schema.constraint_column_usage ccu ON tc.constraint_name = ccu.constraint_name WHERE c.column_name = kcu.column_name AND c.table_name = kcu.table_name AND tc.constraint_type = 'FOREIGN KEY' LIMIT 1 ) reference_column FROM information_schema.columns c JOIN information_schema.tables t on c.table_name = t.table_name WHERE c.table_schema = 'test' AND t.table_type = 'BASE TABLE'" | sed 's/, /<BR\/>/g' | head -n -2 | tail -n+3 | awk -F"|" ' function ltrim(s) { sub(/^[ \t\r

]+/, "", s); return s } function rtrim(s) { sub(/[ \t\r

]+$/, "", s); return s } function trim(s) { return rtrim(ltrim(s)); } BEGIN { print("digraph {") print("graph [overlap=false;splines=true;regular=true];") print("node [shape=Mrecord; fontname=\"Courier New\" style=\"filled, bold\" fillcolor=\"white\", fontcolor=\"black\"];") } { if (length(currentTableName) > 0 && $1 != currentTableName) { print("</TABLE>>]") } if ($1 != currentTableName) { print("") print(trim($1) " [shape=plaintext; label=<") print("<TABLE BORDER=\"1\" CELLBORDER=\"0\" CELLSPACING=\"0\" CELLPADDING=\"3\">") print("<TR>") print("<TD COLSPAN=\"5\" BGCOLOR=\"black\"><FONT color=\"white\"><B>" trim($1) "</B></FONT></TD>") print("</TR>") print("<TR>") print("<TD>column</TD>") print("<TD>type</TD>") print("<TD>nullable</TD>") print("<TD>PK</TD>") print("<TD>FK</TD>") print("</TR>") port = 0 } print("<TR>") print("<TD port=\"f" ++port "\">"trim($2)"</TD>") print("<TD>"trim($4)"</TD>") print("<TD>"trim($6)"</TD>") print("<TD>"trim($8)"</TD>") print("<TD>"trim($9)"</TD>") print("</TR>") if (trim($9) == "FOREIGN KEY") { edges[++edgeCounter] = trim($1) " -> " trim($10) ";" } if (length(trim($5)) > 0) { nodes[++nodeCounter] = trim($4) "[shape=\"box\", style=\"rounded\", label=<<B>" trim($4) " (enum)</B><BR/>" trim($5) ">];" edges[++edgeCounter] = trim($1) ":f" port " -> " trim($4) ";" } currentTableName = $1 } END { print("</TABLE>>]") for (node in nodes) { print(nodes[++i]) } i = 0 for (edge in edges){ print(edges[++i]) } print("}") }'

I have put the result in a external file because the graph has become to big in size. Not so bad, I think.

Update 2017-10-12:

I added a schema.sh script to my script collection.