A Quick Look at PostgreSQL Source Code A Feel for the Overall Structure and Style of the Project

To get and compile the code, run the normal commands:

$ git clone git://git.postgresql.org/git/postgresql.git $ ./configure $ make

(Interestingly, the Makefile is actually committed into the repository, so you might not even need to run the configure script.)

Before looking at the code, try to imagine what the overall structure might be, based on what it needs to accomplish. There needs to be a network component, which hands data to a parser, then maybe an optimizer, and some code to actually run the queries. There is of course a doc directory and README files, but let's jump right in to the src directory.

Structure is the key to understanding when it comes to code, and a diagram might help, but here we're trying to understand the structure looking at the code itself.

The broad division in PostgreSQL is between the frontend (libraries/CLI making requests over the network), and the backend. The backend is probably more interesting. Run $ cd src/backend; ls and we find this directory listing (among others):

Makefile common.mk main po replication access executor nls.mk port rewrite bootstrap foreign nodes postgres snowball catalog lib optimizer postmaster storage commands libpq parser regex tcop tsearch utils

cd snowball; cat README

Woah, what is snowball?. Looks like a grammatical library for natural language processing, designed to find the stem of words: for example, from the word "eating" it will find "eat." Apparently PostgreSQL lets you search text based on the stem of the word. That's cool, learning already.

Optimizer and parser are easy enough to figure out, what about postmaster? Does that sound like a network layer? less postmaster/postmaster.c and we see this beautiful comment:

/*------------------------------------------------------------------------- * * postmaster.c * This program acts as a clearing house for requests to the * POSTGRES system. Frontend programs send a startup message * to the Postmaster and the postmaster uses the info in the * message to setup a backend process. * * The postmaster also manages system-wide operations such as * startup and shutdown. The postmaster itself doesn't do those * operations, mind you --- it just forks off a subprocess to do them * at the right times. It also takes care of resetting the system * if a backend crashes. *

grep socket * -r

pqcomm.c

/*------------------------ *INTERFACE ROUTINES * *setup/teardown: * StreamServerPort - Open postmaster's server port * StreamConnection - Create new connection with client * StreamClose - Close a client/backend connection * TouchSocketFiles - Protect socket files against /tmp cleaners * pq_init - initialize libpq at backend startup * pq_comm_reset - reset libpq during error recovery * pq_close - shutdown libpq at backend exit * *low-level I/O: * pq_getbytes - get a known number of bytes from connection * pq_getstring - get a null terminated string from connection * pq_getmessage - get a message with length word from connection * pq_getbyte - get next byte from connection * pq_peekbyte - peek at next byte from connection * pq_putbytes - send bytes to connection (flushed by pq_flush) * pq_flush - flush pending output * pq_flush_if_writable - flush pending output if non-blocking writeable * pq_getbyte_if_available - get a byte if available without blocking * *message-level I/O (and old-style-COPY-OUT cruft): * pq_putmessage - send a normal message (suppressed in COPY OUT mode) * pq_putmessage_noblock - buffer a normal message (suppressed in COPY OUT) * pq_startcopyout - inform libpq that a COPY OUT transfer is beginning * pq_endcopyout - end a COPY OUT transfer * *------------------------ */

lib

$ ls lib Makefile bipartite_match.c objfiles.txt stringinfo.c README hyperloglog.c pairingheap.c binaryheap.c ilist.c rbtree.c

rbtree.c

/* * rb_leftmost: fetch the leftmost (smallest-valued) tree node. * Returns NULL if tree is empty. * * Note: in the original implementation this included an unlink step, but * that's a bit awkward. Just call rb_delete on the result if that's what * you want. */ RBNode * rb_leftmost(RBTree *rb) { RBNode *node = rb->root; RBNode *leftmost = rb->root; while (node != RBNIL) { leftmost = node; node = node->left; } if (leftmost != RBNIL) return leftmost; return NULL; }

So PostgreSQL has a high-level message concept, excellent.leads to, which contains the low-level network routines. It holds another nice comment:Moving back up, thedirectory looks interesting, I wonder what's in it?Fascinating, PostgreSQL uses a red-black tree. I haven't used those much since college. Here is some code from. It looks a lot like a college textbook:

What about that postgres directory? Why is it named so generically?

$ cd postgres -bash: cd: postgres: Not a directory $ file postgres postgres: 64-bit executable x86_64

OK, it's not a directory, it's the executable. That explains the generic name.

What about the port directory? From the name, it seems it could either be related to networking, or short for 'portability.'

$ ls port Makefile dynloader pg_latch.c sysv_sema.c unix_latch.c aix dynloader.c pg_sema.c sysv_shmem.c win32 atomics.c hpux pg_shmem.c tas win32_latch.c darwin objfiles.txt posix_sema.c tas.s win32_sema.c win32_shmem.c

It's portability, it even has code for porting to HPUX and AIX. Despite being very portable, there isn't a lot of platform specific code scattered throughout PostgreSQL, which is a sign of good design. Portability is a difficult architectural problem, and they've isolated the platform specific code to just a few sections.

Let's try to find the code for select. Where would it be? Running $ grep -r select * actually returns a lot of results. so instead, let's try looking at execMain.c, that seems like the place were a query begins to be executed. The first function is ExecutorRun(), and it calls standard_ExecutorRun() (apparently you can dynamically set a custom executor), and that calls ExecutePlan(), which calls ExecProcNode(), which is not in that file.

If I were using Emacs, I could use ETAGS to jump directly to that function, but I'm not, I'm using the command line and less, like a savage. Fortunately this time $ grep ExecProcNode * quickly shows me the function I'm looking for, and so I run the command less execProcnode.c, taking a look.

execProcnode.c has some great comments explaining the overall structure, complete with an ascii diagram.

* Suppose we want the age of the manager of the shoe department and * the number of employees in that department. So we have the query: * * select DEPT.no_emps, EMP.age * from DEPT, EMP * where EMP.name = DEPT.mgr and * DEPT.name = "shoe" * * Suppose the planner gives us the following plan: * * +---------------------------------------+ * | Nest Loop (DEPT.mgr = EMP.name) | * | / \ | * | / \ | * | Seq Scan Seq Scan | * | DEPT EMP | * | (name = "shoe") | * +---------------------------------------+ * * ExecutorStart() is called first. * It calls InitPlan() which calls ExecInitNode() on * the root of the plan -- the nest loop node.

Moving down; this is it, looks like we've found the core algorithm for making a query:

/* ---------------------------------------------------------------- * ExecProcNode * * Execute the given node to return a(nother) tuple. * ---------------------------------------------------------------- */ TupleTableSlot * ExecProcNode(PlanState *node) { TupleTableSlot *result; CHECK_FOR_INTERRUPTS(); if (node->chgParam != NULL) /* something changed */ ExecReScan(node); /* let ReScan handle this */ if (node->instrument) InstrStartNode(node->instrument); switch (nodeTag(node)) { /* * control nodes */ case T_ResultState: result = ExecResult((ResultState *) node); break; case T_ModifyTableState: result = ExecModifyTable((ModifyTableState *) node); break; case T_AppendState: result = ExecAppend((AppendState *) node); break; case T_MergeAppendState: result = ExecMergeAppend((MergeAppendState *) node); break; case T_RecursiveUnionState: result = ExecRecursiveUnion((RecursiveUnionState *) node); break; /* BitmapAndState does not yield tuples */ /* BitmapOrState does not yield tuples */ /* * scan nodes */ case T_SeqScanState: result = ExecSeqScan((SeqScanState *) node); break; case T_SampleScanState: result = ExecSampleScan((SampleScanState *) node); break; case T_IndexScanState: result = ExecIndexScan((IndexScanState *) node); break; case T_IndexOnlyScanState: result = ExecIndexOnlyScan((IndexOnlyScanState *) node); break; /* BitmapIndexScanState does not yield tuples */ case T_BitmapHeapScanState: result = ExecBitmapHeapScan((BitmapHeapScanState *) node); break; case T_TidScanState: result = ExecTidScan((TidScanState *) node); break; case T_SubqueryScanState: result = ExecSubqueryScan((SubqueryScanState *) node); break; case T_FunctionScanState: result = ExecFunctionScan((FunctionScanState *) node); break; case T_ValuesScanState: result = ExecValuesScan((ValuesScanState *) node); break; case T_CteScanState: result = ExecCteScan((CteScanState *) node); break; case T_WorkTableScanState: result = ExecWorkTableScan((WorkTableScanState *) node); break; case T_ForeignScanState: result = ExecForeignScan((ForeignScanState *) node); break; case T_CustomScanState: result = ExecCustomScan((CustomScanState *) node); break; /* * join nodes */ case T_NestLoopState: result = ExecNestLoop((NestLoopState *) node); break; case T_MergeJoinState: result = ExecMergeJoin((MergeJoinState *) node); break; case T_HashJoinState: result = ExecHashJoin((HashJoinState *) node); break; /* * materialization nodes */ case T_MaterialState: result = ExecMaterial((MaterialState *) node); break; case T_SortState: result = ExecSort((SortState *) node); break; case T_GroupState: result = ExecGroup((GroupState *) node); break; case T_AggState: result = ExecAgg((AggState *) node); break; case T_WindowAggState: result = ExecWindowAgg((WindowAggState *) node); break; case T_UniqueState: result = ExecUnique((UniqueState *) node); break; case T_GatherState: result = ExecGather((GatherState *) node); break; case T_HashState: result = ExecHash((HashState *) node); break; case T_SetOpState: result = ExecSetOp((SetOpState *) node); break; case T_LockRowsState: result = ExecLockRows((LockRowsState *) node); break; case T_LimitState: result = ExecLimit((LimitState *) node); break; default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); result = NULL; break; } if (node->instrument) InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); return result; }

This is good code. Internally, it seems PostgreSQL refers to select as 'scan,' not 'select,' and there are several different ways to select.

That's all for now; I hope you enjoyed my brief tour through some code!