Accessing Disparate Data Through PostgreSQL Foreign Data Wrappers 22 October 2012 By Ben Bromhead

Relational IO is a data marketplace built upon PostgreSQL Foreign Data Wrappers, allowing users to query and analyse data in different formats from different endpoints using SQL.

PostgreSQL is a very cool database, in fact it is probably one of the most feature rich open source offerings in the RDBMS space. One feature that has been getting a bit of attention recently is PostgreSQL’s Foreign Data Wrappers (FDW).

Both Craig Kerstiens’ blog post on the Redis FDW and Ozgun Erdogans post on the MongoDB wrapper, provide an excellent explanation and awesome example of what a FDW is and what they can be used for. However I will also briefly outline what they are and how they work.

What is a Foreign Data Wrapper?

Foreign Data Wrappers are PostgreSQL’s implementation of the SQL Management of External Data extension, providing a way for PostgreSQL to talk to external data sources. In practice this is implemented by writing a C module that implements the PostgreSQL FDW routines for a specific datasource e.g Redis, MySQL and even the Google Search API.

Once a FDW has been written (as an example, the Multicorn IMAP FDW), it must first be imported into PostgreSQL as an extension:

CREATE EXTENSION multicorn;

You then specify a foreign server. Depending on the extension you use it will contain FDW specific information for connecting to the foreign server.

CREATE SERVER multicorn_imap FOREIGN DATA WRAPPER multicorn OPTIONS ( wrapper 'multicorn.imapfdw.ImapFdw' );

From that point on you can create a foreign table that represents the data from the external source:

CREATE FOREIGN TABLE gmail ( "Message-ID" character varying, "From" character varying, "Subject" character varying, "payload" character varying, "flags" character varying[], "To" character varying) server multicorn_imap options ( host 'imap.gmail.com', port '465', payload_column 'payload', flags_column 'flags', ssl 'True', login 'mylogin', password 'mypassword' );

After creating the table you can query the foreign table as you would any other table in the database. The only caveat being that it is read-only; you can only perform SELECT queries and not UPDATE, DELETE, etc.

If you are interested in the mechanics of FDWs check out postgresql.org’s guide on writing FDW callback routines. Also I highly recommend checking out the easy to use Python wrapper for the FDW C interface called Multicorn, which the above example was taken from.

Where it Gets Interesting

The SQL Management of External Data extension provides a great way of creating federated database systems that solve a lot of the pain in wrangling data from multiple sources.

Traditionally using third party data requires developers to connect to these sources using different protocols, each with their own set of idiosyncrasies and gotchas. Now, no technology will ever be free from its own particular set of quirks (PostgreSQL included), however when you start dealing with multiple sources that use different ways of consuming data it gets tiring pretty quickly. For example a basic app that talks to a SOAP, JSON and MySQL endpoint for various queries requires a developer to work with three separate libraries and services.

Throw on top the effort of having to load, extract and transform the data into something usable within the application and it can become quite painful, especially doing it efficiently and in a manner that scales.

Relational IO allows developers to connect to many common data sources through PostgreSQL Foreign Data Wrappers and presents them as tables within a hosted workspace, allowing developers to not only use a single PostgreSQL driver/library but also allows JOINs across these data sources. Data sources can be your own, or those published on the Relational IO Marketplace by third-parties, which are available either for free or a monthly subscription fee.

Performance

There are a number of limitations in the way FDWs currently work in PostgreSQL 9.2. Currently, the FDW implementation determines what predicates are sent to the foreign server. For example the FDW involved in the query

SELECT * FROM gmail WHERE To = 'Ben';

may choose to either use the filtering functionality provided by IMAP, or simply request the entire mailbox contents and let PostgreSQL perform the filtering.

Currently PostgreSQL will always apply predicates to FDW results, irrespective of how the FDW handles it.

From our anecdotal/quasi-formal tests we have found that if the FDW does not pass predicates to the foreign data server, it can be quite slow for results to be returned, especially if the foreign data source contains a large number of records, as it will fetch the entire data set.

If the FDW does pass predicates, the performance will be roughly similar to querying the datasource directly and you will notice the limitations of disk I/O and/or network throughput before seeing the impact of the FDW.

FDW performance is primarily dependent on the FDW implementation. We are in the process of profiling our FDW performance properly and we will post the results when we are done!

Comments

Please enable JavaScript to view the comments powered by Disqus.