The Tor project is working in concert with another non-profit organization in a push to bring private and anonymous browsing to public libraries. In particular, the two projects are collaborating on setting up Tor exit relays in libraries, beginning with a pilot program at a library in New Hampshire.

Spearheading the effort is the Library Freedom Project (LFP), a partnership that focuses on educating librarians about privacy and surveillance issues and equipping them with tools to address the challenges. In February, LFP won the "Knight News Challenge on Libraries" grant competition, and has been using the resulting funds to provide privacy education at libraries throughout the US and Canada, the UK and Ireland, and Australia, with more destinations still to come. A look at the resources (with more here) that LFP hosts for libraries provides some insight into its work—issues from password strength to PGP encryption to privacy-themed browser extensions are covered.

In July, though, LFP Director Alison Macrina announced the start of a more proactive project: to set up Tor exit relays on library networks. While all Tor nodes contribute to the health (and bandwidth) of the Tor network, exit relays are of particular importance because they are the gateway nodes through which traffic flows between Tor and the public Internet. Consequently, exit relays are at the highest risk for interference from ISPs or authorities that find some part of the traffic objectionable or that want to track users. See this recent story at BoingBoing for one account of how running a Tor exit relay can attract unwanted attention.

In the announcement, Macrina called libraries well equipped to handle the potential headaches that accompany running an exit relay, noting that:

As public internet service providers, libraries are shielded from some of the legal concerns that an individual exit relay operator might face, such as trying to explain to law enforcement that the traffic leaving her exit is not her own. Furthermore, libraries are protected from DMCA takedowns by safe harbor provisions . Importantly, librarians know their rights and are ready to fight back when those rights are challenged

In addition, she cited the American Library Association's (ALA's) Core Values of Librarianship, Freedom to Read Statement, and Code of Ethics as supporting a commitment to intellectual freedom and privacy. "Libraries serve a diverse audience; many of our community members are people who need Tor but don't know that it exists, and require instruction to understand and use it."

In the first stage of the program, LFP worked with Nima Fatemi from Tor to set up a middle (i.e., non-exit) relay at the Kilton Library in Lebanon, NH. The IT librarian at Kilton, Chuck McAndrew, already administers the library's public computers, which run Linux and are configured with Tor access plus a variety of other privacy measures.

The initial relay setup consists of a desktop PC running Debian and configured to use 10MB/s of the library's bandwidth. The relay is named LebLibraries and, as is the case with most Tor relays, interested parties can check its current status and recent performance on the Tor Globe site. According to the Globe statistics, LebLibraries first appeared online on July 20, then went offline before coming back online in more permanent fashion on the 22nd. Although it is hard to generalize from that small of a sample, the relay has more or less remained online, with just one incident of downtime between July 22 and publication time.

At any given moment, over 6,000 Tor relays are running, on average, of which about 1,000 are exit relays. The ALA estimates that there are over 119,000 public libraries in the US. Naturally, if even a small percentage of that total ran relays it would mark a substantial increase in the Tor network's capacity.

The relay running in the Kilton Library is just the first step. LFP said it intends to run the relay as a middle relay for about one month before reconfiguring it to serve as an exit relay, just to ensure that it does not interfere with the running of the rest of the library's network. Once the exit-relay transition happens, the project is likely to get its first test of how the non-technical side of managing a Tor relay impacts the library. If there are complaints from law enforcement, DCMA take-down requests, or other forms of criticism, other interested libraries will get to see how the Kilton staff and the LFP handle them and what the fallout is.

Moving forward, LFP plans to continue its campaign of privacy training for librarians, part of which includes providing help with setup for Tor Browser and related tools. Macrina commented that this training can be the first step toward convincing a library to run a relay, saying: "For libraries that have already installed Tor Browser on library PCs, running a relay is the obvious next step toward supporting free expression in their communities and all over the world."

It is also worth noting, though, that running a Tor relay may not be a viable option for every library at a technical level. The Kilton Library still required the expertise provided by Fatemi to set up and fully secure its test relay, and its IT librarian already had experience with Linux systems administration and Tor itself.

In an email, Macrina said that the availability of staffers to maintain relays would be a roadblock for some libraries, while others might be limited by their available bandwidth. Still, she added, the popularity of the LFP training program indicates that there is widespread interest in the subject. "Librarians are ready to take action against surveillance, and we see this as the necessary next step." She noted that a "not insignificant" portion of the librarian community was already quite familiar with Tor, although it is still a minority. But many more librarians without prior exposure to Tor have proven capable of understanding the technical details of running a relay—some just need a bit of assistance setting it up.

An assessment of the Kilton library's experience with its Tor relay should come in the next month or two, Macrina said, depending on her and Fatemi's schedules. In the meantime, LFP is inviting interested librarians to get in contact by filling out a questionnaire that covers various aspects of the library's network, hardware, and software organization—including whether or not it is already providing access to services like Tor. The concerns of public libraries may have a natural alignment to those of the Tor project, but it is good to see the interested parties making an active effort to reach out and work with one another.

Comments (5 posted)

Several years ago, when Yahoo was regularly doing battle with Google on the web-services front, one of its less-well-advertised properties that nevertheless proved popular with software developers was Yahoo Pipes. Yahoo Pipes was a web application that provided a graphical interface for manipulating RSS feeds—think "flow-based programming" applied to filtering, combining, and logically mixing RSS feed content. In June of 2015, Yahoo announced that it was shutting down Yahoo Pipes. Users who are searching for a replacement would do well to take a look at streamtools, which provides much the same functionality and can be self-hosted as well.

Streamtools is the product of The New York Times's research-and-development group, NYTLabs. Releases are available through GitHub; the latest is version 0.2.8 from August 2014. If that seems like far enough ago that one should worry about the project's health, it is also worth noting that there are multiple active forks—some from NYTLabs staff and some from outsiders. The last few releases were made in preparation for the public unveiling of streamtools at a 2014 conference; it would seem that the team now only adds functionality as it needs it.

The tool itself is written in Go; there is a single binary named st that launches a local web server on port 7070 and presents the user with a blank canvas. Individual data sources like site feeds are added as "blocks" that the user can place on the canvas and hook together: the output of one block can be fed into the inputs of an operator block, and the output of the operator block can be further manipulated or rendered as output.

The biggest difference between streamtools and Yahoo Pipes is that streamtools is built primarily to handle JSON streams rather than RSS. There are quite a few web services that have JSON-based public APIs, of course, but an RSS-only source would need to be transformed before streamtools could process it. Perhaps this is just another sign of the decreasing interest in RSS among web publishers; perhaps it is coincidental.

There are a few additional input block types defined in streamtools, though, that offer intriguing possibilities. They include IMAP, HTTP GET and POST requests, WebSockets, and Amazon's Simple Queuing System (SQS). The internal format of these other block types is converted to JSON, though.

The basic stream-manipulation blocks begin with primitive operations like join , mask , and filter . The join operation simply merges two JSON elements together, while mask strips an element down to just the attributes that you specify. The filter operation only passes incoming elements that match a pattern you supply—for example:

{ "foo": bar }

will only pass through elements whose "foo" attribute is exactly equal to "bar."

The expression:

{ "foo": {} }

in a mask block, however, will just strip all incoming stream elements down to their "foo" attributes.

The values of an attribute can be tested for more than just exact matching, of course—for example:

{ .foo < 42 }

matches those elements where "foo" is less than 42. The syntax used for dealing with numeric and string comparisons is a little different, as this example shows. That said, its syntax is fairly straightforward, though it comes from another NYTLabs project called gojee and is designed for use with JSON manipulation. Thus, a quick read-though may be advisable.

But streamtools can do more complex manipulation of input streams as well. The pack block will concatenate all of the incoming elements that match a parameter into a single output object. The sync block allows you to introduce a timer to a stream, throttling the output to a particular speed or simply introducing some regularity to an unpredictable input.

The output options include writing JSON to a file, saving it in a MongoDB database, sending it as an email message, or initiating an Elasticsearch query. There are also functions available to do simple monitoring, such as displaying a live count of any tracked value, drawing histograms, or calculating moving averages.

On the whole, the streamtools basic function set is light on serious mathematical manipulation, but then again it is probably better to offload real number-crunching to something like R or a full-fledged database. What streamtools does excel at is grabbing live data off of the web and turning it into something that can be easily massaged into a form where other software can take over.

The NYTLabs site has several demo applications that one can experiment with (or build on one's own streamtools server), such as a tool that monitors the US Geological Survey feed for realtime earthquake data, and one that turns the New York City transit system's irregular updates about lost-and-found items into a well-formatted JSON stream.

Streamtools has the potential to do more than Yahoo Pipes did, and not just because it works with JSON (and other formats) out of the box. It is also an active, open-source software project, which means that no matter what The New York Times decides about its viability next year, it won't be unceremoniously shut down, leaving users in the lurch. As popular as Yahoo Pipes was in its heyday, it was still a proprietary service. By making streamtools open source, NYTLabs has the potential to give it a longer life.

Comments (1 posted)

PostgreSQL 9.5 Alpha 2 is due to be released on August 6. Not only does the new version support UPSERT, more JSON functionality, and other new features we looked at back in July, it also has some major enhancements for "big data" workloads. Among these are faster sorts, TABLESAMPLE , GROUPING SETS and CUBE , BRIN indexes, and Foreign Data Wrapper improvements. Taken together, these features strengthen arguments for using PostgreSQL for data warehouses, and enable users to continue using it with bigger databases.

This release is still called an "alpha" because the PostgreSQL developers don't feel that it's entirely stable yet, and want to reserve the right to remove features that can't be fixed before the beta release. Among others, this includes the TABLESAMPLE feature described below; as of this writing, there are still problems with that patch and it could be dropped from 9.5.

PostgreSQL as a big data database

Some readers may not think of PostgreSQL for big data, but the database system has a long track record of use in data warehouses. For a decade before Hadoop launched, PostgreSQL was the only pure open-source option for large data volumes and complex analytics. Today, it is still heavily used in mid-sized data warehouses and "data marts", meaning databases in the one to ten terabyte range. Big data and analytics users are a major subset of the PostgreSQL community, and thus its contributor base, resulting in the addition of new features for large databases with each release.

More prominently, the European Union has chosen PostgreSQL as a key part of a publicly funded analytics technology project in the EU's Seventh Framework Programme, or "FP7". This project aims to enhance the technology base of the EU. Funding from FP7, channeled through the consulting firm 2nd Quadrant, paid for two of the features covered below, as well as performance testing and review work on others.

In addition to core features, the PostgreSQL project has been a source of code for many big data startups over the years. These include Netezza, Greenplum, ParAccel, Truviso, Aster Data Systems, and CitusDB. Both Yahoo! Everest and Amazon Redshift are PostgreSQL forks as well. In July, startup PipelineDB released an open-source streaming database that was forked from PostgreSQL 9.4. Streaming databases are used to process huge amounts of incoming data. These startups are generally attracted to PostgreSQL because of its sophisticated query planner and executor, liberal license, and well-documented code.

As hardware gets bigger and faster, the definition of "mid-sized data warehouse" keeps getting bigger and user expectations continue to grow. The PostgreSQL project continuously adds new features and better performance for big data use cases in order to keep up. Version 9.5 includes a lot of "goodies" for users with big databases.

Faster sorts

One thing relational databases do a lot of is sorting data. The bigger the database, the more the performance of those sorts becomes the dominating factor in overall response time. Sort time is critical not just for query results, but also for building indexes, grouping data, and certain kinds of joins. To help with this, PostgreSQL 9.5 comes with an across-the-board speedup in sorting text and long numeric values, yielding 3X to 12X speed improvements for many users.

Peter Geoghegan of Heroku wrote a patch in January that made text sorts faster by using a technique called "abbreviated keys". Previously, PostgreSQL sorted all text purely by using glibc's strcoll() function, which allows the database to sort in whatever locale the user has chosen for their data. The problem with this is that strcoll() is expensive, taking hundreds of times more CPU time than comparing integers.

Converting the text values to binary keys using strxfrm() and sorting them is much faster, except that those binary keys can be very large. So Peter hit on the technique of taking only the first eight bytes of the keys produced by strxfrm() , comparing those, and then breaking any ties by doing a full strcoll() comparison. This speeds up sorts of all kinds, both in-memory quicksorts and on-disk tapesorts. One user tested 9.5 with their data warehouse, and the new version reduced indexing eighteen million text values from almost eleven minutes to less than one minute.

After Geoghegan's patch was committed, contributor Andrew Gierth wrote a patch that uses a similar technique to speed up sorts on PostgreSQL's NUMERIC data type. In Postgres, NUMERIC is an arbitrary-precision number value, supporting over 100,000 digits. This means that it can't be sorted as an integer or float, so Gierth's improvement was welcome.

However, there still may be some issues with the sort improvements. Some platforms, especially older versions of Solaris, have buggy versions of the strxfrm() function that prevent use of this optimization. The developers are discussing whether these platforms are old enough, or obscure enough, to disregard so that most users can get enhanced performance, or whether platform-specific code might be required.

TABLESAMPLE

One feature funded by FP7 was written by Petr Jelinek; it adds the SQL-standard TABLESAMPLE query. This clause returns a pseudo-random sample of the rows in the result set. When used with large tables, TABLESAMPLE lets users get a quick "glimpse" of the data so that they can do further analysis.

For example, imagine that I have a table of 10 million user profiles in JSON. I want to take a look at a handful of rows from this table so that I can find out what kinds of keys the JSON has, but I don't want just the rows at the beginning of the table. In 9.5, I can do this:

SELECT * FROM user_profiles TABLESAMPLE SYSTEM ( 0.001 );

Which will return around 0.001%, or 100, of the rows in the table. In the query above, SYSTEM is the name of the chosen sampling algorithm. The SYSTEM algorithm chooses a set of pseudo-random data pages, and then returns all rows on those pages, and has the advantage in running in constant time regardless of the size of the table. PostgreSQL 9.5 will also ship with the BERNOULLI sampling method, which is more rigorously random, but will take longer the larger the table is.

Currently, however, the TABLESAMPLE patch is under some intense review and rewriting. Tom Lane criticized the quality of the patch, and it remains to be seen if it will stay in 9.5.

CUBE , ROLLUP , and GROUPING SETS

In the SQL99 update of the SQL standard, the ANSI committee added a collection of grouping and reporting clauses to SQL to support the technology known as OLAP for "OnLine Analytical Processing". OLAP technologies, still used today, are ways to summarize and explore large data sets by grouping them in a multi-dimensional space. The three features added to SQL, ROLLUP , CUBE , and GROUPING SETS , provide different ways of summarizing data sets on several axes at once.

ROLLUP is the easiest to understand, since it simply provides subtotals and master totals for the whole set provided. For example, this query would provide counts for each city, then subtotals for each country, and finally a grand total of subscribers.

SELECT country, city, count(*) FROM lwn_subscribers GROUP BY ROLLUP ( country, city ); country | city | count ----------+---------------+------- Australia | Brisbane | 561 Australia | Melbourne | 302 Australia | Perth | 219 Australia | | 1082 USA | New York | 591 USA | Portland | 617 USA | San Francisco | 610 USA | | 1818 | | 2900

CUBE is less immediately understandable, because it's mainly intended to feed multi-dimensional data into an external OLAP tool like Mondrian. It expresses all possible totals for all combinations of specified columns, otherwise known as a "combinatorial explosion". For example, this query would provide totals for each country and each subscription level, and each combination of country and level.

SELECT country, level, count(*) FROM lwn_subscribers GROUP BY CUBE ( country, level ); country | level | count ----------+--------------+------- Australia | leader | 140 Australia | professional | 490 Australia | starving | 394 Australia | supporter | 58 Australia | | 1082 USA | leader | 301 USA | professional | 765 USA | starving | 602 USA | supporter | 150 USA | | 1818 | | 2900 | leader | 441 | professional | 1255 | starving | 996 | supporter | 208

Finally, GROUPING SETS are sort of the "parent class" of both ROLLUP and CUBE . It allows you to select subtotal and total groups and combinations manually to display, in case you need more specific output than ROLLUP or CUBE provide.

For example, say I wanted totals for level and city, then level, and then a grand total, I would write:

SELECT city, level, count(*) FROM lwn_subscribers GROUP BY GROUPING SETS ((city, level),(level),()); city | level | count ---------------+--------------+------- Brisbane | leader | 94 Melbourne | leader | 44 New York | leader | 105 Perth | leader | 2 Portland | leader | 94 San Francisco | leader | 102 | leader | 441 Brisbane | professional | 236 Melbourne | professional | 121 New York | professional | 250 ...

The benefit for users in these query constructs is the ability to get totals, subtotals, and combinations in one query and one scan over the table, which in the past would have required multiple queries against the database. The other advantage for PostgreSQL is that it can now be supported by sophisticated "business intelligence" tools that previously only worked with proprietary SQL databases.

This feature was developed by Gierth, and has caused some discussion because of issues with the PostgreSQL review process that almost caused it to not get committed. It's now in and ready for users to try out.

Note: the above subscription data is fictitious and not related to LWN's actual subscription data.

BRIN indexes

Chilean committer Álvaro Herrera developed another feature funded by FP7. He wrote it to solve one of the major problems with really big tables: indexing them. PostgreSQL supports large tables easily, especially append-only tables, where the user may not even notice that the table has grown to several terabytes. However, standard B+ tree indexes grow increasingly inefficient at large numbers of leaf nodes, primarily because they no longer fit in memory.

Herrera figured out that "close" counts in indexes as much as it does with horseshoes and grenades. He devised a new index type that would index values down to an arbitrarily large "block" of data rather than to an individual data page. Instead of having a leaf node pointing to each individual row, the new index has a range of values to be found in each block. When the table is queried, the index is searched for ranges that correspond to the supplied filter conditions, and then the individual blocks are sequentially scanned. This approach is similar to how data is organized in column-store tables.

This new index type is called "BRIN" for "Block-Range INdex". These indexes are much smaller than conventional B+ trees — as little as 1% of the size of the standard indexes. This makes them much faster, since they usually fit in memory, and sometimes even in the CPU cache.

There are some drawbacks to BRIN indexes, though. They can't be used to enforce uniqueness or support keys. BRIN works much better on values that are incrementally increasing with each row inserted, like timestamps and incrementing integers. Most of all, BRIN indexes handle updates and deletes on the underlying table inefficiently, so they're best used with append-only tables. These restrictions would seem to make BRIN indexes a niche tool, except that large, continuously increasing append-only tables are a common use case. For example, any table used to hold a website log or an audit history can benefit.

Foreign Data Wrappers

In the last couple years, PostgreSQL has gained adoption as part of hybrid solutions involving several databases because of its Foreign Data Wrapper (FDW) feature. These wrappers allow users to link to external data in other databases, whether PostgreSQL or other platforms, and query them as if they were local tables. This has been called "data federation". For version 9.5, several developers have been hard at work on enhancements to the FDW capability.

One such enhancement is primarily administrative: IMPORT FOREIGN SCHEMA . In prior versions, if you wanted to link every table in a target remote database, you had to write individual CREATE FOREIGN TABLE statements for each one, which was tedious to do and annoying to maintain. Ronan Dunklau and Michael Paquier fixed this by implementing a command that creates all tables in the remote database as foreign tables so that you can link "the whole database" in one command.

The other major FDW improvement, JOIN pushdown, was introduced by KaiGai Kohei of NEC. Together with the new "custom scan" API, this allows FDWs to perform parts of queries, like index scans and joins between two tables, on the remote database before returning the result. Pushdown makes data federation much more scalable because it no longer requires sending entire tables across the wire for many common operations.

To take advantage of either of these features, though, the FDW driver for the external data source needs to support them. The postgres_fdw driver, which supports PostgreSQL-to-PostgreSQL connections, will support both, but many of the FDW drivers won't get them added for a while, if at all. Regardless, the improvements in FDWs make it more possible for PostgreSQL to be the interface for large data stores even where it is not the backend for them.

Other features

Several other features in PostgreSQL 9.5 will be of interest to users with large databases. PostgreSQL now makes more effective use of larger allocations of memory in "shared_buffers", by shrinking the array of page references kept by each process. GiST indexes, used for geographic searches, now support "index-only scans" for faster queries over large append-only tables. The vacuumdb utility now has a parallel, multi-process mode for doing maintenance on multiple large tables at once. Finally, UNLOGGED tables can be changed into regular tables, allowing users to build a data set and make it durable as the last step. UNLOGGED tables are non-durable tables that are not written to the transaction log, and are often used in data loading. This last feature was a Google Summer of Code project.

Outside of the core project, two tools have been updated that increase the size of databases PostgreSQL can handle. CitusData released version 1.3 of its cstore_fdw project, which adds a column-store option for tables. Column stores both compress data and makes large aggregation queries, like counts and totals, much faster. The company also released version 1.2 of pg_shard, a tool that "shards" a table by distributing it across a set of database nodes using a hash key. This allows table scans to be run on multiple machines in parallel.

Native parallel query, the ability to use multiple cores on the same machine to execute a large query, has been an active project for two years. Unfortunately, due to the technical challenges being more difficult than expected by the team led by EnterpriseDB programmer Robert Haas, no parallel query features have yet been committed. Parallel table scan is under active development for 9.6.

Overall, PostgreSQL continues to hold its own as the "good enough" big data database for many users, by both improving performance and adding new analytical features. We can expect to see larger PostgreSQL databases under 9.5 than we have before. The project further shows some possibility of expanding is ability to handle even bigger databases through federation and sharding, and via numerous forks. We'll have to wait to find out what 9.6 offers in this area.

PostgreSQL 9.5 Alpha 2 will be available on August 6th from PostgreSQL.org.

Comments (25 posted)