The Python Language Summit is an annual gathering for around 50 developers of various implementations of the Python language. The one-day event is made up of short presentations and discussion of topics that cut across the entire language ecosystem, which includes the "standard" CPython implementation, as well as variants like Jython, IronPython, PyPy, Pyjion, Pyston, and more. As with last year's summit, LWN sat in on the discussions.

This year's edition was once again co-chaired by Larry Hastings and Barry Warsaw and was held in conjunction with the North American PyCon in Portland, Oregon. The fez tradition continues, but this year Hastings brought more of the hats, so many presenters and attendees were wearing them throughout the day. Here are our reports from the summit:

[ I would like to thank LWN subscribers for travel support to attend the summit in Portland. ]

Comments (none posted)

The opening session at the 2016 Python Language Summit concerned the ssl module in the standard library. Cory Benfield and Christian Heimes described some of the problems that the module suffers from and discussed some plans for making things better.

Benfield works at Hewlett Packard Enterprise on HTTP and HTTP/2 for Python and has been involved in the pyOpenSSL project, which is an alternative to the standard library's ssl module. Heimes is a core Python developer and the co-maintainer of the ssl module. He works on identity management and public-key infrastructure for Red Hat.

The ssl module provides SSL/TLS support, Benfield explained. It uses OpenSSL internally and is important so that pip can safely download packages from the Python Package Index (PyPI) for one thing. There are alternatives to ssl, including pyOpenSSL, various wrappers of non-OpenSSL TLS implementations, as well as platform-specific variants for Python running on other frameworks (e.g. Java for Jython, .NET for IronPython).

But the ssl module is "in a dire state", Benfield said, and it has "lots of problems". Much of that comes from the way it has grown and accreted new features over the years. It is now quite complicated. For example, an __ssl.__SSLSocket is wrapped by ssl.SSLObject , which is wrapped by ssl.SSLSocket , which is a subclass of socket.socket .

In addition, the legacy API is "insane", Heimes said. The SSLSocket.getpeercert() call returns an empty dict instead of CERT_NONE when there is no peer certificate. The SSLContext object defaults to not verifying certificates; users must provide a custom factory to get verification. So, using the SSLContext() constructor is a bad way to get a context object, Benfield said. Those are just a few of the problems, he said, there are lots more.

But the ssl module also lacks features. Part of the problem is that it cannot decide if it is a high-level or low-level library, so it does a half job for both. It does not provide access to the verification chain for a certificate (i.e. the chain of signatures and certificates for the signers). Private keys can only be loaded from a file; there is no support for passing in a memory buffer or working with PKCS #11 for using hardware cryptographic tokens. The error messages when certificate validation fails are incomprehensible. And so on.

There is no "general-purpose API", Benfield said, it is really just a wrapper around OpenSSL and is tightly bound to that library. The API uses terms and concepts from OpenSSL to the point that OpenSSL cannot be switched out for a competing TLS implementation. The ssl module is also written in C, so PyPy, Jython, IronPython, and others that do not support C extensions must use something else. Beyond that, ssl in Python is stuck with only using features from OpenSSL 0.9.8, which is "ancient", because that is the latest version of OpenSSL that Apple ships in OS X.

Recently, though, the ssl module has gotten a bit better. Certificate validation by default (from PEP 476) was added and PEP 466 brought several security enhancements to Python 2.7, which has been in bug-fix-only mode for some time. Also, ssl.create_default_context() has been added so that users can avoid the SSLContext() constructor.

Benfield then went into some suggestions for the future. To start with, the standard library and third-party libraries should switch to using SSLContext objects, rather than take arguments for certificates, keys, and other verification options. The Requests module, which Benfield also works on, will start using SSLContext everywhere in an upcoming release, for example.

There is also a plan to create an abstract base class that will allow users to wrap sockets using an SSLContext to produce an SSLSocket for programs that need access to the socket itself. It would be something like a regular socket with two or three extra methods. In addition, a small set of "sane exception objects" would be defined to be used for verification and other failures, Heimes said.

The basic idea is to standardize and clarify the ssl module's API. There are some other things that could be done, but that leads to "a bikeshed we don't have time for right now", Benfield said. But there are a few new features that should help clean things up.

Adding a SSLContext.set_verify_call() would allow advanced users to provide their own verification function; most users "shouldn't touch it", but some do need it, he said. Heimes suggested that a few new types would be added to the module including an X509 class instead of the getpeercert() dictionary, a VerifyContext class to store the certificate chain information, and a PrivateKey type that would allow additional mechanisms (e.g. cryptographic tokens) for providing private keys.

After that, Benfield suggested that there be a feature moratorium for ssl. The module should simply provide the basic features needed for pip and for the standard library. Advanced users should use pyOpenSSL or other alternatives.

The ssl module maintainers would like to move to support OpenSSL 1.1.0, which has a cleaner API, and drop support for versions earlier than 1.0.2. There is a compatibility layer to provide the 1.1.0 API on 1.0.2, but there is still the problem of OS X. It ships an Apple version of 0.9.8 and does not support other versions of OpenSSL. Ned Deily noted that it has gotten "worse than that", as Apple is no longer even shipping the OpenSSL header files in the most recent OS X version.

It is not entirely clear what should be done for OS X. There are some alternative TLS libraries, such as LibreSSL or BoringSSL, that could be used as the basis for the ssl module, perhaps, but have their own sets of problems. For example, BoringSSL is tailored to support only what Google needs, Heimes said.

There is also a question about the root certificate store. Benfield was adamant that Python should never ship its own bundle of root certificates. But trusting the operating system to provide the root store leads to problems on Windows, which will try to fetch unknown root certificates from Microsoft, and other (unspecified) problems on OS X.

There is some level of confusion about what the ssl module is supposed to be, which leads to problems determining what direction it should go. Is it meant to provide building blocks for applications to use for their TLS needs or is it meant to have an API that more or less encapsulates the best practices for TLS use, someone from the audience asked.

Nick Coghlan suggested that it should simply provide enough to bootstrap pip, so that it can be used to install something more advanced. For that purpose, though, a connection provided by the underlying operating system could be used, Barry Warsaw said. But, without a root certificate bundle, that could lead to pip using unencrypted connections, which is insecure.

In the end, it was agreed that more discussions were needed and that the python-dev mailing list would be the right place for them.

Comments (31 posted)

Amber Brown led a session at the 2016 Python Language Summit on the progress in porting the Twisted event-driven networking framework to Python 3. The lack of a Python 3 version of Twisted has been considered one of the larger barriers to adopting the new version of the language, so progress on that front is of great interest in the Python community.

According to Brown, roughly 50% of the lines of code in Twisted have been ported at this point. But, of things that people are likely to want to use, that number is more like 78%, she said. Users who want to write their own Network News Transfer Protocol (NNTP) server will be out of luck, but for most of the commonly used protocols, Twisted will run on Python 3.

That amounts to some 40,000 lines of code ported, which opens up 100,000 lines of code in third-party libraries to be able to use Python 3. The 40 or so patches merged for the port had roughly 6500 lines inserted and 4400 lines deleted.

In the seven years since Python 3 came out, there has been little progress in porting Twisted until recently; she asked, "why has it taken so long?" The bytes versus Unicode divide was one of the major barriers and early releases in the 3.x series did not have support for byte-handling features that Twisted really needs. The change to how strings are handled was good, and cleaned up a lot of ambiguity, she said. But Twisted deals with protocols on the wire, so it needs to use byte strings.

Python 3 lacks explicit Unicode strings using the u'' notation, while Python 2 is missing b'' for byte strings. In addition, until relatively recent Python 3 versions, there was no way to use the " % " formatting operator for byte strings. PEP 461 was adopted to add formatting for byte strings at the behest of the Twisted and Mercurial projects. But there is no .format() method for byte strings, so the Python 2.7 Twisted code using that all has to change.

That leads to more time spent porting and more code to review, she said. There are effectively three string types in the Python world: bytes, Unicode strings, and strings. And there are inconsistencies among them. For example, sys.path() returns bytes on Unix, but strings on other operating systems. In addition, cgi.parse_multipart() returns strings on Python 3, which is just wrong.

There is an "avalanche of changes" that comes from porting to Python 3, she said. New style classes by default broke a lot of things, as did the differences between bound and unbound methods. But, she said, those in the room are all aware of these problems.

Porting to Python 3 was "the most expensive thing we have ever done for Twisted", she said. On average, she spent two hours a day for a year and a half working on it. That cost upwards of $60,000 just for her time, most of it unpaid. That doesn't include lots of time spent by others, including reviewers, and there are "thousands of hours left to go". She is now down deep into porting the protocols in Twisted, which is the harder part.

The "unfortunate reality" is that if she didn't do that work, it would not have happened. Other Twisted developers had written off porting to Python 3. Earlier versions in the 3.x series would have made the job too large, she said; it is only since Python 3.3 was released that porting has become tractable.

But porting to Python 3 has been a "massive drain" on the development of features in Twisted. Half of the patches in the review queue are for the port. As with most projects, reviewers are a scarce resource, and the port patches require a lot of care and knowledge of the problem domain.

That leads to the question of who is using Python 3. The reality is that Python is falling by the wayside for performance-sensitive applications, she said. People are turning to Go or other options. And Twisted on Python 3 is a less attractive target for developers than Twisted on PyPy 5.1—because of the performance.

So, Twisted has spent an enormous amount of time changing its codebase to end up with slower code. PyPy and Pyston make Python competitive with Go in terms of performance, but only really support Python 2.7 at this point. There are some 3,500 C API functions in Python, which is a huge barrier for projects like PyPy and Pyston. She asked: "How do we stop this from happening again?" Long term, it may well be that asyncio (along with async / await ) will provide much of the functionality of the Twisted core.

Guido van Rossum asked about interoperability between Twisted and asyncio. Brown said that it is possible to await a Twisted Deferred so mixing the two is possible. Twisted will be able to share its event loop with that used by asyncio, she said. "The golden age of Twisted and asyncio is 2016", she said to a round of applause. There are still some patches to be merged and some edge cases to be worked out, but there is enough of Twisted working for Python 3 that it can be done.

Brown said she wondered what users with large Python 2.7 codebases would do in 2020 when 2.7 is deprecated and no longer gets updates. She thinks they will simply keep running it. Van Rossum said "that's fine", but that they won't get updates. For Twisted, though, Brown thinks the project will probably end up supporting Twisted on 2.7 for five years after users can realistically port to Python 3, which probably means 2022 or beyond.

Comments (4 posted)

Python's (in)famous global interpreter lock (GIL), which effectively serializes multi-threaded access to the interpreter (thus hampering concurrency using threads), has long been seen as something that Python could do without. But there are both technical and political hurdles to clear before the GIL can be removed. Larry Hastings presented his thoughts and progress on doing a "gilectomy" to the CPython interpreter at the 2016 Python Language Summit.

Hastings said that he has a proof-of-concept solution that gets around the technical and political problems. There are two questions that often get asked: "Could we remove the GIL?" and "Should we remove the GIL?" It is clear that it can be removed, he said, because IronPython and Jython already have. The answer to the second is "maybe"; it will depend on what it buys versus the technical debt it incurs. But, he said, he is going to keep trying to remove the GIL until either it gets removed or everyone tells him to stop.

The GIL was added in 1992 by Guido van Rossum; since then, the world has changed, but Python hasn't. Now, everything, including eyeglasses, is multi-core. Python, however, cannot really take advantage of these cores using threads.

There are four technical considerations that need to be addressed, he said. Reference counting for the garbage collector is one. There is also a need to look at the globals and statics in the interpreter and make them per-thread variables. The C extension parallelism and reentrancy issues need to be handled as do places in the code where atomicity is required.

There are also three political considerations. Van Rossum has said that he will only consider removing the GIL if it does not negatively impact the performance of single-threaded programs. Breaking all of the C extensions, which is the outcome of some other GIL-removing projects, is not reasonable. Removing the GIL must also not over-complicate the code.

There are some potential solutions to the reference counting issue that should not be considered, Hastings said. Both tracing garbage collection and software transactional memory might perform reasonably, but both are likely to be quite complicated and to break all of the C extensions.

So reference counting remains in his proof of concept. That means using atomic increment and decrement operations, which leads to a 30% performance hit right off the bat. As more threads are added it gets worse. He has an idea about "buffered reference counting", but did not have time to describe that at the summit. For global data, PyThreadState can be used to make it per-thread data. He has added fine-grained locking to the small-block allocator so that it can be used by multiple threads as well.

Parallelism is simply something that C extensions will have to live with. It makes the lives of extension developers more difficult, but there really is no way to soften that blow, he said. In order to enforce atomicity, he has added a lock API to CPython (with "macros to hide it behind") so that all mutable objects get locked before accessing them. He noted that "mutable" refers to the C objects, not Python objects, so even immutable objects in Python, like strings, are still mutable from the perspective of the interpreter.

Hastings laid out a set of five rules for locking in CPython to ensure that locking functions smoothly. Locks must be recursive and objects must be self-locking wherever possible. The reference count cannot be touched except through the defined interface and the object type is immutable. The latter drew a question about the desirability of changing object types, but Hastings said that there will be some things that have to be given up to facilitate the removal of the GIL.

When code needs to take multiple locks, it should do it in address order. Finally, the kernel should not be involved in taking the lock unless there is contention. That maps to a futex on Linux, but Windows and Mac OS X have equivalent functionality.

His proof of concept lives in the same source tree as the regular CPython interpreter, which can be configured to run with or without the GIL. One thing that might be possible if the GIL-removal work pans out is to enforce best practices on C extensions, since there will be a new API. The GIL removal is somewhat complicated, so it may fail that particular political consideration, he said.

Hastings briefly described his eight-point plan to remove the GIL (after noting Van Rossum's 2007 "It isn't Easy to Remove the GIL" blog post). It is presumably based on the process he took with his "toy" proof of concept. It starts by adding the atomic increment/decrement, adds locks to various types (dict, list) and free lists, on through murdering the GIL and fixing up the tests.

He showed the results of a "dumb test" he ran using the proof of concept. It calculated the Fibonacci sequence in seven threads. It was roughly 3.5x slower than the standard CPython interpreter in terms of wall time and 25x slower in terms of CPU time (because seven threads were running). That is not as good as he had hoped for in this early stage (he was shooting for only 2x slower), but there are still a lot of low-hanging optimization possibilities.

The open questions ("apart from 'should we do it at all?'" he said with a grin) are about things like separating read and write locks or allowing user-settable locks in the language itself. It might also make sense to look at running multiple interpreters in the same process—GIL-removal time might be the right point to add that feature.

He concluded the talk by noting that he had "Gilectomy" stickers available and a GitHub repository set up for those interested. He said he was planning to "sprint" on the project right after the main PyCon conference; "I have T-shirts if you sprint with me."

There wasn't a lot of time for questions, but there were a few. One person asked about how Gilectomy impacts PyPy. Hastings said he didn't know, but thought that project was more prepared for these kinds of changes than CPython is. Nick Coghlan commented that there is a fair amount of code out there that should be doing locking but isn't; the programs are getting away with it mostly because the GIL—or, as another person suggested, CPU scheduling—protects it. Eliminating the GIL will expose those programs. Hastings also noted that it was unfortunate but that one of the costs of Gilectomy will be to break some C extensions, though he is unsure of how many.

[As evidence of the interest in the Python community about removing the GIL, Hastings took a photo of the (overly) full room where he gave a Gilectomy talk at PyCon later in the week. It can be seen on the right.]

Comments (8 posted)

At OSCON 2016 in Austin, a panel of invited experts debated the always-thorny subject of how open-source software projects deal with patents. The panel was packed, featuring representatives from the free-software world, commerce, and the legal community, so there was scarcely enough time to move through the prepared topics in the time allotted, much less to take questions from the audience. But the discussion was able to highlight a number of current issues, including patent abolition, implicit patent licenses, and where the open-source community should focus its efforts to improve matters.

Defining the problem

Jim Jagielski from the Apache Software Foundation (ASF) served as moderator. The panelists were Bradley Kuhn of the Software Freedom Conservancy (SFC), Heather Meeker from the law firm of O'Melveny & Myers, Rabin Bhattacharya from Capital One, Keith Bergelt of the Open Invention Network (OIN), Mishi Choudhary of the Software Freedom Law Center (SFLC) and SFLC India, and Eben Moglen of the SFLC. After introducing everyone, Jagielski posed the first question: are software patents inherently evil, or are they just implemented wrongly in the legal system?

Kuhn replied first, saying that he has a long history of saying "eliminate them all." In fact, he said, "I'm probably the only patent abolitionist here." Since only Congressional legislation could abolish software patents in the US, he said, the free-software community uses other means to mitigate them—such as copyright. Cloudhary also called herself a patent abolitionist, though, noting that SFLC has worked through the courts (such as filing amicus briefs) to combat software patents. In India, those efforts have been a success: India's Patent Act does not cover software. Yet there is still resistance, with companies in India acting as though software patents are allowed (such as by offering to file software patent applications in India on a client's behalf). Consequently, she said, the patent office has had to release guidelines on several occasions clarifying that software patents are not permitted.

Bergelt replied next, saying that the patent system has means for self-correction, which the open-source community must use—like OIN's Defensive Publications program. But the changes the community wants must be implemented on top of everything that came before, he said. "We're writing on a dirty slate," he said, and there is a long way to go. But he said that the community does not want to get in the way of actual innovation, which is what justifies OIN's "nonaggression" approach. Moglen then identified himself as not just a software-patent abolitionist, but as someone opposed to "government-granted monopolies" of any kind. "I think we could do without all patents," he said, "so I'm more abolitionist than anyone else." As a lawyer, he said, his job is to do what he can for his clients and, where software patents are concerned, that means he is preoccupied with defensive measures.

Bhattacharya said he was "somewhere in the middle." On one hand, there are clearly software patents that are bad. But, on the other, he said, recent changes have made it much more difficult to get a new software patent issued. Research indicates that there is a correlation between a start-up company getting software patents granted and that company getting venture capital. So the industry believes that there is a financial upside. And, he said, he finds many philosophical objections to software patents unconvincing. "Think about how much hardware is emulated in software these days. The inventions that people make as hardware tweaks are patentable, but when the same thing is implemented in software it isn't. Why?"

In contrast, Meeker reported that all of her clients, which include "the biggest technology companies in the world," tell her that they hate software patents. "Who's being well-served by software patents? I have one idea: the hotels and restaurants in Marshall, Texas. They're getting a good deal." To everyone else, she said, software patents are a menace they are afraid of but are afraid to let go of, too. "I'm disappointed that we don't hold the government more accountable to fix the problem," she said.

Strategy and tactics

The next subject Jagielski asked about was how open-source software licenses ought to address patents. Other than a select few (such as the Apache license), most open-source licenses ignore patents, he said. But there is a theory circulating that all software licenses include an implicit grant to use any patents on the licensed software. Jagielski asked the panelists what they made of the theory and whether licenses should be updated to reflect it.

Moglen replied first. Under US patent law, he said, the theory that Jagielski referred to is known as "exhaustion." It holds that a seller cannot sue its own customers for patents in the product that they have sold: the seller's right to exclusive control over the patent is exhausted once the article in question is sold. But in many other jurisdictions, such as the UK, no such doctrine exists. And that is why, he said, the GPLv3 includes its "patent non-aggression" clause. "It's a single-license version of what the OIN pool does. It says that sharing is the rule and sharing is compulsory." Under permissive licenses like the X, MIT, and BSD licenses, the user is at the mercy of whatever the local patent system says. "That varies and it might change. The Kazakh or Chinese patent system may come after you some day."

The Apache license, he added, relies on "defensive suspension" of patent grants whenever someone initiates patent-infringement litigation. That clause is a "poison pill." Bergelt then commented that he believes there are some open-source project communities (such as the Bitcoin and Hadoop communities) that have grown large enough that they may be interested in a "complementary" solution to the Apache license's defensive-suspension approach, although he did not speculate on what such a solution would look like.

Meeker added that if one reads the case law about implied licensing, there is not much to stand on. First of all, she said, the terms "exhaustion," "estoppel," and "implicit license" are all used interchangeably. While the three terms supposedly refer to separate doctrines that would prevent a patent holder from suing a customer for infringement when there is no explicit license, case law does not appear to clearly define or differentiate between them. Consequently, she said, "no lawyer will write you an opinion taking a stance" on implicit patent licenses.

Jagielski then asked the panel what else the open-source community should be doing to take action about software patents. Choudhary said she tells clients to join OIN, because "you get so much for free." In other jurisdictions, she added, the software community has started to see the benefit of OIN-style patent pools and defensive publications. "You have to do those things in parallel with working to abolish software patents," she said.

Moglen advised separating the past from the future when discussing tactics. For the future, he said, "we should prevent people from getting patents." But, in the past, the problem is old patents "rising up and smiting" projects, harming innovation. The work that Choudhary does trying to abolish patents, he said, covered how to protect people in the future—and he fully expects SFLC to continue that work, arguing in front of the Supreme Court about the "design patent" case between Apple and Samsung. Furthermore, he said, "open-source software is an immense repository of prior art. Free software can help by educating people on all that we've invented and that you therefore cannot reinvent and patent."

Kuhn urged free-software developers who work at for-profit companies to refuse to file for patents on their work, even if that means losing bonuses and, possibly, promotions. "It might be good for your career to get patents, but taking a stand is good for all of software freedom." Bergelt suggested that developers look at Twitter's Innovator's Patent Agreement (IPA), which he called a "middle ground" approach. The IPA is a pledge that Twitter will only use patents on employee inventions defensively, unless they get the employee's consent. Meeker recommended the License On Transfer (LOT) Network, which seeks to prevent patent trolls from acquiring patents from companies in desperate financial situations. "Cooperative approaches are great," she said, "but they do not affect patent trolls. My clients are way more afraid of trolls than they are of their competitors."

Choudhary then added that interested developers could join the work going on in India. "It's the only jurisdiction where we're the incumbent, but we still face lots of pressure." She also advocated taking a stand on international trade agreements, which she noted are usually written in secret, a fact that companies try to take advantage of. Bhattacharya added that recent changes at the US Patent and Trademark Office (USPTO) allow the public to submit prior-art examples on patent applications. "If you see one and know that it shouldn't be granted, you can anonymously submit up to three pieces of prior art." Kuhn replied that he sees one downside to Bhattacharya's suggestion: by fighting some patent applications, any that get granted will be perceived as stronger. Kuhn remembers the RSA patent, he said, which many developers might be too young to recall. "It set the progress of encryption back twenty years, and something like that can happen again."

Fear and participation

Next, Jagielski asked panelists about patent-owning corporations' fears of open source. "People are using open-source software all over the place; some companies worry that using it means giving away the 'keys to the kingdom.' Is that fear real?" While those in the open-source community may find such fears hard to imagine, Meeker reported that they are still quite common. "I should have a standard line item for 'talking clients off the ledge over concern about using open-source software,'" she said. Nevertheless, she said, once she sits down with clients and looks at the licenses involved, they get over their fears. "I have never had a client throw up their hands and say 'no, we can't do this, because we have patents.'" Bhattacharya added that there is "a lot of daylight" between the patent policies most companies have and the norms of the open-source community, but that understanding the rights granted in the software licenses and knowing what code you are contributing makes all the difference.

Kuhn said that, in the long term, "it is very hard to be an authentic open-source software participant and not contribute." Free-software communities are rather self-organizing, he said, so if a company wants to lead a project, it will have to contribute code. Moglen responded that corporate fears about open source and patents depend on the culture. In the US, companies rely on lawyers to "talk them off the ledge." But China is completely different; the Communist Party's view is that all patents are government property, he said, so companies there operate without the fear of open source that one sees in the US. Indian companies, he added, "haven't yet grasped how good their situation is" regarding software patents.

As time ran out in the session, Jagielski posed one final question to the panelists in reference to the then-in-progress Oracle-v-Google lawsuit—in which Oracle asserted copyright protection over Java's API. "We think we understand patent law," he said, "but if other things we think we understand, like copyright law, are as uncertain and malleable as they sound today, what does that change for open source?" Moglen was the only panelist to respond, saying "I think we don't know patent law." The "design patent" lawsuit between Apple and Samsung is an example, he said. "I'm very glad that our friends at Google are going to find out what 'fair use' is. It'd be good to know that for design patents, because that clarity will improve enormously our ability to innovate in our world."

With that, the session broke up, although the discussion about how free and open-source software projects work with the patent system will, no doubt, continue for some time to come.

Comments (10 posted)

PostgreSQL's annual developer conference, PGCon, took place in May, which made it a good place to get a look at the new PostgreSQL features coming in version 9.6. The first 9.6 beta was released just the week before and several contributors demonstrated key changes at the conference in Ottawa. For many users, this was the first time to see the finished versions of features that had been under development for months or years.

There were multiple sessions on 9.6 features at PGCon that were presented by their lead contributors. The main highlights included parallel query and phrase search. There were also sessions on performance, transaction management, and other improvements. First, though, the PostgreSQL developers had to do some "housekeeping."

Changes to releases

Before the main conference began, PostgreSQL's annual developer meeting and unconference was held. Among the items discussed were two changes to how PostgreSQL does releases: the Release Management Team and version numbering.

The project experimented with adding a Release Management Team (RMT) this year, which meant that a group of three committers was empowered to make rapid and final decisions about which patches would be included in the feature freeze. Project members did this because the period between the end of development and the beta release for version 9.5 took five months, resulting a final release that was four months late. Álvaro Herrera, Robert Haas, and Noah Misch were appointed to the RMT during the PostgreSQL developer meeting at FOSDEM; with their efforts, the PostgreSQL 9.6 beta was released on time. As such, the developers plan to have an RMT next year as well.

Another topic the project discussed was version numbering. Historically, the PostgreSQL project has used three-part version numbers, in which the first two parts are "major version numbers." This has resulted in lengthy annual arguments about whether to increase the first part of the number. As such, the project is considering moving to a two-part version number, the first part of which would increase every year. Regardless of how this comes out, the version after 9.6 will be 10.0.

Parallel query

The first 9.6 feature presented at PGCon is also the new version's most prominent: parallel query. Haas, who leads a team of contributors that has been working on this feature for the last three years, presented on what's in 9.6 and what is still being worked on. Over those three years, the team has been adding multiple backend features to support parallelism. Haas and most of these contributors work for the PostgreSQL support company EnterpriseDB.

Parallel query is the ability to make use of multiple cores to execute the same query in order to speed up operations that can be parallelized. This feature is desirable for big data workloads and has been available in Oracle and DB2 for some time. The goal is to allow all of the resources of the system to be used to answer a single query if that query is the only one running. Parallel query has been on the PostgreSQL to-do list for over a decade.

With version 9.6, some queries are now parallelizable. Specifically, it can now execute sequential scans, some aggregates, and some joins in parallel. Haas said that his team plans to increase the number of query operations that can be parallelized gradually over the next few years.

The first operation made parallel was sequential scans, otherwise known as full table scans. A sequential scan is when the PostgreSQL engine performs an exhaustive scan of all the 8KB disk pages of the base table; it is normally needed when either the user has requested a large percentage of the table or when there are no useful indexes available. The team picked this operation first because it is relatively simple to parallelize. Also, as sequential scans over large tables can be quite slow, speeding them up appeals to users.

For example, as part of an audit you might want to total up all account balances per department in a financial system. Without parallel scan that can be quite slow:

bench=# explain (costs off, analyze on, timing off) select bid, sum(abalance) from accounts group by bid; HashAggregate (actual rows=500 loops=1) Group Key: bid -> Seq Scan on accounts (actual rows=50000000 loops=1) Planning time: 0.050 ms Execution time: 10488.074 ms

With parallelism, you can speed up the operation a great deal, cutting execution time by as much as 75% using four "workers":

bench=# explain (costs off, analyze on, timing off) select bid, sum(abalance) from accounts group by bid; Finalize GroupAggregate (actual rows=500 loops=1) Group Key: bid -> Sort (actual rows=2500 loops=1) Sort Key: bid Sort Method: quicksort Memory: 233kB -> Gather (actual rows=2500 loops=1) Workers Planned: 4 Workers Launched: 4 -> Partial HashAggregate (actual rows=500 loops=5) Group Key: bid -> Parallel Seq Scan on accounts (actual rows=10000000 loops=5) Planning time: 0.118 ms Execution time: 2739.373 ms

As shown in the example, PostgreSQL 9.6 also includes support for parallel aggregation. Simple aggregates, like SUM() and AVG() , can now be executed in parallel in order to double or quadruple throughput. This doesn't work with windowing aggregates, ordered aggregates, or grouping sets yet; to be supported, each individual aggregate function needs additional code for a parallel context.

The third parallel operation included in the beta is joins, specifically hash joins and nested loop joins. These can now be parallelized in some cases in order to join two large tables more quickly. Merge joins, the third major type of join, are not supported because the industry standard algorithms are not parallelizable. According to Haas, the PostgreSQL project will need to invent a new algorithm in order to add the feature.

Parallel query works by creating several "dynamic background workers" for each query. As PostgreSQL operates on a multi-process model, each of these workers is a process. Workers take on part of the query and then the results are fed to the parent process via a "gather node" that collates the results. For example, in a parallel sequential scan, each worker reads one page of the relation at a time in order. The team tested more sophisticated algorithms for partitioning the scan, but they did not improve throughput.

This feature depends on many underlying changes, not just in version 9.6, but in the last several PostgreSQL releases. For example, PostgreSQL 9.4 included dynamic background workers and dynamic shared-memory allocation. PostgreSQL 9.5 introduced the parallel context and group locking.

Version 9.6 includes "parallel-aware executor nodes." PostgreSQL divides up each query, during the planning phase and again during the execution phase, into "nodes" that each represent one task to be performed, such as sorting data. Some of the code for these nodes is now different depending on whether the node is being executed in a parallel context or a non-parallel context. Parallelizing more query operations in future versions will be largely a matter of adding parallelization code to more types of executor nodes, such as index scan nodes.

Because parallel query is multi-process, the team also had to add a facility for passing messages between the master query process and the workers. This is achieved using a shared-memory message queue, in which the calling process allocates a chunk of shared memory for all of the workers to read and update. This permits forwarding errors and notices as well as data rows and addresses.

During the talk, an audience member asked Haas about resource management. In the beta, there are two settings available: max_parallel_degree , which sets the number of parallel workers for queries in the current session, and max_background_workers , which limits the total number of workers in the system as a whole. The latter is intended as an administrative control to prevent overwhelming the processor. Note that these configuration parameters are likely to change during the beta in order to provide better resource management.

While there is still a lot of work to do on parallel query, the current features already show great improvements to PostgreSQL's performance in the TPC-H data warehousing benchmark. Contributors ran tests on a donated IBM Power 7 system housed at the OSU Open Source Lab, using max_parallel_degree=4 . Of the 22 long-running queries that make up part of the benchmark, three were four times faster, eleven were twice as fast, and the remaining three were either slightly faster or unchanged.

There are a few other limitations on the parallel query features. First, only read queries are currently run in parallel. Second, parallel queries don't work in serializable mode. They also don't work together with cursors. Contributors expect to fix some of these limitations in the next version of PostgreSQL.

Haas expects that various contributors will be working on adding more parallel operations to PostgreSQL for several years along with improving performance and flexibility. What's in 9.6 provides a framework that will allow more contributors to help with that effort.

Phrase search

"Full-text search has been the same for eleven years, so it is time for some improvements," explained Oleg Bartunov. Along with Teodor Sigaev and Alexander Korotkov, he presented on the new features they're adding in 9.6 and 10.0. These Russian contributors have recently formed a new company in Moscow, PostgreSQL Pro, that has added a number of new contributors to the database project.

At the center of the full-text search (FTS) improvements is "phrase search," which is the ability to find phrases and sentences. Current versions of PostgreSQL permit boolean searching on collections of words, but the relationship of these words to each other is not factored in. A word search on "Linux Weekly News", for example, is just as likely to turn up documents where those three words are separated by pages of text as it is to find them together.

Phrase search also considers the proximity of the words to each other. So, for example, you could search using either of the following:

article @> phraseto_tsquery('linux weekly news') article @> to_tsquery('linux <-> weekly <-> news')

This would turn up only documents where the three words were together. The reason for the "tie-fighter" operator ( <-> ) is that it's actually a measure of proximity. For example, if we wanted to allow for "Linux Every Weekly Great News", we might use the search below, allowing the words to have a word in between:

article @> to_tsquery('linux <2> weekly <2> news')

That syntax works in 9.6. The team is implementing more advanced indexing for 10.0, though, to support faster phrase search and ranking. This involves adding a new index type to PostgreSQL called "RUM Indexes," continuing the PostgreSQL theme of index types named after alcohol (though the acronym is not defined). These new indexes will also support adding timestamp data to the index, allowing for text searches that then rank documents by age, as is possible with Elasticsearch. RUM indexes are available on GitHub for those who want to try them against the 9.6 beta.

In addition to phrase search, for 9.6 they added the ability to do fine editing of the data inside specific full-text search fields, or "tsvectors". This includes adding, deleting, and setting ranking weights for specific words.

Other features and schedule

In addition to the above, sessions at the conference included Kevin Grittner covering improvements in transaction management that make queue tables work better in PostgreSQL. In another talk, Amit Kapila explained how write improvements of up to 110% were implemented by lowering locking overhead. More of the conference was concerned about future development, including talks about transaction log reduction, multi-master replication, and more.

There are also more PostgreSQL 9.6 features not covered at PGCon. The Foreign Data Wrappers API now supports doing some updates, deletes, joins, and sorts on remote servers. The new version reduces maintenance required on very large tables. An extension for Bloom filter indexes was added. There's even a command in the terminal client, psql , for creating Excel-like "crosstabs."

One additional change that is liable to be more far-reaching is the additions to synchronous replication, making the database a more feasible option for scale-out workloads. 9.6 includes support for groups of synchronous standbys and a remote_apply setting that reduces the wait time for commits. Combined, these may allow users to create "consistent" replication clusters using the built-in tools.

The pgAdmin team, which makes the primary GUI client for PostgreSQL, has announced a new browser-based version. This makes the C++ desktop client known as pgAdmin3 obsolete.

Given that the first beta was released in mid-May, we can expect a second beta in June or July, with successive betas and release candidates until the final release. Since the beta was on schedule, this will hopefully mean that 9.6 is also on schedule for a September release. A lot of users will be waiting for it.

Comments (8 posted)