When we came across this misbehaviour a little while ago we didn’t think too much of it. We unearthed the cause and worked around it, but ultimately dismissed it as an odd one-off.

Since then it’s cropped up a few more times, and it doesn’t look like it’s going away, so we thought we’d tell a little more about it and what we’re doing to fix it.

The segfault

We use Postfix as our MTA on almost every Linux server in the company. As Anchor’s resident Postfix guru I often get asked for assistance when it comes to troubleshooting Postfix problems. I’ve seen plenty of them, so I can spot the most common issues and teach people how to fix them.

Sometimes you run into something you haven’t seen before, and this was one of those times. When processing the mailqueue, the server would regularly segfault in the SMTP client, but only for mail to a particular domain.

Jun 1 10:18:46 scottishfold postfix/master[1707]: warning: process /usr/libexec/postfix/smtp pid 7515 killed by signal 11 Jun 1 10:18:46 scottishfold postfix/qmgr[1717]: warning: private/smtp socket: malformed response Jun 1 10:18:46 scottishfold postfix/qmgr[1717]: warning: transport smtp failure -- see a previous warning/fatal/panic logfile record for the problem description

The surrounding warnings in the mail logs suggested that encryption was involved. Technically unrelated, but a hint nonetheless. We could’ve cracked out the debugger at this point, but it’s a pretty big hammer that’d take a lot of time and effort to get results.

One of the great things about Postfix is that it’s well-supported by the creator (Wietse Venema) and community alike, so most issues are picked up, reported, and more importantly, fixed. So we went googling and found what we were after, after several false starts.

The cause

Someone on the postfix-users mailing list had run into the same problem, the only report we could find. Crucially, they’d tracked down the immediate cause and instituted a workaround. We looked at our own system, then back to their description; why yes, Percona Server is installed, why do you ask?

Based on the MySQL codebase, Percona is a popular drop-in replacement for the standard MySQL database server, touting vastly improved scalability and performance. It turns out that Percona is distributing a copy of the libmysqlclient.so.16 library that is incompatible with Postfix.

Specifically, when Postfix’s SMTP client went to perform crypto operations, it was hitting functions in libmysqlclient.so.16 , which would in turn segfault. How does this make any sense? It doesn’t.

Dodging the issue

We have a small number of Percona installations because it’s installed only when requested by the customer, so it made more sense to document the problem and move on. In this case we just stopped using opportunistic encryption in the Postfix’s SMTP client.

Prepending a working libmysqlclient.so.16 to the LD_LIBRARY_PATH, as suggested, was an option, but it’s very “dirty” – not the kind of thing we’d roll out to a lot of servers that we manage.

This was well and good until we came across the problem on two other servers, one of them manifesting when using Curl with PHP. That dirty workaround was looking more and more tempting, so we decided to take a closer look.

Digging up the root cause

You might remember Michael from previous bughunts, like that megaraid_sas driver that was causing memory corruption. He agreed to have a look and help explain the problem for us.

At some point the MySQL devs decided to embed the YaSSL library inside libmysqlclient. It provides four dummy “do nothing” functions: CRYPTO_add_lock, CRYPTO_lock, CRYPTO_mem_ctrl and EVP_CIPHER_CTX_init. We’re not sure why, because these functions are not used internally by libmysqlclient, nor are they “hidden” in any way – they’re just there.

Those symbols (functions) are also exported by libcrypto, which is part of the OpenSSL package. libcrypto provides real implementations of these functions, and software that links to libcrypto expects to be able to use them.

If both libcrypto and libmysqlclient are referenced by some binary (like /usr/libexec/postfix/smtp for example), the order in which these libraries are searched for symbols is crucial.

What is generally happening is that the symbols from libmysqlclient are “winning” over those from libcrypto. Symbol versioning could have avoided this, but neither libmysqlclient nor libcrypto do that. So the non-working functions are used preferentially, and the software using libcrypto and libmysqlclient segfaults.

And why do we see this only on systems running Percona, and not ones with vanilla MySQL? Most of our MySQL installations are too old to be affected. 🙂

Whose fault is it really?

In the previous section we said that the MySQL devs embedded YaSSL into libmysqlclient, which is what causes the problem. So why does the problem only arise when Percona is in use?

The library in question is part of the Percona-Server-shared-compat package, which is actually a repackaged copy of the MySQL-shared-compat package. The bug belongs to MySQL, and their MySQL-shared-compat package includes the affected libmysqlclient.so.16 file.

It looks like the bug has since been fixed by MySQL, but when Percona built their packages they based it on a version of MySQL-shared-compat that contained a buggy libmysqlclient.so.16 library. So really, Percona needs to pull down a newer version of MySQL-shared-compat when they build Percona-Server-shared-compat.

What can we do about it?

We thought it’d be a matter of building some new packages, but it’s not quite so simple.

The SRPM (source package) for Percona-Server-shared-compat is a “nosrc” RPM, which means the packager needs to find and provide the source themselves In this case, it’s the MySQL-shared-compat RPM The SRPM for MySQL-shared-compat is also a nosrc RPM We haven’t been able to find that SRPM. Even if we did, we’d need to find its source as well, and Oracle doesn’t seem to be making that easy

At the moment we’re considering whether we might be able to fix the Percona-Server-shared-compat package and build it without needing MySQL-shared-compat. Or, if it’s not possible to remove MySQL-shared-compat from the equation, we might be able to make a fixed version. If all else fails we can roll out bugfixed library files, but it’s very messy and will break as upstream packages get updated.

The bug at least seem to be known, both in MySQL and Percona (and a second report). Fingers crossed we can get a proper fix in the near future.