WebSocket, safety, stability and performance

Written the 11th July 2016 .

Since several weeks, we were working hard on the Hoa\Websocket library test suites, including both unit and integration test suites. Stability and performance have been good goals too. This article presents what we have accomplished, the bugs that have been found and how complementary tests are.

Bugs found

Long story short, 2 important bugs have been found and fixed with these new test suites, despite the fact we were using another integration test suite before.

Before our own test suites, we were using the famous Autobahn test suite. It works great and this is a good tool used by major vendors like Mozilla, Google, Apple, Microsoft, Facebook and more. It validates several aspects of the standard (RFC6455) in addition to “industrial aspects” (real usecases that are important in practice). But tests are not demonstrating the absence of bugs at all. Moreover, it was hard to set up Autobahn for new comers: This is in Python (we are in PHP) and this is not in our devtools box; 2 factors that form an obstacle for new contributors. So we decided some months ago to “migrate” this test suite into our own test suite. The work has been split into 2 chapters:

Write unit test suites from scratch, and Migrate the Autobahn integration test suites.

Let's describe quickly the bugs that have been found and fixed.

The most important bug found concerns the RFC6455 protocol. As a reminder, Hoa\Websocket supports two WebSocket protocols: The standard one (RFC6455) and the old one (Hybi00) for compatibility reasons. When reading a frame, it is required to start reading the “headers” of the message before reading the real message. Only the required bytes were read when needed. This was an error in a specific edge-case: When the message was empty, the algorithm was returning earlier, letting one part of the frame into the network buffer. So the next call to the algorithm was reading invalid data. This is important to always consume the whole frame each time.

This bug has been revealed by the integration test suites when reading consecutive empty messages. It has not been revealed by the Autobahn test suite while it has this test case! See the commit fixing the bug.

The second bug was not as critical as the first one. It concerns a too strict constraint checking UTF-8 validity. You may know that all WebSocket messages must be UTF-8 encoded, so Hoa\Websocket checks this constraint when sending and when receiving a message. Checking this at sending avoids having invalid data on network, then it sounds like a good idea, isn't it? Actually no. A message can be split into arbitrary fragments. Thus each fragment may contain invalid UTF-8 data; but in the end, when all fragments have been received and the message reconstructed, this is still a valid UTF-8 message. So the constraint was too strict. It was easy to fix.

This bug has been revealed by the unit and integration test suites. It has not been revelead by the Autobahn test suite because they are using their own client, so it was not possible to detect it. See the commit fixing the bug.

Unit and integration test suites

Some numbers to illustrate the work on the unit test suites:

12 test suites,

151 test cases,

3514 assertions.

It runs in 2.26 seconds on my computer with PHP7.

And some numbers about the integration test suites:

1 test suite,

31 test cases,

163 assertions.

It runs in 0.65 second on my computer.

Not the whole Autobahn integration test suite has been migrated because our API does not allow some invalid operations. Some test cases are new too.

This particular situation illustrates how complementary are these test suites. Clearly, some bugs were not detectable with the integration test suites only and others with the unit test suites only. If you read the Nature of tests Section from Hoa\Test 's hack book, you may remind this diagram:

Dimensions of the test universe is represented by 3 axis.

This is not because “unit” and “integration” are on the same axis that one is inferior to the other. They are all different and interact at a different level of the code. While the unit test suites focus on isolated method (for instance, how the frame is parsed), the integration test suites focus on how components interact (for instance, how the client receives answer from the server).

Also, this story illustrates that we must not blindly believe a test suite only because it is used by major vendors. It does not detect all the potential bugs because it focuses at only one level of the code. So be careful when authors are using this argument. Ask yourself if this is relevant. Safety is a science, not a marketing word.

Stability and performance

There is a lot of edge-cases when using networks. In the case of the WebSocket protocol, this is even more true. It starts by an HTTP request, then a handshake and then exchange messages. When the HTTP request is received by the server, it is connected with the client, but the client cannot receive message yet from other clients. This holds until the handshake has been successful.

These cases are the favorite places for bugs in a concurrent environment. Concurrent connections are funny but concurrent disconnections (expected or… unexpected ones) are even funnier. We were able to reach 100 concurrent connections per second, but now we are reaching 500 concurrent connections. We still have some unexpected behaviors in particular edge cases but there is no more errors. We can possibly go higher, just for the fun, but we gently reach the limit of the language and start tunning the TCP stack, file system, OS etc. and this is clearly out of the scope of the library.

Also, the socket library is injected. So far, most people use the Hoa\Socket library but one might use another socket library, based on ZeroMQ for instance. We don't have an official library for ZeroMQ; This would be a nice contribution.

If you use Hoa\Websocket in production, we would be glad to learn about it. By knowing the usages, we could probably optimize some parts of the library. We know there are companies using this library for light or heavy operations (with for instance hundreds of nodes in the network, transfering megabytes of data) but we would like to know more 😄.

Thank you for everything!