SBE Overview

Message Structure

Figure 1

SbeTool and the Compiler

java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>

Programming with Stubs

// Write the message header first MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion) .blockLength(CAR.sbeBlockLength()) .templateId(CAR.sbeTemplateId()) .schemaId(CAR.sbeSchemaId()) .version(CAR.sbeSchemaVersion()); // Then write the body of the message car.wrapForEncode(directBuffer, bufferOffset) .serialNumber(1234) .modelYear(2013) .available(BooleanType.TRUE) .code(Model.A) .putVehicleCode(VEHICLE_CODE, srcOffset);

// Read the header and lookup the appropriate template to decode MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion); final int templateId = MESSAGE_HEADER.templateId(); final int actingBlockLength = MESSAGE_HEADER.blockLength(); final int schemaId = MESSAGE_HEADER.schemaId(); final int actingVersion = MESSAGE_HEADER.version(); // Once the template is located then the fields can be decoded. car.wrapForDecode(directBuffer, bufferOffset, actingBlockLength, actingVersion); final StringBuilder sb = new StringBuilder(); sb.append("

car.templateId=").append(car.sbeTemplateId()); sb.append("

car.schemaId=").append(schemaId); sb.append("

car.schemaVersion=").append(car.sbeSchemaVersion()); sb.append("

car.serialNumber=").append(car.serialNumber()); sb.append("

car.modelYear=").append(car.modelYear()); sb.append("

car.available=").append(car.available()); sb.append("

car.code=").append(car.code());

On-The-Fly Decoding

Direct Buffers

Message Extension and Versioning

Byte Ordering and Alignment

Message Protocols

Stub Performance

Feedback

Update: 08-May-2014

Mode Thr Cnt Sec Mean Mean error Units [exec] u.c.r.protobuf.CarBenchmark.testDecode thrpt 1 30 1 462.817 6.474 ops/ms [exec] u.c.r.protobuf.CarBenchmark.testEncode thrpt 1 30 1 326.018 2.972 ops/ms [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode thrpt 1 30 1 1148.050 17.194 ops/ms [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode thrpt 1 30 1 1242.252 12.248 ops/ms [exec] u.c.r.sbe.CarBenchmark.testDecode thrpt 1 30 1 10436.476 102.114 ops/ms [exec] u.c.r.sbe.CarBenchmark.testEncode thrpt 1 30 1 11657.190 65.168 ops/ms [exec] u.c.r.sbe.MarketDataBenchmark.testDecode thrpt 1 30 1 34078.646 261.775 ops/ms [exec] u.c.r.sbe.MarketDataBenchmark.testEncode thrpt 1 30 1 29193.600 443.638 ops/ms

Mode Thr Cnt Sec Mean Mean error Units [exec] u.c.r.protobuf.CarBenchmark.testDecode thrpt 1 30 1 619.467 4.429 ops/ms [exec] u.c.r.protobuf.CarBenchmark.testEncode thrpt 1 30 1 433.711 10.364 ops/ms [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode thrpt 1 30 1 2088.998 60.619 ops/ms [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode thrpt 1 30 1 1316.123 19.816 ops/ms

Throughput msg/ms - Before GPB Optimisation Test Protocol Buffers SBE Ratio Car Encode 462.817 10436.476 22.52 Car Decode 326.018 11657.190 35.76 Market Data Encode 1148.050 34078.646 29.68 Market Data Decode 1242.252 29193.600 23.50

Throughput msg/ms - After GPB Optimisation Test Protocol Buffers SBE Ratio Car Encode 619.467 10436.476 16.85 Car Decode 433.711 11657.190 26.88 Market Data Encode 2088.998 34078.646 16.31 Market Data Decode 1316.123 29193.600 22.18

Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like OPRA can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good overview In this crazy world we still see significant use of ASCII encoded presentations, such as FIX tag value, and some more slightly sane binary encoded presentations like FAST . Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.Last year the CME, who are a member the FIX community , commissioned Todd Montgomery , of 29West LBM fame, and myself to build the reference implementation of the new FIX Simple Binary Encoding (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing Olivier Deheurles at Adaptive . Working on a cool technical problem with such a team is a dream job.SBE is an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found here To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.A message schema can be input into the SbeTool and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.A full example of messages defined in a schema with supporting code can be found here . The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.Messages can be written via the generated stubs in a fluent manner. Each field appears as a generated pair of methods to encode and decode.The generated code in all languages gives performance similar to casting a C struct over the memory.The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found here SBE, via Agrona, provides an abstraction to Java, with theclass, to work with buffers that are byte[], heap or directbuffers, and off heap memory addresses returned fromor JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files viaand thus can be be transferred to a network channel by the kernel thus avoiding user space copies.C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little Endian form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.This is the first version of SBE and we would welcome feedback . The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the specification . Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.