Modern IOT number formats

A flood of IOT numerical formats

IOT developers have a never-ending fountain of innovation, creating new concepts and pushing the boundary for enabling computers to interact with their boundaries. This innovation spills over into the constant creation of new numerical formats. I’ve created a system for dealing with a bunch of different Bluetooth values.

History: big and little endian

One of the earliest hints of the many varieties of number formats is from the famous Internet Engineering Task Force note IEN 137 (IEN is Internet Experiment Note). That paper listed some of the numerical formats in use on internet-facing computer and proposed a partial solution to the common problem of sending integer data. This is the paper that started the big-endian (network and many then-popular computer) and little-endian (some computers including Intel chips) nomenclature.

The Internet as a whole, since the 1980’s, has decided that networks should generally send data in big-endian format. Most PCs are in little endian format. Mac computers have changed their default endian-ness. Bluetooth is little endian by default. But some devices (I’m looking at you, Witti designs!) has some big-endian values.

Examples of problematical numerical formats

For the unbelievers – the people who think that IOT devices are easily specified, and that there are just a few simple data types – here’s a brief list of a few of the formats I’ve needed to deal with

Data Format BBC Micro:Bit Temperature data A single signed byte that represents the temperature in degrees Celsius. For example, when the value is 26, it means that the temperature is 26 degrees Celsius.

This is actually a pretty simple format, but the next two are much weirder. InkBird IBS-TH1 Temperature data A two-byte, signed integer which is in units of a hundredth of a degree C. The value must be divided by 100 to get standardized units of Celsius. Nordic Semiconductor “Thingy:52” pressure data Two bytes in “12Q4” format: there are 16 bits total (12+4), and there’s an implied decimal point with 4 bits of precision. After that, the data must be divided by 1.25 to get a pressure in milliBars. Texas Instruments 1350 SensorTag Pressure sensor Three bytes that are a signed integer that must be divided by 100 to get a pressure in hPA (hecto-pascals, or millibars. Standard atmospheric pressure is 1013.25 hPa)

In addition, many Bluetooth devices put multiple readings into a single characteristic. For example ,the ProtoCentral Sensything will put 4 different voltage readings into a single analog value, and it’s very common for environmental sensor to include humidity and temperature into a single characteristic.

Little Language for IOT Number Parsing

Requirements for the language

Because this little language will potentially be implemented by people in a hurry, the language must be trivial to parse. Most computer languages have a perfectly capable string-split capability; the little language lexical analysis should simply be some variant of strsplit. The language is a little more awkward as a result, but that’s an acceptable tradeoff.

Similarly, it’s simpler if each character (like the vertical bar, |) has only one meaning.

The language includes (in one part) reverse polish notation (RPN) style expressions to modify the incoming data. Most people strongly prefer to use the normal math operators (* / + -); those characters are reserved for the expression language and not used elsewhere. For example, “/” should not be used as a separator because that will interfere with it being used by the mathematical expressions.

The language will be embedded in JSON and may well be embedded in the future in a URL. Common URL characters should be avoided (# & @ ? 🙂 as should anything that can’t be in a JSON string (like a quote “ or the JSON string escape \ or CR or LF.

The final form of the language is:

There’s a set of fields separated by a single space. Extra spaces – either before, after or having multiple spaces – is an error.

separated by a single space. Extra spaces – either before, after or having multiple spaces – is an error. Each field is further split into four sections by vertical bars. The first field is mandatory, and the rest are optional. Example: U8 is a field with just the first section and U8|DEC|Temperature|c is a field with all four sections

by vertical bars. The first field is mandatory, and the rest are optional. Example: U8 is a field with just the first section and U8|DEC|Temperature|c is a field with all four sections Each section is potential separated into sub-sections with a caret (^). For example, U16^100_/ is field with one section; that section is divided into two sub-sections. The first sub-section (U16) tells us that the number should be read as an unsigned integer; the second says that the value should be divided by 1000.

Fields are separated with a single space

Bluetooth devices, luckily, always present data in a consistent way. I’ve never seen a device where the field sizes change from one sample to the next. It’s always that case that if some value is two bytes in on reading, it’s always two bytes. Devices often include multiple data fields in one characteristic.

Each field description is separated with spaces. Example: U8 U8 U8 would be used by a characteristic that includes three unsigned bytes, each of which needs to be handled separately.

In addition to real fields which describe specific bytes there are also pseudo-fields which provide per-characteristic information. For example, you might add the OEB (endian/big) pseudo-field if the data should be interpreted in a big-endian format. (There’s a corresponding OEL for endian/little, and it’s the default for Bluetooth). All pseudo-fields start with the letter capital-O.

Field sections + subsections

Each field will need a richer description than just U8, of course. There are four things that a program needs to know about each field:

The numerical format The preferred display mechanism (e.g., decimal or hexadecimal) The preferred computer-friendly name (no spaces) The units. So far this has not been developed very far. A future goal is that when UI is automatically generated for the field, that that user will be able to select their preferred units. For example, many temperature sensors produce data in degrees C. Some users prefer to see temperatures in degrees Fahrenheit. Similarly, barometric pressure is often produced in hPa (hector-Pascals) millibars but users want to see the results in PSI (pounds per square inch).

These sections are separated with vertical bars (|). Example: U8|DEC|Temp|c is a field where all four sections are present. This value is an unsigned byte, the preferred output is decimal, the field should be called Temp, and the units are degrees c.

Each section is then split into sub-sections with a caret (^) symbol. Example: I24^100_/ is a field with a single section (the whole of I24^100_/) which has two sub-sections I24 (read in a signed 24-bit integer) and 100_/ (divide the value by 100).

The numerical format has two sub-sections. The first is the format proper (e.g., “I24”). The second is used by the calculation language (example, “100_/”). A complete example is “I24^100_/”.

Formats: Simple integer U<bitsize> and I<bitsize>

The I simple format describes signed and U describes unsigned integers. The number of bits must be evenly divisible into bytes; the only allowed values are 8, 16, 24 and 32. These are commonly also called bytes, words, an unnamed type and quadwords.

The endianness defaults to little endian but can be changed with the OEL (endian-little) and OEB (endian-big) pseudo-field.

Example: U8 is an unsigned byte (8 bits) with values 0..255.

Formats: IEEE floats F32 and F64

The F format describes IEEE floats. Both 32-bit and 64-bit floats (commonly called a “double” for historical reasons) can be read. The endian flag is used.

Formats: Q style fixed (floats) (example: Q12Q4)

The Q fixed-format system specifies two numbers: the number of (signed) integer bits and the number of fractional bits. The total number of bits must be 8, 16 or 32. The value is signed.

The difference between a “fixed” and “float” has to do with their range and flexibility. A fixed number can only be used when the range is known ahead of time. An accelerometer, for example, is often set to only provide numbers in a pre-set range like ±8 G. A float is auto-ranging; it can handle values in an enormous range with a certain level of precision. At the lowest level, floats are much more complex to deal with.

Example: Q6Q10|HEX|AccelX|G from the Nordic Things Raw Motion values. The Q6Q10 is a 16-bit value where the top 6 bits are the integer part of the acceleration (so it can be a value from 31 to -32) and the bottom 10 bits are the fractional bits.

Formats: / (slash) style fixed (example: /U8/P8)

The slash format is a subtle variation on the Q style fixed numbers. A slash-style fixed number has two parts: the integer part and the fractional part, each of which must be a simple integer value.

In the slash format, the integer part can be unsigned. The fractional part can be either a binary fraction (any of the unsigned simple formats) or can be a percentage (decimal) which is shown by using a P for the number (e.g., P8 for an 8-bit number which in reality will always be 0..99)

Example: /I8/P8|FIXED|Temperature|C is from the Nordic Thingy temperature reading. The value is a two-byte value where the first integer part is a signed byte and the fractional part is a byte where only the values 0..99 can be present and represent a decimal fraction. If the incoming hex is 0x1B23, the 0x1B is converted to 27 degrees Celsius and the 0x23 is converted to .35; the final value is 27.35 degrees.

Note that the string representation of the number will correctly match actual value (it will be displayed as 27.35 and not 27.350001)

As a weird quasi-historical note: the slash (/) format was implemented first and then the Q format had to be created. If this mini-language was recreated, I’d be tempted to jam everything into the Q format.

Formats: BYTES (example: BYTES)

The Bytes format is just what it appears to be: a hunk of bytes, generally represented as a series of hex-encoded bytes. It can be of any length (indeed, there’s no current way to force a particular number of bytes, which will have to change)

Formats: STRING (example: STRING|ASCII)

The string format is more challenging than bytes thanks to multiple common representations of strings in IOT devices. The length of the string is “all the rest of the bytes”

The string will be decoded as if it’s a UTF8 string. If the incoming bytes cannot be decoded as UTF8, they will instead be read as HEX and displayed in a HEX format.

If the display format is default or ASCII, the display string will have nulls converted to backslash-zero (“\0”), CR and LF converted to

and \r, vertical tab to \v, and plain backslashes to double-backslash \\. Yes, using the word ASCII in this context is a little confusing since the original string was might have been UTF8, and the resulting string will be displayed as UTF8.

If the display format is Eddystone, the string will be converted as an Eddystone-encoded string. For example, if the first byte is a NUL, it will be replaced with the characters “http://www.”

Pseudo-format: O option

Any field starting with an uppercase O is an option pseudo-field. A pseudo-field doesn’t represent data in the incoming bytes; instead it says how the data should be parsed in a broader context. For example, some Bluetooth characteristics needs to be parsed in a big-endian format; rather than have a special tag on each field, the endianness set by a pseudo-option (OEB for option-endian-big in this case).

Endian OEB (endian=big) and OEL (endian=little)

Sets the parsing from that point on to be either big-endian or little-endian. Networking is overwhelmingly big-endian (for historical reasons); Bluetooth LE uses little-endian. Because this library is commonly used with Bluetooth, it defaults to little-endian.

Optional OOPT

Some fields are optional. Use this flag to indicate that the rest of the fields in the incoming data are optional and should be filled in with default values.

TI 1350 SensorTag 2.0: Barometer is “OEL I24^100_/” meaning: read in a 3-byte signed integer in little-endian format. Then divide the result by 100. The data beyond the ^ is the set of math commands to be performed on the data; there’s an RPN stack-based turning complete language to perform calculations.

The display portion can be either HEX for DEC to indicate whether the value is commonly displayed in decimal or hex values.

BBC micro:bit: The thermometer data is I8|DEC

The value is a single signed byte and typically displayed as decimal. For example, a course-grained temperature reading might use this format.

Calculation Language

Each numeric field can be modified via a reverse polish notation (RPN) expression language. The calculation expression is the second sub-field in the numerical format section.

The example I24^100_/ will be used as the sample numerical format section. The I24 is the first sub-section and says that three bytes (24 bits) of the incoming data will be read as a signed integer. The 100_/ is the calculation language.

The calculation language is a stack-based language. The starting value (from the I24) is the first entry and only entry in the stack. The stack only contains double-precision floating-point numbers.

The language is split by underscores into two items: 100 and /. The 100, being a number, is pushed onto the stack. The / means a floating-point divide. It uses up two items from the stack, divides them (the divisor is the one at the top of the stack – 100 in this case), and then pushes the result onto the stack.

There are four opcodes: + – * /. These represent their normal math meanings.

Each command is exactly two upper-case letters.

Command

+ Opcode Meaning number Value to push onto the stack + – * / Opcode for normal math. 10_2_- is the same as 10-2 AN AND the 2 top elements DU DUPLICATE the top element GO GO TO a command IV INVERT inverts the value (e.g., 3à-3 and -3à3) JN JZ JU Jump if non-zero; jump if zero; jump unconditional LS RS LEFT and RIGHT SHIFT NO NO-OP; does nothing PO POP SW SWAP top two elements XY YX YX calculate y**x where x is the top of the stack. XY calculates x**y where x is still the top of the stack Example 10_2_YX is 10^2 = 100 and 10_2_XY is 2^10 = 1024 ZE Like AND but the top value is inverted. 0xFFFF_3_ZE is 0xFFFB (bottom 2 bits are zero’d)

Display format section (examples: DEC HEX FIXED)

The display sections says what the ideal display type is. For example, humans often want to see a temperature value displayed in a base-10 number, not in hex.

For numeric values, you can specify DEC or HEX. For fixed and floating point numbers (like those produced by the Q, slash, or F numerical formats), use FIXED. For strings, display format can be ASCII or Eddystone.

The Specialty display format is used when a value might be a specialized value like an enum. The particular specialization must be added to the sub-section. The only supported value is Specialty^Appearance which will convert a number into a Bluetooth appearance value (e.g., 64 is Phone)

Name section (example: Temperature)

The name section specifies a name for the field.

Example: In the field U8|DEC|Temp|c the field name will be Temp.

Units section (example: mbar)

The units section is a hint about the actual units for the field. This isn’t currently standardized. Some common values are

Amount: ppb (parts per billion)

Angle: d (degrees) dps (degrees per second)

Battery level: %

Gravity: g mpss (meters per second per second)

Light: Lux

Magnetization: microTesla

Pressure: hPa

Transmit Power: db db_at_1_m

Temperature: C

Time: s ms (milliseconds) 10ms

Voltage: volts

Why me, and why now?

I’m doing this because I need it, and I don’t see any other mini-language that can express the reality of modern Bluetooth devices.

There are existing libraries for serializing and deserializing data. Many existing serialization packages carry a strong assumption that the serialization code is the owner of the permanent data format. The packages assume that there is only a small number of numeric types, and that the programmer isn’t concerned with their exact representation so long as the representation is robust and is multi-platform. IOT is the opposite: there are many numerical formats, some of which are different from all previous ones.

I’m creating a hobby program that lets me investigate and use a variety of Bluetooth sensor devices. This is not my first foray into the morass: I have an existing set of app available on the Microsoft app store to deal with a variety of sensors and lights, and I have an existing IOT-capable computer language built into a moderately popular calculator program, also on the Microsoft app store.

One of the top goals of the latest project is to stop rewriting the Bluetooth code. At this point, I have a large enough corpus of actual Bluetooth devices that I can create a JSON description of each devices services and characteristics and use each description to generate both foundational protocol code and “good enough” UI code that a moderately capable programmer can customize. That JSON description includes a little language to describe exactly how each data field should be parsed.