Diving into RLP

Actually, Ethereum wiki has explained RLP extremely easy to understand, so I just reminded them by my writing style and what I expect in this article is about diving into RLP and getting deep understanding. Hmmm, again 😑. One more thing, all of idea here is just my personal points of view and it may be misunderstood. So, you should recheck it, convince by yourself and then believe it if correct.

As what I read from wiki Ethereum, RLP just focuses on byte, string and list. Some extra types of data such as big number, boolean, pointer, slide… are based on which programming language we use to implement RLP.

According to the document, the first byte of encoded data decides which the type of data.

[0x00, 0x7f] : byte

: byte [0x80, 0xbf] : string

: string [0xc0, 0xff] : list

The first question, Why don’t we use a fixed prefix instead of a dynamic prefix?

First of all, you can see in RLP, sometime the data needs some bytes to describe the type, the length of data, but sometime the data also shows its type, the length by itself.

The main reason is to save the memory space.

If we try to use a fixed prefix, we would add them in every single input that we wanna encode and in some situations, the main data is even shorter than the prefix.

You can tell that it will become simpler to read, but it’s only applied to human. In case of computer, it cannot differentiate which one more complex. Computer just runs the code with the care of computational complexity, and in this case, I’m pretty sure that 2 source codes are the same computational complexity.

Furthermore, if it is fixed, so how many bytes will we use? We are not sure. So it’s unnecessary.

The second question, Why did they choose 0x7f, 0x80, 0xbf, 0xc0 as checkpoints?

Just think about it in a sequence. We don’t wanna use any prefix with encoding a single byte, because it will be double (or triple, or more) the memory to store the encoded data if we use a fixed prefix as what I explained in the first question. So we need to determine a range in which, the byte is encoded by itself.

ASCII Table.

It may be not accidental when 0x7f was chosen. The ASCII has used 7 bits to encode 128 single characters and it corresponds with 0x7f . I believe that is the reason why [0x00, 0x7f] was chosen. However, what is RLP encoding of byte with 0x80 value?

The answer is we add a prefix, RLP_encode(0x80) = [0x81, 0x80] .

After that, about string and list, we have no choice and must use prefix. It’s distinct when they divided the rest of range into halves. [0x80, 0xbf] for string encoding and [0xc0, 0xff] for list encoding.

The third question, Why must we use a range to describe a type instead of the only one value of byte, I believe that one value of byte is enough?

Yup, one value of byte is enough to represent a type of data but we need to know how long the data is to get the offset. In order to do that, we must add more prefixes if we just use one value of byte to represent the type of data.

For now, you understand that we will get two problems. Firstly, assuming that we use 0x80 prefix for string and 0x81 for list, so we waste this byte on storing just 2 values while it still can do more. Secondly, it seems that we are trying to fix the prefix again (one byte for type, some bytes for the length of data) and as what I discussed on the first question, it may waste memory a lot.

We choose a range of bytes to not only encode the type of data but also the length of data.

The fourth question, What would we do when the length of data is out of range of prefix?