by Hendrik Saly

Apache Kafka comes with a lot of security features out of the box (at least since version 0.9). But one feature is missing if you deal with sensitive mission critical data: Encryption of the data itself.

Sure, this could simply be accomplished by encrypting the disks on which the Kafka brokers store their data. But the unencrypted form may still reside in memory or other caches and, even worse, anyone who can access the appropriate message can just read it. So it seems that disk encryption solely is not enough for sensitive data. In addition to “just encrypt the data” we need a mechanism to make sure no one altered the data after the producer sent the message. Keep in mind that Kafka, which is a distributed system, have to deal with insecure networks and SSL/TLS may not protect your data in every case under all circumstances (weak cipher suites, man in the middle attacks and openssl bugs, etc.). Furthermore with SSL/TLS enabled Kafka cannot leverage the sendfile syscall anymore (which writes a pagecache directly to a socket).

To achieve all these requirements the producer has to encrypt the messages before pushing them over the wire into Kafka and the Consumer needs to decrypt them upon retrieval. So both endpoints of the communication link need to handle the security aspects which is called End-to-End security (or sometimes also End-to-End encryption). The advantage of this approach is that it is totally transparent for Kafka and all other involved components between Producer and Consumer.

Encryption algorithm for Kafka

Because Kafka is a high volume and low latency message broker we need a fast (but still secure) encryption algorithm which can encrypt an arbitrary amount of data. The obvious choice here is AES (Advanced Encryption Standard) mainly because of the widespread and common hardware support which is available. Modern Intel and AMD processors support AES en-/decryption natively within the CPU which is a lot faster than AES software implementations. The problem with AES in our context is, that it is a symmetric cipher which means that there is only one key which is used for encryption as well as decryption. This leads to the question how a secure key exchange between producer and consumer can be accomplished. The short answer is: it is not possible if you not already have a secure channel (and having that would make establishing another one maybe unnecessary).

So we need another mechanism to conduct secure key exchange. Luckily this challenge is already solved leveraging asymmetric cryptography whereas encryption and decryption is performed with different keys (they belong together and are often referred to as key pair). One key is the so called public key which is used for encryption and the other one is called private key and can decrypt messages encrypted with the corresponding public key. The private key cannot be derived from the public key so it is safe to transfer the public key over an insecure channel.

One could now ask why do we not just use an asymmetric cipher (instead of AES) and we are done. The reason is that asymmetric ciphers are much slower than symmetric ones and they cannot encrypt data bigger than the cipher’s key size. In our case (let’s choose RSA for now as our asymmetric cryptosystem) the typical key length for RSA considered secure is nowadays 4096 bit (=512 byte) but for practical reasons we go with 2048 bits (=256 byte) here. To use RSA as the only one algorithm we would need to chunk our data in pieces of 256 bytes length. That sounds not very practical. But RSA is well suited to encrypt our AES key (which is either 16 or 32 bytes in length). So let’s do this.

Now we need to make the usage of AES and RSA encryption semantically secure. That means that encryption of slightly different messages (which have some identical content) with the same key does not contain informations which allows to derive the key from that. To circumvent this for AES encrypted messages you either have to generate a new random key for every message or to use a the same key but choose a AES mode which support initialization vectors (IV) and use a different IV for every message. For RSA we need to apply a random encryption padding schemes such as Optimal Asymmetric Encryption Padding (OAEP).

As mentioned above RSA en/decryption is pretty slow so we need to avoid it as much as possible. That is why it makes sense to use the same AES key for encrypting more than one message and use an initialization vector to ensure semantically secureness. On the decryption side we need to somehow cache the decrypted AES key. To know which AES key is used for the particular message without submitting it in plaintext (or decrypt it constantly) it is necessary to add a key hash which serves as caching id. To prevent hash collision attacks the hash needs to be cryptographically secure. For now let us use SHA-256 for that.

As a last piece it would be nice if the consumer can detect if a particular message is encrypted or not. In the latter case the decryption is simply skipped. This can be achieved by introducing magic bytes to label encrypted messages.

With that background wiring all this together leads to the following high level message processing chain:

O: Original plain message (arbitrary bytes)

K: Plain AES key

M: Magic bytes (0xDF 0xBB)

hash(K): SHA-256 hash of plain AES key

rsa(K): RSA encrypted plain AES key

aes(O): AES encrypted message

IV: Initialization Vector

L: Length information about hash(K), rsa(K) and IV

Here in this case we use “AES/CBC/PKCS5Padding” for AES en-/decryption and “RSA/ECB/OAEPWithSHA-256AndMGF1Padding” for RSA en-/decryption

Producer:

1) If no AES key exists create a random one → (K)

1. Encrypt AES key with RSA public key → rsa(K)

2. Calculate SHA-256 hash of AES key → hash(K)

2) Generate random initialization vector → IV

3) Encrypt message with AES key and I -> aes(O)

4) Replace original message O with M-L-hash(K)-rsa(K)-I-aes(O)

Consumer:

1) Check magic bytes (M). Bypass unencrypted messages

2) Extract hash(K) by looking at L

3) Extract IV by looking at L

4) If hash(K) is in cache get plain AES key (K)

5) If hash(K) is no in cache get decrypt rsa(K) to get plain AES key (and put them into the cache)

6) Decrypt aes(O) with K and IV

7) Replace M-L-hash(K)-rsa(K)-IV-aes(O) with O

So rsa(K) is only necessary when a new AES key is used (by producer) and/or the cache needs to be populated (by consumer). The drawback with caching the AES key is that it resides permanently in producer and consumer JVM memory. But without caching especially the decrypting process is too slow to be useful (see benchmarks).

This adds a constant overhead of 309 bytes to each message. 5 bytes for the header, 32 bytes for SHA-256 hash, 256 bytes for the RSA encrypted AES key and 16 byte for the IV. The encrypted message size may also be 15 bytes bigger than the original message due to PKCS5Padding which is used (blocksize 16). So the maximum overhead in total would be 324 byte. That may be a lot especially if you only handle small messages you easily double your data size. There is nothing we can do here but use a weaker RSA key with only 1024 bit key size. This would reduce the maximum overhead by 128 byte to 196 byte. But 1024 bit keys are considered vulnerable. A weaker hash algorithm producing a shorter hash is also not an option. And finally we cannot omit the IV nor put them into the RSA encrypted part (because that would make caching impossible). So increased message size as well as the additional CPU cycles for AES en-/decryption are the costs for security.

But that’s enough theory, let’s look at the implementation:

The basic idea for transparent end-to-end encryption in Kafka is to write a Serializer and Deserializer which wraps the original Serializer and Deserializer and adds the en-/decryption processing transparently.

Implementing a delegating Serializer and Deserializer is easy. We just need to implement two classes for that:

org.apache.kafka.common.serialization.Deserializer<T>; public byte[] deserialize(String topic, T data) { return originalDeserializer.deserialize(topic, decrpyt(data)); }

org.apache.kafka.common.serialization.Serializer<T>; public byte[] serialize(String topic, T data) { return originalSerializer.serialize(topic, encrypt(data); }

The encryption and decryption is literally done with the Java implementation which utilizes AES-NI instructions (if CPU supports it) and has also the advantages in having no external dependencies.

En/Decryption of a byte array is pretty simple:

1) Cipher cipher = Cipher.getInstance(algo) 2) cipher.init(mode, key, [IV]) 3) byte[] result = c.doFinal(input)

algo: Algorithm (here “RSA” or “AES”) and padding scheme

mode: encrypt or decrypt

key: The key

IV: Initialization vector (AES only)

input: The input bytes (plaintext or encrypted text), depends on mode

output: The output (encrypted text or plain text), depends on mode

Note: Instances of Cipher are not threadsafe, so it’s best used encapsulated in a threadlocal

Use it

1) Include the library via maven

<dependency> <groupId>de.saly</groupId> <artifactId>kafka-end-2-end-encryption</artifactId> <version>1.0.1</version> </dependency>

or download it from https://github.com/salyh/kafka-end-2-end-encryption/releases/tag/v1.0.1

2) Generate a RSA keypair:

java -cp kafka-end-2-end-encryption-1.0.1.jar de.saly.kafka.crypto.RsaKeyGen 2048

3) Configure your producer and consumer:

Producer:

value.serializer: de.saly.kafka.crypto.EncryptingSerializer crypto.wrapped_serializer: org.apache.kafka.common.serialization.StringSerializer crypto.rsa.publickey.filepath: /opt/rsa_publickey.key

Consumer:

value.deserializer: de.saly.kafka.crypto.DecryptingDeserializer crypto.wrapped_deserializer: org.apache.kafka.common.serialization.StringDeserializer crypto.rsa.privatekey.filepath: /opt/rsa_privatekey.key

Benchmarks

A recent MacBookPro with Java 8 can encrypt approx. 300 mb/s in average and decrypt approx. 1350 mb/sec in average (per Thread)

Limitations

The design of End-to-End security discussed in this article does have some limitations. It provides currently only encryption and no kind accountability or non repudiation (because message are not signed yet). Authentication and authorization is also not covered but can be leveraged by using Kafka’s own mechanisms. It does also not protect against one sitting in den middle (Man in the middle) from dropping, replaying or reordering messages. There is also no forward secrecy present. We will discuss and add some of this features in the part 2 of this article.

Conclusion

This article and the provided implementation demonstrates how transparent end-to-end security can be applied to Kafka and add a enterprise grade security feature. We discussed the nature of symmetric and asymmetric encryption systems, how they can be combined and how much overhead they added. In Part 2 we will discuss optimizations (like batching and compression of messages) and adding cryptographic signatures to accomplish a trusted relationship between various producers and consumers. In Part 3 we will have a look on how non-Java producers and consumers can be made ready for end-to-end security.

Download

https://github.com/salyh/kafka-end-2-end-encryption

https://github.com/salyh/kafka-end-2-end-encryption-bench-it

This article as well as the implementation was inspired by http://www.symantec.com/connect/blogs/end-end-encryption-though-kafka-our-proof-concept (credits to Jim Hoagland).

Apache, Apache Kafka, Kafka, and associated open source project names are trademarks of the Apache Software Foundation