Apache Kafka has been all the rage for the key join of the data pipeline. But in most cases, we only treat Kafka as a stream source or a message queue. This means if you wanna do some AdHoc query, you need to sync the data to HDFS or other storage firstly.

People may have forgotten that Kafka is really good at high throughput since Kafka makes full use of both parallel consuming and sequential reading.

In order to satisfy the use cases around ad hoc analytics, data exploration and trend discovery based on Kafka directly, a new project called spark-adhoc-kafka is open sourced.

With this project, what you can do include:

Treat Kafka topics/streams as tables; Support for SQL Support complex joins(join other Kafka topics or other tables stored in any place) Support for MLSQL and Spark

Notice that you can speed up the ad hoc query in spark-adhoc-kafka by :