KSQL, the SQL streaming engine for Apache Kafka®, puts the power of stream processing into the hands of anyone who knows SQL. It’s fun to use for exploring data in Kafka topics, but its real power comes in building stream processing applications. By continually streaming messages from one Kafka topic to another, applying transformations expressed in SQL, it is possible to build powerful applications doing common data wrangling tasks such as:

Applying schema to data

Filtering and masking data

Changing data structures (for example, flattening nested data)

Changing the serialization format

Enriching streams of data

Unifying multiple streams of data Data wrangling means changing your raw data into a form that you can more easily use. This aspect of data science essentially involves taking data you’ve collected, cleaning it up and changing it into a different format, which can increase its value. It’s not enough to simply have data in your possession—you need to make something of it, properly utilize the data and transform it into something you can analyze. As a result, you’ll be able to derive greater value from it.

Imagine you’ve got data on which you want to build some analytics. It’s coming from multiple sources—maybe different instances of the same system. You need to unify those sources into a single one, reformat the data along the way and apply a schema to the data so that it can be easily analyzed downstream. Apache Kafka and KSQL are a great choice for this.

Why Kafka? Because we can use it for integrating the source systems with those downstream, in near real time and with the ability to add and swap out sources and targets without disturbing the rest of the components. In addition, we can reprocess data as required and share it with other systems—enabling them to take advantage of the data cleansing and data wrangling that we’ve performed.

In this article, we’ll see how to pull in data from REST sources, cleanse it and perform data wrangling with KSQL, then stream it out to both Google Cloud Storage (GCS) as well as Google BigQuery for analysis and visualization in Google Data Studio. We’re using Confluent Cloud™ to host our Kafka brokers, but it will work on a local cluster instead if you want to.