Introduction

For a long time, MapReduce has been used to process big data. A while ago, Google introduced Cloud Dataflow which enables new model for batch and stream processing. In this article, I’ll show a minimum project using Apache Beam, Cloud Pub/Sub and Cloud Datastore.

Why Kotlin?

Although documents from Google are still in Java and Maven, Apache Beam project got Kotlin examples recently. Writing Beam pipelines in Kotlin makes them much readable, intuitive and familiar, I think :)

What do we create?

Let’s make a simple pipeline which subscribes to a Pub/Sub topic and creates Entities of Datastore for each message. Probably it’s a most primitive usecase of Dataflow.

Create a project

Firstly, create a gradle project. You can specify it as a Kotlin project by passing type argument.

mkdir <your-proj-dir>

cd <your-proj-dir>

gradle init --type kotlin-application

Setup Gradle build

Because we use Apache Beam in this project, put these lines in the dependencies block of build.gradle file.

implementation 'org.slf4j:slf4j-simple:1.7.26'

implementation 'org.apache.beam:beam-sdks-java-core:2.13.0'

implementation 'org.apache.beam:beam-runners-direct-java:2.13.0'

implementation 'org.apache.beam:beam-runners-google-cloud-dataflow-java:2.13.0'

Also, add this block at the top level of build.gradle . This is necessary to pass command-line arguments when we run this program.

run {

if (project.hasProperty('args')) {

args project.args.split('\\s+')

}

}

Thanks to this article!

Configure your environment

Prepare a service account which has Pub/Sub, Datastore and Dataflow scopes.

Get a credential json file of the account from Google Cloud console and place it somewhere on your PC.

On Linux or Mac,

export GOOGLE_APPLICATION_CREDENTIALS=<full-path-to-your-json>

On Windows, use set or something on your shell.

NOTE: Apache Beam has GcpOptions#setGcpCredential but it didn’t work for me.

Make a pipeline code

Open src/main/kotlin/<your-proj-name>/App.kt and replace its contents with the code below. Please change package name and other constants as you like.

package org.yourproj import org.apache.beam.sdk.Pipeline

import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO

import org.apache.beam.sdk.io.gcp.pubsub.PubsubOptions

import org.apache.beam.sdk.io.gcp.datastore.DatastoreIO

import org.apache.beam.sdk.options.PipelineOptionsFactory

import org.apache.beam.sdk.transforms.*

import org.apache.beam.sdk.values.PCollection import com.google.datastore.v1.client.DatastoreHelper.makeKey

import com.google.datastore.v1.client.DatastoreHelper.makeValue

import com.google.datastore.v1.Entity

import org.slf4j.LoggerFactory; interface MyOptions : PubsubOptions {

var topic: String

} fun main(args: Array<String>) {

val LOGGER = LoggerFactory.getLogger("org.yourproj.AppKt") val options = PipelineOptionsFactory

.fromArgs(*args)

.withValidation()

.`as`(MyOptions::class.java)

p.apply<PCollection<String>>(PubsubIO.readStrings().fromTopic(options.topic))

.apply(ParDo.of(object: DoFn<String, Entity>() {



fun processElement(

LOGGER.info("input " + input) val p = Pipeline.create(options)p.apply >(PubsubIO.readStrings().fromTopic(options.topic)).apply(ParDo.of(object: DoFn () { @DoFn .ProcessElementfun processElement( @DoFn .Element input: String, output: DoFn.OutputReceiver ) {LOGGER.info("input " + input) val key = String.format("%s-%s", "data", input.replace(' ', '-').toLowerCase())



val entityBuilder = Entity.newBuilder()

entityBuilder.setKey(makeKey("beam-test", key).build())

entityBuilder.putProperties("message", makeValue(input).build()) output.output(entityBuilder.build())

}

}))

.apply(DatastoreIO.v1().write().withProjectId(options.project)) p.run()

}

Notice that GCP parameters like project id and topic name are passed through arguments via inherited options object.

Now you can build your project by running this command.

gradle build

Remove unit test in AppTest.kt if build complaints.

Run it locally

Now you can run your pipeline locally with this command.

gradle run -Pargs="--runner=DirectRunner --project=<your-project-name> --topic=<your-pubsub-topic-name>”

You will see a log like this on console if Pub/Sub connection is successfully established.

[main] WARN org.apache.beam.sdk.io.gcp.pubsub.PubsubUnboundedSource - Created subscription projects/<your-project>/subscriptions/<random-subscription> to topic projects/<your-project>/topics/<your-topic>. Note this subscription WILL NOT be deleted when the pipeline terminates

Testing

We don’t use Pub/Sub emulator here but publisher.py from this document is useful for testing. Open a new shell, set environment variables and run this command to publish some messages on a topic.

python publisher.py PUBSUB_PROJECT_ID publish TOPIC_ID

If successful, you will see log lines for each message. Also, check entities in Cloud Datastore.

Deploy it on Cloud Dataflow

You can deploy the pipeline on Cloud Dataflow just by changing commandline arguments to --runner=DataflowRunner --tempLocation=gs://<your-bucket> .

Once your pipeline is successfully deployed, it appears on Dataflow console and see it running.

NOTE: To deploy on Dataflow, you may have to enable some API and assign additional scopes to the service account. Read error messages carefully.

Conclusion

Processing big-data has been a big job but Cloud Pub/Sub and Dataflow allow us to take step-by-step approach by combining multiple storage and pipelines. Because pipelines can be easily developed, we can focus on models and architecture designs.