Original post at https://www.erlang-solutions.com/blog.html

With the ElixirConf US last week, we’re celebrating all things Elixir. We have also launched our new Elixir Architecture Sessions — get in touch for more details.

Building an Elixir Stream is the third blog of our #ElixirOverload Takeover. You can also read the other Elixir-based content in this series including Receiving Messages in Elixir by Oleg Tarasenko, and To Pipe or Not To Pipe by Joe Yiasemides.

Building an Elixir Stream

MongooseIM is one of the products we develop at ESL. It is a real-time messaging platform with a focus on performance and scalability. Therefore it is essential to perform load tests on such software. Here we look at building and developing an Elixir Stream including building a library and testing data.

Not so long ago the load testing process was quite complicated and consumed too much developer time. We took the effort to develop an automation tool, called Tide, to test MongooseIM both regularly and easily.

It starts load tests with almost no effort from the developer in general. Tests are started in a sandbox environment. During such test, our load-generator Amoc and MongooseIM itself report metrics, describing the state of the test from a client and server perspective. Simulated clients exchange messages through the server, so the most important client metric is end-to-end message delivery time. We use InfluxDB for metrics storage. It is time series database, which efficiently stores metrics, and provides an API for performing analysis on data.

The Problem

During a test, we set up a temporary instance of InfluxDB, where metrics are reported. It is a fresh instance of the database for each test, started on load infrastructure. After each test, we want to:

Retrieve all metrics. Perform some transformations. store it in another instance of InfluxDB, let’s call it persistent .

One technical problem, with this approach, is that InfluxDB yields data in a different format than it accepts as input. The output form of queries via the HTTP API has a JSON representation, however InfluxDB expects data in the Influx Line Protocol on writes.

So in our product we also had to transform the format of data.

So summing up, we query for all data with the following:

SELECT * FROM /.*/

which gives us JSON like:

{

"results": [

{

"statement_id": 0,

"series": [

{

"name": "metric_name",

"columns": ["time", "value"],

"values": [

["2018-02-02T15:46:52Z", 1],

["2018-02-02T15:47:02Z", 2]

]

}

]

}

]

}

We need to transform it into:

metric_name value=1 "2018-02-02T15:46:52Z"

metric_name value=2 "2018-02-02T15:47:02Z"

and do a POST request to another instance of the database.

Naive Solution

Looks pretty simple, just a few Enum.map() . Indeed it worked when we created first prototype. More or less it looked like:

def dump_metrics() do

query = "SELECT * FROM /.*/"

params = URI.encode_query(%{q: query, db: "test_db"})

url = "http://localhost:8086/query?#{params}"

%{body: body} = HTTPoison.get!(url)

data_to_write =

Poison.decode!(body)

|> map_metrics()

|> convert_to_influx_line_protocol()

HTTPoison.post("http://another_influx:8086/write", data_to_write)

end

This naive implementation has a few problems. First of all, it queries for all metrics in one request. This is okay when the expected amount of data is less than a few hundred data points. In our case, we were querying for tens of thousand of data points. Waiting for a response takes more than a minute and is likely to time-out.

Using Chunked Transfer Encoding

A first attempt to fix this problem was taking advantage of InfluxDB’s ability to chunk responses. It uses HTTP chunked transfer encoding. InfluxDB streams partial results, instead of sending all results at once. To use it, we need to simply add chunked=true to the GET parameters. Let’s look at our dumping function, after introducing these changes:

def dump_metrics() do

query = "SELECT * FROM /.*/"

params = URI.encode_query(%{q: query, db: "test_db", chunked: true})

url = "http://localhost:8086/query?#{params}"

%{body: body} = HTTPoison.get!(url)

body

|> String.split("

") # every batch is separated with newline

|> Enum.map(&Poison.decode!/1)

|> Enum.map(&map_metrics/1)

|> Enum.map(&convert_to_influx_line_protocol/1)

|> Enum.each(&HTTPoison.post("http://another_influx:8086/write", &1))

end

It worked pretty well! We got responses in reasonable time and the requests did not time-out. However, this solution still does not leverage the fact that data is being streamed from InfluxDB. We are blocked, until we receive all the data from the database. So the only advantage of this approach is that we are able to perform query such that the request is not timed-out at the database level. Dumping metrics still takes some time. Also, when we started using it, it turned out we are not able to dump metrics for a 1-hour long test. We observed that the system was running out of memory. Also it took quite a long time to dump it. Parameters for tests are at the bottom of the text.

Memory consumption peaked as data was retrieved from the database. This happens, because it needs to store a whole, raw, JSON encoded string in memory before going further.

Building a Stream

One of the solutions we come up with, was to wrap asynchronous InfluxDB responses into an Elixir Stream. We hoped it would speed up the dumping process as we wouldn’t have to wait for all the data before sending it. Another thing we hoped for, was the expected lower memory consumption as the data would be passed between the databases continuously. In other words, the system was expected to receive data from the inbound InfluxDB and push it to the persistent one at the same time and the processes would be more independent.

Stream.resource/3 may be used to build a stream. It requires passing three function: Stream.resource(start_fun, next_fun, after_fun) .

The first function is called when Stream is initialized. As an example, the Elixir docs provide opening a file. In our case it would be initializing a connection to the database.

Second function is called each time, when new element is requested. We would like to emit a batch of data points from InfluxDB when it is called. The last function is supposed to be called when the stream is finished and should perform some cleanup.

We decided to use a GenServer as the “backend” for our resource. In fact, GenServer is not required here, however it was used to provide better separation, especially not to put any messages into the library user’s message queue. Avoiding GenServer should provide better performance, but we ultimately decided that this approach would be easier to understand, provide better isolation and still have good performance.. In this case latency introduced by message passing is really low comparing to the HTTP request latency. Messages sent by library are big binaries containing JSON string from InfluxDB, efficiently optimized by the Erlang VM.

So, functions provided to Stream.resource/3 are GenServer calls:

#Function called to initialize Stream

defp init_fun(url) do

fn ->

{:ok, pid} = Supervisor.new_worker(url)

pid

end

end #Function called where there is demand for new element

defp next_fun do

fn (pid) ->

case Worker.get_chunk(pid) do

{:chunk, chunk} -> {[chunk], pid}

:halt -> {:halt, pid}

end

end

end #Function called when stream is finished, used for clean_up

defp after_fun do

fn (pid) ->

Worker.stop(pid)

end

end

end

We used HTTPosion for HTTP requests. It supports sending HTTP responses as a message to a process. It perfectly fits our architecture. While initializing a GenServer will just do a non-blocking query with HTTPosion and after that, perform specific actions according to received messages. It is all event-based designed. The following diagram represents how the GenServer actually behaves:

In the diagram above we have a Finite State Machine, which accepts three events:

new chunk - it represents a new portion of data from InfluxDB end of data - response from InfluxDB indicating there are no more chunks get chunk - Event from consumer of the Stream

In short words, we accumulate chunks as they arrive. When the consumer asks for one of it, we simply return it. The consumer call is blocking, so if one is waiting for more chunks, the GenServer will eventually reply with either chunk or end the Stream.

Getting back to our dumping function, this is what it looks like now:

def dump_metrics() do

query = "SELECT * FROM /.*/"

params = URI.encode_query(%{q: query, db: "test_db", chunked: true})

url = "http://localhost:8086/query?#{params}"

Stream.resource(init_fun(url), next_fun(), after_fun()) # from snippet above

|> Stream.map(&Posion.decode!/1)

|> Stream.map(&map_metrics/1)

|> Stream.map(&convert_to_influx_line_protocol/1)

|> Stream.each(&HTTPoison.post("http://another_influx:8086/write", &1))

|> Stream.run()

end

Below there is a graph with memory usage while dumping metrics with streams. It has the same scale as the previous graph. Parameters for tests are at the bottom of the text.

As we can see performance is noticeable better.

Crafting a Library

Last but not least, we extracted a library based on the InfluxDB related code. Streaming the results of a query is just one of the features we need to interact with the database. We called our library Flex and it also comes with extra utilities:

API for basic interaction with InfluxDB module for manipulating datapoints basic Query construction module CaseTemplate for conveniently testing your application interaction with InfluxDB.

It is available at GitHub.

Summary

Above we presented all steps we did while constructing our solution.

Starting with a naive solution usually does not solve problems, but it gives us a general view on what a good solution might look like! As you can see, our dump_metrics/0 function did not change much over time. Then, it is essential to find out, if a resource may be batched or chunked. It may be trivial in some cases, like reading a file from disk. In this case database allowed us to get chunked responses, however it required some effort to research it.

Finally we need to design the data flow for our Stream resource. It need to be aligned with the Stream API. We used a GenServer as the backend for our Stream. So we are relying on message passing: while the Stream is running — we received both requests and new chunks as a message. It allowed us to build a reactive and event-based solution.

Note on Testing Data

Here are some facts about environments that was used for measuring performance.

InfluxDB 1.3.0 from Docker image was used

Database was populated with with 1 metric, containing 1 000 000 entries

Command used to populate database:

(for x <- 1..1_000_000, do: "measurement,tag1=my_tag value1=\"asdf#{x}\",value2=\"asdf#{x}\",value3=#{x} #{x}")

|> Enum.join("

")

|> (fn x ->

HTTPoison.post("http://localhot:8086/write?db=test_db&epoch=ms", x, [], timeout: 60_000, recv_timeout: 60_000)

end).()

Time measured with “:timer.tc/1 function” was:

- Enum solution 969435317 us ~= 16 minutes

- Stream solutions 120394400 us ~= 2 minutes

Destination and source InfluxDB instance was the same. Only different databases were used. Therefore performance for Stream solution is expected to be even better when destination and source are different.

Conclusion

If you want to find out more about building an Elixir stream, or have any queries regarding this blog post or our Elixir Development, you can contact us at general@erlang-solutions.com.

We also have a whole alchemy themed selection of blog posts to enjoy including more information about Elixir module attributes, Elixir mix configuration and fault tolerance.