It's been almost exactly a year since my original blog post about DataStreams.jl, so I figure it's time for an update!

There's actually quite a bit to announce and I'm really excited with how things have developed over the last year.

What is DataStreams.jl again?

DataStreams.jl is about designing interfaces for easy and efficient transfer of "table-like" data (i.e. data that can, at least in some sense, be described by rows and columns) between sources and sinks. The key is to provide an interface (i.e. set of required methods to implement) such that as long as a source correctly implements, it can now "stream" data to any existing, valid sink. Similarly, as long as a sink implements the required "sink interface" methods, it should then be able to "receive" data from any source. I gave an intro to the framework in my 2016 JuliaCon talk.

Traditionally, sources and sinks have been developed in isolation, with a source or sink having to provide its own options for interacting with other sources/sinks, usually only a single option (e.g. reading a csv file read in as a dataframe, or loading a csv file into a database). This is pain-staking work, however, to support for each package independently. Imagine as a database middleware developer, do I really want to go through the effort of allowing users to load data into a DB via csv file, sqlite table, dataframe, etc.? A natural first step is to stop and say, "well, I can't possibly support all existing or future types of sources to load into a DB, but I could define a common interface, and as long as a source implements the interface, I can load it into the db". All the sudden, this dev relieves herself of having to be an expert in the most-efficient handling of a specific datasource type, opting instead to allow source developers to code to an interface.

This line of reasoning has been taken even several steps further in the maturation of the DataStreams.jl framework. Independent interfaces exist for both Data.Source types, as well as Data.Sink s. DataStreams.jl itself provides a set of common Data.stream! methods that handle the actual initialization, streaming, and closedown of the data ingestion process, but has and needs no outside knowledge of the specific Source/Sink implementations. For a real-life example, the CSV.jl package now provides a CSV.Source type which implements the Data.Source interface and concerns itself solely with producing table-like data as efficiently as possible. CSV.Sink , on the other hand, implements the Data.Sink interface and only has to worry about storing table-data to disk in the csv format. One of the most powerful results of this Source-Sink decoupling is the fact that adding a new Source or Sink into the ecosystem is O(1) , i.e. I don't have to worry about whether my new source type will be compatible with outputting data to 10 different sink types. I define how my source provides data, ensure the Data.Source interface is properly implemented, and boom, my new source can now interact with any existing and future sinks, automatically and efficiently.

For a more technical and in-depth reading of the interfaces for Data.Source or Data.Sink , see the docs here.

New DataStreams Features

In addition to the power of Source-Sink decoupling, having the actual streaming framework defined centrally in DataStreams.jl allows for several other useful features:

Appending : oftentimes, a user needs to aggregate several sources into a single sink; as part of the Data.Sink interface, sinks must account for the streaming scenario where new source data will be appended to existing sink data (with matching schema as an obvious requirement)

: oftentimes, a user needs to aggregate several sources into a single sink; as part of the interface, sinks must account for the streaming scenario where new source data will be appended to existing sink data (with matching schema as an obvious requirement) Transforms : often, users need to apply simple data cleansing or manipulating tasks to source data before it's in a useable form to analyze or output to a sink. With the data transfer defined centrally, it was extremely simple to allow the user to supply a set of transform functions that should be applied to the data "in-transit", i.e. as a single field value or column of values is provided by the Data.Source , the transform function is applied before sending to the Data.Sink for storage.

: often, users need to apply simple data cleansing or manipulating tasks to source data before it's in a useable form to analyze or output to a sink. With the data transfer defined centrally, it was extremely simple to allow the user to supply a set of transform functions that should be applied to the data "in-transit", i.e. as a single field value or column of values is provided by the , the transform function is applied before sending to the for storage. Package use consistency : due to the consistency of implementing the Data.Source and Data.Sink interfaces separately, each package has a natural "high-level" convenience method fall out. Such methods for Sources are: CSV.read(file, sink=DataFrame; append=false, transforms=Dict()) SQL.query(db, query, sink=DataFrame; append=false, transforms=Dict()) Feather.read(file, sink=DataFrame; append=false, transforms=Dict()) ODBC.query(dsn, query, sink=DataFrame; append=false, transforms=Dict())

: due to the consistency of implementing the and interfaces separately, each package has a natural "high-level" convenience method fall out. Such methods for Sources are:

The definition for these convenience methods are almost identical across packages, utilizing the same DataStreams.jl infrastructure, and all allowing the common DataStream features such as specifying any sink argument they'd like (typically a DataFrame by default), whether to append to the sink or not, and any transform functions that should be applied along the way. These convenience methods work by constructing Data.Source (and possibly Data.Sink ) types "on-the-fly" before calling Data.stream!(source, sink) and Data.close!(sink) to perform the actual data transfer.

Similar convenience methods for Data.Sink are:

CSV.write(file, source)

SQLite.load(db, tablename, source)

Feather.write(file, source)

ODBC.load(dsn, tablename)

with all supporting the same append and transforms keyword arguments as well.

What's next:

Things on my mind and that will be taking up my mind include:

DataFrames modernizing: there's some major work already underway to improve the core DataFrame type for handling tabular data in Julia

Multi-source/sink streaming: though the framework already supports append ing when streaming, there's more in specific use-cases that could be supported by DataStreams centrally

ing when streaming, there's more in specific use-cases that could be supported by DataStreams centrally More formal and powerful transforms: I feel like we've only scratched the surface for transforms so far; I don't think it'll be too difficult to support thinks like: new-column transforms, multi-column transforms, and possibly aggregation transforms.

More Sources & Sinks!: I hope to help enable as many as possible, such as SAS, RData, certain geo-data formats, etc. Stream all the things!!

Finally, I definitely want to give shoutouts to those who have helped, with code, design, or otherwise: David Gold, John Myles White, Milan Bouchet-Valat, David Anthoff, and Yeesian Ng for their direct collaboration and help, as well as many of the core contributors to julia itself. And special shoutouts to Tom Breloff, Jiahao Chen, Cameron McBride, and Jeff Bezanson for specific discussions and suggestions at JuliaCon 2016 and since.

I'm also including some quick "new release news" for individual packages below, since there have been quite a few new features added: