Getting Started

So, let’s see how easy it is to use Arctic and see if I can get you, the Reader, a little bit more into the idea of using yet another database. This will be a very simple walkthrough just to illustrate some of the core features of Arctic.

Setting up

First of all, you need to have MongoDB installed and running. You can read the instructions for your operating system in the official MongoDB docs page.

Having done that, we can install Arctic using pip

pip install git+https://github.com/manahl/arctic.git

We need to install the Pandas library as we will be dealing with DataFrames

pip install pandas

Coding

First things first: let’s import Arctic into our empty Python script

from arctic import Arctic

Now, we need to connect Arctic to its underlying MongoDB instance. You can connect Arctic to any MongoDB instance hosted on the cloud or in your local network. Since I’m running MongoDB on my laptop, I will be using localhost as the address to my instance.

db = Arctic('localhost')

Great! Now we need to create a library.

Arctic separates different data using the concept of libraries. They can be markets, regions, users, etc.

In this example, I will be using a ~160MB CSV file with some financial data. Let’s create a Finance library to store it. By not passing a lib_type value to the initialize_library method, Arctic will default this library to use the Versionstore storage engine. That’s fine for our example.

db.initialize_library('Finance')

We need to access the library that we just created in order to write some data into it. Let’s do that by indexing our db object with the library name.

finance_library = db['Finance']

Before writing the data, let’s open our time series data file. In this example, I’m will be using a file called finance.csv (demo file with the CSV structure used in this example).

Let’s use the Pandas library to open the CSV file. First, we need to import the library, and then use the read_csv method to read the contents into a Pandas DataFrame .

We will also set the unix column as the index of the DataFrame , after converting it from string to datetime .

import pandas as pd

df = pd.read_csv('finance.csv')

df['unix'] = pd.to_datetime(df['unix'])

df.set_index('unix', inplace=True)

Ok! We are ready to write our data into Arctic. To do that, we need to define a symbol.

A symbol is a string that we will use to read or write our data within a library.

Let’s use the symbol Stocks to store the contents of our loaded df into the library finance_library .

finance_library.write('Stocks', df)

We can verify that the data was inserted correctly by using the method read and accessing the data property of the returned object to get the resulting DataFrame .

new_df = finance_library.read('Stocks').data

To perform a query that looks for a specific time interval, we can use the date_range parameter in the read method.

First, we need to import DateRange from arctic.date :

from arctic.date import DateRange

We can pass to the date_range parameter an instance of a DateRange . We can create it by calling its constructor and passing a start date and an end date as parameters.

custom_range = DateRange('2020-01-01', '2020-01-02')

Now we can run the query:

range_df = finance_library.read('Stocks', date_range=custom_range).data

And that’s it! In this very simple example we saw the main features of Arctic by implementing a Python script that handles some financial data.

The official docs are really good, and have pretty much all the information needed, from the basics to the advanced features. There’s no need to read the source code of the package, but if you are interested to see how Arctic’s works and how it achieves its high performance, I encourage you to do that.

But I’m afraid I won’t convince you, the Reader, until I show you some performance numbers, am I right?

Benchmarks

These numbers were all obtained running the scripts on my 2.3 GHz Dual Core 13-inch 2017 Mac Book Pro.

Bear in mind that this in no way is supposed to be a full benchmark of the Arctic database. It’s a simple comparison, made with Python’s time library, between Arctic, MongoDB, SQLite and plain CSV files.

Performance Comparison

Here are the results of running a naive Pandas read_csv , a PyMongo query, a SQLAlchemy query (using an SQLite database) and an Arctic read using the ~160 MB financial data as the source.

The PyMongo and SQLAlchemy query results were parsed into a DataFrame after the engines returned the results, and that time was considered in the benchmark. Both databases were indexed by the unix column.

Pandas read_csv : ~4.6 seconds

: ~4.6 seconds PyMongo query: ~28 seconds

SQLite query: ~30 seconds

Arctic read : ~1.45 seconds

Those results came from making a “get all” kind of query. Let’s try to add a range parameter to see if the results hold up.

new_df = finance_library.read('Stocks', date_range=DateRange('2020-01-01', '2020-01-02')).data

Pandas read_csv : ~4.9 seconds

: ~4.9 seconds PyMongo query: ~1.66 seconds

SQLite query: ~0.7 seconds

Arctic read with DateRange : ~0.12 seconds

The results are still in favor of Arctic after the use of the DateRange , which is the main type of query used we use when dealing with time series data.

Disk Compression

The same CSV file was used to seed each one of the databases.

This was the amount of disk space used by each one of the alternatives:

Plain CSV file: 160.8 MB

MongoDB collection: 347.31 MB

SQLite: 297.9 MB

Arctic: 160.59 MB

Conclusion

Using Arctic when dealing with large time series data sets allows us to achieve remarkable speed and compression improvements. With its easy setup and usage, it can increase productivity and save some precious time.

Even without using its more advanced features, like snapshots or other storage engines, we can make a strong case for the use of Arctic to deal with time series data.

The code of the walkthrough and the benchmark can be found here. It does not contain the full CSV file for license reasons, but I encourage you to run with some of your own data to see if your results are similar to mine.

I hope you enjoyed reading this post. Thank you for your time.

Take care and keep coding!