TimeSeries DataStores 'cuz lost time is never found again by AbhishekKr / @abionic

what will we discuss today What you can do with it? Why TSDB? -»

What makes a (good) TSDB? -»

Existing Solutions. -»

elevator pitch



System focussed on data-storage optimized

for time based queries. Some of the largest datasets have strong time components...

like stock market data, server logs, weather data, or even just the temperature in the server room.

TimeSeries Databases Not a unique problem, any DB can be made to work.

VividCortex reached 332k/sec metrics over 3 MySQL nodes.

It is writing new TSDB, Catena (800k/sec in Beta).

Focussed solutions are to handle scale/queries optimally.

It's like a BigData problem with "pre-structured" data.

Analytics

Some analysis are simple

(image-courtesy: stackoverflow::analytics)

But some need correlation of time-series data (image-courtesy: Spurious Correlations)

Common Time-Series Data for Analysis (image-courtesy: /var/log)

Interesting Time-Series Data (image-courtesy: histography.io)

Forecasting

Some series just seems random,

but is actually predictable. (image-courtesy: dilbert)

Not all predictions are accurate. (image-courtesy: xkcd)

But with enough data, they can be near perfect. (image-courtesy: xkcd)

Popular Time-Series Data Forecasting. (image-courtesy: Yahoo! Finance)

Popular Time-Series Data Forecasting. (image-courtesy: Yahoo! Finance)

Critical Time-Series Data Forecasting. (image-courtesy: Google Weather)

Critical Time-Series Data Forecasting. (image-courtesy: Environment Canada)

Why? Many kinds of analysis require keeping track of

multiple factors over a period of time.

Like... * some mongod usages in Industry * Forecasting (Average, Relevant Value Average, Seasonality Trend, Weighted Avergae, Smoothed Average)



Why? Device Performance Analytics Example: Finding out pattern of specific time-periods

when resource load is more or less.

Manage infrastructure costs

by using influenced elastic cloud. * Skyline

* Appboy

- To achieve such specific targeting, we built a powerful analytics engine using MongoDB to store our data. The Appboy platform collections billions of data points each month from our varied customers including photo sharing apps, games, text messaging apps, digital magazines and more. MongoDB is used as our primary data store and houses almost all of our pre-aggregated analytic data. MongoDB's flexible data store easily keeps track of time series data across dimensions, and ObjectRocket has proven to be a great database provider as we've grown to track billions of data points each month.

Why? Decision's impact via Survey Trends Example: What marketing decisions were taken

at what time?

State of target customer class economy.

Any impact on sale of any influencing data.

Why? Predicting herd mentality in Stock Exchange Example: Which public company related event

had what impact?

Just general trend in competitors stock health

co-related with of your own. * extremeDB

* FAME Database: Forecasting Analysis and Modeling Environment

* < href="http://www.sungard.com/solutions/market-data/market-map/">Sunguard's Market Map

Why? Map Medical IoT monitoring with regular health checks Example: Users average heartbeat

co-related with exercise done.

Warning based on old health issues

with current blood pressure trend. * TempoDB description at GigaOM

Why? Intrusion Detection Systems Example: Seasonality of user requests

and trend of traffic increase.

Significant anomaly in such

can be used by IDS to predict attacks. * < href="http://www.census.gov/retail/marts/www/timeseries.html">Census

stock tick information from global stock exchanges

precious metals prices captured periodically

weather details at a specific long/lat at periodic interval

continuous sensor feed from manufacturing machines or oil rigs, solar panels, etc. Daily examples could be



the volume and speed aspect of data

the sparseness of the information

that makes it challenging to be stored in traditional stores. TimeSeries information is not necessarily different butthat makes it challenging to be stored in traditional stores.

To analyze the data based on the time dimension,



keep arrival time of each feed and



optimize queries by it.

What makes a TimeSeries DataStore?

What?

Storing and Retrieval of Primary Data Points indexed by their TimeStamps.

What makes it better?

What more?

Consolidated Data Points sum, avg, min, max, endpoints, a function specific to type of data

What more?

Consistency and Durability suiting to target domain. Not all. If it is of life-impacting surveys, monetary transactions or any important prediction.

What more?

Scalable and Performant to fit the required scenarios. circular-buffer OR big-data || 100s to Millions Records/Sec

What more?

Compressed Contiguous (old) Data Blobs in wide-row formats when blobs of data are persisted, better compress

What more?

Reusable (BigData) Analytics Toolset if utilize HBase/Cassandra backends, can plug-in existing data crunching mammoths

What more?

Non-Blocking Backups timeseries keep coming at 'continuous (ir)regular intervals of time'

What more?

Auto(or default) managed load-balancing. scaling up and down need be seamless; remember the data stream is coming



Relational Database (special schemas)

NoSQL Databases (epoch indexes)

NoSQL Databases (wide tables)

Column Oriented Databases It's not a unique problem and can be done via any database. Just the requirement scenario doesn't fit everywhere equally.

The different solutions are not cuz of incapability of existing databases but of scale in which data might be written, read and analyzed. Which is of scale. A BigData problem where data is pre-structured and hence can be dealt more intelligent alogrithmic way.

* Normally large volumes of data is pushed in at steady pace. Buffer writes not always help.

* Spark + Cassandra by DataStax for TSDB : https://academy.datastax.com/demos/getting-started-time-series-data-modeling

* VividCortex has reached 332k/sec metrics over 3 servers (each 8CPU,26GB RAM) over their MySQL implementation. Can't Ad-Hoc query TimeSeries data.

* Cacti has been using MySQL to store such data forever.

Popular Types

Existing Solutions

majority of these are opensource

and I'm biased

:)

RRDTool One of the earliest and most popular TimeSeries DataStore. Has persistence, in-memory caching & concurrent tasks. A circular-buffer based store. Bad at Sparse Metrics. No partition, replication or atomic integrity. Links: RRD Tutorial, Using RRD, Beginners

Graphite Carbon: Twisted powered metrics processing daemon. Whisper: time-series db library based on RRD principles. Timestamp value is verified for its position while retrieval. Multi-Archive Storage and Retrieval Behavior.

File per time-series.

Doesn't scale well as more file-descriptors per series.

OpenTSDB Runs on Hadoop and HBase. Highly Scalable. Since v2.0 provides good Plug-in architecture. Involves lot of moving parts (Hadoop, HBase, Zookeeper).

All need to be managed. DownSampling for graphs; not to feed into calculations. * Overview: http://opentsdb.net/overview.html

* Scales to millions of writes per second

* Add capacity by adding nodes

* Docker Image: 'petergrace/opentsdb-docker'

* Go Package for interaction: https://github.com/bzub/go-opentsdb

* Plug-ins support for Logging, Serializers, Search, Real-Time Publishing, RPC

* There is an issue open to fix DownSampling @[Github-Issues](https://github.com/OpenTSDB/opentsdb/pull/325).



KairosDB Kind of re-write of OpenTSDB (not a fork) that runs on Cassandra. Highly Scalable. Keep data and presentation of data separate.

InfluxDB Series of Measurements + Unique Tagset.

Datapoints have fields and timestamp in nano epoch. No external dependencies. Ordered k/v.

Started with LevelDB, then RocksDB.

Default to BoltDB currently (v0.9.1 I think). WAL to enable BoltDB manage its memory swiftly. Over HTTP. Useful SQL-ish language for data query. Had Protobufs now Raw Bytes. * LevelDB: Too many file handles, no online backups, too hard to transfer shard from one server to another.



Druid Lambda Architecture

* Fangjin and Giam core folks

* Tranquility: Realtime ingestion from Kinesis to Druid

* Single Node DRUID cluster

Query Layer

* RDBMS (MySQL, PgSQL): Scan speed of data was quiet slow.

* NoSQL Key/Val (HBase, Cassandra): You often end up doing many pre-computing results.

* Commercial (Vertica, Redshift): but customization on FOSS is powerful



Netflix's Atlas Near real-time graphing for operational insights at scale. Predictable Alerting (like lot less traffic than predicted) Netflix handles more than 1TB analytics data/day with it. In-memory (complete for 6hrs, roll-ups for 2 weeks). Pain. Persists raw data in S3. Uses Hive to process old data. * People tried asking KairosDB backend for Atlas, [here](https://github.com/Netflix/atlas/issues/20)



Prometheus A Service Monitoring System with built-in TSDB

by SoundCloud. Has a query language, alerting and visualization. Data-model as OpenTSDB. Metric names,

labelled with key-values. Can tweak data handled in RAM and Disk (LevelDB).

Blueflood By RackSpace Cloud Monitoring Team

for RealTime Analytics. Auto-purges, not ideal for Batch Tasks on old data. Uses Cassandra for datastorage.

Optional support of Zookeeper and ElasticSearch.

Kdb+ (commercial) Columnar High Performance DB.

Built-in array language 'q' to work directly on data.

Can be used for streaming, real-time and historical data. OLTP from 100 thousand to 1 million records/second/cpu.

OLAP from 1 million to 100 million records/second/cpu. Popular in Financial Sectors.

Customers: Goldman Sachs, JP Morgan, Deutsche Bank, etc.

Also in Utilities, Telecom, Pharamceuticals, Oil-n-Gas sectors. SaaS model over Kdb+ at TimeSeries.guru 32bit Free for Dev/PoC tasks not commercial. (1GB RAM) * Getting Started, Insight

SiteWhere (community edition) It runs on MongoDB or Hadoop/HBase. Provides 'Complex Event Processing' via Siddhi. Provides search and analytics via Apache Solr. Connect devices with MQTT, AMQP, Stomp, other protocols. SaaS; IoT focussed; REST registration; Arduino and Android

tempoiq

legacy: tempodb, @gigaom; (commercial) Focussed on IoT sensors data

for analysis, dashboarding and reporting. Connect anything with flexible event data model, HTTPs, MQTT

something I was working on MomentDB/GoShare TimeSeries arranged as NameSpaced Keys

httpd:ERROR:2015:10:16:54:45:34 = yada | 2015:10:16:54:57:34:httpd:ERROR = nada

HTTP and ZeroMQ support for now

