IoT applications generate torrents of data. PostgreSQL is powerful but scales poorly. With TimescaleDB you can scale Postgres by 20x and harness its power for IoT.

It’s no secret that the so-called “Internet of Things” isn’t about the Things, but the data. And it is a lot of data. Every day, more of the physical world — manufacturing operations, food production systems, trains we commute on — is connected to the Internet and automated, creating more and more streams of sensor data.

Multiply the millions of things by the amount of data per device, and you get an exponentially growing torrent of information being used to make better business decisions, provide better end user experiences, and produce more while wasting less.

Most engineering teams (including ours, early on in our company history) working on these initiatives end up storing all of this data in multiple databases: metadata in a relational database, time-series data in a NoSQL store. Yet each of these databases operate differently; running a polyglot database architecture adds operational and application complexity that’s unnecessary.

You don’t need to do this.

In this post, we’ll show you how to keep all of your IoT relational and time-series data together in PostgreSQL (yes, even at scale), and how that leads to simpler operations, more useful contextualized data, and greater ease of use. We’ll also highlight other awesome PostgreSQL features relevant to IoT, like query power, flexible data types, geospatial support, and a rich ecosystem.

By the end, if you’re convinced that PostgreSQL + TimescaleDB is the ideal database for your IoT project, then please get in touch and tell us if/how we can help.

Yes, PostgreSQL can scale

We’ll start with the most common question we hear:

“Will PostgreSQL scale to support my IoT data rates?”

Scale is the number one concern that developers have about PostgreSQL. And that’s completely valid. IoT backends need to support high data ingest rates, while writes on PostgreSQL slow to a crawl as your dataset grows:

Insert benchmarks reveal vanilla PostgreSQL falls off a cliff as your dataset grows.

But it turns out that for time-series data, if your database is architected the right way, you can scale PostgreSQL to hundreds of thousands of inserts per second, at billions of rows, even on a single node with a modest amount of RAM.

This insight is what led us to develop TimescaleDB, a new open-source time-series database launched earlier this year. TimescaleDB achieves significant performance improvements by automatically partitioning data across time and space behind the scenes, while presenting the illusion of a single, continuous table (called a “hypertable”) across all partitions to the user. Most importantly, TimescaleDB does this while appearing like a normal PostgreSQL instance (in fact, packaged as a PostgreSQL extension, not a fork) to the developer and the outside world.

We benchmarked TimescaleDB against PostgreSQL by loading 1 billion rows, with 10 metrics per row, into a single instance of each.

Here are the results:

TimescaleDB shows 20x faster insert performance and 15x faster data loading time than vanilla PostgreSQL.

In short, TimescaleDB is more than 20x faster than vanilla PostgreSQL when inserting data at scale. The average insert rate for TimescaleDB in this benchmark is 111k rows (1.11M metrics) / second, within a relatively narrow band (as displayed above), while PostgreSQL insert performance drops to 5k rows (50k metrics) / second. But most importantly, TimescaleDB enables you to scale to 1 billion rows with no real impact to insert performance.

(To learn more about TimescaleDB’s time/space partitioning and hypertables, please read our CTO’s detailed technical post.)

But what about data retention? Because time-series data volumes grow quickly, at some point you may want to delete old time-series data that you no longer need. And PostgreSQL does not handle massive deletions well. This is another problem that our time/space partitioning solves: when you need to eliminate old data, instead of deleting rows (and undergoing expensive vacuuming operations), TimescaleDB just drops the partition. And of course, this is all tucked away behind a data retention API.

Why run two databases, when you can run one?

We just saw how TimescaleDB scales PostgreSQL to large workloads, like that within IoT. Now let’s see what it lets us do.

Being able to store time-series data alongside relational data is incredibly powerful. For one, it leads to a simpler stack. Instead of two databases (NoSQL for sensor data, relational for sensor metadata), with all kinds of glue code in between, not to mention the operational headaches of having two databases… you only need one database:

Running one database for time-series and metadata (right) leads to a simpler stack, less application complexity, and less management/ops overhead than when running separate databases (left).

For example, let’s look at backups. Running two different databases means two different backup processes (and longer downtimes). Also, if the systems are backed up independently, then you run the risk of data integrity issues. Imagine this scenario: your time-series data has a backup from the last hour, but your relational data was backed up yesterday; so you now have time-series data referring to new devices whose metadata is lost.

Prioritizing simplicity leads to other advantages: A simpler stack is even useful at the edge, where running one database is hard enough (due to resource constraints), let alone two. Only needing a single database makes IoT prototyping easier. It also helps reduce the complexity of large sprawling IoT projects. As we recently heard from the CTO of a Unicorn IoT startup: “Aggressively consolidating our tech stack lets us sharpen our engineering team’s focus.”

Context, context, context

A single database for time-series data and metadata does more than just simplify your stack: it also lets you add context to your sensor data. Because what is point of collecting data if you don’t know what you’re measuring?

Here’s an example: One company we are working with is developing an IoT application to monitor manufacturing processes. This application involves close examination of assembly line data to catch defects and ensure a high quality product. Most of the assembly line data is time-series in nature: {machine_id, timestamp, measurement}. Yet that data is meaningless by itself without all the other operational metadata: the machine settings at that time, the line information, the shift information, etc. They have experimented with denormalizing their data, but that added an unreasonable amount of data bloat on each measurement.

With TimescaleDB and PostgreSQL, they can keep their metadata normalized, and add the necessary context to their time-series data at query time (via a SQL JOIN). For example, via a query like this:

-- Show me the average temperature by machine, for machines of a -- given type on a particular line, by 5 minute intervals SELECT time_bucket('5 minutes', measurements.time) as five_min, machine_id, avg(temperature) FROM measurements JOIN machines ON measurements.machine_id = machines.id JOIN lines ON machines.line_id = lines.id WHERE machines.type = 'extruder' AND lines.name = 'production' AND measurements.time > now() - interval '36 hours' GROUP BY five_min, machine_id ORDER BY five_min, machine_id;

[Show the average temperature by machine, for machines of a given type on a particular line, by 5 minute intervals.]

The power of SQL

Let’s take a step back and keep in mind the value of an IoT project: to collect, analyze, and act on data to improve utility/efficiencies, reduce downtime/waste, and provide better products and services. In other words, you need more than a data store: you also need an easy way to wring actionable insights out of your data.

This is where SQL comes in. While it’s been quite fashionable in the past several years to denounce SQL and praise NoSQL, the truth is that SQL is quite powerful and is starting to make a comeback (which is one reason why “NoSQL” is now getting “backronymed” to “Not only SQL”).

SQL includes quite a few useful features: e.g., multiple complex WHERE predicates (backed by secondary indexes); multiple aggregations and orderings; window functions, libraries of mathematical and statistical functions; and more.

-- Plot the change in temperature for ceiling sensors -- on linear and logarithmic scales, by 10 second intervals SELECT ten_second, temperature / lead(temperature) OVER data AS temperature_rise_linear, log(temperature / lead(temperature) OVER data) AS temperature_rise_log FROM ( SELECT time_bucket('10 seconds', time) as ten_second, last(temperature, time) as temperature FROM measurements JOIN sensors ON measurements.sensor_id = sensors.id WHERE sensors.type = 'ceiling' AND measurements.time >= '2017-06-01' GROUP BY 1 ORDER BY 1 ) sub window data AS (ORDER BY ten_second asc);

[Plot the change in temperature for ceiling sensors on linear and logarithmic scales, by 10 second intervals.]

TimescaleDB augments SQL by adding new functions necessary for time-series analysis, e.g., as shown with time_bucket and last in the query above.

SQL has another advantage: people across your organization already know it. You won’t need to train your engineers to learn a new specialized query language (or hire new ones), and non-technical users won’t need to rely as heavily on engineering (and engineering release cycles) to answer questions of the data. In other words, by leveraging SQL you can democratize your time-series data, letting more people across your organization access it.

Mark hits the nail on the head.

Flexible datatypes (including JSON)

JSON in a relational database? Eat your heart out, MongoDB.

When you start building your IoT product, you may not know what data you will care about, nor have a specific data schema in mind. Or later on, you may need very specific data structures (e.g., arrays).

Fortunately, PostgreSQL supports a broad spectrum of datatypes. It allows for semi-structured data (via JSON / JSONB support), but also a variety of other data types, including many numeric types, geometric types, arrays, range types, and date/time types.

-- Storing and querying JSON data in PostgreSQL SELECT time, sensor_id, type, readings FROM measurements ORDER BY time DESC, sensor_id LIMIT 50; time | sensor_id | type | readings ---------------+-----------+------+---------------------------------- 1499789565000 | 330 | 1 | {"occupancy": 0, "lights": 1} 1499789565000 | 440 | 2 | {"temperature": 74.0, "humidity": 0.81} 1499789565000 | 441 | 2 | {"temperature": 72.0, "humidity": 0.78} 1499789560000 | 330 | 1 | {"occupancy": 1, "lights": 1} 1499789560000 | 440 | 2 | {"temperature": 73.9, "humidity": 0.81} 1499789560000 | 441 | 2 | {"temperature": 72.1, "humidity": 0.79} 1499789555000 | 330 | 1 | {"occupancy": 1, "lights": 1} 1499789555000 | 440 | 2 | {"temperature": 73.9, "humidity": 0.80} 1499789555000 | 441 | 2 | {"temperature": 72.1, "humidity": 0.78}

[Some of our IoT customers store sensor data in JSONB, which allows indexing.]

Geospatial awesomeness

Often there is a geospatial component in the data from connected devices — physical things exist in a particular space, after all. In particular, geospatial information is important when the things are moving. One IoT use case we often see is asset tracking: e.g., tracking vehicles for fleet management, optimizing routes, reducing spoilage, etc.

TimescaleDB is packaged as a PostgreSQL extension, which means you can run many other PostgreSQL extensions in conjunction with it. One powerful extension is PostGIS, which adds a wealth of geospatial support (including new data types, functions, etc.) to PostgreSQL. By combining PostGIS with TimescaleDB, you can combine your geospatial and time-series data, in effect creating a scalable spatiotemporal database.

Data collected continuously from GPS locations of New York City buses over time.

A real ecosystem

Time-series databases don’t operate in a vacuum. They need connectors, for example, to data buses like Kafka, stream processing engines like Spark, or BI tools like Tableau. Most time-series databases are relatively new, and there hasn’t been enough time for an ecosystem to develop around them. On the other hand, PostgreSQL has been around for over 20+ years, and the community has built an expansive ecosystem around it.

Fortunately, TimescaleDB looks just like PostgreSQL. In fact, customers often ask, “Does Timescale work with X?” And as a quick Google search typically shows, there usually already exists a PostgreSQL connector for X, which will work with TimescaleDB out of the box.

Just like PostgreSQL, TimescaleDB works with the most popular data buses, stream processors, data visualization and BI tools (like the ones shown above), among others.

Last but (clearly) not least: Reliability

To be annoyingly obvious: the database you choose for IoT needs to be reliable (and not wake you up at 3AM). Unlike websites or mobile apps, IoT applications from the beginning are typically deployed in high-value scenarios. If you are using a database to monitor your factory line, then that database cannot go down.

This is where we rest on the shoulders of giants. In the past 20+ years, PostgreSQL has been battle tested in a variety of mission critical applications across a variety of industries. There is also another ecosystem of administrative tools that make reliability easier to achieve: streaming replication, hot standby, and more. TimescaleDB inherits this same level of reliability and ecosystem.

Choose PostgreSQL

There is a lot of noise in the IoT world right now, and it can be hard to decide what database to use for your IoT project. But sometimes the best option is the boring option: the database that just works. That’s PostgreSQL, which now (thanks to TimescaleDB) finally scales to handle IoT workloads and optimizes SQL for time-series data.

If you’re building a new IoT project or currently wrestling with a complex IoT stack, choose PostgreSQL and download TimescaleDB.

If you like what you see and need help with anything, please email us or join our Slack group.