One way of managing the risk of a data breach is to not keep all your data in one place. By spreading it out to different physical and logical locations, and keeping tight controls on access between those locations, you limit the exposure of the overall collection.

In database design, the natural inclination for developers and database professionals is to collect all the data into one database, where it can be analyzed and processed. In the Internet of Things (IoT) era, where data is being collected everywhere, the single-database approach may not always be the best choice. There are reasons why you would spread raw data among several physical locations, but it's not a slam-dunk case. For every reason to spread data around, there is at least one other reason to gather it together.

Spreading the data around is called data distribution. You don't have a master, consolidated database, as such. Instead, you leave some data—probably the raw, unprocessed data—spread around the network. This is not the same as a distributed database, a term often used to refer to a system where different copies of a database are kept synchronized through replication. With data distribution, there are reasons why the data is deliberately not copied.

The classic real-world example of this dilemma, if such cutting-edge techniques could be called “classic,” is surveillance data, especially in the context of smart city/community initiatives. Consider the networked LED lighting fixtures from Sensity (owned by Verizon), which can contain a variety of other devices, such as a video collection. If you have a lot of these in your city, do you send all that video data through the backhaul to your central data store for your fancy computers to analyze?

Is it a privacy or public policy issue?

One of Sensity's arguments, as outlined in the paper "Privacy in the Age of Ubiquitous Video," is to leave the raw video data at the edge—that is, in the nodes:

Edge storage enables video recording directly to embedded non-volatile memory (e.g., flash storage) in the video device. Edge storage can also be combined with a trigger-based approach to eliminate the need for bulk video archives. Rather than recording and backhauling video to an external storage drive on a 24x7 basis, a trigger-based approach isolates recording to specific events of interest and stores the clip of the event directly on the edge. This prevents the potential misuse that can stem from storing and transferring large volumes of raw video and allows third parties to only access footage of a particular segment where an event of interest, such as a gunshot, is detected. [Emphasis mine.]

The emphasized portion of the excerpt describes an interesting motivation for data distribution: security, specifically making it harder for an attacker (or malicious insider) to access large amounts of potentially sensitive data. Obviously, the data at the edge needs to be protected just as with data in a consolidated database. However, on the theory that an attacker might gain access to some of these distributed data stores, the attacker wouldn't necessarily have access to the others.

One can argue whether video from a public surveillance camera should be considered private and sensitive, but this is a significant point of contention for some arguments related to personal privacy. This theory requires that security controls over the data throughout the network be fairly robust. If an admin account, especially one regularly used, has ready access to all the data on all the devices, then you gain very little by distributing that data.

The first argument against data distribution for privacy purposes is that there's no privacy interest to protect. These cameras are all out in public, taking pictures of events in the public space, of scenes where a police officer might well legitimately look and listen. Does one have a right to privacy out in public? This is one of those issues where technology raises policy issues that our philosophy isn't necessarily equipped to handle.

IT and operational technology departments view IoT projects differently. Learn how to close perception gaps to improve IoT outcomes. Download the whitepaper

Minimize your data migration

As the Sensity excerpt says, there is another, perhaps better, reason to leave most of the data at the edge: You don't have to backhaul it over the network. In the meantime, you send back only lower resolution frames or only some of the frames, and if it's necessary to look at the raw data, there is a procedure to access it.

I asked Wayne Arvidson, vice president of intelligence, surveillance, and security solutions at Quantum, about the problem. Arvidson has written about the role of data distribution in smart city initiatives, an application in which companies like Quantum, Verizon, and Hewlett Packard Enterprise are all involved.

Arvidson noted trade-offs to the privacy motivation for surveillance and most other applications: Storage at the edge is presumably limited, so the data is either backhauled somewhere or lost. Customers are hesitant to throw the data away, at least for some time, out of concern for potential litigation.

It's worth noting that these IoT devices are not just cameras with flash storage. By putting computing power on them, it is possible to process data right there—for instance, to identify persons, and perhaps even specific people, and then send those results back to a database or operator, all depending on policy exercised at the edge unit. Don't underestimate what's possible at the edge. The quality of the cameras can be so good that edge computing can find specific license plates and raise an alert.

Improve your analytics

This raises another, better reason for not disposing of surveillance video. Historical data is used for analytics to improve software, such as video processing software, says Arvidson. In that sense, it is a shame to dispose of the data. Such retrospective analysis usually involves collecting all the data together, although, in theory, at least some of the analytics could be distributed as well.

The camera/lighting nodes are really specialized network servers. Such devices need to be hardened rather strenuously against both physical and network attack in any case. Arguably, this enhances the value of distribution itself as a privacy objective, but it requires a fairly high standard for management and security controls, perhaps more than one can really expect. Security experts can argue that inconvenience isn't a bug but a feature. If so, it's a feature for which operations people can be quick to lose appreciation.

So, privacy protection is a benefit of data distribution with many disadvantages in the context of surveillance data. But in the surveillance example, you would still try to do it, to some extent, for the bandwidth savings. You also might be able to retain data by using distributed archives, if you could do so while still keeping the data off the main backhaul.

There are other theoretical downsides to distribution. In the event of a disaster, possibly even a small-scale one like a car accident at just the right place or some road crew cutting a cable, you could lose access to some of the data. Of course, you could design in redundant connections, but nothing's perfect and it all costs money.

With the knowledge that risks and costs are like death and taxes, it may well be that the risks of data distribution are worthwhile. The real lesson here is that you should think about the specific needs of your application and how they can be best fulfilled, and not about how you can squeeze it into a canned application paradigm.

The case for and against decentralizing data for security: Lessons for leaders

Just because you have data at the edge, it doesn't mean it needs to be aggregated elsewhere.

Backhaul costs are real and need to be considered as a design issue.

Determining what is private or public data can result in cost savings and improved security for data that needs to be private.

Related links:

Podcast: What Is Cloud Ready Data Protection?

Data protection for Microsoft SQL databases on HPE Nimble Storage