What you’ll (probably) see is that your guess on the left alternates from heads to tails often, with few streaks, but the result on the right has streaks. Your guess likely has close to 10 heads and 10 tails, but rarely does the actual result come up an even 10 and 10. These differences are attributable to our assumptions about probability (that a coin flip has a 50/50 outcome) being applied to a larger problem (20 actual coin flips).

There’s even a simple statistical property (called Benford’s Law) that can be used to identify faked data sets. In short, in any large enough data set, patterns emerge in the first digits of each data point: more numbers will start with the digit “1” than any other number. This pattern is so pervasive and counterintuitive that it is often used to prove — in court — that data has been falsified.

Real data is unpredictable

Part of what makes data so tricky to work with is that our assumptions about the shape, domain, resolution, and volatility of a set of data are only just that: assumptions.

Even with the most reliable sets of data — say, an atomic clock — there’s always the possibility of error states or the occasional act of god. These one-in-a-million events are when great data visualization really shines: instead of breaking down in the face of anomaly, they highlight the rarity of the situation and respond appropriately.

But when we’re designing visualizations, how can we predict the unpredictable? It’s almost like something out of a Borges story; outliers aren’t really outliers if we’re expecting them.

Why not design with real data?

Real data sets of all shapes and sizes are easy to come by if you know where to look. There’s a slew of large, well-formatted data sets on GitHub, collected and maintained by the (excellently-named) Open Knowledge group. From financial data to global temperature indexes, these data sets can easily provide a reality check for your designs.

Unfortunately, my experience working with historical data sets fails to overcome some of the fundamental challenges of designing robust visualizations. Here’s why:

It’s too hard to fine-tune. The time series of S&P 500 financials over the past 100 years probably isn’t going to fit your visualization domain and fidelity right out of the box. The benefit of having lots of data to work with becomes a curse when you’re trying to normalize lots of values. It’s too easy to pick and choose. I find myself discarding data when it doesn’t come out quite like I expect it to. Instead of taking the harder route of improving the visualization to handle the data at hand, I’ll just select “better” data.

Let’s use placeholder data instead.

Placeholder data is a lot like placeholder text. It’s a design tool meant to ensure the final product — in our case, a real-time data visualization — matches our mockups. It’s useful in all the same ways placeholder text is useful:

Placeholder data allows a designer to test her assumptions about the product. Placeholder data makes decision-making much easier by providing realistic constraints. Placeholder data sets realistic expectations for clients and stakeholders.

What does placeholder data look like?

Just as with placeholder text, we can achieve varying degrees of placeholder data realism by choosing our parameters very carefully. When mocking up a page layout, I can choose the classic placeholder text “lorem ipsum dolor sit amet …” to help pick the right typeface, leading, measure, etc. A placeholder data version of “lorem ipsum” might look just like random numbers, a sort of meaningless pattern that looks like real data if you aren’t looking too hard.