How to build a histogram

Gather your data A histogram is based on a collection of data about a numeric variable. Our first step is to gather some values for that variable. The initial dataset we will consider consists of fuel consumption (in miles per gallon) from a sample of car models available in 1974 (yes, rather out of date). We can visualize the dataset as a pool of items, with each item identified by its value—which in theory lets us "see" all the items, but makes it hard to get the gestalt of the variable. What are some common values? Is there a lot of variation?

Sort into an ordered list A useful first step towards describing the variable's distribution is to sort the items into a list. Now we can see the maximum value and the minimum value. Beyond that, it is hard to say much about the center, shape, and spread of the distribution. Part of the problem is that the list is completely filled; the space between any two items is the same, no matter how dissimilar their values may be. We need a way to see how the items relate to each other. Are they clustered around a few specific values? Is there one lonely item, with a value far removed from all the others?

Draw the number line A common convention is to use a number line, on which higher values are displayed to the right and smaller (or negative) values to the left. We can draw a line representing all possible numbers between the minimum and maximum data values.

Add data to the number line Now, we map each item to a dot at the appropriate point along the number line. In our visualization we draw the path followed by each item on its way from the list to the line, helping to reveal how adjacent list items end up close or far apart on the number line depending on their values. In the first sample dataset ("MPG"), there are—by chance—no two items with exactly the same value. To broaden your understanding, we have included two other datasets in this animation. As you scroll, next you will see a a dataset of an opposite extreme type, in which there are many data points at each value—"NBA", a dataset that lists the ages of some National Basketball Association athletes. The ages are rounded to whole years, so there are multiple items (i.e., multiple athletes) with the same age value. We show repeated values by stacking them on top of one another. Scrolling further, you will see a a third sample dataset, "Geyser", a collection of timings, in seconds, between eruptions of the Old Faithful geyser in Yellowstone National Park. This dataset, too, has values for which there are multiple readings—and again, this multiplicity occurs because the values are rounded to integers. If the timings were measured to multiple decimal places, there almost certainly would be no duplicates. In these small datasets, the stacks of dots (often called a dotplot) may give you a good sense of the distribution of the data. However, many real-world datasets have values which are repeated many hundreds or thousands of times. From a practical standpoint, it becomes very hard to plot all the dots. Conversely, a dataset could have no duplicates at all, so even a dense region of items would appear as an unremarkable smear along the number line. Histograms provide a way to visualize data by aggregating it into bins, and can be used with data of any size. Notice the buttons at the bottom of the visualization? These allow you to switch between the sample datasets at any time.

Portioning items into bins—the essence of a histogram Once items are placed along a number line, drawing a histogram involves sectioning the number line into bins and counting the items that fall into each bin. Notice how the distribution shown in the histogram echoes the distribution from the dot plot. Gathering the items into bins helps us to answer the question "what is the distribution of this data like?" Imagine trying to describe some dataset over the phone: rather than mechanically reading out the entire list of values, it would be more useful to provide a summary, such as by saying whether the variable's distribution is symmetric, where it is centered, and whether it has extreme values. A histogram is another kind of summary, in which you communicate the overall properties in terms of portions (i.e., bins) of the data. For example, the "Geyser" data can be described as being bimodal (because its histogram has two 'peaks'), while "NBA" is more unimodal, and perhaps right-skewed (because the bin heights decrease towards the right). Maybe because histograms are visually similar to bar charts, it's easy to think that they are also similarly objective. But, unlike bar charts, histograms are governed by many parameters. Before describing a dataset to someone based on what you see in its histogram, you need to know whether different parameter values might have led you to different descriptions.

Bin-breaks: Why these bins? For a start, you probably noticed that the histograms shown for our sample datasets have different numbers of bins. This is because we used Sturges' formula, a common method for estimating the number of bins for a histogram, given the size of a dataset. Given a suggested number of bins, how did we then decide the precise values for the bin boundaries (the so-called "breaks")? Again we used a common method: look for nearby round numbers. This is why the breaks for "MPG" are all multiples of 5, and those for "NBA" are multiples of 2. For those two datasets, the bins turn out to cover the range of the item values rather tidily. But look at the first and last bins for "Geyser". Their placement relative to the value range looks a little arbitrary, right? That's because it is. The fact is that there are few hard-and-fast rules for drawing a histogram. Instead of Sturges' formula, we could have chosen the number of bins using Scott's choice or the Freedman-Diaconis choice, among many other methods. And there's certainly no rule saying that bin-break values have to be rounded to the nearest multiple of 2 or 5. What's important is whether a given histogram is a representative summary of its underlying dataset. One way to judge this is to try varying the positions of the breaks, and see what impact that has on the summary that the histogram conveys.

Varying the bin offset First, let's try shifting the bins left and right along the number line, based on an "offset" setting. Watch what happens to the relative bin heights as the offset changes. Items are being moved from one bin to another, changing the aggregation, and therefore, the height of the bins. Typically the various histograms' shapes look quite alike, but occasionally a strangely different picture will pop out. Mouse over the scenario switcher to take control of the animation. Also, try switching between the available datasets.

Varying the bin width We now fix the left-most bin break to the minimum data value, and instead vary bin width. The width values we try here are defined in terms of the default width that was used in the earlier stages. Some widths can be argued as being more valid than others. For example, setting a width of 1.4 years for the "NBA" dataset, where item values are always integers, is asking for trouble: some bins will inevitably span two values, while others only include one. More generally, bin width should be an integer multiple of the precision that was used in measuring the data. Our "MPG" values were all measured to the nearest 0.1, so for that dataset a bin width of 1.4 (mpg) would be fine.

Unmasking the magic Maybe our talk of formulas and rules and variations has made you feel that histograms are complicated things, calling for complicated tools. In the next few sections we aim to show that this is not the case. Summarizing a dataset into bins, and even experimenting with the binning parameters, can be achieved with rather simple code. Our code is a series of formula-like derivations. For adjustable parameters we offer candidate values for you to select; try hovering the mouse pointer over the values in the width row to see histograms with different bin widths. You can fix on a value by clicking it. Clicking the green square at the left side of the row will turn on a "sweep" of width values, creating a cloud of semi-transparent histograms with different bin widths. Creating a sweep adds annotation to each row of the table that the sweep affects. Try creating a sweep, then mousing over the columns on the right to see how the various widths affect the derived bins. Because broader bins tend to capture more items, and narrower bins fewer, it can be hard to compare histograms that have different bin widths; the narrow bins appear unfairly short. A solution is to plot the histograms as densities—in effect, counts per unit width. Use the density checkbox to toggle this option. Switching to density preserves the overall shape of each histogram but boosts the height of narrow bins (and conversely tames the broad ones), making comparisons across bin widths easier. Play with the sample datasets and see if you can find histograms where changing the bin width changes the visual "center" of the distribution.

More adjustments Now we've added a row that lets you vary the offset (note how this is taken into account in the modified derivation of breaks). Try setting up a sweep on width, then mousing over different values for offset. Again, you may want to switch on density plotting for easier comparison. As you explore these parameters, you may come across a parameter combination that leads to a vastly different shape for the distribution. This is often the result of a mismatch between the scale of the data and the scale of the bins. Several "stacks" of dots get caught into a single bin in a way that feels unexpected.

One more thing: Bin openness We've been focusing on the effects of the bin offset and bin width, but there is at least one more parameter that is important: the "openness" of the bins. Openness is about how to deal with items that fall exactly on a bin boundary. Do they get counted as belonging to the lower bin, or to the higher bin? Or to both? It's not both—because that would mean that the boundary items were being counted twice. In a histogram, each data item must be assigned to exactly one bin. So, then: lower or higher? Every histogram-drawing tool has a default policy on this (but might not reveal what that policy is, let alone enable you to try the alternative). In this essay our default has been "left-open," which means that any items with exactly the value of a bin's left (i.e., lower) limit will not be counted as belonging to that bin. How much of a difference does the choice between left-open and right-open make? That largely depends on the dataset—in particular, on the occurrence of data values that are highly "stacked" (i.e., have multiple items), and of course whether a bin boundary happens to fall exactly on such a stack. Among our sample datasets, predictably, "NBA" has the greatest propensity for dramatic variation. Try switching open between L and R. Notice how any data points exactly on a bin boundary "jump" as they are moved between neighboring bins.