A Photographic View of Cumulative Distribution Functions

Robert W. Jernigan

American University, Washington, DC

Journal of Statistics Education Volume 16, Number 1 (2008), jse.amstat.org/v16n1/jernigan.html

Copyright © 2008 by Robert W. Jernigan all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.

Key Words: Alphabet; Kolmogorov; Probability distribution function(pdf); Scrabble; Smirnov; Student projects.

Abstract

This article shows a concrete and easy recognizable view of a cumulative distribution function(cdf). Photograph views of the search tabs on dictionaries are used to increase students’ understanding and facility with the concept of a cumulative distribution function. Projects for student investigations are also given. This motivation and view helps the cdf become a bit more tangible and understandable.

1. Introduction

Images of bar charts, histograms, stem-and-leaf diagrams, frequency distributions, and probability density functions can all be introduced and explained via an intuitive, constructive motivation. We imagine blocks, beads, or digits stacked or piled up in defined bins. For categorical data the bins are named categories in the data. The stacked up objects display the data as a bar chart. Higher stacks indicate the categories that appear more frequently. For continuous or discrete quantitative data, the stacks are built on bins that fall along a number line. For equal sized bins the height of a stack is proportional to f(x), the density or relative frequency of occurrence of numbers in a bin around the real number x.

Between these two extremes of named categories and the real number line, lie ordinal data. These are categorical data ordered in a logical or well accepted way. Ordinal data have long been illustrated with course grades (A, A-, B+, B, etc.) or levels of satisfaction or agreement (very strongly agree, strongly agree, etc.). The alphabet also provides a well understood ordering. A frequency distribution of the usage of letters can be represented in an alphabetically ordered bar chart. For example, a bar chart of the tiles in the crossword game of Scrabble shown in Figure 1, roughly mirrors the occurrence of the letters used in the English language. The letter "E" is most frequent, with twelve occurrences out of 100 tiles, yielding a relative frequency of 0.12. This closely matches the occurrence of the letter "E" in English. Other letters such as "S" were deliberately underrepresented in the design of Scrabble. The relative frequency of "S" in English is about 0.063, but too many "S" letters in Scrabble would make it too easy to form the plurals of nouns used in the game. To make this less likely, the Scrabble "S" is represented with a relative frequency of only 0.04. Other letters are also underrepresented as shown in Richardson et al. (2004).

Figure 1. A photograph of Scrabble letter tiles stacked to see their frequency distribution (not shown are two blank tiles)

These images of frequency distributions seem natural and easily understandable. In contrast, the cumulative distribution function has been more difficult to visualize in a natural way. Given the frequency distribution f(x) for a discrete quantitative or ordinal variable, we define F(x) to be the cumulative relative frequency count of all measurements less than or equal to x. For the Scrabble example, x Î{A,B,C,…,Z}, and F(x) represents the cumulative relative frequency of all tiles with letters occurring at or before x. For example, if x = "E", then F("E") is the cumulative count of the frequencies of all the letters "A", "B", "C", "D", "E" out of the 100 Scrabble tiles. That is, F("E") = 0.27, or, 27% of all Scrabble letters occur at or before the letter "E" in alphabetical order. The entire cumulative frequency distribution is shown in Figure 2.

Figure 2. The cumulative distribution function for the letters in the Scrabble crossword board game.

To many students this construction of cumulative distribution functions (cdfs) seems artificial and more contrived than the simpler view of the frequency distribution. We are not as familiar with cumulative distribution functions because they do not occur routinely in much of our experience. Further it seems that work has to be done to convert a density function into a cdf. But for a continuous random variable, we know from probability that the cumulative distribution function is the more fundamental quantity. The frequency distribution is derived from the cdf through differentiation. But as it is presented in many elementary textbooks, we have to process the frequency distribution to produce the cdf. Thus even though mathematically we know it works in reverse, to some it seems that a cdf is a derivative of the density, in the literal meaning that the cdf must be derived from the frequency distribution. A cumulative distribution function does not appear to be the more fundamental idea from which a density function is derived.

It is the purpose of this article to provide instructors and students with a concrete and easy recognizable, visual example of cumulative distribution function (cdf).

2. A Dictionary Cumulative Distribution Function

Figure 3 shows a cumulative distribution function that can be seen and understood with relative ease. This is a side view of the pages of the paperback version of the Oxford Advanced Learners’ Dictionary. Small colored squares for each letter are shown on the edge of each page. These colored squares act as tabs running from the top of the page for letters early in the alphabet to the bottom of the page for those letters that come later. These printed alphabetical marks on the edges of the pages help speed the look up of words and their definitions. This is really one of the first search engines.

When the dictionary is placed on its side, these colored tabs produce a cumulative distribution function for words from the English language. We have a visual and understandable image (Figure 3) of a cumulative distribution function. We can quantify this by noting, for example, that 93 pages of this dictionary are devoted to the letter "A", so 93 pages have tabs colored to act as guide tabs to words starting with "A". The last page number of each letter’s tab indicates the number of pages devoted to words that begin with letters occurring, in alphabetical order, before that tab. Let G(x) denote the page number of this last page for each letter x Î{A,B,C,…,Z}. These are the cumulative counts of pages for words beginning with each letter in the English alphabet, shown in Table 1. The maximum of G(x), call it M, is just the number of the last page of the dictionary for the letter "Z". Define F(x) to be G(x)/M. Then F(x) represents the cumulative relative frequency of letters of the alphabet. This is the alphabet’s cdf.

Table 1. Oxford Advanced Learners’ Dictionary

x=Letter F(x) = Cumulative Frequency Cumulative Relative Frequency Frequency Relative Frequency A 93 0.052 93 0.052 B 207 0.116 114 0.064 C 381 0.214 174 0.098 D 479 0.269 98 0.055 E 543 0.305 64 0.036 F 634 0.356 91 0.051 G 694 0.390 60 0.034 H 766 0.430 72 0.040 I 825 0.463 59 0.033 J 840 0.472 15 0.008 K 855 0.480 15 0.008 L 921 0.517 66 0.037 M 1009 0.567 88 0.049 N 1043 0.586 34 0.019 O 1087 0.611 44 0.025 P 1231 0.692 144 0.081 Q 1239 0.696 8 0.004 R 1336 0.751 97 0.054 S 1558 0.875 222 0.125 T 1656 0.930 98 0.055 U 1690 0.949 34 0.019 V 1710 0.961 20 0.011 W 1770 0.994 60 0.034 X 1771 0.995 1 0.001 Y 1777 0.998 6 0.003 Z 1780 1.000 3 0.002

Figure 3. A photographic side view of the Oxford Advanced Learners’ Dictionary. Inserted below is a photo manipulation showing the frequency distribution of the dictionary’s letters.

So what can we learn from Figure 3 and this table? The alphabet begins with a few large tab regions, rising vertically. This indicates that the first few letters of the alphabet begin many words. The middle of the alphabet is not so well represented. The letters of "I", "J", and "K" begin many fewer words with their collective contribution of colored tabs not extending vertically to any great extent. Big jumps at "P" and "S" indicate that many words start with those letters. They bracket a thin tab of very few words beginning with "Q". The tail of the alphabet is so thin that the final three letters are combined into a single colored tab representing "X,Y, and Z" together.

The "S" tab spans the most pages, indicating more words in English start with "S" than with any other letter. From our consideration of Scrabble, recall that the letter "E" is most frequently occurring letter in English, but keep in mind that dictionary tabs consider only the first letters of words. As first letters, the letter "S" is most frequently occurring in English. From the table we can find the number of pages devoted to the letter "S". This comes from subtracting G("R"), which is the value that represents the cumulative count just prior to "S", from G("S"). This is G("S") – G("R") = 1558 – 1326 = 222 pages devoted to words that start with the letter "S". We can divide these frequency counts by the maximum frequency count, M, resulting in relative frequency counts. Both frequency and relative frequency counts for each letter are also shown in Table 1. Photo manipulation software also allows us to visually collapse the cdf and place the tabs side-by-side to see the histogram of alphabetic frequencies, shown in the insert of Figure 3. Here we can easily see that after the letter "S" is most frequently used and that the two next most frequent starting letters in English are "C" and "P".

Of course what this shows is that more pages are devoted to the letter "S" than to the letter "K", for example. This is most certainly because more English words start with "S" than "K". But what about other pairs of letters, say "G" and "H"? Do more words start with "G" or more with "H"? Perhaps the definitions of the words starting with G are longer than those for "H". This would give "G" more pages than "H" with perhaps fewer words. If we make the simplifying assumption that more pages mean more words, rather than more pages mean longer definitions, then the size of the letter tabs will be proportional to the frequency of words beginning with each letter of the alphabet.

This image represents a snapshot of the words in the English language. We can quickly see which letters begin more words than others. This is, of course, just one dictionary. How reliable is this representation of the English language? I examined five other large dictionaries: Microsoft Encarta College (2001), New Oxford American (2001), Merriam-Webster Collegiate (1998), Random House Webster’s College (1995), and Newbury House Dictionary of American English (1999). Table 2 shows the cumulative counts for the smallest and largest of these English dictionaries: a small paperback edition (Newbury House) and a more comprehensive reference (Merriam-Webster Collegiate). Although minor editorial discretions are evident, the cdfs in Figure 4 are nearly identical. The cdfs for the others (not shown) are also nearly identical. A consensus image of the English language emerges from these cumulative functions since there is a great deal of common overlap for each dictionary.

Table 2. Cumulative frequencies (page numbers at the end of lettered tabs) of two English language dictionaries are shown along with cumulative relative frequencies.

Merriam-Webster Merriam-Webster Newbury Newbury Letter cumulative cum. relative cumulative cum. relative frequency frequency frequency frequency A 127 0.059 48 0.048 B 246 0.114 101 0.101 C 453 0.210 192 0.191 D 568 0.263 257 0.256 E 652 0.302 293 0.292 F 748 0.347 351 0.350 G 821 0.381 388 0.386 H 903 0.419 429 0.427 I 982 0.455 462 0.460 J 1002 0.465 471 0.469 K 1022 0.474 480 0.478 L 1096 0.508 517 0.515 M 1211 0.562 564 0.562 N 1257 0.583 587 0.585 O 1308 0.607 614 0.612 P 1495 0.693 699 0.696 Q 1507 0.699 704 0.701 R 1611 0.747 766 0.763 S 1876 0.870 877 0.874 T 2001 0.928 932 0.928 U 2037 0.945 953 0.949 V 2074 0.962 964 0.960 W 2139 0.992 997 0.993 X 2141 0.993 998 0.994 Y 2149 0.997 1002 0.998 Z 2156 1.000 1004 1.000

Figure 4. Cumulative Relative Frequencies for the two English Language dictionaries shown in Table 2. Notice the near identical cdfs.





A formal test of the equality using the discrete Kolmogorov-Smirnov statistic can compare two dictionary cdfs. Let F(x) represent the cdf of the most comprehensive Merriam-Webster Collegiate dictionary. This will be taken as a standard. For another dictionary of n pages, let H n (x) represent this other dictionary’s cdf. The Kolmogorov-Smirnov statistic tests the null hypothesis that a sample cdf H n (x) is equal to F(x). A statistic D = max n1/2 | H n (x) - F(x) | measures the discrepancy between two cdfs, where the maximum is taken over the 26 letters of the alphabet. The p-values for testing each of the cdfs of the English dictionaries discussed above were all greater than 0.57, indicating no significance difference.

3. Other Languages

Figure 5. A photographic side view of the Oxford Spanish Language Dictionary

How does the English cumulative distribution function compare with other languages? Figure 5 is a side view of the Oxford Spanish Dictionary. The upper portion of the dictionary is for translating English into Spanish. It is another representation of the English cumulative distribution function that we saw earlier. Notice the second tab, for the letter "B". The "B" tab extends halfway into "A" and also into "C". The editors of this dictionary have not confined the tabs to each letter as before, but they have a larger extent of overlap spanning the letter positions on either side. Although the jumps from one letter to another remain correct, the inked contribution of each letter’s tab is spread out over a range of neighboring letters resulting in a more connected and smoother cumulative distribution. The bottom portion of this dictionary is for translating Spanish into English. In this bottom portion, we have a similarly smoothed cumulative distribution function for Spanish. The most frequent first letter in Spanish is "C". Notice how the Spanish cumulative distribution function builds up more rapidly than English early in the alphabet, only to have English include more frequent words starting with later letters. The cumulative counts are given in Table 3 and a more precise plot in Figure 6. The p-value for the discrete Kolmogorov-Smirnov statistic comparing the English and Spanish cdfs results in a p-value < 10-4, indicating a significant difference.

Table 3. Page numbers at the end of each lettered tab from the Oxford Spanish Dictionary

Letter Spanish to English Cum. Rel. Freq. English to Spanish Cum. Rel. Freq. A 87 0.100 56 0.054 B 118 0.135 120 0.115 C 239 0.274 211 0.202 D 303 0.347 269 0.257 E 366 0.420 304 0.291 F 394 0.452 356 0.340 G 417 0.478 395 0.378 H 438 0.502 439 0.420 I 473 0.542 473 0.452 J 481 0.552 482 0.461 K 482 0.553 491 0.469 L 508 0.583 528 0.505 M 562 0.644 574 0.549 N 578 0.663 597 0.571 O 593 0.680 625 0.598 P 684 0.784 703 0.672 Q 691 0.792 708 0.677 R 737 0.845 762 0.728 S 786 0.901 897 0.858 T 833 0.955 961 0.919 U 839 0.962 986 0.943 V 864 0.991 997 0.953 W 865 0.992 1038 0.992 X 866 0.993 1039 0.993 Y 868 0.995 1044 0.998 Z 872 1.000 1046 1.000