By on •

I would like to write a bit on the meaning and history of the phrase “tidy data.”

Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote:

In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. Wickham, Hadley “Tidy Data”, Journal of Statistical Software, Vol 59, 2014.

Let’s try to apply this definition to following data set from the Wikipedia:

Tournament Winners

Tournament Year Winner Winner Date of Birth Indiana Invitational 1998 Al Fredrickson 21 July 1975 Cleveland Open 1999 Bob Albertson 28 September 1968 Des Moines Masters 1999 Al Fredrickson 21 July 1975 Indiana Invitational 1999 Chip Masterson 14 March 1977

This would seem to be a nice “ready to analyze” data set. Rows are keyed by tournament and year, and rows carry additional key-derived observations of winner’s name and winner’s date of birth. From such a data set we could look for repeated winners, and look at the age of winners.

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

Around January of 2017 Hadley Wickham apparently retconned the “tidy data” definition to be:

Tidy data is data where: Each variable is in a column. Each observation is a row. Each value is a cell. tidyr commit 58cf5d1ebad6b26bd33ad1c94cc5e5e7ef1acf7e

Notice point-3 is now something possibly more related to Codd’s guaranteed access rule, and now the example table is plausibly “tidy.”

The above concept was already well known in statistics and called a “data matrix.” For example:

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual. Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.

One must understand that in statistics, “individual” often refers to observations, not people.

The above reference clearly considers “data matrix” to be a noun phrase already in common use in statistics. It is in the book’s index, and often used in phrases such as:

Suppose X is an n × p data matrix … Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 75.

So statistics not only already has the data organization concepts, statistics already has standard terminology around it. Data engineering often called this data organization a “de-normalized form.”

As a further example, the statistical system R, itself uses variations the above standard terminology. Take for instance the help() text from R’s data.matrix() method:

data.matrix {base} R Documentation Convert a Data Frame to a Numeric Matrix Description Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.

What is the extra “Factors and ordered factors are replaced by their internal codes” part going on about? That is also fairly standard, let’s expand the earlier data matrix quote a bit to see this.

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual. When presenting data in this form, it is customary to assign a numerical code to the categories of a qualitative variable … Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.

Note: for many R analyses the model.matrix() command is implicitly called in preference to the data.matrix() command, as this conversion expands factors into “dummy variables”- which is a representation often more useful for modeling. The model.matrix() documentation starts as follows:

model.matrix {stats} R Documentation Construct Design Matrices Description model.matrix creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.

So to summarize: the whole time we have been talking about well understood concepts of organizing data for analysis that have a long history.

Frankly it appears “tidy data” is something akin to a trademark or marketing term, especially in its “tidyverse” variation.

Share this: Twitter

LinkedIn

Facebook

Reddit

Email

Like this: Like Loading... Related

Categories: data science Opinion Tutorials