One of the most common driving forces behind big data research is the convenient availability of data sets. In other words, it’s common for researchers or developers to encounter a new data set and then structure their workflow — be it designing mathematical models or developing computational tools — around figuring out some way to use it. A related, but different, approach, is to create a new model or tool — a hammer — in order to solve some abstract problem and then search for data sets — nails — with which to showcase it. While conducive to fast-paced work, these “data-first” and “method-first” approaches can amplify issues relating to bias, fairness, and inclusion. First, it’s easy to immediately zone in on the coarse-grained patterns typically evidenced by the majority and then concentrate on analyzing those patterns instead of the fine-grained, harder-to-see patterns associated with minority groups. Second, it’s extremely common to focus on only those data sets that are readily available, such as Twitter. The problem with these “convenience” data sets is that they typically reflect only a particular, privileged segment of society — e.g., white, Americans or young people with smart phones. Consequently, many of the methods developed to analyze these data sets prioritize accurately modeling that majority over other segments of the population.

As an alternative, I would advocate prioritizing vital social questions over data availability — an approach more common in the social sciences. Moreover, if we’re prioritizing social questions, perhaps we should take this as an opportunity to prioritize those questions explicitly related to minorities and bias, fairness, and inclusion. Of course, putting questions first — especially questions about minorities, for whom there may not be much available data — means that we’ll need to go beyond standard convenience data sets and general-purpose “hammer” methods. Instead we’ll need to think hard about how best to instrument data aggregation and curation mechanisms that, when combined with precise, targeted models and tools, are capable of elucidating fine-grained, hard-to-see patterns.

As a concrete example, my political science collaborator Bruce Desmarais and I are currently studying the role of gender in local government organizations. This project is exciting for two reasons: First, there is very little data-driven work on government at the local level. Second, although there has been work in organizational science suggesting that women tend to occupy disadvantaged positions in organizational communication networks, this work mostly consists of individual- and firm-level case studies, rather than large-scale analysis of real-world data.

At first glance, this might seem like a hard area to study using a data-driven approach — these are not the types of question readily answered using Twitter or Facebook data. However, over the past few years, many “open government” data sets have been made available to the public with the stated goal of transparency. These data sets are instances of what I call “push” transparency — that is, government organizations proactively facilitated their distribution. Unfortunately, for our research questions, even these data sets (or the ones that we could find) are insufficient. But, as it turns out, there are other transparency mechanisms for addressing social questions, especially those relating to government — what I call “pull” transparency mechanisms. These mechanisms can be used as an opportunity to move beyond convenience data sets and even to request data that explicitly relates to bias and fairness. For example, most US states have sunshine laws that mimic the federal Freedom of Information Act. These laws require local government organizations to archive textual records — including, in many states, email — and disclose them to the public upon request. As a result, it’s possible to obtain all kinds of local government data via public records requests, including data on bias, fairness, and inclusion. Of course, in order to do this, you have to know about these laws, how to issue a public records request, and so on and so on — all of which is arguably more difficult than pulling in data from the Twitter firehose, but may ultimately help address bigger societal issues.

In our case, Bruce and I (and our students) ended up issuing public records requests to county governments in North Carolina, asking for county managers’ email spanning a period of a few months. We’re now using this data to investigate whether women occupy disadvantaged positions in local government communication networks and, if so, the extent to which this varies with the topic of communication.