Exploring Natality data with BigQuery

Big Data for the Big Day

My wife and I found out we were pregnant in July. Since then I think I’ve had about 20,000 questions that I honestly never cared at all about before. Most of them can boil down to, “what’s a normal pregnancy?”

If I could just ask every person who has ever given birth how theirs went, I think it would make me feel better, and give me an idea of what to expect. I guess I could start cold-calling people at random, that sounds like fun…

Enter BigQuery

Fortunately, BigQuery has this information readily available in its sample datasets. The data comes from from the CDC and covers U.S. birth records from 1969 to 2008. The dataset includes information about the baby’s weight, sex, race, gestational age, Apgar scores, and information about the mother like her age, previous births, cigarette and alcohol use, and whether she was married at the time of the birth.

It’s really easy to use BigQuery to sift through this pile of data. The query language is basically just SQL, and it only takes a few seconds to run through 40+ years of birth records. The first terabyte of data processed per month is free, so it doesn’t even cost anything to play around and run some queries.

Of course, you don’t need to actually crunch the numbers to get a lot of useful information; Google has plenty of search results for [pregnancy week histogram] for instance, if you’re willing to cheat. But I wanted to check their work, and besides, it’s not like I was sleeping anyway.

Some caveats about this dataset, before we get started:

The CDC provides data beyond 2008, they’re just not in the sample table. I briefly looked into loading it in, but the format required just enough work that I decided against it. There’s a ton of other datasets available from the CDC too, if you’re interested.

Due to what was probably an attempt to anonymize data, there’s a column for “record_weight” which means that the record describes two very similar births, and should be counted twice. If you find yourself doing a COUNT(*), what you actually want is a SUM(record_weight).

The information the CDC provides changes over time, so some years’ data might include the child’s race or the mother’s reported number of cigarettes per day, and some years they might not. This makes it hard to compare data across years for anything but the more basic facts.

Like any data source generated by millions of hard-working nurses and public servants, there will be some “anomalous” data. It looks like it was standard practice in the 60s to mark the birth year as “1998” when the birth year was unknown, because who would ever use that number for real data at any time in the future? It’s important to filter out obviously insane data points, like the 40-year term.

I am not a professional data scientist or medical expert or epidemiologist or person who knows statistics good. Consult your doctor before planning your life around this Medium post.

With that out of the way, let’s run some queries.

Births by day of week

One interesting birth fact I kept hearing was that births are much more common on weekdays. This makes some sense when you consider that many births are induced, either due to concerns about post-term pregnancy, or out of sheer convenience for the mother or doctor.

Looking at the data, we can see that there is indeed a noticeable decrease in births on weekends, about a third fewer on the least popular day (Sunday) as the most popular (Tuesday):

SELECT wday, SUM(record_weight) c

FROM [publicdata:samples.natality]

GROUP BY wday

HAVING wday > 0

ORDER BY wday

Loverboy was right, Everybody’s is Working for the Weekend

Distribution of pregnancy term length

My wife is 36 weeks today. At least for me, now’s the time when it’s all becoming eerily real. We’re meeting with the midwife weekly now. The nursery looks a lot like a real room where a human could live. This baby could happen at any moment.

But how common is it to give birth at this point? The dataset has a column gestation_weeks, which gives us this information:

SELECT gestation_weeks, SUM(record_weight) c

FROM [publicdata:samples.natality]

GROUP BY gestation_weeks

HAVING gestation_weeks > 30 AND gestation_weeks < 45

ORDER BY gestation_weeks

That’s kind of useful, but at this point I’m not so much interested in the likelihood it’ll happen this week as I am about it happening today. Ohgoditcouldhappentoday.

Fortunately, the dataset also includes a column “lmp” for last menstrual period. Unfortunately, it’s given in the form “MMDDYYYY” while the birth date is given as three separate columns. To poorly paraphrase Mark Watney: We’re going to have to formula the shit out of this.

Here’s what I came up with to give us a day-by-day breakdown:

SELECT

DATEDIFF(

USEC_TO_TIMESTAMP(PARSE_UTC_USEC(

CONCAT(STRING(year),'-',STRING(month),'-',STRING(day),'00:00:00'))),

USEC_TO_TIMESTAMP(PARSE_UTC_USEC(

CONCAT(SUBSTR(lmp,5,4),'-',SUBSTR(lmp, 0, 2),'-',SUBSTR(lmp, 3, 2),

' 00:00:00')))) diff,

SUM(record_weight) c,

FROM [publicdata:samples.natality]

GROUP BY diff

HAVING diff < 7*45 AND diff > 7*30

ORDER BY diff

Phew! Okay, so basically what this does is string together the birth date into a timestamp, and the lmp date into a timestamp, and get the difference between them using DATEDIFF. That gives us the gestational period in days.

That’s smooooth.

With this chart it’s pretty easy to see another sign of induced labors: the pronounced peak right at 40 weeks. To the best of my understanding, births used to be induced much more often, and used to be induced at pretty much exactly 40 weeks. These days it’s not uncommon to let the kid bake a little longer if there are no signs of distress. Unfortunately the “lmp” column isn’t populated after 1988.