Weather of the Century: Part 1

Ever wonder what the weather was like in your birthplace the day you were born? Building an app to answer this question for anyone (from this planet at least) was the subject of talks on visualization and performance at MongoDB World 2014. In those talks, MongoDB Engineers André Spiegel and Jesse Davis presented the Weather of the Century App , which takes as input any location on Earth and a time since the beginning of 1901. It marks the embedded Google Earth with all the available temperature measurements across the globe from that hour, and orients the globe on the specified location. After it loads the data for that hour, it commences marching forward in time, by one hour every few seconds, updating the display with that hour's temperature measurements, until the "stop" button is clicked. Here it is in action, displaying the weather near the Sheraton Hotel Times Square, on October 1, 2013. (Full disclosure: we do not know anyone who was born in the Sheraton Hotel Times Square at that time. This app can be used to examine the weather anywhere in the world at any time in the past century regardless of birth events.) The components of this app are: MongoDB to hold the weather data, PyMongo and Python to handle the data querying and application logic, and the Google Earth plugin and JavaScript to present the user interface and parse the input. NOAA's Integrated Surface Data Hang on. Weather observations for points all over the globe, for every hour, for the past century? Where does all that data come from? As it happens, it comes from a remarkable organization called the National Oceanic and Atmospheric Administration , or NOAA. They describe their mission as: Science, Service, and Stewardship. To understand and predict changes in climate, weather, oceans, and coasts, To share that knowledge and information with others, and To conserve and manage coastal and marine ecosystems and resources. and consider their domain of observation and experimentation to range "from the surface of the sun to the depths of the ocean floor." 1 NOAA gathers data from land based weather monitoring stations on every continent and accumulates it into one enormous data set. They supplement these observations with oceanic observations from naval vessels. By working with other agencies around the world, they have been able to unify legacy weather data observations along with ongoing measurements into the Integrated Surface Data set, also known as the ISD. This data set contains surface weather observations around the world, stretching back to 1901, and NOAA is hard at work on integrating more stations and earlier measurements. They have made this data publicly and freely available. The ETL Phase Although the data collected before the 1930s was quite sparse, the 1950s saw a steep rise, and in 1970 it jumped radically (see below). All told, the data set holds, at the time of this writing, 2.6 billion data points. NOAA makes the ISD available as a compressed hierarchy of packed ascii files, one directory per year, one file per observation station, one record per observation. Each observation record in those files looks something like this: 0303725053947282013060322517+40779-073969FM-15+0048KNYCV0309999C00005030485MN 0080475N5+02115+02005100975ADDAA101000095AU100001015AW1105GA1025+016765999GA2 045+024385999GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... Taken together, the compressed data occupies about 80GB of disk space, unzipping up to 800GB or so. While this format is compact, storage-agnostic, and is convenient for FTP access, it is rather unfriendly to querying. To do anything useful with the data, we first have to import it into MongoDB, using an ETL ("Extract, Transform, and Load") system written by Consulting Engineer André Spiegel. The loader code is a subsection of the overall Weather of the Century App . For some ETL needs, a simple, serial loader would do, but there is so much data to import, André parallelized the code. How many threads can run simultaneously? MongoDB can ingest at different rates depending on the deployment, so we will cover that, and other high performance topics, in a subsequent post. The MongoDB Schema Each observation record contains many mandatory sections (such as the station ID number and global position), but there are also hundreds of optional sections, which appear in some records and not in others. Easily handling this variability in record content is one of MongoDB's strengths. Once the data has been transformed, its JSON representation looks like this: { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "position" : { "type" : "Point", "coordinates" : [ -96.4, 39.117 ] }, "elevation" : 231, "airTemperature" : { "value" : 21.1, "quality" : "1" }, "sky condition" : { "cavok": "N", "ceilingHeight": { "determination": "9", "quality": "4", "value": 1433 } } "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } [etc] } Were this data to be loaded into a relational database, its footprint on disk could be made smaller (a strength of relational systems), but the hundreds of optional elements per record would mean hundreds of tables to normalize the data across, or wasted space in every record. Retrieving a single observation fully would then require a join across hundreds of tables! Beyond the performance implications of these joins, the code required to work with the data in this format would be troublesome. Alternatively, these records could be stored fully denormalized in a relational database. In that case, on disk space savings would be minimized, but worse still, consider the impact of adding a new optional section to the record format-- you would have to run an ALTER TABLE on that 4.5TB table! Conversely, MongoDB can return an entire weather observation record with a simple query, returning a well-organized and self-documenting data structure. Queries for a subset of fields in observation records (e.g. only the air temperature) can use projection to return only those fields, and they should, so as to not waste network resources with the transfer of superfluous data. It turns out, NOAA used a relational database to store the data, and did not use either of the above degenerate cases in schema design. Rather, they stored the measurements themselves as a flat collection of name/value pairs, and used a small grouping of tables to encode meta-data regarding the field types. You can read the details in their ISH Tech Report . While this structure addresses the massive join and schema change issues, it is a paragon of circumventing the idioms of a relational database to achieve needed behavior, and could be used in a case study explaining why MongoDB was built in the first place. Next: A Look Inside the Weather of the Century App In our next installment, we'll analyze the MongoDB queries the app uses to do its work. In the interim, if you’re looking for a more in-depth look on MongoDB’s architecture, download our guide: Download the Architecture Guide 1 NOAA Read Part 2 >> About the Author - Avery Avery is an infrastructure engineer, designer, and strategist with 20 years experience in every facet of internet technology and software development. As principal of Bringing Fire Consulting, he offers clients his expertise at the intersection of technology, business strategy, and product formulation. He earned a B.A in Computer Science from Brown University, where he specialized in systems and network programming, while also studying anthropology, fiction, cog sci, and semiotics. Avery got his start in internet technology in 1993, configuring apache and automating systems at Panix, the third-oldest ISP in the world. He has an obsession with getting to the heart of a problem, a flair for communication, and a devotion to providing delight to end users.