Geospatial data is the building block of the modern web map, but its production has evolved rapidly in recent years with a vast quantity of user-created content. What once was the realm of experts has now been democratized for the common user, but crowdsourced and volunteered data faces criticisms over its quality. Machine learning is offering an intelligent and objective way to enhance collaborative work and bring added precision and accuracy. But which source will be the one providing the maps of the future?

“Art is a lie that makes us realize truth, at least the truth that is given us to understand.“

— Pablo Picasso

I. The Geographic Enlightenment

I recently visited a small bookstore in Riga, Latvia, which boasts some underground fame among mapping enthusiasts for its collection of vintage Soviet maps of the world. I found Cyrillic writing detailing the location of airfields and rivers in my home state of Montana, inspiring me to ponder the way such information was gathered and stored in the Cold War era. These maps are not only a relic of a fading geopolitical climate but also of what is quickly becoming an outdated way of thinking about geospatial data. When they were produced, they were perhaps the most official and detailed maps available in some regions; by the 1990s they were a commodity for building maps of the future.

A Soviet map of Chicago

Pablo Picasso, speaking about art and its message, claimed that “art is a lie” and that it “makes us realize truth”. There is a similar saying in the world of cartography, about how maps themselves are useful lies. The world is not flat, nor neatly color-coded and labeled. Maps are a convention where we portray the world in a false sense and yet hope to convey some higher truth about what’s there and how things relate to one another.

At a certain point in human history, maps evolved to a point where it was not the common human who curated them, but expert geographers and cartographers. Even in pre-history, maps were a compilation of knowledge across generations and throughout society. Where to find water, where to hunt, and how to navigate the landscape was a collaborative knowledge base.

This communal and crude representation of our world increasingly became mathematical and scientific, and local knowledge began to aggregate into global compendiums. Maps changed based on purpose, taking forms such as nautical charts for sailors. The more official and crucial the use case, the more expertise was behind the crafting of these maps.

On Christmas Day in 1900, the Audubon Society launched the first ever Christmas Bird Count, which resulted in a small collection of bird sightings across North America. The significance of this event in the realm of maps and geospatial data is subtle, but this is perhaps the first organized instance of user-contributed map data.

Previously, maps authored with the help of naturalists or scientists may have showed species distribution—something we would call authoritative data, built from the top down by experts. Alternatively, the Christmas Bird Count demonstrated a case of assertive data, contributed from many amateurs, built from the bottom up.

Number of user-submitted blue jay sightings, Christmas Bird Count, 2017 (data from Audubon Society)

Modern maps are increasingly, if not almost completely, a digital product. Paper maps such as the Soviet collections in Riga have long been authoritative sources, and by the 1990s their contents were being digitized into databases as part of a growing world of Geographic Information Systems (GIS). Companies such as Navteq began to scour the world for governmental and commercial maps that could be imported into these systems, while car navigation systems, mass-distributed paper maps, and the nascent web map all depended on these centralized sources of map data. GIS analysts and developers began building maps based on varying layers of data, classified as points, lines, or polygons. Editing, adding, or deleting map features involved keyboard strokes and mouse clicks rather than hand-drawn plots.

In 2004, OpenStreetMap (OSM) was born in a UK university and similarly employed such assertive data. In the 21st century, we still refer to this data in multiple ways, some having a different connotation than others: crowdsourced data, collaborative data, user-created content, and more. These words have become of increasing importance as the proverbial map has left the hands of experts and become something that is more subjective, democratic, or even anarchic. Rules have been set to keep such a variety of data in check, whether in the semantic guidelines of OSM or standards set forth by the Open Geospatial Consortium.

In 2017, maps and geospatial data are still in a phase of transition, where the true relationship between authoritative and assertive data is in flux. Governments are still exploring their relationship with crowdsourced data, and many professionals remain skeptical. Additionally, the rise of machine learning and AI has meant that complex algorithms are now being applied to satellite imagery, street-level imagery, and other sources to generate more detailed and copious map data.

While Mapillary derives geospatial data from user-contributed photos, companies such as Google, Digital Globe, and Facebook are busy using machine learning to turn raster images from satellites into vector data such as building footprints and road networks. It thus appears that traditional mapping is being invaded not just by the common user, but also by the machines.

II. The Dark Side of the Map

The term volunteered geographic information (VGI) was allegedly coined by the renowned geographer Michael F. Goodchild, who described the emerging 21st-century concept of citizens acting as sensors—walking and breathing collectors of data. This phenomenon has empowered millions of users but often without any promise of reward, without a guarantee of the quality of VGI, and without training.

This century, there has been an explosion of information that includes a geotag available via the web—much of it on purpose, but perhaps most often without users realizing their information is geotagged. Twitter, Instagram, Facebook, and other social media services contain scores of geotagged information, but its users aren’t exactly building a map of VGI as their primary intention. In other cases, VGI is the focus, including photo maps on Flickr and data collection for OpenStreetMap.

Map sources such as OpenStreetMap receive regular criticism concerning quality. Quality is generally said to be rooted in the accuracy of annotations (is this a residential road or a service road?), the precision of the geospatial position (is the fire hydrant on the north or south side of the road?), and the timeliness of the data (when was this observed, and is it still there today?). User-generated content, not just in the realm of maps, can have certain advantages regarding these measures of quality.

Considering Wikipedia against the more traditional encyclopedia, it’s likely that Wikipedia is more up to date with today’s news, and also has a more accurate entry on “Māori religion” with information sourced from locals in New Zealand, than the rather general entry in the encyclopedia at the public library in Albuquerque. In other cases, the quality control by experts may increase reliability, while the reputation of Encyclopedia Britannica may ensure that for academic or even legal purposes, its contents are considered authoritative where Wikipedia is not.

Authoritative often carries a connotation of being “better”, but there are countless examples of where authoritative data and information are incorrect despite the best intentions. The conflict between authoritative and assertive data elicits criticisms on both sides, because in the bigger picture neither is always correct.

In the case of geospatial data, purity is valuable. We may find that a point layer indicating locations of an endangered bird habitat in British Columbia is quite authoritative, having come from state-employed, university-educated biologists, and it can be considered a rather pure dataset—untainted by contributions from amateurs, or even anonymous contributors whose credentials and methods are not known or documented.

User-generated content is “impure”, in a sense, because of the mysterious, undisciplined, and perhaps unreliable origins of the data. Even exact geographic coordinates, detailed descriptions of features, and extremely recent data may have objectively sound quality, but the lack of accountability for who collected this data and how is a barrier to its reputation.

Map of bird colonies on Vancouver Island (source: data.gov.bc.ca)

Another barrier can be called semantic heterogeneity. This stands in contrast to semantic homogeneity. In layman’s terms, the former signifies that one geographic feature may be classified in two different ways by two different users, without agreement; the latter is a uniform standard of classification. An example of this can be seen in mapping metro stations and lines, clarifying such things as whether the boarding area is a platform or a station (or a combination).

In OSM, tags are meant to provide a standard of classifying nodes and ways, but a variety of users means common disagreements. Professionals, however, may not be any more accurate, as standards often extend only across single organizations or partnerships. This can lead to obvious confusion, as mappers not only must agree on what is present on the map or not and whether or not it is correctly labeled, but also upon which labels to use.

User-generated geospatial data—whether actively volunteered or passively collected by a platform—is still useful, even if not pure. Authoritative data such as our bird habitat dataset is carefully crafted but limited. Datasets such as these often have gaps and errors; whether this means an island with five bird colonies that aren’t mapped, or perhaps the recent disappearance of one colony which isn’t reflected in the latest dataset. Turning to the community at large, these gaps can be filled.

Editing OpenStreetMap in Budapest, Hungary

In OpenStreetMap, the quality and quantity of data vary compared to official sources. In Kenya, for example, citizen efforts such as Map Kibera have produced dense and detailed OSM data in urban slums, surpassing the capability of government agencies. Meanwhile, government data in rural Kenya is more complete and higher quality than crowdsourced data. As a whole, developing countries face shortages of labor, expertise, technology, and resources to map at the scale of the local community. Map consumers in these countries are often less concerned whether the data is authoritative or assertive, and more concerned with whether or not the data even exists.

Having some data, while not professionally produced, is far more valuable than having no data in the view of organizations such as MapGive, CartONG, Humanitarian OpenStreetMap Team, and Missing Maps. Large local communities, paid and volunteer staff, as well as remote mappers using satellite imagery and street-level imagery as a reference, can effectively improve maps across the world. Many communities in developing countries are vulnerable, in the sense that the necessary data for emergency response and disaster relief isn’t available from official sources in the same way that it may be in Florida or Singapore.

III. Enter the Machine

While professional geospatial data can be improved and augmented from user-generated data, there remains a question of how scalable this approach will be in the future. The world is a big place; it extends beyond roads and cities, and often beyond the capacity we have as communities of professionals and enthusiasts. As humans, we also sometimes cannot map at scale or in great detail without the assistance of special tools.

Surveying methods rely on a variety of tools to gain precise measurements: Lidar can gather centimeter-level detail, while global positioning systems are increasingly efficient. Remote sensing based on spectral signatures or visible colors has allowed for the training of algorithms to classify satellite and orthographic imagery. GIS software similarly empowers professionals to generate more precise and accurate data, whether in Esri’s ArcGIS suite or the open-source QGIS platform. These are examples of technology used as a tool to aid the human mind.

Remote Sensing: identifying crops over time from aerial imagery (source: USGS)

Some machines, however, may seem to have a mind of their own. While remote sensing has employed algorithms for many years to classify satellite and orthographic imagery, it is only recently that artificial intelligence has exploded as a far more powerful tool than ever seen before. Machine learning now allows for increased precision and volume when processing satellite imagery, emerging not just as a tool but even an alternative to human analysis. Not only can the type of crop in a field be classified, but multiple data sources can help a machine estimate risks and yields. Roads can be extracted and classified, building outlines, and more.

Machine learning has been an important tool in Remote Sensing for decades, as it has allowed minimal human input to generate geospatial data from imagery. These methods are becoming more relevant as more algorithms emerge for filling gaps, fixing errors, or generating map data that is otherwise beyond the capacity of limited teams and communities.

Companies such as Descartes Labs, Orbital Insight, and Planet Labs have recently emerged, demonstrating the latest evolution of machine learning and computer vision applied to satellite imagery. Moving away from aerial and satellite imagery, Microsoft and Google gathered some of the earliest and largest street-level imagery surveys and promptly began experimenting with data extraction on the ground level.

Using machine learning is more affordable than manual collection, and can often provide more consistency and control over what is classified and how. Data can be cleaned intelligently, but also treated differently depending on location. On the Mapillary platform, for example, traffic signs are recognized worldwide, with the algorithm knowing to look for different symbols in North America or Europe. Additionally, many features in both satellite and street-level imagery aren’t easily visible to the human eye but can be swiftly recognized by a machine.

A GIS professional may be quick to endorse machine learning as a time-saver, a reputable method of data classification and analysis, and something mathematically and scientifically precise. The machine is an extension of the professional’s ability; it is cold and calculated, and can maintain the purity of a dataset. The machine is not always correct, however, just like professionals and casual users.

Machine learning requires human verification and validation—a daunting task that, while perhaps less taxing than manual data collection for a professional, can also lack scalability. A GIS analyst managing 3,000 point locations of storm drains is already daunting, but when an algorithm classifies 50,000 street signs, this needs quality control of far more pieces of data. Machine learning often means more data, but sometimes not necessarily better data. It also means machine error, much more than human error, although these may occur at similar rates.

IV. A New Alliance

Geospatial data is prone to errors and shortcomings regardless of its origin, but the professionals are still the gatekeepers of this data. While OpenStreetMap boasts a mass of global data from people with local knowledge, it is often treated with the same doubt in 2017 as high school teachers were casting upon Wikipedia in 2005. It can be used as a guide, but it is not necessarily authoritative like data from the US Geological Survey (USGS) or the Ordinance Survey (OS) in the UK. The method behind OpenStreetMap, however, still shows promise as we discussed—it is complementary to authoritative data.

Edits made by TNMCorps volunteers (source: USGS)

The USGS launched the National Map Corps (TNMCorps) over a decade ago, and to this day accepts contributions from “citizen scientists” in building the National Map, along with other government map products. Meanwhile, HERE WeGo from HERE Maps is curated by a combination of professionals and a user community via the HERE Map Creator. In these cases, authoritative bodies, such as the federal government or an international corporation, are finding that user-generated data have the potential to make their maps better than before.

Meanwhile, machine learning is also benefiting from human input. Undoubtedly, you’ve encountered a CAPTCHA or reCAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) image when logging into your user account somewhere on the web and thus started contributing to human verification of text recognition algorithms. Additionally, an idea called Geo-reCAPTCHA has emerged to describe using micro-input of users to digitize buildings and other features from Earth observation imagery.

In OpenStreetMap, strong concern has arisen over the corporate use of machine learning to generate map data, as it can be seen as a threat to the value of human contribution. However, human verification and validation is a core tenet of OpenStreetMap, and machine learning based contributions to the map are scrutinized by users. A human mapper on the ground in Sri Lanka still holds the ultimate authority over a road classification, even if an algorithm has extracted it from satellite imagery and a Silicon Valley-based analyst has verified it.

Mapillary semantic segmentations, color coded

Street-level imagery offers a two-fold advantage in this realm. This imagery is photographic evidence, and combined with timestamps and geotags it becomes quite near to authoritative data. A casual user with a smartphone can prove that a government dataset has incorrectly positioned a train track on the map by giving an interactive, on-the-ground look at what’s there.

Also, machine learning capabilities can assist that user to become a producer of data, turning several hundred photos into a map of traffic signs across a city center. Sometimes this fills gaps in official data, adding new knowledge; other times it resolves conflicts or provides the opportunity to determine that human data is, in fact, more accurate in its classifications.

V. The Future of Maps

The future of maps is not a consistent vision. Some maps are global in scale, and yet their detail is fine-grained down to the neighborhood level. Other maps have no consideration for anything outside a small area of interest, like the bird habitat data in British Columbia. Many maps focus on networks, such as roads, bicycle paths, subways, and even sewage infrastructure. Others extend beyond these line-shaped networks, including soil maps, population estimates, or social media use, ranging across whole regions.

In each of these examples and many more, there is a place at the table for professionals, for citizens, and for machines. There are few cases where only one of these groups can act more effectively than a combination of at least two of them. In other cases, the completeness, accuracy, and precision of maps suffer from the exclusion of any one of them. Increasingly, it is the collaboration between these different groups that will produce the maps of the future.

Collaborative platforms aren’t always public, as many enterprise platforms bring together professionals and internally used algorithms, while others augment user-generated data with machine analysis. At the root of these platforms is geospatial data, and it must all be collected, classified, validated, and verified.

In the great modern assembly line of geospatial data production, we can all lend a hand to ensure that we all consume a better product. To realize the greater truth about the world and to document and index the world in the best maps that civilization has ever produced, the key will be collaboration. Like Picasso mused, there is some higher truth to be understood from the gathering of many perspectives that may not be entirely complete, accurate, precise, or up to date on their own.

Whether you are a GIS professional, a computer vision scientist, a volunteer mapper, the owner of a mobile phone, or the operator of a vehicle, or even if you’re one of the machines themselves, you have a role to play. The only challenge will be to bring harmony to these many roles and to understand the superior results of their union.

/Chris