

Hello Kitty

Quality assurance in OSM is a pretty hot topic. The first question when first faced with the concept of a map that anyone can edit is - how can one guarantee the quality of the data?

In a constantly transforming world, the fact is that its impossible to gaurantee any map in the world is accurate unless it was physically surveyed recently. And the reason why OpenStreetMap has such high quality of map data for many parts of the world is precisely because this project allows anyone to update the map so easily. And to ensure quality, we just need more users of the map for that location, an empower them to update anything that does not match reality.

Recent incidents

For this to work effectively its obvious we need to make it as simple as possible for map users to spot mistakes and make an update. Some interesting data incidents that were caught by the community recently:

This is just a small list of examples that the data team at Mapbox stumbles into while ivestigating issues reported in our map feedback system. We have been using an amazing tool by willemarcel called osmcha which acts as a handy dashboard to review and investigate changesets. By looking for words like reverted or vandalism in the comments, its possible to identify changesets that corrected a previous mistake.

The interesting observation here is that many of these issues were accidental changes by the mapper which could have been easily avoided if they were more careful or had there been more sanity checks in the software. Most of the incidents have been fixed, usually by alerting the mapper and undoing their changes.

Something to be concerned about is the time it takes for an issue to be detected, which ranges upto a month for something as major as the label for a major city like Buffalo being deleted. But once the mistake is spotted, action and recovery is swift, usually a few minutes. This points us to an interesting question on why does it take so long for issues to be identified? And what happens to issues that are invisible on the map style?

Identifying data issues

While may mistakes are easy to visually identify in the map style especially if they involve large features like cities, forests, lakes or motorways, there are probably thousands of issues on smaller features involving inconsistent access tags, broken geometries or relations that would never be spotted or fixed unless an expert mapper stumbled on them. And there will be numerous cases of incorrect data intentionally being added to the map that will go unnoticed till a local mapper finds it.



Find whats wrong in the map universe with osmose

This is where tools for quality assurance of the OSM data come into play to address the issue. Some of the more popular ones are Osmose, Maproulette, Improveosm, OSM Inspector and keepright. These tools have greatly helped to bring in some sense of quality monitoring to the map, and could become a lot more powerful when part of an integrated validation environment.

This was one of the reasons that prompted us to create osmlint which makes it possible to run a global analysis of OSM data in a few minutes. Anyone want to know how many buildings in the world are shaped like cats? give it a try ;)

Identifying suspicous behaviour

Another aspect to consider while identifying data issues is the mapper behaviour. Currently its common for mappers to check the edit count of a contributor to evaluate how experienced or trusted the user is. Pascal Neis has done some amazing research on user behaviour and the popular tool HDYC provides a detailed contribution profile for any mapper.

An open question is if aspects such a user reputation and experience play a part in validation workflows, and to what extent they can be relied upon to catch problematic behaviour like spamming and vandalism.

The future of validation

Now is a great time to get involved in the OSM data validation puzzle with the growing number of people using the map outside the OSM website through services like Strava and Github. There has also been some major technology leaps that allows us to process more data much faster than before opening up new technical approaches for solving the puzzle.

One of the big question for our team at Mapbox is - How can we make it as simple as possible for anyone who uses the map to validate it? Reaching this goal would greatly enhance the accuracy of the map data keeping it as close to ground reality as practical.

Have you thought of OSM validation before? What other big questions should we all be thinking about?



Smile, you are on OSM