Various studies have focused on the complexities of publishing and using (open) data. A number of lessons can be learned from the experiences of (governmental) data providers, policy-makers, data users, entrepreneurs, competitors and researchers.

Data can be provided by the government, crawled from the web, or generated by sensors. Here are 50 lessons learned in the form of tips and guidelines on creating and using high-quality open datasets.

Publishing data

Organizational structure

1. Involve all appropriate stakeholders at an early stage. This secures support for the data initiative from both the supply and the demand side actors.

2. Clearly explain the reasons for the data initiative and mission statement to everyone involved.

3. Organize roundtables to actualize the data initiative, develop a business case and find out which already available data can be restructured and added to the dataset.

4. Release high-value and high-impact data first. Count data requests to see what data are popular. Conduct surveys to rank priorities and interests from the public.

5. Take away concerns with source trustworthiness, data provenance and legal aspects of re-use. Publish under a trusted username or on a trusted platform. Provide a link to the data maintainer and/or webmaster.

6. Discuss and describe license and any other legal aspects (non-disclosure) clearly and up front. A free and open license for a dataset could be CC0. That way a user is free to do what she wants with the data. Present the conditions and terms of usage. Discuss inside the institution whether the restrictions are in balance with the goals of the data sharing initiative.

7. Grant the data sharing initiative enough time and resources required to complete it and evaluate it. Make sure there is enough time to get the details and user adoption right. Find, employ or educate human resources to work with the data.

8. Set up a data sharing protocol inside your organization. From analyzing the data, to processing the data, to publishing the data.

9. Create a user feedback loop. Be open to and patient with user feedback. Use it to iterate on and improve the quality of the future datasets.

Data quality

10. Ensure data quality. Remove duplicates. Remove empty or broken records. Check the dataset for strange outliers, due to (measurement) artifacts. Check the dataset for completeness and statistical significance (sample size).

11. Set up a quality data distribution platform and monitor its performance. Restore broken links. Set up descriptive, accessible and valid HTML pages to describe your datasets (Provide domain knowledge). Focus on usability and user experience — Do not make your users think about performing basic actions.

12. Ensure easy-to-understand content and formatting. Attach metadata to state descriptive and legal information, coverage, measurement equipment, timeliness and reliability.

13. Establish compatibility and interoperability of systems. Do not use non-standard, closed-sourced or single-OS data formats. .CSV files are a very popular format among data users. Allow and analyze a data quality feedback system by users. Put data in a context.

14. Set up version control. Keep both access to/a history of the raw data and the processed data. Know where to find the files to delete if required by procedure or compliance. Keep notes on the canonical source file and any releases.

15. Hash the data and encrypt during transfer. Create a hash from the files so anyone can verify their contents. Transfer datasets over a secure connection.

16. Check the dataset for privacy sensitive identifiers. Anonymize or pseudo-anonymize these identifiers by removing or substituting them.

17. Check the dataset for business value sensitive identifiers. Some data may contain information which may lead to a competitive advantage (profit numbers, sale numbers, inventory).

User support and communication

18. Plan appropriate outreach campaigns and support conversations around data:

Set up social media accounts and a blog.

Showcase success stories or interesting technical challenges.

Create Wiki’s and tutorials.

Post on data science community websites, such as /r/datasets and Datatau.

Monitor Twitter and search engine alerts to find out if your data set is mentioned somewhere.

Organize a machine learning competition (on Kaggle).

Offer (financial) incentives to work with the data

Produce interesting visualizations and infographics.

Contact schools and universities for educational use.

Team up with a data journalist.

19. Facilitate interaction, error handling and user feedback through formal processes, coordination mechanisms, and by dedicated employees.

20. Implement a contact form, forum, error reports, confidentiality concerns reporting, responsible disclosure. Monitor and log user actions and save error reports. Ask permission to survey your users.

21. Organize events, such as app and machine learning competitions, hackathons and workshops. Also visit the ones organized by other data providers.

22. Develop user skills. Organize boot camps, master classes, e-learning courses. Provide free mentoring and advice. Create clear documentation, tutorials, case studies, how-to’s and FAQ’s.

23. Translate domain knowledge to comprehensible terms. Use domain experts to engage users on forums or mailing lists to get them up to speed.

24. Provide additional tools to work with the data. Plug-ins, cloud computing infrastructure, software libraries, converting and munging scripts.

Sustainability

25. Support the building of a community of data users, like journalists, civic hackers, non-profits, citizens and academics. Organize regular community meet-ups.

26. Create a market place for ideas, data services and people looking to team up. This could be a website, a mailing list or a forum.

27. Support deployment of newly developed services, implementations and usage.

28. Fund well-developed apps to develop scalable models. Help get competition winners and expert data users seed grants, jobs, mentoring and advice.

29. Integrate data-driven content and services into organizations and (government) operations. For example: Code Fellowships hosting civic coders in government agencies, media and civil society. Employ data engineer evangelists.

Avoiding adoption barriers

There are numerous institutional barriers for not publishing data sets. Managers and policy makers can help clear up such barriers, if they are identified inside the organization.

30. Institutional barriers:

Emphasis of barriers and neglect of opportunities

Unclear trade-off between public values (transparency vs. privacy values)

Risk-averse culture (no entrepreneurship)

No uniform policy for publicizing data

Making public only non-value-adding data

No resources and budget with which to publicize data

Fostering organizations’ interests at the expense of citizen interests

No process for dealing with user input

Debatable quality of user input

31. Task complexity barriers:

Lack of ability to discover appropriate data or data with potential.

No access to the original data (only processed data)

No information about the context, relevancy and quality of the data.

Duplication of data, data available in many formats, debate over the source of data.

32. Use and participation barriers:

There are no incentives for the users and the organization does not react to user input.

Frustration with the data sharing initiative. Little or zero time to worry about the details.

Unexpected escalated costs. Legal and privacy concerns.

Lack of knowledge or interest to make sense of the data.

Use Case: Publishing Open Data

Open Data & the Government

Governments have been gathering data for their own use for decades. This includes interesting data on geographical and meteorological matters, as well as environmental polution, crime and law enforcement.

Traditionally this data was only accessible to government experts. The full potential of this data as a catalyst for app development, democratic transparency, innovation and research was not realized.

Open data, according to the Open Knowledge Foundation, is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.

Some researchers and policy makers add another requirement: The data needs to be structured (machine-readable). Questionnaire data stored away in non-selectable .PDF documents or big data without an API is not really open data under that requirement. Researchers working with such tedious data sets need to perform heavy pre-processing or manual labour to “free” the data.

Open can apply to information from any source and about any topic. Anyone can release their data under an open licence for free use by and benefit to the public. Although we may think mostly about government and public sector bodies releasing public information such as budgets or maps, or researchers sharing their results data and publications, any organisation can open information (corporations, universities, NGOs, startups, charities, community groups and individuals). Open Knowledge Foundation

In 2011 Eurocommissioner Neelie Kroes said that “Data is the new gold”. Expectations were high. The Planet Open Knowledge Foundation started ranking countries for data availability.

The reality, however, included some snags. Organizations needed a change of mindset. Decision trees needed to be implemented to aid faster data sharing. And organizations faced privacy concerns.

Research and Documentation Centre (WODC)

The WODC is a criminal justice knowledge centre in The Netherlands. It aims to make a professional contribution to development and evaluation of justice policy set by the Ministry of Security and Justice.

Their Statistical Data and Policy Analysis division provides policy information to ministries, the police, the Prosecutor Office, the media and academic researchers.

They further:

collect, maintain, integrate, and query judicial data sources,

produce crime statistics,

monitor development and measure performance within the Dutch Justice chains,

produce forecasts of the capacity demand of the Dutch Justice chains,

write a statistical yearbook called “Crime and law enforcement”,

conduct research on topics such as e-government and cyber crime.

Their problems

They felt the demand and drive for opening data is high. However, for them, opening data had both benefits and drawbacks.

They knew there was a risk of a privacy breach even when data is anonymized or aggregated (See AOL searchlog leaks). Even with properly anonymized data, individuals may be identified by combining different data sources. This creates a conflict between replicability/completeness and trust/security.

Their solutions:

A data sharing protocol

The WODC now offers 3 kinds of access: Open access, where they publish the data online. Restricted access, where privacy-sensitive data is given to selected scientific organizations. And Demand-driven access, where highly aggregated data is send after receiving a WOB request (similar to a Freedom Of Information request).

All requests for data are monitored and audited. They’ve established a strict set of procedures for data sharing to ensure privacy is maintained. They are in compliance with standards and security policies.

A data sharing procedure

Analyse the type and content of the data, the purpose of the data publication, and restrictions. Preparation: Retrieve, process, and (pseudo-)anonymize the data. Publication: Transfer the data and establish conditions and rules for data access and reuse.

Methods for privacy and security

Only share personal or privacy-sensitive data with trusted scientists, only when strictly necessary. For all other purposes, delete any attributes which may lead to a disclosure of identity.

Avoid publishing (statistical) data based on a small sample size. Data are shared on the highest level of aggregation possible–Statistical and aggregate data is preferred over data on individuals. Use good encryption while transferring data. Whitelist IP’s for external data access. Erase the data after a predefined time.

Using datasets

If you are a data publisher, act like a user following these guidelines, and identify and solve any issues that may appear.

Quality control

33. Measure data quality. How many researchers have looked at the data (“given enough eyeballs, all bugs are shallow”)? What is the skill level of the data publisher? What is the ease of independent verification?

34. Establish who the creator and/or maintainer of the dataset is. Perform a background check on the data source like you would when writing an investigative article. Establish motive for releasing the data (academic, compliance, group effort, commercial, leaking, propaganda etc.). Find and document provenance.

Streets mapped in OpenStreetMap vs. editor location in OpenStreetMap

35. Collect meta-data and measurement data. Are there any more (column header) descriptions for the dataset available? How was the data measured and gathered? What was the research structure?

36. Make sure you have permission to access the data. Even publicly available data crawled from the web may cause legal problems if you lack permission or break the Terms of Service (for example: no automated access). Check out the license and terms so you are at least aware of them.

37. Set up version control. Store the raw dataset and your pre-processed datasets. You may remove extraneous columns, stem text, or reduce the dataset while working on it. Keep notes on the original raw dataset and any processing you’ve done.

38. Check for quality issues like duplication, missing records and incompleteness. Check for (near) duplicates (using hashes or fast all-pairs similarity search) inside the datasets and between datasets. Remove, ignore, restore or fix (replacing with the mean) “NA” values.

39. Check for outliers, noise and malformed structure. Print out the min and max for value columns. Check statistical relevance, level of noise and coherence.

40. Get a feel for the data. Open the datasets and manually inspect a few records. Familiarize yourself with the toolkits for data scientists to get a better research vs. data munging balance (for example: Pandas + Python, RapidMiner, OpenRefine). Read more about the context and domain of the data and data publisher.

User feedback & Community

41. Rate data sets (and check other data users’ ratings). If the data distribution platform allows it: add user feedback through voting and/or leaving a comment. Learn from the other data users.

42. Report successful usage of the data to the data publishers. Did you create a nice app, research article or visualization with the data? Make this known to the data publishers. This rewards their efforts and in turn they may reward you with free marketing and expert insights into the data (or provide access to more data sources).

43. Team up. Work together with other data users to create an ensemble of insights and techniques. Be approachable. Let others know you are working with the data and that you are willing to join forces.

44. Post data munging code and/or conversions for all to use. If the dataset is in .xlsx then more than one user has to convert that to .csv. If the dataset contains duplicates then more than one user benefits from a deduplication script. Similarly for a script that automates API access. Post code and tools to work with the data and check out code posted by other data users.

45. Report (privacy) issues with the data. Is the dataset incomplete, contains many errors or duplicates, does data make it possible to identify individuals? Responsibly disclose privacy issues and constructively disclose technical issues.

46. Set up a data user forum or data wiki (or participate in an existing one). Building a community and/or market place around a data source will improve knowledge and cooperation. It is also a stronger (aggregated) voice for data publishers to act on the issues you could raise.

Enhancing the data and drawing conclusions

47. Combine datasets for better datasets. If the datasets allow this, join data by unique identifiers to create richer datasets. Beware of problems, duplicate ID’s or relying on conversions too much. Some datasets may offer street address, others longitude and latitude. Linking such datasets can benefit from crowdsourcing due to their fuzzy and complex nature.

48. Do not overfit or draw unjustified conclusions from small sample size. If the feature-space is large and the sample size is small there is an increased risk of overfitting on your dataset. Models won’t be generalizable to other datasets. In a similar vein: Do not draw conclusions or create data visualizations from small or untrustworthy data sources. Do your data journalism on correctly and significantly aggregated values.

49. Create attractive visualizations, reports and graphs. Popular visualizations require little domain expertise to convey information, conclusions and/or (business) intelligence. Find feature weights (which columns are most indicative). “Raw data is both an oxymoron and a bad idea“.

Pragmatic visualization is what we term the technical application of visualization techniques to analyze data. The goal of pragmatic visualization is to explore, analyze, or present information in a way that allows the user to thoroughly understand the data. — Robert Kosara

50. Create a data report for yourself or your superior(s). Present/communicate your results. For example: online, to management, in the data research community, to the data publishers.



The semantic web and its links as of 2011

Use Case: Using open data and healthcare data

Linking open energy data

Chris Davis is a Postdoc Energy & Industry. He tries to link up the many available data sources with a focus on industrial ecology and open data.

According to him, energy and sustainability are one of the most important topics of our century.

The problems

Researchers repeat a lot of work. Research is very data intensive. To get a clear picture one needs both aggregated and fine-grained data. Connecting all this data is tedious. The energy is sector is only slowly embracing data sharing initiatives.

The solutions

Create a platform, enipedia.tudelft.nl where researchers can cooperate to avoid duplicate work. Working with multiple editors/data users ensures that bugs are found earlier, that facts are double-checked and that the huge tasks do not fall on the shoulders of a few individual researchers.

Enipedia’s technical infrastructure

Also start a debate among data publishers to clear up any social issues. The technology is already here, it’s the social issues that are holding us back. Have debates about data quality and perform research on what constitutes data quality.

Privacy-sensitive healthcare data

The Dutch have started programs to supply heroine addicts with methadone. It is part of a treatment and risk reduction program. Heroin addicts could register at local distribution points ran by The Public Health Authority and would be provided small dosages of methadone (to reduce resale and misuse). The heroine addict would be registered with The Public Health Authority and sometimes with their personal physician.

The problems

This data is so privacy sensitive that data sharing initiatives, even among government organizations, were shunned. So when a heroin addict partaking in a methadone-program was arrested and taken to jail, the prison doctors had no timely access to this data. Many promising rehabilitation projects were cut short, forcing the addict to go cold-turkey. This resulted in an increase of relapses and death by overdose shortly after release from prison.

The solutions

User groups asked the data publishers and maintainers for a temporary identifier with which to identify participants in these programs. Delete, aggregate or anonymize the records once the participant has completed the program. A so-called anonymous chain-ID prevents privacy issues, and allows for a maximum of sharing — Relevant parties have near on-demand access to the data.

An ID card with expiration date was issued to the addict. Showing the ID at a distribution point allowed them to check if the person matched the photo on record. Showing the ID to the prison doctors served as identification and proof of participation in a program. Data sets and databases relevant to the program all referred to this new chain ID. This made the joining and evaluation of related data easier.

Further reading, Resources & References