Here is a guide for each file format from the Open data handbook and a suggestion with a Python library to use.

JSON is a simple file format that is very easy for any programming language to read. Its simplicity means that it is generally easier for computers to process than others, such as XML. Working with JSON in Python is almost the same such as working with a Python dictionary. You will need the JSON library, but it is preinstalled to every Python 2.6 and after.

import json json_data = open("file root") data = json.load(json_data)

Then data["key"] prints the data for the JSON.

XML is a widely used format for data exchange, because it gives good opportunities to keep the structure in the data and the way files are built on and allows developers to write parts of the documentation in with the data without interfering with the reading of them. This is pretty easy in Python as well. You will need the MiniDom library. It is also preinstalled.

from xml.dom import minidom xmldoc = minidom.parse("file root") itemlist = xmldoc.getElementsByTagName("name")

This prints the data for the "name" tag.

RDF is a W3C-recommended format and makes it possible to represent data in a form that makes it easier to combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations. RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing open data initiatives on the Web. RDF is still not widespread, but it has been a trend among Open Government initiatives, including the British and Spanish Government Linked Open Data projects. The inventor of the Web, Tim Berners-Lee, has recently proposed a five-star scheme that includes linked RDF data as a goal to be sought for open data initiatives I use rdflib for this file format. Here is an example.

from rdflib.graph import Graph g = Graph() g.parse("<file root>", format="<format>") for stmt in g: print(stmt)

In RDF you can run queries too and return only the data you want. But this isn't easy as parsing it. You can find a tutorial here.

Spreadsheets. Many authorities have information left spreadsheet documents, for example Microsoft Excel. This data can often be used immediately with the correct descriptions of what the different columns mean. However, in some cases there can be macros and formulas in spreadsheets, which may be somewhat more cumbersome to handle. It is therefore advisable to document such calculations next to the spreadsheet, since it is generally more accessible for users to read. I prefer to use a tool like xls2csv and then use the output file as a CSV file. But if you want for any reason to work with an .xls file, www.python-excel.org is the best source I had. The most populars are xlrd and xlwt. There is also another library, openpyxl, where you can work with .xlsx files.

Comma Separated Files (CSV) files can be a very useful format, because it is compact and thus suitable to transfer large sets of data with the same structure. However, the format is so spartan that data are often useless without documentation since it can be almost impossible to guess the significance of the different columns. It is therefore particularly important for the comma-separated formats that documentation of the individual fields are accurate. Furthermore, it is essential that the structure of the file is respected, as a single omission of a field may disturb the reading of all remaining data in the file without any real opportunity to rectify it, because it cannot be determined how the remaining data should be interpreted. You can use the CSV Python library. Here is an example:

import csv with open('eggs.csv', 'rb') as csvfile: file = csv.reader(<file root>, delimiter=' ', quotechar='|') for row in file: print ', '.join(row)</pre>

Plain Text (txt) are very easy for computers to read. They generally exclude structural metadata from inside the document however, meaning that developers will need to create a parser that can interpret each document as it appears. Some problems can be caused by switching plain text files between operating systems. MS Windows, Mac OS X and other Unix variants have their own way of telling the computer that they have reached the end of the line. You can load the txt file, but how you will use it after that depends on the data format.

text_file = open("<file root>", "r") lines = text_file.read()</pre>

This example will return the whole txt.

PDF Here is the biggest problem in open data file formats. Many datasets have their data in PDF, and unfortunately it isn't easy to read and then edit them. PDF is really presentation oriented and not content oriented. But you can use PDFMiner to work with it. I won't include any example here since it isn't a trivial one, but you can find anything you want in their documentation.

HTML. Nowadays much data is available in HTML format on various sites. This may well be sufficient if the data is very stable and limited in scope. In some cases, it could be preferable to have data in a form easier to download and manipulate, but as it is cheap and easy to refer to a page on a website, it might be a good starting point in the display of data. Typically, it would be most appropriate to use tables in HTML documents to hold data, and then it is important that the various data fields are displayed and are given IDs which make it easy to find and manipulate data. Yahoo has developed a tool, YQL that can extract structured information from a website, and such tools can do much more with the data if it is carefully tagged. I have used a Python library many times called Beautiful Soup for my projects.

from bs4 import BeautifulSoup soup = BeautifulSoup(html_file) soup.title soup.title.name soup.title.string soup.title.parent.name soup.p soup.p['class'] soup.a soup.find_all('a') soup.find(id="link3")

Those are only a few of what you can do with this library. By calling the tag, it will return the content. You can find more in their documentation.

Scanned image. Yes. It is true. Probably the least suitable form for most data, but both TIFF and JPEG-2000 can at least mark them with documentation of what is in the picture - right up to mark up an image of a document with full text content of the document. If images are clean, containing only text and without any noise, you can use a library called pytesser. You will need the Python Imaging Library (PIL) library to use it. Here is an example:

from pytesser import * image = Image.open('fnord.tif') # Open image object using PIL print image_to_string(image)</pre>

Proprietary formats. Last but not least, some dedicated systems, etc. have their own data formats that they can save or export data in. It can sometimes be enough to expose data in such a format - especially if it is expected that further use would be in a similar system as they came from. Where further information on these proprietary formats can be found should always be indicated, for example by providing a link to the supplier’s website. Generally it is recommended to display data in non-proprietary formats where feasible.. I suggest to google if there is any library specific for this dataset.

Tab Separated Values (TSV). A tab-separated values file is a simple text format for storing data in a tabular structure (for example, database or spreadsheet data). Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character – it is a form of the more general delimiter-separated values format. Unfortunately, I haven't found any good working Python library only for TSV. Until now, I have worked with CSV library like the following example:

import csv with open("tab-separated-values") as tsv: for line in csv.reader(tsv, dialect="excel-tab"): #You can also use delimiter="\t"

Shapefiles are files used to represent spatial data such as polygons that define a city, a neighborhood etc. You can use the libraries fiona and shapely ( pip install fiona shapely ) to help with this job. For example, if you want to load a shapefile, simplify its polygons (to reduce size) and then export to GeoJSON (so you can plot in your Web browser using JavaScript libraries such as LeafLet), you can use this code:

import json import fiona import shapely.geometry shapefile = fiona.open('my_shapefile.shp') shapes = shapely.geometry.shape(shapefile['geometry']) simplified_shapes = shapes.simplify(0.01) # 0.01 is the simplification factor geodict = {'type': 'FeatureCollection', 'features': shapely.geometry.mapping(simplified_shapes)} with open('my_geojson.json') as fobj: fobj.write(json.dumps(geodict))

VOTables is a mix of HTML and XML used most of the time in astronomy. This kind of data contains metadata that it is vital for you. It is pretty simple to extract them with Python using the following library.

from astropy.io.votable import parse votable = parse("votable.xml")

Additional Information. Maybe you will find the Pandas library useful, whose I/O capabilities integrate and unify access from/to most of the formats: CSV, Excel, HDF, SQL, JSON, HTML, and Pickle.