Back to contents Shared PHP Python Ruby Choose a language:



ScraperWiki supports a number of 3rd party Python libraries that we recommend for screen scraping, data analysis and data visualisation.

If you would like us to add a library that isn't listed here, please get in touch.

Downloading

requests Humane and Pythonic way of opening URLs. docs urllib2, urlparse Standard Python libraries for opening URLs. docs gevent Networking library with sychronous event loop. docs mechanize Navigate and complete HTML forms. docs twill A thin shell around mechanize, which you might find easier. docs selenium Automate operating real browsers like Firefox. NB: Only useful in ScraperWiki if you have a Selenium server to point it to. docs

Parsing

XML (HTML, RSS, Atom...)

lxml Highly effective HTML parser with a specialist screen scraping library. docs (also stackoverflow) html5lib An HTML 5 parser, which also copes with invalid documents the same way that major desktop web browsers do. docs Beautiful Soup An alternative popular HTML parser. docs Beautiful Soup 4 Idiomatic ways of navigating, searching, and modifying the parse tree. Home page, quick start and tips on porting from Beautiful Soup 3 pyquery Make jquery like queries from Python. docs PyTidyLib Calls out to the Tidy library, which cleans up bad HTML. docs python-stdnum Handle standardized numbers and codes, from VAT to books.docs Universal Feed Parser Parse RSS and Atom feeds in Python. docs HTML-parser template languages Scrapemark Scraping library that uses templates to extract content. docs scrapely Given example web pages and data, constructs a parser for similar pages. docs scrapy An application framework for crawling web sites and extracting structured data. docs

Other formats

demjson Fancier JSON library than the built in one. docs xlrd, xlwt, xlutils Read, write and process old Excel .xls files. docs openpyxl Read and write Excel .xslx/.xlsm files. docs csvkit A library of utilities to manipulate CSV files. docs PDFMiner Python PDF parser and analyzer. docs pyPdf Manipulate PDF files. docs iCalendar Parse and generate calendar files. docs RDFLib Input and output linked data triples as RDF. docs OpenStreetMap XML/PBF Fast and easy reading of OpenStreetMap files with imposm.parser. docs

Geocoding

geopy Converts addresses into latitude/longitude, and measures distances on the earth. docs GeoIP Convert IP addresses into countries and similar. docs pyephem Scientific-grade astronomy routines. docs

Data pipes

Python YQL Yahoo Query Language, an expressive SQL-like language that lets you query, filter, and join data across Web services. docs Google Data (GData) Access any service using the Google Data protocol. docs pipe2py Converts a Yahoo Pipe into Python so you can run it on ScraperWiki. docs Fluidinfo, FOM (Fluid Object Mapper) Read or write to the Fluidinfo shared database. docs: fluidinfo, fom Tweepy Twitter API, to read or send tweets. docs suds Lightweight SOAP client. docs

Visualising and analysing

General analysis

NumPy Large, multi-dimensional arrays, matrices and functions to operate on them. docs SciPy Uses NumPy to do advanced math, signal processing, optimization, statistics and much more. docs RPy Call out to GNU R, a popular statistical computing and graphics package. docs pandas An expressive framework for data analysis - the same purpose as R, but within Python. docs Data Science Toolkit (DSTK) A collection of the best open data sets and open-source tools for data science. docs Jellyfish Approximate and phonetic matching of strings. docs

Plotting

matplotlib Easily generate lots of charts. docs, example (with ScraperWiki boilerplate) pygooglechart Generate charts using the Google Chart API. docs gviz_api Helper for making data sources for use with the Google Visualization API. docs

Natural language processing

Natural Language Toolkit (NLTK) Natural language processing and text analytics. docs Gensim "Topic Modelling for Humans". docs

Network analysis

pydot Layout networks of nodes (graphs) using Graphviz. docs NetworkX Analyse and draw complex networks (maths graphs). docs igraph Create and manipulate undirected and directed graphs. docs Python Levenshtein Compute string distances and similarities. docs

Other