This page is the home of the Dresden Web Tables Corpus , a collection of about 125 million data tables extracted from the Common Crawl .

About

The Common Crawl is a freely available web crawl created and maintained by the foundation of the same name. The July 2014 incarnation of this corpus, which was used as the basis for this corpus, contains 3.6 billion web pages and is 266TB in size. The data is hosted on Amazon S3, and could thus be easily processed using EC2. The data tables were recognized through a combination of simple heuristics for pre-filtering and a trained classifier to distinguish layout and various kinds of data tables. We included not only (pseudo-)relational tables, but also other kinds of data tables, such as the vertical schema/single entity tables that are common on the web. The features used are similar to the features used in related work, e.g., Cafarella et al. WebTables: exploring the power of tables on the web.

We discovered that the Common Crawl contains many physically identical pages under various logical URLs, as providing multiple URLs variants for a single page is common practice on the web today. This led to many duplicate tables in the initial extracted data. While we originally extracted 174M tables from the Common Crawl, which is consistent with numbers from related work, after content-based deduplication only 125M tables remained.

The final corpus includes only the extracted table data and some metadata described below, not the complete HTML pages from which each table originated. We instead provide code that can automatically retrieve the full HTML text from the Common Crawl S3 bucket using the metadata bundled with the data. This reduces corpus size to 70GB of Gzip compressed data.

News

25.02.2015: Published version 1.1.0 of the the corpus and companion libraries. The new version is based on the July 2014 version of the Common Crawl. It contains more tables, fixes errors of the old version, and also contains some new attributes, such as table classification results that allow distinguishing relational and single-entity tables (see Schema for details)

Getting Started

The corpus consists of 500 individual files directly downloadable from the TU Dresden Database Technology Group's web server, with URLs of the form http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-XXX.json.gz

To download the first file as a sample click here. To download the full dataset (or a subset of any size), you can use a shell command such as

for i in $(seq -w 0 500); do wget http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz; done

The easiest way to work with the dataset is to use the provided Java library, documented at its Github repository page. We also provide a description of the corpus data format and schema below.

Schema

The corpus consists of a set of GZip compressed text files. Each line of text contains one JSON document representing one extracted table and its metadata. The easiest way to use these documents is to use the provided Java library, but you can also decompress, read line by line, and use any JSON parser. We provide a JSON schema.

Code

We provide both the code for the extractor, which is partly based on code published by the Web Data Commons project, as well as a companion library for working with the data set.The library is found at https://github.com/JulianEberius/dwtc-tools and the extractor at https://github.com/JulianEberius/dwtc-extractor.

Related Work

The Web Data Commons project recently published a very similar web table corpus, based on an older version of the Common Crawl, and using a different extraction method and storage format. A very good overview of the related work can be found on their project page.

Corpus Statistics

Table Statistics

see Tables Table Type: entity Table Type: relation Total number: 77,666,916 Total number: 58,674,016 Avg. #column: 2.47 Avg. #column: 5.79 Avg. #row: 8.96 Avg. #row: 17.16 Min. #column: 2 Min. #column: 2 Min. #row: 2 Min. #row: 2 Max. #column: 6,878 Max. #column: 7,291 Max. #row: 46,743 Max. #row: 28,891 Table Type: matrix Table Type: other Total number: 1,973,354 Total number: 7,219,536 Avg. #column: 7.50 Avg. #column: 8.15 Avg. #row: 17.69 Avg. #row: 15.31 Min. #column: 3 Min. #column: 2 Min. #row: 2 Min. #row: 2 Max. #column: 2,035 Max. #column: 113,682 Max. #row: 9,030 Max. #row: 15,023

see Graphs

Most common TLDs

see Table TLD Count com 92,416,580 org 21,391,455 net 5,190,071 edu 4,377,877 co.uk 2,912,414 gov 2,836,297 de 1,949,211 es 853,369 ca 849,214 fr 799,457 com.au 767,434 ac.uk 556,531 info 500,836 it 442,881 eu 418,469 ru 393,374 com.br 390,334 pl 348,186 nl 336,709 tx.us 304,570

Most common domains

see Table Domain Count wikipedia.org 5,843,615 google.com 3,657,730 worldcat.org 2,005,895 godlikeproductions.com 1,724,842 flightaware.com 1,572,751 itjobswatch.co.uk 1,297,494 stackexchange.com 1,227,872 cricketarchive.com 1,154,101 e90post.com 1,047,063 hotels.com 972,114 m3post.com 919,125 go.com 904,348 mixedmartialarts.com 755,110 wowprogress.com 747,538 sports-reference.com 724,744 baseball-reference.com 675,849 macrumors.com 668,493 nhl.com 660,072 stackoverflow.com 643,068 weatherbase.com 629,457

Estimated Distinct Attributes

see Table Attribute Count NULL 108,884,792 date 8,283,909 title 6,768,556 name 5,625,831 1 4,309,750 description 4,285,050 2 3,594,400 location 3,244,015 3 3,122,464 type 3,015,000 views 2,974,465 5 2,874,085 4 2,838,993 publication date 2,715,011 rating 2,691,175 year 2,578,777 filing date 2,519,377 6 2,387,802 author 2,356,401 7 2,309,544

Citation

@inproceedings{Eberius:2015, Author = {Eberius, Julian and Thiele, Maik and Braunschweig, Katrin and Lehner, Wolfgang}, Title = {Top-k Entity Augmentation Using Consistent Set Covering}, Series = {SSDBM '15}, Year = {2015}, Doi = {10.1145/2791347.2791353} }

License

The corpus was initially created for and published in conjunction with the following paper.

The corpus data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The code, which derives partly from the code used by the Web Data Commons project can be used under the terms of the same license, the Apache Software License.

Credits

The extraction of the Web Table Corpus was supported by an Amazon Web Services in Education Grant award.

Contact