Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

Contents

Instructions

Building a baseline statistical phrase MT system Wonderful pages about how to download a bunch of tools and some data and put them together to build a very competent baseline statistical MT system: NAACL 2006 WMT or 2009 WMT.

Freely downloadable

Moses The most-used open-source phrase-based MT decoder. By Philip Koehn and many others. Phrasal A Java phrase-based MT decoder, largely compatible with the core of Moses,with extra functionality for defining feature-rich ML models. By Daniel Cer, Michel Galley, Spence Green, and others. Joshua A Java hierarchical MT decoder, largely based on the design of Hiero. By Chris Callison-Burch and others. Jane A phrase-based MT decoder by the U. Aachen group. cdec A primarily SCFG-based MT decoder by Chris Dyer and many others. C++. EGYPT system System from 1999 JHU workshop. Mainly of historical interest. GIZA++ and mkcls Franz Och. C++. GPL. Still often used for word alignment. Thot Phrase-based model building kit Phramer An Open-Source Java Statistical Phrase-Based MT Decoder Syntax Augmented Machine Translation via Chart Parsing Andreas Zollmann and Ashish Venugopal

Free, but getting them requires hassle

Pharaoh decoder Philip Koehn, ISI. MTTK Machine Translation Tool Kit. Deng and Byrne.

Freely downloadable

Stanford POS tagger Loglinear tagger in Java (by Kristina Toutanova) hunpos An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows. MBT: Memory-based Tagger Based on TiMBL TreeTagger A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online. SVMTool POS Tagger based on SVMs (uses SVMlight). LGPL. ACOPOST (formerly ICOPOST) Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license. MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source. fnTBL A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models. mu-TBL An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog. YamCha SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.) QTAG Part of speech tagger An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.] The TOSCA/LOB tagger. Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set. The venerable Brill's Transformation-based learning Tagger A symbolic tagger, written in C. It's no longer available from a canonical location, but you might find a version from the Wikipedia page or you could try a reimplementation such as fnTBL. Original Xerox Tagger A common lisp HMM tagger available by ftp. Lingua-EN-Tagger Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)

Free, but require registration

Usable by email or on the web, but not distributed freely

Not free

Lingsoft Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing info@lingsoft.fi . There is an online demo. Conexor Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish. Xerox Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo. Infogistics Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.

No longer available

LT POS and LT TTT The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.

Downloadable

YamCha SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.) Mark Greenwood's Noun Phrase Chunker A Java reimplementation of Ramshaw and Marcus (1995). fnTBL A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

Downloadable

CRF++ Generic CRF-based model in C++. Open source. By the author of YamCha. Carafe Generic CRF-based sequence models in O-CaML. Open source. By Ben Wellner. FreeLing A large suite of language analyzers. Written in C++. Covers text preprocessing, morphology, NER, POS tagging, parsing.

Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.

Downloadable

ASSERT PropBank semantic roles (and opinions, etc.) by Sameer Pradhan. Shalmaneser FrameNet-based by Katrin Erk. Tree Kernels in SVMlight by Alessandro Moschitti. A general package, but it has particularly been used for SRL.

Downloadable

Stanford Named Entity Recognizer A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel. LingPipe Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co. YamCha SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Downloadable

Stanford Deterministic Coreference Resolution System Winner of CoNLL 2011 shared task, with subsequent improvements. Distributed as part of Stanford CoreNLP. Heeyoung Lee and others. Java. GPL. Reconcile By Ves Stoyanov and others. Java. GPL. Illinois Coreference Package Java. University of Illinois Research and Academic Use License. Berkeley Coreference Resolution Greg Durrett et al. Mainly Scala. GPL. BART A Beautiful Anaphora Resolution Toolkit. Java. By Yannick Versley and many others. Java. Apache with GPL components. Guitar Java. GPL.

Downloadable

IRSTLM Toolkit Compatible with SRILM, suitable for very large language models. LGPL. By Marcello Federico, Nicola Bertoldi et al. CMU-Cambridge Statistical Language Modeling toolkit

Downloadable, but requires registration

The SRI Language Modeling toolkit by Andreas Stolcke is another good system for building language models, freely available for research purposes.

Not yet classified

Lextools A package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/. Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from: http://www.research.att.com/sw/tools/fsm/

Wordsmith Tools (Mike Scott) The thing to get if you are working in the Windows world.

Downloadable

Free, but require registration

Stuttgart's IMS Corpus Workbench (CWB) A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement. Gate University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system. MITRE's Alembic Workbench A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach. SNoW SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).

Unsure

INTEX a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein silberz@ladl.jussieu.fr

The PennTools page collects information on a variety of NLP systems, many of which are available externally.

English

English language corpora available from the sites above are not repeated here.

Chinese

English language corpora available from the sites above are not repeated here.

The Lancaster Corpus of Mandarin Chinese (LCMC) By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.

Multilingual

Bosnian

Czech

Parallel Czech-English Literature translations in Czech and English Czech National Corpus project: SYN2000 100 million words of contemporary Czech.

French

Association des Bibliophiles Universels Various French literary works. American and French Research on the Treasury of the French Language (ARTFL) 150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).

German

COSMAS Corpus Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection NEGRA Corpus Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.

Russian

Russian National Corpus 150 million words, 5 million words POS-tagged, some in dependency treebank. Library of Russian Internet Libraries Various literary works.

Slovene

Croatian

Croatian National Corpus 100 M words

Spanish and Portuguese

Swedish

Spraakdata, Department of Swedish, Göteborgs University. Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.

CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.

The Senseval web site Has a comprehensive selection of resources for WSD, including a good list of WSD data resources, but not yet the new SEMCOR. Ted Pedersen's code Includes various WSD systems. SenseClusters Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare. Evocation WordNet synset similarity judgments Judgments on how similar the meanings of synsets are and how common they are in the BNC from Jordan Boyd-Graber.

There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:

Entirely or mainly English

CHILDES database. Database of child language transcriptions in English and many other languages. Texts are also available by ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.

Dictionaries of subcategorization frames

The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).

COBUILD Collins Cobuild English Language Dictionary. London: Collins, 1987. The COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial. LDOCE Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978. OALD Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a partial version of the third edition (with this information) is available free online from the Oxford Text Archive.

Not exactly a dictionary, but other popular sources are:

Levin (1993) Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries). The index of verbs is online. English subcategorization evaluation resources Gold standard data, from Cambridge University (Anna Korhonen)

See also COMLEX and CELEX available from the LDC.

Dictionaries of assorted languages on the web

The old version of Robert Beard's Web of Online Dictionaries long ago mutated into YourDictionary.com. I'm told the IPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.

Names

U.S. names with frequency information, are available from the Census Bureau.

SGML structured dictionaries

Cambridge International Dictionary of English and other products in SGML.

English SENSEVAL Resources Dictionary entries and tagged examples for 35 words. ARIES Natural Language Tools Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.

"Techie"

"Corpus Linguistics"

Mailing lists that have information on these topics include:

Corpora The main mailing list for info on corpus-based linguistics. Subscribe by sending the message: subscribe corpora to listserv@uib.no . Or if you want to subscribe with a different email address, send: subscribe corpora email-address (Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can subscribe on the web. Empiricist The empiricist list appears to be defunct now. You used to send a "subscribe" message to empiricists-request@unagi.cis.upenn.edu .

Home pages with something useful on them.

Still under construction...

http://nlp.stanford.edu/links/statnlp.html

Christopher Manning -- <manning@cs.stanford.edu> -- Last modified: Sat Nov 29, 2014