Stanford NLP is a library for text manipulation, which can parse and tokenize natural language texts. Typically applications which operate on text first split the text into words, then annotate the words with their part of speech, using a combination of heuristics and statistical rules. Other operations on the text build upon these results with the same techniques (heuristics and statistical algorithms on earlier data), which results in a pipeline model.

Here, for instance, we see two techniques for constructing a pipeline, one based on configuration, and one manual. Since this example is going to extract dates and times from text, we add the TimeAnnotator class to the end of the pipeline:

object Main { def main (args: Array[String]) { val props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); val pipeline = new StanfordCoreNLP(props); val timeAnnotator = new TimeAnnotator() pipeline.addAnnotator(timeAnnotator) ... } }

Once this is working, you simply tell the pipeline to annotate the text, and then wait for a bit.

val text = "Last summer, they met every Tuesday afternoon, from 1:00 pm to 3:00 pm." val doc = new Annotation(text) pipeline.annotate(doc)

The reason this takes time is that doing the actual work loads a handful of files from disk and works with them, and while they are small, they have large numbers of pre-defined rules. Consider the following sample, which is a small piece of the time matching piece of the library (there are around a thousand lines of this sort of thing).

BASIC_NUMBER_MAP = { "one": 1, "two": 2, "three": 3, ... } BASIC_ORDINAL_MAP = { "first": 1, "second": 2, "third": 3, ... } PERIODIC_SET = { "centennial": TemporalCompose(MULTIPLY, YEARLY, 100), "yearly": YEARLY, "annually": YEARLY, "annual": YEARLY, ... }

This sample demonstrates two things – you can pull out more than just exact times (e.g. “last summer”, “next century”, ranges, times without dates), and the library handles a large number of equivalence classes for you.

One of the most likely issues you’ll run into trying to get this working is getting the classpath and parsing pipeline set up right – while simple to look at, if you try to customize it, you’ll need to develop an understanding of how the library is actually structured.

Once you run it, you can get dates out: