Here is the list of free Natural language processing data sets SNLP

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources. It contains rich set of NLP data sets. Lemur ClueWeb09

Freebase Annotations of the ClueWeb Corpora, v1. The annotations for each corpus are provided as a collection of 500 files. Each file contains annotations of multiple web pages, and each page URL is followed by a list of entities identified in that page. Google Books Ngrams

A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. A Multilingual Corpus of Automatically Extracted Relations from Wikipedia

A dataset of automatically extracted relations from the Wikipedia corpus in 61 languages, along with the manually annotated relations in 3 languages (French, Hindi and Russian). Google released this dataset to facilitate researchers working on natural language processing and to encourage novel applications in a wide variety of languages CORPUS OF CONTEMPORARY AMERICAN ENGLISH

The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus was created by Mark Davies of Brigham Young University, it is used by tens of thousands of users every month who include linguists, teachers, translators, and other researchers. WordNet

WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Wordnets in the World

Links to wordnets in wordnets in a variety of languages with goal to make it easy to use wordnets in multiple languages. GeoWordNet

GeoWordNet is a semantic resource built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet.GeoWordNet Public Dataset contains 3,698,238 entities, 3,698,237 part-of relations between entities, 334 concepts, 182 relations between concepts, 3,698,238 relations between instances and concepts, and 13,562 (English and Italian) alternative entity names.