As I am working now on some language processing project, I am going to post various interesting findings from time to time.

For example – how many words are we using to tell a term? Based on ~1 million of terms (English) sample size, here is the chart, describing % of terms versus words count to describe the term:

You can see that 20% of things we call by one word only. Most of things (40%) we call by using of two words and then it goes down. Interesting that starting two words and on, the probability to use more words to describe the term is cut by half each time (actually the coefficient is ~0.48)!

ngram>0 (total) 974267 Multiplier ngram>1 779238 0.800 ngram>2 375672 0.482 ngram>3 163705 0.436 ngram>4 79690 0.487 ngram>5 38585 0.484 ngram>6 17478 0.453 ngram>7 8439 0.483

That’s kind of semantic law 😉

To be a bit more precise, this is not a Poisson distribution, but an “Erlang Distribution” .

After a brief check, Probability of n-gram:

P(n) ~ Erlang(n,5,2)

Just nice to know 🙂

Share this: Twitter

Pinterest

Reddit

LinkedIn

Facebook

Related