On US Patent 5,937,422 & 'Semantic Forests'

They told you so. Do you remember?

US patent 5,937,422 belongs to the NSA. It was submitted in 1997 and accepted in 1999. Julian Assange came upon it when researching the NSA in 1999. Suelette Dreyfus wrote a report on it which was published by the Independent three months later.



Not much has been heard about it since. Until now.

1. US Patent 5,937,422

The full text of the application for patent 5,937,422 is still available online. A link is provided below; parts of the application appear below.



The relevance is clear: the NSA have been bent on intercepting everything since way before 9/11 and the Patriot Act. All 9/11 and the resulting legislation did was let them come out in the open and work at double speed.



The implications from patent 5,937,422 are patently clear: they're now able to transcribe voice communications.

United States Patent 5,937,422

Nelson et al August 10, 1999



Automatically generating a topic description for text and searching and sorting text by topic using the same



Abstract



A method of automatically generating a topical description of text by receiving the text containing input words; stemming each input word to its root form; assigning a user-definable part-of-speech score to each input word; assigning a language salience score to each input word; assigning an input-word score to each input word; creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word; assigning a definition-word score to each definition word; collapsing each tree structure to a corresponding tree-word list; assigning a tree-word-list score to each entry in each tree-word list; combining the tree-word lists into a final word list; assigning each word in the final word list a final-word-list score; and choosing the top N scoring words in the final word list as the topic description of the input text. Document searching and sorting may be accomplished by performing the method described above on each document in a database and then comparing the similarity of the resulting topical descriptions.



Inventors: Nelson; Douglas J. (Columbia, MD); Schone; Patrick John (Elkridge, MD); Bates; Richard Michael (Greenbelt, MD)

Assignee: The United States of America as represented by the National Security (Washington, DC)

Appl No: 834263

Filed: April 15, 1997



US Class: 707/531; 707/4; 707/532; 707/535; 707/512

Intern'l Class: G06F 017/30

Field of Search: 704/10 707/512,532,535,531,3-5,7



Primary Examiner: Amsbury; Wayne

Assistant Examiner: Channavajjala; Srirama

Attorney, Agent or Firm: Morelli; Robert D



Claims



1. A method of automatically generating a topical description of text, comprising the steps of:



a) receiving the text, where the text consists of one or more input words;

b) stemming each input word to its root form;

c) assigning a user-definable part-of-speech score .beta..sub.i to each input word;

d) assigning a language salience score S.sub.i to each input word;

e) assigning an input-word score to each input word that is a function of the corresponding input word's part-of-speech score .beta..sub.i, language salience score S.sub.i, and the number of times the corresponding input word appears in the text;

f) creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word, where each definition word may be further defined to a user-definable number of levels;

g) assigning a definition-word score A.sub.i,t ›j! to each definition word in each tree structure based on the definition word's part-of-speech score .beta..sub.j, the language salience score of the word the definition word defines, a relational salience score R.sub.k,j, and a user-definable factor W;

h) collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains the unique words contained in the corresponding tree structure;

i) assigning a tree-word-list score to each word in each tree-word list, where each tree-word-list score is a function of the scores of the corresponding word that existed in the corresponding uncollapsed tree structure;

j) combining the tree-word lists into a final word list, where the final word list contains the unique words contained in the tree-word lists;

k) assigning a final-word-list score A.sub.fi ›j! to each word in the final word list, where A.sub.fi ›j! is a function of the corresponding word's dictionary salience and tree-word-list scores; and

l) choosing the top N scoring words in the final word list as the topic description of the input text, where the value N may be defined by the user.



2. The method of claim 1, wherein said step of receiving the text, is comprised of the step of receiving text wherein said text is selected from the group consisting of speech-based text, optical-character-read text, stop-word-filtered text, stutter-phrase-filtered text, and lexical-collocation-filtered text.



3. The method of claim 1, wherein said step of assigning a language salience score S.sub.i to each input word is comprised of the step of determining the language salience score for each input word from the frequency count f.sub.i of each word in a large corpus of text as follows:



S.sub.i =0, if f.sub.i >f.sub.max ;

S.sub.i =log (f.sub.max /(f.sub.i -T.sup.2 +T)), if T.sup.2 S.sub.i =log (f.sub.max /T), if T S.sub.i =.epsilon.+((f.sub.i /T)(log(f.sub.max /T)-.epsilon.)), if f.sub.i.ltoreq.T



where .epsilon. and T are user-definable values, and where f.sub.max represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.



4. The method of claim 3, wherein said step of assigning a language salience score S.sub.i to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.



5. The method of claim 1, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mS.sub.i .beta..sub.i and (S.sub.i m).beta..sub.i, where m is the number of times the corresponding input word occurs in the text.



6. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.



7. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a database selected from a group consisting of a thesaurus, an encyclopedia, and a word-based relational database.



8. The method of claim 1, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.



9. The method of claim 1, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: A.sub.i,t ›j!=W(.beta..sub.j,t).SIGMA.A.sub.i,t-1 ›k!R.sub.k,j



where



R.sub.i,j =D.sub.j /.SIGMA.D.sub.k), where .SIGMA.D.sub.k represents the sum of the dictionary saliences of the words in the definition of word w.sub.i, where D.sub.j =.beta..sub.j (S.sub.j log(d.sub.max /d.sub.j)) 0.5, where d.sub.t is the number of dictionary terms that use the corresponding word in its definition, and where d.sub.max is the number of times the most frequently used word in the dictionary is used.



10. The method of claim 1, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows: A.sub.i,t ›j!=W(.beta..sub.j,t).SIGMA.A.sub.i,t-1 ›k!R.sub.k,j



where



R.sub.i,j =D.sub.j /.SIGMA.D.sub.k), where .SIGMA.D.sub.k represents the sum of the dictionary saliences of the words in the definition of word w.sub.i, where D.sub.j =.beta..sub.j (S.sub.j log(d.sub.m /.DELTA..sub.j)) 0.5, where .DELTA..sub.j =max(d.sub.j, .epsilon.), and d.sub.m is chosen such that a fixed percentage of the observed values of the d.sub.j's are larger than d.sub.m.



11. The method of claim 1, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.



12. The method of claim 1, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.



13. The method of claim 1, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.



14. The method of claim 1, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation



A.sub.fi ›j!=((D.sub.j (f(A.sub.i ›j!))).SIGMA.A.sub.i ›j!).



15. The method of claim 1, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.



16. The method of claim 1, further comprising the steps of:



a) receiving a plurality of documents, where one of said plurality of documents is identified as the document of interest;

b) determining a topic description for each of said plurality of documents;

c) comparing the topic descriptions of each of said plurality of documents to the topic description of said document of interest; and

d) returning each of said plurality of documents that has a topic description that is sufficiently similar to the topic description of said document of interest.



17. The method of claim 1, further comprising the steps of:



a) receiving a plurality of documents;

b) determining a topic description for each of said plurality of documents;

c) comparing the topic descriptions of each of said plurality of documents to each other of said plurality of documents; and

d) sorting said plurality of documents by topic description.



18. The method of claim 2, wherein said step of assigning a language salience score S.sub.i to each input word is comprised of the step of determining the language salience score for each input word from the frequency count f.sub.i of each word in a large corpus of text as follows:



S.sub.i =0, if f.sub.i >f.sub.max ;

S.sub.i =log (f.sub.max /(f.sub.i -T.sup.2 +T)), if T.sup.2 S.sub.i =log (f.sub.max /T), if T S.sub.i =.epsilon.+((f.sub.i /T)(log(f.sub.max /T)-.epsilon.)), if f.sub.i .ltoreq.T



where .epsilon. and T are user-definable values, and where f.sub.max represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.



19. The method of claim 18, wherein said step of assigning a language salience score S.sub.i to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.



20. The method of claim 19, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mS.sub.i .beta..sub.i and (S.sub.i m).beta..sub.i, where m is the number of times the corresponding input word occurs in the text.



21. The method of claim 20, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.



22. The method of claim 21, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.



23. The method of claim 22, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows:



A.sub.i,t ›j!=W(.beta..sub.j, t).SIGMA.A.sub.i,t-1 ›k!R.sub.k,j



where R.sub.i,j =D.sub.j /.SIGMA.D.sub.k), where .SIGMA.D.sub.k represents the sum of the dictionary saliences of the words in the definition of word w.sub.i, where D.sub.j =.beta..sub.j (S.sub.j log(d.sub.max /d.sub.j)) 0.5, where d.sub.t is the number of dictionary terms that use the corresponding word in its definition, and where d.sub.max is the number of times the most frequently used word in the dictionary is used.



24. The method of claim 23, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows:



A.sub.i,t ›j!=W(.beta..sub.j,t).SIGMA.A.sub.i,t-1 ›k!R.sub.k, j



where R.sub.i, j =D.sub.j /.SIGMA.D.sub.k), where .SIGMA.D.sub.k represents the sum of the dictionary saliences of the words in the definition of word w.sub.i, where D.sub.j =.beta..sub.j (S.sub.j log (d.sub.m /.DELTA..sub.j)) 0.5, where .DELTA..sub.j =max(d.sub.j, .epsilon.), and d.sub.m is chosen such that a fixed percentage of the observed values of the d.sub.j's are larger than d.sub.m.



25. The method of claim 24, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.



26. The method of claim 25, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.



27. The method of claim 26, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.



28. The method of claim 27, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation



A.sub.fi ›j!=((D.sub.j (f(A.sub.i ›j!))).SIGMA.A.sub.i ›j!).



29. The method of claim 28, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.



30. The method of claim 29, further comprising the steps of:



a) receiving a plurality of documents, where one of said plurality of documents is identified as the document of interest;

b) determining a topic description for each of said plurality of documents;

c) comparing the topic descriptions of each of said plurality of documents to the topic description of said document of interest; and

d) returning each of said plurality of documents that has a topic description that is sufficiently similar to the topic description of said document of interest.



31. The method of claim 30, further comprising the steps of:



a) receiving a plurality of documents;

b) determining a topic description for each of said plurality of documents;

c) comparing the topic descriptions of each of said plurality of documents to each other of said plurality of documents; and

d) sorting said plurality of documents by topic description.

That's a lot of gibberish - unless you have someone on hand who can explain it in half the time. Fortunately there is someone.

2. Semantic Forests

Published by Suelette Dreyfus 3 December 1999, two weeks after sending her piece to the Independent.

-Caveat Lector-



By Suelette Dreyfus

Special Correspondent

CyberWire Dispatch



'Semantic Forests' doesn't mean much to the average person. But if you say it in concert with the words 'automatic voice telephone interception' and 'US National Security Agency' to a computational linguist, you might just witness the physical manifestations of the word 'fear'.



Words are funny things, often so imprecise. Two people can have a telephone conversation about sex, without ever mentioning the word. And when the artist formerly known as Prince sang a song about 'cream', he wasn't talking about a dairy product.



All this linguistic imprecision has largely protected our voice conversations from the prying ears of governments. Until now.



Or, more particularly, it protected us until 15 April, 1997 - the date the NSA lodged a secret patent application at the US Patent Office. Of course, the content of the NSA patent was not made public for two years, since the Patent Office keeps patent applications secret until they are approved, which in this case was August 10, 1999.



What is so worrying about patent number 5,937,422? The NSA is believed to be the largest and by far most well-funded spy agency in the world, a Microsoft of Spookdom. This document provides the first hard evidence that the NSA appears to be well on its way to creating eavesdropping software capable of listening to millions of international telephone calls a day. Automatically.



Patents are sometimes simply ambit claims, legal handcuffs on what often amounts to little more than theory. Not in this case. This is real. The US Department of Defense has developed the NSA's patent ideas into a real software program, called 'Semantic Forests', which it has been lab testing for at least two years.



Two important reports to the European Parliament, in 1998 and 1999, and Nicky Hager's 1996 book 'Secret Power' reveal that the NSA intercepts international faxes and emails. At the time, this revelation upset a great number of people, no doubt including the European companies which lost competitive tenders to American corporations not long after the NSA found its post-Cold War 'new economy' calling: economic espionage.



Voice telephone calls, however, well, that is another story. Not even the world's most technically advanced spy agency has the ability to do massive telephone interception and automatically massage the content looking for particular words, and presumably topics. Or so said a comprehensive recent report to the European Parliament.



In April 1999, a report commissioned by the Parliament's Office of Scientific and Technological Options Assessment (STOA) concluded that 'effective voice 'wordspotting' systems do not exist' and 'are not in use'.



The tricky bit there is 'do not exist'. Maybe these systems haven't been deployed en masse, but it is looking increasingly like they do actually exist, probably in some form which may be closer to the more powerful topic spotting.



Do The Math

===========



There are two new pieces of evidence to support this, and added together, they raise some fairly explosive questions about exactly what the NSA is doing with the millions of international phone calls it intercepts every day in its electronic eavesdropping web commonly known as Echelon.



First. The NSA's shiny new patent describes a method of 'automatically generating a topic description for text and sorting text by topic'. Sound like a sophisticated web search engine? That's because it is.



This is a search engine designed to trawl through 'machine transcribed speech', in the words of the patent application. Think computers automatically typing up words falling from human lips. Now think of a powerful search engine trawling through those words.



Now sweat...



Maybe the spy agency only wants to transcribe the BBC Radio World News, but I don't think so. The patent contains a few more linguistic clues about the NSA's intent - little golden Easter eggs buried in the legal long grass. The 'Background to the Invention' section of every patent application is the place where the intellectual property lawyers desperately try to waive away everyone else's right to claim anything even remotely touching on the patent.



In this section, the NSA attorneys observed there has been 'growing Interest' in automatically identifying topics in 'unconstrained speech'.



Only a lawyer could make talking sound so painful. 'Unconstrained speech' means human conversation. Maybe it's been 'unconstrained' by the likelihood of being automatically transcribed for real time topic searching.



Here's the part where the imprecision of words - particularly spoken words - comes in. Machine transcribed conversations are raw, and very hard to analyze automatically with software. Many experts thought the NSA couldn't go driftnet fishing in the content of everyone's international phone calls because the technology to transcribe and analyze those calls was too young.



However, if the NSA didn't have the technology to do automatic transcription of speech, why would it have patented a sifting method which, by its very own words, is aimed at transcripts of human speech?



As Australian cryptographer Julian Assange, who discovered the DoD and patent papers while investigating NSA capabilities observed: 'Why make tires if you don't have a car? Maybe we haven't seen the car yet, but we can infer that it exists by all the tires and roads'.



One of the top American cryptographers, Bruce Schneier, also believes the NSA already has machine transcription capability. 'One of the Holy Grails of the NSA is the ability to automatically search through voice traffic', Schneier said. 'They would have expended considerable effort on this capability, and this research indicates at least some of it has been fruitful.'



Second. Two Department of Defense academic papers show the US developed a real software program, called 'Semantic Forests', to implement the patented method.



Published as part of the Text REtrieval Conference (TREC) in 1997 and 1998, the Semantic Forest papers show the program has one main purpose: 'performing retrieval on the output of automatic speech-to-text (speech recognition) systems'. In other words, the US built this software *specifically* to sift through computer-transcribed human speech.



If that doesn't send a chill down your spine, read on.



The DoD's second prime purpose for Semantic Forests was to 'explore rapid prototyping' of this information retrieval system. That statement was written in 1997.



There's also an unambiguous link between Semantic Forests and the NSA patent, it's human and its name is Patrick Schone.



Schone appears on the NSA patent documents as an inventor, and on the Semantic Forests papers as an author, and he works at Ft Meade, NSA headquarters.



Specifically, he works in the DoD's 'Speech Research Branch' which just happens to be located at, you guessed it, Ft Meade.



Very Clever Fish

================



The NSA and the DoD refused to comment on the patent or Semantic Forests respectively. Not surprising really but no matter, since the Semantic Forest papers speak for themselves. The papers reveal a software program which, while somewhat raw a year ago, was advancing quickly in its ability to fish relevant data out of various document pools, including those based on speech.



For example, in one set of tests, the scientists increased the average precision rate for finding relevant documents per query from 19% to 27% in just one year, from 1997 to 1998. Tests in 1998 on another set of documents, in the 'Spoken Document Retrieval' pool were turning up similar stats around 20-23 per cent. The team also discovered that a little hand-fiddling in the software reaped large rewards.



According to the 1998 TREC paper: 'When we supplemented the topic lists for all the queries (by hand) to contain additional words from the relevant documents, our average precision at the number of relevant documents went from 28% to 50%'.



The truth is that Schone and his colleagues have created a truly clever invention. They have done some impressive research. What a shame all this creativity and laborious testing is going to be used for such dark, Orwellian purposes.



Let's work on the mental image of that dark landscape. The NSA sucks down phone calls, emails - all sorts of communications to its satellite bases. Its computers sift through the data looking for information which might interest the US or, if the Americans happen to be feeling generous that day, their allies.



Now, whenever NSA agents want to find out about you, they pull up a slew of details about you on their database. And not just the run-of-the-mill gumshoe detective stuff like your social security number, address, but the telephone number of every person you call regularly, and everything you have said when making those calls to 1-900-Lick-Me from your hotel room on those stop overs in Cleveland.



And here's the real scary stuff:



The NSA likely already has a file on many of us. It's not a traditional manilla file with your name typed neatly on the front. It's the ability to reference you, or anyone who matches your patterns of behaviour and contacts, in the NSA's databases. Now, or in the near future, this file may not just include who you are, but what you *say*.



British Member of the European Parliament Glyn Ford is one of the few politicians around who is truly concerned with the individual's right to privacy. A driving force behind the European Parliament's STOA panel's two year investigation into electronic communications, Ford is worried that the NSA possesses technologies that are 'potentially very dangerous' to privacy and yet have no controls over their activities.



The Australian aboriginal activist and lawyer Noel Pearson once said that that the British gave three great things to the world: tea, cricket, and common law. If unchecked, the NSA and its sister spy agencies in the UK/USA agreement may use this technology to lead an assault on the most important of those gifts, and the common law tenet 'innocent until proven guilty' may be the first casualty.

Further Reading

[CTRL] Fwd: NSA patent 5937422

Industry Watch: NSA Transcribing Voice 17 Years Ago

FAS: Automatically generating a topic description for text and searching and sorting text by topic using the same

