U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.

To get at this information, we need to read the Code XML, and use a natural language processing library to get at the named groups.

NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see my review of “Natural Language Processing with Python“, a book which covers this library in detail)

To achieve the results we want, we first parse one of the U.S. Code XML documents:

from elementtree import ElementTree as ET tree = ET.parse("G:\\us_code\\xml_uscAll@113-21\\usc01.xml")

Then we have to write a function to retrieve just the text nodes. I’ve started this at the

elements, which seems to give good results (i.e. paragraphs of laws, but not the headings).

def getText(node,depth): if node is None: return "" result = [] if depth == 0: iter = node.getiterator(tag='{http://xml.house.gov/schemas/uslm/1.0}p') else: iter = node.getiterator() for child in iter: if child.text is not None: result.append(child.text) if len(child.getchildren()) > 0: for n in child.getchildren(): result = result + getText(n, depth+1) result.append("

") if depth == 0: return " ".join(result) else: return result print getText(tree.getroot(),0) The Committee on the Judiciary of the House of Representatives is authorized to print bills to codify, revise, and reenact the general and permanent laws relating to the District of Columbia and cumulative supplements thereto, similar in style, respectively, to the Code of Laws of the United States, and supplements thereto, and to so continue until final enactment thereof in both Houses of the Congress of the United States. Pub. L. 90–226, title X

We can see in this some of the “entities” we expect to extract – “House of Representatives”, “District of Columbia”, “Code of Laws of the United States.”

It takes a little work to get at this – we first need to parse the text into sentences (an alternative approach might be to just keep the paragraphs as separate sentences, or parse each individually).

import nltk from nltk.tokenize import word_tokenize, sent_tokenize text = getText(tree.getroot(), 0) sentences = sent_tokenize(text) len(sentences) 1319 u'211



July 30, 1947

1 U.S.C.', u'211



Copies of District of Columbia Code and Supplements not available to Senators or Representatives unless specifically requested by them, in writing, see Pub.', u'L. 94\u201359, title VIII, \xa7\u202f801

July 25, 1975

89 Stat.', u'296

section 1317 of Title 44



In addition the Superintendent of Documents shall, at the beginning of the first session of each Congress, supply to each Senator and Representative in such Congress, who may in writing apply for the same, one copy each of the Code of Laws of the United States, the Code of Laws relating to the District of Columbia, and the latest supplement to each code: Provided

And provided further



For preparation and editing an annual appropriation of $6,500 is authorized to carry out the purposes of sections 202 and 203 of this title.

'

From there, we need to parse each sentence into constituent words. The value of this library is that it handles issues like punctuation, which would otherwise cause infinite misery.

words = [nltk.word_tokenize(sentence) for sentence in sentences] words[0] Out[47]: [u'This', u'title', u'was', u'enacted', u'by', u'act', u'July', u'30', u',', u'1947', u',', u'ch', u'.']

Once we have the words, we need NLTK to guess at parts of speech – it considers more detailed categories than you may have learned in school; this added precision seems to help it get more accurate results in later steps.

tagged = [nltk.pos_tag(w) for w in words] : tagged[0] Out[49]: [(u'This', 'DT'), (u'title', 'NN'), (u'was', 'VBD'), (u'enacted', 'VBN'), (u'by', 'IN'), (u'act', 'NN'), (u'July', 'NNP'), (u'30', 'CD'), (u',', ','), (u'1947', 'CD'), (u',', ','), (u'ch', 'JJ'),

And finally, we can look for “entities” in each sentence. NLTK returns what to me is an idiosyncratic result- a list that contains either a tuple, or a tree representing the entity.

entities = [nltk.chunk.ne_chunk(t) for t in tagged] entities[6] Out[58]: Tree('S', [(u'990', 'CD'), (u'\u201cAll', 'JJ'), (u'Acts', 'NNS'), (u'of', 'IN'), Tree('ORGANIZATION', [(u'Congress', 'NNP')]), (u'referring', 'NN'), (u'to', 'TO'), (u'writs', 'NNS'), (u'of', 'IN'), (u'error', 'NN'), (u'shall', 'MD'), (u'be', 'VB'), (u'construed', 'VBN'), (u'as', 'IN'), (u'amended', 'VBN'), (u'to', 'TO'), (u'the', 'DT'), (u'extent', 'NN'), (u'necessary', 'JJ'), (u'to', 'TO'), (u'substitute', 'VB'), (u'appeal', 'NN'), (u'for', 'IN'), (u'writ', 'NN'), (u'of', 'IN'), (u'error.\u201d', 'NNP'), (u'2002\u2014', 'CD'), (u'Pub', 'NNP'), (u'.', '.')]) [(e.node, e.leaves()[0][0]) for e in entities[6] \ if isinstance(e, nltk.tree.Tree)] Out[104]: [('ORGANIZATION', u'Congress')]

From here, I’ve defined a couple simple utility functions to extract just the parts we need from the tree. At this point from inspecting the results it becomes clear that there are a few downsides cause be lack of context: it seems to lose some stopwords (“House OF Representatives”) and we can’t correlate this back to which law the text was in.

def entityStr(e): return " ".join([word for (word, pos) in e.leaves()]) def getEntities(nodes): return [(e.node, entityStr(e)) \ for e in nodes if isinstance(e, nltk.tree.Tree)] e = [entity for entity in \ [getEntities(node) for node in entities] if len(entity) > 0 ] final = [] for lst in e: final = final + lst

There are a few interesting examples here- in some cases NLTK was able to combine multi-word names successfully, but not all cases. I think it loses track of the “of” in the center of some of them.

('ORGANIZATION', u'General Services') ('ORGANIZATION', u'Congress') ('ORGANIZATION', u'Representatives') ('GPE', u'United States') ('ORGANIZATION', u'Internal Revenue Code')

At last, we can count these and see who shows up the most:

Counter(final).most_common(20) Out[152]: [(('ORGANIZATION', u'House'), 185), (('GPE', u'United States Code'), 127), (('ORGANIZATION', u'Congress'), 126), (('ORGANIZATION', u'Representatives'), 107), (('GPE', u'United States'), 96), (('ORGANIZATION', u'Committee'), 89), (('ORGANIZATION', u'OBRA'), 56), (('ORGANIZATION', u'Clerk'), 45), (('ORGANIZATION', u'Large'), 45), (('ORGANIZATION', u'Archivist'), 44), (('GPE', u'United States Statutes'), 43), (('ORGANIZATION', u'Senate'), 37), (('ORGANIZATION', u'House Administration'), 34), (('ORGANIZATION', u'Social'), 32), (('GPE', u'Pub'), 27), (('ORGANIZATION', u'PARCHMENT'), 20), (('ORGANIZATION', u'REQUIREMENT FOR'), 20), (('PERSON', u'Tables'), 17), (('ORGANIZATION', u'Public'), 17), (('ORGANIZATION', u'SUBSEQUENT'), 12)]

You can see there is some noise at the end there – “PARCHMENT”, “SUBSEQUENT”, etc. This is likely due to the legal profession’s obsession with using capital letters to represent bold text (where NLTK assumes a more standard use of English). This would likely be improved with some pre-processing on the texts. Notably “Committee” and “Clerk” and “Archivist” are popular – likely the “Committee” would drop out into specific committees if this were improved.