The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about.

Citations occur for many reasons: a justification for addition or omission in subsequent laws, clarifications, or amendments, or repeals. As we might expect, the most commonly cited sections involve the IRS (Income Taxes, specifically), Social Security, and Military Procurement.

To arrive at this result, we must first see how U.S. Code is laid out. The laws are divided into a hierarchy of units, which allows anything from an entire title to individual sentences to cited. These sections have an ID and an identifier – “identifier” is used an an citation reference within the XML documents, and has a different form from the citations used by the legal community, comes in a form like “25 USC Chapter 21 § 1901”.

The XML hierarchy defines seventeen different levels which can be cited: ‘title’, ‘subtitle’, ‘chapter’, ‘subchapter’, ‘part’, ‘subpart’, ‘division’, ‘subdivision’, ‘article’, ‘subarticle’, ‘section’, ‘subsection’, ‘paragraph’, ‘subparagraph’, ‘clause’, ‘subclause’, and ‘item’.

We can use a simple XPath expression to retrieve one of these, like section:

§?104. Federal Highway Administration (a) The Federal Highway Administration is an administration in the Department of Transportation.

A portion of the human readable citation is contained in “num”. In order to retrieve a citation that a lawyer would recognize, we need to look at “num” for the parent element as well.

from elementtree import ElementTree as ET import os dir = "G:\\us_code\\xml_uscAll@113-21" def getParent(parent_map, elt, idx): try: parent = elt for i in range(idx): parent = parent_map.get(parent) return \ parent.findall('{http://xml.house.gov/schemas/uslm/1.0}num')[0].text + ' ' + parent.findall('{http://xml.house.gov/schemas/uslm/1.0}heading')[0].text except: return "--No Heading--"

Once we find the parent, we need to traverse all the way up the tree:

def getTree(parent_map, t): tree = [] parent = "" idx = 0 while (parent != "--No Heading--"): parent = getParent(parent_map, t, idx) tree.append(parent) idx += 1 return tree usc26.xml: Title 26— Subtitle A— CHAPTER 1—

This forms the basis for a function which builds a citation index – a list of every XML node that can be used in a citation, along with it’s human-readable citation and name. This takes some time, so if you reproduce this effort, you may want to save the results to a file.

dir = "G:\\us_code\\xml_uscAll@113-21" urls = {} def findElements(xpath, urls): for root, dirs, files in os.walk(dir): for f in files: if f.endswith('.xml'): tree = ET.parse(dir + "\\" + f) parent_map = dict((c, p) for p in tree.getiterator() for c in p) sections = tree.findall(xpath) for t in sections: urls[t.attrib.get('identifier')] = \ (t.attrib.get('id'), getTree(parent_map, t), f) refs = {} refTypes = ['title', 'subtitle', 'chapter', \ 'subchapter', 'part', 'subpart', 'division', \ 'subdivision', 'article', 'subarticle', 'section', \ 'subsection', 'paragraph', 'subparagraph', 'clause', \ 'subclause', 'item'] for ref in refTypes: findElements('.//{http://xml.house.gov/schemas/uslm/1.0}' + ref, refs) refs.items()[20] ('/us/usc/t2/s2102/b', ('id8a923648-f59b-11e2-8dfe-b6d89e949a2c', ['(b) Issuance and publication of regulations', u'\xa7\u202f2102. Duties of Commission', u'Part B\u2014 Senate Commission on Art', u'SUBCHAPTER V\u2014 HISTORICAL PRESERVATION AND FINE ARTS', u'CHAPTER 30\u2014 OPERATION AND MAINTENANCE OF CAPITOL COMPLEX', u'Title 2\u2014 THE CONGRESS', '--No Heading--'], 'usc02.xml'))

Now that we know how to look up a citation we need to find the actual citations. Like HTML, the U.S. code documents use the “a href=” tag to reference a node, as well as “ref href=”. The same XPath technique used above allows us to find refs:

hrefs = {} titles = {} refpath = './/{http://xml.house.gov/schemas/uslm/1.0}ref' for root, dirs, files in os.walk(dir): for f in files: if f.endswith('.xml'): tree = ET.parse(dir + "\\" + f) root = tree.getroot() h = {t.attrib.get('href'): f + ' ' + t.text \ for t in tree.findall(refpath)} hrefs = dict(hrefs.items() + h.items()) hrefs.items()[0] Out[55]: ('/us/pl/109/280/s601/a/3', u'usc29.xml Pub. L. 109\u2013280, title VI, \xa7\u202f601(a)(3)')

We have everything we need to find which sections are commonly cited, we just need to combine them. Most of the complexity here is dealing with missing entries (e.g. due to the fact that a citation can point anywhere in the hierarchy).

from collections import Counter def countCitations(urls, hrefs): titles = Counter() subtitles = Counter() chapters = Counter() not_found = [] for key in hrefs.keys(): found = urls.get(key) title = "None" subtitle = "None" chapter = "None" file = "None" if (found != None): (id, history, file) = found if len(history) >= 2: title = history[-2] if len(history) >= 3: subtitle = history[-3] if len(history) >= 4: chapter = history[-4] else: not_found.append(key) titles[file + ": " + title] += 1 subtitles[file + ": " + title + " - " + subtitle] += 1 chapters[file + ": " + title + " - " + subtitle + " - " + chapter] += 1 return (titles, subtitles, chapters, not_found) (t, s, c, none) = countCitations(refs, hrefs)

This returns results that are rolled up to titles, subtitles, and chapters. In particular note how as we drill down, the results provide clarity as to what was most important in the priort section. Within “The Public Health and Welfare,” we see that Social Security is important, and within “Armed Forces,” we see that “General Military Law – Personnel” is important.

None: None: 359662 usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE: 6679 usc10.xml: Title 10— ARMED FORCES: 2078 usc16.xml: Title 16— CONSERVATION: 2068 usc42.xml: None: 1965 usc15.xml: Title 15— COMMERCE AND TRADE: 1796 usc07.xml: Title 7— AGRICULTURE: 1689 usc22.xml: Title 22— FOREIGN RELATIONS AND INTERCOURSE: 1684 usc20.xml: Title 20— EDUCATION: 1660 usc26.xml: Title 26— INTERNAL REVENUE CODE: 1610

None: None - None: 359662 usc42.xml: None - None: 1965 usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE - CHAPTER 7— SOCIAL SECURITY: 1573 usc10.xml: Title 10— ARMED FORCES - Subtitle A— General Military Law: 1490 usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE - CHAPTER 6A— PUBLIC HEALTH SERVICE: 1220 usc26.xml: Title 26— INTERNAL REVENUE CODE - Subtitle A— Income Taxes: 841 usc05.xml: None - None: 736 usc10.xml: None - None: 639 usc16.xml: Title 16— CONSERVATION - CHAPTER 1— NATIONAL PARKS, MILITARY PARKS, MONUMENTS, AND SEASHORES: 616 usc20.xml: Title 20— EDUCATION - CHAPTER 70— STRENGTHENING AND IMPROVEMENT OF ELEMENTARY AND SECONDARY SCHOOLS: 531

None: None - None - None: 359662 usc42.xml: None - None - None: 1965 usc26.xml: Title 26— INTERNAL REVENUE CODE - Subtitle A— Income Taxes - CHAPTER 1— NORMAL TAXES AND SURTAXES: 817 usc10.xml: Title 10— ARMED FORCES - Subtitle A— General Military Law - PART II— PERSONNEL: 738 usc05.xml: None - None - None: 736 usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE - CHAPTER 7— SOCIAL SECURITY - SUBCHAPTER XVIII— HEALTH INSURANCE FOR AGED AND DISABLED: 663 usc10.xml: None - None - None: 639 usc38.xml: None - None - None: 497 usc10.xml: Title 10— ARMED FORCES - Subtitle A— General Military Law - PART IV— SERVICE, SUPPLY, AND PROCUREMENT: 496 usc15.xml: None - None - None: 428

Future work in this area will involve cleaning up the results to remove some of the “None” entries, building a visualization of the results, and training a tagger to recognize the human-readable versions of citation in court documents. In the long run, I hope these developments help make legal information more accessible to everyone, rather than being locked up in expensive databases.