Congress frequently passes laws which amend or repeal sections of prior laws; this produces a series of edits to law which programmers will recognize as bearing resemblance to source control history.

In concept this is simple, but in practice this is incredibly complex – for instance like source control, the system must handle renumbering. What we will see below is that while it is possible to get some data about links, it is difficult to resolve what those links point to.

Here is an example paragraph where, rather than amending a law, the citation serves as a justification for why several words are absent in one section:

In subsection (b)(3)(C), the words “and the EPA Administrator may prescribe rules for purposes of carrying out this subparagraph” are omitted as surplus because of the authority of the Administrator to prescribe regulations under 49:32910(d). The amendment made by section 371(b)(2) of the North American Free Trade Implementation Act ( Public Law 103–182 , 107 Stat. 2128 ) is not given effect because the last sentence of section 503(b)(2)(E) of the Motor Vehicle and Cost Savings Act ( Public Law 92–513 , 86 Stat. 947 ) was omitted in the restatement of title 49 because of the authority of the Administrator to prescribe regulations under 49:32910(d).

We can find all of these references by tracing through the XML documents Congress provides using xpath expressions. It’s worth noting here that there are actually several forms of citation (“a href”, “ref href”), so if you want to do something specific you should consult the documentation (warning: PDF).

hrefs = {} titles = {} for root, dirs, files in os.walk("."): for f in files: if f.endswith('.xml'): tree = ET.parse(f) root = tree.getroot() h = {t.attrib.get('href'): f + ' ' + t.text \ for t in tree.findall('.//{http://xml.house.gov/schemas/uslm/1.0}ref')} hrefs = dict(hrefs.items() + h.items())

This will take a few minutes to generate. Once we have this we can count the links to see which ones are most frequent.

cnt = Counter() counts = [(ref, c) for (ref, c) in Counter(hrefs).most_common(10)]

And then we can print out the common links, files they are contained in, and text. It’s worth noting that you want to use the python print function to do this, because there are a lot of special unicode characters, such as the section symbol.

/us/pl/85/726 usc49.xml Public Law 85–726: 287 /us/stat/72/731 usc49.xml 72 Stat. 731: 260 /us/act/1936-06-29/ch858 usc47.xml act June 29, 1936, ch. 858: 240 /us/act/1949-06-30/ch288 usc51.xml act June 30, 1949, ch. 288: 195 None: 160 /us/act/1950-05-05/ch169/s1 usc50.xml act May 5, 1950, ch. 169, § 1: 140 /us/pl/88/365 usc49.xml Public Law 88–365: 111 /us/stat/78/302 usc49.xml 78 Stat. 302: 111 /us/stat/80/938 usc50.xml 80 Stat. 938: 75 /us/pl/92/513 usc49.xml Public Law 92–513: 75

Unfortunately there is currently no easy way to resolve these links. The documentation for the XML here describes a hypothetical system which could resolve arbitrary links, which would identify links within the provided XML, outside (to other laws) or to absolute URLs.

Even if we have this system, we still need to develop a way of traversing up the hierarchy. For instance, to answer a query like “what law or set of laws have been amended the most times,” we will need to find citations in bills which have passed and filter to citations which amend prior laws. This may require following a string of citations to determine what is effective. Those citations will then need to be rolled up to the level of a statute, which ideally means stepping up a few levels in the XML and finding a nearby section which identifies a section title.

Here, we’re only looking at U.S. Federal Law – there are many other documents which contain legal references – if you are interested in this topic, you may want to read up on what KeyCite does. All in all, it’s quite a complex system.