« Notes on notes on the plane | Home | LaTeXで文章の空白の植�... »

I found an interesting problem while working on a test case generator for the Tsukurimashou Project. The thing is that I'd like to assign an identifying code, which I will call an address, to each line of code in a code base. It's to be understood that these addresses have nothing to do with machine memory addresses, and they need not be sequential; they are just opaque entities that designate lines of code. Anyway, I would like lines of code to keep the same addresses, at least probabilistically, when the program is modified, so that when I collect test information about a line of code I can still keep most of it after I update the software.

The easiest and most traditional thing to do is to address each line of code by the filename that contains it (path, in a multi-level directory structure) and the line number. The first line in the file is line 1, the next is line 2, and so on. This is simple, and it is completely robust to changes in other lines as long as we don't add or remove any. Adding or removing a line alters the addresses of all subsequent lines in the file, and it does it in an especially annoying way because the old addresses remain valid addresses, but they now point to different lines of code. So if I record information about line 1000 and then I delete line 500, then the information for line 1000 now points at the former line 1001 and I have no way to know that the addresses have been screwed up. That's worse than rending the old address completely invalid (no longer pointing to anything), but it would be better to have it still point at the new location of the unchanged line.

Another obvious thing to do would be to feed each line into a hash function. One would typically use a cryptographic hash function like SHA256, even though this particular application does not require cryptographic security and just makes use of other properties cryptographic hash functions usually happen to have. The address of the line of code then expresses what is actually written on that line of code. No changes to other lines will have any effect on the address of this line. On the other hand, if two lines should happen to be identical, they end up with the same address and no way to distinguish them. And maybe sometimes, changes in other lines really should change or invalidate the current line's address. For instance, changing the condition on an if statement probably means that whatever testing we're doing on the single-line body will probably have to change; maybe in that case we want the address for the body line to change because of the change in the previous line, even if the actual text of the body line has not changed.

One idea for the next step would be to use a sliding window. For each line, take a five-line chunk of the program code (two lines before, until two lines after, or whatever) and hash that to create an address for the middle line. This way, changing one line changes a block of addresses of nearby lines but not any further away. We can choose the window size to suit how much locality we want the addresses to have. That amount of locality has to be a fixed number of lines, though, and it's not obvious what the right number of lines is, nor that any single answer to that question will be right everywhere in the code base. The issue of duplicate lines having duplicate addresses is reduced (because the entire window would have to be duplicated) but not eliminated.

Another idea would be to do a baby parse of the code and use that information in the address. Instead of just saying "the line that says x++;", which is what a hash of single lines might express, we could say "the line that says x++; in the function Foobar() in the namespace __Baz_Quux..." and so on. This could very well express the semantics of where the line of code is; but it means building a baby parser for the language in question, and making sure that whatever the parser determines really expresses the things we want the address to express. I'd like to be more agnostic about the contents of the files, and build something that expresses a flexible degree of locality without having built-in assumptions about what language the files are written in and what semantics are relevant.

So here are some thoughts in that direction. If we did build the baby parser, it would probably work by finding a tree structure for the lines in the file. This line is in the body of an if statement, so its parent is the line that contains the "if" keyword; then that if statement is part of a function definition, so its parent is the line with the function identifier on it, and so on. Addresses coming out of the baby-parse method would consist of paths in that tree structure, from the line itself up to the root. It might make sense to put the path, once we find it, through a cryptographic hash function much as we earlier did with individual lines, just to turn it into a well-behaved fixed-size chunk of bits instead of a string or more complicated thing; but that's a minor detail. The core concept is that a semantic address for a line, coming from a parse of the language, would express the path back to the root in the tree structure that comes from parsing the file.

Now let's talk about mountains. One thing people who are interested in mountains (topographers, climbers, and so on) do is they build hierarchies of mountains. There are a few ways to do this, but one of the simplest is to talk about the "line parents" of mountains. From any given mountain peak, you can uniquely define the so-called line parent, which is the nearest peak of higher elevation (subject to a few other conditions that are not relevant to the current discussion but are intended to filter out uninteresting cases). Once you find the line parent for one peak, you can then ask what is the line parent of that line parent (thus an even higher mountain), and continue recursively until you end up at the highest peak you're willing to consider. That root would be Everest if you do this for the entire world, but people usually stop the process at the boundaries of land masses, so there is a separate hierarchy for each continent. The line parent relationship creates a tree structure on mountain peaks. The "lineage" of a peak, which is just the sequence of line parents up to the root of the tree, describes where the peak is located in a way that progresses from very local considerations (nearby peaks) up to the global. If I were trying to tell you where a minor peak you've never heard of was located, I could start naming line parents up the tree until we got to one you recognized, and thus give you a pretty good description of the peak's relations to other mountains.

Suppose God raises a new medium-sized mountain peak somewhere. What happens to existing peaks? Those very near the new one are likely to change their lineage. Those further away are less likely. The influence of the new peak is greatest nearby and then drops off the further away we get. Similar things happen with removing, or changing the height of, an existing peak. The influence is mostly local, but with no really sharp boundary to how far it extends. And there are no magic numbers like window size involved; the entire definition can be called "parameter free," flowing from the sole assumption that we care about height and distance in the first place. Parameter-free locality is something like the property we might like addresses of code lines to have.

So can we define line parents for lines of code, without compromising on language agnosticism? I think we can, and the way to do it flows from the earlier idea of using hashes. Here's what I propose to do:

Hash every line, and consider the hashes as (very long) integers.

From a line, find the nearest line forward or backward to have a greater hash value. That is the parent for the current line.

Find the parent of the parent, and recurse up to the line with greatest hash value in the entire file.

The sequence of increasing hash values and forward/backward directions is the address for the line where we started; it could itself be hashed to convert it to a well-behaved number.

Using greater-hash lines as reference points is something like the baby parse idea of using if-statement heads, function definition lines, and so on; but unlike the baby parse, this method is parameter free. It has (like the baby parse) reasonably well-behaved properties for locality, with small changes elsewhere in the file having only a small chance of changing the current line's address, greater for nearby changes and less for distant changes. And unlike the line numbering method, a change that invalidates an address will tend to invalidate it completely, not just leave the address valid but pointing elsewhere.

It is necessary to have a way of breaking ties, both on the question of "nearest" code line and (since duplicate lines may well happen, therefore duplicate hashes even with a long collision-resistant hash) on the question of "greatest" hash value. The tiebreaking method must not break any of the other properties of the system. A reasonable way to do it would be to declare that between two lines at equal distance from the current one, the first (leftmost, least index value) shall be considered "closer"; and between two lines with equal hash value, the first (leftmost, least index value) shall be considered to have a "greater" value.

Now there is an interesting algorithmic question: how fast can we compute these? To define the problem precisely: given an array A of n real numbers, find for every element A[i] (except the maximum, which you must also find) the index j such that A[j]>A[i] and |i-j| is minimized, breaking ties as described above. We might call this the line parent problem. What is the time complexity of this problem?

This may be a previously-studied problem, but I don't recall having seen it before in exactly this form. It's quite possible it could be solved easily using some other well-known problem, or that someone has solved it incidentally as a step in building something else (as is my own situation).

It's trivial to find all the line parents in O(n2) worst-case time (check all pairs of elements). As a general rule, things that are trivial to do in O(n2) often turn out to really be O(n log n) when you try harder. I think I can actually do it in O(n) time and space, which is obviously optimal up to constant factors, but I'll feel more confident that I haven't missed any details after I finish writing the code for it. See if you can come up with a fast solution yourself.

Pushing it to two dimensions would be an interesting next step...