[Python-3000] Refactoring tool available (work in progress)

In the sandbox I've been working on a refactoring tool, which could form the basis for a Python 2.x -> 3.0 conversion tool. I'd like to invite folks here to give it a try and give me a hand. It certainly needs more work, but I think that the basic infrastructure is sound. Check out sandbox/2to3/: http://svn.python.org/view/sandbox/trunk/2to3/. This message is to invite feedback, and to encourage contributions. It would be great if people tried their hands at writing new transformations! A brief description of how it works: We start with a slightly modified version of tokenize.py. The tokens are passed to a parsing engine (the Python re-implementation of pgen that I wrote at Elemental), also slightly modified, which builds up a concrete parse tree. Through some simple tricks (this is what I had to modify tokenize.py for), the parse tree contains annotations that hold on to all comments and whitespace between tokens. This allows us to reconstruct the original source *exactly*. (I tested this on all .py files in the Python distribution and every single file got reconstructed exactly.) Once we've build a parse tree for a file we can start refactoring. This is done by traversing the parse tree looking for nodes that match a given pattern; for matching nodes, we invoke a transformation method that returns a replacement node (or None if further inspection shows we can't actually do it despite the pattern matching); then this node is grafted into the tree in place of the originally matching node. Originally I created matching patterns by constructing a little tree of "pattern nodes". This quickly got tedious, and I designed a simple language for creating patterns (inspired by the language used to describe Python's grammar). The transformation routine is still just Python code; I would like to be able to use a more compact notation for this too, but it's more complicated since (in practice) there is a lot of ad-hoc testing that goes on. Well, have a look at the three example transformations in the code (in the "fixes" subdirectory). There's a driver program (refactor.py) that lets you repeat this process for any number of files or for all Python files in a given directory tree, with command line switches to select which transformations to apply and whether to actually write back the transformed files or just show the diffs that *would* be applied (the default). Let me describe how the "apply" transformation works, from a high level. This should change apply(f, x, y) into f(*x, **y) (and similar if y is missing). The pattern looks like this, in first approximation: power< 'apply' trailer< '(' arglist< any ',' any [',' any] > ')' > > This references the following rules from the Python grammar (I'm only showing the parts that matter for this example): power: atom trailer* ['**' factor] atom: NAME | ... trailer: '.' NAME | '[' subscriptlist ']' | '(' [arglist] ')' arglist: argument (',' argument)* In the pattern language, "power<...>" means "match a node of type 'power' whose subnodes match ...". In this example, the ... is 'apply' trailer<...>; then those ... are filled in with '(' arglist<...> ')' and finally those ... are: any ',' any [',' any]. Here 'any' is a special token that matches every possible node, and [...] indicates an optional part. Stuff in quotes of course means literal tokens; we use this to match specific keywords, identifiers or operators. Patterns match only the formal grammar; they completely ignore the whitespace and comments that are included in the parse tree as annotations. The pattern language has an additional useful feature: you can associate names with subpatterns, and the pattern matching process (if successful) will return a dictionary that maps those names to the tree nodes that matched those subpatterns. The syntax for this is currently to prefix the subpattern with "NAME="; I'm not sure if this is the most readable syntax but it's the best I could come up with. The final pattern language feature is negative look-ahead matches; something like (not xyzzy) forces the match to fail if the pattern xyzzy matches at this point. (I wanted the syntax to be !xyzzy but tokenizer.py doesn't like that because ! isn't a valid Python delimiter.) So now we have identified a node that matches the pattern. The transformation extracts the subnodes corresponding to f, x and y (if present), and then constructs a new node of the form f(*x, **y). There's one complication worth mentioning: when the original code is apply(f+y, a, b) then the transformation must be (f+y)(a, b). We add the parentheses only when really needed though. There are some unsolved problems. First of all, we don't have a symbol table. If apply is redefined in a particular scope, we don't notice this and still apply the transformation. And of course apply is a relatively simple case (being a free variable); has_key is more complicated, e.g. imagine x.y.z.has_key(a). Second, the transformations can sometimes lose comments. For example, if the input looks something like (x.has_key #blah (y)) the #blah comment is lost, since there is no good place to put it in the translation "(y in x)". Part of this problem is solvable by making the transformations preserve comments more carefully; but sometimes there just isn't a good place to put them. Maybe this is rare enough in practice that we could just leave such cases untranslated, instead flagging them as code that requires manual conversion. I also expect to run into problems with transformations for things like .keys(); right now the best solution I can think of is to translate xxx.keys() into list(xxx.keys()) and translate xxx.iterkeys() into iter(xxx.keys()). That's ugly; but I can't think of anything better without doing a ton of whole-program analysis, which I'd like to avoid. The .keys() transformation is also likely to make the tool non-idempotent; that's perhaps unavoidable, but still unfortunate, since it means you have to keep track of which files already have received which transformations. Finally, I have a dream: a GUI that will let you do this interactively, sort of like query-replace in Emacs. But this message is already too long, so I'll stop for now. Thanks for reading this far. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/)