Ocean of Awareness

Tue, 02 Oct 2018

Language popularity

Github's linguist is seen as the most trustworthy tool for estimating language popularity , in large part because it reports its result as the proportion of code in a very large dataset, instead of web hits or searches. It is ironic, in this context, that linguist avoids looking at the code, preferring to use metadata -- file name and the vim and shebang lines. Scanning the actual code is linguist 's last resort.

How accurate is this? For files that are mostly in a single programming language, currently the majority of them, linguist 's method are probably very accurate.

But literate programming often requires mixing languages. It is perhaps an extreme example, but much of the code used in this blog post comes from a Markdown file, which contains both C and Lua. This code is "untangled" from the Lua by ad-hoc scripts . In my codebase, linguist indentifies this code simply as Markdown. linguist then ignores it, as it does all documentation files. .

Currently, this kind of homegrown literate programming may be so rare that it is not worth taking into account. But if literate programming becomes more popular, that trend might well slip under linguist 's radar. And even those with a lot of faith in linguist 's numbers should be happy to know they could be confirmed by more careful methods.

Token-by-token versus line-by-line

linguist avoids reporting results based on looking at the code, because careful line counting for multiple languages cannot be done with traditional parsing methods. To do careful line counting, a parser must be able to handle ambiguity in several forms -- ambiguous parses, ambiguous tokens, and overlapping variable-length tokens.

The ability to deal with "overlapping variable-length tokens" may sound like a bizarre requirement, but it is not. Line-by-line languages (BASIC, FORTRAN, JSON, .ini files, Markdown) and token-by-token languages (C, Java, Javascript, HTML) are both common, and even today commonly occur in the same file (POD and Perl, Haskell's Bird notation, Knuth's CWeb).

Deterministic parsing can switch back and forth, though at the cost of some very hack-ish code. But for careful line counting, you need to parse line-by-line and token-by-token simultaneously. Consider this example:

int fn () { /* for later \begin{code} */ int fn2(); int a = fn2(); int b = 42; return a + b; /* for later \end{code} */ }

A reader can imagine that this code is part of a test case using code pulled from a LaTeX file. The programmer wanted to indicate the copied portion of code, and did so by commenting out its original LaTeX delimiters. GCC compiles this code without warnings.

It is not really the case that LaTeX is a line-by-line language. But in literate programming systems , it is usually required that the \begin{code} and \end{code} delimiters begin at column 0, and that the code block between them be a set of whole lines so, for our purposes in this post, we can treat LaTeX as line-by-line. For LaTeX, our parser finds

L1c1-L1c29 LaTeX line: " int fn () { /* for later" L2c1-L2c13 \begin{code} L3c1-L5c31 [A CODE BLOCK] L6c1-L6c10 \end{code} L7c1-L7c5 LaTeX line: "*/ }"

Note that in the LaTeX parse, line alignment is respected perfectly: The first and last are ordinary LaTeX lines, the 2nd and 6th are commands bounding the code, and lines 3 through 5 are a code block.

The C tokenization, on the other hand, shows no respect for lines. Most tokens are a small part of their line, and the two comments start in the middle of a line and end in the middle of one. For example, the first comment starts at column 17 of line 1 and ends at column 5 of line 3.

What language is our example in? Our example is long enough to justify classification, and it compiles as C code. So it seems best to classify this example as C code . Our parses give us enough data for a heuristic to make a decision capturing this intuition.

Earley/Leo parsing and combinators

In a series of previous posts , I have been developing a parsing method that integrates Earley/Leo parsing and combinator parsing. Everything in my previous posts is available in Marpa::R2, which was Debian stable as of jessie.

The final piece, added in this post, is the ability to use variable length subparsing , which I have just added to Marpa::R3, Marpa::R2's successor. Releases of Marpa::R3 pass a full test suite, and the documentation is kept up to date, but R3 is alpha, and the usual cautions apply.

Earley/Leo parsing is linear for a superset of the LR-regular grammars, which includes all other grammar classes in practical use, and Earley/Leo allows the equivalent of infinite lookahead. When the power of Earley/Leo gives out, Marpa allows combinators (subparsers) to be invoked. The subparsers can be anything, including other Earley/Leo parsers, and they can be called recursively . Rare will be the grammar of practical interest that cannot be parsed with this combination of methods.

The example

The code that ran this example is available on Github. In previous posts, we gave larger examples , and our tools and techniques have scaled. We expect that the variable-length subparsing feature will also scale -- while it was not available in Marpa::R2, it is not in itself new. Variable-length tokens have been available in other Marpa interfaces for years and they were described in Marpa's theory paper. .

The grammars used in the example of this post are minimal. Only enough LaTex is implemented to recognize code blocks; and only enough C syntax is implemented to recognize comments.

The code, comments, etc.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

posted at: 20:16 | direct link to this entry

§ § §