Debian's reproducible builds effort has major implications for the trustability of individual software packages and the system as a whole. Implementing reproducible builds is also a complex undertaking, and DebConf 2015 featured several sessions that dealt with aspects of the work. The most comprehensive talk was the team report, led by Jérémy "Lunar" Bobbio and Holger Levsen and joined by Eduard "Dhole" Sanou and Chris Lamb.

The essence of the reproducible-build problem, as Bobbio explained it, is that free software typically provides source that can be studied (and verified) and binaries that can be used for any purpose, but it does not provide a proof that the binaries were created from the verified source. There are proof-of-concept exploits that highlight the dangers of this situation, such as Mike Perry and Seth Schoen's kernel exploit presented [PDF] at the 2014 Chaos Communication Congress.

The solution is to enable anyone to reproduce a bit-for-bit identical package from a given source tree, and that is what the Debian team has been working toward since 2013. The effort impacts a number of parts of the Debian project, including packaging tools, various compilers and build tools, Debian's infrastructure, and quite a few individual software packages.

Reproducibility work so far

The grunt work of testing a package for reproducibility involves building the package, saving the result, and then repeating the build with slight alterations to the build environment. The reproducibility team has a set of Jenkins jobs running a battery of such tests on its own mirror of the Debian archive. The variations tested include hostname, kernel version, username and UID/GID running the build, time zone, and locale. At the moment, variations for some other factors (including CPU type and the exact timestamp) are not tested, although there is work in progress to support more variants.

Within that framework, Levsen explained, well over 75% of the packages in Debian "unstable" can be built reproducibly on the amd64 architecture—but the necessary changes have not been merged into the packages in the main Debian archive, and quite a bit more remains to be done before such merging will be considered. The team recently added armhf to the test pool and will be adding ppc64el soon; Levsen said it would support hardware from other architectures, too, if anyone has hardware to donate for the effort.

The most recent work includes the dh-strip-nondeterminism add-on for debhelper, which normalizes the contents of various problematic file formats. The set of formats handled includes several archive formats, which may record filesystem timestamps and permissions irrelevant to final archive.

The team also wrote a utility program called diffoscope, which shows the differences between two packages (or directory trees). Diffoscope works "in depth," Bobbio explained: it recursively unpacks archives, uncompresses PDFs, unpacks Gettext files, and disassembles binaries. That allows it to look beyond differences in the bytes between two archives to the "human readable difference" in the original files.

In addition, the team has drafted some proposals that will affect build and packaging tools. The first is .buildinfo , which is a Debian package control file. A .buildinfo file will be used to record the details of the build environment for a package so that the same conditions can be recreated later.

The second is SOURCE_DATE_EPOCH , a timestamp environment variable that build tools can use to export the last-modification date of the source. As Levsen explained, once all binaries are bit-for-bit compatible, the "interesting" factor becomes not the build timestamp, but the last time when the source was altered. The SOURCE_DATE_EPOCH timestamp is also useful for packages like help2man or epydoc that are used to process documentation, for which the team has already caught and fixed many bugs. More challenging is the process of persuading the maintainers of some of these upstream tools that SOURCE_DATE_EPOCH is a useful bit of information to report.

Chris Lamb then discussed some common reproducibility bugs and how to fix them. Last-modification-time timestamps embedded in files are a common problem: they change the file without adding substantive value. Some of the fixes border on being trivial; for example, gzip records a timestamp by default when compressing a file, but that timestamp can be suppressed by adding the -n flag. The internal timestamp field in a PNG file, however, must be stripped out with ImageMagick or a similar tool, which is more work. Various programming languages, such as Erlang and Ruby, record problematic timestamps whenever they process a file, he said, while simple configure scripts often record unnecessary information like the current time and hostname.

There are also several issues related to the ordering of files. For example, in an archive, if the alphanumeric ordering used by the filesystem differs (as can happen if the system locale is changed), two tar archives of identical files can produce differing results. The fix is to pipe the list of files through sort before it goes to tar . Perl exhibits a similar problem; the hash order produced by Data::Dumper is nondeterministic.

For many of these problems, Lamb has found fixes, but there are others that will require developers to do some work. For example, code that uses the current time as a lazy form of unique identifier will need to be rewritten. Most of the reproducibility fixes the team has implemented are not "crazy," he said, but the further upstream the fix is needed, the less likely it is to get accepted. Nevertheless, Lamb reported that the team had created more than 600 reproducibility-support patches, averaging about two new patches per day.

A look ahead

So far, many of the reproducibility bugs caught and fixed have been in the source packages themselves but, moving forward, work will have to be done on Debian's packaging tools and even its infrastructure. Bobbio noted that there are several open bugs against dpkg (including the bug to add support for .buildinfo files). Similarly, debhelper, cdbs, and sbuild all need to be patched.

There is not always agreement, though, about where some of the fixes required for reproducible builds should be made. For example, the bug to make mtime timestamps produced through dpkg deterministic could be solved by patching dpkg, debhelper, or tar. More discussion is needed, Bobbio said, and he invited volunteers to join in.

Other fixes will impact the Debian infrastructure itself. For example, reproducible builds need to be performed using a fixed build path, which will mean implementing changes on the Debian build server. Similarly, .buildinfo files would have to be accessible to users anywhere in the world in order for those users to actually perform their own reproducible builds; that means that roughly 200,000 files (for all of the packages across all of Debian's architectures) will need to be published somewhere, perhaps in the Debian archive itself, and perhaps as a new service.

But the final "patch," Bobbio said, will have to be to Debian policy. The reproducible builds team would like to add "source must build in a reproducible manner" requirement to a new section 4.15—but, naturally enough, that is a change that everyone in the project needs to think about and have an opportunity to weigh in on.

The session ended with the team providing some practical steps that Debian developers and package maintainers can take to fix reproducibility problems in their packages. The status of any package can be checked online by visiting reproducible.debian.net. Users can also test some reproducible builds locally. There is a script available for pbuilder, although it only works on the packages in the patched, reproducible-build mirror.

Reproducible builds have benefits beyond security, of course. The speakers listed several during the talk, such as the ability to create debug packages for a binary at any time (including long after the binary was built), earlier detection of failed builds, and better testing of development tools. Given the response to the talk and the questions asked by audience members, this is clearly a project that many in the Debian community see as an important next step—even if it is one that still has many tasks, bugs, and open questions left to address.

[The author would like to thank the Debian project for travel assistance to attend DebConf 2015.]

Comments (7 posted)

Among the more curious talks at KVM Forum 2015 was Mark Kraeling's "Virtualizing the Locomotive" (YouTube video and slides [PDF]) that was submitted in the "end-user presentation" category. The topic is exactly what the title says: virtualization for train software. The presenter works for GE Transportation.

There's a lot of electronics and software in trains, and virtualization doesn't really apply to all applications on a train; GE is not planning to apply virtualization directly affecting the safety of the train, for example. These systems use techniques such as lockstep execution and voting. In addition, the systems have to go through special certification. The hypervisor just gets in the way too much.

However, a train has a lot more software, written by multiple suppliers, each of which used to provide completely separate hardware too. For example, a lot of functionality is related to remote control. It is common for an engineer in the front of the train to drive locomotives at the back of the train, possibly two miles away and on the other side of the hill. Radio-based communication is also used to drive locomotives from the ground at stations or maintenance facilities ("a giant train set", Kraeling called them).

It is very common therefore to have not just a separate processor per application, but even to duplicate components such as cell phone modems. A hardware platform that supports virtualization can go against this trend by encouraging consolidation. Deploying all these systems onto a single multi-core processor saves money and enables more reuse of hardware. And if functionality can be added to an existing system just by dropping a new virtual machine (VM) into it, there is no problem if some applications are developed for a specific Linux version, for a legacy OS, or even for Windows CE. This flexibility is another advantage of virtualization.

However, consolidation requires some level of sandboxing for the different VMs. This is not unlike a data center and Kraeling, in fact, used the analogy of a "data center on wheels" several times; but some of the specific use cases for isolation are interesting. It can be hard to patch locomotive software, because—unlike commodity servers—locomotives cost millions of dollars and customers really do not want them to stay idle while testing updates. The requirements this imposes on the quality of software are obvious, but you also need to make the most out of the testing time you can get in the field. For example, an effective way to validate new software is to place it on production locomotives just for the purpose of collecting logs until the software crashes, as well as for collecting data about the crash itself. But it is not stable, so the code is not really in use: it must not interfere with other VMs and with the control systems. Virtualization helps a lot with this kind of sandboxing, of course.

Hardware and software

After explaining the use cases, Kraeling presented the system's hardware and software. The processors selected are x86 and ARM. x86 processors are used for compute-heavy applications. ARM is mostly used for networking services, though 64-bit ARM has the potential to replace x86 as well. All processors are quite low-end, and they often run a 32-bit hypervisor and operating system in order to save memory. Xen was faster than KVM on 32-bit x86, so Xen was used there; on the other hand, KVM was faster than Xen for ARM hardware. GE wants to use KVM with 64-bit ARM as well when the processors are ready.

The focus on low-end x86 and low-power ARM systems is due to power-consumption concerns, which can be an issue for the locomotive's computer systems. This may be surprising, because on a diesel locomotive you basically have a power plant at hand. However, power is directly related to heat, and a locomotive can easily reach 70°C. If the surrounding environment is as hot as the processor, you cannot easily solve overheating problems by adding fans. And even though there is room for fans at the bottom of the chassis, customers do not like maintaining and cleaning them. The low-end, power-conscious x86 processors (GE uses the BayTrail E3845 and the Broadwell 5500U) can run at those temperatures without fans.

For the management layer, there's no standardized tool yet, but GE is looking into OpenStack. A member of the audience pointed out that the libvirt project was started to bridge the differences between Xen and KVM, so it may help GE as well if it does not need the complexity of OpenStack.

Kraeling has tested containers as well. The isolation and flexibility is, of course, not as good as you can get from virtual machines, but they were faster on ARM, so they are looking into CoreOS and Docker. He hasn't yet looked at why x86 didn't benefit from containers. My guess would be that different applications are running on the two systems; CPU-bound applications are quite efficient in a virtual machine.

While high-level, the talk gave an interesting glimpse into a field that most developers are not familiar with; such "embedded" usage of a data center hypervisor like Xen or KVM is not a well-known topic. But, in fact, a presentation [PDF] from KVM Forum 2010 has some striking similarities with Kraeling's. Embedded virtualization is probably more frequent than one would think, and will probably become even more common in combination with realtime virtualization.

Comments (5 posted)

Development in the world of the TeX document-preparation system is steady, gradual, and solid. This fact reflects the maturity of Donald Knuth's TeX engine, which underlies everything else in the system. TeX has been essentially frozen for about 30 years and is one of the very few large software systems that is generally considered to be bug-free. Development now occurs in those layers that surround the typographic core: formats that supply higher-level control of TeX for the ready production of various classes of documents and packages for drawing, slides, sheet music, poetry, and for tweaking TeX's behavior.

In this two-part series, we will look at recent developments in the world of TeX (including LaTeX and similar systems). Considering the pace of development in the TeX community, the notion of "new" that I have in mind is a time horizon of five years or so, although I might mention things that happened even before then. This first part will touch upon typography, programming TeX, and creating diagrams.

TeX basics

Although TeX is still essentially oriented toward the creation of static, paginated documents—and might seem to be losing some relevance in our online world—it is still widely used, especially by mathematicians and scientists in quantitative fields (physics, computer science, etc.). The core reason for this is the same reason that TeX is also popular with various authors and publishing houses in the humanities—those who publish typographically demanding scholarly editions, perhaps mixing languages that employ a variety of alphabets. TeX's purpose is to achieve the best possible automated typography.

This can be seen not only in its unparalleled rendering of mathematical equations, but in its attention to aesthetics in the setting of prose: TeX contains sophisticated algorithms that adjust line breaking and hyphenation to optimize the appearance of entire paragraphs and pages considered as wholes. This attention to detail becomes critically important in complex documents, where typography becomes part of the expression of ideas.

The official and predominant installation method for TeX is TeX Live, which was traditionally distributed on a DVD and is still available in that form. To get the most recent versions of all its parts, however, you will want to follow the usual procedure and install TeX Live through the network. The versions available through Linux package management systems are usually out of date, but the current release is available from the project site. The release notes for TeX Live 2015 are a list of relatively minor, technical details, so I won't be discussing those changes specifically.

The timeliest and most complete source of documentation for the hundreds of TeX packages will be on your disk after you install TeX Live. Open the file /usr/local/texlive/2015/index.html in your browser for links to manuals and examples in various languages.

For those unfamiliar with TeX, when you process a document with the TeX typesetting system, you do so by invoking one of several TeX engines. The various engines differ in the output format that they produce and in how they implement some of TeX's algorithms—which determines what additional features are available. The original tex engine predates Unicode (so it expected an ASCII file) and produced only DVI (for "device independent") files. DVI was intended to be translated into PostScript or other printer commands with a separate tool. Contemporary TeX engines, though, can produce PDF files (e.g., pdfTeX), can understand Unicode text (e.g., XeTeX), and even incorporate scripting languages (e.g., LuaTeX).

The TeX engines should not be confused with TeX document formats, which are large collections of macros that define a set of higher-level layout commands. The most well-known format is LaTeX; another format (which has become popular for book publishing) is ConTeXt. Formats and engines are orthogonal: the pdftex and pdflatex commands invoke the same engine, but the former will process only plain TeX, whereas the latter supports LaTeX.

Fonts and Typography

You should probably work with the LuaLaTeX or XeLaTeX engines for new projects (or their plain-TeX equivalents, LuaTeX and XeTeX), unless you must use TeX packages that are incompatible with these engines or you require a particular feature that's only available with the traditional, PDF-based engine pdfLaTeX.

The reason for this advice is that LuaLaTeX and XeLaTeX have both feet firmly in Unicode land, and their font handling is far more flexible and straightforward than that of the venerable alternatives. One of the annoying drawbacks of TeX in the past was that it lived in its own font universe, and could only use the typefaces that were designed for it.

Generally, TeX was blind to all the other beautiful fonts that you might have installed on your computer. With XeLaTeX and LuaLaTeX, though, you can now easily use any OpenType or TrueType font on your system. And, as we shall shortly see, the maturing of the fontspec and unicode-math packages in recent years radically improves the font-handling landscape for TeX users.

Here is a minimal LaTeX document that shows how to make arbitrary font changes, selecting from among several OpenType/TrueType fonts—some in the TeX Live directory tree and some in the system font directories:

\documentclass{article} \usepackage{fontspec} \defaultfontfeatures{Scale=MatchLowercase,Ligatures=TeX} \begin{document} {\fontspec{Ubuntu}The }{\fontspec{Fetamont Bold10}quick } {\fontspec{Punk Nova}brown {\bfseries fox }} {\fontspec{Sawasdee}jumps {\itshape{\bfseries over }}} {\fontspec{CMU Serif}{\scshape the }} {\fontspec{Overlock}lazy }{\fontspec{Ubuntu Condensed}dog.} \end{document}

This file is intended to be processed with the lualatex command, which allows us to use the common names of fonts rather than having to know their actual filenames—as is required by all other engines, including XeLaTeX. This convenience is one of several reasons that lualatex should probably be your preferred typesetting command, unless you need to use a package or feature that only works with one of the others.

The defaultfontfeatures command in the third line of the example selects two options for the fontspec package. The Scale=MatchLowercase option scales the various fonts so that their lower-case letter heights are optically equal: fonts with the same nominal point size can appear to be different sizes, so this option makes them blend better when mixing fonts within a line. The Ligatures-TeX option enables the familiar TeX ligatures, such as "``" for an opening quotation mark.

In the code sample, bfseries select the boldface variant of the currently selected (or default) font; itshape and scshape select the italic and small caps variants, respectively. The code sample also shows how these can be combined to produce, in this example, boldface italics.

Here is the result when you process the file with lualatex .

The image was made by cropping the PDF output and converting it to a PNG. You can see where LuaLaTeX has chosen the appropriate font variants in response to the font attribute commands ( bfseries , scshape , etc.), some of which are nested. This works because the example uses fonts with these variants available; if the needed variants are not available, those commands will be ignored.

I've actually used this style of ad-hoc font switching when making posters and name tags but, for the more usual kind of document, you will want to select a harmonious set of fonts at the beginning and use them consistently throughout, switching among them with the standard commands for italic, monospace, etc., as required. Here is how you do this with fontspec:

\documentclass{article} \usepackage{fontspec} \defaultfontfeatures{Scale=MatchLowercase,Ligatures=TeX} \setmainfont{Overlock}[BoldFont={* Black}, BoldItalicFont={* Bold Italic}, SmallCapsFont={* SC}] \setmonofont{PT Mono} \begin{document} The {\bfseries quick} {\itshape brown fox {\bfseries jumps over}} {\scshape the {\tt lazy dog.}} \end{document}

Running this through lualatex gives the result in the second figure.

In the options to the setmainfont command (which, unusually, come after the main argument), the asterisks stand for the main font name. This provides a convenient shorthand for selecting font variants. Fontspec is incredibly flexible, allowing you to choose entirely different typefaces for bold, italic, etc., if you want to. You can also choose which font features are activated in every situation; for example, you can decide to use historical ligatures when italics are used, but not in upright text.

LuaLaTeX and XeLaTeX both allow you to use Unicode input without including any additional packages. This lets you replace the traditional TeX commands for accents, and it allows the use of any characters available in the font. This is a Turkish translation of the common English pangram we used in the preceding examples:

{\fontspec{CMU Serif} hızlı kahverengi tilki tembel köpeğin üstünden atlar}

When inserted into our minimal document example, it is typeset as in this figure:

Note, though, that this approach will fail if you choose a font without the glyphs required. For example, attempting to set the above line using the Overlock font will simply skip the "ğ", which is missing from that font.

With the addition of the unicode-math package, Unicode input can even be used in equations. This package also builds its typeset mathematical output using Unicode glyphs, and it allows you to select any math font without loading additional packages:

\documentclass{article} \usepackage{unicode-math} \begin{document} Here is the elementary version of Stokes' Theorem: \medskip XITS (STIX) Math: \setmathfont{xits-math.otf} \[ ∫_Σ ∇ ⨯ 𝐅 ⋅ dΣ = ∮_{∂Σ}𝐅⋅d𝐫 \] \end{document}

The results of running this through luatex can be seen in the figure below. A longer example also showing other variations is available by clicking on the thumbnail.

Using Unicode math input clearly leads to source files that are easier to read, but it may not be to your liking if your system or text editor makes the input of Unicode too cumbersome. You can, of course, freely mix traditional TeX math markup with direct Unicode input.

If you do use Unicode characters for math in your source files, you must take care to use the symbols with the correct meaning, rather than merely the correct appearance. In the example file, we've used the uppercase Greek Sigma (U+03A3) to represent the surface of integration. There is, however, another Unicode character that will appear almost identical in the source file, but which is intended to mean the summation operator (U+2211).

When typesetting equations, TeX treats letters (variables) and operators differently, as it must. So, if you accidentally use the operator sigma, the size and spacing of the symbol will be incorrect, and the equations will look quite wrong.

(Note that if you get an error upon loading unicode-math, you may have to reinstall TeX Live 2015. There was a conflict with another package that was only fixed a few weeks before I write this—perhaps a counterexample to my advice to download a recent version rather than settling for the one in your distribution's repository.)

Programmability

TeX is not only a system of declarative markup tags for text and equations. It is also a Turing-complete programming language, meaning that it can express arbitrary computations. Many popular LaTeX packages perform computations in TeX in order to work their magic, but it is an arcane and tricky language to program in, and quite difficult to read.

LuaTeX (which includes the LuaLaTeX engine) is a project that embeds the Lua language within TeX. It's still officially in beta, but over the last few years has become stable and mature enough that LuaLaTeX is now considered the preferred engine for new projects. It is the focus of future development, and the ConTeXt project has adopted it as its official engine.

Lua is a scripting language designed specifically for embedding, and is therefore small and efficient. It has a familiar, imperative syntax and can be immediately understood with no previous exposure to the language. After a few minutes with the documentation, anyone who knows Python or any similar language can write basic programs in Lua. LuaTeX embeds Lua in such a way that it has access to the internals of TeX, and it can be used to manipulate the boxes and other elements that make up the typeset result. It can also make the results of Lua calculations available to TeX for typesetting. A simple example should make clear how this can be useful:

\documentclass{article} \usepackage{luacode} \begin{document} \pagestyle{empty} \begin{luacode*} function esn (n) return (1 + 1/n)^n end function etn (n) tex.print(string.format('%5d & %1.8f \\\\', n, esn(n))) end \end{luacode*} Convergence to $e$: \begin{tabular}{ll} \rule[-2mm]{0pt}{4mm}$n$ & $(1 + \frac{1}{n})^n$ \\ \hline \luadirect{ for n = 10, 110, 10 do tex.print(etn(n)) end } \hline \end{tabular} \end{document}

In the first part of this document, we have a luacode environment, where we have defined two functions. The first ( esn() ) maps a number n to a simple expression that yields the number e in the limit as n goes to infinity. The second function ( etn() ) prints a string that can be embedded within normal TeX source.

The LaTeX code begins next, with a line introducing a table that will show values of n and the convergents approaching e. Within the table, the columns are built with a luadirect command that immediately executes its argument as Lua code, using calls to one of the functions defined earlier. The typeset result is shown in the figure. The ability to perform calculations and typeset the results in a single TeX file, using a language that is simple to program in, opens up a world of new possibilities, especially for authors of mathematical material.

Another strong use case for LuaTeX is in the automated creation of PDF documents from assorted data sources—for example, consider forms of database publishing, such as the printing of catalogs from product databases. In these cases, there is no TeX formatting in the original data, so some form of flexible mapping from data structures to TeX concepts leading to the final PDF is required. The embedded scripting provided by LuaTeX makes this easier than the alternatives.

Graphics

The PGF/TikZ package, a huge project that provides a complete solution to creating all sorts of diagrams within a TeX document, has learned several new tricks in recent years. A recent article introduced TikZ's new network-graph facilities, including the exploitation of LuaTeX to implement the automated layout of graph diagrams. Here, we'll show how to combine a little scripting along the lines illustrated above with TeX's graphics packages:

\documentclass{article} \usepackage{luacode} \usepackage{pgfplots} \begin{document} \begin{luacode*} function esn (n) return (1 + 1/n)^n end function etp (n) tex.print(string.format('(%5d, %1.8f)', n, esn(n))) end \end{luacode*} \begin{tikzpicture} \begin{axis}[xlabel=$n$, ylabel=$(1 + \frac{1}{n})^n$] \addplot coordinates{ \luadirect{ for n = 10, 110, 10 do tex.print(etp(n)) end } }; \addplot[red] coordinates { % (0, \luadirect{tex.print(math.exp(1))}) % (110, \luadirect{tex.print(math.exp(1))}) }; \end{axis} \end{tikzpicture}

In this example we've replaced the second Lua function with one that prints out a pair of coordinates that we can use in a PGFPlots command. The import of pgfplots in the document preamble creates the tikzpicture environment. The axis environment and \addplot command within the tikzpicture environment invoke the pgfplots subsystem, which provides a specialized language for drawing graphs. The syntax it supports is designed to be more convenient than using plain TikZ for this purpose.

The result is shown in the following figure, where the approach of the convergents to the limit, the transcendental number e (shown as a red line), is illustrated.

With the advent of these easy-to-learn tools, it has become possible to undertake a project such as an entire mathematics textbook, with all calculations and graphing done within a single TeX document.

TikZ and PGF recently received a major update to version 3.0. Because of the power and relative ease of use of PGF and TikZ, much of the action in the past few years on the TeX graphics front has taken the form of new TikZ packages.

A few interesting recent additions or major upgrades are Sa-TikZ, for the automated drawing of switching networks; bondgraph, for making "bond graphs" of physical systems; hf-tikz, which allows you to highlight parts of formulas; the randomwalk package, which calculates and prints 2D random walks; tikzorbital, for drawing colorful pictures of atomic and molecular orbitals; tqft, for topological quantum field theory; and forest, for drawing linguistic trees. All of these packages are documented within the TeX Live installation.

The world of TeX has come a long way since I started using it, when we edited our files on a terminal attached to a remote computer, and checked our output by jogging to the computer room to pick up our printouts.

The next installment of this series will delve into the interaction between the traditional world of TeX, which began as a way to typeset documents for printing, and our current environment of electronic documents that adapt to an assortment of reading devices.

Comments (42 posted)