Benchmarking graphics applications to track performance is an important task, but one that has a number of pitfalls. Martin Peres gave a talk at the 2015 X.Org Developers Conference about those pitfalls and on ways to address them. He also described a framework he has been working on to help automate benchmark runs.

The open-source graphics drivers for Linux are getting better, but they are also getting more complex, Peres said. There are also more games and benchmarks that are becoming available for Linux. Users are relying more on open-source drivers to get the best performance from their graphics hardware and they expect that performance to get better (or at least stay the same) over time.

But code is merged into these drivers from anywhere, he said. There are always risks when merging new code, though code review can help reduce them. Those risks include breaking rendering or other functionality and improving performance for one application, but reducing it for others. It is clear that code review does not catch everything. He showed a "real-life example" of some code that caused a 10% performance regression in a driver, and which took thirteen days before it was fixed in the mainline.

It is impossible to predict the performance impact of a patch, he said. There are many different things that a patch may change, any of which could affect performance. The data and code alignment of the program, CPU caching behavior, CPU and GPU schedulers, different graphics hardware generations, and the power budget of a device can all affect performance in various ways. So there is a need to benchmark changes on all the platforms and games of interest, but that is out of the reach for any developer.

Benchmarks

There are also different needs for benchmarks. Developers may want to try to optimize some part of the driver and want to run experiments with different hardware, power budgets, operating systems, and so on. Quality assurance (QA) teams want to be able to test a patch series before it gets to mainline. In addition, QA will want to follow performance trends in the mainline code and may be asked by a manager to create a "performance retrospective" that shows how the performance has changed over time.

But there is a lot of variance in benchmark runs. There is performance variance within the benchmark itself, as well as between benchmark runs. Events like reaching the power budget, hitting a thermal limit, or a GPU reset can invalidate the data gathered. In addition, there is a need to be able to reproduce the results of a benchmark run—not just the next day, but two months (or more) later.

There can be other problems with benchmark runs, such as not using the expected libraries by mistake and invalidating a set of benchmarks. Also, comparing runs generated from different environments can lead to variance. The environment can include more than just the version and configuration of the kernel, libdrm, and Mesa; it stretches to the display server used and its compositor (along with their configurations) and to the hardware and BIOS versions. All of these sources of variance force multiple runs of benchmarks, which takes a lot of time, Peres said.

Inside a benchmark run, the variance is typically due to power management, concurrent tasks generating I/O or load on the CPU or GPU, hardware interrupts that disrupt the cache, or the CPU or GPU schedulers. For variance between runs, changes due to memory allocation are an additional factor. Due to address-space layout randomization (ASLR), the binary is not placed at the same location in memory, which can lead to the variance of up to 10% that Peres has observed in some benchmarks.

In order to reduce these variances, he had some suggestions. For CPU-limited tests, forcing the CPU to a single frequency and pinning the game or benchmark to a single core can help. Adam Jackson suggested pinning the X server to a core as well. In addition, disabling ASLR and transparent huge pages can reduce the variance, Peres said. Beyond that, running as few services as possible on the system and pinning interrupts to a different core than the test are helpful as well.

Similarly, for GPU-limited tests, forcing the GPU to use a single frequency will help reduce the variance in GPU performance, as will reducing the number of active graphics contexts being handled by the GPU. For both CPUs and GPUs, it is important to properly cool the device so that thermal effects do not come into play.

Changing the environment results in skewing the results, so you have to be smart about the things you change. "Simple, right? Just be smart", Peres said with a grin. You should only change parts of the environment that you aren't trying to optimize. But there are many variables to check, track, and remember, so helping developers and QA by automating as much as possible is needed.

Automation

To that end, what should the objectives for automated benchmarking be? To start with, it should avoid or detect human errors and it must ensure that the data gathered is valid. Predictability in execution time is also an important feature, since those running the tests need to know how long they will take. Providing as much information as possible is needed as well; running a long benchmark and then not having the data necessary to diagnose what it has found is infuriating, he said.

A concrete goal for an automated benchmarking system would be to track every library and program used and to know the versions of those items, their Git commit IDs, and the compilation flags used to build them. The use of various resources, such as CPU/GPU cycles and memory, should be sampled using perf and other tools to give a better view of what the system was doing throughout the tests. All of that information should be stored in the report.

The automation should also be able to understand and act on the results of the benchmarks. If certain benchmarks suddenly start taking much longer than usual, it should reboot and start over, rather than continuing on. That led to a discussion about the validity of the data when things go wrong. In the end, it was generally agreed that the data is useful for developers, so it should be saved and investigated, but from a performance regression perspective, restarting the system and tests makes sense.

To ensure that the data is valid, the statistical accuracy should be calculated and more runs added if needed. Information should be gathered from the kernel about major hardware events that affect the performance. Ideally, the system would learn to give up and reprioritize benchmarks to try to continue to make progress even in the face of major regressions. When problems are detected, the system could also try to bisect to the change that caused it. That has two advantages: it adds credibility to the report and it also reproduces the problem (probably multiple times).

In order to communicate a problem to the developers, the automation must be able to provide all of the information needed to reproduce the problem. Doing so is difficult. For example, ldd is not sufficient to determine which libraries a program uses because dlopen() could be used to access a library at runtime. There are ways to determine all of the library dependencies (e.g. strace , /proc/PID/maps ), but they have their own sets of trade-offs. In addition, determining the version of a library or program is not necessarily bulletproof. It may be necessary to control the build process for all of the pieces to accurately gather that information, he said.

Tools for automation

There are some existing tools that handle parts of these problems. For example, the Phoronix Test Suite automates the data acquisition and does collect some useful metrics, but it is oriented toward "simple reporting" and is not really good for performance analysis. It also hides some of the performance data, is not Git-centric, and cannot be automated easily, he said.

Ben Widawsky's sixonix tool has some definite advantages, including its effort to validate the reported data and to detect hardware events that might invalidate data. It is good for use by a developer, but not really geared toward QA. It is not build-aware nor Git-aware and does not track the environment closely. It also supports a limited set of benchmarks and requires a lot of manual work to run a large series of tests.

An ideal system would manage the build system and commit history to build the programs and libraries needed for the benchmarks. It would also record and manage the environment of the test runs and provide a visual report to help analyze any performance changes.

Those ideals are why Peres started working on EzBench. The idea is to provide workflows and automation to assist in dealing with benchmarks. It is a framework that is meant to be quickly adaptable to many different use cases, for both QA and developers. It was developed by Peres and Chris Wilson (both from Intel) and had been released a week earlier under the MIT license.

EzBench has a modular architecture with separate test profiles, tests, and so on. It automates the acquisition of the benchmark data and generates a visual report that can be used by developers to help track down problems. It can bisect performance changes automatically, though there is still work in progress to integrate that feature. There are also Python bindings for "almost everything", including data acquisition, parsing the data, and producing the report.

There is work in progress to make the system more crash-resistant. It will store the goal and the current state, which can be compared when the system comes back up to determine the next step. The project is in the process of adopting some of Widawsky's work to detect performance changes using modeling. Work on the detection of the whole environment is also ongoing.

As might be guessed, there is also a to-do list. Predicting run times more accurately as well as supporting deadlines and test prioritization will allow users to get full utilization of a system until, say, 9am Monday morning. Sending email to authors of patches that cause performance regressions and integration with patchwork to be able to test patch series are also on the list.

Peres proceeded to demonstrate the tool, which has a web interface with graphs, links, and all that jazz. It looks like a useful tool, though it is still in the early stages of development. Those interested may want to look at the demo in the YouTube video (which may be a little hard to see) or to look at the slides [PDF].

[I would like to thank the X.Org Foundation for travel assistance to Toronto for XDC.]

Comments (1 posted)

Last month, I undertook some restructuring of my home network, particularly with an eye toward consolidating the rather ad-hoc file-storage arrangements that had grown up over the years. Along the way, I ran into an unexpected hurdle where podcasts are concerned: while most other media content is well served by free-software media players and backend services, podcasts are an awkward outlier. The majority of media applications support managing podcast subscriptions, but not in a way that allows for usable central storage. For a solution, I turned to the comparatively isolated world of command-line-driven audio tools.

First, some background information. My initial plan was simply to consolidate all media files onto a single file server that could be accessed from any client device (be it computer, tablet, phone, or car). There is no trick to this process for general, user-created files, but files that are automatically retrieved are another matter. Podcast audio, in one sense, is just like any other media type: the files can come and go, they need to be opened in a media player that supports the relevant format, and they are typically tracked by the user through embedded metadata.

But there are some important distinctions between how users interact with podcast episodes and with other audio content (e.g., from their music library). For example, generally speaking, podcast files and music files represent conceptually disjoint collections that the user does not want mixed—having an hour-long round-table debate appearing in the "shuffle" mode of a music mix is hardly welcome. In addition, podcast content may be (though it is not necessarily) periodically deleted—one rarely listens to content multiple times, years after the first listen. Taken together, these factors have led most media-player development teams to implement podcast support in something of a silo within the application. It appears as a separate tab or screen in the UI, and uses separate code paths.

The root of the awkwardness is that these media players each want to control the download, synchronization, and old-episode-expiration tasks internally in their "podcast silo" code: pointing podcasting tools from a desktop, a laptop, and a mobile device all at the same directory on a file server thus results in chaos. But managing a duplicate set of subscriptions and files for each device is far from optimal, too. Alternatively, one could appoint a single application to handle the subscriptions, and store the files in a location that is shared between all potential listening devices (say, over NFS). That, however, breaks down for any playback application that does not support multiple, independent "libraries" (which, regrettably, is most of them).

Ultimately, the most viable solution I found was configure one application to download podcast content onto the file server, then to expose those downloaded files as a Universal Plug-and-Play (UPnP) media share. This worked better than sharing a directory with NFS because client applications, from desktop media players to mobile apps to Kodi, all seem to handle connecting to multiple UPnP shares without hiccuping, and they do not get confused if a share's content changes suddenly or is unreachable at a given time.

Implementing that solution led me to test driving a variety of command-line or "server-side" podcast-management tools (or aggregators, as they are often known) to run on the file server as cron jobs. There are several decent desktop aggregators, of course, but none I felt were well-suited to unattended operation on a headless server. These command-line options may not get a lot of attention, but there are several capable candidates to choose from. Nonetheless, just as the specific feature sets of desktop audio players vary, so do those of the command-line aggregators, and one size will not fit all users.

The podcast aggregators I experimented with include Jonathan Baker's PodGrab, Manolo Martinez's greg, Mads Michelsen's Poca, John Goerzen's hpodder, Christophe Delord's pod.sh, and Chess Griffin's mashpodder (which is a continuation of the now-defunct bashpodder).

Some of these tools are no longer actively developed, but that does not appear to be a practical drawback, as the RSS and Atom protocols used for podcast feeds are not still being revised. The simpler questions that might lead a user to one aggregator or another include the language it is written in (hpodder, for instance, is in Haskell) but, in the long run, what will likely prove most important is the aggregator's feature set.

All podcast aggregators handle the same basic tasks, of course: managing the set of subscribed feeds, checking each feed for updates, and downloading new content. Where they differ is in their availability of functions to handle the minutia of feeds—and, unfortunately, feeds that are published in the wild are prone to all manner of quirks and incompatibilities.

For instance, many podcasts push out MP3 files tagged with inconsistent, incorrect, or simply missing fields that make them harder to navigate in an audio player that presents just those fields in the user interface. Every episode in the FLOSS Weekly feed is tagged with that week's guests in the "Artist" field. Thus, the episodes are not grouped together, since audio players invariably treat "Artist" as the top sort-order field. Luckily, greg, Poca, and mashpodder all support automatically re-tagging the downloaded files, using a user-defined formula.

To give a completely different example, most podcast aggregators keep track of the previously downloaded episodes from each feed. But the National Public Radio (NPR) "Hourly News Summary" feed always contains just one episode: the most recent. It is an unusual choice on NPR's part (though one that probably does not violate any standards), and it seemed to confuse mashpodder into thinking the lone episode had already been downloaded.

Apart from gracefully coping with imperfect feeds, the other factor differentiating between aggregators is how much control they give over which episodes are downloaded. Mashpodder offer only two options: downloading the newest episode and downloading N episodes back. Contrast that with Poca, which offers only downloading the latest episode or setting a per-feed megabyte limit (after which it will begin to expire old episodes). PodGrab supports only "download the latest" and "download all episodes."

Poca's storage-space limits and mashpodder's episode-count limits are both reasonable approaches to managing a particular feed; the problem is that you get either one or the other, but not both. It is on that front that greg stands out among the aggregators that I tested.

Greg offers multiple settings for configuring how much content it downloads, and it allows each setting to be configured with global defaults and on an individual per-podcast basis. In addition to episode-count limits, greg supports setting a "history date" for each podcast and it allows the user to set up filters that will be run against each episode in a feed to determine if it should be downloaded. The filters use Python string-comparison syntax, plus a set of pre-defined placeholder elements (such as {title} for the podcast episode title, {filename} for the episode's actual filename, {link} for the URL, and so on).

Greg is also the only tool I tested that supports calling an external program to download a podcast episode. This can be something simple (like wget or curl ) or a custom script. The feature, like other greg options, can be set per-feed. Perhaps the most obvious use case is that it allows greg to handle feeds that encapsulate media in a peculiar format—such as YouTube channels or playlists, which can be handled with youtube-dl or a similar utility.

After about a week of experimenting, I did settle on greg as the most capable option for my particular needs. Greg does have some drawbacks, most notably that it does not log activity—or errors. But it does offer commands to interactively check the status of each feed, so recovering from surprises is, at least, possible.

Among the other options, Poca comes in second. It offers logging and its standard output can be suppressed (which is nice for cron usage). PodGrab and the pod.sh shell script are both useful, too, especially if one does not have particularly complex requirements; pod.sh also includes the somewhat peculiar option to re-encode all downloads at a faster tempo. Perhaps that proves useful for users who regularly find themselves short on time.

As a final note, the deployment plan I outlined involved having the podcast aggregator automatically retrieve new episodes, then having the download directory shared over UPnP. Using UPnP means there is no client-side setup: every computer and device (including many smart TVs and home electronics) will discover the shares automatically. But in order to keep the podcast content separate from general audio, one needs to configure it as a separate UPnP source.

Only two free-software UPnP servers that I know of support running multiple simultaneous instances: ReadyMedia (previously known as MiniDLNA) and MediaTomb. Between the two, ReadyMedia is simpler, so I configured two ReadyMedia instances to launch at system-startup time, each pointing to a different configuration file (one for podcasts and one for other audio).

In one sense, it is unfortunate that extra hoops must be jumped through to support podcasts on a central file server but, for the most part, the extra work is due to the different listening experience the user expects. The same could be said for audio books which, like podcasts, demand special treatment within one's library. Fortunately, the free-software community comes through, as it so often does, even within such a small niche.

Comments (8 posted)