LLVM Weekly - #67, Apr 13th 2015

Welcome to the sixty-seventh issue of LLVM Weekly, a weekly newsletter (published every Monday) covering developments in LLVM, Clang, and related projects. LLVM Weekly is brought to you by Alex Bradbury. Subscribe to future issues at http://llvmweekly.org and pass it on to anyone else you think may be interested. Please send any tips or feedback to asb@asbradbury.org, or @llvmweekly or @asbradbury on Twitter.

EuroLLVM is going on today and tomorrow in London. I hope to see a number of you there. Provided there's a reasonable internet connection, I hope to be live-blogging the event on the llvmweekly.org version of this issue.

EuroLLVM Day 1 liveblog

C Concurrency: Still hard

(Note: it's really hard to summarise a talk which covers so many subtleties regarding concurrency. You're probably better off waiting for the video and slides)

The speaker talks through a deceptively simple example that highlights the difficulties of compiling C in a concurrent setting

Why did it take until 2011 to properly define the semantics for multithreading in C? Consider even a simple compiler optimisation like constant propagation (the speaker has a nice worked example for this, best to wait for the slides).

Intuition: the compiler (or hardware) can reorder independent accesses.

State of the art of compiler testing is csmith (PLDI 2011). But it cannot catch 'concurrency compiler bugs'. Can we extend this technique, and if so how do you deal with non-determinism?. Idea: Due to C support for separate compilation, functions can be called in arbitrary non-racy concurrent contexts. So search for transformations of sequential code which are not sound in a concurrent non-racy context.

The cmmtest tool

GCC internal invariant: never re-order with an atomic access. cmmtest found a case where this invariant was broken

What's the way forward? We need to understand the effects of what compilers implement and what programmers rely on, then build on that.

Concurrency isn't the only pain point. e.g. can you do a pointer comparison or pointer arithmetic between pointers to separately allocated objects (as is routinely done in the Linux kernel). So what is C? Fill out the survey.

ThinLTO: A Fine-Grained Demand-Driven IPO Infrastructure

Background: monolithic CMO (cross-module optimisation) is done in the linker process. This tends to consume lots of memory and is generally not parallel, leading to scalability issues with large applications. You can improve slightly by doing intramodule optimisation and code generation in parallel (see SZYGY framework from HPUX compiler?)

Background (cont): GCC has 'WHOPR'. This increases parallelism, but there's a lot of work going on in serial

Ideal fully parallel CMO: end-to-end compilation in parallel, where each module is extended with other modules as if they were from headers. A version of this idea is implemented in the gcc/google branch. LIPO (lightweight interprocedural optimisation). Source importing is pre-computed using a dynamic call graph from a profile run.

Problems with LIPO: you need profile data, importing the whole module is costly, ...

Proposed new model: delay importing until IR files have been generated, which allows fine-grained importing at a function level granularity. But how to synchronise and communicate between these parallel compilations? The 'super thin linker' plugin handles this. The implementation is called ThinLTO.

ThinLTO: does very minimal work by default (no IPA), allowing it to scale to very large programs. Advantages include not requiring profile data, friendly both to single build machines and distributed build systems.

ThinLTO phase 1 generates IR files in parallel, and function summary data to guide later import decisions. Format of this summary and information to be included is up for discussion, may change from the current prototype.

ThinLTO phase 2 is the thin linker plugin layer. Combines the per-module function indexes into a thin archive. Keep the size down by excluding functions which are unlikely to benefit from inlining.

ThinLTO phase 3: parallel BE with demand-driven IPO. Perform iterative lazy function importing, with priority determined by summary and callsite context.

LLVM prototype status: a prototype has been implemented with support for phase2 and phase3, which works with the SPEC CPU 2006 benchmarks.

Preliminary experimental data: right now there's a range of wins and losses in terms of run-time performance. Build-time performance vs LTO is impressive.

Corridor track

Apologies, I missed the last two talks of the day due to some very entertaining discussion in the 'corridor track'. You might want to look at LLPE, a partial evaluator for LLVM bitcode.

EuroLLVM Day 2 liveblog

LLVM meets the truly alien: the Mill CPU architecture in a multi-target tool chain

"There have been essentially no major advances in computer architecture for about 30 years", due almost completely to network effects

This is talk 11 of a series of talks, see the others at millcomputing.com

Claims a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures. Sadly, questioning about the reasoning behind this claim are out of scope of this talk?!?

Mill: wide-issue (30+ MIMD ops per cycle), statically scheduled (no issue hazards or OOO), exposed pipeline (all ops have fixed latency), has 'integrated vectors' (all scalar ops are vector too), has hardware SSA (no general registers)

Mill instruction bundles have two physical streams, but one logical stream. How to branch two streams? You group each stream into Extended Basic Blocks (single-entry, multiple-exit sequences of bundles). Split the stream, reverse one, then paste them together. You now have one entry point (in the middle), where one PC moves down through lower addresses and the other up through higher addresses. The Mill also benefits from having two instruction caches and two decoders. "This is far from the strangest thing about the Mill"

The Mill is a family of member CPUs sharing an abstract operation set and micro-architecture, but differing in concrete operation set and microarch.

How to produce a family? Traditional approach is using microcode, and superscalar out-of-order to hide implementation details. The Mill approach is 'late binding' - compile to the abstract target (a universal superset). Compile through the LLVM middle end, and output 'gen form'. This is specialised at the link stage to the concrete target. Mill assembly is actually C++ (that looks much like assembly), and is run through a C++ compiler. e.g. Mill assembler is `add(b3, b4)`. By using C++, they're able to save the cost of writing their own macro assembler (sidenote: is that a worthwhile tradeoff?)

Now the fun part, 'ranting and raving' about LLVM

The Mill is a 64-bit machine, but pointers are not integers and have special semantics. (sidenote: I know the CHERI team, specifically David Chisnall have done a lot of work regarding differentiating between integers and pointers in LLVM). There are apparently even issues with pointer to int conversion in the Clang frontend. The claim is that the problem is too pervasive for the Mill team to fix.

Width metadata tags in the Mill tell how big an operand is, not what the type is.

There are issues with the belt and the function call/return discipline. Variadic and mutiadic call results are hard to express.

"Tablegen not suitable for a large, regular ISA". They generate .td files from the Mill hardware specs (note: I'm not sure I followed the issue, generating .td from a spec seems wholly sensible?)

Still feel they made the right choice with LLVM

A high-level implementation of software pipelining in LLVM

Software pipelining is a very useful technique currently missing from LLVM

There is some previosu work doing modulo scheduling at the source level (SLMS, 'Towards a Source Level Compiler: Source Level Modulo Scheduling', Ben-Asher and Meisler).

This work attempts to perform software pipelining at the LLVM IR level, scheduled at the end of the optimisation pipeline

Use target hooks to get information on available resource from target specific layer (e.g. number of scalar functional units and vector functional units)

The Swing Module Scheduling heuristic is used. 1) find cyclic dependencies and their length, 2) find resource pressure, 3) computing minimal initiation interval (II), 4) order nodes according to 'criticality', 5) schedule nodes in order

Initial implementation was done for the Movidius SHAVE architecture (8-issue VLIW).

"It works! Somewhat.." Up to 1.5x speedup observed in TSVC tests. Currently seeing many big regressions as well though.

Potential improvements include letting the user control when to enable/disable. Modeling of instruction patterns in IR. Improved resource model. Better profitability analysis.

Lightning talks

Speed of light: fastest implementation of a function on a given CPU (Cortex-A57). Function under test is a simple rgb to yiq conversion. First attempt has no multiply-accumulate and no vectorisation. See small speedup from handwritten scalar asm. Get 21% improvement with some smarter scheduling by-hand. Hand-written vector asm is more than 2x faster, but close to 2x bigger code-size.

An LLVM-based AOT JS compiler (Samsung). Developed a JS frontend generating LLVM IR. Implement the full ECMA standard apart from eval. Using type inference, removed extensive calls to runtime functions. e.g. a simple add with known-integer arguments can be compiled to a single LLVM IR instruction. Saw promising results on several Sunspider tests.

Building Clang/LLVM efficiently. Ideas: build with clang rather than GCC, don't compile unwanted backends, use a faster linker, optimise the host compiler as much as possible. Incremental debug builds (shared rather than static) unsurprisingly give a significant speedup. Overall speedup was 1.58x for release builds and 1.94x for debug builds using all the above.

SPIR: Standard Portable Intermediate Representation. SPIR 1.2/2.0 (pronounced 'spear') was a subset of LLVM IR. SPIR-V was announced alongside Vulkan in March 2015. No longer based on LLVM IR, but maps easily to it. Intended as a stable binary format for deployment of shaders/compute kernels. There are opportunities for first-class SPIR-V support built around LLVM (e.g. a SPIR-V frontend/backend).

Proposing LLVM extensions for generating native code fragments. The speaker describes current LLVM patchpoint support. He proposes a new explicitcc calling convention for use with pathpoints, where the register allocator is told which registers hold the arguments and which registers to preserve. Patches are up for review.

LLVM Inliner Enhancement. The current inliner is carefully tuned based on a large number of real applications, but there are a number of missed opportunities for typical computation intensive benchmarks. Various alternative heuristic rules are explored. A patch is under review in the community.

Optimizing code for GPUs. A key aspect is the memory access pattern. SymEngine can be used to detect sub-optimal memory accesses. Its predictions have a high correlation to hardware performance counters.

Using GCC regression tests for deeply embedded systems. Why? it's very large, has small+portable tests, and supports remote execution out of the box. Ended up with a small patch to dejaGNU, which adds a global override to change the expected result of tests (vs avoiding annotating each test individually). (Sidenote, as someone who's found the GCC regression test very useful in an architecture port, I think this is very handy work)

How to vectorize interleaved access? A patch for induction with arbitrary steps has been upstreamed. The loop vectorizer needs to be taught to transform IR to strided load intrinsics.

Recursion inlining in LLVM. Seen perf improvements of up to ~20%. Currently GCC inlines recursive function call up to a certain depth, but LLVM marks them as noinline. Option 1) is to inline iteratively. 2) remove recursion with a stack. Looked at fibonacci as a pathological example. As you might expect, iterative inlining performs better but has larger codesize.

Verifying code generation is unaffected by -g -S (from Sony). Most devs assume compiling with debug info will not affect code generation, and it's important to verify this assumption. check_cfc.py is a wrapper to check it by comparing objdump output. Found, reported, and fixed a number of bugs using this.

libclang integration in the KDevelop IDE. Currently KDevelop has a hard to maintain custom parser (over 50kloc) - Clang to the rescue! Now integrating libclang and trying to use pretty much all the features it offers. kdev-clang is around 10kloc.

Optimising the pimpl-idiom (as used by e.g. Qt and KDE). With private implementations, you often end up with multiple mallocs. The proposal is to use placement new to place the allocations in the same contiguous region of memory.

Vectorization Of Control Flow Using New Masked Vector Intrinsics

AVX-512 has 32x512-bit registers. It also has 8 64-bit mask registers. These are used in instructions to select vector lanes.

LLVM IR has no masking support. How do we move forwards? In some situations you can avoid masking, but in others masking is the only way that would allow vectorisation. Goal: allow LLVM to have first-class support for AVX and AVX-512, without complicating the story for other targets.

Introduce masked load and store intrinsics, rather than adding new instructions. Who generates them? The vectorizer will. The CodeGenPrepare pass will scalarize the masked intrinsic if the target doesn't support it.

Masked load and store are designed for dealing with control-flow divergence. What happens when we have data divergence. Masked scatter and gather instructions come in to play here. The difference vs load/store is the argument is a vector of pointers.

Strided memory access is a special case of gather/scatter but where the access is at a series of constant offsets. Vector load and shuffle can be more optimal in many cases. Options for handling this are 1) create gather intrinsic and optimise it later, 2) create loads and shuffles, 3) introduce a strided load intrinsics.

Status: masked load and store intrinsics are supported in 3.6 and gather+scatter are in progress. Strided loads and stores have been discussed. Floating point vectorisation is next in line.

LLILC: LLVM as a code generator for the CoreCLR

The CoreCLR was open sourced a few months ago. The idea is to support people who want to deploy the programming platform on a variety of operating systems. This includes Ryu JIT, but it has a somewhat baroque internal architecture.

In CoreCLR, there is no interpreter. Everything is jit compiled by default. There is no re-JITting. There is a pluggable JIT interface, but the JIT must know intimate details of runtime features such as GC and exception handling. This talk focuses on GC.

"We still have our own compiler technology. We may or may not make that more open, we'll see how discussion goes." References to the Microsoft Phoenix work

LLVM IL Compiler: An open-source, cross-platform capable code generator for the CoreCLR based on LLVM. Pronounced lilac. Code. Expect maybe 4x slower JIT codegen performance vs RyuJIT.

Intention is to push all LLVM work back upstream. Currently using MCJIT

CoreCLR's GC is a generational, fully relocating, precise, stop-the-world collector. It also supports a conservative mode. All of these features have implications for code generation.

LLVM previously just offered GCRoot. It now offers statepoints (which Philip Reames and others at Azul are working on), and this is what LLILC is building on.

'Pinning' is used to temporarily stop an object from being relocated by the GC. The result of a pin is a native pointer. One problems is that uses of the pinned objects aren't data-dependent on them (this is one of those subtle cases where you may be best off looking at the slides to understand).

Deciding when to introduce safepoints is a difficult trade-off. Placing early minimizes the likelihood some optimisation will break it. But putting them in late maximises the ability to optimise. For the time being, early insertion is used

Observations on LLVM: been great to work with in general. Missed some things, e.g. ability to model semantically meaningful machine exceptions, memory SSA, explicit alias dependence in the IR, explicitly tracked pointer kinds, easy ability to defer lowering of runtime abstractions.

Status: able to handle around 95% of all methods in simple tests. Currently using the GC in conservative mode and exception handling isn't handled yet. Interested in the future in looking at what a true ahead-of-time system might look like.

C++ on the web: ponies for developers without pwn'ing users

PNaCl is gaining support for dynamic linking, glibc, sockets etc etc

We're now watching a demo that has the caveat 'may contain some smoke and mirrors'

"JavaScript is our microkernel". Filesystem uses the HTML5 filesystem "it's not standard, but it's been in Chrome forever and realistically isn't going to go away"

You guessed it, we're compiling code through pnacl clang inside the browser. Neat demo.

What about people running malware in your browser? This is where NaCl and the Chrome security model come in.

There's been work on supporting shared memory in JavaScript. How would you support multiple processes in JS?

LLDB debugging on Linux and Android

Google currently has about 11 engineers working on LLDB. Imagination, Intel, and Linaro are also participating.

All major features are working on x86-64 on Linux. Linux arm64 is missing hardware watchpoint support and Linux arm is under development. Windows x86 has basic debug functionality, but programs must be built with Clang and lld.

Tips for debugging LLDB: lldb has very good logging infrastructure

News and articles from around the web

A new post on the LLVM Blog details how to use LLVM's libFuzzer for guided fuzzing of libraries.

On the mailing lists

LLVM commits

The R600 backend gained an experimental integrated assembler. r234381.

The libFuzzer documentation has been extended to demonstrate how the Heartbleed vulnerability could have been found using it. r234391.

The preserve-use-list-order flags are now on by default. r234510.

LLVM gained a pass to estimate when branches in a GPU program can diverge. r234567.

The ARM backend learnt to recognise the Cortex-R4 processor. r234486.

Clang commits

Lifetime markers for named temporaries are now always inserted. r234581.

The quality of error messages for assignments to read-only variables has been enhanced. r234677.

clang-format's nested block formatting got a little better. r234304.

Other project commits

Support for the 'native' file format was removed from lld. r234641.

Remote debugging, the remote test suite, and the process to cross-compile lldb has been documented. r234317, r234395, r234489.

LLDB gained initial runtime support for RenderScript. r234503.

ap>The Red Hat developer blog has an article about libgccjit , a new feature in GCC5, which may be of interest.