If one has source code for a compiler/build system whose output should not depend on anything other than the content of the supplied source files, and if one has several other compilers and knows that they do not all contain the same compiler hack, one can make sure one gets an executable that depends upon nothing other than the source code.

Suppose one has source code for a compiler/linker package (say the Groucho Suite) written in such a way that its output will not depend upon any unspecified behaviors, nor on anything other than the content of the input source files, and one compiles/links that code on a variety of independently-produced compilers/linker packages (say the Harpo Suite, the Chico suite, and the Zeppo Suite), yielding a different set of exeuctables for each (call them G-Harpo, G-Chico, and G-Zeppo). It would not be unexpected for these executables to contain different sequences of instructions, but they should be functionally identical. Proving that they are functionally identical in all cases, however, would likely be an intractable problem.

Fortunately, such proof won't be necessary if one only uses the resulting executables for one single purpose: compiling the Groucho suite again. If one compilers the Groucho suite using using G-Harpo (yielding G-G-Harpo), G-Chico (G-G-Chico), and G-Zeppo (G-G-Zeppo), then all three resulting files, G-G-Harpo, G-G-Chico, and G-G-Zeppo, should all byte-for-byte identical. If the files match, that would imply that any "compiler virus" that exists in any of them must exist identically in all of them (since the all three files are byte-for-byte identical, there's no way their behaviors could differ in any way).

Depending upon the age and lineage of the other compilers, it may be possible to ensure that such a virus could not plausibly exist in them. For example, if one uses an antique Macintosh to feed a compiler that was written from scratch in 2007 through a version of MPW that was written in the 1980's, the 1980's compilers wouldn't know where to insert a virus in the 2007 compiler. It may be possible for a compiler today to do fancy enough code analysis to figure it out, but the level of computation required for such analysis would far exceed the level of computation required to simply compile the code, and could not very well have gone unnoticed in a marketplace where compilation speed was a major selling point.

I would posit that if one is working with compilation tools where the bytes in an executable file to be produced should not depend in any way upon anything other than the content of the submitted source files, it is possible to achieve reasonably good immunity from a Thompson-style virus. Unfortunately, for some reason, non-determinism in compilation seems to be regarded as normal in some environments. I recognize that on a multi-CPU system it may be possible for a compiler to run faster if it is allowed to have certain aspects of code generation vary depending upon which of two threads finishes a piece of work first.

On the other hand, I'm not sure I see any reason that compilers/linkers shouldn't provide a "canonical output" mode where the output depends only upon the source files and a "compilation date" which may be overridden by the user. Even if compiling code in such a mode took twice as long as normal compilation, I would suggest that there would be considerable value in being able to recreate any "release build", byte for byte, entirely from source materials, even if it meant that release builds would take longer than "normal builds".