TL;DR: machine code decompilers are very useful, but do not expect the same miracles that they provide for managed languages. To name several limitations: the result generally can't be recompiled, lacks names, types, and other crucial information from the original source code, is likely to be much more difficult to read than the original source code minus comments, and might leave weird processor-specific artifacts in the decompilation listing.

Why are decompilers so popular? Decompilers are very attractive reverse engineering tools because they have the potential to save a lot of work. In fact, they are so unreasonably effective for managed languages such as Java and .NET that "Java and .NET reverse engineering" is virtually non-existent as a topic. This situation causes many beginners to wonder whether the same is true for machine code. Unfortunately, this is not the case. Machine code decompilers do exist, and are useful at saving the analyst time. However, they are merely an aid to a very manual process. The reason this is true is that bytecode language and machine code decompilers are faced with a different set of challenges.

Will I see the original variable names in the decompiled source code? Some challenges arise from the loss of semantic information throughout the compilation process. Managed languages often preserve the names of variables, such as the names of fields within an object. Therefore, it is easy to present the human analyst with names that the programmer created which hopefully are meaningful. This improves the speed of comprehension of decompiled machine code. On the other hand, compilers for machine-code programs usually destroy most of all of this information while compiling the program (perhaps leaving some of it behind in the form of debug information). Therefore, even if a machine code decompiler was perfect in every other way, it would still render non-informative variable names (such as "v11", "a0", "esi0", etc.) that would slow the speed of human comprehension.

Can I recompile the decompiled program? Some challenges relate to disassembling the program. In bytecode languages such as Java and .NET, the metadata associated with the compiled object will generally describe the locations of all code bytes within the object. I.e., all functions will have an entry in some table in a header of the object. In machine language on the other hand, to take x86 Windows disassembly for example, without the help of heavy debug information such as a PDB the disassembler does not know where the code within the binary is located. It is given some hints such as the entrypoint of the program. As a result, machine code disassemblers are forced to implement their own algorithms to discover the code locations within the binary. They generally use two algorithms: linear sweep (scan through the text section looking for known byte sequences that usually denote the beginning of a function), and recursive traversal (when a call instruction to a fixed location is encountered, consider that location as containing code). However, these algorithms generally will not discover all of the code within the binary, due to compiler optimizations such as interprocedural register allocation that modify function prologues causing the linear sweep component to fail, and due to naturally-occurring indirect control flow (i.e. call via function pointer) causing the recursive traversal to fail. Therefore, even if a machine code decompiler encountered no problems other than that one, it could not generally produce a decompilation for an entire program, and hence the result would not be able to be recompiled. The code/data separation problem described above falls into a special category of theoretical problems, called the "undecidable" problems, which it shares with other impossible problems such as the Halting Problem. Therefore, abandon hope of finding an automated machine code decompiler that will produce output that can be recompiled to obtain a clone of the original binary.

Will I have information about the objects used by the decompiled program? There are also challenges relating to the nature of how languages such as C and C++ are compiled versus the managed languages; I'll discuss type information here. In Java bytecode, there is a dedicated instruction called 'new' to allocate objects. It takes an integer argument which is interpreted as a reference into the .class file metadata which describes the object to be allocated. This metadata in turn describes the layout of the class, the names and types of the members, and so on. This makes it very easy to decompile references to the class in a way that is pleasing to the human inspector. When a C++ program is compiled, on the other hand, in the absence of debug information such as RTTI, object creation is not conducted in a neat and tidy way. It calls a user-specifiable memory allocator, and then passes the resulting pointer as an argument to the constructor function (which may also be inlined, and therefore not a function). The instructions that access class members are syntactically indistinguishable from local variable references, array references, etc. Furthermore, the layout of the class is not stored anywhere in the binary. In effect, the only way to discover the data structures in a stripped binary is through data flow analysis. Therefore, a decompiler has to implement its own type reconstruction in order to cope with the situation. In fact, the popular decompiler Hex-Rays mostly leaves this task up to the human analyst (though it also offers the human useful assistance).

Will the decompilation basically resemble the original source code in terms of its control flow structure? Some challenges stem from compiler optimizations having been applied to the compiled binary. The popular optimization known as "tail merging" causes the control flow of the program to be mutilated compared to less-aggressive compilers, which usually manifests itself as a lot of goto statements within the decompilation. The compilation of sparse switch statements can cause similar problems. On the other hand, managed languages often have switch statement instructions.