How .NET assemblies are loaded Wednesday, April 14, 2004 6:20 PM bart

Preparing a talk for some guys on ".NET Framework Internals" to discover the dark sides of the CLR, IL and the rest of the .NET Framework...

Did you ever wonder how .NET assemblies are executed when you start the .exe file? Let's start with some brief overview of the role of the CLR in this - I hope everybody who has ever worked with .NET knows this already: when you're using one of the .NET compilers (such as csc.exe, vbc.exe) you'll be compiling your code into IL - intermediate language - which is a sort of 'universal machine code' that can be executed by the CLR. The CLR has the role to load the IL and JIT (just-in-time) compile it to native code according to the platform the code is running on. This makes the use of the .NET Framework a real advantage: it's language-independent (in contrast to other middleware systems), it's fast since it's compiled at some point (in contrast ...), it's portable (okay, not in contrast to ... :-)). Now, let's compare this with the way Java works: there's a javac.exe compiler to create bytecode in a .class file and there's the interpreter called java.exe to read the bytecode and execute it. This is fairly simple to understand: you take the message which the computer itself does not understand and give it to some translator to read the code, translate it and execute it on the target platform. The .NET Framework (and the CLR more specifically) works different. Although it's still a language which is not understood by the processor of the machine (it's IL, not native code), it's stored in a .exe file. This is one of the 'dark sides' of the .NET Framework which I'll explain during my talk. I'll try to explain it over here as well.

What the .NET compilers spit out is a file called an assembly which has the extension .dll or .exe. Assemblies are a pure .NET-related word which did not exist in the pre-.NET COM-world (or hell?). Now, what does this file contain? It mainly consists of 4 key parts:

The IL -code (Intermediate Language) which is the translation of the original language to the code that can be executed by the CLR.

Metadata that makes the assembly self-describing (remember the xcopy deployment of .NET assemblies vs. the regsvr32 deployment in the world of COM)

A header for the CLR to find out CLR-version info, tokens, flags, the strong name (for assemblies in the GAC signed using sn.exe), etc

The PE header that is found in any executable and contains information on the type of the application etc (note: this has nothing to do with Windows PE ;-))

As you can see, the only common part with classic executable is the PE header (portable executable) in the file that is used by Windows to load the executable and start the execution of it. This is where the tricky part takes places. In fact, a CLR assembly just uses this PE header to start the CLR and give the CLR the instruction to take over the execution of the file (other information in the PE header is ignored, while for classic executables the PE header contains information about the native CPU code as well). To find out how things work, open a .exe assembly using a heximal editor. You'll find a few elements in the file that are of interest:

mscoree.dll - this is a reference to the main .dll of the CLR that contains the Component Object Runtime Execution Engine (the import takes places in a special import section marked as .idata)

_CorExeMain - the name of the entry point in the CLR's dll file to transfer the execution of the file to (at this point, the CLR takes over the control and determines the entry point of the loaded assembly to start JITting and exeucting the file)

In hexadecimal format, the magic occurs over here:

5F 43 6F 72 45 78 65 4D 61 69 6E 00 6D 73 63 6F 72 65 65 2E 64 6C 6C

So, the power is in the jump-instruction to the imported _CorExeMain function (or _CorDllMain for .dll assemblies). More information (although limited) can be found on MSDN via http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/grfuncorexemain.asp. In order to get this thing to work in all kinds of Windows OSes (starting with Windows 98 and NT 4) the compilers generate a JMP 'stub function' of 6 bytes (x86 format, you know). Since Windows Server 2003 and XP support both 64-bits architectures as well (cf. Itanium), the executable loader in these OSes was changed to support the execution of managed code natively (skipping the .idata mscoree.dll load) by examining the PE file header.

Maybe this looks exciting but in fact it's not that new :-). All flavors of pre-.NET Visual Basic use the same tricks to launch the - in this case - interpreter to execute the VB code. The dll that is responsible for the execution is called MSVBVMxx.dll (stands for Visual Basic Virtual Machine). This time the hex magic looks as follows:

4D 53 56 42 56 4D 36 30 2E 44 4C 4C