Introduction

What follows are a number of basic ways to compact shellcodes. In a follow up post, I’ll discuss a few ways to obfuscate them which might be useful for evading signature detection algorithms. Some of the examples illustrated here can also be used for boot loaders, PE protectors/compressors, coding demos or something else that requires compact code. The little tricks shown here are derived from various sources and I mention a number of people at the end of post in acknowledgements. I did plan on discussing a little about the x86 architecture, but there’s already a lot of information out there and I assume you’re already familiar with it.

We’ll cover 4 things here:

Declaration and initialization of variables / registers Testing value of register / variable Conditional jumps / Control flow Character conversions

Initializing Registers

Each CPU register is like a variable itself. The x86 CPU in legacy mode has 8 General Purpose Registers (GPR) each capable of storing 32-bits or 4-bytes of information. Of course, we don’t normally use the Stack Pointer (ESP) for anything other than stack management, so we really only have 7 GPR to use. 4 of these provide access to 8 and 16-bit words.

A very common operation is to set a variable (or in this case register) to zero. What you see below are a number of ways to do it. Some are better than others and really just depends on the situation.

/ / 8 - bits "\x30\xc0" / * xor al , al * / "\x28\xc0" / * sub al , al * / "\xb0\x00" / * mov al , 0x0 * / "\x24\x00" / * and al , 0x0 * / / / 16 - bits "\x66\x31\xc0" / * xor ax , ax * / "\x66\x29\xc0" / * sub ax , ax * / "\x66\xb8\x00\x00" / * mov ax , 0x0 * / "\x66\x83\xe0\x00" / * and ax , 0x0 * / "\x66\x6a\x00" / * push 0x0 * / "\x66\x58" / * pop ax * / / / 32 - bits "\xb8\x00\x00\x00\x00" / * mov eax , 0 * / "\x31\xc0" / * xor eax , eax * / "\x29\xc0" / * sub eax , eax * / "\x6a\x00" / * push 0 * / "\x58" / * pop eax * / "\x83\xe0\x00" / * and eax , 0 * / "\x6b\xc0\x00" / * imul eax , eax , 0 * / "\xf8" / * clc * / "\x19\xc0" / * sbb eax , eax * / "\x6a\xff" / * push -1 * / "\x58" / * pop eax * / "\x40" / * inc eax * / "\x31\xd2" / * xor edx , edx * / "\x92" / * xchg eax , edx * / ( when we know eax is < 0x80000000 ) "\x99" / * cdq * / "\x92" / * xchg eax , edx * / "\xb8\xff\xff\xff\xff" / * mov eax , -1 * / "\x40" / * inc eax * / "\x83\xc8\xff" / * or eax , -1 * / "\x40" / * inc eax * / "\x6a\xff" / * push -1 * / "\x58" / * pop eax * / "\x40" / * inc eax * / ( 64 - bit mode ) "\x48\x31\xc0" / * xor rax , rax * /

It’s not a complete list of ways to initialize this particular register, but the operators used: MOV, XOR, SUB, AND can all set other variables to zero in a similar way. One thing worth mentioning about the last instruction XOR RAX, RAX is that it’s not necessary to perform the operation on RAX. You can save a byte by using XOR EAX, EAX because the result is zero extended to 64-bits.

Initializing to -1

"\xb8\xff\xff\xff\xff" / * mov eax , 0xffffffff * / "\x6a\xff" / * push 0xffffffff * / "\x58" / * pop eax * / "\x83\xc8\xff" / * or eax , 0xffffffff * / "\xf9" / * stc * / "\x19\xc0" / * sbb eax , eax * / "\xb0\xff" / * mov al , 0xff * / "\x0f\xbe\xc0" / * movsx eax , al * / "\x31\xc0" / * xor eax , eax * / "\x48" / * dec eax * / "\x31\xc0" / * xor eax , eax * / "\x83\xf0\xff" / * xor eax , 0xffffffff * / "\x31\xc0" / * xor eax , eax * / "\xf7\xd0" / * not eax * /

Moving a register into another.

/ / moving register or immediate value into register "\x83\xcb\xff" / * or ebx , 0xffffffff * / "\x21\xc3" / * and ebx , eax * / "\x31\xdb" / * xor ebx , ebx * / "\x09\xc3" / * or ebx , eax * / "\x31\xdb" / * xor ebx , ebx * / "\x01\xc3" / * add ebx , eax * / "\x31\xdb" / * xor ebx , ebx * / "\x31\xc3" / * xor ebx , eax * / "\x50" / * push eax * / "\x5b" / * pop ebx * /

Initializing to immediate values is also very common, but something not always well executed by those writing a shellcode. Let’s say you need 1 in EAX/RAX which under Linux would be the EXIT system call.

/ / "\x48\xc7\xc0\x01\x00\x00\x00" / * mov rax , 0x1 * / "\x48\x31\xc0" / * xor rax , rax * / "\x48\xff\xc0" / * inc rax * / "\x31\xc0" / * xor eax , eax * / "\xfe\xc0" / * inc al * / "\x6a\x01" / * push 0x1 * / "\x58" / * pop rax * / "\x83\xc8\xff" / * or eax , 0xffffffff * / "\xf7\xd8" / * neg eax * /

There’s more than 1 reason to use the PUSH/POP combination. It’s more compact than any of the others, but also compatible with both 32 and 64-bit mode whereas some of the others are not. In general, if the immediate value is between -128 and +127, use a PUSH/POP combination. For values above this, the opcodes are larger. Imagine an egg hunter shellcode where you want to attempt reading memory. This normally involves setting register to 4096 which represents a page table boundary. I’ve often seen the following code used.

/ / 32 - bit "\x31\xd2" / * xor edx , edx * / "\x66\x81\xca\xff\x0f" / * or dx , 0xfff * / "\x42" / * inc edx * / / / 64 - bit "\x31\xd2" / * xor edx , edx * / "\x66\x81\xca\xff\x0f" / * or dx , 0xfff * / "\x48\xff\xc2" / * inc rdx * /

Here are some other ways, but with the last two being the most compact.

/ / "\x66\x68\x00\x10" / * push 0x1000 * / "\x66\x5a" / * pop dx * / "\x0f\xb7\xd2" / * movzx edx , dx * / "\x68\x00\x10\x00\x00" / * push 0x1000 * / "\x5a" / * pop edx * / "\xba\x00\x10\x00\x00" / * mov edx , 0x1000 * / "\x31\xd2" / * xor edx , edx * / "\xb6\x10" / * mov dh , 0x10 * / "\x6a\x10" / * push 0x10 * / "\x5a" / * pop edx * / "\x86\xf2" / * xchg dl , dh * /

When pushing -1 or 1 on stack, I’ve often seen code where a register is incremented or decremented. Take this code in block_shell.asm

Pushing 1 on stack

"\x46" / * inc esi * / "\x56" / * push esi * / "\x4e" / * dec esi * /

Which is perfectly fine, but you could just as well push 1 on stack and save a byte.

"\x6a\x01" / * push 0x1 * /

The same operation is near end of code

"\x4e" / * dec esi * / "\x56" / * push esi * / "\x46" / * inc esi * /

We can save a byte.

"\x6a\xff" / * push 0xffffffff * /

Okay, I’m not pointing out all the ways you can optimize metasploit code 🙂 ..just using real world examples where immediate values like this should be pushed on stack when it’s only 1 of those values you require.

In the 64-bit version, consider the following.

The number of bytes generated compared with 2 for an immediate push.

"\x49\xff\xc0" / * inc r8 * / "\x41\x50" / * push r8 * / "\x49\xff\xc8" / * dec r8 * /

Allocating/Initializing Memory

Compilers will use ADD, SUB or in the past ENTER(Pascal/Ada) to allocate memory on the stack.

The unconventional way is to use PUSH/PUSHFD allocating 4 or more bytes. Even PUSHAD can allocate 32-bytes of storage in a single byte.

If using PUSHAD, you can adjust the stack later using ADD, SUB or LEA if you didn’t want to trash GPR with POPAD. Here are a few examples of allocating 32 bytes of space for something.

/ / "\xc8\x20\x00\x00" / * enter 0x20 , 0x0 * / "\xc9" / * leave * / "\x55" / * push ebp * / "\x89\xe5" / * mov ebp , esp * / "\x83\xec\x20" / * sub esp , 0x20 * / "\xc9" / * leave * / "\x83\xec\x20" / * sub esp , 0x20 * / "\x83\xc4\x20" / * add esp , 0x20 * / "\x60" / * pushad * / "\x61" / * popad * /

Here, we allocate 8 bytes and initialize to zero.

/ / FPU "\x83\xec\x08" / * sub esp , 0x08 * / "\x89\xe7" / * mov edi , esp * / "\xd9\xee" / * fldz * / "\xdf\x3f" / * fistp qword [ edi ] * / / / MOV "\x83\xec\x08" / * sub esp , 0x08 * / "\x89\xe7" / * mov edi , esp * / "\x31\xc0" / * xor eax , eax * / "\x89\x07" / * mov [ edi ] , eax * / "\x89\x47\x04" / * mov [ edi + 0x4 ] , eax * / / / STOSD "\x83\xec\x08" / * sub esp , 0x08 * / "\x89\xe7" / * mov edi , esp * / "\x31\xc0" / * xor eax , eax * / "\x57" / * push edi * / "\xab" / * stosd * / "\xab" / * stosd * / "\x5f" / * pop edi * / / / PUSH "\x31\xc0" / * xor eax , eax * / "\x50" / * push eax * / "\x50" / * push eax * / "\x89\xe7" / * mov edi , esp * / / / For 64 - bit mode , we only need 1 push "\x31\xc0" / * xor eax , eax * / "\x50" / * push rax * / "\x54" / * push rsp * / "\x5f" / * pop rdi * /

Here’s how I’d allocate 4096 byte buffer.

/ / allocate 4096 bytes on stack and initialize to zero "\x31\xc0" / * xor eax , eax * / "\x31\xc9" / * xor ecx , ecx * / "\xb5\x10" / * mov ch , 0x10 * / "\x29\xcc" / * sub esp , ecx * / "\x89\xe7" / * mov edi , esp * / "\xf3\xaa" / * rep stosb * /

The above code may cause an exception on Windows (unsure about UNIX-based systems) because of a stack limit imposed upon each application.

The default maximum stack size on Windows is 1MB, and on Linux it’s at least 4MB. Windows pre-allocates about 64KB of stack pages while Linux allocates 128KB.

When you allocate a large block of stack memory that exceeds what’s already available, you need to ensure the page is accessible. Compilers like MSVC and MINGW perform this under the hood so you don’t have to worry about it, but for assembly programming, you need to perform the stack probe yourself.

For example, the following code will allocate approx. 20KB of stack space in 4096-byte blocks.

; allocate 20KB using stack probe "\x31\xc9" / * xor ecx , ecx * / "\xf7\xe1" / * mul ecx * / "\xb1\x05" / * mov cl , 0x5 * / "\xb6\x10" / * mov dh , 0x10 * / "\x29\xd4" / * sub esp , edx * / "\x85\x24\x24" / * test [ esp ] , esp * / "\xe2\xf9" / * loop 0x8 * /

The instruction ‘test [esp], esp‘ should trigger a kernel exception forcing expansion of stack accessible to the application. If the memory is unavailable, the program will raise an exception.

Testing registers

Many functions tend to return 1 for success (TRUE) or 0 for failure (FALSE).

Some will also return -1 or less than zero to indicate failure.

The best way to test for these values is by performing some operation on register that affects the status flags.

The main ones you’ll see used here are Zero Flag (ZF), Sign Flag (SF), Parity Flag (PF) and sometimes the Carry Flag (CF).

You can of course use the Overflow Flag (OF) too, but I don’t use it in examples here.

The Adjust Flag (AF) (also known as the Auxiliary Flag) can be used, but unfortunately doesn’t have a jump opcode associated with it.

You must load the flags into a register for testing using a PUSHFD/POP combination or using the one-byte instruction LAHF.

Testing for 0 or FALSE.

/ / "\x83\xf8\x00" / * cmp eax , 0x0 * / "\x74\x12" / * jz 0x18 * / "\x85\xc0" / * test eax , eax * / "\x74\x0e" / * jz 0x18 * / "\x09\xc0" / * or eax , eax * / "\x74\x0a" / * jz 0x18 * / "\x21\xc0" / * and eax , eax * / "\x74\x06" / * jz 0x18 * / "\x48" / * dec eax * / "\x78\x03" / * js 0x18 * / "\x91" / * xchg ecx , eax * / "\xe3\x00" / * jecxz 0x18 * /

Testing for 1 or TRUE.

/ / "\x3c\x01" / * cmp al , 0x1 * / "\x75\x15" / * jnz 0x24 * / "\x66\x83\xf8\x01" / * cmp ax , 0x1 * / "\x75\x1e" / * jnz 0x2a * / "\x83\xf8\x01" / * cmp eax , 0x1 * / "\x75\x19" / * jnz 0x24 * / "\x0f\xba\xe0\x00" / * bt eax , 0x0 * / "\x73\x0f" / * jae 0x24 * / "\x85\xc0" / * test eax , eax * / "\x7a\x0b" / * jnp 0x24 * / "\x09\xc0" / * or eax , eax * / "\x7a\x07" / * jnp 0x24 * / "\x21\xc0" / * and eax , eax * / "\x7a\x03" / * jnp 0x24 * / "\x48" / * dec eax * / "\x75\x00" / * jnz 0x24 * / "\xf7\xd8" / * neg eax * / "\x78\x75" / * js 0x7a * /

Testing for -1

I’ve often seen code that tests for -1 directly as shown in the first example, but it’s more efficient and compact to use the Sign Flag (SF) in 2nd example.

If operating in legacy mode, we can save a byte by incremeting the register and testing Zero Flag (ZF) instead.

Because ZF=1, PF=1 and SF=0 after the increment, you could alternatively use JP or JNS instead of JZ as shown in 3rd example.

If decrementing by 1 shown in the last example. SF=1, ZF=0, PF=0 so we can use JS, JNP, JNZ or JL.

/ / "\x83\xf8\xff" / * cmp eax , 0xffffffff * / "\x74\x2e" / * jz 0x33 * / "\x85\xc0" / * test eax , eax * / "\x78\x2e" / * js 0x32 * / "\x40" / * inc eax * / "\x74\x16" / * jz 0x1d * / "\x48" / * dec eax * / "\x78\x19" / * js 0x22 * /

The problem with last two for 64-bit mode is that there are no 1-byte instructions for INC/DEC as these are reserved for REX prefixes.

In that case, it’s better to use TEST or increment of an 8-bit register (if available) like AL for EAX/RAX. You could also just check AL for itself or -1 which is smaller too.

JLE can be used after a call to a BSD socket function like recv or send since it can return 0 or -1 on error.

/ / jump if < = 0 "\x85\xc0" / * test eax , eax * / "\x7e\x19" / * jle 0x23 * /

Testing for 0x80, 0x8000, 0x80000000

Performed usually to indicate overflow after doubling/multiplication by 2.

The flags set after a TEST instruction: PF=1, SF=1, ZF=0

Imagine a value to test is 0x80000000.

/ / jump if 0x80000000 "\x85\xc0" / * test eax , eax * / "\x78\x19" / * js 0x23 * /

We could also use INC EAX which sets SF=1, PF=0, OF=0 which allows us to use JS, JNP, JNO or JL

/ / jump if 0x80000000 "\x40" / * inc eax * / "\x78\x19" / * js 0x22 * /

What if we use DEC EAX instead? This sets SF=0, PF=1, OF=1 which allows us to use JNS, JP, JO, or JG

/ / jump if 0x80000000 "\x48" / * dec eax * / "\x79\x19" / * jns 0x22 * /

By adding 0x80000000, the result is zero setting ZF=1, OF=1, CF=1. Use JZ, JO or JC/JB

Subtraction will set ZF=1, OF=0, CF=0. Use JZ, JNO or JNC/JNB.

/ / jump if 0x80000000 "\x01\xc0" / * add eax , eax * / "\x74\x19" / * jz 0x23 * /

Instead of add or sub, shift left by 1

/ / jump if 0x80000000 "\xd1\xe0" / * shl eax , 1 * / "\x74\x19" / * jz 0x23 * /

Using edx

/ / jump if 0x80000000 "\x99" / * cdq * / "\x42" / * inc edx * / "\x74\x0f" / * jz 0x13 * /

Sign extend using Shift Arithmetic Right (SAR)

; CF=1, ZF=1, SF=0 for < 0x80000000 ; CF=0, ZF=0, SF=1 for >= 0x80000000 "\xc1\xf8\x1f" / * sar eax , 0x1f * / "\x78\x5b" / * js 0x60 * /

Using the negate instruction which doesn’t alter the register.

; CF=1, ZF=0, SF=1, OF=1, PF=1 if eax == 0x80000000 "\xf7\xd8" / * neg eax * / "\x72\x38" / * jb 0x3c * / "\x78\x36" / * js 0x3c * / "\x75\x34" / * jnz 0x3c * / "\x70\x32" / * jo 0x3c * / "\x7a\x30" / * jp 0x3c * /

Conditional jumps / Control flow

This involves performing what you might do in higher level languages using FOR, WHILE and DO/WHILE statements.

The NOP instructions are only filler material, taking the place of something otherwise useful.

Normally, I’d try use ECX if it’s free along with a LOOP instruction, but it really depends on the situation.

Looping 2 times

/ / Parity Flag "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x48" / * dec eax * / "\x7a\xfb" / * jp 0x3 * / / / Sign Flag "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x04\x40" / * add al , 0x40 * / "\x79\xfa" / * jns 0x3 * / / / Zero Flag "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x04\x80" / * add al , 0x80 * / "\x75\xfa" / * jnz 0x3 * /

Another way to loop twice if all your registers are used up is using the carry flag.

We clear it first (if not already) save the flags on stack, execute our code, restore flags, complement and loop again.

/ / using the carry flag "\xf8" / * clc * / "\x9c" / * pushfd * / "\x90" / * nop * / "\x90" / * nop * / "\x9d" / * popfd * / "\xf5" / * cmc * / "\x72\xf9" / * jb 0x2 * /

Looping 3 times

For PF, set a register to zero and increment by 1 until PF=1.

For SF, set a register to zero and increment by a number between 43 and 63 until SF=1.

For ZF, set a register to one and increment by 85 until ZF=1

/ / Parity Flag "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x40" / * inc eax * / "\x7b\xfb" / * jnp 0x3 * / / / Sign Flag "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x04\x30" / * add al , 0x30 * / "\x79\xfa" / * jns 0x3 * / / / Zero Flag "\x6a\x01" / * push 0x1 * / "\x58" / * pop eax * / "\x90" / * nop * / "\x90" / * nop * / "\x04\x55" / * add al , 0x55 * / "\x75\xfa" / * jnz 0x4 * /

You may also point out that we can simply set EAX to 3 and decrease until zero, which is true.

/ / Zero Flag "\x6a\x03" / * push 0x3 * / "\x58" / * pop eax * / "\x90" / * nop * / "\x90" / * nop * / "\x48" / * dec eax * / "\x75\xfb" / * jnz 0x4 * /

And if ECX is free, we can simply use LOOP

/ / ECX "\x6a\x03" / * push 0x3 * / "\x59" / * pop ecx * / "\x90" / * nop * / "\x90" / * nop * / "\xe2\xfc" / * loop 0x4 * /

Whenever ECX is free, use LOOP.

If we can’t use ECX, set AL to -1 and use subtraction which saves us a byte compared with 2nd example. You could also set AL to 1 if you wanted to use addition instead.

/ / Zero Flag "\x0c\xff" / * or al , 0xff * / "\x90" / * nop * / "\x90" / * nop * / "\x2c\x55" / * sub al , 0x55 * / "\x75\xfa" / * jnz 0x3 * /

Looping 4 times

This is a little easier since 256 is divisible evenly by 4 for ZF, 128 for SF.

/ / Sign "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x04\x20" / * add al , 0x20 * / "\x79\xfa" / * jns 0x3 * / / / Zero "\x31\xc0" / * xor eax , eax * / "\x90" / * nop * / "\x90" / * nop * / "\x04\x40" / * add al , 0x40 * / "\x75\xfa" / * jnz 0x3 * /

You could go on and on with this, but you should grasp from the above examples how to write your own. The last part of control flow involves implementing conditional calls using relative offsets. Imagine a situation where based on the result of a call to another function, you want to invoke a fake function to confuse some analysis of the code. This is very basic of course.

This only tests for TRUE or FALSE condition, but it would be trivial to test for -1 or a signed value < 0

Character conversions

There are times you might want to convert a string from lowercase to uppercase and vice versa. There are also times you’ll want conversion from unicode to ansi, although the code shown here for that will only cover latin alphabets.

Upper/Lowercase

Converting case is simply a matter of toggling a bit on and off. Look at the following characters and their binary values.

a = 01100001 A = 01000001

b = 01100010 B = 01000010

c = 01100011 C = 01000011

For lowercase, bit 5 is set. (i’m counting from 0).

If you’re positive the string is only lowercase or only uppercase, then using an XOR will flip the case.

/ / flip the case "\x34\x20" / * xor al , 0x20 * /

If you want all lowercase without having to compare each byte, just use OR

/ / convert to lowercase "\x0c\x20" / * or al , 0x20 * /

For uppercase, use AND with 0xDF to zero out bit 5. The only problem is that it will screw up digits or other characters that don’t have a lowercase equivilent.

/ / convert to uppercase "\x24\xdf" / * and al , 0xdf * /

Of course, you can also use BTS to set it to lowercase, but it requires more bytes.

/ / set to lowercase "\x0f\xba\xe8\x05" / * bts eax , 0x5 * /

Or if you’re just flipping the case, BTR

/ / flip case "\x0f\xba\xf0\x05" / * btr eax , 0x5 * /

What happens if you have digits in the string or some other special characters? I’ve used lowercase conversions for shellcode instead of uppercase, because the latter requires conditional jumps and it’s not really required. Take for example the metasploit code block_api.asm

Converting the string to lowercase, you could use the following. Bear in mind, this is reading information about DLL from InLoadOrderLinks which is different to what Metasploit reads.

; movzx ecx , word [ edi + 44 ] ; len = BaseDllName.Length mov esi , [ edi + 48 ] ; str = BaseDllName.Buffer shr ecx , 1 ; len /= 2 xor eax , eax ; c = 0 cdq ; h = 0 hash_dll: ; do { lodsw ; c = *str++ or al , 0x20 ; c = tolower(c) ror edx , 13 ; h = ROTR32(h, 13) add edx , eax ; h += c loop hash_dll ; while (--len)

You couldn’t just plug this into the existing metasploit code of course because the hashes generated would be completely different. It’s just to demonstrate another approach. If you set bit 5 using OR for digits 0-9, nothing changes since bit 5 is already set for those values. The same is true for periods which separate module name from extension, like KERNEL32.DLL will simply be converted to kernel32.dll using OR without the need for conditional jumps. But if converting to uppercase using SUB instruction, you need conditional jumps.

Ansi and Unicode

Okay, this is not strictly unicode conversion since we’re only using latin alphabets. Unicode strings end with 2 null bytes so this should terminate once null byte reached.

; esi = unicode in ; edi = ansi out uni2ans: movsb ; convert it to asciiz format dec edi cmpsb jnz uni2ans

Acknowledgements

There’s a lot of people who helped write this post indirectly through sharing their knowledge and ideas. Some who have inspired code shown here include: drizz, r!sc, d0ris, jb, Z0MBiE, WiteG, Vecna, Mental Driller, GriYo, JPanic, Qkumba/Peter Ferrie, Jacky Qwerty, Super, hh86, benny and any others I forgot.