This time the number really fits: 4k aka 4096 bytes are an amount of data really related to the topics in this chapter: First, it is the size of the pages used by the x86 CPU paging capabilities, which is used for memory and process protection as well as memory mapping under all common 32 Bit operating systems, including Win32, of course. Second, it is the size limit for 4k intro programming competitions, and thus refers to size optimization tricks explained here. Third and last, the file systems often organizes the data in 4k chunks as well.
Because the current tutorial contains many snippets instead of dealing with a single topic, it does not come with an example app. But donīt worry, there are a lot of examples within the text instead as well as sample code in the ZIP archive.
The PE (Portable Executable) file format is used for .exe, .dll, .scr , .cpl and similar files. It consists of its headers and its sections.
The sections are the real interesting part of a PE file. All the code, data and/or resources of the program is stored there. Every section contains of 2 parts: The section header and the section data itself.
Each section header is 40 bytes long, following directly one after another. The section headers are located after the optional header. They contain the name of the section which can be any name you want as long as it is not longer than 8 bytes (unused bytes are padded with zeroes). For example, nothing prevents you from naming your code section .badcode or the resource section .blah .
The section header also contain the size of the data contained in the section, the offset of the section itself in the PE file and the starting address in the address range of the current process where the section should be loaded into.
The last interesting field contains memory-access flags, each marked by one of 32 bits, of the section after being loaded in memory like:
One may notice that there is nothing that determines what each section is good for. Every segment in your source code typically produces one section, additional sections are used for exporting or importing external functions and variables as well as for resources added to the program. A PE file needs at least one section (otherwise it would be empty) and can have as many as possible unless the address space is filled up.
The size each section occupies in the file is the size of its data rounded up to the next multiple of the section alignment in the file field defined in the header. This field is also called PE File Alignment and is a power of 2 between 512 and 65536. Thus every section in the PE file uses at least 512 bytes, even if it contains only one byte of data. The only exception is the last section since you are allowed to strip off the unused part of the last section.
There is another alignment field called memory alignment or object alignment. It must be a multiple of 4096, the page size of the x86 CPUs and must not be less than the file alignment. It must be a multiple of the page size because the paging mechanism used for memory protection and memory sharing by the OS works at page size granularity.
If you want to keep the file small, use a file alignment of 512 bytes; if you want it to load a bit faster set both the file and the object alignment to 4096. In this case, the PE loader does not need to expand the sections up to the next object alignment boundary since they already have the size needed.
Most PE files use a set of predefined sections. These typically use the same names, though their behaviour is still determined by the entries in the PE header and/or the data directory within the optional header.
The following predefined sections are common:
The sections can be ordered in any way possible. It is not even required by the PE format that the section headers are in the same order as the sections itself.
The predefined sections mentioned above are the ones created by most linkers. However, this is not a strict rule. According to the PE file format, the sections contain just data which is mapped to memory according to its description in its header. The use of the data itself is determined by the PE header fields.
Thus, it is possible to use a section for several purposes.
There is also an API function called VirtualProtect which can be used to set access rights on any specified memory location. Memory allocated via GlobalAlloc or LocalAlloc is by default Readable + Writeable + Executable, so one can execute code copied into it without any other actions required.
Since the linear address space of a process can be mapped into the physical memory or swap file using any possible pattern, physical memory fragmentation will never cause memory allocation to fail. However, your linear address space can still be fragmented. So it is better to allocate memory used permanently before memory which will be freed and allocated again. If you want to resize an already locked block of memory, allow the memory mapper to relocate the memory, either by unlocking it before resizing it and relocking it again (locking memory is practically the same as locking surfaces or buffers in DirectX and returns a valid pointer to the memory) or by calling the GlobalReAlloc/LocalReAlloc function the following way:
;for MASM and TASM: replace dword with dword ptr or large push dword GMEM_MOVEABLE ;this flag allows the memory manager to move the block to another address if needed push dword NewSizeOfMemoryBlock ;the new size the block should have push dword MemoryHandle ;was obtained by a previous call to GlobalAlloc or GlobalReAlloc call [GlobalReAlloc] ;for MASM and TASM, remove the bracketsThe same goes for memory allocated by LocalAlloc/LocalReAlloc. The function returns the new handle for the memory, which is the pointer to the memory as well. Use this instead of the old one, the memory position may have changed. By the way, in Win32 there is no difference between Local and Global memory functions due to the fact that there is only one huge 4G segment.
The address within the process address space where the data in the PE file should be loaded to is given by the ImageBase value in the PE header. However, if the address is already occupied by other mapped files or allocated memory, the PE file data has to be loaded to another address. The relocation info is needed for converting the affected memory references. For executables not exporting functions or data, relocation will not occur because it will always be the first file to be loaded into the address space of a process (with one exception: If the image base is less than 4MB, the file will be relocated to 4MB, the default address). So there is no need for relocation info in such executables. However, DLLs will in most cases be relocated since they are never the first file in memory, so chances are not that high that the default address is still unused.
Iīve encountered problems with some linkers which caused relocation to fail. Setting a new base address using the -base ImageBase command line switch solves the problem.
The simplest way to reduce size is using an executable packer. However, these do not reduce file size well at file sizes below 10k. Another way would be using a dropper, a program which only purpose is to unpack and start a program included in it. This works a bit better if the dropper is written in DOS, thus the entire PE file including its headers can be compressed. Both variants do not reduce the size of the original executable itself and are not discussed here.
The first point to look at is the PE file structure.
The entire PE header (DOS MZ header + DOS code + PE headers + section headers) is padded up to the next 512 byte boundary. So without having any section in it, the PE file is already 512 or 1k bytes large (with padding). The minimum section alignment in the file is 512 bytes, so every section adds at least 512 bytes to the file. The only exception is the last section which does not need to be padded, thus the smallest section should be the last one.
Without reorganizing the entire header, the only way for downsizing the PE file is using as few sections as possible. It is even possible to put the imports and the resources into the same segment as the code and the data. Code and data are in the same section if the source code uses the same segment for code and initialized data (like code segment variables). And due to the fact that CS and DS/ES map to the same addresses we do not even need a CS segment override. This works with all assemblers and linkers. If the imports are in the same section as the code depends either on the import library (*.lib) files or the assembler, depending on the import method chosen. So far, I have not seen a resource compiler / assembler / linker combination allowing the resources to exist within an other section, but often resources are not needed. Executables do not need relocation info, so the relocation section can be removed from the .exe without causing problems (some linkers already do this).
Hardcore coder hint: The padded area in the PE header, the code of the DOS stub and all section padding bytes are not used by the PE loader at all, so you can stuff data or code in them. Stuffing the code/data section is trivial, just include the additional data into your source code until it reaches the alignment border. If you want to use the other unused fields (several hundred bytes), make sure to add the section access rights you need. Loading the executable into memory again is the easiest way and does not require adjusting the access right flags, but consumes quite a lot of code. Or just use the already mapped file (the header is loaded at the Image Base) and access the memory directly, but keep in mind that the sections are aligned according to the ObjectAlignment field, thus the position in memory may not be the same as in the file.
The second size optimization method is the same as under DOS: Keeping the code itself as compact as possible, making it smaller byte per byte.
Only Win32 and 32 bit specific tips are mentioned here, others are the same as under DOS, so it is recommended reading other size optimizing tutorials as well and, most important, get a deep look into the x86 instruction list and get a feeling what an asm mnemoic will translate to.
One byte-consuming part in Win32 are function calls. Keeping the pushes as small as possible, using registers is recommended. Many parameters can be 0, if we have a register containing 0 we can push the 0 using a single byte. The instruction for zeroing a register consumes less bytes (e.g. xor eax,eax uses 1 byte) than saved by using the zeroed register as the push operand. There are also special cases where function calls can be optimized. If there is a construct like
call [function_address] retreplace it by (damn, this is so trivial)
Lets take a look at our program entry point. It is defined as:
WinMain( hInstance, hPrevInstance, lpszCmdLine, nCmdShow);
Thus our program is just treated as an ordinary function with the following dword values on the stack:
[esp+16]: nCmdShow ;value determining if the program window should be displayed normal, fullscreen or minimized [esp+12]: lpszCmdLine ;address of the command line [esp+8]: 0 ;always 0 in Win32 [esp+4]: hInstance ;instance handle of the current process [esp+0]: stacked EIP ;return addressThere is a function call which can be replaced using this knowledge: ExitProcess at the end of a program. A ret does also end the program (if you wonder: a function usually ends using a ret instruction ;-) , so this here is my Guiness Book entry, the smallest complete Win32 application:
ProgramEntryPoint: ret end ProgramEntryPointThe code is exactly one byte long (there is no need to clean the stack because it is thrown away by the OS)!
Another Win32 goodie causes all open handles left after a process termination to be closed automatically. So there is no need for closing them manually. But for performance reasons you should still close them yourself as soon as possible if code size optimization is not your main goal.
In the case that DirectX is used (or other functions using the COM calling syntax), using a macro like the one used in the last two tutorials expands to about 20 bytes of code each time it is used. Replacing the macro with the following code
DXcall: push edx mov edx,[edx] add edx,ecx mov edx,[edx] jmp edxand replacing each use of the macro by something like
mov ecx, MethodToUse mov edx, InterfaceToUse call DXcallreduces the code size remarkably.
In 32 Bit code, registers are either 8 or 32 bit by default. Thus, data addresses are always 4 bytes long. Some arithmetic instructions allow to define a 4 byte immediate number (an immediate is a number which is a part of the instruction like in add eax,8) as a single signed byte if it is in the range of a signed byte. MASM and TASM use the short form, if possible, by default while NASM always defaults to using 4 bytes. In this case NASM code requires a byte override for producing the short form:
add eax, byte 8
Altering only a part of a dword operand also reduces the size of immediates. However, if you alter a word instead of a dword there is only one byte saved instead of two because a 16 bit operand size override prefix byte is used.
Allocating memory can be made smaller as well. Instead of allocating it with GlobalAlloc or LocalAlloc, allocating the memory from the stack uses less code. It works the same way as allocating local variables on the stack:
sub esp,SizeOfMemoryBlock ;note that the stack grows downwards ;esp now points to the allocated memory blockIf you need to free the allocated memory (only required if a ret instruction follows), just use:
add esp,SizeOfMemoryBlockThere is one important issue about using the stack that way: In the PE header are two fields called StackReserveSize and StackCommitSize. The first one determines the maximum size the stack can grow to without the danger of overwriting other data in the address space of the process. This is done by marking the amount of memory as reserved by the memory manager. The second one contains the amount of memory used for the stack which already maps to physical memory.
A problem which also occurs often is running out of registers, thus memory variables have to be used. Using a predefined variable in the data or code segment costs 5 bytes for accessing it, a relative address still costs at least one byte more than accessing a register. So using all 8 general purpose registers may help. 8 registers? Yes, esp is a general purpose register, isnīt it? It can be used the same way as esi,edi or ebp.
And there are some handy instructions for using it as a memory pointer: push works similar to stosd and pop similar to lodsd, with the difference that esp always works downwards during a push and upwards during a pop, unaffected by the direction flag. On the other hand, push and pop are much more flexible because they accept nearly any operand and not only eax like lodsd or stosd (but lodsd/stosd are smaller).
It is a common misbelief that modifying esp or the memory it points to by any other way than using push, pop, call, ret or int will cause the program to crash. Fact is, only when using esp as a stack pointer it must be sure that esp points to the right address. It is also possible to set esp to another address as long as there is allocated memory below that address and use that memory as the stack. One can use esp as a pointer to memory to fill at esp with descending addresses and still use esp as a stack pointer since the memory area already filled by the loop is above the address esp points to and the memory used as a stack is below it. As a conclusion, the following issues must be taken care of when using esp:
So far, all sample applications in the tutorials have been GUI or Windowed Mode apps. But there are also Console Applications. A console is a simple window which can be used to read input from the keyboard and write text to it. It is the same window as the one used for the DOS-box. Every program can have a console, the only difference is that console apps do not need to create them manually (by calling AllocConsole), they are started with an already attached console. It is just a single flag in the PE header that tells the PE loader wether to create a console or not. When a console app is started from a console it uses the console it was started from, blocking it for other actions until the program ends.
There are two ways for getting input from and writing chars to a console. So-called low-level functions work directly with the input buffer and the output buffer, using the 2-dimensional cursor position specified. And there are functions which use the standard handles STDIN, STDOUT and STDERR. Standard handles can be used like file handles in the file access functions.
A standard handle can be obtained by either calling GetStdHandle or CreateFile with the predefined filenames for the handle. Although console output looks completely ugly with its textmode layout, it can be very handy for debugging purposes. Not only that it is an independent window which can be read from and written to easily and which can be placed onto another monitor in a multi-screen environment, there is also the power of redirecting input and output to a file on disk or to other standard files like COM or LPT. It is also possible to use it for sending or reading data to or from another process if the console app was started by another application. Last goodie to mention is that it is easily changeable into a GUI app which can output its info to a file on disk instead of the console just by changing a flag in the linker and some of the parameters of CreateFile.
For getting a handle to STDOUT which can be redirected use the following code:
;for MASM and TASM: replace dword with dword ptr or large push dword STD_OUTPUT_HANDLE ;if you want to use STDERR use STD_ERROR_HANDLE instead call [GetStdHandle] ;remove brackets for MASM and TASM ;eax contains the requested standard handle or -1 if it failedThe same can be done for getting an always non-redirected handle using:
push dword 0 push dword FILE_ATTRIBUTE_NORMAL push dword CreationFlag ;see below push dword 0 push dword FILE_SHARE_READ + FILE_SHARE_WRITE ;can also be 0 in most cases push dword GENERIC_WRITE push dword address_of_filename ;same as offset filename call [CreateFileA] ;eax contains the requested handle or -1 if it failedThe following file names can be used with CreateFile:
push dword 0 push dword AddressOfVariableToFillWithTheNumbersOfBytesActuallyWritten ;may be also be 0 push dword LengthOfStringToWrite push dword AddressOfStringToWrite push dword handle ;as retrieved by GetStdHandle or CreateFile call [WriteFile]Note that there is no need to mark the end of the string by a 0 since the string length is already given to the function. If a new line should start, just include the byte value for a line break into the string.
Now you should know most issues needed for writing or porting assembler programs to Win32. Most additional stuff required is contained in the several SDKs, quite well explained and there are also many sources and examples for special topics around (mostly in C, but quite easy to read). There wonīt be more chapters of this tutorial since figuring out how to use the Win32 API should not be a problem for anyone who read the entire tutorial and most algorithms are not language-specific, so they do not belong into this tutorial (and I īm not going to give sources about them, since the only real way to understand them is by implementing them yourself).
The following issues should always be kept in mind:
Happy coding and watch out for other articles written by me