The first part is dealing with information for all languages, the others are about asm-only optimizations. Important: Optimization has to be considered as a complete subject. You have to regard quite everything. You can increase performance a lot by only following the first part. Asm optimizations will only add some extra speed (but in several cases, hardcore optimization can even speed it up several hundred percents again). The best is to keep everything in mind instead of trying to optimize it step-by-step.
The most important thing is: The code cannot be better than its underlying algorithm. Why making a loop running 500 times faster by a factor of 3 when it can be replaced by two loops needing only 50 times two run? (Hint: Mathematical books may contain useful tips for coding).
Many speed relevant things depend on the design of the complete computer, not only the CPU. These contain cache speeds, RAM and Hard disk access times (do not forget: random accesses are slower than sequential accesses), graphic card RAM speed (writing to it is often faster than to uncached normal RAM, reading is often remarkably slower). You should also have an idea what the OS functions do and how long they are executing.
Manipulate memory in blocks that fit into the first or second level cache until they are ready before continuing with the next one instead of accessing it patternlessly. Avoid reading from graphic RAM. If you need information from the OS frequently try to save them if you can allocate memory for it, so that you do not need to use time-wasting function calls. Avoid reading from the harddisk or even the CD and read as large blocks as possible.
Substitute polling routines with event handlers (means you should trap mouse and keyboard interrupts / messages, but it may not be useful if you only want to know the time in windows regularly - TimeGetTime may be faster than a timer callback). Maybe there are services for preloading data into caches, try them. Avoid frequent resizing of your used memory or the OS will be very busy, especially if it is already swapping onto disk.
Hardware acceleration may be faster, but not in all cases. If it is emulated it will be slower than your own routine.
There are also general code optimizations:
Try to minimize loops and jumps. Mispredicted jumps need a lot of time. The same applies to calls, mainly far ones.
Changing the kind of data type, especially from integer to floating-point or backwards is also a waste of clock cycles (but in some cases it can be optimized by ignoring the type checking - mainly usable in asm code).
Considering math, there are the following rules: Avoid powers, they need a lot of time. Divisions are also a nasty thing. Multiplications are better, but still not optimal. Try to replace multiplications and divisions with shifts. Additions and subtractions are fast and nice. Fast boolean functions like AND, OR, XOR used creatively can also replace slow range checking and do other stuff more efficiently.
Use CPU registers as often as possible. Try to keep many variables in the seven registers: (E)AX, (E)BX, (E)CX, (E)DX, (E)SI, (E)DI, (E)BP. Using the stack pointer is nearly impossible because it is used by faults, exceptions, interrupts, calls, pushes and pops. The 32-Bit registers can also be used in Real-Mode DOS, but not for addressing more than 64k above a segment address.
A well-liked optimization is using the registers partially by changing only a 8 or 16 Bit part of them. Example: You have an array like 256*256. You can put the row address in AL and the column address in AH. By doing this you can access the elements without calculating an offset. And because both the rows and the columns are fitting into 8-bit registers there is not the danger that the memory access is outside the structure. This may also replace range-checking.
You can combine registers in a memory access for calculating the address. Especially in 32-bit mode you can use constructs like MOV EAX,[EDI+4*EBP+15], which include a lot of calculations without changing the involved index registers.
You can also store up to 3 loop counter variables into a 32-bit register: One in AH, one in AL, and the third one in the upper 16 bits of EAX. The 8-bit ones are decremented by one, the 16-bit one is decremented by 65536. If a carry occurs, the corresponding counter has reached the value -1, indicating the end of the loop.
Another famous optimization deals with the flags. Every conditional jump (except JCX/JNCX) is controlled by them. Flags are not only changed by the CMP instruction, but not all instructions change flags, at least not all flags. The first statement indicates that quite a lot of CMPs can be removed if there is an instruction like INC/DEC/ADD/... changing the same value. The second statement says that you can do some operations between the time the flag is set and its used for jumping. For example, MOVs can be put between CMP and Jxxx. Flags used in combination with SETxx or ADC/SBB can remove jumps.
E.g., Bresenham's line algorithm can be realized by calculating a value which indicates the amount of pixels one pixel is higher than its neighbour (somewhere between 0 and 1) multiplied with the maximum value the register can contain (256 if using an 8-bit one). For every drawn pixel this amount is added to this register. Every time a carry occurs the line has climbed one pixel up or down. You can add the carry flag to the coordinate of the pixel by using ADC reg,0 or SBB reg,0.
The third optimization to be mentioned is that modern CPUs can execute several instructions at the same time if they do not depend on each other (read: The one does not need the result of the other one). You can execute two integer commands, one FPU or maybe two MMX commands at the same time. Intel Pentium CPUs are even capable of executing several FPU instructions at the same time - see the end of the page.
Note: Do not forget to align to WORDS for 8086, DWORDs for 386 and eight words for SSE.
This is a complete different thing. Forget everything you have heard about speed optimization.
Alignment optimizations waste space, remove them. Put as much as possible in loops or near calls if it is used several times.
Try to remove constants in your code (only constants below 128 are sized one byte), they reside not only in constant additions and other calculation, every access to a static memory variable contains a constant. Store data on the stack instead of static storing, code dealing with 8 or 16 bit is more compact than 32 bit one (considering constants). Especially dealing with structures can be a code-intensive thing.
COM files (DOS) are smaller, avoid linking static libraries, especially libraries containing unused code.
You may have seen that code size and code speed are two concurrent things.
But there are more so-called optimizations which can make your code slower. In the past, self-modifying code was used for speeding up programs. But it is the worst that can happen to dual code/data cached CPUs with large prefetch queues, especially in protected mode. When code is being modified, the CPU/cache state has to recreated completely, taking up to hundreds of clock cycles. Cyrix implemented in its 5x86 the option of disabling the checking of self-modifying code in order to speed the CPU up a bit. I wonder if self-modifying code support will be dropped sooner or later, maybe in the CPUs or in the OSses.
In DOS, reloading segment registers is a popular way of simplifying memory accesses. The same popularity has the misuse of segment registers as a 16 bit variable. But in protected mode (includes V86, too), you are not using segment registers, you are loading selectors instead! This means that you are loading an index pointing into a descriptor table, thus causing a lot of organization overhead eating a lot of time.
It is also possible to start several FPU calculations, one followed by another. Each calculations need several clock cycles, but if they are independent from each other they can execute parallelly on genuine Intel Pentium and above. Some FXCH can help increasing this parallelism (FXCH does not eat time on P5 and above). However, 486ers, Cyrix and older AMDs will be slower than without the FXCHs. They cannot execute several FPU operations at the same time, but they are slowed down a bit by the FXCH.