| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Optimizations
Corner:
Understanding Partial Stalls The Pentium II and Pentium III microarchitectures are based on dynamic data flow analysis. The processor decodes its instructions into simpler micro-ops that are called uops (internal units of work). These uops are added to the reorder buffer, and if all their resources are ready they are issued to the execution unit. The resource can be the result of another uop, or a real register or memory value. When more than one uop generates a resource, the processor cannot know from where to take the result of the resource, so it must wait for the value of the architectural register. The value only becomes available when the instructions generating the resource retire. For example:mov al,mem1The uop of the last instruction depends on the resources BX and AX. The value in register AX is generated by two uops, and therefore the last instruction can execute only after the first two instructions retire. So to formalize the problem, the instruction for which the partial stall is issued reads from a large register (for example, EAX) after a previous instruction writes to one of its partial registers (for example, AL, AH, AX). The read stalls until the write retires, even if the instructions are not adjacent. This stall applies to all register pairs involving a larger register with any of its sub-components. Examples of larger registers with one of its partial registers are AX with EAX, BL with BX, and SI with ESI. The stall does not occur if the write has already retired when the read begins execution. A partial stall also occurs in the following cases because the processor operates on 32 bits internally (even though it seems to be operating on only 16 bits): A MOV instruction writes to any partial register, and subsequently the MOVSX (move with sign-extend) or MOVZX (move with zero-extend) instructions read from the same partial register. For example: mov ax, 7A MOV instruction writes to any partial register, and subsequently copies the contents to any segment register. For example: mov ax, 7The actual cycle loss for a partial stall varies depending on the number of cycles before the source instruction retires. The average cost is 7-10 cycles if the instruction uses the large register immediately after setting the small register. If more instructions are executed between this pair, the performance loss is smaller. If more than 40 uops are decoded between the setting of the small register and the usage of the large one we can be certain that the instruction that sets the register already retired, and a penalty is avoided. The
XOR and SUB instructions can be used to clear the upper bits of a
large register before an instruction writes to one of its partial
registers. When the upper bits of the larger register are cleared
in this way, reading it after writing to one of its partial registers
does not cause a stall. Other methods of clearing the upper bits of
the large register do not prevent a stall.
The INC instruction uses the entire EAX register. The preceding MOV instruction uses just the lower portion of the EAX register: AL. This causes a partial stall. Use the XOR instruction before reading the partial register to clear all bits in EAX and prevent the stall. If a mispredicted branch or interrupt occurs between the XOR and the setting of the small register the partial stall is not prevented. Understanding MOB Stalls The MOB (memory order buffer) is a memory subsystem that acts as a reservation station and a reorder buffer. It holds suspended loads and stores, and redispatches them when the blocking condition (dependency or resource) disappears. If an instruction needs to read a larger data element after a previous instruction wrote a smaller data element from the same address, the MOB cannot be used to forward the data. As a result the data must be loaded from memory instead of forwarding the value from the buffer.The goal of the MOB is to prevent loads (LD) from being blocked by a store (ST) in the MOB. The idea is to allow store to forward data to load, instead of blocking the loads. Every load blocked by the MOB costs the Pentium II processor six to nine or more clocks. Stores are buffered in the MOB and are placed in memory in the background upon store instruction retirement (using spare cycles). All loads are checked against the previous store in the MOB to detect LD/ST conflicts so that memory ordering is maintained. Non-conflicting loads may pass stores. A conflicting load (i.e. load to same address as a previous store) may receive its data directly from the store in the MOB, or it may be blocked until the store executes to memory. Certain conditions must be met for a store to forward data to load. In effect, this mechanism provides memory renaming. Store forwarding is a performance win. If a load is blocked (i.e., a conflict is detected but store cannot forward), there is a significant performance penalty. MOB allows store to forward data to a conflicting load only if the following conditions are met:
________________________________________________________
|
|
|