Chapter 2, Opteron's Floating Point Units

2.1 The Floating Point Renamed Register File

Opteron's Floating Point renamed register file has been increased from 88 to 120 entries. It is a renamed register file in the classical meaning of the word. It's a single entity that must contain all architectural (non-speculative) and speculative values for the registers defined by the instruction set. The Opteron restores the support for 72 speculative instructions again. The support for speculative instructions was decreased from 72 to 56 with the introduction of the Athlon XP core that included the eight 128 bit XMM registers for SSE but did not increase the size of the 88 entry renamed register file. Each 128 bit XMM register uses two entries in the renamed register file. The Opteron thus uses 32 entries to hold the architectural (retired) state of the now 16 XMM registers, which explains the increase: 88 + 32 makes 120 entries. 40 of the 120 entries are used to hold the architectural (non-speculative) state of the registers defined by the instruction set. 32 are used for the sixteen XMM registers. 8 are used for the eight x87/MMX registers. A further 8 register entries are used for micro code scratch registers, some- times called micro-architectural registers. These registers are not defined by the instruction set and are not directly visible to the programmer. They are used by micro code to calculate complex floating point calculations like sine or log instructions. The 48 (40+8) entries that define the architectural state of the processor are defined by the 48 entry Architectural Tag Array . The entries that hold the very latest speculative values for the 48 architectural register entries are identified with the 48 entry Future File Tag Array . The speculative state of the processor needs to be discarded in case of a branch-miss-prediction or exception. This is handled by overwriting the 48 entries of the Future File Tag Array with those of the Architectural Tag Array . Each entry of the renamed register file is 90 bit wide. Floating Point Values are expanded to a total of 90 bits (68 mantisse, 18 exponent, 1 sign bit and 3 class bits) The three class bits contain extra information about the floating point number. The class bits also identify non floating point numbers (integers) which are not expanded when written in the renamed register file. The 120 registers 8 32 8 non speculative registers: FP/MMX registers (arch.) SSE/SSE2 registers (arch.) Micro Code Scratch registers (arch) 8 32 8 24 speculative registers FP/MMX registers ( latest ) SSE/SSE2 registers ( latest ) Micro Code Scratch reg. (latest ) Remaining speculative The 90 bit registers 68 18 1 3 subdivision of the 90 bits for FP Mantisse bits Exponent bits Sign bit Class Code bits Definition of the 3 bit Class Code 0 1 2 3 4 5 6 7 Zero Infinity Quit NAN (Not A Number) Signaling NAN (Not A Number) Denormal (very small FP number ) MMX / XMM (non FP contents) Normal ( FP number, not very small ) Unsupported

2.2 Floating Point rename stage 1: x87 stack to absolute FP register mapping

The "stack features" of the legacy x87 are undone in this first stage of the Floating Point pipeline. The x87 instructions access the eight architectural 80 bit registers via a 3 bit Top Of Stack (TOS) pointer. Instructions use the TOS as both source and destination. The second argument can be another value on the stack relative to the TOS register or a memory operand. The 3 bit TOS pointer is maintained in the 16 bit x87 FP status register. The x87 TOS register relative references are replaced by absolute references which directly identify the x87 registers involved in the operation. A speculative version of the TOS pointer is used for the translations. The 3 bit pointer can be updated by the actions of up to three instructions per cycle. Instructions can be speculative but are still in order at this stage. They've not yet been scheduled by the Floating Point Out-Of-Order scheduler. If an exception or a branch-miss-prediction occurs then the speculative TOS pointer is replaced with the non-speculative retired one which is retrieved from the reorder buffer. The retired version reflects the value of the TOS during the instruction just prior to the one that caused the exception or branch miss prediction.

2.3 Floating Point rename stage 2: Regular Register Renaming

The actual register renaming takes place in this stage. Each instruction that needs a destination register gets one assigned here. The destination registers must be unique in respect to all other instructions in flight. No instructions may write to the same register. Up to three free register entries are obtained from the register free list. There are 120 registers available in total. The free-list can have a maximum of 72 free entries, equal to the maximum number of instructions in flight. The remaining 48 entries hold the values of the (non-speculative) architectural registers: The eight x87/MMX registers, The eight scratch register (accessible by micro code only) and the sixteen 128 bit XMM registers for SSE and SSE2, each using two entries. These registers are not at a fixed location but may occupy any of the 120 entries. This is what makes the free-list necessary. The 48 entries occupied by the architectural registers mentioned above are identified by the 48 entry Architectural Tag Array . It has an entry for each architectural register with a value that points to one of the 120 renamed registers. Up to three instructions can thus be renamed per cycle. The data dependencies are handled with the aid of another structure, the 48 entry Future File Tag Array This array contains pointers the 48 renamed registers that contain the very latest speculative values for each of the architectural registers. The instructions that are getting renamed access this structure to obtain the renamed registers were they can find their source operands. The instructions will then store the renamed register which was allocated to them to the Future File Tag Array so that subsequent instructions know were to find the result data. Example: An instruction uses architectural registers 3 and 5 as input data and writes its result back into register 3. It will first read entries 3 and 5 to obtained the pointers to the renamed registers that contain or will contain the latest values for register 3 and 5. S ay renamed registers 93 and 12. The instruction now knows its source registers, 93 and 12 and can overwrite entry 3 of the Future File Tag Array with the renamed register it was assigned to store it's result, say 97. A subsequent instruction that needs architectural register 3 will now use renamed register 97. If an exception or branch-miss-prediction occurs then the 48 entries of the Future File Tag Array are overwritten with the 48 entries from the Architectural Tag Array . All speculative results are thereby discarded. The pointers in the Architectural Tag Array were written there by the retirement logic. Up to three values can be written per cycle for each line of instructions that retires. The values are taken from the Reorder Buffer . The Reorder Buffer is shared by all instructions. Floating Point Instructions that finish write certain information like exception status, TOS used et-cetera into the Reorder Buffer. This information includes also the destination register they modify, Both the number of to the architectural register and the renamed register are stored in the Reorder Buffer. The two of them are used to update the Architectural Tag Array at retirement. One as the data and the other as the entry number of the Architectural Tag Array.

2.4 Floating Point instruction scheduler

The Floating Point scheduler uses the following three criteria to determine if it may dispatch an instruction to the execution pipeline it has been assigned to ( FPMUL, FPADD, FPMISC ) 1) The instructions source registers and or memory operands will be available. 2) The instruction Pipeline to which the instruction has been assigned will be available. 3) The result bus for that instruction pipe will be available on the clock cycle in which the instruction will complete. The scheduler will always dispatch the oldest instruction that is ready for each of the three pipelines. When we say will be available then we mean in two cycles from the current cycle. It takes two cycles to get an instruction into execution, one to schedule and another to read the 120 entry renamed register file. An instruction checks if its source registers are available first when it is placed in the scheduler. After that it will continuously monitor the Tag busses of the result busses for all source data still missing. The Tag busses run two cycles ahead of the result busses. The scheduler can thus see two cycles in advance which results will become ready. A dispatched instruction will arrive in two cycles at its execution were it grabs the incoming result data from the selected result bus. The execution pipelines are 4 stages deep. Instructions with lower latencies may leave the pipeline earlier, after two or three cycles. Two cycles however is the shortest execution latency. Instructions that need load data from memory wait until the data arrives from the L1 Data Cache or from further away in the Memory Hierarchy. The scheduler knows two cycles in advance that data is coming. This is one cycle more than for integer loads. The extra cycle stems from the Data Convert and Classify unit that pre-processes Floating Point data from memory. A load miss avoids that the Instruction which needed the load data is removed from the scheduler. The instruction stays in the scheduler until the data arrives with a load hit . Any instruction that was scheduled depending on load that missed is invalidated and its results are not written to the register file.

2.5 The 5 read and 5 write ports of the floating point renamed register file

The renamed register file register file is accessed directly after the instructions are dispatched Out Of Order by the Scheduler. Up to three instructions can access the register file simultaneously. One instruction for each of the three functional units. The FPMUL and FPADD instructions obtain two source operands each while instructions for the FPMISC unit only need a single operand. Three write ports are available to write results from the floating point units back to the register file. The write addresses arrive earlier then the result data. This is used to decode the write address in the cycle before the write occurs. All three units can have memory data as a source operand. The reorder buffer tags that accompany the data coming from memory are translated to renamed register locations by the load mapper. Two 64 bit loads can be handled per cycle. The new 120 entry register file shows bypass logic at both sides. The bypasses are used to pass result and or load data directly to succeeding dependent instructions. Thereby avoiding any extra delay that would result from the actual writing and reading from the register file.

2.6 The Floating Point processing units

There is a range of processing units connected to the FPMUL, FPADD and FPMISC register file ports. The ports determine to which of the three floating point pipelines a particular unit belongs. The x87 and SSE2 floating point multiplier handles 64 and 80 extended length multiplications. The large Wallace tree which handle the 64 bit multiplications for 80 bit extended floating point and 64 bit integer multiplications can be split into two independent Wallace trees that handle the dual 32 bit SIMD multiplications used for SSE and 3Dnow functions ( US Patent 6,490,607 ) This unit can also autonomously handle floating point divide and square root functions. These instructions are not implemented with micro code but are handled entirely by this unit itself with a single direct path instruction. The unit contains bi-partite lookup tables for this purpose. ( US Patent 6,256,653 ) These table contain base values and differential values for rapid reciprocal and reciprocal square root approximations which are then used as a start point for the divide and the square root instructions. This unit is connected to the FPMUL ports of the register file. The x87 and SSE2 floating point adder handles 64 and extended length additions and subtractions. It is connected to the FPADD ports of the register file. The 3Dnow! and SSE dual 32 bit floating point unit handles the single length SIMD floating point instructions as introduced in 3dnow! by AMD and SSE by Intel (The later is called 3Dnow! professional in the Athlon XP). This unit is connected to both the FPMUL and FPADD ports and can handle one 64 bit (2x32) instruction of each group per cycle, So one MUL type and one ADD type instruction per cycle. 128 bit instructions of either type have a throughput of one per two cycles. The 2x64 bit MMX/SSE ALU unit is a dual unit that can handle certain packed integer 128 bit SSE instructions at a throughput of 1 per cycle. It is connected to both the FPMUL and FPADD ports. The FPMUL ports are used even though the instructions aren't multiplications but rather adds, subtracts and logic functions. The idea is to double op the size of operands that can be read and written to the register file to a full 128 bit. The 128 bit SSE instructions are still handled by two individual 64 bit operations. The throughput is increased to one per cycle because they can be executed by both the FPMUL and the FPADD pipelines. The 1x64 bit MMX/SSE Multiplier unit handles MMX and SSE integer multiplies. It is connected to the FPMUL ports of the register file. It can handle a single 64 bit MMX instruction per cycle or 128 bit SSE instruction with a 2 cycle throughput using two 64 bit operations. The FP Store unit , more recently called the FP Miscellaneous unit handles not only the stores but also a number of other single operand functions such as Integer to Float and Float to Integer conversions. It further provides a lot of functions used by Vector Path generated micro code to handle more complex x87 operations. It contains the Floating Point Constant ROM that contains a range of floating point constants such as pi, e, log2 et-cetera.

2.7 The Convert and Classify units

Load data that arrives from the L1 Data Cache or from further on the Memory Hierarchy goes through the Convert and Classify unit first. The Load data is converted, if appropriate, to the internal 87 bit floating point format (1 sign bit, 18 exponent and 68 mantisse bits ). The floating point values are also classified into a three bit Class code. The 87+3=90 bits are then stored into the 90 bit register file. The Class code can sub-sequentially be used to speed up floating point operations. For example: Only the class code needs to be tested to find out if a number is zero instead of all 86 mantisse plus exponent bits. We've seen that the Floating Point Scheduler runs two cycles ahead of the actual execution units. One cycle more than the Integer Scheduler. It observes at the Tag busses that identify two cycles in advance which results will become ready at a certain result bus. The Tag busses also indicate which data will come from memory in advance. However, the hit/miss signal may later indicate that the data was erroneous because of a Cache Miss . The Convert and Classify units add an extra cycle with at least somewhat useful work in order to give the scheduler the time to take the Hit/Miss signal into account. The Optimization manual has a whole appendix (E) dedicated to SSE and SSE2 optimizations related to the classification of the contents of the SSE registers. Instructions that operate on another data type then expected should be avoided. Revision C does not need these optimizations anymore. It is likely that Revision C can perform these format translations itself without the intervention of microcode after an exception.

AMD has managed to eliminate much of the x87 legacy overhead and did speed up some important but problematic functions. More specifically for the x87 status register. Early Athlons used a large area to handle the processing of the 16 bit floating point status register. This has all gone, some of it already in the Athlon XP. Program code with a conditional test on x87 floating point values used to kill Out-Of-Order advantages because of the serializing nature of the instructions that make the floating point status code available to the Integer Pipeline which handles the conditional branches. The Opteron has special hardware to avoid this serialization and to preserve Out Of Order processing. x87 Floating Point Status register 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x87 FP Busy Cond. Code 3 Top of Stack Cond. Code 2 Cond. Code 1 Cond. Code 0 Excep tion Status Stack Fault Preci- sion excep Under- flow excep Over- flow excep Zero Divide excep Denorm Oper. excep Invalid Oper. excep B C3 TOS C2 C1 C0 ES SF PE UE OE ZE DE IE Different Parts of the x87 floating point status register are handled in different ways. The register is a bit of a mixture of different things. It contains for example the 3 bit TOS pointer that indicates which of the eight x87 is the current top of stack. The first Rename Stage holds the speculative version of this pointer. It is used here to translate the TOS relative register addresses to absolute x87 register addresses. All finishing instructions preserve their copy of this value in the Re-Order buffer when they finish. These copies then become the non-speculative versions of TOS at the moment that the instructions are retired out of the Re-Order buffer. The Retirement Logic may detect that an exception or branch-miss-prediction did occur. It then replaces the speculative version of the TOS in the first rename stage with latest retired, non-speculative version. The speculative 3 bit TOS value is used before the instructions are scheduled Out Of Order. The only reason that it is used later on is during Retirement which is handled In-Order again. This means that special Out-Of-Order hardware for the TOS can be, and is eliminated. The execution of a during Floating Point instruction may itself cause an exception. Most bits of the x87 status register are dedicated flags that identify exceptions. Exceptions are always handled In-Order at retirement time. This again means that any special Out-Of-Order hardware for these bits can be, and is eliminated. The tricky part is in the CC (Condition Code) bits. These bits contain exception data most of the time but may contain sometimes information which is the result of a Floating Point compare and which must be processed in a full Out-Of-Order fashion. The Opteron has special new hardware to handle these cases. This hardware detects combinations of instructions that need special handling. Condition Code bits after a x87 Floating Point compare Cond. Code 3 Cond. Code 2 Cond. Code 1 Cond. Code 0 Compare Result 0 0 0 0 ST(0) > source 0 0 0 1 ST(0) < source 1 0 0 0 ST(0) = source 1 1 0 1 Operands were unordered The first combination is a FCOMI with a FCMOV. The first does a compare and sets the CC bits according to the result. It then moves the compare result to the Integer status Register. The FCMOV then does a conditional floating point move depending on the Integer Status bits. Opteron's hardware allows full speed processing here by implementing an Out-Of-Order bypass that avoids that the FCMOV has to wait for the actual Integer Status Flags. The second combination is the FCOM and FSTSW pair. The first instruction is identical to the FCOMI instruction with the exception that it does not copies the CC bits to the Integer Status bits. It's the FSTSW (Floating point Store Status Word) instruction that copies the 16 floating point status bits to the AEX register or to a Memory Location from were they can be used for conditional operations. The later is a serializing operation because all floating point instructions need to finish first before the 16 status flags are known. The Opteron has special hardware that does allow maximum speed Out-Of-Order processing without the serializing disadvantage. It also provides a way to recover from any (rare) miss predictions. The result of all AMD's x87 optimizations is that the Opteron literally runs circles around the Pentium 4 when it comes to x87 processing. It has removed large special purpose circuits for status processing and implemented a few small ones that handle the cases mentioned. The shift to SSE2 floating point however will make removed area overhead more important than the speed-ups.

Chapter 3, Opteron's Data Cache and Load / Store units

3.1 Data Cache: 64 kByte with three cycle data load latency

The Opteron's relatively large L1 Data Cache supports a three cycle Load-Use latency . Actually only the second and third cycle are used to access the Cache memory itself. The first cycle is spend in the Integer Pipeline for the x86 memory address calculation using one of the three available AGU's. The address calculated by the AGU is send to the memory array in the second cycle where it is decoded . This means that it is known at which word line the data can be found at the end of the second cycle. The right data word is activated at the beginning of the third cycle. Data is accessed in the memory array, selected and send forward to the Integer Pipeline or the Floating Point pipeline. Below the more detailed timing of a typical Integer x86 instruction F( reg,mem ) . This type of instruction first loads data from memory and then performs an operation on it. We see that in the same cycle in which the instruction is dispatched to the Scheduler it is also dispatched to the so-called "Pre-Cache Load/Store unit" or simply LS1. Instructions in this unit compete for cache access together with those in LS2. The instructions in LS1 first need to wait for their effective memory address. They monitor the result busses of the AGU's. An instruction in LS1 knows from which AGU it can expect its address. Instructions check the re-order buffer Tag which identifies the address one clock-cycle in advance. In general, an instruction in LS1 will fetch its address and wait for its turn to probe the cache. Typical timing of an F ( reg, mem ) x86 operation. Cycle Integer Scheduler Load / Store Unit (LS1) ALU's and AGU's Cache Address Decode Cache Data Access 0 Dispatched to Scheduler Dispatched to LS1 1 AGU Scheduled 2 Load Scheduled Address Generation 3 Cache Address Decode 4 ALU Scheduled Cache Data Access 5 Dependent Operation Instructions may route the address immediately to the cache also if there are no other (older) instructions waiting. This is the case in our example above. In any case, each instruction will keep the address for possible follow-on actions. The address is send directly from the AGU result bus to the Data Cache's address decoders in our case here. Data comes back from memory one cycle later and is routed to the Integer Pipeline. LS1 places the re-order buffer Tag one cycle in advance on the Data Cache result Tag bus so that the Integer ALU schedulers can schedule any instruction depending on the load data.

3.2 Two accesses per cycle, read or write: 8 way bank interleaved, two way set associative

The Opteron's cache has two 64 bit ports. Two accesses can occur each cycle. Any combination of loads and stores is possible. The dual port mechanism is implemented by a banking mechanism: The cache consist of 8 individual banks, each with a single port. Two accesses can occur simultaneously if they are to different banks. Virtual Address bit used to access the L1 data Cache Cache Line Index Bank Byte 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 A single 64 byte cache line is subdivided in 8 independent 64 bit banks. Two accesses are to two different banks if their addresses have a different bank-field, address bits 3 to 5. The bits are the lowest possible address bits that can be used for this purpose. This schem effectively maps adjacent 64 bit words in different banks. The principle of data locality makes these bits the most suitable choice. The 64 kByte Cache is two way set-associative . The cache is split in two 32 kByte ways accessed with Virtual Address bits [14:0] A hit into any of the two ways is detected if the Physical Address Tag , bits [39:12], which is stored alongside with each cache line, is identical to bits [39:12] of the Physical Address . Virtual to Physical address translation is performed with the help of the TLB's (Translation Look aside Buffers). A port accesses 2 ways and compares 2 tags with the translated address. Each port has its own TLB to do the address translation. The two 64 bit ports are used simultaneously when exchanging cache-lines with the rest of the memory hierarchy. This means that the memory bus from the unified L2 cache to the L1 data cache is now 128 bit wide. The event where a new cache line is needed will take first 4 cycles to evict the old cache-line and then 4 cycles more to load the new cache-line when it arrives.

3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's US Patent 6,453,387.

The L1 Data Cache has room to store 1024 cache lines out of the total of 17,179,869,184 cache lines that fit within the 40 bit physical address space. Accesses need to check if the stored cache line corresponds with the actual memory location they want to access. It is for this purpose that the Tag rams store the higher physical address bits belonging to each of the 1024 cache-lines. There are two copies of the Tag ram to allow the simultaneous operation of two access ports. The Tag rams are accessed with bits [14:6] of the virtual address. Each Tag ram outputs 2 Tags for both ways of the 2 way- set-associative cache. The wanted cache-line can be in either way. The Tag rams contain physical addresses. A physical address uniquely defines a memory position throughout the entire distributed system memory. The cache is however accessed with the virtual addresses as defined by the program. Virtual addresses have only a meaning from within a process context. This means that a virtual-to-physical-address translation is needed to be able to check the physical Tags. This translation is handled by a lengthy sequence of four table lookups in memory: The virtual address field [47:12] is divided into four equal sub-fields that each indexes into one of the four tables. Each table points to the start of the next table, The last table, the page table, then finally contains the translated address. Virtual Address to Physical Address Translation: The Table Walk. virtual address page offset page map level 4 table offset page directory pointer offset page directory offset page table offset 47 39 38 30 29 21 20 12 11 0 | | | | | page map level 4 | | | | == > page dir.pointer | | |

== > page directory | |



== > page table |





| |

physical address page offset 39 12 11 0 This so-called Table-Walk is a very lengthy procedure indeed. The Opteron uses so-called Translation Look aside Buffers (TLB's) to remember the 40 most recently used address translations. 32 of these remember 4k page translations using the scheme above. The remaining 8 are used for so-called 2M / 4M page translations which skip the last table and define the translations for large 2 Megabyte pages. ( The 4M pages are only used for backwards compatibility ) The virtual address bits {47:12] are compared with all 40 entries of the TLB's in the second of the three clock-cycle access. At the end of the second cycle we know if any one of them matches. Each entry also contains the associated physical address bits [39:12]. These are selected in the third cycle and compared with the physical Tags to test if we have a cache hit.

3.4 The 512 entry second level TLB

If the necessary translation is not found within the 40 entries of the primary TLB's, then there is a second chance that it is available in the level-2 TLB which is shared by both ports. This table contains 512 address translations. This larger table can be used to update the primary TLB's with a minor delay. It is organized in a different way : It is 512 entry 4-way set-associative. This means that it has 128 sets of 4 translations each. Virtual address bits [18:12] are used to select one of the 128 sets. We get four translations giving us four chances that we have the translation we need. Each translation contains the rest of the virtual address bits [47:19]. We can check if we have the right translation by comparing these bits with our address. The matching entry then contains the associated physical address field [39:12] we need.

3.5 Error Coding and Correction

The L1 Data Cache is ECC protected (Error Coding and Correction). Eight bits are used for each 64 bits to be able to correct single bit errors and to detect dual bit errors with the help of a 64 bit Hamming SED/DED scheme (Single Error Detection / Double Error Detection) Six parity bits are needed to retrieve the position of the error bit. E C C 64 bit Hamming SED/DED error location retrieval bit 63 bit 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 0 1 x 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 x 1 x 1 x 0 x 1 x 0 x The six bits are shown in the column at the left. A one means that a parity error was detected. The six bits represent the parity of the 32 purple bits in each rows. The parity errors together now represent a 6 bit index that points to the error position. Additional parity bits are used to detect double bit errors and errors in the parity bits themselves. (Thanks to Collin for bringing this to my attention)

3.6 The Load / Store Unit, LS1 and LS2

The Load Store unit handle the accesses to the Data Cache. This type of unit plays an increasingly important role play in modern speculative out-of-order processors. They are expected to grow significantly in size and complexity in newer architectures on the horizon. An extra reason to give the Opteron's Load Store units a closer look. The split in LS1 and LS2 is sometimes described as LS1 being for the L1 Data Cache and LS2 for the L2 Cache. This is far to popular however and even incorrect. We'll go into more detail here.

3.7 The "Pre Cache" Load / Store unit: LS1

The Pre-Cache Load/Store unit (LS1) is the place where dispatched memory accesses wait for the addresses generated by the AGU's (Address Generator Units) LS1 has 12 entries, whenever a memory access is dispatched to the Integer Scheduler it is also dispatched to an entry in LS1. The re-order tag bus belonging to the AGU indicates if the required Address is being calculated and available on the result bus of the AGU in the next cycle. An access waiting in LS1 knows at which AGU to look for the address. When an instruction has its address coming or did already receive it may then probe the cache. There are two access ports. The two oldest accesses in LS1 will be allowed to probe the Cache. Both load and store instructions probe the cache. A load will actually access the cache to obtain the load data. The store presents its address but will never write from LS1 to the Cache. Store instructions will only write after they've received the data to be written and when they are retired. Stores must be retired first because the store instruction may be speculative and is discarded later. Imagine that MicroSoft patches a buffer overflow exploit by adding a test for the overflow. This test becomes a conditional branch that prevents the write to the buffer in case of an overflow. The overflow tends to never happen so the branch predictor will predict it as not-taken, It will do so also in the case that it finally does happen. The write to the buffer will now be executed speculative. So the actual writes to the cache must be delayed until after retirement when it's verified that the branch predictions were correct. These deferred stores do not introduce any real delays however. Loads that access the cache also check LS1 and LS2 to see if there are any pending writes to the memory location they are about to read. If so than they catch the data directly from LS1 or LS2 without delay. The stores in LS1 however do present their address to the cache hit/miss detection logic. If it turns out that the cache-line is not present then it may be loaded as soon as possible from the Level 2 cache or from system memory. This can be a good policy since there is a significant chance that following loads will need the same cache-line. Stores may receive the data they have to write to memory while waiting in LS1 as long as the data comes in time, otherwise they move on to LS2 to receive the data there.

3.8 Entering LS2: The Cache Probe Response

All accesses in LS1 probe the Cache and then move on to the Post-Cache Load/Store unit ( LS2 ) An access can either be a Load, a Store or a Load-Store (The latter reads first and then writes the result back to the same location) All accesses which came from LS1 first wait to see the results from the cache probe. If it was a cache hit or a miss, If there was a cache parity error. They also receive the physical address which was translated from virtual into physical by the TLB's. Together with the physical address come the page attribute bits which determine for instance if the memory is cacheable or not. Then in the following cycle, in case there was a cache miss, the instructions receive a so-called MAB tag ( Missed Address Buffer Tag ) This tag will later be used to see if a missed cache-line arrives from the L2 cache or from system memory. The MAB tag needs to be used instead of the generally used re-order buffer tags. Multiple Loads and Stores may depend on the same cache line and thus on the same MAB tag. All these accesses miss and they'll all receive the same MAB tag. The Bus Interface Unit (BUI) will load missed cache-lines from the unified L2 cache or system memory to fill the data-cache. It also presents the so-called Fill-tag to LS2. This fill-tag is compared to the MAB-tag of all accesses that missed. The accesses that match the fill-tag are changed from miss to hit.

3.9 The "Post Cache" Load Store unit: LS2

The so-called Post-Cache Load Store unit ( LS2 ) has 32 entries. It is organized in a somewhat "shift register" like way so that the oldest outstanding access ends up in entry 0. Each of the 32 entries has numerous fields. Many of these fields are accompanied with a comparator and other logic to see if the fields matches a certain condition. All accesses stay in LS2 at least until retirement, Accesses that missed the cache will wait in LS2 until the cache-line arrives from the memory hierarchy. All Stores wait in LS2 for their retirement first before actually writing data to memory. Various fields in an LS2 buffer entry Type Address & Data Tags Status Flags Action Flags .... Valid Flags Acc. Type Store Data 64 bit Virtual Address Physical Address mem Type Instr- uc tion Tag Write Data Tag Missed Address Buffer Tag Cache Hit / Miss Retired access Last Store in Buff (LIB) Self Mod. Code Flag Snoop Re- Sync Flag Store Load For- ward .... Retired Stores in LS2 that have the hit/miss flag set to hit may use a cache port simultaneously with a probing store in LS1. The retired store from LS2 writes to the data cache itself but does not use the cache hit/miss logic. The probing store from LS1 only uses the hit/miss logic but doesn't access the data cache itself. This shared use is important performance wise because each store would occupy a cache port twice otherwise, first while probing from LS1 and secondly when writing from LS2 after retirement. This would halve the store bandwidth of the L1 Data Cache.

3.10 Retiring instructions in the Load Store unit and Exception Handling

All access instructions, Loads as well as Stores stay in LS2 until they are retired. Loads may be removed directly from LS2 when they are retired to make place for new instructions. Stores must still write their data to memory. They wait to do so until retirement when it is determined that no exception or branch miss-prediction occurred. Writes are removed from LS2 after they have committed their data to memory. LS2 has a retirement interface with the re-order buffer. The re-order buffer presents the Tag of the line that is being retired to LS2. It only needs to present a single Tag for up to three instructions in a line since these all have the same tag except for the 'sub- index' which identifies the lane (0, 1 or 2). LS2 compares all-instruction tags with the retirement-tag and set the Retired flag of those who match. Retired loads may be deleted directly from LS2. If the retirement logic of the re-order buffer has detected a branch-miss prediction or exception then all instructions matching the retirement tag and all those with succeeding tags are discarded from LS2. The only ones left in LS2 are the retired stores that are waiting to commit their data to memory.

3.11 Store to Load forwarding, The Dependency Link File US Patent 6,266,744.

A Load probing the data cache will also check the Load Store units to see if there are any outstanding stores to the same address as the load. If it finds such a store ( and the store is before the load in program order ) then there are two possibilities. If the store has already obtained the write data from one of the result busses then these can be directly forwarded to the load. If the store has not yet obtained it data then the load misses and moves to LS2. An entry is created in a unit called the Dependency Link File . This unit now registers both the tags of the write data, ( which tells the data-to-be-stored is coming in the next cycle ) as well as the Load tag which is the be used to tell a following instruction that the load data will be available. The Dependency Link File keeps monitoring the write data tag, and then, as soon as it detects it, puts the load instruction tag on one of the Cache Load tag busses. It does the same with the actual data when it comes one cycle later. The result data from instruction 1 can be directly forwarded to the consuming instruction 4 in the example below. Instructions 2 and 3 (the store and the load) are bypassed in this case. 1) F( regA,regD ); // register A is a function of register A and register D 2) store ( mem, regA ); // store register A to memory 3) load ( regB, mem ); // load register B from the same memory location 4) F( regD, regB ); // uses register B and register D to calculate new value of register D Miss-matched store to loads: Stores that only modify part of the load data are not supported. The load must first wait unit the store is retired and stored to memory. The load may then access the cache to get it's data which is a combination of the stored data and the original contents of the cache. The optimization manual describes all possible miss-match cases since they can lead to a considerable performance penalty. Multiple Stores to the same address are handled with the so-called LIB flag ( Last In Buffer ) This flag identifies the most recent store to a certain address. A newer load accessing the same address will choose this one. Multiple partial stores to the same word were each modifies only a part of the word are not supported by the Load Store buffer. They are not merged in the Load Store buffer. They will be merged later on in the cache after all stores are retired and written.

3.12 Self Modifying Code checks: Mutual exclusive L1 DCache and L1 ICache US Patent 6,415,360.

Self Modifying Code (SMC) checks must in principle be performed for each store. It must be tested if the store does not modify any of the instructions in the Instruction Cache or any following Instruction in flight in any stage of execution. A significant simplification is made by making the L1 Data Cache and L1 Instruction Cache exclusive to each other: A cache-line can only exist in either one, not in both at the same time. When a cache line is loaded in the L1 Data cache then it will be evicted from the L1 Instruction cache. The first advantage is that the contents of the Instruction Cache does not need to be tested any further for SMC. The second advantage is that SMC checks may be limited to Data Cache misses. Stores to un-cacheable memory must be checked always. ( They always "miss" ) The store's write-address is send from LS2 to the SMC test unit which is close to the Instruction Cache. This units holds the cache-line addresses of all the Instructions in flight. If there is a conflict then it marks the store that caused the conflict. The reorder buffer will discard all instructions which follow the store when the store is retired.

Deadlocks can occur when multiple processors fight for the ownership of the same cache-line. They do so for instance if they both want to write to the same line. A cache-line is generally loaded as soon as possible in case of a cache-miss. This will cause the cache-line to be invalidated in other caches in case of a store. Two processors get in a deadlock if they keep invalidating each others cache-lines before they are able to finish the stores. An example given is the case where two processor try to complete a store which is to an unaligned address so that part of the store data goes to cache line A1 and part of the store data goes to cache line A2. Unaligned stores of this type are typically split into two stores by the hardware. An exponential back-off mechanism is provided to handle this kind of deadlock situations. A back-off time is introduced when the memory access remains unsuccessful before retrying to become owner of the cache-line again. This time grows exponentially after each unsuccessful try until one of the processors finally succeeds.

3.14 Improvements for multi processing and multi threading

The Opteron's micro architecture has a large number of improvements related to multi processing and multi threading. Very important improvements also for the desktop market. Multi-processor on a chip solutions are just around the corner and hyper- threading may take a significant step forward in the near future with Intel's Prescott. The ability to perform multi processing and multi threaded applications efficiently becomes essential. Switching contexts, starting and ending of processes and threads as well as inter-process and inter-thread communication is traditionally associated with large overheads. Significant improvements have been made to reduce these overheads to a minimum.

Different processes can have different contexts That is: different translations from virtual to physical addresses. A process switch will cause the Translation Look Aside buffers to be invalidated ( flushed ). Large Translation buffers won't help you a lot if they are frequently flushed which then can lead to significant performance degradation. The Opteron introduces a new mechanism to avoid flushing of the TLB's. An Address Space Number (ASN) register is added together with an enable bit (ASNE). The Address Space Number is used to uniquely identify a process. Each entry in the TLB now includes the ASN of the process. An address can be successfully translated if the address matches the Virtual Address Tag in the TLB and the ASN register matches the ASN field in the TLB. The ASN field can be seen as an "extension" of the Virtual Address. This now means that different translations of different processes can coexist in the TLB, avoiding the need to flush the TLB's for context switches. A global flag is available for data and code that is preferably accessible for all processes, typically operating system related. Global translations do not require the ASN fields to match. This means that many processes can share a single entry in the TLB to access global data. Another advantage of the ASN and global flag is that flushing can be limited to specific entries whenever an invalidation of the TLB is needed. Only the entries which have a certain ASN or have the global bit set are flushed.

The TLB's can be seen as caches containing the translation information stored in the address translation tables in memory. The actual translation requires several levels of indirections through the tables stored in main memory. This is the so-called "table walk" A very time consuming process which may take many hundreds of cycles for a single TLB entry. The Opteron attempts to speed up the table walk with a 24 entry Page Descriptor Cache. Even so, it remains important to avoid the table walk whenever possible in a multi-tasking multi-threaded environment. A table walk becomes necessary whenever entries in the TLB do not correspond to the memory resident translations anymore because some- body has modified the latter. Until now there was only one way to guarantee TLB coherency: Flush the TLB's if it may be possible that any of the entries is not identical anymore to the memory resident tables. Many actions in the x86 architecture result in an automatic flush of the TLB's, often unnecessary. A new feature in the Opteron: The TLB flush filter can avoid these costly flushing in many occasions. The TLB Flush filter is implemented as a 32 entry, Content Addressable Memory ( CAM ). It remembers the addresses of regions in memory that were accessed when the TLB's were loaded. These regions thus belong to the Page Translation Tables. The Filter then keeps monitoring all accesses to memory to see if any of these regions are accessed again. If not then it may disable the flushing of the TLB's because coherency is guaranteed.

3.17 Data Cache Snoop Interface

The Snoop interface is used for a wide variety of purposes. It's used to maintain Cache Coherency in a multiprocessor system. It is used for conserving strict memory ordering in shared memory, for Self Modifying Code detection, for TLB coherency et-cetera. The snoop interface uses the physical addresses from other processor accesses as well as from accesses issued on behalf of the instruction cache to probe various memories and buffers for data that has somehow, something to do with that particular address.

3.18 Snooping the Data Cache for Cache Coherency, The MOESI protocol

The Opteron can maintain cache coherency in systems of up to 8 processors. It uses the so-called MOESI protocol for this purpose. The snoop interface plays a central role in the effectuation of the protocol. If a cache line is read from system memory ( which may be connected to any of the eight processors ), then the read has to snoop all the caches of all processors. Snoop accesses are much smaller then normal memory accesses because they do not carry the 64 byte cache line data. Many snoops may therefore be active without overloading the distributed memory system throughput. A snoop may find the cache-line in one of the caches of another processor. If a processor does not find the cache-line in someone else's cache then it loads it from system memory into its cache and marks it as Exclusive . Now whenever it writes something in the cache-line then it becomes Modified . It does in general not write the modified cache-line back to memory. It only does so if a special memory-page-attribute tells it to do so (write through). The cache line will be evicted only later on if another cache-line comes in which competes for the same place in the cache. If a processor needs to read from memory and it finds the cache line in someone else's cache then it will mark the cache line as Shared . If the cache-line it finds in the other processors cache is Modified then it will load it directly from there instead of reading it from the memory which may be not up to date. Cache to cache transfers are generally faster then memory accesses. The status of the cache-line in the other cache goes from Modified to Owner . This cache-line still isn't written back to memory. Any other (third) processor that needs this cache-line from memory will find a Shared version and a Owner version in the caches of the first two processors. It will obtain the Owner version instead of reading it from system memory. The owner is the latest who modified the cache-line and stays responsible to update the system memory later on. A cache-line stays shared as long as nobody modifies the cache-line again. If one of the processors modifies it then it must let this know to the other processors by sending an invalidate probe throughout the system. The state becomes Modified in this processor and Invalid in the other ones. If it continues to write to the cache line then it does not have to send anymore invalidate probes because the cache line isn't shared anymore. It has taken over the responsibility to update the system memory with the modified cache line whenever it must evict the cache-line later on.

3.19 Snooping the Data Cache for Cache Coherency, The Snoop Tag RAM

Other processors that access system memory need to snoop the Data Cache to maintain cache coherency using the MOESI protocol. We saw that there were two kinds of snoops. Read and Invalidate snoops. The basic task of a snoop is first to establish if the Data Cache contains the cache-line in question. There is a third set of Tags available specially for the snoop interface. ( The other two are used for the two regular ports of the data cache ). The snoop-Tag ram has 1024 entries, one for each cache line. It holds the Physical address bits [39:12] belonging to each cache line. Virtual Address bit used to access the L1 data Cache virtual page address offset in page offset in cache line W 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Physical Address used to snoop the L1 data Cache physical page address offset in page offset in cache line 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 The regular Tag rams are accessed with the virtual address. The Snoop Tag ram however must deal with the physical address ! Fortunately many of the virtual address bits needed are identical to the physical address bits. Only bits [15:12] are different and thus useless. This means that we must read out the Tags of all 16 possible cache-lines in parallel and then test if anyone of them matches. Luckily enough this doesn't present to much of a burden. The total bus width (in bit-lines) of for instance the cache rams is 512 bit. Sixteen times a 28 bit Tag is less (448) so there's space left for some extra bits like the state info for each cache-line. Once we know which of the 16 possible cache-lines hits then we know also the remaining virtual address bits needed to access the cache plus the Way (0 or 1) which holds the cache-line. The position itself, ( 1 out of 16 ) directly provides the 3 extra address bits plus the Way bit. This means we can now access the cache if needed in case of a Read Snoop hit.

3.20 Snooping the L1 Data Cache and outstanding stores in LS2

It is not necessary for snoop reads from other processors that want to read a cache-line from the L1 data cache to check for retired stores in LS2 that will write to the cache-line they are about to read. This even though the data these stores will write is already considered to be part of the memory by the processor who issued the writes. It's is OK for other processors to see these writes occur at a later stage. The only effect externally is that it looks as if the processor is slightly slower. An external processor that writes to a shared cache line must send snoop invalidates around. The snoop interface will invalidate the local cache-line if it receives such a snoop invalidate that hits the cache. The snoop interface must also set the hit/miss flag to miss for all stores in the Load Store unit that want to write to the cache-line that was hit. The later is not a specific snoop operation however. It is needed in all cases in which a cache-line is evicted or invalidated. These stores that originally did hit but who are set back to miss will need to probe the cache again.

3.21 Snooping LS2 for loads to recover strict memory ordering in shared memory US Patent 6,473,837.

An interesting trick allows the Opterons to handle speculative out of order loads from shared memory and still observe the strict memory access ordering required for shared memory multiprocessing. The hardware detects violations and can restore strict memory ordering when needed. A communicating processor may for instance first write a new command for another processor to A1 in memory and then increment a value A2 to notify that it has issued the next command. The processor which is supposed to handle the commands may find the value A2 incremented but still reads the old command from A1 if it executes loads out of order. The ability to handle loads out of order can significantly speed up processing. Most notable is the example where a first load misses the cache. An out of order processor may issue another load which may hit the cache without waiting for the result of the first load. It would be beneficial to maintain out-of-order loads in a multiprocessing environment. Another important speed improvement is speculative processing . The first load that missed may have been the counter A2 in our example. The new command must be fetched if A2 has been increased. A conditional call is made based on a test of the value of A2. A speculative processor attempts to predict the outcome of the branch at the beginning of the pipeline. It may predict that the counter has been incremented if it generally takes more time to execute the command than it takes to provide a new command. That is: The new command is generally sitting waiting to be executed by the time the previous command has been executed. The speculative out of order processor may first attempt to load the counter A2, It may miss but the branch predictor has predicted that it was increased and the command from A1 will be loaded for execution. The load from A1 may hit the cache. We actually do not know if this is a new commando or not. Let say it is the old one. The counter A2 still has to be loaded from memory. If A2 is increased in the mean time then the load that missed will cause the modified cache-line to be loaded in the local data cache with the incremented counter included. The processor will conclude that the branch prediction was correct and erroneously carry on with the old command. The Opteron has a snoop mechanism that allows this kind of fully speculative out-of-order processing for high performance multi-processing. The mechanism detects cases which may go wrong and consequently restores memory ordering. We'll illustrate the mechanism with the use of our example. When the first processor writes a new commando into A1 then it will send a snoop- invalidate around to invalidate the cache-line in all other caches. This snoop invalidate will also reach the snoop interface of the Load Store unit: The snoop interface first checks the entries for a load that did hit the cache-line-to-be-invalidated. This load would then be the "old command" from A1 in our example. When it finds a load hit then it continues by checking all older loads to see any of them is marked as a miss. This would then be the load of the A2 counter value in our example. It marks the Snoop ReSync flag of all the load misses it finds. This flag will cause any succeeding instructions to be canceled when the load is retired including the instruction that loads A1. The load of A1 will be re-executed and will now correctly read the new command from memory.

3.22 Snooping the TLB Flush Filter CAM

Snooping is used to preserve memory coherency. The function of the TLB flush filter is to prevent unnecessary flushes of the TLB's. It does so by monitoring up to 32 areas in memory that are known to contain page table translation information which is cached in the TLB's. These entries must be snooped also by snoop invalidates from other processors that may write to the page tables of our processor. If any of the snoops hits a TLB flush filter entry then we know that a TLB may have invalid entries and that the TLB flush filter may not prevent the flushing of the TLB's anymore. The snoop-invalidates are not send if a processor is sure that a cache-line is not shared with other processors . This suggests that the TLB's (being caches in their own right) participate in the MOESI protocol for cache coherency via the TLB flush filter. The memory page translation tables ( PML4, PDP, PDE and PTE entries) may be in cacheable memory. A special flag has to be set in the Opteron if the Operating System decides to put the tables in un-cacheable memory. ( TLBCACHEDIS in HWCR )

Chapter 4, Opteron's Instruction Cache and Decoding

4.1 Instruction Cache: More then instructions alone

Access to the Instruction cache is 128 bit wide. 16 bytes of instructions can be loaded from the cache each cycle. The instruction bytes are accompanied with an extra 76 bits of extra information. This extends the total width of the Instruc- ion cache port to 204 bits. We're still counting only the bits that cover the full Instruction Cache. That is: Each of the 1024 cache lines has its own set of these extra bits. There are several more fields that have less then 1024 entries and are valid only for a subset of the cache lines. Instruction only Total size Instruction Cache size: 64 kByte 102 kByte Cache Line size 64 Byte 102 Byte One Read Port 128 bit 204 bit One Write Port 128 bit 204 bit Well known are the three so-called pre-decode bits attached to each byte. They mark the start and end points of the complex variable length x86 instructions and provide some functional information. The other two fields are the parity bits, 1 parity bit for each 16 data bits, and the so-called branch selectors . ( eight times 2 bit for each 16 byte line of instruction code ). Ram Size Bus Size Comments Instruction Code: 64 kByte 128 bit 16 bytes instruction code Parity bits 4 kByte 8 bit One parity bit for each 16 bit Pre-decode 26 kByte 52 bit 3 bits per byte (start, end, function) + 4 bit per 16 byte line Branch Selectors 8 kByte 16 bit 2 bits for each 2 bytes of instruction code TOTAL 102 kByte 204 bit The Opteron's branch selectors are different from those of the Athlon (32) and they now cover all 1024 cache-lines of the Instruction Cache. The branch selectors contain local branch prediction information which can not be retrieved as readily as for instance the pre-decode information. A piece of code has to be executed multiple times before the branch-selectors become meaningful. This is the reason that the branch selector bits are saved together with the instruction data in the unified level 2 cache whenever a cache-line is evicted from the instruction cache. The branch selectors represent one bit extra for each byte. The level 2 cache has this bit already for ECC ( Error Coding and Correction ) information. ECC is only used for data cache lines and not for instruction cache lines. The latter do not need ECC, a few parity bits per cache line is sufficient. Instruction cache lines that are corrupted can always be retrieved from external DRAM memory.

4.2 The General Instruction Format

A short overview of the 64 bit instruction format: A series of prefixes can precede the actual instructions. At the start we have the legacy prefixes. The most important legacy prefixes are the operand size override prefix (hex 66) and the address size override prefix (hex 67). These prefixes can change the length of the entire instruction because they change the length of the displacement and immediate fields which can be 1, 2 or 4 bytes long. The REX prefix (hex 4 X ) is the new 64 bit prefix which brings us 64 bit processing. The value of X is used to extend the number of General Purpose registers and SSE registers from 8 to 16. Three bits are used for this purpose because x86 can specify up to three registers per instruction for data and address calculations. The fourth bit is used as operand size override (64 bit or default size)

AMD64 Instruction Format

The Escape prefix (hex 0F) is used to identify SSE instructions. The Opcode is the actual start of the instruction after the prefixes. It can be either one or two bytes and may have an optional MODRM byte and SIB byte. The optional displacement and immediate fields can contain constants used for address and data calculations and can be 1, 2 or 4 bytes. The total length of the instruction is limited to 15 bytes.

4.3 The Pre-decode bits

Each byte in the instruction cache is accompanied with 3 pre-decode bits generated by the pre-decoder. These bits accelerate the decoding of the variable length instructions. Each instruction byte has a start bit that is set when the byte is the start of a variable length instruction and a similar end bit . Both bits are set in case of a single byte instruction. More information is given with the third bit, the function bit . The decoders look first at the function bit at the last byte of the variable length instruction. If the function bit is 0 then the instruction is a so-called direct path instruction which can be handled directly by the functional units. Otherwise if the function bit is 1 at the end byte then the instruction is a so-called vector path instruction . A more complex operation that needs to be handled by a microcode program. Definition of the Instruction Pre-decode bits START bit 1 indicates first byte of an instruction END bit 1 indicates last byte of an instruction FUNCTION bit rule 1: Direct Path instruction if 0 on the last byte Vector Path instruction if 1 on the last byte rule 2: 1 indicates Prefix byte of Direct Path (except last byte) 0 indicates Prefix byte of Vector Path (except last byte) rule 3: For vector-path instructions only: if the function bit of the MODRM byte is set then the instruction contains a SIB byte. Then, secondly, the function bits identify the prefix bytes. Ones identify prefix bytes of direct path instructions and zeroes define the prefix bytes of vector-path instructions. Then, finally, in case of vector-path instructions only: if the function bit of the MODRM byte is set then the instruction also contains a SIB byte.

We find a very large block of logic with fourfold symmetry directly near the position were the 16 byte blocks of data are read and written from and to the instruction cache. We'll discuss the most likely candidate here, A fourfold incarnation of an earlier pre-decoder described in gate level detail in US Patent 6,260,134 This fourfold version can, according to the patent which describes it, pre- decode an entire line of 16 bytes in only two cycles by means of what it calls: massively parallel pre-decoding. This circumferences a basic problem in variable length pre-decoding and decoding in general, being: A second instruction can not be decoded until the length of the first instruction is known. The start position of the second instruction depends on the length of the first instruction. The massively parallel pre-decoder avoids this problem by first pre-decoding the 16 possible instructions in parallel. Each instruction starts at one of the 16 byte locations of the 16 byte line. It then filters out the real instructions with the help of the program-counter which points to the start byte of the next instruction, depending on where we jump into the 16 byte line. 16 bytes of instructions can be fetched per cycle from the instruction cache to be fed to the decoders. It may be that the line is not yet pre-decoded or wrongly pre-decoded. (Data bytes between instructions can mislead the pre- decoder). If a branch is made to an address which does not have its pre-decode start bit set then we know that something is wrong. The instruction pipeline may invoke the pre- decoding hardware in this case to initialize or correct the pre-decoding bits within only two cycles. The massively parallel pre-decoder uses four blocks, these blocks are an adapted version of an earlier pre-decoder. A single block pre-decodes four possible instructions in parallel. Each instruction starting at one of four subsequent byte positions. The old single block was capable of stepping through a 16 byte line in four cycles. The massively parallel pre-decoder combines four of them and uses a second stage to resolve the relations between the four: The start / end fixer / sorter .

4.5 Large Workload Branch Prediction

Branch Prediction is the technique that makes it possible to design pipelined processors. The outcome of a conditional branch is generally only known at the very end of the pipeline while we need to have this information at the very beginning of the pipeline. We need the branch outcome to know which line of instructions to load next. The loading of a line of instructions already takes two cycles. If we don't want to loose anymore cycles then we must have decided on a new instruction pointer at the end of the cycle when 16 instruction byte line arrives from the instruction cache. This means that there is no time at all to even look at the instruction bytes, to try to identify conditional branches, and then to look up what the behavior was of these branches in recent history in order to make a prediction. Doing this alone would cost us several cycles.

4.6 Improved Branch Prediction

The Branch prediction hardware does not make any attempt to look at the fetched instruction bytes at all. It uses several data structures instead to rapidly select a new address. It has a 2048 entry Branch Target Buffer and a 12 entry Return Stack to select a next Program Counter address. It further uses two branch history structures, one for local and one for global history, It uses these branch history structures to predict the outcome of the branches. The so-called branch selectors are used for local history while the global history counters are used for global history.

4.7 The Branch Selectors

The branch selectors embody the local history. Local means that the prediction is based on the history of the branch itself alone. Conditional branches that are taken about always in the same way can be predicted with the branch selectors. unconditional branches are also handled by the branch selectors. Remember that there is no time to look at the actual code. What a branch selector says is that history has shown that a branch will be encountered that is almost certainly taken, conditional or unconditional. Now if it's not so certain that a branch will be taken? The branch selectors may leave the prediction in this case to the global branch prediction . The branch selectors will predict the branch as taken to identify the branch but leave the final decision to the global history counters by setting the global flag . Branch Selectors 16 byte line of instruction code 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 BS0 BS1 BS2 BS3 BS4 BS5 BS6 BS7 Branch Selection K7 Athlon 32 K8 Athlon 64 3 take branch 2 take branch 3 (or return) 2 take branch 1 take branch 2 (or return) 1 return take branch 1 (or return) 0 continue to next line continue to next line Each 16 byte line of instruction code is accompanied with eight 2 bit branch-selectors (some patents talk about 9) The branch selector within the line is selected with bits [3:1] of the Instruction Fetch address. The branch selector answers the question: I did enter this 16 byte line on this particular address, now what 16 byte line should I load in the next cycle? A line can have multiple jumps, calls, returns. They can be conditional or unconditional. We may have jumped anywhere in the middle of all these branches. The branch selectors tell us what to do depending on where we entered the line. The K7 can predict two branches per line plus one return. The new 64 bit core can predict up to three branches per line and anyone of them may be a return according to Opteron's optimization manual. ( There are no patents yet so the table above is our own extrapolation ). The branch selectors are saved together with the instruction code in the large One Megabyte L2 cache whenever a cache-line is evicted from the instruction cache. The most useful data to save there is the information which can't be easily retrieved from the instruction code: the branch history. Information like the actual branch target address or the fact that the branch is a return is retrieved relatively fast in most cases by the processor.

4.8 The Branch Target Buffer

The BTB ( Branch Target Buffer ) contains 2048 addresses from which the branch selectors can choose the next cycle's Instruction Fetch address. Fred Weber's MPF2001 Hammer presentation shows us that each 16 byte line can now have up to four branch target addresses to choose from ( Up from two in the case of the Athlon 32 ). Each branch target entry is shared between eight lines. From the branch selectors we know that any single line may not use more then 3 of these. We assume that when a branch selector says: Select the 2nd branch, This means the second branch available for the current line. Most important Branch Target Buffer fields Line Tag ( 3 bit ) 15 bit cache index Cache Way Select, 0 or 1 Return Instruction Use Global Prediction Offset in Instr. Code Field Description Line Tag ( 3 bit ) A branch target buffer entry is shared between 8 lines. The line tag tells us if this entry belongs to the current line. 15 bit cache index These 15 bits are sufficient to access the two 32 kB ways of the 64 kB 2 way set-associative Instruction Cache cache way select ( 0 or 1 ) Used to check the way of the cache.. Return Instruction This bit tells us to use the address from the return stack instead to access the next line in the instruction cache. Use Global Prediction The Global Flag leaves the final Branch Prediction to the Global History Counters. Offset in Instruction code This tells us where the end of the branch is located in the 16 byte line of instruction code. Each branch target entry needs a 3 bit tag to identify to which of the 8 possible lines of instructions it belongs. Sharing branch target entries strongly reduces the amount of branch target addresses needed. 2048 entries still would represent 12 kByte if the full 48 bit addresses were stored in the BTB. This would be a relatively large memory which you won't find on Opteron's die. The trick used here is to store only the 16 bits which are actually needed to access the 64 kByte instruction cache. The higher address bits are retrieved later on. The Opteron has a new unit called the BTAC ( Branch Target Address Calculator ) to support this.

4.9 The Global History Bimodal Counters

The Athlon 64 has 16,384 branch history counters. Four times as much as its 32 bit predecessor. The counters describe the likelihood that a branch is taken. They count up to a maximum of 3 when branches are taken and count down to a minimum of 0 when not taken. Counter values 3 and 2 predict a branch as taken, see the table. Definition of the 2 bit Branch History Counters Counter Value Branch Prediction counter = 3 Strongly Taken counter = 2 Weakly Taken counter = 1 Weakly not Taken counter = 0 Strongly not Taken The BHBC is accessed by using four bits of the Program Counter and the outcome (taken or not taken) from the last eight branches. This is basically the same as in the Athlon 32. The fact that we now have four times as many counters means that we have four branch predictors per 16 byte instruction line. This corresponds with the four branch target addresses per line. This would be an improvement over the Athlon 32 were the two branches per line could interfere which each others branch predictions. Addressing the Branch History Counters Instruction Address bits 7:4 Branch outcome of the eight previous branches | | | | | | | | | | | 16,384 Branch History Counters Branch prediction 0 Branch prediction 1 Branch prediction 2 Branch prediction 3 Another improvement is that only branches whose global bit was set participate in the global branch prediction. This pr events branches with a static behavior from polluting the global branch history. ( US Patent 6,502,188 describes this in the context of the Athlon 32 ) The global bit is set whenever a branch has a variable outcome. The GHBC table allows the processors to predict global branch patterns of up to eight branches.

4.10 Combined Local and Global Branch Prediction with three branches per line

A single 16 byte line with up to three conditional branches represents a complex situation. If we predict a first branch as not taken then we encounter the next conditional branch which must be predicted also et-cetera. Does the opteron handle this in multiple steps? or does it handle the whole multiple branch prediction at once? Local and Global Branch Prediction with three Branches per Line IF AND THEN Branch Selector Selects Branch 1 Branch 1 is local, or global and predicted taken TAKE BRANCH 0 Branch Selector Selects Branch 1 Branch 1 is global and predicted not taken and Branch 2 is local, or global and predicted taken TAKE BRANCH 1 Branch Selector Selects Branch 1 Branch 1 is global and predicted not taken and Branch 2 is global and predicted not taken and Branch 3 is local, or global and predicted taken TAKE BRANCH 2 Branch Selector Selects Branch 1 Branch 1 is global and predicted not taken and Branch 2 is global and predicted not taken and Branch 3 is global and predicted not taken GO TO NEXT LINE Branch Selector Selects Branch 2 Branch 2 is local, or global and predicted taken TAKE BRANCH 0 Branch Selector Selects Branch 2 Branch 2 is global and predicted not taken and Branch 3 is local, or global and predicted taken TAKE BRANCH 1 Branch Selector Selects Branch 2 Branch 2 is global and predicted not taken and Branch 3 is global and predicted not taken GO TO NEXT LINE Branch Selector Selects Branch 3 Branch 3 is local, or global and predicted taken TAKE BRANCH 2 Branch Selector Selects Branch 3 Branch 3 is global and predicted not taken GO TO NEXT LINE If we may take Fred Weber's MPF2001 presentation as an indication here then we guess that it takes the branches one step at a time. (The presentation shows a single GHBC prediction per cycle ). A potential bottleneck may indeed be the GHBC. A second and a third branch need a different "8 bit branch outcome" index into the table. The 8 bit value should be shifted 1 and 2 positions further for the 2nd and the 3rd branch with zeroes inserted to indicate "not taken" in order to operate according the rules.

4.11 The Branch Target Address Calculator, Backup for the Branch Target Buffer

Another new improvement is the BTAC, The Branch Target Address Generator, This unit is very useful for several purposes. It can generate full (48 bit) branch addresses two cycles after the 16 byte line of code has been loaded from the cache. It works for most branches which typically use an 8 or 32 bit displacement in the instruction to jump or call to code relative to the program counter. The BTAC can probably identify return instructions as well. One task of the BTAC is a backup function of the BTB (Branch Target Buffer). The BTB shares each branch address with eight lines. We may find that the branch selectors are OK but the branch target they select has been overwritten by another branch. The branch selectors are maintained for all cache-lines in the 64 kByte I Cache. They are also preserved together with instruction cache-lines which are evicted from L1 to the large 1 MegaByte L2 cache. It is unlikely that branch-selectors which are reloaded from L2 into L1 still find their branch target addresses in the BTB. On the contrary, the BTB entries should be cleared whenever a cache-line is evicted from L1 to L2. A cache-line that returns from L2 to L1 can restore the pre-decode bits rapidly (In two cycles with a massively parallel pre-decoder) It has to restore the BTB entries as well but this can take much more time. The Athlon 32 fills the BTB with instruction-addresses that come back from the re-order buffer when the branch is retired. This procedure would be repeated for each branch in the 16 byte line when it is taken. It may well be that the Athlon 64 still works this way. The BTAC can take over the functionality of the BTB until the BTB entries are restored. The BTAC can use the lowest Instruction Fetch address bits to see were we enter the 16 byte line. It can then scan from that position to the first branch and calculate the full 48 bit address by adding the 8 or 32 bit displacement from the code. Now we have a calculated value which can be used to index the cache. It is still a guessed address. The certain address only comes when the branch instruction retires. The BTAC may have picked the wrong branch for example. We believe that the BTAC calculates the full 48 bit address. We believe so because it can be made to maintain the full 48 bit which has several advantages. The 48 bits would be lost whenever the BTB is used to predict an address because it stores only a small portion of the address. The BTAC can be used to maintain 48 bit because the BTB identifies the location in the 16 byte line of the branch it uses. The BTAC can use this to find the right branch and subsequently add the displacement to keep the address at 48 bits. There are two important tasks that need the full 48 bit address. First: The branch-miss prediction test hardware has to compare the full 48 bit "guess" address with the actual 48 bit address as calculated by the branch instruction. Secondly: The cache hit/miss test hardware needs the full 48 bit "guess" address (virtual) to translate and compare it with the (physical) address tag stored together with each cache-line. There are some patents without BTAC that use a scheme of reversed TLB lookup to recover the full 48 bit (virtual) "guess" address from the (physical) cache tag and use this for the branch miss prediction test. However such an address is not useful for the cache hit-miss test ( It hits always! ).

4.12 Instruction Cache Hit / Miss detection, The Current Page and BTAC

The basic components for the Instruction Cache hit/miss detection are basically the same as those for the data cache. See section- The single port Instruction cache only needs a single tag ram and a single TLB. The instruction cache also has a second level TLB ( see section-3.4) and it has its snoop tag ram (section-3.19). All these structures are relatively simple to recognize on the die-photo. The current page register holds address bits [47:15] of the "guessed" Instruction Fetch address. The BTB only stores the lower 15 Instruction Fetch address bits. The Fetch logic speculates that the next 16 byte instruction line will be fetched from the same 32 kB page and that the upper address bits [47:15] remain the same. Jumps and calls that cross the 32 kB border are miss predicted. The higher bits of the fetch address [47:12] are needed for the cache hit/miss logic. The virtual page address [47:12] is translated to a physical page address [39:12] . This page address is then compared to the two physical address tags read from the two way set associative instruction cache to see if there is a hit in either way. The new BTAC ( Branch Target Address Calculator) can recover the full 48 bit address from the displacement field in the instruction code two cycles after the code is fetched from the cache. This address can then be compared with the current page register to check if the assumption that the branch would not cross the 32 kB bounder was right. The cache hit/miss logic in the mean time has translated and compared the guessed address with the two instruction cache tags and produced the hit/miss result. Cache Hit / Miss and Current Page Test Cache Hit Current Page OK Continue with the Instruction Line Fetched from the Instruction Cache Cache Hit Current Page not OK Re-access the cache / TLB with the corrected Current Page Cache Miss Current Page OK Real Cache Miss. Reload Cache-line from L2 or memory. Cache Miss Current Page not OK Re-access the cache / TLB with the corrected Current Page The processor continues with the 16 instruction bytes fetched from the cache if there was a cache hit and the 32 kB border was not crossed. The Fetch logic will re-access the cache if the 32 kB border was crossed and will ignore the hit/miss result in this case. If the 32 kB border was not crossed and the TLB thus translated the right fetch address and there was a cache miss then we may conclude that the cache miss was real and that we have to reload the line from memory or L2. The BTAC does not help in case of indirect branches. These still have to wait until the correct address becomes available from the retired branch instruction.

4.13 Instruction Cache Snooping

The Snoop interface of the Instruction Cache is used to maintain Cache Coherency in a multiprocessor environment and for Self Modifying Code detection. Another processor that shares a cache-line with a cache line in the Instruction cache sends snoop-invalidates throughout the system when it writes into the shared cache-line. The snoop interface checks if the snoop invalidate hits with a cache line in the instruction cache. It will invalidate the line upon a hit. The snoop interface works with physical addresses as described in section 3.19 The instruction cache can share cache-lines with other processors. It can not share a cache-line with its own data cache however. The latter is forbidden because the processor must correctly handle Self Modifying Code programs. The Instruction and data cache are exclusive to each other as well as to the unified level 2 cache. The snoop interface detects if a cache-line load for the data cache hits a cache-line in the instruction cache and invalidates the cache-line upon a hit. The instruction cache may share a cache-line with a data-cache on another processor. This so-called Cross Modifying Code case is less stringent. The exact moment at which the other processor overwrites the instruction code is uncertain. The only effect of a shared cache-line which is modified by another processor is that we see the modification somewhat later, as if the other processor was slightly slower. Interesting is that the new ASN (Address Space Number) could make it possible for the instruction cache and data cache to share cache lines as long as they are assigned to different processes with different ASN's. This would be similar to the cross modifying case mentioned above. The hardware however does not support it because the ASN's are not stored together with the cache lines. It would not be worth the trouble anyway from a performance point of view.

Athlon 64, Bringing 64 bits to the x86 Universe