. ====== Ch5.4 Reducing Cache Miss Penalty (五種) ====== ※ [方法1] Multi-Level Caches :
Average memory access time = Hit timeL1 + Miss rateL1 × Miss penaltyL1 and Miss penaltyL1 = Hit timeL2 + Miss rateL2 × Miss penaltyL2
所以
Average memory access time = Hit timeL1 + Miss rateL1× (Hit timeL2 + Miss rateL2 × Miss penaltyL2)
1. Local miss rate : This rate is simply the number of misses in a cache divided by the total number of memory accesses to this cache. As you would expect, for the first-level cache it is equal to Miss rate L1 and for the second-level cache it is Miss rate L2. 2. Global miss rate : The number of misses in the cache divided by the total number of memory accesses generated by the CPU. Using the terms above, the global miss rate for the first-level cache is still just Miss rateL1 but for the second-level cache it is Miss rate L1 × Miss rate L2.
Average memory stalls per instruction = Misses per instruction L1× Hit time L2 + Misses per instruction L2 × Miss penalty L2.
EXAMPLE Suppose that in 1000 memory references there are 40 misses in the 1st-level cache and 20 misses in the 2nd-level cache. What are the various miss rates? Assume the Miss penalty from L2 cache to Memory is 100 clock cycles, the Hit time of L2 cache is 10 clock cycles, the Hit time of L1 is 1 clock cycles, and there are 1.5 memory references per instruction. What is the average memory access time and average stall cycles per instruction? Ignore the impact of writes. ANSWER : 1st-level local and global miss rate = 40 / 1000 = 4% 2nd-level local miss rate = 20 / 40 = 50% 2nd-level global miss rate = 20 / 1000 = 2% => Average memory access time = 1 + 4%(10 + 50% × 100 ) = 3.4 clock cycles. (若無L2 => Average memory access time = 1 + 4% × 100 = 5 clock cycles.) 1.5 memory references per instruction => 1000 memory reference per 667 instructions. 所以 每千個指令的失誤率 Miss Rate L1 = 40*1.5 = 60 , Miss Rate L2 = 20*1.5=30 Average memory stalls per instruction = Misses per instruction L1× Hit time L2 + Misses per instruction L2 × Miss penalty L2 = (60/1000) × 10 + (30/1000) × 100 = 0.060 × 10 + 0.030 × 100 = 3.6 clock cycles
另一種算法是 (Average memory access time - L1 Hit time ) × 平均Cache存取次數 = (3.4 – 1.0) * 1.5 = 3.6 clock cycles. ※ [方法2] Critical word first & Early restart : (只對Block大的Cache有效) 1. Critical word first : Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first. 2. Early restart : Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution. ※ [方法3] Giving Priority to Read Misses over Writes : This optimization serves reads before writes have been completed. ※ [方法4] Merging write buffer : (將連續字元組的多個寫入動作合併為單一區塊) If the buffer contains other modified blocks, the addresses can be checked to see if the address of this new data matches the address of the valid write buffer entry. If so, the new data are combined with that entry, called write merging. ※ [方法5] Victim caches : One approach to lower miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at small cost.
沒有留言:
張貼留言