資工遊俠劉建春(AaA / Amzshar / 燕俠 / JCLIUL)之IT人柱力(仙人模式): [EE_CSIE] Computer Architecture Chapter05 Notes (3)

2007年12月19日星期三

[EE_CSIE] Computer Architecture Chapter05 Notes (3)

.
====== Ch5.3 Cache Performance ======
※
※ Average Memory Access Time (AMAT) = Hit time + Miss Rate × Miss Penalty
※

EXAMPLE : 比較何者有較低的失誤率,
(假設Cahe為具有寫入緩衝區的直接寫入式cache)
1. 16KB指令快取 + 16KB資料快取,
2. 32KB合併式快取(Unified Cache),
假設36%指令是資料快取, 而命中需1個時脈週期(Hit time=1),
失誤代價=100時脈週期(Miss Penalty=100).
而Unified合併式快取,因無法同時處理兩個要求,Load與Store必須多出額外的1個時脈週期.

ANSWER :
先計算每千個指令的失誤次數轉換為失誤率.

Misses / Instruction = ( Miss Rate × Memory Access ) / Instruction

Miss Rate = [ ( Misses/1000 Instructions ) / 1000 ] / ( Memory Access / Instruction )

因為每個指令存取需要一次的 Memory Access以取得指令:
=> Miss Rate 16KB指令 = [ 3.82 / 1000 ] / 1.00 = 0.004 (佔 74%)
=> Miss Rate 16KB資料 = [ 40.9 / 1000 ] / 0.36 = 0.114 (佔26%)
=> 分離式Cache整體失誤率 = 74% × 0.004 + 26% × 0.114 = 0.0324

合併式Unified Cache必須計算指令與資料存取:
=> Miss Rate 32KB合併式
　　= [ 43.3 / 1000 ] / ( 1.00 + 0.36 ) = 0.0318 (比較上,稍微低)

Average memory access time (AMAT)
= % instructions × (Hit time + Instruction miss rate × Miss penalty) +% data × (Hit time + Data miss rate × Miss penalty)

=> AMAT分離式 = 74% × ( 1 + 0.004 × 100 ) + 26% × ( 1 + 0.114 × 100 ) = 4.24

=> AMAT合併式 = 74% × ( 1 + 0.0318 × 100 ) + 26% × ( 1 + 1 + 0.0318 × 100 ) = 4.44

所以, 分離式Cache在每個時脈週期提供兩個記憶體存取,
因而避免掉結構危障.
雖然Miss Rate 較高,但AMAT仍然比僅單一存取阜的Unified Cache要短.

※ 比較有沒有Cache對效能的影響
EXAMPLE : An in-order execution computer
(Such as Ultra SPARC III), Cache Penalty=100 clock cycles,
all instructions normally take 1.0 clock cycles,
Assume the Average Miss Rate is 2%, there is an average of 1.5 memory references per instruction, and that the average number of Cache Misses per 1000 instructions is 30.
What is the impact on performance when behavior of the cache is included?
Calculate the impact using both misses per instruction and miss rate.
ANSWER :
CPU time
= IC×[ ( CPI execution + (Memory stall clock cycles)/Instruction ] × Clock cycle time

1. 包含Cache失誤的效能---
CPU time = IC × (1.0 + 30/1000 × 100) × Clock cycle time
　　　　　= IC × 4.0 × Clock cycle time

2. 使用Miss Rate來計算---
CPU time = IC×[CPI execution + Miss Rate×(Memory accesses/Instruction)×Miss penalty]×Clock cycle time

=> CPU time
　= IC×[1.0 + 2%×1.5×100]×Clock cycle time
　　= IC × 4.0 × Clock cycle time

The clock cycle time and (IC) instruction count are the same, with or without a cache.
Thus, CPU time increases fourfold, with CPI from 1.00 for a “perfect cache” to 4.00 with a cache that can miss.
Without any memory hierarchy at all the CPI would increase again to 1.0 + 100 × 1.5 or 151— a factor of almost 40 times longer than a system with a cache!

※ 比較不同架構的Cache 對效能得影響 (Direct-Mapped v.s. 2-way-associative ):
EXAMPLE Assume that the CPI=2.0 with a perfect cache, the clock cycle time is 1.0 ns, there are 1.5 memory references per instruction, both caches size is 64 KB, and block size of 64 bytes. One cache is direct mapped and the other is two-way set associative. For set-associative caches we must add a multiplexor to select between the blocks in the set depending on the tag match. Since the speed of the CPU is tied directly to the speed of a cache hit, assume the CPU clock cycle time must be stretched 1.25 times to accommodate the selection multiplexor of the set-associative cache. Cache miss penalty=75 ns for either cache organization. First, calculate the average memory access time, and then CPU performance. Assume the hit time=1 clock cycle, the Miss Rate = 1.4% for direct-mapped 64-KB cache, the Miss Rate=1.0% for a two-way set-associative cache.
ANSWER :
Average memory access time = Hit time + Miss rate × Miss penalty

AMAT(1-way) = 1.0 + (0.014×75) = 2.05 ns
AMAT(2-way) = 1.0×1.25 + (0.01×75) = 2.00 ns (AMAT較佳)

CPU time = IC×[CPI execution + Miss Rate×(Memory accesses/Instruction)×Miss penalty]×Clock cycle time

CPU time(1-way) = IC×[2 + 0.0014×1.5×75]×1.0 = 3.58 × IC (CPU time較佳)
CPU time(2-way) = IC×[2×1.25 + 0.01×1.5×75] ×1.0=3.63 × IC
=> 相對效能 = CPU time (2-way) / CPU time(1-way) = 3.63 / 3.58 = 1.01

※ Out-of-Order Execution Processor:
( Memory stall cycles / Instruction )
= ( Misses / Instruction ) * (Total miss latency – Overlapped miss latency )

資工遊俠劉建春(AaA / Amzshar / 燕俠 / JCLIUL)之IT人柱力(仙人模式)

2007年12月19日星期三

[EE_CSIE] Computer Architecture Chapter05 Notes (3)

沒有留言:

張貼留言

2007年12月19日 星期三

[EE_CSIE] Computer Architecture Chapter05 Notes (3)

沒有留言:

張貼留言

2007年12月19日星期三