ILP (Instruction-Level Parallelism) : 同一時間執行超過一個指令的能力。
---3.1 Fig 3.1
A.2 Forwarding (前饋), Bypassing (旁路) : data hazard stalls
A.2 delayed branches (延遲分支), branch scheduling (分支排程) : control hazard stalls
A.8 scoreboarding (動態排程-計分版) : true-dependency data hazard stalls
3.2 renaming (動態排程-重新命名) : data hazard, WAR(anti-dependency), WAW(ouput-dependency)
3.4 branch prediction (分支預測) : control stalls
3.6 Issuing multiple instructions per cycle : ideal CPI
3.7 speculation (預測執行) : data hazard and control hazard stall
3.2/3.7 disambiguation (動態記憶體檢查) : Data hazard stalls with memory
--- 3.1 Three different types of dependences :
1. Data dependence : (RAW: true dependence)
2. Name dependence : (WAR: Anti-dependence, WAW: output-dependence) <-- renaming
3. Control dependence : <-- speculation
-- Q4 : RAW: true dependence, WAR: Anti-dependence, WAW: output-dependence
RAW : Inst j data depedent on Inst i <-- Overcome by Stall or Eliminating it by transforming the code
WAR : Inst i reads precedes Inst j write, the order must be preserved <-- register renaming
WAW : i writes and j writes, the order must be preserved
loop :
(1) DIV.D F0, F2, F4 RAW : (1)(2) |
(2) ADD.D F6, F0, F8 /(2)(3) |
(3) S.D F6,0(R1) /(4)(5) |
(4) SUB.D F8,F10,F14 WAR : (2)(4) |
(5) MUL.D F6,F10,F8 WAW : (2)(5) |
| register-renaming
(1) DIV.D F0, F2, F4
(2) ADD.D S, F0, F8
(3) S.D S,0(R1)
(4) SUB.D T,F10,F14
(5) MUL.D F6,F10,T
| register-renaming
(1) DIV.D F0, F2, F4
(2) ADD.D S, F0, F8
(3) S.D S,0(R1)
(4) SUB.D T,F10,F14
(5) MUL.D F6,F10,T
--3.2 Q5 Tomasulo's algorithm : Enhancement of scoreboarding, Allow Insts to exec out-of-order when there are sufficient resource and no data dependance.
(register-renaming to resolve WAR and WAW)
(register-renaming to resolve WAR and WAW)
--3.4 Touenament Predictor (聯合式預測器): Adaptively combining local and global predictors
-The primary motivation for correlating branch predictors came from the observation that the standard 2-bit predictor using only local information failed on some important branches and that by adding global information, the performance could be improved.
-Tournament predictors take this insight to the next level, by using multiple predictors, usually one based on global information and one based on local information, and combining them with a selector.
-Tournament predictors take this insight to the next level, by using multiple predictors, usually one based on global information and one based on local information, and combining them with a selector.
-Tournament predictors are the most popular form of multilevel branch predictors. A multilevel branch predictor use several levels of branch prediction tables together with an algorithm for choosing among the multiple predictors; Existing tournament predictors use a 2-bit saturating counter per branch to choose among two different predictors. The four states of the counter dictate whether to use predictor 1 or predictor 2. The state transition diagram is shown in Figure 3.16.
--3.6 Taking advantage of more ILP with Multiple issue :
-Superscalar processors (dynamic issue capability) issue varying numbers of instructions per clock and are either statically scheduled or dynamically scheduled using techniques based on Tomasulo's algorithm. Statically scheduled processor use in-order execution, while dynamically scheduled processors use out-of-order execution.
-Superscalar processors (dynamic issue capability) issue varying numbers of instructions per clock and are either statically scheduled or dynamically scheduled using techniques based on Tomasulo's algorithm. Statically scheduled processor use in-order execution, while dynamically scheduled processors use out-of-order execution.
-VLIW processors (static issue capability), in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (hence, they are also known as EPIC--Explicitly Parallel Instruction Computers). VLIW and EPIC processors are inherently statically scheduled by the compiler.
-- Q3.2
DADDI R1,R1,#4 / LD R2,7(R1) R1: true-dependency 不允許out-of-order
DADD R3,R1,R2 / S.D R2,7(R1) NONE 可
S.D R2,7(R1) / S.D R2,200(R7) Maybe output-denp. 可能,如果硬體夠早算出有效位址,S.D順序可能被交換
BEZ R1,place / S.D R1,7(R1) NONE 不可,直到分支分辨出結果前,改變指令都是預測執行
-- Q3.9 (a)考慮兩到分支 B1,B2交互執行,P欄位列出B1,B2共用的單一位元predictor之值. B1,B2欄位列出分支的動作.T is taken,NT is not taken,預測器一開始是NT
P B1 P B2 P B1 P B2 P B1 P B2 P B1 P B2
NT T T NT NT NT NT T T T T NT NT NT NT T
預測正確 - N - N - Y - N - Y - N - Y - N
此處B1,B2分別以T/NT交替出現,如果他們各有一個單一位元predictor,那麼每次都會預測錯誤,因為共用一個預測器,所以預測正確提升了.
(b)B1都T,B2都NT,若用單一位元predictor,那每次都會預測正確,因為共用一個預測器,所以預測都錯誤
P B1 P B2 P B1 P B2 P B1 P B2 P B1 P B2
NT T T NT NT T T NT NT T T NT NT T T NT
預測正確 - N - N - N - N - N - N - N - N
(c)假如一個predictor被一組分支指令共用,程式執行時.此集合中的成員可能會變動. 當一個新的分支進入或是一個舊的分支離開這個集合,目前這個predictor所代表的分支動作歷程不可能像預測舊的集合般預測這個新集合的行為,這時會影響預測器的狀態.集合改變的時間間隔可能會減少長期的預測準確率.
--Q3.20 當一道指令被錯誤預測時,對於構成CPU時間方程式的三個因素有什麼影響:動態指令數,每道指令平均時脈數,以及時脈週期時間?當預測錯誤時,CPU時間可能會增加.CPU時間方程式中的哪一個因素最適合用來模擬此時間增加?為什麼?
=>當預測執行正確時,應該被執行的指令可以藉由減少或消除暫停而提早執行.當執行動作被拖延到指令不再被預測執行時,延遲就無法避免.將必須執行的指令提早執行並不會對指令數目或時脈週期有任何影響.減少暫停週期能增進的是CPI.
當預測不正確時,不再執行路徑上的指令會被執行,而它們的結果會被忽略.對時脈週期是沒有影響的,但是對動態的指令數目會增加.一般的指令分佈對於CPI的改變與影響很小,但是不正確的預測執行指令所消耗的週期會造成CPU時間的大量增加,這點從計算IC的角度來看最為明顯.
--Q3.21
ADD.D F0,F8,F8 /
MUL.D F2,F8,F8 /
SUB.D F4,F0,F2 /
DADDI R10,R12,R12
ROB fields Committed?
------------------------------------------------- ---------------------
entry instruction destination Value Yes/No
0 ADD.D F0 F8 + F8 Y
1 MUL.D F2 - N
2 SUB.D F4 - N
0 DADDI F10 R12 + R12 N
1
2
此處沒有MUL.D項目,因為10個週期延遲表示他尚未執行完成. 此處也沒有SUB.D,因為相依於MUL.D.
ROB的第0項開始記錄ADD.D,但已經被 ADD.D覆寫了; 第1和第2項則保有它們的初始值.
ROB的第0項開始記錄ADD.D,但已經被 ADD.D覆寫了; 第1和第2項則保有它們的初始值.
.End.
沒有留言:
張貼留言