=== Ch4.1 Basic compiler techniques for exposing ILP ===
IA-64 : Intel Architecture-64, Intel's first 64-bit CPU micro architecture, is based on EPIC.
EPIC : Explicitly Parallel Instruction Computing
FIGURE 4.1 Latencies of FP operations used in this chapter.
這圖是貫穿第四章的精神所在,說明不同類型指令間的Latency.
先介紹什麼是 Pipeline Schedule 與 Loop Unrolling :
例如:
for (i=1000; i>0; i=i-1) {
X[i] = X[i] + s;
}
1. MIPS code =>
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
2. Without any scheduling (10 cycles) =>
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall
BNE R1,R2,Loop
stall
3. Schedule 排程後(6 cycles) =>
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
BNE R1,R2,Loop
S.D F4,8(R1)
4. Loop unrolled 迴圈展開 =>
(14 clock cycles or 14/4=3.5 per iteration)
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
5. Unrolled loop 再 Schedule=>
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
BNE R1,R2,Loop
S.D F16,8(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
BNE R1,R2,Loop
S.D F16,8(R1)
請問下面這兩行指令中間,為何需要一個stall呢?
回覆刪除===================
DADDUI R1,R1,#-8
BNE R1,R2,Loop
為了 Parallel / Pipeline,
回覆刪除簡單的說就是讓其見縫插針 :)
比如說
L.D指令執行需要2t時間...
ADD.D指令執行執行需要3t時間...
DADDUI指令執行執行需要2t時間...
然後再考慮 需要用的 衝不衝突,
有沒有需要等待的情形...
請問一下,下面這行指令為什麼是-8,而不是-4??
回覆刪除DADDUI R1,R1,#-8
因為 in this MIPS assembly language,
回覆刪除8 bytes per DW,
here is for decrement pointer
請問AaA 小弟最近剛學計結
回覆刪除看不懂DADDUI指令的意思
可以請版主解釋一下嗎?感謝
DADDU and DADDUI are unsigned additions.
回覆刪除The DADDUI is used to force the string of 16 ones to be zero extended, because its an unsigned value.
Without the U, the string of 16 ones would be treated as the 2's complement value -1, and sign extended to a big, 64 bit negative one, which is NOT what we want.
DADDU is used has to do with exceptions...things that make pipelining hard to implement. In some machines, the DADDU will not cause condition codes to be set, where as the DADD will.
However, having the middle instruction as DADDU or DADD doesn't change the answer. But, if the DADDUI's aren't used, the maximum immediate value will be a zero in the sign bit, and all the rest ones, giving 32767 as the decimal value.
感謝您!但是可以請您幫我翻譯嗎?
回覆刪除我有Goole到這篇文章,但是無法完全看懂!!Sorry,英文不夠好!!
麻煩囉!!
比如說
回覆刪除L.D指令執行需要2t時間...
ADD.D指令執行執行需要3t時間...
DADDUI指令執行執行需要2t時間...
然後再考慮 需要用的 衝不衝突,
有沒有需要等待的情形...
----------
DADDUI 是integer指令
需要2t是因為RAW