2007年12月17日 星期一

[EE_CSIE] Computer Architecture Chapter04 Notes (1)

=== Ch4 Exploiting Instruction Level Parallelism with S/W Approach ===

=== Ch4.1 Basic compiler techniques for exposing ILP ===
IA-64 : Intel Architecture-64, Intel's first 64-bit CPU micro architecture, is based on EPIC.

EPIC : Explicitly Parallel Instruction Computing


FIGURE 4.1 Latencies of FP operations used in this chapter.
這圖是貫穿第四章的精神所在,說明不同類型指令間的Latency.

先介紹什麼是 Pipeline Schedule 與 Loop Unrolling :

例如:
for (i=1000; i>0; i=i-1) {
 X[i] = X[i] + s;
}

1. MIPS code =>
Loop: L.D     F0,0(R1)
   ADD.D   F4,F0,F2
   S.D     F4,0(R1)
   DADDUI  R1,R1,#-8
   BNE    R1,R2,Loop

2. Without any scheduling (10 cycles) =>
Loop: L.D F0,0(R1)
    stall
   ADD.D   F4,F0,F2
    stall
    stall
   S.D     F4,0(R1)
   DADDUI  R1,R1,#-8
    stall
   BNE    R1,R2,Loop
    stall

3. Schedule 排程後(6 cycles) =>
Loop: L.D     F0,0(R1)
   DADDUI  R1,R1,#-8
   ADD.D   F4,F0,F2
    stall
   BNE    R1,R2,Loop
   S.D     F4,8(R1)

4. Loop unrolled 迴圈展開 =>
(14 clock cycles or 14/4=3.5 per iteration)
Loop: L.D   F0,0(R1)
   ADD.D  F4,F0,F2
   S.D   F4,0(R1)
   L.D    F6,-8(R1)
   ADD.D   F8,F6,F2
   S.D    F8,-8(R1)
   L.D    F10,-16(R1)
   ADD.D  F12,F10,F2
   S.D    F12,-16(R1)
   L.D     F14,-24(R1)
   ADD.D   F16,F14,F2
   S.D     F16,-24(R1)
   DADDUI R1,R1,#-32
   BNE    R1,R2,Loop

5. Unrolled loop 再 Schedule=>
Loop: L.D  F0,0(R1)
   L.D  F6,-8(R1)
   L.D  F10,-16(R1)
   L.D  F14,-24(R1)
   ADD.D  F4,F0,F2
   ADD.D  F8,F6,F2
   ADD.D  F12,F10,F2
   ADD.D  F16,F14,F2
   S.D  F4,0(R1)
   S.D  F8,-8(R1)
   DADDUI  R1,R1,#-32
   S.D    F12,16(R1)
   BNE    R1,R2,Loop
   S.D    F16,8(R1)


8 則留言:

  1. 請問下面這兩行指令中間,為何需要一個stall呢?
    ===================
    DADDUI R1,R1,#-8
    BNE   R1,R2,Loop

    回覆刪除
  2. 為了 Parallel / Pipeline,
    簡單的說就是讓其見縫插針 :)

    比如說
    L.D指令執行需要2t時間...
    ADD.D指令執行執行需要3t時間...
    DADDUI指令執行執行需要2t時間...
    然後再考慮 需要用的 衝不衝突,
    有沒有需要等待的情形...

    回覆刪除
  3. 請問一下,下面這行指令為什麼是-8,而不是-4??

    DADDUI  R1,R1,#-8

    回覆刪除
  4. 因為 in this MIPS assembly language,
    8 bytes per DW,
    here is for decrement pointer

    回覆刪除
  5. 請問AaA 小弟最近剛學計結
    看不懂DADDUI指令的意思
    可以請版主解釋一下嗎?感謝

    回覆刪除
  6. DADDU and DADDUI are unsigned additions.

    The DADDUI is used to force the string of 16 ones to be zero extended, because its an unsigned value.

    Without the U, the string of 16 ones would be treated as the 2's complement value -1, and sign extended to a big, 64 bit negative one, which is NOT what we want.

    DADDU is used has to do with exceptions...things that make pipelining hard to implement. In some machines, the DADDU will not cause condition codes to be set, where as the DADD will.

    However, having the middle instruction as DADDU or DADD doesn't change the answer. But, if the DADDUI's aren't used, the maximum immediate value will be a zero in the sign bit, and all the rest ones, giving 32767 as the decimal value.

    回覆刪除
  7. 感謝您!但是可以請您幫我翻譯嗎?
    我有Goole到這篇文章,但是無法完全看懂!!Sorry,英文不夠好!!
    麻煩囉!!

    回覆刪除
  8. 比如說
    L.D指令執行需要2t時間...
    ADD.D指令執行執行需要3t時間...
    DADDUI指令執行執行需要2t時間...
    然後再考慮 需要用的 衝不衝突,
    有沒有需要等待的情形...

    ----------

    DADDUI 是integer指令
    需要2t是因為RAW

    回覆刪除