※ Delayed branch can reduce the Control hazard.
程式範例:
LD R1,0(R2)
DSUBU R1,R1,R3
BEQZ R1,L
OR R4,R5,R6
DADDU R10,R4,R3
L: DADDU R7,R8,R9
=> DSUBU and BENQZ depend on LD
=> stall will be needed after LD.
1. branch almost taken
=> R7 was not needed on the fall-through path
=> Could increase the speed by moving DADDU to the position after LD.
2. branch rarely taken :
=> R4 was not needed on the taken path
=> Could increase the speed by moving OR to the position after LD.
3. profiled-based strategy predictor :
用預先收集的早期執行概況來預測分支
=== Ch4.3 Static Multiple Issue : VLIW ===
※靜態 Statically scheduled superscalar requires compiler assistance.
※動態 Dynamically-scheduled superscalar requires less compiler assistance, but has hardware costs.
※VLIW : Very Long Instruction Word
在一道指令中納入很多運算(64~128 bits or more)
VLIWs use multiple, independent functional units. Rather than attempting to issue multiple, independent instructions to the units, a VLIW packages the multiple operations into one very long instruction, or requires that the instructions in the issue packet satisfy the same constraints.
※ Basic VLIW approach ---
1. Local scheduling tech :
a.) the loop unrolling generates straight-line code.
b.) Operate on a single basic block.
2. Global scheduling tech : (trace scheduling是特別為VLIW發展的全域排程技巧)
a.) scheduling code across branches.
b.) More complex in structure.
c.) Must deal with significantly more complicated trade-offs in optimization.
※ For the original VLIW model, there are both technical and logistical problems.
=> The technical problems are the increase in code size and the limitations of lock-step operation.
※ Two different elements combine to increase code size substantially for a VLIW.
1, generating enough operations in a straight-line code fragment requires ambitiously unrolling loops (as earlier examples) thereby increasing code size.
2, whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.
=> 解increase code size方法
=>
1. Clever encodings (例如:讓數個function unit共用一個 large immediate field)
2. Compress the instructions in main memory
※ Early VLIWs operated in lock-step – T
here was no hazard detection H/W at all.
因為所有的 function unit 必須保持同步,所以任何一個管線發生stall,就會造成整個processor stall.
※ logistical : Binary code compatibility problem (執行碼相容性問題)
1. In a strict VLIW approach, the code sequence makes use of both the instruction set definition and the detailed pipeline structure.
2. Different numbers of functional units and unit latencies require different versions of code.
3. One possible solution is object-code translation or emulation.
4. Another approach is to temper the strictness of the approach, so that binary compatibility is still feasible.
※ Multiple Issue Processor兩個潛在的優點是Vector Processor所沒有的 :
1. has the potential to extract some amount of parallelism from less regularly structured code.
2. to use a more conventional, and typically less expensive, cache-based mem system.
沒有留言:
張貼留言