※ The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations, such a dependence is called a loop-carried dependence.
程式範例:
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
--
A[2] = A[1] + C[1]
B[2] = B[1] + A[2]
--
A[3] = A[2] + C[2]
B[3] = B[2] + A[3]
--
... ... ...
--
A[101] = A[100] + C[100]
B[101] = B[100] + A[101]
--
1.所以S1會用到上一次 S1計算出來的值,S2也會用到上一次S2的結果=> Loop-carried depenence
2.而同一迴圈,S2相依於S1,(not loop-carried),只要照順序執行即可。
2.而同一迴圈,S2相依於S1,(not loop-carried),只要照順序執行即可。
※ Loop-carried dependence不見得會妨礙Parallelism:
程式範例:
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */ }
--
A[1] = A[1] + B[1]
B[2] = C[1] + D[2]
--
A[2] = A[2] + B[2]
B[3] = C[2] + D[2]
--
... ... ...
--
A[100] = A[100] + B[100]
B[101] = C[100] + D[100]
--
S1相依於S2,之間存在Loop-carried dependence.
轉換關鍵性:
1.S1到S2沒有相依性,交換這兩道順序不會影響S2的執行.
2.Loop第一次執行,S1相依於此迴圈開始執行前的B[1]值.
化解為:
A[1] = A[1] + B[1]
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
B[101] = C[100] + D[100]
※ A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding, as in the above fragment.
Detecting a recurrence can be important for two reasons:
Some architectures (especially vector computers) have special support for executing recurrences, and some recurrences can be the source of a reasonable amount of parallelism.
※ Dependence distance :
for (i=6;i<=100;i=i+1) {
Y[i] = Y[i-5] + Y[i];
}
第I次執行時,Loop會讀取陣列元素i-5, Dependence distance = 5.
Dependence distance越大,the more potential parallelism can be obtained by unrolling loop.
※ Finding the dependences is important in 3 tasks :
1. Good scheduling of code.
2. Determining which loops might contain parallelism.
3. Eliminating name dependences.
※ Compiler 偵測 dependences ?
Nearly all dependence analysis algorithms work on the assumption that array indices are affine (仿射) : a one-dimensional array index is affine if it can be written in the form a × i + b, where a and b are constants, and i is the loop index variable. 而x[y[i]]就Nonaffine.
※ A dependence exists if two conditions hold: (GCD偵測)
1. There are two iteration indices, j and k, both within the limits of the for loop.
That is m ? j ? n, m ? k ? n.
2. The loop stores into an array element indexed by a × j + b and later fetches from that same array element when it is indexed by c × k + d. That is, a × j + b = c × k + d.
範例: Use the GCD test to determine whether dependences exist in the following loop:
for (i=1; i<=100; i=i+1) {
X[2*i+3] = X[2*i] * 5.0;
}
解法: Given the values a = 2, b = 3, c = 2, and d = 0,
then GCD(a,c) = 2, andd – b = –3.
Since 2 does not divide –3, no dependence is possible.
=> GCD測試可保證沒相依性存在,但可能GCD測成功,但並沒相依性存在.
(因為loop bounds沒考慮到)
※ Situation in which array-oriented dependence analysis (陣列導向的相依性分析) cannot tell us :
1. When objects are referenced via pointers.
2. When array indexing is indirect through another array.
3. When a dependence may exist for some value of inputs, but does not exist in actuality when the code is run since the input never take in those value.
4. When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of that variable depend.
※ The basic approach used in points-to analysis (指向分析) replies on information from :
1. Type information(型別資訊), which restricts what a pointer can point to.
2. Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what a pointer can point to. (例: p指向X, q永不指向X, 則p和q就不能指向同一物件)
3. Information derived from pointer assignment. (p -> q -> X, q的值指定給p,則p指向q所指的物件)
※ Eliminating Dependent Computations (消除相依計算) :
1. Copy propagation (複製傳遞) : 用來避免複製運算 Eliminates operations that copy values.
DADDUI R1, R2, #4
DADDUI R1, R2, #4
變成=> DADDUI R1, R2, #8
2. Tree height reduction (樹的高度縮減) :
ADD R1,R2,R3
ADD R4,R1,R6
ADD R8,R4,R7
轉換 3 cycles => 2 cycles
ADD R1,R2,R3
ADD R4,R6,R7
ADD R8,R1,R4
3.Recurrences (遞迴):
sum = sum + x;
sum = sum + x1 + x2 + x3 + x4 + x5 ; 5 cycles
=> sum = ( (sum + x1) + (x2 + x3) ) + (x4 + x5) ; 3 cycles
※ Software Pipeline(軟體管線) : Symbolic Loop Unrolling (象徵性迴圈展開) :
Software Pipeline(軟體管線) : Reorganize loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop (從不同回合中挑選組合而成).
請參考 Fig4.6
※ Global Code Scheduling (全域程式碼排程):
1. Effective scheduling of a loop body with internal control flow will require moving inst. across branches.
2. Aims to compact a code fragment with internal control structure into the shortest possible sequence (Critical Path) that preserves the data and control dependence. (保留 data 與 control 相依性)
3. It can reduce the effect of control dependences arising from conditional nonloop branches by moving code.
4. Effectively using global code motion require estimates of the relative frequency of different paths.
※ Trace Scheduling (追蹤排程) : focusing on the Critical Path
1. Useful for processors with a large number of issues per clock.
2. A way to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths. (用於執行頻率有明顯差距的不同路徑上)
※ Two steps of Trace Scheduling :
1. Trace selection (追蹤選擇) : tries to find a likely sequence of basic blocks whose operations will be put together into a smaller number of instructions.
2. Trace compaction (追蹤壓縮) : Tries to squeeze the trace into a small number of wide instructions. (其實就是 code scheduling.)
The advantage of the Trace Scheduling approach is that it simplifies the decisions concerning global code motion. (Trace scheduling 優點在於簡化了 Global Code 移動的決策)
※ Superblocks (超級區塊) : 解決 Trace Scheduling 追蹤中間進入或離開造成十分複雜的情況
1. are formed by a process similar to that used for traces.
2. but are a form of extended basic blocks, which are restricted to a single entry point but allow multiple exists.
How can a superblock with only one entrance be constructed? The answer is to use tail duplication (尾部複製) to create a separate block that corresponds to the portion of the trace after the entry.
與一般的 Trace 產生方法比起來, 使用 superblocks能減少額外記錄(bookkeeping)與排程的複雜度, 但是程式碼可能會大於以 Trace 為基礎的方法. 所以, 如同Trace scheduling, Superblocks在其他技巧都行不通時再使用比較合適.
沒有留言:
張貼留言