Loop Transformations - Oracle® Solaris Studio 12.4: C User's Guide

Language:

3.2.5 Loop Transformations

The compiler performs several loop restructuring transformations to help improve the parallelization of a loop in programs. Some of these transformations can also improve the single processor execution of loops as well. The transformations performed by the compiler are described in this section.

3.2.5.1 Loop Distribution

Loops often contain a few statements that cannot be executed in parallel and many statements that can be executed in parallel. Loop Distribution attempts to remove the sequential statements into a separate loop and gather the parallelizable statements into a different loop. This process is illustrated in the following example:

Example 3-12 Candidate for Loop Distribution

for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];               /* S1 */
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0;  /* S2 */
    y[i] = z[i] - x[i];                    /* S3 */
}

Assuming that arrays x, y, w, a, and z do not overlap, statements S1 and S3 can be parallelized but statement S2 cannot be. The following example shows how the loop looks after it is split or distributed into two different loops.

Example 3-13 Distributed Loop

/* L1: parallel loop */
for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];              /* S1 */
    y[i] = z[i] - x[i];                   /* S3 */
}
/* L2: sequential loop */
for (i=0; i < n; i++) {
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0; /* S2 */
}

After this transformation, loop L1 does not contain any statements that prevent the parallelization of the loop and may be executed in parallel. Loop L2, however, still has a non-parallelizable statement from the original loop.

Loop distribution is not always profitable or safe to perform. The compiler performs analysis to determine the safety and profitability of distribution.

3.2.5.2 Loop Fusion

If the granularity of a loop, or the work performed by a loop, is small, the performance gain from distribution might be insignificant because the overhead of parallel loop startup is too high compared to the loop workload. In such situations, the compiler uses loop fusion to combine several loops into a single parallel loop, increasing the granularity of the loop. Loop fusion is easy and safe when loops with identical trip counts are adjacent to each other. Consider the following example:

Example 3-14 Loops With Small Work Loads

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];        /* S1 */
}
/* L2: another short parallel loop */
for (i=0; i < 100; i++) {
    b[i] = a[i] * d[i];        /* S2 */
}

The two short parallel loops are next to each other, and can be safely combined as follows:

Example 3-15 The Two Loops Fused

/* L3: a larger parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];       /* S1 */
    b[i] = a[i] * d[i];       /* S2 */
}

The new loop generates half the parallel loop execution overhead. Loop fusion can also help in other ways. For example if the same data is referenced in two loops, then combining them can improve the locality of reference.

However, loop fusion is not always safe to perform. If loop fusion creates a data dependence that did not exist previously, the fusion could result in incorrect execution. Consider the following example:

Example 3-16 Unsafe Fusion Candidates

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];      /* S1 */
}
/* L2: a short loop with data dependence */
for (i=0; i < 100; i++) {
    a[i+1] = a[i] * d[i];    /* S2 */
}

If the loops in this example are fused, a data dependence is created from statement S2 to S1. In effect, the value of a[i] in the right side of statement S1 is computed in statement S2. If the loops are not fused, this dependence would not occur. The compiler performs safety and profitability analysis to determine whether loop fusion should be done. Often, the compiler can fuse an arbitrary number of loops. Increasing the granularity in this manner can sometimes push a loop far enough up for it to be profitable for parallelization.

3.2.5.3 Loop Interchange

Parallelizing the outermost loop in a nest of loops is generally more profitable because the overheads incurred are small. However, parallelizing the outermost loops is not always safe due to dependences that might be carried by such loops. The following example illustrates this situation.

Example 3-17 Nested Loop That Cannot Be Parallelized

for (i=0; i <n; i++) {
    for (j=0; j <n; j++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

In this example, the loop with the index variable i cannot be parallelized because of a dependency between two successive iterations of the loop. The two loops can be interchanged and the parallel loop (the j-loop) becomes the outer loop:

Example 3-18 Loops Interchanged

for (j=0; j<n; j++) {
    for (i=0; i<n; i++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

The resulting loop incurs an overhead of parallel work distribution only once, while previously the overhead was incurred n times. The compiler performs safety and profitability analysis to determine whether to perform loop interchange.