PARALLEL PROGRAMMING (gabor_par, lab_limits)


Chapel offers language support for task-driven and data-driven parallelism. In task parallelism you define actions that can be done independently.  These get started in separate threads and and synchronization variables control the flow between tasks.  For data parallelism you specify a chunk of work to be done and Chapel splits it up into smaller sets, executing each separately.  This is often followed by a serial step to combine the individual results, the well-known map-reduce technique, where map describes applying a function to data in parallel and reduce collapses the results.

We have two examples that will serve as good, reasonably simple introductions to these two techniques.  We often want to run Gabor filters as a bank, changing the angle of rotation but otherwise running the same filter on the same data.  Each filter runs independently; this is a good fit for task parallelism.  The second exercise of the Color Conversion section asked you to evaluate the L*A*B* conversion for all possible RGB combinations, 256*256*256 = 15,777,216 data points, to find the limits of the transform. We want to evaluate the same rgb_to_lab() function at each pixel, and then collapse the results to a single minimum or maximum: data parallelism.

The directory for this section is parallel.  You'll be using the C functions for reading and writing PNG images in img_png_v3.c and img_png_v3.h, and the color library ip_color_v2.chpl.  Use make name to build the programs, leaving off the '.chpl' suffix.  Executables get placed in the bin sub-directory, intermediate files (which you should never have to look at)  in build.  The image for this section is francnotch.png, the same one used for the Gabor filters.

The programs and image for this section are here (as zip file).  A PDF version of the text is here.

Data Parallelism (lab_limits)

The basic Chapel statement for operating on data in parallel is forall. As with for, if followed by do it takes a single statement to execute on the data, otherwise braces allow for a block of statements.  Chapel controls how the iteration is split: there may be a separate task for each iteration, they may be grouped, or it may even execute completely serially.  The statements after the forall will only be executed when all iterations are done.

Like for, forall can be used as an expression, returning a collection of values.  You may not break out of a forall, nor return.

Conceptually you can think of the body of the loop forming a procedure that can be called in a separate task.  Any variables from outside the loop are passed as arguments to that procedure.  This isolates it from the world outside the scope of the loop, preventing race conditions.  Normally this argument list is without any intents, ie. all variables except arrays and the synchronization types we will talk about with task parallelism are passed as const; the others have type ref.  You can change the intent by adding a with clause to the forall

    forall (x,y) in img.area

      with (ref sum : real, ref cnt : int) {



You cannot use the out or inout intent, as these can cause data races (as can assigning to a ref).

Because arrays are passed by reference, you need to ensure that accesses do not overlap between different iterations.  Chapel will not detect these.  If we're doing a column sum with

    var colsum : [0..ncol-1] int;

    forall y in 0..nrow-1 do

      for x in 0..ncol-1 do

        colsum[x/2] += img(x,y);

both even and odd columns get written to the same array value, ie. x/2 rounds to 0 for x = 1 and there will be a race at colsum[0].  If you swap the loops, putting x in the forall, then you may also not be safe because the range may split so that x = 0 and x = 1 are in different tasks.  It would be safe to use the outer loop index directly, without alteration so that no accesses could collide.  Any inner loops are run serially, so it doesn't make sense to embed foralls.

As we've said a couple times, the process of turning these intermediate results into a single value is called a reduction.  You have a limited number of operators that can be applied: addition + and multiplication *, logical and && and or ||, bit-wise and &, or |, and xor ^, and min and max.  The format of the reduce command is

    <operator> reduce <iterator>

If you need the partial results as you go along, use scan instead of reduce.  Scan returns an array of the same size as what it iterates over, where each value is the cumulative sum of values to that point.

There are two other operators for reduce and scan: minloc and maxloc.  These take a zipper for the iterator with two ranges, one the data for which we want the minimum or maximum value, the other a “name”.  (Unfortunately what we would like is a version that uses a domain or range and just returns the indices at the extreme value, without bothering to explicitly provide it (with the accompanying chance for error).)  The reduction returns a tuple with the minimum or maximum and the corresponding element from the names list.

Consider two examples of (ab)using reduce and scan.  The first populates a 1000x1000 image with the square of the distance of the pixel (ie. x**2 + y**2) at each point.  

    var img1 : [0..999, 0..999] int;

    forall (x,y) in img1.domain do img1(x,y) = (x**2) + (y**2);

It then fills a column sum array in parallel using forall.  We use the domain .dim() method, which takes as argument the rank index, to extract the domain range.  The domain's first rank is run in parallel, and since the column sum depends on this index the addition is safe.  These two loops are equivalent:

    var colsumA : [img1.domain.dim(1)] int;

    forall x in img1.domain.dim(1) do

      for y in img1.domain.dim(2) do

        colsumA(x) += img1(x,y);

or  forall (x,y) in img1.domain do

      dolsumA(x) += img1(x,y);

Reducing this sums all 1,000,000 pixel values.

    var grandsum = + reduce colsumA;

    writeln("grandsum  ", grandsum);

    > grandsum  665667000000

We can actually write this as a one-liner without using the colsumA storage. We use a reduce to do the column sum by providing the column as a fixed value in the slice.  Wrapping this in a forall generates a collection of these sums, one for each column, which the outer reduce then collapses to the grand total.  This should match the value we just got.

    var grandsum2 = + reduce (forall x in img1.domain.dim(1) do

                               + reduce img1[x,img1.domain.dim(2)]);

    writeln("grandsum2 ", grandsum2);

    > grandsum2 665667000000

The second example uses scan to generate the solution to Exercise 3 of the Gabor Filter section, creating an integral image.  Scan produces an array with an incremental sum along one axis.  To declare an array of arrays, use separate brackets, one for each rank.  First we populate a small test image

    var img2 : [0..4, 0..4] int;

    forall (i,j) in img2.domain do

      img2(i,j) = (i*5) + j;


    > 0 1 2 3 4

    > 5 6 7 8 9

    > 10 11 12 13 14

    > 15 16 17 18 19

    > 20 21 22 23 24

Then we declare storage for the results of the scan, an array with the same number of columns and rows but split.  We use scan to get the cumulative sum, element by element, of each column.  We'll hard-wire the ranges to keep the code short, rather than using domain.dim().

    var colsumB : [0..4][0..4] int;

    forall j in 0..4 do colsumB(j) = + scan img2[0..4,j]

    for j in 0..4 do writeln(colsumB(j));

    > 0 5 15 30 50

    > 1 7 18 34 55

    > 2 9 21 38 60

    > 3 11 24 42 65

    > 4 13 27 46 70

We need the loop to print the array or Chapel will print it as a 1D list on one line.  Note that the values along each row are the cumulative sum of the corresponding column.  Because the scan proceeds along a column, it stores the result in rows; we need to transpose the matrix.

    forall j in 0..4 do

      for i in j+1..4 do

        colsumB[i][j] <=> colsumB[j][i];

Now if we do another scan in the opposite direction we get the sum of all those with smaller indices.  This is the integral image.

    var intimg : [0..4][0..4] int;

    forall i in 0..4 do

      intimg(i) = + scan colsumB[i][0..4];

    for i in 0..4 do writeln(intimg(i));

    > 0 1 3 6 10

    > 5 12 21 32 45

    > 15 33 54 78 105

    > 30 64 102 144 190

    > 50 105 165 230 300

You'll find this example in ex_datapar.chpl.  Compile and run it with

    make ex_datapar


By the way, Chapel provides a shortcut for forall statements.  You can put the everything except the forall in square brackets before a single statement to iterate over.  The first loop above to initialize img1 could have been written

    img1 = [(x,y) in img1.domain] (x**2) + (y**2);

Limits Found in Serial (lab_limits_ser)

For our initial, serial solution to Exercise 2 of the Color Conversion section, we choose to track the extreme values for each color plane and their RGB "position" as we cycle through all the combinations.  The code is straightforward but repetitive:

    var l, l_a, l_b : real;

    var lmin = max(real);

    var lmax = min(real);

    var lminpos, lmaxpos : 3*int;

    /* Similarly for A, B. */

    lmin = max(real);

    lmax = min(real);

    for r in 0..255 {

      for g in 0..255 {

        for b in 0..255 {

          rgbpix_to_lab(r : c_uchar, g : c_uchar, b : c_uchar, l, l_a, l_b);

          if (l < lmin) {

            lmin = l;

            lminpos = (r, g, b);


          if (lmax < l) {

            lmax = l;

            lmaxpos = (r, g, b);


          /* Similarly for A, B. */




l_a and l_b are used for the A and B planes to avoid confusion with 'b'lue. At the end of the loop we just print the minimum and maximum values and their position:

    (rpos, gpos, bpos) = lminpos;

    writef("  L  %7.2dr (%3i %3i %3i)", lmin, rpos, gpos, bpos);

    (rpos, gpos, bpos) = lmaxpos;

    writef("  -  %7.2dr (%3i %3i %3i)", lmax, rpos, gpos, bpos);

(A quick aside about the formatting codes for writef().  The function is similar to C's printf(), except that the letter codes per type differ: %i int, %u uint, %r real, %s string. %m for imag, and %z complex.  The d before the r means to use decimal, not exponential, format.  x would print a hexadecimal integer, b a binary, o octal.  More details are in the modules/standard/IO.chpl file.)  We chose to deconstruct the position tuples before printing; alternatively, we could have done

    writef("  L  %7.2dr (%3i %3i %3i)", lmin, lminpos(1), lminpos(2), lminpos(3));

The for loop is a serial command.  Changing this is what's needed to have work done in parallel.

Compile and run this with

    make lab_limits_ser


The program takes no command-line arguments.  As a check, it first prints the white point to confirm L=100.0, A=B=0.0.  If you monitor your CPU usage, you'll see that the program operates serially on one core.

Output from the LAB limits programOutput from the LAB limits program
Output from the LAB limits program

Parallelism with Local Updates (lab_limits_parv1)

Our first parallel takes the outer for loop and changes it to forall.  This causes the compiler to complain about all the variables the loop uses that are declared outside the loop.  To fix this, we'll replace the scalars we used to track the extrema and their location for each plane with arrays. We'll store the limits per r, the color we use in the forall loop, so there is no problem accessing each value (within a parallel task r is constant).

    var lmin, lmax : [0..255] real;

    var lminpos, lmaxpos : [0..255] 3*int;

    /* Similarly for A, B. */

    lmin = max(real);

    lmax = min(real);

    forall r in 0..255 {

      for g in 0..255 {

        for b in 0..255 {

          var l, l_a, l_b : real;

          rgbpix_to_lab(r : c_uchar, g : c_uchar, b : c_uchar, l, l_a, l_b);

          if (l < lmin(r)) {

            lmin(r) = l;

            lminpos(r) = (r, g, b);


          if (lmax(r) < l) {

            lmax(r) = l;

            lmaxpos(r) = (r, g, b);


          /* Similarly for A, B. */




To get the absolute extrema we reduce over the arrays.  We want the value of l[min|max]pos corresponding to the minimum/maximum, and zip the value array with the position to extract the location.

    (limit, (rlim, glim, blim)) = minloc reduce zip(lmin, lminpos);

    writef("  L  %7.2dr (%3i %3i %3i)", limit, rlim, glim, blim);

    (limit, (rlim, glim, blim)) = maxloc reduce zip(lmax, lmaxpos);

    writef("  -  %7.2dr (%3i %3i %3i)\n", limit, rlim, glim, blim);

    /* Similarly for A, B. */

Compile and run this with

    make lab_limits_parv1


If you monitor your CPU you should see multiple cores in use.  The run time will also decrease.

Parallelism with Per-Pixel Storage (lab_limits_parv2)

The second version is conceptually cleaner.  Rather than tracking intermediate values for each pass of the parallel loop, we store all the points and then do a global reduce at the end to find the extrema and their position.  In effect we're doing an LAB conversion on a 256*256*256 image, without using the clrimage framework.  This costs memory to store the LAB values at each RGB point, and would not be practical for large data sets.

We'll define a domain cube for our RGB volume, and three arrays to hold the converted colors.  The loop to populate the arrays is simple

    const cube : domain(rank=3) = { 0..255, 0..255, 0..255 };

    var l, l_a, l_b : [cube] real;


    forall (r,g,b) in cube do

      rgbpix_to_lab(r : c_uchar, g : c_uchar, b: c_uchar,

                    l(r,g,b), l_a(r,g,b), l_b(r,g,b));

Since each array element is only written by one r, g, b combination, this is safe from race conditions.

The reduction is also simple

    (limit, (rlim, glim, blim)) = minloc reduce zip(l, cube);

    writeln("  L  %7.2dr (%3i %3i %3i)", limit, glim, blim);

    (limit, (rlim, glim, blim)) = maxloc reduce zip(l, cube);

    writeln("  -  %7.2dr (%3i %3i %3i)\n", limit, glim, blim);

    /* Similarly for A, B. */

Compile and run this with

    make lab_limits_parv2


The output will match the other two versions.


Timing results from 10 runs give

    lab_limits_ser     6.89 +/- 0.01 s

    lab_limits_parv1   2.98 +/- 0.02 s     improvement to serial: 2.3x

    lab_limits_parv2  11.51 +/- 0.03 s     degradation to serial: 0.6x

The programs were compiled without the --fast flag.  The small standard deviations show good repeatability.  All four CPU cores were used for the two parallel results.  The whole-image reductions in _parv2 dominate the performance; you can see the pause between the printing of each limit.  The code is certainly simpler than in _parv1, but it comes at a cost of extra memory and run time.  The _parv1 improvement over the serial version is smaller than we might expect; typically four threads give a speed-up of 3.0 - 3.5 X.  In comparison to the 1.11 (April 2015) release, the parallel versions have slowed by 10-15% whereas the serial version hasn't changed.

If we insert timers and look at the cost of the forall loop and the reductions we find (values in ms)

                        forall     reduce

    lab_limits_parv1     2956          2

    lab_limits_parv2     2704       8680

Doing the extra work in the loop in _parv1 adds 10% to the timing, but eliminates the cost of the reduction.  Note that there are six reductions in either program, so it is taking more than a second to do each over the whole cube.  This is despite the reductions themselves being done in parallel – each core remains occupied for the run of _parv2.

Task Parallelism (gaborbank)

Chapel provides three statements to define and launch tasks.  The body of each is wrapped in a function, much as forall does with externally declared variables being provided as arguments to that function.  The begin statement takes a single statement or block and executes it in a "fire-and-forget" fashion.  Chapel starts the task in the background and does not wait for it to finish.  The cobegin statement takes several statements and executes each as a task, waiting until all have finished.  The coforall statement treats its body as a task that is called for every iteration.  Unlike a forall loop that may or may not split its interactions into separate tasks, the coforall will do so.  Each of these keywords can be followed by 'with (<argument list>)' to give non-default intents to the outside variables, just like forall.

    begin run_gaborfilter(img, size, theta, sclx, scly, wavelen, phi_rad);

runs a Gabor filter asynchronously.

    cobegin {

      run_gaborfilter(img, size, 0, sclx, scly, wavelen, phi_rad);

      run_gaborfilter(img, size, pi/2.0, sclx, scly, wavelen, phi_rad);


runs two filters in orthogonal directions in separate tasks.  Whether these tasks run in parallel or serial depends on the computing resources available.

    coforall theta in 0..135 by 45 do

      run_gaborfilter(img, size, DEG2RAD(theta), sclx, scly, wavelen, phi_rad);

runs four filters in four separate tasks.  As with the other for variants, do is used if there is only one statement in the loop body; otherwise, use braces around the block.

A cobegin can be used when you have a known number of tasks to run in parallel when you're coding; a coforall is used when the amount is indeterminate.  We'll use coforall in the gaborbank program; to give an example of cobegin, let's try writing a parallel sort.  The Quicksort algorithm works recursively through an array.  At each step it picks an element and splits the array into two, those values that are smaller and those that are bigger.  This is called partitioning the array and its result is that the picked element has been placed correctly but the sub-arrays to either side are still unsorted. So, we recur on each of those – and this can be done in parallel.  The top level looks like:

    proc quicksort(data : [] real) {

      var q : int;         /* index of pivot */


      q = partition(data);

      cobegin {





which says we find the pivot point (which places the q'th biggest element at that index) and then recur on the sub-arrays formed by the slices.  This recursion happens in parallel thanks to the cobegin.

ex_quicksort.chpl implements the algorithm as presented in "Introduction to Algorithms" by Corman et. al.  The partition function is:

    proc partition(data : [] real) : int {

      const pivot = data(data.domain.last);

      var i = data.domain.first - 1;

      for j in {

        if (data(j) <= pivot) {

          i += 1;

          data(i) <=> data(j);


      i += 1;

      data(i) <=> data(data.domain.last);

      return i;

using the swap operator <=>.  The domain.first and domain.last members are the bounds of the sub-array, inclusive; there is also domain.low and domain.high which are limits of the range and can differ from first and last if the alignment is not zero.

The example program populates an array with random numbers, calls quicksort, and then checks that the values do indeed increase across the array.  Compile and run it with

    make ex_quicksort [CHPLFLG=”-s nelt=#”]


You can change the size of the array with the nelt config param, passed to the compiler with the make variable CHPLFLG.

(If you were to compile this program and run it for a moderate sized array, say of 100,000 elements on our machine, you'd find your console filling with warning messages about memory allocation failures.  This seems to be an artefact of the Chapel runtime, since the program will run to completion and pass its verification.  For larger nelt you will find it eventually segfaults.  quicksort implementations often use a simple insertion sort once the sub-array size gets below some threshold, say 8 or 16 elements; for these small arrays the inefficiency of the insertion sort isn't a problem, and wins over the recursion.  The example program contains a config param --use_insert to turn this behavior on or off and --len_insert to set the threshold.  Using the insertion sort decreases the number of allocations the runtime does and increases the array size at which the warnings appear.  Another option is to compile with CHPLFLG=--fast which turns off the runtime checks and the allocation failure messages.)

In a way cobegin and coforall are convenience statements that handle the synchronization of all threads at the end of their block.  The sync command can also do this; it takes a block of statements that must complete before execution continues to the next command.  This differs from a cobegin in that any task that is created within the scope of the sync, even in subroutines, must finish, while the cobegin only requires the statements directly within its block must.

    sync {

      begin run_gaborfilter(img, size, 0, sclx, scly, wavelen, phi_rad);

      begin run_gaborfilter(img, size, pi/2.0, sclx, scly, wavelen, phi_rad);


The serial command can block a section of code from launching new tasks if a condition holds.  This means that the section will execute serially.  You can use this to avoid the overhead of launching tasks if there are too few of them.  The form of the statement is

    serial <condition> {

      /* commands, possibly with begin/cobegin/coforall */


or, if there's only one statement that follows, you can replace the braces with do.  If you omit the test, then Chapel assumes it has the value true and will always execute the block serially.

As with forall, return, break, and continue statements are not allowed inside a concurrent block for any of these commands.

Gabor Filter Bank (gaborbank)

Because the Gabor filter responds most strongly to lines at a given angle, we often want to run several filters with different rotations on the same image. We call this a bank of filters, and the angular resolution depends on the filter's parameters, mostly size and scale.

We want to modify our filter from the last section to generate the bank. Without changing the convolution routine itself, we can take advantage of task parallelism do run the filters in parallel.  The modified program is called gaborbank, and it can be compiled with

    make gaborbank

It replaces the --theta command line argument with --nrot, the number of filters in the bank. Filters will be spaced evenly over 180 degrees; the angle wraps, so that from 180 to 360 you get an inverted response.  We also delete the --outname argument and will save each result as "bank_<rot>.png", where <rot> is the filter rotation in degrees.

Most of the main procedure in the previous version has been pulled into a procedure called filter_and_save().  Here we can allocate a new image to store the result independently of other tasks, run the filter, convert the result to 8-bit, and save it to disk.  We pass all the global parameters through to the function so it is self-contained.  The theta parameter is now passed in degrees; for clarity, we've renamed phi_rad to emphasize it's in radians.

    proc filter_and_save(img : clrimage, size : int, theta : int, sclx : real,

                         scly : real, wavelen : real, phi_rad : real) : int {

      const theta_rad = pi * theta / 180;

      var gabor : clrimage;

      var grey : rgbimage;

      var gabor2grey : conversion;

      var outname : c_string;

      var retval : int;


      gabor = new_clrimage(img);

      run_gaborfilter(img, gabor, size, theta_rad, sclx, scly,

                      wavelen, phi_rad); = rgbconvert.CENTER;

      gabor2grey.plane = clrplane.C1;


      retval = display_color(gabor, grey, gabor2grey);

      if (retval < 0) then return retval;


      /* This makes a 3-digit angle without using something like sprintf. */

      if (theta < 10) then outname = "bank_00" + theta + ".png";

      else if (theta < 100) then outname = "bank_0" + theta + ".png";

      else outname = "bank_" + theta + ".png";


      retval = PNG_write(outname, grey, CLR_R);



      return retval;


Our main() becomes very simple then.  It reads the input PNG, does the conversion for the L or Y plane, and then uses a coforall to generate the bank.

    proc main() {

      var rgb : rgbimage;

      var clr : clrimage;

      var retval : int;


      retval = PNG_read(inname, rgb);

      cleanup_onerr(retval, png);


      retval = rgb_convert(rgb, grey, space);

      cleanup_onerr(retval, rgb);


      coforall bank in 0..nrot-1 {

        const theta = (bank * 180) / nrot;

        filter_and_save(img, size, theta, sclx, scly, wavelen, phi_rad);





And that's it.  If you run this program, you'll see that each core of your CPU is occupied.  If you cycle through the filter responses you'll notice edges strengthen and weaken as it rotates, with vertical edges strongest for angles near 0 degrees and horizontal for 90 degrees.  You may see that the filters are not run in order because there is no guarantee about when the coforall iterations will fire.  This might not be a good use of coforall if the number of banks is more than the number of cores to run each thread on.  A forall would be a better choice in this case.  



The Chapel language provides synchronization variables and atomic operations.  Atomic operations allow multiple tasks to safely interact when reading or changing a value by limiting the operations that can be done and ensuring these cannot be interrupted.  We'll talk about atomic variables in the next section.  Synchronization variables are used to communicate between tasks by forcing one to wait until it can access the value (for a read or write).  They come in two flavors.  The first, sync, is indicated by placing the keyword before the type in the variable declaration

    var done$ : sync bool;

where the trailing $ on the name is traditional.  sync variables normally operate in a read-write-read-write cycle, where a task will wait until the previous action occurs.  The second are single variables which offer a write-once-read-many functionality.

    var fired$ : single bool;

Chapel pairs a flag with the value of the variable.  It tracks whether the variable has a value – is "full" – or does not – is "empty".  sync normally requires that the flag be empty before a write or full before a read to ensure the proper cycling; if it is not correct, then the task will wait. single does not place the requirement on the read, so there is no way to clear the flag for a second write; any read will block until the write occurs.  Compound operators like += perform first a read, then a write, so they can be used safely with sync variables.

Any boolean, integer, real, or imaginary variable can be used for synchronization.  Strings and complex types cannot.  You cannot declare a structured type as sync or single, but you can declare members to be. Similarly you cannot say that an array is a sync or single, but the array can contain synchronized elements.

    var locks$ : [1..n] sync int;

We said 'normally' a couple times because there is low-level access to methods on these variables that can override the behavior.  Using them in expressions or assignment statements calls down to the default methods.  The method names encode the flag state at the beginning and end of the operation.



read the variable when flag full then clear the flag


sync, single

read the variable when flag full, leave flag unchanged


sync, single

assign value when flag empty, then set the flag



assign value when flag full, leave flag unchanged


sync, single

peek with non-blocking read, leave flag unchanged



non-blocking write, then set the flag if empty



assign default value and clear the flag, non-blocking


sync, single

non-blocking test if flag is set


Normal access to a sync variable is with readFE() and assignment is with writeEF(). Normal access to a single variable is with readFF() and assignment with writeEF().

We'll see an example of a sync variable in use in the second version of the FAST corner detector.


We've converted two programs now into parallel versions.  The first used data parallelism, where we carved up the data to analyze via a forall loop, analyzed each piece separately, and combined the results at the end.  As long as the intermediate values are kept separate for the loop variable, our calculation is safe, although some re-organization of the code was needed to do this.  Our second program used the existing convolver as a black box and instead divided the problem into independent analysis tasks.  No reduction step was necessary, and the only code change was to place the work into a procedure and then use a coforall to call it. We've learned:

Next we'll build on this experience and write another parallel program that uses k-means clustering to quantize the colors in an image.


1. Create a version of gaborbank that runs the filter at perpendicular angles.  Since there are only two calls to the convolver, instead of a variable number, use a cobegin.  Allow the rotation of one of the angles to be specified on the command line.

2. Modify the color conversions to run in parallel.

3. Modify the Gabor filter convolver to run in parallel.

4. A pyramid is a series of images of decreasing resolution, where the image at level p has half the size (width and height) as at p-1 and whose pixels are an average of the four corresponding below it.  That is,

      I(p,x,y) = (I(p-1,2*x,2*y) + I(p-1,2*x+1,2*y) + I(p-1,2*x,2*y+1) +

                  I(p-1,2*x+1,2*y+1)) / 4;

A pyramid is often used to find interesting features at coarse resolution (which can be done quickly) and to translate this into an area to explore in more detail at lower levels.  Write a program that calculates each level of the pyramid.  How would you divide the work to be done in parallel?


A tarball with the material for this section is here.  A zip file is here.

A PDF of this text is here.