Convolutions are an essential part of low-level image processing. They represent a local neighborhood calculation: at each pixel we sum a weighted contribution from those surrounding it. The weights are placed in a kernel, a size x size matrix. Think of a convolution as shifting this kernel over an image. At each point we multiply the pixels by the kernel values that lie atop (treating weights outside the kernel as 0, ie. these pixels are ignored) and sum when done. A convolution can be a low-pass filter, emphasizing broad features in the image. If the kernel elements are all 1/(size*size), then we compute the local average value. A Gaussian filter has a bell-shape. Small low-pass filters are often used to reduce noise in an image, large filters effectively lower the resolution. A convolution can also be a high-pass filter, emphasizing local changes. These might indicate edges, or steps in brightness. The Gabor filters are tunable and can belong to either class.
In this section we write a convolver with a programmable kernel. If it were fixed, then we could optimize the program by pre-calculating the memory accesses and arithmetic operations, but for this approach we will need to loop over the kernel. This introduces us to Chapel's concept of subdomains and subranges, and we'll look at a few different ways to express the computation.
The directory for this section is gabor. You'll be using the C routines from the first section for reading and writing, and the color conversion procedures from the second for generating greyscale values. Use make name to build the programs, where the name is without a .chpl suffix. The bin sub-directory will contain executables, the build sub-directory intermediate files. The sample image for this exercise is francnotch.png. It has foreground objects with strong edges at different angles.
The programs developed in the section are in a tarball here (as zip file). A PDF copy of this text is here.
There are two types of edge detectors. Differential operators look at the gradient, or first derivative, of intensity across an image. An edge is found where the gradient is large, either positive or negative. The size of the kernel determines the breadth of edges that will be marked. Differential operators are usually balanced so that their response to a uniform field is 0. They tend to be sensitive to noise, although an optimal filter that combines the differentiation with smoothing can be constructed given a model of the noise. Zero-crossing operators, the second kind, look at the second derivative of the intensity. For a linear step they yield 0 at the edge. This point is more robust as edges get broader; a differential operator will instead see a smaller peak response. They combine a smoothing response with the differentiator to perform better in noise.
Simple differential detectors are the Sobol and Prewitt kernels. These are for vertical edges or steps over x. Rotate them by 90 degrees for horizontal edges or steps over y. Teh values in the kernels are different approximations of the differentiation.
As an example, let's apply the Sobol operator to two pixel neighborhoods, one with a step and one that's uniform with noise. Remember that we aren't doing a matrix multiplication, rather a pixel-by-pixel multiplication followed by a sum. We're ignoring the outer rows and columns as the kernel doesn't full cover them. Usually the result is compared against a threshold, with pixels whose absolute sum exceeds it being marked as edges. In the example the threshold is set at 10, and only the pixels at the step are marked.
The filter applied to an image captures the images, although there are many false edges, many are missed, and some diagonal lines appear in both directions. More sophisticated strategies than a binary threshold are often used to improve the result. A Canny edge detector, for example, uses a high threshold to select only the strongest edges as seeds, and then extends them into adjacent pixels that are above a lower threshold.
The Laplacian of Gaussian is a common zero-crossing detector. The Gaussian is a smoothing function.
This is a normal distribution where s is the standard deviation, assumed equal for both axes, and the mean is 0. The factor 1/(2*pi*s**2) is to normalize the integral. The Laplacian is the differentiator
Together they give
For example, for s = 2.0 and an 11x11 kernel, after normalizing the matrix values so they sum to 1.0 and scaling so that the center value is 100, our kernel becomes
and for s = 1.5
Notice that the kernels are symmetric, but that they do not sum to 0. This means that the output signal will have a bias. This can be removed by subtracting a local average. To use a zero crossing detector we compare each pixel after the convolution with its neighbors. If the signs in a pair differ, then there is a zero between them that can be marked as an edge.
A non-symmetric version of this filter offers some interesting possibilities. If the central region does not form a dot then the filter will respond to elongated features. If it is not rotationally symmetric, then it responds to oriented features. The Gabor filter modifies the Gaussian. It first introduces separate scaling parameters (sigmas) for each axis. If the scales are the same the central region is a dot, but as they grow unequal it distorts into an oval. Second, it rotates the coordinate plane through an angle; this is just a rigid mapping on the x and y coordinates. Finally, it multiplies the exponential with a sinusoid with a wavelength and phase offset. This introduces lobes that can be symmetrically or anti-symmetrically placed along or aside the central axis.
The filter takes five parameters: theta, the angle of rotation; sclx and scly, the sigmas for the exponential; lambda, the wavelength of the sinusoid (which will be called wavelen in the code since lambda is a reserved word); and phi, the phase of the sinusoid.
where we have dropped the normalization constant.
A heat map shows the strength of the kernel components, with dark being negative, red neutral, and white positive. These maps show the unrotated filter and at 30 and 60 degrees for the symmatric and asymmetric cases. Because the sinusoid is a cosine along x, if sclx > scly an angle of 0 degrees is oriented vertically.
Our programs in this section will implement this function and its convolution with an image. In practice we can reduce the number of parameters. We need scly >> sclx to stretch the response, with sclx = 2.8 being a common figure for Gaussians and scly being on the order of the 0.5 - 1.5 X the filter size. The size determines the angular resolution, with 13x13 being the minimum if we want to vary theta by 10 degrees; 23x23 gives the best resolution before the gains taper off. phi is either 0 for the symmetric case or 90 for the asymmetric (ie. the cosine becomes a sine). The wavelength is about the size of the filter to detect steps. If it is half or smaller you'll see a positive and negative lobe on each side of the center line and the filter will respond to thin lines.
The plot_kernel.chpl program generates these heat maps. You might find it helpful for visualizing the kernel, although we won't discuss it. The logic is fairly straightforward: scale the coordinates to the filter space, evaluate the filter at each pixel and color code the result. The program takes each of the parameters as command-line arguments, ie.
plot_kernel --theta=<degrees> --sclx=<#> --scly=<#> --wavelen=<#>
--phi=<degrees>
Reasonable defaults have been provided. The --outname argument lets you specify the PNG file to create (the default is filter.png). You can compile the program with
make plot_kernel
The heat maps above were based on a 13x13 kernel drawn/sampled over a 700x700 image, with sclx = 2.8, scly = 4.0, and a wavelength of 5.0. The last two parameters differ from our preferred values of scly = 6.0 and wavelen = 12.0 and were chosen to demonstrate the kernel's features clearly.
In the previous section we briefly mentioned slicing and subranges, where you define a second range as a sub-set of a first. The same can be done for domains. Besides the clarity of intent, the main benefit subdomains offer is the provable validity of array accesses, which may let us avoid bounds checks or help the compiler with optimization, although neither is currently done by Chapel.
Just as a subrange is created by putting a second range specification in brackets or parentheses after the base, so too domains. Provide one range for each rank in the domain. The result is an intersection of the two. Here is where open-ended ranges, ones without either a start or and end, become useful, as they don't constrain one side of the base. The result will be a range with one endpoint taken from the base and the other from the second, whichever end was known.
Say we have a 2D domain representing an image array
Aimg : domain(rank=2) = { 0..ncol-1, 0..nrow-1 };
Let a kernel have a "diameter" of size x size, where size is odd. The radius r = (size-1)/2, which means that it extends r pixels on either size of a center point (x, y):
Akernel : domain(rank=2) = { x-r..x+r, y-r..y+r }
Then we can form a subdomain of Aimg that is big enough for the kernel. We can do this either by repeating the subrange
Aconv = Aimg[x-r..x+r, y-r..y+r]
or by using the kernel area to provide the subrange
Aconv = Aimg[Akernel]
Since the second version doesn't repeat the subranges, it's the better choice.
Chapel provides many ways of working with domains, ranges, and arrays since they're such fundamental data structures. Let's look at examples for each. The source can be found in ex_rangeop.chpl. Compile and run it with
make ex_rangeop
bin/ex_rangeop
To demonstrate subranges and see the effect of steps, alignments, and counts, we'll start with a base range that is a simple sequence from 1 through 10.
var rng1 = 1..10;
We can use a for expression to print the members
writeln("base range ", rng1, " = ", for i in rng1 do i);
> 1..10 = 1 2 3 4 5 6 7 8 9 10
If we apply a subrange with an open start to rng1 we effectively change the end point
writeln(rng1[..8], " = ", for i in rgb1[..8] do i);
> 1..8 = 1 2 3 4 5 6 7 8
Similarly, an open end would only change the start point. A new step applies to the sub-sequence, and a new alignment can shift the first generated value.
writeln(rng1[2.. by 3], " = ", for i in rng1[2.. by 3] do i);
> 2..10 by 3 = 2 5 8
writeln(rng1[3.. by 3 align 2], " = ", for i in rng1[3.. by 3 align 2] do i);
> 3..10 by 3 align 2 = 5 8
Here the align 2 would make 2 be the first generated value (ie. 2, 5, 8, 11, ...; also 2, -1, -4, -7 ...), which is incremented by the step until it falls within the start and end points.
If the subrange has no start point, then its alignment is unknown and this propagates to the result. A bad alignment means the range cannot be used for iteration, and the program will halt with an error message. Therefore, in this next example we cannot use the for expression to print values, because they cannot be generated. If we provide the alignment, however, then it works.
writeln(rng1[..8 by 3]);
> 1..8 by 3
writeln(rng1[..8 by 3 align 2], " = ", for i in rng1[..8 by 3 align 2] do i);
> 1..8 by 3 align 2 = 2 5 8
If we provide a count of the number of elements to generate, then this is done on the subrange. The order of the qualifiers is important, as they are applied from left to right. In the first example we step by 2 through the three counted elements 3, 4, 5, giving 3 and 5. In the second we step by 2 to the original end point, so 3, 5, 7, 9, and take the first three of those.
writeln(rng1[3.. # 3 by 2], " = ", for i in rng1[3.. # 3 by 2] do i);
> 3..5 = 3 5
writeln(rng1[3.. by 2 # 3], " = ", for i in rng1[3.. by 2 # 3] do i);
> 3..10 = 3 5 7
We can also do addition and subtraction to change the start, end, and align parameters.
writeln(rng1 + 2, " = ", for i in rng1+2 do i);
> 3..12 = 3 4 5 6 7 8 9 10 11 12
writeln(rng1 - 2, " = ", for i in rng1-2 do i);
> -1..8 = -1 0 1 2 3 4 5 6 7 8
Although we haven't said it explicitly, the range bounds can be negative. We'll use this when defining the Gabor kernel.
Range objects support many methods for adjusting their parameters and testing for membership. The members themselves are accessible: low the start point, first the first generated element, high the end point, last the last generated element, stride the step, alignment the number that must be part of the unbounded sequence, and length or size for the number of values the sequence will generate.
The member() method tests if a range generates a value or if all values of a subrange are included in the base.
writeln(rng1.member(9));
> true
writeln(rng1.member(11));
> false
writeln(rng1.member(2..5));
> true
writeln(rng1.member(2..11));
> false
writeln(rng1.member(2..));
> false
writeln(rng1.member(..8));
> false
For the methods that adjust the range, let's use another base sequence where we've provided a stride and alignment.
var rng2 = 1..20 by 4 align 3;
writeln(for i in rgb2 do i);
> 3 7 11 15 19
The expand() method adds its argument to the end point and subtracts it from the start. You can see this in how Chapel prints the range to the left of the equals sign.
writeln(rng2.expand(2), " = ", for i in rng2.expand(2) do i);
> -1..22 by 4 = -1 3 7 11 15 19
The exterior() method creates a new range that begins immediately after the base and whose end point is shifted by the positive argument. If the argument is negative then the new range lies before the base and the start point shifts.
writeln(rng2.exterior(3), " = ", for i in rng2.exterior(3) do i);
> 21..23 by 4 align 3 = 23
writeln(rng2.exterior(-3), " = ", for i in rng2.exterior(-3) do i);
> -2..0 by 4 align 3 = -1
That is, for a positive argument modifying the base range start..end we have a new series (end+1)..(end+arg), and the stride and alignment do not change. For a negative argument the new series runs from (start-arg)..(start-1). We have in effect added the points just outside the base range, as you can see in the change in the start and end points. Because we have an alignment restriction, though, we generate only one of the three values.
The interior() method with a positive argument picks that many values at the end of the sequence, or at thes tart with a negative argument. That is, the new series becomes end-arg+1..end and start..start+arg-1, respectively, and again the stride and alignment do not change.
writeln(rng2.interior(3), " = ", for i in rng2.interior(3) do i);
> 18..20 by 4 align 3 = 19
writeln(rng2.interior(-3), " = ", for i in rng2.interior(-3) do i);
> 1..3 by 4 align 3 = 3
As with the exterior example, we're taking the three innermost points at either end but the alignment condition causes only one of them to be generated.
The offset() method changes the alignment. The argument is added to the current alignment, and the result is taken modulo the step.
writeln(rng2.offset(3), " = ", for i in rng2.offset(3) do i);
> 1..20 by 4 align 2 = 2 6 10 14 18
writeln(rng2.offset(-3), " = ", for i in rng2.offset(-3) do i);
> 1..20 by 4 align 0 = 4 8 12 16 20
For the first example the new alignment is (3 + 3) % 4 = 2, and for the second (3 - 3) % 4 = 0. Although 0 is outside the start and end values, it is the anchor for the series of step 4.
The translate() method shifts the whole sequence: start, end, and alignment.
writeln(rng2.translate(3), " = ", for i in rng2.translate(3) do i);
> 4..23 by 4 align 2 = 6 10 14 18 22
writeln(rng2.translate(-3), " = ", for i in rng2.translate(-3) do i);
> -2..17 by 4 align 0 = 0 4 8 12 16
In general we access the indices that are part of a domain by simply using the domain's name. We can do this in a for loop or one of its variants (forall, coforall) and the loop will iterate over each index tuple that is part of the domain. Let's set up a small domain to see this. Again we can use a for expression to print the its members. You'll see that the second rank varies the most; it is the innermost loop.
dom1 = { 1..3, 4..6 }
writeln(dom1, " = ", for i in dom1 do i);
> {1..3, 4..6} = (1, 4) (1, 5) (1, 6) (2, 4) (2, 5) (2, 6) (3, 4) (3, 5) (3, 6)
Chapel will automatically iterate a domain over all its values if you provide it as an argument to a function that takes a tuple to hold its indices. The return value of such a function can be assigned to an array. The argument needs to be defined either as a tuple, fn(ind : 2*int), or destructured but with the type itself as a tuple, fn((x, y) : (int, int)).
proc domfn1((x, y) : (int, int)) {
writeln(x, " + ", y, " = ", x+y);
return x+y;
}
var arr1 : [dom1] int;
var arr2 : [dom1] int;
arr2 = domfn1(dom1);
> 2 + 4 = 6
> 3 + 4 = 7
> 2 + 5 = 7
> 3 + 5 = 8
> 2 + 6 = 8
> 3 + 6 = 9
> 1 + 4 = 5
> 1 + 5 = 6
> 1 + 6 = 7
writeln("arr2");
writeln(arr2);
> arr2
> 5 6 7
> 6 7 8
> 7 8 9
Note that the write functions support a pretty-printer for arrays. All pairs of x and y indices are passed to domfn1 (although not necessarily in order, since Chapel automatically makes the calls in parallel), and the sum of the coordinates is assigned to the corresponding element of array arr2.
We said that you cannot use ranges with strides in domains. Instead, you can give the domain itself a stride using the by keyword. The stride applies to all ranks.
writeln(dom1 by 2, " = ", for i in dom1 by 2 do i);
> {1..3 by 2, 4..6 by 2} = (1, 4) (1, 6) (3, 4) (3, 6)
Similarly, you can specify the count per range. This takes a tuple for each rank.
writeln(dom1 # (2, 1), " = ", for i in dom1 # (2, 1) do i);
> {1..2, 4..4} = (1, 4) (2, 4)
As with ranges, domains support the expand(), translate(), interior(), and exterior() methods. All return new domains, which means they can be chained together. The arguments are generally tuples that represent the changes along each rank. We'll start by defining a new base domain before applying them.
var dom2 = { 1..10, 20..30 };
writeln(dom2);
> {1..10, 20..30}
writeln(dom2.expand(5,3));
> {-4..15, 17..33}
writeln(dom2.expand(5,3).expand(-2,-1))
> {-2..13, 18..32}
writeln(dom2.translate(4,5))
> {5..14, 25..35}
writeln(dom2.exterior(2,3))
> {11..12, 31..33}
writeln(dom2.exterior(-2,-3))
> {-1..0, 17..19}
writeln(dom2.interior(2,3))
> {9..10, 28..30}
writeln(dom2.interior(-2,-3))
> {1..2, 20..22}
You can find these examples in ex_domainop.chpl. Compile and run them with
make ex_domainop
bin/ex_domainop
The domain underlying an array can be used as an iteration source for loops or iterated operations, as we saw above. The loops generate the values of the array, not the indices. Use array.domain to get the indices. Slicing also applies.
Arrays can be operated upon by scalar values, in which case the operation applies at each index. Arrays of the same size and indices may also be combined or set equal, which copies the values over.
var dom1 = {1..3, 4..6};
var arr1 : [dom1] int;
var arr2 : [dom1] int;
var arr3 : [dom1] int;
arr1 = 3;
writeln(arr1);
> 3 3 3
> 3 3 3
> 3 3 3
arr1 += 5;
writeln(arr1);
> 8 8 8
> 8 8 8
> 8 8 8
arr2(1,4) = 1; arr2(1,5)=2; arr2(1,6)=4;
arr2(2,4) = 2; arr2(2,5)=4; arr2(2,6)=8;
arr2(3,4) = 1; arr2(3,5)=8; arr2(3,6)=2;
writeln(arr2);
> 1 2 4
> 2 4 8
> 1 8 2
arr3 = arr1 / arr2;
writeln(arr3);
> 8 4 2
> 4 2 1
> 8 1 4
arr3 = for (x, y) in arr3.domain do x+y;
writeln(arr3);
> 5 6 7
> 6 7 8
> 7 8 9
We can declare a variable as an alias to an array using the => operator in place of the type. The domain declaration for the alias is mapped to the domain of the source, and must be smaller than the source. For example, if we want to shift a kernel centered about 0 to a point (x,y) in an image we can do
var kernel : [-r..r, -r..r] real;
var shift [x-r..x+r, y-r..y+r] => kernel[-r..r, -r..r];
The indices for shift are now centered about (x, y). We'll use this in the fifth example. Internally Chapel uses the function reshape(A: Array, D: Domain) to return a copy of A re-indexed to the domain, which must have the same number of elements/indices.
You'll find these samples in ex_arrayop.chpl. Compile and run them with
make ex_arrayop
bin/ex_arrayop
We're now ready to code our filter. We'll try several different approaches to looping and defining subdomains to see which feels best.
To facilitate working with subdomains, we've added an extra constructor to the clrimage class, naming the file ip_color_v2.chpl. This version takes an existing image and copies the domain but not the data. This way we'll be sure that both access the same underlying representation. We'll use it so the input image and convolution result have the same domain. This constructor overloads the previous, and Chapel will call the right version depending on whether you provide two integers as arguments, the width and height, or another image.
class clrimage {
var ncol : int; /* width (columns) of image */
var nrow : int; /* height (rows) of image */
var npix : int; /* number pixels = w * h */
var area : domain(rank=2); /* array bounds by x and y */
var c1 : [area] real; /* first color plane (L, H, Y) */
var c2 : [area] real; /* second color plane (A, S, U) */
var c3 : [area] real; /* third color plane (B, V, V) */
proc clrimage(w : int, h : int) {
ncol = w;
nrow = h;
npix = w * h;
/* This automatically resizes the arrays. */
area = {0..ncol-1, 0..nrow-1};
}
proc clrimage(orig : clrimage) {
ncol = orig.ncol;
nrow = orig.nrow;
npix = orig.npix;
area = orig.area;
}
}
The general flow of the program is the same for all versions. It starts by sanity checking the command-line arguments and converts the two angle parameters in degrees to radians. These are constants set at runtime. It then reads the PNG file and generates a greyscale image using either the LAB or YUV space. It next runs the Gabor filter, creating the size x size kernel and convolving it across the image. Finally it scales the result back to an 8-bit image for display and writes it to disk.
The only procedure that changes between the versions is run_gaborfilter().
You'll notice reading the source code that we're using the Time module. This provides timers with up to microsecond resolution for testing how long parts of the program take to run. We'll use the times to compare the five approaches in the Performance section. You simply call start() and stop() methods on each timer, and use elapsed(TimeUnits.milliseconds) to get the interval between them.
Compile the programs with
make gabor_v[123456]
The program takes many possible command line arguments:
--inname=<file> PNG file to read
--outname=<file> PNG file to write result to
--space=[LAB|YUV] which color space to use for greyscale conversion
--size=<int> kernel diameter, must be odd
--theta=<real> rotation angle of filter, in degrees
--sclx=<real> scaling parameter for x axis
--scly=<real> scaling parameter for y axis
--wavelen=<real> wavelength of sinusoid
--phi=<real> phase offset of sinusoid, in degrees
Only --inname and --outname must be provided, the others have default values. A typical run with the default filters would be
bin/gabor_v1 --inname=francnotch.png --outname=g1.png
By the way, since there are so many parameters, Chapel has the option of placing them in a file and using it from the command line. Say config.vals contains
size=15
theta=45.0
wavelen=14.0
Then you can pass the file with the -f flag, not needing to put these three on the command line.
bin/gabor_v1 -f config.vals --inname=francnotch.png --outname=g1.png
Our first convolver is a straightforward sum using index arithmetic to center the kernel over each pixel. We define two domains. The first is for the kernel
const Akern : domain(2) = { -r..r, -r..r };
var kernel : [Akern] real;
where r is the radius or half-width and the domain is centered at (0, 0). The second is for the interior of the image so that the kernel falls completely inside.
const Ainside = img.area.expand(-r, -r);
where img is the greyscale input. We could have also used a subrange
const Ainside = img.area[r..img.ncol-r-1, r..img.nrow-r-1];
but the expand method better explains the intent of what we're doing.
We set up the kernel
gabor_kernel(theta=theta, sclx=sclx, scly=scly, wavelen=wavelen, phi=phi,
kernel = kernel);
which populates the kernel array following the Gabor function definition. The result is initialized to 0 using an implicit iteration over the array's bounds
gabor.c1 = 0.0;
and then we have a nested loop. The outer for covers each pixel in the inner subdomain, while the inner covers each kernel element and the underlying pixel
for (x, y) in Ainside {
for (ki, kj) in Akern do
gabor.c1(x,y) += img.c1(x+ki,y+kj) * kernel(ki,kj);
}
Because we can use negative values in our ranges and have centered the kernel at (0,0), the source pixel in c1 is just the center pixel shifted by the kernel index. We only use the c1 plane because both the LAB and YUV transforms put the luminance image there.
One danger of this approach is that we can generate illegal indices for img.c1 if the kernel ever goes outside the image. In other words, we depend on defining Ainside correctly. If we contracted img.area by less than r, then either x+ki or y+kj would go out of bounds, and the program would crash. We'll next use subdomains to make sure this can't happen.
Our second version will slice the image area by the kernel at each pixel before starting the inner loop. This guarantees that the indices will be valid. Our definitions for the Akern and Ainside domains don't change, nor does the kernel array definition.
const Akern : domain(2) = { -r..r, -r..r };
var kernel : [Akern] real;
const Ainside = img.area.expand(-r, -r);
We'll set up a separate subdomain for the convolution as a variable since we'll be changing it at each pixel
var Aconv : subdomain(img.area);
We need two sets of indices when we do the multiplication in the inner loop, one in the image and one in the kernel. The way to combine multiple iterators into one is with a zipper. It takes any number of ranges or iterators and with each step of the loop gets the next value of each, returning all in a tuple. Each component must have the same length.
zip(Aconv, Akern)
will produce one matching index (which is a tuple) from each domain, and then another, and so on. The zipper generates a tuple of two tuples, so we'll use destructuring to pull out the indices.
First we need to define the Aconv subdomain for each pixel. This is just a slice of the image area. Then we generate the coordinates and indices and carry out the multiplication.
for (x, y) in Ainside {
Aconv = img.area[x-r..x+r, y-r..y+r];
for ((xc,yc), (ki,kj)) in zip(Aconv, Akern) do
gabor.c1(x,y) += img.c1(xc,yc) * kernel(ki,kj);
}
Although it seems that we can still get the Aconv subdomain wrong, this is actually safe. We are no longer doing arithmetic on the array indices and the subdomain guarantees that (xc, yc) will always be in bounds. If the subranges are wrong then the program will halt with the message that the zippered iterations have different lengths.
The third version replaces the Aconv area with a translation of the kernel to the new pixels. Remember that translations create new copies of the domain, so they don't affect the original. We then zipper the two domains together so that pixel coordinates match the overlying filter, multiply, and add.
for (x, y) in Ainside {
for ((xc,yc), (ki,kj)) in zip(Akern.translate(x,y), Akern) do
gabor.c1(x,y) += img.c1(xc,yc) * kernel(ki,kj);
}
The fourth variant starts with the zipper of the second. You may remember back in the beginning of this section we said that a convolution was a multiplication of every pixel with its kernel element, followed by a sum of those products. This is a reduction, and we'll see more of them in the next section when we talk about parallel programming. A reduction takes a set of values and returns one; in this case, a sum. The reduce command is an expression with the form
<operation> reduce <data>
This variant splits the calculation into these two steps. It uses the same framework as the second variant
const Akern : domain(2) = { -r..r, -r..r };
var kernel : [Akern] real;
const Ainside = img.area.expand(-r, -r);
var Aconv : subdomain(img.area);
and defines an intermediate array to hold all the products
var prod : [Akern] real;
Our inner loop will populate this array, and once it's done we perform the reduction.
for (x, y) in Ainside {
Aconv = img.area[x-r..x+r, y-r..y+r];
for ((xc,yc), (ki,kj)) in zip(Aconv, Akern) do
prod(ki,kj) = img.c1(xc,yc) * kernel(ki,kj);
gabor.c1(x,y) = + reduce prod;
}
We shouldn't expect this to be the fastest variant, because the reduction is a second pass through the kernel. However, Chapel will automatically perform it in parallel.
Strangely enough, the output of this version differs from _v1 and _v2 at 68 pixels located in the middle of a cloud in the upper right corner; the per-pixel difference is 1 between the images. It appears to be a floating point error, where at one of the pixels, (1030,14), gabor_v1 gives a sum of -1.74e-13 and gabor_v3 9.09e-13. This is enough to flip a bit when doing the 8-bit conversion.
In the fifth version we'll use array aliasing to shift the indices of the kernel array so it aligns with the image grid. We can then address the kernel with image coordinates. The domain and array definitions are the same as with gabor_v1.
const Akern = domain(2) = { -r..r, -r..r };
const Ainside = img.area.expand(-r, -r);
var kernel : [Akern] real;
We'll define a variable inside the loop with the alias.
for (x, y) in Ainside {
const shifted : [x-r..x+r, y-r..y+r] => kernel[Akern];
for (ki,kj) in img.area[shifted.domain] do
gabor.c1(x, y) += img.c1(ki,kj) * shifted(ki,kj);
}
In effect the alias appears to move the kernel above the destination pixel. We first tried doing this by shifting the domain using translate(), but the problem is that that's a "translation by", that is, an incremental shift in the ranges from the previous position. Since we can't be sure what our previous pixel was, we would need to do some math on the existing range compared to the desired to get the delta. What we would really like is a "translation to" method.
For the final version we'll try to avoid array indices altogether to do the multiplication. As with the previous version we create an alias called shifted for the kernel that is centered on the current pixel. We will also slice the pixel array over the same area. Since the two arrays now are aligned, we can multiply them directly, letting Chapel iterate over both domains element by element. We'll then do a reduction on the result.
const Akern = domain(2) = { -r..r, -r..r };
const Ainside = img.area.expand(-r, -r);
var kernel : [Akern] real;
for (x, y) in Ainside {
const shifted : [x-r..x+r, y-r..y+r] => kernel[Akern];
const overlay = img.c1[x-r..x+r, y-r..y+r];
gabor.c1(x,y) = + reduce (shifted * overlay);
}
As with the fourth version, the output differs at 68 pixels from the others. Defining overlay as a subdomain of c1 using shifted does not compile
const overlay = img.c1[shifted]; /* no good */
If you've been running these programs as we go along, you will have noticed that the run times vary dramatically. The table gives the average time in milliseconds for the key routines.
| rgb_convert | run_gaborfilter | display_color | |
v1 | 646 | 28525 | 156 | index math |
v2 | 647 | 61609 | 154 | domain + zipper |
v3 | 646 | 61237 | 155 | translate + zipper |
v4 | 647 | 166386 | 156 | zipper + reduction |
v5 | 649 | 36466 | 156 | array alias |
v6 | 650 | 186218 | 156 | array multiplication |
These runs were made on the francnotch.png image with the default parameters. The consistency of the LAB conversion at the start and 8-bit conversion of the results gives us confidence in these measurements, as these routines don't change between program versions.
The preliminary conclusion, without going deeper into the compiled code (which is beyond our goals for this exercise), is that zippering iterators hurts performance and that the reductions in gabor_v4 and gabor_v6 are especially costly, despite Chapel doing it internally in parallel. The extra domain created in _v2, compared to the translation in _v3, has been optimized away, leaving the zipper as doubling the time to do the convolution. The simplest access pattern of using a single set of indices and offsetting them into the image array, either directly by calculation in gabor_v1 or indirectly by array aliasing in gabor_v5, is best. Compared to the April 2015 release, the reductions have gotten 20-30% faster while the other variants have not changed.
The Chapel compiler does have a --fast flag that turns off runtime safety checks such as array bounds. If you re-compile the programs with these programs using 'make CHPLFLG=--fast all', the results are dramatically faster. You can also set the environment variable CHPL_FAST to any non-empty string. This will cause the compiler to use the --fast flag for any compilation.
| rgb_convert | run_gaborfilter | display_color | improve base |
v1 | 391 | 205 | 10 | 139 X |
v2 | 390 | 5287 | 11 | 12 |
v3 | 388 | 5228 | 11 | 12 |
v4 | 389 | 11765 | 10 | 14 |
v5 | 390 | 1252 | 10 | 29 |
v6 | 389 | 19499 | 11 | 10 |
That's a speed-up of the run_gaborfilter() routine by one or two orders of magnitude! (And as a check, the resulting images do match those generated without the --fast flag – we haven't corrupted the operation of the program).
We're still new to the language so figuring out what is happening here – if we are seeing any optimizations from subdomains, if we're using subdomains properly, and how to improve the run time – isn't something we can do yet. It would be nice if Chapel, with its support for subdomains, could get closer to the raw performance without the compiler flag. Note that the cost of the array aliasing in _v5 is now much higher than the direct access of _v1.
In this section we've used a convolution example, an area multiply followed by sum, to talk about subranges, subdomains, and array operations. We've looked at six different implementations of the convolver using arithmetic on array indices, subdomains to generate guaranteed correct indices, a multiply-and-reduce approach, and array aliasing. The direct addressing schemes are the fastest. We've learned:
If you monitored the programs as they ran, you noticed that they were all serial. We can speed up the performance by taking advantage of Chapel's parallel programming support.
1. Generate a heat map for a Gaussian and Laplace of Gaussian filter. Define a grid of say 50 pixels per unit in the filter space. Calculate the filter function for each x, y and map it to RGB channels. For example, a heat map that runs from block through red and yellow to white is
r = 2*t g = 2*t - 0.5 b = 2t - 1
where t, the function value, lies between 0 and 1.0. The colors are clipped at 0 and 1, then scaled to an 8-bit range. Store each in an rgbimage and save as a PNG. (plot_kernel.chpl contains our solution for the Gabor filter and was used for the figures for this section.)
2. Write a Laplacian of Gaussian edge detector. Either save the convolution result as a PNG and highlight values near mid-scale, ie. near 0, or pass the pixels through a threshold marking only those that have a different sign than their neighbor (implying that a zero lies between).
3. An integral image I is one where the value at a pixel (x, y) is the sum of all pixels with smaller coordinates, ie.
This allows us to quickly calculate sums over areas. For example, if we want to average in a 5x5 square about a pixel, then we would do
(I(x+5,y+5)-I(x-5,y+5)-I(x+5,y-5)+I(x-5,y-5)) / 25
Write a library routine to generate the integral image, leaving it in a clrimage.
4. The asymmetric (phi=0) Gabor filter gives a large background signal. Write a high-pass filter to remove it. What remains?
A tarball with the code and image for this section can be found here. A zip file is here.
A PDF version of this text is here.