k-means clustering is a general partitioning strategy.  It assigns data points in a n-dimensional space to a number of buckets, effectively dividing the space and quantizing the values.  We can use it in image processing to reduce the number of colors in an image to make finding regions of distinct color, or comparing colors, easier.  In other words, we attempt to find the strongest clusters of values over the color space and label each pixel with the ID of the cluster to which it belongs.

This is an iterative process that runs until the population of the buckets stabilize.  The heart of the algorithm is a per-pixel comparison with each cluster to find the best fit, which then modifies the cluster's boundaries for the next pass.  This can be done in parallel, and is the focus of this section.

The directory for this section is kmeans.  You'll be using the C functions for reading and writing PNG images in img_png_v3.c and img_png_v3.h, and the color library which is updated to ip_color_v3.chpl.  Use make name to build the programs, leaving off the '.chpl' suffix.  Executables get placed in the bin sub-directory and intermediate files in build.  The image for this section is field.png.

The files for this section are located here (as zip file).  A PDF copy of this text is here.

Aside: Clustering

The k-means clustering algorithm is simple.  Begin by giving each bucket a position in the color space.  The seeds can be generated randomly, although this tends to produce buckets that are initially empty. We will instead pick pixels at random from the image, making a fixed number of attempts for each and picking the one furthest from all other seeds assuming a minimum separation be met.

For each pass we assign each pixel to the bucket whose center of mass is closest.  The center of mass is the average per color plane over all members of the cluster.  We then compute the new center of mass from the assignments and look at the population of each bucket.  When that has stabilized we are done. If not, any clusters that are unpopulated are re-seeded to midway between the two largest unused blocks and we start another pass.

We do not actually mark the pixels in any way during the pass.  We only have to track the aggregates for the cluster.  Specifically, the actions at each pixel are to

  1. 1.  Find the closest cluster. (smallest distance to its center of mass) 

  2. 2.  Increment the count of pixels in the cluster for this pass. (npix) 

  3. 3.  Increment a sum of color coordinates for the next center of mass. (sumc1, sumc2, sumc3) 

This can be done in parallel.  After all pixels are done, then we need to finish processing the clusters.  We calculate

  1. 1.  The new center of mass (c1cm = sumc1/npix, c2cm = sumc2/npix, c3cm = sumc3/npix) 

  2. 2.  The difference in npix between previous pass and current pass. (dnpix)  

When the largest dnpix/npix ratio of all clusters falls below a threshold of, say, 1%, then we are done.  We can map each center of mass back to RGB and create a greyscale image with the ID of the cluster at each pixel.

The final algorithm is:

    1. Seed clusters, ensuring some minimum distance between

    2. Repeat a pass

       2a. Assign pixels to clusters, tracking population and color plane sums

           for centers of mass.

       2b. Calculate center of mass.

       2c. Place empty clusters between biggest.

       2d. Calculate cluster population change as fraction of


       until population change is small enough, or maximum number of passes

       has been reached.

    3. Convert cluster colors back to RGB.

    4. Create greyscale image holding cluster assignments.

Implementation (kmeans_v1)

We will not be presenting all the code in detail here, rather only discussing the Chapel implementation issues.  The program can be found in kmeans_v1.chpl.  Compile it with

    make kmeans_v1

It takes four command-line arguments

    --inname     PNG file to read

    --outname    file to create with cluster ID assignments

    --space      LAB, YUV, HSV, RGB - color space to use

    --ncluster   number of clusters

Run it with

    bin/kmeans_v1 --inname=field.png --outname=ids.png

You'll find a few changes in the ip_color module, now up to version 3. First there are two new members of the clrimage class that store the ranges for the columns and rows because we'll want to use each separately and array.domain.dim() is wordy.  The domain for the image hasn't changed, but it's now built from these separated ranges.  We also store the color space in a third new member so we don't have to pass it around as an argument.  The third change is the addition of procedures to convert from any color space back to RGB.  (This was Exercise 1 in the Color Conversion section.)

We perform the pixel assignment and creation of the ID image in parallel using forall loops.  Our strategy for the pixel assignment is to create an array ksplit of temporary clusters to hold the intermediate sums and reduce them at the end.  We iterate over the rows in the outer, parallel loop, and ksplit is specified by row, so this is safe.

    proc iterate_pass(clr : clrimage, const ref kset : [] cluster) {

      /* intermediate storage, one per row */

      var ksplit : [clr.rows][kset.domain] cluster;

      /* results of pass */

      var knew : [kset.domain] cluster;


      forall y in clr.rows do

        for x in clr.cols {

          /* closest_cluster returns index of best cluster fit. */

          const closest = closest_cluster(clr.c1(x,y), clr.c2(x,y), clr.c3(x,y),

                                          clr.space, kset);

          ksplit(y)(closest).npix += 1;

          ksplit(y)(closest).c1 += clr.c1(x,y);

          ksplit(y)(closest).c2 += clr.c2(x,y);

          ksplit(y)(closest).c3 += clr.c3(x,y);



      forall k in kset.domain do

        for y in clr.rows {

          knew(k).npix += ksplit(y)(k).npix;

          knew(k).c1 += ksplit(y)(k).c1;

          knew(k).c2 += ksplit(y)(k).c2;

          knew(k).c3 += ksplit(y)(k).c3;


      forall k in knew {

        if (0 == k.npix) {

          k.skip = true;

        } else {

          k.c1 /= k.npix;

          k.c2 /= k.npix;

          k.c3 /= k.npix;




      return knew;


The double bracket notation on ksplit indicates that we have an array each of whose elements is an array; this contrasts with a single bracket array addressed with a tuple (like the color planes) where the tuple is converted to an index.  cluster is a record with fields to hold the population npix, the center of mass and partial sum per color plane, and a flag if we should ignore the cluster because the population is 0.  

    record cluster {

      var skip : bool;                  /* true to ignore cluster in set */

      var npix : int;                   /* number pixels assigned to cluster */

      var c1, c2, c3 : real;            /* center of mass of each color plane */

      var c1tmp : real;                 /* temp needed to sum H */

      var r, g, b : c_uchar;            /* RGB equivalent of c1, c2, c3 */


The operation of the procedure should be clear.  We write out the loops as nested rather than using a single forall statement and tuple as the index variable in order to guarantee which is the outer loop.  (The actual code has a special case for the H plane, because we need to account for angles wrapping around 0).

The other use of parallelism is with the ID image.  Given an rgbimage called quant to hold the IDs, the parallel loop looks like

    forall (x, y) in clr.area {

      var xy = (y * quant.ncol) + x;     /* flat pixel index */

      /* closest_cluster returns index of best cluster fit. */

      quant.r(xy) = closest_cluster(clr.c1(x,y), clr.c2(x,y), clr.c3(x,y),

                                    clr.space, kset);


The closest_cluster() procedure checks a color (c1, c2, c3) against each cluster, using the procedure cluster_sep() to measure the distance in a sum-of-squares fashion, and picking the smallest.  We simply track the minimum and location as we go along, rather than using a reduction.

Module: Random

The seeding step requires random numbers to pick pixels across the image. Chapel provides a standard module called Random that generates reals in the range 0.0 to 1.0.  The module defines a class RandomStream whose constructor can take an integer seed; if none is provided, the system time is used.  Use the method getNext() for a new random value, or fillRandom(array) to populate an entire array of reals (fillRandom() may also be called as a stand-alone procedure, in which case it creates a new stream).  The random number generator is by default parallel-safe.

    rand = new RandomStream();


    var seeds : [1..10000] real;


We will want to generate random positions within a range or domain, and kmeans.chpl defines two procedures to do this.  random_ranged() returns an element from within the range:

    proc random_ranged(rand : RandomStream, rng : range) : rng.idxType {

      const elt = rng.length * rand.getNext();

      return rng.first + rng.stride * (nearbyint(elt - 0.5) : rng.idxType);


The element is shifted by -0.5 so that the range is centered within the interval; because nearbyint() rounds -0.5 to 0, the limit stays valid.  Notice how we can use the type of the range, rng.idxType, as the return declaration for the procedure and the cast because this is known at compile time.  Ranges do not, unfortunately, have a method to return the nth element, or we would be able to easily support other non-rectangular types.  For domains we build a tuple to hold a random value from each range, and then call random_range() to populate it.

    proc random_domain(rand : RandomStream, dom : domain) : dom.rank*dom.idxType {

      var pt : dom.rank * dom.idxType;

      for i in 1..dom.rank do pt(i) = random_ranged(rand, com.dim(i));

      return pt;


This handles any domain geometry.  Again we are able to use dom.idxType in the declarations, as well as the dom.rank parameter.  Domains also have the property dom.numIndices, but this is not a parameter like the rank and cannot be used at compile-time.

The seeding procedure itself picks a random pixel in the image for the first cluster and then tries a fixed number of pixels for the others, picking the one that is the furthest from all seeded clusters.

Module: Sort

Clusters may be empty after a pass if they have shifted into an empty volume of the space.  Rather than leaving them adrift, we re-seed them, placing one halfway between the largest clusters.  If there's more than one empty bucket then the first goes between the largest and second largest, the next between the second and third largest, and so on.  (This is somewhat naive; another strategy would be to track the standard deviation or density of each cluster and split the worst, or to check if the cluster is multi-modal.)  To do this we sort the clusters by decreasing population. The Sort standard module provides the standard sorting routines: QuickSort, MergeSort, HeapSort, InsertionSort, BubbleSort, and SelectionSort.  All work on 1D arrays of values, but none take a comparison function as an argument. Instead, Chapel expects that you have overloaded the relational operators for whichever data type is in the array: <= and < for an increasing sort order, >= and > for a decreasing.  Therefore, we need to define these for our cluster record.

In the program these four methods are defined locally within fillin_empties(), so that we could use different comparisons in other places if we wanted to sort by other criteria.  The procedure declarations are simple

    proc fillin_empties(kset : [] cluster, space : clrspace) {

      proc >=(c1 : cluster, c2 : cluster) : bool {

        return c1.npix >= c2.npix;


      proc >(c1 : cluster, c2 : cluster) : bool {

        return c1.npix > c2.npix;

      /* Provide <=, < if reverse=false */


      QuickSort(kset, reverse=true);


      /* assign empties */


Here the reverse argument to QuickSort puts the list in decreasing population.

Atomic Variables (kmeans_v2)

Atomic variables are restricted either in software or hardware so that certain operations are guaranteed to be consistent between tasks: one will never see the variable in an intermediate state.  This might be done by placing a lock around the variable so that only one task can interact with it at a time, or it may use CPU commands that protect the memory location.  In Chapel any integer, unsigned integer, boolean, and real may be declared atomic.  Simply place the atomic keyword before the type

    var pixsum : atomic int;

Although you cannot treat structural types as atomic, you can declare their members to be so.

You will not be able to use normal assignments or operators for this variable, neither to change its value nor access it.  Instead, the variable will now support several methods: 



returns the variable's value



places a new value in the variable

pixsum.compareExchange(orig, new)


stores new as the value only if orig is the current value, otherwise does nothing

returns true if the value was changed, false if not



equivalent to pixsum += val; not supported for bools; returns nothing



equivalent to pixsum -= val; not supported for bools; returns nothing

pixsum.or(val), pixsum.and(val), pixsum.xor(val)


equivalent to |=, &=, ^=; only supported for integers; returns nothing

pixsum.fetchAdd(val), pixsum.fetchSub(val)


as pixsum.[add|sub](val) but returns the original value; not supported for bools

pixsum.fetchOr(val), pixsum.fetchAnd(val), pixsum.fetchXor(val)


as pixsum.[or|and|xor](val) but returns the original value; only integers



sets the value to true and returns the old value; only supported for bools



sets the value to false; only supported for bools



blocks the task until the variable has the given value

We can use atomic variables to replace the intermediate array during the k-means iteration pass.  You'll find the changes in kmeans_v2.chpl.  If you look at the iterate_pass() procedure in kmeans_v1, you'll notice that we only need to add and read ksplit.  We add five atomic members to the cluster structure to handle the intermediate sums

    record cluster {

      var skip : bool;                      /* true to ignore cluster */

      var npix : int;                       /* number pixels in cluster */

      var c1, c2, c3 : real;                /* center of mass */

      var r, g, b : c_uchar;                /* back-converted RGB value */

      var sumpix : atomic int;              /* incremental sum for npix */

      var sum1, sum2, sum3 : atomic real;   /* incremental sum for c[1-3] */

      var sumh : atomic real;               /* second sum needed for H wrap */


We could have declared npix, c1, c2, and c3 to be atomic, but this would mean we would have to use the read/write methods everywhere else in the program. (OK, we did do a version like this, but it was harder to read; since our changes are limited to the iterate_pass() procedure, this seems to be the cleaner and easier change.)

The iteration pass is now simpler, using only the new clusters.

    forall (x,y) in clr.area {

      var closest = closest_cluster(..);






    forall k in knew {

      k.npix = k.sumpix.read();

      k.c1 = k.sum1.read() / k.npix;

      k.c2 = k.sum2.read() / k.npix;

      k.c3 = k.sum3.read() / k.npix;


Again we've omitted the special case for the H plane; see the source for details.

Chapel's implementation of atomic variables is still in active development and what we've presented here describes the current implementation.  Eventually you will be able to specify looser restrictions on accessing the variable, and you will be able to pass these conditions as arguments to all the methods above.  There are also design considerations if you work with atomic variables when tasks are split over different pieces of hardware or interoperate over a network.  You'll find details by digging into the documentation. CHPL_HOME/doc/technotes/README.atomics is the starting point.

To use this program

    make kmeans_v2

    bin/kmeans_v2 --inname=field.png --outname=ids2.png

The result will differ from kmeans_v1 because the random number generator has no seed.  If you provide the same value in both programs in the 'const rand = new RandomSeed();' line in seed_clusters() the results will be the same.


This program was the inspiration for this tutorial.  We've been playing around with color quantization recently, and have threaded and CUDA versions running in C. You may notice that the accuracy of the color quantization isn't fantastic and that the algorithm can take a long time to settle.  The inaccuracy is related to proper seeding and handling of empty clusters; splitting large clusters also seems necessary.  The run time is tied to the relationship between colors in the scene and the number of clusters used; too many clusters leads to slow convergence, too few loses detail.  This is an active area of research for us, so we can't offer solutions.


You can use gnuplot to show where the clusters are.  The graph will plot a dot in the 3D R, G, B space for the cluster at its center of mass and with a size proportional to the number of pixels contained in it.  First save the output from kmeans and delete all lines up to and including the table header.  Then start gnuplot and type

    set xlabel “R” ; set ylabel “G” ; set zlabel “B

    set nokey

    set view 45, 307, 1, 1

    basenpix = 5000

    splot 'datafile' u 6:7:8:($2/basenpix):($6*65336+$7*256+$8) with points pt 7

      ps variable lc rgb variable

The value basenpix should be about the number of pixels in the smallest cluster and determines how big to draw each dot.


To demonstrate this inaccuracy, the color_checker.chpl program creates a PNG image colors.png with the colors of the Macbeth chart.  Running kmeans on this image with --ncluster=25, you'll find not every block has a unique color.  Although the random seeds will give you different results, the run below only found 19 colors of the 25 that exist (including the black bars between squares).  In the gnuplot diagram there are 5 large dots for the blocks that have combined: (2,1) and (3,4); (4,1) and (2,3); (4,2) and (1,3); (6,2) and (4,3); (1,4) and (2,4), where the numbers are (column,row).  (Block (5,4) looks similar to (3,4) but is really different.)    Compile and run the program with

    make color_checker


    bin/kmeans_v[12] --inname=colors.png --outname=ids_clr.png


Clusters in color checker, excluding black

What we can compare is the performance of the two approaches we've taken.  Running on the field.png image with 50 clusters and a fixed seed to RandomStream, we find run times (in s) of

                      base     --fast    improve

         kmeans_v1     336       26        13X

         kmeans_v2     328       25        13X

Both programs ran the clustering in parallel on four cores.  The base column gives the time the analysis took when the program was compiled normally, the --fast column when built with CHPLFLG=--fast.  The improve column is the ratio of the base to the fast times.

Atomic variables do not hurt the performance of this program; in fact, they help it a wee bit.  This might because there is not much contention.  The cluster assignments are dispersed because there are many buckets, few tasks running in parallel, and the colors vary from pixel to pixel.  As with the Gabor filter programs, we again see an order of magnitude improvement when we compile with --fast.


In this section we've used a divide-and-reduce approach to running the color clustering in parallel.  We've also seen how we can use Chapel's atomic variables to eliminate the intermediate storage and reduce step; the performance of this approach is competitive with the first.  We've learned:

In the next section we'll implement another parallel image processing program, looking for corners in images with the FAST detector.  This will require we write our own iterator.


1. Generate a back-mapped image.  That is, create a color RGB image where instead of cluster IDs you store their corresponding RGB values.  How do the results compare to the original image?

2. Write a Gaussian filter to smooth the initial image (in color space) before doing the quantization.  How does this affect the quantization?

3. This implementation compares each pixel against each cluster to find the closest.  The kdtree library in the RANSAC section is more efficient, especially as the number of clusters increases.  Modify the kmeans program to use it, building a new tree before each pass and using find_nearest_point() to get the closest cluster.


The files needed for this section can be found here.  A zip file is here.

A PDF version of this text is here.