k-means clustering is a general partitioning strategy.  It assigns data points in a n-dimensional space to a number of buckets, effectively dividing the space and quantizing the values.  We can use it in image processing to reduce the number of colors in an image to make finding regions of distinct color, or comparing colors, easier.  In other words, we attempt to find the strongest clusters of values over the color space and label each pixel with the ID of the cluster to which it belongs.

This is an iterative process that runs until the population of the buckets stabilize.  The heart of the algorithm is a per-pixel comparison with each cluster to find the best fit, which then modifies the cluster's boundaries for the next pass.  This can be done in parallel, and is the focus of this chapter.

The directory for this chapter is kmeans.  You'll be using the C functions for reading and writing PNG images in img_png_v3.c and img_png_v3.h, and the color library which is updated to ip_color_v3.chpl.  Use make name to build the programs, leaving off the '.chpl' suffix.  Executables get placed in the bin sub-directory and intermediate files in build.  The image for this chapter is field.png.

The files for this chapter are located here (as zip file).  A PDF copy of this text is here.

Aside: Clustering

The k-means clustering algorithm is simple.  Begin by giving each bucket a position in the color space.  The seeds can be generated randomly, although this tends to produce buckets that are initially empty. We will instead pick pixels at random from the image, making a fixed number of attempts for each and picking the one furthest from all other seeds assuming a minimum separation be met.

For each pass we assign each pixel to the bucket whose center of mass is closest.  The center of mass is the average per color plane over all members of the cluster.  We then compute the new center of mass from the assignments and look at the population of each bucket.  When that has stabilized we are done. If not, any clusters that are unpopulated are re-seeded to midway between the two largest unused blocks and we start another pass.

We do not actually mark the pixels in any way during the pass.  We only have to track the aggregates for the cluster.  Specifically, the actions at each pixel are to

  1. 1.  Find the closest cluster. (smallest distance to its center of mass) 

  2. 2.  Increment the count of pixels in the cluster for this pass. (npix) 

  3. 3.  Increment a sum of color coordinates for the next center of mass. (sumc1, sumc2, sumc3) 

This can be done in parallel.  After all pixels are done, then we need to finish processing the clusters.  We calculate

  1. 1.  The new center of mass (c1cm = sumc1/npix, c2cm = sumc2/npix, c3cm = sumc3/npix) 

  2. 2.  The difference in npix between previous pass and current pass. (dnpix)  

When the largest dnpix/npix ratio of all clusters falls below a threshold of, say, 1%, then we are done.  We can map each center of mass back to RGB and create a greyscale image with the ID of the cluster at each pixel.

The final algorithm is:

    1. Seed clusters, ensuring some minimum distance between

    2. Repeat a pass

       2a. Assign pixels to clusters, tracking population and color plane sums

           for centers of mass.

       2b. Calculate center of mass.

       2c. Place empty clusters between biggest.

       2d. Calculate cluster population change as fraction of


       until population change is small enough, or maximum number of passes

       has been reached.

    3. Convert cluster colors back to RGB.

    4. Create greyscale image holding cluster assignments.

Implementation (kmeans_v1)

We will not be presenting all the code in detail here, rather only discussing the Chapel implementation issues.  The program can be found in kmeans_v1.chpl.  Compile it with

    make kmeans_v1

It takes four command-line arguments

    --inname     PNG file to read

    --outname    file to create with cluster ID assignments

    --space      LAB, YUV, HSV, RGB - color space to use

    --ncluster   number of clusters

Run it with

    bin/kmeans_v1 --inname=field.png --outname=ids.png

You'll find a few changes in the ip_color module, now up to version 3. First there are two new members of the clrimage class that store the ranges for the columns and rows because we'll want to use each separately and array.domain.dim() is wordy.  The domain for the image hasn't changed, but it's now built from these separated ranges.  We also store the color space in a third new member so we don't have to pass it around as an argument.  The third change is the addition of procedures to convert from any color space back to RGB.  (This was Exercise 1 in the Color Conversion chapter.)

We perform the pixel assignment and creation of the ID image in parallel using forall loops.  Our strategy for the pixel assignment is to create an array ksplit of temporary clusters to hold the intermediate sums and reduce them at the end.  We iterate over the rows in the outer, parallel loop, and ksplit is specified by row, so this is safe.  The procedure returns an array with the new cluster definitions.

    proc iterate_pass(clr : clrimage, const ref kset : [] cluster) {

      /* intermediate storage, one per row */

      var ksplit : [clr.rows][kset.domain] cluster;

      /* results of pass */

      var knew : [kset.domain] cluster;


      forall y in clr.rows do

        for x in clr.cols {

          /* closest_cluster returns index of best cluster fit. */

          const closest = closest_cluster(clr.c1(y,x), clr.c2(y,x), clr.c3(y,x),

                                          clr.space, kset);

          ksplit(y)(closest).npix += 1;

          ksplit(y)(closest).c1 += clr.c1(y,x);

          ksplit(y)(closest).c2 += clr.c2(y,x);

          ksplit(y)(closest).c3 += clr.c3(y,x);



      forall k in kset.domain do

        for y in clr.rows {

          knew(k).npix += ksplit(y)(k).npix;

          knew(k).c1 += ksplit(y)(k).c1;

          knew(k).c2 += ksplit(y)(k).c2;

          knew(k).c3 += ksplit(y)(k).c3;


      forall k in knew {

        if (0 == k.npix) {

          k.skip = true;

        } else {

          k.c1 /= k.npix;

          k.c2 /= k.npix;

          k.c3 /= k.npix;




      return knew;


The double bracket notation on ksplit indicates that we have an array each of whose elements is an array; this contrasts with a single bracket array addressed with a tuple (like the color planes) where the tuple is converted to an index.  cluster is a record with fields to hold the population npix, the center of mass and partial sum per color plane, and a flag if we should ignore the cluster because the population is 0.  

    record cluster {

      var skip : bool;                  /* true to ignore cluster in set */

      var npix : int;                   /* number pixels assigned to cluster */

      var c1, c2, c3 : real;            /* center of mass of each color plane */

      var c1tmp : real;                 /* temp needed to sum H */

      var r, g, b : c_uchar;            /* RGB equivalent of c1, c2, c3 */


The operation of the procedure should be clear.  We write out the loops as nested rather than using a single forall statement and tuple as the index variable in order to guarantee which is the outer loop.  (The actual code has a special case for the H plane, because we need to account for angles wrapping around 0).

The other use of parallelism is with the ID image.  Given an rgbimage called quant to hold the IDs, the parallel loop looks like

    forall (y, x) in clr.area {

      var xy = (y * quant.ncol) + x;     /* flat pixel index */

      /* closest_cluster returns index of best cluster fit. */

      quant.r(xy) = closest_cluster(clr.c1(y,x), clr.c2(y,x), clr.c3(y,x),

                                    clr.space, kset);


The closest_cluster() procedure checks a color (c1, c2, c3) against each cluster, using the procedure cluster_sep() to measure the distance in a sum-of-squares fashion, and picking the smallest.  We simply track the minimum and location as we go along, rather than using a reduction.

Module: Random

The seeding step requires random numbers to pick pixels across the image. Chapel provides a standard module called Random that generates reals in the range 0.0 to 1.0.  The module defines a class RandomStream whose constructor can take an integer seed; if none is provided, the system time is used.  Use the method getNext() for a new random value, or fillRandom(array) to populate an entire array of reals (fillRandom() may also be called as a stand-alone procedure, in which case it creates a new stream).  The random number generator is by default parallel-safe.

    rand = makeRandomStream();


    var seeds : [1..10000] real;


The module provides two different generators, NPB, which comes fro the NAS Parallel Benchmark, and PCG.  In Chapel 1.17 the module has not yet stabilized and its documentation recommends using makeRandomStream() to hide any future changes.  The NPBR algorithm can only generate real numbers, while PCG will work with any time.  Our code will work with either generator, which means defining two procedures to covert a random real to a position within a range or a domain.  random_ranged() returns an element from within the range:

    proc random_ranged(rand : RandomStream, rng : range) : rng.idxType {

      const elt = rng.length * rand.getNext();

      return rng.first + rng.stride * (nearbyint(elt - 0.5) : rng.idxType);


The element is shifted by -0.5 so that the range is centered within the interval; because nearbyint() rounds -0.5 to 0, the limit stays valid.  Notice how we can use the type of the range, rng.idxType, as the return declaration for the procedure and the cast because this is known at compile time.  Ranges do not, unfortunately, have a method to return the nth element, or we would be able to easily support other non-rectangular types.  For domains we build a tuple to hold a random value from each range, and then call random_range() to populate it.

    proc random_domain(rand : RandomStream, dom : domain) : dom.rank*dom.idxType {

      var pt : dom.rank * dom.idxType;

      for i in 1..dom.rank do pt(i) = random_ranged(rand, com.dim(i));

      return pt;


This handles any domain geometry.  Again we are able to use dom.idxType in the declarations, as well as the dom.rank parameter.  Domains also have the property dom.numIndices, but this is not a parameter like the rank and cannot be used at compile-time.

The seeding procedure itself picks a random pixel in the image for the first cluster and then tries a fixed number of pixels for the others, picking the one that is the furthest from all seeded clusters.

Module: Sort

The k-means algorithm is sensitive to outliers affecting the center-of-mass.  This can cause clusters to shift into an empty volume of space, further from all data points than any other cluster, and so coming up empty during a pass.  Rather than leaving them adrift, we move them halfway between the largest clusters, assuming these are the most likely to need sub-dividing.  If there's more than one empty bucket then the first goes between the largest and second largest, the next between the second and third largest, and so on.  This is somewhat naive; another strategy would be to track the standard deviation or density of each cluster and split the worst, or to check if the cluster is multi-modal.  To do this we need to know the biggest clusters, or to order them by decreasing population. The Sort module provides the standard sorting routines: QuickSort, MergeSort, HeapSort, InsertionSort, BubbleSort, and SelectionSort.  The sort() function is the general-purpose entry into the library and wraps a call to QuickSort.  All work on 1D arrays of arbitrary type and require a compare() function that returns a negative number if item1 comes before item2, a positive number if after, and 0 if equal.  (This matches C's qsort() convention.)  We can overload the default function for our data type.  We do this locally inside fillin_empties() so that we can use different comparisons in other places.

    proc fillin_empties(kset : [] cluster, space : clrspace) {

      proc DefaultComparator.compare(c1 : cluster, c2 : cluster) {

        return (c1.npix - c2.npix);


This puts the clusters in ascending size.  To reverse the order the module has defined a reverseComparor that inverts the result of the standard Comparator.  We only need to use it; there's no additional code.

      sort(kset, comparator=reverseComparator);


      /* assign empties */


The default comparator itself uses <, so you could overload that instead.

    record DefaultComparator {

      proc compare(a, b) {

        if a < b { return -1; }

        else if b < a { return 1; }

        else return 0;



Atomic Variables (kmeans_v2)

Atomic variables are restricted either in software or by hardware so that certain operations are guaranteed to be consistent between tasks: one will never see the variable in an intermediate state.  This might be done by placing a lock around the variable so that only one task can interact with it at a time, or it may use CPU commands that protect the memory location.  In Chapel any integer, unsigned integer, boolean, and real may be declared atomic.  Simply place the atomic keyword before the type

    var pixsum : atomic int;

Although you cannot treat structural types as atomic, you can declare their members to be so.

You will not be able to use normal assignments or operators for this variable, neither to change its value nor access it.  Instead, the variable will now support several methods: 



returns the variable's value



places a new value in the variable

pixsum.compareExchange(orig, new)


stores new as the value only if orig is the current value, otherwise does nothing

returns true if the value was changed, false if not



equivalent to pixsum += val; not supported for bools; returns nothing



equivalent to pixsum -= val; not supported for bools; returns nothing

pixsum.or(val), pixsum.and(val), pixsum.xor(val)


equivalent to |=, &=, ^=; only supported for integers; returns nothing

pixsum.fetchAdd(val), pixsum.fetchSub(val)


as pixsum.[add|sub](val) but returns the original value; not supported for bools

pixsum.fetchOr(val), pixsum.fetchAnd(val), pixsum.fetchXor(val)


as pixsum.[or|and|xor](val) but returns the original value; only integers



sets the value to true and returns the old value; only supported for bools



sets the value to false; only supported for bools



blocks the task until the variable has the given value

We can use atomic variables to replace the intermediate array during the k-means iteration pass.  You'll find the changes in kmeans_v2.chpl.  If you look at the iterate_pass() procedure in kmeans_v1, you'll notice that we only need to add and read ksplit.  We add five atomic members to the cluster structure to handle the intermediate sums

    record cluster {

      var skip : bool;                      /* true to ignore cluster */

      var npix : int;                       /* number pixels in cluster */

      var c1, c2, c3 : real;                /* center of mass */

      var r, g, b : c_uchar;                /* back-converted RGB value */

      var sumpix : atomic int;              /* incremental sum for npix */

      var sum1, sum2, sum3 : atomic real;   /* incremental sum for c[1-3] */

      var sumh : atomic real;               /* second sum needed for H wrap */


We could have declared npix, c1, c2, and c3 to be atomic, but this would mean we would have to use the read/write methods everywhere else in the program. (OK, we did do a version like this, but it was harder to read; since our changes are limited to the iterate_pass() procedure, this seems to be the cleaner and easier change.)

The iteration pass is now simpler, using only the new clusters.

    forall (y, x) in clr.area {

      const closest = closest_cluster(..);






    forall k in knew {

      k.npix = k.sumpix.read();

      k.c1 = k.sum1.read() / k.npix;

      k.c2 = k.sum2.read() / k.npix;

      k.c3 = k.sum3.read() / k.npix;


Again we've omitted the special case for the H plane; see the source for details.

Chapel's implementation of atomic variables is still in active development and what we've presented here describes the current implementation.  Eventually you will be able to specify looser restrictions on accessing the variable, and you will be able to pass these conditions as arguments to all the methods above.  There are also design considerations if you work with atomic variables when tasks are split over different pieces of hardware or inter operate over a network.  You'll find details by digging into the documentation. CHPL_HOME/doc/rst/technotes/atomics.rst is the starting point.

To use this program

    make kmeans_v2

    bin/kmeans_v2 --inname=field.png --outname=ids2.png

The result will differ from kmeans_v1 because the random number generator has no seed.  If you provide the same value in both programs in the 'const rand = new RandomSeed();' line in seed_clusters() the results will be the same.


This program was the inspiration for this tutorial.  We've been playing around with color quantization recently, and have threaded and CUDA versions running in C. You may notice that the accuracy of the color quantization isn't fantastic and that the algorithm can take a long time to settle.  The inaccuracy is related to proper seeding and handling of empty clusters; splitting large clusters also seems necessary.  The run time is tied to the relationship between colors in the scene and the number of clusters used; too many clusters leads to slow convergence, too few loses detail.  This is an active area of research for us, so we can't offer solutions.


You can use gnuplot to show where the clusters are.  The graph will plot a dot in the 3D R, G, B space for the cluster at its center of mass and with a size proportional to the number of pixels contained in it.  First save the output from kmeans and delete all lines up to and including the table header.  Then start gnuplot and type

    set xlabel “R” ; set ylabel “G” ; set zlabel “B

    set nokey

    set view 45, 307, 1, 1

    basenpix = 5000

    splot 'datafile' u 6:7:8:($2/basenpix):($6*65336+$7*256+$8) with points pt 7

      ps variable lc rgb variable

The value basenpix should be about the number of pixels in the smallest cluster and determines how big to draw each dot.


To demonstrate this inaccuracy, the color_checker.chpl program creates a PNG image colors.png with the colors of the Macbeth chart.  Running kmeans on this image with --ncluster=25, you'll find not every block has a unique color.  Although the random seeds will give you different results, the run below only found 19 colors of the 25 that exist (including the black bars between squares).  In the gnuplot diagram there are 5 large dots for the blocks that have combined: (2,1) and (3,4); (4,1) and (2,3); (4,2) and (1,3); (6,2) and (4,3); (1,4) and (2,4), where the numbers are (column,row).  (Block (5,4) looks similar to (3,4) but is really different.)    Compile and run the program with

    make color_checker


    bin/kmeans_v[12] --inname=colors.png --outname=ids_clr.png


Clusters in color checker, excluding black

What we can compare is the performance of the two approaches we've taken.  Running on the field.png image with 50 clusters and a fixed seed to RandomStream, we find run times (in seconds) of




improve base





12 X

column sums





atomic variables

Both programs ran the clustering in parallel on four cores.  The base column gives the time the analysis took when the program was compiled normally, the --fast column when built with CHPLFLG=--fast.  The improve column is the ratio of the base to the fast times.

Atomic variables do not hurt the performance of this program.  Only when we take away the base sanity checking by the runtime do we see a small hit.  As with the Gabor filter programs, we again see an order of magnitude improvement when we compile with --fast.


In this chapter we've used a divide-and-reduce approach to running the color clustering in parallel.  We've also seen how we can use Chapel's atomic variables to eliminate the intermediate storage and reduce step; the performance of this approach is competitive with the first.  We've learned:

In the next chapter we'll implement another parallel image processing program, looking for corners in images with the FAST detector.  This will require we write our own iterator.


1. Generate a back-mapped image.  That is, create a color RGB image where instead of cluster IDs you store their corresponding RGB values.  How do the results compare to the original image?

2. Write a Gaussian filter to smooth the initial image (in color space) before doing the quantization.  How does this affect the quantization?

3. This implementation compares each pixel against each cluster to find the closest.  The kdtree library in the RANSAC chapter is more efficient, especially as the number of clusters increases.  Modify the kmeans program to use it, building a new tree before each pass and using find_nearest_point() to get the closest cluster.


The files needed for this chapter can be found here.  A zip file is here.

A PDF version of this text is here.