2011-03-25

Easy Parallelization with C++0X lambda functions, Thread Building Blocks

Lambda functors in C++0X


The relatively new gcc-4.5 release supports lambda expressions,
which – in contrast to boost.lambda, boost.bind and the like – provide
easy capturing of variables in context in the lambda expression.
This finally makes the STL algorithms usable, such as





struct add_val{
  add_val(double d):val(d){}
  void operator()(const double& d){
    d+=val;
  }
  double val;
};


int main(){
  std::vector<double> v;
  // ... fill vector

  double inc = 3.0;

  // the old way of doing things
  add_val av(inc);
  std::for_each(v.begin(),v.end(),av);

  // using a boost.lambda function
  std::for_each(v.begin(),v.end(), _1+=inc);

  // using a c++0x lambda function
  std::for_each(v.begin(),v.end(), [=](double& d){d+=inc;});
}

The "old way" primarily has inconveniences for the programmer:
  • We need to define a struct outside the scope, even for tiny
    functionality.
  • We need to explicitly capture variables from the scope of the
    surrounding code (here: inc) in the struct.

Boost.Lambda tried to resolve this problem, in an elegant way I
believe. However, there are still shortcomings of this approach:



  • Variables have unintuitive names (_1, _2)
  • The code in the lambda expression is not really a block of code. It is an expression, where parts may be evaluated
    surprisingly.
  • Also, since this is an expression, conditionals and loops must be
    expressed in an awkward (that is, non-C++) way using if_, for_
    and so on.

The new C++0X lambda syntax gets rid of all these problems. While the
syntax looks a bit strange at first, it is much more readable than
boost.lambda constructs. The block of code is not out of scope, all
names of the surrounding code can be used as copy ([=]) or
reference ([&]).


You might wonder, why we do not just use boost.foreach and get away
with writing





#include <boost/foreach.hpp>
#define foreach BOOST_FOREACH
// ...as before...
foreach(double& d, v){
  d+=inc;
}

// in c++0x, not yet implemented, this will be
for(double& d : v){
  d+=inc;
}
… which is an idiom quite well-known in other languages. The fun
part is, that we cannot change what for does, but we can change the
implementation of for_each. This is one of the things that the Intel (R) Threading Building Blocks library does.

Parallelizing your for-loop by changing two lines


A nice trick now is to change your (side-effect-free) for-loops like
this:





#include <tbb/parallel_for_each.h>

// before
foreach(double& d, v){
  d+=inc;
}

// after
tbb::parallel_for_each(v.begin(),v.end(),[=](double& d){
  d+=inc;
});

… and everything in this loop is automatically run in parallel.

Similar things can be done with OpenMP:



int end = v.size();
#pragma omp parallel for
for(int i=0;i<end;i++){
   v[i]+=inc;
}

but this is already much more intrusive. Furthermore, the loop index
must be an int and the last index must be known. While
variables may be passed as copy (using private (inc)), these
variables are private to the thread, not to the "wrapped" function. Of
course, OpenMP gives you much more fine-grained control over
parallelization as well.

2011-02-09

Profiling Boost Multi-Array

Worried by this StackOverflow thread, I did my own experiments on
profiling the access speed in Boost MultiArrays.


First of all, the thread is correct, BOOST_MA needs about 3 times as
much time to iterate over the array as Native.


The additional cases I checked are quite interesting, though:



BOOST_IT
only has a 30% overhead over Native. Note that you can
get around the obfuscating pointer types by using the auto keyword
and -std-gnu++0x as a compiler argument (works for gcc version >= 4.4).

RAW_POINTER
is actually slower than Native. This is a typical
case of "do not try to be smarter than your compiler. Note that this
is the same algorithm like the one you get when you use std::fill.

unsigned int
index types are significantly slower (about 2x) in
Native condition than int index types. This is bad, because
intuitively, one would choose unsigned int for array indexing.

My code:





#include <boost/date_time/posix_time/posix_time.hpp>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
using namespace boost::posix_time; 

int main(int argc, char* argv[])
{
  typedef int idx;
  const idx X_SIZE = 400;
  const idx Y_SIZE = 400;
  const idx ITERATIONS = 5000;

  double *nativeMatrix = new double [X_SIZE * Y_SIZE];

  typedef boost::multi_array_ref<double, 2> ImageArrayType;
  ImageArrayType boostMatrix(nativeMatrix, boost::extents[X_SIZE][Y_SIZE]);    

  // Native condition
  ptime startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
      for (idx y = 0; y < Y_SIZE; ++y)
          for (idx x = 0; x < X_SIZE; ++x)
              nativeMatrix[x + (y * X_SIZE)] = 2.345;
  ptime endTime = microsec_clock::universal_time();
  printf("[Native]Elapsed time: %6.3f seconds\n", time_period(startTime, endTime).length().total_milliseconds()/1000.f);

  // other conditions   
  startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
    {
#ifdef RAW_POINTER
      double* end = boostMatrix.data() + X_SIZE*Y_SIZE;
      for(double* begin=boostMatrix.data(); begin!=end; ++begin)
        *begin = 2.345;
#elif defined(BOOST_IT)
      for(auto it=boostMatrix.begin(); it!= boostMatrix.end(); ++it)
          for(auto it2=(*it).begin(); it1!=(*it).end(); ++it2)
              *it2 = 2.345;
#elif defined(BOOST_MA)
      for (idx y = 0; y < Y_SIZE; ++y)
         for (idx x = 0; x < X_SIZE; ++x)
             boostMatrix[y][x] = 2.345;
#endif
    }
  endTime = microsec_clock::universal_time();
  printf("[Boost] Elapsed time: %6.3f seconds\n", time_period(startTime,endTime).length().total_milliseconds()/1000.f );

  return 0;
}

All compiled and executed using


g++-4.4 -O3 -g0 -DNDEBUG -march=native -mtune=native --fast-math -std=gnu++0x % && ./a.out

2011-02-08

Parallelization in Python using Weave and OpenMP

If you're using python a lot, you probably stumbled across the performance python howto at some point. Since inner loops are so
expensive in python, it makes sense to move them to C++, and
PerformancePython shows how to do that. When standard libraries like
numpy and weave.blitz are not expressive enough, your best bet becomes
SciPy Weave. In weave you write your C++ code in a string and access
your numpy arrays via the blitz converters. The code is compiled when
it has changed and can make use of arbitrary libraries as we shall see.

Parallelizing Weave-Code: Per-Pixel eigenvalues


Lets suppose we have a 3D image $D$, with the second order derivatives
in $x$, $y$ and $z$ direction given by $Dxx, Dxy, Dyy, \ldots$. We
want to calculate the eigenvalues at each pixel. For efficiency, we
use LAPACK and for clarity, we use Boost uBLAS. Lets first write down
the C++ code:

namespace ublas = boost::numeric::ublas;
namespace lapack = boost::numeric::bindings::lapack;
using namespace std;
typedef ublas::bounded_matrix<double,3,3,ublas::column_major> mat;
ublas::bounded_vector<double,3> lambda;
ublas::bounded_vector<double,34*3> work; // the 34*width is from syev.hpp

mat A;
int i,j,k,idx;
#pragma omp parallel for private(i,j,k,A,idx,lambda,work)
for(i=0;i<nx;i++){
  for(j=0;j<ny;j++){
    for(k=0;k<nz;k++){
      A(0,0) = Dxx(i,j,k);  // fill upper triangle of A
      A(1,1) = Dyy(i,j,k);
      A(2,2) = Dzz(i,j,k);
      A(0,1) = Dxy(i,j,k);
      A(0,2) = Dxz(i,j,k);
      A(1,2) = Dyz(i,j,k);

      // determine eigenvalues (N) of upper (U) triangle of A in lambda
      lapack::syev('N','U',A,lambda,lapack::workspace(work));
      lambda1(i,j,k) = lambda(0);  // eigenvalues in ascending order
      lambda2(i,j,k) = lambda(1);
      lambda3(i,j,k) = lambda(2);
    }
  }
 }  

Note that we use OpenMP here to parallelize the outmost for-loop using
a #pragma directive. I took me a while to figure out that one needs
to explicitly name all variables which must be private to a thread in
this for-loop, otherwise you will not notice any speedups and/or weird
program behavior. I suppose further restrictions are that the size of
the variables needs to be known at compile time (here I use
bounded_vector for that purpose). Finally, the loop index must be of
type int for OpenMP.

That is only the body of a function. We save this in a python string
called codestring and execute it as follows:

from scipy.weave import inline, converters
variables = "nx ny nz lambda1 lambda2 lambda3 Dxx Dyy Dzz Dxy Dxz Dyz".split()
inline(codestring,
       variables,
       extra_compile_args =['-O3 -fopenmp'],
       extra_link_args=['-lgomp'],
       headers=[
           '<boost/numeric/bindings/lapack/syev.hpp>',
           '<boost/numeric/bindings/traits/ublas_matrix.hpp> ',
           '<boost/numeric/bindings/traits/ublas_vector.hpp> ',
           '<boost/numeric/ublas/matrix.hpp>',
           '<boost/numeric/ublas/banded.hpp>',
           '<boost/numeric/ublas/vector.hpp>',
           '<cmath>'],
       type_converters=converters.blitz,
       libraries=["lapack"])

A couple of things to point out here: since codestring is only the
body of a function, we need some other place to declare include
files and declarations. Include files are listed in headers,
additional code with declarations can be supplied using
support_code. To use OpenMP, we need the compiler argument
-fopenmp, and the linker needs to link the gomp library. The
type_converters argument tells Weave to copy and use conversion
routines into the final program which convert NumPy arrays in Blitz
arrays, which have nice accessor operators.

Second Order Parallelization


We can go even one step further and parallelize the resulting code
over multiple processors or even computers. In Python, a comparably
new and exciting way to achieve this is Parallel IPython. I'll leave
the details on this to a later posting, instead I'll point to a
problem I found while running Weave code in parallel on multiple machines.


When Weave compiles your code, it writes the generated source code to
~/.python26_compiled. The file name is created from a hash code over
the contained source code. This way, Weave knows when to recompile and
when to reuse an old binary.


Now to the downside:

Multiple Cores
You will get problems if you run multiple
instances at the same time: They try to write the same source code
file at the same time.
Multiple Computers
You will get the same problem if your home
directory is on the same file system (NFS or the like)

My solution to problem number one is to compile once with a single
thread, then reuse using all threads. Problem two can be solved by
symlinking ~/.python26_compiled to /tmp.

2011-01-27

Dijkstra on boost grid graph

There is a whole science of finding tubular structures in
3D-images. It started sometime in the late nineties. One of the most
cited papers1 in this area shows how to use eigenvalues of the
local second order derivatives to identify tubular structures
heuristically. It also shows how to combine the results on multiple scales.



Raw Data
After Pre-processing
That is quite neat, but the approach has a major problem: It does not
assume any global structure. Therefore, when the data does not have
evidence for tubular structures, you still do not see them, and
furthermore, you have to have a method that allows you to say "points
x and y lie on the same tube". This, and determining properties of the
tubes, are the problems which most of the followupr research strives
to solve.

I wanted to have a quick-and-dirty approach to the problem. For a
given seed point, I would like to know the connectivity of the tubular
structure to this seed point. I pre-processed the raw image using the
method of Sato et al.1, determined the seed point, the graph is
defined implicitly over the 3D-image using voxel-neighborhoods, so we
could in theory just run Dijkstra on it. I was lazy, however and did
not want to implement it all.



BGL Grid Graph


A quick google search turned up boost::grid_graph, a relatively new
member of the boost graph library (BGL). It defines a graph on a grid
without representing everything. Neat. The documentation is still a
bit shortish and sadly, only a 6-neighborhood (in 3D) is supported,
but who'd expect a free lunch?


Usability is (at first) easy as pie:





typedef grid_graph<3> graph_t;
typedef graph_traits<graph_t> Traits;
typedef Traits::edge_descriptor edge_descriptor;
typedef Traits::vertex_descriptor vertex_descriptor;

// ...

boost::array<vidx_t, 3> lengths = { { 256, 256, 256 } };
graph_t graph(lengths, false); // do not wrap any dimensions

Defining properties of edges in a grid graph


Contrary to graphs with internal properties, we have to define
external properties for graph_t. This is currently not documented
well, and it took me some time to figure out:



  • In most BGL examples, nodes are indexed using an integer type. The
    grid_graph, however, uses n-tuples of type boost::array<n>.
    This means, most examples cannot by applied to grid_graph in a
    straight-forward way: You cannot index an array by a tuple without
    doing some tricks first.

  • We would like to have an arbitrary function of the nodes as the
    weight. Most BGL examples simply use an array for whichever
    property they put in their graph.

The first problem can be solved using shared array property maps
(which do not seem to be well-documented either):





shared_array_property_map<vertex_descriptor,
  property_map<graph_t, vertex_index_t>::const_type>
  p_map(num_vertices(graph), get(vertex_index, graph));
shared_array_property_map<double,
  property_map<graph_t, vertex_index_t>::const_type>
  d_map(num_vertices(graph), get(vertex_index, graph));
We will use p_map as our predecessor map, which maps a vertex index (integer)
to a vertex descriptor (n-tuple).

The second map d_map is a mapping from vertex index to double, which
will be our distance map.


Finally, we need to create a edge_weight for graph_t, and it
should be a function, not an array. For this, I peaked W.P. McNeills
Boost-Implicit-Graph Example on GitHub.





struct edge_weight_map;
namespace boost {
  template<>
  struct property_map< graph_t, edge_weight_t > {
    typedef edge_weight_map type;
    typedef edge_weight_map const_type;
  };
}

/*
   Map from edges to weight values
*/
struct edge_weight_map {
  typedef double value_type;
  typedef value_type reference;
  typedef edge_descriptor key_type;
  typedef boost::readable_property_map_tag category;
  const graph_t& m_graph;
  edge_weight_map(const graph_t& g)
  :m_graph(g) { }
  reference operator[](key_type e) const; // implemented below
};

typedef boost::property_map<graph_t, boost::edge_weight_t>::const_type
        const_edge_weight_map;
typedef boost::property_traits<const_edge_weight_map>::reference
        edge_weight_map_value_type;
typedef boost::property_traits<const_edge_weight_map>::key_type
        edge_weight_map_key;

namespace boost{
    // PropertyMap valid expressions
    edge_weight_map_value_type get(const_edge_weight_map pmap, edge_weight_map_key e) {
        return pmap[e]; }
    // ReadablePropertyGraph valid expressions
    const_edge_weight_map get(boost::edge_weight_t, const graph_t&g) {
        return const_edge_weight_map(g); }
    edge_weight_map_value_type get(boost::edge_weight_t tag, const graph_t& g, edge_weight_map_key e) {
        return get(tag, g)[e]; }
}

// whoa, lots of typedefs, but now we can write the function that we
// wanted to write: Map edges to weights!
edge_weight_map::reference
edge_weight_map::operator[](key_type e) const {
    vertex_descriptor t = target(e,m_graph);
    vertex_descriptor s = source(e,m_graph);
    // f = f(t,s)
    return f;
}
That was more nasty than I expected, but it is still not that much
code to write.

Representing the data in a Boost Multi-Array


Finally, we need to represent 3D-data. I hate dealing with raw memory,
so I always wrap it in some convenient abstracting structure first
chance I get. 3D-arrays are definitely a case for Boost Multi-Array.
Its not a big deal using it, but I want to advertise its simplicity here:





typedef boost::multi_array_ref<unsigned char, 3> array_type;

std::ifstream dat("raw.dat", std::ios::in | std::ios::binary);
unsigned char* data = new unsigned char[256*256*256];
dat.read((char*)data,256*256*256);
array_type A(data,boost::extents[256][256][256]);
dat.close();
By the way, you can also change the base of this array as well as the
memory order, so if you need to port Matlab (TM) code to a programming
language (no adjective here on purpose), you can't get much simpler.

Finish line: Running Dijkstra


This is trivial now and straight from the book:





boost::array<vidx_t, 3> start  = { { 109, 129,  24 } };
dijkstra_shortest_paths(graph, start
        ,predecessor_map(p_map)
        .distance_map(d_map)
        );
Note the "." before distance_map, this is a named argument called
distance_map, which is set to d_map. The same goes for p_map.


Thresholded Distances
We can now trace paths starting at points which have a high value in
the raw image:


Traced paths starting at thresholded raw data

Footnotes:

1 "3D multi-scale line filter for segmentation and visualization
of curvilinear structures in medical images" by Sato et al.,
CVRMed-MRCAS, 1997 (PDF)

2010-12-15

Nearest Neighbor Classifier in CUV

After reading this blog post by Alex Smola I implemented a
one-nearest-neighbor classifier in CUV. Apart from a bug that popped
up, I was happy to see that no changes to the library where needed,
and the code is very short, so I'll share the implementation here.


There is a small annoyance which I'll fix later:



  • there is no argmin_col, so we instead resort to multiplying by
    -1 and then use argmax_col for now. I'll clean this mess up
    later, probably by adding a argFunc_col functor.
import cuv_python as cp
import pyublas
import numpy as np

class KNN:
    """ calculates the labels of a test set by determining the closest
    instance in a training set and returning the corresponding label."""
    def __init__(self, data, data_l):
        """
        data:   a (number_instances x dimensionality) numpy matrix
        data_l: a number_instances numpy vector containing the labels
        """
        self.data   = cp.push(data)
        self.data_l = data_l
        self.dsq    = cp.dev_matrix_cmf(self.data.h,1)
        cp.reduce_to_col(self.dsq.vec,self.data,cp.reduce_functor.ADD_SQUARED)
    def __get_distance_matrix(self, test):
        """
        test: a (number_test_instances x dimensionality) numpy matrix
        returns: a (number_instances x number_test_instances) CUV distance matrix
        """
        t   = cp.push(test)
        assert t.w == self.data.w
        tsq = cp.dev_matrix_cmf(t.h, 1)
        cp.reduce_to_col(tsq.vec,t,cp.reduce_functor.ADD_SQUARED)
        p   = cp.dev_matrix_cmf(self.data.h, t.h)
        cp.prod(p, self.data, t, 'n','t',-2, 0)
        cp.matrix_plus_col(p,self.dsq.vec)
        cp.matrix_plus_row(p,tsq.vec)
        return p
    def run(self,test):
        """
        test:    a (number_test_instances x dimensionality) numpy matrix
        returns: the estimated labels of the test instances
        """
        p = self.__get_distance_matrix(test)
        p *= -1.                # no argmin supported yet, sorry
        idx = cp.dev_matrix_cmi(test.shape[0],1)
        cp.argmax_to_row(idx.vec, p)
        hidx  = idx.np.reshape(idx.h)
        return self.data_l.reshape(self.data.h)[hidx]
As suggested in the original post, this is a very fast method to
calculate the nearest neighbor. Instead of looping over all possible
pairs $x∈ X$ and $z∈ Z$ calculating the distance

\[D_{ij} = \|x_i-z_j\|^2,\]


we first precalculate $\|x_i\|^2$ and $\|z_i\|^2$ and then use a fast
matrix multiplication and two additions to determine the distance
above (derived by expanding the second binomial formula)


\[D_{ij} = \|x_i\|^2 + \|z_i\|^2 - 2 XZ^T.\]


We can measure the speed of the implementation by running the above
program using GPU or CPU. The CPU implementation uses cBLAS, I'm using
a 3.2 GHz CPU and a GTX580 for comparison.



MethodTime (s)Relative speed
CUV CPU (cBLAS)58.3534
CUV GPU (cuBLAS)1.711

To test this on MNIST, save the above as knn.py and run the following source:





import os
from knn import KNN
import numpy as np

class MNIST:
  """ This simply reads the MNIST files to main memory and converts them to float """
  def __init__(self,dir):
      from scipy.io.numpyio import fread
      fd = open(dir+'/train-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.data_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 60000 )
      fd.close()

      fd = open(dir+'/train-images.idx3-ubyte')
      fread(fd,16,'c')
      self.data = np.fromfile(file=fd, dtype=np.uint8).reshape( (60000,784) )
      fd.close()

      fd = open(dir+'/t10k-images.idx3-ubyte')
      fread(fd,16,'c')
      self.test = np.fromfile(file=fd, dtype=np.uint8).reshape( (10000,784) )
      fd.close()

      fd = open(dir+'/t10k-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.test_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 10000 )
      fd.close()
  def get_test(self):
      v = self.test.astype('float32').T.copy("F")
      t = self.test_labels
      return v,t
  def get(self):
      v = self.data.astype('float32').T.copy("F")
      t = self.data_labels
      return v,t

pg = MNIST("/home/local/datasets/MNIST");

data, data_l  = pg.get()
test, test_l = pg.get_test()

knn = KNN(data.T.copy("F"),data_l)

off, err_cnt = 5000, 0
for i in xrange(0,10000,off):
    pred = knn.run(test[:,i:i+off].T.copy("F"))
    err_cnt += (pred!=test_l[i:i+off]).sum()

print "Errors: ", err_cnt

2010-11-26

Hurricane in the Server Room

Our new toy arrived, with twelve cores, four GTX580 GPUs, four
terabyte hard disk and 24 Gigabytes RAM. The thing looks quite
impressive from the inside:





Note the four GPUs on the left hand side and the vertical stack of
fans in the center. Together with the CPU fans above the GPUs it is
impossible to sit next to this machine and think clearly. It blows
your head off, or at least it sounds as if it would. Even when almost
unused, this machine produced so much heat our server room could not
discharge the heat anymore. Amazing.




If you ever wondered: A current Ubuntu Server version seems to have no
problems with this specification.

2010-11-12

Benchmarking GPUs using the CUV library

Since I started programming GPUs, a few generations of these cards
have been released, most recently the GTX580. I always wondered how
good these actually are and how programs written for one GPU scale to
the next generation.


We have a test suite available here, which is at least of importance
to us: The CUV library. Apart from unit-tests checking correctness of
the implementation, the library also has a few "tests" which measure
execution speed on GPU and CPU for comparison.


Instead of inventing new tests for the benchmark, we simply reuse the
speed tests which come with CUV, as they are probably relevant
use-cases that the programmers optimized for, anyway. A small perl
script that now resides in the scripts/ directory of CUV now
identifies speed tests by their name (*_speed), runs them and parses
their output. We simply collect all values in a defined order and save
them to a file. A larger number of these files can then be analyzed
using a python script included in the same directory, which uses the
superb matplotlib library to draw a bar chart. We use a reference GPU
to compare relative timings.


To cut a long story short, here are the results comparing GTX285,
GTX295, GX2-9800, GTX480 and GTX580:



As expected, Fermi generation cards perform a lot better than the
older generation, the GTX580 also improves on GTX480. Some operations,
which are apparently not implemented well, perform worse with newer
generation cards. The real work horses of our library have definitely
improved a lot over generations, even though we did not spend time on
optimizing them for the later cards.


We're not the first ones to compare these cards of course. Legit Reviews compares the frame rate rendered by the GTX480 and GTX580,
finding only marginal improvements. Brightsideofnews runs many
rendering related benchmarks as well (3DMark Vantage, Unigine Heaven,
Pripyat), with mixed results. However, we're more interested in
general purpose computing (GPGPU) and in writing algorithms for GPU
hardware which then scale with the new hardware, as it becomes
available.