2011-02-09

Profiling Boost Multi-Array

Worried by this StackOverflow thread, I did my own experiments on
profiling the access speed in Boost MultiArrays.


First of all, the thread is correct, BOOST_MA needs about 3 times as
much time to iterate over the array as Native.


The additional cases I checked are quite interesting, though:



BOOST_IT
only has a 30% overhead over Native. Note that you can
get around the obfuscating pointer types by using the auto keyword
and -std-gnu++0x as a compiler argument (works for gcc version >= 4.4).

RAW_POINTER
is actually slower than Native. This is a typical
case of "do not try to be smarter than your compiler. Note that this
is the same algorithm like the one you get when you use std::fill.

unsigned int
index types are significantly slower (about 2x) in
Native condition than int index types. This is bad, because
intuitively, one would choose unsigned int for array indexing.

My code:





#include <boost/date_time/posix_time/posix_time.hpp>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
using namespace boost::posix_time; 

int main(int argc, char* argv[])
{
  typedef int idx;
  const idx X_SIZE = 400;
  const idx Y_SIZE = 400;
  const idx ITERATIONS = 5000;

  double *nativeMatrix = new double [X_SIZE * Y_SIZE];

  typedef boost::multi_array_ref<double, 2> ImageArrayType;
  ImageArrayType boostMatrix(nativeMatrix, boost::extents[X_SIZE][Y_SIZE]);    

  // Native condition
  ptime startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
      for (idx y = 0; y < Y_SIZE; ++y)
          for (idx x = 0; x < X_SIZE; ++x)
              nativeMatrix[x + (y * X_SIZE)] = 2.345;
  ptime endTime = microsec_clock::universal_time();
  printf("[Native]Elapsed time: %6.3f seconds\n", time_period(startTime, endTime).length().total_milliseconds()/1000.f);

  // other conditions   
  startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
    {
#ifdef RAW_POINTER
      double* end = boostMatrix.data() + X_SIZE*Y_SIZE;
      for(double* begin=boostMatrix.data(); begin!=end; ++begin)
        *begin = 2.345;
#elif defined(BOOST_IT)
      for(auto it=boostMatrix.begin(); it!= boostMatrix.end(); ++it)
          for(auto it2=(*it).begin(); it1!=(*it).end(); ++it2)
              *it2 = 2.345;
#elif defined(BOOST_MA)
      for (idx y = 0; y < Y_SIZE; ++y)
         for (idx x = 0; x < X_SIZE; ++x)
             boostMatrix[y][x] = 2.345;
#endif
    }
  endTime = microsec_clock::universal_time();
  printf("[Boost] Elapsed time: %6.3f seconds\n", time_period(startTime,endTime).length().total_milliseconds()/1000.f );

  return 0;
}

All compiled and executed using


g++-4.4 -O3 -g0 -DNDEBUG -march=native -mtune=native --fast-math -std=gnu++0x % && ./a.out

2011-02-08

Parallelization in Python using Weave and OpenMP

If you're using python a lot, you probably stumbled across the performance python howto at some point. Since inner loops are so
expensive in python, it makes sense to move them to C++, and
PerformancePython shows how to do that. When standard libraries like
numpy and weave.blitz are not expressive enough, your best bet becomes
SciPy Weave. In weave you write your C++ code in a string and access
your numpy arrays via the blitz converters. The code is compiled when
it has changed and can make use of arbitrary libraries as we shall see.

Parallelizing Weave-Code: Per-Pixel eigenvalues


Lets suppose we have a 3D image $D$, with the second order derivatives
in $x$, $y$ and $z$ direction given by $Dxx, Dxy, Dyy, \ldots$. We
want to calculate the eigenvalues at each pixel. For efficiency, we
use LAPACK and for clarity, we use Boost uBLAS. Lets first write down
the C++ code:

namespace ublas = boost::numeric::ublas;
namespace lapack = boost::numeric::bindings::lapack;
using namespace std;
typedef ublas::bounded_matrix<double,3,3,ublas::column_major> mat;
ublas::bounded_vector<double,3> lambda;
ublas::bounded_vector<double,34*3> work; // the 34*width is from syev.hpp

mat A;
int i,j,k,idx;
#pragma omp parallel for private(i,j,k,A,idx,lambda,work)
for(i=0;i<nx;i++){
  for(j=0;j<ny;j++){
    for(k=0;k<nz;k++){
      A(0,0) = Dxx(i,j,k);  // fill upper triangle of A
      A(1,1) = Dyy(i,j,k);
      A(2,2) = Dzz(i,j,k);
      A(0,1) = Dxy(i,j,k);
      A(0,2) = Dxz(i,j,k);
      A(1,2) = Dyz(i,j,k);

      // determine eigenvalues (N) of upper (U) triangle of A in lambda
      lapack::syev('N','U',A,lambda,lapack::workspace(work));
      lambda1(i,j,k) = lambda(0);  // eigenvalues in ascending order
      lambda2(i,j,k) = lambda(1);
      lambda3(i,j,k) = lambda(2);
    }
  }
 }  

Note that we use OpenMP here to parallelize the outmost for-loop using
a #pragma directive. I took me a while to figure out that one needs
to explicitly name all variables which must be private to a thread in
this for-loop, otherwise you will not notice any speedups and/or weird
program behavior. I suppose further restrictions are that the size of
the variables needs to be known at compile time (here I use
bounded_vector for that purpose). Finally, the loop index must be of
type int for OpenMP.

That is only the body of a function. We save this in a python string
called codestring and execute it as follows:

from scipy.weave import inline, converters
variables = "nx ny nz lambda1 lambda2 lambda3 Dxx Dyy Dzz Dxy Dxz Dyz".split()
inline(codestring,
       variables,
       extra_compile_args =['-O3 -fopenmp'],
       extra_link_args=['-lgomp'],
       headers=[
           '<boost/numeric/bindings/lapack/syev.hpp>',
           '<boost/numeric/bindings/traits/ublas_matrix.hpp> ',
           '<boost/numeric/bindings/traits/ublas_vector.hpp> ',
           '<boost/numeric/ublas/matrix.hpp>',
           '<boost/numeric/ublas/banded.hpp>',
           '<boost/numeric/ublas/vector.hpp>',
           '<cmath>'],
       type_converters=converters.blitz,
       libraries=["lapack"])

A couple of things to point out here: since codestring is only the
body of a function, we need some other place to declare include
files and declarations. Include files are listed in headers,
additional code with declarations can be supplied using
support_code. To use OpenMP, we need the compiler argument
-fopenmp, and the linker needs to link the gomp library. The
type_converters argument tells Weave to copy and use conversion
routines into the final program which convert NumPy arrays in Blitz
arrays, which have nice accessor operators.

Second Order Parallelization


We can go even one step further and parallelize the resulting code
over multiple processors or even computers. In Python, a comparably
new and exciting way to achieve this is Parallel IPython. I'll leave
the details on this to a later posting, instead I'll point to a
problem I found while running Weave code in parallel on multiple machines.


When Weave compiles your code, it writes the generated source code to
~/.python26_compiled. The file name is created from a hash code over
the contained source code. This way, Weave knows when to recompile and
when to reuse an old binary.


Now to the downside:

Multiple Cores
You will get problems if you run multiple
instances at the same time: They try to write the same source code
file at the same time.
Multiple Computers
You will get the same problem if your home
directory is on the same file system (NFS or the like)

My solution to problem number one is to compile once with a single
thread, then reuse using all threads. Problem two can be solved by
symlinking ~/.python26_compiled to /tmp.