2011-02-08

Parallelization in Python using Weave and OpenMP

If you're using python a lot, you probably stumbled across the performance python howto at some point. Since inner loops are so
expensive in python, it makes sense to move them to C++, and
PerformancePython shows how to do that. When standard libraries like
numpy and weave.blitz are not expressive enough, your best bet becomes
SciPy Weave. In weave you write your C++ code in a string and access
your numpy arrays via the blitz converters. The code is compiled when
it has changed and can make use of arbitrary libraries as we shall see.

Parallelizing Weave-Code: Per-Pixel eigenvalues


Lets suppose we have a 3D image $D$, with the second order derivatives
in $x$, $y$ and $z$ direction given by $Dxx, Dxy, Dyy, \ldots$. We
want to calculate the eigenvalues at each pixel. For efficiency, we
use LAPACK and for clarity, we use Boost uBLAS. Lets first write down
the C++ code:

namespace ublas = boost::numeric::ublas;
namespace lapack = boost::numeric::bindings::lapack;
using namespace std;
typedef ublas::bounded_matrix<double,3,3,ublas::column_major> mat;
ublas::bounded_vector<double,3> lambda;
ublas::bounded_vector<double,34*3> work; // the 34*width is from syev.hpp

mat A;
int i,j,k,idx;
#pragma omp parallel for private(i,j,k,A,idx,lambda,work)
for(i=0;i<nx;i++){
  for(j=0;j<ny;j++){
    for(k=0;k<nz;k++){
      A(0,0) = Dxx(i,j,k);  // fill upper triangle of A
      A(1,1) = Dyy(i,j,k);
      A(2,2) = Dzz(i,j,k);
      A(0,1) = Dxy(i,j,k);
      A(0,2) = Dxz(i,j,k);
      A(1,2) = Dyz(i,j,k);

      // determine eigenvalues (N) of upper (U) triangle of A in lambda
      lapack::syev('N','U',A,lambda,lapack::workspace(work));
      lambda1(i,j,k) = lambda(0);  // eigenvalues in ascending order
      lambda2(i,j,k) = lambda(1);
      lambda3(i,j,k) = lambda(2);
    }
  }
 }  

Note that we use OpenMP here to parallelize the outmost for-loop using
a #pragma directive. I took me a while to figure out that one needs
to explicitly name all variables which must be private to a thread in
this for-loop, otherwise you will not notice any speedups and/or weird
program behavior. I suppose further restrictions are that the size of
the variables needs to be known at compile time (here I use
bounded_vector for that purpose). Finally, the loop index must be of
type int for OpenMP.

That is only the body of a function. We save this in a python string
called codestring and execute it as follows:

from scipy.weave import inline, converters
variables = "nx ny nz lambda1 lambda2 lambda3 Dxx Dyy Dzz Dxy Dxz Dyz".split()
inline(codestring,
       variables,
       extra_compile_args =['-O3 -fopenmp'],
       extra_link_args=['-lgomp'],
       headers=[
           '<boost/numeric/bindings/lapack/syev.hpp>',
           '<boost/numeric/bindings/traits/ublas_matrix.hpp> ',
           '<boost/numeric/bindings/traits/ublas_vector.hpp> ',
           '<boost/numeric/ublas/matrix.hpp>',
           '<boost/numeric/ublas/banded.hpp>',
           '<boost/numeric/ublas/vector.hpp>',
           '<cmath>'],
       type_converters=converters.blitz,
       libraries=["lapack"])

A couple of things to point out here: since codestring is only the
body of a function, we need some other place to declare include
files and declarations. Include files are listed in headers,
additional code with declarations can be supplied using
support_code. To use OpenMP, we need the compiler argument
-fopenmp, and the linker needs to link the gomp library. The
type_converters argument tells Weave to copy and use conversion
routines into the final program which convert NumPy arrays in Blitz
arrays, which have nice accessor operators.

Second Order Parallelization


We can go even one step further and parallelize the resulting code
over multiple processors or even computers. In Python, a comparably
new and exciting way to achieve this is Parallel IPython. I'll leave
the details on this to a later posting, instead I'll point to a
problem I found while running Weave code in parallel on multiple machines.


When Weave compiles your code, it writes the generated source code to
~/.python26_compiled. The file name is created from a hash code over
the contained source code. This way, Weave knows when to recompile and
when to reuse an old binary.


Now to the downside:

Multiple Cores
You will get problems if you run multiple
instances at the same time: They try to write the same source code
file at the same time.
Multiple Computers
You will get the same problem if your home
directory is on the same file system (NFS or the like)

My solution to problem number one is to compile once with a single
thread, then reuse using all threads. Problem two can be solved by
symlinking ~/.python26_compiled to /tmp.

1 comment: