expensive in python, it makes sense to move them to C++, and
PerformancePython shows how to do that. When standard libraries like
numpy and weave.blitz are not expressive enough, your best bet becomes
SciPy Weave. In weave you write your C++ code in a string and access
your numpy arrays via the blitz converters. The code is compiled when
it has changed and can make use of arbitrary libraries as we shall see.
Parallelizing Weave-Code: Per-Pixel eigenvalues
Lets suppose we have a 3D image $D$, with the second order derivatives
in $x$, $y$ and $z$ direction given by $Dxx, Dxy, Dyy, \ldots$. We
want to calculate the eigenvalues at each pixel. For efficiency, we
use LAPACK and for clarity, we use Boost uBLAS. Lets first write down
the C++ code:
namespace ublas = boost::numeric::ublas; namespace lapack = boost::numeric::bindings::lapack; using namespace std; typedef ublas::bounded_matrix<double,3,3,ublas::column_major> mat; ublas::bounded_vector<double,3> lambda; ublas::bounded_vector<double,34*3> work; // the 34*width is from syev.hpp mat A; int i,j,k,idx; #pragma omp parallel for private(i,j,k,A,idx,lambda,work) for(i=0;i<nx;i++){ for(j=0;j<ny;j++){ for(k=0;k<nz;k++){ A(0,0) = Dxx(i,j,k); // fill upper triangle of A A(1,1) = Dyy(i,j,k); A(2,2) = Dzz(i,j,k); A(0,1) = Dxy(i,j,k); A(0,2) = Dxz(i,j,k); A(1,2) = Dyz(i,j,k); // determine eigenvalues (N) of upper (U) triangle of A in lambda lapack::syev('N','U',A,lambda,lapack::workspace(work)); lambda1(i,j,k) = lambda(0); // eigenvalues in ascending order lambda2(i,j,k) = lambda(1); lambda3(i,j,k) = lambda(2); } } }
Note that we use OpenMP here to parallelize the outmost for-loop using
a
#pragma
directive. I took me a while to figure out that one needsto explicitly name all variables which must be private to a thread in
this for-loop, otherwise you will not notice any speedups and/or weird
program behavior. I suppose further restrictions are that the size of
the variables needs to be known at compile time (here I use
bounded_vector
for that purpose). Finally, the loop index must be oftype
int
for OpenMP.That is only the body of a function. We save this in a python string
called
codestring
and execute it as follows:from scipy.weave import inline, converters variables = "nx ny nz lambda1 lambda2 lambda3 Dxx Dyy Dzz Dxy Dxz Dyz".split() inline(codestring, variables, extra_compile_args =['-O3 -fopenmp'], extra_link_args=['-lgomp'], headers=[ '<boost/numeric/bindings/lapack/syev.hpp>', '<boost/numeric/bindings/traits/ublas_matrix.hpp> ', '<boost/numeric/bindings/traits/ublas_vector.hpp> ', '<boost/numeric/ublas/matrix.hpp>', '<boost/numeric/ublas/banded.hpp>', '<boost/numeric/ublas/vector.hpp>', '<cmath>'], type_converters=converters.blitz, libraries=["lapack"])
A couple of things to point out here: since
codestring
is only thebody of a function, we need some other place to declare include
files and declarations. Include files are listed in
headers
,additional code with declarations can be supplied using
support_code
. To use OpenMP, we need the compiler argument-fopenmp
, and the linker needs to link the gomp
library. Thetype_converters
argument tells Weave to copy and use conversionroutines into the final program which convert NumPy arrays in Blitz
arrays, which have nice accessor operators.
Second Order Parallelization
We can go even one step further and parallelize the resulting code
over multiple processors or even computers. In Python, a comparably
new and exciting way to achieve this is Parallel IPython. I'll leave
the details on this to a later posting, instead I'll point to a
problem I found while running Weave code in parallel on multiple machines.
When Weave compiles your code, it writes the generated source code to
~/.python26_compiled
. The file name is created from a hash code overthe contained source code. This way, Weave knows when to recompile and
when to reuse an old binary.
Now to the downside:
- Multiple Cores
- You will get problems if you run multiple
instances at the same time: They try to write the same source code
file at the same time. - Multiple Computers
- You will get the same problem if your home
directory is on the same file system (NFS or the like)
My solution to problem number one is to compile once with a single
thread, then reuse using all threads. Problem two can be solved by
symlinking
~/.python26_compiled
to /tmp
.
This post is useful. Thank you.
ReplyDelete