2011-02-09

Profiling Boost Multi-Array

Worried by this StackOverflow thread, I did my own experiments on
profiling the access speed in Boost MultiArrays.


First of all, the thread is correct, BOOST_MA needs about 3 times as
much time to iterate over the array as Native.


The additional cases I checked are quite interesting, though:



BOOST_IT
only has a 30% overhead over Native. Note that you can
get around the obfuscating pointer types by using the auto keyword
and -std-gnu++0x as a compiler argument (works for gcc version >= 4.4).

RAW_POINTER
is actually slower than Native. This is a typical
case of "do not try to be smarter than your compiler. Note that this
is the same algorithm like the one you get when you use std::fill.

unsigned int
index types are significantly slower (about 2x) in
Native condition than int index types. This is bad, because
intuitively, one would choose unsigned int for array indexing.

My code:





#include <boost/date_time/posix_time/posix_time.hpp>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
using namespace boost::posix_time; 

int main(int argc, char* argv[])
{
  typedef int idx;
  const idx X_SIZE = 400;
  const idx Y_SIZE = 400;
  const idx ITERATIONS = 5000;

  double *nativeMatrix = new double [X_SIZE * Y_SIZE];

  typedef boost::multi_array_ref<double, 2> ImageArrayType;
  ImageArrayType boostMatrix(nativeMatrix, boost::extents[X_SIZE][Y_SIZE]);    

  // Native condition
  ptime startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
      for (idx y = 0; y < Y_SIZE; ++y)
          for (idx x = 0; x < X_SIZE; ++x)
              nativeMatrix[x + (y * X_SIZE)] = 2.345;
  ptime endTime = microsec_clock::universal_time();
  printf("[Native]Elapsed time: %6.3f seconds\n", time_period(startTime, endTime).length().total_milliseconds()/1000.f);

  // other conditions   
  startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
    {
#ifdef RAW_POINTER
      double* end = boostMatrix.data() + X_SIZE*Y_SIZE;
      for(double* begin=boostMatrix.data(); begin!=end; ++begin)
        *begin = 2.345;
#elif defined(BOOST_IT)
      for(auto it=boostMatrix.begin(); it!= boostMatrix.end(); ++it)
          for(auto it2=(*it).begin(); it1!=(*it).end(); ++it2)
              *it2 = 2.345;
#elif defined(BOOST_MA)
      for (idx y = 0; y < Y_SIZE; ++y)
         for (idx x = 0; x < X_SIZE; ++x)
             boostMatrix[y][x] = 2.345;
#endif
    }
  endTime = microsec_clock::universal_time();
  printf("[Boost] Elapsed time: %6.3f seconds\n", time_period(startTime,endTime).length().total_milliseconds()/1000.f );

  return 0;
}

All compiled and executed using


g++-4.4 -O3 -g0 -DNDEBUG -march=native -mtune=native --fast-math -std=gnu++0x % && ./a.out

2011-02-08

Parallelization in Python using Weave and OpenMP

If you're using python a lot, you probably stumbled across the performance python howto at some point. Since inner loops are so
expensive in python, it makes sense to move them to C++, and
PerformancePython shows how to do that. When standard libraries like
numpy and weave.blitz are not expressive enough, your best bet becomes
SciPy Weave. In weave you write your C++ code in a string and access
your numpy arrays via the blitz converters. The code is compiled when
it has changed and can make use of arbitrary libraries as we shall see.

Parallelizing Weave-Code: Per-Pixel eigenvalues


Lets suppose we have a 3D image $D$, with the second order derivatives
in $x$, $y$ and $z$ direction given by $Dxx, Dxy, Dyy, \ldots$. We
want to calculate the eigenvalues at each pixel. For efficiency, we
use LAPACK and for clarity, we use Boost uBLAS. Lets first write down
the C++ code:

namespace ublas = boost::numeric::ublas;
namespace lapack = boost::numeric::bindings::lapack;
using namespace std;
typedef ublas::bounded_matrix<double,3,3,ublas::column_major> mat;
ublas::bounded_vector<double,3> lambda;
ublas::bounded_vector<double,34*3> work; // the 34*width is from syev.hpp

mat A;
int i,j,k,idx;
#pragma omp parallel for private(i,j,k,A,idx,lambda,work)
for(i=0;i<nx;i++){
  for(j=0;j<ny;j++){
    for(k=0;k<nz;k++){
      A(0,0) = Dxx(i,j,k);  // fill upper triangle of A
      A(1,1) = Dyy(i,j,k);
      A(2,2) = Dzz(i,j,k);
      A(0,1) = Dxy(i,j,k);
      A(0,2) = Dxz(i,j,k);
      A(1,2) = Dyz(i,j,k);

      // determine eigenvalues (N) of upper (U) triangle of A in lambda
      lapack::syev('N','U',A,lambda,lapack::workspace(work));
      lambda1(i,j,k) = lambda(0);  // eigenvalues in ascending order
      lambda2(i,j,k) = lambda(1);
      lambda3(i,j,k) = lambda(2);
    }
  }
 }  

Note that we use OpenMP here to parallelize the outmost for-loop using
a #pragma directive. I took me a while to figure out that one needs
to explicitly name all variables which must be private to a thread in
this for-loop, otherwise you will not notice any speedups and/or weird
program behavior. I suppose further restrictions are that the size of
the variables needs to be known at compile time (here I use
bounded_vector for that purpose). Finally, the loop index must be of
type int for OpenMP.

That is only the body of a function. We save this in a python string
called codestring and execute it as follows:

from scipy.weave import inline, converters
variables = "nx ny nz lambda1 lambda2 lambda3 Dxx Dyy Dzz Dxy Dxz Dyz".split()
inline(codestring,
       variables,
       extra_compile_args =['-O3 -fopenmp'],
       extra_link_args=['-lgomp'],
       headers=[
           '<boost/numeric/bindings/lapack/syev.hpp>',
           '<boost/numeric/bindings/traits/ublas_matrix.hpp> ',
           '<boost/numeric/bindings/traits/ublas_vector.hpp> ',
           '<boost/numeric/ublas/matrix.hpp>',
           '<boost/numeric/ublas/banded.hpp>',
           '<boost/numeric/ublas/vector.hpp>',
           '<cmath>'],
       type_converters=converters.blitz,
       libraries=["lapack"])

A couple of things to point out here: since codestring is only the
body of a function, we need some other place to declare include
files and declarations. Include files are listed in headers,
additional code with declarations can be supplied using
support_code. To use OpenMP, we need the compiler argument
-fopenmp, and the linker needs to link the gomp library. The
type_converters argument tells Weave to copy and use conversion
routines into the final program which convert NumPy arrays in Blitz
arrays, which have nice accessor operators.

Second Order Parallelization


We can go even one step further and parallelize the resulting code
over multiple processors or even computers. In Python, a comparably
new and exciting way to achieve this is Parallel IPython. I'll leave
the details on this to a later posting, instead I'll point to a
problem I found while running Weave code in parallel on multiple machines.


When Weave compiles your code, it writes the generated source code to
~/.python26_compiled. The file name is created from a hash code over
the contained source code. This way, Weave knows when to recompile and
when to reuse an old binary.


Now to the downside:

Multiple Cores
You will get problems if you run multiple
instances at the same time: They try to write the same source code
file at the same time.
Multiple Computers
You will get the same problem if your home
directory is on the same file system (NFS or the like)

My solution to problem number one is to compile once with a single
thread, then reuse using all threads. Problem two can be solved by
symlinking ~/.python26_compiled to /tmp.

2011-01-27

Dijkstra on boost grid graph

There is a whole science of finding tubular structures in
3D-images. It started sometime in the late nineties. One of the most
cited papers1 in this area shows how to use eigenvalues of the
local second order derivatives to identify tubular structures
heuristically. It also shows how to combine the results on multiple scales.



Raw Data
After Pre-processing
That is quite neat, but the approach has a major problem: It does not
assume any global structure. Therefore, when the data does not have
evidence for tubular structures, you still do not see them, and
furthermore, you have to have a method that allows you to say "points
x and y lie on the same tube". This, and determining properties of the
tubes, are the problems which most of the followupr research strives
to solve.

I wanted to have a quick-and-dirty approach to the problem. For a
given seed point, I would like to know the connectivity of the tubular
structure to this seed point. I pre-processed the raw image using the
method of Sato et al.1, determined the seed point, the graph is
defined implicitly over the 3D-image using voxel-neighborhoods, so we
could in theory just run Dijkstra on it. I was lazy, however and did
not want to implement it all.



BGL Grid Graph


A quick google search turned up boost::grid_graph, a relatively new
member of the boost graph library (BGL). It defines a graph on a grid
without representing everything. Neat. The documentation is still a
bit shortish and sadly, only a 6-neighborhood (in 3D) is supported,
but who'd expect a free lunch?


Usability is (at first) easy as pie:





typedef grid_graph<3> graph_t;
typedef graph_traits<graph_t> Traits;
typedef Traits::edge_descriptor edge_descriptor;
typedef Traits::vertex_descriptor vertex_descriptor;

// ...

boost::array<vidx_t, 3> lengths = { { 256, 256, 256 } };
graph_t graph(lengths, false); // do not wrap any dimensions

Defining properties of edges in a grid graph


Contrary to graphs with internal properties, we have to define
external properties for graph_t. This is currently not documented
well, and it took me some time to figure out:



  • In most BGL examples, nodes are indexed using an integer type. The
    grid_graph, however, uses n-tuples of type boost::array<n>.
    This means, most examples cannot by applied to grid_graph in a
    straight-forward way: You cannot index an array by a tuple without
    doing some tricks first.

  • We would like to have an arbitrary function of the nodes as the
    weight. Most BGL examples simply use an array for whichever
    property they put in their graph.

The first problem can be solved using shared array property maps
(which do not seem to be well-documented either):





shared_array_property_map<vertex_descriptor,
  property_map<graph_t, vertex_index_t>::const_type>
  p_map(num_vertices(graph), get(vertex_index, graph));
shared_array_property_map<double,
  property_map<graph_t, vertex_index_t>::const_type>
  d_map(num_vertices(graph), get(vertex_index, graph));
We will use p_map as our predecessor map, which maps a vertex index (integer)
to a vertex descriptor (n-tuple).

The second map d_map is a mapping from vertex index to double, which
will be our distance map.


Finally, we need to create a edge_weight for graph_t, and it
should be a function, not an array. For this, I peaked W.P. McNeills
Boost-Implicit-Graph Example on GitHub.





struct edge_weight_map;
namespace boost {
  template<>
  struct property_map< graph_t, edge_weight_t > {
    typedef edge_weight_map type;
    typedef edge_weight_map const_type;
  };
}

/*
   Map from edges to weight values
*/
struct edge_weight_map {
  typedef double value_type;
  typedef value_type reference;
  typedef edge_descriptor key_type;
  typedef boost::readable_property_map_tag category;
  const graph_t& m_graph;
  edge_weight_map(const graph_t& g)
  :m_graph(g) { }
  reference operator[](key_type e) const; // implemented below
};

typedef boost::property_map<graph_t, boost::edge_weight_t>::const_type
        const_edge_weight_map;
typedef boost::property_traits<const_edge_weight_map>::reference
        edge_weight_map_value_type;
typedef boost::property_traits<const_edge_weight_map>::key_type
        edge_weight_map_key;

namespace boost{
    // PropertyMap valid expressions
    edge_weight_map_value_type get(const_edge_weight_map pmap, edge_weight_map_key e) {
        return pmap[e]; }
    // ReadablePropertyGraph valid expressions
    const_edge_weight_map get(boost::edge_weight_t, const graph_t&g) {
        return const_edge_weight_map(g); }
    edge_weight_map_value_type get(boost::edge_weight_t tag, const graph_t& g, edge_weight_map_key e) {
        return get(tag, g)[e]; }
}

// whoa, lots of typedefs, but now we can write the function that we
// wanted to write: Map edges to weights!
edge_weight_map::reference
edge_weight_map::operator[](key_type e) const {
    vertex_descriptor t = target(e,m_graph);
    vertex_descriptor s = source(e,m_graph);
    // f = f(t,s)
    return f;
}
That was more nasty than I expected, but it is still not that much
code to write.

Representing the data in a Boost Multi-Array


Finally, we need to represent 3D-data. I hate dealing with raw memory,
so I always wrap it in some convenient abstracting structure first
chance I get. 3D-arrays are definitely a case for Boost Multi-Array.
Its not a big deal using it, but I want to advertise its simplicity here:





typedef boost::multi_array_ref<unsigned char, 3> array_type;

std::ifstream dat("raw.dat", std::ios::in | std::ios::binary);
unsigned char* data = new unsigned char[256*256*256];
dat.read((char*)data,256*256*256);
array_type A(data,boost::extents[256][256][256]);
dat.close();
By the way, you can also change the base of this array as well as the
memory order, so if you need to port Matlab (TM) code to a programming
language (no adjective here on purpose), you can't get much simpler.

Finish line: Running Dijkstra


This is trivial now and straight from the book:





boost::array<vidx_t, 3> start  = { { 109, 129,  24 } };
dijkstra_shortest_paths(graph, start
        ,predecessor_map(p_map)
        .distance_map(d_map)
        );
Note the "." before distance_map, this is a named argument called
distance_map, which is set to d_map. The same goes for p_map.


Thresholded Distances
We can now trace paths starting at points which have a high value in
the raw image:


Traced paths starting at thresholded raw data

Footnotes:

1 "3D multi-scale line filter for segmentation and visualization
of curvilinear structures in medical images" by Sato et al.,
CVRMed-MRCAS, 1997 (PDF)

2010-12-15

Nearest Neighbor Classifier in CUV

After reading this blog post by Alex Smola I implemented a
one-nearest-neighbor classifier in CUV. Apart from a bug that popped
up, I was happy to see that no changes to the library where needed,
and the code is very short, so I'll share the implementation here.


There is a small annoyance which I'll fix later:



  • there is no argmin_col, so we instead resort to multiplying by
    -1 and then use argmax_col for now. I'll clean this mess up
    later, probably by adding a argFunc_col functor.
import cuv_python as cp
import pyublas
import numpy as np

class KNN:
    """ calculates the labels of a test set by determining the closest
    instance in a training set and returning the corresponding label."""
    def __init__(self, data, data_l):
        """
        data:   a (number_instances x dimensionality) numpy matrix
        data_l: a number_instances numpy vector containing the labels
        """
        self.data   = cp.push(data)
        self.data_l = data_l
        self.dsq    = cp.dev_matrix_cmf(self.data.h,1)
        cp.reduce_to_col(self.dsq.vec,self.data,cp.reduce_functor.ADD_SQUARED)
    def __get_distance_matrix(self, test):
        """
        test: a (number_test_instances x dimensionality) numpy matrix
        returns: a (number_instances x number_test_instances) CUV distance matrix
        """
        t   = cp.push(test)
        assert t.w == self.data.w
        tsq = cp.dev_matrix_cmf(t.h, 1)
        cp.reduce_to_col(tsq.vec,t,cp.reduce_functor.ADD_SQUARED)
        p   = cp.dev_matrix_cmf(self.data.h, t.h)
        cp.prod(p, self.data, t, 'n','t',-2, 0)
        cp.matrix_plus_col(p,self.dsq.vec)
        cp.matrix_plus_row(p,tsq.vec)
        return p
    def run(self,test):
        """
        test:    a (number_test_instances x dimensionality) numpy matrix
        returns: the estimated labels of the test instances
        """
        p = self.__get_distance_matrix(test)
        p *= -1.                # no argmin supported yet, sorry
        idx = cp.dev_matrix_cmi(test.shape[0],1)
        cp.argmax_to_row(idx.vec, p)
        hidx  = idx.np.reshape(idx.h)
        return self.data_l.reshape(self.data.h)[hidx]
As suggested in the original post, this is a very fast method to
calculate the nearest neighbor. Instead of looping over all possible
pairs $x∈ X$ and $z∈ Z$ calculating the distance

\[D_{ij} = \|x_i-z_j\|^2,\]


we first precalculate $\|x_i\|^2$ and $\|z_i\|^2$ and then use a fast
matrix multiplication and two additions to determine the distance
above (derived by expanding the second binomial formula)


\[D_{ij} = \|x_i\|^2 + \|z_i\|^2 - 2 XZ^T.\]


We can measure the speed of the implementation by running the above
program using GPU or CPU. The CPU implementation uses cBLAS, I'm using
a 3.2 GHz CPU and a GTX580 for comparison.



MethodTime (s)Relative speed
CUV CPU (cBLAS)58.3534
CUV GPU (cuBLAS)1.711

To test this on MNIST, save the above as knn.py and run the following source:





import os
from knn import KNN
import numpy as np

class MNIST:
  """ This simply reads the MNIST files to main memory and converts them to float """
  def __init__(self,dir):
      from scipy.io.numpyio import fread
      fd = open(dir+'/train-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.data_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 60000 )
      fd.close()

      fd = open(dir+'/train-images.idx3-ubyte')
      fread(fd,16,'c')
      self.data = np.fromfile(file=fd, dtype=np.uint8).reshape( (60000,784) )
      fd.close()

      fd = open(dir+'/t10k-images.idx3-ubyte')
      fread(fd,16,'c')
      self.test = np.fromfile(file=fd, dtype=np.uint8).reshape( (10000,784) )
      fd.close()

      fd = open(dir+'/t10k-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.test_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 10000 )
      fd.close()
  def get_test(self):
      v = self.test.astype('float32').T.copy("F")
      t = self.test_labels
      return v,t
  def get(self):
      v = self.data.astype('float32').T.copy("F")
      t = self.data_labels
      return v,t

pg = MNIST("/home/local/datasets/MNIST");

data, data_l  = pg.get()
test, test_l = pg.get_test()

knn = KNN(data.T.copy("F"),data_l)

off, err_cnt = 5000, 0
for i in xrange(0,10000,off):
    pred = knn.run(test[:,i:i+off].T.copy("F"))
    err_cnt += (pred!=test_l[i:i+off]).sum()

print "Errors: ", err_cnt

2010-11-26

Hurricane in the Server Room

Our new toy arrived, with twelve cores, four GTX580 GPUs, four
terabyte hard disk and 24 Gigabytes RAM. The thing looks quite
impressive from the inside:





Note the four GPUs on the left hand side and the vertical stack of
fans in the center. Together with the CPU fans above the GPUs it is
impossible to sit next to this machine and think clearly. It blows
your head off, or at least it sounds as if it would. Even when almost
unused, this machine produced so much heat our server room could not
discharge the heat anymore. Amazing.




If you ever wondered: A current Ubuntu Server version seems to have no
problems with this specification.

2010-11-12

Benchmarking GPUs using the CUV library

Since I started programming GPUs, a few generations of these cards
have been released, most recently the GTX580. I always wondered how
good these actually are and how programs written for one GPU scale to
the next generation.


We have a test suite available here, which is at least of importance
to us: The CUV library. Apart from unit-tests checking correctness of
the implementation, the library also has a few "tests" which measure
execution speed on GPU and CPU for comparison.


Instead of inventing new tests for the benchmark, we simply reuse the
speed tests which come with CUV, as they are probably relevant
use-cases that the programmers optimized for, anyway. A small perl
script that now resides in the scripts/ directory of CUV now
identifies speed tests by their name (*_speed), runs them and parses
their output. We simply collect all values in a defined order and save
them to a file. A larger number of these files can then be analyzed
using a python script included in the same directory, which uses the
superb matplotlib library to draw a bar chart. We use a reference GPU
to compare relative timings.


To cut a long story short, here are the results comparing GTX285,
GTX295, GX2-9800, GTX480 and GTX580:



As expected, Fermi generation cards perform a lot better than the
older generation, the GTX580 also improves on GTX480. Some operations,
which are apparently not implemented well, perform worse with newer
generation cards. The real work horses of our library have definitely
improved a lot over generations, even though we did not spend time on
optimizing them for the later cards.


We're not the first ones to compare these cards of course. Legit Reviews compares the frame rate rendered by the GTX480 and GTX580,
finding only marginal improvements. Brightsideofnews runs many
rendering related benchmarks as well (3DMark Vantage, Unigine Heaven,
Pripyat), with mixed results. However, we're more interested in
general purpose computing (GPGPU) and in writing algorithms for GPU
hardware which then scale with the new hardware, as it becomes
available.

2010-11-08

Directory Color Embedding

Do you also save all your experiment data in folder names which
represent the settings? I do. If you have many experiments, however,
this strategy will defy comparison of the experiments, as it becomes
hard to put everything in one plot. The default settings in plotting
programs such as gnuplot or matplotlib do not have a large enough
range of colors, and if they do, it becomes hard to assign the colors
in a meaningful way for exploratory data analysis. The script here
might come to rescue.



Reading pre-processing directory names


We first transform directory names such that test-paramA_0.1 becomes
test-paramA_0000.1, allowing easier string comparison.

import Levenshtein as L
import numpy as np
from colormath.color_objects import LabColor
import os, re

def fillfunc(mo):
    """ returns the matched number, with padded zeros to the front """
    s = mo.group(1)
    while(len(s)<6): s = '0'+s
    return s
def get_unified_names(fns):
    n = len(fns)
    strs = [x for x in fns]
    for i in xrange(len(strs)):
        strs[i] = re.sub(r'(?<!\d)([\d.]+)(?!\d)',fillfunc,strs[i])
    return strs

Creating the distance matrix based on Levenstein string distance

The Levenshtein string distance is a measure of how many
edits/insertions/deletions one needs to transform one string into the
other. With this we automatically derive a dissimilarity measure for
the unified directory strings:

mat = np.zeros((n,n))
for i in xrange(n):
    for j in xrange(n):
        if j<=i: continue
        mat[i,j] = L.distance(strs[i],strs[j])
        mat[j,i] = mat[i,j]
return np.exp(-mat)

Embedding the dissimilarity matrix using Multidimensional Scaling (MDS) in R


The distance matrix does not necessarily directly map to coordinates
in 3D, so we need an embedding algorithm which deals with distance
matrices only.


For simplicity, we call R from python with a saved dissimilarity matrix.

def emb(mat):
    np.savetxt("file.dat", mat)
    os.system("./analyse/emb.R")
    res = np.loadtxt("mds.dat")
    return res

Now to the part written in R, which only does the embedding and saves
the result in a text file:

#!/usr/bin/Rscript
library(MASS)
library(vegan)
tab <- read.table("file.dat", header = FALSE, sep=" ")
data.m    <- as.matrix(tab)
data.mds <- vegan::metaMDS(data.m, k=3, trymax=50)
write.table(data.mds$points, file="mds.dat", quote=F, row.names=F, col.names=F)

Transforming embedded coordinates into colors

The coordinates are in some – it seems arbitrary – range and
therefore need to be mapped to colors. We want similar colors to have
similar edit distance, we therefore map our coordinates directly to
Lab Color Space and then transform the Lab color coordinates to RGB:

def getcolor(embedded,i):
    #lab = LabColor(0.8,*embedded[i,:])
    lab = LabColor(*embedded[i,:])
    rgb = lab.convert_to('RGB', debug=False).get_rgb_hex()
    return rgb

The resulting string can be used directly in the color specification
of a matplotlib plot and is then best combined with picking to find
out more about a particularly interesting plotline.

Example Embedding of directory names