Lesser Vision

2012-01-19

GPU convolutions for neural networks

With all the popularity of deep learning, many researchers in the
field might wonder which framework is "right" to implement their
experiments. For plain neural networks, the main "work horse" is the
matrix multiplication, which can be accelerated a lot using graphics
processing units (GPU). For convolutional architectures, the matrix
multiplication is typically "replaced" by a convolution, and we would
also like to see them being fast(er) on GPU.

Neural net convolutions are somewhat special, since there filters are 3D and pool over input layers. Also, since they are usually applied to
many small "maps" at once, common FFT acceleration techniques do not
apply.

For my own implementations, I compared 3 convolution implementations:

The convolutions that come with Theano (from git, 2011-1-14). This
implementation is by far the most flexible, as we will see. It is
based on the formely separate, now theano-integrated CudaNdarray
library.
Alex Krizhevsky, a PhD student in Toronto, wrote two publically available convolution routines. We already integrated the first
version of his convolutions in CUV.
Alex' new convolutions created for the cuda-convnet (svn, 2011-1-13)
which are described as being "several times faster" than the first
version.

Constraints

The (main) constraints of the three versions are quite different:


Implementation	Image Size	Memory-Ordering (row-major)	Other
Theano	any	(nImages, nChannels, imageH, imageW)	–
Alex old	square only	(nChannels, nImages, imageH*imageW)	nFilters%2==0
Alex new	square only	(nChannels, imageH*imageW, nImages)	nFilters%16==0

Regarding squared images, one can argue that in random image
collections the shapes vary, anyway, and for batch processing it is
necessary to square them.

The ordering is tricky. At first sight, Theano's ordering looks most
intuitive. However, all operations which are functions of all channels
of a single pixel are a bit tricky to optimize. Alex' old and new
orderings can both use efficient matrix-row operations for
cross-channel functions. The "Alex old" convolution has the
disadvantage that images in one batch are not in the columns or the
rows of a matrix, so that final "full" layers (for example in LeNet)
require reordering the matrix. The new convolutions have images in the
columns of a matrix, solving the reordering problem, even though this
ordering looks most un-intuitive.

I should also mention the "sparse" filter option in Alex' code, which
allows to convolve only certain maps with a filter. I'm not going into
detail since Theano does not have this feature and I want to compare
execution times.

Speed

In the following table, all operations were computed 10 times and the
(wall clock) times averaged. For Theano, I varied the 'version'
parameter, but found that the auto-selection (-1) selects the best
algorithm. I used a GTX480 and in an Intel Xeon X5650 (2.67 GHz).

Execution speed of convolution packages
Version	Image Size	Filter Size	Type	Time (ms)	Comment
Naive CPU	32,8,176,176	32,8,7,7	fwd	34200
			dimg	26800
			dflt	n/a
Alex new	32,8,176,176	32,8,7,7	fwd	75
			dimg	90
			dflt	55
			trn	0.3	transposing all input batch
			total	220.3
Alex old	32,8,176,176	32,8,7,7	fwd	101
			dimg	240	plus error padding (3 ms)
			dflt	115	plus summing over batch (.8 ms)
			total	459
Theano	32,8,176,176	32,8,7,7	fwd	268
			dimg	451
			dflt	281
			total	1000

Key:

Image Size: batch size, number of input maps, height, width
Filter Size: number of output maps, number of input maps, height, width
Type: fwd is the "forward pass" convolution, dimg is the
derivative w.r.t. the inputs and dflt is the derivative w.r.t. the
filters.

Discussion: I was quite surprised to see Theano is comparably slow.
It seems that Alex' new convolutions are indeed faster, albeit not
several times (for the tested case) (Update: With patches for small
batch sizes kindly provided by Alex, speed nearly doubled!). The
overhead of a transpose (to comply with the "weird" memory layout) is
negligible compared to the overall advantages. All GPU
implementations significantly outperform a naive CPU version (just
many nested for-loops). Note however that theano is able to generate
code for efficient CPU convolutions.

Combinations: Theano is quite flexible, but "Alex new" is
fast. How do we get the best of two worlds? It is interesting to
note that the memory layouts of both convolutions are transposed to
each other, and that for just 0.3 ms (in the above setting), we can
get from one to the other. So we can get speed or flexibility at
wish.

Maintenance concerns

Both implementations are not particularly very well documented, but
well tested. At least for CudaNdarray, there is a successor on the way. It seems to me that optimized code at this level is mostly
write-only anyway.

2011-03-25

Easy Parallelization with C++0X lambda functions, Thread Building Blocks

Lambda functors in C++0X

The relatively new gcc-4.5 release supports lambda expressions,
which – in contrast to boost.lambda, boost.bind and the like – provide
easy capturing of variables in context in the lambda expression.
This finally makes the STL algorithms usable, such as

struct add_val{
  add_val(double d):val(d){}
  void operator()(const double& d){
    d+=val;
  }
  double val;
};


int main(){
  std::vector<double> v;
  // ... fill vector

  double inc = 3.0;

  // the old way of doing things
  add_val av(inc);
  std::for_each(v.begin(),v.end(),av);

  // using a boost.lambda function
  std::for_each(v.begin(),v.end(), _1+=inc);

  // using a c++0x lambda function
  std::for_each(v.begin(),v.end(), [=](double& d){d+=inc;});
}

The "old way" primarily has inconveniences for the programmer:

We need to define a struct outside the scope, even for tiny
functionality.
We need to explicitly capture variables from the scope of the
surrounding code (here: inc) in the struct.

Boost.Lambda tried to resolve this problem, in an elegant way I
believe. However, there are still shortcomings of this approach:

Variables have unintuitive names (_1, _2)
The code in the lambda expression is not really a block of code. It is an expression, where parts may be evaluated
surprisingly.
Also, since this is an expression, conditionals and loops must be
expressed in an awkward (that is, non-C++) way using if_, for_
and so on.

The new C++0X lambda syntax gets rid of all these problems. While the
syntax looks a bit strange at first, it is much more readable than
boost.lambda constructs. The block of code is not out of scope, all
names of the surrounding code can be used as copy ([=]) or
reference ([&]).

You might wonder, why we do not just use boost.foreach and get away
with writing

#include <boost/foreach.hpp>
#define foreach BOOST_FOREACH
// ...as before...
foreach(double& d, v){
  d+=inc;
}

// in c++0x, not yet implemented, this will be
for(double& d : v){
  d+=inc;
}

… which is an idiom quite well-known in other languages. The fun
part is, that we cannot change what for does, but we can change the
implementation of for_each. This is one of the things that the Intel (R) Threading Building Blocks library does.

Parallelizing your for-loop by changing two lines

A nice trick now is to change your (side-effect-free) for-loops like
this:

#include <tbb/parallel_for_each.h>

// before
foreach(double& d, v){
  d+=inc;
}

// after
tbb::parallel_for_each(v.begin(),v.end(),[=](double& d){
  d+=inc;
});

… and everything in this loop is automatically run in parallel.

Similar things can be done with OpenMP:

int end = v.size();
#pragma omp parallel for
for(int i=0;i<end;i++){
   v[i]+=inc;
}

but this is already much more intrusive. Furthermore, the loop index
must be an int and the last index must be known. While
variables may be passed as copy (using private (inc)), these
variables are private to the thread, not to the "wrapped" function. Of
course, OpenMP gives you much more fine-grained control over
parallelization as well.

2011-02-09

Profiling Boost Multi-Array

Worried by this StackOverflow thread, I did my own experiments on
profiling the access speed in Boost MultiArrays.

First of all, the thread is correct, BOOST_MA needs about 3 times as
much time to iterate over the array as Native.

The additional cases I checked are quite interesting, though:

BOOST_IT: only has a 30% overhead over Native. Note that you can
get around the obfuscating pointer types by using the auto keyword
and -std-gnu++0x as a compiler argument (works for gcc version >= 4.4).
RAW_POINTER: is actually slower than Native. This is a typical
case of "do not try to be smarter than your compiler. Note that this
is the same algorithm like the one you get when you use std::fill.
unsigned int: index types are significantly slower (about 2x) in
Native condition than int index types. This is bad, because
intuitively, one would choose unsigned int for array indexing.

My code:

#include <boost/date_time/posix_time/posix_time.hpp>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
using namespace boost::posix_time; 

int main(int argc, char* argv[])
{
  typedef int idx;
  const idx X_SIZE = 400;
  const idx Y_SIZE = 400;
  const idx ITERATIONS = 5000;

  double *nativeMatrix = new double [X_SIZE * Y_SIZE];

  typedef boost::multi_array_ref<double, 2> ImageArrayType;
  ImageArrayType boostMatrix(nativeMatrix, boost::extents[X_SIZE][Y_SIZE]);    

  // Native condition
  ptime startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
      for (idx y = 0; y < Y_SIZE; ++y)
          for (idx x = 0; x < X_SIZE; ++x)
              nativeMatrix[x + (y * X_SIZE)] = 2.345;
  ptime endTime = microsec_clock::universal_time();
  printf("[Native]Elapsed time: %6.3f seconds\n", time_period(startTime, endTime).length().total_milliseconds()/1000.f);

  // other conditions   
  startTime = microsec_clock::universal_time();
  for (idx i = 0; i < ITERATIONS; ++i)
    {
#ifdef RAW_POINTER
      double* end = boostMatrix.data() + X_SIZE*Y_SIZE;
      for(double* begin=boostMatrix.data(); begin!=end; ++begin)
        *begin = 2.345;
#elif defined(BOOST_IT)
      for(auto it=boostMatrix.begin(); it!= boostMatrix.end(); ++it)
          for(auto it2=(*it).begin(); it1!=(*it).end(); ++it2)
              *it2 = 2.345;
#elif defined(BOOST_MA)
      for (idx y = 0; y < Y_SIZE; ++y)
         for (idx x = 0; x < X_SIZE; ++x)
             boostMatrix[y][x] = 2.345;
#endif
    }
  endTime = microsec_clock::universal_time();
  printf("[Boost] Elapsed time: %6.3f seconds\n", time_period(startTime,endTime).length().total_milliseconds()/1000.f );

  return 0;
}

All compiled and executed using

g++-4.4 -O3 -g0 -DNDEBUG -march=native -mtune=native --fast-math -std=gnu++0x % && ./a.out

2011-02-08

Parallelization in Python using Weave and OpenMP

If you're using python a lot, you probably stumbled across the performance python howto at some point. Since inner loops are so
expensive in python, it makes sense to move them to C++, and
PerformancePython shows how to do that. When standard libraries like
numpy and weave.blitz are not expressive enough, your best bet becomes
SciPy Weave. In weave you write your C++ code in a string and access
your numpy arrays via the blitz converters. The code is compiled when
it has changed and can make use of arbitrary libraries as we shall see.

Parallelizing Weave-Code: Per-Pixel eigenvalues

Lets suppose we have a 3D image $D$, with the second order derivatives
in $x$, $y$ and $z$ direction given by $Dxx, Dxy, Dyy, \ldots$. We
want to calculate the eigenvalues at each pixel. For efficiency, we
use LAPACK and for clarity, we use Boost uBLAS. Lets first write down
the C++ code:

namespace ublas = boost::numeric::ublas;
namespace lapack = boost::numeric::bindings::lapack;
using namespace std;
typedef ublas::bounded_matrix<double,3,3,ublas::column_major> mat;
ublas::bounded_vector<double,3> lambda;
ublas::bounded_vector<double,34*3> work; // the 34*width is from syev.hpp

mat A;
int i,j,k,idx;
#pragma omp parallel for private(i,j,k,A,idx,lambda,work)
for(i=0;i<nx;i++){
  for(j=0;j<ny;j++){
    for(k=0;k<nz;k++){
      A(0,0) = Dxx(i,j,k);  // fill upper triangle of A
      A(1,1) = Dyy(i,j,k);
      A(2,2) = Dzz(i,j,k);
      A(0,1) = Dxy(i,j,k);
      A(0,2) = Dxz(i,j,k);
      A(1,2) = Dyz(i,j,k);

      // determine eigenvalues (N) of upper (U) triangle of A in lambda
      lapack::syev('N','U',A,lambda,lapack::workspace(work));
      lambda1(i,j,k) = lambda(0);  // eigenvalues in ascending order
      lambda2(i,j,k) = lambda(1);
      lambda3(i,j,k) = lambda(2);
    }
  }
 }

Note that we use OpenMP here to parallelize the outmost for-loop using
a #pragma directive. I took me a while to figure out that one needs
to explicitly name all variables which must be private to a thread in
this for-loop, otherwise you will not notice any speedups and/or weird
program behavior. I suppose further restrictions are that the size of
the variables needs to be known at compile time (here I use
bounded_vector for that purpose). Finally, the loop index must be of
type int for OpenMP.

That is only the body of a function. We save this in a python string
called codestring and execute it as follows:

from scipy.weave import inline, converters
variables = "nx ny nz lambda1 lambda2 lambda3 Dxx Dyy Dzz Dxy Dxz Dyz".split()
inline(codestring,
       variables,
       extra_compile_args =['-O3 -fopenmp'],
       extra_link_args=['-lgomp'],
       headers=[
           '<boost/numeric/bindings/lapack/syev.hpp>',
           '<boost/numeric/bindings/traits/ublas_matrix.hpp> ',
           '<boost/numeric/bindings/traits/ublas_vector.hpp> ',
           '<boost/numeric/ublas/matrix.hpp>',
           '<boost/numeric/ublas/banded.hpp>',
           '<boost/numeric/ublas/vector.hpp>',
           '<cmath>'],
       type_converters=converters.blitz,
       libraries=["lapack"])

A couple of things to point out here: since codestring is only the
body of a function, we need some other place to declare include
files and declarations. Include files are listed in headers,
additional code with declarations can be supplied using
support_code. To use OpenMP, we need the compiler argument
-fopenmp, and the linker needs to link the gomp library. The
type_converters argument tells Weave to copy and use conversion
routines into the final program which convert NumPy arrays in Blitz
arrays, which have nice accessor operators.

Second Order Parallelization

We can go even one step further and parallelize the resulting code
over multiple processors or even computers. In Python, a comparably
new and exciting way to achieve this is Parallel IPython. I'll leave
the details on this to a later posting, instead I'll point to a
problem I found while running Weave code in parallel on multiple machines.

When Weave compiles your code, it writes the generated source code to
~/.python26_compiled. The file name is created from a hash code over
the contained source code. This way, Weave knows when to recompile and
when to reuse an old binary.

Now to the downside:

Multiple Cores: You will get problems if you run multiple
instances at the same time: They try to write the same source code
file at the same time.
Multiple Computers: You will get the same problem if your home
directory is on the same file system (NFS or the like)

My solution to problem number one is to compile once with a single
thread, then reuse using all threads. Problem two can be solved by
symlinking ~/.python26_compiled to /tmp.

2011-01-27

Dijkstra on boost grid graph

There is a whole science of finding tubular structures in
3D-images. It started sometime in the late nineties. One of the most
cited papers¹ in this area shows how to use eigenvalues of the
local second order derivatives to identify tubular structures
heuristically. It also shows how to combine the results on multiple scales.

Raw Data

After Pre-processing

That is quite neat, but the approach has a major problem: It does not
assume any global structure. Therefore, when the data does not have
evidence for tubular structures, you still do not see them, and
furthermore, you have to have a method that allows you to say "points
x and y lie on the same tube". This, and determining properties of the
tubes, are the problems which most of the followupr research strives
to solve.

I wanted to have a quick-and-dirty approach to the problem. For a
given seed point, I would like to know the connectivity of the tubular
structure to this seed point. I pre-processed the raw image using the
method of Sato et al.¹, determined the seed point, the graph is
defined implicitly over the 3D-image using voxel-neighborhoods, so we
could in theory just run Dijkstra on it. I was lazy, however and did
not want to implement it all.

BGL Grid Graph

A quick google search turned up boost::grid_graph, a relatively new
member of the boost graph library (BGL). It defines a graph on a grid
without representing everything. Neat. The documentation is still a
bit shortish and sadly, only a 6-neighborhood (in 3D) is supported,
but who'd expect a free lunch?

Usability is (at first) easy as pie:

typedef grid_graph<3> graph_t;
typedef graph_traits<graph_t> Traits;
typedef Traits::edge_descriptor edge_descriptor;
typedef Traits::vertex_descriptor vertex_descriptor;

// ...

boost::array<vidx_t, 3> lengths = { { 256, 256, 256 } };
graph_t graph(lengths, false); // do not wrap any dimensions

Defining properties of edges in a grid graph

Contrary to graphs with internal properties, we have to define
external properties for graph_t. This is currently not documented
well, and it took me some time to figure out:

In most BGL examples, nodes are indexed using an integer type. The
grid_graph, however, uses n-tuples of type boost::array<n>.
This means, most examples cannot by applied to grid_graph in a
straight-forward way: You cannot index an array by a tuple without
doing some tricks first.
We would like to have an arbitrary function of the nodes as the
weight. Most BGL examples simply use an array for whichever
property they put in their graph.

The first problem can be solved using shared array property maps
(which do not seem to be well-documented either):

shared_array_property_map<vertex_descriptor,
  property_map<graph_t, vertex_index_t>::const_type>
  p_map(num_vertices(graph), get(vertex_index, graph));
shared_array_property_map<double,
  property_map<graph_t, vertex_index_t>::const_type>
  d_map(num_vertices(graph), get(vertex_index, graph));

We will use p_map as our predecessor map, which maps a vertex index (integer)
to a vertex descriptor (n-tuple).

The second map d_map is a mapping from vertex index to double, which
will be our distance map.

Finally, we need to create a edge_weight for graph_t, and it
should be a function, not an array. For this, I peaked W.P. McNeills
Boost-Implicit-Graph Example on GitHub.

struct edge_weight_map;
namespace boost {
  template<>
  struct property_map< graph_t, edge_weight_t > {
    typedef edge_weight_map type;
    typedef edge_weight_map const_type;
  };
}

/*
   Map from edges to weight values
*/
struct edge_weight_map {
  typedef double value_type;
  typedef value_type reference;
  typedef edge_descriptor key_type;
  typedef boost::readable_property_map_tag category;
  const graph_t& m_graph;
  edge_weight_map(const graph_t& g)
  :m_graph(g) { }
  reference operator[](key_type e) const; // implemented below
};

typedef boost::property_map<graph_t, boost::edge_weight_t>::const_type
        const_edge_weight_map;
typedef boost::property_traits<const_edge_weight_map>::reference
        edge_weight_map_value_type;
typedef boost::property_traits<const_edge_weight_map>::key_type
        edge_weight_map_key;

namespace boost{
    // PropertyMap valid expressions
    edge_weight_map_value_type get(const_edge_weight_map pmap, edge_weight_map_key e) {
        return pmap[e]; }
    // ReadablePropertyGraph valid expressions
    const_edge_weight_map get(boost::edge_weight_t, const graph_t&g) {
        return const_edge_weight_map(g); }
    edge_weight_map_value_type get(boost::edge_weight_t tag, const graph_t& g, edge_weight_map_key e) {
        return get(tag, g)[e]; }
}

// whoa, lots of typedefs, but now we can write the function that we
// wanted to write: Map edges to weights!
edge_weight_map::reference
edge_weight_map::operator[](key_type e) const {
    vertex_descriptor t = target(e,m_graph);
    vertex_descriptor s = source(e,m_graph);
    // f = f(t,s)
    return f;
}

That was more nasty than I expected, but it is still not that much
code to write.

Representing the data in a Boost Multi-Array

Finally, we need to represent 3D-data. I hate dealing with raw memory,
so I always wrap it in some convenient abstracting structure first
chance I get. 3D-arrays are definitely a case for Boost Multi-Array.
Its not a big deal using it, but I want to advertise its simplicity here:

typedef boost::multi_array_ref<unsigned char, 3> array_type;

std::ifstream dat("raw.dat", std::ios::in | std::ios::binary);
unsigned char* data = new unsigned char[256*256*256];
dat.read((char*)data,256*256*256);
array_type A(data,boost::extents[256][256][256]);
dat.close();

By the way, you can also change the base of this array as well as the
memory order, so if you need to port Matlab (TM) code to a programming
language (no adjective here on purpose), you can't get much simpler.

Finish line: Running Dijkstra

This is trivial now and straight from the book:

boost::array<vidx_t, 3> start  = { { 109, 129,  24 } };
dijkstra_shortest_paths(graph, start
        ,predecessor_map(p_map)
        .distance_map(d_map)
        );

Note the "." before distance_map, this is a named argument called
distance_map, which is set to d_map. The same goes for p_map.

Thresholded Distances

We can now trace paths starting at points which have a high value in
the raw image:

Traced paths starting at thresholded raw data

Footnotes:

¹ "3D multi-scale line filter for segmentation and visualization
of curvilinear structures in medical images" by Sato et al.,
CVRMed-MRCAS, 1997 (PDF)

2010-12-15

Nearest Neighbor Classifier in CUV

After reading this blog post by Alex Smola I implemented a
one-nearest-neighbor classifier in CUV. Apart from a bug that popped
up, I was happy to see that no changes to the library where needed,
and the code is very short, so I'll share the implementation here.

There is a small annoyance which I'll fix later:

there is no argmin_col, so we instead resort to multiplying by
-1 and then use argmax_col for now. I'll clean this mess up
later, probably by adding a argFunc_col functor.

import cuv_python as cp
import pyublas
import numpy as np

class KNN:
    """ calculates the labels of a test set by determining the closest
    instance in a training set and returning the corresponding label."""
    def __init__(self, data, data_l):
        """
        data:   a (number_instances x dimensionality) numpy matrix
        data_l: a number_instances numpy vector containing the labels
        """
        self.data   = cp.push(data)
        self.data_l = data_l
        self.dsq    = cp.dev_matrix_cmf(self.data.h,1)
        cp.reduce_to_col(self.dsq.vec,self.data,cp.reduce_functor.ADD_SQUARED)
    def __get_distance_matrix(self, test):
        """
        test: a (number_test_instances x dimensionality) numpy matrix
        returns: a (number_instances x number_test_instances) CUV distance matrix
        """
        t   = cp.push(test)
        assert t.w == self.data.w
        tsq = cp.dev_matrix_cmf(t.h, 1)
        cp.reduce_to_col(tsq.vec,t,cp.reduce_functor.ADD_SQUARED)
        p   = cp.dev_matrix_cmf(self.data.h, t.h)
        cp.prod(p, self.data, t, 'n','t',-2, 0)
        cp.matrix_plus_col(p,self.dsq.vec)
        cp.matrix_plus_row(p,tsq.vec)
        return p
    def run(self,test):
        """
        test:    a (number_test_instances x dimensionality) numpy matrix
        returns: the estimated labels of the test instances
        """
        p = self.__get_distance_matrix(test)
        p *= -1.                # no argmin supported yet, sorry
        idx = cp.dev_matrix_cmi(test.shape[0],1)
        cp.argmax_to_row(idx.vec, p)
        hidx  = idx.np.reshape(idx.h)
        return self.data_l.reshape(self.data.h)[hidx]

As suggested in the original post, this is a very fast method to
calculate the nearest neighbor. Instead of looping over all possible
pairs $x∈ X$ and $z∈ Z$ calculating the distance

\[D_{ij} = \|x_i-z_j\|^2,\]

we first precalculate $\|x_i\|^2$ and $\|z_i\|^2$ and then use a fast
matrix multiplication and two additions to determine the distance
above (derived by expanding the second binomial formula)

\[D_{ij} = \|x_i\|^2 + \|z_i\|^2 - 2 XZ^T.\]

We can measure the speed of the implementation by running the above
program using GPU or CPU. The CPU implementation uses cBLAS, I'm using
a 3.2 GHz CPU and a GTX580 for comparison.


Method	Time (s)	Relative speed
CUV CPU (cBLAS)	58.35	34
CUV GPU (cuBLAS)	1.71	1

To test this on MNIST, save the above as knn.py and run the following source:

import os
from knn import KNN
import numpy as np

class MNIST:
  """ This simply reads the MNIST files to main memory and converts them to float """
  def __init__(self,dir):
      from scipy.io.numpyio import fread
      fd = open(dir+'/train-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.data_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 60000 )
      fd.close()

      fd = open(dir+'/train-images.idx3-ubyte')
      fread(fd,16,'c')
      self.data = np.fromfile(file=fd, dtype=np.uint8).reshape( (60000,784) )
      fd.close()

      fd = open(dir+'/t10k-images.idx3-ubyte')
      fread(fd,16,'c')
      self.test = np.fromfile(file=fd, dtype=np.uint8).reshape( (10000,784) )
      fd.close()

      fd = open(dir+'/t10k-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.test_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 10000 )
      fd.close()
  def get_test(self):
      v = self.test.astype('float32').T.copy("F")
      t = self.test_labels
      return v,t
  def get(self):
      v = self.data.astype('float32').T.copy("F")
      t = self.data_labels
      return v,t

pg = MNIST("/home/local/datasets/MNIST");

data, data_l  = pg.get()
test, test_l = pg.get_test()

knn = KNN(data.T.copy("F"),data_l)

off, err_cnt = 5000, 0
for i in xrange(0,10000,off):
    pred = knn.run(test[:,i:i+off].T.copy("F"))
    err_cnt += (pred!=test_l[i:i+off]).sum()

print "Errors: ", err_cnt

2010-11-26

Hurricane in the Server Room

Our new toy arrived, with twelve cores, four GTX580 GPUs, four
terabyte hard disk and 24 Gigabytes RAM. The thing looks quite
impressive from the inside:

Note the four GPUs on the left hand side and the vertical stack of
fans in the center. Together with the CPU fans above the GPUs it is
impossible to sit next to this machine and think clearly. It blows
your head off, or at least it sounds as if it would. Even when almost
unused, this machine produced so much heat our server room could not
discharge the heat anymore. Amazing.

If you ever wondered: A current Ubuntu Server version seems to have no
problems with this specification.