2010-12-15

Nearest Neighbor Classifier in CUV

After reading this blog post by Alex Smola I implemented a
one-nearest-neighbor classifier in CUV. Apart from a bug that popped
up, I was happy to see that no changes to the library where needed,
and the code is very short, so I'll share the implementation here.


There is a small annoyance which I'll fix later:



  • there is no argmin_col, so we instead resort to multiplying by
    -1 and then use argmax_col for now. I'll clean this mess up
    later, probably by adding a argFunc_col functor.
import cuv_python as cp
import pyublas
import numpy as np

class KNN:
    """ calculates the labels of a test set by determining the closest
    instance in a training set and returning the corresponding label."""
    def __init__(self, data, data_l):
        """
        data:   a (number_instances x dimensionality) numpy matrix
        data_l: a number_instances numpy vector containing the labels
        """
        self.data   = cp.push(data)
        self.data_l = data_l
        self.dsq    = cp.dev_matrix_cmf(self.data.h,1)
        cp.reduce_to_col(self.dsq.vec,self.data,cp.reduce_functor.ADD_SQUARED)
    def __get_distance_matrix(self, test):
        """
        test: a (number_test_instances x dimensionality) numpy matrix
        returns: a (number_instances x number_test_instances) CUV distance matrix
        """
        t   = cp.push(test)
        assert t.w == self.data.w
        tsq = cp.dev_matrix_cmf(t.h, 1)
        cp.reduce_to_col(tsq.vec,t,cp.reduce_functor.ADD_SQUARED)
        p   = cp.dev_matrix_cmf(self.data.h, t.h)
        cp.prod(p, self.data, t, 'n','t',-2, 0)
        cp.matrix_plus_col(p,self.dsq.vec)
        cp.matrix_plus_row(p,tsq.vec)
        return p
    def run(self,test):
        """
        test:    a (number_test_instances x dimensionality) numpy matrix
        returns: the estimated labels of the test instances
        """
        p = self.__get_distance_matrix(test)
        p *= -1.                # no argmin supported yet, sorry
        idx = cp.dev_matrix_cmi(test.shape[0],1)
        cp.argmax_to_row(idx.vec, p)
        hidx  = idx.np.reshape(idx.h)
        return self.data_l.reshape(self.data.h)[hidx]
As suggested in the original post, this is a very fast method to
calculate the nearest neighbor. Instead of looping over all possible
pairs $x∈ X$ and $z∈ Z$ calculating the distance

\[D_{ij} = \|x_i-z_j\|^2,\]


we first precalculate $\|x_i\|^2$ and $\|z_i\|^2$ and then use a fast
matrix multiplication and two additions to determine the distance
above (derived by expanding the second binomial formula)


\[D_{ij} = \|x_i\|^2 + \|z_i\|^2 - 2 XZ^T.\]


We can measure the speed of the implementation by running the above
program using GPU or CPU. The CPU implementation uses cBLAS, I'm using
a 3.2 GHz CPU and a GTX580 for comparison.



MethodTime (s)Relative speed
CUV CPU (cBLAS)58.3534
CUV GPU (cuBLAS)1.711

To test this on MNIST, save the above as knn.py and run the following source:





import os
from knn import KNN
import numpy as np

class MNIST:
  """ This simply reads the MNIST files to main memory and converts them to float """
  def __init__(self,dir):
      from scipy.io.numpyio import fread
      fd = open(dir+'/train-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.data_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 60000 )
      fd.close()

      fd = open(dir+'/train-images.idx3-ubyte')
      fread(fd,16,'c')
      self.data = np.fromfile(file=fd, dtype=np.uint8).reshape( (60000,784) )
      fd.close()

      fd = open(dir+'/t10k-images.idx3-ubyte')
      fread(fd,16,'c')
      self.test = np.fromfile(file=fd, dtype=np.uint8).reshape( (10000,784) )
      fd.close()

      fd = open(dir+'/t10k-labels.idx1-ubyte')
      fread(fd,8,'c')
      self.test_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 10000 )
      fd.close()
  def get_test(self):
      v = self.test.astype('float32').T.copy("F")
      t = self.test_labels
      return v,t
  def get(self):
      v = self.data.astype('float32').T.copy("F")
      t = self.data_labels
      return v,t

pg = MNIST("/home/local/datasets/MNIST");

data, data_l  = pg.get()
test, test_l = pg.get_test()

knn = KNN(data.T.copy("F"),data_l)

off, err_cnt = 5000, 0
for i in xrange(0,10000,off):
    pred = knn.run(test[:,i:i+off].T.copy("F"))
    err_cnt += (pred!=test_l[i:i+off]).sum()

print "Errors: ", err_cnt