one-nearest-neighbor classifier in CUV. Apart from a bug that popped
up, I was happy to see that no changes to the library where needed,
and the code is very short, so I'll share the implementation here.
There is a small annoyance which I'll fix later:
- there is no
argmin_col
, so we instead resort to multiplying by
-1
and then useargmax_col
for now. I'll clean this mess up
later, probably by adding aargFunc_col
functor.
import cuv_python as cp import pyublas import numpy as np class KNN: """ calculates the labels of a test set by determining the closest instance in a training set and returning the corresponding label.""" def __init__(self, data, data_l): """ data: a (number_instances x dimensionality) numpy matrix data_l: a number_instances numpy vector containing the labels """ self.data = cp.push(data) self.data_l = data_l self.dsq = cp.dev_matrix_cmf(self.data.h,1) cp.reduce_to_col(self.dsq.vec,self.data,cp.reduce_functor.ADD_SQUARED) def __get_distance_matrix(self, test): """ test: a (number_test_instances x dimensionality) numpy matrix returns: a (number_instances x number_test_instances) CUV distance matrix """ t = cp.push(test) assert t.w == self.data.w tsq = cp.dev_matrix_cmf(t.h, 1) cp.reduce_to_col(tsq.vec,t,cp.reduce_functor.ADD_SQUARED) p = cp.dev_matrix_cmf(self.data.h, t.h) cp.prod(p, self.data, t, 'n','t',-2, 0) cp.matrix_plus_col(p,self.dsq.vec) cp.matrix_plus_row(p,tsq.vec) return p def run(self,test): """ test: a (number_test_instances x dimensionality) numpy matrix returns: the estimated labels of the test instances """ p = self.__get_distance_matrix(test) p *= -1. # no argmin supported yet, sorry idx = cp.dev_matrix_cmi(test.shape[0],1) cp.argmax_to_row(idx.vec, p) hidx = idx.np.reshape(idx.h) return self.data_l.reshape(self.data.h)[hidx]As suggested in the original post, this is a very fast method to
calculate the nearest neighbor. Instead of looping over all possible
pairs $x∈ X$ and $z∈ Z$ calculating the distance
\[D_{ij} = \|x_i-z_j\|^2,\]
we first precalculate $\|x_i\|^2$ and $\|z_i\|^2$ and then use a fast
matrix multiplication and two additions to determine the distance
above (derived by expanding the second binomial formula)
\[D_{ij} = \|x_i\|^2 + \|z_i\|^2 - 2 XZ^T.\]
We can measure the speed of the implementation by running the above
program using GPU or CPU. The CPU implementation uses cBLAS, I'm using
a 3.2 GHz CPU and a GTX580 for comparison.
Method | Time (s) | Relative speed |
---|---|---|
CUV CPU (cBLAS) | 58.35 | 34 |
CUV GPU (cuBLAS) | 1.71 | 1 |
To test this on MNIST, save the above as
knn.py
and run the following source:import os from knn import KNN import numpy as np class MNIST: """ This simply reads the MNIST files to main memory and converts them to float """ def __init__(self,dir): from scipy.io.numpyio import fread fd = open(dir+'/train-labels.idx1-ubyte') fread(fd,8,'c') self.data_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 60000 ) fd.close() fd = open(dir+'/train-images.idx3-ubyte') fread(fd,16,'c') self.data = np.fromfile(file=fd, dtype=np.uint8).reshape( (60000,784) ) fd.close() fd = open(dir+'/t10k-images.idx3-ubyte') fread(fd,16,'c') self.test = np.fromfile(file=fd, dtype=np.uint8).reshape( (10000,784) ) fd.close() fd = open(dir+'/t10k-labels.idx1-ubyte') fread(fd,8,'c') self.test_labels = np.fromfile(file=fd, dtype=np.uint8).reshape( 10000 ) fd.close() def get_test(self): v = self.test.astype('float32').T.copy("F") t = self.test_labels return v,t def get(self): v = self.data.astype('float32').T.copy("F") t = self.data_labels return v,t pg = MNIST("/home/local/datasets/MNIST"); data, data_l = pg.get() test, test_l = pg.get_test() knn = KNN(data.T.copy("F"),data_l) off, err_cnt = 5000, 0 for i in xrange(0,10000,off): pred = knn.run(test[:,i:i+off].T.copy("F")) err_cnt += (pred!=test_l[i:i+off]).sum() print "Errors: ", err_cnt
No comments:
Post a Comment