2012-01-19

GPU convolutions for neural networks

With all the popularity of deep learning, many researchers in the
field might wonder which framework is "right" to implement their
experiments. For plain neural networks, the main "work horse" is the
matrix multiplication, which can be accelerated a lot using graphics
processing units (GPU). For convolutional architectures, the matrix
multiplication is typically "replaced" by a convolution, and we would
also like to see them being fast(er) on GPU.


Neural net convolutions are somewhat special, since there filters are 3D and pool over input layers. Also, since they are usually applied to
many small "maps" at once, common FFT acceleration techniques do not
apply.


For my own implementations, I compared 3 convolution implementations:

  • The convolutions that come with Theano (from git, 2011-1-14). This
    implementation is by far the most flexible, as we will see. It is
    based on the formely separate, now theano-integrated CudaNdarray
    library.
  • Alex Krizhevsky, a PhD student in Toronto, wrote two publically available convolution routines. We already integrated the first
    version of his convolutions in CUV.
  • Alex' new convolutions created for the cuda-convnet (svn, 2011-1-13)
    which are described as being "several times faster" than the first
    version.


Constraints


The (main) constraints of the three versions are quite different:



ImplementationImage SizeMemory-Ordering (row-major)Other
Theanoany(nImages, nChannels, imageH, imageW)
Alex oldsquare only(nChannels, nImages, imageH*imageW)nFilters%2==0
Alex newsquare only(nChannels, imageH*imageW, nImages)nFilters%16==0

Regarding squared images, one can argue that in random image
collections the shapes vary, anyway, and for batch processing it is
necessary to square them.


The ordering is tricky. At first sight, Theano's ordering looks most
intuitive. However, all operations which are functions of all channels
of a single pixel are a bit tricky to optimize. Alex' old and new
orderings can both use efficient matrix-row operations for
cross-channel functions. The "Alex old" convolution has the
disadvantage that images in one batch are not in the columns or the
rows of a matrix, so that final "full" layers (for example in LeNet)
require reordering the matrix. The new convolutions have images in the
columns of a matrix, solving the reordering problem, even though this
ordering looks most un-intuitive.


I should also mention the "sparse" filter option in Alex' code, which
allows to convolve only certain maps with a filter. I'm not going into
detail since Theano does not have this feature and I want to compare
execution times.


Speed


In the following table, all operations were computed 10 times and the
(wall clock) times averaged. For Theano, I varied the 'version'
parameter, but found that the auto-selection (-1) selects the best
algorithm. I used a GTX480 and in an Intel Xeon X5650 (2.67 GHz).



Execution speed of convolution packages
VersionImage SizeFilter SizeTypeTime (ms)Comment
Naive CPU32,8,176,17632,8,7,7fwd34200
dimg26800
dfltn/a
Alex new32,8,176,17632,8,7,7fwd75
dimg90
dflt55
trn0.3transposing all input batch
total220.3
Alex old32,8,176,17632,8,7,7fwd101
dimg240plus error padding (3 ms)
dflt115plus summing over batch (.8 ms)
total459
Theano32,8,176,17632,8,7,7fwd268
dimg451
dflt281
total1000

Key:

Image Size
batch size, number of input maps, height, width
Filter Size
number of output maps, number of input maps, height, width
Type
fwd is the "forward pass" convolution, dimg is the
derivative w.r.t. the inputs and dflt is the derivative w.r.t. the
filters.

Discussion: I was quite surprised to see Theano is comparably slow.
It seems that Alex' new convolutions are indeed faster, albeit not
several times (for the tested case) (Update: With patches for small
batch sizes kindly provided by Alex, speed nearly doubled!). The
overhead of a transpose (to comply with the "weird" memory layout) is
negligible compared to the overall advantages. All GPU
implementations significantly outperform a naive CPU version (just
many nested for-loops). Note however that theano is able to generate
code for efficient CPU convolutions.


Combinations: Theano is quite flexible, but "Alex new" is
fast. How do we get the best of two worlds? It is interesting to
note that the memory layouts of both convolutions are transposed to
each other, and that for just 0.3 ms (in the above setting), we can
get from one to the other. So we can get speed or flexibility at
wish.


Maintenance concerns


Both implementations are not particularly very well documented, but
well tested. At least for CudaNdarray, there is a successor on the way. It seems to me that optimized code at this level is mostly
write-only anyway.