field might wonder which framework is "right" to implement their
experiments. For plain neural networks, the main "work horse" is the
matrix multiplication, which can be accelerated a lot using graphics
processing units (GPU). For convolutional architectures, the matrix
multiplication is typically "replaced" by a convolution, and we would
also like to see them being fast(er) on GPU.
Neural net convolutions are somewhat special, since there filters are 3D and pool over input layers. Also, since they are usually applied to
many small "maps" at once, common FFT acceleration techniques do not
apply.
For my own implementations, I compared 3 convolution implementations:
- The convolutions that come with Theano (from git, 2011-1-14). This
implementation is by far the most flexible, as we will see. It is
based on the formely separate, now theano-integrated CudaNdarray
library.
- Alex Krizhevsky, a PhD student in Toronto, wrote two publically available convolution routines. We already integrated the first
version of his convolutions in CUV.
- Alex' new convolutions created for the cuda-convnet (svn, 2011-1-13)
which are described as being "several times faster" than the first
version.
Constraints
The (main) constraints of the three versions are quite different:
Implementation | Image Size | Memory-Ordering (row-major) | Other |
---|---|---|---|
Theano | any | (nImages, nChannels, imageH, imageW) | – |
Alex old | square only | (nChannels, nImages, imageH*imageW) | nFilters%2==0 |
Alex new | square only | (nChannels, imageH*imageW, nImages) | nFilters%16==0 |
Regarding squared images, one can argue that in random image
collections the shapes vary, anyway, and for batch processing it is
necessary to square them.
The ordering is tricky. At first sight, Theano's ordering looks most
intuitive. However, all operations which are functions of all channels
of a single pixel are a bit tricky to optimize. Alex' old and new
orderings can both use efficient matrix-row operations for
cross-channel functions. The "Alex old" convolution has the
disadvantage that images in one batch are not in the columns or the
rows of a matrix, so that final "full" layers (for example in LeNet)
require reordering the matrix. The new convolutions have images in the
columns of a matrix, solving the reordering problem, even though this
ordering looks most un-intuitive.
I should also mention the "sparse" filter option in Alex' code, which
allows to convolve only certain maps with a filter. I'm not going into
detail since Theano does not have this feature and I want to compare
execution times.
Speed
In the following table, all operations were computed 10 times and the
(wall clock) times averaged. For Theano, I varied the 'version'
parameter, but found that the auto-selection (-1) selects the best
algorithm. I used a GTX480 and in an Intel Xeon X5650 (2.67 GHz).
Version | Image Size | Filter Size | Type | Time (ms) | Comment |
---|---|---|---|---|---|
Naive CPU | 32,8,176,176 | 32,8,7,7 | fwd | 34200 | |
dimg | 26800 | ||||
dflt | n/a | ||||
Alex new | 32,8,176,176 | 32,8,7,7 | fwd | 75 | |
dimg | 90 | ||||
dflt | 55 | ||||
trn | 0.3 | transposing all input batch | |||
total | 220.3 | ||||
Alex old | 32,8,176,176 | 32,8,7,7 | fwd | 101 | |
dimg | 240 | plus error padding (3 ms) | |||
dflt | 115 | plus summing over batch (.8 ms) | |||
total | 459 | ||||
Theano | 32,8,176,176 | 32,8,7,7 | fwd | 268 | |
dimg | 451 | ||||
dflt | 281 | ||||
total | 1000 |
Key:
- Image Size
- batch size, number of input maps, height, width
- Filter Size
- number of output maps, number of input maps, height, width
- Type
- fwd is the "forward pass" convolution, dimg is the
derivative w.r.t. the inputs and dflt is the derivative w.r.t. the
filters.
Discussion: I was quite surprised to see Theano is comparably slow.
It seems that Alex' new convolutions are indeed faster, albeit not
several times (for the tested case) (Update: With patches for small
batch sizes kindly provided by Alex, speed nearly doubled!). The
overhead of a transpose (to comply with the "weird" memory layout) is
negligible compared to the overall advantages. All GPU
implementations significantly outperform a naive CPU version (just
many nested for-loops). Note however that theano is able to generate
code for efficient CPU convolutions.
Combinations: Theano is quite flexible, but "Alex new" is
fast. How do we get the best of two worlds? It is interesting to
note that the memory layouts of both convolutions are transposed to
each other, and that for just 0.3 ms (in the above setting), we can
get from one to the other. So we can get speed or flexibility at
wish.
Maintenance concerns
Both implementations are not particularly very well documented, but
well tested. At least for CudaNdarray, there is a successor on the way. It seems to me that optimized code at this level is mostly
write-only anyway.