Lesser Vision: January 2012

With all the popularity of deep learning, many researchers in the
field might wonder which framework is "right" to implement their
experiments. For plain neural networks, the main "work horse" is the
matrix multiplication, which can be accelerated a lot using graphics
processing units (GPU). For convolutional architectures, the matrix
multiplication is typically "replaced" by a convolution, and we would
also like to see them being fast(er) on GPU.

Neural net convolutions are somewhat special, since there filters are 3D and pool over input layers. Also, since they are usually applied to
many small "maps" at once, common FFT acceleration techniques do not
apply.

For my own implementations, I compared 3 convolution implementations:

The convolutions that come with Theano (from git, 2011-1-14). This
implementation is by far the most flexible, as we will see. It is
based on the formely separate, now theano-integrated CudaNdarray
library.
Alex Krizhevsky, a PhD student in Toronto, wrote two publically available convolution routines. We already integrated the first
version of his convolutions in CUV.
Alex' new convolutions created for the cuda-convnet (svn, 2011-1-13)
which are described as being "several times faster" than the first
version.

Constraints

The (main) constraints of the three versions are quite different:


Implementation	Image Size	Memory-Ordering (row-major)	Other
Theano	any	(nImages, nChannels, imageH, imageW)	–
Alex old	square only	(nChannels, nImages, imageH*imageW)	nFilters%2==0
Alex new	square only	(nChannels, imageH*imageW, nImages)	nFilters%16==0

Regarding squared images, one can argue that in random image
collections the shapes vary, anyway, and for batch processing it is
necessary to square them.

The ordering is tricky. At first sight, Theano's ordering looks most
intuitive. However, all operations which are functions of all channels
of a single pixel are a bit tricky to optimize. Alex' old and new
orderings can both use efficient matrix-row operations for
cross-channel functions. The "Alex old" convolution has the
disadvantage that images in one batch are not in the columns or the
rows of a matrix, so that final "full" layers (for example in LeNet)
require reordering the matrix. The new convolutions have images in the
columns of a matrix, solving the reordering problem, even though this
ordering looks most un-intuitive.

I should also mention the "sparse" filter option in Alex' code, which
allows to convolve only certain maps with a filter. I'm not going into
detail since Theano does not have this feature and I want to compare
execution times.

Speed

In the following table, all operations were computed 10 times and the
(wall clock) times averaged. For Theano, I varied the 'version'
parameter, but found that the auto-selection (-1) selects the best
algorithm. I used a GTX480 and in an Intel Xeon X5650 (2.67 GHz).

Execution speed of convolution packages
Version	Image Size	Filter Size	Type	Time (ms)	Comment
Naive CPU	32,8,176,176	32,8,7,7	fwd	34200
			dimg	26800
			dflt	n/a
Alex new	32,8,176,176	32,8,7,7	fwd	75
			dimg	90
			dflt	55
			trn	0.3	transposing all input batch
			total	220.3
Alex old	32,8,176,176	32,8,7,7	fwd	101
			dimg	240	plus error padding (3 ms)
			dflt	115	plus summing over batch (.8 ms)
			total	459
Theano	32,8,176,176	32,8,7,7	fwd	268
			dimg	451
			dflt	281
			total	1000

Key:

Image Size: batch size, number of input maps, height, width
Filter Size: number of output maps, number of input maps, height, width
Type: fwd is the "forward pass" convolution, dimg is the
derivative w.r.t. the inputs and dflt is the derivative w.r.t. the
filters.

Discussion: I was quite surprised to see Theano is comparably slow.
It seems that Alex' new convolutions are indeed faster, albeit not
several times (for the tested case) (Update: With patches for small
batch sizes kindly provided by Alex, speed nearly doubled!). The
overhead of a transpose (to comply with the "weird" memory layout) is
negligible compared to the overall advantages. All GPU
implementations significantly outperform a naive CPU version (just
many nested for-loops). Note however that theano is able to generate
code for efficient CPU convolutions.

Combinations: Theano is quite flexible, but "Alex new" is
fast. How do we get the best of two worlds? It is interesting to
note that the memory layouts of both convolutions are transposed to
each other, and that for just 0.3 ms (in the above setting), we can
get from one to the other. So we can get speed or flexibility at
wish.

Maintenance concerns

Both implementations are not particularly very well documented, but
well tested. At least for CudaNdarray, there is a successor on the way. It seems to me that optimized code at this level is mostly
write-only anyway.

Lesser Vision

2012-01-19

GPU convolutions for neural networks

Constraints

Speed

Maintenance concerns