How 1x1 convolution reduces the floating-point operations in CNN

The main purpose of applying kernels (filters) in a CNN architecture is to extract features from an input image, but a kernel of size 1X1 holds a special purpose in modern CNN architectures. Usage of 1x1 kernels started from Inception architecture and their main purpose was to reduce the number of floating-point operations without compromising on the detected features.

When a hidden layer input is applied with1x1 kernels prior to applying a NxN feature extracting kernel, the number of operations required is drastically reduced which in turn reduces the load on processor (reduction in flops) and in turn reduces the time of operations. Let’s understand this with the following example:

Fig. 1 : 5 x 5 kernel is used directly

Total number of operations can be calculated as:
(Input Size X No. of Kernels) X (Kernel Size X No. of Inp. channels) + bias

We can ignore bias part as it is very negligible compared to other operations.

For the above Fig. 1:

Number of Operations:
= (14x14x48) x (5x5x480)
= 112,896,000
= 113 Million Flops

In case when a 1 x 1 kernel is used in between , just like in Inception architectures, the number of operations required is illustrated below:

Fig. 2: When 1x1 kernel is used in between

For the above Fig. 2:

Number of Operations:
= Operations for conv1 + Operations for conv2
= (14x14x16) x (1x1x480) + (14x14x48) x (5x5x16)
= 1,505,280 + 3,763,200
= 5,268,480
= 5.3 Million Flops

Looking at the number of operations required in both the cases, it is clear that number of operations drastically reduces (from 113 M to 5.3 M) without any reduction in output channel size.

--

--

Anukool Chaturvedi

Data Scientist with keen interest in Image Segmentation and GANs