Created: April 23, 2022
Last Updated: April 23, 2022

Separable Depthwise Convolution

In this tutorial, you'd learn about what depthwise separable convolutions are and how they compare to regular convolution filters. You'd see that they are more efficient than regular convolutions in terms of speed and memory with little tradeoffs.

Lastly, you'd see how it can be combined into a standard neural architecture which hopefully you'd be able to adapt to your development workflows.

First lets write all the necessary imports needed in this tutorial as follows

import torch
from prettytable import PrettyTable
from collections import OrderedDict

Regular Convolutions

We'd begin by examining just how many parameters and FLOPs (Floating Point Operations) are in a regular convolution. If you don't know what these terms are, they'd be explained soon.

First let's define a regular convolution layer.

input_channels = 3
output_channels = 64
kernel_size = 5
stride = 2

regular_conv = torch.nn.Conv2d(input_channels, output_channels, kernel_size, stride)

Standard Convolution

Number of Parameters in a Regular Convolution

The number of parameters in a convolution layer (regardless of whether it's a regular or depthwise layer) is simply the number of elements in the layer that have to be "learnt" during the training process.

For a conv layer, this will typically be the total number of weights and biases, specifically the total number of kernels (or filters) in the layer.

In our current example, We have defined the following

A kernel size as $5 \times 5$
The expected number of input channels are $3$ , therefore each filter (kernel) would be a tensor of size $5 \times 5 \times 3$
The specified number of output channels are $64$ . Which implies the following
- There would be 64 kernels
- There would be a corresponding scalar value known as the bias for each kernel i.e bias size of $1 \times 64$

With this, we can determine the total number of parameters as the filter size times number of output channels plus bias. Which can be calculated as $(5 \times 5 \times 3 \times 64) + 64 = 4864$

Let's write a function total_learnables to perform the calculation above for any pytorch module.

def total_learnables(model):
    table = PrettyTable(["Learnable", "Count"])
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad: continue
        params = parameter.numel()
        table.add_row([name, params])
        total_params += params
    return (total_params, table)

(total_params_regular, table) = total_learnables(regular_conv)
print(table)
print(f"[Regular Convolution] Total Learnables = {total_params_regular}")

+-----------+-------+
| Learnable | Count |
+-----------+-------+
|   weight  |  4800 |
|    bias   |   64  |
+-----------+-------+
[Regular Convolution] Total Learnables = 4864

Number of Floating Point Operations (Flops) In a Regular Convolution

The number of FLOPs means the number of operations that would be performed by this layer. This is highly dependent on the input to the convolution layer.

To show this, let's define an input image for the convolution layer as

rand_image = torch.rand(1, input_channels, 228, 228) # Batch, Channel, Spatial, Spatial

With an input image of size $228 \times 228$ , when it is convolved with the conv layer. The total operations that will be performed can be calculated as

\text{FLOPs} = 228 \times 228 \times [(5 \times 5 \times 3 \times 64) + 64] \newline \text{FLOPs} = 252,850,176 \text{ (operations)}

There's no need to write a function to compute this - FLOPs are more of an algorithmic notation compared to the parameter sizes.

Now that you've seen how a regular convolution can be viewed from the perspective of its floating point operations and the total number of learnable parameters - It's time to see how separable depthwise convolutions are an improvement in terms of efficiency.

Separable Depthwise Convolutions

Depthwise Separable Convolution

In a nutshell, depthwise separable convolutions are a factorised form of regular convolutions.

An analogy is representing a $10 \times 10$ matrix using 2 smaller vectors $a_1$ and $a_2$ , both of size $1 \times 10$ . \ By multiplying $a_{1}^{T} \times a_2$ , we get the resulting $10 \times 10$ matrix but with a smaller representation $\{a_1, a_2\}$ .

This can be defined as

separable_conv = torch.nn.Sequential(OrderedDict([
    ("Depthwise", torch.nn.Conv2d(input_channels, input_channels, kernel_size, stride, groups=input_channels)),
    ("Pointwise", torch.nn.Conv2d(input_channels, output_channels, 1, 1))
]))

(total_params_separable, table) = total_learnables(separable_conv)
print(table)
print(f"[Separable Convolution] Total Learnables = {total_params_separable}")

print(f"Percent reduction = {(1 - total_params_separable / total_params_regular) * 100}%")

+------------------+-------+
|    Learnable     | Count |
+------------------+-------+
| Depthwise.weight |   75  |
|  Depthwise.bias  |   3   |
| Pointwise.weight |  192  |
|  Pointwise.bias  |   64  |
+------------------+-------+
[Separable Convolution] Total Learnables = 334
Percent reduction = 93.13322368421053%

out = regular_conv(rand_image)
print(f"Standard Convolution: {out.size()}")

out = separable_conv(rand_image)
print(f"Separable Convolution: {out.size()}")

Standard Convolution: torch.Size([1, 64, 112, 112])
Separable Convolution: torch.Size([1, 64, 112, 112])