Overview

The paper “MobileNet: Efficient Convolutional Neural Network for Mobile Vision Application” authored by Andrew G. Howard, Menglong Zhu, and others in 2017, introduces a new approach for convolutional neural network which is more efficient and less computational expensive named depthwise separable convolutions. With this convolutional layer, enable to deploy complex architecture of CNN on edge devices. This method significantly reduce the number of parameters without reducing accuracy significantly.

Paper : https://arxiv.org/abs/1704.04861

Key Contributions:

1. Depthwise separable convolutions

The most key of the contribution is introducing depthwise separable convolutions to replace standard convolution layers. This factorize the operation into two steps: depthwise convolutions (applied per input channel) and pointwise convolutions (applied across channels), significantly reducing computational cost and model size.

2. Model Shrinking with hyperparameters

Two hyperparameters, width multiplier (a) and resolution multiplier (p), enable flexible trade-off between accuracy and resource usage.

Width Multiplier → reduces the number of filters per layers
Resolution Multiplier → scales down the input image resolution

3. Resource Efficiency

MobileNet models use 8-9 times fewer computation than traditional archtecture while maintaining competitive accuracy on task lie imageNet classification

4. Application

MobileNet is demonstrated across task such as object detection, fine grained classification, face attribute detection, and large scale geolocation. The model achieves comparable performance to larger network like VGG16 with significant reduce number of parameters.

Comparation standard CNN and Depthwise separable convolutions

In the world of deep learning, convolutions are at the heart of most computer vision models. Standard convolutions have been the go-to approach, but they are computationally expensive. To address this, depthwise separable convolutions were introduced, which dramatically reduce computational cost while maintaining comparable accuracy. This article explores the differences, computations, and advantages of both methods.

1. Standard CNN

Input tensor of shape DF×DF×M (Spatial Dimension DF, M input channels)
Filters of size DK x DK x M x N, where DK is the kernel size, M is the number of input channels, and N is the number of filters (output channels).
Output tensor DF x DF x N

Computation :

$$\text{Standard Convolution:} \\[8pt] G_{k,l,n} = \sum_{i=1}^{D_K} \sum_{j=1}^{D_K} \sum_{m=1}^{M} K_{i,j,m,n} \cdot F_{k+i-1,l+j-1,m} \\[20pt]$$

$$\text{Computational Cost:} \\ \text{Cost} = D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F$$

2. Depthwise Separable Convolutions

$$\text{Depthwise Separable Convolution:} \\$$

$$\text{(a) Depthwise Convolution:} \\ G_{k,l,m} = \sum_{i=1}^{D_K} \sum_{j=1}^{D_K} K_{i,j,m} \cdot F_{k+i-1,l+j-1,m} $$

$$\text{Cost}_{\text{depthwise}} = D_K \cdot D_K \cdot M \cdot D_F \cdot D_F$$

$$\text{(b) Pointwise Convolution:} \\ G_{k,l,n} = \sum_{m=1}^{M} K_{m,n} \cdot F_{k,l,m} \\ $$

$$\text{Cost}_{\text{pointwise}} = M \cdot N \cdot D_F \cdot D_F$$

3. Comparations

$$\text{Total Cost for Depthwise Separable Convolution:} \\$$

$$\text{Cost}_{\text{depthwise separable}} = D_K \cdot D_K \cdot M \cdot D_F \cdot D_F + M \cdot N \cdot D_F \cdot D_F $$

$$\text{Reduction Factor:} \ $$

Summary

MobileNets represent a paradigm shift in building efficient neural networks for mobile and embedded systems. By prioritizing computational efficiency while maintaining accuracy, MobileNets pave the way for real-time AI applications on devices with limited hardware resources.

MobileNet Explained: A Slim and Efficient CNN Model

Table of contents