Efficient algorithms for object detection: Faster R-CNN, YOLO, SSD

6 min readApr 12, 2023

Object detection technology evolves rapidly and finds new applications across various industries, enabling machines to perceive and interpret visual information from the real world for a wide range of tasks. It is employed in healthcare settings for various tasks, such as identifying anatomical landmark medical image, detecting tumors or lesions; in autonomous driving, object detection helps detect objects such as vehicles, cyclists and obstacles…

But to perform this type of annotation, Object detections algorithms such as R-CNN, YOLO, SSD automatically learn. Let’s go with VinLab to discover the architecture, features and performance of these three famous algorithms.

Object detection algorithms are broadly classified into two categories based on how many times the same input image is passed through a network.

1. Faster R-CNN

What is R-CNN?

R-CNN, or Regions with CNN Features, is an object identification model that employs top-down region suggestions and high-capacity CNNs to localize and segment objects. It employs selective search to find a number of potential bounding-box object regions (“regions of interest”) and then independently extracts features for classification from each region.

*VinLab object detection in an image with its associated bounding box*

Key components of Faster R-CNN

Let’s discover 4 components of Faster R-CNN algorithms

Region Proposal Network (RPN): Faster R-CNN introduces an RPN that shares convolutional layers with the object detection network. The RPN generates region proposals, which are potential bounding box proposals for objects in an image. RPN is designed to efficiently propose regions of interest (RoIs) using anchor boxes at different scales and aspect ratios, and it predicts the probability of an object presence and the coordinates of the bounding boxes.
Region of Interest (RoI) Pooling: RoI pooling is used to align the features within each proposed region to a fixed size, which serves as the input for the subsequent layers of the network. This allows Faster R-CNN to handle objects of different sizes and aspect ratios effectively, and enables the network to be trained end-to-end.
Convolutional Neural Network (CNN): Faster R-CNN uses a CNN to extract features from the input image and the RoIs generated by the RPN. The CNN typically consists of multiple convolutional layers followed by fully connected layers, which learn discriminative features from the input image for object detection.
Classification and Regression Head: Faster R-CNN has a classification head and a regression head that take the features from RoI pooling as input. The classification head predicts the probability of each RoI belonging to different object classes, while the regression head predicts the refined coordinates of the bounding boxes for each

Faster R-CNN Model Architecture. Taken from: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2016

2. YOLO (You Only Look One)

What is YOLO?

You Only Look Once (YOLO) proposes using an end-to-end neural network that makes predictions of bounding boxes and class probabilities all at once

It stands out from other object detection algorithms due to its unique approach of treating object detection as a single regression problem, making it faster and simpler compared to other methods.

Key components of YOLO

YOLO has 6 components:

Grid-based Detection: YOLO divides the input image into a grid of cells and predicts bounding boxes and class probabilities directly on this grid. Each grid cell is responsible for predicting a fixed number of bounding boxes (often 2 or 3) and their associated class probabilities. This grid-based approach allows YOLO to detect objects at different locations in the image with a single pass through the network, making it highly efficient.
Anchor Boxes: YOLO uses anchor boxes, also known as prior boxes, which are pre-defined bounding box shapes with different aspect ratios and scales. These anchor boxes are used to predict the coordinates of the bounding boxes for detected objects. YOLO predicts the offsets (i.e., the coordinates) of the anchor boxes with respect to the ground truth bounding boxes, allowing for accurate localization of objects
Convolutional Neural Network (CNN): YOLO uses a CNN as its base network for feature extraction. This CNN processes the input image and generates feature maps that capture object features at different scales and resolutions.
Feature Concatenation: YOLO concatenates the feature maps from different layers of the CNN to capture both low-level and high-level features. This allows for the detection of objects of varying sizes and complexities.
Non-maximum Suppression (NMS): After predicting the class probabilities and bounding box coordinates, YOLO performs non-maximum suppression (NMS) to filter out redundant and overlapping bounding boxes. NMS helps in selecting the most confident and accurate bounding boxes for the final detection results.
Multi-class Detection: YOLO is capable of detecting objects from multiple classes in a single neural network pass. It predicts class probabilities for each bounding box, allowing for multi-class object detection in a single shot.

3. SSD (Single Shot Detection)

What is SSD?

SSD (Single Shot MultiBox Detector) is an object detection algorithm proposed by Wei Liu et al. in 2016. It is a popular real-time object detection algorithm that is known for its accuracy and efficiency. SSD builds upon the concept of anchor boxes, similar to YOLO, but with some key differences in the architecture and approach.

Key components of SSD

7 components key components of SSD includes:

Base Convolutional Network: SSD starts with a base convolutional network, typically a pre-trained deep neural network, such as VGG, ResNet, or MobileNet, which serves as the feature extractor. This network processes the input image and generates a series of feature maps with different resolutions.
Multi-scale Feature Maps: SSD uses a set of feature maps with different resolutions obtained from the base convolutional network. These feature maps capture object features at different scales, allowing SSD to detect objects of various sizes effectively.
Anchor Boxes: SSD uses anchor boxes, also known as prior boxes, at each feature map location. These anchor boxes are pre-defined bounding box shapes with different aspect ratios and scales. SSD predicts the offsets and class probabilities for these anchor boxes, which serve as the reference boxes for detecting objects.
Localization Layers: SSD adds localization layers on top of each feature map to predict the offsets (i.e., the coordinates) of the anchor boxes with respect to the ground truth bounding boxes. These offsets are used to refine the anchor boxes and improve the localization accuracy of the detected objects.
Classification Layers: SSD adds classification layers on top of each feature map to predict the class probabilities for each anchor box. These probabilities represent the likelihood of an anchor box containing an object from a specific class. SSD uses multi-class classification to predict object classes, allowing for detection of objects from multiple classes in a single pass.
Feature Pyramid Network (FPN): SSD uses a Feature Pyramid Network (FPN) to merge and utilize feature maps from different resolutions. This allows for capturing both fine-grained and coarse-grained features, improving the accuracy of object detection.
Non-maximum Suppression (NMS): After predicting the class probabilities and offsets for anchor boxes, SSD performs non-maximum suppression (NMS) to filter out redundant and overlapping bounding boxes. NMS helps in selecting the most confident and accurate bounding boxes for the final detection results.

Thanks for reading!

If you are looking for information about artificial intelligence, machine learning, general data concepts, or medical data science applications, follow us to acquire more useful knowledge about these topics.

Contact

Email: info@vinlab.io

Twitter: https://twitter.com/VinLab_io

YouTube: https://www.youtube.com/@Vinlab-MedicalImageAnnotation

Open source project: https://github.com/vinbigdata-medical/vindr-lab

Efficient algorithms for object detection: Faster R-CNN, YOLO, SSD

1. Faster R-CNN

2. YOLO (You Only Look One)

3. SSD (Single Shot Detection)

Written by VinLab