Introduction to object detection and YOLO
Ten years ago, it was nearly impossible to allow computers to tell the difference between a cat and a dog. Today, advanced developments in image classification and object detection allow the computer to tell the difference with over 99% accuracy. Object detection is a computer technology that combines image processing and computer vision, and uses the former algorithms to achieve latter tasks in emulating human vision, like object recognition, defect detection or automatic driving. It is widely used in tracking objects, including video surveillance and image retrieval. Among various methods for object detection, YOLO (You Only Look Once) uses a Convolutional Neural Network (CNN) to perform end-to-end object detection.
The following diagram illustrates the architecture of the CNN used in YOLOv3.
source: Ayoosh Kathuria, What’s new in YOLO v3? Towards Data Science.
The method was proposed by Joseph Redmon et al. from the University of Washington in 2015 and updated to version 3 in 2018 along with another researcher Ali Farhadi in the paper titled “YOLOv3: An Incremental Improvement”.
Past advanced detection systems such as R-CNN employ region proposal methods. Given an image, such systems first generate potential bounding boxes and then run a classifier on the proposed boxes. Post-processing is used after classification to refine bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene. Such complex pipelines are slow and hard to optimize since each individual component needs to be trained separately.
YOLO is a unified detection system. It is based on a single convolutional network Thus, YOLO is more efficient compared to other detection systems. YOLO reasons globally about an image, and thus make fewer background errors, in contrast to region proposal-based techniques.
“This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities” (Redmon, 2018). To be more specific, the image shown below can be divided into a S * S grid with each cell in the grid assigned a corresponding class probability map. The network will firstly predict bounding boxes using dimension clusters and then assign a confidence score for each bounding box using logistic regression. As a result, the class probability map and the bounding boxes with confidence scores are combined to generate a final detection of bounding boxes and class labels.
Source: You Only Look Once: Unified, Real-Time Object Detection
Code implementation and explanation
We started our project from the official DarkNet GitHub repository, coming with the paper, “YOLOv3: An Incremental Improvement”. The official GitHub contains the source code (written in C) for the YOLOv3 implemented in the paper, providing a step-by-step tutorial on how to use the code for object detection.
It is a challenging task to transfer the coding implemented in C to Keras in Python . For example, even using a pre-trained model directly requires sophisticated code to distill and interpret the predicted bounding boxes output by the model. As a result, we learned Keras implementation from the a great Github post, “keras-yolo3: Training and Detecting Objects with YOLO3” by Huynh Ngoc Anh.
Download the Pre-trained Model Weights
The first step is to download the pre-trained model weights. These were trained using the DarkNet code base on the MSCOCO dataset. Download the model weights and place them into the current working directory with the filename “yolov3.weights.”
Next, we need to define a Keras model that has the right number and type of layers to match the downloaded model weights. The model architecture is called a “DarkNet” and was originally loosely based on the VGG-16 model.
Generally, the YOLOv3 is structured as the following:
YOLOv3 Structure
Source: YOLOv3: An Incremental Improvement
The following figure displays a shortcut of the YOLOv3 model that we used for our project:
yolov3 model
After defining the model and downloading the pre-trained weights, we call the load_weights() function to pass the weights into the model and set up the weights to specified layers.
Then, we saved the model for further predictions.
Since the model was pre-trained with dedicated classes, the model we used can only detect classes listed below:
80 Classes of COCO
Training the Model
We will give some instances detected by the model. The input test images should be loaded, resized and scaled to a suitable format for detecting, which is expected to be color images with the square shape of 416*416 pixels scaling from 0–1 in this case. However, the output of the model has encoded bounding boxes and class predictions, which needs further interpretation. Thus we draw the bounding boxes on the original images to do the visualization.
The following cases are the examples running the YOLOv3 model:
- YOLOv3 detects a single person in the image with a high accuracy, which is over 97%.
Test example
Source: Weibo
- When the image contains more than one object, our selected YOLOv3 model could also detect those objects one by one. Since YOLOv3 sees the entire image while prediction, we can see that there are few background errors in the following instance, which is one of the strengths of the YOLOv3 model compared to other object detection algorithms. However, it evokes one limitation of the YOLOv3 model. When multiple objects gather together, it is possible for the YOLOv3 model to generate lower accuracy for the object detection.
Test Example
Source: Google
- Another limitation of the YOLOv3 model is represented by the following images. It struggles to localize small objects that appear in groups. We can see from the following two instances that it fails to detect some of the people, and for the flock of birds, it may confuse the YOLOv3 model which loses the ability to detect them separately.
Test Example
Source: Google
Test Example
Source: Google
Next Steps
With the pretrained model using YOLOv3 which could detect over 80 categories, we want to extend the model by training with our custom dataset. In the next stage, we will focus on the detection of traffic signs, which are key map features for navigation, traffic control and road safety. In the bright future of autonomous driving, accurate and robust detection of traffic signs is a crucial step for driving directions and early warning.
Our training and test dataset come from one of Google’s open source, OpenImageV6, which is a public database online. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.3 per image on average). In this case, we will make the use of only one of the categories, traffic signs, to retrain our model. The images and labels are downloaded into separate folders.
Here is the detail instruction to download the dataset from OpenImageV6: Colab Coding Instruction
Reference
Ayoosh Kathuria, What’s new in YOLO v3? Towards Data Science.
The article discusses the metrics, algorithms and math behind YOLO v3 as well as the benchmark to evaluate its performance.
This paper presents YOLO, a unified object detection algorithm. YOLO is extremely fast compared to other object detection systems and can achieve rather high accuracy.
Joseph Redmon & Ali Farhadi, YOLOv3: An Incremental Improvement
This paper presents YOLOv3, which adds some updates to YOLO to make it better. YOLOvv3 is more accurate and still fast.