Overview:

For our initial blog post, we introduced the concept of Object Detection and how YOLO (You Only Look Once) utilized Convolutional Neural Network (CNN) to perform end-to-end object detection without defining features. In this part, we also adopted the implementation as our baseline model, from the paper, “YOLOv3: An Incremental Improvement”. The official GitHub repository contains the source code for the YOLOv3 implemented in the paper (written in C), providing a step-by-step tutorial on how to use the code for object detection. In our midway blog post, we focused on the application of object detection on self-driving cars and trained our custom detector on images of traffic signs collected from Google’s Open Image Dataset V6. In order to give better guidance while driving and save time when making decisions, we need a model that will give us real-time results with high precision. Thus in this blog post, we keep working on the scenarios of autonomous driving, and deliver a model that could deal with video inputs and make real-time decisions within milliseconds.

Set Up:

In this project, we use Anaconda with CPU on the local machine as our platform which allows us to automate many dependencies and switch between different environments. We downloaded the pre-trained yolov3 weights and converted it to a TensorFlow model, which is shown below:

TF model TF model


In our experiment, we have tried three sets of weights, which are full YOLOv3 pre-trained weights, tiny YOLOv3 pre-trained weights, and the weights we saved when training our custom detector. Each set has its own advantage and shortcoming which will be analyzed in the next part. Now we are ready to process the video input and run the real-time detector.

Real-time YOLOv3 Object Detection on Traffic Sign

First, we loaded the pre-trained weights that we saved in the process of training traffic signs in the midway blog post. Then, we selected the video that was cut from the dash cam recorded as the input of our model with the pre-trained weights. The video is regarded as the combination of thousands of images when we use it as the input, and we store width, height, and fps from the video for later use. Before reading images into our models, we preprocessed the input images to match the model predicting requirements. First, the default image size for this model is 416, thus we resized all video images to 416 using padding so that one of the height and weight is 416 and the remaining on is less or equal to 416 but still divisible by 32. Next, we transformed all resized images into RGB format, and each image was updated by adding a channel to the first dimension. After the image preprocessing, we loaded all images into the model with a pre-trained weights. For each image, the pre-trained model would predict the bounding boxes, classes, and confidence scores and rewrite the image with results. After looping through all images, the newly generated video consisted of those images with predictions and marks.

Here is the result of the real real-time detector using pre-trained YOLOv3 weights. It is slow but accurate and powerful. The FPS in CPU condition is under 2.

sample detection with YOLOv3 weights
sample detection with YOLOv3 weights


We have also tried the detector with tiny YOLOv3 weights. It is much faster than the full YOLOv3, but we can see that it fails to detect crowds of people and the detection in the bounding boxes are slightly off, which are not as accurate as before. The FPS is around 9.

sample detection with tiny YOLOv3 weights
sample detection with tiny YOLOv3 weights


To test our custom detector, we used the video from the dash cam recorder of a car. We limited the video within 20 seconds, generating the results in a timely manner. The following video is an example of our testing:

sample detection with custom YOLOv3 weights
sample detection with custom YOLOv3 weights


Short-comings:

  • Although the pre-trained YOLOv3 model detects 80 classes. Our customized model with transfer learning only detects the traffic sign class, so it gives less flexibility in detecting other objects, such as cars, persons, trucks, buses, etc. In order to get a satisfying mAP, a large amount of training images is required for each training class (~1000 images). Our single-class training process took around 6 hours using Colab free GPU, and more time and computing power will be required if more classes are trained. For example, retrain 80 classes using the COCO dataset takes approximately 1 week to complete.

  • Creating the custom labeled dataset for training can be time-consuming if the dataset is not available or the annotation is not capable to train on the Darknet model. The labeled custom dataset can be created either through using Google images that have been labeled or using annotation tools to manually draw target labels on images. The second way requires a long period of preparation to train the model.

Potential Improvements:

  • Multilingual Traffic Sign Detector:

    Many traffic signs are universally used around the world, such as speed limit signs. However, some commonly used traffic signs containing texts make it hard for foreign drivers to read due to language barriers. Using transfer learning, we can train a customized model using traffic signs contains text messages in different languages, such as stop signs in English (STOP) and stop signs in Chinese (停). The new model will be able to detect traffic signs containing texts and be able to assist drivers traveling in foreign countries.

    Chinese stop sign
    Chinese stop sign
    Source: google


    Korean stop sign
    Korean stop sign
    Source: google


    Japanses stop sign
    Japanses stop sign
    Source: google


  • Object Detection API:

    Object detection is being widely used in industry with various applications including autonomous driving, face recognition and identity verification. It would be useful and convenient if we can build up object detection APIs and integrate into any web apps or mobile apps.

    Nice application by ultralytics.com

    API example
    API example
    source: ultralytics.com


  • Object Tracking:

    Another extension of object detection would be object tracking, which tracks objects throughout the frame. Real-time tracking techniques such as deep sort could learn features of a person and memorize them, which could be able to find them again once they leave the frame.

References:

Video from a Dash Cam Recorder in New York

The video from YouTube is the recording of a dash cam, documenting the traffic in New York City for one and half hour. We took advantage of this video for 20 seconds, applying Yolov3 on this recording.

YoloV3 Implemented in Tensorflow 2.0

This repo provides a clean implementation of YoloV3 in TensorFlow 2.0 using all the best practices. This repo includes the steps to set up local machine and documents prepared for the video applications.

Chinese stop sign

Example of stop sign in Chinese.

Japanese stop sign

Example of stop sign in Japanese.

Korean stop sign

Example of stop sign in Korean.

ultralytics.com

This is a nice ios application uses YOLOv4 API (an updated version of YOLOv3 with higher speed) to perform onject detection on your mobile devices. All computation will be done on the phone, thus no image data will be sent to the server.