What YOLO Model Architecture Is Best for You?

December 21, 2022 by

Kathleen Siddell

YOLO is the state-of-the-art, real-time object detection system used in many computer vision solutions worldwide. There have been several updates for this computer vision model since its release, each claiming to be the best.

You naturally would think the latest release is always the best version out there. However, that isn’t necessarily the case, and we can explain why that is. This article addresses the consideration of choosing the right YOLO version and helps you determine the best YOLO model architecture for your needs.

Real-Time Object Detection With YOLO

What Is Object Detection? 

Object detection is the process of identifying, locating, and labeling objects in a particular image or video using a machine learning or deep learning algorithm. This process also allows counting the number of objects in a scene and determining the exact location of an object. Real-time object detection is the process of locating objects in video data and is widely used in many industries, like automobiles for self-driving cars, pedestrian detection in road traffic, the retail industry for detecting shopper behavior, and many more.

Object Detection With YOLO and How It Compares to Other Detectors 

Because of that detection process, the main advantage of YOLO is its incredible speed compared to other detectors like Fast R-CNN and R-CNN. YOLOs speed and ability to run at 45 frames per second mean it is an ideal choice for real-time object detection use cases like autonomous driving or medical uses.

The Background of YOLO Model Architecture

The YOLO model architecture was created by Joseph Redmon, who developed YOLO versions one through three. This architecture is based on Darknet, which is the open-source framework for DNNs, consisting of 24 convolutional neural network layers with a combo of fully connected layers. It performed better than other popular object detection models when evaluated with the COCO dataset. This dataset consists of 80 objects like people, cars, animals, bicycles, and kitchen and dining objects.

Up to YOLOv3, Joseph Redmon introduced significant improvements with different features and levels of speed and accuracy. His paper titled "YOLO9000: Better, Faster, Stronger" introduced the YOLOv2, which was significantly faster than the previous one and could process images at 40–90 frames per second (FPS). The next paper published, "YOLOv3: An incremental improvement", introduced design changes to YOLO, which was able to detect objects with more accuracy. After YOLOv3, Joseph Redmon announced his intention to stop computer vision (CV) research due to privacy concerns and misuse of the technology.

The development of YOLOv4 was continued by Alexey Bochkovskiy, who introduced YOLOv4 in his research paper titled "YOLOv4: Optimal Speed and Accuracy of Object Detection." It is one of the first systems to detect an object in a single stage. Earlier object detection methods used multiple models or stages and fine-grained operations to detect the approximate location of an object. What made YOLOv1 and YOLOv2 useful was their higher accuracy and ability to speed up the provided models more than the object detection models at that time. YOLOv4 provides faster inference by detecting and classifying them in a single pass.

Existing YOLO Model Architecture and What Each Is Good For 

YOLOv3 and v4 have improvements over their initial versions, while v2 is known for its accuracy, speed, and architecture. YOLOv3 has a high mean average precision (mAP) and is faster than other detection methods. Additionally, it increases the precision of small objects and uses a multi-label technique to detect more specific classes.

V4 was built with proliferation in mind, greatly increasing the inference speed and providing optimal speed and accuracy for object detection. It uses novel techniques called the "bag of freebies" (BOF) and "bag of specials" (BOS). The BOF method does augmentation without much additional computational power. As such, it does not increase inference time while improving accuracy. 

Glenn Jocher released YOLOv5 just two months after releasing v4. YOLOv6 and YOLOv7 were released in 2022. Compared to other versions, YOLOv6 and v7 are more experimental and are not widely tested by the academic community. They also come with less official documentation and support as compared to v3 and v4.

How To Determine Which YOLO Is Right for You

While there are several YOLO versions claiming to be better than previous versions, it does not mean that it's the best version for your needs. Here are some questions you should answer to determine the right YOLO version for you.

How Many Object Classes Do I Have? 

First, find out the important statistics of your data set. The number of object classes in your data set depends on the complexity of the environment. For example, there could be hundreds of retail objects in an inventory tracking application. In a simpler environment or use case, you may need to only detect only a single object (e.g., counting the number of people that enter a store).

How Much Training Time Do I Need?

It is important to consider the training cost since training isn't a one-time factor in developing a successful application. The amount of time necessary for training will depend on the model you choose and your data. You may train it hundreds of times until you get your expected accuracy and detection speed. Additionally, size of the data set (number of images) and how many times the model sees that data set (epochs) has a significant impact on the training time.

A more complex environment will also likely have more classes – increasing the training time. The training time also depends on the environmental conditions. There will be more changing elements when the detection happens outside, while there will be fewer objects to detect when the detection happens inside with more consistent or controlled lighting and a static background.

What Type of Processing Do I Need?

An important consideration for efficient operation is understanding the needed processing requirements of your application and devices. Having more objects to detect doesn't necessarily mean that the model runs slower. For most CNNs, the type of processing depends more so on the number of pixels, as higher image sizes will require high performance.

How Much Power Do I Need?

Next, you need to consider the power required for your detection system. The type of device will affect your power demands as some devices require more power than others. Additionally, consider the power source. Will the device be plugged into an outlet or powered by a battery or other limited power source? You especially need to consider power usage in mobile environments like vehicles where battery use is the only option.

How Fast Do I Need To Run My Models? 

Consider your expected frames per second (FPS), which is important in object tracking, as well as how often the data needs to be pushed, which is important for real-time analysis and decision-making.

Frames per second represent how many times per second your model runs. Most cameras operate between 24-30 frames per second, but that is often more data than required for most use cases. For example, analytics measuring cars arriving in a parking lot does not need to process 30 times a second, most cars will arrive every few seconds. However, in self-driving vehicle situations or real-time detections of higher importance, you may want object data at 30 frames per second or more to ensure optimal performance.

How Do I Plan To Maintain My Models?

Find out what your classic software upkeep and maintenance requirements are. Do you perform continuous integration or continuous deployment (CI/CD)? Are you updating the model architecture for better performance?

Answering these questions will help you easily determine which YOLO version suits you the best. Remember that YOLO is a model created by academics in AI and machine learning. They built a neural network for a competition with clearly defined rules and datasets that were scored against completing a specific task. 

However, object detection is a more complex and varied process in the real world, so it is important to understand your application and use cases before deciding on the best YOLO version.

Introducing the Newest YOLO, YOLO by alwaysAI

The YOLO models designed by academia and published for higher performance in competitions often do not solve production level use cases well without some tweaking. Most of the data these models were created for is stock footage, or a vast collection of random, unrelated images. In production use cases, data often follows more explicit patterns like the examples from just one camera, similar background environments, and thus a new YOLO architecture tailored to production level computer vision scenarios is needed.

At alwaysAI, we’ve taken the best of what YOLO has to offer and optimized it for our computer vision platform. With YOLO by alwaysAI, our goal is to create the best model architectures for real-world deployments and deliver models and solutions that solve our customer’s problems. We optimized YOLO to achieve higher performance, accomplished through observations made in YOLO and other object detection applications in real-world business use cases.

It brings many additional benefits in terms of streamlining and simplifying the CV development process. It not only improves the ease and speed of training but also simplifies the deployment and maintenance of your CV application.

alwaysAI offers three types of YOLO-based models accessible through our computer vision platform. 

YOLO Dynamic

Based on YOLOv4, YOLO Dynamic is best for applications that require the detection of multiple object classes. It is optimized for object detection in complex environments with changing conditions and backgrounds. We chose YOLOv4 as the root architecture due to the high performance and reliability seen in production environments.

YOLO Static

YOLO static is based on YOLOv4 and is also a multi-object detection system. But it is optimized for object detection in less complex environments where conditions have less variability. For example, imagine a scenario where your camera is fixed and always faces the same environment such as a retail store, a frictionless checkout system, or detection on a static background. The model can be trained and optimized for efficiency since it will have to only detect a specific set of items in predictable conditions.

YOLO Classic 

YOLO vintage, based on YOLOv3, is optimized for simple environments that have fewer items or object classes (typically less than ten) to detect. YOLOv3 is one of the most robust detection models and has demonstrated incredible reliability over extensive production-hours in real-world use cases, more so than any other detection model.

Advantages of Using YOLO by alwaysAI

Advantages of YOLO Dynamic and Static

Both YOLO Dynamic and Static are capable of detecting many object classes. But, they are best suited for different types of deployment environments. YOLO Static is optimized for controlled situations like environments where conditions such as lighting and backgrounds don't change (e.g., a fixed camera with an interior view of a retail store). It allows for faster training and easier maintenance. On the other hand, YOLO Dynamic is aimed at complex environments with many changing elements (e.g., a construction site with variable daylight and background elements).

The Advantages of YOLO Classic

YOLO Classic is based on YOLOv3, a YOLO architecture that's probably the most widely used production-level computer vision object detection model worldwide, with a lot of experience, robustness, and support in customer deployments. It’s a battle-tested model that delivers accurate results quickly and efficiently. Since the number of classes to detect is limited to only what’s needed for your application, YOLO Classic provides fast inferencing, high accuracy, and it trains faster.

Overall Advantages of YOLO by alwaysAI:

  • Allows rapid training of models with our model training toolkit
  • Requires less data for high-quality performance
  • Easily converts to accelerators and edge devices like TensorRT for NVIDIA, NVIDIA Jetson Nano, NVIDIA Jetson Xavier NX, NVIDIA Jetson Xavier AGX, Hailo, Intel Myriad, and ONNX
  • All architectures are well-maintained by alwaysAI, ensuring high performance, enterprise quality, uptime, and accuracy
  • Simple edge deployment packaging with the alwaysAI manager
  • A tested system backed by a pedigree of CV experts
  • Provides the fastest path to ROI compared to the latest untested versions of YOLO

Object Detection Use Cases

Object detection is a common computer vision task that forms the basis for many practical business applications. Object detection use cases often have objectives like locating, tracking, and counting objects or detecting anomalies and outliers in an environment. Object detection data provides valuable business information that allows stakeholders to respond to events in real-time. Businesses and enterprise developers can use YOLO by alwaysAI for:

  • Contactless checkout. Use the alwaysAI platform to create a completely frictionless shopping experience. With sensors and cameras powered by AI, shoppers simply pick items off a shelf and build a 'virtual' shopping cart. Some touchless grocery systems even allow shoppers to simply walk out of the store, automatically charging the bill to their linked account. They also have the option to walk up to a kiosk and quickly pay with a card or mobile device.
  • Improve worker safety. Object detection solutions can be deployed to monitor safety in workplaces, construction sites, and industrial settings in real-time. Cameras can be trained to detect the presence of PPE like hard hats, vests, safety glasses, and more to send alerts to managers if certain workers are not in compliance.
  • Productivity improvements. Computer vision can boost productivity and efficiency, improving the bottom line. Object detection monitors workers in factories, warehouses, production facilities, and construction sites, providing valuable data on their work activities. It is important to note that workers can be tracked while also maintaining their individual privacy.
  • Customer analytics in retail. Object detection applications help store operators better understand metrics like foot traffic, dwell time, and wait times to improve the customer experience.

How Can You Use YOLO by alwaysAI?

Even though all the developers of the various YOLO versions claim theirs to be better, the latest version may not be the best version that suits your needs. The vast landscape of model architectures and open-source tools can make computer vision complex. But the alwaysAI computer vision platform exists to simply CV development and allow businesses to focus on creating value-rich applications that drive huge ROI.

You can use our pre-trained YOLO models and starter applications to get started with computer vision in your environment very quickly. Train and refine your application with your data in our fast, easy-to-use model training studio with a single button click.

We make computer vision come alive on the edge - where work and life happen. 

Do you want to leverage the power of state-of-the-art object detection using YOLO by alwaysAI? Speak with an AI expert today to learn more about how YOLO can help you achieve your business goals.

Drive More Real-Time Intelligence and Higher ROI in Your Business