Data Capture Guidelines¶
Capturing data is the first step in the model training process, and one of the most important. A model is only as good as the dataset it is trained on; as the saying goes: garbage in, garbage out. Keeping that in mind, we have compiled a list of considerations that will help you to ensure the data you capture gives you the best chance at an accurate, robust model.
There are three ways to generate a dataset:
Collect the dataset yourself
Acquire data from outside sources
Use a digitally generated dataset.
The scope of this article is for the first method, however, the considerations we discuss apply for the other ways of generating a dataset as well.
You can collect your dataset in either image or video format. The main, overarching theme to keep in mind is: Collect as you will inference. In computer vision, “inference” is the term we use for applying a trained model to an input to infer an outcome. If your target application will be analyzing random images from the internet, your dataset should be images pulled from the internet. If your target application will be running on security camera video footage collected from a camera in a high corner of a building lobby, then your data should be from a similar, or preferably the same, video camera. Model training is done on images, and inference is technically done on images as well, even if it is analyzing a video stream, however the concept is to train on data that resembles real-world applications. Once your video data is collected, can easily sample the videos to create images to use for training. While you may be able to find a ready made dataset, or generate one using images or video collected by someone else, if you have full control of the source of your data, you can ensure a better quality dataset.
The term label balance refers to aiming to have roughly equivalent number of example images for each class, or label, as they are often called, you are training your model to recognize. If there is a large discrepancy in the number of images across classes, e.g. you are training a model to recognize bottles and cans and have 2000 images of bottles and 50 images of cans, the model will not be balanced. This could result in disparity in accuracy or precision across classes and will generally be detrimental to your model.
The optimal lighting for your dataset depends on the lighting of your target application. To return to the security camera example, the lighting will vary greatly depending on whether your security camera is inside or outside, or if it is running during the night, the day, or both. A camera that is inside may have consistent lighting throughout the day, and even the night, whereas a camera outside is subject to changes in lighting due to things like weather and time of day. None of these things are an impediment, however, you’ll want to take them into consideration, and ideally have examples of the all the lighting conditions your model will be exposed to in your training data.
The angle that objects are viewed from can drastically change their shape. An umbrella from above or below is an octagon, but from the side it is a crescent with a line. When collecting data, consider the angle you will be inferencing from, whether your camera will be high or low, as well as the direction that your targets will be crossing the frame, if it is relevant.
When humans looks at a scene, our brains perform a lot of processing to interpret whether an object is close to us or far away. The main factor in this is size: the closer an object is to us, the bigger it appears. We need to take that into consideration when training a computer vision model. In order to teach the model to recognize an object consistently regardless of how close it is to the camera, we need images of the object from a wide range of distances in our dataset. That means we need images where our target class takes up most of the frame, as well as images where the target class takes up very little of the frame. Try to capture the target object from a variety of distances, especially if the object will be moving towards or away from the camera in the target application.
Resolution will play a role in the quality of the model if there is a large discrepancy between the resolution of the training images and inferencing images. For example, if the training images are high-definition, the model will have trouble finding the same shape in grainy, low resolution images. Typically, resolution is close enough across most devices, however it is good to take this into consideration in general.
Most likely, the framework on which you train the model will re-scale all images so that they are consistent for training. However, if the raw images that you gather for training have a wide range of scales, this re-scaling will affect all of them differently, which will have a negative impact on your model. Try to use training images that are at roughly the same scale.
Your occlusion tolerance is something that you have to make a decision on when gathering data. How much of an object do you want to be visible before your model detects it? 50%? 80%? 20%? Keep in mind, if you want a partially visible object to be detected by your model, you need a large number of examples of the object being occluded in your dataset. In addition, the more an object is occluded, the less defining features are able to be detected, and as such, you may introduce false positives, or reduce the accuracy of inferences if you use images containing occluded objects as your training data.
If you are going to be inferencing in a location that has weather, i.e. outside, try to account for that when gathering data. If all the data you collect is during a bright sunny day, what happens when it rains or snows? The clouds will reduce light, the rain will add an artifact over the entire inference area that needs to be accounted for. Snow will completely change the background, etc. Think about whether you will be inferencing in various weather conditions, and try to incorporate that as best you can into your data collection. You may not be able to make it rain when you are capturing data, but maybe you could simulate it by using images from time with varying amounts of light, like early morning or evening.
The background of your training dataset can drastically change how your target classes are recognized. If you collect all your data in a controlled environment, say with a white background, it will be easy for the model to recognize the objects you are training on, but this won’t translate to an accurate model in the real world. In the real world, your object may be camouflaged by the background, or have half blend in and half stand out, or any number of situations. To generate a robust model that is accurate in many situations, vary the background of the training dataset as much as possible.
What happens if your target class is in the background and there are other things in the foreground? It will affect the focus of your hardware, the clarity of your object, and the overall visibility of your target. This is a very likely situation when you deploy your model in the real world. There is no guarantee that what you are training for will be front and center in your image, so try to include images that have things other than your target class as the foreground.