Regardless of the manner in which you have gathered the data, the next step is to annotate the data. The type of data annotation required depends on the type of model you are training, for more information see our Introduction to Model Training article. Currently, alwaysAI’s training tool trains an object detection model, so the method of annotation is to draw bounding boxes around all examples of the target label(s) in your dataset. There are many tools available to help you do this, from simple, open-source tools like labelIMG to web applications with multiple features like Supervise.ly. In this article we talk about another open-source tool that we have incorporated into the CLI - CVAT. In the end, how you draw the bounding boxes is not important - the important thing is the format of the annotations. The format we support is Pascal VOC; you can find an example of this annotation format further down in this document, in the Validate Data section. This guide contains the following sections:
(Optional) Preparing the Dataset
Download and install the latest release of the alwaysAI CLI (If you already have it installed make sure you update to the latest version)
Once the alwaysAI CLI is installed the following command will start CVAT:
$ aai annotate
If you have installed CVAT before, you may already have a superuser account, which is required to use the tool. Follow the prompts to run CVAT with the correct user.
Paste the given link into Google Chrome.
Prepare the Dataset (optional)¶
Prepping and cleaning the data before starting to annotate will expedite the annotation process. CVAT has a great processing tool (FFmpeg) built into it, but you may want to prepare the data yourself before uploading it to CVAT. This will give you more control of the video and images you intend to annotate. We recommend installing FFmpeg to use for pre-processing (find details on this at the end of the document). Here are some examples of actions you can take to make annotation smoother.
Sample Images from a Video¶
As you know, a movie is just a large number of images displayed in order. This means that the difference between each frame is not very big. We need a lot of examples of images to train a robust model, but it doesn’t help us if they are all the same, or very similar images. In order to ensure variance in the training dataset we may want to take a subset of images by sampling the video. Note: I have noticed the sampling function in CVAT is not compatible with exporting the dataset in the Pascal VOC format.
Split a Video into Multiple Parts¶
When collecting data, you may find that the objects you wish to detect are interspersed randomly throughout the video. Since you are not interested in the parts of the movie that don’t contain an object or objects that you want to identify, you can split the movie into parts that contain objects and parts that don’t, and only keep the parts you want.
When collecting data, you will likely end up with multiple movies. For large datasets this can get unwieldy, so concatenation can keep things simple.
See the list of tools at the end of this document for additional details on processing and sampling videos.
Upload Data to CVAT¶
In CVAT, uploading data is a step when you create a task. You can upload a directory of images, or a single video per task. Please see the CVAT documentation for more information. You may only be able to upload a subset of images or a smaller movie if files are large. You can sample videos as we described, with ffmpeg, and create multiple tasks. You can use multiple datasets for training, so this is not an issue.
Annotate the Data¶
Before starting annotation, you must create a new task, add the labels of the objects you want to detect, and upload multiple images or single video. Next, open the task you want to annotate. There are a lot of annotation modes available, however for object detection you will most likely use ‘Annotation’ or ‘Interpolation’ modes. Good documentation on the CVAT User Guide on GitHub.
Export the Dataset¶
From within CVAT, there are a couple options for exporting. The thing to keep in mind is Pascal VOC. That is the format you want. You can’t export from the annotation window, so save your work and exit annotation mode by using the back button on your browser. You can export as a dataset (selecting ‘Pascal VOC’ format in the sub-menu), or dump annotations (as described in the next section):
The Dump Annotations feature exports just the annotations you have saved for the task you specify, so if you created the task from images, you can use this option. Make sure to select Pascal VOC.ZIP.
Export as Dataset¶
The Export as Dataset function exports the annotations as well as the images of the dataset. If you created the task from a single video, you must use this method, and select Pascal VOC 2012.
Note: Since CVAT only allows you to annotate one video at a time, each export will have the same naming convention – 1.xml, 2.xml, 1.jpg, 2.jpg etc. If you have multiple tasks for a single model, these will need to be combined before training the model. This feature is built into the alwaysAI model training tool. All merging and combining of datasets is done behind the scenes after you run the train command. You can try this manually, but note that changing the names of the files is not enough, because there is a path inside the xml to the originally named file.
Here is the format that your data should take:
<annotation> <folder>folder_name</folder> <filename>image.jpeg</filename>//there must be an image with this name or an error will be thrown. <path>absolute_path/image.jpeg</path>//absolute path to image <source> <database>database_source</database> </source> <size> <width>640</width>//width of image in pixels <height>480</height>//height of image in pixels <depth>3</depth> </size> <segmented>0</segmented> <object> <name>object_name</name> <pose>Unspecified</pose>//Optional <truncated>0</truncated>//Optional <difficult>0</difficult>//Optional <bndbox> <xmin>486</xmin> <ymin>6</ymin> <xmax>640</xmax> <ymax>256</ymax> </bndbox> </object> </annotation>
Note: On advanced settings: if you want you can create attributes for each label. Examples of attributes are “pose”, “truncated”, and “difficult”. These all tell the training tool something about each annotation. The attribute “truncated” is used when the object being annotated is either obscured by something in the frame, or only partly visible in the frame. This is a checkbox true/false type of attribute. It makes annotation slightly more accurate, but it takes longer since the value persists for the length of the bounding box, so if something comes into frame you need a truncated = true bounding box as it is coming into the frame, then be sure to delete that box and draw a new one with truncated = false.
We recommend you download and install ffmpeg to assist you in generating your dataset. It is great for generating sample images from a wide variety of video formats, or changing the format of a video. There are detailed installation instructions on the website provided, and many sites with command instructions and examples. e.g. https://www.labnol.org/internet/useful-ffmpeg-commands/28490/
A sample command that gives you high quality samples with 2 frames per second is:
$ ffmpeg -i movie_name.mov -r 2 -q:v 1 image_name_%4d.png