Data Annotation

Regardless of the manner in which you have gathered the data, the next step is to annotate the data. The type of data annotation required depends on the type of model you are training, for more information see our Introduction to Model Training article. Currently, alwaysAI’s training tool trains an object detection model, so the method of annotation is to draw bounding boxes around all examples of the target label(s) in your dataset. There are many tools available to help you do this, from simple, open-source tools like labelIMG to web applications with multiple features like In this guide, we talk about another open-source tool that we have incorporated into the alwaysAI CLI - CVAT. In the end, how you draw the bounding boxes is not important - the important thing is the format of the annotations. The format we support is Pascal VOC; you can find an example of this annotation format further down in this document, in the Validate Data section. This guide contains the following sections:


  • Download and install the latest release of alwaysAI (If you already have it installed make sure you update to the latest version; opening the desktop application will automatically check for updates)

  • Once alwaysAI is installed, entering the following command in the terminal will start CVAT:

   $ aai annotate
  • If you have installed CVAT before, you may already have a superuser account, which is required to use the tool. Follow the prompts to run CVAT with the correct user.

  • Paste the given link into a web browser (Google Chrome is recommended).

Note: Google Chrome is suggested for best results. Safari has been tested on the new UI in CVAT; it will not support the original UI. Google Chrome may still offer better user experience.

Prepare the Dataset (optional)

Prepping and cleaning the data before starting to annotate will expedite the annotation process. CVAT has a great processing tool (FFmpeg) built into it, but you may want to prepare the data yourself before uploading it to CVAT. This will give you more control of the video and images you intend to annotate. We recommend installing FFmpeg to use for pre-processing (find details on this at the end of the document). Here are some examples of actions you can take to make annotation smoother.

Sample Images from a Video

As you know, a movie is just a large number of images displayed in order. This means that the difference between each frame is not very big. We need a lot of examples of images to train a robust model, but it doesn’t help us if they are all the same, or very similar images. In order to ensure variance in the training dataset we may want to take a subset of images by sampling the video.

Note: The sampling function in CVAT is not compatible with exporting the dataset in the Pascal VOC format.

Split a Video into Multiple Parts

When collecting data, you may find that the objects you wish to detect are interspersed randomly throughout the video. Since you are not interested in the parts of the movie that don’t contain an object or objects that you want to identify, you can split the movie into parts that contain objects and parts that don’t, and only keep the parts you want.

Concatenate Videos

When collecting data, you will likely end up with multiple movies. For large datasets this can get unwieldy, so concatenation can keep things simple.

See the list of tools at the end of this document for additional details on processing and sampling videos.

Upload Data to CVAT

In CVAT, uploading data is a step when you create a task. You can upload a directory of images, or a single video per task. Please see the CVAT documentation for more information. You may only be able to upload a subset of images or a smaller movie if files are large. You can sample videos as we described, with ffmpeg, and create multiple tasks. You can use multiple datasets for training, so this is not an issue.

Annotate the Data

Before starting annotation, you must create a new task, add the labels of the objects you want to detect, and upload multiple images or single video. Next, open the task you want to annotate. There are a lot of annotation modes available, however for object detection you will most likely use ‘Annotation’ or ‘Interpolation’ modes in the original UI, or ‘Shape’ or ‘Track’ in the new UI. There is good documentation on the CVAT GitHub page; look at the README for extra documentation links. You can also watch our CVAT tutorial on YouTube.

Export the Dataset

From within CVAT, there are a couple options for exporting. The thing to keep in mind is Pascal VOC. That is the format you want. You can’t export from the annotation window, so save your work and exit annotation mode by using the back button on your browser. You can export as a dataset (selecting ‘Pascal VOC’ format in the sub-menu), or dump annotations (as described in the next section):


Dump Annotation(s)

The Dump Annotations feature exports just the annotations you have saved for the task you specify, so if you created the task from images, you can use this option. Make sure to select Pascal VOC.ZIP.

Export as Dataset

The Export as Dataset function exports the annotations as well as the images of the dataset. If you created the task from a single video, you must use this method, and select Pascal VOC 2012.

Note: Since CVAT only allows you to annotate one video at a time, each export will have the same naming convention – 1.xml, 2.xml, 1.jpg, 2.jpg etc. If you have multiple tasks for a single model, these will need to be combined before training the model. This feature is built into the alwaysAI model training tool. All merging and combining of datasets is done behind the scenes after you run the train command. You can try this manually, but note that changing the names of the files is not enough, because there is a path inside the xml to the originally named file.

Validate Data

Here is the format that your data should take:

	<filename>image.jpeg</filename>//there must be an image with this name or an error will be thrown.
	<path>absolute_path/image.jpeg</path>//absolute path to image
		<width>640</width>//width of image in pixels
		<height>480</height>//height of image in pixels

Note: On advanced settings: if you want you can create attributes for each label. Examples of attributes are “pose”, “truncated”, and “difficult”. These all tell the training tool something about each annotation. The attribute “truncated” is used when the object being annotated is either obscured by something in the frame, or only partly visible in the frame. This is a checkbox true/false type of attribute. It makes annotation slightly more accurate, but it takes longer since the value persists for the length of the bounding box, so if something comes into frame you need a truncated = true bounding box as it is coming into the frame, then be sure to delete that box and draw a new one with truncated = false.

Additional Guides


We recommend you download and install ffmpeg to assist you in generating your dataset. It is great for generating sample images from a wide variety of video formats, or changing the format of a video. There are detailed installation instructions on the website provided, and many sites with command instructions and examples. e.g.

A sample command that gives you high quality samples with 2 frames per second is:

$ ffmpeg -i -r 2 -q:v 1 image_name_%4d.png


We have integrated CVAT into the alwaysAI CLI. It is a robust and somewhat complicated annotation tool, can handle video annotation and has advanced features. The link above has a variety of tutorials and guides to assist you in more advanced annotations.