Additional Resources

We’ve compiled some additional resources to assist you with model training. On this page, you’ll find Frequently Asked Questions (FAQs), troubleshooting support, a command cheat sheet for CLI help, and a glossary of training commands.

FAQ

This page answers FAQs from four categories

Please visit our Discord channel or or send an email to support@alwaysai.co to ask questions that are not answered on our page. Please find helpful tutorials and additional reading material on our Blog page.

Background

What is an epoch?

An epoch is running through every image 1 time.

What is batch size?

The batch size is the number of images processed before the model is updated. Batch size is largely dependent on how much memory you have available for training: the more memory you can use, the larger the possible batch size.

What does “overfitting” mean?

When the model performs very well on the training dataset, but not on data that hasn’t been seen before, the model is overfit. This means that even though the performance metrics may appear very good for the training dataset, the model cannot be generalized to new data. For instance, if you have a model that you want to train to detect sporting equipment, and for the label ‘ball’ your dataset included only green tennis balls, even if the precision and recall were very high and loss is very low, the model probably won’t generalize to basketballs or baseballs, or maybe even non-green tennis balls. This is an extreme example; if you train any dataset too much, any model will learn that dataset so well that it doesn’t understand that new data may also be instances of the desired labels.

What is “loss”?

There are numerous algorithms to measure loss, and this measurement will be different for different machine learning tasks. In general, loss measures how far off the model was in correctly learning the task, and as such it is always a value that we want to minimize. There are two types of loss, training loss and validation loss. Training loss is measured by how accurately the model predicts using the training data. Validation loss measures how accurately the model predicted on validation data, which is annotated data that the model was not trained on.

What are “precision” and “recall”?

Precision describes how many of the detected objects are what we actually wanted to detect. It is calculated by dividing the number of correctly identified objects, the true positives, by the total identified objects (both the true and false positives).

Recall describes how many of the objects of interest we managed to detect. It is calculated by dividing the true positives by the true positives plus the true negatives.

Say we have a model that is supposed to detect dogs, and in a picture there are three dogs and two cats. If the model detects all entities in the picture as dogs, it would have low precision, because only 3 of the 5 objects were what we wanted to detect. It would have high recall, however, because it managed to detect all the dogs. We want our model to have both high precision and high recall. This would mean we want a model to correctly identify dogs as dogs, and not identify any cats as dogs.

What is “data augmentation”?

It can sometimes be challenging to collect sufficient images to train your model. You can augment your dataset by taking the images you do have and creating additional images by rotating, cropping, brightening, darkening, blurring, etc. them. One Python library you can use to do this is imgaug.

How do I know when I should stop training?

Generally, you want to stop training when loss no longer decreases and mAP no longer increases. If you visually test your model, using it in an application, and you notice certain instances of labels are no longer being picked up, you may have overfit your model. If instead some objects are being mis-identified, your model may need more training.

Data Collection

What is the format for my training data?

Input for training is expected to be in Pascal VOC format. Images should be JPEGs and stored in a folder named ‘JPEGImages’. Annotations are in XML format, and should be stored in a folder named ‘Annotations’; you can see an example of this format in our Data Annotation guide. Every image should correspond to an annotation file, e.g. file ‘0.jpg’ corresponds to ‘0.xml’. Every dataset consists of the ‘Annotations’ and ‘JPEGImages’ folders zipped together. Zip the folders by selecting the individual ‘Annotations’ and ‘JPEGImages’ folders, not a parent directory.

What if I have multiple datasets, do I need to combine them before training?

No, the aai dataset train command will combine any input datasets provided they are in zip format. You can also merge datasets using aai dataset merge if you would like a consolidated input file, however the aai dataset train does this automatically.

How much input data do I need?

Approximately 300 images per label at a minimum is recommended, however more data will almost surely produce better results. See the Data Capture Guidelines document for more details.

Annotation

Do I need to annotate all objects in an image?

No, however you should be careful to not include too many images containing objects of interest without annotations, and you should be mindful of what other objects are in your images. For more details on data collection and annotation, please refer to our blogs on these subjects.

How much of an object needs to be showing before I annotate it?

In general, about 20% of the image should be present to annotate it. Additionally, if more than 20% of the image is covered, you can mark the annotation as ‘truncated’.

Training

What are the commands that are required for training?

The basic syntax for training is:

$ aai dataset train <dataset1.zip> <dataset2.zip> --numEpochs <integer> --batchSize <integer>

An example usage is

$ aai dataset train resized_dataset_sample_592.zip --numEpochs 20 --batchSize 4

Note: You can train on as many datasets as you’d like, as long as the aggregate file size is < 2GB.

What are optional flags for training?

The optional flags include:

  • --name sets the name of the model to the string provided, instead of a random phrase.

  • --labels you can manually enter the labels, however they are auto-generated from the dataset.

  • --model sets the model. The options are mobilenet_v1, yolov3, and resnet_faster_rcnn.

  • --imgSize sets the dimension of the model to the provided values. Options are small, medium, large, or a string formatted as MxN. See below for more details.

  • --jupyter switches training from CLI to using Jupyter Notebook.

  • --trainValRatio sets the train-validation ratio split. The default is 0.7.

  • --continue-from-version enables you to pick up training from a previous session.

What does the --model flag specify?

We offer three training options: MobileNet SSD version 1, YOLO version 3, and FasterRCNN. The MobileNet model is the default, and it will train faster, and have a fast inference speed, but it is not as accurate as FasterRCNN. FasterRCNN will take more time, and won’t inference as fast, but should provide better accuracy. YOLO is in between. It is more accurate than MobileNet, but will likely have little slower inference time.

What does the --imgSize flag specify?

You can specify either small, medium, large, or an integer by integer dimension. Larger dimensions will take longer to train but likely have better performance, and vice versa. The dimensions vary based on which model you are training. For MobileNet and FasterRCNN, the small option is 300x300, and this is the default. The medium option is 640x640, and the large is 1280x720. For YOLOv3, the small/default option is 320x320, the medium is 416x416, and the large is 640x640. For YOLOv3 if you enter custom dimensions, they must be divisible by 32.

What does the --batchSize flag specify?

The --batchSize flag specifies how many images that the model is trained on before the model is updated. This value will be dependent on how much memory you have available.

What does the --trainValRatio flag specify?

The --trainValRatio flag specifies how much of your annotated data is used for training versus validation, using a floating point number. The default is set to 0.7, which means that 70% of the training data will be used for training and 30% will be used for validation.

What does the --continue-from-version flag specify?

If you use the --continue-from-version command, a new model version will be created, with a model name that is one version higher than the previous iteration that was run. For example, 0.2 will be created if you continue from 0.1, and 0.6 will be created if 0.5 was the last model version, even if you continue from 0.3.

Note: When using the --continue-from-version flag with the --jupyter you must also use the --name flag and specify the model name.

What are the default settings for model training?

You must manually set the training data, the number of epochs, and the batch size. The default setting for training is using the CLI (which can be toggled to training using the Jupyter notebook using the --jupyter flag). Additionally, the default for the train/validation split is 0.7, which can be altered using the --trainValRatio flag.

What types of models can I train?

Currently, we train object detection models by transfer-learning from either a pre-trained MobileNet-SSD, a YOLOv3, or ResNet FasterRCNN model that has been trained on the COCO Dataset.

How can I test my model?

You can use your model in an app to visually assess the model’s ability to detect the desired objects. You can do this by publishing it to your personal model catalog, using

$ aai model publish <username/modelname>

You can use

$ aai app models add <username/modelname> --local-version <version>

To use the model locally, without publishing to the model catalog.

To add a model to an app, use

$ aai app models add <username/modelname>

And to update the version of the model if the model is already added to your application, simply run

$ aai app models update

Can I test multiple versions of my model side by side?

Yes! However, you must use different training names for the two models, i.e. ‘my_model1’ and ‘my_model2’. You can train and publish two versions of your model, using these different ids, and test each model’s performance on the same input stream. See this blog for more details.

Do I need to train on all of my labels?

Yes, at the moment you must train on all the labels in your dataset. Additionally, if a label is specified in the training command, it must be present in your dataset! You do not need to manually enter labels, they will be automatically detected in your dataset.

What if I forget to specify a label to train on?

The training tool will print all of the labels it detects in your dataset to the console. You can then re-run the command with all the labels specified.

Do I have to train all at once, or can I pick up where I left off?

You can continue training your model from a previous version by using the --continue-from-version id flag in your training command and specifying the version you would like to continue from. If you do not use this flag, training will begin from scratch and a new model version will be made, incrementing from the last version.

Do I have to keep using the same training settings if I continue training?

No! You can change any settings, including using the CLI or Jupyter notebook, between iterations of training.

What if I want to revert to an older version of my model?

You can specify an older version of a model to continue training from, by specifying the version and using the

$ ... --continue-from-version

flag in the aai dataset train command. You can use an older version of a model by specifying the desired model name and using

$ aai app models add <username/modelname> --local-version <version>

Can I see performance metrics while the model is training?

You can graphically view training loss as well as start and end validation loss by training using the --jupyter flag, which opens a Jupyter notebook to perform training. If you instead train using the CLI (default), you will see training loss printed periodically to the console as well as mAP and recall every 10 minutes and when training is complete. For more details on interpreting these tables, see our documentation on model training output.

Is there a limit on how much data I can train?

You can train on up to 2GB of data. Training on datasets, either individually or combined, that exceed this size may result in inconsistent results and should not be attempted. Additionally, you may need to check limitations of your file system (using ulimit -aH for Linux and Mac) if you have a lot of small files. This is not a restriction for the training tool, however.

What hardware do I need to train a model?

Currently the model training tool is available for use with MacOS, Windows, and Linux. We offer CPU training on MacOS and Windows, and GPU training on Linux computers fitted with NVIDIA GPUS, as long as the proper CUDA drivers have been installed.

How long does it take to train a model?

This depends largely on the size of your dataset, the number of epochs you are running, the batch size, and whether you are training on a CPU or GPU. Generally, training on GPU will be much faster (approximately 3-5 faster) than CPU. For reference, our license plate detection model was trained on a GPU for 1,300 epochs with a batch size of 16 using a dataset that contained 951 images and this training took approximately 20 hours. (Note that this model may not have needed to be trained for this many epochs, however we offer it as an example in how these different components may affect one another). As another example, running 4 epochs on a CPU using a dataset of 592 images took about 20 minutes using a batch size of four.

Is there a place I can see logs of the training data if I lost my console output?

Yes, in the ‘training-temp’ folder in the training directory there is a folder called ‘logs’ that contains the tensorflow-logs.log file.

Troubleshooting

Trouble with Annotations

Error 1

Error: Annotations directory not found
alwaysAI retraining expects a PascalVOC datset. The zipped directory must have Pascal VOC
formatted annotation files with a "JPEGImages" directory containing the images corresponding to the annotation
files found in "Annotations"

- Annotations
    - annotation xml or json files
- JPEGImages
    - corresponding image files in jpeg format (.jpg or .jpeg)

This error occurs when you are attempting to merge datasets (either with aai dataset merge or aai dataset train) that have not been compressed properly. Select the ‘Annotations’ folder and ‘JPEGImages’ folder for one annotation set and compress these (do not compress the parent directory containing these folders). Repeat for all annotations sets (you can rename after compression if need be), and then use these compressed files as input for the command.

Trouble with Training

Error 2

{ 
    Error: spawn sh ENOENT
    at Process.ChildProcess._handle.onexit (internal/child_process.js:240:19)
    at onErrorNT (internal/child_process.js:415:16)
    at process._tickCallback (internal/process/next_tick.js:63:19)
    at Function.Module.runMain (internal/modules/cjs/loader.js:832:11)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)
    errno: ‘ENOENT’,
    code: ‘ENOENT’,
    syscall: ‘spawn sh’,
    path: ‘sh’,
    spawnargs:
    [ ‘-c’,
        ‘docker info | grep CPUs | grep -oE “[1-9][0-9]{0,2}” | head -n1' ] 
}

This error occurs when training is attempted on a Windows machine. Currently, training is only supported on Mac and Linux, with Windows support planned for future iterations of the tool.

Error 3

docker: Error response from daemon: Conflict. The container name <name> is already in use by container <container>. You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
Error: Exported model not found in training-temp/username/modelname

This error occurs when you begin Jupyter training and it is not properly brought back down. Run

$ docker ps

To list all the running processes and then kill the process associated with the container in the error messages using

$ docker kill [container]

Error 4

info: Train/Validation ratio : 0.7
info: Datasets
 dataset.zip
info: Labels
 label
info: Available labels for dataset:
 label
 label
info: Batch Size : 4
/bin/sh: docker: command not found

Please install Docker. You can find details on model training set up here.

Error 5

RangeError [ERR_FS_FILE_TOO_LARGE]: File size (2326059448) is greater than 2 GB

This error occurs when you attempt to train dataset(s) that are over 2 GB in size total. You will need to reduce image size, or the number of annotations and images used for training. Do not split your original dataset into individual files and attempt to run the training command on all of the individual files; this may result in some of your dataset not being properly merged, even if no error ensues.

Error 6

info: Resizing images and annotations
TypeError: cannot read property ‘forEach’ of undefined

This error indicates a problem with the input data. The expected input is one XML annotation file per one JPEG image, both with corresponding names (i.e. ‘file0.xml’ and ‘file0.jpg’). The annotations should be in one parent folder called ‘Annotations’ and the images should be in a folder named ‘JPEGImages’. Both of these folders should be zipped into one parent folder. To properly zip the folders, select ‘Annotations’ and ‘JPEGImages’ and select ‘compress’. Do not compress the unzipped parent folder containing the two directories.

Error 7

ValueError: The passed save_path is not a valid checkpoint: /tf-tools/trained_models/model.ckpt-XX

If you were able to see training progression on the console, then there is an issue with the training folder. However, if training never began, this is a memory issue. If you’ve already modified Docker to have access to more memory, reduce the number of epochs you are running and the batch size you are using. To modify Docker access:

Open Docker Preferences Resources Advanced

Modify your preferences to give Docker access to all but 1 CPU, all but 1-2GB of memory (or what you are comfortable with), and 2 GB of swap. Docker will restart and you can try training again.

Error 8

Label __ not found in any annotation files….
You’ve included a label that is not in any training images. You should drop the label, or ensure that annotation/image pairs for that label are included in the input data.

Note: The program will also recognize if your annotations include labels that you did not specify in your training command and will notify you.

Error 9

docker: invalid reference format: repository name must be lowercase.
See ‘docker run --help’
Error: Exported model not found ‘training-temp/userid/modlename’

Make sure that the path that the folder you are training in has no spaces in it.

Error 10

Traceback (most recent call last):
File "/tf-tools/create_od_tf_record.py", line 287, in <module>
    tf.app.run()
File "/root/anaconda3/envs/tf-14/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/root/anaconda3/envs/tf-14/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
File "/root/anaconda3/envs/tf-14/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
File "/tf-tools/create_od_tf_record.py", line 275, in main
    mask_type=FLAGS.mask_type)
File "/tf-tools/create_od_tf_record.py", line 233, in create_tf_record
    mask_type=mask_type)
File "/tf-tools/create_od_tf_record.py", line 152, in dict_to_tf_example
    classes.append(label_map_dict[class_name])
KeyError: 'box'

The error above was caused by not including the label ‘box’ even though it was present in the provided dataset. If you see a similar error, make sure that you are training on all labels that are present in your dataset.

Error 11

Error: Error: EMFILE: too many open files, open <directory>

This error is due to a limitation of your operating system on the number of allowable open files. You can check the file limit using ‘ulimit -aH’ on Linux and Mac to see if this is the cause of the error. You can set the file limit using ‘ulimit’ as well, but please use caution when altering your system settings.

Trouble with Publishing Models

Ensure you are properly logged in with aai user show, and log in with aai user login if not.

Ensure your user id matches the user id in the model specified and the model name is also properly specified.

Command Cheat Sheet

Data Commands

Run Data Collection Starter App

$ aai app start

Merge Datasets

This is not necessary, aai dataset train will automatically merge datasets. You can merge and train multiple datasets, so long as the aggregate file size is 2 GB or less.

$ aai dataset merge <dataset1.zip> <dataset2.zip>

Resize a Dataset

This is not necessary, but may speed up training time, especially if training more than once, using the --continue-from-version flag.

$ aai dataset resize --target-dir <dataset.zip>

Training Commands

Default Training Command

$ aai dataset train dataset1.zip dataset2.zip --numEpochs <integer> --batchSize <integer>

Continue From Version

$ ... --continue-from-version <string> ...

Set Train/Validation Ratio

Default is 0.7

$ ... --trainValRatio <float> ...

Specify the Model Name

$ ... --name <modelname>

Specify the Model

$ ... --model <model>

Options are mobilenet_v1, yolov3, and resnet_faster_rcnn

Specify the Image Dimensions

$ ... --imgSize <string>

Accepted args/formatting include: small, medium, large, MxN (Ex. 300x300)

Specify Labels

$ ... --labels <str0> [...]

Note: You don’t need to provide labels! They will be automatically detected.

Specify Hardware

$ ... --hardware [GPU, CPU]...

Note: GPU training is only available on Linux computers with NVIDIA GPUs that have the appropriate CUDA drivers installed.

Train Using Jupyter

$ ... --jupyter ...

Post-Training Commands

Publish the Model

$ aai model publish <username/modelname>

Add (published) Model to App

$ aai model add <username/modelname>

Add (unpublished) Model to App

$ aai model add <username/modelname> --local-version <version>

Update a Model (already added to app)

$ aai models update

Glossary of Terms

Annotation

(Adjective). The process of labeling data by defining which areas of an image contain the relevant object(s).

(Noun). The actual files that contain the information regarding the areas of interest for a particular image. Annotations are sometimes referred to as the ground truth and they are used in supervised learning; the model repeatedly compares predictions against annotations in order to improve.

Augmentation

The process of altering images thereby creating new images that are sufficiently different from the originals. Augmentation can include blurring, cropping, brightening, darkening, rotating, and more. Augmentation is used to increase the size of a dataset.

Batch

The number of images trained on before the model is updated.

Difficult

In annotation, ‘difficult’ is set to 1 when the object is not easily recognized, otherwise it is set to 0.

Epoch

Training on each image one time.

Loss

A quantification of how different the model’s prediction is from the ground truth.

Learning Rate

How often the weights in the model are updated.

Overfitting

When the model performs well on the training dataset, but poorly on new test data.

Train, Validation, Test Split

There are three components to training a neural network. The actual training, the tuning of the hyperparameters, such as learning rate, and testing the model. To accomplish this, the original dataset is typically split into a training and testing dataset, usually with an 80/20 split, respectively. The training dataset is then split into a training and validation dataset. As the model trains, it compares its prediction to the annotation on all of the training data and adjusts the weights and other hyperparameters accordingly. When the model is done training, the model is tested against the validation data to see how well it performs.

Truncated

When annotating, this describes whether the object being annotated is completely visible. If the object is visible (i.e. not truncated), this value is set to 0, otherwise it is set to 1. Typically if 20% or more of the object is obscured, it should be marked as truncated.

Underfitting

The model has poor performance on validation and training data as well as test data.

Weight

Weights are a way of quantifying how important a given input is for a neural network and how much it contributes to the output.