Troubleshooting

Trouble with Annotations

Error 1

Error: Annotations directory not found 
alwaysAI retraining expects a PascalVOC datset. The zipped directory must have Pascal VOC
formatted annotation files with a "JPEGImages" directory containing the images corresponding to the annotation
files found in "Annotations"

- Annotations
    - annotation xml or json files
- JPEGImages
    - corresponding image files in jpeg format (.jpg or .jpeg)

This error occurs when you are attempting to merge datasets (either with aai dataset merge or aai dataset train) that have not been compressed properly. Select the ‘Annotations’ folder and ‘JPEGImages’ folder for one annotation set and compress these (do not compress the parent directory containing these folders). Repeat for all annotations sets (you can rename after compression if need be), and then use these compressed files as input for the command.

Trouble with Training

Error 2

{ 
    Error: spawn sh ENOENT
    at Process.ChildProcess._handle.onexit (internal/child_process.js:240:19)
    at onErrorNT (internal/child_process.js:415:16)
    at process._tickCallback (internal/process/next_tick.js:63:19)
    at Function.Module.runMain (internal/modules/cjs/loader.js:832:11)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)
    errno: ‘ENOENT’,
    code: ‘ENOENT’,
    syscall: ‘spawn sh’,
    path: ‘sh’,
    spawnargs:
    [ ‘-c’,
        ‘docker info | grep CPUs | grep -oE “[1-9][0-9]{0,2}” | head -n1' ] 
}

This error occurs when training is attempted on a Windows machine. Currently, training is only supported on Mac and Linux, with Windows support planned for future iterations of the tool.

Error 3

docker: Error response from daemon: Conflict. The container name <name> is already in use by container <container>. You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
Error: Exported model not found in training-temp/username/modelname

This error occurs when you begin Jupyter training and it is not properly brought back down. Run

$ docker ps

To list all the running processes and then kill the process associated with the container in the error messages using

$ docker kill [container]

Error 4

info: Train/Validation ratio : 0.7
info: Datasets
 dataset.zip
info: Labels
 label
info: Available labels for dataset:
 label
 label
info: Batch Size : 4
/bin/sh: docker: command not found

Please install Docker. You can find details on model training set up here.

Error 5

RangeError [ERR_FS_FILE_TOO_LARGE]: File size (2326059448) is greater than 2 GB

This error occurs when you attempt to train dataset(s) that are over 2 GB in size total. You will need to reduce image size, or the number of annotations and images used for training. Do not split your original dataset into individual files and attempt to run the training command on all of the individual files; this may result in some of your dataset not being properly merged, even if no error ensues.

Error 6

info: Resizing images and annotations
TypeError: cannot read property ‘forEach’ of undefined

This error indicates a problem with the input data. The expected input is one XML annotation file per one JPEG image, both with corresponding names (i.e. ‘file0.xml’ and ‘file0.jpg’). The annotations should be in one parent folder called ‘Annotations’ and the images should be in a folder named ‘JPEGImages’. Both of these folders should be zipped into one parent folder. To properly zip the folders, select ‘Annotations’ and ‘JPEGImages’ and select ‘compress’. Do not compress the unzipped parent folder containing the two directories.

Error 7

ValueError: The passed save_path is not a valid checkpoint: /tf-tools/trained_models/model.ckpt-XX

If you were able to see training progression on the console, then there is an issue with the training folder. However, if training never began, this is a memory issue. If you’ve already modified Docker to have access to more memory, reduce the number of steps you are running. To modify Docker access:

Open Docker Preferences Resources Advanced

Modify your preferences to give Docker access to all but 1 CPU, all but 1-2GB of memory (or what you are comfortable with), and 2 GB of swap. Docker will restart and you can try training again.

Error 8

Label __ not found in any annotation files….
You’ve included a label that is not in any training images. You should drop the label, or ensure that annotation/image pairs for that label are included in the input data.

Note: The program will also recognize if your annotations include labels that you did not specify in your training command and will notify you.

Error 9

docker: invalid reference format: repository name must be lowercase.
See ‘docker run --help’
Error: Exported model not found ‘training-temp/userid/modlename’

Make sure that the path that the folder you are training in has no spaces in it.

Error 10

Traceback (most recent call last):
File "/tf-tools/create_od_tf_record.py", line 287, in <module>
    tf.app.run()
File "/root/anaconda3/envs/tf-14/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/root/anaconda3/envs/tf-14/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
File "/root/anaconda3/envs/tf-14/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
File "/tf-tools/create_od_tf_record.py", line 275, in main
    mask_type=FLAGS.mask_type)
File "/tf-tools/create_od_tf_record.py", line 233, in create_tf_record
    mask_type=mask_type)
File "/tf-tools/create_od_tf_record.py", line 152, in dict_to_tf_example
    classes.append(label_map_dict[class_name])
KeyError: 'box'

The error above was caused by not including the label ‘box’ even though it was present in the provided dataset. If you see a similar error, make sure that you are training on all labels that are present in your dataset.

Error 11

Error: Error: EMFILE: too many open files, open <directory>

This error is due to a limitation of your operating system on the number of allowable open files. You can check the file limit using ‘ulimit -aH’ on Linux and Mac to see if this is the cause of the error. You can set the file limit using ‘ulimit’ as well, but please use caution when altering your system settings.

Trouble with Publishing Models

Ensure you are properly logged in with aai user show, and log in with aai user login if not.

Ensure your user id matches the user id in the model specified and the model name is also properly specified.