CUDA & Contours

by Steve Griset | November 3, 2021 | Computer Vision | 8 min read


Using only Computer Vision techniques (Edge detection, thresholding, blurring, etc.) a developer can put together useful applications without having to use computational expensive machine learning algorithms. For example, using background subtraction, a developer could create an application that counts people going in and out of a store, or an application that counts cars entering and exiting a parking garage, or a motion detection application for security cameras. If developers wish to track the objects detected from background subtraction, they can simply apply contours to them. To speed up these applications, a developer can use CUDA APIs to offload Computer Vision algorithms from the CPU to the GPU.

Access the source code from the tutorial here: https://github.com/alwaysai/cuda_and_contours


CUDA (Compute Unified Device Architecture) is a set of libraries and APIs that give developers programmable access to the parallel computing power of a GPU. CUDA APIs grant applications executing on the CPU direct access to NVIDIAs GPU virtual instruction set and parallel computational elements. Since 2018, OpenCV has offered developers APIs and modules to do CUDA-accelerated Computer Vision. The Computer Vision modules include image processing and filtering, optical flow, background segmentation and others.

A complete list of CUDA accelerated modules can be found in OpenCV’s official documentation: https://docs.opencv.org/master/d1/d1e/group__cuda.html

To get the most of this new functionality, you need to understand CUDA’s parallelism. CUDA uses data parallelism and not task parallelism. Data parallelism focuses on distributing the data across different circuits , which operate on the data in parallel. Data parallelism can be applied on data structures like arrays and matrices by working on each element in parallel, which is ideal for Computer Vision and machine learning applications. Task parallelism, on the other hand, allows applications to spread processes and threads across CPU cores, speeding up traditional software programs.


A Contour of an object is the curve joining all the continuous points along that object’s boundary. In Computer Vision, contours are useful for shape analysis, object detection, and recognition. To accurately find an object’s contour, it’s recommended to use a single channel image (in most cases grey scale) where thresholding and Edge detection can be applied to isolate a white foreground object from the black background. OpenCV provides a useful function: cv2.findContours() to find the raw contours of an input image. The OpenCV function has three arguments, first input image, second contour retrieval mode, and third is the contour approximation method. Once called, the function will return a tuple containing the image’s contours and hierarchy.

alwaysAI’s EdgeIQ software package contains helper APIs that provide advance contour processing capabilities when used in conjunction with OpenCV’s cv2.findContours() function: https://alwaysai.co/docs/Edgeiq_api/background_subtractor.html

color spacing img

Using CUDA to do Color Spacing

Source code: https://github.com/alwaysai/cuda_and_contours/tree/main/color_space_cuda

Color spacing in Computer Vision is an organization of colors that allows for an application to consistently represent and reproduce colors based on a mathematical color model. The color_space_cuda module included in this tutorial's GitHub repository will demonstrate how to optimally use CUDA in your application to move from one color space to another. The application takes input video in the BGR color space and using the GPU converts the input image into the HSV, L*a*b* and YCrCb color spaces, and then renders all four color spaces simultaneously to the screen.

When you open up the application, you will notice that it is divided into two sections. The main section which controls the flow of the application, and a class object called get_color_spaces that converts the original BGR frame into different color spaces. The class object takes one boolean argument on whether to use CUDA to do color spacing or not. The first method is the initialization method that pre-allocates memory to accept OpenCV function call returns.

def _initialize_color_space(self, bgr_frame): print("Allocating Memory") self.rows, self.columns = bgr_frame.shape[:2] self.bgr_frame = np.empty((self.rows, self.columns, 3), np.uint8) self.hsv_frame = np.empty((self.rows, self.columns, 3), np.uint8) self.lab_frame = np.empty((self.rows, self.columns, 3), np.uint8) self.YCrCb_frame = np.empty((self.rows, self.columns, 3), np.uint8) if self.cuda: self.stream = cv2.cuda_Stream() self.cuda_bgr_frame = cv2.cuda_GpuMat(self.rows, self.columns, cv2.CV_8UC3) self.cuda_hsv_frame = cv2.cuda_GpuMat(self.rows, self.columns, cv2.CV_8UC3) self.cuda_lab_frame = cv2.cuda_GpuMat(self.rows, self.columns, cv2.CV_8UC3) self.cuda_YCrCb_frame = cv2.cuda_GpuMat(self.rows, self.columns, cv2.CV_8UC3)

OpenCV in python automatically allocates any arrays which are returned from function calls. So if you do not pre-allocate memory for the return arrays, each time function calls are made, they will allocate and destroy those arrays.

self.stream = cv2.cuda_Stream()

OpenCV’s CUDA modules use streams to execute algorithms. CUDA streams are a sequence of operations that execute on the GPU. CUDA streams have the following concurrency attributes:

  1. CUDA operations in different streams may run concurrently
  2. CUDA operations from different streams may be interleaved

If you don’t explicitly pass in a CUDA stream to a OpenCV CUDA module, the default stream is used, which executes a GPU device synchronization before the function exits, temporarily stalling the GPU concurrency. The next method of the class takes the BGR input frame and converts it to different color spaces (HSV, L*A*B and YCrCb) and returns the results to the main function of the application. The program has two flows, one that uses CUDA to do the work, and another that just uses the CPU.

def do_color_spaceing(self,frame): if not self._initialized: self._initialize_color_space(frame) self._initialized = True self.bgr_frame = frame if self.cuda: self.cuda_bgr_frame.upload(self.bgr_frame, self.stream) self.cuda_hsv_frame = cv2.cuda.cvtColor(self.cuda_bgr_frame, cv2.COLOR_BGR2HSV, stream=self.stream) self.cuda_lab_frame = cv2.cuda.cvtColor(self.cuda_bgr_frame, cv2.COLOR_BGR2Lab, stream=self.stream) self.cuda_YCrCb_frame = cv2.cuda.cvtColor(self.cuda_bgr_frame, cv2.COLOR_BGR2YCrCb, stream=self.stream) self.cuda_hsv_frame.download(self.stream, self.hsv_frame) self.cuda_lab_frame.download(self.stream, self.lab_frame) self.cuda_YCrCb_frame.download(self.stream, self.YCrCb_frame) self.stream.waitForCompletion() else: self.hsv_frame = cv2.cvtColor(self.bgr_frame, cv2.COLOR_BGR2HSV) self.lab_frame = cv2.cvtColor(self.bgr_frame, cv2.COLOR_BGR2Lab) self.YCrCb_frame = cv2.cvtColor(self.bgr_frame, cv2.COLOR_BGR2YCrCb) return (self.hsv_frame, self.lab_frame, self.YCrCb_frame)

When using the GPU, the first step is to upload the BGR frame to the device (GPU). When OpenCV uploads data, it changes the data format from Mat to GpuMat, a format which can be consumed by the GPU.

self.cuda_bgr_frame.upload(self.bgr_frame, self.stream)

CUDA C extends C by allowing the programmer to define C functions, called kernels; that, when called, is executed N times in parallel by N different CUDA threads, as opposed to only once, like regular C functions. OpenCV uses the CUDA C extensions to implement its methods into CUDA kernels. This allows applications to asynchronously execute on the GPU.

self.cuda_hsv_frame = cv2.cuda.cvtColor(self.cuda_bgr_frame, cv2.COLOR_BGR2HSV, stream=self.stream) - async kernel 1 self.cuda_lab_frame = cv2.cuda.cvtColor(self.cuda_bgr_frame, cv2.COLOR_BGR2Lab, stream=self.stream) - async kernel 2 self.cuda_YCrCb_frame = cv2.cuda.cvtColor(self.cuda_bgr_frame, cv2.COLOR_BGR2YCrCb, stream=self.stream) - async kernel 3

Leveraging CUDA’s stream and kernels, the example code is able to change the BGR frame to the different color space asynchronous instead of serially. The only blocking call is made at the end, to make sure all the downloads are synchronized. Using a Jetson Xavier, I ran a few benchmarks using OpenCV and OpenCV CUDA on a short video clip. Using CUDA, on average I was able to see about 1 FPS performance over the non-CUDA method, using a 10 second video clip.

OpenCV with CUDA
OpenCV without CUDA
First Run
Second Run
Third Run
Average FPS
28.49 FPS
27.90 FPS
28.48 FPS
28.29 FPS
26.74 FPS
27.92 FPS
27.15 FPS
27.52 FPS

Background Subtraction using CUDA and alwaysAI

Source code: https://github.com/alwaysai/cuda_and_contours/tree/main/background_subtraction

Background subtraction, is a technique used for generating a foreground mask using a fixed camera. The foreground mask is a binary image (black and white) containing the pixels (white) belonging to moving objects in the scene. The algorithm performs a subtraction between the current frame and a model of the background, which contains the static part of the scene, this is why the camera must be fixed for the algorithm to work correctly.

background subtraction img

alwaysAI provides a set of APIs that enable developers to use background subtraction in their applications, using either GPU or CPU to execute the algorithm.

mog2_process = Edgeiq.MOG2(history=120, var_threshold=250, detect_shadows=True, cuda=CUDA)

The instantiation of the alwaysAI background subtraction class MOG2 takes four API parameters:

  1. history = The length of history in frames before executing a background model update.
  2. var_threshold = Threshold on the squared Mahalanobis distance between a pixel and the background model distribution.
  3. detect_shadows = defines whether shadow detection should be enabled.
  4. cuda = whether to use NVIDIA Jetson GPU to perform the background subtraction calculation.

The first three parameters are useful for tweaking your application to achieve the desired performance.

mog_frame = mog2_process.process_frame(blurred_frame, learning_rate=-1)

To process a frame using background subtraction, you need to call the process_frame method, which takes two input parameters:

  1. frame = The input frame to do background subtraction against. In the case of the sample application contained in the repo, the demo application is using a blurred frame to reduce unwanted image noise.
  2. learning_rate = A value between 0 and 1 which indicates how fast the background model is learned. 0 means that the background model is not updated at all, 1 means that the background model is completely reinitialized from the last frame. Negative parameter values make the algorithm use some automatically chosen learning rate.

The sample background subtraction application is divided into three functions:

  1. main function, which controls the flow of the application and does the contour processing.
  2. preprocess function, which preprocess the input image on which the background subtraction will be performed. Gaussian blurring is used to reduce image details that are unnecessary for background subtraction.
  3. postprocess function, which does two morphological operations (dilate and erode) is used to sharpen the edges of the foreground mask.

Both the preprocessing and postprocess functions use OpenCV CUDA image filtering methods: https://docs.opencv.org/4.5.4/dc/d66/group__cudafilters.html

The main difference between the CUDA version, and standard version of the API, is that you must set up the filter in advance before applying it.

cv2.cuda.createGaussianFilter(frame_device.type(), frame_device_gauss.type(), (5, 5), sigma1=0, sigma2=0)

The last part of the main function is where the processing of the contours is done. The contours are being used to highlight the objects in the foreground mask and find the centroids for those objects and their outlines. The first step in contour processing is to get the raw contours from the image and process them into contours that other API’s can use.

raw_contours = cv2.findContours(post_frame, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) contours = Edgeiq.get_contours(raw_contours=raw_contours)

You can use alwaysAI’s APIs to get the bounding boxes or the centroid coordinates of the contours which can be used to locate them in the image.

bounding_boxes = Edgeiq.get_boundingboxes(contours=contours) centroids = Edgeiq.get_moments(contours=contours)

OpenCV provides an API for you to get the raw moments of the contours, which can be used to calculate features like center of mass, and area or the centroid of an object. You can use the following code snippet to find the centroid of a contour. M["m00"] is the area value for the contour.

M = cv2.moments(contour) if M["m00"] > 0: cX = int(M["m10"] / M["m00"]) cY = int(M["m01"] / M["m00"])

Finally, OpenCV provides you with an API to draw the outline of the contours in an image.

cv2.drawContours(frame, [contour], -1, (0, 0, 255), 2,)
contours img


Using OpenCV’s CUDA libraries, combined with contour processing functions, allow developers to create highly efficient applications that are useful in real-world use cases like retail, smart cities, and many other video analytics applications. Since these applications are not using computationally expensive machine learning algorithms, they can execute on smaller Edge devices, using less power.

stylized image of a computer chip

Sign up today and start your project

We can't wait to see what you'll build!