How to Detect Pedestrians and Bicyclists in a Cityscape Video

by Eric VanBuhler | 5/19/2022  |  Object Detection  | 5 min read

Detecting pedestrians and bicyclists in a cityscape scene is a crucial part of autonomous driving applications. Autonomous vehicles need to determine how far away pedestrians and bicyclists are, as well as what their intentions are. A simple way to detect people and bicycles is to use Object Detection. However, in this case we need much more detailed information about the exact locations of the pedestrians and bicyclists than Object Detection can provide, so we’ll use a technique called Semantic Segmentation, in which detections are done pixel-by-pixel, rather than with bounding boxes.

* note - alwaysAI provides a set of open source pre-trained models in the Model Catalog. The following example uses one of the starter models with a simple algorithm in order to achieve its goal.

Removing Pedestrians and Bicyclists from a Video

In this tutorial, we’ll use the ENet computer vision model to segment pedestrians and bicyclists in each frame of a video, and then use the results to perform actions based on the locations of the pedestrians and bicyclists. To keep this application simple, we’ll use the detections to edit the output video, removing the pedestrians and bicyclists from the video. The alwaysAI semantic_segmentation_cityscape starter app runs the ENet model on a series of cityscape images, so it will be a great starting point for us.

*note - The source for this guide can be found at: https://github.com/alwaysai/pedestrian-segmentation

First, pick out your video clip and place it in the app directory. I have a short video clip of pedestrians and bicyclists on a crosswalk. In the app.py file, swap out the series of images for a video clip by removing the image load steps, adding the video stream to our with statement, and changing the Streamer parameters back to the defaults.

def main():
semantic_segmentation = edgeiq.SemanticSegmentation(
"alwaysai/enet")
semantic_segmentation.load(engine=edgeiq.Engine.DNN)

print("Engine: {}".format(semantic_segmentation.engine))
print("Accelerator: {}\n".format(semantic_segmentation.accelerator))
print("Model:\n{}\n".format(semantic_segmentation.model_id))
print("Labels:\n{}\n".format(semantic_segmentation.labels))

with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:
...

Inside our with statement, we need to change the for loop iterating over images to a while loop reading frames of the video. Since we’re working with a video now, frame makes more sense than image, so change all instances of image to frame.

def main():
...
with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:

while video_stream.more():
frame = video_stream.read()

results = semantic_segmentation.segment_image(frame)

# Generate text to display on streamer
text = ["Model: {}".format(semantic_segmentation.model_id)]
text.append("Inference time: {:1.3f} s".format(results.duration))
text.append("Legend:")
text.append(semantic_segmentation.build_legend())

mask = semantic_segmentation.build_image_mask(results.class_map)
blended = edgeiq.blend_images(frame, mask, alpha=0.5)
...

In order to use the “stop” button on the Streamer, we need to swap 

streamer.wait()
 for a check of 
streamer.check_exit().

def main():
...
with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:

while video_stream.more():
...
streamer.send_data(blended, text)
if streamer.check_exit():
break

Next, we can try out the app to see how well it classifies pedestrians and bicyclists. After running aai app install and aai app start, here’s what I get on the Streamer:

We can see that the model is doing alright at detecting some of the people, but it also appears to be incorrectly detecting some people and bicycles as motorcycles, so we’ll need to take additional steps to correct this issue later on.

Masking Out Only the Pedestrians and Bicyclists

Looking at the label list for the model, the labels we are interested in are “Person,” “Rider,” and “Bicycle.” Let’s make a list of the labels we’d like to mask at the beginning of our app.

def main():
...
labels_to_mask = ['Person', 'Rider', 'Bicycle']
print("Labels to mask:\n{}\n".format(labels_to_mask))

Now we need to select only the labels from that list for our filtering mask. The class_map provided in the results has the label index for each pixel. We can create a matrix of labels by mapping the labels to the class map. Next, we’ll make a base class map of all zeros (the index for “unlabeled”) and add the classes from the list on the top of that. Then we’ll replace the results.class_map with the new filtered_class_map in semantic_segmentation.build_image_mask.

def main():
...
with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:

while video_stream.more():
...
label_map = np.array(semantic_segmentation.labels)[results.class_map]
# Setting to zero defaults to "Unlabeled"
filtered_class_map = np.zeros(results.class_map.shape).astype(int)
for label in labels_to_mask:
filtered_class_map += results.class_map * (label_map == label).astype(int)

mask = semantic_segmentation.build_image_mask(filtered_class_map)
blended = edgeiq.blend_images(frame, mask, alpha=0.5)

Now when we run our app, we can see that only people, riders, and bicycles are masked.

Showing the Original Video and Mask Side-by-side

In order to see the mask colors a little more clearly, we can separate the image and mask so that we can see them side-by-side (rather than seeing the mask superimposed on the image). To do that, we’ll simply concatenate the frame and the mask, and send the combined images as one image to the Streamer. We’ll also remove the edgeiq.blend_images call, as it's no longer needed.

def main():
...
with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:

while video_stream.more():
...

mask = semantic_segmentation.build_image_mask(filtered_class_map)
combined = np.concatenate((frame, mask), axis=0)

streamer.send_data(combined, text)

Showing the Masked Video Instead of the Mask

To get a better idea of what is actually being detected in the video, we can mask out the background, which includes everything that is not a “Person,” “Rider,” or “Bicycle.” First, make a boolean matrix of the labeled part of the class-map, and then use that to copy the labeled pixels from the frame to the masked frame.

def main():
...
with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:

while video_stream.more():
...
bool_class_map = (filtered_class_map > 0)
masked_frame = np.zeros(frame.shape)
masked_frame[bool_class_map] = frame[bool_class_map].copy()
combined = np.concatenate((frame, masked_frame), axis=0)

streamer.send_data(combined, text)

Filtering Out Pedestrians and Bicyclists

In order to filter out our objects of interest, we’ll store the last known pixel value from when the object wasn’t detected, and apply that value to the same location when an object is detected. We’ll create an array called non_detection to store the last known value of each pixel when it was not detected as a pedestrian or bicyclist. For each frame, we’ll update non_detection with the latest non-detection pixels. Next, we’ll generate the output frame starting with the latest frame, replacing the pixels where pedestrians and bicyclists are detected.

def main():
...
with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.Streamer() as streamer:

last_non_detection = None
while video_stream.more():
frame = video_stream.read()

if last_non_detection is None:
last_non_detection = np.zeros(frame.shape)

...

non_detection_map = (filtered_class_map == 0)
detection_map = (filtered_class_map != 0)
last_non_detection[non_detection_map] = frame[non_detection_map].copy()
out_frame = frame.copy()
out_frame[detection_map] = last_non_detection[detection_map].copy()
combined = np.concatenate((frame, out_frame), axis=0)

streamer.send_data(combined, text)

We noticed earlier that some of the people and bicyclists were incorrectly classified as “Motorcycle.” Let’s add “Motorcycle” to our list of labels to mask and see if the results are more accurate.

def main():
...

labels_to_mask = ['Person', 'Rider', 'Bicycle', 'Motorcycle']
...

We can see that now the people and bicyclists are filtered out much better!

Saving the Video to a File

Since we’re doing batch processing of a video file, there isn't really a need to display everything on the Streamer. We can process each frame and create a new video file for the output. To save the video clip, we’ll use the VideoWriter class. You can learn more about saving video clips in our documentation. Let’s also create a flag so that we can easily enable and disable Streamer processing.

def main():
...
enable_streamer = False

with edgeiq.FileVideoStream('Use Case - Clip 3.mp4') as video_stream, \
edgeiq.VideoWriter(output_path="processed_video.avi") as video_writer:

if enable_streamer:
streamer = edgeiq.Streamer().setup()

last_non_detection = None
while video_stream.more():
...

if enable_streamer:
# Generate text to display on streamer
text = ["Model: {}".format(semantic_segmentation.model_id)]
text.append("Inference time: {:1.3f} s".format(results.duration))
text.append("Legend:")
text.append(semantic_segmentation.build_legend())
...

if enable_streamer:
combined = np.concatenate((frame, out_frame), axis=0)

streamer.send_data(combined, text)
if streamer.check_exit():
break

video_writer.write_frame(out_frame)

if enable_streamer:
streamer.close()

Potential Improvements

Watching our video, we see that there are times when a person or bicyclist gets captured by non_detection, and those pixels are used rather than the background. The issue here could be that frame to frame, some pixels are alternating between being detected and being undetected. We could add a filter to the mask to smooth out these occasional misdetections between otherwise correct detections.

Additionally, we could cut down on the code size and post-processing by combining our filtered class-map and detection handling.

Conclusion

We’ve walked through a simple example of segmenting an image, masking only specific classes, and taking an action based on those masks. This process can be expanded to many use cases, including autonomous driving, defect detection, medical analysis, and many others.

The alwaysAI platform makes it easy to build, test, and deploy computer vision applications such as this pedestrian and bicyclist detector. We can’t wait to see what you build with alwaysAI!

Drive More Real-Time Intelligence and Higher ROI in Your Business