Make sure to check out the repo here!
In recent years, both hardware and software advances have enabled many new applications to introduce more context in decision making.
At the same time, we are transitioning away from a decades-old communications infrastructure to foster a rapidly evolving ecosystem of IoT devices. This transition comes with growing pains around scaling bandwidth.
High demand for image/video content is largely at the heart of these challenges.
Some applications, like VR in particular, demand high bandwidth to avoid making users nauseous while turning their heads to view their simulated surroundings.
Facebook researchers have published video compression techniques tailored to tackle the challenges of VR.
Using gaze detection, they provide a high fidelity rendition of the scene where a user is focused. Then like magic, the parts of the image a user is not focusing on are filled-in (in-hallucinated) with a GAN, which generates a realistic rendition while saving a lot of bandwidth.
In other work, Berkeley researchers achieve a new state of the art in performing lossless image compression with a goal of helping people "explore and advance this line of modern compression ideas".
This line of investigation emphasizes learning the distribution of an image corpus to more efficiently encode the spatial information for a sample.
YouTube also acknowledges the importance of specializing bandwidth reduction strategies to exploit the distribution of data. They even created a dataset to promote advances in video compression for user generated content.
Going forward, we consider video feeds typical of IoT-enabled smart camera applications.
There seems to be no avoiding the trade off of spending compute to save bandwidth but we also want to spend it intelligently so we want to take advantage of the context.
Though video data is typically very redundant in time, methods used to reduce bandwidth in time, such as motion detection based filtering, lack the context to be of practical use. Motion cameras are often triggered too frequently and at uninteresting points.
Supervised learning based vision tasks, like object detection, can be used to parse additional semantic context from a scene. However, this leaves the application boxed in by developer assumptions.
Instead, we want to explore unsupervised methods of reducing bandwidth by learning the context of a scene in order to filter redundant content from a streaming video feed.
Our approach will be guided by the following observations:
- MOST images are NOT interesting/relevant
- we cannot define "interesting" ahead of time
The need to flag infrequent but novel examples from an image corpus leads us to frame a visual anomaly detection problem.
Applying visual anomaly detection, we stream ONLY infrequent anomalous images.
Some surveys on methods in anomaly detection include motion detection as an image processing based method, but it works when motion is infrequent.
One powerful application of deep learning involves learning the distribution of an image corpus to flag anomalies.
To robustly identify novelty in an image feed, we use the image reconstruction loss of an autoencoder as an anomaly score and choose a suitable threshold.
The autoencoder is trained with a batch size of 1 on each new image acquired by the camera (online learning) to learn the changing distribution of images.
Updating our model on each new image helps to adapt to the changing (nonstationary) image distribution.
However, it is a much greater computational burden to train a deep learning model than it is to run inference.
Here, the Jetson Nano shines with the hardware acceleration and environment to run fast train steps online and inline with image acquisition.
Model performance in computer vision tasks can be strongly impacted by choice of network architectures.
Rather than performing a broad search in network architecture designs, we emphasize simple patterns like the hour glass of an undercomplete autoencoder.
Further, we enforce a limit on the number of network parameters to keep the models small and fast.
First, we show the loss as our autoencoder learned on each new image sampled from a camera pointing out of a window to observe city car and foot traffic.
To score a new sample as normal or anomalous, we compute the deviation of its reconstruction loss from a rolling window sample mean.
Samples with an anomaly score deviating by some factor (like 3-5) times the sample variance of observations from a rolling time window are flagged as anomalous images.
We begin a frame acquisition loop and start training a small, undercomplete convolutional autoencoder. At each step, we evaluate the sample by thresholding on the reconstruction loss to identify anomalous images to stream.
For the sake of documenting our experiment, we actually forgo the bandwidth savings and stream all images and labels for later review. We want to explore how different:
- thresholds filter more/less aggressively
- learning rates and window size affect model recency bias
- model sizes impact computational performance
This experiment shows 36 out of 4991 images sampled as anomalous based on the threshold we set. In other words, the threshold we chose helped to filter the image feed down to a set 100 times smaller.
This kind of reduction could also make human review much easier as long as we find useful behavior in how anomalous images are flagged.
Instead of motion as a trigger, our model exhibits high anomaly scores during large structural image changes (hand in front of camera, camera movement, large unobserved object moves into field of view).
Check out a sample clip from our visual anomaly detection experiment.
Here is a simple demo script to perform visual anomaly detection using the video feed from a webcam and training a small convolutional autoencoder using Keras.
By generalizing our experiment into a more flexible repo, we can investigate the effects of using different network architectures and learning parameters in performing anomaly detection over multiple video stream sources.
Ideally, we can establish a threshold for anomalies which provides useful results for downstream human review.
However, by lowering the threshold to achieve sufficient recall, we are more likely to introduce false positives, which can be frustrating for reviewers.
By performing additional analysis, we can filter those samples marked for human review using some acceptance criteria based on semantic context.
Here, the DeepStream SDK shines!
This library makes it easy for us to detect and track objects while applying cascaded inference with secondary classifiers to determine object attributes.
Here, we show the sample application, which includes models to track objects like cars and people while also using secondary models to determine vehicle type, color, and make.
Using Nvidia's Transfer Learning Toolkit, we can fine-tune models to specialize to a new task for our application.
Then our unsupervised, online anomaly detection module can be incorporated into the cascade to identify an anomalous episode.
To each episode, we may apply intelligent video analytics with detectors, trackers, and secondary inferences using the Deepstream SDK and stream the inference results through popular message brokers. From here, we might index each episode for review along with it's associated metadata in the cloud.
More concretely, we might configure our anomaly detectors using openCV's VideoWriter method to write anomalous episodes to an mp4 file. Then we can set up an event handler to perform directory watch and run the deepstream sdk pipeline on anomalous episodes of video.
By running online anomaly detectors, devices like the Jetson Nano can filter which streams use additional resources of compute and bandwidth and ultimately, human review, similar to NVIDIA's smart garage reference application.
By introducing an online, on-device, anomaly detection module into our IVA pipeline, we can exploit the tradeoffs in how we budget bandwidth and compute resources in model to improve user experience in smart city applications like Metropolis.