This is a part of blog series on video recognition using deep learning algorithms. Here we will dive deep into the various input considerations, solution approaches and design considerations applied to video recognition using deep learning techniques.
What are the common video dataset benchmarks for model training?
Training deep learning models for various video analysis tasks like human activity recognition, video understanding needs large amount of annotated video data, in terms of object detection, classification, localization, skeleton level human pose/ action detection.
Leading datasets that are leveraged for gauging the model performance in terms of accuracy and precision are Kinetics (400/ 600/ 700), Atomic Visual Actions (AVA) and Charades. These have varying number of video files, lengths and degrees of annotation of features.
What are the input processing considerations?
1. Frame rates, sampling and strides for training
Training video action recognition tasks on raw video input at high frame rate requires extensive compute resources. To optimize the process, frame sampling with optimum strides i.e. analyzing every Nth frame is often employed with the stride parameters varying by nature of the task – complexity, variability over time, sample.
2. Prediction averaging across short clips while validating
Validating and optimizing model configuration outputs on multiple short, overlapping video clips enables ML engineers to cover comprehensive range of real-world inputs, in turn ensuring high model performance in terms of accuracy, reliability and resilience.
What are various solution approaches explored so far?
• Single network models
This is the basic approach to action recognition in videos. The idea is to train a 2D CNN model to predict action in individual frames of the video. This offers a strong performance baseline, which can be further fine-tuned in terms of backbone architectures, and mechanisms.
This approach works in case of simple actions that do not have interframe temporal dependencies e.g. walking, running, eating, drinking etc.
However, in case of complex actions that need detecting a sequence of tasks by an industrial assembly operator or retail store associate, the model accuracy from this solution might not be sufficient.
• Two network models
This solution approach has an ensemble solution with two parallel network streams with their output being merged or fused at requisite points in the pipeline – either an early fusion, late fusion or slow fusion. The fusion mechanism is configured based on performance requirements of the use case.
Here are the characteristics for two networks in the pipeline
- Spatial stream – It analyzes individual image frames to recognize action using relations between spatial features within a frame. There is little consideration to inter-frame relation as part of the action recognition in this stream
- Temporal stream – It analyzes the stacked sequence of frames to recognize action using inter-relations between image frames manifesting as dynamic pixel brightness in a stacked optical flow
Spatial streams are CNNs (convolutional neural networks) optimized for image analysis (classification, object detection, counting). Temporal streams typically are models that retain temporal information like RNNs and LSTMs.
These are often used as parallel streams and their outputs are fused purposively to achieve required level of performance. In some cases, they are also used serially, with the spatial stream output fed into temporal stream – known as convolutional recurrent neural network or CRNN model.
• Attention based models
This solution approach leverages capturing the long-term contextual information which is pivotal to skeleton based human activity recognition tasks and is not captured by approaches with ensemble architectures using CNN and RNN that we discussed so far. CNNs and RNNs rely on local operations from spatial and temporal perspectives, respectively. The attention mechanism in self-attention network or SAN models helps in acquiring the global context.
• Slow fast networks
This solution approach developed and published by Facebook AI Research employs a two-pathway model – a slow pathway with low frame rate that captures the spatial semantic details and a fast pathway with high framerate that captures the temporal transition over time. The pathways are fused using lateral connections. While this is also a two-network ensemble model, the principal differentiator of this solution from the earlier 2 network stream approaches is that temporal speeds of the two streams are different. The fast pathway has high temporal speed, but is a lightweight network, thus enabling it to capture the state transitions of semantically, spatially distinct objects within the video field of view, which are in turn detected by the slow pathway with lower frame rate but higher number of channels in input. This approach is widely seen as the latest state of the art.
In subsequent parts of this series, we will look at various industry applications of video recognition models at the sensory edge.
eInfochips offers comprehensive computer vision solutions for diverse industry verticals like transportation, industrial, pharmaceutical, smart cities and consumer electronics across the model lifecycle from algorithm selection, training, validation, inferencing, deployment and sustenance. For more information please contact us today.
This post was last modified on November 10, 2020 10:32 am