Solving Large-Scale Video Classification w/ Convolutional Neural Networks | DSP LAB

DSP LAB Paper Report

Large-scale Video Classification with Convolutional Neural Networks

Name- Nipun Dixit N.No- N14378569

Summary of the Paper

This paper talk about using the Convolutional Neural Networks for Large-Scale Video Classification, as we know how nicely the Convolutional Neural Networks are being used as an amazing method for Image Recognition, which is producing great results in recognizing images with great accuracy and precision, so here they are trying to use the same method for Video Classification, where they are trying to provide an extensive empirical evaluation of the Convolutional Neural Networks to classify large scale videos by using the dataset of around a million of Youtube Videos which are being classified into around 487 classes.

They try to use multiple approaches in an attempt to extend the usage of Convolutional Neural Networks (CNN) in the time domain in order to take advantage of the locally available spatio-temporal information, and in turn they suggest a foveated imaging method as a way to fasten up the training method for the model.

Hence it was noticed that the spatio-temporal networks were giving a significant amount of improvements as compared to the feature-based baselines whereas only modest amount of improvement was noticed for the single-frame models, 55.3% to 63.9% and 59.3% to 60.9% respectively.

And lastly it tries to generalize the performance of the best model which was seen by using it on UCF-101 Action Recognition Dataset and hence it was seen that there was significant performance improvement when compared to the baseline model.

Introduction

There has been a large amount of images and videos which are available on the internet, overtime many researchers are working hard to develop various kind of algorithms for varied and path-breaking applications, be it something as search and summarizing the information or be it real time video tagging. Overtime Convolutional Neural Networks are being seen and used as the model which is highly effective when it comes to the image content, be it segmentation of the images or be it detection, retrieval or be it image recognition with high accuracy.

Also over-time CNN’S have learnt amazingly powerful and great image features, so due to which it can be greatly helpful when it comes to the real time video classification or better to put large-scale video classification, but with the videos there are various challenges which are being posed when CNN’S are being used as compared to the images which are just static but in the videos the model has the access to the appearance information of the complex temporal evolution as well as the single static images.

As there are hardly any classification benchmarks which are being set to properly classify the videos as in general videos are mode difficult to store, collect and to edit or annotate.

So its very important to properly train the CNN models such that they can serve the purpose of classifying the videos by analyzing the content on them, for this a dataset of Sports Videos was selected which had around 1 million videos and were classified into 487 different kind of Sports.

Generally CNN’S require a very large amount of training time such that it can optimize the huge amount of the parameters that are making the model and specifically in this case the difficulty is multi folded due to the presence of the several video frames at a particular time.

In this report they try to resolve this issue by introducing 2 different streams of processing-

Context Stream- It learns the features of the videos on a very low-resolution frame.

Fovea Stream- It learns the features of the videos on high resolution and it generally operate on the middle portion of the frame. 

After successful implementation of these 2 different kinds of processing it was observed that there was a significant amount of improvement in the runtime performance of the model as there was significant reduction in the dimensions of the input and at the same time the accuracy of the classification model was retained.

But there is a genuine doubt which will come across the mind of one and all that whether these models are good only for a huge dataset such as the Sports Dataset which they have considered and which is consisted of around 1 million videos or are they good for different and smaller types of datasets, but this paper beautifully tries to work upon that issue by trying to empirically learn the problem and tries to achieve a very high performance by trying to re-purpose the low-level features learned on the dataset rather than trying to learn the entire network on UCF-101 alone, also as there were some classes which were related to sports so they can be transferred in the different settings.

In summary this paper tries to do the following things-

It provides an evaluation which is being done extensively on the model by trying to extend the usage of the CNN’S on the video classification and trying to mould it such that it can produce significant improvement of the performance as it did in the case of the images.

It tries to come up with an architecture to process the video inputs at 2 different spatial resolutions i.e. context streams for low resolutions and fovea streams for high resolutions and hence improving the performance of the CNN without compromising the accuracy of the system.

It tries to apply the networks to the dataset which here was selected was UCF-101 and tries to come up with the significant improvements in the feature based results and also on the training networks by considering UCF-101 alone.

Models

Videos cannot be as simply processed as images which can be rescaled and cropped to the desired size. So when trying to develop a model the each of the video is being treated as the package of the small and are broken into fixed size clips, as every clip has different amount of frames in a particular time so to understand the various temporal features the network is being extended into time dimension.

So to generate a connectivity pattern of the videos they are being classified into Early Fusion, Slow Fusion and Late Fusion, these are described below in much more details and later multi resolution model is being described which in-turn increases the computational efficiency.

Time Information Fusion CNN’S

The fusion is being done in the CNN’S by extending the first convolutional layer filters in time and also by placing the 2 different single-frame networks which are separated by some distance in time and in-turn the output is being fused for processing

Multiresolution CNN’S

As the CNN’S generally take weeks to train the datasets which are as large as its considered in the paper and runtime is the critical and a crucial component which gives the system ability to perform simulation on different architectures. So in order to resolve this issue and address this problem of speeding up the models and in-turn retaining the performance in this research they come up with the models to change the architecture to faster the running time and not compromising the performance of the system.

Other ways which could have been used are improvements in the weight quantization, better optimizing the algorithms and improving the hardware but this approach which the researchers here have used to modify the architecture seems more suitable for this kind of applications.

So here they tried speed up the networks by reducing the amount of the layers but it was seen that it decreased the performance significantly so instead in this model they tried to use 2 different kinds of Streams which are Foeva and Context used for high and low resolution respectively.

Results

So first the results on the Sports dataset are being computed and then the learned features as well as network predictions are qualitatively analyzed.

Dataset

The dataset has 1 million varieties of YouTube videos which are of 487 different classes and are arranged in manually adjusted taxonomy such as Team Games, Ball Games, Winter Sports etc. around 5% videos are set under 1 class and the annotations are produced with the help of analyzing the videos by seeing the text data which is surrounding the videos.

In this research they the data is being weakly classified in 2 levels- like if the the algorithm predicting the tag fails or the provided description of the video doesn’t matches with the video itself, Secondly if the video is correctly classified but it may still show inconsistency based upon the different frames which are being used in the videos.

So dataset is split in the following ways- first the 70 percent of the videos are being fed to the training set, 10 % of the videos are being fed to the validation set and the remaining 20% are being fed to the test set.

Also in order to remove the duplicate videos the near-duplicate finding algorithm is being deployed which in turn was found that around 1755 videos out of 1 million were detected to be duplicate, which is very insignificant fraction.

Training

The model is being trained over 1 month, so for the full-frame networks models which processes 5 clips per second were selected and for the multi resolution networks the model processing 20 clips per second were selected, So the rate of the initial ones was 20 times slower than what was expected, but then the comparable speed was reached by using 10-50 model replicas.

Video-Level Predictions

Lastly the predictions were made of the videos, to make the correct predictions 20 clips were randomly selected and sampled and clips are being passed through the networks 4 times and in-turn the predictions were averaged to produce a very accurate amount of probabilities of the classes, and for the videos individual clips were averaged over entire duration of the video and the result was predicted.

Following are the predicted results and statistics-

Conclusions

In this study and experiment it was concluded that the Convolutional Neural Network architectures are very much capable of gaining insight of the various features from the data which is weakly labeled and it can in-turn surpass feature-based methods in the performance and these methods are very much efficient even in connecting the architecture with the time as without compromising the performance the run-time was increased significantly.

References

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf

Andrej Karpathy, George Toderici, Sanket Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei. Large-scale Video Classification with Convolutional Neural Networks. Google Research.

Essay: Solving Large-Scale Video Classification w/ Convolutional Neural Networks | DSP LAB

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay Categories: