. Introduction
The number of video collections in the social media like Youtube, Facebook etc increasing rapidly in recent years, and algorithm’s for the emotion analysis of these video collections are very less. Extensive research efforts has been taken to recognize the emotions provided by the videos. While certain algorithms offer research efforts in recognizing semantics , less attempts has been taken to find the emotions carried by the video.
2. Problem Definition and Algorithm 2.1 Task Definition
In this project we are proposing a computational framework to predict the emotions exhibited by the videos. We have created a dataset designed rigorously from various video sharing sites. The dataset was annotated manually from a set of features extracted from the videos using certain algorithms. Exploring videos online is a common activity among the majority of the Internet users. YouTube, the largest video streamer stores billions of videos on its servers. Early study has estimated that previous video categorization algorithms were dedicated to enable users to search videos according to their needs. But the emotion factor never came to limelight in the classification methods. In this project our ultimate aim is to classify the videos based on the seven
emotion categories. (i.e., happy, fear, sad, surprise, anger, disgust, neutral).
FIGURE 1: CONVOLUTION NEURAL NETWORK
Using supervised Deep learning methods, we categorize the videos based on their emotion. The exhibited results prove the effectiveness of the proposed algorithm and categorizes the classification of videos into the respective emotions.
2.2 Algorithm Definition
The main task of the project is to assign one of the seven emotions to various video clips taken from video streaming sites. Under some realistic conditions with a variation of attributes such as pose and the object illumination, the videos depict various emotions. In this project we have used convolution neural network model using deep learning techniques, focussing on one modality.
In this project we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. The convolution neural network focusses on capturing visual information in detected faces. We explored many methods for the combination of hidden layers from these modalities into a single classifier. We can achieve a considerably good accuracy from our single modality classifier. In computer vision machine learning has made great process over the years. Tasks like object detection and classification, image segmentation has reached above human performance. But, human emotion recognition remains a challenging task to compete in the real time.
In this project, we have used deep learning techniques to recognize human emotions in videos. The theoretical background of the techniques have been introduced. The informative content including video frame’s are extracted from videos to predict the emotions. Many models are trained to classify the emotions. Convolution network layer uses
sequential model to train the dataset and save the weight model and this weight model is used to recognize emotions of the images. Video frames algorithm is used to classify the sequence of frames(video) and we used openCV and Microsoft Cognitive Services to find the faces from the sequence of frames(video). Then the faces are cropped and the dataset is created to be trained using the convolution model we have built. After that, the accuracy of the models are displayed i.e. the emotion of the video is analyzed and displayed. The advantages and disadvantages of the model is discussed along with the possible improvements for the future. Among the information, cropped faces are the most informative when to classify the human emotions. Convolution model has outstanding ability to extract image features from the video including facial features.
3. Experimental Evaluation 3.1 Methodology
We used standard method Train/test split to evaluate our trained CNN model. Since we used two different datasets(MS FER2013[1] and our own dataset[2]) we also compared the results of their Classifications Accuracy to get some performance and accuracy metrics. Also our hypotheses on this problem is based on the two datasets(i.e based on the
FIGURE 2 : FACE DETECTION USING OPENCV[2]
variance of the inputs). If we take the MS FER2013 dataset, all the images are perfectly captured dedicatedly for this purpose and all the emotions were naturally given by the trained professions and the emotions are labeled manually. These images have static appearance and has good pixel clarities which make them to give better accuracy results in standard evaluation.In the other hand, the dataset which was created manually by us were the image frames from the various videos from the Internet. These image frames have some pixel problems (due to resize) and also have non static appearance(the person in the image may appear blurred or shaky). Ultimately this is the right dataset to train our model because we have to analyze the emotion for the same internet videos, but the accuracy results comparing the above one is less. Probably because there is no equal or enough amount of images for each of the emotions which we need to classify and image resolution resize problems. Also we compared the results of classifying the custom image frames programmatically with the Vision API (detects the emotion of a person
from an image) from the Microsoft AZURE. We wanted to make sure that our trained models from two datasets are giving the same results as the competing and currently one of the best active method to classify the emotion in the computer vision world.
3.2 Results
The main goal of our project is to analyze the emotion of the given video based on the trained weight models. We randomly selected a video from the Internet and the video is analyzed frame by frame images with the trained weight model to detect the emotions and they are combined together to give the percentage of each emotions in a video. This is done by converting the video first into an image sequences and then these image sequences are passed through set of a functions to detect the faces, crop the image by boundary of the face, resize the cropped image and convert from RGB to grayscale. We were able to predict the emotions of the video with the best accuracy.
FIGURE 3 : RESULTS OF FER TRAINING DATASET
Train loss : 0.648404936364162636 Train accuracy: 72.18755198410189 Test loss : 0.86227893394012233 Test accuracy: 67.7822345990123
FIGURE 4 : RESULTS OF OUR OWN DATASET
Train loss: 0.60921879641215 Train accuracy: 77.19999998410543 Test loss: 0.8386240746174659 Test accuracy: 71.95804194970565
FIGURE 5 : VIDEO EMOTION ANALYSIS OF VIDEO
3.3 Discussion
The weakness of the method is that without enough dataset to train the model the accuracy of the model is affected. On the other hand, the model is not capable to classify certain emotions since the number of trained images of each emotion is very less as we couldn’t find sufficient dataset for the model to be trained. Because there is hardly any features from the images for the model to find the key feature of emotion’s. Besides, the video to be analyzed is a database of both colored images and gray-scale images. Without color information in FER2013 and the database we created, model is not utilized properly and so is the saved weight parameters of the model.
4. Related Work
Other work has been done by splitting the model into three parts :pre- processing and data collection, feature extraction and classification of the model and finally evaluating it[3]. Audio-SVM model is used to extract audio features from audio with OpenSMILE as feature extractor and classified along with SVM. CNN-LSTM model is used to train the feature extractor, extract deep features from images and uses LSTM to integrate features and classify emotion. Video-C3D takes face frames as input from videos and C3D model uses classifier and feature extractor.
And it uses static images from dataset to train the model, so the accuracy of the model is less compared to the algorithm we created. The method we implemented is a blend of a efficient and smart method, the first one converts video to frames of images and the faces are cropped from the images and given as a test data to the convolution model we built. We achieved a higher accuracy than their model. The reason for the higher accuracy is that we used different kinds of images taken from different videos and we trained the CNN model with both static and non static images.
5. Future Work
Emotion classification from videos is a challenging task and involves many sub tasks including face tracking, face detection, face recognition and others. Improvement in these sub-tasks will benefit the study of the primary goal. Due to the limited resources and time of this project, the accuracy improvement and real time implementation of the project was not achieved. With adequate dataset combined with facial recognition, might improve the results. Real time implementation of the algorithm is possible by detecting our emotion from live video and this can be implemented into various applications like recommending the user based on their emotions, review for a movie, review of lecture class etc.
6. Conclusion
For past few years recommendations were based on user’s activity that depends on browsing history or location history. But it has gradually changed. There comes the use of emotion based recommendation system where activities are recommended based on user’s emotion so that the performance of classification can be improved to greater extent. The objective of this project is to develop a methodology and a system to automatically recognize emotions expressed by people. The system will use data from input images, recognize facial expressions and classify the emotion of the user into different classes and finally suggest the best activity to the user. This proposed system can be applied in any user’s personal device that can recognize the user’s emotion and track user activity so that it can suggest or remind the user for Digital Well Being.
7. Bibliography
[1] Ferr2013 Dataset https://www.kaggle.com/c/challenges-in-
representation-learning-facial-expression-recognition-challenge/data [2] BeastyReacts https://www.youtube.com/channel/ UChjUq7Hb1daBKfWEvE-rUEw
[3] Deep Learning of Human Emotion Recognition in Videos http:// www.diva-portal.org/smash/get/diva2:1174434/FULLTEXT01.pdf