Automated Cigarette Censoring in Videos using Deep Learning Techniques

Automated Censoring of Cigarettes in Videos using Deep Learning Techniques

Swapnil Dhanwal1, Vishnu Bhaskar1, Tanya Agarwal1

1 Netaji Subhas University of Technology (formerly Netaji Subhas Institute of Technology), Delhi, India, 110078

{swapnild.co, vishnub.co, tanyaa.co}@nsit.net.in

Abstract. Studies have shown that exposure to videos containing instances of cigarette smoking is more likely to induce subsequent smoking behaviour in people, especially adolescents and young adults. Thus, censoring cigarettes in videos has become an urgent task. Automation will make the otherwise tedious process less cumbersome and more efficient. In this paper, we propose an approach for the same using concepts of Deep Learning and Computer Vision. We have manually developed a dataset by acquiring images from multiple online sources and then segmenting, cleaning and augmenting them. Features, extracted using the Convolutional layers of the Inception V3 model from these images, are used to train a neural network. This model detects cigarettes with an accuracy of 89.99%. Finally, censoring is carried out by generating predictions on video frames using the aforementioned model.

Keywords: Deep Learning, Computer Vision, Transfer Learning, Convolutional Neural Networks, Segmentation, Video Censoring

1 Introduction

Videos are recordings of moving visual images that convey some information or may just provide entertainment. However, they may contain age-inappropriate or harmful content that needs to be censored. Censoring is the act of examining a book, a video, a film, et cetera and suppressing objectionable or harmful content from it. Video censoring, therefore, is a process that consists of watching an entire video, detecting and locating the object to be censored (referred to as the target object) in each frame and finally censoring the target object. The entire process can become cumbersome, tedious and drawn out.

The purpose of automating a process is to improve efficiency and reduce human effort – the only human effort required in such a process lies in its development and maintenance. Hence, automation of video censoring can possibly lead to a more efficient and accurate outcome. This involves extracting frames from the video, calculating and processing their segments and running them through an object classifier and localiser or object detector to locate the target object.

Studies [1] have shown that exposure to media containing cigarettes can have profound effects on people, especially the young, and can lead to the development of smoking habits in the future. Smoking at young ages can have dire consequences in adulthood such as respiratory and cardiovascular diseases. This habit is injurious to not only the smoker, but also to the people in the vicinity. Motivated to remediate this, we have tried to ameliorate the process of censoring cigarettes in videos, by automating an otherwise manual effort. This poses many challenges.

Firstly, there are challenges associated with object detection itself – such as deformation and variation in lighting, aspect-ratio and occlusion. These challenges are compounded by having a cigarette as the target object. Unlike detection of objects such as cars and animals, cigarettes can be present in different sizes, orientations and aspect-ratios in an image. Differences in illumination can confuse even human beings as to the identity of a cigarette. Consider an unlit cigarette placed on a table. It is difficult for a human being to differentiate between it and a piece of chalk, let alone for a deep neural network to do so. The second challenge is the lack of availability of a cigarette image dataset. The ImageNet dataset does contain images of cigarette butts, about 1,300 of them, but since we want to detect all parts of cigarettes, it is insufficient for our purposes. Thus, data need to be collected, cleaned and processed manually. This lack of data also necessitates the use of a pre-trained model. Another challenging aspect is performance – deep neural networks require considerable resources to train and run. Training and predicting can be made faster by using a GPU but sequential tasks such as segmentation are CPU bound. It is thus necessary to intelligently extract portions of the image which are likely to contain cigarettes, prior to censoring.

In the following sections, we explain in detail, works related to our research, our methodology, our experimentation, the results of our efforts and the future scope of improvement. Through the approach proposed by us in Section 3, we hope to arrive at a suitable model for detecting and censoring the target object, cigarettes.

2 Related Work

Several researchers have worked in the field of object classification, localisation, image segmentation and development of fast techniques for object detection. There have also been works that aim to detect the presence of cigarettes or smoking events in a scene – for instance, images or video-feed from a CCTV camera. In our research, however, we found that the technique of Deep Learning has not been exploited for the same.

Several works attempt the detection of smoking events in a scene. For instance, Harikrishnan et al. [2] proposed a method to detect smoking events by using a combination of sensors to detect smoke and facial detection to detect the smoker. Here, facial detection was performed by using Discrete Wavelet Transform [3]. mPuff, a hardware approach proposed by Amin Ahsan Ali et al. [4], classified respiratory recordings as smoking-puffs or regular breathing. Kentaro Iwamoto et al. [5] investigated smoke detection from captured image sequences. Their system proposes to address estimation of candidate areas of smoke and detection of smoke in the scene. Kavitha et al. [6] proposed a technique in which, initially, moving objects are detected in the video frame. Then, template matching is used to detect cigarettes and the Haar classifier is used to detect smokers' faces. The Inception family of deep CNNs proposed by Sergey Ioffe et al. [7] and Szegedy et al. [8] also possesses the ability to detect cigarette butts – a class of the ImageNet dataset containing 1,300 images.

A big challenge in object detection is localisation. This entails segmenting the image into smaller parts and passing these segments through the CNN to get predictions. A naive method would use a sliding window approach to extract image segments, however this would be very inefficient. Several researchers have worked on increasing the performance of object detection by using specialised segmentation techniques. Chen, C. et al [9] described R-CNN, or Region based CNN, which aims to simplify the problem of selecting regions in conventional CNNs, by pre-defining 2000 regions to make predictions from. To make these region-proposals, it uses Selective Search [10] – an algorithm which creates larger regions by combining perceptually similar, smaller regions in a bottom up manner. Shaoqing Ren et al. [11], in Faster-CNN, introduced the concept of Region Proposal Networks (RPNs), which try to eliminate the cost of calculating region-proposals by sharing features with the main CNN. A modified version of Faster R-CNN was employed to speed up the detection of small objects in remote sensing applications by Yun Ren et al. [12] One of the most revolutionary papers in the field of fast object detection was 'You Only Look Once (YOLO)' by Joseph Redmon et al [13]. In this method, instead of using a sliding window approach or region-proposals, the dataset images were divided into grids, with the target objects annotated by bounding boxes. The deep neural network would now predict two features – bounding boxes and the class of the detected object. These models are aptly called single-shot detectors because they perform classification and localisation simultaneously. Wei Liu et al. [14] proposed a faster single-shot detector compared to YOLO. It works by fixing boxes of various sizes and aspect ratios and then, at test time, scores these boxes for object presence. In this paper, we have attempted to apply the collective knowledge and inspiration gained from the above works to create a model that can detect and censor cigarettes in videos.

3 Proposed Approach

We now describe the steps undertaken by us in creating our model, which is divided into two parts – the classifier and the censor. As shown in Fig. 1, initially, data collection and cleaning were performed by downloading images from various online sources. Then, to balance the skewed dataset, data augmentation techniques were applied and synthesised images were incorporated into our training set. The next step was to perform transfer learning by building a model on top of Inception V3. This involved using the Inception V3 model for feature extraction and using the extracted features to train a fully connected neural network – forming the basis of the classifier.

The second part of our model, the censor, used Selective Search to calculate region-proposals. These region-proposals were fed into the classifier and tagged according to the generated predictions. Finally, the tags were used to censor the video. This process is described in detail in the following subsections.

Fig. 1. Methodology

3.1 Data Collection

Generally, with an increase in the size of the dataset, performance of neural networks increases as opposed to traditional learning algorithms. However, in the cases where transfer learning is used, smaller datasets can suffice. To account for the lack of images of cigarettes in the ImageNet dataset, we created our dataset using images downloaded from Google Images, Bing Images, Getty Images and DuckDuckGo.

After downloading, the images were segmented to into the following sizes:

1. 200 x 200

2. 300 x 300

3. 400 x 400

4. 500 x 500

5. 600 x 600

6. 700 x 700

7. 800 x 800

8. 900 x 900

With the following overlaps:

1. 1-3: 50% Overlap

2. 4-6: 25% Overlap

3. 7-8: No Overlap

Applying the above transformations yielded a coarse dataset of 105,000 images, the size of which reduced to about 70,000 after cleaning. These were used to create the training, development and the positive class of the test dataset. The negative class of the test set was populated by randomly sampling the ImageNet dataset.

3.2 Data Cleaning

Data cleaning can be defined as the process of detecting and correcting corrupt or inaccurate records from database i.e. identifying and replacing dirty or coarse data. It may be performed interactively with data wrangling tools or as batch processing through scripting. We manually segregated the collected images into 3 categories. Here, we donâ€™t use the word â€˜categoryâ€™ and â€˜classâ€™ interchangeably. The word â€˜categoryâ€™ refers to the descriptions assigned to the data by us; and the word â€˜classâ€™ refers to the labels used for training the model.

The collected data conform to the following categories:

1. Cigarette with coloured background

2. Cigarette with white background

3. Not a cigarette

The first 2 categories were used for data augmentation purposes as explained in the next subsection. A positive and negative class was used for training. However, since the dataset was severely skewed in favour of the negative-class samples, data augmentation techniques, as described below, were applied.

3.3 Data Augmentation

As explained previously, the training set was skewed in the favour of images belonging to the negative class. To remedy this problem, more images belonging to the positive class needed to be obtained. Fig. 2 shows some samples from our augmented dataset.

Rotation. Images belonging to the positive classes were rotated about their centre by 30 degrees successively to obtain a total of 12 images.

Colour filtering. A histogram of colour values of the images belonging to the â€˜cigarette with white backgroundâ€™ category – to isolate the colours of cigarettes – was generated. Then, pixels with colours having less than a predefined threshold frequency were set to white. However, this approach produced mixed results – some of the resulting images were noisy and lossy. In some cases, cigarettes were isolated perfectly so that they resembled images from the class â€˜cigarette with white backgroundâ€™. In others, the loss of information and generated noise was too large for the image to be classified as positive. Different values of the threshold frequency were experimented with and finally, about 12,000 images were successfully filtered and moved to the â€˜cigarette with white backgroundâ€™ class.

Synthesising images. To further remedy the problem of a skewed dataset, a small set of images – synthesised by superimposing cigarettes on different textures – was generated.

Fig. 2. The top and bottom rows show positive and negative samples respectively

3.4 Classifier Architecture

The Inception V3 model is a deep Convolutional Neural Network trained on the ImageNet dataset [15]. The ImageNet dataset contains 14 million images belonging to 1000 classes ranging from 'cat' and 'dog' to 'dishwasher' and 'plane'.

Inception V3 model chosen for three reasons. Firstly, the lack of a public dataset of cigarette images meant we had little data to work with; hence, some form of transfer learning was necessary. The second reason was its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (2015), of which it was the runner up. Thirdly, the ImageNet dataset contains approximately 1300 images of cigarette butts, which would give our model some intuition about the shape, colour and texture of cigarettes from the get-go. Our proposed model consists of all the layers of the Inception V3 model except the fully connected layers which are replaced by a combination of dense and dropout layers used to generate the final prediction, as shown in Fig. 3. Images scaled to 150 x 150 pixels were input into Inception V3 convolutional layers and the extracted features were used to train the fully connected layers.

Fig. 3. Classifier architecture

3.5 Video Censoring

This process was composed of region-proposal generation, prediction, tagging and finally, video-frame censoring. Region-proposals can be defined as the perceptually important portions of an image i.e. they are more likely to represent objects. To optimise performance, it made sense to run the model on a subset of all possible image segments (i.e. region-proposals) instead of segments generated through a sliding-window approach.

The algorithm used for generating region-proposals in this paper is Selective Search. Selective Search starts by over-segmenting the image using the graph-based segmentation algorithm described by Felzenszwalb et al. [16]. Many of these fundamental segments can collectively represent individual objects. Selective Search is able to combine these smaller segments into larger ones based on similarity, in a bottom-up manner. The size, scale and number of segments can be controlled using the corresponding parameters to the algorithm. The output is a set of principal segments called region-proposals.

The region-proposals thus obtained were tagged with the frame-number and frame-location before being input to the model for prediction. Non-max suppression was applied on the positively predicted regions to reduce the number of overlapping bounding boxes.

4 Experimentation and Results

Using Google Colab as our development environment, we created and experimented with different model architectures (referred to as the â€˜precursorsâ€™) and hyper-parameter settings. We refer to the final model used for censoring as the â€˜conclusive modelâ€™.

Precursors to the conclusive model had the following shortcomings:

1. Misclassified objects with similar colour and shape as cigarettes

2. Misclassified text, smoke, water, skin and wood textures as cigarettes

3. Misclassified small objects as cigarettes – likely due to their similarity with cigarette butts

Corrective measures included adding images of smoke and water textures to the negative class. With these measures, the conclusive model showed test-set accuracy of 89.99%. We took a unique approach to calculating test-set accuracy wherein the test set was created by taking a random sample of ImageNet (for the negative samples) and interspersing it with positive samples from our own data. Images from the cigarette butt class of ImageNet were also added to the positive class.

The fully connected layers of the conclusive model had 2 dense and 2 dropout layers in an alternating manner, an ADAM [17] optimiser, binary-cross-entropy as the loss function and was trained for 100 epochs. The conclusive modelâ€™s performance metrics and confusion matrix are shown in the Table. 1 and Table. 2, respectively.

Table 1. Characteristics of the conclusive model

Parameters Value

Training samples 20,000

Validation samples 10,000

Testing samples 38,000

Training accuracy 95.74%

Testing accuracy 89.99%

Precision 97.03%

Recall 84.12%

F1 score 90.01%

Content with the above results, we chose the conclusive model to perform video-censoring. With our current approach, censoring a one-minute video took 4 hours – a number we hope to decrease in the future. Our findings translated well to images taken from dissimilar distributions. For instance, in videos taken from YouTube, we observed high precision in detecting cigarettes, however, there were still some misclassifications (as evident from the slightly lower recall metric). Although every cigarette was correctly censored, we observed that our model would seldom misclassify small, cylindrical objects as cigarettes. There were also some false-negatives, however, these were not missed by the censor considering that the smaller regions they were comprised of had been detected earlier.

Table 2. Confusion matrix for test-set

Predicted: true Predicted: false

Actual: true True positives: 18559 False negatives: 3503

Actual: false False positives: 568 True negatives: 15624

5 Conclusion and Future Work

Exposure to media depicting cigarette consumption can have harmful effects on young viewers. Hence, censoring of such content is of paramount importance. Video censoring, a tedious process in which the entire video needs to be examined manually can be replaced by an automated process using Deep Learning. This is what we aimed for, and we have achieved 89.99% accuracy with our proposed model. Our model is able to censor videos containing cigarettes which are visually distinguishable from other objects, with acceptable recall.

Future work lies in making cigarette censoring a real-time process. This involves creating an annotated dataset – complete with bounding boxes to feed single-shot detectors like YOLO. Our model also has other limitations. It misclassifies some objects of similar shape, colour and texture as cigarettes and its performance precludes it from being classified as a real-time censor. Further, it fails in situations when the cigarette is so small as to be indistinguishable from tiny objects. In the future, we aim to remove these limitations by gathering more data, performing multi-label classification and using a single-shot detector. With this in mind, in conclusion, we believe that we have taken a step in the right direction and hope that that our work can inspire other researchers to make media safer for younger generations.

References

1. Todd F. Heatherton and James D. Sargent: Does Watching Smoking in Movies Promote Teenage Smoking (2010)

2. Harikrishnan K., Akhilkrishna P., Bijo Varghese, Shylesh S., Vivekanand Chandrasekhar, Munna Basil Mathai: Smoke Detection Captured from Image Features (2015)

3. M.J. Shensa: The discrete wavelet transform: wedding the a trous and Mallat algorithms (1992)

4. Amin Ahsan Ali, Syed Monowar Hossain, Karen Hovsepian, Md. Mahbubur Rahman, Kurt Plarre, Santosh Kumar: mPuff: Automated detection of cigarette smoking puffs from respiration measurements (2012)

5. Kentaro Iwamoto, Hironori Inoue, Toru Matsubara, Toshihisa Tanaka: Cigarette smoke detection from captured image sequences (2010)

6. Kavitha, Chandan S., Gayathri Devi M., Pradeep Kumar, Reshma: Automated System for Smoking Detection (2015)

7. Sergey Ioffe, Christian Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)

8. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna: Rethinking the Inception Architecture for Computer Vision (2016)

9. Chen, C., Liu, M.-Y., Tuzel, C.O., Xiao, J: {R-CNN for Small Object Detection (2016)

10. J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders: Selective Search for Object Recognition (2012)

11. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2016)

12. Yun Ren, Changren Zhu and Shunping Xiao: Small Object Detection in Optical Remote Sensing Images via Modified Faster R-CNN (2018)

13. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: You Only Look Once: Unified, Real-Time Object Detection (2016)

14. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg: SSD: Single Shot MultiBox Detector (2015)

15. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li Fei-Fei: ImageNet Large Scale Visual Recognition Challenge (2015)

16. Pedro F. Felzenszwalb, Daniel P. Huttenlocher: Efficient Graph-Based Image Segmentation (2003)

17. Diederik P. Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization (2014)

Essay: Automated Cigarette Censoring in Videos using Deep Learning Techniques

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay Categories: