Maximizing Accuracy with LSTM Attention Model for Stage Based Context Learning

Stage Based Context Learning using LSTM Attention model

Sudhakar Reddy Peddinti

Master of Science in Computer Science – Data Science

University of Missouri-Kansas City

srxw3@mail.umkc.edu

Abstract—Results exceeding human accuracy by neural networks are achieved due to advancements like memory, attention and NLP. This has led to neural network models that outperform human capabilities. In this paper, attention model is used strategically to design stage based context aware system which focuses only on what is needed depending on the external parameters that are given as input. As an example, video processing and QA is taken as an example to demonstrate the capabilities of this architecture.

Keywords—Context building, Stage Based learning, Image QA, Video processing, LSTM Neural Networks, Attention Model

I. Introduction

Context aware/learning is heart of many AI based application as seen in mobile devices/smart watches to conversational speakers. The task of building context is complex and difficult due to the involvement of many uncertain factors gathered around the surroundings. Searching the parameter space of deep architectures is an amplified difficult task, but learning algorithms such as those uses concept of memory proven that it is possible to achieve context based on the bits and pieces of information carried in them. Unlike pure language based QA systems that have been studied extensively in the NLP community [2, 3, 4], image QA systems are designed to automatically answer natural language questions according to the content of a reference image.

We are making use of the same concept along with attention mechanism to achieve stage based context learning. In this approach, we use the attention mechanism to identify the smaller models which are meant to hold the different information. As a proof of concept, we are using this approach for video QA. First the video is processed to identify key frames init. Once the key frames are identified, we use CNN model to encode each of those key frames. An LSTM is used to parse the user question and attention model will identify the CNN blocks which correlates with the information present in the question. Another stage of attention model is used to identify the critical information present in the previously selected image.

II. Related work

A. Image QA

First, Image QA is closely related to image captioning. In the previously advanced system, it first extracted a high level image feature vector from GoogleNet and then fed it into a LSTM to generate captions. The method proposed in [4] went one step further to use an attention mechanism in the caption generation process. Different from [4, 5], the approach proposed in [6] first used a CNN to detect words given the images, then used a maximum entropy language model to generate a list of caption candidates, and finally used a deep multimodal similarity model (DMSM) to re-rank the candidates. Instead of using a RNN or a LSTM, the DMSM uses a CNN to model the semantics of captions.

Unlike image captioning, in image QA, the question is given and the task is to learn the relevant visual and text representation to infer the answer. In order to facilitate the re- search of image QA, several data sets have been constructed either through automatic generation based on image caption data or by human labeling of questions and answers given images. Among them, the image QA data set in [8] is generated based on the COCO caption data set. Given a sentence that describes an image, the au- thors first used a parser to parse the sentence, then replaced the key word in the sentence using question words and the key word became the answer. [8] created an image QA data set through human labeling.

B. Video Classification and Captioning

video classification and video captioning: While video classification concentrates on automatically labeling video clips based on their semantic contents like human actions or complex events, video captioning attempts to generate a complete and nat- ural sentence, enriching the single label as in video classification, to capture the most informative dynamics in videos. There have been several efforts surveying literatures on video content understanding. Most of the approaches surveyed in these works adopted hand-crafted features coupled with typical machine learning pipelines for action recognition and event detection

III. Basic Deep Learning Modules

In this section, we briefly review basic deep learning modules that have been widely adopted in the literatures for video analysis.

A. Convolutional Neural Networks (CNNs)

Inspired by the visual perception mechanisms of animals and McCulloch- Pitts Model, Fukushima proposed the “neocognitron” in 1980, which is the first computational model of using local connectivities between neurons

of a hierarchically transformed image. To obtain the translational in- variance, Fukushima applied neurons with the same parameters on patches

of the previous layer at different locations; thus this can be considered as the predecessor of CNN. Further inspired by this idea, LeCun et al. de- signed and trained the modern framework of CNN – LeNet-5 and obtained the state-of-the-art performance on several pattern recognition datasets, e.g., handwritten character recognition. LeNet-5 has multiple layers and is trained with the back-propagation algorithm in an end-to-end formulation, i.e., classifying visual patterns directly by using raw images. However, lim- ited by the scale of labeled training data and computational power, LeNet-5

VGGNet has two versions, i.e., VGG16 and VGG19 models, which contain 16 and 19 layers respectively. VGGNet pushed the depth of CNN architecture from 8 layer as in AlexNet to 16-19 layers, which largely improves the discriminative power. In addition, by using very small (3 × 3) convolutional filters, VGGNet is capable of capturing details in the input images.

GoogLeNet is inspired by the Hebbian principle with multi-scale pro- cessing and it contains 22 layers. A novel CNN architecture commonly referred to as Inception is proposed to increase both the depth and the width of CNN while maintaining an affordable computational cost. There are several extensions upon this work, including BN-Inception- V2, Inception-V3 and Inception-V4.

ResNet, as one of the latest deep architectures, has remarkably in- creased the depth of CNN to 152 layers using deep residual layers with skip connections. ResNet won the 1st place in the 2015 ImageNet Chal- lenge and has recently been extended to more than 1000 layers on the CIFAR-10 dataset.

B. Recurrent Neural Networks (RNNs)

The CNN architectures discussed above are all feed-forward neural net- works whose connections do not form cycles, which is insufficient for sequence labeling. To better explore the temporal information of sequential data, re- current connection structures have been introduced, leading to the emergence of RNN. Different from feed-forward neural networks, RNNs allow cyclical connections to form cycles, which thus enables a “memory” of previous inputs to persist in the network’s internal state [20]. It has been pointed out that a finite-sized RNN with sigmoid activation functions can simulate a universal Turing machine.

IV. Reference:

J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.

J. Berant and P. Liang. Semantic parsing via paraphrasing. In Proceedings of ACL, volume 7, page 92, 2014.

A. Bordes, S. Chopra, and J. Weston. Question answering with subgraph embeddings. arXiv preprint arXiv:1406.3676, 2014.

K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural im- age caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.

H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dolla ́r, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.

M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. arXiv preprint arXiv:1505.02074, 2015.

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv preprint arXiv:1505.05612, 2015.

Essay: Maximizing Accuracy with LSTM Attention Model for Stage Based Context Learning

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: