How RCNN Enhances Scene Labeling with Multi-Scale RNNs

Abstract

Different from traditional convolutional neural networks (CNN) this model has intra-layer recurrent connections in the convolutional layers. Therefore each convolutional layer becomes a two-dimensional recurrent neural network(RNN). The units receive constant feed-forward inputs from the previous layer and recurrent inputs from their neighborhoods. The recurrent iterations proceed the region of context captured by each unit expands segmentation and Feature extraction and context modulation are seamlessly integrated is different from typical methods that entail separate modules for the two steps. An multi-scale RCNN is proposed. Deep recurrent convolutional neural network (RCNN) for this task is originally proposed for segmentation. Over two benchmark datasets are Sift Flow the model outperforms many state-of-the-art models in accuracy and efficiency. Scene labeling (or scene parsing) is an important step towards high-level image interpretation. Aims at fully parsing the input image by labeling the semantic category of each pixel. Compared with image classification, scene labeling is more challenging as it simultaneously solves both segmentation and recognition. SIFT Flow Dataset has 715 images from rural and urban scenes composed of 8 classes. The scenes have approximately 320 _ 240 pixels. As in performed a 5-fold cross-validation with the dataset randomly split into 572 training images and 143 test images in each fold. The SIFT Flow is a larger dataset composed of 2688 images of 256 _ 256 pixels and 33 semantic labels.

Key Words – Convolutional Neural Networks, Recurrent Neural Networ .Recurrent Convolutional Neural Network.Segementation,Multi-scale RCNN, SIFT Flow Dataset

1. INTRODUCTION

Scene parsing has drawn increasing research interest due to its wide [3]applications in many attractive areas like autonomous vehicles, robot navigation and virtual reality. [1,4]The remains a challenging problem since it requires solving segmentation, classiﬁcation and detection simultaneously. RNN gains a strong discriminative capability.[7] RNN is superior by using a better architecture explicitly incorporating context information into the training process of multiple hidden layers. Learns concepts with different abstractness. RNN effectively fuses the output features across different time steps for classiﬁcation[5] or a more concrete parsing purpose. To verify the effectiveness of RNN and have conducted extensive experiments over ﬁve popular and challenging scene parsing datasets including SiftFlow RNN is capable of greatly enhancing the discriminative power of per-pixel feature representations. Scene parsing problem and pro- pose a novel recurrent neural network (RNN) for parsing scene images[2]. RNN can enhance the capability of RNNs in modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse. Recurrent Neural Networks RNN has been employed to model long-range context in images. For instance built one recurrent connection from the output to the input layer and introduced layer-wise self-recurrent connections.

Compared with those methods the proposed RNN models the context by allowing multiple forms of recurrent connections. Recurrent neural networks (RNN) are suitable for these tasks because the long-range context information can be captured by a fixed number of recurrent weights. Treating scene labeling as a two-dimensional variant of sequence learning RNN can also be applied but the studies are relatively scarce.

This type of RNN has been proposed but there it is used for object recognition[5]. It is unknown if it is useful for scene labeling a more challenging task. This motivates the present work. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence[7]. The multiscale approach called the hierarchical multiscale recurrent neural network that can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. multiscale RNN model can learn the hierarchical multiscale structure from temporal data without explicit boundary information.

This model called a hierarchical multiscale recurrent neural network (HM-RNN) [16]does not assign fixed update rates but adaptively determines proper update times corresponding to different abstraction levels of the layers. This model tends to learn fine timescales for low-level layers and coarse timescales for high-level layers. The introduce a binary boundary detector at each layer. The boundary detector is turned on only at the time steps where a segment of the corresponding abstraction level is completely processed. Alsoduring the within segment images.

2. RELATED WORKS

The scene parsing problem has been approached with a wide variety of methods in recent years. Many methods rely on MRFs, CRFs, or other types of graphical models to ensure the consistency of the labeling and to account for context [ 19], [ 15], [26]. Most methods rely on a pre-segmentation into superpixels or other segment candidates, and extract features and categories from individual segments and from various combinations of neighboring segments. The graphical model inference pulls out the most consistent set of segments covers the image. The proposed a method to aggregate segments in a greedy fashion using a trained scoring function. The originality of the approach is that the feature vector of the combination of two segments is computed from the feature vectors of the individual segments through a trainable function. They use deep learning methods to train their feature extractor. Their feature extractor operates on hand-engineered features. One of the main question in scene parsing is to take a wide context into account to make a local decision. [ 32] proposed to use the histogram of labels extracted from a coarse scale as input to the labeler that looks at finer scales. Our approach is some simpler: our feature extractor is applied densely to an image pyramid. The coarse feature maps thereby gen-erated are up sampled to match that of the finest scale.

Three scales each feature vector has multiple fields encode multiple regions of increasing sizes and decreasing resolutions centered on the same pixel location. The first end-to-end neural network model for scene labeling refers to the deep CNN proposed in [11]. The model is trained by a supervised greedy learning strategy. Another end-to-end model is proposed. Top-down recurrent connections are incorporated into a CNN to capture context information. In the first recurrent iteration the CNN receives a raw patch and outputs a predicted label map down sampled due to pooling. In other iterations the CNN receives both a down sampled patch and the label map predicted in the previous iteration and then outputs a new predicted label map. Compared with the models in this approach is simple and elegant but its performance is not the best on some benchmark datasets. It is noted that both models in [14] and [7] are called RCNN. For convenience in follows if not specified RCNN refers to the model.

Recurrent Neural Networks RNN[19] has been employed to model long-range context in images. For instance [16]built one recurrent connection from the output to the input layer and introduced layer-wise self-recurrent connections. Compared with those methods the proposed RNN models the context by allowing multiple forms of recurrent connections. In addition RNN combines the output features at multiple time steps for pixel classiﬁcation. [21]Utilized a parallel multi-dimensional long short-term memory for fast volumetric segmentation. The performance was relatively inferior and also based on similar motivations that used RNNs to reﬁne the learned features from a CNN [9]by modeling the contextual dependencies along multiple spatial directions. The RNN incorporates the context information into the feature learning process of CNNs

3. PROBLEM AND DATASET DESCRIPTION

Problem Description

More formally in this report address the problem of obtaining the maximum likelihood of a patch from an image.[22]Being from a certain class. This is defined statistically as

P (class|observation) = argmaxclassP (observation|class)

where observation refers to the pixels that belong to a specific patch and class refers to the single label given to the whole patch. This is done for each patch in a given input image and results in an approximate separation of various segments in the image[30]. This approach that also takes into account neighboring patches or other patches in the same image. Also avoid using a prior for our final classification of the form P (class).

The completely isolate the likelihood function and attempt to optimize it. Given an optimal likelihood function a fully the method in subsequent work is more likely to obtain better classification performance[34].

3.2 DATASET USED

Experiments are performed over two benchmark datasets for scene labeling Sift Flow [16] and Stanford Background [28]. The Sift Flow dataset contains 2688 color images all of have the size of 256 × 256 pixels. Among them 2488 images are training data and the remaining 200 images are testing data. There are 33 semantic categories and the class frequency is highly unbalanced. The first parameterized layer is a convolutional layer followed by a 2 × 2 non-overlapping max pooling layer. This is to reduce the size of feature maps and thus save the computing cost and memory. [28]The other two parameterized layers are RCLs. Another 2 × 2 max pooling layer is placed between the two RCLs. The numbers of feature maps in these layers are 32, 64 and 128. The filter size in the first [21]convolutional layer is 7 × 7 and the feed-forward and recurrent filters in RCLs are all 3 × 3. Three scales of images are used and neighboring scales differed by a factor of 2 in each side of the image. For the Sift Flow dataset, the hyper-parameters are determined on a separate validation set.

4. METHODOLOGY

The proposed RCNN[31] was tested on several benchmark object recognition datasets. With fewer parameters RCNN achieved better results than the state-of-the-art CNNs over all of these datasets[25]. The validates the advantage of RCNN over CNN[14]. Recurrent neural network (RNN) has a long history in the artificial neural network communication but most successful applications refer to the modeling of sequential data[11]. A hierarchical RNN called the Neural Abstraction Pyramid (NAP) is proposed for image processing. NAP is a biology-inspired architecture with both vertical and lateral recurrent connectivity through the image interpretation is gradually refined to resolve visual ambiguities.

4.1 RCNN

The key module of the RCNN is the RCL.A generic RNN [20,35]with feed-forward input u(t), internal state x(t) and parameters θ can be described by:

x(t) = F(u(t), x(t − 1), θ) (1) (1)

where F is the function describing the dynamic behavior of RNN. The RCL introduces recurrent connections into a convolutional layer. It can be regarded as a special two-dimensional RNN feed-forward and recurrent computations both take the form of convolution.

xijk (t) = σ (wf ) u(i,j)(t) + (wr )⊤x(i,j)(t − 1)+bk (2)

Where u(i,j) and x(i,j) are vectorized square patches centered at (i, j) of the feature maps of the previous layer and the current layer, wfand wr are the weights of feed-forward and recurrent for the kth feature map and bk is the kth element of the bias. σ used is composed of two functions σ(zijk ) = h(g(zijk )), where g is the widely used rectified linear function g(zijk ) = max (zijk , 0), and h is the local response normalization (LRN) :

g(zijk)

h(g(zijk)) = ———————————– (3)

(1+a/L ∑_(k=Max⁡(0,k-L/2))^(Min⁡(K,k+L/2))▒〖(G(Zijk))〗 2)

where K is the number of feature maps, α and β are constants controlling the amplitude of normal- ization. The LRN forces the units in the same location to compete for high activities mimics the lateral inhibition in the cortex. In our experiments, LRN is found to consistently improve the accuracy, though slightly. Following [11], α and β are set to 0.001 and 0.75, respectively. L is set to K/8 + 1. During the training or testing phase an RCL is unfolded for T time steps into a multi-layer sub- network. T is a predetermined hyper-parameter T = 3.

The receptive field (RF) of each unit expands with larger T so that more context information is captured. The depth of the subnetwork also increases

Essay: How RCNN Enhances Scene Labeling with Multi-Scale RNNs

Essay details and download:

Text preview of this essay:

Abstract

About this essay:

Essay details and download:

Text preview of this essay:

Abstract

About this essay:

Essay Categories: