Hierarchical attention-based CNN for deceptive spam review detection
Yuxin Liu, Li Wang, Miguel Kakanakou stephane
College of Computer Science and Technology ,Taiyuan University of Technology ,Shanxi, China
Abstract
Deceptive spam reviews affect purchase decision of consumers and decision-making of merchant. As a matter, it seriously affects fair competition of online market. Existing measures to detect deceptive spam reviews mainly focus on designing features from the perspective of linguistic and psychological cues, which hardly reveal the latent semantic information of the reviews. We proposed a hierarchical attention-based neural network model to learn document representation. The model makes an attention polling from sentence representation to document representation to extract most significant and comprehensive information of reviews. An intermediate document representation produced by the bidirectional long short-term memory is used as a reference for local document representations generated by the convolution layer to obtain attention weights. The document representation is formed by combining local representation using obtained attention weights. Experiments results on the public and gold standard spam review dataset show that the proposed model can achieve higher accuracy compared with state-of-the-art approaches.
Key words: deceptive spam review, representation learning, attention mechanism, convolution neural network(CNN) long-short term memory(LSTM)
1.INTRODUCTION
With the development of internet, people become more confident to explain their thoughts on websites and share them with millions of people [1].By 2016,American review site Yelp had more than 108 million comments, https://www.yelp.com/about, the annual growth of comment is more than 0.18 million [2].However fake reviews account for about 14-20% in the Yelp, and for about 2-6% in Tripadvisor, Orbitz, Priceline and Expedia [3,4].Customers can learn about the information of performance, quality and user experience of products or services by reading reviews of products, which can help them determine whether the target product meet their needs, so as to make the right purchase decision. In 2011, the United States Cone survey (http://www.conecomm.com/contentmgr/showdetails.php/id / 4008) states that 64% of users knew the product information by read relevant reviews, 87% of users made the purchase decision after reading the affirmative comment, and 80% of them gave up the intention of purchase after read the negative comment, so comment information can guide people's purchase behavior, positive comment can produce huge economic benefits and good prestige for individuals and collectives, which promote false spam comments produce[5,6,7,8].
Appearing of comment can be active or passive, for example, a merchant /brand, on the one hand, in order to improve the reputation of its own products on the network, in person/hire people post deceptive positive comments; on the other hand, the negative comments are posted to their rivals in order to reduce the competitor's prestige and benefit from it. These fake reviews not only seriously affect the normal competition of online market, but also injures the rights of consumers and businesses[9].Therefore, opinion mining techniques are used to assisting businessman analyze posted customers’s opinions on offered products detect and filter spam reviews so as to proffer truthful reviews to purchasers [10]. However, these research in this area is not adequate and many critical problems related to spam detection is not solved yet [9].
Deceptive opinion spam is a fictitious opinion which is deliberately written to sound authentic [11,12,13], in other words, it refers to the fake comment maliciously released by some merchants or users driven by some interests to mislead potential customers. It is noteworthy that the quality of the comments written by users is influenced by various factors such as the author's cultural background and emotions at that time. In this article, spam reviews are not refer to comments on commentators' own experience of the products, so the low quality comments is not necessarily a spam review; conversely, the spammer usually try to ensure the quality of the comment to improve the influence of the spam comment. In order to hide their identities and to mislead users, spammer usually write comments as normal comments. So in many cases, normal users cannot identify whether a comment is spam or not, resulting in the plight of the insufficient annotation data and difficult to assess detection results. This is one of the challenges for deceptive spam review detection. The researchers organized three volunteers to manually identify 160 false comments and the volunteers tended to misjudge the false comments as real comments, recognition accuracy is only 53.1%- 61.9% [14]. Therefore, the detection of spam is an urgent and impossible task.
Deceptive spam review detection usually be deemed as a 2-category classification problem. Most of existing approaches mainly follow the seminal work of jindal and liu [11] adopting machine leaning method to build the classifiers. Under this direction, feature engineering is important, the majority of studies focus on designing effective features to enhance classification performance from the perspective of linguistic and psychological cues, but both are discrete features. Although these features have strong performance, their sparsity makes it difficult to encode the semantic information of a document from the viewpoint of discourse.
Recently, neural network models have been used to learn semantic representations for nature language representation tasks, achieving highly competitive results [15]. In view of the good performance of neural network based method in NLP, some research explored neural network model to learn document-level representation for detecting spam review from semantic. For example, Ren et.al[16] produced a gated recurrent neural network model for deceptive spam review detection. However, research in this aspect is not adequate and many critical problems related to spam detection are not solved.
In this paper, we propose a novel method named hierarchical attention-based CNN for deceptive spam review detection (HACNN). Since a review has a hierarchical structure (words form sentences, sentences form a document)[17], we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation. This model mainly consists of two layers: the first layer is word2sent layer(see section 3.1), which uses a convolutional neural network to produce continuous sentence vector representations base on word embedding; the second layer is sent2doc(see section 3.2),which uses an attention pooling-based convolutional neural network to generate the document representation base on sentence representation. The last layer is the softmax classification that uses the generated document representation to identify deceptive spam reviews.
The main contribution of this paper is:
(1) A new hierarchical attention-based convolutional neural network is proposed to distinguish deceptive reviews from the truth reviews. The proposed model does not need external modules and can be trained end-to-end.
(2)The combination of the word2sent layer and sent2doc layer enable our model extract comprehensive information, namely historical, future and local context of any position in a document. Namely,the position and intensity information of features are completely preserved with the help of the proposed hierarchical and attention mechanism.
(3) Empirical results on the public and gold standard spam review dataset demonstrate that the proposed model can achieve higher accuracy compared with state-of-the-art approaches.
The remainder of the paper is structured as follows: the next section gives a brief review of related works. Section 3 introduces the details of our proposed method. Section 4 presents experimental analysis and results. Section 5 summarizes the contributions of this work and future work.
2.Related work
We will make a brief review of the related work from two perspectives. One is deceptive spam review detection and the other is neural networks for representation learning.
2.1. Deceptive spam review detection
If we compare with other types of spam such as e-mail spam [18] and web spam [19], spam review detection is very nontrivial because it is not possible to distinguished deceptive review from real comments by manual evaluation [11],Hence, state-of-the art methods in detecting various types of spam are not applied in this domain. Accordingly, we can consider detection of spam reviews as one of the sophisticated problems in Natural Language Processing domain.
In comparison with other types of spam such as e-mail spam[18] and web spam [19], spam review detection is very nontrivial because it is impossible distinguishing deceptive review from real comments by manual evaluation[11],Hence, state-of-the art methods in detecting various types of spam are not applied in this domain. Accordingly, detection of spam reviews could be considered as one of the sophisticated problems in Natural Language Processing domain.
This problem was first proposed by Jindal and Liu[11] in 2008, training models using features based on the review content, reviewer, and product itself. Jindal et al.[12] generalizes deceptive comments into three categories: false positive (negative) comments; just talk about the brand and not the product itself; do not include any views (such as advertising); the first one is the most harmful and the latter two comments are relatively harmless and easy to identify.
Many methods are proposed to filter spam reviews [20-24]. Most these studies tried to demonstrate how spam reviews differs from real opinions in terms of sentiment and linguistic aspects [3,11,13,25-28], writing style [29,30], subjectivity and readability [31]. Majority of these approaches have been conducted on synthetic dataset initially introduced by [13]. However, by performing same methods on synthetic and real datasets, Mukherjee et al. argued that synthetic datasets are defective [32,33]. Thus, techniques based on these synthetic datasets are problematic, as they do not appropriately reflect real world spam reviews[34].
Yoo and Gretzel gathered 42 deceptive and 40 truthful hotel reviews and manually compare the linguistic difference [35]. Ott et al. created a gold-standard collection by employing Turkers to write fake reviews, and follow-up research was based on their data [36].
Recently, li et.al developed a wider-coverage gold-standard deceptive spam review[37]based ott and li, which comprises of data from three domains(hotel, restaurant, and doctor ),generated through crowdsourcing and domain experts, and explored generalized approaches for identifying online deceptive opinion spam .We adopt this dataset for our experiments due to its larger size and coverage.
A bunch of approaches demonstrated that focusing on context similarity of reviews is advantageous. In these approaches, duplicate and near duplicate reviews were considered as spam [38-42] Content similarity comparison is a famous technique among researchers as it is generally believed that spammers create a few number of fake reviews and try to copy it in different situations, with various identities and for diverse products of a brand. Atefeh Heydari et al. proposed a deceptive spam review detection system wherein activeness of reviewers and rating behavior as well as context similarity of reviews are synthetically investigated in suspicious time intervals captured from time series of reviews by a pattern recognition technique [9].
There has been work that exploit features outside the review content itself. Zhang et al. proposed a CoFea method based on entropy and the co-training algorithm to identify deceptive spam review by unlabeled reviews. He firstly sort all lexical terms of reviews by entropy, further proposed two strategies, CoFea-T strategy and CoFea-S strategy, the CoFea-T strategy produced better accuracy and CoFea-S strategy saves more computing time[43].[具体文献没有粘贴过来]
all these methods exhibit an indispensable problem: traditional discrete features, fail to effectively extract the semantic information from the whole discourse. To overcome these drawbacks, in this paper, we propose a hierarchical attention-based convolutional neural network that not only extract continuous features but also capture important and comprehensive information through the review.
2.2 neural models for representation learning
Neural networks have been used to learn continuous representation for many NLP tasks [44-46]. With advances in deep learning techniques, distributed word representation [47-49] has become a common practice for vector representation. Word vector representations are mostly learned through neural language models [50-52]. Word embedding has drawn great attention in recent years because it can capture both syntactic and semantic information.
Paragraph vector [47] is also very powerful. It learns representations of sentence or paragraphs in the same way as learning word vectors using Skip-grams or CBOW [48,49]. It is an unsupervised algorithm that learns fixed-length feature representations from variable-length texts. It achieves amazingly good performance on some tasks. However, other search find that it performs sub-optimally on other tasks.
Luyang li.et al. introduced a sentence weighted neural network to learn a document representation[53]. The model makes a hard attention and a fixed mode through incorporating sentence weights into document representation learning which can be improved by a soft alignment and a more flexible mode. They apply KL-divergence as the importance weight of the word. However, in this paper we empirically explore a hierarchical attention polling-based convolutional neural networks to learn document representation for detecting deceptive spam review.
3.method
A review has a hierarchical structure (words form sentences, sentences form a document, the composition of words in forming sentences is similar to the composition of sentences in forming documents.)[9]. We likewise construct a hierarchical neural network to learn document representation. Fig. 1 depicts the structure of our model, which mainly comprised of two layers: the first layer is word2sent layer (see section 3.1), which uses a convolutional neural network to produce continuous sentence vector representations base on word embedding; the second layer is sent2doc (see section 3.2), which uses an attention pooling-based convolutional neural network to generate the document representation base on sentence representation. The last layer is the softmax classification that uses the generated document representation to identify deceptive spam reviews.
3.1 word2sent layer
The convolution neural network is a state-of-the-art method to model semantic representations of sentences [54]. A convolutional neural network [45,55,56] is used to learn continuous representations of a sentence as it does not rely on external parse tree. The convolution action has been commonly used to synthesize lexical n-gram information [44,57]. N-grams have been shown useful for many NLP tasks [49,58,59], and we apply them to our neural network. Shown in the figure 2, we use three convolutional filters to produce sentence representation. The reason is that they are capable of capturing local semantics of n-grams of various granularities, including unigrams, bigrams and trigrams, respectively. This is proven powerful for some NLP task, such as sentiment classification [46]. We will use three convolution filters of respective widths 2, 3 and 4.
Fig.2.word2sent layer
Formally, denote a sentence consisting of n words as (,, ..,… ). A convolutional filter is a list of linear layers with shared parameters. Let be the width of the three convolutional filters, respectively.
Taking for example, and are the shared parameters of lines layers for this filter. The input of a linear layer is the concatenation of word embeddings in a fixed-length window size which denoted as
. The output of a linear layer is calculated as
(1)
where ,is the output size of the linear layer. We feed that to an average pooling layer, resulting in an output vector with fixed-length.
(2)
We further add hyperbolic tangent (tanh) to incorporate point wise nonlinearity. We then obtain the output of this filter.
(3)
Similarly, we obtain the and for the other two convolutional filters with width and , respectively. To capture global semantics of a sentence, we average the outputs of three filters to generate the output S.
(4)
3.2 sent2doc layer
Given the continuous sentence vector representation obtained from the first layer of the model, the SENT2DOC layer will output a document vector representation. There have been various methods for document representation, such as averaging all sentence vectors as document representation, however, they cannot extract the semantic information of between sentences. In this paper we produce a sent2doc model (Borrow an attention polling-based idea of meng joo Er el.[60] for document representation, which was used for sentences representation), it can extract comprehensive semantic information from sentence representation to document representation. The model mainly has two salient points: the pooling scheme and the combination of the BLSTM model with convolutional structure.
As shown in Fig.1, convolutional filters perform convolutions on the input document matrix and generate local representations. An attention pooling layer is used to integrate local representations into the final document representation with attention weights. These weights are obtained by comparing local representations by position with an intermediate sentence representation generated by the BLSTM [61,62] and optimized during the training phase. At last, document Representation of all distinct convolutional filters are concatenated into the final feature vector which is fed into a top-level softmax classifier. The intermediate sentence representation will be also used as an input to the softmax classifier in the testing phase, which is indicated by the dashed-lines in Fig.1.
CNN has been proven to be a powerful semantic composition model, and the convolution operation can independently capture local information contained in every position of a document. The convolution operation in our model is conducted in one dimension between k filters and a concatenation vector which represents a window of m sentences starting from the ith sentence, obtaining features for the window of sentences in the corresponding feature maps. The term d is the dimension of sentence representation. The parameters of each filter are shared across all the windows. Multiple filters with differently initialized weights are used to improve the model’s learning capability. The number of filters k is determined using cross-validation.the convolution operation is governed by
(5)
where , the term is a bias vector and is a nonlinear activation function. We employ a special version of nonlinear activation function called LeakyReLU [63] ,because it allows a small gradient when the unit is not active and helps further improve the learning efficiency compared with ReLU.
Suppose the length of a document is T. As the sentence window slides, the feature maps of the convolutional layer can be represented as follows:
(6)
Each element is a local representation of the corresponding position. The output of the convolutional layer represents local representations of the document.
An intermediate document representation is generated by LSTM.The BLSTM is a variant of the recurrent neural network which can solve the problem of “vanishing gradient ” by replacing the hidden state of the recurrent neural network with a gated memory unit, moreover, it can learn both historical and future information contained in a document. The BLSTM architecture is jointly trained with all the other components of the architecture. The gradients of the cost function back-propagate through the intermediate document representation so that it is optimized during the training phase. It’s worth nothing that the output dimension of the BLSTM should be controlled same as the output dimension of the convolutional operation.That is to say, we should map both local representation and intermediate representation to the same dimension.We defined the intermediate document representation as
And then we calculate the attention weights by compare the local document representation generated by the convolutional with the intermediate document representation generated by the BLSTM, the higher the similarity between the intermediate document representation and each local document representation ,the bigger attention weight is assigned to that local representation. The attention weight is calculated as follows:
(7)
Where
The term ai is a scalar and the function is used to measure the similarity between its two inputs. Cosine similarity is used in our model. After the attention weights are obtained, the final document representation is given by:
(8)
In our research, the attention can be regarded as taking a weighted sum of all the sentence annotations to compute the document annotation. The weight of each sentence measured how much the sentence contributes to the meaning of the entire document. This method borrows a very distinguished idea of attention mechanism(by assigning bigger weights to more significant features) to extract more important information contained in the document. Because sentence in a review play different roles in the semantic representation , some sentence are more important in distinguishing deceptive spam reviews from the truth reviews.
Fig.1.Hierarchical attention-based convolution neural network for deceptive spam review detection
3.3. softmax classifier
On the top of the document convolution layer, we add a softmax classifier and a linear transformation to produce conditional probabilities over the class space. To avoid overfitting, dropout with a masking probability p is applied to the penultimate layer. The key idea of dropout is to randomly drop units (along with their connections) from the neural network during the training phase []. This output layer is calculated as follows:
(9)
(10)
where is an element-wise multiplication operator, is the masking vector with dropout rate p which is the probability of dropping a unit during training, and C is the class number. In addition, a norm constraint of the output weights is imposed during training as well.
As our model is a supervised method, each review has its golden label Pcg. The following objective function in terms of minimizing the categorical cross-entropy is used:
(11)
where has a 1-of-K coding scheme whose dimension corresponding to the true class is 1 while all others being 0. The parameters to be determined by the model include all the weights and bias terms in the convolutional filters, the BLSTM and the softmax classifier. The attention weights will be updated during the training phase. Word embedding are fine tuned as well. Optimization is performed using the Adadelta update rule of [], which has been shown as an effective and efficient back-propagation algorithm.
4.experiments
In this section, we evaluate the performance of the proposed model on public datasets for spam review detection and compare it with state-of-the-art methods.We conduct three types of experiments which are in-domain, cross-domain, mix-domain.
4.1 dataset and evaluation metrics
We will use the public dataset released by Jiwei Li[37]which is a gold standard spam review dataset. The distribution of the datasets are listed in Table 1. To facilitate the following discussion, the datasets are now briefly described: the dataset contains three domains ,respectively, Hotel, Restaurant and Doctor. There are three types of data in each domain, namely “Customer ” “Doctor” and “Employee” They stand for different data sources. The truth reviews are from customers who really have consumption experience. The spam reviews are edited by Turkers and experts. Experts are employees in each domain who have expert-level domain knowledge. Specifically, Li [37] and Ott [36] use Amazon Mechanic al Turk to collect deceptive reviews from online workers (Turkers). In the experiment, we use accuracy as evaluation metrics,For the hotel domain, we perform all(Customer/Turker/Employee) reviews for classification. For the Restaurant and Doctor domains, because Employee reviews are too few, we perform only Customer/Turker reviews classification. Table 1 shows the statistics of the dataset. We use 90% for the training set, 10% for the test set.
Table 1
Statistics of the three domain dataset
Domain
Turkey
Expert
Customer
Hotel
800
280
800
Restaurant
200
0
200
Doctor
356
0
200
4.2.Word embedding
The input of the algorithm is N variable-length documents. Each sentence S is constituted by words which are represented by vectors.In this paper ,we will adopt word2vec for our word embedding, namely ,off-the-shelf matrices will be used as our initial matrices to make better use of semantic and grammatical associations of words.
The model was trained on a massive Google News dataset that contained over 100 billion different words. It contains 300-dimensional vectors for 3 million words and phrases. Since the word vectors matrix is quite large (3.6 GB) and contains a lot of words that unnecessary for us (3 million words but our contains only the words that we need. The model was trained by using the Skip-gram method and maximizing the average log probability of all the words [64] as follows:
(7)
where c is the context window size. The values of word vectors are included in the parameters which are optimized during the training procedure
4.4 experiments results and analysis
4.4.1 in-domain results
In-domain, a set of test are conducted according to Ren et.al ’s setting [16],in order to compare our model with state-of-the-art neural model with Bi-directional gated recurrent neural network[16]. For hotel domain, we used all(Customer/Turker/Employee) the reviews, For the Restaurant and Doctor domains, we perform only Customer/Turker reviews, because Employee reviews are too few. The results are shown in Table 2, the accuracy of our model is 5% higher than that of ren.et.al. Especially, in doctor domain, Our accuracy is up to 90.9%, while accuracy of Ren et al.’s is only 76.3%, which is obviously higher than that of their methods.
However, on the restaurant domain, our model gives lower results compared with model of Ren et al.’s . One possible reason is that the number of reviews is relatively lower, which lead to relatively lower accuracies. The above analysis show that our neural network is more suitable for deceptive spam review detection.
Table 2
In-domain result, ALL represents Customers /Turkey/Employee
Domain
Setting
Methods
Accuracy
Hotel
ALL
Li et al.
66.4
Ren et al.
80.8
HACNN
85.5
Restaurant
Customer/Turkey
Li et al.
81.7
Ren et al.
87.1
HACNN
85
Doctor
Customer/Turkey
Li et al.
74.5
Ren et al.
76.3
90.9
4.4.2 cross-domain results
in cross-domain, we want to know whether the relatively richly annotated Hotel domain dataset can be used to train effective detection models on the Restaurant or Doctor domain. If it can ,which will be a good thing. And in order to study the generalization ability of our neural model, we also do the experiments in the cross-domain. We train a classifier on Hotel domain, and evaluate the performance on the other domain. The results are shown in Table 4, in which our model is obviously better than the others. In restaurant domain, our model gains the best result ,the accuracy is up to 87.5%.In doctor domain, at present the most accurate rate was Li et al's discrete model[37]. The accuracy of the two most advanced neural networks is much lower than that of Li et al. And our model is almost the same as Li et al.'s method.
Our model trained on Hotel domain apply in Doctor domain not as good as in the Restaurant domain, which is reasonable due to the many shared properties among Restaurant and Hotel, such as the environment and location, and also largely due to the difference in vocabulary. This result is consistent with the results of all the other models above.
Table 4
Cross-domain results.
Domain
Methods
Accuracy
Restaurant
Li et al.
78.5
Ren et al.
83.5
Li et al.2017
66.8
HACNN
87.5
Doctor
Li et al.
74.5
Ren et al.
57.0
Li et al.2017
61.5
HACNN
72.7
Table 5
Cross-domain results.
Training Domain
Testing domain
Accuracy
Doctor/Restaurant
Hotel
0.595
Doctor/Hotel
Restaurant
0.775
Hotel/Restaurant
Doctor
0.745
We also train on the two domain and test on the other one domain, for example, we train on the Hotel and Restaurant domain and test on the Doctor domain. The results are shown in Table 5.
4.4.3 mix-domain result
In the mixed-domain ,We want to make a comparison with Li’s paper[53], he adopts all the deceptive reviews from Turker and experts and truth reviews from customs are utilized. Hence ,we do our best to use data with the same distribution in mix-domain experiment.
Li’s methods [53] conclude paragraph-average, Weight-average, basic CNN, SCNN,SWNN and the combination of aforementioned methods and features. Specifically, The SCNN is a basic document representation learning model which consists of two convolution operation. The sentence convolution is to make a composition of each sentence by a fix-length window. The document convolution transforms sentences vectors into a document vector. SWNN is the sentence-weighted neural network model, which can be regarded as a modified of SCNN. He applies KL-divergence as the importance weight of the word to compute the importance weight of the sentence.
We can see from the table 6 that our HACNN model gain the best results in mix-domain. The scores of accuracy is high above the other neural-network. SWNN gets the accuracy is 80.1% , SWNN with features gain 82.2% accuracy. In the spam reviews detection, POS and I are strong feature. We can make a bold assumption: If we add this two features, our recognition rate will be more higher than theirs , next we will do this work.
The results demonstrate the effectiveness of the combination of the word2sent layer and sent2doc layer, which enable our model extract comprehensive information, namely historical, future and local context of any position in a document, namely, the position and intensity information of reviews are completely preserved with the help of the proposed hierarchical and attention mechanism. This is what we aim to achieve for deceptive spam review detection so as to overcome disadvantages of existing strategies.
Table 6
mix-domain result
Model
Accuracy
Paragraph-average
0.729
Weight-average
0.680
Basic LSTM
0.550
Hier-LSTM
0.618
Basic CNN
0.708
SCNN
0.702
SWNN
0.801
SWNN+POS
0.797
SWNN+POS+I
0.822
HACNN
0.855
4.4.4 Parameter settings
We experimentally study the effect of three parameters in our experiments, which are sentence window size, dropout rate and the number of sentence-level convolution filter. The accuracy is shown in Fig.4, from which we can see sentence window size is set as 2,3,4, dropout pro is 0.5 and the number of word2doc convolution filtes is 100. Thus, we use these parameters in our deceptive spam reviews detection.
(a)Effect of sentence window size
(b) Effect of dropout prob
(c) Effect of the number of sentence level convolution filter
Fig.3. the effect of three parameters in the experiment
5. conclusion and future work
A new hierarchical attention-based convolutional neural network is successfully proposed for deceptive spam review detection. The proposed model does not need external modules and can be trained end-to-end.The sent2doc model is exploited to generate a document representation base on the sentence representation to extract most significant information of reviews. Furthemore, the combination of the word2sent layer and sent2doc layer enable our model extract comprehensive information, (historical, future and local context of any position in a document.) Namely,the position and intensity information of reviews are completely preserved with the help of the proposed hierarchical and attention mechanism.Experiments results on the public and gold standard spam review dataset show that the proposed model can achieve higher accuracy and a stronger generalization ability compared with state-of-the-art approaches. For future works, on the one hand, we will make a combination of features to further improve performance, because of the selection of features is very important in this direction, on the other hand ,we can extend this new model to the other natural language processing tasks such as sentiment analysis, even computer vision and image recognition domain.