OPTIMIZATION BASED AGGREGATION FOR RANKING FRAUD DETECTION IN MOBILE APPLICATIONS
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SRI KRISHNA COLLEGE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SRI KRISHNA COLLEGE OF TECHNOLOGY
Ranking fraud in the mobile application market refers to the fraudulent activities that are performed to raise the applications' position in the popularity list. Due to fraudulent activities the users cannot judge the difference between fake reviews and the real ones. To overcome this difficulty a system for ranking fraud in mobile Apps has been proposed. Historic ranking of records, ratings and reviews have been collected for real world App data. The ranking fraud detection system for mobile applications accurately locates the ranking fraud. The leading session may be viewed as the time range when the App gains popularity. The three types of evidences that are used include ranking based evidences, rating based evidences, and review based evidences. Optimization based aggregation method has been employed to integrate all these evidences. The behaviour of these evidences has been investigated and evaluated with real world App data.
Leaderboard, Mobile apps, Ranking fraud detection,Evidence aggregation,Rating and review.
This session presents an overview of data mining, its various types and areas of application. It also presents the scope of data mining to detect the ranking fraud in the mobile applications from their historic data.Data mining is the analysis step of the Knowledge Discovery in Databases, or KDD, an interdisciplinary subfield of computers. It is the computational process of discovering patterns in large data involving methods at the intersection of various domains of computer science like computer vision, artificial intelligence, machine learning, statistics, and database systems. The objective of the data mining process is to extract information from a data set and transform it into an understandable structure. Data mining is also known as sorting through data to identify patterns and establish relationships. Data mining combines data analysis techniques with high-end technology for use within a process. The primary goal of data mining is to develop usable knowledge regarding future events. The key properties of data mining include:Automatic discovery of patterns,Prediction of likely outcomes,Creation of actionable information,Focus on large data sets and databases.There are various types of data mining depending on where the mining technique is applied. They are text mining, spatial data mining, web mining, sequence mining.The KDD process is commonly defined with the stages: Selection, Pre-processing, Transformation,Data Mining,Interpretation/Evaluation.The overall goal of the data mining process is to obtain information from a data set. Data mining is the analysis step of KDD. It involves six common classes of tasks namely Anomaly detection, Association rule learning, Clustering, Classification, Regression and Summarization. The verification of the patterns produced by the data mining algorithm makes it possible to discover knowledge from data.
Fraud detection is a technique of identifying prohibited acts that occur all around the world. It defines the skilled impostor, formalizes the key forms and sub forms of recognized frauds and reveals the gathered data nature. Fraud detection is the identification of symptoms of fraud where no previous disbelief exists.Ranking fraud in the mobile App market refers to fraudulent or deceptive activities which have a purpose of bumping up the Apps in the popularity list. It is essential as it has become very frequent for App developers to use shady means, such as inflating their Apps' sales or posting phony App ratings, to commit ranking fraud to increase the number of downloads and revenue.
Many mobile app stores launch daily app leader board which shows the chart ranking of popular apps. The leader board is important for promoting apps. The grade level of original apps decreases due to the fraudulent mobile apps. Higher rank on the leader board leads to huge number of downloads and the app developers get huge profits. Therefore it is essential to detect the fraudulent apps. The main objective of the system is to find the fraudulent apps in the leader board.The ranking based evidences have been analyzed to detect the manipulation of app ranks in the leader board.The rating based evidences inspect the data to obtain the correct rating for a given application.The reviews are examined to determine the user's views about the given app.Aggregation of the ranking, rating and review based evidences has been performed to detect the fraudulent ranking of mobile applications.
EXTRACTING EVIDENCES FOR RANKING FRAUD DETECTION
Growing technology has lead to enormous growth in the App development market. With over a million Apps available in the App store, it is tricky to find out the best App. Using accurate rating navigation across the App store with more confidence is possible. The various fraudulent activities are also performed for marketing these Apps to increase revenues and popularity. These give raise to fake chart rankings which provide inaccurate information to the user. Therefore, a system to rank the fraud in mobile Apps has been proposed.
The system takes historic record of mobile Apps as input. Data has been collected from real world for the top 100 Apps. Pre-processing has been performed to remove the unnecessary data. Then validation has been performed based on the ranking evidences to ensure the validation of data set. Ranking based evidences have been analyzed. The fraudulent Apps have been discovered by the aggregation of the evidences. Thus, the system detects the fraudulent Apps from the list of Apps.
Table 3.1 presents the data set description. The data of the top 100 apps have been collected for three categories of applications namely top free, top paid and top gross.
Table: 3.1. Dataset description
Number of app
The figure 3.1 presents the system architecture. The system takes historic data of mobile app ranking as input. Pre-processing has been performed on the data. Validation of data has been done based on the ranking evidences. The ranking evidences are further analyzed to detect the fraudulent mobile apps.
Fig.3.1 System Architecture
The complete system to detect the fraudulent app ranking in the leader board chart is divided into three modules. They are:Data collection and pre-processing,Validation of data set based on rank evidences,Ranking based evidences
Data collection and pre-processing
Data collection is defined as the systematic approach for gathering information from a variety of sources to get a complete and accurate knowledge of an area of interest. The data has been collected from App Annie store for the top 100 paid, free, gross apps for a period of 6 months from April- September 2015.The data has been stored for easy and reliable processing.
Pre-processing is defined as the preliminary processing of data in order to prepare it for further analysis.Pre-processing includes the following steps:Extracting data from a larger set,Checking completeness of data, Removing the inconsistent data
Data extraction is the act or process of retrieving the unstructured data out of data sources for further processing.Dataset containing top n-apps are fed into the system.If the dataset comprises of more than 100 apps then the data of the apps exceeding the top 100 are removed.Data of the top 100 apps are obtained.
Data completeness refers to an indication of whether or not all the data necessary to meet the information demand are available in the data resource.Dataset containing top n-apps are fed into the system.If the dataset comprises of less than 100 apps then the data are removed and user is notified.Else, the data of the top 100 apps is obtained.
Inconsistent data is corrupt data. It is the data which cannot be correctly understood or interpreted. Thus, it is essential to remove the inconsistent data to accurately find the ranking fraud.Data of the top 100 app are fed into the system.If the data has any unnecessary symbols or irrelevant data, it is removed.Consistent data is returned to the user as output.
Validation of data set based on rank evidences
It is not possible to detect fraud ranking by using manipulated data. Therefore data sets with inconsistency have to be eliminated.A list of the top 100 apps is read.If an app is given more than one rank every day it would occur twice in the list.This could not happen with real rating. If such ratings are found, then they are discarded.Valid data set are given to the user.
Ranking based evidences
It is the process of taking the rank positions of the top 100 apps and analyzing them in order to find the manipulations in the ranking. It is done based on the fact that the ranking pattern varies for fraudulent and normal apps. An app can be in the recession phase, maintaining phase and rising phase. An application has ranking manipulation if its ranking abruptly rises to the peak value and maintains itself in the peak position for a very short span of time and then drops to a minimum value of rank.The ranking for the individual apps are employed to the system.The pattern of ranking is analyzed for every app.If the variation in the ranking is beyond the threshold range then the application is checked for fraudulency.If the ranking pattern of an app is different from traditional apps then manipulation is detected.Real and manipulated apps are obtained.
This chapter gives the implementation details of the project. This system is implemented using Java.
Data collection and pre-processing
The data has been collected from App Annie mobile analytics store. Fig shows the ranking of the top free, gross and paid mobile apps in the AppAnnie leader board chart and the way the data has been collected and stored as excel sheets for further processing.
Pre-processing is the preliminary process in order to prepare the data for further analysis. The pre-processing has been performed to extract the required data from a large data set, check for its completeness and to remove the inconsistencies.The mobile App data has been obtained from the leader board which gives the top applications based on their ranks. For detecting the fraudulency in the top 100 applications it is sufficient to have information about the top 100 applications. So, extraction has been performed as shown in the Fig 5.2 to get the information about the top 100 Apps alone.
Extracting data from a larger set
In order to detect the fraudulency in the top 100 applications it is essential to have information about the top 100 applications. Therefore the completeness has been checked and if there any missing data then the user is notified as shown in the Fig 5.3. If there is no missing data then the completeness is ensured as shown in the Fig 5.4.
Fig 5.3.Data set with missing data
Fig 5.4.Data set with no missing data
Fig 5.5 shows the complete data set comprising of inconsistencies in the form of symbols like $, ?, = ,etc. For example: $=WhatsAppMessenger comprises of two extra symbols $ and = before the name of the application. Such symbols are the inconsistencies and they have been removed to obtain the required data set.
Fig 5.5.Removing the inconsistent data
Validation of data set based on rank evidences
The leader board chart needs to be reliable to obtain the desired results. If an application is ranked twice on the leader board for the same day, then the ranking is not correct. Fig 5.6 shows the validity check performed on the dataset to ensure that each application is given a unique rank each day.
Fig 5.6.Validation of data set based on ranking
Ranking based evidences
Fig.5.7 depicts the overall ranking for the Facebook Messenger App for a period of 6 months. It can be inferred from the graph that the position of the App does not vary drastically. The position of the application is almost constant and so the App lies in the maintaining phase. This behaviour shows that there are no manipulations in the ranking.
Fig 5.7.Ranking based evidences for normal App
Fig.5.8 shows the overall ranking for the Hulu Hulu App for a period of 6 months. It can be inferred from the graph that the position of the App varies drastically. The position of the application rises after recession and does not remain in the peak. The ranking position then falls to a minimum. There is no consistency in the ranking which is not the characteristic of a top 100 app and so the App behaviour shows that there is manipulation in the ranking.
Fig 5.8.Ranking based evidences for fraud App
This session presents a brief review of the related works of this project.The related works include the detection of web spam, spam in online reviews,fraudulence detection in taxi driving pattern and the problems in fraud detection.
Ntoulas et al  studied the various aspects of content-based spam on the Web and presented a number of heuristic methods for detecting content based spam. A study on the various aspects of content-based spam on the web using a real-world data set from the MSN Search crawler was proposed. An investigation on web spam: the injection of artificially-created pages into the web to influence the results from search engines, to drive traffic to certain pages for fun or profit was done.It is based on the technique of automatic detection of spam pages which comprises a variety of methods like keyword stuffing, analysis of the number of words in the page title, n-gram likelihood for detecting spam. Each method was proposed to be highly parallelizable as it ran in time proportional to the size of the page. Every method identified spam by analyzing the content of the downloaded pages. The experiment was performed on a subset of a crawl by MSN Search. It demonstrated the relative merits of every method. A method on how to employ machine learning techniques to create a highly efficient and reasonably-accurate spam detection algorithm was employed.An examination was done on the effectiveness of this technique in isolation and when it was aggregated using classification algorithms. It was observed that the heuristic correctly identified 86.2% of the spam pages and misidentified 13.8% of the spam and 3.1% of the non-spam pages.This spam detection method proved to be efficient than the previous methods but when used in isolation this method did not identify all the spam pages.
Zhou et al  studied the problem of unsupervised Web ranking spam detection. Specifically, they proposed Mobile App Classification with Enriched Contextual Information. The study was based on the use of mobile Apps and it emphasised on the key role of user preference understanding. The user preferences provide opportunities for understanding intelligent personalized context-based services. It had been found that the key step for the mobile App usage analysis was to classify Apps into some predefined categories. However, it had been a non trivial task to effectively classify mobile Apps due to the limited contextual information available for the analysis. For instance, there was limited information about mobile Apps in their names. Thus, contextual information is usually incomplete and ambiguous.An approach for first enriching the contextual information of mobile Apps by exploiting the additional web knowledge from the web search engine had been proposed. The contextual features for mobile Apps from the context-rich device logs of mobile users were extracted by the observation that different types of mobile Apps may be relevant to different real-world contexts. The enriched contextual information was combined into the Maximum Entropy model for training a mobile App classifier.Extensive experiments on 443 mobile users' device logs were used to show both the effectiveness and efficiency of the approach. The experimental results clearly showed that the approach outperformed two state-of-the-art benchmark methods with a significant margin. It is an efficient online link spam and term spam detection method using spamicity. It is both efficient and effective approach for solving the problem of automatic App classification. But,it cannot be embedded into mobile devices. As different users have different App usage and so behaviours integration of personal preferences into contextual feature extraction is unexplored.
Lim et al  proposed a method for detecting Spammers and Spam nets in the Linkedin social network. A manual dataset of real Linkedin users was constructed. Classification was performed to find the Spammers and legitimate users. The method for detecting Linkedin Spammers consisted of a set of new heuristics which used a kNN classifier. A method for detecting Spam nets (fake companies) in Linkedin was also put forth based on a set of new heuristics together with the use of machine learning.Different classification techniques like decision trees, techniques based on rules, neural networks and kNN was proposed to detect Spam nets in Linkedin. It focused on the idea that the fake profiles of a fake company, usually shared similarities that allowed differentiation between legitimate companies and fake companies. This method calculated the similarity between different profiles of the companies using several distance functions. The similarity values obtained were used as thresholds to detect fake companies (Spam nets) among the legitimate companies.It was found that the proposed methods were very effective. A F-Measure of 0.971 and an AUC close to 1 in the detection of Spammer profile were obtained.The heuristics proposed by this method are adequate to detect Spammer profiles.The proposed method performs very well to detect Spam nets (fake companies) in Linkedin.This method does not detect Spam nets effectively in other social networks.
D.M.Blei et al proposed Latent Dirichlet Allocation (LDA), a generative probabilistic model for collections of discrete data such as text amount. Basically it is a three level hierarchical Bayesian model in which each element of a group was demonstrated as a finite mixture over a fundamental set of topics. Each topic was demonstrated as an infinite mixture over fundamental set of topic probabilities. With the reference of text modelling, the topic probabilities provided an open representation of a document.The problem of modelling text corpora and other collections of discrete data was addressed. The objective was to find short descriptions of the members of a collection that enabled efficient processing of large collections while preserving the essential statistical relationships that were useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance.A flexible generative probabilistic model-LDA for collections of discrete data was described. LDA was based on a simple exchangeability assumption for the words and topics in a document; it was therefore realized by a straightforward application of de Finetti's representation theorem. LDA was viewed as a dimensionality reduction technique. A simple convexity-based variation approach for inference was put forth. It was observed that it is possible to achieve higher accuracy by dispensing with the requirement of maintaining a bound.It is an efficient approximation inference technique presented based on various methods and it improves performance.The results of classification compares only a collection of unigrams and probabilistic LSI model.
Y. Ge, et al illustrated that the growth in the field of GPS tracking technology has allowed the users to install GPS tracking devices in taxies to gather huge amount of GPS traces for some time period. These traces by GPS have offered an unparalleled opportunity to uncover taxi driving fraud traces. A fraud detection system was proposed to identify the taxi driving fraud. First, two sort of function were uncovered i.e. travel route evidence and driving distance evidence. Even a third function was developed to combine the previous functions based on Dempster - Shafer theory. First identification of interesting locations was done from tremendous amount of taxi GPS logs and then a parameter free method was proposed to extract the travel route evidences. Secondly, concept of route mark was developed to illustrate the driving path between locations and based on those mark, a specific model was characterized for the distribution of driving distance and to discover the driving distance evidences.The system utilized the speed information to design a system called the speed based fraud detection system to model taxi behaviours and detect taxi fraud. The method was found to be robust to location errors and independent of the map information and road networks.Parameter free method is proposed to extract the travel route evidences.It is not applicable for real world taxi driving fraud detection system with a large scale taxi GPS logs.The problem with mobile application ranking has been stated. The overview of the system and the description of the various modules have been presented.
CONCLUSION AND FUTURE WORK
A system has been designed to detect the fraudulency in the ranking of mobile application. It is designed to gain information about the accurate ranking of mobile apps which would help to navigate across the app store with confidence. To do so, the data set has been collected from the leader board chart and pre-processing has been performed. Validation of the ranking based evidences has been done. Fraud detection by using the ranking based evidences has been implemented. The rating and review based evidences would be gathered for the real world apps and the three evidences would be aggregated to detect the fraudulent mobile app ratings.
1. K. Ali and K. Shi, 'Getjar mobile application recommendations with very sparse datasets,' in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,2012, pp. 204'212.
2. D. M. Blei, A. Y. Ng, and M. I. Jordan, 'Latent Dirichlet allocation,' J. Mach. Learn. Res., pp. 993'1022, 2003.
3. Hengshu Zhu, Hui Xiong, Yong Ge, and Enhong Chen, Discovery of Ranking Fraud for Mobile Apps, IEEE Transactions on knowledge and data engineering, Vol. 27, No. 1, January 2015.
10. N. Jindal, E.-P. Lim, V.A. Nguyen, B. Liu, and H. W. Lauw,'Detecting product review spammers using rating behaviors,' inProc. 19th ACMInt. Conf. Inform. Knowl. Manage, 2010, pp. 939'948.
11. A. Klementiev, D. Roth, and K. Small, 'Unsupervised rank aggregation with distance-based models,' in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 472'479.
12. C. Liu,Y. Ge, H. Xiong, and Z.-H. Zhou, 'A taxi driving fraud detection system,' in Proc. IEEE 11th Int. Conf. DataMining, 2011, pp. 181'190.
13. A.Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. Castellanos, and R. Ghosh, 'Spotting opinion spammers using behavioral footprints,' in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2013, pp. 632'640.
14. A. Ntoulas, M. Najork and M. Manasse, 'Detecting spam web pages through content analysis,' in Proc. 15th Int. Conf. World Wide Web, 2006, pp. 83'92.
15. B. Zhou, J. Pei, and Z. Tang, 'A spamicity approach to web spam detection,' in Proc. SIAM Int. Conf. Data Mining, 2008, pp. 277'288.
...(download the rest of the essay above)