2.1 Online Social Network Sites
The internet, largely regarded as a static repository of information has become a useful tool used for people in many ways with the increased use of SNSs. They have become an important part of people’s information flows and an important tool for community building as they allow the creation of new pathways for information avenues to mobilize group members . Starting from the 90’s, blogs, SNSs and micro blogs have allowed people to interact with the intent of sharing common interests . The authors in  found that SNSs 80% of internet users identified themselves to a group, thus making users accomplish goals. The percentage of adult SNS users has also increased considerably, where they only mail or search in additions to SNS. The authors in  examined SNS use amongst adults and found that majority of SNS users were on Facebook, supported by the findings of Madden and Zickhur .
2.2 Types of Social Media
The evolving field of Social Media can be classified six types namely collaborative projects, blogs, social networking sites, game worlds and virtual social worlds.
2.2.1 Collaborative projects
Collaborative projects enable the joint creation of content by end-users and the most democratic demonstration of UGC. Users can add, remove or change text-based content like online encyclopedia or they can be social bookmarking applications which enable the group collection and rating of content. Example of online encyclopedia are Wikipedia’s, available in more than 230 different languages and for the social bookmarking the web service Delicious, which allows the storage and sharing of web bookmarks. The bottom-line in these projects are joint efforts lead to a better outcome . Corporate firms need to realize that collaborative projects are becoming the main source of information for many consumers. Though Wikipedia content may not be true, it is believed to be true by many Internet users. Mobile handset manufacturer Nokia, uses internal wikis to update employees on project status and to trade ideas, which are used by about 20% of its 68,000 staff members. Likewise, American computer software company Adobe Systems maintains a list of bookmarks to company-related websites and conversations on Delicious.
Blogs which display entries in reverse chronological order of date represent the earliest form of Social Media [OECD. (2007). Participative web and user-created content: Web 2.0, wikis, and social networking. Paris: Organisation for Economic Co-operation and Development]. Blogs are corresponding to Social Media personal web pages and vary from personal diaries to information summaries. The Blogs are usually managed by one person with possible interactions with others through comments. These text-based SNS`s are the most common, due to their historical roots. Currently blogs have also begun to take different media formats like for example San Francisco-based Justin.tv allows users to create personalized television channels for broadcasting images from webcam in real time to other users. Blogs are also used to update employees, customers, and shareholders on important developments. Jonathan Schwartz, Sun Micro-systems CEO, maintains a personal blog to improve transparency in his company just like General Motors. Blogs do come with risks like dissatisfied or disappointed customers may protest or complain through blogs , thus providing damaging information online for others to note. Companies may be forced to live with negative comments from their own staffs like Robert Scoble of Microsoft, who fiercely criticized their products before he quit.
2.2.3 Content communities
These SNS’s main objective is media content sharing and exist for many types of media. For example, BookCrossing shares above seven lakh books from 130 countries.Flickr.com shares photos, Youtube shares videos and slideshare sharing PowerPoint. Users on content communities do not create a personal profile page. Corporate companies feel content sharing communities are a risk, fearing sharing of copyrighted material in these communities. Though communities remove such illegal content or have rules in place, it is difficult to avoid popular videos being uploaded to Youtube. The popularity of content communities makes them an attractive contact channel. Procter & Gamble in 2007, encouraged upload of 1-minute videos on Pepto-Bismol drug on Youtube from its employees. Manufacturer Blendtec became popular for its inexpensive ‘‘Will it blend?’’ videos, watched by millions. Cisco and Google, rely on content communities to share recruiting videos, as well as keynote speeches and press announcements, with their employees and investors.
2.2.4 Social networking sites
These sites enable users to create personal profiles and connect with others, inviting friends and colleagues to access profiles, sending e-mails and instant messaging between them. The profiles can include photos, video/audio and blogs. Social networking sites are highly popular, specifically among younger Internet users. Even the word ‘Facebook addict has found it’s way into the Dictionary. Social networking sites are being used to for brand creation , marketing research . Warner Brothers created a Facebook profile for visitors to watch trailers, down-load graphics, and play games. A florist offered a widget on Facebook called ‘‘Gimme Love’’ for users to send ‘‘virtual bouquets’’ to friends.
2.2.5 Virtual game worlds
Virtual worlds are platforms replicating a 3D environment allowing user interactions just like real life. They are a high manifestation of Social Media providing the highest level of social presence and media richness. Virtual worlds can be of two types. One, virtual game worlds, where users require to behave based on context based rules like massively multiplayer online role-playing game (MMORPG). Such applications have gained popularity like Sony’s Play-Station and Microsoft’s X-Box, with many users from different parts of the world. The game, World of Warcraft has 8.5 million subscribers.
2.2.6 Virtual social worlds
Virtual social worlds allow inhabitants to choose behaviors, a virtual life replicating real lives. Users appear as incarnations in a three-dimensional virtual environment; with no rules restricting the range of possible interactions, except for basic physical laws such as gravity. It allows a range of self presentation strategies. As the usage increases, the users demonstrate behaviors resembling real life environments   A known example of virtual social worlds is the Second Life application where avatars take a walk or enjoy virtual sunshine. It also allows a user to design virtual clothing or furniture items and sell this content to others. These sites offe a bagful of opportunities to marketing companies or internal process management in human resources .
2.3 Issues in Social Media Websites
Social networking sites are generally not a problem with organizations, but it is the users who are a cause for concern. The generic issues can be privacy and are detailed below
• Privacy Issues: SNS’s are popular especially amongst students . This has prompted many corporations to invest time and money in creating, purchasing, promoting, and advertising SNSs. They may be exposed to various risks like disclosure of personal information, online buying and access to inappropriate content . Other risks include fake profiles, malicious application, spam, and fake links which lead to attacks. Marketing research indicates that SNSs are growing in popularity worldwide. The press coverage of SNSs has emphasized potential privacy concerns.
• Productivity Issues: Employees may spend more time updating their profiles and sites throughout the day. For Example, if 30 workers spend even 20 minutes on a social networking site every day, it would account to 10hrs of ineffective work per day or 105 days per year and for organizations looking at productivity, this is a major issue. Moreover, Employees do not appreciate colleagues spending hours on SNS’s while others function to cover workloads. Soldiers are banned from accessing MySpace , Canadian government prohibits employees from Facebook  and U.S. Congress had proposed legislation to ban youth from accessing SNSs in schools and libraries .
• Resource Management: SNS updates may consume very little bandwidth, but the video links are an issue, since they require high levels of bandwidth.
• Viruses and Malware: It is an important threat, often overlooked by organizations as hackers sense the opportunities to commit fraud and launch spam or malware attacks, on SNS’s. Some applications have the potential to infect computers with malicious code and collect data from users.
• Online Scams: As more and more people follow SNS’s, they can fall a victim to online scams that seem authentic. Users are convinced to give personal details and the issue of data theft becomes a serious risk. Profile data can often be mined by cybercriminals, making it imperative for Employers to be careful. Online scammers generally send an e-mail or message with a link to the user which ask for the profile information and tell the user that it would add new followers. These links sent to the user would be similar to applications, games etc. So whenever the user post his details in the link then the details will be received by scammers and information could be misused.
• Clickjacking: It is a malicious technique of tricking Web users into revealing confidential information or taking control of their resources, when they click on harmless Web pages. Users are vulnerability across a variety of browsers and platforms and clickjacking takes the form of embedded code or script that can run without the user’s knowledge. The objective behind such an attack is that users can be tricked into clicking in the links, icons, buttons etc, which could trigger running of processes at the background without the knowledge of the user.
2.4 Evolution of Social Network Sites
The first recognizable social network site was launched in 1997 called. SixDegrees.com which allowed users to create profiles and list Friends, The surfacing of lists was added later in 1998. Classmates.com allowed people to affiliate with their high school or college and surf the network for others who were also affiliated, but users could not create profiles or list Friends initially. SixDegrees was the first to combine these features and promoted itself as a tool to help people connect with and send messages to others. SixDegrees attracted millions of users, but failed to sustain its business and was closed in 2000 . Between 1997 to 2001, many community tools began supporting combinations of profiles and publicly articulated Friends like AsianAvenue, MiGente and BlackPlanet. LiveJournal aer its launch in 1999, flisted one-directional connections on user pages called buddy lists. The Korean virtual world, Cyworld, started in 1999 added SNS features in 2001 with diary pages , Friends lists and guest books. The next wave started with the launch of Ryze.com, in 2001, which helped people influence their business networks. It allowed introduction to friends including the entrepreneurs and investors forming the base for future SNSs. Ryze, LinkedIn, Friendster and Tribe.net believed in supporting each other with less of competition . Friendster became one of the biggest disappointments in Internet history, while LinkedIn became a powerful business service . 2003 witnessed the launch of many new SNSs . Most sites were profile oriented trying to replicate Friendster. Professional sites like Xing, LinkedIn and Visible Path focussed on business people. At Dogster, strangers with shared interests connected with each other while Care2 helped meeting of activists. Couchsurfing connected travelers to people with couches and MyChurch joined members of the Christian church. As the user-generated content grew, websites began implementing SNS features and later turned into SNS’s. Flickr for photo sharing and Last.FM for music and YouTube for video sharing, grew in popularity. Google’s Orkut could not sustain a U.S. user base, while it became a but national SNS of Brazil before growing rapidly in India . Live Spaces of Microsoft’s Windows launched for US became extremely popular elsewhere. MySpace was launched in 2003 to compete with Friendster and other sites to attract estranged SNS users. MySpace grew rapidly when Friendster intimated a fee would be charged for messages. MySpace differentiated itself from others by allowing users to personalize their pages and updating user required features regularly . Teenagers joined MySpace in 2004, because they wanted to connect with their favorite bands. MySpace encouraged user’s friends to join, did not reject underage users and changed its user policy to accommodate minors. MySpace mainly had teenagers, musicians/artists and the post-college crowds. News Corporation purchased MySpace In July 2005, but the site was implicated in a series of sexual interactions between adults and minors, prompting legal action . Chinese QQ messaging service became the world’s largest SNS after it made friends visible with profiles[27.] Korean Cyworld introduced homepages and buddies . Facebook began in early 2004 as a Harvard-only SNS  . A Facebook user needed to have an .edu email address for registration and later the SNS started supporting other schools and institutions. In later 2005, it expanded the network to include high school students, professionals inside corporate networks and anyone. Gaining access to corporate networks required a .com address and for gaining access to high school networks they required administrator approval. Facebook users could build Applications to personalize profiles and perform other tasks like comparing a movie. Many SNSs focus on growing exponentially, while some seek smaller groups. This rise of SNSs indicates a paradigm shift in the organization of online communities, SNS dedicated to communities still exist. SNSs are mainly organized with people, not interests implying they mirror unmediated social structures, in a world of networks . SNS features have introduced a new framework for online communities and a vibrant new research context. Fig. 2.1 depicts the launch of major SNS’s
Fig.2.1 – Launch of SNS’s
2.5 SNS Networks
SNSs provide rich sources of behavioral data gathered through automated collections or through company provided datasets, enabling analysis on large-scale patterns of links or usage , Golder et. al.  examined a dataset of 362 million messages exchanged by Facebook users for studying Friending and messaging. Lampe et. al. in  explored the relationship between profile elements and count of friends in Facebook and found that profile fields led to a reduction in transaction costs. Network visualization was analyzed in , while researchers have studied network structure of Friendship. Kumar et.al. analyzed the roles played by people in the growth of Flickr and Yahoonetworks . Friendship classification scheme was studied by Hsu et.al. , while Herring et.al analyzed the role of language in Friendship  and research on geographical importance in Friendship . Study on people’s motivations in joining specific communities was done in . Spertus et. al in their study  identified a topology of users from community memberships and the usage in recommending additional communities of interest to users
2.6 Advantages in Social Networking
SNS and digital media have become a catalyst for contemporary communication, and their advance constitutes a transformation in human communication. The media has changed mediated communication in social and cultural processes with new and to express in a variety of ways and freely participate in major events. SNS platforms allow users to interact and collaborate with each other, exploiting different tools. There are many intuitive benefits for the use of social media technologies. They offer a means for self-mass communication capable of reaching global audiences . Social media make it possible for an average user to create, change, circulate, and share digital content and knowledge, i.e. websites, blogs, films, video clips, pictures, etc. with other users in powerful new ways. Users have the power to transform their personal social networks by connections . Social media is characterized by multiple sources of production and distribution and users exercise some control over the information they provide on SNS sites . Users understand the power and their intellectual property rights. Social media introduces pervasive changes to communications . The SNS culture is on user aspiration of technology and its effect on their lives. They also reflect a belief in their contributions and a degree of social connections with others. . Fig. 3 depicts the Social Media Landscape.
Fig. 2.2 Social Media Landscape.
2.6.1 Impression and Friendship on SNS
Individuals construct an online representation of self like dating profiles and SNSs constitute an part of self-presentation, impression management and friendship. Boyd  in one of the earliest academic articles on SNSs , which examined Friendster as the center of publicly circulated social networks for users to connect and present themselves as an extended network. While most sites encourage users to construct accurate representations of themselves, participants do this to varying degrees. Marwick  found that users on three different SNSs had complex strategies for negotiating rigidity in profiles as authenticity and playfulness varies across sites. Skog  found that LunarStorm status feature strongly influenced users behavior on what they revealed in profiles measured by sent messages and photo. Another important feature of self-presentation is the friendship links, which serve as an identity for online profiles. Impression management was one of the reasons for Friendster user in choosing friends . Fono and Raynes-Goldie in their examination of LiveJournal  described users’ understandings on public display of connections and Friendship as a catalyst for social drama. Walther studied friends links impacting impression formation .
2.6.2 SNS Identity Construction
Online identity construction allows users to define themselves more by placing of labels like student. Facebook provides the opportunity to share interests like ideas, images and users online identity, thus producing a symbolic communication. Schau and Gilly  explored the reasons on Facebook users being prone to emphasizing particular aspects of their identity and removed inconsistent tags in construction. Any online representation with imagery or associations makes users display their self-concept of whom . These underlying concepts lay the foundation for users to construct their identities. Actions required for self-presentation are based on materials like signs, brands, symbols and practices to c a desired create an impression . The social identity theory proposed by Hoyer and MacInnis offers that individuals evaluate brands in terms of their consistency with individual identities . Users may not make direct brand associations, but their behaviors show consistency with brands, thus creating an identity acceptable in a desired group. Zhang and Daughtery  state that SNS experience is a platform for comparison of users and enhance their self-identity. Users use other profiles as a yardstick to construct self and determine their social position. In education SNS can be used for survival and becoming an instructional leader , where survival implies technical aspects of a subject understanding social and cultural norms. Leaders need be connected to SNS to be in touch and give necessary information to be successful . Hansford and Ehrich  identified positive outcomes where 16 of the studies in the meta-analysis reported beneficial outcomes for the participants. The most common positive outcomes reported in the analyzed studies were support, empathy, counseling, sharing ideas, problem-solving and professional development.
2.6.3 Communities of Practice
Wenger  terms informal units as communities of practice and indicates that they consist of people who are fully engaged in the process of creating and communicating. They are not a part of the formal organizational structure, but consist of individuals who are informally connected due to their interactions and the work they do together. Communities of practice are defined by doing and have very flexible boundaries in participation. They also differ from teams, since the participants are bound together by shared instead of a task and form a deeper relationship than a network. The main characteristic of a community of practice is that is exists whether or not it is recognized by the organization. Communities of practice accelerate learning and innovations . Considering SNS’s support professional growth by allowing practicing professionals to create their own virtual communities of practice where they can participate in an ongoing exchange of ideas that are practiced. The members build an online community that is focused on a knowledge domain and accumulate expertise in the domain by sharing and interacting on problems and solutions. SNS’s are an ideal platform for such connections, since they favor a network of connections managed by individual users . SNS’s are an ever referable repository with an unprecedented amount of information, thus creating an opportunity for educators and professionals to engage with each other and growth freely . Those who use SNS do so because they view it as an encyclopedia of knowledge that is always available . SNSs support a participatory and collaborative context through which participants can create media-rich information base .
FILTERED INFORMATION RETRIEVALS
People rely on recommendations in their day to day life for everything right from product purchases to travel and leisure. They sift through books and WebPages to find interesting and valuable information. Tapestry was one of the first recommender systems [90.] coined the phrase “filtering” which is widely adopted. Recommendations may suggest interesting items, in addition to filtering [91.]. Fundamental assumption in filtering is users rating items similarly due to similar behaviors like buying, rating items similarly [92.]. Collaborative Filtering (CF) techniques use preferences on items to predict a new users like or dislike inferred through existing user behaviors. User ratings can be explicit indications on a 1–5 scale or implicit indications like click-throughs or purchases. Table 3.1 is an example of user-item ratings matrix with likes and
dislikes, while Table 3.2 is a user recommendation table. A brief overview of CF techniques is depicted in Table 3.3.
Table 3.1: An example of a user-item matrix
Rajeswari: (like) Serials, News, (dislike) Superman
Narayanan: (like) News, Superman, (dislike) spiderman
Rajagopal: (like) spiderman, (dislike) News
Nandakumar: (like) Serials, (dislike) Spiderman
Table 3.2 User recommendations
Serials News Spider-man Super-man
Rajeswari Like Like Dislike
Narayanan Like Dislike Like
Rajagopal Dislike Like
Nandakumar Like Dislike ?
Table 3.3: Overview of collaborative filtering techniques.
CF categories Representative techniques Main advantages Main shortcomings
Memory-based CF neighbor-based CF (item-based/user-based CF algorithms with Pearson/vector cosine correlation) easy implementation *are dependent on human ratings
New data can be added easily and incrementally *performance decrease when data are sparse
tem-based/user-based top- recommendations Need not consider the content of the items being recommended *cannot recommend for new users and items
*scale well with co-rated items *have limited scalability for large datasets
Model-based CF bayesian belief nets CF *better address the sparsity, scalability and other problems *expensive model-building
DP-based CF *improve prediction performance *have trade-off between prediction performance and scalability
latent semantic CF
parse factor analysis *give an intuitive rationale for recommendations *lose useful information for dimensionality reduction techniques
F using dimensionality reduction techniques, for example, SVD, PCA
Hybrid recommenders content-based CF recommender, for example, Fab *overcome limitations of CF and content-based or other recommenders *have increased complexity and expense for implementation
content-boosted CF *improve prediction performance *need external information that usually not available
hybrid CF combining memory-based and model-based CF algorithms, for example, Personality Diagnosis *overcome CF problems such as sparsity and gray sheep
3.2 Characteristics of Collaborative Filtering
Though E-commerce Systems producing quality predictions attract customer interest, their recommendation algorithms operate in a demanding environment. These systems depend on a few characteristics for providing fast and accurate recommendations.
3.2.1. Data Thinness
The data thinness appears in several situations for example a new item entered as it is difficult to find similar ones for lack of information . An item cannot be recommended until it is rated and new users unlikely to give recommendations in lack of the items rating history. The algorithmic coverage, defined as the percentage of items for recommendations reduces, making it unable to generate This reduces the effectiveness of a recommendation system which relies on comparing users in pairs for predictions. To reduce this problem, Dimensionality reduction techniques like Singular Value Decomposition (SVD)  have been proposed. They remove insignificant users or items to reduce the dimensionalities of the user-item matrix. Latent Semantic Indexing (LSI) is based on SVD . When users are discarded, useful recommendation information may be lost degrading the recommendation quality . Hybrid CF algorithms like content-boosted CF algorithm , address this thinness by using external content information and predict for new items. Kim and Li proposed a probabilistic model where the items were classified into groups and predictions were made using a Gaussian distribution of user ratings [97.].
Model-based CF algorithms like TAN-ELR [98.] provided accurate predictions for sparse data. A few model-based CF techniques tackling the thinness problem include the association retrieval technique. It applies an associative retrieval framework and relates spreading activation algorithms to explore transitive associations among users using their ratings and purchase history 
When existing users or items grow tremendously beyond acceptable levels, traditional filtering algorithms suffer in terms of computational resources. The complexity burdens the reacting time and recommendations for all users .Dimensionality reduction techniques quickly produce quality recommendations, but undergo expensive matrix factorization steps.
Memory-based filtering algorithms like item-based Pearson correlation CF algorithm achieves satisfactory results by calculating the similarity only between the pair of co-rated items of a user [101.] . Model-based filtering algorithms like clustering CF algorithm seeks user recommendation within similar clusters instead of the entire data set
Very similar items with different names or entries are treated as different by many recommender systems and fail to discover the latent association. For example, movie and a film, mean the same and filtering systems would find no match between them to compute similarity, thus decreasing system performance. Solving similarities depends on intellectual term expansion or the construction of a thesaurus. The LSI method uses a large matrix of term-document association and constructs a semantic space, where closely associated terms and documents are placed closer to each other. The performance of LSI in addressing similarity is impressive where precision is low, but gives only a partial solution to the problem .
Gray sheep users inconsistently agree or disagree with any group of people and thus do not benefit from information filtering, while Black sheep users idiosyncratic tastes make recommendations very difficult making black sheep an acceptable failure [105.]. Claypool et al. provided a hybrid approach combining content-based and filtering recommendations. They based their prediction on a weighted average of the content-based prediction and the CF prediction, allowing the system to determine the optimal mix of content-based and CF recommendation for each user [106.].
Some users give positive recommendations for their own items while responding negatively to others and it is desirable for systems to discourage this kind of phenomenon [91.]. such models for collaborative filtering have been identified and their effectiveness has been studied. Lam and Riedl found that item-based CF algorithm were affected marginally by these attacks than the user-based CF algorithm and suggested alternative ways to evaluate and detect attacks . O’Mahony et al. solved these attacks problem by analyzing a recommender system’s resilience to potentially malicious perturbations in the customer/product rating matrix . Bell and Koren [109.] used a comprehensive approach to the attacks problem by removing global effects with data normalization and working with residual of global effects to select neighbors and achieved improved performance on the Netflix data [110.].
3.3. Memory-Based Collaborative Filtering Techniques
Memory-based CF algorithms generate a prediction by identifying every user as part of a group with similar interests and neighbors of a user to predict preferences on new items. The neighborhood-based CF algorithm, uses the following steps: calculate the similarity or weight, , which reflects distance, correlation, or weight, between two users or two items, and produces a prediction for the active user by taking the weighted average of all the ratings of the user or item on a certain item or user, or using a simple weighted average .
3.3.1. Computing Similarity
Item or user similarity is a critical step in memory-based collaborative filtering algorithms. In item-based filtering algorithms, similarity computation is identifying users who have rated the items and then the filter is applied . In a user-based filtering algorithm, the similarity between users is computed before applying similarity on item ratings. There are many techniques to compute similarity or weight between users or items.
• Correlation-Based Similarity: Similarity between users or items is measured by computing the Pearson correlation, which measures the extent to which two variables linearly relate with each other . The Pearson correlation-based CF algorithms are widely used by the research community. In user-based Pearson correlation, rated items of users are summed and averaged with co-rated items of the user. Table 3.4 is a simple example of ratings matrix.
Table 3.4: User Ratings matrix
I1 I2 I3 I4
U1 4 ? 5 5
U2 4 2 1
U3 3 2 4
U4 4 4
U5 2 1 3 5
In an item-based Pearson Correlation is the average rating by the users. Variations of Pearson correlations can be found in . Examples of other correlation-based similarities are constrained Pearson correlation, Spearman rank correlation, and Kendall’s correlation .The number of users in similarity computation is the neighborhood size of the user.
• Vector Cosine-Based Similarity: Two documents measured for similarity by visualizing each document as a vector of frequencies and computing the cosine angle formed by the frequency vectors  can be used in collaborative filtering by substituting users or items for documents and ratings instead of frequencies. For a desired similarity a similarity matrix can be computed . In a real time situation, users using multiple rating scales can fail the vector cosine mehtod and to address this drawback, the corresponding user average can be subtracted from each co-rated pair which is similar to Pearson correlation, but without negative values.
3.3.2. Prediction and Recommendation Computation
Recommendations is the base step in a collaborative filtering technique. In the neighborhood-based CF algorithm, a subset of nearest neighbors of the active user are chosen based on their similarity with the user and a weighted aggregate of their ratings is used to generate predictions for the active user [115.].
• Weighted Sum of Others’ Ratings: The weighted average of all the ratings on an item can be used to predict for the active user :
• Simple Weighted Average: In item-based prediction, a simple weighted average can be used to predict the rating, , where summations are done for all ratings and the weight between the user rating and overall weight on the item is computed.
3.3.3. Top N Recommendations
Top-ranked items can be generated for a user’s interest like a list of products or books. These Top-N-recommendation techniques analyze the user-item matrix to discover relations between users or items and employ them to compute recommendations. They can be modeled on association rule mining techniques.
• User-Based Top- Recommendation Algorithms: These algorithms identify nearest neighbors of an active user and then apply Pearson correlation or vector-space models for the result . The user a vector in the dimensional item space and the similarities between the active user and other users are computed between the vectors and . the corresponding rows of similar users in the user-item matrix are aggregated to identify the Top-N items. The algorithms have limitations in real-time performance and scalability [116.].
• Item-Based Top-N-Recommendation Algorithms: These algorithms were developed to address the scalability issues of user-based top-N- recommendation algorithms. The algorithms compute similar items and then identify the top-n-set. The results are sorted in decreasing order of similarity, to achieve the item-based Top- list . This method can produce suboptimal recommendations which was solved by Deshpande and Karypis  who overcame the issue by using all combinations of items up to a particular size before recommendations.
3.4. Model-Based Filtering Techniques
Developed Models like data mining, or machine learning algorithm recognize patterns based on training data and then make predictions. These predictions can be used by filtering tasks on test or real-world data. Model-based CF algorithms like Bayesian or clustering models can overcome the shortcomings of memory-based CF algorithms. CF models can be designed from classification algorithms when user ratings are categorical, while regression and SVD methods can be used for numerical ratings.
• Simple Bayesian Filtering Algorithm: Simple Bayesian Filtering algorithm uses a naïve Bayes (NB) strategy to make predictions for filtering tasks. They predict from the highest probability of a class computed . Observed data’s probability is considered for incomplete data and a Laplace Estimator is used to smoothen the probability calculation for avoiding a conditional probability of 0. Since, most real-world CF data are multiclass, Su and Khoshgoftaar  applied the simple Bayesian CF algorithm to multiclass data for CF tasks. They found it ‘s predictive accuracy was bad but had better scalability than the Pearson correlation-based CF as it made predictions based on observed ratings in a lesser time. The simple Bayesian CF algorithm can be regarded as memory-based filtering technique because of its in-memory calculation for predictions.
• Filtering by Clustering: Cluster data objects are similar within the same cluste, but dissimilar to other cluster objcts [119.]. Object similarity is determined using Pearson correlation or Minkowski distance. Clustering methods can be categorized as partitioning methods, density-based methods and hierarchical methods [120.]. The MacQueen partitioning method has a relative efficiency with easy implementation. Density-based clustering methods search for dense clusters of objects separated by sparse regions that represent noise. DBSCAN  and OPTICS [122.] are well-known density-based clustering methods. Hierarchical clustering methods like BIRCH , created a hierarchical decomposition of data objects using a criterion. In most situations, clustering is an intermediate step and the resulting clusters are used for further analysis or processing. CF models with Clustering can be applied in different ways. Sarwar et al.  and O’Connor and Herlocker [124.] used clustering to partition the data into clusters before using a memory-based CF algorithm to make predictions for CF tasks within each cluster. Si and Rin extended existing clustering algorithms into a flexible mixture model (FMM) for CF by clustering both users and items. The model allowed each user and item to be in multiple clusters, but the clusters were modeled separately . Their experimental results showed it had better accuracy than the Pearson correlation-based CF algorithm.. Moreover, Clustering models have better scalability than typical collaborative filtering methods because they make predictions within much smaller clusters [103.]. Clustering computation is complex and expensive with low quality. As optimal clustering over large data sets is near to impossible, dimensionality reduction is necessary.
• Regression-Based Filtering Algorithms: These method use an approximation of the ratings to make predictions based on a regression model. If a random variable represents a user’s preferences on different items the regression matrix is very sparse. Canny  proposed a sparse factor analysis in which missing elements were replaced with default voting values and a regression model used. The sparse factor analysis has better scalability than Pearson correlation-based CF. Vucetic and Obradovic  proposed a regression-based approach to CF tasks on numerical ratings data that searches for similarities between items, building a collection of simple linear models, and combined them to provide rating predictions for an active user. `This method addressed sparsity. Lemire and Maclachlan  proposed slope one algorithms to make faster CF prediction than memory-based CF algorithms.
• MDP-Based Filtering Algorithms: Shani et al. [25.] viewed recommendations as a sequential optimization problem and used a Markov decision processes (MDPs) model for recommender systems. The optimal solution to the MDP is to maximize the function of its reward stream where, after starting with an initial policy and then updating the policybased on the previous policy making the iterations converge to an optimal policy. This can be viewed as approximating a partial observable MDP(POMDP) by using a finite rather than unbounded window of past history to define the current state. The computational and representational complexity of these POMDP’ is high, strategies like value function approximation, policy based optimization  and stochastic sampling  are used.
3.5 Hybrid Collaborative Filtering Techniques
Hybrid CF systems combine filtering with content-based systems to make predictions or recommendations where content-based recommendations analyze and find regularities textual content information like messages, logs and user preferences [130.]. Textual contents have many important elements like words or similarity between items . content-based use classification algorithms for recommendations to make recommendations. These techniques must have enough information to build a reliable classifier. Further they are limited by the explicitly associated features of objects they recommend which are hard to extract. On the other hand collaborative filtering make recommendations without descriptive data. Demographical recommender systems use user profile information like occupation and pin code, while utility and knowledge based recommender systems require knowledge about user needs [132 ]
• Hybrid techniques Incorporating CF and Content-Based Features: Content-boosted CF algorithm uses naïve Bayes as the content classifier. It then fills in the missing values of the rating matrix with the predictions of the content predictor to form a pseudo rating matrix, in which observed ratings are kept untouched and missing ratings are replaced by the predictions of a content predictor. It then makes predictions over the resulting pseudo ratings matrix using a weighted Pearson correlation-basedCF algorithm, which gives a higher weight for the item that more users rated, and gives a higher weight for the active user  illustrated in Table 3.5. Content-boosted CF recommenders have an improved predictor performance over pure content-based and memory-based CF algorithms.
Table 3,5: Content-boosted CF and its variations
(a) Content / sparse data rating
(b) pseudorating by content predictor
(c) predictions from (weighted) Pearson CF on the pseudo rating data.
Content information Rating matrix
Age Sex Career zip I1 I2 I3 I4 I5
U1 32 F writer 22904 4
U2 27 M student 10022 2 4 3
U3 24 M engineer 60402 1
U4 50 F other 60804 3 3 3 3
U5 28 M educator 85251 1
Pseudo rating data
I1 I2 I3 I4 I5
2 3 4 3 2
2 2 4 3 2
3 1 3 4 3
3 3 3 3 3
1 2 4 1 2
I1 I2 I3 I4 I5
2 3 4 2 3
3 4 2 2 3
3 3 2 3 3
3 3 3 3 3
1 3 1 2 2
Ansari et al.  proposed a Bayesian preference model to statistically integrate disparate information for recommendations like user preferences and use the Markov chain Monte Carlo (MCMC) methods .
• Recommender Systems combined with CF: A hybrid weighted recommender combines different recommendations techniques by their weights computed from the results of available recommendation techniques [132.] like adjustable weights , majority weighted voting  and average weighted voting . However, hybrid recommenders rely on external information that is usually not available, and they generally increased implementation complexity .
...(download the rest of the essay above)