Personalized movie trailer generation based on movie preferences

Master Thesis Proposal
Marketing Management
Rotterdam School of Management
Erasmus University Rotterdam
Date: 04/02/2019
1. Introduction
Advertising is central to the marketing mix for movies. Generating products that are characterized by frequent new product introductions and short product life cycles, the movie industry is known for unusually high levels of advertising (Rennhoff & Wilbur, 2011). The average advertising expenditure for a movie amounts to $36m, with the largest part allocated to trailer advertising (Karray & Debernitz, 2017).
This high expenditure may occur because movie trailers are one of the most influential sources throughout the consumer decision process to see a movie. According to a study by Google, 94% of variation in a film’s box office opening can be explained with trailer-related searches on Google and YouTube four weeks prior to release (Google, 2013), and according to the Motion Pictures Association of America, 54% of movie-attendees report watching a trailer before attending a movie (Karray & Debernitz, 2017). This poses the movie trailer as a crucial advertising medium to get right.
Traditional marketing endeavors target relevant consumers through segmentation practices. However, demographic and psychographic composition of audiences vary too much from movie to movie to be able to target consumers effectively in the possible time-span (Hixson, 2005). As trailers are usually released in a one-size-fits-all manner, the clips used in these advertisements are selected to appeal to a wide audience. Due to time and budget constraints, catering different movie trailers to different audience segments would be difficult task.
With the shift of content to the digital realm, many new avenues for advertising are opening up. One highly notable development is the rise of personalized experiences, with brands that are taking advantage of the opportunities of personalization seeing revenues increase two to three times faster than those that do not (BCG, 2017). Whereas segmentation seems to be based on the problem that each consumer cannot be targeted individually, personalization may counter this rhetoric altogether. In an industry in which large sums of money are spent on one-size-fits-all advertising, personalization of trailer advertisements may help studios target consumers on an individual level.
1.2 Research question
Given the importance of movie trailers in the consumer decision process and the large budget all, it is relevant to research how to make this advertising medium more relevant to consumers personally. Bigger, more relevant datasets mean that there is more insight to consumer preferences, which provides ample opportunity to target audiences highly specifically. This thesis proposes a personalized movie trailer based on such consumer preferences. Such a system would help movie studios to target different audiences more effectively, as each consumer will be shown a movie trailer highly relevant to their own personal preferences. For instance, if data shows that user X will watch anything that features actor Y, a trailer that heavily features that actor should be more effective in persuading that user to watch the movie.
Central to linking consumer preferences data to movie trailers are recommendation systems. Recommendation systems (RS) are software tools and techniques that provide suggestions for items that are most likely to be of interest to a user (Ricci, Rokach, & Shapira, 2015). Content-based RSs generate recommendations by combining user feedback on items with the content (i.e., features) associated with them (Lops, De Gemmis, & Semeraro, 2011). Operating by the logic that users will prefer the same features in movies as they do in movie trailers, the following research question is proposed:
“How can recommendation systems guide the generation of personalized movie trailers, and how effective is personalization in movie trailers?”
The purpose of this study is to propose a framework for personalized movie trailers based on consumer preferences, and to provide empirical evidence for the effectiveness of such a trailer for audiences.
1.3 Academic relevance
Numerous studies in the field of video summarization or video abstraction have been conducted. Video summarization systems select significant segments for users to generate a short version of a lengthy video (Kannan, Ghinea, & Swaminathan, 2015). Existing approaches can be classified into cognitive-level and affective-level approaches, the former extracting low-level features such as color, motion, and composition, and the latter utilizing high-level semantic features such as events and semantic concepts. Most of these approaches focus on analyzing genre-specific movie trailers for patterns to automatically select salient moments in a movie for a movie trailer (Hermes & Schultz, 2006; Smeaton, Lehane, O’Connor, Brady, & Craig, 2006; Smith, Joshi, Huet, Hsu, & Cota, 2017), thus generating a generic summary that is common for all users.
Fewer research has been done with regards to personalizing video summarization. One highly relevant study is by Kannan, Ghinea, and Swaminathan (2015), who propose a novel system that summarizes a movie based on the preferences and interests of the user. Shots and scenes are automatically detected, which are labeled with high-level features. A key difference between this system and the proposed system in this study is the collection of user preferences, which are asked explicitly to the user in order to generate a summary, while in this study user preferences are inferred through existing data.
1.4 Managerial relevance
Most research in this direction points to trailer generation as a labor-intensive process. Hand-selecting shots for movie trailers takes a long time.
The studies mentioned in the section above propose video summarization methods with an interactive component to personalize video summaries to the user. This technology, while relevant to information retrieval research, does not seem applicable or relevant to advertising executives in its usefulness. The proposed summarization system of this thesis will be useful in that existing data on user preferences is leveraged to generate relevant summaries. This could be useful for video-on-demand platforms that have information on the preferences of their users, or for movie studios that wish to advertise tailored movie trailers through other platforms, e.g., Google’s DoubleClick Dymamic Ad Insertion (Marvin, 2016).
2. Literature review
The following section will discuss related literature that has been written on the topic of personalized video summarization and detail the supporting literature that will form the basis for the conceptual framework of this thesis.
2.1 Personalized video summarization
Numerous studies researching personalized video summarization or video abstraction have been conducted. One highly relevant study is by Kannan, Ghinea, and Swaminathan (2015), who propose a novel system that summarizes a movie based on the preferences and interests of the user. Shots and scenes are automatically detected, for which high-level features are semi-automatically annotated. One key difference between this system and the proposed system in this study is the collection of user preferences, which are asked explicitly to the user to generate a summary, while in this study user preferences are inferred implicitly.
Most of these approaches explicitly obtain users’ preferences for shot level personalized video summarization.
2.1.1 The Semantic Gap
A problem often encountered in video summarization and movie recommendation is the semantic gap (references). The semantic gap is the gap between the high-level concepts that users expect when searching for interesting multimedia content (e.g., genre, plot, actors) and the low-level features that it is possible to automatically extract from the same content (e.g., brightness, contrast, etc.). This gap represents two research directions, the first being mostly explored by researchers with a background in film theory and the latter being focused on mainly by computer scientists (Hermes & Schultz, 2006).
2.2 Recommendation systems
For the purposes of this research, recommendation system literature will be adapted to select scenes for a personalized trailer. Two main avenues can be found in recommendation system research: content-based recommendation and collaborative filtering.
Content-based RSs create a profile of a user’s preferences by combining feedback on items with the content (i.e., features) associated with them. This feedback, or ratings, can be gathered explicitly (by asking) or implicitly (by analyzing activity). Recommendations are generated by matching the user profile against the features of all items, computing similarity measures with the unknown item (Lops et al., 2011).
An example of such an approach is proposed by Deldjoo et al. (2016), wherein a content-based algorithm based on cosine similarity between items was used on a small dataset of 160 movies was used to provide recommendation based on low-level visual features. Recommender systems typically use two types of item features, namely high-level features and low-level features, the former expressing semantic properties of media content that are obtained from meta-information from databases, lexicons, reviews, or news articles, and the latter being extracted directly from the media file itself, typically representing design aspects of a movie (such as lighting, colors, and motion). The researchers found that recommendations based on low-level stylistic visual features are better than recommendations based on high level semantic features, and that low-level features extracted from trailers can be used as an alternative to features extracted from full-length movies in building content-based recommender systems.
The collaborative filtering (CF) approach produces recommendations of items based on patterns of ratings (Koren & Bell, 2015). Using a neighborhood approach, the objective is to find a set of other users whose ratings are similar to the user’s ratings, in order to infer that preferences of the neighborhood are also applicable to the user. There are two approaches to CF: user-user and item-item, the latter of which is considered to perform better (Lescovec et al., 2016). One approach to CF that has been popularized by the recommendation system of Netflix is matrix factorization (MF), which entails that a large matrix of ratings can be expressed as a product of smaller matrices in order to save storage space (Serrano, 2018).
In extension of their previous work on content-based recommendation, Deldjoo, Elahi, and Cremonesi (2016) propose a recommendation system based on Factorization Machines (a combination of Support Vector Machines and MF) and low-level stylistic features. RSs based on CF often have to be supplemented with side information to maintain a rich set of high-level descriptive attributes about movies for newly released movies, which is often human-generated and prone to biases and errors. Analyzing low-level stylistic features to make recommendations can solve this and can address the problem of a new item being added with no high-level attributes. The results show that recommendations based on low-level visual features achieve almost 10 times better accuracy in comparison to those that are based on high-level features.
2.3 Features
An area of heavy debate within video summarization and recommendation literature is the tradeoff between low-level features and high-level features, the former expressing semantic properties of media content that are obtained from meta-information (e.g., plot, genre, director, actors), and the latter being extracted directly from the media file itself, typically representing design aspects of a movie (such as lighting, colors, and motion). This tradeoff naturally forms a semantic gap problem that has been discussed heavily in the literature.
Much of the video summarization and recommendation literature is guided by the assumption that user preferences are influenced by high-level features to a greater extent than low-level features. For instance,
2.3.1 Low level features
Recent literature on RSs suggest that consumer preferences when choosing an item are influenced in a greater deal by visual aspects of items and less by their semantic features. Deldjoo, Elahi, Quadrana, and Cremonesi (2018) use low-level visual features extracted using the MPEG-7 standard and a deep neural network (DNN). The MPEG-7 standard extracts visual descriptors of images as color descriptors and texture descriptors. Alternatively, the authors used the activation values of inner neurons of the GoogLeNet DNN as visual features for each key frame. Whereas MPEG-7 features capture stylistic descriptors (i.e., color and texture), DNN features capture semantic content (e.g, objects, people, etc.). In this study, MPEG-7 features generated more accurate recommendations than semantic features (DNN). This could be due to the fact that while a DNN recognizes relevant semantic features (such as actors), it also recognizes non-relevant semantic features, which can create noise in the dataset.
Some studies have attempted to bridge the semantic gap by using both high-level and low-level features. For instance, Hermes and Schultz (2006) used face detection, cut detection, motion analysis, and text detection to be extracted automatically, and background information to be extracted from the Internet Movie Database (IMDb). Xu and Zhang (2013) use motion analysis, face recognition, sound volume detection, speech and music detection, and low-level features of brightness, contrast, and shot length.
2.3.2 Importance of semantic features
As this research is conducted within the context of marketing, attention has to be paid to which movie trailer features are most indicative of consumer’s willingness to see the movie. In a qualitative exploratory study on New Zealand film audiences by Finsterwalder, Kuppelwieser, and De Villiers (2012), it was found that actors are the greatest influencers on film quality expectations, and genre the most important influence on film content expectations. Moreover, consumers enjoying the music in a trailer may find the potential film increasingly interesting. Similarly, Karray and Debernitz (2017) found that the appeal of the plot, the number of scene cuts and the inclusion of violent, sexual, or humorous scenes influence the movie’s abnormal returns.
3. Methodology
The objective of this research is to use content-based recommendation to recommend shots to a user that contain similar characteristics to movies that user previously rated highly. This entails finding a set of movies the user likes using the MovieLens 20-M dataset. For each of those movies, an item profile is to be built, which is a description of the movie based on a number of pre-determined features. From these movies, a user profile will be inferred. For instance, because the user likes movies with Brad Pitt, it will be inferred that the user will prefer shots with Brad Pitt in a movie trailer. Because the user likes horror movies, it will be inferred that the user likes shots that are stylistically similar to horror movies.
Informed by the literature discussed above, the following features will be used to create item profiles for the content-based recommendation system:
1. Actor appearance. As one of the most important influencers on film quality expectations, actor appearance should be taken into account as a feature to guide scene recommendation. Actor appearances can be extracted from the IMDB dataset. The logic that this follows from is that if a user likes multiple movies that feature the same actor, this actor should have a high degree of importance in building a user profile.
2. Genre. As discussed above, genre is an important influence on film content expectations and a widely-used feature for segmenting audiences. Genre features are included in the MMTF-14K dataset.
3. Visual descriptors. Low-level visual features have been shown to be very representative of the user’s feelings, according to the theory of Applied Media Aesthetics (Deldjoo et al., 2018). The MMTF-14K datasets has aesthetic descriptors and object and scene descriptors extracted from the FC7 layer of the AlexNet convolutional neural network.
3.1 Movie processing
3.1.1 Video segmentation
The goal is to segment a movie into shots and to select a representative key frame from each shot. IBM’s Multimedia Analysis and Retrieval System (IMARS) (Natsev, Smith, Tešié, Xie, & Yan, 2008) will be used for shot boundary detection. Since a key frame can represent a shot, the middle frame from each shot will be extracted as key frame for visual analysis.
3.1.2 Feature Extraction
Given the segmented clips, features are extracted in terms of actor appearance, genre, and visual descriptors.
Actor appearance
Actors are key to a consumer’s expectations of the movie. A good personalized trailer would feature those actors that are most relevant to a user’s interests. To recognize these actors, the easiest way is to use face recognition. A face recognition system using Eigenfaces (Turk & Pentland, 1991) will be implemented in OpenCV . Facial recognition using Eigenfaces promises great recognition accuracy of around 95% (Kannan et al., 2015).
Genre
Specific movie events can correspond to genres, i.e., a romantic shot in a movie should be classified as belonging to the romance genre, so that it is more likely to be recommended to someone who prefers romantic movies. These movie events are to be manually annotated for each shot as they cannot be automatically detected even using the most modern semantic concept detection methodologies (Kannan et al., 2015). This is because of the highly subjective nature of these movie events, and because “the low-level visual features trained for classification are not highly correlated with the corresponding event” (Kannan et al., 2015).
Visual descriptors
To match the available dataset, visual descriptions from the FC7 layer of the AlexNet convolutional neural network will be used. These represent abstract, top-level features that are discovered in each key frame, and are descriptors of color and texture.
3.3 Training process
The datasets that will be used for the training of the recommendation system are called MMTF-14K (Deldjoo), MovieLens 20M (reference), and UC Irvine Machine Learning Lab’s Movie Data Set, which has data on the cast of over 10,000 movies.
3.4 Summarization
During the summarization process, video segments are ranked based on computed similarity measures between the user profile and the movie features. Personalized movie summarization can be seen as “the process of measuring the similarity score of each video segment for the given user preferences and selecting those top ranked segments that will increase the cumulative similarity score of the summary” (Kannan et al., 2015).
First, the similarity between each shot and the user preferences on actor appearance, genre, and visual descriptors is calculated using cosine similarity measures. Each shot is stored as a vector of its features in a high-dimensional space, after which the angles between the vectors are calculated as the cosine similarity between the vectors. After this, user profiles are created based on their ratings on the same features on movies and the similarity between a shot and a user is computed similarly. This should return a ranked list of shots to select for that specific user.
3.5 Evaluation
In accordance with previous studies on automatically generated movie trailers, a qualitative user study will be performed to evaluate the summarization system. This presents the “cold-start” problem of recommendation, as there will be no data on the users in question. To alleviate this problem, the most direct way is to make a rapid profile of a new user by asking for explicit ratings after presenting a number of movies to the user.
In an online questionnaire format, 20-50 users will first be given 20 movies to rate, after which they will receive a personalized movie trailer to evaluate. In accordance with Kannan et al. (2015) and Xu and Zhang (2013), users will be asked to evaluate a personalized trailer, a generic trailer and a trailer with randomized shots on the following points: personalization, plot introduction, character introduction, and attendance likelihood.

Essay: Personalized movie trailer generation based on movie preferences

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: