Hilary Elfenbein and Nalini Ambady began a new branch of the cultural research by suggesting that there is an “in-group advantage” in the understanding of emotion: that participants were generally more accurate in recognizing emotions expressed by members of their own culture than in recognizing emotions expressed by members of another. The experiment was replicated across both positive and negative emotions and tested on non-facial nonverbal channels of emotion such as tone of voice and body language (Elfenbein & Ambady, 2002). Joshua Ackerman and his colleagues furthered this research by claiming they had found a “cross-race effect” in a study that asked participants to memorize emotional face stimuli and recall them later. Their results suggested White participants were more likely to remember angry Black faces than angry White faces and explained it with a biological response: that White participants found Black faces threatening and it was an evolutionary mechanism to remember them. This research was replicated by Eva Krumhuber and Antony Manstead in 2011 (Krumhuber & Manstead, 2011) and again by Steven Young and Kurt Hugenberg in 2012 (Young & Hugenberg, 2012) using the same stimuli set. JD Gwinn and Jamie Barden argued that in replicating the 2006 work by Ackerman, these two studies failed to validate the stimuli set. They noted that the stimuli contained only 4 black subjects whose facial expression were all quite “unusual”. They re-tested the effect of angry expressions on the memory of White and Black faces with some newly designed stimuli and found that angry expressions impaired memory for Black faces, compared to neutral which was contrary to the previous findings. They tested both a White and Black participant sample, finding similar results. They concluded that the cross-race effect was better explained by stereotype-congruency.
All of the literature discussed thus far in exploring biological, societal, and cultural differences in expression and recognition of emotion use nearly identical research methods: collecting a set of facial expression stimuli founded in Ekman’s 1972 theory of emotion or creating a new set that is then coded using the same FACS developed in 1978, presenting that to a panel of observers controlling for the variable of interest, asking them a set of questions about the faces presented, and then analyzing the results for significant differences. There is an entirely separate branch of work founded in Ekman’s 1978 FACS research that has sought over time to automate the coding process using machine learning and computer vision. The review suggests that it began in 1992 when Susanne Kaiser and Thomas Wehrle demonstrated a method where small dots were affixed to the faces of participants who were themselves FACS experts expressing various facial emotions. The dot patterns were captured and digitized from the videos using a special algorithm, and an artificial neural network was then used to automatically classify the distances and dot patterns into the separate emotions. In 1997, Curtis Padgett and Garrison Cottrell advanced the neural net classification method by testing three different representation schemes as input to the classifier to compare results (a full face projection, an eye-and-mouth projection, and an eye-and-mouth projection onto random 32×32 patches from the image) (Padgett & Cottrell, 1997). The results suggested that the latter of the three systems achieved an 86% generalization on new face images. During the same year, two other significant contributions were made testing alternative feature sets as input to machine learning classifiers: one by Lanitis, Taylor and Cootes which used measurements of the shapes of key facial features and spatial arrangements to achieve between 70% and 99% accuracy on a normal test set of 200 images (Lanitis, Taylor & Cootes, 1997) and one by Essa and Pentland which used estimates of facial motion called optical flow extracted from video slides to achieve similar results (Essa & Pentland, 1997). M.S. Barlett and colleagues advanced the research in 1999 by successfully feeding a hybrid feature set of facial features and optical flow estimations into a three-layer artificial neural network to automatically detect the presence of facial action units 1 through 7 in a facial image (out of Ekman’s total of 46 from the 1978 research) (Barlett et al., 1999).
Neural networks remained the method of choice for automatic facial emotion and facial action classification through the 1990s. In 2005, Meulders, De Boeck, Van Mechelen, and Gelman proposed a probabilistic feature analysis to extract the most relevant features to producing an expression, with the goal of identifying a minimal feature set that could more efficiently classify facial emotions. While neural networks remain a popular and effective method for classifying emotion even in recent research (Meng et al., 2016), the literature shows emergence of other methods that can make more efficient classifications with smaller feature sets, like support vector machines and hidden Markov models. These very same methods were used in 2012 by Jiang, Valstar and Pantic to create a fully automatic facial action recognition system (Jiang et al., 2012).
The methods that I will employ in my research are not focused on the automated recognition of emotion in facial expressions. Instead, we will use FACS-coded faces from the Cohn Kanade database of tagged facial images (Lucey et al., 2010) to measure how one’s own emotions effect our ability to perceive emotion in others. I would like to contribute to the current body of literature around societal context by asking the key question: does how we feel impact how we perceive others?
Q6 by Dr. Gregg Vesonder
In Kahneman’s book, System 1 is the term used to explain the part of our brain that makes quick, automatic decisions based only on information from the past. In other words, it is a low-energy decision making engine and does not bother to expend any energy making decisions using information that is not already known. System 2 is the part of our brain which is capable of making slow, well-thought out decisions and that often requires an extra expense of energy for critical thinking and incorporating pieces of information that may not be fully known or understood. There is a relationship between the Systems in that, ideally, the two systems work in harmony and when System 1 requires a little more thinking power to make a decision it turns to System 2 for processing. The theory suggests that all illogical decision making comes from cases where this harmony does not exist (Kahneman, 2011).
Kahneman explains that System 1 is easily influenced, impatient, impulsive, and more driven by emotion than System 2. When System 1 is fired up or under load (i.e. from emotions), System 2 tends to fail to override and performs poorly. In addition, every time we have an emotional experience, we are providing System 1 with more information on which it will automatically use to make a quick decision in the future. So, even if System 1 is not under load at the time of decision making, emotional experiences emotional in the past are still influencing the decision making process of our ‘autopilot’ System 1 which, Kahneman writes, actually makes the majority of our decisions even when we believe we are actually making rational decisions with System 2. I believe that emotional content and emotional experiences heavily influence our decision making, even if we are not emotional in the moment.
My hypotheses do not presently take this into account, but perhaps asking the participants to think about which faces are exhibiting certain emotions might actually be considered a System 2 task as it requires some level of thinking and careful examination. Based on this, it would be interesting to test if priming a subject with an emotional stimulus (i.e. suppressing System 2) in advance of completing the questionnaire would significantly alter emotion perception and the results.
Q7 by Dr. Gregg Vesonder
Gestalt Principles state that a whole is greater than the sum of its parts, or in other words, that the whole picture tells a different story than any individual piece. The concept of figure and ground explains that we have a perceptual tendency to separate parts or “figures” out from their background based on traits such as shapes, colors, or size. The focus in any moment is on the figure. The ground is simply the backdrop. Sometimes this is a stable relationship, but sometimes (in an unstable relationship) our attention shifts such that what was formerly the figure is now the ground, and vice versa. (Grais, 2017). In the example presented in the question text, a smile might be considered unstable. We may perceive it as “happy” when presented it in a blank context, or if the individual is sitting on the beach with their family. But we may perceive it as an altogether different emotion if that same smile is on the face of a shooter holding a gun.
Similarly, the Gestalt concept of “Proximity” explains that objects that appear close together appear to form groups. A smile alone may require more thinking to decide whether it is actually a “happy” emotion being shown than a smile among 11 other smiles or a group of people who are smiling in a photo. Context in this way does not have to be environmental with a single figure, but can also include multiple figures that exhibit some similar features.
Gestalt theory also explains that we tend to group things together that share similarities (i.e. shape, color, size) in the concept of “Similarity”. We have grown to recognize smile as a smile and a frown as a frown based on being exposed to hundreds if not thousands of past interactions with individuals who have exhibited those facial expressions. When confronted with a new expression, we are comparing features of that new expression to those from the past and classifying it based on similarities. They may not be as simple as “color” and “shape” and in fact may be quite complex and comprise over a hundred unique features we cannot describe individually, but the concept holds. In this way, past context can affect present perception of emotion.
In all of these cases, the common theme is that context absolutely effects our perception of emotion. When we test our theories, we should consider the context not only of the stimuli we are asking participants to tag but perhaps even the context of the participant. Do I perceive emotion when I myself am in the comfort of my own home versus just before leaving the office after a stressful day at work? Even when presented an identical image of a smile, my own context might alter my response.
Q8 by Dr. Gregg Vesonder
Yes. The primary common theme is that each of these items will invoke an emotional response in us. In 2017, Schindler et al. conducted a meta-review of the literature around extant measures of emotional response to stimuli from various domains, ranging from film, music and art to consumer products, architecture, and physical attractiveness, and developed a new assessment tool called the Aesthetic Emotions Scale (AESTHEMOS) designed to measure the stimuli’s perceived aesthetic appeal from any of these domains (Schindler et al., 2017). What they discerned from their literature review is that extant measures of emotion have become very domain-focused because the way we respond emotionally to, say, a landscape, is different than how we might respond to a piece of music (i.e. the collection of emotions invoked are typically different), but both responses are emotional ones. They call these responses “aesthetic emotions”.
While AESTHEMOS focuses on creating a domain-agnostic assessment tool for measuring emotional response to stimuli, one contribution by my research would be to measure how our own emotions affect our perception of stimuli from these various domains, just as we do with faces. For instance, if the lit review conducted by Schindler et al. suggests that different combinations of emotions are invoked by stimuli from different domains, then what does feeling angry do to our perception of the world around us? Are we less likely to enjoy art and music, or will we feel more enjoyment (happiness) from certain types of art and music more in those cases because they provide an outlet?
I say: yes, there are common themes in the perception of all of these stimuli in that they all invoke an emotional response in us. And I hypothesize that our state of emotion effects our perception of them, and therefore their effect on us, in different ways depending on the domain that they come from.
Q9 by Dr. Gregg Vesonder
The structure of the raw data collected has 4 main sections in a single flat table containing a total of 137 columns… First there is a unique Session ID (to the user and device) for every submission along with their Age, Gender, whether or not they identify as a Native English Speaker, and their baseline self-rated emotion response (Happy, Sad, Angry, Afraid, Surprised):
Following this there is a long series of columns containing the 8 images that the user was shown for each emotion (since they are presented from a random pool) and whether or not the user flagged that image (i.e. “Tap the faces that look ‘happy’”). We do this for each of the 5 emotions and twice more for “NOT Happy” and “NOT Sad” resulting in a total of 112 columns containing this data:
The third component is the same user’s self-rated emotion responses on a scale of 1-5 after they have been asked to tag all of the faces to see if playing the ‘game’ has had any impact on emotion:
And finally there is a series of time stamps indicating the time that the user submits each task, designed to see if there is variation in response time depending on the emotional responses.
Before any quantitative analysis, data processing will be applied to calculate some additional features:
(1) For each face tagged (there are 56), we will compare them to the already-tagged Cohn-Kanade database from whence they come (Lucey et al., 2010) to see if the user was “correct”. This will generate 56 new features explaining, for each face, we know whether they correctly identified the dominant emotion.
(2) For each emotion (there are 5) and each “non” emotion (there are 2) we will tally the total number of responses correct and incorrect, as well as the overall total correct and incorrect. This will generate 16 new features.
(3) For each time stamp, we will calculate the completion time (in seconds) it took each participant to complete the step as well as the total time to completion. This will generate 10 new features.
(4) Each participant will also be placed in an age group: (18 to 24), (25 to 44), (45 to 64), (65 and over) based on those collected by the US Census Beaureau.
(5) For each participant, we will create 5 new binary features, each representing a positive or negative flag for feeling each emotion. For example, if a participant responds 1-2 (low) on the ‘sad’ scale, they will be considered “not sad”. If they respond 3-5 (mid-high) on the ‘sad’ scale, they will be considered “sad”.
Our dataset will now have a total of 225 columns. 26 of those features are of interest for statistical analysis (those generated in (2) and (3)) and the remainder will be treated as explanatory variables or preserved for exploratory hypotheses.
Before any quantitative statistical analysis is performed, a qualitative assessment of the data will be conducted. Histograms will be generated for age group, gender, the native English speaker flag to look for any anomalies or outliers in the distributions that should be removed prior to formal analysis. The way the application is designed should not allow for any missing data. Rows with empty cells or empty responses will be removed before statistical analysis as they indicate a system error or abandonment of the questionnaire and preliminary data collection suggests that such cases should be sparse (<5%).
The distributions of the features calculated in preprocessing will be checked for normalcy to assess the appropriate statistical test method for comparing between-group responses.
An exploratory data analysis will be conducted to produce summary visualizations of the responses. Visualizations that explain the average number of correct/incorrect responses and average response times by age group, gender, English speakers, and emotional baseline will be created to tell a data story and present the results in summary. The visualizations will also aid in identifying any outliers or particularly interesting patterns.
The Spearman and Pearson partial correlations and their statistical significance will be calculated between the participants’ emotional responses (i.e. Happiness level) and the number of correct/incorrect responses to each emotion and correct/incorrect responses overall. We will calculate these over all ages and genders, as well as within gender and age groups.
A student’s t-test will be conducted to test for statistically significant differences in the number of correct/incorrect responses to ALL emotions for each emotion group (i.e. do the number of correct “happy” responses differ between the “happy” and “not happy” groups?).
A one-way Analysis of Covariance (ANCOVA) test will also be conducted to compare the dependent variable (number of correct responses to each emotion) between emotion groups (i.e. “happy” vs. “not happy”) while including (1) age, (2) gender, and (3) native English speaker as covariates.
Finally, unsupervised learning techniques will be used to identify clusters of participants with similarities in their emotional responses that are more complex than obvious to the human eye. K-means clustering with varying levels of k will be employed on the participants’ responses to the emotional questionnaire (5 features) and the elbow method will be used to identify the optimal k. DBSCAN (density-based spatial clustering of applications with noise) will also be used to generate the same emotion clusters. Then, ANOVA and ANCOVA tests will be performed once more to compare the number of correct/incorrect responses between these new complex “emotion clusters” that were generated by k-means and DBSCAN, and the results compared.
Each of these quantitative analyses will generate a LOT of results, but will all be done algorithmically so that significance levels and correlations can be compared easily in the end.
Q10 by Dr. Babak Heydari
There are various machine learning methods that are popularly used for classification problems like the one in question. Deciding on the most effective approach is usually a function of things like the size of the sample set, the dimensionality of the feature space, whether or not we believe the data is linearly separable, and any underlying assumptions the method might make about the distribution of the data. A few of the more popular methods are discussed here as well as advantages and disadvantages of each and the reason for final selection.
Logistic Regression – one of the simpler and more traditional approaches, and often a good place to start, logistic regression fits a linear regression model to the training data and makes predictions by computing the probability that a dependent variable falls into a specific category as a linear function of independent variables. While one of its advantages are its simplicity, it assumes that the features are generally linear and that the feature space is linearly separable. There are few disadvantages to starting out with Logistic Regression in a new classification problem and then trying more advanced methods from there.
Naïve Bayes – Based on Bayes theorem that works on conditional probability: that the probability that something will happen given that something else has already occurred. Given this, we can calculate the probability of an event using its prior knowledge. The Naïve Bayes classifier assumes this holds true for the data we are using to make our prediction. It also assumes that all of the features in the data set we are using are unrelated to each other. This can be a disadvantage if learning the relationships between features would provide more accurate classification, since it is unable to do so. However, it is fast, simple, and highly scalable. It also works well with categorical data if the data is not linearly separable.
K Nearest Neighbors (KNN) – The KNN algorithm makes a prediction of a class based on the feature similarity of the test data to the existing (training) data. The advantage to this is that it is a non-parametric method, meaning it makes no prior assumptions about the distribution of the data and is therefore very helpful when we have no prior knowledge and need to let the structure of the data speak for itself. It works very well in real-world cases, and because there is no (or very minimal) formal “training” period, it is generally very fast. However, because it makes the prediction based on the “nearness” of similar items, it requires we come up with a meaningful measure of distance, which can be a challenge depending on the type of data we are working with. For the same reason it is insensitive to outliers, it is very sensitive to irrelevant features inappropriately included in the measure of distance.
Support Vector Machines – SVM’s separate the data into classes by maximizing the margin between classes using what are called “support vectors”. There are both linear SVM’s as well as non-linear when it is not possible to separate the training data using a hyperplane (in other words, the boundary the SVM creates doesn’t have to be a straight line). The benefit of non-linear SVM’s are that we can capture much more complex relationships between classes, but at the expense of being computationally expensive. Because they do not make any strong underlying assumptions of the data and because of their ability to understand complex relationships, they often provide some of the best classification performance for real world classification problems when simpler methods do not produce acceptable performance.
Decision Trees & Random Forests – Decision Trees use a branching methodology to make predictions just as the name would suggest. Each “branch” of the tree represents a decision made based on a prior decision, and a “leaf” node at the end of a branch represents a predicted class. They help make decisions under uncertainty, and also provide a nice visual representation of a decision situation (like deciding between classes). They also work well on categorical or even mixed data since they do not make any assumptions about the data or linearity. However, the accuracy of decisions generally goes down as the dimensionality of features go up and they generally do not work well for high dimensionality data sets. Random Forests generate multiple decision trees with different random samples of the data and then use the “most popular” prediction as the final output.
Artificial Neural Network & Deep Nets – Finally, artificial neural networks represent an entire branch of research that uses simulations of biological neural networks to make decisions or make predictions using data. The basic anatomy of an ANN consists of an input layer containing the feature set that is being used to make predictions, an output layer which contains one or more “nodes” representing an output of the network (this could be, for example, multiple classes), and a series of hidden layers which transform the input to the output data. All nodes are connected by a weight, which is reinforced when a neuron reaches a threshold and “fires” to those nodes on the right. We compare the output to that in the training set, adjust the weights to reduce error, and then make another guess. An ANN keeps doing this until it feels it can’t decrease the error any more. “Deep Learning” networks are simply ANNs with a much higher number of hidden layers. ANNs are very computationally expensive, but work well when the feature space is complex and generalized decisions need to be made by detecting patterns that may or may not be detectable by humans. They have been shown to work very well in computer vision applications, and are popular in facial recognition and emotion detection as seen in the literature, however, due to their complexity and computation intensity I will not be using ANN’s in this response.
Selected Model: For this problem, I have ruled out traditional methods like Logistic Regression and Naïve Bayes and the more advanced and computationally intense methods of ANNs and Deep Learning networks. Decision Trees will become too complex with the high dimensionality of continuous variables, and while the KNN approach may also provide good performance, coming up with a meaningful definition of distance may be difficult. For an implementation with relatively good performance and moderate complexity, I will be implementing a linear Support Vector Classifier for this problem.
...(download the rest of the essay above)
About this essay:
This essay was submitted to us by a student in order to help you with your studies.
If you use part of this page in your own work, you need to provide a citation, as follows:
Essay Sauce, Cultural differences in emotion recognition and expression. Available from:<https://www.essaysauce.com/psychology-essays/cultural-differences-in-emotion-recognition-and-expression/> [Accessed 23-09-19].
Review this essay:
Please note that the above text is only a preview of this essay.