Named-entity recognition is a subset of information extraction in natural language processing. It creates structure from unannotated text by identifying proper nouns as informational entities. A named entity is a phrase that identifies a real-world object that has a proper name. This includes personal names, geographic locations, organizations, and numeric expressions such as time, date, and money expressions. Named-entity recognition, or NER, tags and categorizes words, which helps systems disambiguate and create meaning in text. Named-entity recognition offers better analytics for text because it can easily reveal who and what the information is about.
As the stepping stone for future analysis of text, named-entity recognition has numerous applications in various fields. NER models scan a corpus of data, recognize a string of text as an entity, and then understand the meaning of the entity within the context that it is used. In news reporting, named-entity recognition helps classify content for news sources by scanning entire articles and revealing the main entities (people, organizations, and places). Tags can be created for each entity, which helps categorize news articles. This is useful because tagged news articles can be linked with each other and create a hierarchy which makes related content easier to find. Similarly, named-entity recognition is applicable in other content rich resources such as search engines and Netflix. For example, in Netflix NER might be used to find and extract movie descriptors from a list of movies watched and recommend other movies which have the “most similar entities mentioned in them” (Konkol, 2015). Furthermore, named-entity recognition is important for research articles. One specific research topic can have hundreds of papers published and without any structure, finding particular information can be daunting. With named-entity recognition, tags on entities common to a topic can be used to narrow the scope of the search of a researcher. Lastly, named-entity recognition is also applicable to social media platforms. Since content on social media is abundant and user-generated, analyzing the content provides a platform to “understand events, opinions and preferences of groups and individuals” (Moon et al., 2018). These are just some of the ways named entity recognition can be pertinent to several industries.
This literature review will concentrate on named-entity recognition techniques in the domain of social media. Recognizing entities in social media differs greatly from entities in formal pieces of writing. Formal texts, including press releases, published works, news articles and research papers, use natural language that follows specific language patterns and grammar rules. However, social media content may not follow all of these language rules because it exhibits a different writing style. Social media text is typically informal and freeform and can possibly be associated with an image. For example, correct capitalization and punctuation in formal writing is always assumed while it may not always exist in social media posts (Liu et al., 2013). Many early named-entity recognition algorithms focused on extracting information from formal texts because of the more rigid rules and the widespread availability of source material. As a result, social media entity recognition and extraction techniques improve on conventional named-entity recognition strategies. Various research has experimented on the viability of different methods. Twitter is an often used in research as a social media platform to study because it has a large user base, limits the length of a single post, and has easily accessible data for data analysis. Named-entity recognition in social media also has to account for context. Social media posts commonly use informal abbreviations, slang, and misspellings that are understandable to most users but not to algorithms parsing through text. Posts can be context dependent, based on previous posts, and short (Ritter et al., 2011). With these differences, recognizing entities in social media is a challenging task.
There are several applications of named-entity recognition in social media. When entities are recognized, current events such as natural disasters or political updates can be analyzed to assess a situation and determine future responses. Additionally, it allows for tracking user interests in cultural phenomena, including celebrities, notable figures, movies, books, songs, or TV shows. Therefore, named entity recognition can be used to identify current trends and opinions over specific periods of time.
Problems with Named-Entity Recognition in Social Media
While named-entity recognition in social media has interesting use cases, prospective NER models face problems involving ambiguity and feasibility. Most of social media data is in the form of “misspelled ungrammatical short sentence fragments,” so newer methods to understand a string of text need to be developed (Gattani et al., 2013). Additionally, the shorter length of a single post provides less meaning and context. For example, a tweet that just says “Go Panthers!” is harder to understand because “Panthers” could mean the Carolina football team or the animal. Without context, a tweet may be tagged incorrectly, leading to an inaccurate analysis or conclusion. Social media includes other instances of named entities that are not typically found in print material like popular culture terms. Therefore, models that process this data need to be trained from the very beginning to recognize more types of entities and use “social signals,” which accounts for what is happening in the surrounding social environment when content is posted onto social media (Gattani et al., 2013). Trending topics on social media can correspond to increased traffic on other popular websites such as Wikipedia and increased search engine queries. Another central issue is the efficiency of a technique. There needs to be a balance between accuracy and speed when performing named-entity recognition. Sometimes computation-heavy solutions work well in an experimental setting with a smaller data set, but do not scale to “high-speed tweet streams of 3000-6000 tweets per second” (Gattani et al., 2013). Many models utilize manual tagging to classify data, which is unfeasible on a larger scale in a real-time situation. Proposed NER models are attempting to resolve these issues.
General approaches to recognize named-entities are supervised, semi-supervised, and unsupervised learning. These approaches all have varied amounts of training data and human input. Training data is full of correct examples of phrases and associated tags and is usually manually identified by humans.
Supervised learning relies on a large corpora of text that is is heavily annotated. Annotations are based on orthographic and grammar rules and the topic an entity falls under. Capitalization is useful in recognizing entities since most entities begin with a capital letter, are completely capitalized, or are phrases using mixed case (Nadeau, 2009). Punctuation is another significant indicator; many named entities end with a period or use internal periods within the phrase. Internal apostrophes, hyphens or ampersands are used to define these entities. Supervised learning also uses tagging to further annotate a text. Content can be specifically tagged using parts of speech tagging algorithms or by designated categories of subjects. Supervised learning analyzes the training data given and uses the given rules to infer that certain phrases are entities. While this approach is accurate for entities that are previously recognized, it is lacking in recognizing instances of named-entities that are new and unseen. This approach relies heavily on humans manually tagging entities for future use, which is not very efficient.
On the other hand, a semi-supervised learning approach utilizes a data set that is partially labeled. Semi-supervised learning models need only a small amount of labeled information to use as a starting point (Nadeau, 2009). Afterwards, they can learn dynamically. Only a small amount of supervision, or human effort, is needed when compared to a supervised learning approach. Given a smaller amount of data, such as examples fitting different categories of entities, a semi-supervised learning model can begin a learning process to identify other instances of an entity. These types of models can search a corpus for sentences that contain these named-entities and identify contextual clues that are common in these given examples. Then, they can find other instances of the same type within similar contexts. This process can be repeated with the newly identified entities to learn more contextual clues and increase the number of entities associated with a particular subject. With this approach, semi-supervised learning is a more advanced, flexible and efficient way to recognize named-entities than supervised learning because it does not simply rely on past rules. Current models are increasingly accurate in learning and recognizing new named-entities.
While supervised and semi-supervised learning models fully or partially rely on a given annotated set of data, unsupervised learning models do not use any training data. Unsupervised learning algorithms create structure and recognize named-entities from completely unlabeled data. This can be done by using a clustering technique, which groups similar named-entities in groups. It relies on lexical resources, lexical patterns, and statistics that are computed on a large, unannotated corpus (Nadeau, 2009). WordNet is an example of a lexical resource used in this type of strategy. Unsupervised learning models do not require any human input, but their success depends on the resources that they use.
Current research regarding NER in social media uses any of these three learning approaches. Most recent research focuses on developing and testing new models to find named-entities and then analyzing them. Twitter is often used as the main resource of social media content, and tweets are collected using APIs and scanners. The text in social media data is typically preprocessed by normalizing tokens in posts, removing hyperlinks, and replacing each string with its formal form, if possible (Liu et al., 2013). This step helps reduce some ambiguity and simplifies the processing needed to be done by the proposed NER algorithm.
KnowItAll is an unsupervised learning system developed by researchers at the University of Washington. It can extract information in two stages (Etzioni et al., 2005). This is done without any manual tagging techniques by using three distinct methods to recognize and classify named-entities. KnowItAll uses Pattern Learning to learn category-specific patterns and rules to extract entities, Subclass Extraction to identify subclasses of a broader entity, and List Extraction to create lists of instances of entities. It further uses a search engine to validate that a string is a named-entity, “treating the Web as a massive corpus of text” (Etzioni et al., 2005). This system has two parts: an Extractor and an Assessor. The Extractor uses given rules to create possible strings of entities, searches them using search engines, and uses the rules to extract information from the resulting web pages. The Assessor then computes the probability that extracted information is correct before adding it to the knowledge base of KnowItAll. Because KnowItAll is unsupervised, it can “recursively discover new relations, attributes, and instances in a fully automated, scalable manner” (Etzioni et al., 2005). The results of this research study revealed that entities are easily recognizable but are often misclassified with the wrong tags. While this research project does not directly pertain to NER in social media, it introduced the idea of using an online knowledge base of data that is improved by subsequent discussed research.
Research regarding semi-supervised learning models has developed by improving on previous published approaches. In 2011, Liu et al. of the Harbin Institute of Technology first presented a system for analyzing tweets that combined a KNN classifier and and CRF labeler to label every word in a tweet. KNN, or K-Nearest Neighbors, is an algorithm that recognizes a named-entity by determining how similar it is to previously stored named-entities by using distance functions. KNN captures global coarse evidence that is useful for future tweets and is faster at retraining its statistical models when it encounters new entities. This is because KNN uses “word-level classification, leveraging the similar and recently labeled tweets” (Liu et al., 2013). A CRF (Conditional Random Field) labeler is a semi-supervised procedure that uses a graph structure to determine the conditional probability of a sequence of words (McCallum & Li, 2003). CRF is successful at understanding “the subtle interactions between words and their labels” (Liu, 2013). The hybrid approach of using the KNN classifier and the CRF label is effective at recognizing named entities, but this model has to constantly be retrained when it encounters newer types of entities.
Later research introduced a targeted approach to named-entity recognition (Ashwini & Choi, 2014). This project focused on entities more commonplace on social media, such as movies, TV shows, and sports. The researchers created a targeted set of movie titles from tweets collected over three months. Tweets were normalized, possible candidates for named-entities referring to movies were identified, and each candidate was classified as an entity or not by using manual tagging. Then this base of data was used to identify entities in a larger collection of tweets. Results showed that this model was unbiased to any particular targeted set, and statistical models do not need to be retrained as they did with K-Nearest Neighbors. This is useful when target sets frequently are updated.
Another semi-supervised NER approach created a Wikipedia-style Knowledge Base (KB) to recognize a wide assortment of named-entities. Gatani, et al. described a system that also extracted, linked, and classified social media data (2013). Entities were extracted by locating strings in the text of tweets that referred to predefined categories created by the researchers. Potential named-entities were linked to concepts and instances in the Knowledge Base. The KB was built by using a hierarchy of concepts similar to Wikipedia's organization system. A node structure is used to determine the topic and context of entity and tweets were tagged based on entities mentioned. This Knowledge Base is frequently updated so that it continues to be accurate with newly defined and created named-entities. Gatani, et al. tested compared their new approach with the standard formal text-based Stanford Named Entity Recognizer and OpenCalais and found that the KB strategy was more effective for social media posts. The Stanford NER had fewer predefined categories which would not be specific enough for social media analysis. The proposed system had better F1 scores than that of OpenCalais, with greater accuracy of entity extraction and classification but not tagging. They also noted that the solution to classifying tweets using KNN classifiers and CRF labelers by Liu, et al. only extracts one person, organization, and location entity, while their model allows for a large number of entity types that link back to their Knowledge Base.
Most of the recent named-entity recognition models are focused on naming entities found in the text of tweets. However, images are also frequently shared and associated with text in social media posts on most social media platforms, including Twitter and Snapchat. Snapchat is a form of social media that combines text and images to make a complete message. When a social media platform is primarily image-based with text captions optional, text-based NER can be difficult because much of the context of the post is missing in the image. Researchers Moon, et al. introduced a Multimodal NER approach (2018). They created a new dataset called SnapCaptions, which are image-caption pairs submitted to public Snapchat stories with fully annotated named entities. Image recognition tools were used to generate words that describe the image. This, combined with the image's caption was used in a hybrid Bi-LSTM/CRF approach. Bi-LSTM (Bidirectional Long Short Term Memory) stores and remembers named-entities in blocks based on time intervals, mimicking short-term memory in humans. The results of this research project highlighted the importance of context; including the meaning of accompanying images significantly improved the results of NER compared to standard text-based approaches.
Evaluation of Research
Much of recent named-entity recognition research related to social media is focused on semi-supervised approaches, which is why this literature review did not discuss specific supervised and unsupervised learning approaches in social media. Some researchers focused on model efficiency, while others looked to improve entity classification. Researchers sometimes directly worked on improving a previous model while others sought to improve general named entity recognition techniques. Semi-supervised learning models are more prolific because of improved heuristics methods compared to supervised or unsupervised learning models. On a larger scale, the manual nature and limitedness of supervised learning makes it unfeasible to work with social media. Unsupervised methods need improved lexical resources related to social media in order to be realistically used. Recent research on NER strategies applied to social media should be evaluated based on accuracy in correctly identifying a named-entity, adaptability to new data and platforms, and efficiency in processing large amounts of data.
Accuracy is the foundation of named-entity recognition research and all of the past research models discussed strived to be completely accurate. Accuracy is achieved when models recognize which phrases of words are entities and correctly classify identified entities. Semi-supervised learning models such as the KNN/CRF classifiers, the Targeted Sets, the Wikipedia-style Knowledge Base technique and the Multimodal NER Approach did not report significant mistakes when identifying entities. However, errors were mostly noted when labeling and annotating social media data. For example, the Wikipedia-style Knowledge Base model involving multiple categories had different levels of success. In the KB, some categories (environment, travel, politics) were much more accurate than others (people, social sciences, businesses) when classifying the contents of tweets. Researchers pointed to heightened ambiguities in social media text and a general consensus was to use improved normalization techniques in the future.
It is important that proposed named-entity recognition models are adaptable to different types of data so that a single model can be used to analyze a large, diverse corpus of social media data. The hybrid KNN/CRF classifier is able to “captures global coarse evidence” which is useful when analyzing tweets in the future (Liu 2011). It is also faster at retraining its statistical models when it encounters new entities. KnowItAll, which used unsupervised methods, is able to work with any type of data, since it used a search engine to disambiguate text. The Knowledge Base using Wikipedia is frequently updating so that newer named-entities will be recognized. While the Targetable NER model only used a targeted set of movie titles, researchers noted that the model itself was unbiased to any particular targeted set. Only the Multimodal model combining text and image data did not seem very adaptable to diverse sets of data because the image recognition technology it used may not necessarily always be updated and compatible with new and unlabeled images.
Additionally, named-entity recognition models should process social media data efficiently. Efficiency can be determined by a model's reliance on annotated (or trained) data and how quickly it can classify a named-entity. Retraining models is typically done to acclimate NER models to different types of data. Although the K-Nearest Neighbors classifier needed to be retrained less frequently than previous research models, subsequent research models did not need to be retrained and were more self-sufficient. The lack of reliance on retraining models is useful when data sets are frequently changed and updated. Models should be proficient so that real-time data analysis can be effectively used in time-sensitive industries including meteorology and news reporting.
Even though most of the current named-entity recognition research in social media uses content from Twitter, content on other social media platforms such as Facebook and Instagram should eventually be analyzed using NER techniques. With better social media data collection, aggregating data from different social media networks adds more depth when analyzing data based on named-entities. The Multimodal model approach is the most applicable to multiple social media platforms because image content reposted on different platforms can be recognized as a named-entity. The rest of the semi-supervised models have different modes of data collection and classifying particularly tailored to Twitter. Depending on the amount of training data available, it is unclear whether or not the same proposed NER model would have the same results on different platforms.
Named entity recognition in the domain of social media faces new challenges compared to other forms of writing. In recent years, various researchers have using their own (or past) models to potentially increase effectiveness and efficient to overcome these difficulties. Comprehensively, recent research indicates that proposed NER models are increasing in efficiency and accuracy and that there is room for improvement. In the realm of social media, semi-supervised learning models are the most viable because they are a good balance between the precision of supervised learning and the uninvolved nature of unsupervised learning. Semi-supervised models are able to self-learn dynamically because they are based on accuracy with the ability to discover more entity-tag pairs that are correct. If retraining is not necessary, then these models are more suitable for long term use. The conclusions of many of the research projects still suggest that ambiguity is still somewhat an issue and that the context of a post, namely time, location, and recipients help enable more accurate named-entity recognition. With the current state of research in this field, semi-supervised models can improve by learning more from the annotated text used as a basis for data. Improved preprocessing of content being analyzed, such as implementing more orthographic rules can help reduce errors in tagging named-entities. Elements of data that add context should be accounted for in preprocessing stages. The speed at which named-entities are tagged should also be a consideration when developing new techniques.
Furthering research into learning models that better perform named-entity recognition is an essential task because social media is a mode of communication that is an untapped field full of unanalyzed text, apart from hashtags. Many industries may find it beneficial to correctly identify and understand what millions of people are discussing about, especially marketing departments for businesses with various products and services. Any domain using abbreviated or shorthand text would benefit from research that focuses on informal text. Culturally, image sharing has increased due to superior cameras and popularity of meme images. This is why a multimodal approach that combines text, images, and other forms of media should be further researched to better glean information from a large amount of social media data. Improvements with NER models in social media enhances information extraction; otherwise text evaluation stagnates with rudimentary, limited, human-dependent analysis methods.
...(download the rest of the essay above)