Essay:

Essay details:

  • Subject area(s): Science
  • Price: Free download
  • Published on: 15th October 2019
  • File format: Text
  • Number of pages: 2

Text preview of this essay:

This page is a preview - download the full version of this essay above.

% INTRO: Previous baseline + GSCs

The sheer quantity of biological information deposited in literature every day leads to information overload for biomedical researchers, even in very specialized areas of research. In 2016 alone there were 869,666 citations indexed in MEDLINE \cite[]{MEDLINESTATS}, which is greater than one paper per minute. Ideally, efficient, accurate text-mining and information extraction tools and methods could be used to help unlock structured information from this growing amount of raw text for use in computational data analysis. Text-mining has already proven useful for many types of large-scale biomedical data analysis, such as network biology \cite[]{Zhou2014Nature}, gene prioritization \cite[]{Aerts2006}, drug repositioning \cite[]{Wang2013}, and the creation of curated databases \cite[]{Li2015}. Broadly speaking, information extraction in the biomedical domain can be divided into three steps: named entity recognition (NER), trigger word identification and relation extraction.

% FIGURE 1: Example text

\begin{figure*}[!tbp]

\centerline{\includegraphics[width=1\textwidth]{fig01.png}}

\caption{Example of a complex biomedical event annotation}\label{fig:01}

\end{figure*}

Biomedical named entity recognition (BNER) is the task of identifying biomedical named entities, such as genes and gene products, diseases, and species in raw text. Biomedical named entities have several characteristics that make their recognition in text particularly challenging \cite[]{Campos2012}, including the sharing of head nouns (e.g. "91 and 84 kDa proteins" refers to "91 kDa protein" and "84 kDa protein"), several spelling forms per entity (e.g. "N-acetylcysteine", "N-acetyl-cysteine", and "NAcetylCysteine") and ambiguous abbreviations (e.g. "TCF" may refer to "T cell factor" or to "Tissue Culture Fluid"). Until recently, state-of-the-art BNER tools have relied on hand-crafted features to capture the characteristics of different entity classes. This process of feature engineering, i.e. finding the set of features that best helps discern entities of a specific type from other tokens (or other entity classes), incurs extensive trial-and-error experiments. On top of this costly process, high-quality BNER tools typically employ entity-specific modules, such as whitelists and blacklist dictionaries, which are difficult to build and maintain. Defining these steps currently takes the majority of time and cost when developing BNER tools \cite[]{Leser2005} and leads to highly specialized solutions that cannot be ported to domains (or even entity types) other than the ones they were designed for. Recently, a domain-independent approach based on

\pagebreak

\noindent deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF) \cite[]{Lample2016} has achieved state-of-the-art results for the task of BNER \cite[]{Habibi2017}. This method is completely generic, requiring only pre-trained word embeddings and labeled training data. Further improvements to this method have been made using transfer learning \cite[]{Giorgi2018} and multi-task learning \cite[]{Wang2018}.

%Finding gene names in scientific text is both important and difficult. It is important because it is needed for tasks such as document retrieval, information extraction, summarization, and automated text mining, reasoning, and discovery. Technically, finding gene names in text is a kind of named entity recognition (NER) similar to the tasks of finding person names and company names in newspaper text [1]. However, a combination of characteristics, some of which are common to other domains, makes gene names particularly difficult to recognize automatically.

% • Millions of gene names are used.

% • New names are created continuously.

% • Authors usually do not use proposed standardized names, which means that the name used depends on preference.

% • Gene names naturally co-occur with other types, such as cell names, that have similar morphology, and even similar context.

% • Expert readers may disagree on which parts of text correspond to a gene name.

% • Unlike companies and individuals, genes are not defined unambiguously. A gene name may refer to a specified sequence of DNA base pairs, but that sequence may vary in nonspecific ways, as a result of polymorphism, multiple alleles, translocation, and cross-species analogs.

% All of these things make gene name finding a unique and persistent problem. An alternative approach to finding gene names in text is to decide upon the actual gene database identifiers that are referenced in a sentence. This is the goal of the gene normalization task [2]. While success in gene normalization to some degree eliminates the need to find explicit gene mentions, it will probably never be the case that gene normalization is more easily achieved. Therefore, the need to find gene mentions will probably continue into the future.

Biomedical events describe complex interactions between various biomedical entities. In this context, an event can be defined as a combination of an event trigger and an arbitrary number of arguments. Event triggers are words or phrases that signify the occurrence of an event. For example, Figure~1\vphantom{\ref{fig:01}} shows a single event, of type \textit{"positive regulation"} with the trigger word \textit{"increased"} and arguments \textit{"insulin"} and \textit{"VEGF"}. Biomedical event extraction, the task of identifying biomedical events in text, is typically tackled with a pipeline-based approach, where trigger word identification is followed by event argument identification. In this approach, accurate trigger word identification is critical. Analysis in multiple studies \cite[]{Wang2016Dependency, Zhou2014Bioinformatics} suggest that more than 60\% of event extraction errors are caused due to incorrect trigger word identification.

Until very recently, the best-performing methods for biomedical trigger word identification have generally relied on rule-based approaches \cite[]{Vlachos2009} or traditional machine learning-based approaches whereby many hand-crafted features are extracted from text and fed into machine learning algorithms for classification. For example, \cite{Pyysalo2012} proposed a model where various features are extracted from the processed data and fed into a support vector machine (SVM) for classification. \cite{Zhou2014Bioinformatics} proposed a framework where embedded features of words learned from a neural language model are combined with hand-crafted features and fed to a SVM for classification using multiple kernel learning. \cite{Wei2015} proposed a pipeline-based approach using a CRF and a SVM; where the CRF is used to tag valid triggers and the SVM is used to identify the trigger type. More recently, neural network based approaches have been proposed which are less brittle and outperform rule-based and feature-based approaches. \cite{Wang2016Dependency} proposed a method where dependency based word embeddings within a window around the word are fed into a neural network to perform classification. \cite{Wang2016CNN} also proposed another model where word and entity mention features of words within a window around the word are fed to a convolutional neural network (CNN) to perform classification. Although both of the methods achieve good performance, they fail to capture features outside the window. To address this problem, \cite{Rahul2017} proposed a method based on bidirectional recurrent neural networks (RNNs), which achieves state-of-the-art performance.

Given the similarity of BNER and biomedical event trigger identification, and the similarity of the current state-of-the-art solutions to these tasks, a multi-task learning (MTL) approach seems like a natural fit. At a high-level, MTL \cite[]{Caruana1993} is a machine learning method in which multiple learning tasks are solved at the same time. In the classification context, MTL is used to improve the performance of multiple classification tasks by learning them jointly. The idea is by sharing representations between tasks, we can exploit commonalities, leading to improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately \cite[]{Baxter2000, Thrun1996, Caruana1998}. In this paper, we propose a multi-task learning framework, for jointly learning the tasks of BNER and biomedical event trigger word identification.

...(download the rest of the essay above)

About this essay:

This essay was submitted to us by a student in order to help you with your studies.

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, . Available from:< https://www.essaysauce.com/essays/science/2018-3-18-1521385034.php > [Accessed 09.12.19].