Auxiliary Verb based text-mining approach to infer disease-related genes.
Abstract — Text-mining is widely used in biology to infer relationships between biological entities. In biology, relationships between gene and disease are important to discover cause of disease. For this reason, we propose a useful method to infer disease-related genes using auxiliary verb. By using auxiliary verb, our method decrease the number of candidate genes which are related to disease and infer more meaningful disease-related genes than other comparable methods. Furthermore, our method extracts useful sentences which are involved in disease and gene. Using the extracted sentences, we can obtain useful information for disease-gene relationships. We implemented our method to lung cancer literature data. Our method found 9 genes which are involved in lung cancer among the top 10 inferred genes.
I. INTRODUCTION
Biomedical text data are generated from a lot of biological experiments. These data include useful knowledge for biology. Furthermore, we can obtain biomedical text easily from online database such as PubMed [15], PMC [14], and OMIM [11]. For this reason, in biology, text-mining approach is widely used to analyze biomedical literature data. In particular, text-mining is utilized to infer biomedical relationships between biological entities such as disease-gene, disease- drug, and gene-drug, because the relationships are important to describe complex biological phenomenon. Furthermore, by analyzing various biomedical literature data, we can discover hidden relationships which are included in various biological experimental results.
The Swanson’s ABC model [17, 18] is representative biomedical text-mining approach. They presented approach to infer new relationships between biological entities using existing relationships. Since text-mining showed possibility as a useful method to infer biomedical relationships, a vast number of approaches are presented [9, 12, 16]. However, many approaches infer too many biomedical relationships to use as useful knowledge, because the size of literature data is very large. To consider this problem, we propose a method to infer disease-related genes using auxiliary verb.
This study has two main goals. One of the goals is decrease the number of inferred relationships. The other goal is inferring useful disease-gene relationships. To address these goals, we used auxiliary verb. Our assumptions are as follows:
– Biological experimental results cannot be described with 100 percent reliability.
– Auxiliary verb is widely used to describe conclusions for biological experimental results.
– In the literature data, conclusion section includes useful author’s description for their experiments.
We considered that auxiliary verb is a key to use author’s description for biological experimental results in the literature. For this reason, we utilized auxiliary verb to achieve our goals.
The main contributions of this work include:
– A decrease of the number of inferred relationships
– Inferring of meaningful disease-gene relationships.
– Extraction of useful sentences to support relationships.
The rest of the paper is organized as follows. Section 2 introduces related studies. The presented method is described in Section 3. Section 4 describes experimental results and discussion for this study. The conclusions and further works are included in Section 5.
II. RELATED STUDIES
Several text-mining approaches [1, 2, 8] have been developed in the biomedical field. Named entity recognition, text classification, terminology extraction, relationship extraction are representative biomedical text-mining field. Among them, this study addresses relationship extraction field.
Jung et al. [6] presented a literature search tool for extraction of disease-associated genes. To implement their tool, they applied a rule-based text-mining algorithm with keyword matching to extract target diseases, genes, with significant results, and the type of study described by the article. Pletscher-Frankild et al. [13] presented a system for extracting disease-gene associations from biomedical abstracts. To implement their system, they used dictionary-based tagger from named entity recognition and scoring scheme that takes into account co-occurrences. They also developed the DISEASES resource which integrates the results from text-mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. Fang et al. [3] provided a database which is called as MeInfoText. The database presented comprehensive association information about gene methylation and caner based on association mining from literature data. The MeInfoText also presented a set of genes which may contribute the development of cancer by aberrant methylation. Tiffin et al. [19] attempted to extract candidate disease genes using expression profiles. They used the eVOC anatomical ontology to integrate text mining of biomedical literature and data-mining of human gene expression data. Using the proposed approach, they successfully prioritize candidate genes according to expression in disease-affected tissues.
Several researches for extracting disease-gene relationship also have been presented. Gottlieb et al. [4] presented PRINCIPLE tool which analyzes and visualizes disease specific gene network based on the PRINCE [22] algorithm. The PRINCE algorithm was developed to infer disease-gene relationships using network analysis. To implement their algorithm, they used disease-disease similarity and protein-protein interaction data. Luo et al. [10] constructed a reliable heterogeneous network by fusing multiple networks which include PPI network, phenotype similarity network, and known associations between disease and genes. After constructing the network, they analyzed the network using RWRHN which is devised based on random walk-based algorithm by them. The proposed approach predicted novel casual genes for 16 diseases.
III. METHODS
In this section, we describe the proposed method to infer disease-related genes using auxiliary verb. Figure 3 outlines the proposed method.
Figure 1. Outline of the proposed method
Our method has four steps. First, we obtained literature data which are involved in lung cancer from PubMed using MeSH terms. After processing the literature data, we extracted sentences which include auxiliary verb and genes. In the next step, we calculated scores for auxiliary verb by analyzing the extracted sentence. Finally, using the score, we constructed lung cancer-related gene-gene network.
3.1 Data properties
We obtained literature data (18,005) which is involved in lung cancer and genetic from PubMed. Gene database, which include 39,816 genes, is obtained from HGNC [5]. Answer set (29) is gathered from KEGG [7] and OMIM database. Answer set means gene lists which are already known to be related to lung cancer. Seven auxiliary verbs (should, will, would, can, could, may, might) are used to implement our method in this study.
3.2 Literature processing
Among the literature data, we selected structured literature data which have “conclusion” section. Figure 2 shows literature data types which include structured and unstructured.
Figure 2. Structured and unstructured literature data
In the Figure 2, the A indicates unstructured literature data and the B indicates structured literature data. The structured data are used to implement our method. Among the several sections in the literature, the “conclusion” section is used to infer useful relationships in this study.
After selecting the literature data, we categorized sentences according to parts-of-speech tagging using POS tagger [20, 21]. Figure 3 shows pos-tagging example by POS tagger.
Figure 3. Example for POS tagger.
The POS tagger analyzed a sentence for each word. By using tagging results, we can identify parts-of-speech for each word.
3.3 Sentence Extraction
First, we sorted the literature data into sentence unit. Using the tagging results, we extracted sentences which include auxiliary verb. In the next step, we extracted sentences which include gene symbol to infer disease-gene relationships. The gene symbol is obtained from HGNC database. The extracted sentence is used to infer disease-related genes.
3.4 Scoring
In this study, seven auxiliary verbs (should, will, would, can, could, may, might) are used as a key to analyze literature data. The verbs are divided into 4 groups which include should, will, can, and may. Figure 4 shows scoring process.
Figure 4. Scoring process
As shown in Figure 4, we categorized sentences by auxiliary verb. After categorizing, we calculated score for each auxiliary verb based on genes which are included in sentences. The formulae we used to calculate score is below.
Here, “the number of inferred genes” means genes which are inferred from sentences for each auxiliary verb and “the number of known genes” means genes which are already known to be related with lung cancer among the inferred genes. Answer set is used to confirm the known genes. These scores are used to construct lung cancer-related gene-gene network in the next step.
3.5 Network construction
In this step, co-occurrence method is used to extract gene-gene relationships from the sentences. The co-occurrence method means that if two genes are appeared in same sentence, these genes are considered that they have a relationship. Using the method, we extracted several gene-gene relationships from sentences with weight. The weight is assigned by auxiliary verb which is appeared in the same sentence. By integrating these extracted relationships, we constructed lung cancer-related gene-gene network. If a relationship is extracted from several sentences, the relationship has a new weight which is calculated by adding weights for each sentence.
IV. RESULTS & DISCUSSIONS
In this section, we describe experimental results and discussions for our study. We also present comparison experimental results by comparing methods which infer disease-related genes.
4.1 auxiliary verb analysis
To validate one of our assumptions, we analyzed distribution of auxiliary verb in the structured literature data.
Table 1. Distribution of auxiliary verb
Structured literature data (54320)
Area conclusion section other section
Sentence 8,089 46,231
Auxiliary verb 3,155 2,275
Percentage 39.00 % 4.92 %
Table 1 shows distribution of auxiliary verb for locations. In the table, the “Area” indicates locations in the structured literature data. The “sentence” indicates the number of sentences for each location. The “Auxiliary verb” indicates the number of sentences which include auxiliary verb. The “Percentage” indicates proportion of auxiliary verb among the sentences. This result showed that auxiliary verb is widely used to describe conclusions for biological experimental results.
4.2 score calculation
To calculate score for auxiliary verb, we selected sentences which include gene symbol among the sentences which include auxiliary verb. After selecting sentences, we calculated score for each auxiliary verb using scoring function which is described in methods.
Table 2. Score for each auxiliary verb
Inferred genes Known genes Score
Should 20 8 0.4
Will 19 3 0.16
Can 114 9 0.08
May 311 13 0.04
Table 2 shows scores for each auxiliary verb. In the table, the “Inferred genes” indicates the number of inferred gene from the sentences. The “Known genes” indicates the number of known genes among the inferred genes.
4.3 Network construction
We constructed lung cancer-related gene-gene network based on co-occurrence and auxiliary verb score. Figure 5 shows constructed gene-gene network.
Figure 5. Lung cancer-related gene-gene network
In the network, the node indicates gene and the edge indicates relationships between genes. The weight of edges is calculated by auxiliary verb. In the network, the number of nodes is 153 and the number of edges is 155. These properties showed that our method decrease the number of inferred relationships. To infer top N genes in the network, we analyzed the network using weighted degree measures.
Table 3. Inferred Top 10 genes
Rank Gene Evidence
1 EGFR OMIM
2 KRAS KEGG, OMIM
3 ALK KEGG
4 MET
5 BRAF OMIM
6 ERCC1
7 T
8 XRCC1
9 TP53 KEGG
10 PIK3CA OMIM
Table 3 shows top 10 genes inferred by network analysis. In the table, the “Evidence” indicates reference which includes lung cancer-related gene database. Among the 10 inferred genes, 6 genes are validated by answer set.
4.4 Comparison experimental results
We compared our method to comparable methods which infer disease-gene relationships. One of the methods is PRINCE algorithm [22], and the other is RWRHN [10]. Both algorithms are methods to infer disease-gene relationships. In case of PRINCE algorithm, we implemented the method by using PRICE tool for lung cancer. In case of “RWRHN”, we extracted top 10 genes, which are inferred by RWRHN, from results in the paper. To validate genes inferred by each method, we used answer set data.
Figure 6. Inferred Top 10 genes
Figure 6 shows inferred top 10 genes for each method. The y-axis indicates the number of known genes among the inferred top 10 genes. As shown in Figure 6, our method found more known genes than comparable methods. This result showed that our method is useful to infer disease-gene relationships.
4.5 Literature validation
To validate inferred genes which are not confirmed by answer set, we found evidences from extracted sentences. The sentences are extracted by sentence extraction step in our method. The sentence has gene symbol and auxiliary verb. Using sentences, we found evidences for 3 genes which include MET, ERCC1, and XRCC1. In case of T, we cannot find an evidence for relationship with lung cancer.
Table 4. Literature validation for 3 genes
Gene PMID Supported Sentence
MET 25886066 Using FISH analysis to detect high polysomy and amplification of MET gene may be useful in predicting shortened PFS and OS after Gefitinib treatment in lung adenocarcinoma.
20150826 The occurrence of MET amplification and EGFR/ K-ras mutations might be mutually exclusive suggesting several distinct mechanisms in the development of lung adenocarcinoma.
20107422 Our results suggest that increased MET GCN would be an independent poor prognostic factor in SCC of the lung.
ERCC1 25051148 This meta-analysis suggests that the ERCC1 19007T>C polymorphism may be associated with lung cancer risk in Asians, while larger scale association studies are necessary to further validate the association of 19007T>C polymorphism with lung cancer risk.
21875468 The expression of ERCC1 mRNA, lymph node metastasis, pathological grade, cancer family history and smoking can be used as prognostic indicator of non-small cell lung cancer.
20840811 The sensitivity to cisplatin of lung cancer cell A549/DDP could be enhanced by RNA interfering ERCC1 gene targeted code 346.
20003463 Genetic polymorphisms in ERCC1 and XRCC1 genes might be prognostic factors in non-smoking female patients with lung adenocarcinoma.
17502833 These findings indicate that ERCC1 polymorphisms may contribute to the etiology of lung cancer. Further functional studies were warranted to elucidate the mechanism of the associations.
XRCC1 25684477 Our findings indicated that certain XRCC1 Arg399Gln variants might affect the susceptibility of lung cancer in Chinese population.
24175813 It is concluded that XRCC1 genotypes Gln/Gln and Arg/Gln may influence cancer susceptibility in patients with smoking habits and these functional SNPs in XRCC1 gene may act as attractive candidate biomarkers in lung cancer for diagnosis and prognosis.
22339849 Genetic polymorphisms in XRCC1 gene might be associated with overall survival and response to platinum-based chemotherapy in lung cancer patients.
20975374 Homozygous Trp/Trp variant genotype of XRCC1 Arg194Trp polymorphism could increase lung cancer risk in total population, especially in Asians.
20003463 Genetic polymorphisms in ERCC1 and XRCC1 genes might be prognostic factors in non-smoking female patients with lung adenocarcinoma.
17952468 These findings suggest that genetic polymorphisms in the DNA repair genes may modulate overall lung cancer susceptibility and that pathological stage and XRCC1 Arg399Gln independently predicted overall survival among Indian lung cancer patients.
15840879 However, we cannot exclude the possibility that the OGG1 Ser326Cys and XRCC1 Arg194Trp polymorphisms play minor roles in lung carcinogenesis.
16875604 Those results suggest that the XRCC1 Arg194Trp and XPD Lys751Gln genetic polymorphisms may be associated with clinical responses to platinum-based chemotherapy in advanced non-small cell lung cancer.
Table 4 shows gene, PMID, and supported sentences. In the table, the “PMID” is reference which is generated by PubMed. By using the PMID, we can access literature data for the sentence. The “supported sentences” indicates sentences which are extracted by sentence extraction step in our method. As shown in Table 4, our method found useful knowledge for disease-gene relationships using extracted sentences. By using the sentences, we can extract useful knowledge for genes which cannot be validated by answer set.
V. CONCLUSIONS
In the present study, we attempted to infer disease-related genes using auxiliary verb. To decrease the number of inferred genes, we used conclusion section in the structured literature data. As a result, we can construct small disease-related gene-gene network. Using the network, we inferred top 10 genes and validated that the inferred genes have a relationship with lung cancer. Our experimental results showed that the proposed method is more useful than comparable methods to infer disease-related genes. Furthermore, our method can extract useful sentences which are involved in disease and gene. By using the sentences, the proposed method can find supported knowledge for genes which are not validated by answer set. Through answer set and supported sentences, the proposed method found 9 genes which are involved in lung cancer among the inferred top 10 genes. In this study, we used only conclusion section in literature data. In the further work, we will develop the proposed method to extract meaningful knowledge in other sections.
ACKNOWLEDGMENT
This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (NRF-2015R1A2A1A05001845).