ClustalX/ ClustalW/ Clustal2 Clustering and Alignment of DNA and protein sequences .Unlocking Biological Data with Bioinformatics: An Introduction to Computational Tools

Bioinformatics is one of such afresh evolving fields, which makes usage of computer, mathematics and statistics in molecular biology to store, recover, and examines biological data. Though yet at initial stages, it has become one of the fastest growing fields, and quickly recognized itself as a vital component of any biological research activity. It is getting prevalent due to its ability to analyses massive amount of biological data rapidly and efficiently. Bioinformatics can support a biologist to extract valuable information from biological data providing countless web- and/or computer-based tools, the majority of which are freely available.

Key words

Bioi

Table of Contents

Introduction

Bioinformatics is an multidisciplinary science, developed by the mixture of several other disciplines like mathematics, biology, computer science, and statistics, to develop methods for storing, recovery and examines of biological data [1]. Paulien Hogeweg, a Dutch system-biologist, was the first person who used the term "Bioinformatics" in 1970, denoting to the use of information technology for reviewing biological systems [2,3]. The introduction of user-friendly interactive automated modeling along with the creation of SWISS-MODEL server about 18 years ago [4] resulted in immense growth of this discipline. Since then, it has become an important part of biological sciences to process biological data at a much faster rate with the databases and informatics working at the backend

Computational tools are usually used for characterization of genes, phylogenetic analyses, determining structural and physiochemical properties of proteins and performing simulations to study how biomolecule interact in a living cell. Though these tools cannot produce information as consistent as experimentation, which is expensive, tedious and time consuming, however, the in-silico analyses can still enable to reach an informed decision for directing an expensive experiment. For example, a druggable molecule essentially have certain ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties to pass through clinical trials. If a compound does not have required ADMETs, it is likely to be rejected. To evade such failures, different bioinformatics tools have been developed to predict ADMET properties, which allow researchers to screen many compounds to select most druggable molecule before launching of clinical trials [5].

Homology and Similarity Tools:

Homologous sequences are sequences that are related by divergence from a common ancestor. Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true of false. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated.

Protein Function Analysis:

This group of programs allow you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains. Highly significant hits against these different pattern databases allow you to approximate the biochemical function of your query protein.

Structural Analysis:

This set of tools allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases. The identification of these and other biological properties are all clues that aid the search to elucidate the specific function of your sequence

Gene Identification and Sequence Analyses

Sequence analyses refer to the understanding of different features of a biomolecule like nucleic acid or protein, which give to it its unique function. First, the sequences of corresponding molecule(s) are retrieved from public databases. After modification, if needed, they are subjected to numerous tools that enable estimate their characterise related to their structure, function, evolutionary history or identification of homologues with an excessive accuracy. Which tool should be used for what depends on the very nature of analysis to be carried out. For example, data retrieval tools such as Entrez of PubMed [9] allows one to search and retrieve data from a wide range of data domains. Similarly, pattern discovery tools such as Expression Profiler [10], Gene Quiz [11] allow researchers to search out different patterns in the given data. Another set of tools is dedicated to carry out sequence comparison. These tools such as BLAST (Basic Local Alignment Search Tool) [12], ClustalW [13] enable one to compare gene or protein sequences to study their evolutionary history or origin. The data visualization tools such as GeneView [15], Genes-Graphs [17] Jalview [14], TreeView [16], allow researchers to view data in graphic representation. These tools use advanced mathematical modelling and statistical inferences such as dynamic programming, Hidden Markov Model (HMM), Regression analysis, Artificial Neural Network (ANN), Clustering and Sequence Mining to analyse the given sequence. These examines are popular due to their massive applications in biological sciences, the ease, and the capacity to produce a wealth of knowledge about the gene/protein in question. These types of analyses are particularly useful for identification of promoter, terminator, or un translated regions involved in the regulations expression, recognition of a transit peptide, exons introns, or an open reading frame (ORF), and identification of certain variable regions to be used as signatures for diagnostic purposes. Therefore, sequence analyses are one of the frequently performed analyses of bioinformatics. For example, Stoilov et al. [18] used sequence analysis coupled with homology modelling to investigate the genetic basis of primary congenital glaucoma (PCG) [18] Similarly, Rho-independent transcription terminators form a collection of 343 prokaryotic genomes were predicted quite accurately (<6% false positive prediction) using various computational tools [22]. Mostly predictions rely on complementary DNA (cDNA) and Expressed Sequence Tags (ESTs). However, the cDNA/ESTs information is often scarce and incomplete, and therefore makes the task of finding new genes hugely difficult. Computational scientists

have developed another technique referred as an ab initio gene identification. The potential of this technique was demonstrated in a study, which could predict 88% of already verified exons and 90% of the coding nucleotides from Drosophila melanogaster with very low rate of false-positive identification [23

Tool Description

BLAST It is a search tool, used for DNA or protein sequence search based on individuality.

JIGSAW To find genes, and to envisage the splicing sites in the particular DNA sequences.

Virtual Footprint Whole prokaryotic genome (with one regular pattern) may be examined with this program along with promoter regions with some regulator patterns

Soft berry Tools Numerous tools are dedicated in annotation of animal, plant, and bacterial genomes along with the structure and function prediction of proteins and RNA.

Genscan Used to predict the exon-intron sites in genomic sequences

ORF Finder The putative genes may be subjected to this tool to find Open Reading Frame (ORF).

Sequerome Used for sequence profiling.

HMMER Homologous protein sequences may be searched from the respective databases using this tool.

Clustal Omega Multiple sequence alignments may be performed using this program.

novoSNP Used to find the single nucleotide variation in the DNA sequence.

PPP Prokaryotic promoter prediction tool used to predict the promoter sequences present up-stream the gene

WebGeSTer This is a database containing sequences of transcription terminator sequences and is used to predict the termination sites of the genes during transcription.

ProtParam Used to forecast the physico-chemical properties of proteins.

Phylogenetic Analyses

Phylogenetic analyses are procedures used to rebuild the evolutionary relationship between a group of linked molecules or organisms, to predict certain features of a molecule with unknown roles, to determine genetic relatedness and to track gene flow [26]. This all could be categorized on a genealogic tree or tree of life. The fundamental principle of phylogeny is to group living organisms according to the degree of similarity: greater the similarity, closer the organisms would appear on a tree. A phylogenetic comparative analysis is widely used to control for the lack of statistical independence among species [27]. The methods to construct a phylogenetic tree are divided into three major groups: parsimony methods, distance methods, and likelihood methods. None of the methods is perfect; each one has its own strengths and weaknesses. For example, the distance-based trees are easy to set up but not that accurate. The ultimate parsimony and extreme likelihood methods are (in theory) the most accurate, but they take more time to run [28]

Phylogenetic tools are commonly used to test various evolutionary hypotheses and have become essential for functional genomics, mostly when the functions of a gene are not identified. For example, prior to the appearance of an algal membrane protein, plastid terminal oxidase 1 (PTOX1), in tobacco chloroplasts, authors conducted a phylogenetic analysis to construct the evolutionary history and determine essential features of that polypeptide [30]. The phylogenetic analysis exposed that the Chlamydomonas reinhardtii PTOX1 (Cr-PTOX1) has typical signatures of higher plant PTOX such as iron-binding sites, a conserved exon and various blocks of amino acids to act as plastoquinol terminal oxidase [30]. Similarly, Chen et al. [31] used phylogenetic analysis to study the evolutionary history of respiratory mechanisms in the deep-sea bacterium Shewanella piezotolerans WP3 [31]

Tool Description

MEGA (Molecular Evolutionary Genetics Analysis Builds phylogenetic tress to study the evolutionary closeness.

JStree An open-source library for viewing and editing phylogenetic trees for presentation improvement.

Jalview It is an alignment editor and is used to refine the alignment

PAML It is molecular phylogenetic analysis tool based on maximum likelihood method.

TreeView Software to view the phylogenetic trees, with the provision of changing view.

MOLPHY It is molecular phylogenetic analysis tool based on maximum likelihood method.

PHYLIP A package for phylogenetic studies.

Sequence Databases

Biological sequence database refers to a immense collection of information about biological molecules such as proteins nucleic acids, and polymers, each molecule to be recognized by a exclusive key. The stowed information is not only important for future use but also serves as a tool for primary sequence analyses. With the improvement of high quantity sequencing techniques, the sequencing has grasped to a whole-genome scale, which is producing a huge amount of data every day. The submission and stowage of this information to become easily accessible to the scientific community has directed the development of several databases worldwide. Each database has become an self-sufficient depiction of a molecular unit of life. This section deals with such databases, as an understanding of these databases will help to recover significant information from these data collections related to one's project. Databases contain a variability of information; and so are classified into Primary, Secondary, or Composite databases, depending upon the information stored in them. For example, the data in a primary database is obtained through experimentation such as yeast-two hybrid assay, affinity chromatography, XRD or NMR approaches such as related to sequence or structure. SWISS-PROT [32], UniProt [33] and PIR [34], GenBank [35], EMBL [36], DDBJ [37] and the Protein Databank PDB [38] are examples of primary databases. A secondary database contains information that is derived from the analysis of data stored in primary databases like conserved sequences, active sites of a protein family or conserved secondary motifs of protein molecules [39,40]. Examples of secondary databases include CATH [42], eMOTIF [44] SCOP [41], PROSITE [43]. Subsequently, the primary databases are of archival nature while secondary databases are designated as curated databases. A composite database comprises of information resultant from different primary sources. Examples of composite databases include NRDB (nonredundant database), which comprises of data obtained from GenBank (CDS translations), SWISS-PROT, PDB, PIR and PRF. Similarly, the INSD (International Nucleotide Sequence Database) is another example of composite database, which is assortment of nucleic acid sequences from GenBank, EMBL and DDBJ. The UniProt (universal protein sequence database) [45] represents another example, which is also a collection of sequences resulting from various other databases Swiss-Prot, PIRPSD and TrEMBL. Similarly, wwPDB (worldwide PDB) is a composite of 3D structures in the RCSB (Research Collaboratory for Structural Bioinformatics) MSD, PDB and PDBj [46].

Genome Sequence Databases

The GenBank, constructed by the NCBI [35], is a immense collection of genome sequences of over 250,000 species. The data from GenBank can be retrieved through the NCBI's integrated retrieval system, Entrez, while the literature is accessible via PubMed [47]. Each sequence contains information about the literature, organism, bibliography, and a set of various additional features, which contain coding regions, untranslated regions, promoters, terminators, introns, exons, repeat regions, and translations. The sequence information stowed in GenBank is obtained through submission both by the individual laboratories besides by large-scale genome sequencing projects. Similarly, the Xenbase is a modernized resource of genomic and biological data on the frogs including Xenopus tropicalis and Xenopus laevis [48], where Xenopus spp. are considered as model providing new information in the field of developmental biology which may subjugated to modelling and reproduction studies of the human diseases. The Saccharomyces Genome Database (SGD) contains complete information of the yeast (Saccharomyces cerevisiae) and delivers bioinformatics tools to explore and analyse the data available in SGD.

Database Description

Nucleotide Databases

GenBank It is the member of International Nucleotide Sequence Databases (INSD) and is a nucleotide sequence resource.

DNA Data Bank of Japan It is the member of International Nucleotide Sequence Databases (INSD) and is one of the biggest resources for nucleotide sequences.

Rfam A collection of RNA families, represented by multiple sequence alignments

Protein Databases

Proteomics Identifications Database A public source, containing supporting evidence for functional characterization and post-translation modification of proteins and peptides

Protein Data Bank This is another major resource of proteins containing information of experimentally-determined structures of nucleic acids, proteins, and other complex assemblies

Pfam Collection of protein families

Genome databases

Ensembl It is a database containing annotated genomes of eukaryotes including human, mouse and other vertebrates

Miscellaneous Databases

TAIR The Arabidopsis Information Resource (TAIR) maintains adatabaseof genetic andmolecular datafor the model plantArabidopsis thaliana. It provides information on gene structure, gene product, gene expression, DNA and seed stocks, genome maps, genetic and physical markers

Reactome A peer-reviewed resource of human biological processes

Medherb Resource database for medicinally important herbs

Signalling & Metabolic Pathway Databases

SGMP The Signaling Gateway Molecule Pages (SGMP) database provides highly structured data on proteins which exist in different functional states participating in signal transduction pathways

KEGG KEGG is a suite of databases and associated software for understanding and simulating higher-order functional behaviours of the cell or the organism from its genome information

PID The Pathway Interaction Database (PID) is a collection of curated and peer-reviewed pathways composed of human molecular signaling and regulatory events and key cellular processes. It serves as a research to study the cellular pathways with a special emphasis on cancer.

CMAP Complement Map Database is a novel and easily accessible research tool to assist the complement community and scientists from related disciplines in exploring the complement network and discovering new connections.

Molecular Interactions

Proteins rarely perform their functions in segregation, and thus often interact with other molecules all the time to accomplish a certain process. Understanding how biomolecules relate with other molecules grasps numerous interpretations, for example, drug design, for protein folding, and purification techniques [75] and therefore has become one of the frequently pursued research area using either experimental or bioinformatics approaches. Understanding of molecular interactions is also important to clarify the biological functions of a molecule. For example, protein-protein interactions play a vital role in cellular activities such as transportation, signalling, homeostasis, cellular metabolism and several biochemical processes [76].

Bioinformatics in this regard becomes relatively handy to envisage protein-protein interactions without resorting to costly, and time-consuming physical approaches such as Nuclear Magnetic Resonance (NMR) spectroscopy and X-ray crystallography. Frequently crystal structure coordinates give deceptively static views of interactions as a complex cannot be characterized by a single structure. Thus, it has been realized that 3D structure of a molecule cannot produce a complete picture of each individual interaction. Hence, computational methods accomplished to forecast consistent protein-protein interactions have become vital. But, such studies produce useful information, which permit scientists to control a specific pathway to be manipulated to achieve required change(s) in the cell.

Apart from forecast of protein structures, the molecular modelling can also support in choosing one exclusive conformations, which directs the activity of a biomolecule. Other applications comprise spotting residues at 'hot-spot' of protein lines by reducing a protein onto a small molecule called ligand. There are many software's available to perform docking calculations; only few, which are most extensively used, will be discussed here.

The immense generation of data has led to the progress of various databases to establish and enable study on molecular interactions. For example, signal transduction pathways databases may contain protein-protein, DNA-RNA, Protein-RNA, DNA substrate protein-DNA interactions [88]. The Biomolecular Interaction Network Database (BIND) is one of the major available information resources that deliver access to pairwise molecular interface and complexes [89]. Likewise, MINT is another database, which supplies information of functional relations of biological molecules [90]. A list of selected tools to study protein-protein interactions is given in Table 5.

Tool Description

BIND A database that provides access to molecular interaction and bio-complexes

PathBLAST It is meant to search protein-protein interaction network of the any selected organism and extracts all interaction pathways that align with the query.

SMART A Simple Modular Architecture Retrieval Tool; describes multiple information about the protein query.

IntAct It is an open source database system and provides analysis tools for molecular interaction data. All interactions are derived from literature curation or direct user submissions and are freely available

MCODE It is suited for both computationally and biologically oriented researchers. Its features include; Fast network clustering, Fine-tuning of results with numerous node-scoring and cluster-finding parameters, Interactive cluster boundary and content exploration, Multiple result set management, Cluster sub-network creation and plain text export

Graemlin It is capable of scalable multiple network alignment with its functional evolution model that allows both the generalization of existing alignment scoring schemes and the location of conserved network topologies other than protein complexes and metabolic pathways.

MOE An integrated package of tools used for drug discovery. It combines visualization, modelling, and drug discovery on one plate-form.

Conclusion

Bioinformatics is a relatively new discipline and has progressed very fast in the last few years. It has made it possible to test our hypotheses virtually and therefore allows to take a better and an informed decision before launching costly experimentations. Although, more and more tools for analyzing, proteomes, genomes, predicting structures, rational drug designing and molecular simulations are being developed; none of them is 'perfect'. Hence, the expedition for discovery a well package for solving the given problems will continue. One thing is clear that the future research will be directed mostly by the accessibility of databases, which could be either general or precise. It can also be securely expected, based on the advances in the field of bioinformatics, that the bioinformatics tools and software packages would be able to give results that are more exact and thus more consistent clarifications. Predictions in the field of bioinformatics contain its future involvement to functional understanding of the human genome, leading to improved detection of drug targets and individualized therapy. Hence, bioinformatics and other scientific disciplines must move hand in hand to embellishment for the welfare of humanity.

References

1. Mount DW (2004) Sequence and genome analysis. New York: Cold Spring.

2. Peitsch MC (1996) ProMod and Swiss-Model: Internet-based tools for automated comparative protein modelling. Biochem Soc Trans 24: 274-279.

3. Hogeweg P (2011) The roots of bioinformatics in theoretical biology. PLoS Comput Biol 7: e1002021.

4. Geer RC, Sayers EW (2003) Entrez: making use of its power. Brief Bioinform 4: 179-184.

5. Zhang Y, Phillips CA, Rogers GL, Baker EJ, Chesler EJ, et al. (2014) On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types. BMC Bioinformatics 15: 110.

6. Lencz T, Guha S, Liu C, Rosenfeld J, Mukherjee S, et al. (2013) Genome-wide association study implicates NDST3 in schizophrenia and bipolar disorder. Nat Commun 4: 2739.

1. Kingsford CL, Ayanbule K, Salzberg SL (2007) Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol 8: R22.

3. Ouzounis CA (2012) Rise and demise of bioinformatics? Promise and progress. PLoS Comput Biol 8: e1002487.

4. Altschul SF, Madden TL, SchÃ¤ffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.

a. Hesper B, Hogeweg P (1970) Bioinformatica:eenwerkconcept. Kameleon 1:28-9.

5. Dibyajyoti S, Bin ET, Swati P (2013) Bioinformatics: The effects on the cost of drug discovery. Galle Med J 18:44-50.

6. Ouzounis CA, Valencia A (2003) Early bioinformatics: the birth of a discipline–a personal view. Bioinformatics 19: 2176-2190.

7. Molatudi M, Molotja N, Pouris A (2009) Abibliometric study of bioinformatics research in South Africa. Scientometrics 81:47-59.

8. Parmigiani G, Garrett ES, Irizarry RA, Zeger SL (2003) The analysis of gene expression data: an overview of methods and software, Springer, New York.

9. Hoersch S, Leroy C, Brown NP, Andrade MA, Sander C (2000) The GeneQuiz web server: protein functional analysis through the Web. Trends Biochem Sci 25: 33-35.

10. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31: 365-370.

1. Sehar U, Mehmood MA, Hussain K, Nawaz S, Nadeem S, et al. (2013) Domain wise docking analyses of the modular chitin binding protein CBP50 from Bacillus thuringiensisserovarkonkukian S4. Bioinformation 9: 901-907.

3. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680.

4. Clamp M, Cuff J, Searle SM, Barton GJ (2004) The Jalview Java alignment editor. Bioinformatics 20: 426-427.

5. Thomas P, Starlinger J, Vowinkel A, Arzt S, Leser U (2012) GeneView: a comprehensive semantic search engine for PubMed. Nucleic Acids Res 40: W585-591.

6. Page RD (2001) TreeView. Glasgow University, Glasgow, UK.

7. Stoilov I, Akarsu AN, Alozie I, Child A, Barsoum-Homsy M, et al. (1998) Sequence analysis and homology modeling suggest that primary congenital glaucoma on 2p21 results from mutations disrupting either the hinge region or the conserved core structures of cytochrome P4501B1. Am J Hum Genet 62: 573-584.

8. Tekaia F, Gordon SV, Garnier T, Brosch R, Barrell BG, et al. (1999) Analysis of the proteome of Mycobacterium tuberculosis in silico. Tuber Lung Dis 79: 329-342.

9. Mehmood MA, Xiao X, Hafeez FY, Gai Y, Wang F (2011) Molecular characterization of the modular chitin binding protein Cbp50 from Bacillus thuringiensisserovarkonkukian. Antonie Van Leeuwenhoek 100: 445-453.

10. Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10: 516-522.

11. Peng Z, Lu Y, Li L, Zhao Q, Feng Q, et al. (2013) The draft genome of the fast-growing non-timber forest species moso bamboo (Phyllostachysheterocycla). Nat Genet 45: 456-461, 461e1-2.

12. Khan FA, Phillips CD, Baker RJ (2014) Timeframes of speciation, reticulation, and hybridization in the bulldog bat explained through phylogenetic analyses of all genetic transmission elements. Syst Biol 63: 96-110.

13. Freckleton RP, Harvey PH, Pagel M (2002) Phylogenetic analysis and comparative data: a test and review of evidence. Am Nat 160: 712-726.

14. Price MN, Dehal PS, Arkin AP (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One 5: e9490.

15. Bast F (2013) Sequence similarity search, multiple sequence alignment, model selection, distance matrix and phylogeny reconstruction. Nat Protoc.

16. Ahmad N, Michoux F, Nixon PJ (2012) Investigating the production of foreign membrane proteins in tobacco chloroplasts: expression of an algal plastid terminal oxidase. PLoS One 7: e41722.

17. Chen Y, Wang F, Xu J, Mehmood MA, Xiao X (2011) Physiological and evolutionary studies of NAP systems in Shewanellapiezotolerans WP3. ISME J 5: 843-855.

18. UniProt Consortium (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42: D191-198.

19. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2008) GenBank. Nucleic Acids Res 36: D25-30.

20. Miyazaki S, Sugawara H, Gojobori T, Tateno Y (2003) DNA Data Bank of Japan (DDBJ) in XML. Nucleic Acids Res 31: 13-16.

21. Wu CH, Yeh LS, Huang H, Arminski L, Castro-Alvear J, et al. (2003) The Protein Information Resource. Nucleic Acids Res 31: 345-347.

22. Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, et al. (2005) The EMBL Nucleotide Sequence Database. Nucleic Acids Res 33: D29-33.

23. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 235-242.

Essay: ClustalX/ ClustalW/ Clustal2 Clustering and Alignment of DNA and protein sequences .Unlocking Biological Data with Bioinformatics: An Introduction to Computational Tools

Essay details and download:

Text preview of this essay:

Introduction

References

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

References

About this essay:

Essay Categories: