National Center for Biotechnology Information
www.ncbi.nlm.nih.gov/
Brief description: The National Center for Biotechnology Information’s mission is to make advancements in science and technology through various disciplines of research, as part of the National Institute of Health. The NCBI is a great resource of databases that provide genetic data and sequencing, such as GenBank, OMIM, MMDB, UniGene, CGAP, Entrez, and BLAST.
BLAST
blast.ncbi.nlm.nih.gov/Blast.cgi
Brief description: BLAST is a tool provided through the NCBI, and it stands for Basic Local Alignment Search Tool. BLAST compares the protein, amino acid, or nucleotide sequences that users enter to other sequences in databases in order to find the significance of any similarities. This tool is useful to translate nucleotide sequences to proteins, or for matching proteins to any possible nucleotide sequences. In addition, it is helpful in comparing genomes.
PubMed
www.ncbi.nlm.nih.gov/pubmed
Brief description: PubMed is a database that includes published journals, online books, and citations for biomedical literature that are maintained by the NCBI. PubMed is a great resource for research because it allows users to search by topics, authors, and key words to find evidence that will support their research.
Online Mendelian Inheritance in Man (OMIM)
www.ncbi.nlm.nih.gov/omim
Brief description: OMIM is a database provided by NCBI that provides all of the human genes, along with their mutations, and phenotypes. This database is geared towards educating physicians, health care professionals, and genetic researchers about genetic disorders. Thus, it is why it is a great tool for researching human diseases. Also, users can search diseases by phenotype to find the related genes.
NCBI Conserved Domain Search
www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
Brief description: The Conserved Domain Search tool is helpful in identifying the domains of proteins and protein families. By finding a conserved domain it shows that a sequence of amino acids function independently, which can suggest the protein’s function.
CDART: Conserved Domain Architecture Retrieval Tool
www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps
Brief description: The Conserve Architecture Retrieval Tool is useful in finding protein similarities by using domain profiles, as opposed to the entire protein sequence.
European Bioinformatics Institute
www.ebi.ac.uk/index.html
Brief description: The European Bioinformatics Institute (EBI), compiles data from researches in the field of computational biology, and also data from life science experiences. Also, EBI provides bioinformatics training for its users, which makes it a great resource for scientists from all backgrounds of experience.
Protein Data Bank
www.rcsb.org/pdb/
Brief description: The Research Collaboration for Structural Bioinformatics maintains the Protein Data Bank (PDB). The PDB allows users to access information regarding how to visualize the complex nature, and 3-D shapes of proteins.
GenomeNet Database Resources
www.genome.jp/
Brief description: The GenomeNet is a resourceful database for genome research, and other areas related to the field of biomedical sciences. GenomeNet has developed various bioinformatics tools to assist in sequence analysis, genome analysis, and chemical analysis.
1.1. Access points for integrated suites of sequence analysis tools
Multiple sequence alignment (protein)
www.ncbi.nlm.nih.gov/tools/cobalt/cobalt.cgi?link_loc=BlastHomeLink
Brief description: COBALT is a constraint-based Multiple Alignment Tool that uses conserved domains in order to compute the multiple sequence alignments of different proteins. COBALT is great tool for studying and/or composing phylogenetic relationships.
ExPASy
www.expasy.org/
Brief description: ExPASy, previously known as “Expert Protein Analysis System” is a portal that combines access to many kinds of resources for biological sciences. It provides various databases and software tools in areas such as proteomics, transcriptomics, population genetics, genomics, systems biology, phylogeny, and more.
Multiple sequence alignments
www.ebi.ac.uk/Tools/msa/
Brief description: Multiple Sequence Alignment (MSA) is a great tool to use when inferring evolutionary relationships, inferring protein structure, and composing phylogenetic trees. MSA is the alignment of at least three biological sequences.
PRABI (Rhone-Alpes Bioinformatics Center)
www.prabi.fr/
Brief description: PRABI is a portal that provides access to training information, and research resources for bioinformatics and biostatistics fields. These tools include database access, sequence comparisons, nucleotide sequence analysis, protein sequence analysis, structural bioinformatics, genomics, and more.
Biology Workbench/San Diego Supercomputer Center
workbench.sdsc.edu/ Currently unavailable for lack of funding
Brief description: STARTTTYPINGGHERE
1.2. Some resources for human genomics
The Human Genome (NCBI)
www.ncbi.nlm.nih.gov/genome/guide/human/
Brief description: The Human Genome resource provided by NCBI features a genome browser and genome data viewer that allows for access to analyzing genome assemblies by chromosome. This is helpful in exploring evolutionary relationships, genetic testing, and links between genetic variation to human health.
Human Genome Browser Gateway (UCSC)
genome.ucsc.edu/cgi-bin/hgGateway?db=hg10
Brief description: The UCSC Human Genome Browser allows for the interactive use of visualizing genetic data. The browser is resourceful because it allows users to search for clades, genes, proteins, and specific positions in genomes.
ENCODE
genome.ucsc.edu/ENCODE/
Brief description: ENCODE stands for the Encyclopedia of DNA Elements. The purpose of ENCODE is to break down the functional parts of the human genome. The elements include those that function at RNA and protein levels, and regulatory functions when the gene is active.
1.3. Databases with entire genomic sequences
National Center for Genome Resources
www.ncgr.org/
Brief description: The National Center for Genome Resources, or NCGR, is a research institution that provides the resource capacity for sequencing entire genomes, RNA, exome, and much more. The provided information is useful in creating experimental designs, data visualization, and analysis for researchers.
J. Craig Venter Institute
www.jcvi.org/
Brief description: The J. Craig Venter Institute provides research based in genomes and bioinformatics. Their goal is to make advances in the environment and within the field of human health. They have made various successful advancements with infectious disease, the first diploid human genome, the first synthetic cell, and many more. JCVI is a helpful resource in any of the aforementioned topics, and more related to genomics.
Gramene: A Resource for Comparative Grass Genomics
www.gramene.org/
Brief Description: Gramene is database that has a collection of the genomes of various crops and plant species. The site allows users to search this database using any keyword, species, or genome. In addition, it also has a feature that allows users to access and analyze the regulatory and metabolic pathways of plants.
Maize GDB (Maize Genetics and Genomics Database)
www.maizegdb.org/gbrowse
Brief Description: Maize GDB is database dedicated to serving researchers with the genomic data for Zea mays. Zea mays is the crop plant corn or maize, and it is ideal for research due to the fact that it is an model organism. The Genome Assemblies provided by the Maize GDB is a great resource for those doing research with Zea mays, and as a point of comparison with other crops.
1.4. Example of a specialized structure prediction tool
COILS Server
www.ch.embnet.org/software/COILS_form.html
Brief description: The COILS Server is provided by ExPASy and its name comes from the fact that it provides a prediction of the coiled coil regions in proteins. Users can paste a sequence into this tool and then the program compares the sequence to others in its database of known parallel two-stranded coiled-coils. After producing a similarity score, it is useful in its way of predicting how the protein will have a coiled-coil conformation.
1.5. Metabolic and signaling pathways
BioCyc (several organisms)
biocyc.org/
Brief description: BioCyc has a collection of over 1,000 pathway and genome databases. The database allows users to search genomes, in addition to specific genes, metabolic pathways, metabolites, and enzymes. Also BioCyc has the feature of calculating predictions of metabolic pathways and enzymes.
EcoCyc (Escherichia coli)
ecocyc.org/
Brief description: EcoCyc is a database that provides the entire genome for the bacterium Escherichia coli. This database also provides a compilation of literature-based information regarding the transcription, transport, and metabolic processes of Escherichia coli.
Saccharomyces cerevisiae (brewer’s yeast)
www.yeastgenome.org/
Brief description: The Saccharomyces Genome Database, also known as SGD, provides users with the tools to search and analyze biological information for Saccharomyces cerevisiae. These tools include sequencing, and literature that allows users to find and compare functional relationships.
Arabidopsis thaliana (thale cress)
www.arabidopsis.org/biocyc/
Brief description: This website is a collection of tools that offers users information from databases about Arabidopsis thaliana. In addition, these tools include ways for users to study the pathways, compounds, gene expression, enzymes, and a metabolic overview of Arabidopsis thaliana as well.
Danio rerio (zebra fish)
useast.ensembl.org/Danio_rerio/Info/Index
Brief description: This particular database for Danio rerio (zebra fish) gives users the access to find out this species’ gene assembly, DNA sequence, karyotype, gene annotations, phenotype data, and variation data. In addition, this interface offers downloads of alignments for comparative genomics involving Danio rerio.
Mus musculus (mouse)
www.informatics.jax.org
Brief description: MGI, or Mouse Genome Informatics, is an international database which serves the purpose of providing biological information pertaining to the laboratory mouse in order to aid in the studies of human health and disease. Some of this information and tools include, but are not limited to: genes, phenotypes, disease connections, gene expression, functions, vertebrae homology, and metabolic pathways.
Homo sapiens (human)
www.hmdb.ca
Brief description: The Human Metabolome Database (HMDB) is a database that compiles information about metabolites in the human body.
1.6. Additional learning resources (notice the absence of Wikipedia on this list)
Taxonomy:
www.hyperdictionary.com/dictionary/taxonomy
Brief description: Taxonomy is a means of placing animals and plants into respective categories based on their relationships.
Gene ontology:
www.yeastgenome.org/cgi-bin/GO/goTermFinder.pl
Brief description: Gene ontology terms can be used to see if any specific genes significantly share something in common. Gene ontology may give insight to similar gene functions and components.
Phylogenetic trees:
encyclopedia.thefreedictionary.com/phylogenetic tree
Brief description: Phylogenetic Trees depict the evolutionary relationships among different species. These relationships are inferred through a series of comparisons relating to genetics and physical characteristics.
aleph0.clarku.edu/~djoyce/java/Phyltree/cover.html
Brief description: The tree model is not the best method to show the evolutionary relationships among individuals within a species, closely related species, hybrid species, or distant interactions. Furthermore, a fundamental part of constructing a phylogeny tree are essential mutations, which are a measure of time, and give some evolutionary insight to the distance between species.
www.phylogenetictrees.com/segminator.php
Brief description: The segminator tool analyzes viral data. The additional tools provided include obtaining nucleotide and amino acid residue, and identifying variables. These tools can provide a basis of inferring phylogenetic relationships.
Google Scholar
scholar.google.com/schhp?hl=en&tab=ws
Brief description: Google Scholar provides access to searching a wide-range of academic literature, suitable for research evidence.
WolframAlpha
www.wolframalpha.com/
Brief description: WolphramAlpha is a great interface for computing in the fields of Mathematics, Science, Technology, and even everyday life. The interface uses a series of AI technology, algorithms, and knowledgebase to compute accurately.
1.7. The National Center for Biotechnology Information (NCBI)
NCBI is a comprehensive network of databases that include information on nucleotidyl sequences (e.g. chromosomal DNA, mRNA, non-protein–coding RNAs), amino acyl sequences (proteins), taxonomy, genetically-based diseases (also known as “inborn errors of metabolism.” Here’s a diagram that illustrates the relationships among these different databases:
You may want to continue exploring NCBI. This link will take you to a comprehensive list of all databases in it:
www.ncbi.nlm.nih.gov/guide/all/#databases_.
2. Case Study: An Unknown Human Nucleotidyl Sequence
Specific Learning Objectives
i. Describe what GenBank files are and be able to read them.
ii. Describe what FASTA format is and learn how to identify sequences in FASTA format.
iii. Become familiar with the BLAST program (check NCBI websites) and learn how to use it.
NOTE: Your instructor may decide to assign you a sequence that differs from the one in this section.
If this is the case, enter modifications to this document as necessary.
2.1.
The nucleotidyl-residue (or “nucleotide,” for short) sequence on the following page comes from a human DNA sequencing project. You are given the task of identifying the location of this sequence within the human genome (Alaie et al., 2012). The problem is that the human genome is made up of 3 billion base pairs (bp). To check even 1000 bp by eye in search of this sequence is quite time-consuming (as you will find out shortly). Imagine if you had to check a billion nucleotides in a sequence!
Notice that the sequence provided below is in FASTA format, i.e., it does not start directly with nucleotide abbreviations (A, G, T, C), nor it does include numbers, spaces or symbols. Instead, a name or designation for the sequence is written in the first line, preceded by the “>” symbol.
Start by scanning (by eye) the given sequence (3360-bp) in search of the location of the following short nucleotide stretches. Devise your own method.
i) TATACTTCAGGAACTAATTCTGAAGCATCA and ii) TCTGTGCCTTTTTTATATCTTGGCAGGTAG
Mark the sequences on your printout of this document (underline or use a highlighter) or on the electronic document, as requested by your instructor.
2.2.
Please note the time at the beginning of your search and answer the following questions once you have located your sequence.
1. Describe the method you used to find the sequence stretches (visual comparison? computer-aided?).
At first I attempted to look for chunks of the first sequence, so I would look for “TAT” or “TATAC” but after looking for nearly 10 minutes, I realized that I was getting nowhere because there were too many of the same letters. Instead, I ended up just using the search function on the document to find both sequences quickly.
2. How long did it take for you to find your sequence?
Sequence i) 9 minutes
Sequence ii) Less than 1 minute
2.3. BLAST
Let us explore the efficiency of using vast online databases and online search tools to locate and identify unknown nucleotide sequences. One such search tool is called BLAST (Basic Local Alignment Search Tool). This program compares a nucleotidyl (DNA, RNA) or amino acyl sequence (protein) of interest to online databases looking for regions of local similarity and calculates the statistical significance of matches. One such online database is NCBI’s GenBank, which contains the sequences of at least three full-length human genomes and, being hosted by the National Library of Medicine (a brand of the National Institutes of Health), is free to the public.
Finding sequences of known (or putative) function in a database that have similarity to your sequence of interest may allow you to identify the gene family to which your sequence belongs or the functional significance of your sequence, if any. You will use a BLAST search to uncover information about an unknown sequence. Copy and paste the unknown sequence (either the one from last page or as provided by your section’s instructor) onto a new Word document and save it in your computer’s hard drive. Give it a title in the format 202_Test_Sequence_LastName_FirstName.docx (example: 202_Test_Sequence_McKinnell_James.docx).
A. Go to NCBI BLAST website at blast.ncbi.nlm.nih.gov/Blast.cgi.
B. In the resulting page, scroll down to Basic Blast and click on the link nucleotide blast. Copy the first line of the nucleotide sequence in the Word document and paste it in the “Enter Query Sequence” box. (The top line, preceded by the “>” sign, is the description of what the sequence is.)
C. Leave the settings as they are, but make sure that Human genomic + transcript is selected in the Choose Search Set options. Scroll to the bottom of the page and click the BLAST button in the left-hand corner. Wait for results. Did your sequence find any matches in the human genome database?
There was no data found with this sequence.
What could be the reason for this result? The one line may have been an insufficient amount to process, or maybe the search was not specific enough for this query.
D. Now try a longer sequence. Copy the first three lines and paste this sequence into the “Enter Query Sequence” box and click BLAST again. Did your query match any sequence in the human genome database?
Yes
If so, what match did it locate? Homo sapiens fragile X mental retardation 1 (FMR1), transcript variant(s) in coding and non-coding mRNA
E. Next copy one line that is roughly in the middle of the provided sequence and paste it into the “Query Sequence” box and run the BLAST search again. Did you get a result this time?
Yes, some overlapping results from searching the first three lines.
F. Propose a reason for why this one line yielded a different result than the one line at the beginning of the sequence.
This part of the sequence may be more unique and specific compared to the first line.
G. Click on the first of the matches that your search yielded. This match should be with a sequence within GenBank. What is the name of this gene? What is the Sequence ID?
The gene is named fragile X mental retardation 1, and the Sequence ID is NM_001185082.1 .
H. What chromosome is this located in? At what location of this gene?
It located at Chromosome X.
3. Conclusion
A fully processed messenger RNA (mRNA) contains nucleotide triplets in a particular sequence that are read from an initiation codon (AUG) up to one or two termination codons (out of three: UAG, UAA, UGA). The expression of a eukaryotic gene is controlled by DNA sequences called regulatory regions. The regulatory regions include the gene’s promoter, which binds RNA polymerase once the transcription factors have bound the DNA and made that site accessible, and one or more enhancers that also bind transcription factors and contribute to the control of gene expression.
Usually, the expression of a gene can be modified if one of its regulatory regions undergoes a mutation. This mutation may be of immense significance, even if the change involves a single base substitution, since a transcription factor’s recognition of the site is sequence-specific. Mutations may involve more substantial changes to the gene’s regulatory regions, such as multiple nucleotide deletions, or, as in the case of the gene under study in this lab, multiple nucleotide additions which may eventually result in the silencing of this gene.
The gene you searched codes for the so-called fragile-X mental retardation protein (FMRP). The promoter of this gene contains a variable number of the trinucleotide repeat CGG. Individuals with no disease (normal phenotype or wildtype) have promoters containing <60 CGG repeats. Individuals whose promoters contain 60–200 trinucleotide repeats are said to possess a “premutation” that renders them susceptible to movement problems (ataxia) later in life. Individuals whose promoters have >200 CGG trinucleotide repeats are afflicted with fragile-X syndrome and display a wide range of symptoms that include mental retardation, large testes, etc. In turn, FMRP is involved in the transport of RNA transcripts to polyribosomes located at sites of protein synthesis. In neurons these sites include the terminals of axons. Loss of expression of FMRP has far-reaching consequences for an affected individual.
4. Questionnaire
a) Consider the sequence you searched using the BLAST program. Would you predict that this gene comes from a healthy person, a person with a premutation, or a person afflicted with fragile-X syndrome, just by looking at the sequence?
I would assume that this gene comes from a healthy person.
Explain your reasoning.
I do not have the physical capacity to analyze an entire sequence just by looking at it, so I would take a positive guess and assume the best.
b) We used the default database when conducting our BLAST search. This database contains only human genome sequences. Imagine that the sequence you subjected to the BLAST search yielded no matches (regardless of the length of the sequence you entered into the Query box). What would you infer about that sequence?
I would assume that the sequence does not belong to the human genome.
c) What result would you predict if we searched that sequence against all known sequences?
I would predict that the sequence would belong to a genome that is not human.
A database containing all known nucleotide sequences exists and is called “nucleotide collection (nr/nt).” This database can be found on the BLAST site under “Choose Search Set.” At “Database” you will see that the “Human Genome + transcript” is selected. Select “Others” instead and you will find that the “nucleotide collection (nr/nt)” database is automatically selected. Run your search against this vast database.
d) How do your results differ from the original search?
Compared to the original search, this search has returned several more results.
e) Describe the capabilities of a BLAST search.
BLAST is capable of comparing biological sequence data including nucleotides, proteins, and amino acids in various genomes. These sequences can be compared for their statistical significance.
f) What could be the possible limitations of a BLAST search?
Some of the limitations of a BLAST search could be its time constraints. Researchers using BLAST have to enter each query of sequences one by one, in addition to taking account the observations needed for the countless amount of results for each search.
g) BLAST is often nicknamed “the Google of DNA search tools.” Compare a BLAST search to a Google search and list one possible similarity and one possible difference.
A similarity between a Google Search and a BLAST search is that both can return several results if the query is just specific enough. A difference between them is that BLAST has more filtering options than Google, which provides a more tailored approach.
5. Discussion
You are given a sequence of DNA and told that it is human. You are asked to find out its identity and whether it has similarity to sequences in other organisms. Please describe the bioinformatics tool, the database, and the procedure you would use to find such information. Give two possible outcomes of your search.
I would use the BLAST tool to analyze the DNA sequence. First I would enter a line of nucleotides from the middle of the sequence into the query and make sure that the “Human genomic + transcript” is selected in the “Choose Search Set” options. Then I would observe and take note of these results. Next, in order to see if there is any comparison to other organisms I would run the same query, but instead use the database called “nucleotide collection (nr/nt),” and then select “Others” instead of “Human genomic + transcript. This would allow for additional results pertaining to other organisms. The possible outcomes include that it belongs to a human, and it either is a gene in a healthy person, a permutation, or disorder. Also, this gene may or may not be homologous in other species.