Bacterial genome sequencing:
A genome is the complete set of an organism’s genetic material. A prokaryotic genome is much smaller in size and is not as complex as the eukaryotic genome. Genomes contain DNA which transfers genetic information sequentially. DNA is self-replicated by DNA polymerase and undergoes transcription by RNA polymerase to produce RNA. The RNA further gets translated into proteins with the help of ribosome and this flow of genetic information is called as “the central dogma of life”. Bacterial genomes usually contain a single, circular, supercoiled DNA. Decoding the genetic information of an organism and sequencing the genomes will have an immense impact in understanding molecular biology (Heng, 2011 #1).
Genome sequencing methods:
The process of sequencing the DNA bases (adenine, guanine, cytosine and thymine) is known as genome sequencing. The first method known for the sequencing of DNA was the Sanger’s dideoxy chain termination method (Sanger, 1977 #2). Four reaction mixtures are made and each contain one of the four specific dideoxy nucleotide phosphates (ddATP, ddCTP, ddGTP, ddTTP) (Sanger, 1977 #2). The fragment to be sequenced is added to the reaction mixtures along with a radiolabeled primer and DNA polymerase (Sanger, 1977 #2). The dideoxy nucleotide phosphates give rise to fragments that terminate at every possible incorporation of ddNTPs (Sanger, 1977 #2). The fragments are then resolved using polyacrylamide gel electrophoresis (Sanger, 1977 #2). The radiolabeled primer provides visualization upon exposure to a photographic film (Sanger, 1977 #2). The autoradiogram of the gel is read from bottom to top and the sequence of the strand complementary to the template strand is determined from 5’ end to the 3’ end (Sanger, 1977 #2).
Automated sequencing was introduced to allow faster separation of DNA (Wallis, 2011 #3). The DNA fragments are separated by size using capillary electrophoresis (Wallis, 2011 #3). The radioactive primers are replaced by fluorescent labeled ddNTPs (Wallis, 2011 #3).
To overcome the use of expensive equipment, there are many other sequencing methods that have been developed over the years, referred to as next generation sequencing. This equipment sequences at a faster rate and is relatively cheaper. Some examples of feasible DNA sequencing methods include single-molecule real-time sequencing, ion torrent sequencing, illumina sequencing, etc (Pareek, 2011 #4).
Single-molecule real-time sequencing (SMRT) is used to sequence a single molecule of DNA (Levene, 2003 #5). Similar to automated sequencing, each of the four bases are fluorescent labeled. SMRT is done on a chip and is widely applied in genomics (Levene, 2003 #5). Ion torrent sequencing is similar to single-molecular real-time sequencing where it is done on a semiconductor chip and the nucleotide sequence (bases) is translated into a digital signal (Quail, 2012 #6). Ion torrent sequencing is fast, cheap and a simple approach to sequence DNA (Quail, 2012 #6). Illumina sequencing is similar to Sanger sequencing but it contains a chain terminator which blocks polymerization such that only one base is added at a time to the template (Lahens, 2017 #7).
The next generation sequencing methods have dominated the industry as they are more economical, provide high throughput and have a simple workflow when compared to the traditional methods (Wallace, 2016 #8).
Objectives of genome sequencing:
Genome sequencing of poorly studied bacteria helps to build a reference genome which will help to put unknown genomes on a position in the tree of life. Genome sequencing will improve our understanding about the mechanisms used by bacteria to cause diseases, helps to identify new drug targets and also design novel antibiotics (Donkor, 2013 #9).
An automatic annotation pipeline of prokaryotic genomes exists but is not very reliable as computer annotation can often times be wrong, especially for proteins with unknown functions. The need for manual curation of genes plays a very important role in genome annotation (Pfeiffer, 2015 #10). Genome annotation is the process of obtaining the raw sequence of the genes and predicting their function, localization and structure (Reeves, 2009 #11). The various modules involved in annotating a gene are summarized below.
Sequence based similarity:
The identification of homologs is an important step in the annotation of genes. Protein sequence databases contain the sequences of previously annotated proteins with which the sequence of the gene of interest is compared to and homologous sequences are identified (Pearson, 2013 #12). The main idea of identification of homologous sequence is to look at the similarities of function or protein domains of the gene of interest.
Some of the tools used to explore the sequence based similarity data are Basic Local Alignment Search Tool (BLAST), Conserved Domains Database (CDD), T-COFFEE and WebLogo. Proteins with similar sequences can be obtained using BLAST and CDD (Marchler-Bauer, 2005 #13). T-COFFEE and Weblogo is used to align multiple similar sequences obtained from BLAST to check how well-conserved the sequence is throughout (Crooks, 2004 #14).
The amino acids in T-COFFEE and WebLogo alignments are color coded based on their nature (Crooks, 2004 #14). For example, in WebLogo, green indicates polar amino acids, blue indicates basic amino acids, red indicates acidic amino acids and black indicates hydrophobic amino acids. The bigger the size of the letter on a WebLogo image, it means that the base is very well-conserved in all sequences among various species. Similarly, gaps in the sequences indicate that the sequence is not very well-conserved and gaps towards the N or C terminal indicate that the protein might not have been called correctly (Crooks, 2004 #14).
To determine the functional similarity of the gene of interest to other known proteins, various online tools such as Protein Data Bank (PDB), Pfam and TIGRFAM are used. Hypothetical proteins can be assigned to a protein family by classifying them based on the presence of functional domains (Xu, 2012 #15). Based on previously annotated proteins, a statistical model called the Hidden Markov Model (HMM) has been developed that is used to identify the correlations in structure among various proteins (Yoon, 2009 #16). TIGRFAM and Pfam are structure predicting tools that are based on HMM which are developed using many sequences that have been annotated manually (Mulder, 2002 #17).
Cellular localization data:
Cellular localization data is important for hypothetical proteins as only the localization prediction is available most of the hypothetical proteins which can serve as a useful information to perform wet lab experiments (Yoon, 2009 #16). Gram staining is a technique that can be used to identify whether the bacteria is gram positive or negative as they have different mechanisms for the secretion of protein (Silhavy, 2010 #18).
TMHMM is an online tool that shows the presence of the number of transmembrane helices that are present in the protein which implies that the protein is a membrane protein (Dönnes, 2004 #19). The TMHMM plot is similar to a hydropathy plot. SignalP is a tool that predicts the presence of a signal peptide at the amino terminal of the protein implying that the protein is secreted out of the cell (Dönnes, 2004 #19). PSORTb is a tool that gathers data from various cellular localization prediction tools and predicts the best possible localization of the gene. PSORTb result displays a score (out of 10) for each localization site and the site with the highest score is predicted to be the site at which the protein is present (Yu, 2010 #20).
Alternative open reading frame:
An incorrectly called sequence can result in the translation of a protein of a completely different function, irrelevant to the organism. The presence of an alternative open reading frame can be initially predicted by looking at a WebLogo image (Vanderperre, 2013 #21). A lot of gaps towards the N terminal suggests that the protein might not have been called correctly.
One online tool that can be used to check the presence of an alternative open reading frame is the IMG sequence viewer. The sequence is checked for any possible start codon with a Shine Dalgarno sequence that is present few base pairs upstream to the original open reading frame. The new sequence including the new open reading frame is then re-BLASTED and the results are compared with the original BLAST results. If the new open reading frame results in a higher score and a lower expect value, it can be predicted as the alternate open reading frame (Harper, 2012 #22).
Gene duplication and degradation:
The most important ideas of evolutionary genomics are orthology and paralogy (Koonin, 2005 #23). Genes present in different species that evolve from a common ancestral gene and referred to as orthologous genes (Gabaldón, 2013 #24). These genes usually result in common functions as that of the ancestral gene. On the other hand, genes in different species with new functions that originate from a common ancestral gene are known as paralogs (Gabaldón, 2013 #24). These genes have evolved through duplication and degradation. The genes which are non-functional but resemble a particular gene are known as pseudogenes. Since pseudogenes are non-functional, they are difficult to identify in the genome (Tutar, 2012 #25).
The gene association details or the identification of similar genes can be obtained by BLAST by hovering over the related information that appears next to the query sequence results. Although the computational tools provide some information on gene duplication and degradation, understanding the complex evolutionary tree of some genes requires further research (Gabaldón, 2013 #24).
Horizontal gene transfer:
If an organism contains genetic material from organisms that are not its ancestor, this could be an evidence of horizontal or lateral gene transfer in the organism. The potential advantages of horizontal gene transfer are that the genes that undergo deletion mutations can be reproduced in the organism and the horizontal gene from another species can be maintained in the organism for various generations (Vogan, 2011 #26). Horizontal gene transfer also has an advantage of antibiotic resistance in bacteria. This results in the degradation of novel antibiotics in bacteria and helps the bacteria to evolve and exhibit virulence (Davies, 2010 #27).
The evidence of horizontal gene transfer can be collected by constructing a phylogenetic tree. The purpose of a phylogenetic tree is to identify the evolutionary relationships of organisms. Every organism follows a taxonomy which shows the family, order, class, phylum and domain in which it belongs. If almost or most of the genera in a phylogenetic tree belong to a different phylum or domain, it implies that there is evidence that the gene has undergone horizontal gene transfer recently (Vogan, 2011 #26).
The same can be concluded by the guanine-cytosine content of an organism. For example, Kytococcus sedentarius has a GC content of 72%. If any gene in the Kytococcus sedentarius genome seems to have a GC content that is significantly lesser or more than 72%, it has an evidence of horizontal gene transfer. In cases of horizontal gene transfer, the gene neighborhood image is obtained from the IMG sequence viewer. If there is an absence of the gene in closely related organisms or presence of the same gene in organisms that are not closely related, the gene has a strong evidence of horizontal gene transfer.
Mutant phenotypes in bacteria:
Phenotype refers to any trait or characteristic of an organism. An allele that encodes the phenotype is called a wild type allele. If the allele is present in any other mutant form it is called a mutant phenotype. It is important to identify the presence of mutant phenotypes in bacteria as they provide information about the functions of the genes or hypothetical proteins present in the bacteria (Deutschbauer, 2014 #28). Mutant fitness data is an approach that is being used to calculate the strain fitness of bacteria. The strain fitness refers to the logarithmic change in the abundance of bacteria after inserting mutations in the bacteria (Deutschbauer, 2014 #28).
As feasible as it seems, mutant fitness profiling of bacteria has its own disadvantages. The enormous diversity of bacteria makes it difficult to identify the elaborate functions of all bacterial proteins and it could be time consuming (Palace, 2014 #29).
Transposon mutagenesis is widely used for inserting random mutations in bacteria. A transposon is a segment of the chromosome present in one bacteria that can be translocated in any host bacteria. Transposon mutagenesis has been published as a strong approach for generating mutants in bacterial genomes (Beaurepaire, 2007 #30).
Random bar code transposon-site sequencing is a type of transposon mutagenesis that is used to characterize a complex transposon mutant library (Wetmore, 2015 #31). The transposon mutant library is subjected to various growth conditions such as a number of carbon sources, nitrogen sources and stress. A new insight to genome annotation has been acquired by gene fitness calculation, which is a widely-used approach in determining the mutant phenotypes and functional associations of many proteins of unknown function (Gray, 2015 #32).
...(download the rest of the essay above)