Summary
Sodium bisulfite treatment of DNA strands can detect methylated cytosines which can enable researchers to discover more information on the causes and effects of methylated cytosines. However, this produces many challenges with aligning bisulfite reads to a reference genome as the reference genome does not exactly match the bisulfite reads and up to four strands may need to be aligned. Furthermore, bisulfite colour mapping has a high error rate when mapping methylated regions. Many alignment tools have been created to deal with these issues such as wild card aligners, Bismark and BatMeth. Further research into integrating techniques will allow for the combination of speed and accuracy that is not currently present in these alignment tools.
Introduction:
DNA Methylation
DNA methylation is the addition of a methyl group (-CH3) to the C5 of a cytosine by DNA methyltransferases. It is important in the regulation of gene expression, X-chromosone inactivation and genetic imprinting (Robertson, 2005; Laird, 2010; Stevens et al, 2013) and changes in DNA methylation have been associated with cancer (Baylin & Jones, 2011) and other diseases (Feinberg, 2007). It has been suggested that DNA methylation could provide an account of environmental exposures (Bock, 2012) and may be a useful tool in determining biomarkers for disease diagnostics (Laird, 2003; Bock, 2009; Walker & Ho, 2012). In the human genome, cytosines that appear in a CpG dinucleotide context are methylated 70 – 80% of the time (Ehrlich et al, 1982; Pelizzola & Ecker, 2010), however, the majority of CpG islands remain unmethylated despite their high concentration of CpG dinucleotides (Antequera, 2017). Methylation can occur in a non-CpG context though this is limited and is usually restricted to specific cell types, such as neural and embryonic cells (Krueger et al, 2012).
Detecting DNA Methylation
There are various methods to detect DNA methylation. Restriction enzymes sensitive to DNA methylation only cut restriction sites with unmethylated CpGs and so each MRe-Seq sequencing read specifies the methylation status of a CpG dinucleotide (Stevens et al, 2013). This method can cover up to 30% of the genome (Nair et al, 2011) though inferring the position of methylated CpGs would require complete digestion which is not typically present in practice (Stevens et al, 2013). Methylcytosine specific antibodies can be used to infer methylation (Maunakea et al, 2010) as it has a lack of bias for certain nucleotide sequences. However, the MeDIP method is affected by external variables such as the density of CpG dinucleotides in the genome (Pelizzola et al, 2008; Harris et al., 2010).
Sodium Bisulfite Sequencing
Sodium bisulfite sequencing to map methylated cytosines was first adopted by Lister et al (2009). Fragments of 200bp long are attached to methylated adaptors and then treated with bisulfite which converts unmethylated cytosine to uracil (Krueger et al, 2012). During PCR amplification of the fragments, uracil is converted to thymine and then these fragments are sequenced. In the sequencing reads, unmethylated cytosines appear as a thymines and methylated cytosines are protected from conversion (Bock, 2012).
By comparing the sequenced reads to the reference genome, the methylated cytosines can be inferred (Krueger, 2012). This method can be used to produce genome wide, single base resolution maps of methylated cytosines in a genome (Lister et al, 2009). When aligning the reads produced by sodium bisulfite sequencing, it is essential that the alignment is correct (Krueger, 2012). This is due to the fact that the methylation state of cytosines is deduced by the comparison of the reads to the reference sequence (Bock, 2012).
Challenges of Aligning Sodium Bisulfite Reads
Aligning sodium bisulfite reads to the large reference genome produces a number of computational challenges (Chatterjee et al., 2012).
Firstly, the reference genome is not an exact match to the bisulfite reads as the unmethylated cytosines have been converted to thymines. The alignment must then account for the depletion of unmethylated cytosines (Bock et al., 2012) as C to T mismatches may represent a unmethylated cytosine instead of a SNP or computational error (Prezza et al., 2016).
In addition, it is possible that four strands will need to be aligned to the reference genome. Cytosine methylation is not symmetrical and so both the top DNA strand and the bottom DNA strand will be need to be treated with sodium bisulfite and then PCR amplification results in four strands (Krueger & Andrews, 2011). The sequencing library can be generated in two ways (Chen et al, 2010); a directional library in which the sequencing reads correspond to either the reference forward or the reference reverse strand (Lister et al, 2009) or strand specificity is not conserved and so all four possible DNA strands must be sequenced (Cokus et al., 2008; Popp et al., 2010). The alignment of four strands cannot be performed by standard alignment tools (Krueger & Andrews, 2011) and so different alignment methods will need to be adopted.
Finally, SOLiD sequencing represents the transition from one base to another as a colour. There is a high error rate as if one sequenced colour is wrong, all subsequent colours are wrong therefore C to T conversion is not suitable (Lim et al., 2012). In addition, there is a decrease in the unbiased reads of hypomethylated regions (Lim et al., 2012) due to the many mismatched produced when mapping bisulfite colour reads onto a highly methylated genome (Krueger et al., 2012).
Methodology
Aligning non-exact matches
Several different approaches have been developed to allow the aligning of bisulfite reads to a reference genome: methylation aware methods and ‘three letter’ aligners.
Methylation aware alignments (‘wild card’ methods) consider both cytosine and thymine a potential match to the reference cytosine (Krueger et al, 2012). There are two variations of the ‘wild card’ method to perform a methylation aware alignment (Smith et al, 2009). Firstly, the alignment can be adjusted so that all of the cytosines in the reference genome can match to both a cytosine and a thymine in the sequencing reads. The sequence alignment matrix is modified so that a thymine in the sequencing read matching to a cytosine in the reference genome does not result in any penalty (Bock, 2012). Secondly, a wild card letter Y can be added so that all the cytosines in the genomic sequence are converted to a Y. This letter Y can match with both a cytosine and thymine in the read sequence (Bock et al, 2012).
‘Three letter’ aligners convert all residual cytosines in the bisulfite sequence reads and all the cytosines in the reference genome to thymines (Chen et al, 2010; Bock, 2012; Krueger et al, 2012). This allows the read sequence to be unaffected by its methylated cytosines (Krueger et al, 2012). It is then possible to use standard alignment tools, such as Bowtie, exclusively on a three letter alphabet to align the reads to the reference sequence (Langmead et al, 2009) as there will be an exact match between the converted read sequence and the converted reference sequence (Krueger et al, 2012).
Aligning up to four strands
Bismark is a three letter aligner that can perform bisulfite alignments on both directional and non-directional libraries. The bisulfite reads and reference genome are converted by changing Cs to Ts on the forward strands and Gs to As on the reverse strands. Each of the reads are then simultaneously aligned to the converted reference genome using Bowtie (Krueger & Andrews, 2011) and the outputs are analysed to determine if the sequence can be uniquely mapped. The ‘best’ unique alignment is then selected. This allows the read sequence to be compared to the original pre-converted genomic sequence that provides the ‘best’ alignment in order to determine the methylation state of cytosines in the genome.
SOLiD Sequencing
BatMeth is a basic alignment tool for methylation that addresses the challenges raised by mapping bisulfite colour reads to a genome. This alignment method can use both base space and colour space and will change the reference genome into colour space for SOLiD sequencing (Lim et al., 2012). At the initial stage, BatMeth counts the number of bisulfite reads instead of mapping and then removes the reads that have the wrong orientation. Further steps were added to improve accuracy. A dynamic programming step for colour reads accounts for mismatch accuracy and an incremental processing step produces high unique mapping rates (Lim et al., 2012).
Discussion
Aligning non-exact matches
Methylation aware software results in a higher genomic coverage (Bock, 2012) and provides the highest mapping efficiency (Krueger, 2012) though has an increased bias towards highly methylated areas. This is due to the fact that cytosines or wild card letter Ys in the reference sequence will align to thymines in the reads with more ambiguity than aligning with cytosines in the reference sequence as the thymines could be unmethylated cytosines or thymines. This results in extra cytosines in the methylated reads raising the complexity level of the read so that it is sufficient for a unique alignment to the genome but unmethylated thymine containing reads being discarded (Bock, 2012). This can lead to an overestimation of methylation levels (Krueger, 2012). However, these are only problems if the reference sequence has a high number of sections that have high identity to other regions in the sequence and they become less relevant for longer sequencing reads (Bock, 2012).
In contrast, the three letter aligners are unbiased to methylated sequences as all cytosines in the reference genome and the sequencing reads are converted to thymines (Chen et al, 2010; Bock, 2012; Krueger et al, 2012). This results in exact matches between the converted reads and converted reference sequence and so it is possible to use standard alignment tools (Chen et al, 2010; Krueger & Andrews, 2011). However, this method does decrease the sequence complexity as there are only three letters to be aligned instead of four such that a larger percentage of reads are discarded as they match multiple positions in the reference genome (Bock, 2012).
Aligning up to four strands
Bismark has been found to accurately detect methylation levels at a constant mapping efficiency (Krueger, 2012). Some studies have shown that Bismark has uniquely mapped 77–82% of sequencing reads (Kunde-Ramamoorthy et al., 2014) with a low error rate (Tsuji & Weng, 2016). It is able to take both directional and non-directional libraries and aligns methylated sequences accurately and in an unbiased manner (Krueger & Andrews, 2011). However, it has been suggested that Bismark is very slow (Tsuji & Weng, 2016) though is substantially faster than Pash (Kunde-Ramamoorthy et al., 2014).
SOLiD Sequencing: BatMeth
BatMeth allows the user the perform both base space and colour space mapping in one software. It has been suggested that the software is faster and produces fewer false positives than other softwares available (Lim et al., 2012). However, one study found that BatMeth only uniquely identified 50% of the reads and was excluded from further consideration (Kunde-Ramamoorthy et al., 2014). There are very few papers comparing BatMeth’s ability with other software and so a full conclusion on its accuracy and efficiency cannot be deduced.
Integrating Techniques to Improve Alignment
It has been found that integrating the mapping results from Bismark, BSMAP and BS-Seeker2 significantly increased the detection accuracy when compared with the individual mappers (Jong-Hun et al., 2015). The amount of detected cytosine was higher than that by Bismark alone and overall efficiency of methylation detection was improved (Jong-Hun et al., 2015).
Conclusions
Overall, it appears that the improvement of one aspect is often detrimental to another aspect e.g. the higher the speed of the alignment program the lower the accuracy of its read mapping. Thus the choice of alignment tool is dependent on what is important for the study being conducted. BatMeth has addressed some of the challenges associated with mapping bisulfite colour reads, however there are a limited number of studies thoroughly exploring the efficiency and accuracy of this software. Research into integrating alignment tools is promising as it allows for the best aspects of each program to be exploited and it is possible these integrations will provide better alignments for bisulfite sequencing reads.
References
Antequera, F. (2017) CpG Islands and DNA Methylation: In eLS, John Wiley & Sons, Ltd (Ed.)
Baylin, S.B. & Jones, P.A. (2011) A Decade of Exploring the Cancer Epigenome – Biological and Translational Implications, Nature Reviews Cancer, 11, 726 – 734
Bock, C. (2009) Epigenetic Biomarker Development, Epigenomics, 1, 99 – 110
Bock, C. (2012) Analysing and Interpreting DNA Methylation Data, Nature Reviews Genetics, 13, 705 – 719
Chatterjee, A., Stockwell, P.A., Roger, E.J. & Morison, I.M. (2012) Comparison of Alignment Software for Genome-Wide Bisulphate Sequence Data, Nucleic Acids Research, 40, e79
Cokus, S.J., Feng, S., Zhang, X., Chen, Z., Merriman, B. et al. (2008) Shotgun Bisulphite Sequencing of the Arabidopsis Genome Reveals DNA Methylation Patterning, Nature, 452(7184), 215 – 219
Ehrlich, M. et al. (1982) Amount and Distribution of 5-Methylcytosine in Human DNA from Different Types of Tissue of Cells, Nucleic Acids Research, 10, 2709 – 2721
Feinberg, A.P. (2007) Phenotype Plasticity and the Epigenetics of Human Disease, Nature, 447, 433 – 440
Harris, R.A., Wang, T., Coarfa, C., Nagarajan, R.P., Hong, C. et al. (2010) Comparison of Sequencing- Based Methods to Profile DNA Methylation and Identification of Monoallelic epigenetic Modifications, Natural Biotechnology, 28(10), 1097 – 1105
Kunde-Ramamoorthy, G., Coarfa, C., Laritsky, E., Kessler, N.J., Harris, R.A. et al. (2014) Comparison and Quantitative Verification of Mapping Algorithms for Whole-Genome Bisulfite Sequencing, Nucleic Acids, 42(6), e43
Krueger, F. & Andrews, S.R. (2011) Bismark: A Flexible Aligner and Methylation Caller for Bisulfite-Seq Applications, Bioinformatics, 27, 1571 – 1572 

Krueger, F., Kreck, B., Franke, A. & Andrews, S.R. (2012) DNA Methylome Analysis Using Short Bisulfite Sequencing Data, Nature Methods, 9(2), 145 – 151
Laird, P.W. (2003) The Power and the Promise of DNA Methylation Markers, Nature Reviews Cancer, 3, 253 – 266
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. (2009) Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome, Genome Biology, 10, R25
Lim, J.Q., Tennakoon, C., Li, G., Wong, E., Ruan, W. et al. (2012) BatMeth: Improved Mapper for Bisulfite Sequencing Reads on DNA Methylation, Genome Biology, 13(10)
Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G. et al. (2009) Human DNA Methylomes at Base Resolution Show Widespread Epigenomic Differences, Nature, 462(7271), 315-22
Maunakea, A.K., Nagarajan, R.P., Bilenky, M., Ballinger, T.J., D’Souza, C. et al. (2012, Conserved Role of Intragenic DNA Methylation in Regulating Alternative Promoters, Nature, 466, 253 – 257
Nair, S.S., Coolen, M.W., Stirzaker, C., Song, J.Z., Statham, A.L. et al. (2011) Comparison of Methyl-DNA Immunoprecipitation (MeDIP) and Methyl-CpG Binding Domain (MBD) Protein Capture for Genome-Wide DNA Methylation Aalysis Reveal CpG Sequence Coverage Bias, Epigenetics, 6(1), 34 – 44
Pelizzola, M., Koga, Y., Urban, A.E., Krauthammer, M., Weissma, S. et al. (2008) MEDME: An Experimental and Analytical Methodology for the Estimation of DNA Methylation Levels Based on Microarray Derived MeDIP-Enrichment, Genome Research, 18, 1652 – 1659
Popp, C., Dean, W., Feng, S., Cokus, S.J., Andrews, S. et al. (2010) Genome-Wide Erasure of DNA Methylation in Mouse Primordial Germ Cells is Affected by Aid Deficiency, Nature, 463(7284), 1101 – 1105
Prezza, N., Vezzi, F., Käller, M. & Policriti, A. (2016) Fast, Accurate and Lightweight Analysis of BS-Treated Reads with ERNE 2, BMC Bioinformatics, 17(S4)
Robertson, K.D. (2005) DNA Methylation and Human Disease, Nature Reviews Genetics, 6, 597 – 610
Smith, A.D, Chung, W.Y., Hodges, E., Kendall, J. Hannon, G. et al. (2009) Updates to the RMAP Short-Read Mapping Software, Bioinformatics, 25, 2841 – 2842
Walker, C.L. & Ho, S.M. (2012) Developmental Reprogramming of Cancer Susceptibility, Nature Reviews Cancer, 12, 479 – 486