There is an academic demand to drive research into the use of genetic engineering techniques in metagenomics for enhanced biodegradation Yu2013 of micropollutants in wastewater, especially using the novel CRISPR-Cas (Clustered Regularly Interspaced Short Palindromic Repeats and CRISPR associated) system for genome manipulation in bacterial communities. The CRISPR-Cas systems requires a custom single-guide RNA (sgRNA of sgRNA) for targeted genome editing (e.g. gene targeting), which can be designed by using any of the existing CRISPR-Cas sgRNA design software tools.
Although there are several existing CRISPR software tools available, they have limitations for designing single-guide RNA(s) for bacterial communities by utilising multiple genome sequences of the community.
Thus, it is a challenge to develop a software tool to design sgRNA(s) for bacterial communities, that is capable to handle multiple bacterial genome sequences for targeted genome modifications. In this respect, the primary aim of this project is to develop a software tool tool that can be used to design and score (based on the prediction of the on-targets activity and the off-target effects of the CRISPR-Cas9) guide RNAs for bacterial species in a community.
CRISPR is a new method for targeting and editing genomes in any species Doudna2014 that is based on a natural defence system used by bacteria and archaea, to protect themselves from viral infections Terns2011.
This bacterial self-defence system Barrangou2007 relies on a DNA endonuclease called Cas9 (CRISPR-associated protein 9) that can cut an organism\’s genome at any desired location and which is directed by an appropriate RNA called guide RNA (sgRNA, gRNA) Biolabs2007.
This new technique has revolutionised targeted genome manipulation allowing a previously unseen level of genomic targeting efficiency and simplicity Doudna2014 that continues to receive significant attention in industrial, biological and biomedical research and which led to the emergence of a broad range of computational tools.
The CRISPR method enables several genome targeting applications such as functional knock-out (KO protein coding genes or non-coding DNA such as promoter or transcription factor binding DNA etcetera ), knock-in (KI), transcriptional activation or repression and many others Doench2014. Although, this project currently focuses only on the \’\’knock-out\’\’ CRISPR genomic applications for bacterial communities, the developed design tool can be easily enhanced to support other applications, due to the nature of the modular design of the tool.
All of the CRISPR based genome targeting applications require the aforementioned Cas9 (or Cpf1) endonuclease and single-guide RNA to be introduced into the target cell where they can form a protein complex (Cas-sgRNA complex) targeting a specific DNA sequence (genomic target, RNA target or target) in the genome which is complementary to the protospacer (spacer) part of the guide RNA that is neighbouring the PAM (protospacer adjacent motif) site. See figure Graham2015 on page for detailed information about the CRISPR-Cas9 system.
In addition, different types of Cas9 endonucleases are used in, and developed for, different applications Doudna2014. For example, for functional knock-out and knock-in applications, currently the most commonly used endonuclease is the original wild-type Cas9 (wtCas9) protein from the Streptococcus pyogenes Esvelt2013. The wtCas9 has two endonuclease domains that introduce a DSB (double stranded break) at three nucleotides upstream of the PAM site\’s NGG sequence (N means any nucleotide) Ran2013a.
On the other hand, the transcriptional activation (CRISPRa) or inhibition (CRISPRi) applications often use a catalytically modified Cas9 protein called dead Cas9 or dCas9 that lost its endonuclease activity but targeted tight binding capacity, and which is fused with a functional activator or repressor domains to form a complex that binds at the gene promoter site to alter the expression of the desired gene Larson2013.
The double strand breaks (DSB) produced by the wild-type Cas9 or any fully functional variety of Cas9 endonuclease in the genome editing applications can be repaired by two distinct endogenous DNA repairing systems.
The precise homology directed repair (HDR) Davis2014 and the error-prone non-homologous end joining (NHEJ) Moore1996 Wilson1999.
The main difference between the HDR and the NHEJ is that the HDR requires a nearly identical donor sequence (a donor dsDNA or single-stranded oligonucleotides ssODN) as a template for repairing the DSB precisely, while the NHEJ just directly joins the two (blunt) ends that might introduce mutations which can vary in size, such as an insertion by non-matching breaks or a deletion by lost nucleotides at the damaged break sites (See figure HANDBOOK on page ) Boulton1996.
However, the NHEJ is also capable of accurate DNA repair using compatible microhomologies presented on both strands Wilson1999.
Designing a CRISPR experiment involves several experimental considerations that can be shared between these applications, such as the delivery method of the Cas9 endonuclease and gRNAs into the target species.
Although different applications used in different CRISPR experiments can share some common factors, others might differ. As an example, a different Cas9 protein is used for the same type of application; for example, a synthetically developed Cas9 variant is used for a knock-out application instead of the wtCas9 protein.
However, in general, the typical workflow of a CRISPR-based experiment CRISPR101 (See details in figure on page ) shares the common components and begins with the development of the biological question (for example, modify a gene of a bacteria to lose a specific functionality), which is followed by the selection of the desired and relevant CRISPR/Cas9 application, in this case the knock-out.
After selecting the application, the expression system and the in vivo delivery method for the endonuclease and the designable guide RNA should be chosen (e.g. viral transduction by AAV).
The next step is to design gRNAs (typically 3 to 8 sgRNAs) using a computational tool, which is the primary aim of the project, followed by the process of cloning the components into vectors for delivery. The final steps of the workflow are the delivery of the components based on the selected expression system and delivery method into the target cell(s) followed by the validation of the CRISPR experiment Graham2015.
The CRISPR experiment mainly depends on the gRNA\’s targeting specificity and its efficiency (successful activity of the desired manipulation at the specific location).
The gRNA\’s targeting specificity relies on two factors the target sequence and the PAM sequence. In the
perfect scenario, the Cas9 complex would perform the activity only at the desired locations (on-targets) where the sgRNA spacer of the Cas9 complex has the perfect match with the targeting sequence (RNA target, target) and where the PAM sequence is compatible with the selected Cas9\’s binding site.
However, due to the nature of the CRISPR method, unwanted activities can occur at the locations where the gRNA spacer has partial homology to the target. These unwanted locations are called off-targets and the unwanted changes at off-targets are called off-target effects.
On the other hand, how to predict the intensity of the activity of the desired manipulations (efficiency), which depends on several factors such as the low activity of the endonuclease in individual cells, the poor efficiency of the NHEJ-based DNA repair for the knock-out application that does not cause KO allele in the targeted gene, or several other factors etc , is not yet fully understood despite numerous empirical efforts having been made to predict and fully understand the rules governing gRNA efficiency and specificity Doench2016.
As a result, the two main challenges for the software tools designing the guide RNA are to try to predict activities at on-target sites and to try to minimise or avoid activities at off-target sites (off-target effects).
Reading through the literature, it is clear there are several ongoing efforts to find new techniques
to maximise on-target efficiency and minimise off-target activityKleinstiver2016Graham2015. Nevertheless, essentially, the following two approaches are used widely to quantitatively score on-targets and off-targets activities:
The empirical approach in which statistical data is retrieved from previous studies and the heuristic approach in which a possible method for scoring targets (on or off) is simply based on the mismatches between the guiding RNA sequence (spacer) and the potential targets\’ sequences in the genome Ran2013a.
Although numerous research papers have suggested that examining off-target sites in large-scale studies predicts off-targets more precisely than the heuristic approach Tsai2015 Doench2016, there could be some CRISPR experiment where the empirical approach for scoring off-targets is not applicable in the software tool.
For example, those large-scale studies which examine off-targets mainly use a limited number of organisms such as human and other eukaryotic model organisms, but this project is primarily focusing on prokaryotic organisms such as the bacterial community or metagenomics; therefore, it can only rely on existing heuristics that could miss many off-target sites Tsai2015.
According to , the algorithm of searching off-targets in the genome seems to be underestimated which is most likely due to the commonly used Bowtie2 algorithm which misses, through its design, potential off-target sites that have more than one mismatch.
As correctly pointed out, both the searching algorithm for finding the off-target sites in the genome and scoring the found off-targets could play a crucial role in minimising off-target effects, and this should be further investigated.
Available Design Tools
There are several CRISPR tools available that share common or favour different single-guide RNA design features that might be required for an individual CRISPR experiment.
Different CRISPR experiments should require a certain set of tool features and therefore finding the best available tool or tools for the CRISPR experiment is not an easy task.
Fortunately, CRISPR Software Matchmaker an interactive on-line google spreadsheet (by Cameron MacPherson) was created to overcome of the hardness of the selection of the best tool(s) for a project by filtering the available tools of the matchmaker software based on their individual features that require for the CRISPR project.
However, the available tools are lack of the desired combination of the features that is targeted to achieve in this projects. In fact, a performed search for the available tools that have composed features that is similar to the aimed features of the Design Tool, not surprisingly, did not provide any result, despite the performed search only used a subset of the aimed features (see table on page ) for filtering the available tools.
Software Development Considerations
\”Software development is a process of computer programming, documenting, testing, and bug fixing
involved in creating and maintaining applications and frameworks resulting in a software product.\”
Description of the Software Development Environment
The Software Development Environment is the set of development tools and/or processes that are used to create a software.
To chose the proper development environment is a very important step in the software development.
In addition, there are many development tools, programming languages, integrated development environment (IDE) and operating systems are exist that makes the decision of choosing the components for the development environment very difficult.
Furthermore, one of the most important criteria for selecting these components were the availability for multiple platforms (cross-platform) and availability for Open Source Community making Open-source Software (OSS) product.
Historically, the Biologist community have been using a wide variety of programming languages and operating systems (open and closed-source) for developing and running their program or software; therefore there is no any \’\’gold-standard\’\’ development environment especially for developing an open-source biological software product.
In this respect, the reasons of the selections of the components for the particular development environment of the project were based on a mixture of objective and subjective factors.
Mainly, the following important aspects were considered for setting up the Development Environment for the project:
Selection of the target Operating Systems (OSes).
Available Programming Languages for the considered OS(es).
Frameworks, Libraries availability for the considered Programming Languages in the targeted OSes.
Available Integrated Development Environments (IDEs) for the targeted OSes for rapid development.
Features of the considered programming languages and IDEs.
Availabe Relational Database Management Systems for the targeted OSes.
Available existing software components/programs required for the project.
Target Operating Systems
Considering those main criteria listed above the Apple\’s OS X and the Open Source Linux were selected as the target operating systems for the software development of the project.
It is worth mentioning, that the core of the OS X and the other Apple\’s proprietary operating systems (macOS, iOS, tvOS and watchOS) is an open source Unix operating system Darwin.
As a result, one of the main reasons selecting these operating systems for the Design Tool development was the similarity between Linux and Darwin.
However, the future version of the Desing Tool might be available for Windows Operating Systems due to the fact that the design considerations of Design Tool development placed great emphasis upon the portability.
Selected Programming Language
The Swift programming language that is created for Apple\’s platforms (iOS, tvOS, macOS etc ) was selected as the main programming language for the project.
Swift is developed by Apple Co. aiming to create the best available language that can offer tremendous amount of modern features that developers could expect.
For instance, Swift, like almost all modern programming languages, supports the Objectum Oriented Programming (OOP) paradigm that is based on the concept of \’\’objects\’\’ and their structure and behaviour; and which will be the primary development style for developing the tool.
Furthermore, Swift version 2.2 was released open-source and made available in late 2015 for Linux.
Since Swift was open-sourced, it has been continually growing and evolving by the community-driven process the (Swift Evolution Process).
Although, the Swift Evolution that is driven by the community gives opportunities to enhance the functionality of Swift, it also causes API Design changes that could create big challenges for the developers to maintain their existing source-code for the new builds.
Considering these above the latest snapshot of the Swift 3, the first major release that developed with the community, was selected for developing the Design Tool despite that the latest snapshot should not be considered as final version.
Chosen Integrated Developer Environment
Although, Linux is very popular for developing open source software it lacks of professional Integrated Development Environment (IDE) especially for Swift the relative young language in Linux.
In contrast, for Mac OSX, the Xcode a fully featured professional IDE developed by Apple that natively supports Swift is available for free of charge for developers.
Considering the great features of the Xcode the Xcode 8 beta with Apple\’s OSX operating system was selected as the main development environment for developing the tool as the selected programming language the open-source Swift 3.0 is fully integrated with the Xcode 8.
Although, Xcode only runs on Apple\’s OS X operating system, the developed software can be built on the Linux operating system if the source code developed in OS X is compatible (portability) with Linux\’s Swift.
Therefore, the development of Design Tool put great emphasis on the importance of the source code portability between the OS X and Linux during the development.
Although, the system design choices of the tool were mainly affected by the capabilities of the selected Software Development Environment, they were
highly inspired by a planned implementation of a unusual feature of the tool. This feature has the capability of providing different type of user interfaces (UIs) for the users on same or different platforms (portability), such as some OS independent Command Line (CLI) or a Web based user interface or even some OS dependent Graphical User interface (GUI).
To be able to provide different UIs for the users, the Presentation (CLI or GUI) of the software tool should be separated from its Business logic.
This Separation of Concers (SoC) that allows changes on GUI code without having too much impact on Business logic code, can be achieved by applying some user interface software architectural pattern (Design Pattern).
Although, there are several architectural design patterns are available for design consideration, such as MVC, MVP, Presentation Model, MVVM or MVP-VM etc , the MVP-VM which is one of those design patterns that supports multiple user interface technologies, was selected to implement the desired separation of concerns between the Presentation Layer and the Business Layer of the tool.
Furthermore, MVP-VM can be used together with a typical 3-Tier/Multi-Tier Architecture (Front-End, Business Logic, Back-End) which is the classical layered structure of the software systems, as the Middle Tier usually has components related to the other tiers e.g. presentation logic to Front-End and the data access logic to Back-End). See details in figure on page .
This design consideration helps enhance the future version of the tool by implementing a typical 3-Tier Architecture where the three major tier are distributed to different places in a network, which also requires the separation of the logic, data and the presentation.
In addition, this separation not only can provide different UIs for the users, but it involves breaking the software code into smaller individual pieces called units which can be then subjected into a series of tests (Unit Tests) that only focusing on those small parts of the software code.
This Unit Test approach reduces the development time of the tool and it also reduces or eliminates the software bugs in the developed application/program.
In software development, Design Patterns are reusable and tested solutions (best practices) to solve common design problems of an application.
For example, the MVP-VM architectural pattern is a Design Pattern that can be used in this Object-oriented programming (OOP) project to solve a certain problem (capability of using different UIs) by separating the roles by decoupling the underlying objects (classes).
An other example, the sgRNA Design Tool is designed to be able to use different algorithm for on-target scoring by using some already existing software that can be integrated into the design tool. These native software require different type of input file or files, different command parameters for their run and they produce completely different type of output files.
To solve design problems for producing individual input files, parsing the outputs for a native software that can be run simultaneously, three reusable design patterns were used, the Parser (See the UML diagram of the implemented Parser Design Pattern in figure on page .), Worker Thread and Async Task pattern are used.
As it can be seen from above examples, different patterns can help to solve different problems that the developers can use when designing and/or developing a software.
Thus, the design-pattern-based approach; that implements different Design Patterns (Command Pattern, Decorator Pattern, Factory Pattern, etc ) to solve differenet design issues; is heavily used in the design and development stage of the sgRNA Design tool.
Reusability is a \’\’well-known\’\’ concept in software development that can improve quality, enhance functionality and decrease complexity of software products by reusing previously developed computer software(s) (e.g. device drivers, shared libraries, modules etc ) in a new software product Wang2010.
However, reusability in the software development does not apply only to software codes or software modules, it also can be referred to any reusable existing datasets, documentation, design patterns etc
One technique of reusing an already available software components developed in Apple\’s Swift programming language is using 3rd-party Frameworks.
Framework in the Apple\’s terminology is a bundle of shared resources that, similarly to the library, can be reused by multiple independent software applications.
As a result, to achieve the aimed features of the tool the following 3rd-party Frameworks are selected; however all other components of the source code were written by the author:
SwiftCLI – Used to develop the CLI of the tool.
Camembert – Swift wrapper for SQLite database engine.
Express – Asynchronous web application server.
BrightFutures – Framework to leverage asynchronous code in Swift.
TidyJSON – JSON package for Swift.
PathToRegex – Swift library library for path translation.
Regex – Regular Expression library for Swift.
Stencil – Template language (similar to Django and Mustache) for Swift.
The Design Tool is designed to store and retrieve its data into a local database. However, selecting the proper database management software depends on the characteristic of the data used by the application.
Since, the data sets used by the tool are not large and complex and are not separated from the application by network there is no any benefit of using a complex client/server relational database for this project.
Therefore, SQLite a single-file based relational database management system engine that can be easily embedded into an application is chosen for managing the design tool\’s data.
Sqlite is fast and reliable database engine that does not require maintenance or configuration and it is a prefect choice for an application that use only local storage for its data that does not exceed a terabyte.
SQLite, as all other relational databases, organises data in relations which are implemented as tables. Similarly to spreadsheets, tables are made up of rows (records or tuples) and columns (attributes or fields) and the relationships that can be defined between tables can be used to efficiently store and effectively retrieve the required reliable, accurate and non-redundant data.
However, to ensure data integrity, accuracy and eliminate data redundancy the database should be properly designed.
Database design start first with the requirement analysis. In other words, gathering the requirements and defining the purpose of the database (objective).
The next step is determining the data to be stored in the database by grouping these related data into tables and then creating the relationships among these tables before the final step of the database design the database refinement and normalisation.
The database design of the Design Tool requires the biological entities, that are involved in the CRISPR based genetic engineering technique, to be abstracted into the database.
These biological entities, such as:
the species (DesignSource) which are the subjects of the targeted genetic manipulation,
the CRISPR associated Endonucleases (Nuclease), the enzymes that can bind to any enzyme specific PAM sequence in a genomic DNA sequence,
the PAM sequences (PAM), the enzyme specific genomic DNA sequences where the related Cas-sgRNA complexes can bind,
the targets, RNA or genomic targets (On/Off-target)), the short genomic DNA sequences that are present immediately upstream or downstream (depends on the used endonuclease) of a PAM and are targeted by the PAM\’s specific endonuclease activity,
among with any non-biological entities that are required for the tool, are implemented as tables in the database design.
The final database design that includes the normalised tables with their relationships to the other tables can be seen in figure on page .
Used Software Components
Design Tool is designed to be capable of using some already available software algorithms for scoring the candidate guide RNAs.
This functionality of the tool allows the experimentalist to chose from the selection of scoring algorithms that is most suitable for their CRISPR experiment.
Therefore, the following pre-built software components are bundled into the design tool:
Cas-Offinder – offline off-target finder tool
bwa – Burrows-Wheeler Alignment tool
Blat – a BLAST-like alignment tool uses pairwise sequence alignment algorithm developed by Jim Kent
Bowtie – Burrows-Wheeler sequence aligner and analyser.
Bowtie2 – extended the functionality of Bowtie with FM-Index (similar to suffix-array)
However, due to the time constraint of the project, not all of these software components\’ functionalities are implemented in the current version of the developed tool but bundled into the desing tool to make their implementation into the tool easier in the future.
The \’\’SwiftBio\’\’ Framework
One of the aims of this project is to develop a portable framework (i.e. shared linbrary in Linux) in Swift that contains basic functions and utilities for biological computation that can be used to reduce the complexity of the Design Tool by implementing all biological functions that required for the tool in a testable framework, the SwiftBio Framework.
This approach not only just reduce the complexity and indeed improve the quality of the tool, but gives the opportunity of reusing the framework in future software products or even enhancing or extending its functionality resulted by the contribution of the community.
It is worth mentioning, that one of the main concerns for the development of the SwiftBio Framework was to support a moderate level of syntax compatibility with the core objects and functions of BioPython, one of the most popular tool for computational biology, that are implemented in BioSwift.
As an example, BioSwift implemented some part of the BioPython\’s SeqRecord, SeqIO and Seq modules for dealing with sequences that are required for the Design Tool. The reason for this syntax compatibility is to make BioSwift to be available for the other developers that are familiar with BioPython and would like to develop a new product with Swift in Linux or in Mac OS X.
As the result, the \’\’SwiftBio\’\’framework includes basic functionality of parsing a DNA sequence (currently only FASTA) files into a sequence record using similar syntax to the BioPython\’s SeqRecord. In addition to the syntax similarity with BioPython, the SwiftBio Framework implemented functionality for dealing with CRISPR related functions that are required for designing and scoring single guide RNAs.
gRNA Design considerations
A typical in silico design of the single guide RNAs for a CRISPR experiment can be broken into two main steps: identifying the sgRNA candidates and predicting the \’\’goodness\’\’ (scoring) of the candidates.
Locating potential target sequences (on-targets) in a design target (interested region of the genome), undoubtedly, is the easiest part of the sgRNA design. It is just a simple search for all endonuclease specific PAM sequences in the interested genomic DNA sequence; where the particular CRISPR associated endonuclease can bind to; and collect all potential genomic target sequences which are present immediately upstream or downstream (depends on the used endonuclease) of the found PAM sequences.
In other words, find all interested target sequences in the genome which will became the protospacer (RNA targeting sequence) part of the candidate sgRNAs.
However, predicting the \’\’goodness\’\’ of the single-guide RNA candidates (i.e. quantitatively score its efficiency and specificity) for their genomic targets in a particular CRISPR experiment is challenging, mainly, for two reasons: the hardness of predicting their on-target and off-target activities.
Ideally, the sgRNA targeting sequence should have only one 100 homology to the target sequence with no any other homology sequences in the genome and therefore the designed sgRNA would only specific to its target sequence (targeting specificity) and would also result in a maximised cleavage efficiency for the desired target sequence (on-target activity) in a \’\’perfect\’\’ CRISPR \’\’knockout\’\’ experiment; that is, it would result in a prefect on-target activity (efficiency 100) with no off-target effects (0 off-target activity).
However, realistically, there are several factors that can affect the on-target activity for the sgRNA targeting sequences that have perfect homology in the genome, such as the number, the type and the positions of the nucleotides in the associated target sequences or other factors (e.g. sgRNA or endonuclease concentration in a species etc ) that can result in increased or decreased on-target activity.
As a result, two different sgRNA targeting sequences that have perfect homology can have different cleavage efficiency for a certain species in the same CRISPR \’\’knockout\’\’ experiment.
Moreover, sgRNA\’s protospacer (RNA targeting sequence) will likely have partial or even complete homology elsewhere in the genome (off-target(s)) where the Cas-sgRNA complex can bind and perform unwanted activity (off-target activity).
In addition, an endonuclease recognises a specific canonical PAM (optimal PAM) sequence that varies depending on the bacterial species from which the particular nuclease is derived from; and it is well known that an endonuclease can target alternative canonical PAM(s) sites with lower efficiency other than the optimal PAM Hsu2013a Tsai2015a, which also can result in unwanted (off-target) activities in the genome. See table on page for some recognised alternative PAM sequences of some endonuclease variants. As it can be seen in the table, that the endonuclease derived from wild-type S. pyogenes (wtCas9) can target sites flanked by \’\’5\’-NGG-3\’\’\’, but also can target \’\’5\’-NAG-3\’\’\’ although at unknown efficiency.
However, not every alternative PAM sequences have been identified for the individual Cas enconucleases which makes the prediciton of the off-target effects even harder.
In addition, several initial studies have demonstrated that the target mismatches close to the PAM (approx. 1-10 base pairs length region which is also referred to as the \’\’seed\’\’) can significantly decrease the endonuclease activity, while mismatches at a distance (e.g. more than 10-12 base pairs), to the PAM do not highly affect its activity Hsu2013a Zhang2015 Sternberg2014 Cong2013.
Predicting \’\’on-target\’\’ and \’\’off-target\’\’ activities
To summarise, the endonuclease\’s genomic target search mechanism is dictated by the recognition of the PAM sites, where an active Cas-sgRNA complex can bind to.
However, the nuclease activity is only triggered if there is a matching target to the protospacer of the bound complex\’s sgRNA. Therefore, these two criteria (recognised PAM with a matching target) must be satisfied in order for the endonuclease to function and therefore to predict the on-target and off-target activities in the genome.
However, as it is already mentioned in the prevoious chapter, that the matching target does not mean 100 homology to the targeting sequence, but partial homology that can trigger the nuclease activity albeit lower efficiency Zhang2015.
Thus, the Design Tool; due to the complexity for scoring the efficiency and specificity of a sgRNA for a CRISPR experiment, and due to the nature of the sgRNA design for bacterial community where the number of empirically validated sgRNAs are limited, especially for diverse bacterial species; as several other available design tools is scoring the potential sgRNAs based on their heuristically predicted on-target and off-target activities, and therefore it is assumed that only the sequence similarities affect the \’\’goodness\’\’ of the designed guide RNAs.
The implemented workflow of the sgRNA design is summarised below:
Locating all RNA target sequences in the interested region of the genome sequence (on-targets), in other words, identifying the sgRNA candidates as the sgRNA\’s targeting sequence is a complete homologue to the RNA target sequence.
Find and score all potential off-targets of each sgRNA candidate.
Compute score for each candidate based on the scored off-targets and optionally on the predicted on-target activity.
Chose a selection from the highest scored sgRNAs for the CRISPR experiment.
sgRNA candidate search
As it mentioned earlier, the candidate sgRNAs are the selection of all the target sequences resulted from a simple search for the canonical PAMs in the interested region of the genome. The number of the candidates depends on the length of the design target and the complexity of the used canonical PAM.
For example, searching for the short and simple \’\’5-NGG-3\’\’ PAM sequence of the wtCas9 endonuclease in a 2000 nucleotides region in a genome (double-stranded genomic DNA) results approximately sgRNA candidates i.e. probability of observing the \’\’GG\’ 2-mer in a 2kb genomic DNA is .
However, finding all potential off-targets for all of the candidates can be computationally intensive due to the number of candidates, the length of the genome and the complexity of the canonical PAM.
Considering the above example, searching for the potential off-targets in an average sized bacterial genome (i.e. approx. 5 million base-pair) the algorithm must find all the PAMs in the whole genome (i.e. approximately 1.2 million for the double-stranded genome) and then analyse each of the found potential off-targets with all the 250 sgRNA candidates.
In addition, recent studies have suggested that alternative PAM sequences also influence the particular endonuclease activity, and therefore they should be considered in the search ofoff-targets, in spite of they have indeed lower binding efficiency compared to the endonuclease\’s optimal PAM sequence Zhang2015.
Therefore, many existing design tools utilise general sequence aligners (such as BWA, Bowtie, Blat etc ) to perform off-target searches that contain small number of mismatches, mainly to improve the speed of processing.
Although, the Design Tool is prepared for utilising any alignment tools and to implement any scoring function easily, the initial version of the tool only utilise the Cas-Offinder, the off-target finder tool, for searching off-targets, mainly due to the reasons that the general aligners do not consider the seed or the PAM sequences when scoring matching targets to the sgRNA and the time constraint of developing a novel scoring function.
Scoring algorithm implemented by the tool for computing the score of the off-targets, resulted by the search of Cas-Offinder, is based on the number, the position of the mismatches to the particular sgRNA and the associated endonuclease\’s binding affinity to the off-target\’s PAM.
Therefore the formula of the actual scoring algorithm for the Cas-Offinder based off-target search is defined by the following equation:
The first term calculates the predicted endonuclease efficiency of an off-target, based on the experimental evidenced assumption that the mismatch closer to the PAM decrease the endonuclease activity Hsu2013a Zhang2015 Sternberg2014 Cong2013.
The second term is the PAM affinity which is based on the prediction of the endonuclease binding efficiency to the characterised canonical PAM sequences.
Although most studies, evaluating the binding efficiency have been conducted only for eukaryotic model organisms, the Design Tool still uses those empirically validated numbers, if they are available for specific endonucleases, for scoring off-targets activity.
As not every endonucleases that are available in the Design Tool\’s database have been experimentally validated for their characterised PAM sequences (optimal or alternative), as a rule of thumb the Design Tool uses default values that are similar to the empricially retrieved ones to predict their binding efficiency to the particular PAM(s).
As an example, the wtCas9 have one optimal (\’\’NGG\’\’) and three alternative (\’\’NAG\’\’, \’\’NGA\’\’, \’\’NAA\’\’) PAM sequences, the binding specificities are quantified as percentage based on the host cells surviving in the conducted experiment e.g. NGG(68.0), NAG(1.32), NGA(0.2), NAA(0.07)Kleinstiver2015, and therefore the NmCas9 which does not have emprirically validated specificity uses theDesign Tool\’s default for its optimal NNNNGATT() PAM sequence.
The score of an individual sgRNA candidate is based on the aggregated scores of the predicted off-targets and the optimal PAMs binding affinity value.
As it can be seen in the formula above, that a score \’\’\’\’ of the candidate sgRNA means that there is no any homology of the sgRNA presents in the genome (i.e. no aggregated off-target\’s score is provided), and if there is only one exact homology, with an equivalent PAM to the canonical PAM (e.g. for \’\’NGG\’\’ canonical PAM the \’\’AGG\’\’, \’\’TGG\’\’, \’\’CGG\’\’ and \’\’GGG\’\’ PAMs are equivalent), the score would be , as the computed score of the sgRNA will be equal with the formula\’s value. Therefore only those candidates are considered that have an aggregate score that is greater than .
The sgRNA ranking is based on the candidates scores (higher the score, higher in the rank) and the selection of the sgRNAs for a CRISPR experiment, typically \’\’6\’\’ to \’\’8\’\’ sgRNAs, should be chosen from the highest ranked candidates and scores less than should be completely avoided from the selection.
This section contains the explanation of concepts (terms) that are used in and relevant to this project. There are two types of terms are used: general terms used in the GRISPR genome engineering field and the Design Tool specific terms (tagged with (DT)).
Design source (DT) – Is a genomic DNA sequence, typically a whole genome sequence of a species that will be the source for the sgRNA design of the Design Tool.
Design target (DT) – Is an interested region in the design source (e.g. a gene) in which the genetic manipulation(s) (e.g. genetic knock-out, knock-in, gene repression or activation) is/are desired. In other words, a region in which all the potential genomic targets are the subject of the guide RNA design of the Design Tool.
gRNA, guide RNA, sgRNA, single guide RNA, single-guide RNA – It is a synthetic RNA that consist the RNA targeting sequence (spacer) that targets the Cas protein to a specific genomic location in a CRISPR experiment.
RNA targeting sequence, spacer ,protospacer – It is a short sequence of the sgRNA (spacer+crRNA+trRNA) that defines the genomic target to be modified. Typically, a 100 homologue sequence of a genomic sequence to be targeted for genetic manipulation.
PAM – It is a short (approx. 3-10 bases length) DNA sequence in a genome where the PAM specific endonuclease can bind to.
optimal PAM (DT) – it is the endonuclease specific canonical PAM.
alternative PAM (DT) – empirically characterised canonical PAMs that an endonuclease can target with lower efficiency than the optimal PAM.
Target, genomic target, RNA target sequence – It is a short (approx. 15-25 nucleotides length) sequence of the genome; adjacent to a PAM; that is the target of the sgRNA targeting sequence of the designed gRNA.
on-target activity – desired genetic manipulation at a desired location. Typically, activities within all the design targets of a genome.
off-target effects/activity – unwanted genetic manipulations at non target sites. Typically, endonuclease activities out of the design targets regions.
At the end of this project, I would like to thank my supervisor Professor Anil Wipat and Dr. Martin Sim for their intellectual support, ideas and continuous guidance during all of the phases of this dissertation.
...(download the rest of the essay above)