There is an academic demand to drive research into the use of genetic engineering techniques in metagenomics for enhanced biodegradation Yu2013 of micropollutants in wastewater, especially using the novel CRISPR-Cas (Clustered Regularly Interspaced Short Palindromic Repeats and CRISPR associated) system for genome manipulation in bacterial communities. The CRISPR-Cas systems requires a custom single-guide RNA (sgRNA of sgRNA) for targeted genome editing (e.g. gene targeting), which can be designed by using any of the existing CRISPR-Cas sgRNA design software tools.
Although there are several existing CRISPR software tools available, they have limitations for designing single-guide RNA(s) for bacterial communities by utilising multiple genome sequences of the community.
Thus, it is a challenge to develop a software tool to design sgRNA(s) for bacterial communities, that is capable to handle multiple bacterial genome sequences for targeted genome modifications. In this respect, the primary aim of this project is to develop a software tool tool that can be used to design and score (based on the prediction of the on-targets activity and the off-target effects of the CRISPR-Cas9) guide RNAs for bacterial species in a community.
CRISPR is a new method for targeting and editing genomes in any species Doudna2014 that is based on a natural defence system used by bacteria and archaea, to protect themselves from viral infections Terns2011.
This bacterial self-defence system Barrangou2007 relies on a DNA endonuclease called Cas9 (CRISPR-associated protein 9) that can cut an organism\’s genome at any desired location and which is directed by an appropriate RNA called guide RNA (sgRNA, gRNA) Biolabs2007.
This new technique has revolutionised targeted genome manipulation allowing a previously unseen level of genomic targeting efficiency and simplicity Doudna2014 that continues to receive significant attention in industrial, biological and biomedical research and which led to the emergence of a broad range of computational tools.
The CRISPR method enables several genome targeting applications such as functional knock-out (KO protein coding genes or non-coding DNA such as promoter or transcription factor binding DNA etcetera ), knock-in (KI), transcriptional activation or repression and many others Doench2014. Although, this project currently focuses only on the \’\’knock-out\’\’ CRISPR genomic applications for bacterial communities, the developed design tool can be easily enhanced to support other applications, due to the nature of the modular design of the tool.
All of the CRISPR based genome targeting applications require the aforementioned Cas9 (or Cpf1) endonuclease and single-guide RNA to be introduced into the target cell where they can form a protein complex (Cas-sgRNA complex) targeting a specific DNA sequence (genomic target, RNA target or target) in the genome which is complementary to the protospacer (spacer) part of the guide RNA that is neighbouring the PAM (protospacer adjacent motif) site. See figure Graham2015 on page for detailed information about the CRISPR-Cas9 system.
In addition, different types of Cas9 endonucleases are used in, and developed for, different applications Doudna2014. For example, for functional knock-out and knock-in applications, currently the most commonly used endonuclease is the original wild-type Cas9 (wtCas9) protein from the Streptococcus pyogenes Esvelt2013. The wtCas9 has two endonuclease domains that introduce a DSB (double stranded break) at three nucleotides upstream of the PAM site\’s NGG sequence (N means any nucleotide) Ran2013a.
On the other hand, the transcriptional activation (CRISPRa) or inhibition (CRISPRi) applications often use a catalytically modified Cas9 protein called dead Cas9 or dCas9 that lost its endonuclease activity but targeted tight binding capacity, and which is fused with a functional activator or repressor domains to form a complex that binds at the gene promoter site to alter the expression of the desired gene Larson2013.
The double strand breaks (DSB) produced by the wild-type Cas9 or any fully functional variety of Cas9 endonuclease in the genome editing applications can be repaired by two distinct endogenous DNA repairing systems.
The precise homology directed repair (HDR) Davis2014 and the error-prone non-homologous end joining (NHEJ) Moore1996 Wilson1999.
The main difference between the HDR and the NHEJ is that the HDR requires a nearly identical donor sequence (a donor dsDNA or single-stranded oligonucleotides ssODN) as a template for repairing the DSB precisely, while the NHEJ just directly joins the two (blunt) ends that might introduce mutations which can vary in size, such as an insertion by non-matching breaks or a deletion by lost nucleotides at the damaged break sites (See figure HANDBOOK on page ) Boulton1996.
However, the NHEJ is also capable of accurate DNA repair using compatible microhomologies presented on both strands Wilson1999.
Designing a CRISPR experiment involves several experimental considerations that can be shared between these applications, such as the delivery method of the Cas9 endonuclease and gRNAs into the target species.
Although different applications used in different CRISPR experiments can share some common factors, others might differ. As an example, a different Cas9 protein is used for the same type of application; for example, a synthetically developed Cas9 variant is used for a knock-out application instead of the wtCas9 protein.
However, in general, the typical workflow of a CRISPR-based experiment CRISPR101 (See details in figure on page ) shares the common components and begins with the development of the biological question (for example, modify a gene of a bacteria to lose a specific functionality), which is followed by the selection of the desired and relevant CRISPR/Cas9 application, in this case the knock-out.
After selecting the application, the expression system and the in vivo delivery method for the endonuclease and the designable guide RNA should be chosen (e.g. viral transduction by AAV).
The next step is to design gRNAs (typically 3 to 8 sgRNAs) using a computational tool, which is the primary aim of the project, followed by the process of cloning the components into vectors for delivery. The final steps of the workflow are the delivery of the components based on the selected expression system and delivery method into the target cell(s) followed by the validation of the CRISPR experiment Graham2015.
The CRISPR experiment mainly depends on the gRNA\’s targeting specificity and its efficiency (successful activity of the desired manipulation at the specific location).
The gRNA\’s targeting specificity relies on two factors the target sequence and the PAM sequence. In the
perfect scenario, the Cas9 complex would perform the activity only at the desired locations (on-targets) where the sgRNA spacer of the Cas9 complex has the perfect match with the targeting sequence (RNA target, target) and where the PAM sequence is compatible with the selected Cas9\’s binding site.
However, due to the nature of the CRISPR method, unwanted activities can occur at the locations where the gRNA spacer has partial homology to the target. These unwanted locations are called off-targets and the unwanted changes at off-targets are called off-target effects.
On the other hand, how to predict the intensity of the activity of the desired manipulations (efficiency), which depends on several factors such as the low activity of the endonuclease in individual cells, the poor efficiency of the NHEJ-based DNA repair for the knock-out application that does not cause KO allele in the targeted gene, or several other factors etc , is not yet fully understood despite numerous empirical efforts having been made to predict and fully understand the rules governing gRNA efficiency and specificity Doench2016.
As a result, the two main challenges for the software tools designing the guide RNA are to try to predict activities at on-target sites and to try to minimise or avoid activities at off-target sites (off-target effects).
Reading through the literature, it is clear there are several ongoing efforts to find new techniques
to maximise on-target efficiency and minimise off-target activityKleinstiver2016Graham2015. Nevertheless, essentially, the following two approaches are used widely to quantitatively score on-targets and off-targets activities:
The empirical approach in which statistical data is retrieved from previous studies and the heuristic approach in which a possible method for scoring targets (on or off) is simply based on the mismatches between the guiding RNA sequence (spacer) and the potential targets\’ sequences in the genome Ran2013a.
Although numerous research papers have suggested that examining off-target sites in large-scale studies predicts off-targets more precisely than the heuristic approach Tsai2015 Doench2016, there could be some CRISPR experiment where the empirical approach for scoring off-targets is not applicable in the software tool.
For example, those large-scale studies which examine off-targets mainly use a limited number of organisms such as human and other eukaryotic model organisms, but this project is primarily focusing on prokaryotic organisms such as the bacterial community or metagenomics; therefore, it can only rely on existing heuristics that could miss many off-target sites Tsai2015.
According to , the algorithm of searching off-targets in the genome seems to be underestimated which is most likely due to the commonly used Bowtie2 algorithm which misses, through its design, potential off-target sites that have more than one mismatch.
As correctly pointed out, both the searching algorithm for finding the off-target sites in the genome and scoring the found off-targets could play a crucial role in minimising off-target effects, and this should be further investigated.
Available Design Tools
There are several CRISPR tools available that share common or favour different single-guide RNA design features that might be required for an individual CRISPR experiment.
Different CRISPR experiments should require a certain set of tool features and therefore finding the best available tool or tools for the CRISPR experiment is not an easy task.
Fortunately, CRISPR Software Matchmaker an interactive on-line google spreadsheet (by Cameron MacPherson) was created to overcome of the hardness of the selection of the best tool(s) for a project by filtering the available tools of the matchmaker software based on their individual features that require for the CRISPR project.
However, the available tools are lack of the desired combination of the features that is targeted to achieve in this projects. In fact, a performed search for the available tools that have composed features that is similar to the aimed features of the Design Tool, not surprisingly, did not provide any result, despite the performed search only used a subset of the aimed features (see table on page ) for filtering the available tools.
Software Development Considerations
\”Software development is a process of computer programming, documenting, testing, and bug fixing
involved in creating and maintaining applications and frameworks resulting in a software product.\”
Description of the Software Development Environment
The Software Development Environment is the set of development tools and/or processes that are used to create a software.
To chose the proper development environment is a very important step in the software development.
In addition, there are many development tools, programming languages, integrated development environment (IDE) and operating systems are exist that makes the decision of choosing the components for the development environment very difficult.
Furthermore, one of the most important criteria for selecting these components were the availability for multiple platforms (cross-platform) and availability for Open Source Community making Open-source Software (OSS) product.
Historically, the Biologist community have been using a wide variety of programming languages and operating systems (open and closed-source) for developing and running their program or software; therefore there is no any \’\’gold-standard\’\’ development environment especially for developing an open-source biological software product.
In this respect, the reasons of the selections of the components for the particular development environment of the project were based on a mixture of objective and subjective factors.
Mainly, the following important aspects were considered for setting up the Development Environment for the project:
Selection of the target Operating Systems (OSes).
Available Programming Languages for the considered OS(es).
Frameworks, Libraries availability for the considered Programming Languages in the targeted OSes.
Available Integrated Development Environments (IDEs) for the targeted OSes for rapid development.
Features of the considered programming languages and IDEs.
Availabe Relational Database Management Systems for the targeted OSes.
Available existing software components/programs required for the project.
Target Operating Systems
Considering those main criteria listed above the Apple\’s OS X and the Open Source Linux were selected as the target operating systems for the software development of the project.
It is worth mentioning, that the core of the OS X and the other Apple\’s proprietary operating systems (macOS, iOS, tvOS and watchOS) is an open source Unix operating system Darwin.
As a result, one of the main reasons selecting these operating systems for the Design Tool development was the similarity between Linux and Darwin.
However, the future version of the Desing Tool might be available for Windows Operating Systems due to the fact that the design considerations of Design Tool development placed great emphasis upon the portability.
Selected Programming Language
The Swift programming language that is created for Apple\’s platforms (iOS, tvOS, macOS etc ) was selected as the main programming language for the project.
Swift is developed by Apple Co. aiming to create the best available language that can offer tremendous amount of modern features that developers could expect.
For instance, Swift, like almost all modern programming languages, supports the Objectum Oriented Programming (OOP) paradigm that is based on the concept of \’\’objects\’\’ and their structure and behaviour; and which will be the primary development style for developing the tool.
Furthermore, Swift version 2.2 was released open-source and made available in late 2015 for Linux.
Since Swift was open-sourced, it has been continually growing and evolving by the community-driven process the (Swift Evolution Process).
Although, the Swift Evolution that is driven by the community gives opportunities to enhance the functionality of Swift, it also causes API Design changes that could create big challenges for the developers to maintain their existing source-code for the new builds.
Considering these above the latest snapshot of the Swift 3, the first major release that developed with the community, was selected for developing the Design Tool despite that the latest snapshot should not be considered as final version.
Chosen Integrated Developer Environment
Although, Linux is very popular for developing open source software it lacks of professional Integrated Development Environment (IDE) especially for Swift the relative young language in Linux.
In contrast, for Mac OSX, the Xcode a fully featured professional IDE developed by Apple that natively supports Swift is available for free of charge for developers.
Considering the great features of the Xcode the Xcode 8 beta with Apple\’s OSX operating system was selected as the main development environment for developing the tool as the selected programming language the open-source Swift 3.0 is fully integrated with the Xcode 8.
Although, Xcode only runs on Apple\’s OS X operating system, the developed software can be built on the Linux operating system if the source code developed in OS X is compatible (portability) with Linux\’s Swift.
Therefore, the development of Design Tool put great emphasis on the importance of the source code portability between the OS X and Linux during the development.
Although, the system design choices of the tool were mainly affected by the capabilities of the selected Software Development Environment, they were
highly inspired by a planned implementation of a unusual feature of the tool. This feature has the capability of providing different type of user interfaces (UIs) for the users on same or different platforms (portability), such as some OS independent Command Line (CLI) or a Web based user interface or even some OS dependent Graphical User interface (GUI).
To be able to provide different UIs for the users, the Presentation (CLI or GUI) of the software tool should be separated from its Business logic.
This Separation of Concers (SoC) that allows changes on GUI code without having too much impact on Business logic code, can be achieved by applying some user interface software architectural pattern (Design Pattern).
Although, there are several architectural design patterns are available for design consideration, such as MVC, MVP, Presentation Model, MVVM or MVP-VM etc , the MVP-VM which is one of those design patterns that supports multiple user interface technologies, was selected to implement the desired separation of concerns between the Presentation Layer and the Business Layer of the tool.
Furthermore, MVP-VM can be used together with a typical 3-Tier/Multi-Tier Architecture (Front-End, Business Logic, Back-End) which is the classical layered structure of the software systems, as the Middle Tier usually has components related to the other tiers e.g. presentation logic to Front-End and the data access logic to Back-End). See details in figure on page .
This design consideration helps enhance the future version of the tool by implementing a typical 3-Tier Architecture where the three major tier are distributed to different places in a network, which also requires the separation of the logic, data and the presentation.
In addition, this separation not only can provide different UIs for the users, but it involves breaking the software code into smaller individual pieces called units which can be then subjected into a series of tests (Unit Tests) that only focusing on those small parts of the software code.
This Unit Test approach reduces the development time of the tool and it also reduces or eliminates the software bugs in the developed application/program.
In software development, Design Patterns are reusable and tested solutions (best practices) to solve common design problems of an application.
For example, the MVP-VM architectural pattern is a Design Pattern that can be used in this Object-oriented programming (OOP) project to solve a certain problem (capability of using different UIs) by separating the roles by decoupling the underlying objects (classes).
An other example, the sgRNA Design Tool is designed to be able to use different algorithm for on-target scoring by using some already existing software that can be integrated into the design tool. These native software require different type of input file or files, different command parameters for their run and they produce completely different type of output files.
To solve design problems for producing individual input files, parsing the outputs for a native software that can be run simultaneously, three reusable design patterns were used, the Parser (See the UML diagram of the implemented Parser Design Pattern in figure on page .), Worker Thread and Async Task pattern are used.
As it can be seen from above examples, different patterns can help to solve different problems that the developers can use when designing and/or developing a software.
Thus, the design-pattern-based approach; that implements different Design Patterns (Command Pattern, Decorator Pattern, Factory Pattern, etc ) to solve differenet design issues; is heavily used in the design and development stage of the sgRNA Design tool.
Reusability is a \’\’well-known\’\’ concept in software development that can improve quality, enhance functionality and decrease complexity of software products by reusing previously developed computer software(s) (e.g. device drivers, shared libraries, modules etc ) in a new software product Wang2010.
However, reusability in the software development does not apply only to software codes or software modules, it also can be referred to any reusable existing datasets, documentation, design patterns etc
One technique of reusing an already available software components developed in Apple\’s Swift programming language is using 3rd-party Frameworks.
Framework in the Apple\’s terminology is a bundle of shared resources that, similarly to the library, can be reused by multiple independent software applications.
As a result, to achieve the aimed features of the tool the following 3rd-party Frameworks are selected; however all other components of the source code were written by the author:
SwiftCLI – Used to develop the CLI of the tool.
Camembert – Swift wrapper for SQLite database engine.
Express – Asynchronous web application server.
...(download the rest of the essay above)