\\documentclass{sig-alternate}

\\pdfpagewidth=8.5truein

\\pdfpageheight=11truein

\\bibliographystyle{unsrt}

\\usepackage{cite}

\\usepackage{booktabs}

\\usepackage{pgfplotstable}

\\usepackage{amsmath}

\\usepackage{float}

\\begin{document}

\\sloppy

\\title{Improving Gene Regulatory Network Reconstruction via Metabolic Simulation}

\\author{

\\alignauthor

Kishan K C\\\\

\\affaddr{Rochester Institute of Technology}\\\\

\\affaddr{Golisano College of Computing and Information Sciences}\\\\

\\affaddr{20 Lomb Memorial Dr, Rochester, NY}\\\\

\\email{[email protected]}

}

\\maketitle

\\begin{abstract}

A gene inside a cell is a fundamental unit of an organism that are controlled by intricate system of genetic switches, also known as Gene Regulatory Networks (GRNs). GRNs make sure that right proteins are being produced by right cells at right time. Accurately constructing GRNs from gene expression data helps to answer biological questions. Understanding GRNs demands for effective computational methods to gain insights in biological processes of interest.

In this study, we are trying to develop a general approach to leverage metabolic information to improve GRN reconstruction. As a initial step, we present a method to infer GRN from expression data by formulating a network inference problem as feature selection problem. Moreover, we introduce a novel feature ranking technique, which ranks transcription factors (TFs) based on their importance to regulate the expression of a target gene. In addition, we simulate biomass associated with the inferred network via metabolic simulation. Further, we plan to improve this model, incorporating simulated biomass as prior knowledge.

We evaluate our method for GRN inference on five microarray data sets from DREAM4 insilico network inference challenge. The preliminary results shows that our approach outperforms GENIE (random forest), top performer of the challenge, in GRN reconstruction.

\\end{abstract}

\\keywords{Gene Regulatory Networks, Ensemble Methods, Support Vector Machine, Recursive Feature Elimination, Biomass simulation}

\\section{Introduction}

A gene is a fundamental unit of a living organism that holds instructions for an organism to make all the proteins it needs to survive and grow. The information content of a gene is converted into proteins which are responsible for the vast array of functions in cells and the whole body of organisms. An organism has an array of cells type like brain cells, blood cells, skin cells, liver cells, bone cells that are responsible for different activities. There is an intricate system of genetic switches, also known as gene regulatory networks (GRNs) within an organism, represents a network of regulatory interaction that controls cell function. Transcription factors (TFs) regulates expression of target genes. GRN ensures that TFs turn its target genes on or off to make sure that right proteins are made in right cells at right time.

Understanding GRNs is of enormous value since accurate prediction of direct regulatory interactions will advance many areas like biotechnology, drug development, whole cell modelling, and understanding disease states \\cite{gene}. Knowledge of GRNs can shed lights on the mechanism that occur when these cellular processes are dysregulated. When more and more GRNs from different physiological and disease conditions become available, statistical comparison of these networks allow to learn about interaction changes across different conditions and enrich our biological and biomedical understanding about the changes \\cite{ideker2012differential}. For instance, accurate model of GRNs can explain mechanisms of diseases that are characterized by GRN dysfunctions, and guide to better strategies for drug design as well as effective strategies for cellular engineering.

GRN reconstruction from expression data is a challenging problem in systems biology because of 1) high dimensionality, 2) low sample size and 3) noise inherent in the experimental measurements. The selection of machine learning method that implements approach to overcome noise in the data and uses regularization to avoid overfitting can be reasonable way to address these issues. In conjunction with the appropriate method, exploiting other sources of information that can shed light on cellular process underlying GRN \\cite{imoto2004combining, werhli2007reconstructing, greenfield2013robust, mukherjee2008network} can improve GRN reconstruction. We propose to incorporate metabolic information associated with GRN as prior information to improve GRN reconstruction. Metabolic information i.e. biomass is based on regulatory interactions that guide the chemical reactions of metabolism in an organism. In other words, GRN influences the production of biomass in an organism.

In this study, we present a method to infer the direct relationships between genes from expression data to reconstruct GRN. Some genes may be regulated by a number of TFs whereas some genes may be regulated by few TFs or not regulated by any TFs at all. Assuming an expression of a gene is the functional output of TFs, we can infer direct regulatory interactions between TF and its target genes and infer a network. Inference of GRN can be improved if we can incorporate biological knowledge into the reconstruction process. We aim to create a model by answering following research questions:

\\begin{enumerate}

\\item How to develop model to reconstruct GRN from gene expression data that allows the integration of metabolic information?

\\item How to simulate biomass associated with reconstructed network?

\\item How to develop framework to incorporate biomass into model to improve GRN reconstruction?

\\end{enumerate}

We followed a strategy to decompose GRN inference problem into multiple regression problems in which we plan to infer the set of TFs that are most informative to predict expression level of a target gene. We propose an ensemble Supp (ESVM) for TFs selection that follows the ensemble and bagging concept of random forest and adopts the backward elimination strategy which is the rational of recursive feature elimination (RFE). The idea behind this is that, building an ensemble of SVM models in each iterations of SVM-RFE using a randomly drawn subset of training set for each target gene, will produce different feature rankings of TFs which will be aggregated as one ensemble vote to get final ranking.

Since only a portion of genes have regulatory interactions, SVM encodes sparsity to our model to handle inherent redundancy in gene microarray data. Also, SVM uses slack variables to overcome noise in the data. Moreover, we have additional information about each of the features i.e. genes which can be represented as a set of properties, referred to as meta-features. In the context of SVM, a variety of methods have shown to incorporate prior knowledge. We assume that weight is assigned to each feature and we use the meta-features to define a prior to the weights. We plan to use metabolic information as a prior to initialize weights and design a kernel function that incorporates this information.

Biomass output of the organism provides information about underlying regulatory state that guides the chemical reactions of metabolism, producing that biomass. Biomass under certain regulatory state can be predicted by a metabolic simulation. Availability of various computational model that allows us to simulate biomass associated with regulatory network and close relation between regulatory state and biomass output motivates us to use it as prior information to our inference model.

\\section{Literature Review} \\label{review}

GRN inference is a long-standing challenge and a wide variety of computational approaches have been proposed including statistical correlation, mutual information, regularized linear regression, ensemble method like random forest, and neural networks. These approaches are based on expression data without taking metabolic information into account.

Calculating pairwise measures between genes has been the focus of the number of algorithms to create large GRNs. The expression of a specific TF and its set of target genes are assumed to be statistically correlated and correlation measures like Pearson and Spearman correlation \\cite{Eisen08121998} arecommonly used methods to indicate the regulatory relationship between genes. Improvement to correlation coefficient was suggested in order to favour the hub genes having strong interactions with other genes. However, correlation based approach fail to identify the non-linear relationship between genes. Information theoretic approach was proposed to capture complex dependencies that are common in the biological procedure. Although mutual information \\cite{butte2000mutual} was used to capture non-linear dependencies between genes that were invisible to Pearson correlation coefficient, it suffers from predicting many false positive links between gene due to indirect dependencies. Many refinements have been proposed that includes CLR \\cite{faith2007large}, ARACNE \\cite{margolin2006aracne} algorithm to remove indirect effects. Other variants of methods like MRNET \\cite{meyer2007information}, C3NET \\cite{Altay2010} that uses mutual information have been developed that try to avoid inferring false negative links.

There have been a lot of attempts to comparatively evaluate the performance of different network inference algorithms \\cite{compare, ensemblecompare}. These studies are mainly focused on a small set of methods and aim to derive some interesting information about those methods. DREAM (Dialogue for Reverse Engineering Assessments and Methods) network inference challenge \\cite{marbach2012wisdom} has been focused on evaluating techniques and provide researchers with benchmark dataset to validate their work.

Recently, ensemble methods \\cite{irrthum2010inferring, Haury2012, SÅ‚awek2013,ruyssinck2014nimefi, Guo2016} formalised GRN inference problem into local regression subproblems which can be considered as feature selection problem. Gene Network Inference with Ensemble of trees (GENIE) \\cite{irrthum2010inferring} was the top performer in DREAM network inference challenges and is recognized as the state-of-the-art method. This method used a tree-based ensemble method to calculate variable importance for each predictor and considers high feature importance as the indication of a relationship between predictor and target in the GRN. As using the random forest for feature selection is not understood theoretically, alternative methods in the same setting have been proposed. Regression based method try to identify a subset of TFs that are the most informative to predict the expression level of target gene \\cite{Haury2012}. The non-zero coefficients from the method is interpreted as a regulatory interaction between the TF and the gene. The sparsity of a TF set is encoded by feature selection appraoches (e.g. L1 regularization).

\\section{Research Plan}

\\subsection{Research Agenda}

We plan to approach our research problem in three parts: (1) creation of model to reconstruct GRN from expression data, (2) metabolic simulation to estimate the biomass production resulting from reconstructed network, and (3) development of framework to incorporate biomass output to improve reconstruction. Previous works (discussed in section \\ref{review} ) show a variation in approaches that have been well studied and investigated to solve the GRN inference problem. In this work, we plan to improvise GRN inference taking metabolic information into account. As a initial step, we will develop a model for GRN inference based on expression data. We plan to simulate biomass output of an organism associated with the inferred network. As a final step, we will develop a model that incorporates biomass output as prior information for GRN inference and evaluate its performance against any single GRN inference method and integration of multiple methods. These methods fail to incorporate information that can improve GRN reconstruction into account.

\\subsection{Methodology}

\\subsubsection{Problem definition}

In this study, we are focused on inferring a direct relationship between TF and its target genes using gene expression data. Measurements of the expression profile of G genes over N experimental conditions are taken as an input data. We are using the general framework as used by many ensemble

methods \\cite{irrthum2010inferring, Haury2012, SÅ‚awek2013,ruyssinck2014nimefi, Guo2016} to approach the GRN inference problem. Let us define a gene expression dataset as a matrix of N rows by G columns in which each row represent an experimental condition and each column represents a gene.

\\textbf{

\\[

\\textbf{E}

=

\\begin{bmatrix}

x_{1,1} & x_{1,2} & \\dots & x_{1,G} \\\\

x_{2,1} & x_{2,2} & \\dots & x_{2,G} \\\\

\\vdots & \\vdots & \\ddots & \\vdots \\\\

x_{N,1} & x_{N,2} & \\dots & x_{N,G}

\\end{bmatrix}

\\] }

where $x_{N,G}$ is the expression value of g gene in experimental condition N.

We compute the score $w_{i,j}$ representing the strength of association between TF i and target gene j. Based on this method, we will provide a ranked list of potential regulatory links based on the scores in decreasing order, ranking most likely links at the top. Then, it comes to the choice of proper threshold on this ranking to create a network. This approach is based on the standard prediction format of DREAM challenges \\cite{marbach2012wisdom}, which have been widely used to evaluate various GRN inference methods.

\\subsubsection{GRN Inference with Feature Selection Approach}

Many popular methods for GRN inference are based on the score. For example, correlation and mutual information between two genes are popular way to score the candidate regulations \\cite{Eisen08121998, butte2000mutual}. However, this kind of direct approach fails to separate direct from indirect regulations. For instance, if $g_1$regulates $g_2$ and $g_2$ regulates $g_3$, correlation or mutual information between $g_1$ and $g_3$ is likely to be large, although there lack a direct regulatory regulatory link between $g_1$ and $g_3$. Similarly, if $g_1$ regulates $g_2$ and $g_3$, $g_2$ and $g_3$ are likely to have higher correlation or mutual information value although there is no direct regulation. One of the strategies to avoid the indirect regulations is to post-process the predicted regulations and try to remove indirect regulations because they are already explained by other regulations \\cite{margolin2006aracne}. Another strategy is, given a target gene $g \\in G$, to estimate the scores of $s(t\\ g)$ for all candidate regulators $t \\in T_g$ simultaneously.

We used a decomposition setting to convert GRN inference problem with G genes into G regression subproblems, where each subproblem can be considered as a feature selection problem in statistics \\cite{irrthum2010inferring}. Each subproblem aims to predict the expression value of a particular target gene using the expression value of TFs as input \\cite{Haury2012}. More specifically, we are interested in finding the important subset of TFs that are most informative to predict the expression profile of a particular target gene. Let E be the gene expression data as defined above and the target gene is the $i^{th}$ column in the expression matrix E.

This scenario can be presened as a feature selection problem in regression setting \\cite{Haury2012}. More specifically, for each target gene $g \\in G$, we consider the regression problem where we wish to predict the expression value of g from the expression level of its candidate TFs $t \\in T_g$:

\\begin{equation} \\label{eq1}

X_g = f(X_{T_{g}}) + \\epsilon

\\end{equation}

where $X_i$ represents the expression level of $i_{th}$ gene in the expression matrix E, $X_{T_g} = \\{X_t, t \\in T_g\\}$ is the expression level of candidate transcription factor for g, and $\\epsilon$ is some noise. We are interested in identification of small subset of transcription factor that are most informative to regulate $X_g$. Also, we try to estimate score $w_{tg}$ for each transcription factor. $w_{tg}$ can be used to assess its importance to regulate gene g:

\\begin{equation}

f(X_{T_g}) = \\sum_{t \\in T_g} w_{tg}x_{t},\\hfill \\forall j \\in [1,2,...,i-1,i+1,...,G-1]

\\end{equation}

where $w_{tg}$ is the score value that indicates the strength of regulatory links between transcription factor t and gene g. We can generate rankings of TFs using $w_{tg}$ as the metric of their informativeness to predict the expression level of a target gene. We repeat this process for all genes as a target genes regulated by a set of TFs and aggregate the individual rankings to get the global ranking of all regulatory relationship between genes.

\\subsubsection{Feature Selection with Ensemble SVM-Recursive Feature Elimination (ESVM-RFE)}

Support Vector Machines (SVM) are shown to work on higher dimensional feature space \\cite{gunn1998support}. When the number of features N is larger than the number of samples S, data overfitting may arise. SVM \\cite{vapnik1998statistical} avoid overfitting to some extent without reducing the feature space.

SVM regression performs linear regression in the high-dimension feature space using $\\epsilon$-insensitive loss and, at the same time, tries to reduce model complexity by minimizing $||w||^2$. Using the epsilon intensive loss function we ensure existence of the global minimum and at the same time optimization of reliable generalization bound. Non-negative slack variables are introduced to measure the deviation of training samples outside $\\epsilon$-insensitive zone. Thus SVM regression is formulated as minimization of the following functional:

\\begin{equation}

minimize \\frac{||w||^2}{2} + C \\sum_{i=1}^{n} (\\xi_i + \\xi_i^*)

\\end{equation}\\begin{center}

Subject to $\\begin{cases}y_i - f(x_i, w) \\leq \\epsilon + \\xi_i^* \\\\

f(x_i, w) - y_i \\leq \\epsilon + \\xi_i \\\\

\\xi_i, \\xi_i^* \\geq 0, i = 1, ...., n

\\end{cases}$

\\end{center}

The optmization problem can be formulated as dual optimization problem:

\\begin{multline}

maximize \\{-\\frac{1}{2} \\sum_{i, j=1}^l (\\alpha_i - \\alpha_i^*) (\\alpha_j - \\alpha_j^*) \\langle x_i, x_j \\rangle \\\\ - \\epsilon \\sum_{i, j=1}^l (\\alpha_i + \\alpha_i^*) + \\sum_{i, j=1}^l y_i(\\alpha_i - \\alpha_i^*) \\}

\\end{multline}

\\begin{center}

Subject to $\\begin{cases}\\sum_{i=1^l} (\\alpha_i - \\alpha_i^*) = 0 \\\\ \\alpha_i, \\alpha_i^* \\in [0,C]\\end{cases}$

\\end{center}

One of the advantages of Support Vector Machine, and Support Vector Regression (SVR) as the part of it, is that it can be used to avoid difficulties of using linear functions in the high dimensional feature space and optimization problem is transformed into dual convex quadratic program. In regression case, the loss function is used to penalize errors that are greater than threshold $\\epsilon$. Such loss functions usually lead to the sparse representation of the decision rule, giving significant algorithmic and representational advantages.

Recursive Feature Elimination(RFE) is considered to be an effective process in feature selection. RFE algorithm is implemented using a Support Vector Regression to assist in identifying the least useful transcription factor(s) to eliminate and end up with a subset of transcription factors $t_g \\subset T_g$that are most informative to predict the expression level of a target gene g. However, the primary concern with this approach is the amount of computational power needed. The computational performance of the method can be improved by removing the chunk of features that are least important to the target. The goal of this approach is to remove more features during each iteration keeping important features.

Starting with all the features and removing one feature variable at a time in a sequential backward elimination manner, we can obtain a ranked list of features. Training a $\\epsilon$-SVR on dataset with all features gives the coefficients of the weight vector for each feature. Weights obtained from trained $\\epsilon$SVR can be used as feature score to rank the features. The feature with smallest ranking score

$c_i = (w_{i}^2)$ where $w_i$ is the weight associated with i-th feature, is removed in each iteration. Intuitively, those features with the largest weights are the most informative. Thus in an iterative procedure of SVM-RFE one trains the $\\epsilon$-SVR, computes the ranking criteria for all features, and discards the feature with the smallest ranking criterion. The procedure is repeated until a small subset of features is obtained.

The idea behind using $c_i = (w_{i}^2)$ is to remove the feature that has minimal effect on the change of objective function when the feature is removed. The objective function for this implementation is chosen to be $J = \\frac{||w||^2}{2}$ in SVM-RFE \\cite{zhou2007msvm}. The effect of a removing feature in the objective function is described in Optimal Brain Damage Algorithm \\cite{lecun1990optimal}, by expanding an objective function in Taylor\'s series to second order as in equation (\\ref{eq2}).

\\begin{equation} \\label{eq2}

\\bigtriangleup J(i) = \\frac{\\delta J}{\\delta w_i} \\bigtriangleup w_i + \\frac{\\delta^2 J}{\\delta w_i^2} \\bigtriangleup (w_i)^2

\\end{equation}

At optimal of J, the first term on the right-hand side can be neglected. Since we have $J = \\frac{||w||^2}{2}$, the change in objective function associated with the removal of i-th feature is

\\begin{equation}

\\bigtriangleup J(i) = (\\bigtriangleup w_i)^2

\\end{equation}

Many feature selection algorithms are sensitive to changes caused by small perturbations in different experimental conditions \\cite{guyon2003introduction}. Repeating the process of feature selection in several subsamples from a bootstrap sampling of training data can be one way to stabilize the method. This bootstrapping approach can be used with SVM-RFE \\cite{duan2005multiple}. Instead of applying this idea to SVM-RFE as a whole, we plan to apply it to each step in SVM-RFE. We plan to train multiple linear SVM on the subsamples of data and aggregate the weights to get feature ranking score.

The outcome of this model will be adjacency list corresponding to a putative regulatory link between genes as summarized in Figure \\ref{grn}.

\\begin{figure}[h]

\\includegraphics[width=\\linewidth]{grn.png}

\\caption{Overview of our approach to the network inference task. Our model takes DNA microarray data to compute the adjacency matrix representing the putative relationship between genes. This adjacency matrix can be converted to adjacency list.}

\\label{grn}

\\end{figure}

\\subsubsection{Incorporation of Metabolic information}

Observable characteristics of an organism are the result of underlying GRN within an organism. We can improve the reconstruction of GRN incorporating observable traits. We plan to analyze the possibility of incorporating various biological information into the process. Specifically, we will be focused on metabolic information i.e. biomass associated with that network as shown in Figure \\ref{metabolic}.

\\begin{figure}[h]

\\includegraphics[width=\\linewidth]{metabolic.png}

\\caption{Metabolic simulation to observe the biomass resulting from reconstructed network}

\\label{metabolic}

\\end{figure}

In order to incorporate that information, we need to compute the amount of biomass resulting from reconstructed GRN and observe the distribution of biomass under different experimental conditions. However, biomass is not obtained by a convenient formula but rather by running a simulation program. Flux Balance Analysis \\cite{orth2010flux} is a widely used approach for the study of metabolic networks of an organism. FBA uses linear programming to optimize biologically motivated objective function, using expression data set and metabolic network model. We plan to use publicly available COBRA(Constraint-Based Reconstruction and Analysis) Toolbox for MATLAB \\cite{cobra} to make numerical predictions of biomass.With simulated biomass, we plan to incorporate it into the model.

\\begin{table*}[t]

\\caption {AUROC of our method on five DREAM4 insilico multifactorial datasets}

\\label{tab:title}

\\renewcommand{\\arraystretch}{1.5}

\\resizebox{\\textwidth}{!}{%

\\begin{tabular*}{1\\textwidth}{@{\\extracolsep{\\fill}}lllllr}

\\hline

\\textbf{Method} & \\textbf{Net 1} & \\textbf{Net 2} & \\textbf{Net 3} & \\textbf{Net 4} & \\textbf{Net 5} \\\\

\\hline

\\textbf{SVM} & 0.495 & 0.508 & 0.483 & 0.492 & 0.444 \\\\

\\hline

\\textbf{SVM-RFE} & 0.579 & 0.545 & 0.573 & 0.567 & 0.571 \\\\

\\hline

\\textbf{GENIE (Random Forest)} & 0.709 & 0.698 & 0.750 & 0.793 & 0.765\\\\

\\hline

\\textbf{SVM-RFE with Bootstrapping} & 0.732 & 0.734 & 0.782 & 0.787 & 0.791 \\\\

\\hline

\\end{tabular*}

}

\\end{table*}

\\subsection{Results}

\\subsubsection{Gene Expression Data Description}

We test our method on five gene expression data from the DREAM4 network inference challenge \\cite{greenfield2010dream4}. These data sets are created for DREAM4 insilico 100 multifactorial challenge and aim to mimic samples from multifactorial perturbation data, which is defined as static steady-state expression profiles acheived by slightly perturbing all gene expression values at the same time. Each dataset is represented as an expression matrix of G genes by N chip measurements. Specifically, the challenge provided a total of 100 genes over 100 microarrays in each of the five dataset.

Dataset specific gold standards containing known transcription factor to target gene (transcription factor-target gene) interactions were compiled for performance evaluation.To be consistent with the evaluation in \\cite{marbach2012wisdom}, we also considered all transcription factor-target gene pairs that are not part of gold standards as negatives, although gold standard are based on incomplete knowledge indicating they might yet contain unknown true interactions. This leads to the choice of gold standards including interactions with strong experimental support.

\\subsubsection{Implementation}

We developed an approach that takes expression data set and gold standard as input and infer direct regulatory interaction between TFs and target genes. For each dataset, we run an ensemble model of SVM-RFE with ensemble size 1000. For each ensemble step, we generate bootstrap samples from dataset. The model executes for each gene in the sample as a target gene, taking all other genes as potential transcription factors. If list of potential TFs and target genes are known for that dataset, the model separates TFs as predictors and target genes as responses, making the model efficient. For each target gene, the model performs SVM-RFE to eliminate least important features and creates a ranked list of features. To make SVM-RFE efficient, we remove chunk of features in each iteration, based on the idea of simulated annealing\\cite{ding2006improving}. The simple schedule of removing $\\frac{1}{i+1}$ of the remaining genes during iteration i is used. That is, half are removed in the first iteration, one-third in the second, one-fourth in the third, and so on.

\\subsubsection{Model Evaluation}

We evaluated our approach using gene expression dataset with 100 genes with expression value over 100 experiments. We compare our method with GENIE algorithm, top performer in DREAM4 challenge, on same dataset. Table \\ref{tab:title} summarizes the performance of our method on different setting and GENIE on DREAM4 dataset in terms of AUROC.

It is still early to draw conclusions from the results as our evaluation is still in progress and plan to evaluate the performance of SVM-RFE on DREAM5 dataset and compare with other algorithms. Based on the evaluation so far, our method outperforms GENIE on DREAM4 size 100 in silico multifactorial dataset.

As a second part of the project, we simulate biomass using metabolic model of yeast and expression data from DREAM5 challenge and is presented in Figure \\ref{fig:biomass} We found that our results from simulated biomass is consistent with previous work \\cite{vemuri2007increasing}.

\\begin{figure}

\\includegraphics[width=\\linewidth, height=6cm]{biomass.png}

\\caption{Histogram of biomass simulated using metabolic model with gene expression data and gold standard from DREAM5 challenge}

\\label{fig:biomass}

\\end{figure}

\\section{Conclusion and future work}

The work described in this study has been concerned with the identification of regulatory links between genes using expression data and improving it with the incorporation of biological information as prior knowledge. As our initial step, we present a general approach of reconstructing GRN from gene expression dataset using feature selection approach and evaluate its performance on the DREAM4 dataset. As we move on, we will evaluate the performance of our approach on the real dataset and look to contrast its performance with other approaches (discussed in Section (\\ref{review})).

We are also interested in evaluating the idea of incorporating biological information as prior knowledge to our approach. The information we plan to incorporate into the model to improve reconstruction is biomass simulated using COBRA model, a standard model for metabolic networks with set of regulatory interactions, under different experimental conditions. Our immediate next step will be an addition of regulatory links inferred from our approach into COBRA model and observe the change in biomass. Once we have biomass associated with inferred network, we can look for possibilities to incorporate that information into the model to improve GRN reconstruction. Further, we plan to investigate other biological information that can shed lights on cellular process underlying GRNs.

\\bibliographystyle{abbrv}

\\bibliography{sigproc}

\\newpage

\\appendix

\\section{Research Overview}

This research is part of a three-year project proposed by

Dr. Anne Haake (GCCIS) and Dr. Justin Domke

(GCCIS) and funded by the National Science Foundation. The

project is concerned with developing novel methodology for leveraging metabolic information to improve regulatory reconstruction. Dr.

Haake (my advisor) and her team are investigating

possible ways to create a model for GRN reconstruction and improve the model with metabolic information.

Currently, Dr. Haake\'s research team for this project

consists of Dr. Rui Li(GCCIS), Dr. Feng Cui(GSoLS), Christopher Snyner(MS Student, GSoLS) and Kishan K C(me). Dr. Cui and Snyder along with Dr. Haake are domain experts for the project who facilitate us in understanding and making sense of our approach and results in biological domain. My research is concerned with development of general approach of GRN reconstruction and investigate possible ways to incorporate biological information as prior knowledge into the approach to improve reconstruction. I am closely working with Dr. Li in understanding the gene expression dataset and developing model that can incorporate metabolic information to improve GRN reconstruction. I am also exploring the way to simulate biomass (metabolic information) for an organism so that we can use that information as prior into the model. Eventually, we are looking to develop a model that can incorporate different biological information as well as other similar unruly priors as prior knowledge to better reconstruct GRN.

\\end{document}

**...(download the rest of the essay above)**