Statistical Tools for Measuring the Genetic Diversity for Crop Improvement

ABSTRACT

Information in breeding material about genetic diversity and variation is a priceless benefit in plant breeding or improvement of crops. To estimate genetic diversity in breeding material there a many statistical tool that plant breeder are using. This dependence of this method on purebred data, morphological data, agronomic presentation data, biochemical data, and more recently molecular data. For realistically correct and impartial evaluations of genetic diversity, satisfactory devotion has to be devoted to (i) selection strategies; (ii) application of various data sets on the basis of the consideration of their strengths and limitations; (iii) choice of genetic distance quantity(s), clustering processes, and other multivariate methods in analyses of data; and (iv) objective purpose of genetic relationships. Sensible combination and application of statistical tools and techniques, such as bootstrapping, is vital for addressing complex issues related to data analysis and understanding of results from diverse types of data sets, particularly through clustering procedures. These review emphases on use of statistical tools and techniques in examination of genetic diversity at the intraspecific level in crop plants.

Introduction

Important component of crop improvement is Analysis of genetic relationships, as it serves to provide information about genetic diversity and for stratified sampling of breeding populations is a platform. Correct calculation of the levels and patterns of genetic diversity can be invaluable in crop breeding for diverse applications including (i) analysis of genetic variability in cultivars (Smith, 1984; Cox et al., 1986) (ii) identifying diverse parental groupings to create segregating progenies with maximum genetic erraticism for further selection and (iii) introgressing desirable genes from diverse germplasm into the available genetic base. An understanding of genetic relationships among inbred lines or pure lines can be particularly useful in planning crosses, in assigning lines to specific heterotic groups, and for precise identification with respect to plant varietal protection. Analysis of genetic diversity in germplasm collections can simplify reliable classification of accessions and identification of subgroups of core accessions with possible usefulness for specific breeding purposes. Important emphasis is being paid to complete analysis of genetic diversity in many crops with major field crops such as wheat (Triticumaesti vum L.), rice (Oryza sativa L.), maize (Zea mays L.), barley (Hordeum vulgare L.), and soybean [Glycine max (L.) Study of genetic multiplicity is the process by which variation among individuals or groups of individuals or populations is investigated by a specific method or a combination of methods. The data often involve mathematical quantities and in many cases, mixtures of different types of variables. Varied data sets have been used by researchers to analyze genetic diversity in crop plants; most important among such data sets are pedigree data passport data–morphological data biochemical data obtained by analysis of isozymes and storage proteins and, recently, DNA-based marker data that allow more reliable differentiation of genotypes. Since each of these data sets provide different types of information, the choice of analytical method(s) depends on the objective(s) of the experiment, and the level of resolution required, the resources and technological infrastructure available, and the operational and time constraints, if any. Statistics is a field of mathematics that pertains to data analysis. Statistical methods and equations can be applied to a data set in order to analyze and interpret results, explain variations in the data, or predict future data. A few examples of statistical information we can calculate are:

* Average value (mean)

* Most frequently occurring value (mode)

* On average, how much each measurement deviates from the mean (standard deviation of the mean)

* Span of values over which your data set occurs (range), and

* Midpoint between the lowest and highest value of the set (median)

Statistics is important in the field of engineering by it provides tools to analyze collected data. For example, a chemical engineer may wish to analyze temperature measurements from a mixing tank. Statistical methods can be used to determine how reliable and reproducible the temperature measurements are, how much the temperature varies within the data set, what future temperatures of the tank may be, and how confident the engineer can be in the temperature measurements made. This article will cover the basic statistical functions of mean, median, mode, standard deviation of the mean, weighted averages and standard deviations, correlation coefficients, z-scores, and p-values.

Basic Statistics

When performing statistical analysis on a set of data, the mean, median, mode, and standard deviation are all helpful values to calculate. The mean, median and mode are all estimates of where the "middle" of a set of data is. These values are useful when creating groups or bins to organize larger sets of data. The standard deviation is the average distance between the actual data and the mean.

Mean and Weighted Average:

The mean (also know as average), is obtained by dividing the sum of observed values by the number of observations, n. Although data points fall above, below, or on the mean, it can be considered a good estimate for predicting subsequent data points. The formula for the mean is given below as equation (1).

However, equation (1) can only be used when the error associated with each measurement is the same or unknown. Otherwise, the weighted average, which incorporates the standard deviation, should be calculated using equation (2) below.

Median:

The median is the middle value of a set of data containing an odd number of values, or the average of the two middle values of a set of data with an even number of values. The median is especially helpful when separating data into two equal sized bins.

Standard Deviation and Weighted Standard Deviation:

The standard deviation gives an idea of how close the entire set of data is to the average value. Data sets with a small standard deviation have tightly grouped, precise data. Data sets with large standard deviations have data spread out over a wide range of values.

Simple measures of Variability:

Simple measures of variability consist of two descriptive statistics like statistics of dispersion and statistics of location.

Statistics of Location: It is also known as central measures of tendency. Statistics of location measures sample along the region of given dimension representing a variable. Mean is a familiar one statistics of location. Mean is measured by summing all individual values and dividing them by total no. of observations.

Statistics of dispersion: It measures frequency distribution. Range is the simple measure of distribution. Range is the difference between highest and the lowest values. Standard deviation is the common one statistic of dispersion and is defined as square root of variance. Variance is the sum of squared deviations of all observations in sample from its mean that is divided by degree of freedom. S.E is the measure of mean difference between sample estimates and the population parameters. It measures uncontrolled variation. When the population differ appreciably and significantly in their mean the direct judgment between their means and variances is useless, as larger individuals vary more than the smaller ones. S.D of the tail length of cow is much greater than tail lengths of mouse.

Coefficient of variation is an important measure of variability and measure relative amount of dispersion in different simples. It represents standard deviation as a percentage of the mean. C.V can clearly indicate that which character is more variable than the other one. C.V can be used for following purposes.

* For comparison of different characters under experimentation.

* It can be used in different mutation experiments for checking different mutagen doses for same character.

* It is also useful for locating results of experiments at multi-locations. Larger C.V shows defective experimentation.

* In D2 statistics relative contribution of different traits to the total divergence can be determined by C.V of cluster means for each of the character of a trait.

Degree of freedom, Level of significance etc. are some important terminologies in statistical tools.

Degree of freedom is the number of independent comparisons that are frequently used in different kinds of analysis of variance. Level of significance is the level of confidence about the results of an experiment that is repeated again and again that means if an experiment is repeated 100 times so there is 95-99 % probability of having same results under same conditions with 5% and 1% level of significance. Significance is tested by F-test and by t-test.

Chi-square test: Genetic variability can be assessed in qualitative traits through chi-square. When a pure breeding plants are crossed then F1 generation is produced and by selfing F2 is produced and segregation occurs. Maximum variation occurs in F2 generation. As we know that qualitative characters don’t have numerical values. For qualitative characters data consist of number of individuals in given classes. This type of data is called enumeration data. Testing of data is very important to test the fitness of the data that it is good fit or not. Chi-square is used to observe and detect differences among individual values. Chi-square helps to observe qualitative traits like flower color (Red, Pink and white) and other traits in improvement of crop plants. It is used for testing the observed frequencies with that of expected frequencies.

Characteristics of Chi-square

* Used for checking the goodness of fit.

* Differs from binomials and can only be used in small samples testing.

* From large data families can be combined.

* Used only in numeric data never used in %age data.

* Data should consist of two or more classes.

Goodness of fit and magnitude of Chi-square: From observed value if the deviation of expected is small then X2 approaches to zero. If the observed values or frequencies do not agree with expected one then your fit is poor because now X2 value increases. Chi-square is dependent on sample size as well as variability within the sample. Magnitude is affected by three reasons. The deviation between observed and expected is large, no. of classes that are summed together. For discrete or discontinuous classes a correction is needed that is called Yates correction. Lack of continuity is corrected by this correction. Yates correction is used in following conditions.

* When sample size is 5 or less than 5.

* When D.F is equal to 1

* The probability is closed to critical region.

X2 also check homogeneity and heterogeneity in a data. Chi-square is additive in nature. Sum of two Chi-squares is also a Chi-square. Homogeneity test can be performed to check which family is true representative of the same population and the data can be pooled together or not.

Simple Linear Correlation

As for crop improvement correlation is very important one. It deals with independent and dependent variables. Yield is an important trait for crop improvement and is a dependent variable. There is a time lapse between measurements of two variables. In such situation the variable that is measured first is independent variable. Increase in one variable and also in the second one shows positive correlation and an increase in one variable causes to decrease in other variable is called negative correlation. Correlation tells us about association between two variables but it does not tells about degree of association that is present between two variables. The definite measure of closeness between two variables is called coefficient of correlation. It is denoted by r. while square of it is called coefficient of determination. Sometimes people confuse correlation with regression. Regression is the amount of change in one variable with a unit change in other variable and it tells us about nature of relationship. Value of r lies between -1 to 1. Extreme values show perfect linear association between two variables while mid value zero shows no association.

In plant breeding selection for crop improvement is made on phenotypic variates. Several traits make the selection procedure easy and some traits make it complex one. However direct or indirect relationship between variables make it easy to deal. Path analysis helps us in measuring that type of association with a diagram by using cause and effect relationship. This diagram is called path diagram. Our main objective is effect and different traits contributing towards effect are called cause.

Generation Means Analysis

Weighted least square generation mean analysis or joint scale test is used in target traits for investigation of genetic bases of variation. In this extreme variation between parents is important one. This tool is very much robust as compared to other ones because it is based on 1st order statistics (mean). Scaling of phenotypic expression gives expected values of the generations. Additive dominance model as per Mather and Jinks describes phenotype in terms of mid-parental values, additive effects, dominance effects and additive*dominance effects.

Metroglyph and Index Scoring

The first early work of Anderson (1957), proposed the use of metroglyph and index-score to study the pattern of morphological variations in individual data set. In the early seventies this method was used to study morphological variation in green gram. This method uses a range of variations arising from trait such that extent of trait variation is determined by the length of rays on the glyph. The performance of a genotype is adjudged by the value of the index score of that genotype. The score value determine the length of ray which may be small, medium or long.

Euclidian Distance: Similar to metroglyph and the score index is Euclidian Distance (ED) measurement. Euclidian distance measures similarity between two genotypes, populations or individuals using statistical measures.

Metroglyph and index-score methods measures genetic distance by use of morphological traits. Euclidian distance measurements utilize both morphological and molecular based marker data sets.

Grouping Techniques in analysis Genetic Diversity

Genetic relationship among and with breeding materials can be identified and classified using multivariate grouping methods. The use of established multivariate statistical algorithms is important in classifying breeding materials from germplasm, accessions, lines, and other races into distinct and variable groups depending on genotype performance. Such groups can be resistant to diseases, earliness in maturity, reduced canopy drought resistant etc. The widely used techniques irrespective of the data source (morphological, biochemical and molecular marker data) are cluster analysis, Principal Component Analysis (PCA), Principal Coordinate Analysis (PCOA) Canonical Correlation and Multidimensional Scaling (MDS).

Cluster analysis

Presents patterns of relationships between genotypes and hierarchical mutually exclusive grouping such that similar descriptions are mathematically gathered into same cluster. Cluster analysis have five methods namely un-weighted paired group method using centroids (UPGMA and UPGMC), Single Linkages (SLCA), Complete Linkage (CLCA) and Median Linkage (MLCA). UPGMA and UPAMC provide more accurate grouping information on breeding materials used in accordance with pedigrees and calculated results found most consistent with known heterotic groups than the other clusters.

Principal components, canonical and multidimensional analyses are used to derive a 2-or 3-dimensonal scatter plot of individuals such that the geometrical distances among individual genotypes reflect the genetic distances among them. Principal component is defined as a reduced data form which clarifies the relationship between breeding materials into interpretable fewer dimensions to form new variables. These new variables are visualized as different non correlating groups.

Principal components analysis first determines Eigen values which explain the amount of total variation displayed on the component axes. It is expected that the first 3 axes will explain a large sum of the variations captured by the genotypes. Cluster and principal component analysis can be jointly used to explain the variations in breeding materials in genetic diversity studies.

Path Analysis

In plant breeding, selection is practiced upon phenotypic variants/traits. Several traits make the selection procedure complex, however, direct and indirect relationship of varieties make it easy. A diagram can represent the whole system of varieties/traits using cause and effect relationship. Such a diagram is called path diagram.

The ultimate objective is effect. Various traits contributing towards the effect are called cause. For example, let’s suppose that wheat grain yield (y) is effect of various components like number of tillers (m1), grains per spike (m2) and 100-grain weight (m3) which are called cause or causal factors.

Principle Component Analysis

Genetic relationship among and with breeding materials can be identified and classified using multivariate grouping methods. The use of established multivariate statistical algorithm is important in classifying breeding materials from germplasm, accessions, lines, and other races into distinct and variable groups depending on genotype performance. The widely used techniques irrespective of the data source (morphological, biochemical and molecular marker data) are cluster analysis, Principal Component Analysis (PCA), Principal Coordinate Analysis (PCOA) Canonical Correlation and Multidimensional Scaling (MDS).

Principal components, canonical and multidimensional analyses are used to derive a 2-or 3- dimensional scatter plot of individuals such that the geometrical distances among individual genotypes reflect the genetic distances among them. Defined principal component as a reduced data form which clarify the relationship between breeding materials into interpretable fewer dimensions to form new variables. These new variables are visualized as different non correlating groups.

Metroglyph Analysis

This is a semi-graphic method of studying variability in a large number of germplasm lines taken at a time. This technique was developed by Anderson in 1970 to investigate the pattern of morphological variation in crops species.

It has following steps

Plotting of glyph on graph

A small circle by which the position of a genotype or line is represented on the graph is called glyph. For plotting of glyph on the graph two characters having high variability are chosen. One of them is used on the X axis and the other on the Y axis. The mean values of the X for each genotype are plotted on the graph against the mean values of Y. thus each line occupies a definite position on the glyph.

Depiction of variation

Variation for remaining characters of each genotype is displayed on the respective glyph by rays. Each character occupies a definite ray position. Variation for each character is depicted by the length of ray. Thus the length of ray for a particular character on the glyph may be short or medium or long depending on the index value of a genotype.

Construction of index score

For this purpose the variation for each character is divided into three groups low, medium and high. The genotype with low, medium and high values are given index score 1, 2 and 3 respectively. The worth of a genotype is calculated by adding index values of all the characters. Thus maximum and minimum score of an individual will be 3n and n respectively. Where n is the total number of characters include in the study.

Analysis of variation

The X and Y axes are divided into three groups low, medium and high. The max and mini number of groups or cluster will be nine. The variation is analyzed for various traits within the group and between the groups. The genotype for use as parents in the hybridization program should be chosen from different groups representing wide genetic variability

D2 Statistics

The concept of d2 statistics was originally developed by P.C Mahalanobis in 1928. Genetic diversity arises due to geographic separation or due to genetic barriers to crossability.

The selected genotype are evaluated field trial and observations are recorded on various quantitative characters.

First variance for various characters and covariances for their combinations are estimated.

* Computation of d2 values and testing their significance against the table value of x2 for p degree of freedom, where p is the total number of characters. If the calculated values of D2 is higher than table value of x2 it is considered significant and vice versa.

* Finding out the contribution of individual character towards total divergence.

* Grouping of different genotypes into various clusters

* Estimation of average distances at intra-cluster and inter-cluster levels

* Construction of cluster diagram

In D2 analysis, a diagram is constructed with the help of D2 values which is known as cluster diagram. The square roots of divergence intra and inter cluster D2 values are used in the construction of cluster diagram. This diagram provides information on the following aspects.

* The number of clusters represent the number of groups in which a population can be classified on the basis of D2 analysis.

* The distance between two clusters is the measure of the degree of diversification. The greater the distance between two cluster the greater divergence and vice versa

The genotypes falling in the same cluster are more closely related than those belonging to another cluster.

### References:

Smith, J.S.C. 1984.Genetic variability withinU.S. hybridmaize:Multi variate analysis of isozyme data. Crop Sci. 24:1041–1046.

Singh, P. and S.S Narayanam.2006. Biometrical techniques in plant breeding.3rd ed. Kalyani publishing, new delhi, india.

Sleper, D.A. and J.M. Poehlman. 2006. Breeding field crops. 5th ed. Blackwell Publshing, Iowa, USA.

**...(download the rest of the essay above)**