Comparison in terms of prediction accuracy of models
Genomic selection (GS) is playing a major role in plant breeding for the selection of candidate individuals (animal or plants) early in time. However, for improving GS better statistical models are required. For this reason, in this chapter book we provide an improved version of the Bayesian multiple-trait and multiple-environment (BMTME) model of Montesinos-López et al. that takes into account the correlation between traits (genetic and residual) and between environments since allows general covariance’s matrices. This improved version of the BMTME model was derived using the matrix normal distribution that allows a more easy derivation of all full conditional distributions required, allows a more efficient model in terms of time of implementation. We tested the proposed model using simulated and real data sets. According to our results we have elements to conclude that this model improved considerably in terms of time of implementation and it is better than a Bayesian multiple-trait, multiple-environment model that not take into account general covariance structure for covariance’s of the traits and environments.
- genomic selection
- multiple-trait and multiple-environment
- general covariance’s matrices
Genomic selection is revolutionizing plant breeding, since allows the selection of candidate individuals (animal or plants) early in time. However, the success of genomic selection is linked directly to the use of statistical models, since the process of selection of candidate individuals is done using statistical models. However, most of the models currently used in genomic selection are univariate models mostly for continuous phenotypes, which not exploit the existing correlation between traits when the selection of individuals (genotypes or animals) is done with the purpose to improve simultaneously multiple-traits. The advantage of jointly modeling multiple-traits compared to analyzing each trait separately, is that the inference process appropriately accounts for the correlation among the traits, which helps to increase prediction accuracy, statistical power, parameter estimation accuracy, and reduce trait selection bias [1, 2]. For this reason, there is a great interest of plant and animal scientist to develop appropriate genomic selection models for multiple-traits and multiple-environments to take advantage of this correlation and to improve the prediction accuracy in the selection of candidate individuals.
For this reason, in this chapter we propose an improved version of the Bayesian multiple-trait, multiple-environment (BMTME) model proposed by Montesinos-López et al.  that is appropriate for correlated multiple-traits and multiple-environments but instead of building this model using the multivariate normal distribution we propose to build it using the matrix normal distribution which should avoid that the number of rows of the datasets grows proportional to the number of traits under study.
Also, the BMTME model was improved adding a general covariance structure for the genetic covariance of environments in place of assuming a diagonal matrix as the original BMTME model. Additionally, in this chapter we compare the improved model in terms of prediction accuracy and time of implementation with the original BMTME model of Montesinos-López et al.  and with a multiple-trait and multiple-environment model where it is ignored the correlation between traits and between environments. Our hypothesis is that the improved model should be similar in terms of prediction accuracy, but considerably faster in terms of time of implementation with regard to the original BMTME of Montesinos-López et al.  and a little better in terms of prediction accuracy that a multiple-trait and multiple-environment model that ignore the correlation between traits and environments. Also, we propose to implement the proposed model with simulated and real data sets. Our results suggest that the construction and implementation of the proposed model should be of great help for breeding scientist and programs since will help to select candidate genotypes early in time with more accuracy.
2. Material and methods
2.1. Matrix normal distribution
The matrix normal distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables. According with Rowe  the
When the covariance matrix
upon using the following matrix identities
Some useful properties of the matrix normal distribution are: the mean and model is equal to
2.2. Univariate model with genotype by environment interaction (M1)
First, for each trait we considered the following univariate linear mixed model:
2.3. Multivariate correlated model with multiple-trait and multiple-environment (M2)
To account for the correlation between traits, all of the
2.4. Joint posterior density and prior specification
In this section, we provide the joint posterior density and prior specification for the improved BMTME model. Assuming independent prior distributions for
2.5. Gibbs sampler
In order to produce posterior means for all relevant model parameters, below we outline the exact Gibbs sampler procedure that we proposed for estimating the parameters of interest. The ordering of draws is somewhat arbitrary; however, we suggest the following order:
Step 1. Simulate
Step 2. Simulate
Step 3. Simulate
Step 4. Simulate
Step 5. Simulate
Step 6. Return to step 1 or terminate when chain length is adequate to meet convergence diagnostics.
2.6. Multivariate uncorrelated model with multiple-trait and multiple-environment (M3)
To compare the model given in Eq. (7) we considered also model
2.7. Experimental data sets
2.7.1. Simulate data sets
For testing the proposed models and methods we simulated multiple-trait and multiple-environment data using model in Eq. (7). We studied six scenarios depending of the parameters used. For the first scenario (S1) we used the following parameters: three environments, three traits, 80 genotypes, 1 replication for environment-trait-genotype combination. We assumed that
2.7.2. Real wheat data set
Here, we present the information on the first real data set used for implementing the proposed models. This real data set composed of 250 wheat lines that were extracted from a large set of 39 yield trials grown during the 2013–2014 crop season in Ciudad Obregon, Sonora, Mexico . The trials under study were days to heading (DTHD), grain yield (GRYLD), plant height (PTHT) and the green normalized difference vegetation index (GNDVI), each of these traits were evaluated in three environments (Bed2IR, Bed5IR and Drip). The marker information used after editing was 12,083 markers. This data set was also used by Montesinos-López et al.  for this reason those interested in more details of this data set see this publication.
2.7.3. Real maize data set
The second real data set used for implementing the proposed models is composed of 309 double-haploid maize lines. Traits available in this data set include grain yield (Yield), anthesis-silking interval (ASI), and plant height (PH); each of these traits were evaluated in three optimum rainfed environments (EBU, KAT, and KTI). The marker information used after editing was 12,083 markers. Also, this data set was also used by Montesinos-López et al.  for this reason those interested in more details of this data set see this publication.
2.8. Assessing prediction accuracy
For assessing prediction accuracy for the simulated and real data sets a 20 training (trn)-testing (tst) random partitions were implemented under a cross-validation that mimicked a situation where lines were evaluated in some environments for the traits of interest; however, some lines were missing in all traits in the other environments, this cross-validation scheme is called CV1. Under this cross-validation, we assigned 80% of the lines to the trn set and the remaining 20% to the tst set. We used the Pearson correlation and mean square error of prediction (MSEP) to compare the predictive performance of the proposed models. Models with Pearson correlation closet to one indicated better predictions, while under the MSEP values closed to zero are better in terms of prediction accuracy. It is important to point out that model
The results are presented in two sections. The first section presents the results of the simulated data set, while the second the results with the real data sets.
3.1. Simulated data sets
In Table 1, under scenario S1 we can observe that the proposed model
In Table 2, under scenario S4 model
3.2. Real data sets
In Table 3 we can observe that in the wheat data set the best predictions were observed under the proposed improved BMTME model (
According to the results observed with the simulated data sets (Tables 1 and 2) and real data sets (Table 3) there is evidence that the larger the correlation between traits (genetic and residual) and environments (genetic) the better the performance of the proposed improved BMTME (
In this paper we proposed an improved version of the Bayesian multiple-trait multiple-environment (BMTME) model of Montesinos-López et al.  that was derived using the matrix normal distribution. The advantage of the proposed model (
where , .
In the simplification of some calculations the following properties were involved:
where and .