Big Data Supervised Pairwise Ortholog Detection in Yeasts Big Data Supervised Pairwise Ortholog Detection in Yeasts

Ortholog are genes in different species, evolving from a common ancestor. Ortholog detection is essential to study phylogenies and to predict the function of unknown genes. The scalability of gene (or protein) pairwise comparisons and that of the classification process constitutes a challenge due to the ever-increasing amount of sequenced genomes. Ortholog detection algorithms, just based on sequence similarity, tend to fail in classifica- tion, specifically, in Saccharomycete yeasts with rampant paralogies and gene losses. In this book chapter, a new classification approach has been proposed based on the combination of pairwise similarity measures in a decision system that consider the extreme imbalance between ortholog and non-ortholog pairs. Some new gene pair similarity measures are defined based on protein physicochemical profiles, gene pair membership to conserved regions in related genomes, and protein lengths. The efficiency and scalability of the calculation of these measures are analyzed to propose its implementation for big data. In conclusion, evaluated supervised algorithms that manage big and imbalanced data showed high effectiveness in Saccharomycete yeast genomes.


Introduction
Orthologs are genes in different species evolving from a common ancestor while paralogous genes are homologous sequences that evolved in a duplication event derived from a unique sequence [1]. The distinction between orthologs and paralogs is crucial since ortholog genes are considered evolution markers and they help to infer the function of unknown proteins.
The orthology relationship is expressed in terms of pairs with one-to-one relations, or of groups with one-to-many or many-to-many relationships. Thus, the scalability of both gene/ protein pairwise comparisons and the classification process constitute a challenge due to the ever-increasing amount of sequenced genomes. The unsupervised ortholog classification problem is presented in [2] starting from the pairwise sequence comparison. Many algorithms built a similarity graph from these comparisons such as the Reciprocal Best Hit (RBH) heuristic [3], Inparanoid [4], OrthoMCL [5], the Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data (OMA) [6], the well-known Reciprocal Smallest Distance (RSD) algorithm [7], among others. In this unsupervised learning approach there are also tree-based and hybrid algorithms [8]. On the other hand, [9], a supervised approach form pairwise classification approach is presented where there is no evidence of the imbalance management of the scarce ortholog pairs (minority class) among the majority of non-ortholog pairs. Besides, this approach does not couple with the generalization problem of the built model to external genome pair is not involved in the learning process.
An important issue to tackle with ortholog detection (OD) algorithms is the underlying information of gene/protein features used to classify the sequences. Algorithms just based on sequence similarity tend to fail in classification because OD is negatively affected by genetic and evolutionary events like duplications and gene losses, or changes between genome segments such as duplications, deletions, horizontal gene transfers, fusions, fissions, inversions, and transpositions. Although the high potentiality of the OD methods referenced either in the "omics" 1 tools site or the site of the Quest for Orthologs 2 Consortium, some of their limitations are reported in literature: • They tend to include paralogs in the orthogroups increasing false positives [8]. In the presence of lots of gene losses caused by whole genome duplications (WGD) in Saccharomycete yeasts, a big amount of paralogs can be included in orthogroups, or in orthologous pairs [10].
• They can reduce specificity due to the presence of short sequences or those that is evolved in a convergent way. This kind of affection may arise when genes inherited by horizontal transfers brought about failure in the inference of near phylogenetic relationships between species that are distantly related and have recently exchanged a gene [11].
• They can fail in ortholog detection mainly in the twilight zone (less than the 30% of sequence identity), that is, they can raise false negatives in the presence of distantly related divergent sequences. They can also fail in front of mosaics of proteins [12] since orthogroups not considering hybrid proteins may contain proteins without a common ancestor.
With the aim of reducing these limitations and improving efficacy in ortholog detection, there is a tendency in algorithms to include diverse gene information in addition to the alignment-based features (AB). Some ortholog detection algorithms include information about: (i) synteny [13][14][15], (ii) global rearrangements [16,17], (iii) protein interactions [9], (iv) the architecture of protein mosaics [18], and (v) evolution distances [6,7]. So far, these attempts have not been enough for increasing efficacy in classification [19]. The combination of AB with other features like the physicochemical profiles, the length of the sequences, and the synteny may be positive for pairwise ortholog detection as well as the learning process from curated classifications in benchmark datasets may allow for a more effective approach.
In order to evaluate OD algorithms, Salichos and Rokas [10] constructed a Saccharomycete yeast benchmark dataset having orthogroups deprived of paralogs. Such orthogroups contain "curated or true" orthologs from yeast species that underwent pre-and post-whole genome duplications (WGD) [10]. Actually, when Multiparanoid [20], OrthoMCL, and extended versions of RBH and the RSD were evaluated using this benchmark dataset, they included paralogs in the orthogroups [10]. This fact assures that OD is a bioinformatics field in constant need of new effective algorithms mainly for yeasts.
In terms of efficiency, OD algorithms based on sequence alignments may take from weeks to years of CPU time to compute orthologs of sequenced species, due to the quadratic scaling capacity of these comparisons. That is why in the Quest for Orthologs Consortium meeting [19], efficiency and management of big data were target questions, and some efforts in these senses were presented. They stated that the increased demand of computation for sequence analysis has not been achieved by means of the increasing computation capacity, but it can be achieved with new approaches or algorithm implementations.
In this chapter, a new classification approach has been proposed based on the combination of pairwise protein features in a supervised decision system that consider the extreme imbalance between ortholog and non-ortholog pairs on the benchmark dataset proposed by [10]. Some new gene pair similarity measures considered as pairwise protein features are defined based on protein physicochemical profiles, their membership to conserved regions in related genomes, and their lengths. The efficiency and scalability of the calculation of these measures are analyzed to propose its implementation for big data. In conclusion, evaluated supervised algorithms that manage big and imbalanced data in scalable machine learning libraries as Mahout [21] and MLlib [22] overdid RBH, RSD, and the Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data (OMA) [6] in Saccharomycete yeast genomes.

Yeast genomes in experiments
Yeast genomes selected for the experiments are well-studied eukaryote species that shared a common ancestor with human millions of years ago. Saccharomyces cerevisiae belongs to Saccharomycete yeasts and Schizosaccharomyces pombe is a unicellular organism from Archiascomycete fungus. These species bring about experimental models for many essential processes of eukaryotes since most of their genes have been predicted as homologs to multicellular eukaryotes ones. Specifically, S. pombe genes are more similar to mammals' genes than S. cerevisiae ones [23]. However, S. pombe is a distant relative of S. cerevisiae, accordingly, ortholog detection between them is a difficult task due to the divergence of sequences [13]. On the other hand, other close Saccharomycete species as Kluyveromyces (Kluyveromyces lactis and Kluyveromyces waltii) and Candida glabrata have been included in one of the ortholog detection benchmarking datasets [10] since some of them underwent a whole genome duplication (WGD) process causing a rampant paralogy and a lot of gene losses. Precisely, the genome pair of C. glabrata and S. cerevisiae that are post-WGD species constitutes a target study case for OD algorithms. For the studies reported below we have identified each genome pair as follows: • S. cerevisiae -K. lactis: ScerKlac

Similarity measures as gene pair features in ortholog detection
Five normalized pairwise sequence similarities (global and local alignment scores, protein length, synteny and physicochemical profile) were previously specified in [24]. Table 1 Measure Description of the calculation S 1local and global alignment The sequence alignment measure averages positives local and global protein alignment scores from the Smith Waterman [32] and the Needleman-Wunsch [33] algorithms calculated with a specified scoring matrix, and "gap open" (GOP) and "gap extended" (GEP) parameters. The scores are normalized by using the maximum score of all (x i , y j ) protein pairs S 2protein length Measure S 2 represents the normalized difference for continuous values [34] of the protein sequence lengths S 3membership to locally collinear blocks This measure is calculated from the distance between pairs of sequences considering their membership to Locally Collinear Blocks (LCBs) which represent truly homologous regions obtained with the Mauve software [35]. The normalized difference is selected for the comparison of the total number of codons in each block This measure is based on the spectral representation of sequences from the global protein pairwise alignment by using the linear predictive coding [34]. First, each amino acid that lies in a matching region without "gaps" between two aligned sequences is replaced by its contact energy [36]. The moving average for each spectrum, that is, the average of this physicochemical feature in the predefined window size W is then calculated. Next, the Pearson correlation coefficient with the corresponding significance level between the two spectral representations in a matching region is obtained. Finally, the significant similarities of the regions without "gaps" are aggregated considering the length of each region. From our previous studies presented in [37,38], three features for the physicochemical profile with W values of 3, 5 and 7 have been considered summarizes the calculations for protein pairs (x i ,y j ) from the two proteome representations P 1 ={x 1 , x 2 ,…, x n } and P 2 ={y 1 , y 2 ,…, y m } with n and m sequences, respectively.

From parallel to big data calculation of similarity measures
Time complexity analysis of the sequential algorithms for the calculation of pairwise similarity measures in terms of N gene/protein pairs has brought about the need of a parallel version with the scalability target. The protein physicochemical profile turns to be the one with the highest quadratic cost O(NÂ(m + n) 2 ) since it operates over the aligned sequences with m and n as the maximum sequence lengths for two genomes, respectively. The parallel proposal is sketched in Figure 1 with a fork control flow to distribute the calculation over different processors and a join control flow to terminate the parallel executions.
For the scalability analysis of the parallel physicochemical profile calculation, we estimated the T S sequential execution time in expression (1) by using constants.
The constant t c indicates the required time for the arithmetic operations. We assume t c as optimum value, t c = 1 [25]. Consequently, the parallel execution time T P is defined in terms of the calculation time and the communication time among processors. In the parallel version, calculation time depends on the distribution of N cycles among P processors. With this distribution: (i) all the processors execute ⌊N/P⌋ iterations or (ii) p < P processors execute ⌊N/P⌋+1 iterations. Thus, when we consider the worst case, the calculation time T Calc can be estimated as in expression (2).
The communication time among processors is directly related to the information accessed in the sequential section of the algorithm. In each iteration, each processor P needs to receive sequence information. Hence, communication time can be defined as in expression (3).
where t s represents the time required to establish the communication and to prepare the sending information and t w is the required time to send a numeric value. These values can be considered as optimum ones, that is, t s = 0 and t w = 1 [25]. In this way, we combine expressions (2) and (3) to obtain an estimation of the parallel execution time for the physicochemical profile as in expression (4).
Time complexity of the parallel version of the algorithm can be estimated from the parallel execution time T P . Expression (⌊N/P⌋+1) can be taken as N/P.

It holds that mÂn
that is lower than O(NÂ(m+n)2) as demonstrated in expression (5).
By using expressions T S and T P , we can analyze the speed-up and the efficiency performance for different samples obtained from N protein pairs and a range of 1-100 processors. Figure 2A and B shows the speed-up and efficiency values, respectively, in a genome comparison dataset of S. cerevisiae and S. pombe. The 100% sample represents 5006 sequence pairs; the 75% sample represents 3754 pairs, the 50% represents 2503 pairs, and the 25% represents 1251 pairs. An increased number of processors lead to a saturation in the speed-up. Specifically, when the problem size increase, thus the speed-up and the efficiency values increase for a single number of processors, although both values tend to decrease with an increasing number of processors. The efficiency tends to keep constant when we simultaneously increase both the problem size and the number of processors, consequently, contributing to the scalable conception of the algorithm [26].
With the sequential and parallel MATLAB (2010) implementations, we calculated real execution time for one, two, three and four processors in a personal computer Intel ® Core™ i3 CPU, 2.53 GHz, RAM DDR3 de 4.0 Gb, 64Bits Windows7. With 1921 pairs of S. cerevisiae and S. pombe representing 100% of the sample, Figure 3A and B shows the real and estimated values of the speed-up and efficiency, respectively. We can observe a similar performance of estimated and real values for the total and 50% of the sample. The best speed-up was obtained for the execution in four processors, but the efficiency is best with two processors.   Theoretically, the speed-up value is bounded by the number of processors and the efficiency value is in the interval [0, 1]. However, the real speed-up obtained is higher than the number of processors in each execution and the efficiency is higher than one. This event is known as super-linear speed-up. It is a consequence of the fact that each processor consume less time than TS P . It can be possible since the parallel version may execute less work than the sequential version or may take advantage of resources such as the cache memory [25].
In a further scalability analysis with 133,666,445 pairs of sequences from different yeast genomes, the increased number of processors from 1 to 100 saturates the speed-up. Table 2 shows the genome data used for this scalability analysis. Datasets are conformed by accumulating gene pairs from different genome pairs with different maximum sequence lengths. Figure 4A illustrates the effects of the gene pair accumulation in the speed-up of the parallel physicochemical profile similarity measure calculation while Figure 4B depicted the effect in  Table 2. Genome data used in the scalability analysis. the efficiency of this calculation. When the problem size increases, the speed-up and the efficiency increase for a specific number of processors, although both values tend to decrease when the number of processors increases. Efficiency tends to be constant when both the problem size and the number of processors increase. This is an essential issue to achieve scalability, specifically, the horizontal one.
In sum, the parallel algorithm time complexity considering P processors is lower than the cost of the sequential version and the scalability may be achieved with the parallel proposal. Nevertheless, considering implementation hazards of the parallel models, scalability can be improved when these models are executed in a cloud within a big data framework as MapReduce [27,28]. This framework guarantees reliability, availability, and a good execution performance in a cloud with a distributed file system.
In a MapReduce design, each mapper process should build a subset of the resulting dataset with calculated features of all protein pairs of its partition. Figure 5 shows three steps such as Initial, Map and Final. The Initial phase divides the protein pair set into independent Hadoop distributed file system (HDFS) blocks; it also duplicates and transfers them to other nodes of the cloud. Then, in the Map step, each mapper calculates the protein features of all the pairs in its partition. Finally, in the Final step, the files created by each mapper would be concatenated to build the final dataset file. In this procedure the Reduce step is omitted since the mappers output are directly combined following the operation reported in [29].
Algorithm 1 shows el pseudo-code of the Map function in the MapReduce process for the protein feature calculations.
Step 2 calculates the features for a pair of proteins and Step 5 save the previously calculated data.

Big data supervised classification with imbalance management
Given a set A={S r (x i , y j )} of gene pair features as discrete or continuous values of r gene pair similarity measure functions, previously specified, we represent a pairwise ortholog detection ∀y j ∈G 2 is the universe of the gene pairs,and d∉A is the binary decision feature obtained from a curated classification. This decision feature defines the low ratio of orthologs to the total number of possible gene pairs. The big data supervised classification divides DS into train and test sets to build a learning/training model and to classify the instances by means of a big data supervised algorithm managing the imbalance between classes. The built model can be generalized to related pairs of genomes/yeast species not used in the learning/training process following the generalization concept in [30]. Thus, the test set is used to compare both supervised and classical unsupervised algorithms. Algorithm 2 shows the steps required in the training phase as well as Algorithm 3 shows the classification phase. The general testing/evaluation scheme for supervised and unsupervised algorithms is sketched in Figure 6 based on the use of a testing external set as reported in [31].
Training phase The general scheme receives annotated genomes (proteomes) from related yeast species since some protein features as the membership to conserved regions require certain evolution closeness. For the evaluation of pairwise ortholog detection algorithms, we can compare the supervised solutions and the unsupervised reference algorithms such as RBH, RSD, and OMA. Firstly, protein pairs are separated into train and test sets and after that pairwise similarity measures for the pairs of both sets are calculated. Test sets are used for the assessment of the unsupervised algorithms while the training set is used to build the supervised models which will be further tested on the previously mentioned test set.
The classification performance will be assessed by evaluation metrics for imbalanced datasets. The Geometric Mean (G-Mean) is defined in [32] as: where sensitivity and specificity are traditionally calculated considering the true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN). The area under the curve (AUC) [33] is estimated from plotting of true positive rate (TPR) versus false positive rate (FPR) values. We approximate the AUC by averaging (TPR) and (FPR) values through the following equation: where TPR ¼ TP TPþFN represents the percentage of orthologs (TP) correctly classified and FPR ¼ FP FPþTN , the percentage of non-orthologs (TN) misclassified. G-Mean measure maximizes the accuracy of the two classes (orthologs and non-orthologs) by considering the misclassification costs and AUC values in the estimation of the sensitivity and specificity, getting at a balance between them [34].

Settings
For the evaluation of pairwise ortholog detection algorithms in related yeast genomes we carried out (i) three exploratory experiments for an external validation of supervised classification models and (ii) an experiment to evaluate a model built with a pair of yeasts into two external pairs. In Experiment 1, we evaluated the algorithms inside a genome pair by partitioning the complete set of protein pairs at random (75% for training and 25% for testing). Specifically, we divided the S. cerevisiae -K. lactis dataset into 16,986,996 pairs for training and 5,662,332 pairs for testing. In Experiment 2, we evaluated the model in the first experiment into 8,095,907 pairs of S. cerevisiae and S. pombe genomes. In Experiment 3, we tested in S. cerevisiae and S. pombe as in the second experiment but with models obtained by training with the complete S. cerevisiae -K. lactis dataset (22,649,328 pairs). In Experiment 4, we used the model in the third experiment and tested it in two different pairs: 29,887,416 pairs of S. cerevisiae and C. glabrata and 20,318,472 pairs of C. glabrata -K. lactis.
The details for the data used in the experiments are specified in Table 3. For each genome pair we built four datasets (BLOSUM50, BLOSUM62_1, BLOSUM 62_2, and PAM250) from combinations of alignment parameter settings shown in Table 4. The gene pair features included in the datasets are the average of local and global alignment similarity measures, the length of sequences, the gene membership to conserved regions (synteny), and the physicochemical  Table 3. Genome data used in the experiments. profiles within 3, 5, and 7 window sizes. We are aggregating global and local alignment similarities to combine structural and functional similarities of proteins.
The S. cerevisiae -S. pombe dataset represents the union of Inparanoid7.0 and GeneDB ortholog classifications as is described in [35]. On the other hand, the S. cerevisiae -K. lactis and S. cerevisiae -C. glabrata datasets contain all ortholog pairs in the gold groups reported in [10]. At the time of building the subset with all possible pairs, we just excluded 89 genes from S. cerevisiae, 37 from C. glabrata, and 1403 from K. lactis because the genome physical location data in the YGOB database [36], required for the LCB feature calculation, was not found.
The amount of pairs in each dataset prevented traditional machine learning methods from training and testing, so big data implementations were selected from scalable MapReduce Mahout [21] and Spark MLlib [37] libraries. The selected algorithms for the evaluation are listed in Table 5. Explicitly, we selected the random oversampling (ROS) and the cost-sensitive approaches for imbalance management. Besides, the big data framework details are specified in Table 6. On the other hand, the unsupervised algorithms parameter values used for the evaluation are specified in Table 7.
The infrastructure used to perform the big data experiments consists of 20 nodes connected via a 40 GB/s Infiniband network. Each node has two Intel Xeon E5-2620 microprocessors (at 2 GHz, 15 MB cache) and 64 GB of main memory working under Linux CentOS 6.5. The head node of the cluster has two Intel Xeon E5-2620 microprocessors (at 2.00 GHz, 15 MB cache) and 32 GB of main memory.

Results
Supervised classifiers that manage big data were compared from two different points of view: i. The efficacy and the execution time when the training process is performed with one partition of the genome pair dataset or with the complete dataset. ii.
The efficacy and the execution time in the classification process performed on datasets with potential fails for ortholog detection: • The one of related species (distant relatives), S. cerevisiae and S. pombe, false negatives can increase in the presence of divergent sequences.
• Other two Saccharomycete yeast datasets which complexity is a possible increased number of false positives due to lots of paralog losses produced by the WGD. Supervised and unsupervised classifiers are compared in terms of the efficacy in complex Saccharomycete yeast datasets.
Best mean results of G-Mean, AUC, TPR, and TNR (true negative rate) in Experiments 1, 2, and 3 are shown in Tables 8-10 Table 6. Big data framework, applications and algorithms.    Table 8. Best mean G-Mean, AUC, TPR and TNR results in Experiment 1 in S. cerevisiae -K. lactis. Highlighted TPR and TNR with the best stability in these measures.    Table 9. Best mean G-Mean, AUC, TPR and TNR results in Experiment 2 in S. cerevisiae -S. pombe. Highlighted TPR and TNR with the best stability in these measures.
when we use the four alignment parameter value combinations in Table 4. Similarly, good efficacy is obtained when we trained with a partition of S. cerevisiae -K. lactis and with the corresponding complete training set.
The Analyzing measure values of Experiment 4 we noticed that the performance of supervised classifiers is almost the same as in the previous experiments in terms of the stability in the four different alignment parameter value datasets. Best mean results of G-Mean and AUC, TPR and TNR for supervised and unsupervised classifiers are shown in Table 11. The underlined values highlight the most effective method in the experiment while the bold face values identify the best performing supervised and unsupervised algorithms. Supervised Random Forest classifiers managing data imbalance overdid the unsupervised ones, specifically, when the costsensitive version is applied. Among the unsupervised classifiers, RSD reaches the highest G-Mean (0.9374) using the recommended parameter values (E-value = 1eÀ05 and α = 0.8) in [38]), higher values were also obtained for AUC and TPR. On the contrary, OMA was the best among the unsupervised algorithms in S. cerevisiae -S. pombe datasets.
The results of the Friedman test [39] for the AUC measure with the four datasets of S. cerevisiae -C. glabrata, C. glabrata -K. lactis, and S. cerevisiae -S. pombe and the supervised classifiers that manage imbalance are shown in Table 12 (first column). The test yielded significant differences among classifiers, being RF-BDCS the one with the high mean rank followed by the classifiers based on Random Forest with ROS preprocessing and the SVM classifier with ROS (RS:100) and 0.5 regulation. In the Friedman test for G-Mean (  Table 11. Best mean G-Mean, AUC, TPR and TNR results in Experiment 4 in S. cerevisiae -C. glabrata and C. glabrata -K. lactis. Highlighted TPR and TNR with the best stability in these measures. differences with higher mean rank values for RF-BDCS and ROS (RS: 100%) + SVM-BD (regParam: 1.0). When we applied the same test to compare execution time ( Table 12 third column) there are significant differences among the algorithms. In particular, SVM classifiers combined with ROS exhibited the best mean ranks.

Discussion
As a general result, experiments showed that supervised classifiers changed only slightly with the selection of different alignment parameters, maybe because of the appropriate combination of scoring matrices and gap penalties in relation to the sequence diversity between the two yeast genomes. The four scoring matrices assessed have been previously recommended to detect homologs in a wide range of amino acid identities. Moreover, gap penalty settings were not low enough to affect the sensitivity of the alignment [40].
The ROS pre-processing method for big data made SVM effective for pairwise ortholog detection and improved the performance of Random Forest for big data even more with a higher value for the resampling size parameter of 130% [41]. Conversely, the experiments showed that the variation in this parameter value from 100 to 130% did not significantly influence on the performance of the SVM big data classifier with different regulation values.
The cost-sensitive classifier RF-BDCS showed the best performance in S. cerevisiae -C. glabrata, C. glabrata -K. lactis and S. cerevisiae -K. lactis because probably it improved the training from the minority class. The best tree split was selected following the misclassification costs assigned to the instances; such costs were also considered to associate certain class to a leaf [29]. This cost treatment does not imply changes in the sample distribution, and avoids possible overfitting that is commonly found in ROS solutions due to the presenceof duplicated instances. The setting of the cost values ((C(+|À)=IR) and C(À|+)=1)) can also influence on the success of the algorithm.
In the case of SVM for big data classifier, the fixed regularization parameter defines the tradeoff between the goal of minimizing the training error and minimizing the model complexity to avoid overfitting. The higher is its value, the simpler the model, however, a better performance in classification may be achieved by setting an intermediate regulation value, or one close to cero [37]. Specifically, the ROS (RS: 100%) + SVM-BD (regParam: 0.5) classifier exhibited the best AUC and G-Mean values in S. cerevisiae -S. pombe, and the best balance between TPR and TNR in the rest of the datasets.
All methodologies including the proposed supervised big data approach generally declined their performance for ortholog detection in S. cerevisiae -S. pombe datasets, probably because of S. pombe is a distant relative of S. cerevisiae [23]. The supervised classifiers performance was also negatively affected by differences in data distribution between the train and test sets [42]. On the contrary, ROS (RS: 100%) + SVM-BD (regParam: 0.5) remained stable in S. cerevisiae -C. glabrata, C. glabrata -K. lactis and S. cerevisiae -S. pombe datasets when considering the balance between TP Rate and TN Rate . Best results obtained in S. cerevisiae -C. glabrata are outstanding where algorithms are vulnerable to produce false positives since both genomes underwent a WGD and a subsequent differential loss of gene duplicates [10].
The initial assumption of RBH, RSD and OMA that the sequences of orthologous genes/proteins are more similar to each other than they are to any other genes from the compared organisms may produce classification errors [12]. The reduced quality shown by RBH, RSD and OMA, mainly in the case of RBH, could be caused by this assumption despite that BLAST parameters can be tuned as has been recommended in [43]. In particular, RBH infer orthology relationships simply based on reciprocal BLAST best hits. In contrast, the RSD procedure is less likely than RBH to be misled by existing close paralogs since it relies on both global sequence alignment and maximum likelihood estimation of evolutionary distances finding many putative orthologs missed by RBH. The OMA algorithm also displays advantages over RBH by using evolutionary distances instead of alignment scores. It allows the inclusion of one-to-many and many-to-many orthologs considering the uncertainty in distance estimations and detects potential differential gene losses.
On the other hand, the success of big data supervised classifiers managing imbalance over RSD and OMA may be explained by feature combinations together with the learning from curated classifications. The assembling of alignment measures with the comparison of sequence lengths, the membership of genes to conserved regions (synteny) and the physicochemical profiles of amino acids improved the detection of homology and certainly the supervised classification results on the test sets, even if both species underwent WGD. Specifically, the aggregation of global and local alignment scores allows us to combine protein structural and functional relationships between sequence pairs, respectively. The periodicity of the physicochemical properties of amino acids can detect similarity among protein pairs with sequences having functional similarities despite their low amino acid sequence identities (<35%). These sequences may affect ortholog detection in S. cerevisiae -S. pombe which are moderately related and their orthologs may be diverged. The synteny information lets us to consider that genes belonging to the same conserved segment in genomes of different species will probably be orthologs. Besides, the length of sequences as the relative positions of amino acids within the same protein in different species and in duplicated regions within the same species may also contribute to the enhanced supervised classification results.

Conclusions
The combination of alignment measures with other protein pair features such as the sequence lengths, the gene membership to conserved regions, and the physicochemical profiles has complemented the homology detection in the proposed supervised approach for pairwise ortholog detection. Such combined features alongside curated orthologs pairs extracted from a curated dataset have led to an effective and efficient ortholog classification method in a big data scenario with the treatment of the low ratio of orthologs to the total possible gene pairs between two genomes.
The supervised classifiers that manage imbalance outdid the popular unsupervised (RBH, RSD, and OMA) algorithms even when the supervised model was extended to yeast datasets containing "traps" for ortholog detection algorithms. In future research, the introduction of new gene pair features might improve the effectiveness and efficiency of the supervised algorithms.
The scalability analysis of the proposed gene pair feature calculation highlighted the advances that can be achieved in the scalability issue when we parallelize the calculation, and later on, when we implement a big data model for the same calculation.