Using Autoregressive Models of Wavelet Bases in the Design of Mental Task-Based BCIs

A powerful tool for analyzing the characteristics of the signal in the frequency domain as well as in the time domain is wavelet analysis. To analyze the signal, it can be decomposed into some levels successively by wavelet transform. At each level, decomposition yields two types of components: the approximations component which is the low-frequency high-scale portion of the signal, and the details component which is the high-frequency low-scale portion. The resultant approximations component is decomposed repetitively after each level. This is usually referred to as wavelet decomposition. Fig. 1.a shows a signal decomposed into three levels by wavelet decomposition.


Wavelet packet analysis
A powerful tool for analyzing the characteristics of the signal in the frequency domain as well as in the time domain is wavelet analysis. To analyze the signal, it can be decomposed into some levels successively by wavelet transform. At each level, decomposition yields two types of components: the approximations component which is the low-frequency high-scale portion of the signal, and the details component which is the high-frequency low-scale portion. The resultant approximations component is decomposed repetitively after each level. This is usually referred to as wavelet decomposition. Fig. 1.a shows a signal decomposed into three levels by wavelet decomposition.
There is another approach for signal decomposition in which at each level, not only the approximations component but the details component is also decomposed. This is called wavelet packet analysis, and each of the components at different levels is referred to as a node or wavelet packet. Wavelet packet analysis is more flexible but more complicated than wavelet decomposition. Please see Fig. 1.b for more details. In this figure, the signal is decomposed into three levels using wavelet packet analysis.
Let us assume that the maximum frequency of the signal is f m . At each level of decomposition, the approximations component has the lower half of the frequency spectrum of the signal decomposed, and the details component has the higher half of the signal's frequency spectrum. The frequency spectrums pertaining to different packets in three-level wavelet packet analysis is shown in Table 1.  The packets produced by the wavelet packet analysis make an upside-down "tree", whose root is the initial signal and its branches are coming down. If we cut some branches of this tree (i.e., we exclude some packets from this tree), what remains is called a "sub-tree". The final nodes (terminal nodes or leaves) of each sub-tree are a "wavelet basis" for the initial signal. In other words, a wavelet basis is by definition a set of packets containing all non-overlapping frequency components of the initial signal. The packets of a basis can be selected from different levels.  The best sub-tree (or basis) is a sub-tree (or basis) which minimizes a specific cost function. Cost functions are usually defined based on an entropy function (e.g., Shannon's entropy function) [1]. One can also define his or her own cost function. To find the best sub-tree, the costs of all nodes in the main tree are measured. Then, in the direction from leaf to root and at each level, the cost of the node and its resultant child nodes are compared to each other. If the sum of the child nodes' costs is higher than or equal to the parent node's cost, the child nodes are cut off from the tree; otherwise, the child nodes remain in the tree, and the cost of the parent's node will be updated (replaced) with the sum of the child nodes' costs. This parent node is in turn one of the child nodes for the upper level. The same comparison (with updated costs) is done for the upper levels until the root level is reached. What remains at the end is the best sub-tree and its leaves are the best basis for the initial signal.

Brain-computer interfaces
Brain-Computer Interfaces (BCIs) providing an alternative channel for communication and intended to be used by motor-disabled individuals can be categorized into two main classes. Synchronized BCIs are the first and earliest class, in which the user is able to activate the system during pre-defined time periods only. The second class is referred to as asynchronous or selfpaced BCIs. These systems are more useful, since the user can activate them whenever he or she wishes. At each time instant, a self-paced BCI is either in the intentional-control (IC) state or in the no-control (NC) state. IC is the active state, and NC is the inactive state (i.e., the state during which the output of the system is inactive).
The performance of a self-paced BCI is usually evaluated by two rates: the true positive rate (TPR) and the false positive rate (FPR). TPR is considered as the rate of correctly classifying intentional-control states, while FPR is defined as the rate of misclassifying no-control states.
A neurological phenomenon is a set of specific features or patterns in signals produced by the brain. They arise due to brain activities to which these phenomena are time-locked. Different types of neurological phenomena include the activity of neural cells (ANCs), P300, brain rhythms such as the Mu, Beta, and Gamma rhythms, movement-related potentials (MRPs), slow cortical potentials (SCPs), visual evoked potentials (VEPs), steady-state visual evoked potentials (SSVEPs), and mental tasks (MTs). For a review of the field of BCIs, please refer to [2]- [7].
The dataset of Keirn and Aunon [8] containing the EEG signals of five mental tasks is used in this paper, as it has been used in a variety of studies such as [9]- [55]. Although most of these studies can classify mental tasks to some extent, only a few of them care about false positives or confusion matrices and report them [32], [45]- [48]. The classification error is reported in studies [21], [31], [49] as well. The rates of correct classification are the only measure well paid attention to in the remaining studies. False positives are of great importance in BCI applications, they are hence fairly considered in this study.

Autoregressive modeling
The autoregressive (AR) model of an order K of a signal is defined as follows:  The most popular method to find the AR coefficients is the Burg algorithm [56] which computes coefficients at successive orders in the forward direction as well as in the backward direction. This method is used in this paper to estimate the AR coefficients.
Figuring out the optimal AR model order is not straightforward, even though some techniques including the Reflection Coefficient [18], the Information Theoretic criterion or Akaike Information Criterion (AIC), the Autoregressive Transfer Function criterion, and the Final Prediction Error (FPE) criterion [57] have been introduced. If the order of the model is too low, the whole signal cannot be captured in the model. On the other hand, the more the order is, the more portion of the noise is captured. Since there is no guarantee that the above mentioned techniques work well in every application and since FPR is of great importance in our application, we do not use these techniques to find the optimal AR model order. Instead, to be on the safe side, we vary the order in a reasonably large range and select the best order based on the performance of the system evaluated via nested five-fold cross-validation.

Quadratic discriminant analysis
The Quadratic Discriminant Analysis (QDA) classifier [58] assumes the classes have normal distributions. For QDA method of classification, unlike the linear discriminant analysis, the covariance matrices of the classes can be different. For a two-class problem, the quadratic discriminant function by definition is as follows: where x is the vector to be classified, μ 1 , μ 2 are the estimated mean vectors of classes 1 and 2, Σ 1 , Σ 2 are the estimated covariance matrices of class 1 and class 2, π 1 , π 2 are the prior probabilities of the two classes, C 12 is the cost of misclassifying a member of class 1 as class 2, and C 21 is the cost due to misclassifying a member of class 2.
The decision rule for classification is: where ω 1 , ω 2 represent class 1 and class 2, respectively.
In this paper, the same value for the cost of false negative C 21 and false positive C 12 is used. It is also assumed that the a-priori probabilities of the two classes are equal.

Data
As mentioned in the Section 1.2, we used the EEG data which has been collected previously by Keirn and Aunon [8]. The EEG signals of this dataset belong to seven subjects, each performing five different mental tasks. The mental tasks include baseline, mentally computing a nontrivial multiplication, composing a letter to a friend mentally, rotating a three-dimensional object mentally, and visualizing writing a sequence of numbers on a blackboard. The subjects did not vocalize or gesture in any way when their signals were being recorded. Each session of recording comprises five trials of each mental task; therefore, there are a total of twenty five trials in a session. Only one session was performed on a single day. The length of each trial is ten seconds. Two of the subjects (Subjects 2 and 7) completed only one session, while one of them (Subject 5) completed three sessions. In this study, we used the data of subjects who completed at least 10 trials. Since the signals of Subject 4 were missing some data, we did not use them. New numbers were assigned to the subjects whose signals have been used in this study (see Table 2). Table 2 also shows the number of trials completed by each subject.  EEG signals were recorded from six channels (electrodes) while the subjects were seated in a room which is sound-controlled and with dim lighting. The electrodes were placed at positions C3, C4, P3, P4, O1, and O2 (based on the International 10-20 System) on the scalp. The reference electrodes were two electrically linked mastoids, A1 and A2. Fig. 2 shows the electrodes' locations. During recordings, the impedances between each electrode and the reference electrodes were kept below 5 kΩ. The signals were sampled at 250 Hz with an A/D converter (twelve-bit Lab Master) and a bank of amplifiers (Grass 7P511, with the band-pass filters set at 0.1-100 Hz). The system was calibrated at the beginning of each session with a known voltage.
Two EOG electrodes were placed below and at the outside corner of the left eye for detecting ocular artifacts. Since in this study, we did not remove any segment of the EEG signals due to ocular artifacts, the signals of EOG electrodes were not used.
Even though this dataset has not been collected in a self-paced paradigm, we are using it as an introductory exploration. It is obvious that in a self-paced paradigm, brain activities do not change, but since the pacing information (the exact start and end time of the mental tasks) is not known, training the BCI system would be more complicated.

The design of BCI systems
In this paper, we apply wavelet packet analysis to the design of a two-state self-paced mental task-based BCI. We develop and custom design five different BCIs for each subject based on the five mental tasks. In each BCI, one mental task is considered as the intentional-control task and the other four mental tasks are considered as the no-control tasks. Unlike the no-control tasks, the intentional-control task should activate the BCI system. Even though a BCI in which the baseline is the intentional-control task is practically useless, we consider it here for comparison purposes. We then determine the two most discriminatory mental tasks for each subject by comparing the performance of the five BCI systems of that subject. The overview of the proposed BCI system is illustrated in Fig. 3.
We customize the BCIs for every subject and mental task, since it has been proven that customized BCIs yield better results than general BCIs [59]- [60].
The EEG signals of four subjects are exploited. We use the first ten trials for Subject 3, and all ten trials for the other three subjects. The sampling rate is 250 Hz and each trial is 10 seconds long; therefore, there are 2500 samples in every trial of each mental task.
Each trial is divided into 45 256-sample overlapping segments. Each segment overlaps with the adjacent segment by 206 samples (about 80%). Hence, for each subject, the total number of segments for each mental task is 450. The segments with the length of more than 1 second are sufficiently long to get a good characteristic of the signal [61].
For each BCI, we have six different EEG channels. Each channel has its own feature vector and classifier. For each channel, the EEG segment is decomposed using wavelet packet analysis and the AR models of the resultant wavelet packets are estimated using the Burg algorithm.
The AR coefficients of the packets belonging to a given wavelet basis are concatenated into a vector to form the channel's feature vector. The classifier of each channel is QDA. Each QDA classifies the input EEG segment as an IC or NC segment. The task of the second-stage classifier which is a simple majority voting classifier is to determine whether the final output of the system is IC or NC.

Training, cross-validation, and testing
For each subject, the 5×450 segments pertaining to the five mental tasks are divided into a training set, a validation set, and a test set. The training set is used to train the system. The validation set is used to select the best wavelet, the best wavelet basis, and the best AR model order. The test set is used to evaluate the final performance of the system. The performance of a system evaluated based on a fixed split of data into training, validation and test sets is not accurate and robust, therefore, we perform nested five-fold (or 5×5) cross-validation. While the model selection is done during the inner cross-validation process, the system performance is estimated in the outer cross-validation.
The data are split into the five outer folds, for each of which 80% of the data is used for training and validation and 20% of the data is used for testing. The portion of data which is assigned for training and validation are further divided into five inner folds. For each inner fold, 80% of the data is used for training and the rest is used for validation. Hence, what we report as the cross-validation and testing results are the average over 25 and 5 different cases, respectively.

Different wavelet bases
Each of these segments is decomposed by wavelet packet analysis into three levels. In threelevel wavelet packet decomposition, there exist 14 packets which can be seen in Fig. 1.b. We have 25 different wavelet bases representing the initial segment in the wavelet domain. These bases are listed in Table 3.

The relationship between the packet length and the AR order
The length of a child packet is almost half of its parent packet's length. Since the initial decomposed segment is 256 samples long, the packets at levels 1, 2, and 3 contain approximately 128, 64, and 32 samples, respectively. When we estimate the packets' AR model, we consider their lengths as a factor in selecting the appropriate AR order. In other words, we try to keep the same ratios between the orders of the packets at different levels. Therefore, if we set the AR order of a first-level packet as K, the AR orders for the packets at levels 2 and 3 would be close to K/2 and K/4, respectively.

Different wavelets
Wavelets from various families are used. For each subject and each mental task, the wavelet with the best performance during nested five-fold cross-validation is selected. The 36 wavelets tested are from the Haar, Daubechies, Biorthogonal, Coiflets, and Symlets families. We assign a number to each of the wavelets and list them in Table 5.

The proposed method to find the best wavelet basis
As mentioned earlier in the Introduction section, to find the best wavelet basis, different cost functions, which are usually different types of entropy functions, have been proposed. Unfortunately, the methodology based on the cost function is not working here because of two main reasons. First of all, since we have a number of segments for training the system which are not fully stationary, different wavelet bases are chosen for different segments. Hence, we can not come up with a single basis. Secondly, if we suppose that we are able to find the best basis based on a defined cost function, there is no guarantee that the selected basis has the best performance for our system. Therefore, we propose a method to find the best wavelet basis which does not have the above problems. The idea is very simple. We select a basis as the best basis which shows the best performance during a cross-validation process. There are two measures for performance evaluation of our system, TPR and FPR. The ratio of TPR to FPR is calculated to compare the performance with different wavelet basis. This method is not only applicable to the BCI system but also to any system which is based on classification.

Selecting the best wavelet and the best wavelet basis
For each subject and each mental task, we run a nested five-fold cross-validation process with the first set of AR model orders in Table 4 (i.e., 12, 6, and 3 for levels 1, 2, and 3, respectively) in order to find the best wavelet and the best wavelet basis. For each of the 36 wavelets, we test all 25 possible wavelet bases. The best wavelet basis is firstly determined for every wavelet based on the ratio of TPR/FPR. Secondly, we compare the performance of the system with the best bases of different wavelets, and figure out the wavelet whose best basis yields the best system performance (in terms of the ratio of TPR to FPR). It is noteworthy that the results would be the same if we first find the best wavelet for each of the 25 wavelet bases and then select the basis with the best performance.

Optimizing the AR model order
Having selected the best wavelet and wavelet basis for the BCI of each subject and each mental task, we find the optimal AR order via another nested five-fold cross-validation process. To this end, we test different sets of AR orders (i.e., sets 2, 3, …, 13 in Table 4) using the best wavelet and the best basis as previously determined, and select the set with the highest TPR/FPR ratio.

Testing the system
We test the system via the outer fold of nested five-fold cross-validation with the selected wavelet, the best wavelet basis, and the optimal AR order set for every BCI belonging to each subject and mental task. The results are given in the next section.

Results
The results of the cross-validation process (at AR order set 1 and the optimal AR order set) and the results of testing (at the optimal AR order set) are summarized in Table 6. This table also shows the selected wavelet and the best wavelet basis for each BCI. The performance of a system which is based on three-level wavelet decomposition (instead of wavelet packet analysis) is furthermore given in Table 6 for comparison purposes.
1   Table 6. Performance of BCIs for Different Subjects and Tasks during Cross-Validation and Testing Table 6 is divided into 20 sections. Each section contains the information about the BCI belonging to a specific subject and mental task. The first line in every section shows the result of the first cross-validation process at AR order set 1. As mentioned before, this cross-validation is performed in order to select the best wavelet and the best wavelet basis for each BCI. The results include the mean and the standard deviation of the TPR and FPR values for the selected case (i.e., the best wavelet and basis). The best wavelet and the best basis are given on lines four and five of the table, respectively. The optimal AR model order set is determined based on the results of the second crossvalidation process done with the selected wavelet and basis. The results of the second crossvalidation with the optimal AR order make the second lines of sections. It is noteworthy that the optimal AR order set for all BCIs is the last set which has the largest numbers.
The performance of the BCI systems with the best configuration (i.e., using the best wavelet, best basis, and optimal AR order set) are also given on the third line of each section. The last line of each section presents the performance of the BCI based on three-level wavelet decomposition as an example to be compared with the performance of the system based on the best basis. As previously discussed, three-level wavelet decomposition is one of the existing bases in three-level wavelet packet analysis (basis 16 according to Table 3).
The most discriminatory task for each subject is determined by comparing the performance of five BCIs pertaining to the subject. The most discriminatory mental task is in a section with solid black borders. For each subject, the second most discriminatory task is also chosen based on the results and shown in bold in the table. Table 7 presents the average system performance of the BCIs for each subject (over tasks) and for each task (over subjects).
The preliminary results have been published in [62].

Best wavelet bases
As seen from Table 6, the best wavelet basis for all subjects and mental tasks is surprisingly the first basis which is made of the approximations and details components of the first level of wavelet decomposition. This implies that for the proposed BCIs, it is enough to decompose the signal into the first level, and further decomposition degrades the system performance.

Most discriminatory tasks
To determine the most discriminatory tasks for each subject, we consider the FPR values during testing and cross-validation at the optimal AR order set. We put the main weight on the FPR of testing since not only the portion of the dataset used for testing is larger than the portion used for cross-validation, but also the testing portion is completely separate from the portions exploited during training and cross-validation processes. The TPR values during testing and cross-validation are then considered if necessary to find out the most discriminatory tasks.
For Subject 1, the testing FPR for the baseline and the rotation tasks are the same and the lowest. The cross-validation FPR is lower for the baseline. Hence, the most discriminatory task must be the baseline. Since a BCI based on the baseline is activated when the subject wishes to relax and think of nothing, it is practically useless; therefore, we do not consider the baseline as the most discriminatory task. The rotation task is then selected as the most discriminatory task. The second most discriminatory task for Subject 1 is the multiplication.
For Subject 2, the most discriminatory task is the multiplication since it has the lowest testing and cross-validation FPRs. The counting is the second most discriminatory task for this subject.
For Subjects 3 and 4, the most and the second most discriminatory tasks are the counting and the letter composing, respectively. For Subject 3, the FPR values for these two tasks interestingly reach zero during testing.

Selected wavelets
Unlike the basis, the selected wavelets are not the same for all subjects and mental tasks. In eleven BCIs (out of the twenty BCIs designed for different subjects and tasks), a wavelet from the Daubechies family has been chosen as the best wavelet. Wavelet 'db2', the most selected wavelet, is the best wavelet for five BCIs. The BCIs for the most and the second most discriminatory tasks of Subjects 2 and 3 has the best performance with this wavelet among all other wavelets. In five BCIs, the Biorthogonal family is the best. Each of the Coiflets and Symlets families are also selected in two BCIs.

Average system performance
According to Table 7, the BCIs of Subject 3 have generally the best performance, with the average FPR of 0.08% and the average TPR of 57.51%. Moreover, the BCIs based on the counting task have the best performance overall. The average performance of BCIs based on the counting has FPR and TPR values of 0.17% and 63.00%.

Comparison of different wavelet bases
For each BCI, we consider and evaluate the system performance with different wavelet bases as further analysis. We then sort and rank the bases based on the ratio of TPR to FPR during testing. Considering the performance of each wavelet basis for different subjects and mental tasks, we count the number of cases that each basis ranks n-th among the other bases. The results are summarized in Table 8. The corresponding three-dimensional histograms are shown in Fig. 4. In this figure, the horizontal axes are related to different wavelet bases and their ranks. The vertical axis is showing the number of times that a basis has a specific rank amongst other bases. It can be seen from this bar diagram that as the basis number is increasing, the ranks of the basis is getting worse, i.e., the first basis has the highest ranks and the last basis has the worst rank. The results are almost along with the cross-validation results (with the first set of AR orders) and show that the first basis is the best basis for all BCIs except for three of them belonging to the multiplication task of Subject 1 and the rotation task of Subjects 2 and 3. For the multiplication task of Subject 1, the best basis is basis 2. The basis 6 ranks first for the rotation task of Subjects 2 and 3. The rank of the first basis for the multiplication task of Subject 1 and the rotation task of Subject 2 is two. The first basis is ranked third for the rotation task of Subject 3. In all other 17 BCIs, the first basis is the best.

Conclusion
In this paper, we presented a method to select the best wavelet basis in the design of a twostate self-paced mental task-based BCI. The use of the proposed methodology is not limited to the BCI systems and it can be also used in other applications. The previously introduced methods (based on cost functions) to find the best wavelet basis generally has two major drawbacks. First of all, they are not practical in classification problems where we have different training signal segments and each of them may result in a different basis. In this case, we cannot come up with a unique basis for all of the training signal segments; therefore, we are not able to finalize our design of the classification system. The second drawback is resulted by this fact that, supposing we can find a unique best wavelet basis for all training segments, there is no guarantee that this wavelet basis yields the best classification accuracy. Because of these two reasons, we decided to propose a method based on the classification accuracy to find the best basis. Since our BCI systems are evaluated by true positive and false positive rates, we used these two measures in the process of finding the best wavelet basis. It is worth noting that any other kind of classification accuracy measure can be potentially exploited in our proposed method to select the best basis.
We have tested our proposed method in the design of mental task-based BCIs. The output of the BCI should be activated when the subject performs a specific mental task. The aim is to minimize and maximize the false activation rate and the true activation rate, respectively.
The scalar autoregressive model coefficients of the components of the best wavelet basis were used as features. The classifiers studied were based on QDA and majority voting. We performed nested five-fold cross-validation two times to choose the best wavelet, the best wavelet basis, and the best autoregressive model orders. Results have shown that the most discriminatory tasks are different amongst the subjects confirming the findings of previous studies ( [20], [30], [33], and [42]) based on the same dataset.
For each subject and each mental task, the best configuration (i.e., the best wavelet, the best wavelet basis, and the optimum AR order) is found during offline analysis of the data. During online analysis, the best configuration is used. Therefore, the system is applicable to real-time applications.