1. Introduction
Non-invasive Brain-Computer Interfaces (BCIs) have been an active research area where several different methods have been developed. They include Electroencephalography (EEG), Near-infrared Spectroscopy (NIRS), functional-MRI (fMRI) among others [1]. Of those BCIs, EEG is one of the most studied methods. This is mainly due to its fine temporal resolution, ease of use, and relatively low set-up cost. Each BCI method naturally has its own advantages and disadvantages. EEG is no exception.
One of the main disadvantages of an EEG-based BCI is its susceptibility to noise, which motivates development of a variety of machine learning algorithms for decoding EEG signals, and there have been significant advancements in the area [2].
One way of categorizing machine-learning algorithms for BCI is batch mode and sequential (online) mode. In the batch mode learning, the collectively acquired EEG data from a subject is divided into two subsets: training data and test data. The former is used for training the machine-learning algorithm, whereas the latter is used to evaluate the algorithm's capability to predict the subject's intention [2]-[4]. There are several facets in batch mode learning which call for improvements:
First, it is non-trivial to decide how much data should be used for training and how much data should be left for testing. It should also be noted that the number of necessary training data may depend on each subject.
Second, with the batch mode learning, by definition, one cannot perform sequential evaluations of predictive performance as time evolves.
Third, the batch mode learning presumes that the data is stationary, i.e., the subject's physical condition and/or the environment around the subject does not change over the period of the experiment.
In contrast, the sequential learning algorithm considered in this study starts learning with the very first single trial datum and proceeds with the learning each time a single trial datum arrives within a Bayesian framework.
This paper proposes a Bayesian sequential learning algorithm for steady-state visual evoked potential (SSVEP) classification problems in a principled manner. In particular, the paper performs the following:
Evaluation of the sequential posterior distribution of unknown parameters each time a trial is performed.
Computation of the sequential predictive distribution of the class label at each trial based on the posterior distribution obtained above.
Automatic hyperparameter learning, where hyperparameter in this study corresponds to the search region volume in the unknown parameter space.
Sequential evaluation of the error between the true label and the predicted label.
Sequential evaluation of marginal likelihood which quantifies the reliability of the prediction at each trial.
Experiments are performed on a four class problem in addition to two class problem, where the extension from the latter to the former is nontrivial.
Formulate the problem using nonlinear model to capture potential nonlinearities which can be easily extended to more difficult problems.
2. Related work
There are three ingredients in this study: (i) SSVEP, (ii) Sequential (Online) learning, and (iii) Sequential Monte Carlo implementations. The descriptions that follow will be given in terms of these keywords. For the batch mode learning, we cite the survey paper reported in [2] instead of citing individual papers.
Allison et al. [5] performed a demographic study of several different BCI methods and showed that an SSVEP-based BCI spelling system was competitive for different age groups, as well as different gender groups, with little experience, under noisy environments. It is also reported that most subjects stated that they did not consider the flickering stimuli annoying and would use or recommend such a BCI system. In [6], and also in [7], an SSVEP-based orthosis control system with an LED light source is proposed. The flickering frequencies were 6 Hz, 7 Hz, 8 Hz, and 13 Hz in the former, and 8 Hz and 13 Hz in the latter. Classification was performed using the second- and third-order higher harmonics in addition to the fundamental frequency component. In [8], an SSVEP-based speller is proposed. After Principal Component Analysis (PCA), the probability of each frequency component is estimated using a particular information matrix. The speller introduces a selection based on a decision tree and an undo command for correcting eventual errors. In [9], EEG signals are represented in the spatial-spectral-temporal domain by a wavelet transform. It also uses a multi-linear discriminative subspace by employing general tensor discriminant analysis (GTDA). The classification is conducted by support vector machine (SVM). Reference [10] proposes a biphasic stimulation technique to solve the issue of phase drifts in SSVEP in phase-tagged systems. The Kalman filter is used in [11] to decode neural activity and estimate the desired movement kinematics, where the filter gain is approximated by its steady-state form, which is computed offline before real-time decoding commences. Canonical correlation analysis (CCA) is used in [12] to analyze the frequency components of SSVEP, where the correlations between the target oscillation waveforms, as well as their higher order harmonics, and those of the acquired SSVEP waveforms are calculated. It is demonstrated that the scheme performed better than a fast Fourier transform-based spectrum estimation method. CCA is used in an online manner by updating the parameters each time data arrives. An online learning scheme called Stochastic Meta Descent (SMD), which is a generalization of the gradient descent algorithm, is proposed in [13]. The paper also discusses various aspects of errors incurred in online learning algorithms.
The Subject Specific Classification Model is discussed in [14], where model Gaussian parameters are updated online after an initial learning of the Subject Independent Classification Model from a pool of subjects. The data was taken from P300, which is one component of EEG.
Martinez et al. [4] propose an SSVEP-based online BCI system with visual neurofeedback. The algorithm is different from the one proposed in this paper. There is a report on Bayesian sequential learning for EEG signals [6], where the Sequential Monte Carlo is used for implementation. In addition to the task differences between [6] and this study, there are several algorithmic distinctions between the two. First, the basis function used in the former is linear with respect to the associated parameters, whereas the latter uses a basis function in which parameters appear in a nonlinear manner. Second, the parameter that controls the size of the parameter search region is fixed in the former, whereas it is learned automatically in the latter. These two differences, at least within our experience, are important for achieving better performance.
In addition, no extension for multi-class classification problems was performed in the former, whereas both algorithmic extension and a learning experiment by using multi-class real EEG data are performed in the latter. Our proposed algorithm is based on part of earlier work [16] of the authors' research group for a different application. Preliminary results on a two-class classification problem were reported by the present group in [17]. The current paper gives a full account of the results by expanding several parts of that conference paper. First, a four-class classification problem was formulated and tested experimentally. Second, the algorithm was further improved by incorporating an Effective Sample Size and Rao-Blackwellisation. Third, more detailed discussions are added.
3. Subjects and data acquisition
Of Event Related Potentials used in BCI, the target quantities considered in this study are SSVEPs, which are natural responses to visual stimulation at specific frequencies. These frequency components and their higher-order harmonics can be observed in the occipital region [4]. It is known that SSVEPs are often useful in research because of the reasonable signal-to-noise ratio and relative immunity to artifacts [5].
In an attempt to perform two-class and four-class classification problems, we gathered two sets of SSVEP data. The settings that we describe below for the two experiments are the same except for the number of stimuli and their frequencies. EEG data were recorded by a Polymate (Nihon Koden, Tokyo) with six active electrodes (O1, OZ, O2, O9, IZ, O10) according to the international 10-10 system and referenced to the left earlobe with a digitization rate of 500 Hz. Even though the highest flickering frequency was 10 Hz, we considered second and third order harmonics in one of the experiments reported below. Our original intention was to examine the harmonics higher than three even though they were not reported in this paper. In order to have a wide margin for the Nyquist frequency we chose 500 Hz. Five volunteers (aged 21-23 years) participated in the present study. All subjects were healthy, with no past history of psychiatric or neurological disorders.
Written informed consent was obtained from each subject on forms approved by the ethical committee of Waseda University.
Each subject was seated in a comfortable chair 60 cm in front of the monitor in an electrically shielded and dimmed room. The flow of task events is shown in Figure 1. The stimulus for the two-class problem is illustrated in Figure 2, whereas that of the four-class problem is illustrated in Figure 3.
In the two-class problem, two flickering checkerboard stimuli (left and right) were presented on the monitor, whereas in the four-class problem, four flickering checkerboard stimuli (left, right, top, and bottom) were presented. In addition, a fixation cross was placed at the center, which the subject was usually asked to fixate at.
In the two-class problem, the left stimulus was a checkerboard flickering with frequency of 4.29 Hz, whereas the right stimulus flickered at frequency of 6.00 Hz. In the four-class problem, there were additional stimuli, one at the top with frequency of 7.50 Hz, and one at the bottom with frequency of 10.0 Hz. There were three reasons for selecting these particular frequencies. First, it is known that SSVEPs are discernible in approximately the 4.0 Hz - 50 Hz band [18][19]. Second, higher harmonics of a particular frequency component should not overlap with the fundamental frequency component. Such overlap could give rise to a problem when one considers multi-class classification problem where multiple frequencies are involved as is described in section 6. Third, since the monitor refresh rate is 60 Hz, choice of the flickering frequencies were restricted by 60/positive integer. We chose 60/14=4.2857…. which we approximated by 4. 29.
The subject usually fixated at a central fixation cross. When an arrow replaced the cross, the subject should move his or her eyes to the checkerboard indicated by the arrow for 3.0 s, after which a red circle is shown so that the subject would know when to rest for 5.0 s. This sequence was one trial, and trials were repeated twenty times, constituting one session. The direction of the arrow was selected at random. Each subject completed 600 trials, or 30 sessions. The measurements were performed with a Polymate AP1124, a multi-purpose portable bio-amplifier recording device, manufactured by TEAC Corporation, Tokyo, Japan. The device is equipped with 24 channels with a maximum sampling frequency of 1 kHz. In addition to electroencephalograms (EEGs), eyeball movement and other external signals can be measured. The dimensions are W90 mm x H 44 mm x D 158 mm, the weight is 300 g, and the device is powered by battery.
4. Algorithm
This section gives a description of the proposed sequential learning algorithm. It consists of several aspects: (i) the basis function to fit the data, (ii) the likelihood function, (iii) sequential parameter learning, and (iv) sequential hyperparameter learning. The actual predictive values are given by the predictive distribution of the target class labels, which will be described in 4.3. In order to improve the learning capabilities, Rao-Blackwellisation will also be described. We begin with a two-class classification problem followed by a multi-class classification problem. A schematic diagram of the proposed algorithm is given in Figure 4.
4.1. Two-class classification problem
Let
4.1.1. Basis function and classifier
Consider the parameterized family of nonlinear basis functions
where
The function
Other popular basis functions often work as well. It should be noted that this basis function is nonlinear with respect to
In order to associate the quantity defined by (1) with the class label, consider:
where
while another is:
We tested both functions and found them to work equally well for our SSVEP learning. In what follows, we will report our results with (4) by introducing a latent variable
where
4.1.2. Parameter search stochastic dynamics
In order to perform sequential learning, we perform a sequential stochastic search of the parameter
where
4.1.3. Automatic hyperparameter search stochastic dynamics
Our experiences tell us that automatic adjustment of
There are at least two reasons for this transition probability to be log-normal. One is that
4.1.4. Rao-blackwellised SMC
In this paper, we implemented not only the standard SMC but also Rao-Blackwellised SMC (RBSMC) for the purpose of performance improvement. Rao-Blackwellisation is a statistical variance reduction strategy for the Monte Carlo method [25], [26]. It is a combination of analytical integration (marginalization) and the Monte Carlo method. In order to explain this, recall the parameters associated with the basis function (1), write
where there are two hyperparameters
Since the basis function is linear with respect to
4.2. Multi-class classification problem
This section attempts to generalize the results of the previous section to multi-class problems. Although we will restrict ourselves to a four-class problem, the method can, in principle, be applied to cases with more than four classes.
Let
4.2.1. Basis function
Consider the basis function
where
4.2.2. Multinomial logistic model
This paper assumes the Multinomial Logistic Model for the target problems, where it is assumed that the error
where
The predicted label
The function
4.2.3. Parameter/hyperparameter search stochastic dynamics
We use the same standard Sequential Monte Carlo (SMC) used in 4.1.
4.3. Bayesian Sequential Learning
Letting
The second factor in the numerator is the predictive probability for parameter
At the
5. Implementation
The Sequential Monte Carlo (SMC) is a powerful means of evaluating the posterior or predictive probabilities of Bayesian nonlinear or non-Gaussian models in a sequential manner. This study uses the SMC to evaluate equations (20) and (23) in an attempt to evaluate the SER. The SMC first attempts to approximate the posterior distribution by an empirical distribution (delta mass) weighted by normalized importance weights (importance sampling). In order to avoid depletion of samples, caused by an increase in the variance of the weights, the SMC replaces the weighted empirical distribution by unweighted delta masses (resampling).
There are several different methods of determining when resampling should be done. We tried two of them. One method is to resample every step, and another is to perform resampling only when the Effective Sample Size (EES [22]-[24]) becomes smaller than a threshold value:
where
The threshold value of ESS is often set at
5.1. Standard SMC
Figure 6 gives an overview of the standard SMC, where
5.2. Rao-blackwellised SMC
In the Rao-Blackwellised SMC implementation, the marginal likelihood
Here, the marginal likelihood
Details of updating
6. Results
This section reports the results of learning experiments using the algorithms proposed in the previous sections.
6.1. Observation data
As explained in 3, the six channels (O1, OZ, O2, O9, IZ, O10) located in the occipital eye field were used for our classification problems. Data were taken in 600 trials from each of five subjects (A, B, C, D, E). One trial lasted for 3.0 seconds. Data in the first 0.0-1.0 s was deleted in order to eliminate the effect of eyeball movements on the EEG. The raw data were filtered by 50.0 Hz notch filter. After Hanning windowing, DFT was performed by MATLAB_R2008a to obtain the feature vector
Figure.8 demonstrates the power spectrum of subject D taken from one trial when a stimulus was presented at 6.0 Hz. The particular frequency component is relatively clear. Figure.9 is from another trial of the same subject, where the target frequency component is not clearly discernible.
The vertical lines in the two figures indicate (from the left) 4.29Hz, 6.0Hz, 8.58(4.29
It should be noted that while more frequency components give more information, the number of parameters to be learned increases, so that learning becomes more difficult.
6.2. Experimental settings
This study examines several different versions of Sequential Monte Carlo for implementing the target sequential learning, as displayed in Table 1, where standard SMC means no hyperparameter learning and resampling is performed at every step. The abbreviated notation will be used throughout the rest of the paper. Various experimental settings are summarized in Table 2, where
6.3. Performance evaluation criteria
We will propose three performance evaluation criteria. One is Sequential Error Rates (
where
in order to make performance comparisons with the existing methods. Another quantity we will be evaluating is the sequential marginal likelihood:
which is the marginalization of the likelihood with respect to the current predictive distribution. This quantifies the reliability of the prediction
6.4. Experimental results
6.4.1. Two-class classification problem
Figure 10 shows the Sequential Error Rate of subject D over one session consisting of 600 trials. The algorithm was implemented by Sequential Monte Carlo together with the proposed hyperparameter learning (HP+SMC). Table 3 summarizes the Sequential Error Rates of subjects A-E, which were averages over ten learning trials.
A | B | C | D | E | |
minimum error rate | 0.00 | 0.010 | 0.00 | 0.00 | 0.00 |
maximum error rate | 0.75 | 0.75 | 0.71 | 0.75 | 0.80 |
average over 600 trials | 0.16 | 0.26 | 0.19 | 0.077 | 0.13 |
In Figure 11, the
Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D averaged over the window M=20 as was in Figure 10. Even though Figure 10 and Figure 12 are similar, the latter comes from the Bayesian concept where the latter appears slightly less abrupt. This particular quantity can be applied to the change detection problem as is done in [27].
Figure.13 shows the Cumulative Error of subject D with different algorithms, and Table.4 gives final Cumulative Errors of subjects A-E, that is, the Cumulative Errors at the last trial. These values were the averages over ten experiments.
Figure 14 shows the ESS trajectories (moving average over 20 trials) of subject D with several different methods.
Table 5 summarizes the computation time of the various methods averaged over ten experiments. The middle column shows the time per trial, whereas the right-most column shows the time needed for all trials. The par trial time does not contain the case where resampling is done with the ESS.
1 step (S) | whole data (S) | |
SMC | 0.120 | 72.0 |
HP+SMC | 0.130 | 77.7 |
SMCESS | - | 57.4 |
HP+SMCESS | - | 63.9 |
RBSMC | 0.320 | 192 |
HP+RBSMC | 0.328 | 197 |
RBSMCESS | - | 145 |
HP+RBSMCESS | - | 121 |
Effects of the higher order harmonics were examined and are summarized in Table.6 for three cases: (i) fundamental frequency only, (ii) second-order higher harmonics in addition to the fundamental frequency, and (iii) second- and third-order higher harmonics in addition to the fundamental frequency. The numbers in the table indicate the final Cumulative Errors. The results were the averages over ten experiments.
6.4.2. Multi-class classification problem
A | B | C | D | E | |
minimum error rate | 0.23 | 0.39 | 0.25 | 0.15 | 0.58 |
maximum error rate | 0.74 | 0.82 | 0.60 | 0.72 | 0.85 |
average over 600 trials | 0.46 | 0.60 | 0.47 | 0.36 | 0.72 |
Cumulative Error: Figure 16 shows Cumulative Errors of subject D for the four-class problem with two different algorithms. One is a standard SMC without hyperparameter learning (SMCmulti), and the other is the proposed SMC with hyperparameter learning (HP+SMCmulti). The dotted line indicates a random classification. Table.8 summarizes the CEs for subjects A-E. These are the values averaged over ten experiments.
7. Discussion
7.1. Sequential error rate
We first observe that there are two errors involved in brain-computer interfaces in general, and in this study in particular. One is the error made by the brain (subject), and the other is the error made by the computer (algorithm), provided that the hardware behind the experiments is functional. Let us look at the Sequential Error Rate in Figure 10. It started decreasing immediately after the experiment began, and it had already dropped to about 0.1 at around the 20-th trial. At around the 80-th trial, the Sequential Error Rate became almost 0. One possible interpretation of this is that, if we can assume that the subject does not make an error during these 80 trials, then the SER trajectory represents the process of how the computer learns the classification problem. Recall that there are h(d+2)+1 parameters in (1), which in this case is 141. In addition, the hyperparameter
At around trial 80, the SER dropped to almost zero, so that if the subject's EEG signals were consistent with the previous ones, the computer algorithm did not need to seek different samples in the parameter space. Therefore, the ups and downs after trial 80 could be interpreted as the fact that the subject's EEG signals became slightly inconsistent with the previous ones. During this period, the computer naturally needed to search for slightly different posterior samples, so that some errors were incurred.
This was followed by several ups and downs between 0.2 and almost 0 until approximately trial 310. The subject seemed to have obtained a reasonable amount of skill for the task, so that the subject achieved almost 0. The Sequential Error Rate in the trials between 310 and 330, as well as between 350 and 380, were almost zero. However, the subject's Sequential Error Rate again increased at around trial 380. A sharp dip was observed in the hyperparameter trajectory, as demonstrated in Figure 11. One possible interpretation of this is that by trial 380, the computer had found fairly good posterior samples for predictions so that the parameter search region was narrowed down; however, a sudden change was observed and the computer needed to quickly widen its stochastic parameter search region, which was indicated by the sudden drop of hyperparameter
A question might arise as to why was the drop in
7.2. Sequential marginal likelihood
Since Sequential Marginal Likelihood can be interpreted as the reliability of each prediction of the subject, this quantity can also be used to evaluate subject’s performance with probabilistic justification. Another potential application of this quantity is its use in change detection problem such that a significant change of this quantity would indicate occurrence of change in the subject’s signal quality or/and environmental change. It should be noted that the sequential marginal likelihood is a well defined Bayesian quantity whereas such reliability index is not available in maximum likelihood method.
7.3. Rao-blackwellisation
Note that in Figure13, the best performance was achieved by HP+RBSMCESS, where the Rao-Blackwellisation and the Effective Sample Size were taken into account, in addition to the hyperparameter learning. The proposed scheme appeared functional.
Figure 13 shows the Cumulative Error of subject D, where the black dotted line shows the Cumulative Error corresponding to random classification. Since the Cumulative Error with the proposed algorithms grows slower than the random classification, the results appeared to indicate that the algorithms were functional. Figure 13 appears to indicate that the proposed
7.4. Effective sample size
From the trajectory of the Effective Sample Size (ESS), we observed that the ESS was generally large if resampling was performed at each step. This could be attributable to the fact that the purpose of resampling was to avoid degeneracy of samples, i.e., to bring in more diversity in the samples. Table 5 appears to indicate that the computation time with ESS was significantly reduced since it avoided sampling when ESS did not become smaller than a threshold value.
7.5. Higher-order harmonics
Taking the higher-order harmonics into account generally improved the prediction capabilities, except for subject E, as was seen in Table 6. One future research project could be to develop an algorithm to choose appropriate frequency components automatically. The number of frequency components is also related to the overfitting problem in machine learning, where the number of parameters is large compared with the number of data available, which sometimes results in performance degradation.
7.6. Multi-class classification problem
The extension of the two-class problem to the four-class classification problem discussed in Section 4.2 was nontrivial. One of the difficulties can be seen from the term
8. Conclusion
This paper proposed Bayesian sequential learning algorithms for SSVEP sequential classification problems in a principled manner. Two experiments were conducted: one involving a two-class problem, and the other involving a four-class problem. The stimuli consisted of a flickering checkerboard at frequencies ranging from 4.29 to 10.0 Hz. The algorithms were implemented by the Sequential Monte Carlo. One of the points of the proposed algorithms was their hyperparameter learning, enabling it to automatically adjust to environmental changes, including changes in the subjects' physical conditions as well as their environments. Computation costs were also measured, which appeared to indicate that the algorithms could be implemented in real time. The proposed algorithms appeared functional. The proposed sequential algorithms are applicable to other brain signals besides EEG.
In the experiments performed in this study, the subjects were asked to look at the stimuli. A future research project is to examine sequential classification problems with covert selective attention [20], [21] where subjects are asked to pay attention to stimuli without eyeball movements. This project is in progress and will be reported in a future paper.
Appendix
Update Equations of
The update equations of
Where