Bayesian Sequential Learning for EEG-Based BCI Classification Problems

Non-invasive Brain-Computer Interfaces (BCIs) have been an active research area where several different methods have been developed. They include Electroencephalography (EEG), Near-infrared Spectroscopy (NIRS), functional-MRI (fMRI) among others [1]. Of those BCIs, EEG is one of the most studied methods. This is mainly due to its fine temporal resolution, ease of use, and relatively low set-up cost. Each BCI method naturally has its own advantages and disadvantages. EEG is no exception.


Introduction
Non-invasive Brain-Computer Interfaces (BCIs) have been an active research area where several different methods have been developed. They include Electroencephalography (EEG), Near-infrared Spectroscopy (NIRS), functional-MRI (fMRI) among others [1]. Of those BCIs, EEG is one of the most studied methods. This is mainly due to its fine temporal resolution, ease of use, and relatively low set-up cost. Each BCI method naturally has its own advantages and disadvantages. EEG is no exception.
One of the main disadvantages of an EEG-based BCI is its susceptibility to noise, which motivates development of a variety of machine learning algorithms for decoding EEG signals, and there have been significant advancements in the area [2].
One way of categorizing machine-learning algorithms for BCI is batch mode and sequential (online) mode. In the batch mode learning, the collectively acquired EEG data from a subject is divided into two subsets: training data and test data. The former is used for training the machine-learning algorithm, whereas the latter is used to evaluate the algorithm's capability to predict the subject's intention [2]- [4]. There are several facets in batch mode learning which call for improvements: 1. First, it is non-trivial to decide how much data should be used for training and how much data should be left for testing. It should also be noted that the number of necessary training data may depend on each subject.

Related work
There are three ingredients in this study: (i) SSVEP, (ii) Sequential (Online) learning, and (iii) Sequential Monte Carlo implementations. The descriptions that follow will be given in terms of these keywords. For the batch mode learning, we cite the survey paper reported in [2] instead of citing individual papers.
Allison et al. [5] performed a demographic study of several different BCI methods and showed that an SSVEP-based BCI spelling system was competitive for different age groups, as well as different gender groups, with little experience, under noisy environments. It is also reported that most subjects stated that they did not consider the flickering stimuli annoying and would use or recommend such a BCI system. In [6], and also in [7], an SSVEP-based orthosis control system with an LED light source is proposed. The flickering frequencies were 6 Hz, 7 Hz, 8 Hz, and 13 Hz in the former, and 8 Hz and 13 Hz in the latter. Classification was performed using the second-and third-order higher harmonics in addition to the fundamental frequency component. In [8], an SSVEP-based speller is proposed. After Principal Component Analysis (PCA), the probability of each frequency component is estimated using a particular information matrix. The speller introduces a selection based on a decision tree and an undo command for correcting eventual errors. In [9], EEG signals are represented in the spatial-spectraltemporal domain by a wavelet transform. It also uses a multi-linear discriminative subspace by employing general tensor discriminant analysis (GTDA). The classification is conducted by support vector machine (SVM). Reference [10] proposes a biphasic stimulation technique to solve the issue of phase drifts in SSVEP in phase-tagged systems. The Kalman filter is used in [11] to decode neural activity and estimate the desired movement kinematics, where the filter gain is approximated by its steady-state form, which is computed offline before real-time decoding commences. Canonical correlation analysis (CCA) is used in [12] to analyze the frequency components of SSVEP, where the correlations between the target oscillation waveforms, as well as their higher order harmonics, and those of the acquired SSVEP waveforms are calculated. It is demonstrated that the scheme performed better than a fast Fourier transform-based spectrum estimation method. CCA is used in an online manner by updating the parameters each time data arrives. An online learning scheme called Stochastic Meta Descent (SMD), which is a generalization of the gradient descent algorithm, is proposed in [13]. The paper also discusses various aspects of errors incurred in online learning algorithms.
The Subject Specific Classification Model is discussed in [14], where model Gaussian parameters are updated online after an initial learning of the Subject Independent Classification Model from a pool of subjects. The data was taken from P300, which is one component of EEG.
Martinez et al. [4] propose an SSVEP-based online BCI system with visual neurofeedback. The algorithm is different from the one proposed in this paper. There is a report on Bayesian sequential learning for EEG signals [6], where the Sequential Monte Carlo is used for implementation. In addition to the task differences between [6] and this study, there are several algorithmic distinctions between the two. First, the basis function used in the former is linear with respect to the associated parameters, whereas the latter uses a basis function in which parameters appear in a nonlinear manner. Second, the parameter that controls the size of the parameter search region is fixed in the former, whereas it is learned automatically in the latter. These two differences, at least within our experience, are important for achieving better performance.
In addition, no extension for multi-class classification problems was performed in the former, whereas both algorithmic extension and a learning experiment by using multi-class real EEG data are performed in the latter. Our proposed algorithm is based on part of earlier work [16] of the authors' research group for a different application. Preliminary results on a two-class classification problem were reported by the present group in [17]. The current paper gives a full account of the results by expanding several parts of that conference paper. First, a fourclass classification problem was formulated and tested experimentally. Second, the algorithm was further improved by incorporating an Effective Sample Size and Rao-Blackwellisation. Third, more detailed discussions are added.

Subjects and data acquisition
Of Event Related Potentials used in BCI, the target quantities considered in this study are SSVEPs, which are natural responses to visual stimulation at specific frequencies. These frequency components and their higher-order harmonics can be observed in the occipital region [4]. It is known that SSVEPs are often useful in research because of the reasonable signalto-noise ratio and relative immunity to artifacts [5].
In an attempt to perform two-class and four-class classification problems, we gathered two sets of SSVEP data. The settings that we describe below for the two experiments are the same except for the number of stimuli and their frequencies. EEG data were recorded by a Polymate (Nihon Koden, Tokyo) with six active electrodes (O1, OZ, O2, O9, IZ, O10) according to the international 10-10 system and referenced to the left earlobe with a digitization rate of 500 Hz. Even though the highest flickering frequency was 10 Hz, we considered second and third order harmonics in one of the experiments reported below. Our original intention was to examine the harmonics higher than three even though they were not reported in this paper. In order to have a wide margin for the Nyquist frequency we chose 500 Hz. Five volunteers (aged 21-23 years) participated in the present study. All subjects were healthy, with no past history of psychiatric or neurological disorders.
Written informed consent was obtained from each subject on forms approved by the ethical committee of Waseda University.
Each subject was seated in a comfortable chair 60 cm in front of the monitor in an electrically shielded and dimmed room. The flow of task events is shown in Figure 1. The stimulus for the two-class problem is illustrated in Figure 2, whereas that of the four-class problem is illustrated in Figure 3. In the two-class problem, two flickering checkerboard stimuli (left and right) were presented on the monitor, whereas in the four-class problem, four flickering checkerboard stimuli (left, right, top, and bottom) were presented. In addition, a fixation cross was placed at the center, which the subject was usually asked to fixate at.
In the two-class problem, the left stimulus was a checkerboard flickering with frequency of 4.29 Hz, whereas the right stimulus flickered at frequency of 6.00 Hz. In the four-class problem, there were additional stimuli, one at the top with frequency of 7.50 Hz, and one at the bottom with frequency of 10.0 Hz. There were three reasons for selecting these particular frequencies.
First, it is known that SSVEPs are discernible in approximately the 4.0 Hz -50 Hz band [18] [19]. Second, higher harmonics of a particular frequency component should not overlap with the fundamental frequency component. Such overlap could give rise to a problem when one considers multi-class classification problem where multiple frequencies are involved as is described in section 6. Third, since the monitor refresh rate is 60 Hz, choice of the flickering frequencies were restricted by 60/positive integer. We chose 60/14=4.2857…. which we approximated by 4. 29.
The subject usually fixated at a central fixation cross. When an arrow replaced the cross, the subject should move his or her eyes to the checkerboard indicated by the arrow for 3.0 s, after which a red circle is shown so that the subject would know when to rest for 5.0 s. This sequence was one trial, and trials were repeated twenty times, constituting one session. The direction of the arrow was selected at random. Each subject completed 600 trials, or 30 sessions. The measurements were performed with a Polymate AP1124, a multi-purpose portable bioamplifier recording device, manufactured by TEAC Corporation, Tokyo, Japan. The device is equipped with 24 channels with a maximum sampling frequency of 1 kHz. In addition to electroencephalograms (EEGs), eyeball movement and other external signals can be measured. The dimensions are W90 mm x H 44 mm x D 158 mm, the weight is 300 g, and the device is powered by battery.

Algorithm
This section gives a description of the proposed sequential learning algorithm. It consists of several aspects: (i) the basis function to fit the data, (ii) the likelihood function, (iii) sequential parameter learning, and (iv) sequential hyperparameter learning. The actual predictive values are given by the predictive distribution of the target class labels, which will be described in 4.3. In order to improve the learning capabilities, Rao-Blackwellisation will also be described.
We begin with a two-class classification problem followed by a multi-class classification problem. A schematic diagram of the proposed algorithm is given in Figure 4.

Two-class classification problem
Let x k ∈ R d be the feature vector at the k-th trial, where d represents the dimension of x k which, in our paper, is the DFT spectrum of a single trial EEG. Let y k ∈ {0,1} be the binary class label of each trial, where 0 corresponds to the right flickering image and 1 corresponds to the left flickering image. Our purpose here is to learn parameters associated with the basis function, to be defined shortly, and predict the subject's intention given SSVEP data, each time datum arrives.

Basis function and classifier
Consider the parameterized family of nonlinear basis functions f (• ) defined by: where The function σ( • ) is a sigmoidal function defined by σ(a) =    Other popular basis functions often work as well. It should be noted that this basis function is nonlinear with respect to u k as well as x k , which enables capturing of potential nonlinearities involved.
In order to associate the quantity defined by (1) with the class label, consider: whereΦ is a function which monotonically maps the real numbers onto [0,1]. Several choices of Φ are possible. One is: while another is: We tested both functions and found them to work equally well for our SSVEP learning. In what follows, we will report our results with (4) by introducing a latent variable z k and considering where I (A) represents an indicator function defined as 1 when A is true and 0 when A is false.

Parameter search stochastic dynamics
In order to perform sequential learning, we perform a sequential stochastic search of the parameter ω k each time trial data is acquired: where Z ω represents the normalization constant. This amounts to searching for a new value ω k based on the previous value ω k -1 , but in a random walk manner. This is a first-order Markov process, so that the parameters of the distant past are naturally forgotten because of the noise, whereas the parameters of the immediate past tend to be taken into account with higher weights. This stochastic parameter search is reflected in the posterior distributions (20) given sequential data. Since this transition probability is Gaussian, it involves γ k , which is the reciprocal of the variance parameter. More specifically, if γ k is small, the parameter search region for ω k will be large, whereas if γ k is large, the search region will be small.

Automatic hyperparameter search stochastic dynamics
Our experiences tell us that automatic adjustment of γ k is often important in order to achieve better performance. γ k is often called a hyperparameter since it controls the behavior of the target parameter ω k . We perform the following automatic stochastic search of γ k : There are at least two reasons for this transition probability to be log-normal. One is that γ k needs to be positive, and another is to cover a large range of values in the hyperparameter space. It should be noted that (10) is also a first-order Markov process, so that the hyperparameters of the immediate past are taken into account, whereas the hyperparameters of the distant past tend to be forgotten. In order to explain the importance of such hyperparameter learning, consider Figure 5. Letting θ k ∶ = (ω k , γ k ) the blue region represents the likelihood function landscape in the θ-space, where the darker the blue color, the higher the likelihood function value. The white diamonds represent samples θ t -1 ), γ t (i) :large, and the light-green diamonds small. Now suppose that the likelihood function landscape in the θ -space changed by a relatively large amount, as shown by the pink region, where the darker the color, the higher the likelihood. The yellow diamonds are scarce in the pink region, so that it is difficult to find θ samples that give rise to meaningful likelihood function values. This is due to the fact that γ t (i) is large, so that the search region is restricted. If γ t (i) is relatively small, on the other hand, then the green θ samples might capture at least a part of the pink region where the likelihood function values are meaningful. The proposed hyperparameter learning scheme automatically learns appropriate γ t (i) values from the sequential data and lets the algorithm find reasonable θ samples.

Rao-blackwellised SMC
In this paper, we implemented not only the standard SMC but also Rao-Blackwellised SMC (RBSMC) for the purpose of performance improvement. Rao-Blackwellisation is a statistical variance reduction strategy for the Monte Carlo method [25], [26]. It is a combination of analytical integration (marginalization) and the Monte Carlo method. In order to explain this, recall the parameters associated with the basis function (1), write ω k ∶ = (u k , v k ), and decompose the stochastic search dynamics (9) into two parts: where there are two hyperparameters γ k and δ k . The corresponding hyperparameter stochastic search dynamics will be given by: :small. The proposed scheme automatically learns appropriate γ values so that it can capture appropriate θ samples in relatively high-likelihood regions in the θ-space.
Since the basis function is linear with respect to v k , the Rao-Blackwellisation can be conducted with the data augmentation of Z k [26], where the likelihood function P(y k | x k , ω k ) is to be integrated out with respect to v k , which, in turn, gives rise to smaller variances. A specific implementation of this particular Rao-Blackwellisation will be given in subsection 5.2.

Multi-class classification problem
This section attempts to generalize the results of the previous section to multi-class problems. Although we will restrict ourselves to a four-class problem, the method can, in principle, be applied to cases with more than four classes. Let x k ∈ R d be the feature vector at the k-th trial, which is the power spectrum obtained through DFT over the trial period, where d stands for the dimension of x k . Let y k ∈ {1,2, 3,4} denote the class labels at each trial, where left corresponds to label 1, right to label 2, top to label 3, and bottom to label 4. Our goal is to learn the parameters associated with the basis function described below in an attempt to predict the subject's intention.

Basis function
Consider the basis function f q (• ) defined by (15), which is nonlinear with respect to not only x k but also the parameter vector ω k . There are Q outputs associated with the basis function, where Q is the number of class labels, which is four in this paper. We have: where

Multinomial logistic model
This paper assumes the Multinomial Logistic Model for the target problems, where it is assumed that the error k ,q in each term follows an independently identically distributed logistic distribution. By introducing a latent variable z k ,q , we write: where C k ,q represents the score of the term controlled by the outputs of the other class labels.
It follows from (15) that the probability of y k belonging to class q is described by: The predicted label y pred is the label q max that has the maximum value of P(y k = q | z k ,q ). Using (18), the likelihood function is described by: The function I ( • ) is again an indicator described in 4.1. The generalization to the multi-class problem (18)-(20) appears straightforward; however, our experience tells us that the multiclass problems are much more difficult than the equations look. Experimental results are reported in 6.4.

Parameter/hyperparameter search stochastic dynamics
We use the same standard Sequential Monte Carlo (SMC) used in 4.1.

Bayesian Sequential Learning
Letting θ k ∶ = (ω k , γ k , δ k ), x 1:k : = (x 1 , ⋯ , x k ), y 1:k : = (y 1 , ⋯ , y k ), one can derive its sequential posterior distribution at trial k: P(θ k | x 1:k , y 1:k ) = P (y k | x k , θ k )P(θ k | x 1:k -1 , y 1:k -1 ) The second factor in the numerator is the predictive probability for parameter θ k , which is given by: At the k + 1-st trial, let the EEG data x k +1 be given. Then the prediction at the trial amounts to computing the predictive probability for label P(y k +1 ):

Implementation
The Sequential Monte Carlo (SMC) is a powerful means of evaluating the posterior or predictive probabilities of Bayesian nonlinear or non-Gaussian models in a sequential manner. This study uses the SMC to evaluate equations (20) and (23) in an attempt to evaluate the SER. The SMC first attempts to approximate the posterior distribution by an empirical distribution (delta mass) weighted by normalized importance weights (importance sampling). In order to avoid depletion of samples, caused by an increase in the variance of the weights, the SMC replaces the weighted empirical distribution by unweighted delta masses (resampling).
There are several different methods of determining when resampling should be done. We tried two of them. One method is to resample every step, and another is to perform resampling only when the Effective Sample Size (EES [22]- [24]) becomes smaller than a threshold value: where n is the number of samples, i is the index of a sample, and Ω k is the normalized importance weight defined by Ω k .
There are several different methods of determining when resampling should be done. We tried two of them. O resample every step, and another is to perform resampling only when the Effective Sample Size (EES [22]- [24]) than a threshold value: where is the number of samples, is the index of a sample, andΩ is the normalized importance weight defined b The threshold value of ESS is often set at /2 ([23], [24]), which we adopted. Figure 6 gives an overview of the standard SMC, where * ∶= ( , ) and (•) denote the proposal distributi paper is set as ( * | : , : ). This choice is due to its simplicity of implementation. Figure 6 demonstrate resampling is done with the ESS. If resampling is performed every step, then the ESS step is simply ignored. Figure 6. Implementation of standard SMC.

Rao-blackwellised SMC
In the Rao-Blackwellised SMC implementation, the marginal likelihood ( | : , : ) is used instead of the lik ( | , ), where ∶= ( , , ).This implementation is described in Figure 7.   Figure 6 gives an overview of the standard SMC, where θ k * ∶ = (ω k , γ k ) and Q( • ) denote the proposal distribution, which in this paper is set as P(θ k * | x 1:k -1 , y 1:k -1 ). This choice is due to its simplicity of implementation. Figure 6 demonstrates the case where resampling is done with the ESS. If resampling is performed every step, then the ESS step is simply ignored.

Rao-blackwellised SMC
In the Rao-Blackwellised SMC implementation, the marginal likelihood P(y k | z 1:k , u 1:k ) is used instead of the likelihood function P(y k | x k , ω k ), where Θ k ∶ = (ω k , γ k , δ k ). This implementation is described in Figure 7. Here, the marginal likelihood ( | : , : ) can be written as: Details of updating | and are given in the Appendix.

Results
This section reports the results of learning experiments using the algorithms proposed in the previous sections.

Observation data
As explained in 3, the six channels (O1, OZ, O2, O9, IZ, O10) located in the occipital eye field were used for Here, the marginal likelihood P(y k | z 1:k , u 1:k ) can be written as: Details of updating z k |k -1 and S k are given in the Appendix.

Results
This section reports the results of learning experiments using the algorithms proposed in the previous sections.   The vertical lines in the two figures indicate (from the left) 4.29Hz, 6.0Hz, 8.58(4.29 × 2) Hz, 12.0(6.0 × 2) Hz, 12.87(4.29 × 3) Hz and 18.0(6.0 × 3) Hz, respectively. It should be noted that even with SSVEP, the observed frequency components are not always identifiable by inspection. It should also be noted that SSVEP can contain higher harmonics of the target frequency [4] and that the classification accuracy may be improved by taking into account higher harmonics [4]. In order to examine the effectiveness of the higher order harmonics for our classification problem, this section considers the following three settings: (i) the fundamental frequency only, (ii) the second order harmonics, in addition to the fundamental frequency, and (iii) second and third order harmonics, in addition to the fundamental frequency. Since the number of channels is 6, the dimensions of our feature vectors are (i) 12, (ii) 24, and (iii) 36, respectively.

Observation data
It should be noted that while more frequency components give more information, the number of parameters to be learned increases, so that learning becomes more difficult.

Experimental settings
This study examines several different versions of Sequential Monte Carlo for implementing the target sequential learning, as displayed in Table 1, where standard SMC means no hyperparameter learning and resampling is performed at every step. The abbreviated notation will be used throughout the rest of the paper. Various experimental settings are summarized in Table 2, where n denotes the number of samples; γ 0 and δ 0 , the initial conditions for hyperparameters γ and δ ; σ h , the hyper-hyperparameter; and h , the number of perceptron hidden units.

Performance evaluation criteria
We will propose three performance evaluation criteria. One is Sequential Error Rates (SE R k ) defined by where y k (y k ') is the true class, and y k , pred ( y k ' , pred ) is the predicted class defined by (23).
Notation I (• ) stands for an indicator described in 4. This is the moving average of the prediction error over a window of size M. We will also compute Cumulative Error (CE)   in order to make performance comparisons with the existing methods. Another quantity we will be evaluating is the sequential marginal likelihood: which is the marginalization of the likelihood with respect to the current predictive distribution. This quantifies the reliability of the prediction y k with respect to (x 1:k , y 1:k -1 ). In order to explain a rationale behind this, recall that given data y, the likelihood P(y | z) can be interpreted as the degree of appropriateness of z in explaining y. This, in turn, can be interpreted as the appropriateness of y in terms of z.

Two-class classification problem
a. Sequential Error Rate Figure 10 shows the Sequential Error Rate of subject D over one session consisting of 600 trials. The algorithm was implemented by Sequential Monte Carlo together with the proposed hyperparameter learning (HP+SMC). Table 3 summarizes the Sequential Error Rates of subjects A-E, which were averages over ten learning trials.   Table 3. Sequential Error Rate of the subjects (M=20, HP+SMC) b. Trajectory of hyperparameter In Figure 11, the γ t -trajectory (blue) is superimposed on the Sequential Error Rates (red) of Figure 10. The value of γ t is the posterior mean. Note that there was a significant dip in γ t around 380 and 480, due to the fact that the algorithm detected a sudden change in the data, so that it automatically widened the search region in the parameter space. Eventually, the algorithm re-started learning the parameters. This phenomenon was also discernible at around 475. The hyperparameter learning appeared functional.  b. Trajectory of hyperparameter In Figure 11, the -trajectory (blue) is superimposed on the Sequential Error posterior mean. Note that there was a significant dip in around 380 and 480, d change in the data, so that it automatically widened the search region in the par learning the parameters. This phenomenon was also discernible at around 475. T Figure 11. Trajectory of hyperparameter for Subject D with the SER in Fig.10 (HP+SMC) c.
Sequential Marginal Likelihood (Reliability of the Predictions) Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D av Even though Figure 10 and Figure 12 are similar, the latter comes from the Baye abrupt. This particular quantity can be applied to the change detection problem a Figure 11. Trajectory of hyperparameter γ k for Subject D with the SER in Fig.10 (HP+SMC) superimposed. The value of γ k was its posterior mean c. Sequential Marginal Likelihood (Reliability of the Predictions) Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D averaged over the window M=20 as was in Figure 10. Even though Figure 10 and Figure 12 are similar, the latter comes from the Bayesian concept where the latter appears slightly less abrupt. This particular quantity can be applied to the change detection problem as is done in [27]. Cumulative Error Figure.13 shows the Cumulative Error of subject D with different algorithm A-E, that is, the Cumulative Errors at the last trial. These values were the av  Table.4 gives final Cumulative Errors of subjects A-E, that is, the Cumulative Errors at the last trial. These values were the averages over ten experiments. Cumulative Error Figure.13 shows the Cumulative Error of subject D with different algorithms, an A-E, that is, the Cumulative Errors at the last trial. These values were the average  Table. I.The dotted straight line at 1/2 corresponds to a random classification.   f. Computation Time Table 5 summarizes the computation time of the various methods averaged over time per trial, whereas the right-most column shows the time needed for all tria where resampling is done with the ESS. f. Computation Time Table 5 summarizes the computation time of the various methods averaged over ten experiments. The middle column shows the time per trial, whereas the right-most column shows the time needed for all trials. The par trial time does not contain the case where resampling is done with the ESS.

Multi-class classification problem
a. Sequential Error Rate: Figure 15 shows the Sequential Error Rates (moving average window size 20) of subject D for the the four-class problem with the HP+SMC algorithm, and Table 7 gives various values related to the Sequential Error Rates of subjects A-E averaged over 10 experiments. Cumulative Error: Figure 16 shows Cumulative Errors of subject D for the four-class problem with two different algorithms. One is a standard SMC without hyperparameter learning (SMCmulti), and the other is the proposed SMC with hyperparameter learning  b.
Cumulative Error: Figure 16 shows Cumulative Errors of subject D algorithms. One is a standard SMC without hyperparameter learning (SMCm hyperparameter learning (HP+SMCmulti). The dotted line indicates a random subjects A-E. These are the values averaged over ten experiments.    b.
Cumulative Error: Figure 16 shows Cumulative Errors of subject D f algorithms. One is a standard SMC without hyperparameter learning (SMCmu hyperparameter learning (HP+SMCmulti). The dotted line indicates a random c subjects A-E. These are the values averaged over ten experiments.   (HP+SMCmulti). The dotted line indicates a random classification. We first observe that there are two errors involved in brain-computer interfaces in general, and in this study in particular. One is the error made by the brain (subject), and the other is the error made by the computer (algorithm), provided that the hardware behind the experiments is functional. Let us look at the Sequential Error Rate in Figure 10. It started decreasing immediately after the experiment began, and it had already dropped to about 0.1 at around the 20-th trial. At around the 80-th trial, the Sequential Error Rate became almost 0. One possible interpretation of this is that, if we can assume that the subject does not make an error during these 80 trials, then the SER trajectory represents the process of how the computer learns the classification problem. Recall that there are h(d+2)+1 parameters in (1), which in this case is 141. In addition, the hyperparameter γ needs to be learned. This means that the parameter/ hyperparameter landscape is vague at the beginning in a high dimensional space, so that the computer searches for posterior samples in an attempt to find appropriate parameter values for better classification. It should be noted that the hyperparameter γ t is relatively flat up to trial 80 but slightly increases. Since γ t represents the reciprocal of the size of the parameter search region, this period can be interpreted as the computer's early effort to search for parameters by slightly narrowing down the parameter space search region.
At around trial 80, the SER dropped to almost zero, so that if the subject's EEG signals were consistent with the previous ones, the computer algorithm did not need to seek different samples in the parameter space. Therefore, the ups and downs after trial 80 could be interpreted as the fact that the subject's EEG signals became slightly inconsistent with the previous ones. During this period, the computer naturally needed to search for slightly different posterior samples, so that some errors were incurred.
This was followed by several ups and downs between 0.2 and almost 0 until approximately trial 310. The subject seemed to have obtained a reasonable amount of skill for the task, so that the subject achieved almost 0. The Sequential Error Rate in the trials between 310 and 330, as well as between 350 and 380, were almost zero. However, the subject's Sequential Error Rate again increased at around trial 380. A sharp dip was observed in the hyperparameter trajectory, as demonstrated in Figure 11. One possible interpretation of this is that by trial 380, the computer had found fairly good posterior samples for predictions so that the parameter search region was narrowed down; however, a sudden change was observed and the computer needed to quickly widen its stochastic parameter search region, which was indicated by the sudden drop of hyperparameter γ t at around k=380. With this, the algorithm tried to learn parameter values different from the previous ones and eventually found better parameter values. The Sequential Error Rate again dropped to almost 0 at around trial 420, which lasted for approximately 40 trials. A similar phenomenon was discernible after around trial 480. It is important to notice that, in addition to the learning mechanism, a "forgetting" mechanism is naturally built in. Namely, 9 and 10 are first-order Markov stochastic dynamical systems, so that memories of the distant past are forgotten, whereas the more recent data are taken into account with higher weights. Note, however, that the Sequential Monte Carlo algorithm took into account several hundred parameter values instead of a single parameter value, which endowed the predictions with robustness.
A question might arise as to why was the drop in γ t at around k=380 much more significant than, for instance, that at around k=270, where the SER increase was more significant at the latter than the former. Our interpretation is that at around k=270, γ t is not too large, so that the parameter search region is still reasonably large, whereas at around k=380, the parameter search region was already sharp, and a more sudden change of the search region size was needed.

Sequential marginal likelihood
Since Sequential Marginal Likelihood can be interpreted as the reliability of each prediction of the subject, this quantity can also be used to evaluate subject's performance with probabilistic justification. Another potential application of this quantity is its use in change detection problem such that a significant change of this quantity would indicate occurrence of change in the subject's signal quality or/and environmental change. It should be noted that the sequential marginal likelihood is a well defined Bayesian quantity whereas such reliability index is not available in maximum likelihood method.

Rao-blackwellisation
Note that in Figure13, the best performance was achieved by HP+RBSMCESS, where the Rao-Blackwellisation and the Effective Sample Size were taken into account, in addition to the hyperparameter learning. The proposed scheme appeared functional. Figure 13 shows the Cumulative Error of subject D, where the black dotted line shows the Cumulative Error corresponding to random classification. Since the Cumulative Error with the proposed algorithms grows slower than the random classification, the results appeared to indicate that the algorithms were functional. Figure 13 appears to indicate that the proposed γ t -learning, as well as the Rao-Blackwellised SMC, was functional.

Effective sample size
From the trajectory of the Effective Sample Size (ESS), we observed that the ESS was generally large if resampling was performed at each step. This could be attributable to the fact that the purpose of resampling was to avoid degeneracy of samples, i.e., to bring in more diversity in the samples. Table 5 appears to indicate that the computation time with ESS was significantly reduced since it avoided sampling when ESS did not become smaller than a threshold value.

Higher-order harmonics
Taking the higher-order harmonics into account generally improved the prediction capabilities, except for subject E, as was seen in Table 6. One future research project could be to develop an algorithm to choose appropriate frequency components automatically. The number of frequency components is also related to the overfitting problem in machine learning, where the number of parameters is large compared with the number of data available, which sometimes results in performance degradation.

Multi-class classification problem
The extension of the two-class problem to the four-class classification problem discussed in Section 4.2 was nontrivial. One of the difficulties can be seen from the term C k ,q in (17) - (19), where the values of equation (18) must be well-separated from each other for the four classes.
The experimental results reported in subsection 6.4, however, appeared reasonable. The Cumulative Errors shown in Figure.16 appeared to indicate that the learning was functional.

Conclusion
This paper proposed Bayesian sequential learning algorithms for SSVEP sequential classification problems in a principled manner. Two experiments were conducted: one involving a twoclass problem, and the other involving a four-class problem. The stimuli consisted of a flickering checkerboard at frequencies ranging from 4.29 to 10.0 Hz. The algorithms were implemented by the Sequential Monte Carlo. One of the points of the proposed algorithms was their hyperparameter learning, enabling it to automatically adjust to environmental changes, including changes in the subjects' physical conditions as well as their environments. Computation costs were also measured, which appeared to indicate that the algorithms could be implemented in real time. The proposed algorithms appeared functional. The proposed sequential algorithms are applicable to other brain signals besides EEG.
In the experiments performed in this study, the subjects were asked to look at the stimuli. A future research project is to examine sequential classification problems with covert selective attention [20], [21] where subjects are asked to pay attention to stimuli without eyeball movements. This project is in progress and will be reported in a future paper.