Statistics of the multilabel datasets.
Typical Laplacian embedding focuses on building Laplacian matrices prior to minimizing weights of connected graph components. However, for multilabel problems, it is difficult to determine such Laplacian graphs owing to multiple relations between vertices. Unlike typical approaches that require precomputed Laplacian matrices, this chapter presents a new method for automatically constructing Laplacian graphs during Laplacian embedding. By using trace minimization techniques, the topology of the Laplacian graph can be learned from input data, subsequently creating robust Laplacian embedding and influencing graph convolutional networks. Experiments on different open datasets with clean data and Gaussian noise were carried out. The noise level ranged from 6% to 12% of the maximum value of each dataset. Eleven different multilabel classification algorithms were used as the baselines for comparison. To verify the performance, three evaluation metrics specific to multilabel learning are proposed because multilabel learning is much more complicated than traditional single-label settings; each sample can be associated with multiple labels. The experimental results show that the proposed method performed better than the baselines, even when the data were contaminated by noise. The findings indicate that the proposed method is reliably robust against noise.
- graph neural networks
- multilabel classification
- deep learning
Traditional supervised learning deals with the analysis of single-label data, which means that samples are associated with a single label. However, in many real-world data mining applications, such as text classification [1, 2], scene classification [3, 4], crowd sensing/mining [5, 6, 7, 8, 9, 10, 11], and gene functional classification [12, 13], the samples are associated with more than one label. From this description, we understand that the challenge of the multilabel classification task is its potential output.
Basically, multilabel learning algorithms can be categorized into two different groups. 1) Problem transformation method. This method takes the multilabel problem and converts it into a single-label problem that can easily be classified using any classifier using the relationship between labels. 2) Adapted algorithm method. This method directly performs multilabel classification rather than transforming the problem into different subsets of problems, and most of these methods use the Euclidean distance between samples.
The main idea of this paper is to aggregate similar samples to obtain better results. To aggregate similar samples, we use the properties of graph neural networks (GNNs) . The main contributions of this study are as follows:
We propose a method that constructs a multilabel-based Laplacian graph such that each element in it represents the relationship between samples.
We use similar samples with an aggregation approach that is not used in traditional multilabel learning methods.
The rest of this paper is arranged as follows. Section 2 shows the taxonomy of multilabel learning algorithms and describes their methods. Section 3 presents the details of our proposed method. Section 4 describes the multilabel datasets, evaluation metrics and experimental results, followed by the conclusions in Section 5.
2. Related work
2.1 Multilabel learning algorithms
In this section, we review multilabel learning algorithms. The algorithms that have been applied to multilabel learning over the last decade are not just those mentioned in this paper. Figure 1 summarizes the algorithms detailed in the next section.
2.1.1 Problem transformation
Binary relevance (BR) is used to address a multilabel problem with a binary classifier, and its advantages are simplicity and efficiency, but correlation between labels is not considered. Classifier chains (CCs) are configured in a chain of binary classifiers where a classifier in the chain is based on the prediction of the previous classifier; their advantage is that they consider the relationship between labels but hence cannot be parallelized. Calibrated label ranking (CAL) performs ranking via the pairwise comparison of labels and has the advantage of considering the relationship (but only the pairwise relationship) between labels. Label powersets (LP) treat the situation when multiple labels belong to the same sample as a new label and have the advantage of considering the relationship between labels, but the time complexity grows exponentially with label sets. Random k-labelsets (RKL) are variants of LP models where each classifier is trained with a small random set of labels; their advantage is that they consider the relationship between labels, but they have a low accuracy rate if a worse label set combination is randomly selected.
2.1.2 Adapted algorithm
The multilabel -nearest neighbor (MLkNN) method is derived from the traditional -nearest neighbor algorithm. Each sample is identified with nearest neighbors in the training set, and information is obtained from these identified neighbors. Multilabel support vector machine (ML-SVM) classification determines an optimal hyperplane that separates observations according to their labels. A multilabel decision tree (ML-DT) is constructed by building a decision tree, where each node corresponds to a set of samples in the data set.
2.2 Graph neural networks
GNNs were mentioned for the first time and further elaborated by . The goal of a GNN is to learn a node’s representation of the acquisition of its information by propagation. Currently, there are many deep learning tasks that need to process data with graph structures. Convolutional neural networks (CNNs)  have been successfully developed in the field of computer vision [18, 19] but are unable to process graph structured data . The method used in this paper is called a graph convolutional network (GCN). A GCN can aggregate similar samples by propagating neighbor information, giving it the ability to infer, and there is no need to consider the sequence. GCNs have appeared in many top machine learning conferences and many applications across different tasks and domains, such as manifold learning [21, 22], computer vision [23, 24, 25], text classification [26, 27], hashing [28, 29], and hyperspectral image classification [30, 31].
3. The proposed method
This section presents the overall flow of our proposed method, as shown in Figure 2. The multilabel data matrix is first converted into a similarity matrix generated from a Laplacian graph. We call this a multilabel-based Laplacian graph and use this graph as inputs to the GCN model. Each node in the output layer predicts the probability of class membership for the label.
3.1 Multilabel-based Laplacian graph
This section presents the proposed method. Before this, let us describe some notational conventions. Matrices are written in boldface capital letters (e.g., ). The transpose of a matrix is denoted as . Vectors are written in boldface lowercase letters (e.g., ). For a matrix , the -th column and the -th entry are denoted by and , respectively. denotes the identity matrix, is the -norm, and denotes a column vector with all elements equal to ones.
Based on , we formally present our multilabel-based Laplacian graph. For a multilabel dataset, let be the data matrix with and representing the number of samples and the dimensions, respectively. is the multilabel-based Laplacian graph, and we use a sparse representation method to construct this graph as follows:
We normalize = 1 which represents a sparse constraint on because sparse representation is robust to noise , and is an adjustable parameter. The second term is added to regularize the loss function.
3.2 Graph convolutional network
Based on , we fit the GCN used for single-label classification to multilabel classification. The GCN has been modified from a first-order Chebyshev approximation . In order to create a multidimensional input, ChevNet convolution with an input vector and a filter is formulated as follows:
where means the convolution operator, is the adjacency matrix and is the degree matrix. By using the single parameter = = to avoid overfitting, Eq. (2) can be rewritten as:
Repeated use of this graph convolution operation may cause serious problems such as vanishing gradients. Therefore, in Eq. (3) is modified to with and , finally giving a layerwise propagation rule to support multidimensional inputs as follows:
Here, is the output of an activation function in the -th layer of the GCN. is a trainable weight matrix corresponding to the -th layer of GCN. is the data matrix. denotes a specific activation function such as a sigmoid activation function.
This paper considers only a two-layer GCN model as the proposed method, and we modify Eq. (4) by placing the adjacent matrix into a multilabel-based Laplacian graph to obtain the formula of the two-layer GCN method proposed in this paper as follows:
where and . For semi-supervised multilabel classification, we evaluate the mean square error over all labeled samples:
where is the ground truth label matrix with labelsets, and is the number of labeled samples.
The multilabel datasets used in this paper and their associated statistics are shown in Table 1.
4.2 Experimental setup
In this study, we have added probabilistic classifier chains , CSMLC  and RethinkNet  as baselines for comparison. The experimental settings are as follows: First, multilabel datasets are preprocessed to [0,1] as inputs, 80% of the samples are used for model (both multilabel learning and proposed method) training, and the last 20% of the samples are used as test sets. We also add Gaussian noise ranging from 6% to 12% of each test sample to test the robustness of the model. The overall framework is shown in Figure 2.
For deep learning, we train all models for 200 epochs using Adam  with a learning rate of 0.01 and the mean square error as the loss function.
4.3 Evaluation metrics
In multilabel learning, the evaluation metrics must be more rigorous than traditional single-label learning because one sample may be associated with multiple labels. These evaluation metrics  are divided into three groups, as shown in Figure 3. The higher the values of the F1 score, precision, mean average precision and recall, the better the performance is. The lower the values of the Hamming loss, one-error, coverage and ranking loss, the better the performance is. We consider the Hamming loss, one-error and mean average precision as three major metrics.
4.4 Experimental results
All experiments use different combinations of training and test data to verify the trained model and average the results after repeating the training ten times. According to the observations in Figures 4–6, the following conclusions are reached:
Regardless of whether the Gaussian noise is added to the data set, the classification results of the problem transformation methods (BR, CCs, CAL, LP and RKL) are almost worse than the adaptive algorithms (MLkNN, ML-SVM and ML-DT)
Deep learning may not obtain the best performance.
We found that our method was raised on average by 1.8% and 8% higher in Hamming loss and mean average precision, respectively. And also has excellent performance even if the dataset were contaminated by noise.
Regardless of whether noise is added to the data, our method in one-error evaluation is not as good as other baselines.
In this paper, we proposed a method of constructing a relation matrix by considering the correlation and sparsity of paired samples. We then added the characteristics of a GCN, which aggregates similar samples, to finally obtain the probability of occurrence of each label. Experimental results on six datasets showed that our proposed method can deliver superior performance in comparison with eleven baselines. Our future work will include designing a general framework that can reduce the use of memory and increase the efficiency of a GCN and extending this framework to unsupervised learning.
This work is supported in part by the Data Science Lab, NSYSU, and in part by the Pervasive Artificial Intelligence Research (PAIR) Lab, Taiwan, under the grant Nos. 110-2634-F-008-004 and 110-2221-E-110-046.