Statistics of the multilabel datasets.

## Abstract

Typical Laplacian embedding focuses on building Laplacian matrices prior to minimizing weights of connected graph components. However, for multilabel problems, it is difficult to determine such Laplacian graphs owing to multiple relations between vertices. Unlike typical approaches that require precomputed Laplacian matrices, this chapter presents a new method for automatically constructing Laplacian graphs during Laplacian embedding. By using trace minimization techniques, the topology of the Laplacian graph can be learned from input data, subsequently creating robust Laplacian embedding and influencing graph convolutional networks. Experiments on different open datasets with clean data and Gaussian noise were carried out. The noise level ranged from 6% to 12% of the maximum value of each dataset. Eleven different multilabel classification algorithms were used as the baselines for comparison. To verify the performance, three evaluation metrics specific to multilabel learning are proposed because multilabel learning is much more complicated than traditional single-label settings; each sample can be associated with multiple labels. The experimental results show that the proposed method performed better than the baselines, even when the data were contaminated by noise. The findings indicate that the proposed method is reliably robust against noise.

### Keywords

- graph neural networks
- multilabel classification
- deep learning

## 1. Introduction

Traditional supervised learning deals with the analysis of single-label data, which means that samples are associated with a single label. However, in many real-world data mining applications, such as text classification [1, 2], scene classification [3, 4], crowd sensing/mining [5, 6, 7, 8, 9, 10, 11], and gene functional classification [12, 13], the samples are associated with more than one label. From this description, we understand that the challenge of the multilabel classification task is its potential output.

Basically, multilabel learning algorithms can be categorized into two different groups. 1) Problem transformation method. This method takes the multilabel problem and converts it into a single-label problem that can easily be classified using any classifier using the relationship between labels. 2) Adapted algorithm method. This method directly performs multilabel classification rather than transforming the problem into different subsets of problems, and most of these methods use the Euclidean distance between samples.

The main idea of this paper is to aggregate similar samples to obtain better results. To aggregate similar samples, we use the properties of graph neural networks (GNNs) [14]. The main contributions of this study are as follows:

We propose a method that constructs a multilabel-based Laplacian graph such that each element in it represents the relationship between samples.

We use similar samples with an aggregation approach that is not used in traditional multilabel learning methods.

The rest of this paper is arranged as follows. Section 2 shows the taxonomy of multilabel learning algorithms and describes their methods. Section 3 presents the details of our proposed method. Section 4 describes the multilabel datasets, evaluation metrics and experimental results, followed by the conclusions in Section 5.

## 2. Related work

### 2.1 Multilabel learning algorithms

In this section, we review multilabel learning algorithms. The algorithms that have been applied to multilabel learning over the last decade are not just those mentioned in this paper. Figure 1 summarizes the algorithms detailed in the next section.

#### 2.1.1 Problem transformation

Binary relevance (BR) is used to address a multilabel problem with a binary classifier, and its advantages are simplicity and efficiency, but correlation between labels is not considered. Classifier chains (CCs) are configured in a chain of binary classifiers where a classifier in the chain is based on the prediction of the previous classifier; their advantage is that they consider the relationship between labels but hence cannot be parallelized. Calibrated label ranking (CAL) performs ranking via the pairwise comparison of labels and has the advantage of considering the relationship (but only the pairwise relationship) between labels. Label powersets (LP) treat the situation when multiple labels belong to the same sample as a new label and have the advantage of considering the relationship between labels, but the time complexity grows exponentially with label sets. Random k-labelsets (RKL) are variants of LP models where each classifier is trained with a small random set of labels; their advantage is that they consider the relationship between labels, but they have a low accuracy rate if a worse label set combination is randomly selected.

#### 2.1.2 Adapted algorithm

The multilabel

### 2.2 Graph neural networks

GNNs were mentioned for the first time and further elaborated by [16]. The goal of a GNN is to learn a node’s representation of the acquisition of its information by propagation. Currently, there are many deep learning tasks that need to process data with graph structures. Convolutional neural networks (CNNs) [17] have been successfully developed in the field of computer vision [18, 19] but are unable to process graph structured data [20]. The method used in this paper is called a graph convolutional network (GCN). A GCN can aggregate similar samples by propagating neighbor information, giving it the ability to infer, and there is no need to consider the sequence. GCNs have appeared in many top machine learning conferences and many applications across different tasks and domains, such as manifold learning [21, 22], computer vision [23, 24, 25], text classification [26, 27], hashing [28, 29], and hyperspectral image classification [30, 31].

## 3. The proposed method

This section presents the overall flow of our proposed method, as shown in Figure 2. The multilabel data matrix is first converted into a similarity matrix generated from a Laplacian graph. We call this a multilabel-based Laplacian graph and use this graph as inputs to the GCN model. Each node in the output layer predicts the probability of class membership for the label.

### 3.1 Multilabel-based Laplacian graph

This section presents the proposed method. Before this, let us describe some notational conventions. Matrices are written in boldface capital letters (e.g.,

Based on [32], we formally present our multilabel-based Laplacian graph. For a multilabel dataset, let

We normalize

### 3.2 Graph convolutional network

Based on [34], we fit the GCN used for single-label classification to multilabel classification. The GCN has been modified from a first-order Chebyshev approximation [35]. In order to create a multidimensional input, ChevNet convolution with an input vector

where

Repeated use of this graph convolution operation may cause serious problems such as vanishing gradients. Therefore,

Here,

This paper considers only a two-layer GCN model as the proposed method, and we modify Eq. (4) by placing the adjacent matrix into a multilabel-based Laplacian graph to obtain the formula of the two-layer GCN method proposed in this paper as follows:

where

where

## 4. Experiments

### 4.1 Datasets

The multilabel datasets used in this paper and their associated statistics are shown in Table 1.

### 4.2 Experimental setup

In this study, we have added probabilistic classifier chains [36], CSMLC [37] and RethinkNet [38] as baselines for comparison. The experimental settings are as follows: First, multilabel datasets are preprocessed to [0,1] as inputs, 80% of the samples are used for model (both multilabel learning and proposed method) training, and the last 20% of the samples are used as test sets. We also add Gaussian noise ranging from 6% to 12% of each test sample to test the robustness of the model. The overall framework is shown in Figure 2.

For deep learning, we train all models for 200 epochs using Adam [39] with a learning rate of 0.01 and the mean square error as the loss function.

### 4.3 Evaluation metrics

In multilabel learning, the evaluation metrics must be more rigorous than traditional single-label learning because one sample may be associated with multiple labels. These evaluation metrics [15] are divided into three groups, as shown in Figure 3. The higher the values of the F1 score, precision, mean average precision and recall, the better the performance is. The lower the values of the Hamming loss, one-error, coverage and ranking loss, the better the performance is. We consider the Hamming loss, one-error and mean average precision as three major metrics.

### 4.4 Experimental results

All experiments use different combinations of training and test data to verify the trained model and average the results after repeating the training ten times. According to the observations in Figures 4–6, the following conclusions are reached:

Regardless of whether the Gaussian noise is added to the data set, the classification results of the problem transformation methods (BR, CCs, CAL, LP and RKL) are almost worse than the adaptive algorithms (MLkNN, ML-SVM and ML-DT)

Deep learning may not obtain the best performance.

We found that our method was raised on average by 1.8% and 8% higher in Hamming loss and mean average precision, respectively. And also has excellent performance even if the dataset were contaminated by noise.

Regardless of whether noise is added to the data, our method in one-error evaluation is not as good as other baselines.

## 5. Conclusions

In this paper, we proposed a method of constructing a relation matrix by considering the correlation and sparsity of paired samples. We then added the characteristics of a GCN, which aggregates similar samples, to finally obtain the probability of occurrence of each label. Experimental results on six datasets showed that our proposed method can deliver superior performance in comparison with eleven baselines. Our future work will include designing a general framework that can reduce the use of memory and increase the efficiency of a GCN and extending this framework to unsupervised learning.

## Acknowledgments

This work is supported in part by the Data Science Lab, NSYSU, and in part by the Pervasive Artificial Intelligence Research (PAIR) Lab, Taiwan, under the grant Nos. 110-2634-F-008-004 and 110-2221-E-110-046.