Open access peer-reviewed chapter - ONLINE FIRST

Multilabel Classification Based on Graph Neural Networks

By Wei-Cheng Ye and Jia-Ching Wang

Submitted: May 26th 2021Reviewed: July 28th 2021Published: September 6th 2021

DOI: 10.5772/intechopen.99681

Downloaded: 18

Abstract

Typical Laplacian embedding focuses on building Laplacian matrices prior to minimizing weights of connected graph components. However, for multilabel problems, it is difficult to determine such Laplacian graphs owing to multiple relations between vertices. Unlike typical approaches that require precomputed Laplacian matrices, this chapter presents a new method for automatically constructing Laplacian graphs during Laplacian embedding. By using trace minimization techniques, the topology of the Laplacian graph can be learned from input data, subsequently creating robust Laplacian embedding and influencing graph convolutional networks. Experiments on different open datasets with clean data and Gaussian noise were carried out. The noise level ranged from 6% to 12% of the maximum value of each dataset. Eleven different multilabel classification algorithms were used as the baselines for comparison. To verify the performance, three evaluation metrics specific to multilabel learning are proposed because multilabel learning is much more complicated than traditional single-label settings; each sample can be associated with multiple labels. The experimental results show that the proposed method performed better than the baselines, even when the data were contaminated by noise. The findings indicate that the proposed method is reliably robust against noise.

Keywords

  • graph neural networks
  • multilabel classification
  • deep learning

1. Introduction

Traditional supervised learning deals with the analysis of single-label data, which means that samples are associated with a single label. However, in many real-world data mining applications, such as text classification [1, 2], scene classification [3, 4], crowd sensing/mining [5, 6, 7, 8, 9, 10, 11], and gene functional classification [12, 13], the samples are associated with more than one label. From this description, we understand that the challenge of the multilabel classification task is its potential output.

Basically, multilabel learning algorithms can be categorized into two different groups. 1) Problem transformation method. This method takes the multilabel problem and converts it into a single-label problem that can easily be classified using any classifier using the relationship between labels. 2) Adapted algorithm method. This method directly performs multilabel classification rather than transforming the problem into different subsets of problems, and most of these methods use the Euclidean distance between samples.

The main idea of this paper is to aggregate similar samples to obtain better results. To aggregate similar samples, we use the properties of graph neural networks (GNNs) [14]. The main contributions of this study are as follows:

  • We propose a method that constructs a multilabel-based Laplacian graph such that each element in it represents the relationship between samples.

  • We use similar samples with an aggregation approach that is not used in traditional multilabel learning methods.

The rest of this paper is arranged as follows. Section 2 shows the taxonomy of multilabel learning algorithms and describes their methods. Section 3 presents the details of our proposed method. Section 4 describes the multilabel datasets, evaluation metrics and experimental results, followed by the conclusions in Section 5.

Advertisement

2. Related work

2.1 Multilabel learning algorithms

In this section, we review multilabel learning algorithms. The algorithms that have been applied to multilabel learning over the last decade are not just those mentioned in this paper. Figure 1 summarizes the algorithms detailed in the next section.

Figure 1.

Taxonomy of multilabel learning algorithms [15].

2.1.1 Problem transformation

Binary relevance (BR) is used to address a multilabel problem with a binary classifier, and its advantages are simplicity and efficiency, but correlation between labels is not considered. Classifier chains (CCs) are configured in a chain of binary classifiers where a classifier in the chain is based on the prediction of the previous classifier; their advantage is that they consider the relationship between labels but hence cannot be parallelized. Calibrated label ranking (CAL) performs ranking via the pairwise comparison of labels and has the advantage of considering the relationship (but only the pairwise relationship) between labels. Label powersets (LP) treat the situation when multiple labels belong to the same sample as a new label and have the advantage of considering the relationship between labels, but the time complexity grows exponentially with label sets. Random k-labelsets (RKL) are variants of LP models where each classifier is trained with a small random set of labels; their advantage is that they consider the relationship between labels, but they have a low accuracy rate if a worse label set combination is randomly selected.

2.1.2 Adapted algorithm

The multilabel k-nearest neighbor (MLkNN) method is derived from the traditional k-nearest neighbor algorithm. Each sample is identified with knearest neighbors in the training set, and information is obtained from these identified neighbors. Multilabel support vector machine (ML-SVM) classification determines an optimal hyperplane that separates observations according to their labels. A multilabel decision tree (ML-DT) is constructed by building a decision tree, where each node corresponds to a set of samples in the data set.

2.2 Graph neural networks

GNNs were mentioned for the first time and further elaborated by [16]. The goal of a GNN is to learn a node’s representation of the acquisition of its information by propagation. Currently, there are many deep learning tasks that need to process data with graph structures. Convolutional neural networks (CNNs) [17] have been successfully developed in the field of computer vision [18, 19] but are unable to process graph structured data [20]. The method used in this paper is called a graph convolutional network (GCN). A GCN can aggregate similar samples by propagating neighbor information, giving it the ability to infer, and there is no need to consider the sequence. GCNs have appeared in many top machine learning conferences and many applications across different tasks and domains, such as manifold learning [21, 22], computer vision [23, 24, 25], text classification [26, 27], hashing [28, 29], and hyperspectral image classification [30, 31].

3. The proposed method

This section presents the overall flow of our proposed method, as shown in Figure 2. The multilabel data matrix is first converted into a similarity matrix generated from a Laplacian graph. We call this a multilabel-based Laplacian graph and use this graph as inputs to the GCN model. Each node in the output layer predicts the probability of class membership for the label.

Figure 2.

An illustration of the work flow of the proposed method. Fully green color represents the training model; fully blue color represents the test model.

3.1 Multilabel-based Laplacian graph

This section presents the proposed method. Before this, let us describe some notational conventions. Matrices are written in boldface capital letters (e.g., X). The transpose of a matrix is denoted as X. Vectors are written in boldface lowercase letters (e.g., x). For a matrix XRn×m, the j-th column and the ij-th entry are denoted by xjand xij, respectively. Idenotes the identity matrix, 2is the l2-norm, and 1denotes a column vector with all elements equal to ones.

Based on [32], we formally present our multilabel-based Laplacian graph. For a multilabel dataset, let X=x1xnRn×mbe the data matrix with nand mrepresenting the number of samples and the dimensions, respectively. SRn×nis the multilabel-based Laplacian graph, and we use a sparse representation method to construct this graph as follows:

minSi,j=1nxixj22Sij+βi=1nsi22s.t.Sii=0,Sij0,1Τsi=1.E1

We normalize 1Τsi= 1 which represents a sparse constraint on Sbecause sparse representation is robust to noise [33], and βis an adjustable parameter. The second term is added to regularize the loss function.

3.2 Graph convolutional network

Based on [34], we fit the GCN used for single-label classification to multilabel classification. The GCN has been modified from a first-order Chebyshev approximation [35]. In order to create a multidimensional input, ChevNet convolution with an input vector xand a filter gθis formulated as follows:

xgθ=θ0xθ1D12AD12x,E2

where means the convolution operator, Ais the adjacency matrix and Dis the degree matrix. By using the single parameter θ= θ0= θ1to avoid overfitting, Eq. (2) can be rewritten as:

xgθ=θIn+D12AD12x.E3

Repeated use of this graph convolution operation may cause serious problems such as vanishing gradients. Therefore, In+D12AD12in Eq. (3) is modified to D˜12A˜D˜12with A˜=A+Inand D˜ii=jA˜ij, finally giving a layerwise propagation rule to support multidimensional inputs as follows:

Hl+1=σD˜12A˜D˜12HlWl.E4

Here, Hlis the output of an activation function in the l-th layer of the GCN. Wlis a trainable weight matrix corresponding to the l-th layer of GCN. H0is the data matrix. σdenotes a specific activation function such as a sigmoid activation function.

This paper considers only a two-layer GCN model as the proposed method, and we modify Eq. (4) by placing the adjacent matrix into a multilabel-based Laplacian graph to obtain the formula of the two-layer GCN method proposed in this paper as follows:

H1=σD̂12ŜD̂12H0W0H2=σD̂12ŜD̂12H1W1,E5

where Ŝ=S+Inand D̂ii=jŜij. For semi-supervised multilabel classification, we evaluate the mean square error over all labeled samples:

MeanSquareError=1ti=1tHi2Yi2,E6

where Y01n×cis the ground truth label matrix with clabelsets, and tis the number of labeled samples.

4. Experiments

4.1 Datasets

The multilabel datasets used in this paper and their associated statistics are shown in Table 1.

DatasetsDomain# of features# of samples# of training data# of test data# of classes
Emotions*Audio725934741196
Water Quality*Chemistry16106084821214
CIE Image*Image294200016004005
Natural Scenes*Image294240719254826
Yeast*Biology1032417193348414
AR Face**Image1024303032424260616

Table 1.

Statistics of the multilabel datasets.

Multilabel datasets are available at http://mulan.sourceforge.net/datasets-mlc.html


AR Face dataset is available at http://www2.ece.ohio-state.edu/_aleix/ARdatabase.html


4.2 Experimental setup

In this study, we have added probabilistic classifier chains [36], CSMLC [37] and RethinkNet [38] as baselines for comparison. The experimental settings are as follows: First, multilabel datasets are preprocessed to [0,1] as inputs, 80% of the samples are used for model (both multilabel learning and proposed method) training, and the last 20% of the samples are used as test sets. We also add Gaussian noise ranging from 6% to 12% of each test sample to test the robustness of the model. The overall framework is shown in Figure 2.

For deep learning, we train all models for 200 epochs using Adam [39] with a learning rate of 0.01 and the mean square error as the loss function.

4.3 Evaluation metrics

In multilabel learning, the evaluation metrics must be more rigorous than traditional single-label learning because one sample may be associated with multiple labels. These evaluation metrics [15] are divided into three groups, as shown in Figure 3. The higher the values of the F1 score, precision, mean average precision and recall, the better the performance is. The lower the values of the Hamming loss, one-error, coverage and ranking loss, the better the performance is. We consider the Hamming loss, one-error and mean average precision as three major metrics.

Figure 3.

Taxonomy of evaluation metrics.

4.4 Experimental results

All experiments use different combinations of training and test data to verify the trained model and average the results after repeating the training ten times. According to the observations in Figures 46, the following conclusions are reached:

  • Regardless of whether the Gaussian noise is added to the data set, the classification results of the problem transformation methods (BR, CCs, CAL, LP and RKL) are almost worse than the adaptive algorithms (MLkNN, ML-SVM and ML-DT)

  • Deep learning may not obtain the best performance.

  • We found that our method was raised on average by 1.8% and 8% higher in Hamming loss and mean average precision, respectively. And also has excellent performance even if the dataset were contaminated by noise.

  • Regardless of whether noise is added to the data, our method in one-error evaluation is not as good as other baselines.

Figure 4.

Results of the proposed method compared with multilabel learning algorithms on the used multilabel datasets. (a)–(c) show the results without adding Gaussian noise.

Figure 5.

Results of the proposed method compared with multilabel learning algorithms on the used multilabel datasets. (a)–(c) show the results of adding 6% Gaussian noise.

Figure 6.

Results of the proposed method compared with multilabel learning algorithms on the used multilabel datasets. (a)–(c) show the results of adding 12% Gaussian noise.

5. Conclusions

In this paper, we proposed a method of constructing a relation matrix by considering the correlation and sparsity of paired samples. We then added the characteristics of a GCN, which aggregates similar samples, to finally obtain the probability of occurrence of each label. Experimental results on six datasets showed that our proposed method can deliver superior performance in comparison with eleven baselines. Our future work will include designing a general framework that can reduce the use of memory and increase the efficiency of a GCN and extending this framework to unsupervised learning.

Advertisement

Acknowledgments

This work is supported in part by the Data Science Lab, NSYSU, and in part by the Pervasive Artificial Intelligence Research (PAIR) Lab, Taiwan, under the grant Nos. 110-2634-F-008-004 and 110-2221-E-110-046.

DOWNLOAD FOR FREE

chapter PDF

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Wei-Cheng Ye and Jia-Ching Wang (September 6th 2021). Multilabel Classification Based on Graph Neural Networks [Online First], IntechOpen, DOI: 10.5772/intechopen.99681. Available from:

chapter statistics

18total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us