This chapter discusses missing-value problems from the perspective of machine learning. Missing values frequently occur during data acquisition. When a dataset contains missing values, nonvectorial data are generated. This subsequently causes a serious problem in pattern recognition models because nonvectorial data need further data wrangling before models are built. In view of such, this chapter reviews the methodologies of related works and examines their empirical effectiveness. At present, a great deal of effort has been devoted in this field, and those works can be roughly divided into two types — Multiple imputation and single imputation, where the latter can be further classified into subcategories. They include deletion, fixed-value replacement, K-Nearest Neighbors, regression, tree-based algorithms, and latent component-based approaches. In this chapter, those approaches are introduced and commented. Finally, numerical examples are provided along with recommendations on future development.
Part of the book: Applications of Pattern Recognition
Typical Laplacian embedding focuses on building Laplacian matrices prior to minimizing weights of connected graph components. However, for multilabel problems, it is difficult to determine such Laplacian graphs owing to multiple relations between vertices. Unlike typical approaches that require precomputed Laplacian matrices, this chapter presents a new method for automatically constructing Laplacian graphs during Laplacian embedding. By using trace minimization techniques, the topology of the Laplacian graph can be learned from input data, subsequently creating robust Laplacian embedding and influencing graph convolutional networks. Experiments on different open datasets with clean data and Gaussian noise were carried out. The noise level ranged from 6% to 12% of the maximum value of each dataset. Eleven different multilabel classification algorithms were used as the baselines for comparison. To verify the performance, three evaluation metrics specific to multilabel learning are proposed because multilabel learning is much more complicated than traditional single-label settings; each sample can be associated with multiple labels. The experimental results show that the proposed method performed better than the baselines, even when the data were contaminated by noise. The findings indicate that the proposed method is reliably robust against noise.
Part of the book: Data Mining