Open access peer-reviewed chapter

Deep Learning for Natural Language Processing

Written By

Yuan Wang, Zekun Li, Zhenyu Deng, Huiling Song and Jucheng Yang

Reviewed: 14 July 2023 Published: 11 August 2023

DOI: 10.5772/intechopen.112550

From the Edited Volume

Deep Learning and Reinforcement Learning

Edited by Jucheng Yang, Yarui Chen, Tingting Zhao, Yuan Wang and Xuran Pan

Chapter metrics overview

152 Chapter Downloads

View Full Metrics

Abstract

With the constantly growing number of topical or sentiment-bearing texts and dialogs on the Web, the demand for automatic language or text analysis algorithms continues to expand. This chapter discusses about advanced deep learning techniques for classical and hot research directions in the field of natural language processing, including text classification, sentiment analysis, and task-oriented dialog systems. In text classification, we focus on tasks of multi-label text classification and extreme multi-label text classification, which allow for automatically annotates the texts with the most relevant labels. In sentiment analysis, we look into aspect-based sentiment analysis that makes automatic extraction of fine-grained sentiment information from texts, and multimodal sentiment analysis that classifies people’s opinions or attitudes from multimedia data through fusion techniques. In dialog system, we introduce how deep learning techniques work in pipeline mode and end-to-end mode for task-oriented dialog system. In this chapter, the rapidly evolving state of the research on the three topics is reviewed. Furthermore, trends in the research on deep learning for natural language processing are identified, and a discussion about future advances is provided.

Keywords

  • deep learning
  • text classification
  • sentiment analysis
  • task-oriented dialog system
  • tasks and models

1. Introduction

Deep learning becomes increasingly important due to the fast growing of internet contents and the urgent needs of big data in natural language processing (NLP).

The text classification task is one of the most fundamental scenarios in natural language processing (NLP), where the user enters the text and the model divides the input text into defined categories. Text classification tasks can be divided into multi-class text classification, multi-label text classification, hierarchical text classification and extreme multi-label text classification. In the multi-class text classification settings, there are two or more label categories in the label set, and each sample has only one relevant label. In the multi-label text classification (MLTC) settings, a sample may have one or more relevant labels. The hierarchical text classification is a special multi-class text task or multi-label task, where the labels have a hierarchical relationship between them. The extreme multi-label text classification task (XMTC) is annotating the most relevant labels for the text from a large label set with millions, or even billions, of labels. It is a limitation of traditional models that words are treated as independent features out of context. Deep learning methods have had great success in other related fields by automatically extracting context-sensitive features from raw text. Text classification techniques can be applied into problem classification [1], topic classification [2], and emotion classification [3]. Text classification tasks can be divided into the recommendation system domain, the legal domain, and the ad placement domain depending on the target domain. In the field of recommendation systems, predicting how much a user prefers a particular item. In the legal field, MLTC questions are used to predict the final outcome of bills. In the field of ad placement, personalized ads are tailored to users by inferring their characteristics and personal interests on social media.

Sentiment analysis refers to mining people’s opinions and emotional attitudes toward various matters through modal information such as texts and images. In the early days, sentiment analysis was mainly used to analyze user reviews of products sold online, and thus confirm user preferences for purchasing products. With the popularity of self-publishing nowadays, sentiment analysis is more often used to identify the sentiment analysis of topic participants, to mine the value of topics, and to analyze related public opinion. Sentiment analysis has important application value for both society and individuals.

The dialog system relies on deep learning technology to act as an assistant to talk or chat with people to people. Task-oriented dialog system is used to solve specific problems in specific fields, such as movie ticket reservation, restaurant table reservation, etc. Because of its huge commercial value, it has attracted more and more people’s attention.

This chapter is organized as follows: Section 2 discusses advancement in text classification, Section 3 outlines the sentiment analysis, Section 4 presents the task-oriented dialog system, and finally, Section 5 concludes the chapter.

Advertisement

2. Advancement in text classification

2.1 Multi-label text classification

There are three problems in MLTC settings. The process of obtaining comprehensive supervisory information is time-consuming and labor-intensive. The lack of theoretical support for the interpretability aspect of deep learning is also an issue that needs to be addressed. Modeling label dependencies is a major difficulty (Figure 1).

Figure 1.

Deep learning in multi-label text classification.

Multi-label text classification includes text pre-processing, text representation work using feature engineering, and classifier. Text pre-processing is a series of processes on the original text including word segmentation, cleaning, normalization, and so on. Text representation processes words into vectors or matrices so that computers can process them. Feature engineering is divided into heuristics, machine learning-based methods, and deep learning-based methods. Deep learning-based approaches can be divided into text-based representations [4] and interactive representations [4] based on text and labels, depending on whether the model introduces labels information to represent the text.

2.1.1 Text representation

Deep learning-based approaches can be divided into text-based representations [4] and interactive representations [4]. Text-based representations focus on converting text into machine-understandable form for subsequent natural language processing tasks. Interactive representations, on the other hand, focus on modeling dialog history and context to better understand the current dialog by considering different sentences in the dialog history and changes in user intent. It should be noted that text-based and interactive representations are not mutually exclusive but can be used in combination. In some tasks, text-based representations can be used first to convert individual texts into representation vectors, and then considered in conjunction with interactive representations to take into account contextual information for more accurate and comprehensive text comprehension and processing. For text-based representations, TextCNN [5] applies convolutional neural networks and uses multiple kernels of different sizes to extract key information in sentences. For interactive representations, LEAM [6] establishes the semantic interaction matrix between texts and labels to obtain the attention weight, so as to obtain the most relevant labels.

2.1.2 Deep learning models

Deep learning-based text representation works to automatically acquire textual information, including word vector models and neural network models.

Word vector models based on distributed representations map vectors in high-dimensional space to low-dimensional space, alleviating the problem of feature sparsity. Commonly used word vectors include static word vectors word2vec [7], global vectors for word representation (Glove) [8], dynamic word vector models such as embedding from language models (ELMo) [9], and bidirectional encoder representations from transformers (BERT) [10] models. Word2vec can further subdivided into CBOW [7] and skip-gram. The input to the CBOW [7] is a vector of neighboring words of a central word, and the output is a vector of words of that central word. The input to the skip-gram model is a vector of central words, and the output is a vector representation of the surrounding words of that central word. This is generally better than CBOW. Glove [8] statistical co-occurrence matrix and sliding window, taking into account both local and global information. Firstly, the co-occurrence matrix is constructed by using the corpus, and secondly, the relationship between the word vector and the co-occurrence matrix is constructed. ELMo [9] has a three-layer structure, with the first layer being the word2vec or Glove, and the next two layers being the two bidirectional long- and short-term memory (Bi-LSTM) extracting word contextual features to effectively solve the problem of multiple meanings of words. BERT uses transformer as the main framework for capturing bidirectional relations in utterances and constructs mask language model and next sentence prediction as targets for multi-task training in terms of training tasks.

Common neural network models include convolutional neural networks (CNN) [11], recurrent neural network (RNN) [12], long- and short- term memory network (LSTM) [1], and attention mechanisms [13]. CNN sets different convolutional kernels to extract local contextual information of the text and deepens the multi-layer convolutional and pooling layers to capture deeper textual information. In detail, the input layer obtains low-dimensional word vectors. The convolution layer extracts the local information of the text and the pooling layer reduces the feature dimension and prevents overfitting. Finally, the text and label dimensions are unified by the fully connected layer. The softmax layer is normalized to obtain the probability. RNN uses time series memory history information to obtain a representation of text content information by accepting text sequences of arbitrary length and generating a fixed-length vector. Gradient vanishing or explosion prevents RNN from effectively learning long-term dependencies and correlations. LSTM, in order to solve the problem of RNN on long-term dependency, adds forgetting gates, input gates, and output gates units to RNN to avoid gradient vanishing or explosion. The methods above assign the same weight to words and cannot distinguish the importance of words. Inspired by human attention, the attention mechanism is introduced to focus on key information and key contents, making it easy for models to focus on the weighted part and improve the classification accuracy. The attention mechanisms are usually divided into three categories, namely local attention, global attention, and self-attention mechanisms. Global attention considers entire text of words, assigning weights between 0 and 1 to obtain the text representation. Local attention assigns a weight of either 0 or 1 to each word, discarding some irrelevant items directly. Self-attention assigns weights based on the interaction of input words, which has advantage of parallel computing in long text classification.

In conclusion, both word vector models and neural network models are important components of deep learning-based text representation techniques, and they each have their own advantages and can be selected according to the needs of specific tasks. Word vector models focus more on the static representation of words, while neural network models are better able to capture the dynamic information of the context. Word vector models are relatively fast to train, while neural network models usually require larger computational resources and longer training time. Neural network models may perform better on some complex tasks, but for some simple tasks, word vector models are effective enough.

2.2 Extreme multi-label text classification

Extreme multi-label text classification learns a classifier that labels the most relevant subset of labels for a document from a very large set of labels. The main challenge is the millions of labels, features, and training points. The current research architectures in extreme multi-label text classification can be divided into four main categories, namely one-vs-all models, embedding-based models, tree-based models, and deep learning models. Due to the high computational costs brought by large-scale labels, the existing MLTC techniques have difficulty solving the XMTC problem. It can be seen that the extreme label text classification task is trapped in a large label space and feature space, leading to two pressing problems. The first problem is the power-law distribution, where long-tailed labels have very little data associated with them, making it difficult to obtain dependencies between labels, presenting data sparsity and scalability in extreme text classification work. The second problem is that computation is expensive, and the same results can be obtained at less cost using data augmentation techniques. One-vs-all models train a separate classifier for each label on the entire datasets. The one-vs-all models usually classifies well and with high accuracy; however, it assumes that the individual labels are independent of each other and uncorrelated, resulting in a cost that grows linearly with the number of labels. Embedded models typically use the relationships between labels to map labels from a high-dimensional space to a low-dimensional space using a linear matrix mapping approach as a way to reduce the total number of parameters in the model and reduce the training time required for the model. The limitation of the embedding method is that it ignores the correlation between input and output, resulting in an unaligned embedding of the two. Tree-structured models are trained to produce instance or labeled trees to make predictions, such as decision trees, random forests, Hoffman trees, etc. Traditional tree-based approaches can harm performance due to large tree height and large cluster size.

All three types of models mentioned above are based on bag-of-words representations of text, where words are treated as independent features out of context and cannot capture deep semantic information. In contrast, deep learning models can automatically extract implicit contextual features from raw text for extreme multi-label text classification.

Typical work, such as XML-CNN [14], first explored the application of deep learning to XMTC, proposing a series of CNN models for XMTC, modeling convolutional neural networks and dynamic maximum pooling layers to extract semantic features of text, and introducing hidden bottleneck layers to reduce model parameters and accelerate training; however, XML-CNN [14] cannot capture the most important subtext of each label. Therefore, AttentionXML [15] solves this problem with two techniques. Firstly, a multi-label attention mechanism is introduced to capture the most relevant parts of text for each label. Secondly, a shallow and wide probabilistic label tree is built to handle millions of labels. Lightxml [16] adopts BERT as an encoder for text and obtains a better text representation, which is the state-of-the-art extreme multi-label text classification model. DeepXML [17] designed a framework to decompose XMTC into four subtasks using this framework. These four subtasks are optimized by selecting different components to generate a series of algorithms, including Astec [17], DECAF [18], GalaXC [19], and ECLARE [20]. Astec [17] needs to use label clustering to obtain intermediate feature representations. DECAF [18] jointly learn model parameters and feature representation to get label metadata. GalaXC [19] introduces a label attention mechanism to make more accurate predictions based on the multi-resolution embedding of nodes given by the graph. ECLARE [20] allows collaborative learning using label-label correlations.

In summary, one-vs-all models are simple and intuitive and can be used flexibly with a variety of binary classification algorithms but ignore the correlation between labels, which may lead to inaccurate classification. Embedding-based models capture semantic information but do not directly model the correlation between labels. Tree-based models are able to handle high-dimensional and nonlinear data and can capture correlations between nested features and labels. Deep learning models are capable of learning complex feature representations and contextual correlations and are suitable for large-scale data and complex tasks.

Advertisement

3. Advancement in sentiment analysis

This section will introduce the aspect-based sentiment analysis (ABSA) and multimodal sentiment analysis in the sentiment analysis task, which is a classical task in the field of natural language processing, and we will mainly introduce the deep learning techniques for sentiment analysis since they have better performance than the past machine learning methods and are the mainstream methods in the field of sentiment analysis.

3.1 Aspect-based sentiment analysis

The concept of ABSA was first introduced in 2010 by Thet et al. [21], and further, Liu [22] gave a definition of viewpoint in 2012; sentiment analysis and opinion mining refers to the field of research that analyzes people’s opinions, sentiments, evaluations, attitudes, and emotions from written language. From 2014 to 2016, SemEval, an international semantic evaluation conference, has included the ABSA task as one of its subtasks and provided a series of benchmark datasets [23, 24], which have all been manually annotated. In recent years, the aspect-based sentiment analysis task has been receiving attention from many scholars, especially after the rapid application of deep learning and other related technologies in the fields of data mining, information retrieval, and intelligent question and answer. Therefore, research related to aspect-based sentiment analysis based on deep learning has also continued to achieve breakthroughs [25, 26, 27, 28, 29], and the ABSA task has gradually become one of the popular research topics in the field of NLP (Figure 2).

Figure 2.

The working effect of ABSA.

The advantage of aspect-based sentiment analysis is mainly that text sentiment analysis is fine-grained. Coarse-grained sentiment analysis can often only capture one-sided single sentiment tendency and cannot analyze detail from each attribute level. A review text often contains sentiment views for different evaluation objects, for example, “the service of this restaurant is good, but the taste is bad.” The text of this review evaluates the two aspects of “service” and “taste” separately, and the document-level and sentence-level sentiment analysis cannot mine each aspect separately. Therefore, aspect-based sentiment analysis is needed for re view texts that contain multiple aspects [30, 31].

Sentiment analysis methods based on deep learning can be divided into fourmain types: sentiment analysis methods with a single neural network, sentiment analysis methods with a hybrid neural network, sentiment analysis with the introduction of attention mechanisms, and sentiment analysis using pre-trained models.

The main methods for sentiment analysis of single neural networks are introducing a series of neural network models [32, 33] (e.g., CNN, RNN, etc.). CNN is mainly used to extract local features of text data, abstract low-dimensional vectors into vector representations with high-level semantics after operations such as convolutional pooling, and then process the coded representations and output the results. Lu et al. [34] made full use of syntactic relations and sentiment dependency information and proposed an aspect-gated graph convolutional network (AGGCN) to implement aspect-based sentiment analysis work. Liang et al. [35] made full use of the dependency syntactic knowledge and designed a dependency-embedded graph convolutional network applied to end-to-end sentiment analysis. Wang et al. [36] proposed a new unified location-aware convolutional neural network (UP-CNN) to solve the problem of difficult to fully utilize aspect location information.

In ABSA tasks, attention mechanisms have received a lot of attention and have been actively used in aspect-based sentiment analysis tasks because of the different importance of information in different parts of the text for aspect-based sentiment analysis tasks, and attention mechanisms have ability to adaptively identify key information and enhance attention to it [37, 38, 39, 40]. Liao et al. [41] use a two-way transformer-based RoBERTa model to extract features from text and aspect word strings and use a cross-attention mechanism to add attention to the most relevant features for a given aspect category.

3.2 Multimodal sentiment analysis

With the rapid development of information and network technology and the widespread use of mobile terminals, people are gradually showing a trend of diversifying the content they publish. The messages they publish for different events and topics are no longer limited to a single text form, but tend to publish multimodal content combining text and images to express their feelings and opinion aspect-based. This situation and trend have attracted academic attention to multimodal sentiment analysis research, and by analyzing the sentiment tendency implied by these multimodal data, it has great application value in box office prediction, product marketing, political election, product recommendation, mental health analysis, etc. Therefore, multimodal sentiment analysis has become a hot research topic in recent years [42, 43]. Multimodal sentiment analysis is the process of combining documents that describe the same thing in different forms (e.g., sound, image, text, etc.) to enrich our perception of the thing and analyze the sentiment it expresses. The term modality is generally associated in academic research with the sensory modalities that represent our primary communication and sensory channels, and when a research question or data set contains multiple modalities, it is characterized as a multimodal task or multimodal data set. In general, academics have focused on (but not limited to) three modalities: (1) natural language, both spoken and textual, (2) visual signals, often represented by images or videos, and (3) acoustic signals, such as intonation and audio. Multimodal learning is a dynamic multidisciplinary field that is breaking new ground in many tasks such as multimodal sentiment analysis, cross-modal retrieve, image caption, audiovisual speech recognition, and visual question and answer, visual speech recognition, and other tasks (Figure 3).

Figure 3.

The working effect of MSA.

Multimodal sentiment analysis makes full use of data from different modalities for accurate sentiment prediction. In 2016, a cross-modality consistent regression (CCR) model was proposed in the literature [44]. The authors of this paper concluded that the overall sentiment of text and image unimodal, as well as multimodal is the same with respect to representation of modality, text including descriptions and captions of images, and learning visual features using CNNs, which outperformed the unimodal model. In the same year, work [45] proposed a tree-structured recursive neural networks (TreeLSTM) that use a tree structure and incorporates visual attention mechanisms. The system builds a structured structure based on sentence parsing aimed at aligning text words and image regions for accurate analysis and incorporates LSTM and attention mechanisms to learn a robust joint visual text representation with contemporaneous optimal results. In addition, the problem of image text mismatch and defects in social media data such as spoken words, misspellings, and lack of punctuation, pose a challenge to the task of sentiment analysis of multimodal data, and to address this challenge, in 2017, Xu et al. constructed different multimodal sentiment analysis networks, such as the hierarchical semantic attentional network (HSAN) [46] and multimodal deep semantic network (MultiSentiNet) [47]. HSAN focused on image captions and proposed a hierarchical semantic network model based on image captions in a multimodal sentiment analysis task using image captions to extract visual semantic features as additional information for text. MultiSentiNet, on the other hand, extracts image features from both objects and scenes and proposes a visual feature-guided attentional long- and short-term memory network to extract words that contribute to the understanding of text sentiment and aggregates these words with visual semantic features, objects and scenes. In 2018, co-memory network [48] proposed a novel co-memory network (CoMN), which models the interdependence between vision and text through memory networks to fully consider the interrelationship between multimodal data. In 2020, multi-view attentional network (MVAN) [49] utilizes a continuously updated memory network to obtain deep semantic features of images and texts. The authors found that existing datasets for multimodal sentiment analysis generally labeled only positive, negative and neutral sentiment polarities, and lacked graphical multimodal datasets for more detailed sentiment classification, so the authors constructed a large-scale image text multimodal dataset (TumEmo) based on social media multimodal data. Cheema proposed a simple and effective multimodal neural network (Sentiment Multi-Layer Neural Network, Se-MLNN) [50] model that used RoBERT to extract text features containing contextual features and multiple high-level image features from multiple perspectives to accurately predict the overall sentiment after fusing the features.

Advertisement

4. Advancement in task-oriented dialog system

This chapter introduces the task-oriented dialog system, including pipeline mode and end-to-end mode (Figure 4).

Figure 4.

Task-oriented dialog system.

4.1 Pipeline mode

Task-oriented dialog system aims to process user messages accurately and puts forward fairly requirements for response constraints. Therefore, a pipeline method is proposed to generate responses in a controllable way. It is mainly divided into four parts: natural language understanding, dialog state tracking, dialog strategy learning, and natural language generation. The natural language understanding module converts the original user messages into semantic slots and classifies the domain and user intentions. Dialog status tracking module iteratively calibrates the dialog status based on the current input and dialog history. The dialog state includes relevant user actions and slot value pairs. The dialog strategy learning module tracks the calibrated dialog state according to the dialog state and decides the next action of the dialog agent. Finally, the natural language generation module converts the selected conversation actions into natural language for feedback to users. For example, in the movie ticket reservation task, the agent interacts with the movie knowledge base to retrieve movie information with specific constraints [51], such as movie name, time, cinema, etc.

4.1.1 Natural language understanding

Natural language understanding has a significant impact on the response quality of the whole system, which converts the user generated natural language messages into semantic slots and classified them. There are three tasks involved: domain classification, intention detection and slot filling. Domain classification aims to determine to which particular domain or topic the user input belongs. It categorizes the user’s text into predefined domains, such as hotel booking, flight enquiry, weather information, etc. By identifying the subject domain to which the input relates, it can be passed to the appropriate processing module for further parsing. Intention detection refers to determining the user’s intent or purpose in a particular domain. It focuses on the purpose behind the user’s input rather than just the input text itself. For example, in the domain of hotel booking, a user may have different intentions, such as finding a hotel, booking a hotel, canceling a booking, etc. The goal of intent recognition is to identify the specific intent of the user so that the system can take the appropriate action or provide the correct response. Slot filling is the process of identifying and extracting key information from user input that is relevant to a specific domain. Slots are usually parameters or variables related to the intent, such as date, location, person’s name, price, etc. Through slot filling, the system can capture and record the specific information provided by the user in a particular domain. For example, in a hotel reservation domain, slots may include check-in date, check-out date, location, room type, etc.

Domain classification and intent detection belong to the same classification task. The problem of domain intent and classification of dialog is solved through deep learning, including building a deep convex network [52], which combines the prediction of a prior network with the current dialog as the overall input of the current network. In order to solve the difficulty of using depth neural networks to predict fields and intentions, some scholars used restricted Boltzmann machines and depth belief networks to derive the parameters of the initialized depth neural networks [53]. In order to take advantage of the advantages of recurrent neural networks (RNN) in sequence processing, some work used recurrent neural networks as dialog encoders and predicted intentions and domain categories [54]. Some scholars have proposed a short text intention classification model. Due to the lack of information in a single conversation turn, it is difficult to identify the intention of phrases. Using RNN or CNN structure to fuse the dialog history, and obtain the context information as the additional input of the current turn information [55]. This model has achieved good performance in intention classification tasks. Recently, by pre-trained task-oriented dialog BERT, this method has achieved high accuracy in intention detection tasks. The proposed method can effectively alleviate the problem of data shortage in specific areas.

Slot filling, also known as semantic tagging problem, is a sequence classification problem. This model needs to predict multiple targets at the same time. Deep belief network shows good ability in deep structure learning. Some scholars built a sequence marker based on deep belief network. In addition to the named entity recognition input features used in traditional markers, they also combined part of speech and syntactic features as part of the input. Recurrent structures are beneficial to sequence marking tasks because they can track information along past time steps to maximize the use of sequence information. Some scholars first proposed that RNN language models can be applied to sequence tagging rather than simply predicting words [56]. At the output end of RNN, the sequence labels corresponding to the input words are not normal words. Some scholars further studied the impact of different recurrent structures on slot filling tasks and found that all RNN models are superior to the simple conditional random field method [57]. Because the shallow output representation of traditional semantic annotation lacks the ability to express structured dialog information, the slot filling task is regarded as a template based tree decoding process by iteratively generating and filling templates [58].

4.1.2 Dialog status tracking

Dialog state tracking (DST) is the first module of the dialog manager. According to the entire dialog history, each turn tracks the user’s goals and relevant details, providing the strategy learning module with the information needed for decision-making. There is a close relationship between natural language understanding and dialog state tracking. Both of them need to fill slots of dialog information [59]. However, they actually play two different roles. The natural language understanding module attempts to classify current user messages, such as intention recognition and domain recognition, and slots to which each message character belongs.

The first flow can be considered as a multi-class classification task. For multi-class classification DST, the tracker predicts to select the correct class from multiple values. Some scholars used RNN as a neural tracker to obtain the perception of dialog context [60]. The tracker finally makes a binary prediction of the current slot value pair based on the dialog history. The second flow of neural tracker with unfixed slot names and values attracts more attention because it not only reduces the model and time complexity of DST tasks but also helps to train task-oriented dialog systems end-to-end. Some scholars proposed the belief span, that is. the text corresponding to the dialog context spans to a specific slot [61]. They built a two-stage CopyNet to copy and store the slot value history storage slot in the dialog to prepare for neural response. The belief span promotes the end-to-end training of the dialog system and improves the tracking accuracy outside the vocabulary. Based on this, some scholars proposed the minimum belief span, which is not scalable to generate belief state domains from scratch when the system interacts with APIs from different sources [62]. Some scholars proposed a trade model. The model also applies the replication mechanism and uses a soft-gated pointer generator to generate the slot value dialog context based on the domain slot pair and coding [63].

4.1.3 Natural language generation

Natural language generation is the last module in the pipeline mode of task-oriented dialog system. It tries to convert the dialog actions generated by the dialog manager into the final natural language representation. The standard flow of the defined natural language generation module is composed of four components, and its core components are content determination, sentence planning, and surface implementation.

The deep learning method is applied to further enhance the NLG performance, and the pipeline is folded into a single module. The generation of end-to-end natural languages has made gratifying progress and is the most popular way to implement NLG. Some scholars believed that natural language generation should be completely data-driven and not rely on any expert rules [64]. They proposed a statistical language model based on RNN, which uses semantic constraints and syntax trees to learn response generation. In addition, they also used CNN re-ranked to further select better answers. Similarly, some scholars used LSTM model to learn sentence planning and surface implementation at the same time. Some scholars used GRU to further improve the generation quality on multiple domains [65]. The proposed generator always generates high-quality responses on multiple domains. To improve the adaptability of the domain recurrent model, some scholars proposed to first train the recurrent language to model the data synthesized from the data sets outside the domain, and then fine-tune the relatively small data sets within the domain. This training strategy has proved to be effective in human assessment [66].

4.2 End-to-end mode

In the process of building an end-to-end task-oriented dialog system, a complex neural network model is used to implicitly represent key functions, and all modules are integrated into one module. The research of task-oriented end-to-end neural network model mainly focuses on training methods or model architecture, which is the key to response correctness and quality [67]. An incremental learning framework is proposed to train their end-to-end task-oriented system. The main idea is to establish an uncertainty estimation module to evaluate the confidence of the generated response. If the confidence is higher than the threshold value, the response will be accepted. If the confidence score is lower, the manual response will be introduced.

Recent works often do not build end-to-end systems to apply in a pipeline manner. Instead, they use complex neural models to implicitly represent key functions and integrate modules into one. Task-oriented end-to-end neural model research focuses on training methods or model architecture, which is the key and quality of response correctness. Some scholars proposed an incremental learning framework to train their end-to-end learning task-oriented system [61]. The main idea is to establish an uncertainty evaluation module to evaluate the confidence of the generated appropriate response. If the confidence score is higher than the threshold, then the response will be accepted, while if the confidence score is very low. The agent can also use online learning to learn from human responses. Some scholars use model agnostic meta learning (MAML) to jointly improve adaptability and reliability [68]. In real life online service tasks, there are only a few training samples. Similarly, some scholars also used MAML to train the end-to-end neural model to promote domain adaptation, which enables the model to train rich resource tasks first, and then train limited new task data [59]. Other scholars trained an inconsistent order detection module in an unsupervised manner [63]. The module detects whether the command discourse generates a more coherent response.

Advertisement

5. Conclusions

Most existing shallow and deep learning models have structures that can be used for text classification, including integrated approaches. BERT learns a form of linguistic representation that can be used to fine-tune many downstream NLP tasks. The main approaches are to add data, increase computational power, and design training programs to obtain better results. The trade-off between data and computational resources and predictive performance is worth investigating. Due to the inability to collect data with full supervisory information, so MLTC is gradually turning to the problem of classification with limited supervised information. Since the excellent performance of AlexNet in 2012, deep learning has shown great potential. How to leverage the powerful learning capabilities of deep learning to better capture the label dependencies is key to solving MLTC tasks.

With the development of deep learning technology in the application of emotion analysis tasks, the performance of emotion analysis has been greatly improved. However, some tasks and scenarios still need more abundant data sets to evaluate the model more accurately.

Although deep learning has achieved remarkable results in the dialog system, in the pipeline mode, if accurate and fast access to user intentions is still the demand of the industry, in the end-to-end mode, controllability, and interpretability also need to be further studied.

References

  1. 1. Graves A. Long short-term memory. In: Supervised sequence labelling with recurrent neural networks. Berlin: Springer; 2012. pp. 37-45
  2. 2. Sakai Y, Matsuoka Y, Goto M. Purchasing behavior analysis model that considers the relationship between topic hierarchy and item categories. In: International Conference on Human-Computer Interaction. Cham: Springer; 2022. pp. 344-358
  3. 3. Chen Z, Qian T. Transfer capsule network for aspect level sentiment classification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Washington: ACL; 2019. pp. 547-556
  4. 4. Li Q, Peng H, Li J, Xia C, Yang R, Sun L, et al. A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST). 2022;13(2):1-41
  5. 5. Chen Y. Convolutional neural network for sentence classification. [Master’s thesis], University of Waterloo. 2015
  6. 6. Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X et al. Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174. 2018
  7. 7. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013
  8. 8. Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Toronto: ACL; 2014. pp. 1532-1543
  9. 9. Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, et al. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research. 2021;304:114135
  10. 10. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805
  11. 11. Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. 2014
  12. 12. Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. 2014
  13. 13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30
  14. 14. Liu J, Chang W-C, Wu Y, Yang Y. Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM; 2017. pp. 115-124
  15. 15. You R, Zhang Z, Wang Z, Dai S, Mamitsuka H, Zhu S. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. Advances in Neural Information Processing Systems. 2019;32
  16. 16. Jiang T, Wang D, Sun L, Yang H, Zhao Z, Zhuang F. Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. Toronto: AAAI; 2021. pp. 7987-7994
  17. 17. Dahiya K, Saini D, Mittal A, Shaw A, Dave K, Soni A, et al. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM; 2021. pp. 31-39
  18. 18. Mittal A, Dahiya K, Agrawal S, Saini D, Agarwal S, Kar P, et al. Decaf: Deep extreme classification with label features. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM; 2021. pp. 49-57
  19. 19. Saini D, Jain AK, Dave K, Jiao J, Singh A, Zhang R, et al. Galaxc: Graph neural networks with labelwise attention for extreme classification. In: Proceedings of the Web Conference 2021. New York: ACM; 2021. pp. 3733-3744
  20. 20. Mittal A, Sachdeva N, Agrawal S, Agarwal S, Kar P, Varma M. Eclare: Extreme classification with label graph correlations. In: Proceedings of the Web Conference 2021. New York: ACM; 2021. pp. 3721-3732
  21. 21. Thet TT, Na J-C, Khoo CSG. Aspect-based sentiment analysis of movie reviews on discussion boards. Journal of Information Science. 2010;36(6):823-848
  22. 22. Liu B, Zhang L. A survey of opinion mining and sentiment analysis. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Boston, MA: Springer; 2012
  23. 23. Pontiki M, Galanis D, Papageorgiou H, Manandhar S, Androutsopoulos I. Semeval-2015 task 12: Aspect based sentiment analysis. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Toronto: ACL; 2015. pp. 486-495
  24. 24. Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, Al-Smadi M, et al. Semeval-2016 task 5: Aspect based sentiment analysis. In: International Workshop on Semantic Evaluation. Toronto: ACL; 2016. pp. 19-30
  25. 25. Do HH, Prasad PWC, Maag A, Alsadoon A. Deep learning for aspect-based sentiment analysis: A comparative review. Expert Systems with Applications. 2019;118:272-299
  26. 26. Akhtar MS, Gupta D, Ekbal A, Bhattacharyya P. Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis. Knowledge-Based Systems. 2017;125:116-135
  27. 27. Peng H, Ma Y, Li Y, Cambria E. Learning multi-grained aspect target sequence for chinese sentiment analysis. Knowledge-Based Systems. 2018;148:167-176
  28. 28. Tang F, Luoyi F, Yao B, Wenchao X. Aspect based fine-grained sentiment analysis for online reviews. Information Sciences. 2019;488:190-204
  29. 29. Liu N, Shen B. Rememnn: A novel memory neural network for powerful interaction in aspect-based sentiment analysis. Neurocomputing. 2020;395:66-77
  30. 30. Xiao D, Ren F, Pang X, Cai M, Wang Q, He M, et al. A hierarchical and parallel framework for end-to-end aspect-based sentiment analysis. Neurocomputing. 2021;465:549-560
  31. 31. Zhou J, Zhao J, Huang JX, Qinmin Vivian H, He L. Masad: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing. 2021;455:47-58
  32. 32. Khasanah IN. Sentiment classification using fasttext embedding and deep learning model. Procedia Computer Science. 2021;189:343-350
  33. 33. Basiri ME, Nemati S, Abdar M, Cambria E, Rajendra U, Acharya. Abcdm: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Generation Computer Systems. 2021;115:279-294
  34. 34. Qiang L, Zhu Z, Zhang G, Kang S, Liu P. Aspect-gated graph convolutional networks for aspect-based sentiment analysis. Applied Intelligence. 2021;51(7):4408-4419
  35. 35. Liang Y, Meng F, Zhang J, Chen Y, Jinan X, Zhou J. A dependency syntactic knowledge augmented interactive architecture for end-to-end aspect-based sentiment analysis. Neurocomputing. 2021;454:291-302
  36. 36. Wang X, Li F, Zhang Z, Guangluan X, Zhang J, Sun X. A unified position-aware convolutional neural network for aspect based sentiment analysis. Neurocomputing. 2021;450:91-103
  37. 37. Li Z, Li L, Zhou A, Hongbin L. Jtsg: A joint term-sentiment generator for aspect-based sentiment analysis. Neurocomputing. 2021;459:1-9
  38. 38. Qiannan X, Zhu L, Dai T, Yan C. Aspect-based sentiment classification with multi-attention network. Neurocomputing. 2020;388:135-143
  39. 39. Chen Y, Zhuang T, Guo K. Memory network with hierarchical multi-head attention for aspect-based sentiment analysis. Applied Intelligence. 2021;51(7):4287-4304
  40. 40. Yuming Lin YF, Li Y, Cai G, Zhou A. Aspect-based sentiment analysis for online reviews with hybrid attention networks. World Wide Web. 2021;24(4):1215-1233
  41. 41. Liao W, Zeng B, Yin X, Wei P. An improved aspect-category sentiment analysis model for text sentiment analysis based on roberta. Applied Intelligence. 2021;51(6):3522-3533
  42. 42. Kaur R, Kautish S. Multimodal sentiment analysis: A survey and comparison. Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines. IGI Global. 2022. pp. 1846-1870
  43. 43. Soleymani M, Garcia D, Jou B, Schuller B, Chang S-F, Pantic M. A survey of multimodal sentiment analysis. Image and Vision Computing. 2017;65:3-14
  44. 44. You Q, Luo J, Jin H, Yang J. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. New York: ACM; 2016. pp. 13-22
  45. 45. You Q, Cao L, Jin H, Luo J. Robust visual-textual sentiment analysis: When attention meets tree-structured recursive neural networks. In: Proceedings of the 24th ACM International Conference on Multimedia. New York: ACM; 2016. pp. 1008-1017
  46. 46. Nan X. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). Beijing, China: IEEE; 2017. pp. 152-154
  47. 47. Xu N, Mao W. Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM; 2017. pp. 2399-2402
  48. 48. Xu N, Mao W, Chen G. A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM; 2018. pp. 929-932
  49. 49. Yang X, Feng S, Wang D, Zhang Y. Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia. 2020;23:4014-4026
  50. 50. Cheema GS, Hakimov S, Müller-Budack E, Ewerth R. A fair and comprehensive comparison of multimodal tweet sentiment analysis methods. In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding. New York: ACM; 2021. pp. 37-45
  51. 51. Masi I, Tran AT, Leksut JT, Hassner T, Medioni G. Do we really need to collect millions of faces for effective face recognition? In: Computer Vision. Cham: Springer; 2016. pp. 579-596
  52. 52. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations. New York: ACM; 2014. pp. 46-57
  53. 53. Campagna G, Foryciarz A, Moradshahi M, Lam MS. Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking. 2020
  54. 54. Chen J, Zhang R, Mao Y, Xu J. Parallel interactive networks for multi-domain dialogue state generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Toronto: ACL; 2020. pp. 17-26
  55. 55. Chen H, Liu X, Yin D, Tang J. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter. 2017;19(2):25-35
  56. 56. Gliwa B, Mochol I, Biesek M, Wawer A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In: Proceedings of the 2nd Workshop on New Frontiers in Summarization. New York: ACM; 2019. pp. 38-49
  57. 57. Wen TH, Gasic M, Kim D, Mrksic N, Su PH, Vandyke D, et al. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Toronto: ACL; 2015. pp. 275-284
  58. 58. Wen TH, Gasic M, Mrksic N, Rojas-Barahona LM, Su PH, Ultes S, et al. Conditional generation and snapshot learning in neural dialogue systems. 2016
  59. 59. Wen TH, Vandyke TH., Mrksic N, Gasic M, Rojas-Barahona LM, Su PH, et al. A network-based end-to-end trainable task-oriented dialogue system. 2016
  60. 60. Williams J. Multi-domain learning and generalization in dialog state tracking. In: Proceedings of the SIGDIAL 2013 Conference. Toronto: ACL; 2013. pp. 433-441
  61. 61. Williams JD, K. Asadi, G. Zweig. Hybrid code networks:practical and efficient end-to-end dialog control with supervised and reinforcement learning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Toronto: ACL; 2017. pp. 665-677
  62. 62. Tamar A, Yi W, Thomas G, Levine S, Abbeel P. Value iteration networks. In: Twenty-Sixth International Joint Conference on Artificial Intelligence, New York: ACM; 2017. pp. 246-257
  63. 63. Loni B. A survey of state-of-the-art methods on question classification. In: Proceedings of the 7th Workshop on Ph.D Students. New York: ACM; 2011
  64. 64. Tao C, Mou L, Zhao D, Rui Y. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. 2017
  65. 65. Tao C, Wu W, Xu C, Hu W, Yan R. One time of interaction may not be enough: Go deep with an interaction-over-interaction network for response selection in dialogues. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Toronto: ACL; 2019. pp. 189-197
  66. 66. Tran VK, Nguyen LM. Semantic Refinement Gru-Based Neural Language Generation for Spoken Dialogue Systems. Singapore: Springer; 2017
  67. 67. Tur G, Hakkani-Tur D, Heck L. What is left to be understood in atis? In: Spoken Language Technology Workshop (SLT), New York: IEEE; 2011. pp. 236-247
  68. 68. Lu C, Xiang Z, Cheng C, Yang R, Kai Y. Agent-aware dropout dqn for safe and efficient on-line dialogue policy learning. In: The 2017 Conference on Empirical Methods on Natural Language Processing, Toronto: ACL; 2017. pp. 127-137

Written By

Yuan Wang, Zekun Li, Zhenyu Deng, Huiling Song and Jucheng Yang

Reviewed: 14 July 2023 Published: 11 August 2023