Comparative table of the articles analyzed.
Nowadays, emoji plays a fundamental role in human computer-mediated communications, allowing the latter to convey body language, objects, symbols, or ideas in text messages using Unicode standardized pictographs and logographs. Emoji allows people expressing more “authentically” emotions and their personalities, by increasing the semantic content of visual messages. The relationship between language, emoji, and emotions is now being studied by several disciplines such as linguistics, psychology, natural language processing (NLP), and machine learning (ML). Particularly, the last two are employed for the automatic detection of emotions and personality traits, building emoji sentiment lexicons, as well as for conveying artificial agents with the ability of expressing emotions through emoji. In this chapter, we introduce the concept of emoji and review the main challenges in using these as a proxy of language and emotions, the ML, and NLP techniques used for classification and detection of emotions using emoji, and presenting new trends for the exploitation of discovered emotional patterns for robotic emotional communication.
- machine learning
- natural language processing
- emotional communication
- human-robot interaction
Recently, in the episode “Smile” of the popular science fiction television program “Doctor Who,” a hypothetical off-earth colony is presented. This colony is maintained and operated by robots, which communicate and express emotions with humans and its pairs, through the usage of emoji. Sure, one may argue that such technology, besides being mere science fiction, is ridiculous since phonetic communication is much simpler and much easier to understand. While this is true for conventional information (e.g., explaining the concept of real numbers), communicating body emotional responses or gesticulation (e.g., to describe confusion) using only phonograms would require many more words to convey the same message than an emoji (e.g.,
The Japanese word emoji (e = picture and moji = word) literally stands for “picture word.” Although recently popularized, its older predecessors can be tracked to the nineteenth century, when cartoons were employed for humorous writing. Smileys followed in 1964 and were meant to be used by an insurance company’s promotional merchandise to improve the morale of its employee. The first to employ the emoticon :) in an online bulletin forum to denote humorous messages was Carnegie Mellon researchers in 1982, and 10 years later, the emoticons were already widespread in emails and Websites . Finally, in 1998, Shigetaka Kurita devised emoji to improve emoticons pictorially, and became widespread by 2010. From this moment, the use of emoji has gained a lot of momentum, even achieving that the word namely “Face with Tears of Joy” (
Since its origin, emoji undoubtedly have become a part of the mainstream communication around the globe allowing people, with different languages and cultural backgrounds, to share and interpret more accurately ideas and emotions. In this vein, it has been hypothesized that emoji shall become a universal language due to its generic communication features and its ever progressing lexicon [2, 5, 6, 7]. Although, this idea is controversial [8, 9] since emoji usage during the communication is influenced by factors such as context, users interrelations, users’ first language, repetitiveness, socio-demographics, among others [2, 5, 8]. This clearly adds ambiguity on how to employ them and its proper interpretation. Nevertheless, in the same fashion as sentiment analysis mines sentiments, attitudes, and emotions from text , we can employ billions (or perhaps more) of written messages within the Internet that contains emoji, to generate affective responses in artificial entities. More precisely, using natural language processing (NLP) along with machine learning (ML), we can extract semantics, express beat gestures, emotional states, and affective cues, add personality layers, among other characteristics from text. All this knowledge can be used to build, for instance, emoji sentiment lexicons  that will conform the emoji communication competence  that will power the engines of the emotional expression and communication of an artificial entity.
In the rest of this chapter, we first review the elements of the emoji code, and how emoji are used in the emotional expression and communication (Section 2). Afterward, in Section 3, we present a review of the state of the art in the usage of NLP and ML to classify and predict annotation and expression of emotions, gestures, affective cues, and so on, using written messages from multiple types of sources. In Section 4, we present several examples on how emoji are currently employed by artificial entities, both virtual and embodied, for expressing emotions during its interaction with humans. Lastly, Section 5 summarizes the chapter and discusses open questions regarding emoji usage as a source for robotic emotional communication.
2. Competence, lexicon, and ways of usage of emoji
To study how emoji are employed and about its challenges, we cannot simply do it without specifying the emoji competence . Loosely speaking, competence (either linguistic or communicative) stands for the rules (e.g., grammar) and abilities an individual owns to correctly employ a given language to convey a specific idea . Hence, the emoji competence stands for an adequate usage of emoji within messages, not only in their representation but also in exact position within the message, to address a specific function (e.g., emotional expression, gestures, maintain interest in the communication, etc.) . Nevertheless, even while the emoji competence has not been formally defined yet, and it can only be developed through the usage of emoji themselves [2, 6], here, we elaborate several of its components.
A key element of the emoji competence is the emoji lexicon, which is the standardization of pictograms (i.e., figures that resemble the real-world object), ideograms (i.e., figures that represent an idea) and logograms (i.e., figures that represent a sound or words) into anime-like graphical representations that belong to the ever-growing Unicode computer character standard [2, 6, 12]. These are employed within any message in three different ways: adjunctively, substitutively, or providing mixed textuality. In the first case, emoji appear along text within specific points of the written message (e.g., at the end of it) conveying it with emotional tone or adding visual annotations; it requires an overall low emoji competence. In the second case, emoji replace words, requiring a higher degree of competence to understand, not only the symbols per se but also the layout structure of the message, for instance, if we consider syntagms, which are symbols sequentially grouped that together conform a new idea (e.g., I love coffee =
The emoji lexicon possesses generic features such as representationality, which allows signs and usage rules to be combined in specific forms to convey a message. Similarly, any person who is well versed with code’s signs and rules is capable of interpreting any message based on the code (i.e., interpretability). However, messages built using the emoji lexicon are affected by contextualization, allowing that references, interpersonal relationships, and other factors affect the meaning of the message [4, 5]. Besides these, the emoji code is composed by a core and peripheral lexicon [2, 5]. As in the Swadesh list, the core lexicon stands for those emoji whose meaning and usage is, somehow, universally accepted and used, even while the Unicode supports more than 1000 different emoji . Within this stand, all facial emoji also contain those emoji that stand for Ekman’s six basic emotions such as surprise (
2.1 How do we use emoji?
Emoji within any message can have several functions; Figure 1 summarizes these. As shown by the latter, one of the most important functions an emoji has is emotivity, which adds an emotional layer to plain text communication. In this sense, emoji serve as a substitute of face-to-face (F2F) facial expressions, gestures, and body language, to state oneself emotional states, moods, or affective nuances. When used in this manner, emoji take the role of discourse strategies such as intonation or phrasing [2, 4, 15]. Emoji emotivity mostly conveys positive emotions, hence it can be employed to emphasize an specific point of view, such as sarcasm, while softening the negative emotions associated with it (e.g., with respect to the one that is being sarcastic), allowing the receiver of the message to focus on the content instead of the negativity elicited [2, 14].
Another important role of emoji is as phatic instrument during communication [2, 16]. In this sense, they are employed as utterances that allow the flow of the conversation to unfold pleasantly and smoothly. In this sense, emoji serve as an opener or ending utterance (i.e., waving hand) to open or close a conversation, respectively, maintaining a positive dialog regardless of the content. Similarly, emoji can be used to fill uncomfortable moments of silence during a conversation avoiding its abrupt interruption. Beat gestures are another function of emoji; the former can be defined as a repetitive rhythmical co-speech gesture that emphasizes the rhythm of the speech . For instance, in the same way that keeping nodding up and down during a conversation emphasizes agreement with the interlocutor, emoji can be repeated to convey the same meaning (e.g.,
3. Studying emoji usage using formal frameworks
Emoji usage has had a deep impact on humans’ computer-mediated communication (CMC). With the increasing use of social media platforms such as Facebook, Twitter, or Instagram, people now massively interchange messages and ideas through text-based chat tools that support emoji usage, imbuing these with semantic, emotional, and meaningful meaning. In order to analyze and extract comprehensive knowledge from emoji-embedded message data sets, many methods have been developed through the usage of a multidisciplinary approach, which involves ML along with NLP, psychology, robotics, and so on. Among the tasks developed with ML algorithms for the analysis of emoji usage stand sentiment analysis [5, 19], polarity analysis [10, 20], sentiment lexicon buildage , utterance embeddings , personality assessment , to mention a few. These applications are summarized in Table 1 .
|Related papers||Problems addressed||Method||Emoji use||Emoji competence|
|||Emoji classification correct emoji prediction|
Matching utterance embeddings with emoji embeddings
Emoji for sentiment analysis
|10-fold cross validation|
Shallow classifiers: SVM and LR
|||Image processing & computer vision to detect facial expression|
Auto-labeling using emoji sentiment
Emoticons as a heuristic data
Tweets data preprocessing
Emoji sentiment map & lexicon
Emoji sentiment ranking
|Discrete probability distribution approximation|
Automated analysis of social media contents
Correlation analysis among languages
Adaptive Boosted Decision Trees (ADT)
10-fold cross validation
Random Forests (RF)
|||Emotions representation using emoji|
Big 5 personality assessment test using emoji
|Exploratory Factor Analysis (EFA)|
Confirmatory Factor Analysis (CFA)
|||Emoji as co-speech element|
|||Facial expressions recognition|
Emoji usage for peer communication
Emoji as social cues
The following section shows an analysis from the point of view of the use of ML algorithms to support tasks related to the sentiment analysis through the use of emoji, classification, comparison, polarity detection, data preprocessing from tweets with emoji embeddings, and computer vision techniques for video processing to detect facial expression.
3.1 Emoji classification and comparison
In recent years, algorithms such as deep learning (DL) have emerged as a new area of ML, which involve new techniques for signal and information processing. This type of algorithms employ several nonlinear layers for information processing through supervised and unsupervised feature extraction, and transformation for pattern analysis and classification. It also includes algorithms for multiple levels of representation attaining models that can describe the complex relations within data. Particularly, if data sets are considerably large, a deep-learning approach is the best option for reaching a well-trained model regarding if data are labeled or not [25, 26]. Until our days, ML algorithms that use shallow architecture show a good performance and effectiveness for solving simple problems, for instance, linear regression (LR), support vector machines (SVM), multilayer perceptron (MPL) with a single hidden layer, decision trees like random forest or ID3, among others. These architectures have limitations for extracting patterns from a wide complex problem’s variety, such as signals, human speech, natural language, images, and sound processing . For this reason, a deep-learning approach allows to solve these limitations showing good results.
Emoji classification and comparison constitute two important tasks for discriminating several kinds of emoji, including those with similar meaning. Deep-learning models have been used for this goal in texts where emoji are embedded, producing better result than softmax methods, such as logistic regression, naive Bayes, artificial neural networks, and others. For example, Xiang Li et al. developed a deep neural network architecture for getting a trained model that could predict the correct emoji for its corresponding utterance . This approach provides the possibility that machines generate an automatic message for humans during a conversation with the use of implicit sentiment and better semantic on ideas.
In Li et al.’s  proposal, the system receives as input an utterance set and an emoji set . The main goal is to train a classification model, which could predict the correct emoji for an utterance given.
The architecture used in this work has two parts. The first is a convolutional neural network (CNN) for giving a sentence embedding that represents an utterance, and the second one is the embedding of emoji and this part should be trained. In order to join both parts, a matching structure was created due to embeddings in continuous vector space that could well represent emoji, consequently performing better than discrete softmax classifier.
The bottom of CNN is a word embedding layer for tasks of NLP. This provides semantic information about a word using real vector that represents its features. For an utterance that represent a sequence of words, for each word is a one-hot vector of dictionary dimension, a bit from takes value 1 if it corresponds to word on the dictionary and 0 for remaining bits. In Eq. (1), the embedding matrix is defined such that :
where and are word embedding and word dictionary dimensions, respectively. Each is the embedding for word in a dictionary. The convolutional layer uses sliding windows to get information from word embeddings; for this process, the following function is used (see the Eq. (2) ):
where is the size of window and is the bias vector. Hence, the parameter to be trained is .
Once obtained a series continuous representations of local features from convolutional layer, theory of dynamic pooling is used for sensitizing these embeddings into one vector of the whole utterance. This produces by output the max pooling. The hidden layer uses the sentence embedding of the utterance obtained as and returns finally the vector to represent the utterance.
Similarly to the word embedding layer, the emoji embedding layer uses a matrix defined as to obtain , where $K$ is the one-hot vector’s length that represents each emoji . Each of is one parameter of neural network. The process of training is a forward propagation for computing the matching score between the given utterance and the correct emoji, and matching score between the given utterance and the negative emoji. Backward propagation is used to update model parameters. For calculating the matching score, the cosine similarity measure is used, whereas for training the neural network, the Hinge Loss function was used. It is worth mentioning that the latter is very useful for carrying out pairwise comparison to identify similar emoji types.
Finally, the author obtains an architecture that uses a CNN and a matching approach for classifying and learning emoji embeddings. The importance of the aforementioned work regarding the field of robotics is the possibility of producing a facial gesture as a result of the introduction of a statement, conversation, or idea to a machine, employing the semantic and emotional relation of emoji.
3.2 Emoji sentiment analysis
In the area of decision making, it is being relevant to know how the people think and what they will do in the future. These produce the needs of grouping peoples in accordance with their interaction on Internet and social networks. Sentiment analysis or opinion mining is the study of people’s opinions, sentiments, or emotions, using an NLP approach, which includes, but is not limited to, text mining, data mining, ML, and deep learning . For instance, the CNN’s usage has been employed to predict the tweets’ emoji polarities. These techniques have showed to be more effective than shallow models in image recognition and text classification where they reach better results .
Tweets processing for mining opinion and text analysis tasks play a crucial role for different areas in the industry because these produce relevant result for feedback the design of products and services. As Twitter is a platform where user interactions are very informal and unstructured and people use many languages and acronyms, it becomes necessary to build a model language-independent and nonsupervised learning. We can see the use of emoji or emoticons in this scenario through heuristic labels for a system; for this, the feature’s extraction process was developed by unsupervised techniques. The emoji/emoticons are the final result that represents a sentiment that a tweet contains. According to Mohammad Hanafy et al. in order to get a trained model for text processing, it is essential to do a data preprocessing for obtaining the data sets, where noisy elements are removed such as hashtags and other stranger characters like “at,” reduction of words by removing duplicated words, and very important, reemphasizing the emoticons with their scores. Each emoticon has a raw data that contain a sentiment classified as negative, neutral, or positive. For each classification, a continuous value is recorded. This representation is used in auto-label phase, for generating the training data using the score for determining emoji .
Feature extraction stage uses the Tf-idf approach; it indicates the importance of a word in the text through its frequency in the text or text’s set. Using Eq. (3), we can calculate this as follows [19, 27]:
where is the word and is the tweet. Term frequency in the document is , is document frequency where word exists, and is the number of tweets.
Other feature-extracting methods employed were bag-of-words (BOW) and Word2Vec. BOW selects a set of important words in tweets, and then each document is represented as a vector of the number of occurrences of the selected words. Word2Vec uses a two-layer neural network to represent each word as a vector of certain length based on its context. This feature extraction model computes a distributed vector representation of the word, been its main advantage that similar words are close in the vector space. Moreover, it is very useful for named entity recognition, parsing, disambiguation, tagging, and machine translation. In the area of big data processing, the library Spark ML within the Apache Spark engine uses skip-gram-model-based implementation that seeks to learn vector representation that take into account the contexts in which words occur .
Skip-gram model learns word vector representations that are good at predicting its context in the same sentence or sequence of training words denoted as , where is . The objective function is to maximize the average log-likelihood, which is defined by Eq. (4) :
where is the size of training windows. Each is associated with two vectors as word and as context, respectively. Using Eq. (5), given the word , the probability of correctly predicting is computed as :
where is the vocabulary length. The cost of computing is expensive; consequently, Spark ML uses hierarchical softmax with computational cost of .
These feature extractor models were used with other classifiers, such as SVM, MaxEnt, voting ensembles, CNN, and LSTM to extend the architecture of recurrent neural network (RNN). As solution proposal, a weighted voting ensemble classifier is used that combines the output of different models and its classification probabilities. For each model, a different weight when voting is assigned. The proposed model reaches a considerable accuracy in comparison with other models. This approach is very important in scenarios where we need no human intervention and any information about the used language; it is very useful to apply a good combination between classical and deep-learning algorithm to achieve better accuracy .
3.3 From video to emoji
As consequence of the semantic meaning that emoji carriers, there are some applications and researches that involve the image processing for generating emoji classification or an utterance with emoji embeddings. For that purpose, Chembula et al. have created an application that receives as input a stream of video or images from a person and create an emoticon based on image face. The solution detects the facial expression at the time that message is being generated. Once that facial expression is detected, the device generates a message with the suitable emoticon .
This system performs a facial detection, facial feature detection, and classification task to finally identify the facial expression. Although the initial processing proposed by Chembula and Pradesh  was not specified on the general description, we can use open source solutions in order to aim this job.
OpenCV is an open source library for computer vision, and it includes classifiers for real-time face detection and tracing like Haar classifiers and Adaptive Boosting (AdaBoost). We can download trained model for performing this task; the model is an XML file that can be imported inside the OpenCV project. For featuring extraction, the library includes algorithm for detecting region of interest in human face like eyes, mouth, and nose. For this propose, drop information from image stream using gray scale convert and afterward using Gaussian Blur for reducing noise is important. Canny algorithms may be used for tracking facial features with more precision than others like Sobel and Laplace .
In , Microsoft’s emotion API is used as a tool to detect facial images from the Webcam image capture of the computer. Once the image is captured, the detected face is classified into seven emotion tags. Although the process is not specified exactly, the API mentioned works on an implementation of the OpenCV library for .NET , so the algorithms used for face detection should be the same as those described above.
For classification task, we can use nearest neighbor classifier, SVM, logistic classifier, Parzen density classifiers, normal density Bayes classifiers, and Fisher’s linear discriminant . Finally, when the classification is done, the output layer consists a group of types of emoji according to the meaning for each type of emotion detected in the image face. The importance of this contribution lies in the possibility of introducing new forms of human-computer interaction through the use of emotions. This can be useful for intelligent assistants both physical and visual that are able to react or are current according to the mood of people who use a particular intelligent ecosystem.
Figure 2 shows in a general way the operation of what has been explained above.
4. Applications to virtual and embodied robotics
As already mentioned, in this work, our intention is to elaborate the elements that will power an artificial intelligent entity, either virtual or physically embodied, with the capacity to recognize and express (R&E) emotional content using emoji. In this sense, we can collect massive amounts of human-human (and perhaps human-machines too) interactions from multiple Internet sources such as social media or forums, to train ML algorithms, which can R&E emotions, utterances, beat gestures, and even assess personality of the interlocutor. Furthermore, we may even reconstruct text phrases from speech in which emoji are embedded to these to obtain a bigger picture of the semantic meaning. For instance, if we asked the robot “are you sure?” while raising the eyebrows to emphasize our incredulity, we may obtain an equivalent expression such as “are you sure?
4.1 Embodied service robots study cases
Service robots are a type of embodied artificial intelligent entities (EAIE), which are meant to enhance and support human social activities such as health and elder care, education, domestic chores, among others [32, 33, 34]. A very important element for EAIE is improving the naturalness of human-robot interactions (HRI), which can provide EAIE with the capacity to R&E emotions to/from their human interlocutors [32, 33].
Regarding the emotional mechanisms of an embodied robot per se, a relevant example is the work by , which consists in an architecture for imbuing an EAIE with emotions that are displayed in an LED screen using emoticons. Such architecture establishes that a robot’s emotions are in terms of long-medium-short affective states suchlike its personality (i.e., social and mood changes), the surrounding ambient (i.e., temperature, brightness, and sound levels), and human interaction (i.e., hit, pat, and stroke sensors), respectively. All of these sensory inputs were employed to determine EAIE emotional state using ad hoc rules, which are coded into a fuzzy logic algorithm, which is then displayed in an LED face. Facial gestures corresponding to Ekman’s basic emotions expressions are shown in the form of emoticons.
An important application of embodied service robots is the support of elder’s daily activities to promote a healthy life style and providing them with an enriching companion. In such case, a more advanced interaction models for EAIE based on an emotional model, gestures, facial expressions, and R&E utterances are proposed [32, 35, 36, 37]. The authors of these works put forward several cost-efficient EIAE based on mobile device technologies namely iPhonoid, iPhonoid-C, and iPadrone. These are robotic companions based on an architecture, which among other features is built upon the informationally structured spaces (ISS) concept. The latter allows to gather, store, and transform multimodal data from the surrounding ambiance into a unified framework for perception, reasoning, and decision making. This is a very interesting concept since, not only EAIE behavior may be improved by its own perceptions and HRI but also from remote users’ information such as elder’s activities from Internet or grocery shopping. Likewise, all these multimodal information can be exploited by any family member to improve the quality of his/her relation with the elder ones . Regarding the emotional model, the perception and action modules are the most relevant. Among the perceptions considered in these frameworks stand the number of people in the room, gestures, utterances, colors, etc. In the same fashion as , these EAIE implements an emotional time-varying framework, which considers emotion, feeling, and mood (from shorter to longer emotional duration states, respectively). First, perceptions are transformed into emotions using expert-defined parameters, then emotions and long-term traits (i.e., mood) serve as the input of feelings whose activation follows a spiking neural network model [32, 35]. Particularly, mood and feelings are within a feedback loop, which emphasize the emotional time-varying approach. Once perceptions are turned into its corresponding emotional state, the latter is sent to the action module to determine the robot behavior (i.e., conversation content, gestures, and facial expression). As mentioned earlier, EAIE also R&E utterances, which provide feedback to the robot’s emotional state. Another interesting feature of the architecture of these EAIE is its conversational framework. In this sense, the usage of certain utterances, gestures, or facial expressions depends on conversation modes, which in turn depends on NLP processing for syntactic and semantic analyses [32, 37]. Nevertheless, with regard to facial and gesture expressions, these works take them for granted and barely discuss both. In particular, how facial expressions are designed and expressed can only be guessed from figures of these EAIE papers, which closely resemble emoji-like facial expressions.
Embodied service robots are also beneficial in the pedagogical area as educational agents [38, 39]. Under this setting, robots are employed in a learning-by-teaching approach where students (ranging from kindergarten to preadolescence) read and prepare educational material beforehand, which is then taught to the robotic peer. This has shown to improve students understanding and knowledge retention about the studied subject, increasing their motivation and concentration [38, 40]. Likewise, robots may enhance its classroom presence and the elaboration of affective strategies by means of recognizing and expressing emotional content. For instance, one may desire to elicit an affective state that engages students in an activity or identify boredom in students. Then, robot’s reaction has to be an optimized combination of gestures, intonation, and other nonverbal cues, which maximize learning gains while minimizing distraction . Humanoid robots are preferred in education due to their anthropomorphic emotional expression, which is readily available through body and head posture, arms, speech intonation, and so on. Among the most popular humanoid robotic frameworks stand the Nao ® and Pepper ® robots [38, 39, 40]. In particular, Pepper is a small humanoid robot, which is provided with microphones, 3D sensors, touch sensors, gyroscope, RGB camera, and touch screen placed on the chest of the robot, among other sensors. Through the ALMood Module, Pepper is able to process perceptions from sensors (e.g., interlocutors’ gaze, voice intonation, or linguistic semantics of speech) to provide an estimation of the instantaneous emotional state of the speaker, surrounding people, and ambiance mood [42, 43]. However, Pepper communication and its emotional expression is mainly carried out through speech consequence of limitations such as a static face, unrefined gestures, and other nonverbal cues, which are not as flexible as human standards , for instance while we consider Figure 4 , which is a picture displaying a sad Pepper. Only by looking the picture, it is unclear if the robot is sad, looking at its wheels, or simply turned off.
4.2 Study cases through the emoji communication lens
In summary, in the above revised EAIE cases (emoticon-based expression, iPadrone/iPhonoid, and Pepper), emotions are generated through an ad hoc architecture, which considers emotions and moods that are determined by multimodal data. A cartoon of these works is presented in Figure 5 , displaying on (a) the work of  on (b) the work of [32, 35, 36, 37], and on (c) Pepper the robot as described in [42, 43, 44].
In these cases, we can integrate emoji-based models to enhance the emotional communication with humans, for some tasks more directly than for others. Take for instance, the facial expressions by itself, in the case of (a) and (b), the replacement of emoticon-based emotional expression by its emoji counterpart is straightforward. This will not only improve visually the robot’s facial expression but also allowing more complex facial expressions to be displayed such as sarcasm (
Regarding the emotional expression of the discussed EAIE, this is contingent to the emotional model, which in the case of (a) and (b) are expert-design knowledge coded into fuzzy logic behavior rules and more complex neural networks, respectively. In both cases, this not only will bias the EAIE into specific emotional states but also will require vast human effort to maintain it. In contrast, Pepper’s framework is robuster, includes a developer kit, which allows modifying robot’s behaviors and the integration of third party chatbots, performing semantic and utterance analysis, and is maintained and improved by a robotics enterprise. Yet, Pepper’s emotional communication is constrained by a static face, while it can express emotions by changing the color of its led eyes and adopt the corresponding body posture; its emotional communication is mainly done through verbal expressions. Nevertheless, in a pragmatic sense, do we really need to emulate emotions for a robot to have an emotional communication or is enough to R&E emotions so that a human interlocutor cannot distinguish between man and machine? In this sense, NLP and ML can be used to leverage the emotional communication of a robot by first mapping multimodal data into a discourse-like text where emoji are embedded, and then, using emoji-based models to recognize sentiments, utterances, and gestures so the decision-making module can determine the corresponding message along with its corresponding emoji. In the case of (a), the microphone and in the case of (b), the microphone, camera, and ambient sensors will be responsible for capturing speech and facial expressions that will be converted into a discourse-like text. Once the emotional content of the message is identified, the corresponding emoji shall be displayed. In the case of Pepper, F2F communication can be improved directly by displaying emoji in its front tablet. For instance, when Pepper starts waving to a potential speaker, a friendly emoji such as a waving hand
Emotional communication is a key piece for enhancing HRI, after all it will be very useful if our smart phones, personal computers, cars or busses, and other devices could exploit our emotional information to improve our experience and communication. While nowadays, several proposals for robotic emotional communication are undergoing, emoji as a framework for the latter present a novel approach with high applicability and big usage opportunities. Some of the works presented here discussed the linguistic aspects of emoji, as well as the technical aspects in terms of ML and NLP to R&E emotions, utterances, gestures in texts, which contain emoji. Furthermore, we also presented some related works in the area of HRI, which can easily adopt emoji for imbuing an embodied artificial intelligent entity with the capacity for expressing and recognizing emotional aspects of the communication. On the whole, ML models support these issues, but we do not exclude the important task that involves the processing and transformation of data to reach a suitable input representation for training an appropriate model.
On the other hand, there are several open questions regarding the usage of emoji for emotional communication. For instance, are emoji suitable for the communication of every robotic entity? Emoji are mostly employed in a friendly manner and for maintaining a positive communication. If the objective is to model a virtual human, emoji usage will clearly restrain the spectrum of emotions, which may be detected and expressed due to its knowledge base. An important example to consider is the humanoid robot designed by Hiroshi Ishiguro, the man who made a copy of himself . Ishiguro’s proposal is that in order to understand and model emotions, we must first understand ourselves. Hence, this humanoid robot, namely Geminoid HI-1, is capable of displaying ultrarealistic human-like behaviors. However, do we really want to interact with service robots, which may have bad personality traits such as been unsociable and fickle, or whose mood can be affected by heat and noise like a human does? Do we really want to interact with service robots, which can be rude as a real elderly caretaker could? In this sense, emoji usage for the emotional communication may be best suited when the task at hand (e.g., robotic retail store cashier or an educational agent) requires keeping a friendly tone with the human interlocutor. Another question is, should the entire emoji lexicon be used or be restricted only to the core lexicon, which refers to facial expressions? In an ultrarealistic anthropomorphic robot such as Geminoid HI-1, all hand gestures might be carried out by robot’s hands itself, thus it should be unnecessary to even fit a screen for displaying a waving emoji (
Author GSB thanks the Cátedra CONACYT program for supporting this research. Author OGTL thanks GSB for his excellent collaboration.