Mining Numbers in Text: A Survey

Both words and numerals are tokens found in almost all documents but they have different properties. However, relatively little attention has been paid in numerals found in texts and many systems treated the numbers found in the document in ad-hoc ways, such as regarded them as mere strings in the same way as words, normalized them to zeros, or simply ignored them. Recent growth of natural language processing (NLP) research areas has change this situations and more and more attentions have been paid to the numeracy in documents. In this survey, we provide a quick overview of the history and recent advances of the research of mining such relations between numerals and words found in text data.


Introduction
Natural language processing (NLP) is a research field to make machines understand the meaning of a text data, which is a typically a list of words. In some cases, texts are not understandable in their closed form, i.e., without understanding the data other than the words. Numerals are an important form of data in such nonword data not only because many documents are accompanied with related metadata such as publish dates expressed in the form of numbers, but also because the document themselves contain numerals such as "three people", "500 dollars", and "90 cm." Jointly mining texts and their associated numerical metadata has many variations and many studies have been proposed. For example, predicting stars given with product review texts is typical task of such research areas. Location-aware text mining can be considered as mining of association rules between words and positional data (i.e., longitude and latitude). Even joint learning of texts and images can be seen as mining relations between texts and associated RGB data.
In contrast to such grounding-type research, studies on mining numerals explicitly written in text have been getting little attention. However, recently more and more research studies are proposed on this area partly due to the recent advance of deep neural network-based language modeling.
In this survey, we try to provide a quick overview of the history and recent advances of this research field ranging from traditional tasks like information retrieval to emerging ones such as numerical reading comprehension.

Traditional tasks
Firstly, we give a survey on the systems which consider treatment of numbers for the traditional tasks such as information retrieval (IR), question answering (QA), and information extraction (IE). Some of the questions or queries of these tasks require the answers to be numbers, hence requiring appropriate treatment of numbers found in the target text.

Question answering
Question Answering (QA) is a task to find appropriate answers from text to the questions given also in text. Because many type of questions require the answers to be numbers e.g., 8,848 (8,849) (meters) is the answer of the question "how tall is Everest?", some existing QA systems treat numbers appropriately, typically in adhoc heuristical ways.
For example, IBM's PIQUANT system for TREC2003 [1] have had the sanity checking module, which use the Cyc knowledge base to check the given answer is valid intervals found in Cyc, e.g., rejecting "200 miles" for the questions for height of a mountain, having the knowledge "mountains are between 1,000 and 30,000 high" from Cyc. Moriceau [2] consider more complicated situation where several numeric answers can be extracted from different Web pages in QA system. They proposed a way to integrate considering the nature of numbers such as number approximations.

Information retrieval
Similarly to QA tasks, some Information Retrieval (IR) systems return the direct "answers" to the query. Therefore, appropriate treatment of numbers is required for some type of queries. For example, Banerjee et al. [3] introduced Quantity Consensus Queries (QCQs), the answers for which is the quantity intervals, such as "driving time from Paris to Nice". Their proposed algorithm propose and rank intervals considering whether returned snippets is included in the intervals or not. Sarawagi and Chakrabarti [4] proposed a system to answer quantity queries on Web tables such as "escape velocity jupiter." Their system contain the modules to interpret the numbers presented in the table cells to improve the accuracy.
On the contrary, queries also can be numbers. Yoshida et al. [5] proposed a suffix array-based text mining system enhanced with treatment of numbers, which accept range queries like "[1,000 -10,000] ft'.

Information extraction
Information extraction (IE) is another type of systems that return the answers to the questions, but in this case the questions are given a priori such as "extract all dates and places of events found in the given documents." Many extracted information is in numerals, so special treatment of numbers often contributes to the improvement of the performance of IE systems.
For example, Bakalov and Fuxman [6] proposed a system to extract numerical attributes of objects given attribute names, seed entities, and related Web pages and properly distinguish the attributes having similar values. Table 1 summarizes these systems.

Numerical common sense acquisition
Numerical common senses acquisition is a task to obtain numerical common senses, e.g., the height of mountains have a typical values "1,000 -10,000 meters". Many numerals found on text typically describe some attributes of objects such as "25 C" for the temperature of some city, "170 cm" for the height of some person, etc. Obtaining such numerical common senses can contribute to improving various kinds of systems, e.g., anomaly detection or dialogue systems, etc.
We introduce two type of tasks in this type of research. One is a task to directly extract the common senses, and the other is a task to acquire such knowledge as language model parameters.

Pattern-based extraction of numerical common senses
In this task, the input is a large collection of texts. 1 The output is a database for "typical values" of something. 2 Typical methods for this task is to use pattern matching to obtain numerals for each attribute described in the given text. For example, the value "80" can be extracted from the sentence "The size of the dog is 80 cm." using the pattern "the size of the A is # cm."

Previous methods
Aramaki et al. [7] proposed to obtain physical size of entities by using Web search with patterns like "book (*cm x *cm)". Bagherinezhad et al. [8] proposed to use knowledge obtained using these patterns with object detection from images to achieve more reliable object size knowledge. Davidov and Rappoport [9] proposed similar approach but augment their method by obtaining terms similar to given object using the Web and WordNet. Takamura and Tsujii [10] took similar approach by using Web search for linguistic patterns e.g., "the size of A", but they enhanced their patterns with more indirect clues such as WordNet relations, n-gram corpus for the explicit patterns, e.g., "A is longer than B", and implicit patterns, e.g., "put A in B", through a machine learning approach to determine their weights. Narisawa et al. [11] proposed to obtain numerical common sense by searching numerical expressions in Web corpus, and calculating distribution of numbers given contexts that are given syntactically such as "verb=give, subj=he, ..." and predict labels for given numbers in text, such as small, normal, large.
Recently, a large dataset called Distribution over Quantities (DoQ), was provided by Elazar et al. [12]. It contains ten dimensions (TIME, CURRENCY, LENGTH, AREA, VOLUME, MASS, TEMPERATURE, DURATION SPEED, VOLTAGE) for various kinds of words including nouns, adjectives, and verbs. They explored co-occurrence of words and numeracy in large Web data. Table 2 summarizes these approaches and Table 3 shows the existing data sets.

Prediction of numbers in sentences
Some researchers tried to acquire numerical common senses as parameters of language modeling. In this type of research, the system directly predicts numbers to fill in the blanks in texts, or assessing feasibility of the number presented in text, without explicitly collecting above-mentioned knowledge bases.

Task definition
In this task, the input is a sentence, or document, where the position for a numeral is masked. The system then outputs a likely value for the masked position. For example, given the sentence "my five-year-old son is [MASK] cm tall.", the

Objects in images
Davidov and Rappoport [9] "A is * cm tall" Web Search, TREC Takamura and Tsujii [10] "the size of A", Web search "A is longer than B", WordNet Narisawa et al. [11] Syntactic Patterns Web corpus Elazar et al. [12] co-occurrence Web corpus Table 2.
Systems for Numerical Common Sense Acquisition.

Data Source Method
DoQ [12] Web corpus Collecting co-occurred numerals for each word system is required to answer the likely value to be filled in the position of " [MASK]." Because the input is a sequence of words, encoder-decoder models are applicable to this task. Especially, the BERT language model is a good match for this problem. BERT is a deep neural network model that consists of modules called Transformers. It is trained on the task where the input is a sequence of words with special "[MASK]" tokens, and one of the output is the estimated original word for the position of "[MASK]".

Previous approaches
Several BERT models pretrained on a huge size of text data are available to the public. Using such pretrained language model to predict or assess the numeracy in documents is an emerging trend. Typically, the models are enhanced with the ability to predict numbers by, simply using masked language models by replace the numbers to be predicted with [MASK] tokens, or by adding numeracy inference modules into language models or by fine-tuning setting where output is a discretized versions of target numeracy.
Zhang et al. [13] investigated how pretrained language model like BERT can predict (the discretized version of) the attribute with continuous numeric values such as MASS or PRICE with evaluation with DoQ. Chen et al. [14] proposed a task of predicting the magnitude of hidden numerals in text and provided a large dataset called Numeracy-600 K. They also reported CNN and RNN-based models to solve this task. Berg-Kirpatrick and Spokoyny [15] proposed more advanced model using BERT and reported that using BERT was better than other models including BiGRU.
On the other hand, Lin et al. [16] considered more difficult task to predict accurate number to be filled in the blank in text, like "A bird usually has [MASK] legs". They reported that the current pretrained models including BERT and RoBERTa performed poorly.
A language model that did not use encoder-decoder model was also proposed. Spithourakis and Riedel [17] proposed a language model for a sequence of words and numerals, which gives the probability for words and numerals simultaneously. For example, it gives the probability of the numeral "50,000" appearing just after the word sequence "the number of video-game consoles I have is". They introduced the probabilities of being words or numerals for each token, and modeled the probability for numerals independently of that of words, using some variations including digit-based RNN and mixture of Gaussians. Table 4 summarizes the approaches and Table 5 shows the dataset for this task.

System Model Task
Zhang et al. [13] BERT fine tuning Magnitude a prediction

Numeracy embeddings
Embedding or distributed representation of words has become basic building blocks for natural language processing in recent years. It represents each word by high-dimension vectors (typically with 50 dimensions or more) of real values. These vectors reflect the meaning of words, such as words with similar meaning are represented by similar 3 vectors. Some researchers have investigated how numeracy itself is modeled in such pre-trained word embedding vectors.

Task definition
Embedding vectors are also assigned to numerals such as "three", "100", "million", etc. Popular word embeddings like word2vec do not distinguish these numerals from other words, i.e., the learning algorithms for these vectors treat numerals and other words equally. So, it is not obvious these word vectors appropriately reflect the meaning of numbers, such as "100 is larger than 3" and "4 is the next number of 3", etc. Numeracy embedding is a task to embed such numerals in appropriate vector representation.

Investigating pre-trained word vectors
Nowadays, word embeddings learned using huge size of corpus are provided by various researchers. Some researchers investigated how or whether these pretrained word vectors appropriately represent numerals.
Naik et al. [18] used GloVe, FastText, and SkipGram vectors. They compares similarity of embedding vectors for numbers. They used two types of tasks: one is for magnitude, e.g., vector for 4 should be more similar to 3 than 1000000, and the other is for numeration, e.g., vector for three should be more similar to 3 than billion. Contextualized word vectors were also considered. Wallace et al. [19] found that the pretrained language models for DROP, which is numeracy entailment task mentioned later sections, already captures numeracy, by testing if BiLSTM model with pretrained embedding pass some tests such as list maximum, decoding (e.g., convert the string "five" to 5), taking a sum of two numbers.

Obtaining word vectors for numerals
On the other hand, developing algorithms specialized to obtain word vectors for numerals beyond pre-trained word vectors have been proposed by some researches in recent years.
Jiang et al. [20] proposed to obtain embedding for numbers by directly applying Skip-Gram models to obtain embeddings for numbers taking into consideration of meaning of numbers by taking weighted average of embeddings which is 3 Similarity of vectors is typically defined by inner product or cosine similarity of vectors.

Data Source Task
Numeracy-600 K [14] Market comments from Reuters Magnitude prediction Table 5.

Number Prediction DataSet.
numerically similar to the target number. They find "prototype numbers" by clustering, and represent numbers as a weighted average of these prototypes. Sundararaman et al. [21] proposed to learn embeddings for numbers, which reflect the distance of two numbers in the number line, independently from words. Table 6 summarizes these previous methods.

Numerical reading comprehension and numerical textural entailment detection
More complex tasks such as textual entailment detection or reading comprehension also require treatment of numbers appropriately to answer some of the questions. We first mention on some early works for these tasks and then introduce some recent systems.

Task definition
Textual entailment detection is a task to find, given some texts, sentences which are true if the given text (called hypothesis) are true. The situation become more complicated if the sentences contains numerals because it requires numerical knowledge to understand the meaning of sentence. For example, we can say that the sentence "five people are in the house." is true given the hypothesis "two men and three women are in the house.", but it requires mathematical knowledge that two plus three equals five.
Some early works for this task include numeracy modules. The system by Tsuboi et al. [22] for textual entailment recognition task (RITE) in NTCIR-9 consider temporal expression matching such as "the first half of Nth century" to the appropriate interval. The system by Iftene and Moruz [23] implemented the special rules for numbers which create intervals considering expressions like "more than" or "over" for the Recognizing Textual Entailment (RTE-6) task.
Reading comprehension is a more complicated task, where the system is required to answer various types of questions.

Numeracy-focused data sets
Aforementioned studies mainly focused on the "range" of the numbers, i.e., they simply treat numbers as points or distributions defined on the number line. However, reading comprehension tasks require more advanced numeric skills such as addition, average, maximum, etc., into language models. This line of research typically constructs the dataset for numeracy understanding task by selecting numeracy-related data from existing datasets for reading comprehension, natural language inference, or entailment. The selected data contain many questions that require understanding and calculation on numbers beyond simple range-or distribution-based treatment of numbers.
Roy et al. [24] proposed the task of Quantity Entailment, which require numeric reasoning to answer. Their dataset included the corpus from datasets for Recognizing Textual Entailment (RTE) task. They also proposed a method to solve these problems with CRF-based recognition of quantity part of the text, and rule-based recognition of entailment.
Ravichander et al. [25] proposed the EQUATE framework for quantitative reasoning in textual entailment, such as determining "5855 of lambs are back" is correct given the premise "6048 lambs is either black or white and there are 193 white ones." DROP proposed by Dua et al. [26] require systems to do operations such as addition, counting, or sorting. The type of questions and answers in DROP dataset varies widely, such as the question "Where did Charles travel to first" given passages "In 1517, the King sailed to Castle. ... In 1518, he traveled to Barcelona." State-of-the-art methods for reading comprehension performed poorly for these datasets (both of EQUATE and DROP) and the authors concluded that more advanced methods are required for these new tasks for numeric reading comprehension. Table 7 summarizes these datasets.

Methods
Given these datasets, more advanced models for them have been proposed. Typical approaches given the recent advance of deep neural network technologies is to use sequence-to-sequence (seq2seq) model for this task. In seq2seq models, the sequence of words can be feed as input directly to the system, then the system also returns another sequence of words as the output. Especially, recent pretrained language models including BERT already contains language models trained on huge amount of text documents, and they can be trained to return appropriate word sequence by being trained on relatively small set of training samples (i.e., the pair of input documents and "correct" or appropriate output for each input.) of a given task.
Rozen et al. [27] reported that performance for existing natural language inference (NLI) datasets can be improved by augmenting the dataset with synthetic adversarial datasets including the ones generated by rule-based replacement of numeric expressions found in the dataset. Geva et al. [28]   synthetic numerical tasks to BERT pretraining steps with fine tuning on DROP dramatically improved the score for DROP. Ran et al. [29] proposed to inject graphbased numerical reasoning module between embedding and prediction modules, which outperformed existing machine reading comprehension models on the DROP dataset. Table 8 summarizes these approaches.

Solving math word problems
Math word problem texts are a typical type of documents that contain numerals and words extensively and require deep understanding of the meaning of numerals. Developing a system that automatically solve math word problems is thus one major research task in this area.

Task definition
In this task, the problem is given in a text that contains numerals, e.g., "How much How much would it cost to buy 12 apples at 1.1 dollars each?", and systems are required to provide a solution for the problem, e.g., 12 Â 1:1 ¼ 13:2 dollars. Recent approaches for this task typically use deep neural networks that take a sequence of words as inputs. These inputs are transformed through several layers and used to produce the final output. Variety of output forms are considered by previous methods, including simple seq2seq models (i.e., outputs are also sequences of words) and sequence-to-tree models (i.e., outputs are in tree forms that represent equations to calculate the answers.) Sequence-to-sequence (seq2seq) is a typical approach for this task. Ling et al. [30] provided their original dataset with 100,000 samples, and proposed a method to generate answer rationales which are human-readable instructions to derive the answers using a sequence-to-sequence (seq2seq) model. Saxton et al. [31] investigate the ability of existing sequence to sequence architectures including Transformer for mathematical reasoning (e.g., "Solve 41 + 132") with free-form texts.
Some researchers have tried to produce graphs that represent the mathematical operations to directly produce the answers to the questions. Amini et al. [32] provided a dataset for math word problems called MathQA. They also proposed the sequence-to-Program model to solve this task. The approach by Zhang et al. [33] uses a new architecture called Graph2Tree, which uses graphs constructed from texts independently from BiLSTM encoders. They tested their system on MAWPS data set. [34] Lample and Charton [35] showed that neural models can solve mathematical problems such as symbolic integration and solving differential equations using sequence-to-sequence approaches. Table 9 summarizes the existing data sets and Table 10 summarizes the systems for this task proposed so far.

Other tasks
Yoshida et al. [36] considered a problem of estimating appropriate units for the numbers found in Wikipedia tables when units were omitted. Elazar and Goldberg [37] considered the problem to infer the omitted head related to numerals such as "It is worth about two million __." Chen et al. [38] proposed the numeral attachment task, which determine what entity is the number presented in text related. They also proposed the task of numeral categorization, which is to classify numerals presented in financial text into 7 or 17 categories [39].
The task proposed by Chaganty and Liang [40] was to describe given numerals by examples, such as "$131 million is about the cost to employ everyone in Texas over a lunch period."

Conclusions
The relations between numerals and words found in text data has been paid little attention compared to other areas in natural language processing. This paper provided the overview of this field ranging from the systems for traditional tasks such as information retrieval tor the relatively recent tasks like reading comprehension.
We categorized the previous researches into 6 types: traditional tasks, numerical common sense acquisition, numeracy embeddings, numerical reading comprehension, solving math word problems, and others. The first two tasks have been studied relatively long time, while the remaining topics is emerging with recent advances of neural language models.
In Section 2, we introduced some previous systems that have numerical modules for traditional tasks like QA, IE, and IR. In Section 3, we introduced numerical common sense acquisition where typical approaches are pattern-based extraction and parameter estimation for language models. In Section 4, numeracy embedding,   where the goal is assigning appropriate real-valued vectors to numerals, was introduced, Section 5 introduced numerical reading comprehension and numerical entailment, that require more advanced numerical understanding of text. The task of solving math word problems, which are typical type of texts that contain numerals extensively, was introduced in Section 6, and Section 7 touched on other unique tasks. Recent increase of the dataset and resources focusing on numeracy will accelerate the development of the systems with the ability of understanding numeracy in text.