Open access peer-reviewed chapter

Data Service Outsourcing and Privacy Protection in Mobile Internet

Written By

Zhen Qin, Erqiang Zhou, Yi Ding, Yang Zhao, Fuhu Deng and Hu Xiong

Reviewed: July 3rd, 2018 Published: November 5th, 2018

DOI: 10.5772/intechopen.79903

Chapter metrics overview

1,053 Chapter Downloads

View Full Metrics

Abstract

Mobile Internet data have the characteristics of large scale, variety of patterns, and complex association. On the one hand, it needs efficient data processing model to provide support for data services, and on the other hand, it needs certain computing resources to provide data security services. Due to the limited resources of mobile terminals, it is impossible to complete large-scale data computation and storage. However, outsourcing to third parties may cause some risks in user privacy protection. This monography focuses on key technologies of data service outsourcing and privacy protection, including the existing methods of data analysis and processing, the fine-grained data access control through effective user privacy protection mechanism, and the data sharing in the mobile Internet.

Keywords

  • mobile Internet
  • data service
  • data outsourcing
  • privacy preserving

Preface

The data of mobile Internet have the characteristics of large scale, variety of patterns, complex association and so on. On the one hand, it needs efficient data processing model to provide support for data services, and on the other hand, it needs certain computing resources to provide data security services. Due to the limited resources of mobile terminals, it is impossible to complete large-scale data computation and storage. However, outsourcing to third parties may cause some risks in user privacy protection.

This monography focuses on key technologies of data service outsourcing and privacy protection in mobile Internet, including the existing methods of data analysis and processing, the fine-grained data access control through effective user privacy protection mechanism, and the data sharing in the mobile Internet, which provide technical support for improving various service applications based on mobile Internet. The financial support from the National Natural Science Foundation of China under grants No.61672135, the Sichuan Science-Technology Support Plan Program under grants No.2018GZ0236, No.2017FZ0004 and No.2016JZ0020, the National Science Foundation of China - Guangdong Joint Foundation under grants No.U1401257 are also greatly appreciated.

Needless to say, any errors and omissions that still remain are totally our own responsibility and we would greatly appreciate the feedbacks about them from the readers. Thus, please send your comments to “qinzhen@uestc.edu.cn” and put “Data Service Outsourcing and Privacy Protection in Mobile Internet” in the subject line.

Zhen Qin

University of Electronic Science and Technology of China

March, 2018

Advertisement

Setting the Stage

1. Why Outsourcing and Privacy Protection Matter for Data Service in the Mobile Internet

1.1. Mobile Internet

With the continuous development of science and technology, the mobile Internet (MI) has become an important part of our daily lives. It is a product of the convergence of mobile communication technologies and Internet technologies, which has a wide range of network coverage, convenience, and instant features. Smartphone is the main carrier of the mobile Internet. Through smartphones, people can make voice calls, send text messages, and take video communications as well as surf the Internet at any time (including browsing the web, sending and receiving e-mails, watching movies, etc.). In other words, using smartphones, users are more capable of creating, implementing, and managing novel and efficient mobile applications. Nowadays, the MI has become an important portal and major innovation platform for Internet businesses and an important hub for the exchange of information resources among various social media, business service companies, and new application software.

1.1.1. Mobile Internet

The mobile Internet, defined as the platform where the user’s wireless access to the digitized contents of the Internet via mobile devices, consists of the following:

  • Internet service provider is an organization that provides services such as Internet access, Internet transit, domain name registration, web hosting, Usenet service, and collocation.

  • Users can get different services from the mobile Internet such as points of interest services, music services, fitness services, and mobile commerce services by using mobile applications.

In the MI, mobile devices, which include smartphones, tablets, PCs, e-books, MIDs, etc., enable users to stay connected with not only the network service providers through LTE and NFC techniques but also the physically close peers through short-range wireless communication techniques. In fact, each individual user has a personalized behavior pattern, which can influence the reliability and efficiency of the network in the MI. Moreover, the MI has a wide range of applications such as wireless body area networks (WBANs), mobile e-commerce, mobile social networks, mobile ad hoc networks, etc. The common scenarios of the MI will be shown in Figure 1.1 .

Figure 1.1.

The common scenarios of the MI.

1.1.2. Application

The mobile Internet offers a huge number of applications to the users. Most of these applications have many unique features or special functions, but the main technology of the applications remains the same as other services. The popular applications in the mobile Internet are briefly reviewed below.

1.1.2.1. Mobile social networks

Social networking has already become an important integral part of our daily lives, enabling us to contact our friends and families. Nowadays, the pervasive use of the mobile Internet and the social connections fosters a promising mobile social network (MSN) where reliable, comfortable, and efficient computation and communication tools are provided to improve the quality of our work and life.

MSN is social networking where individuals with similar interests converse and connect with one another through their mobile phone and/or tablet. Much like web-based social networking, mobile social networking occurs in virtual communities [1].

Similar to many online social networking sites, such as Facebook and Twitter, there are just as many social networks on mobile devices. They offer vast number of functions including multimedia posts, photo sharing, and instant messaging. Most of these mobile applications offer free international calling and texting capabilities. Today, social networking applications are not just for the social aspect but are frequently used for professional aspects as well, such as LinkedIn, which is still constantly growing. Along with sharing multimedia posts and instant messaging, social networks are commonly used to connect immigrants in a new country. While the thought of moving to a new country may be intimidating for many, social media can be used to connect immigrants of the same land together to make assimilation a little less stressful [2].

1.1.2.2. Mobile ad hoc networks

Due to the limitation of the applications in the wire communication, the need of the extension of applications via the mobile Internet becomes necessary. The growth of laptops and 802.11/Wi-Fi wireless networks has made mobile ad hoc networks (MANETs) a popular research topic since the mid-1990s.

A mobile ad hoc network (MANET), also known as wireless ad hoc network [3] or ad hoc wireless network, is continuously self-configuring, infrastructure-less network of mobile devices connected using wireless channels [4]. Each device in a MANET is free to move independently in any direction and will therefore change its links to other devices frequently. Such networks may operate by themselves or may be connected to the larger Internet. They may contain one or multiple and different transceivers between nodes. This results in a highly dynamic, autonomous topology [5].

MANETs can be used in many applications, ranging from sensors for environment, vehicular ad hoc communications, road safety, smart home, peer-to-peer messaging, air/land/navy defense, weapons, robots, etc. A typical instance is health monitoring, which may include heart rate, blood pressure, etc. [6]. This can be constant, in the case of a patient in a hospital or event driven in the case of a wearable sensor that automatically reports your location to an ambulance team in case of an emergency. Animals can have sensors attached to them in order to track their movements for migration patterns, feeding habits, or other research purposes [7]. Another example is used in disaster rescue operations. Sensors may also be attached to unmanned aerial vehicles (UAVs) for surveillance or environment mapping [8]. In the case of autonomous UAV-aided search and rescue, this would be considered an event mapping application, since the UAVs are deployed to search an area but will only transmit data back when a person has been found.

1.1.2.3. Mobile payment

Since the rapid economic growth, the methods of payment have changed a lot all over the world. Given by the development of the mobile Internet, the mobile payment becomes more pervasive in Asian countries, especially in China.

Mobile payment (also referred to as mobile money, mobile money transfer, and mobile wallet) generally refers to payment services operated under financial regulation and performed from or via a mobile device. Instead of paying with cash, check, or credit cards, a consumer can use a mobile to pay for a wide range of services and digital or hard goods.

Although the concept of using non-coin-based currency systems has a long history [9], it is only recently that the technology to support such systems has become widely available.

1.1.3. Mobile Internet and user behavior

Since entering the twenty-first century, the China’s Internet has developed rapidly. In particular, the mobile Internet further enhances this speed. We soon discovered that today’s Internet is used in e-commerce shopping, map navigation, mobile payment, stock financing, online booking, government office, recharge, and other aspects that are completely different from the original. It seems that everything can be done on the Internet. You can rely on your phone instead of taking it out. In the mobile Internet, users and mobile terminals are closely related. The accumulation of user behavior in the mobile Internet is of great value. Smartphones and other mobile terminals are no longer simple communication tools but become mobile terminals that integrate communication, computing, and sensing functions. It can continuously generate and collect various data information [10]. Various sensing devices on the mobile terminal can sense the surrounding environment and generate various data information; at the same time, the mobile terminal serves as an entrance for the user to enter the mobile Internet and can collect data information generated by other users. By combining various data related to mobile users, user behaviors can be modeled and predicted, and important statistical data for specific services can be extracted. For example, mobile users’ GPS information and historical travel habits can be used to generate traffic statistics and travel advice [11]. It can be seen that, based on the data information generated in the mobile Internet, we can satisfy the different requirements of different users in different scenarios in the first time by judging the user information needs and accurately sorting the information categories, thereby providing users with personalized high-quality information and improving users’ service experience.

The relevant data show that in the 3 years of rapid development of the mobile Internet, the total amount of data and information generated by humans has exceeded 400 years in the past. The global data volume has reached 1.8 ZB in 2011, and it is expected that the global data volume will reach 35 ZB by 2020. However, with the accumulation of data information, the amount of data available for analysis and mining continues to increase, and the entropy of information (i.e., the ratio of total information to valuable information) tends to increase. The increasing amount of data information makes information overload, information redundancy, and information search become new issues in the mobile Internet era. Search engines and recommendation systems provide very important technical means for solving these problems. When users search for information on the Internet, the search engine performs information matching in the background of the system with the keywords from the users and displays the results for users. However, if the user cannot accurately describe the keywords they need, then the search engine is powerless. Unlike search engines, the recommendation system does not require users to provide explicit requirements but simulates the user’s interests by analyzing their historical behavior, thereby actively recommending to users’ information that meets their interests and needs. In recent years, e-commerce has been booming, and the dominant position of the recommendation system in the Internet has become increasingly apparent. However, in the mobile Internet, data are not only more and more large scale but also have a variety of models, complex relationships, and reliability uncertainties. Diversified models make the data content difficult to understand. The complexity of the association makes the data difficult to be effectively identified. The uncertainty of credibility makes it difficult to determine the authenticity of the data. Such data features make the perception, expression, understanding, and calculation of the data face new challenges. The complexity of space-time complexity in the traditional computing model greatly limits our ability to design data-efficient computing models for the mobile Internet. How to quantify, define, and extract the essential characteristics and internal correlation of data in the mobile Internet, then study its internal mechanism, and provide support for the recommendation service is an urgent problem to be solved.

1.1.4. Behavior and data

At the same time, due to the characteristics of large scale, diverse models, and complex relationships in data in the mobile Internet, data analysis and processing processes have certain requirements for computing resources, while mobile terminals such as smartphones have limited computing resources (such as computational processing capabilities, etc.). Therefore, analysis and processing of the mobile Internet data is often done on devices such as third-party data processing servers that have sufficient computing resources. However, users and mobile terminals in the mobile Internet are closely related, and their data contains a large amount of user privacy information. This will bring a series of security issues. Since the users facing the outsourced data application system are not a specific user, there may be any users who are connected to the Internet. These users can submit data query requests to outsourcing. Different users have different requirements for outsourcing. It is precisely because of this particularity that the outsourcing system may be completely open to any external potential users. It is not enough to protect the privacy of outsourced data only through system-level access control, cryptography, and other technical means. According to statistics from the Beijing Zhongguancun police station, telecommunication fraud reported for the whole year of 2012 accounted for 32% of the cases filed, which is the highest proportion of the types of crimes. Fraudulent often uses six kinds of methods:

  • The impersonation of an individual or a circle of friends after the disclosure of information, such as criminals posing as frauds by public security organs, postal services, telecommunications, banks, social security staff, or friends and relatives, accounts for 42% of the total number of fraud cases.

  • After the leakage of shopping information, the seller is guilty of fraud.

  • The winning fraud is caused by the leakage of telephone, QQ, or e-mail.

  • The false recruitment information is received after job information leakage.

  • The Internet dating fraud after the leakage of dating information.

  • Kidnapping fraud after the disclosure of family information. It can be seen that many third parties have disclosed users’ personal information to varying degrees.

When people realized that they wanted to protect their privacy and tried to hide their actions, they did not think that their actions were already on the Internet; especially, different social locations in the network generate many data footprints. This kind of data has the characteristics of being cumulative and relevant. The information of a single place may not reveal the privacy of the user, but if a lot of people’s behaviors are gathered from different independent locations, his privacy will be exposed because there has been enough information about him. This hidden data exposure is often unpredictable and controllable by individuals. The leakage of personal privacy information has caused panic among some users, which has negatively affected the sharing of data resources in the mobile Internet. The issue of privacy protection has become increasingly important.

As a carrier of personal information, mobile smart terminals are characterized by mobility, diversity, and intelligence. The various types of applications attached to them present complex types, their execution mechanisms are not disclosed, a large amount of data, etc. These factors are superimposed and cause them to face much greater threats than before. In terms of presentation, the security issues of mobile smart terminals are mainly concentrated on the following aspects: firstly, illegally disseminating content; secondly, maliciously absorbing fees; thirdly, user privacy is stolen; and fourthly, mobile terminal viruses and illegal brushing cause the black screen, system crashes, and other issues. The following is an example of the mobile phone malicious code and user privacy theft. Due to the mobile malicious code problem caused by the negligence of the design and implementation process of mobile smart terminal operating systems, more or less system vulnerabilities may be exploited by malicious code. In particular, with the continuous maturation of the mobile Internet hacking technology, malicious code for mobile smart terminals poses a great threat to their security. Mobile malicious code is a destructive malicious mobile terminal program. It usually uses SMS, MMS, e-mail, and browsing websites through mobile communication networks or uses infrared and Bluetooth to transfer between mobile smart terminals. At present, the highest market share of mobile smart terminal operating system is Google Android operating system, followed by Apple iOS operating system and Microsoft Windows Phone operating system. As smart terminal operating systems become more and more uniform, the spread of malicious code has been accelerated, and it has evolved from a simple suction to sophisticated fraud and hooliganism. In addition, with the increase in the binding of mobile terminals and bank accounts, such as mobile terminals and third-party payment, the generation and spread of mobile phone viruses are accelerating toward the direction of traffic consumption, malicious deductions, and private information theft. In particular, with the rapid development of the Android-based operating system, mobile smart terminals using the operating system have gradually become the main targets of hacker attacks. In the first half of 2015, there were 5.967 million new Android virus data packets nationwide, an increase of 17.41% year on year, and the number of infected users reached 140 million, an increase of 58% from the same period last year. The total number of mobile payment users reached 11.455 million. How to outsource the data storage in the mobile Internet to a third-party server securely under the premise of ensuring the privacy of users and the realization of fine-grained access control of mobile Internet data is an urgent problem to be solved.

1.2. Research topics of the mobile Internet

1.2.1. Ontology-based information retrieval

Information retrieval is an important way of information exchange between users and data, which have been studied by many researches. Traditional ways of information retrieval are always based on keyword index and retrieval document. But the data on the mobile Internet have four features: large scale, complicated modules, diversified data format, and incongruous data sources. So, there is always a problem that different keywords can be the same meaning. The results of the search can be inaccurate or incomplete.

To deal with this situation, paper [12] gives us the definition of ontology-based information retrieval. This model includes the four main processes of an IR system: indexing, querying, searching, and ranking. However, as opposed to traditional keyword-based IR models, in this approach, the query is expressed in terms of an ontology-based query language (SPARQL), and the external resources used for indexing and query processing consist of an ontology and its corresponding KB. The indexing process is equivalent to a semantic annotation process. Instead of creating an inverted index where the keywords are associated with the documents where they appear, in the case of the ontology-based IR model, the inverted index contains semantic entities (meanings) associated with the documents where they appear. The relation or association between a semantic entity and a document is called annotation. The overall retrieval process consists of the following steps:

  • The system takes as input a formal SPARQL query.

  • The SPARQL query is executed against a KB, returning a list of semantic entities that satisfy the query. This process is purely Boolean (i.e., based on an exact match), so that the returned instances must strictly hold all the conditions of the formal query.

  • The documents that are annotated (indexed) with the above instances are retrieved, ranked, and presented to the user. In contrast to the previous phase, the document retrieval phase is based on an approximate match, since the relation between a document and the concepts that annotate it has an inherent degree of fuzziness.

The steps listed above are described in more detail in the following subsections, from indexing to query processing, document retrieval, and ranking.

In this model, it is assumed that a KB has been built and associated with the information sources (the document base), by using one or several domain otologies that describe concepts appearing in a document text. The concepts and instances in the KB are linked to the documents by means of explicit, non-embedded annotations of the documents. In this model, keywords appearing in a document are assigned weights reflecting the fact that some words are better at discriminating between documents than others. Similarly, in this system, annotations are assigned weights that reflect the discriminative power of instances with respect to the documents. And, the weight d x of an instance x for a document d is computed as

d x = f r e q x , d m a x y f r e q y , d l o g | D | n x E1.1

where f r e q x , d is the number of occurrences in d of the keywords attached to x and D is the set of all documents in the search space.

The query execution returns a set of tuples that satisfy the SPARQL query. Then, extract the semantic entities from those tuples, and access the semantic index to collect all the documents in the repository that are annotated with these semantic entities. Once the list of documents is formed, the search engine computes a semantic similarity value between the query and each document, using an adaptation of the classic vector space IR model. Each document in the search space is represented as a document vector where each element corresponds to a semantic entity. The value of an element is the weight of the annotation between the document and the semantic entity, if such annotation exists, and zero otherwise. The query vector is generated weighting the variables in the SELECT clause of the SPARQL query. For testing purposes, the weight of each variable of the query was set to 1, but in the original model, users are allowed to manually set this weight according to their interest. Once the vectors are constructed, the similarity measure between a document d and the query q is computed as

s i m ( d , q ) = d × q | d | | q | E1.2

If the knowledge in the KB is incomplete (e.g., there are documents about travel offers in the knowledge source, but the corresponding instances are missing in the KB), the semantic ranking algorithm performs very poorly: SPARQL queries will return less results than expected, and the relevant documents will not be retrieved or will get a much lower similarity value than they should. As limited as might be, keyword-based search will likely perform better in these cases. To cope with this, the ranking function combines the semantic similarity measure with the similarity measure of a keyword-based algorithm. Combine the output of search engines, and compute the final score as

λ s i m ( d , q ) + ( 1 λ ) k s i m ( d , q ) E1.3

where k s i m is computed by a keyword-based algorithm and λ [ 0,1 ] .

A lot of information retrieval models have been put forward to improve the performance of information retrieval, and some of the combinations of these technologies have been proven to be effective. But faced with the rich data in the MI, the great improvement needs to be done for getting the search more intelligent and efficient.

1.2.2. Deep learning

Machine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. It learns from the external input data through algorithm analysis to obtain the regularity for recognition and judgment. With the development of machine learning, it is classified as shallow learning and deep learning and plays an important role in the field of artificial intelligence. Due to the large-scale data, multiple model, and complex correlation, traditional shallow learning algorithms (such as the support vector machine algorithm [13]) are still subject to certain restrictions in dealing with complex classifications [14] even if widely used.

Deep learning is a new field in the study of machine learning; the motive is to establish and simulate human brain analysis neural network to learn. It mimics the human brain to explain the mechanism of data, such as images, sound, and text. As with machine learning methods, deep machine learning methods also have supervised learning and unsupervised learning. Supervised learning algorithms experience a data set containing features, but each example is also associated with a label or target. For example, the iris data set is annotated with the species of each iris plant. A supervised learning algorithm can study the iris data set and learn to classify iris plants into three different species based on their measurements. Unsupervised learning algorithms experience a data set containing many features and then learn useful properties of the structure of this data set. In the context of deep learning, we usually want to learn the entire probability distribution that generated a data set, whether explicitly as in density estimation or implicitly for tasks like synthesis or deionizing. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the data set into clusters of similar examples. The learning model established under different learning frameworks is very different. For example, the convolutional neural network (CNN) is a kind of machine learning model of deep learning under the supervision of the deep place letter (deep belief networks, referred to as DBNs), which is a kind of machine learning model in the case of unsupervised learning.

The concept of deep learning was proposed in 2006 by Hinton [15]. Based on the deep belief network (DBN), the algorithm is proposed to solve the optimization problem of the deep structure, and then the deep structure of multilayer automatic encoder is proposed. In addition, the convolution neural network proposed by Lecun et al. [16] is the first real multilayer structure learning algorithm, which uses the space relative relation to reduce the number of parameters to improve the training performance. Deep learning is a method of representing learning in machine learning. Observations (such as an image) can be represented in a variety of ways, such as vectors for each pixel’s strength value, or more abstracted into a range of edges, specific shapes, etc. It is easier to learn a task from an instance using some specific presentation methods (e.g., face recognition or facial expression recognition). The advantage of deep learning is to replace the manual acquisition feature with non-supervised or semi-supervised feature learning and hierarchical feature extraction.

Deep learning architectures such as deep neural networks, deep belief networks, and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, and drug design [13], where they have produced results comparable to and in some cases superior [14] to human experts.

The work [17] firstly proposed a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR). This model is a pre-trained deep neural network-hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. Although the experiments show that CD-DNN-HMMs provide dramatic improvements in recognition accuracy, many issues remain to be resolved. First, although CD-DNN-HMM training is asymptotically quite scalable, in practice, it is quite challenging to train CD-DNN-HMMs on tens of thousands of hours of data. Second, highly effective speaker and environment adaptation algorithms for DNN-HMMs must be found, ideally ones that are completely unsupervised and integrated with the pre-training phase. Third, the training in this study used the embedded Viterbi algorithm, which is not optimal. In addition, the study views the treatment of the time dimension of speech by DNN-HMM and GMM-HMMs alike as a very crude way of dealing with the intricate temporal properties of speech. Finally, although Gaussian RBMs can learn an initial distributed representation of their input, they still produce a diagonal covariance Gaussian for the conditional distribution over the input space given the latent state (as diagonal covariance GMMs also do). Ji et al. [18] developed a 3D CNN model for action recognition. This model construct features from both spatial and temporal dimensions by performing 3D convolutions. The developed deep architecture generates multiple channels of information from adjacent input frames and performs convolution and subsampling separately in each channel. The final feature representation is computed by combining information from all channels. Evaluated by the TRECVID and the KTH data sets, the results show that the 3D CNN model outperforms compared with the methods on the TRECVID data, while it achieves competitive performance on the KTH data, demonstrating its superior performance in real-world environments. Furthermore, Cho et al. [19] proposed a novel neural network model called RNN encoder-decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN encoder-decoder as an additional feature in the existing log-linear model. But one approach that was not investigated is to replace the whole or a part of the phrase table by letting the RNN encoder-decoder propose target phrases. A multi-context deep learning framework for saliency detection is proposed in [20]. Different pre-training strategies are investigated to learn the deep model for saliency detection, and a task-specific pre-training scheme for the presented multi-context deep model is designed. Tested by the contemporary deep models in ImageNet image classification challenge, their effectiveness in saliency detection is investigated.

Although deep learning has shown great success in various applications such as objection recognition [21] and speech recognition [17], especially language modeling [22], paraphrase detection [23], and word embedding extraction [24], the work implemented in the MI is still worthy to investigate.

1.2.3. Outsourced attribute-based encryption system

Attribute-based encryption (ABE) is a promising cryptographic primitive, which has been widely applied to design fine-grained access control system recently. Data access control has been evolving in the past 30 years, and various techniques have been developed to effectively implement fine-grained access control [25], which allows flexibility in specifying differential access rights of individual users. However, traditional access control systems are mostly designed for in-house services and depend greatly on the system itself to enforce authorization policies. Thus, they cannot be applied in cloud computing because users and cloud servers are no longer in the same trusted domain. For the purpose of helping the data owner impose access control over data stored on untrusted cloud servers, a feasible consideration would be encrypting data through certain cryptographic primitives but disclosing decryption keys only to authorized users. One critical issue of this branch of approaches is how to achieve the desired security goals without introducing high complexity of key management and data encryption. Existing works resolve this issue either by introducing a per-file access control list (ACL) for fine-grained access control or by categorizing files into several filegroups for efficiency. As the system scales, however, the ACL-based scheme would introduce an extremely high complexity which could be proportional to the number of system users. The filegroup-based scheme, on the other hand, is just able to provide coarse-grained access control of data.

To provide fine-grained access control over encrypted data, a novel public-key primitive, namely, attribute-based encryption (ABE) [26], is introduced in the cryptographic community, which enables public-key-based one-to-many encryption. In ABE system, users’ keys and ciphertexts are labeled with sets of descriptive attributes and access policies, respectively, and a particular key can decrypt a ciphertext only if the associated attributes and policy are matched. Though ABE is a promising primitive to design fine-grained access control system in cloud computing, there are several challenges remained in the application of ABE.

  • One of the main drawbacks of ABE is that the computational cost in decryption phase grows with the number of attributes specified in the access policy. The drawback appears more serious for resource-constrained users such as mobile devices and sensors. Therefore, one challenge is how to reduce the decryption complexity of ABE such that it can be applied to fine-grained access control for users with resource-constrained devices.

  • Beyond decryption, generating users’ private key in existing ABE schemes also requires a great quantity of modular exponentiations. Furthermore, the revocation of any single user in existing ABE requires key update at authority for remaining users who share his/her attributes. All of these heavy tasks centralized at the authority side would make it become the efficiency bottleneck in the whole access control system. Therefore, another challenge is how to reduce the key-issuing complexity of ABE such that scalable access control can be supported.

To improve the practical application level of the attribute-based encryption system, researchers began to attempt to introduce outsourced computing to the attribute-based encryption system. In 2011, Green et al. [27] proposed a scheme that outsourced the decryption operation in the attribute-based cryptosystem to a third party in order to improve the client’s operational efficiency and enable the attribute-based cryptosystem to run on devices with limited computational resources such as smart cards. The ciphertext length in this scheme increases linearly with the number of access policy attributes. In [27], the third party uses the conversion key sent by the user to convert the encrypted ciphertext with the higher decryption cost into the ElGamal-encrypted ciphertext with the lower decryption cost and transmits the converted ciphertext back to users. And then, users can use their own decryption keys to complete the decryption locally. Li et al. [28] simplified the client’s computational complexity in the attribute-based cryptosystem from the perspective of outsourced cryptographic operations. After outsourcing cryptographic operations, the user only needs to perform constant exponentiation operations for encryption regardless of the number of attributes and can encrypt any access policy. Li et al. [29] proposed an attribute-based encryption scheme that supports outsourced key generation and outsourced decryption operations. Meanwhile, the scheme proposed a fine-grained access control system based on this scheme. Lai et al. [30] pointed out that the solution proposed by the user in [27] validates the results returned by the third party and that the security of the solution depends on the random oracle model. On the basis of [27], the literature [30] proposes a verifiable outsourced attribute-based encryption scheme. By adding redundant information, users can verify the conversion results returned by third parties to ensure the accuracy of conversions. The security of the new program is based on the standard model. Li et al. [31] proposed an attribute-based encryption scheme that can both reduce attribute authority and user computing burden and also support user authentication, on the basis of [27, 28, 31] and so on. By outsourcing part of the key generation and decryption operations to different service providers, the attribute authority and the user’s local computing burden are greatly reduced. Lin et al. [32] pointed out that the method of adding redundant information in [30] can not only achieve the purpose of verifying the returned result, but also increases the computational load and communication burden of the system. The literature [32] proposes an outsourced attribute-based encryption scheme that uses a hybrid encryption system and a commitment mechanism to complete user authentication. The length of the ciphertext and the calculation of cryptographic operations in this scheme are only half that of the scheme mentioned in [30], and the security of the scheme is based on the standard model. Ma et al. [33] presented the concept of innocent proof for the first time. The user who holds the private key cannot report cloud server operation errors and proposed an attribute-based encryption scheme that supports both outsourced encryption and decryption operations and verifiable and innocent proofs. Mao et al. [34] proposed an attribute-based encryption scheme that verifies outsource decryption, and experiments show that the scheme is more concise and less computational than cryptographic schemes proposed in [30] but still with the number of attributes linearly related. Xu et al. [35] proposed a circuit-based ciphertext-policy-based attribute-based hybrid encryption system to ensure data integrity in cloud computing environments and implement fine-grained access control and verifiable delegation mechanisms.

It can be known from the existing work that currently existing secure outsourced attribute-based encryption schemes cannot achieve a fixed ciphertext length, and these solutions not only improve the computational efficiency but also increase the communication load of mobile devices. However, the work of achieving fine-grained access control of data in the MI and promoting the sharing of data in the MI is still worthy to investigate.

1.2.4. Proxy re-encryption system

Since cloud users do not fully trust cloud computing service providers (CSPs), they do not want any unauthorized personnel (including cloud service providers) to view their own data. This cannot rely on administrative constraints or ethics. The rules to be solved can only adopt technical means to ensure that cloud service providers and other non-authorized personnel cannot obtain any uploading cloud data from the customer, and any unauthorized changes to the customer data in the cloud will be accepted by the customers. Therefore, user data usually needs to be encrypted on the client before being transmitted to the cloud. However, in the cloud computing environment, there are a large number of data sharing scenarios. When data sharing is performed, if decryption is performed in the cloud, there is a risk of leakage of user data, which is not completely trusted by the cloud computing service provider for the user. The situation is impossible. If you need the user to re-encrypt the file, this is very inconvenient for the user. Therefore, the proxy re-encryption (PRE) scheme for distributed data storage can well solve this problem.

Proxy re-encryption (PRE), initially introduced by Blaze et al. [36], enables a semi-trusted proxy to transform a ciphertext encrypted under the public key of delegator into another ciphertext under the public key of delegatee without leaking the underlying encrypted messages or private keys of delegator/delegatee to the proxy. This special kind of public-key encryption seems to be an optimal candidate to ensure the security of sharing data in cloud computing. Supposedly, the data owner (say, Alice) intends to share the sensitive data stored in the cloud with another granted user (say, Bob). It is desirable that the requested data can be accessed by nobody other than Bob. Inspired by the primitive of PRE, Alice can encrypt the sensitive data under her own public key before uploading the shared data to the semi-trusted cloud. After receiving the request of data sharing from Bob, Alice generates a proxy re-encryption key using her own private key and Bob’s public key and sends this proxy re-encryption key to the semi-trusted cloud server. Equipped with this proxy re-encryption key, cloud server can transform the ciphertext encrypted under the public key of Alice into an encryption under the public key of Bob. By utilizing the PRE primitive, the transformed ciphertext can only be decrypted by Bob, whereas the cloud server is unable to learn the plaintext or private keys of Alice or Bob. Finally, Bob can download and decrypt the requested data with his own private key. In this way, the costly burden of secure data sharing can be uploaded to the semi-trusted cloud server with abundant resources.

In 2005, Ateniese et al. [37] proposed the first unidirectional PRE scheme. Similar to the scheme of Blaze et al. [36], both of the schemes are secure only against chosen-plaintext attacks (CPA). In 2007, Canetti et al. [38] designed a bidirectional PRE scheme with chosen-ciphertext security. In 2008, Libert et al. [39] introduced a replayable chosen-ciphertext secure (RCCA) unidirectional PRE scheme. Since then, various PRE schemes have been proposed in the literature (e.g., [40, 41, 42, 43, 44]).

PRE can be extended in the context of identity-based encryption. In 2007, Green and Ateniese [45] proposed the first identity-based proxy re-encryption (IBPRE) scheme, which is CCA secure in the random oracle model, where hash functions are assumed to be fully random. Chu and Tzeng [30] constructed a CCA secure IBPRE scheme in the standard model. After that, many identity-based proxy re-encryption (IBPRE) schemes have been proposed, such as [47, 48, 49, 50, 51, 52, 53].

However, among all of the aforementioned schemes, the semi-trusted proxy can use a given re-encryption key to transform all the ciphertexts of a delegator into those of a delegatee. But in reality, the delegator does not want to transform all of his data for the delegatee. Therefore, type-based PRE [54] and conditional PRE (CPRE) [55, 56] were proposed, in which the proxy can only fulfill ciphertext conversion conditionally. Later, Liang et al. [57, 58] proposed two IBCPRE schemes with CCA secure in the standard model. However, He et al. [59] presented the security analysis to show that their schemes only achieve CPA security. In 2016, He et al. [60] proposed an efficient identity-based conditional proxy re-encryption (IBCPRE) scheme with CCA secure in the random oracle model.

PRE can be extended in the attribute-based setting. Attribute-based proxy re-encryption (ABPRE) can effectively increase the flexibility of data sharing. In 2009, Liang et al. [61] first defined the notion of ciphertext-policy ABPRE (CP-ABPRE), where each ciphertext is labeled with a set of descriptive conditions and each re-encryption key is associated with an access tree that specifies which type of ciphertexts the proxy can re-encrypt, and they presented a concrete scheme supporting and gates with positive and negative attributes. After that, several CP-ABPRE schemes (e.g., [62]) with more expressive access policy were proposed. In 2011, Fang et al. [63] proposed a key-policy ABPRE (KP-ABPRE) scheme in the random oracle model, whereby ciphertext encrypted with conditions W can be re-encrypted by the proxy using the CPRE key under the access structure TT if and only if T ( W ) T ( W ) = 1 . More recent ABPRE systems can be seen in [64, 65, 66].

In 2016, Lee et al. [67] proposed a searchable hierarchical CPRE (HCPRE) scheme for cloud storage services, and cloud service provider is able to generate a hierarchical key, but the re-encryption key generation algorithm also requires the private keys of the delegator and delegatee.

So far, the proxy re-encryption has been extensively investigated. However, the concrete proxy re-encryption schemes with different properties still need to be proposed and applied to the special environment for the MI.

1.3. Organization

The core material of this book consists of the following:

  • Chapters 1 and 2, discussing the importance of outsourcing and privacy protection for data service in the mobile Internet and the basics of data service outsourcing and privacy protection in the mobile Internet

  • Chapters 3–5, introducing information retrieval, classical machine learning, and deep learning

  • Chapters 6 and 7, introducing attribute-based encryption for flexible and fine-grained access control and motivating proxy re-encryption for secure access delegation

2. Foundations

The purpose of this chapter is to offer a brief review of all the necessary cryptographic concepts needed in the following chapters. We start with the mathematical backgrounds and associated hard problems. The other aspect of this chapter is some facts regarding the classical symmetric cryptography and public-key cryptosystem. Finally, a concise description of provable security will be presented in the rest of this chapter. We remark that this introduction is by no means exhaustive such that the approaches used to speed up pairing computations or select suitable parameters of elliptic curve are really beyond the scope of this book.

2.1. Mathematical concepts and properties

2.1.1. Concepts from number theory

2.1.1.1. Primes and divisibility

Let Z be the set of integers. We say that a divides b (denoted as a | b ) if there exists an integer c satisfying a c = b for a , b , c R Z . If a does not divide b , we write a b . Despite that this definition makes sense even when one or more of these integers are negative or zero, we are only interested in the case when a , b , c are all positive. A direct observation is that if a | b and a | c , then a | ( x b + y c ) for any x , y R Z .

If a | b and a is positive, a is regarded as a divisor or factor of b . Furthermore, a is also called a nontrivial factor of b in case a { 1, b } . A positive integer p ( > 1 ) is considered as prime if it has no nontrivial factors, i.e., it has only two divisors: 1 and p itself. A positive integer greater than 1 is called composite if this integer is not a prime number. It is the convention to know that “1” is neither prime nor composite.

According to the fundamental theorem of arithmetic, every integer greater than 1 can be expressed uniquely as a product of primes. Namely, any positive integer n > 1 can be written as n = i p i e i such that { p i } are distinct primes and e i 1 for all i ; furthermore, the { p i } and { e i } are uniquely determined up to ordering.

Proposition 2.1

Let a and b be two integers and b > 0 . Then there exist unique integers q , r satisfying a = q b + r and 0 r < b .

c is known as the greatest common divisor of two nonnegative integers a and b (written as gcd ( a , b ) ) if c is the largest integer satisfying c | a and c | b . It is noted that gcd ( 0,0 ) are not defined, whereas gcd ( b ,0 ) = gcd ( 0, b ) = b . The notion of greatest common divisor also makes sense when either or both of a , b are negative, but we will never need this; therefore, when we write gcd ( a , b ) , we always assume that a , b 0 . Note that if p is prime, then gcd ( a , p ) is either equal to 1 or p . a and b are called relatively prime if gcd ( a , b ) = 1 .

Proposition 2.2

Let a , b be two positive integers. Then there exist x , y Z such that a x + b y = gcd ( a , b ) . Furthermore, gcd ( a , b ) is the smallest positive integer that can be expressed in this manner.

Proof. Let I denote the set { a x + b y } where x , y are chosen from Z . Note that I certainly contains some positive integers since at least a , b I such that a = a × 1 + b × 0 and b = a × 0 + b × 1 . Let d denote the smallest positive integer in I . We claim that d = gcd ( a , b ) due to the fact that d can be represented as d = a x + b y for some x , y Z (recall that d I ). In addition, to show that d = gcd ( a , b ) , we must show that d | a and d | b and that d is the largest integer with this property. In fact, we can show that d divides every element in I . To prove this, take arbitrary c I such that c = a x + b y with x , y Z . According to Proposition 2.1, we have c = q d + r with q and r being integers and 0 r < d . Then

r = c q d = a x + b y q ( a x + b y ) = ( x q x ) a + ( y q y ) b I . E2.1

There is a contradiction between r 0 and the choice of d as the smallest positive integer in I . Thus, it is obvious that r = 0 and hence d | c .

Since both a and b are elements in I , the above shows that d | a and d | b . Suppose there exists d > d such that d | a and d | b . Then d | a x + b y ; since the latter is equal to d , this means d | d , which is impossible if d is larger than d . It is concluded that d is the largest integer dividing both a and b and hence d = gcd ( a , b ) .

Given a and b , the Euclidean algorithm can be used to compute gcd ( a , b ) in polynomial time. The extended Euclidean algorithm can be used to compute the coefficients x , y (as in Proposition 2.1) in polynomial time as well. Interested readers can refer to [68] for more details.

Proposition 2.3

If c | a b and gcd ( a , c ) = 1 , then c | b . In particular, if p is prime and p | a b , then either p | a or p | b .

Proof. It is easy to observe that c × α = a b for some integer α since c | a b . From gcd ( a , c ) = 1 then, according to the previous proposition, there exist integers x , y such that 1 = a x + c y . Multiplying both sides by b , we obtain

b = a x b + c y b = a b x + c y b = c × α x + c y b = c ( α x + y b ) . E2.2

Since ( α x + y b ) is an integer, it follows that c | b .

The second part of this proposition follows from the fact that if p a , then gcd ( a , p ) = 1 .

Proposition 2.4

If p | N , q | N , and gcd ( p , q ) = 1 , then p q | N .

Proof. We can see p α = N and q β = N from p | N and q | N and obtain 1 = p x + q y from Proposition 2.2, where α , β , x , y are all integers. Multiplying both sides of the last equation by N , we obtain N = p x N + q y N = p x q β + q y p α = p q ( x β + y α ) . Thus, p q | N .

2.1.1.2. Modular arithmetic

Let a , b , N be integers with N > 1 . Let [ a mod N ] denote the remainder of a Z upon division by N . In more detail, according to Proposition 2.1, there exist unique q and r such that a = q N + r and 0 r < N , and [ a mod N ] is defined as this remainder r . Namely, 0 [ a mod N ] < N . The process of mapping a to [ a mod N ] is called reduction modulo N .

a and b are regarded as congruent modulo N (written as a = b mod N ) if [ a mod N ] = [ b mod N ] , that is to say, the remainder when a is divided by N is equivalent to the remainder when b is divided by N . In this manner, a = b mod N if and only if N | ( a b ) . Note that a = [ b mod N ] implies a = b mod N , but not vice versa. For example, 36 = 18 mod 18 but 36 [ 18 mod 18 ] = 0 .

We remark that congruence modulo N is an equivalence relation, i.e., it is reflexive ( a = a mod N for all a ), symmetric ( a = b mod N implies b = a mod N ), and transitive (if a = b mod N and b = c mod N , then a = c mod N ). Congruence modulo N also obeys the standard rules of arithmetic with respect to addition, subtraction, and multiplication, so if a = a mod N and b = b mod N , then ( a + b ) = ( a + b ) mod N and a b = a b mod N . Thus, the calculations of congruence modulo N can be simplified by “reducing and then adding/multiplying” instead of “adding/multiplying and then reducing.”

Example 1. Let us compute [ 3193015 × 590702 mod 100 ] . Since 3193015 = 15 mod 100 and 590702 = 2 mod 100 , we have

3193015 × 590702 = [ 3193015 mod 100 ] [ 590702 mod 100 ] mod 100 = 15 × 2 = 30 mod 100.

The cost of alternate approach to derive the answer (viz., computing the product 3193015 × 590702 and then reducing the answer modulo 100) is much more expensive.

Different from addition, subtraction, and multiplication, the division operation in congruence modulo N has not been involved. That is to say, if a = a mod N and b = b mod N , then it is not necessarily true that a / b = a / b mod N ; in fact, the expression “ a / b mod N ” is not, in general, defined well. As a specific example related to the division, a b = c b mod N does not necessarily imply that a = c mod N .

Example 2. Take N = 12 . Then 3 × 3 = 9 = 3 × 7 mod 12 , but 3 7 mod 24 .

However, a meaningful notion of division has also been defined for congruence modulo N . If for a given integer a there exists an integer a 1 such that a × a 1 = 1 mod N , a 1 is viewed as a (multiplicative) inverse of a modulo N , and a is considered as invertible modulo N . It is obvious to observe that if α is a multiplicative inverse of a modulo N , then so is [ α mod N ] ; furthermore, if α is another multiplicative inverse, then [ α mod N ] = [ α mod N ] . We can simply let a 1 denote the unique multiplicative inverse of a that lies in the range { 0, , N 1 } if a is invertible.

In this way, division by a modulo N is defined as multiplication by a 1 modulo N if a is invertible modulo N (i.e., c / a is defined as c × a 1 mod N ) . We stress that division by a is only defined when a is invertible. If c × a = b × a mod N and a are invertible, then we may divide each side of the equation by a (or, equivalently, multiply each side by a 1 ) to obtain

( c × a ) × a 1 = ( b × a ) × a 1 mod N c = b mod N . E2.3

We see that in this case, division works “as expected” by adopting the idea of invertible integers. The natural question is that which integers are invertible modulo of a given modulus N. To answer this question fully, Proposition 2.2 is used in the following proposition:

Proposition 2.5

Let a , N be integers with N > 1 . Then a is invertible modulo if and only if gcd ( a , N ) = 1 .

Proof. Assume a is invertible modulo N , and let b denote its inverse. It is evident that a 0 since 0 × b = 0 mod N regardless of the value of b . Since a × b = 1 mod N , the definition of congruence modulo N implies that a × b 1 = α × N for some α Z . Equivalently, b × a α × N = 1 . According to Proposition 2.2, this implies gcd ( a , N ) = 1 .

Conversely, if gcd ( a , N ) = 1 , then according to Proposition 2.2, there exist integers x , y such that a x + N y = 1 . Reducing each side of this equation, modulo N gives a x = 1 mod , and it is easy to see that x is a multiplicative inverse of a .

Example 3. Let N = 13 and a = 10 . Then 10 × 4 + ( 3 ) × 13 = 1 , and so 4 = [ 4 mod 13 ] is the inverse of 10. One can verify that 10 × 4 = 1 mod 13 .

2.1.2. Concepts from abstract algebra

2.1.2.1. Group theory

Assume G be a set. A binary operation defined over the set G (written as g h if g , h G ) is simply a function that takes as input two elements of G and outputs another element in the set G . The formal definition of a group is described as follows:

Definition 2.1. A group is a set G pairing with a binary operation such that:

Closure law: For all g , h G , the result of g h still remains in the set G .

Identity law: There exists an identity element e G such that for all g G , e g = g = g e .

Inverse law: For all g G , there exists an element h G such that g h = e = h g . Such an h is called an inverse element of g .

Associative law: For all g 1 , g 2 , g 3 G , ( g 1 g 2 ) g 3 = g 1 ( g 2 g 3 ) .

In this way, the set G along with the binary operation is defined as a group. When the set G has a finite number of elements, ( G , ) is viewed as a finite group, and | G | denotes the order of the group, that is, the number of elements in G .

If g h = h g for all g , h G , this group is called abelian group or commutative group. In this book, we will always deal with finite and abelian (commutative) groups.

If ( G , ) is a group, a set H G pairing with the operation is called a subgroup of ( G , ) if the set H forms a group under the involved operation . To check that ( H , ) is a subgroup, we need to verify closure, existence of identity and inverses, and associativity according to Definition 2.1. (In fact, associativity is inherited automatically from the group ( G , ) .) Every group ( G , ) always has the trivial subgroups ( G , ) and ({ e }, ). ( H , ) is called a strict subgroup of ( G , ) if H G .

Associativity implies that the notation of long expression g 1 g 2 g n without parentheses is unambiguous since it does not matter in what order we perform the operation .

It is easy to see that the identity element in a group ( G , ) is unique, and thus we can therefore refer to the identity element of a group. One can also demonstrate that each element g of a group has a unique inverse element.

In general, the abstract notation will not be used to denote the group operation directly. Either the additive notation or the multiplicative notation will be used instead depending on the involved group. In case the additive notation has been adopted, the group operation applied to two elements g , h in the group is denoted as g + h ; the identity element is denoted as “0,” and the inverse element of g is denoted as g . In case the multiplicative notation has been employed, the group operation applied to g , h in the group is denoted as g × h or simply g h ; the identity element is denoted as “1,” and the inverse element of g is denoted as g 1 . As in the case of multiplication modulo N , we also define division by g as multiplication by g 1 (i.e., [ ( h / g ) mod N ] is defined as [ ( h × g 1 ) mod N ]). Finally, we remark that this does not imply that the group operation corresponds to the addition or multiplication operation for integers. Instead, this merely serves as useful general notation.

Example 4. A set may be formed as a group under one operation, but not another operation. For example, the set of integers Z forms an abelian group under the addition operation: the identity element is “0,” and every integer g has inverse identity g . On the other hand, it does not form a group under multiplication operation since, for example, the multiplicative inverse of the integer “0” does not make sense.

Example 5. The set of complex numbers C is not a group under multiplication, since “0” does not have a multiplicative inverse. The set of nonzero complex numbers, however, is an abelian group under multiplication with identity “1.”

Example 6. Let N 2 be an integer. The set { 0, , N 1 } with respect to addition modulo N is an abelian group of order N : closure is obvious; associativity and commutativity follow from the fact that the integers satisfy these properties; the identity element is 0; and, since a + ( N a ) = 0 mod N , it follows that the inverse element of any element a is [ ( N a ) mod N ] . We denote this group by ( Z N , + ) . (Occasionally, the notion Z N will also be used to denote the set { 0, , N 1 } without involving any particular operation.)

The “cancelation law” for groups has been demonstrated in the following lemma.

Lemma 2.1 Let ( G , × ) be a group and a , b , c G . If a × c = b × c , then a = b . In particular, if a × c = c , then a is the identity element in G .

Proof. We know a × c = b × c . Multiplying both sides by the unique inverse element c 1 of the element c , we obtain a = b . In detail

a = a × 1 = a × ( c × c 1 ) = ( a × c ) × c 1 = ( b × c ) × c 1 = b × ( c × c 1 ) = b × 1 = b . E2.4

Group exponentiation: It is often useful to be able to describe the group operation applied n times to a fixed element g , where n is a positive integer. As for the additive operation, we demonstrate this as n g or n g , that is,

n g = n g = def g + + g n times . E2.5

Note that n is an integer, while g is a group element. So n g does not represent the group operation applied to n and g (indeed, we are working in a group where the group operation is written additively). However, the operation “behaves fortunately as it should,” for example, if g G and n , n are integers, then ( n g ) + ( n g ) = ( n + n ) g , n ( n g ) = ( n n ) g and 1 g = g . In an abelian group G with g , h G , ( n g ) + ( n h ) = n ( g + h ) .

As for the multiplicative operation, we demonstrate application of the group operation n times to an element g by g n . That is,

g n = def g × × g n times . E2.6

The familiar rules of exponentiation follow: g n × g n = g n + n , ( g n ) n = g n n , and g 1 = g . Also, if G is a commutative group and g , h G , then g n h n = ( g h ) n .

The above notation can be extended to the case when n is zero or a negative integer in the natural way. Generally, we leave g r undefined if r is not an integer. As for additive operation, we have 0 g = d e f 0 and ( n ) g = d e f n ( g ) such that n is a positive integer. (Note that in the equation “ 0 g = 0 ,” the “0” on the left-hand side is the integer 0, while the “0” on the right-hand side is the identity element in the group.) As one would expect, it can be shown that ( n ) g = ( n g ) . As for multiplicative notation, g 0 = d e f 1 and g n = d e f ( g 1 ) n . Again, as expected, one can show that g n = ( g n ) 1 .

Theorem 2.1

If ( G , × ) is a finite group with the order m = | G | , then for any element g G , g m = 1 .

Proof. We prove the theorem only for the commutative group ( G , × ) (despite it holds for any finite group). Fix arbitrary g G , and let g 1 , , g m be the elements of G . We claim that

g 1 × g 2 × × g m = ( g × g 1 ) × ( g × g 2 ) × × ( g × g m ) . E2.7

It is noted that g × g i = g × g j implies g i = g j according to the cancelation law shown in Lemma 2.1. Thus m elements in parentheses on the right-hand side of the displayed equation are pairwise different from each other. By considering that there are exactly m elements in G , the m elements being multiplied together on the right-hand side are simply all elements of G in some permuted order. The order in which all elements of the group are multiplied does not matter due to the fact that the group ( G , × ) is commutative, and thus the result of the right-hand side is identical to the result of left-hand side.

Fueled by the fact that ( G , × ) is commutative, all occurrences of the element g can be pulled out, and we can obtain

g 1 × g 2 × × g m = ( g × g 1 ) × ( g × g 2 ) × × ( g × g m ) = g m × ( g 1 × g 2 × × g m ) . E2.8

Once again by the cancelation law in the group, it is obvious that g m = 1 .

Corollary 2.1

If ( G , × ) is a finite group with the order m = | G | > 1 , then for any element g G and any integer i , g i = g [ i mod m ] .

Proof. According to Proposition 2.1, i can be expressed as q m + r , where q , r are integers and r can be demonstrated as [ i mod m ] . By Theorem 2.1,

g i = g q m + r = g q m × g r = 1 q × g r = g r E2.9

holds.

Example 7. As for the additive operation, the above corollary means that if g is an element in a group with order m , then i g = [ i mod m ] g . As an example, consider the group Z 25 of order m = 25 , and take g = 13 . The corollary can be instantiated as

303 13 = [ 303 mod 15 ] 13 = 3 13 = 13 + 13 + 13 = 39 = 14 mod 25. E2.10

Corollary 2.2

Let ( G , × ) be a finite group with order m = | G | > 1 . Let e > 0 be an integer, and define the function f e : G G by f e ( g ) = g e . If gcd ( e , m ) = 1 , then f e is a permutation. Furthermore, if d = [ e 1 mod m ] , then f d is the inverse of f e .

Proof. According to Proposition 2.5, e is invertible modulo m due to the fact that gcd ( e , m ) = 1 . To show that f d is the inverse of f e , for any g G , we have

f d ( f e ( g ) ) = f d ( g e ) = ( g e ) d = g e d = g [ e d mod m ] = g 1 = g . E2.11

2.1.2.2. Group ( Z N * , × )

As mentioned before, the set Z N = { 0, , N 1 } pairing with the addition operation modulo N can be regarded as a group. One natural problem is that whether the set { 0, , N 1 } associated with the multiplication operation modulo N can be viewed as a group or not. It is obvious that “1” can be considered as the identity element for the multiplication modulo N. However, not every element in this set is invertible associated with the multiplication modulo N, for example, the multiplicative inverse element of “0” obviously does not make sense. What make matters worse, if N = 8 , then the elements “2,” “4,” and “6” are not invertible by exhaustively trying every possible elements in { 0, ,7 } . Thus, it is necessary to identify which elements in { 0, , N 1 } are invertible modulo N. Depending on Proposition 2.5, the element a { 0, , N 1 } is invertible if and only if gcd ( a , N ) = 1 . It is also easy to observe that the inverse element of a resides in the range { 0, , N 1 } . This results in the definition of the following set for N > 1 :

Z N * = d e f { a { 1, , N 1 } | gcd ( a , N ) = 1 } . E2.12

In other words, Z N * includes integers in the set { 1, , N 1 } that are relatively prime to N.

Based on the discussion above, the identity element and inverse element associated with each element can be found in Z N * . Furthermore, the commutativity and associativity features inherit from Z N directly. To discuss the feature of closure, let a , b Z N * and c = [ a × b mod N ] , and then assume c Z N * . This implies that gcd ( c , N ) 1 , and so a prime p exists such that p | N and p | c . From c = [ a × b mod N ] , we see that a × b = q N + c for some integer q and thus p | a × b . Based on Proposition 2.3, we obtain p | a or p | b and, thus, contradict either gcd ( a , N ) = 1 or gcd ( b , N ) = 1 (recall that a , b Z N * ). In short, the set Z N * with respect to the multiplication operation modulo N forms a group.

Proposition 2.6

Z N * is a commutative group pairing with the multiplication operation modulo N when N > 1 is an integer.

Let ϕ ( N ) denote the order of the group ( Z N * , × ) such that ϕ ( N ) ( = d e f | Z N * | ) and ϕ is called the Euler phi function. To compute the value of ϕ ( N ) , we first consider the case when N is prime, i.e., N = p . In this case, all elements in { 1, , p 1 } are relatively prime to p , and so ϕ ( p ) = | Z p * | = p 1 . We then consider the case N = p × q such that p , q are distinct primes. We see that either p | a or q | a if an integer a { 1, , N 1 } is not relatively prime to N . Note that a cannot be divided by both p and q simultaneously since this would imply p q | a and contradict a < N ( = p × q ) . The elements in { 1, , N 1 } that can be divided by p are exactly the ( q 1 ) elements p ,2 p ,3 p , , ( q 1 ) p , and the elements that can be divided by q are exactly the ( p 1 ) elements q ,2 q , , ( p 1 ) q . Thus, the number of elements that are neither divisible by p or q is therefore given by

N 1 ( q 1 ) ( p 1 ) = p × q p q + 1 = ( p 1 ) ( q 1 ) . E2.13

That is to say, ϕ ( N ) = ( p 1 ) ( q 1 ) when N is the product of two distinct primes p and q .

Theorem 2.2

Let N = i p i e i , where the { p i } are distinct primes and e i 1 . Then ϕ ( N ) = N × i ( 1 1 p i 1 ) .

Example 8. Take N = 24 = 3 2 3 . Then Z 24 * = { 1,5,7,11,13,17,19,23 } and | Z N * | = 8 = 24 × ( 1 1 2 ) × ( 1 1 3 ) = ϕ ( 24 ) . The inverse identity of 5 in Z N * is 5 itself, since 5 × 5 = 25 = 1 mod 24 .

Until now, the set Z N * under the multiplication operation modulo N is shown to be a group with order ϕ ( N ) . According to the Theorem 2.1, we get the following theorem directly:

Theorem 2.3

Take arbitrary N > 1 and a Z N * ; then we obtain

a ϕ ( N ) = 1 mod N . E2.14

For the specific case when N = p is prime and a { 1, , p 1 } , we have

a p 1 = 1 mod p . E2.15

By Corollary 2.2, we obtain the following theorem easily:

Theorem 2.4

Let N > 1 be an integer. For integer e > 0 , we define f e : Z N * Z N * as f e ( x ) = x e mod . If e is relatively prime to ϕ ( N ) , then f e is a permutation. Moreover, if d = [ e 1 mod ϕ ( N ) ] , then f d is the inverse of f e .

Definition 2.2. A function f : G H is said to be an isomorphism from a group ( G , G ) to another group ( H , H ) if (1) f is a bijective mapping, and (2) for all g 1 , g 2 G , we have f ( g 1 G g 2 ) = f ( g 1 ) H f ( g 2 ) . If these properties hold, then we say the group ( G , G ) and the group ( H , H ) are isomorphic and write this as G H .

Group isomorphisms: An isomorphism from a group to another group provides an alternate and equivalent approach to think about the structure of groups. For example, if the group ( G , G ) is finite and G H , then the group ( H , H ) must be finite and have the same order as G . Also, if there exists an isomorphism f from a group ( G , G ) to another group ( H , H ) , then f 1 is an isomorphism from ( H , H ) to ( G , G ) .

Example 9. The bijective mapping f where f ( n ) = 2 n is an isomorphism from the group ( Z , + ) to the group ( Z 2 n , + ) because f ( a + b ) = 2 ( a + b ) = 2 a + 2 b = f ( a ) + f ( b ) , where Z 2 n denotes the set of even integers and + is the addition operation for the integers.

Example 10. The bijective mapping f where f ( n ) = 3 n is an isomorphism from the group ( Z 6 , + ) to the group ( Z 7 * , × ) , where + and × denote the addition operation modulo 6 and the multiplication operation modulo 7 . On the one hand, 3 0 = 1 , 3 1 = 3 , 3 2 = 2 , 3 3 = 6 , 3 4 = 4 , and 3 5 = 5 . On the other hand, f ( 2 + 3 ) = 3 ( 2 + 3 ) = 3 2 × 3 3 = f ( 2 ) × f ( 3 ) .

2.1.2.3. Chinese remainder theorem

Theorem 2.5

Let n 1 , n 2 , , n k be integers that are pairwise relatively prime, i.e., gcd ( n i , n j ) = 1 for i j . Then there exists a unique solution modulo of the product n = n 1 × n 2 × n k to the following system of congruences:

x a 1 ( mod n 1 ) x a 2 ( mod n 2 ) x a k ( mod n k ) . E2.16

The solution to the system of congruences shown in Theorem 2.5 can be calculated as

x = i = 1 k a i N i M i mod n E2.17

where

N i = n n i E2.18

and

M i = N i 1 mod n i . E2.19

The solution can also be represented in a slightly different way which makes it easier to be understood. Namely, we can rewrite Eq. (2.1) as

x = i = 1 k a i e i mod n E2.20

such that

e i { 1 ( mod n i ) 0 ( mod n j ) , j i . E2.21

So the solution of the Chinese remainder theory can be regarded essentially as an integer version of Lagrange interpolation, where a polynomial is created according to k points by calculating a similar set of coefficients that are either 0 or 1 and thus enforce the desired behavior at the given points.

Example 11. Consider the following system of congruences:

x 2 ( mod 3 ) = a 1 ( mod n 1 ) x 3 ( mod 5 ) = a 2 ( mod n 2 ) . x 2 ( mod 7 ) = a 3 ( mod n 3 ) E2.22

According to the solution of the Chinese remainder theorem, we see that

n = n 1 × n 2 × n 3 = 3 × 5 × 7 = 105 N 1 = n n 1 = 105 3 = 35 N 2 = n n 2 = 105 5 = 21 N 3 = n n 3 = 105 7 = 15 M 1 = N 1 1 mod n 1 = 2 1 mod 3 = 2 M 2 = N 2 1 mod n 2 = 3 1 mod 5 = 1 M 3 = N 3 1 mod n 3 = 2 1 mod 7 = 1 E2.23

so that

x = ( a 1 × N 1 × M 1 + a 2 × N 2 × M 2 + a 3 × N 3 × M 3 ) mod 105 = ( 2 × 35 × 2 + 3 × 21 × 1 + 2 × 15 × 1 ) mod 105 = 23 mod 105 . E2.24

In this example we can also think of the solution as finding integers e 1 , e 2 , and e 3 such that

x = ( 2 × e 1 + 3 × e 2 + 2 × e 3 ) mod 105 E2.25
e 1 = 70 { 70 ( mod 3 ) 0 ( mod 5 ) 0 ( mod 7 )
e 2 = 21 { 0 ( mod 3 ) 21 ( mod 5 ) 0 ( mod 7 )

and

e 3 = 15 { 0 ( mod 3 ) 0 ( mod 5 ) 15 ( mod 7 ) . E2.26

2.1.2.4. Cyclic groups and generators

Let ( G , × ) be a finite group with order m. For arbitrary g G , consider the set

g = d e f { g 0 , g 1 , , } . E2.27

According to Theorem 2.1, g m = 1 . If i m denotes the smallest positive integer for which g i = 1 , then the above sequence repeats after i terms (i.e., g i = g 0 , g i + 1 = g 1 , , g 2 i = g 0 , etc.), and so

g = { g 0 , , g i 1 } . E2.28

It is obvious to see that g contains exactly i elements since if g j = g k with 0 j < k < i , then g k j 1 thanks to the choice of i .

It is easy to observe that ( g , × ) can be regarded as a subgroup of ( G , × ) , which is generated by the element g . If the order of the subgroup ( g , × ) is i , then i is called the o r d e r of the element g .

Definition 2.3. If ( G , × ) is a finite group and g is randomly chosen from G , then the order of element g is defined as the smallest positive integer i such that g i = 1 .

Proposition 2.7

If G is a finite group, and g G is an element with order i , then for any integer x , we have g x = g [ x m o d i ] .

Proof. The proof of this proposition is similar to the proof of Corollary 2.1, and thus we omit it here.

Proposition 2.8

If G is a finite group and g G is an element with order i , then g x = g y if and only if x = y mod i .

Proof. If x = y mod i , then [ x mod i ] = [ y mod i ] , and according to the previous proposition, we directly have

g x = g [ x mod i ] = g [ y mod i ] = g y . E2.29

On the other hand, from g x = g y , we can easily obtain g x = g y from the previous proposition, where x = [ x mod i ] and y = [ y mod i ] . We in turn get g x ( g y ) 1 = g x y = 1 . If x y , the difference x y is then a nonzero integer smaller than i since both x and y are smaller than i . There is contradiction between the fact that i is the order of the element g and the fact that x y is a nonzero integer smaller than i . Therefore, we obtain x = y directly.

The identity element of any group ( G , ) , features with the order 1 , can generate the group ( e , ) . Furthermore, identity element is known as the only element with order 1 . On the other hand, if an element g G features with the order m (where m denotes the order of the group ( G , ) ) can be found, then g = G . In other words, G can be generated by the element g and considered as a cyclic group. Here, g is called a generator of G . (Different from the identity element, a cyclic group may have multiple generators.) If g is a generator of cyclic group ( G , ) , then, by definition, every element h G can be derived by computing g x for some x { 0, , m 1 } .

Proposition 2.9

Let ( G , ) be a finite group with order m , and the element g G features with the order i . Then i | m .

Proof. By Theorem 2.1, g m = 1 . According to Proposition 2.7, we have g m = g [ m mod i ] in case the element g features with the order i . If i m , then i = d e f [ m mod i ] is a nonzero integer smaller than i such that g i = 1 . This contradicts the fact that i is the order of the element g.

Corollary 2.3

If ( G , ) is a group with prime order p , then this group is cyclic. Furthermore, all elements of G other than the identity element are generators of ( G , ) .

Proof. Based on Proposition 2.9, the only possible orders of elements in the group ( G , ) are 1 and p. Only the identity element features with the order 1, and so all other elements own order p and can generate the original group.

In addition to the groups of prime order, the additive group Z N , for N > 1 , provides another example of a cyclic group, whereas the element 1 offers the function of generator.

Theorem 2.6

If p is prime, then ( Z p * , × ) forms a cyclic group.

The proof of this theorem is outside the scope of this book, and the interesting reader can refer to for more details.

Example 12. Consider the group ( Z 11 * , × ) , which is cyclic by the previous theorem. We have 10 = { 1,10 } , and so 10 is not a generator. However,

2 = { 1,2,4,8,5,10,9,7,3,6 } = Z 11 * , E2.30

and so 2 is a generator of Z 11 * .

Given a cyclic group G with order q , we can represent this group by a generator g G as G = { g 0 , g 1 , , g q 1 } . In other words, each element h G can be expressed as h = g x such that x Z q . Correspondingly, x is called the discrete logarithm of h with respect to the generator g such that x = log g h . Note that the logarithms in this case are regarded as “discrete” because these logarithm values range over a finite range, as opposed to “standard” logarithms from calculus whose values are within an infinite set.

The discrete logarithm problem in a cyclic group G with respect to a given generator g is to calculate x such that g x = h with the input of a random element h G . Formally speaking, G is a polynomial-time algorithm with the input 1 n to output a cyclic group G with order q and a generator g G . In addition, the group operation in G is required to be computed efficiently (i.e., in time polynomial in n ). Consider the following experiment between a given algorithm A and the algorithm G :

The discrete logarithm experiment D L o g A , G ( n )

  1. Given the parameter n , the algorithm G outputs ( G , q , g ), where G denotes a cyclic group with order q and g is a generator of G such that | | q | | = n .

  2. Choose h G randomly.

  3. Given { G , q , g , h } , the task of A is to output x Z q .

  4. This experiment outputs 1 if g x = h and 0 otherwise.

Definition 2.4. The discrete logarithm problem is said to be hard relative to the algorithm G if, for all probabilistic, polynomial-time algorithms A , the probability Pr [ D L o g A , G ( n ) = 1 ] is negligible.

By way of example, groups of the form Z p * , where p is prime number, offer one family of cyclic groups in which the discrete logarithm problem is believed to be hard [69, 70].

2.1.3. Elliptic curve groups

Different from Z p * , another interesting class of groups, which is used widely in cryptographic applications, is those consisting of points on elliptic curves. Despite elliptic curve groups being very crucial in practical construction of cryptographic primitives, our treatment of such groups in this book is rather minimal and sacrifices generality in favor of simplicity. The basic reason about this approach is due to the fact that most cryptographic primitives based on elliptic curve groups can be understood and investigated by treating the underlying group in a completely generic manner without involving any particular group used to instantiate the primitive. Namely, cryptographic primitives can be built on arbitrary cyclic groups, and the security of these schemes can be proven as long as the related computational problem in the underlying group is believed to be hard no matter how the group is actually instantiated. Therefore, mathematical background needed for a deeper understanding of elliptic curve groups is beyond the scope of this book. Interested readers can refer to [71] for more details.

Define Z p as { 0, , p 1 } for p 5 be a prime. Despite the elliptic curves can in fact be defined over arbitrary (finite or infinite) fields, only the case such that p 2 or 3 is taken into consideration to eliminate the additional complications. The elliptic curve E over Z p represents the set of points ( x , y ) defined by the equation y 2 = x 3 + a x + b mod p with a , b R Z p and with the discriminant 4 a 3 + 27 b 2 0 mod p (this condition ensures that the equation x 3 + a x + b = 0 mod p has no repeated roots). This set of points on Z p along with a distinguished point O at infinity { ( x , y ) | x , y Z p E ( x , y ) = 0 } { O } outlines the curve E . The elements of the set E ( Z p ) are called the points on the elliptic curve E defined by y 2 = x 3 + a x + b mod p , and O is called the “point at infinity.”

Example 13. An element y Z p * is called a quadratic residue modulo p if there exists an x Z p * such that x 2 = y mod p , while x is regarded as a square root of y in this case. According to [68], every quadratic residue modulo p that has exactly two square roots for p > 2 is prime.

Let f ( x ) be a function such that f = x 3 + x + 6 and consider the curve E : y 2 = f ( x ) mod 11 . Each value of x for which f ( x ) is a quadratic residue modulo 11 yields two points on the curve; a value of x for which f ( x ) = 0 mod 11 gives one point on the curve E ( Z 11 ) . The points on the curve E ( Z 11 ) (shown in Table 2.1 ) can be identified and explained as follows:

  • f ( 0 ) = 6 mod 11 , a quadratic nonresidue modulo 11.

  • f ( 1 ) = 8 mod 11 , a quadratic nonresidue modulo 11.

  • f ( 2 ) = 5 mod 11 , a quadratic residue modulo 11 with square roots 4 and 7. This yields the points ( 2,4 ) , ( 2,7 ) E ( Z 11 ) .

  • f ( 3 ) = 3 mod 11 , a quadratic residue modulo 11 with square roots 5 and 6. This yields the points ( 3,5 ) , ( 3,6 ) E ( Z 11 ) .

  • f ( 4 ) = 8 mod 11 , a quadratic nonresidue modulo 11.

  • f ( 5 ) = 4 mod 11 , a quadratic residue modulo 11 with square roots 2 and 9. This yields the points ( 5,2 ) , ( 5,9 ) E ( Z 11 ) .

  • f ( 6 ) = 8 mod 11 , a quadratic nonresidue modulo 11.

  • f ( 7 ) = 4 mod 11 , a quadratic residue modulo 11 with square roots 2 and 9. This yields the points ( 7,2 ) , ( 7,9 ) E ( Z 11 ) .

  • f ( 8 ) = 9 mod 11 , a quadratic residue modulo 11 with square roots 3 and 8. This yields the points ( 8,3 ) , ( 8,8 ) E ( Z 11 ) .

  • f ( 9 ) = 7 mod 11 , a quadratic nonresidue modulo 11.

  • f ( 10 ) = 4 mod 11 , a quadratic residue modulo 11 with square roots 2 and 9. This yields the points ( 10,2 ) , ( 10,9 ) E ( Z 11 ) .

x 0 1 2 3 4 5 6 7 8 9 10
x 3 + x + 6 mod 11 6 8 5 3 8 4 8 4 9 7 4
Quadratic residue or nonresidue modulo 11? × × × × ×
y 4 5 2 2 3 2
7 6 9 9 8 9

Table 2.1.

The points on the elliptic curve y 2 x 3 + x + 6 mod 11 .

Including the point O at infinity, there are 7 points in E ( Z 11 ) .

A common approach to conceptualize E ( Z p ) intuitively is to observe the graph with equation y 2 = x 3 + a x + b over the reals (rather than the equation y 2 = x 3 + a x + b mod p ) as in Figure 2.1 . Despite that this figure does not correspond exactly to E ( Z p ) because E ( Z p ) consists only a finite number of points (recall that Z p is a finite set), while there are an infinite number of solutions to the same equation over the real numbers, the picture offers insightful intuition. In such figure, the point at infinity O can be viewed as sitting at the top of the y-axis and lying on every vertical line.

Figure 2.1.

Graph of the elliptic curve y 2 = x 3 + x + 6 such that Δ > 0 .

Observations from Figure 2.2 demonstrate that every line intersecting with the elliptic curve E ( Z p ) intersects this curve in exactly three points. In case the elliptic line is tangent to the curve at the point P , the point P is counted twice, and the point O is also counted (when the line is vertical). A binary operation (called “addition”) can be defined from this fact and denoted by “+” on points of E ( Z p ) as follows:

  • The point O is defined as the identity element such that P + O = O + P = P for all P E ( Z p )

  • We obtain

P 1 + P 2 + P 3 = O . E2.31

for P 1 , P 2 , P 3 are colinear points on E . (Note that the ordering of P 1 , P 2 , P 3 has been disregarded in the sense that the addition operation is commutative for all points and associative for colinear points.)

Figure 2.2.

Addition operations on points on an elliptic curve.

Negation. On input a point P , the negation of P is defined as the point P such that P + ( P ) = O . If P = O , it is easy to see that P = O . Otherwise, due to the fact that P + ( P ) + O = ( P + ( P ) ) + O = O + O = O and Eq. (2.2), it is evident that the negation of P corresponds to the third point on the line passing through P and O or, in other words, the vertical line passing through P . According to Figure 2.2 , P is simply the reflection of P in the x -axis, that is, if P = ( x , y ) , then P = ( x , y ) = ( x , y ) .

Addition of points: To evaluate the sum P 1 + P 2 for two arbitrary points P 1 , P 2 O on the elliptic curve E , we can draw the line through P 1 , P 2 (if P 1 = P 2 then draw the line tangent to E at P 1 ) and identify the third point of intersection P 3 of this line with the curve E . From Eq. (2.2), we can directly derive P 1 + P 2 = P 3 . If P 3 = O , then P 1 + P 2 = O = O . Otherwise, if the third point of intersection of the line through P 1 and P 2 is the point P 3 = ( x , y ) O , then

P 1 + P 2 = P 3 = ( x , y ) . E2.32

It is straightforward but tedious to carry out the addition law concretely. Assume P 1 = ( x 1 , y 1 ) and P 2 = ( x 2 , y 2 ) are two points in E ( Z p ) with P 1 , P 2 O and E is represented by the equation y 2 = x 3 + a x + b mod p . The slope of the line through these points can be computed as

λ = { y 2 y 1 x 2 x 1 mod p , P 1 P 2 3 x 1 2 + a 2 y 1 mod p , P 1 = P 2 . E2.33

The line passing through P 1 and P 2 can be outlined by the equation

y = λ ( x x 1 ) + y 1 mod p . E2.34

To find the third point of intersection of this line with E , substitute Eq. (2.3) into y 2 = x 3 + a x + b mod p to obtain

( λ ( x x 1 ) + y 1 ) 2 = x 3 + a x + b mod p . E2.35

The values of x that satisfy this equation are x 1 , x 2 and [ λ 2 x 1 x 2 m o d p ] . The first two solutions correspond to the original points P 1 and P 2 , while the third is the x-coordinate of the third point of intersection P 3 . The y-value corresponding to this third value of x is y = [ λ ( x x 1 ) + y 1 m o d p ] . That is, P 3 = ( x 3 , y 3 ) where

x 3 = ( λ 2 x 1 x 2 ) mod p y 3 = [ λ ( x 1 x 3 ) y 1 ] mod p . E2.36

It is straightforward that the set of points E ( Z p ) pairing with the addition law defined above forms an abelian group. All the necessary properties have almost been shown: closure under addition law depends on the fact (the proof is omitted due to the space limit) that any line intersecting E intersects this curve in exactly three points, O serves as the identity element, inverse element of each point on E ( Z p ) can also be found in E ( Z p ) , and commutativity of addition law rests on Eq. (2.2). The final property to verify is associativity, which the interested reader can check themselves with tedious calculation.

According to Table 2.1 , there are seven points in the set E ( Z 11 ) with the curve E : y 2 = x 3 + x + 6 mod 11 such that

E ( Z 11 ) = { O , ( 2,4 ) , ( 2,7 ) , ( 3,5 ) , ( 3,6 ) , ( 5,2 ) , ( 5,9 ) , ( 7,2 ) , ( 7,9 ) , ( 8,3 ) , ( 8,8 ) , ( 10,2 ) , ( 10,9 ) } . E2.37

Suppose P = ( 2,7 ) , 2 P = P + P can be computed as follows:

First, the slope of the line can be calculated as follows:

λ = 3 × 2 2 + 1 2 × 7 mod 11 = 2 3 mod 11 = 8 . E2.38

After that, we obtain

x 3 = ( 8 2 2 2 ) mod 11 = 5 y 3 = [ 8 × ( 2 5 ) 7 ] mod 11 = 2 . E2.39

Therefore, 2 P = ( 5,2 ) . Similarly, we can also get 3 P = ( 8,3 ) , 4 P = ( 10,2 ) , 5 P = ( 3,6 ) , 6 P = ( 7,9 ) , 7 P = ( 7,2 ) , 8 P = ( 3,5 ) , 9 P = ( 10,9 ) , 10 P = ( 8,8 ) , 11 P = ( 5,9 ) , 12 P = ( 2,4 ) , and 13 P = O .

It is easy to see that E ( Z 11 ) along with the addition operation forms a commutative group with the generator P = ( 2,7 ) .

Similar to Z p * , elliptic curve groups provide another family of cyclic groups in which the discrete logarithm problem is believed to be hard [71].

2.1.4. Bilinear pairing

Let G be a cyclic multiplicative group with the generator g and the prime order p and G T be another multiplicative cyclic group of the same order p . A bilinear pairing refers to a map e ^ : G × G G T featured with the following properties:

  1. Bilinearity: For all g 1 , g 2 G , and a , b Z p * , e ^ ( g 1 a , g 2 b ) = e ^ ( g 1 , g 2 ) a b .

  2. Nondegeneracy: There exist g 1 , g 2 G such that e ^ ( g 1 , g 2 ) 1 G T .

  3. Computability: For all g 1 , g 2 G , there is an efficient algorithm to compute e ^ ( g 1 , g 2 ) .

The modified Weil pairing [77] or Tate pairing [72] defined on the elliptic curves can be adopted to implement such an admissible bilinear map.

2.2. Public-key cryptography

The discussion thus far has been very general. We now show some concrete examples of concrete public-key encryption and signature schemes in the traditional PKI and ID-PKC settings, respectively. As mentioned in [73], public-key encryption enables two parties to communicate with each other securely without any shared secret information in advance. Let us call the sender Alice and the receiver Bob. Bob first selects the key pair ( p k B , s k B ) and sends the public key p k B to Alice over any channel but keeps the private key s k B secure and secret. Before sending the sensitive message m to Bob, Alice generates the ciphertext c by performing the encryption algorithm c = E n c p k B ( m ) under Bob’s public key. To recover the message m , Bob decrypts the ciphertext c by carrying out the inverse decryption D e c s k B ( c ) with his own private key s k B . According to [74], digital signature allows a user to generate a signature on the given message with secret key s k in such a way that any other party who knows this user’s corresponding public key p k can ensure that the message indeed originated from this user and has not been altered in any way. Let us call the sender Alice and the receiver Bob. The sender Alice first generates a public/secret key pair ( p k A , s k A ) and then distributes p k A in some reliable way to the other users in the system while keeping s k A secret. When the message m originated from Alice needs to be authenticated, Alice can generate a digital signature σ on the message m using her private key s k A ; the pair ( m , σ ) will be then sent. Anyone who knows the public key p k A can verify the authenticity of m by checking whether V r f y p k ( m , σ ) = ? 1 or not. In this section, we have chosen to focus on the most well-known and long-standing public-key primitives in this section.

Observing that all existing constructions of public-key encryption and signature schemes are proven secure under some assumptions associated with the hardness of solving some mathematical problems. Therefore, the number-theoretic hard problems related to these constructions have been first reviewed here.

Definition 2.5. (RSA problem) Let N = p × q , where p and q are two distinct odd prime numbers. Let e be a random prime number such that g c d ( e , ( p 1 ) × ( q 1 ) ) = 1 , and let G be a cyclic subgroup of Z N * . Given ( N , e ) and y G , the RSA problem consists of finding z Z N such that y = z e .

Definition 2.6. (Discrete logarithm (DL) problem) Let G be a finite cyclic group and g be a generator of G . Given a random element h G , the DL problem consists of computing the exponent x such that h = g x .

Definition 2.7. (Weil Diffie-Hellman (WDH) problem) Given a generator P of G and a triple < a P , b P , c P > for random a , b , c Z p * , the WDH problem is to compute e ^ ( P , P ) a b c G T .

Definition 2.8. (Decision Diffie-Hellman (DDH) problem) Given a generator P of G and a tuple < a P , b P , c P > for a , b , c Z p * , the DDH problem is to decide whether c = ? a b mod q holds or not.

Definition 2.9. (Computational Diffie-Hellman (CDH) problem) Given a generator P of G and a tuple < a P , b P > for a , b Z p * , the CDH problem is to compute a b P .

2.2.1. Public-key encryption algorithms

2.2.1.1. Framework of public-key encryption

Definition 2.10. A public-key encryption scheme (see Figure 2.3 ) consists of the following three algorithms:

  1. Gen: Given a security parameter 1 n , this algorithm outputs a pair of public and secret keys ( p k , s k ) .

  2. Enc: On input a public key p k and a message m from some underlying plaintext space, this algorithm outputs a ciphertext c , which can be denoted as c E n c p k ( m ) .

  3. Dec: On input a private key s k and a ciphertext c , this algorithm outputs a message m or a special symbol to denote failure, which can be written as m = D e c s k ( c ) .

Figure 2.3.

Intuition of public-key encryption.

We make the consistency constraint that if for every n , every ( p k , s k ) generated by G e n ( 1 n ) , every message m in the appropriate underlying plaintext space, and c = E n c p k ( m ) , then D e c s k ( c ) = m .

2.2.1.2. RSA encryption scheme

The RSA encryption scheme [75] is the most widely used public-key encryption scheme, and its security depends on the intractability of the RSA problem. Concretely, the RSA encryption scheme consists of the following three algorithms:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Randomly choose two prime numbers p and q such that | p | | q | .

  2. Compute N = p × q .

  3. Compute ϕ ( N ) = ( p 1 ) × ( q 1 ) .

  4. Randomly choose an integer e < ϕ ( N ) such that gcd ( e , ϕ ( N ) ) = 1 , and compute the integer d such that e × d 1 mod ϕ ( N ) .

  5. Publish ( N , e ) as the public key, and keep d as the corresponding private key. Note that p , q and ϕ ( N ) will be destroyed safely.

Enc: Before sending a sensitive message m < N , the sender creates the ciphertext c = m e mod N . (Note that the plaintext message space is in fact Z N * .)

Dec: To decrypt the ciphertext c, the corresponding receiver computes m = c d mod N .

2.2.1.3. ElGamal encryption scheme

The ElGamal encryption scheme [76] is constructed based on the DL problem. Concretely, the ElGamal encryption scheme consists of the following three algorithms:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Choose a random multiplicative generator g of Z p * such that p is a random prime number.

  2. Pick a random number 1 x p 2 as the private key.

  3. Compute the corresponding public key by y = g x ( mod p ) .

  4. Publish ( p , g , y ) as the public key, and keep x as the corresponding private key.

Enc: Before sending a sensitive message m Z p * , the sender picks 1 k p 2 and computes the ciphertext as follows:

c 1 = g k ( mod p ) , c 2 = y k ( mod p ) . E2.40

Dec: To decrypt the ciphertext ( c 1 , c 2 ) , the following steps are performed:

m = c 2 / c 1 x ( mod p ) . E2.41

2.2.1.4. Framework of ID-based encryption scheme

Definition 2.11. An ID-based encryption scheme (see Figure 2.4 ) consists of the following three algorithms:

Figure 2.4.

Intuition of ID-based encryption.

Setup: On input a security parameter k , this algorithm is performed by the PKG to create the system parameters params and the master secret key master-key such that params will be distributed to all users in the system and master-key will be kept secret by PKG itself.

Extract: On input params, master-key, and an arbitrary identity I D { 0,1 } * , this algorithm is performed by the PKG to generate a private key d I D . After that, d I D will be sent to the corresponding user secretly.

Enc: On input params, ID, and m , this algorithm outputs a ciphertext c .

Dec: On input params, ID, c , and the corresponding private key d I D , this algorithm recovers m .

2.2.1.5. Boneh-Franklin IBE

The first practical identity-based encryption scheme is constructed by Boneh and Franklin [77] based on bilinear pairings, and its security depends on the WDH problem. First the basic Boneh-Franklin IBE scheme which is not secure against an adaptive chosen-ciphertext attack1 is presented. The only reason for describing the basic scheme is to improve the readability of the full scheme. After that, the full scheme extends the basic scheme to achieve security against an adaptive chosen-ciphertext attack.

Boneh-Franklin IBE (basic scheme): Concretely, the basic Boneh-Franklin IBE scheme consists of the following four algorithms:

Setup: The algorithm works as follows:

1. Let e ^ : G × G G T be a symmetric bilinear pairing defined in Section 2.1.4.

2. Pick a random s Z q * and set P p u b = s P .

3. Choose two cryptographic hash functions H 1 : { 0,1 } * G * and H 2 : G T { 0,1 } n .

The system parameters p a r a m s = { P , P p u b , H 1 , H 2 } are published to everyone in the system, while master-key s Z q * is kept secret by PKG.

Extract: On input a string ID { 0,1 } * , this algorithm is performed by the PKG as follows:

  1. Compute d I D = s H 1 ( I D ) as the private key associated with the identity ID.

  2. Send d I D to the user I D secretly.

Encrypt: To encrypt m under the public-key ID, r is chosen from Z q * , and the ciphertext is generated as follows:

c = r P , m H 2 ( g I D r ) where g I D = e ^ ( H 1 ( I D ) , P p u b ) G T . E2.42

Decrypt: Given a ciphertext c = U , V encrypted using the public-key ID. The plaintext message can be recovered as follows:

V H 2 ( e ^ ( d I D , U ) ) = m . E2.43

Boneh-Franklin IBE (full scheme): Concretely, the full Boneh-Franklin IBE scheme consists of the following four algorithms:

Setup and Extract: The algorithms are the same as those in the above basic scheme. In addition, two additional hash function H 3 : { 0,1 } n × { 0,1 } n Z q * and H 4 : { 0,1 } n { 0,1 } n are required.

Encrypt: To encrypt m under the public-key ID, choose a random γ { 0,1 } n , set r = H 3 ( γ , m ) , and generate the ciphertext as follows:

C = r P , γ H 2 ( g I D r ) , m H 4 ( γ ) where g I D = e ^ ( H 1 ( I D ) , P p u b ) . E2.44

Decrypt: Let U , V , W be a ciphertext encrypted using the public-key ID. Given a ciphertext c = U , V , W encrypted using the public-key ID. The plaintext message can be recovered as follows:

  1. Compute V H 2 ( e ^ ( d I D , U ) ) = γ .

  2. Compute W H 4 ( γ ) = m .

  3. Set r = H 3 ( γ , m ) . Test that U = r P . If not, reject the ciphertext.

  4. Output m .

2.2.2. Signature algorithms

2.2.2.1. Framework of digital signature

Definition 2.12. A signature scheme (see Figure 2.5 ) consists of the following three algorithms:

Figure 2.5.

Intuition of digital signature.

Gen: Given a security parameter 1 n , this algorithm outputs a pair of public and secret keys ( p k , s k ) .

Sign: On input a private key s k and a message m from some underlying message space, this algorithm outputs a signature σ , which can be written as σ S i g n s k ( m ) .

Verify: On input a public key p k , a message m , and a signature σ , this algorithm returns 1 for accept or 0 for reject.

We make the consistency constraint that if for every n , every ( p k , s k ) generated by Gen, and every message m , it holds that

V e r i f y p k ( m , S i g n s k ( m ) ) = 1 . E2.45

2.2.2.2. RSA signature scheme

The RSA cryptosystems [75] may be used to provide both encryption and digital signature. The RSA digital signature is described as follows:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Randomly choose two prime numbers p and q such that | p | | q | .

  2. Compute N = p × q .

  3. Compute ϕ ( N ) = ( p 1 ) × ( q 1 ) .

  4. Randomly choose an integer e < ϕ ( N ) such that gcd ( e , ϕ ( N ) ) = 1 , and compute the integer d such that e × d 1 mod ϕ ( N ) .

  5. Publish ( N , e ) as the public key, and keep d as the corresponding private key. Noted that p , q and ϕ ( N ) will be destroyed safely.

Sign: To create a signature of message m Z N * , the signer creates σ = m d mod N .

Verify: Given a message-signature pair ( m , s ) , this algorithm outputs 1 if m = σ e mod N and returns 0 otherwise.

2.2.2.3. ElGamal signature scheme

The ElGamal cryptosystems [75] may be used to provide both encryption and digital signature. The ElGamal digital signature is described as follows:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Choose a random multiplicative generator g of Z p * such that p is a random prime number.

  2. Pick a random number 1 x p 2 as the private key.

  3. Compute the corresponding public key by y = g x ( mod p ) .

  4. Publish ( p , g , y ) as the public key, and keep x as the corresponding private key.

Sign: To create a signature of message m Z p * , the signer picks a random number 1 k p 2 and generates a signature ( r , s ) such that

r = g k ( mod p ) , s = k 1 ( m x × r ) ( mod p 1 ) . E2.46

Verify: Given a message-signature pair ( m , ( r , s ) ) , this algorithm outputs 1 if y r × r s = g m ( mod p ) and returns 0 otherwise.

2.2.2.4. Schnorr signature scheme

Many variations of the basic ElGamal signature scheme have been proposed, and Schnorr signature scheme [78] is one well-known variant of the ElGamal scheme. The ElGamal digital signature is described as follows:

Gen: The system parameters are generated as follows:

  1. Choose two prime numbers p and q such that q | p 1 .

  2. Choose an element g Z p * of order q .

  3. Select a cryptographic hash function H : { 0,1 } * Z q .

  4. Select a random number x R Z q as the secret key and compute y = g x ( mod p ) as the public key. Publish ( p , q , g , y , H ) as the public key and keep x as the corresponding secret key.

The parameters ( p , q , g , H ) are publicized for use by system-wide users.

Sign: To create a signature on message m { 0,1 } * , the signer picks a random number k from Z q at random and computes a signature such that r = g k ( mod p ) , e = H ( m | | r ) , and s = k + x × e ( mod q ) .

Verify: Given a message-signature pair ( m , ( e , s ) ) , this algorithm outputs 1 if r = g s × y e ( mod p ) and e = H ( m | | r ) and returns 0 otherwise.

2.2.2.5. Digital signature standard

In 1991, a digital signature algorithm (DSA) has been presented by the US National Institute of Standards and Technology (NIST) and later been called the Digital Signature Standard (DSS) [79] by the US Federal Information Processing Standards (FIPS 186). DSS is regarded as the first digital signature scheme recognized by the US government and is described as follows:

Gen: The system parameters are generated as follows:

  1. Select a prime number q such that 2 159 < q < 2 160 .

  2. Choose t such that 0 t 8 , and select a prime number p where 2 511 + 64 t < p < 2 512 + 64 t and q | ( p 1 ) .

  3. Select a generator g of the unique cyclic group with order q in Z p * .

    1. Select an element h Z p * and compute g = h ( p 1 ) / q mod p .

    2. If g = 1 then go to previous step.

  4. Select a random integer x such that 1 x q 1 and compute y = g x mod p .

  5. ( p , q , g , y ) is published as the public key, while x is kept secret as the private key of the corresponding user.

Sign: To create a signature on message m { 0,1 } * , the following steps are performed.

  1. Select a random secret integer k such that 0 < k < q .

  2. Compute r = ( g k mod p ) mod q .

  3. Compute k 1 mod q .

  4. Compute s = k 1 { h ( m ) + x r } mod q .

  5. ( r , s ) is returned as the signature on the message m .

Verify: Given a message-signature pair ( m , ( r , s ) ) , the following steps are performed to check the validity of the signature.

  1. Compute u 1 = s 1 h ( m ) mod q and u 2 = r s 1 mod q .

  2. Output 1 if r = ( g u 1 y u 2 mod p ) mod q and return 0 otherwise.

2.2.2.6. Framework of ID-based signature scheme

An ID-based encryption scheme (see Figure 2.6 ) consists of the following three algorithms:

Figure 2.6.

Intuition of ID-based signature.

Setup: On input a security parameter k , this algorithm is performed by the PKG to create the system parameters params and the master secret key master-key such that params will be distributed to all users in the system and master-key will be kept secret by PKG itself.

Extract: On input params, master-key, and an arbitrary identity I D { 0,1 } * , this algorithm is performed by the PKG to generate a private key d I D . After that, d I D will be sent to the corresponding user secretly.

Sign: Given a message m , an identity ID, a private key d I D , and params, this algorithm generates the signature σ on the message m under the identity ID.

Verify: Given a signature σ , a message m , an identity ID, and params, this algorithm returns 1 for accept or 0 for reject.

2.2.2.7. Cha-Cheon IBS

Inspired by [77], an ID-based signature scheme has been proposed by Cha and Cheon [80] using gap Diffie-Hellman (GDH) groups, where GDH group is a cyclic group where the CDH problem is hard, but the DDH problem is easy. The Cha-Cheon identity-based signature scheme is described as follows:

Setup: Let e ^ : G × G G T be a symmetric bilinear pairing defined in Section 2.1.4. Choose a generator P from the group G , pick a random s Z p * , set P p u b = s P , and choose cryptographic hash functions H 1 : { 0,1 } * × G Z p * and H 2 : { 0,1 } * G . The system parameter ( P , P p u b , H 1 , H 2 ) is published, while the master secret key s is kept secret by PKG.

Extract: Given an identity ID, this algorithm computes d I D = s H 2 ( I D ) and sends it to the user associated with identity ID secretly.

Sign: Given a secret key d I D and a message m , pick a random number r Z p * and output a signature σ = ( U , V ) such that U = r H 2 ( I D ) , h = H 1 ( m , U ) , and V = ( r + h ) d I D .

Verify: To verify a signature σ = ( U , V ) on a message m under an identity ID, this algorithm outputs 1 if e ^ ( U + h H 2 ( I D ) , P p u b ) = e ^ ( P , V ) and returns 0 otherwise, where h = H 1 ( m , U ) .

2.2.2.8. Bellare-Neven-Namprempre IBS

Bellare et al. [81] proposed the first pairing-free identity-based signature scheme based on the DL problems. The Bellare-Neven-Namprempre identity-based signature scheme is described as follows:

Setup: Generate an elliptic curve E over a finite field Z p , a prime q dividing the number of points on E , and a curve point P of order q , pick a random x Z q * , and compute P p u b = x P . Choose two cryptographic hash functions H 1 , H 2 : { 0,1 } * Z p . Z p , E , q , P , P p u b , H 1 , H 2 is published as the system parameter, and x is kept secret as the master secret key by PKG.

Extract: Given an identity ID, PKG picks a random r Z q * and computes R = r P , c = H 1 ( I D , R ) , and s = r + c x mod q . ( R , s ) is sent to the user associated with ID secretly.

Sign: Given a secret key pair ( R , s ) associated with ID and a message m , pick a random t Z q * and compute T = t P , h = H 2 ( I D , m , R , T ) , and

z = t + h s mod q . E2.47

The resulting signature is σ = ( R , T , z ) .

Verify: To verify a signature σ = ( R , T , z ) of a message m for an identity ID, compute h = H 2 ( I D , m , R , T ) and c = H 1 ( I D , R ) and check whether

z P = T + h ( R + c P p u b ) E2.48

holds or not.

2.3. Provable security

Provable security, which involves the exact definitions, precise assumptions, and rigorous security proof of cryptographic primitives, is regarded as the methodology unavoidable in designing, analyzing, and evaluating new primitives [82, 83, 84]. The basic reason for the necessity of provable security originates from the fact that the security of a cryptographic primitive cannot be checked in the same way that software is typically checked. Without a proof that no adversary with enough computational resources can break the primitive, we can only depend on our intuition about the security of the mentioned primitive. Unfortunately, history has already demonstrated that intuition in cryptography and information security is disastrous in the sense that countless examples of unproven schemes or schemes only with heuristic security proof were broken (sometimes immediately or sometimes years after being published or deployed).

Instead of indiscreetly assuming a given cryptographic primitive is secure, provable security assumes that some mathematical problems are hard to solve and then to prove that the given primitive is secure provided this assumption. Concretely, the proof proceeds by presenting an explicit reduction showing how to construct an efficient algorithm C that succeeds in solving certain computational tasks that was assumed to be hard from any efficient adversary A that succeeds in breaking the primitive with nonnegligible probability. The steps of such a proof can be outlined in a high level as follows:

We first assume that some mathematical problem X cannot be solved by any polynomial-time algorithm except with negligible probability. To prove that some cryptographic primitive Π is secure, the following steps are performed.

Initial: Assume an efficient adversary A attempts to attack the cryptographic primitive Π and denotes A ’s success probability by ε .

  1. An efficient adversary C is constructed by employing adversary A as a subroutine to solve problem X . Given some input instance of mathematical problem X , the algorithm C will simulate an execution of Π for the adversary A such that:

    1. The view of A when it is invoked as a subroutine by C should be identical to the view of A when it interacts with the real primitive Π itself.

    2. Furthermore, if A succeeds in breaking the execution of Π which is being simulated by C , this should enable C to solve the given instance of X with inverse polynomial probability 1 p .

  2. From Step 1, it is obvious that if ε is not negligible, then the problem X can be solved by C with nonnegligible probability ε p , which contradicts the initial assumption. Therefore, we conclude that, given the assumption associated with X , no efficient algorithm A can succeed with nonnegligible probability ε ; in other words, Π is proven to be computationally secure formally.

2.3.1. Public-key encryption

As for the security model of public-key encryption, we begin our definitional treatment by considering the case of an eavesdropping adversary who observes a single ciphertext that it wishes to crack. In other words, the eavesdropping adversary does not have any further interaction with the sender or the receiver after receiving a (single) ciphertext. Intuitively speaking, the security model for the public-key encryption is depicted by the definition of indistinguishability. Particularly, consider an experiment Pub K A , Π eav ( n ) between an eavesdropping adversary A and a corresponding simulator/challenger C , in which A outputs two messages m 0 and m 1 with equal length and in turn is given an encryption of one of these messages randomly chosen by C using a randomly generated key. According to the definition of indistinguishability, a public-key encryption scheme Π is said to be secure if, in this experiment, no adversary A can distinguish which message is encrypted better than a naive guess. We now give the formal definition of P u b K A , Π e a v ( n ) as follows:

Given a public-key encryption scheme Π = ( Gen , Enc , Dec ) and an adversary A , the eavesdropping experiment P u b K A , Π e a v ( n ) is defined as follows:

  1. The algorithm Gen ( 1 n ) is performed by C to calculate keys ( p k , s k ) .

  2. After receiving the public key p k A outputs a pair of messages ( m 0 , m 1 ) such that | m 0 | = | m 1 | .

  3. After receiving ( m 0 , m 1 ) , C chooses a random bit b { 0,1 } and sends to A a ciphertext c E n c p k ( m b ) . Here, c is called the challenge ciphertext.

  4. After receiving c , A outputs a bit b and wins the experiment if b = b .

Definition 2.13. A public-key encryption scheme Π = ( G e n , E n c , D e c ) is regarded to achieve indistinguishability against an eavesdropper if for all probabilistic, polynomial-time adversaries A , the advantage of A in the eavesdropping experiment P u b K A , Π e a v ( n ) , defined as A d v Π i n d - e a v ( A ) = | 2 P r [ b = b ] 1 | , is negligible.

2.3.1.1. Security against chosen-plaintext attacks

In view of the fact that a relatively weak attacker who only passively eavesdrops on the communication has been modeled in the eavesdropping experiment, a more powerful type of adversarial attacker who is allowed to query encryptions of multiple messages in an adaptive manner should be considered to simulate the attacker’s capabilities in the real world. Thus, a chosen-plaintext attack (CPA) has been defined to enable the adversary A to interact freely with an encryption oracle to encrypt messages of A ’s choice.

Specifically, consider the following experiment defined for public-key encryption scheme Π = ( G e n , E n c , D e c ) ; the CPA indistinguishability experiment P u b K A , Π c p a ( n ) between the adversary A who can mount chosen-plaintext attack and a corresponding simulator/challenger C is defined as follows:

Initial: The algorithm Gen ( 1 n ) is performed by C to calculate keys ( p k , s k ) .

Phase 1: After receiving the public key p k , A issues a sequence of queries to the oracle E n c p k ( ) adaptively.

Challenge: After deciding Phase 1 is over, A outputs a pair of messages ( m 0 , m 1 ) with equal length. C now chooses a random bit b { 0,1 } and sends to A a ciphertext c E n c p k ( m b ) . Here, c is called the challenge ciphertext.

Phase 2: The adversary A now issues a second sequence of queries to the oracle E n c p k ( ) adaptively again as in Phase 1.

Guess: After receiving c , A outputs a bit b and wins the experiment if b = b .

Definition 2.14. A public-key encryption scheme Π = ( G e n , E n c , D e c ) is regarded to achieve indistinguishability against chosen-plaintext attacks if for all probabilistic, polynomial-time adversaries A , the advantage of A in the experiment P u b K A , Π c p a ( n ) , defined as Ad v Π ind - cpa ( A ) = | 2 P r [ b = b ] 1 | , is negligible.

2.3.1.2. Security against chosen-ciphertext attacks

To strengthen the adversary’s capabilities further, a third type of adversary, which is more powerful than the former two types of adversaries and can mount chosen-ciphertext attack [85, 86], should be taken into consideration. In this attack model, the adversary is not only allowed to encrypt any messages of its choice as in the chosen-plaintext attack model but also enabled the adversary to decrypt any ciphertexts of its choice. Therefore, the chosen-ciphertext attack is regarded as the most powerful attack associated to the public-key encryption so far.

Specifically, consider the following experiment defined for public-key encryption scheme Π = ( G e n , E n c , D e c ) ; the CCA indistinguishability experiment P u b K A , Π c c a ( n ) (shown in Figure 2.7 ) between the adversary A who can mount chosen-ciphertext attack and a corresponding simulator/challenger C is defined as follows:

Figure 2.7.

CCA experiment P u b K A , Π c c a ( n ) for public-key encryption.

Initial: The algorithm Gen ( 1 n ) is performed by C to calculate keys ( p k , s k ) .

Phase 1: After receiving the public key p k , A issues a sequence of queries to the encryption oracle E n c p k ( ) and the decryption oracle D e c s k ( ) adaptively.

Challenge: After deciding Phase 1 is over, A outputs a pair of messages ( m 0 , m 1 ) with equal length. C now chooses a random bit b { 0,1 } and sends to A a ciphertext c E n c p k ( m b ) . Here, c is called the challenge ciphertext.

Phase 2: After receiving c , A now issues a second sequence of queries to the oracles En c pk ( ) and D e c s k ( ) adaptively again as in Phase 1 with the restriction that the challenged ciphertext cannot be queried to the decryption oracle.

Guess: A outputs a bit b and wins the experiment if b = b .

Definition 2.15. A public-key encryption scheme Π = ( G e n , E n c , D e c ) is regarded to achieve indistinguishability against chosen-ciphertext attacks if for all probabilistic, polynomial-time adversaries A , the advantage of A in the experiment P u b K A , Π c c a ( n ) , defined as Ad v Π ind - cca ( A ) = | 2 P r [ b = b ] 1 | , is negligible.

2.3.2. ID-based encryption

Different from the public-key encryption in the traditional PKI, the definition of chosen-ciphertext security must be strengthened a bit in the ID-PKC due to the fact that the attacker might already own the private keys of users I D 1 , , I D n of her choice when she attacks another public key I D . To ensure the security of encryption scheme in the ID-based environment [77], the definition of chosen-ciphertext security must enable the attacker to obtain the private key associated with any identity of her choice (other than the challenged public key I D being attacked).

2.3.2.1. Security against chosen-ciphertext and identity attacks

Specifically, consider the following experiment defined for ID-based encryption scheme Π = ( S e t u p , E x t r a c t , E n c r y p t , D e c r y p t ) ; the CCA indistinguishability experiment I B E A , Π c c a , c i d a ( n ) (shown in Figure 2.8 ) between the adversary A who can mount chosen-ciphertext attack and a corresponding simulator/challenger C is defined as follows:

Figure 2.8.

CCA experiment I B E A , Π c c a , c i d a ( n ) for ID-based encryption.

Initial: The algorithm Setup is performed by the challenger C to generate the system parameters params, which will be forwarded to the adversary A .

Phase 1: A can perform a polynomially bounded number of queries in an adaptive manner as follows:

  1. Upon receiving an identity I D i , C runs private key extraction oracle E x t r a c t ( ) on I D i and forwards the associated private key to A .

  2. Upon receiving a tuple ( I D i , c i ) , C runs decryption oracle D e c r y p t ( ) on ( I D i , c i ) and sends the result to A .

Challenge: After deciding Phase 1 is over, adversary submits two plaintexts ( m 0 , m 1 ) with equal length and an identity ID * to C such that the identity ID * must not have sent to the private key extraction oracle E x t r a c t ( ) in Phase 1. After receiving ( m 0 , m 1 ) , C selects a random bit b { 0,1 } , sets c = E n c r y p t ( p a r a m s , I D * , m b ) , and forwards c to the adversary as the challenge ciphertext.

Phase 2: This is identical to Phase 1 with the restriction that:

  1. A must not have sent ID* to the private key extraction oracle E x t r a c t ( ) .

  2. A must not have sent ( I D * , c ) to the decryption oracle D e c r y p t ( ) .

Guess: A outputs a bit b and wins the experiment if b = b .

Definition 2.16. An ID-based encryption scheme Π = ( Setup , Extract , Encrypt , Decrypt ) is regarded to achieve indistinguishability against chosen-ciphertext attacks if for all probabilistic, polynomial-time adversaries A , the advantage of A in the experiment I B E A , Π c c a , c i d a ( n ) , defined as A d v Π i n d - c c a - i b e ( A ) = | 2 P r [ b = b ] 1 | , is negligible.

2.3.3. Digital signature

To capture the adversary’s power in the digital signature, a security property called existential unforgeability against a chosen message attack [87] should be achieved in the sense that the existential unforgeability means that the adversary should not be able to generate a valid signature on any message, and the chosen message attack means that the adversary is able to obtain signatures on any messages it wishes during its attack. We now give the formal definition of existential unforgeability against a chosen message attack as follows:

2.3.3.1. Security against chosen message attacks

Let Π = ( G e n , S i g n , V e r i f y ) be a signature scheme, and consider the following experiment S i g - f o r g e A , Π c m a ( n ) (shown in Figure 2.9 ) between an adversary A and a corresponding challenger/simulator C as follows:

Figure 2.9.

Unforgeability experiment S i g - f o r c e A , Π c m a ( n ) for signature.

Initial: The algorithm Gen is run by C to obtain keys ( p k , s k ) such that the public key p k is forwarded to the adversary A , whereas the private key s k is kept secret by C itself.

Attack: After receiving the public key p k , A is given to access the signing oracle S i g n s k ( ) which returns a signature S i g n s k ( m ) for any message m of A ’s choice.

Forgery: A outputs a message and signature pair ( m * , σ * ) . A wins this experiment if V e r i f y p k ( m * , σ * ) = 1 with the restriction that σ * has not been queried to the oracle S i g n s k ( ) .

Definition 2.17. A signature scheme Π = ( G e n , S i g n , V r f y ) is said to achieve existentially unforgeability under an adaptive chosen message attack if for all probabilistic polynomial-time adversaries A , the advantage of A wins the experiment S i g - f o r g e A , Π c m a ( n ) is negligible.

2.3.4. ID-based signature

The security model of existential unforgeability under an adaptive chosen message attack in the traditional PKI can be extended to the identity-based environment naturally [80, 81]. We now give the formal definition of existential unforgeability against a chosen message attack for the ID-based signature as follows:

2.3.4.1. Security against chosen message and identity attacks

Let Π = ( S e t u p , E x t r a c t , S i g n , V e r i f y ) be a signature scheme, and consider the following experiment I B S - f o r g e A , Π c m a , c i d a ( n ) (shown in Figure 2.10 ) between an adversary A and a corresponding challenger/simulator C as follows:

Figure 2.10.

Unforgeability experiment I B S - f o r g e A , Π c m a , c i d a ( n ) for ID-based signature.

Initial: The algorithm Setup is performed by the challenger C to generate the system parameters params, which will be forwarded to the adversary A .

Attack: A can perform a polynomially bounded number of queries in an adaptive manner as follows:

  • Upon receiving an identity I D i , C runs private key extraction oracle E x t r a c t ( ) on I D i and forwards the associated private key to A .

  • Upon receiving a tuple ( I D , m ) , C runs signing oracle S i g n ( ) on ( I D , m ) and forwards the result to A .

Forgery: A outputs a message m * , an identity I D * , and a signature σ * . A is said to succeed in the experiment I B S - f o r g e A , Π c m a , c i d a ( n ) if the following requirements are satisfied:

  1. Verify(params, I D * , m * , σ * ) = 1.

  2. I D * has not been queried to the private key extraction oracle E x t r a c t ( ) .

  3. ( I D * , m * ) has not been queried to the signing oracle S i g n ( ) .

Definition 2.18. An ID-based signature scheme Π = ( S e t u p , E x t r a c t , S i g n , V e r i f y ) is said to achieve existentially unforgeability under an adaptive chosen message attack if for all probabilistic polynomial-time adversaries A , the advantage of A wins the experiment I B S - f o r g e A , Π c m a , c i d a ( n ) is negligible.

Advertisement

Outsourcing for Data Services in Mobile Internet

3. Information Retrieval

3.1. Introduction

Information retrieval is a very simple concept with everyone having practical experience in its use. The scenario of a user having an information need, translating that into a search statement and executing that search to locate the information, has become ubiquitous to everyday life. The Internet has become a repository of any information a person needs, replacing the library as a more convenient research tool. An information retrieval system is a system that ingests information, transforms it into searchable format, and provides an interface to allow a user to search and retrieve information. The most obvious example of an information retrieval system is Google, and the English language has even been extended with the term “Google it” to mean search for something.

So, everyone has had an experience with information retrieval systems, and with a little thought, it is easy to answer the question: does it work? Everyone who has used such systems has experienced the frustration that is encountered when looking for certain information. Given the massive amount of intellectual effort that is going into the design and evolution of a “Google” or other search systems, the question comes to mind why is it so hard to find what you are looking for.

One of the goals of this chapter is to explain the practical and theoretical issues associated with information retrieval that makes the design of information retrieval systems one of the challenges of our time. The demand for and expectations of users to quickly find any information they need continues to drive both the theoretical analysis and development of new technologies to satisfy that need. To scope the problem, one of the first things that need to be defined is “information.” Twenty-five years ago, information retrieval was totally focused on textual items. That was because almost all of the “digital information” of value was in textual form. In today’s technical environment, most people carry with them most of the time the capability to create images and videos of interest, that is, the cell phone. This has made modalities other than text to become as common as text. That is coupled with the Internet Web sites that allow and are designed for ease of the use of uploading and storing those modalities which more than justify the need to include other than text as part of the information retrieval problem. There are a lot of parallelisms between the information processing steps for text, images, audio, and video. Although maps are another modality that could be included, they will only be generally discussed.

In general, information that will be considered in information retrieval systems includes text, images, audio, and video. The term “item” shall be used to define a specific information object. This could be a textual document, a news item from an RSS feed, an image, a video program, or an audio program. It is useful to make a distinction between the original items from what is processed by the information retrieval system as the basic indexable item. The original item will always be kept for display purposes, but a lot of preprocessing can occur on it during the process of creating the searchable index. The term “item” will refer to the original object. On occasion the term document will be used when the item being referred to is a textual item.

An information retrieval system is the hardware and software that facilitates a user in finding the information the user needs. Hardware is included in the definition because specialized hardware is needed to transform certain modalities into digital processing format (e.g., encoders that translate composite video to digital video). As the detailed processing of items is described, it will become clear that an information retrieval system is not a single application but is composed of many different applications that work together to provide the tools and functions needed to assist the users in answering their questions. The overall goal of an information retrieval system is to minimize the user overhead in locating the information of value. Overhead from a user’s perspective can be defined as the time it takes to locate the needed information. The time starts when a user starts to interact with the system and ends when they have found the items of interest. Human factors play significantly in this process. For example, most users have a short threshold on frustration waiting for a response. That means in a commercial system on the Internet, the user is more satisfied with a response less than 3 s than a longer response that has more accurate information. In internal corporate systems, users are willing to wait a little longer to get results, but there is still a trade-off between accuracy and speed. Most users would rather have the faster results and iterate on their searches than allowing the system to process the queries with more complex techniques providing better results. All of the major processing steps are described for an information retrieval system, but in many cases, only a subset of them are used on operational systems because users are not willing to accept the increase in response time.

The evolution of information retrieval systems has been closely tied to the evolution of computer processing power. Early information retrieval systems were focused on automating the manual indexing processes in libraries. These systems migrated the structure and organization of card catalogs into structured databases. They maintained the same Boolean search query structure associated with the database that was used for other database applications. This was feasible because all of the assignment of terms to describe the content of a document was done by professional indexers. In parallel, there was also academic research work being done on small data sets that considered how to automate the indexing process making all of the text of a document part of the searchable index. The only place that large systems designed to search on massive amounts of text were available was in government and military systems. As commercial processing power and storage significantly increased, it became more feasible to consider applying the algorithms and techniques being developed in the universities to commercial systems. In addition, the creation of the original documents also was migrating to digital format so that they were in a format that could be processed by the new algorithms. The largest change that drove information technologies to become part of everyone’s experience was the introduction and growth of the Internet. The Internet became a massive repository of unstructured information, and information retrieval techniques were the only approach to effectively locate information on it. This changed the funding and development of search techniques from a few government-funded efforts to thousands of new ideas being funded by venture capitalists moving the more practical implementation of university algorithms into commercial systems.

3.2. Objectives of information retrieval system

The general objective of an information retrieval system is to minimize the time it takes for a user to locate the information they need. The goal is to provide the information needed to satisfy the user’s question. Satisfaction does not necessarily mean finding all information on a particular issue. It means finding sufficient information that the user can proceed with whatever activity initiated. This is very important because it does explain some of the drivers behind the existing search systems and suggests that precision is typically more important than recalling all possible information. For example, a user looking for a particular product does not have to find the names of everyone that sells the product or every company that manufactures the product to meet their need of getting that product. Of course, if they did have total information, then it is possible that they could have gotten it cheaper, but in most cases, the consumer will never know what they missed. The concept that a user does not know how much information they missed explains why in most cases the precision of a search is more important than the ability to recall all possible items of interest; the user never knows what they missed, but they can tell if they are seeing a lot of useless information in the first few pages of search results. That does not mean that finding everything on a topic is not important to some users. If you are trying to make decisions on purchasing a stock or a company, then finding all the facts about that stock or company may be critical to prevent a bad investment. Missing the one article talking about the company being sued and possibly going bankrupt could lead to a very painful investment. But providing comprehensive retrieval of all items that are relevant to a user’s search can have the negative effect of information overload on the user. In particular there is a tendency for important information to be repeated in many items on the same topic. Thus, trying to get all information makes the process of reviewing and filtering out redundant information very tedious. The better a system is in finding all items on a question (Recall), the more important techniques to present aggregates of that information become.

From the user’s perspective, time is the important factor that they use to gage the effectiveness of information retrieval. Except for users that do information retrieval as a primary aspect of their job (e.g., librarians and research assistants), most users have very little patience for investing extensive time in finding information they need. They expect interactive response from their searches with replies within 3 C4s at the most. Instead of looking through all the hits to see what might be of value, they will only review the first one, and at most second pages before deciding, they need to change their search strategy. These aspects of the human nature of searchers have had a direct effect on the commercial Web sites and the development of commercial information retrieval. The times that are candidates to be minimized in an information retrieval system are the time to create the query, the time to execute the query, the time to select what items returned from the query the user wants to review in detail, and the time to determine if the returned item is of value. The initial research in information retrieval focused on the search as the primary area of interest. But to meet the user’s expectation of fast response and to maximize the relevant information returned require optimization in all of these areas. The time to create a query used to be considered outside the scope of technical system support. But systems such as Google know what is in their database and what other users have searched on; so as you type a query, they provide hints on what to search on. This vocabulary browse capability helps the user in expanding the search string and helps in getting better precision.

In information retrieval, the term “relevant” is used to represent an item containing the needed information. In reality, the definition of relevance is not a binary classification but a continuous function. Items can exactly match the information need or partially match the information need. From a user’s perspective, “relevant” and needed are synonymous. From a system perspective, information could be relevant to a search statement (i.e., matching the criteria of the search statement) even though it is not needed/relevant to user (e.g., the user already knew the information or just read it in the previous item reviewed).

Relevant documents are those that contain some information that helps answer the user’s information need. Nonrelevant documents do not contain any useful information. Using these definitions, the two primary metrics used in evaluating information retrieval systems can be defined. They are Precision and Recall:

P r e c i s i o n = N u m b e r _ R e t r i e v e d _ R e l e v a n t N u m b e r _ T o t a l _ R e t r i e v e d . R e c a l l = N u m b e r _ R e t r i e v e d _ R e l e v a n t N u m b e r _ P o s s i b l e _ R e l e v a n t . E3.1

Number_Possible_Relevant is the number of relevant items in the database, Number_Total_Retrieved is the total number of items retrieved from the query, and Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s search need.

Precision is the factor that most users understand. When a user executes a search and has 80% precision, it means that four out of five items that are retrieved are of interest to the user. From a user’s perspective, the lower the precision the more likely the user is wasting his resource (time) looking at nonrelevant items. From a metric perspective, the precision figure is across all of the “hits” returned from the query. But in reality most users will only look at the first few pages of hit results before deciding to change their query strategy. Thus, what is of more value in commercial systems is not the total precision but the precision across the first 20C50 hits. Typically, in a weighted system where the words within a document are assigned weights based upon how well they describe the semantics of the document, precision in the first 20C50 items is higher than the precision across all the possible hits returned. But when comparing search systems, the total precision is used.

Recall is a very useful concept in comparing systems. It measures how well a search system is capable of retrieving all possible hits that exist in the database. Unfortunately, it is impossible to calculate except in very controlled environments. It requires in the denominator the total number of relevant items in the database. If the system could determine that number, then the system could return them. There have been some attempts to estimate the total relevant items in a database, but there are no techniques that provide accurate enough results to be used for a specific search request. In Chapter 3.3 on “Information Retrieval Evaluation,” techniques that have been used in evaluating the accuracy of different search systems will be described. But it is not applicable in the general case.

3.3. Classic information retrieval

In this section we briefly present the three classic models in information retrieval, namely, the Boolean, the vector, and the probabilistic models.

3.3.1. Basic concepts

The classic models in information retrieval consider that each document is described by a set of representative keywords called index terms. An index term is simply a (document) word whose semantics helps in remembering the document’s main themes. Thus, index terms are used to index and summarize the document contents. In general, index terms are mainly nouns because nouns have meaning by themselves and, thus, their semantics is easier to identify and to grasp. Adjectives, adverbs, and connectives are less useful as index terms because they work mainly as complements. However, it might be interesting to consider all the distinct words in a document collection as index terms. We postpone a discussion on the problem of how to generate index terms, where the issue is covered in detail.

Given a set of index terms for a document, we notice that not all terms are equally useful for describing the document contents. In fact, there are index terms which are simply vaguer than others. Deciding on the importance of a term for summarizing the contents of a document is not a trivial issue. Despite this difficulty, there are properties of an index term which are easily measured and which are useful for evaluating the potential of a term as such. For instance, consider a collection with a hundred thousand documents. A word which appears in each of the 100,000 documents is completely useless as an index term because it does not tell us anything about which documents the user might be interested in. On the other hand, a word which appears in just five documents is quite useful because it narrows down considerably the space of documents which might be of interest to the user. Thus, it should be clear that distinct index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document.

Let k i be an index term, d j be a document, and w i , j 0 be a weight associated with the pair ( k i , d j ). This weight quantifies the importance of the index term for describing the document semantic contents.

Definition 3.1. Let t be the number of index terms in the system and k i be a generic index term. K = k l , , k t is the set of all index terms. A weight w i , j > 0 is associated with each index term k i of a document d j . For an index term which does not appear in the document text, w i , j = 0 . With the document d j is associated an index term vector d j represented by d j = ( w 1 , j , w 2 , j , , w t , j ). Further, let g i be a function that returns the weight associated with the index term k i in any t-dimensional vector (i.e., g i ( d j ) = w i , j ).

As we later discuss, the index term weights are usually assumed to be mutually independent. This means that knowing the weight w i , j associated with the pair ( k i , d j ) tells us nothing about the weight w i + 1 , j associated with the pair ( k i + 1 , d j ). This is clearly a simplification because occurrences of index terms in a document are not uncorrelated. Consider, for instance, that the terms computer and network are used to index a given document which covers the area of computer networks. Frequently, in this document, the appearance of one of these two words attracts the appearance of the other. Thus, these two words are correlated and their weights could reflect this correlation. While mutual independence seems to be a strong simplification, it does simplify the task of computing index term weights and allows for fast ranking computation. Furthermore, taking advantage of index term correlations for improving the final document ranking is not a simple task. In fact, none of the many approaches proposed in the past has clearly demonstrated that index term correlations are advantageous (for ranking purposes) with general collections. Therefore, unless clearly stated otherwise, we assume mutual independence among index terms. In Chapter 5, we discuss modern retrieval techniques which are based on term correlations and which have been tested successfully with particular collections. These successes seem to be slowly shifting the current understanding toward a more favorable view of the usefulness of term correlations for information retrieval systems.

The above definitions provide support for discussing the three classic information retrieval models, namely, the Boolean, the vector, and the probabilistic models, as we now do.

3.3.2. Boolean model

The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since the concept of a set is quite intuitive, the Boolean model provides a framework which is easy to grasp by a common user of an IR system. Furthermore, the queries are specified as Boolean expressions which have precise semantics. Given its inherent simplicity and neat formalism, the Boolean model received great attention in the past years and was adopted by many of the early commercial bibliographic systems ( Figure 3.1 ).

Figure 3.1.

The three conjunctive components for the query [ q = k a ( k b ¬ k c ) ] .

Unfortunately, the Boolean model suffers from major drawbacks. First, its retrieval strategy is based on a binary decision criterion (i.e., a document is predicted to be either relevant or nonrelevant) without any notion of a grading scale, which prevents good retrieval performance. Thus, the Boolean model is in reality much more a data (instead of information) retrieval model. Second, while Boolean expressions have precise semantics, frequently it is not simple to translate an information need into a Boolean expression. In fact, most users find it difficult and awkward to express their query requests in terms of Boolean expressions. The Boolean expressions actually formulated by users often are quite simple (see Chapter 10 for a more thorough discussion on this issue). Despite these drawbacks, the Boolean model is still the dominant model with commercial document database systems and provides a good starting point for those new to the field.

The Boolean model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be all binary, that is, w i , j { 0,1 } . A query q is composed of index terms linked by three connectives: not, and, and or. Thus, a query is essentially a conventional Boolean expression which can be represented as a disjunction of conjunctive vectors (i.e., in disjunctive normal form (DNF)). For instance, the query [ q = k a ( k b ¬ k c ) ] can be written in disjunctive normal form as [ d d n f = ( 1,1,1 ) ( 1,1,0 ) ( 1,0,0 ) ], where each of the components is a binary weighted vector associated with the tuple ( k a , k b , k c ). These binary weighted vectors are called the conjunctive components of q d n f . Figure 2.3 illustrates the three conjunctive components for the query q .

Definition 3.2. For the Boolean model, the index term weight variables are all binary, that is, w i , j { 0,1 } . A query q is a conventional Boolean expression. Let q d n f be the disjunctive normal form for the query q . Further, let q c c be any of the conjunctive components of q d n f . The similarity of a document d j to the query q is defined as

{ 1 if q c c | ( q c c q d n f ) ( k i , g i ( d j ) = g i ( q c c ) ) 0 otherwise . E3.2

If sim( d j , q ) = 1, then the Boolean model predicts that the document d j is relevant to the query q (it might not be). Otherwise, the prediction is that the document is not relevant.

The Boolean model predicts that each document is either relevant or nonrelevant. There is no notion of a partial match to the query conditions. For instance, let dj be a document for which d j = ( 0,1,0 ) . Document d j includes the index term k b but is considered nonrelevant to the query [ q = k a ( k b ¬ k c ) ].

The main advantages of the Boolean model are the clean formalism behind the model and its simplicity. The main disadvantages are that exact matching may lead to retrieval of too few or too many documents (see Chapter 10). Today, it is well known that index term weighting can lead to a substantial improvement in retrieval performance. Index term weighting brings us to the vector model.

3.3.3. Vector model

The vector model [88, 89] recognizes that the use of binary weights is too limiting and proposes a framework in which partial matching is possible. This is accomplished by assigning nonbinary weights to index terms in queries and in documents. These term weights are ultimately used to compute the degree of similarity between each document stored in the system and the user query. By sorting the retrieved documents in decreasing order of this degree of similarity, the vector model takes into consideration documents which match the query terms only partially. The main resultant effect is that the ranked document answer set is a lot more precise (in the sense that it better matches the user information need) than the document answer set retrieved by the Boolean model.

Definition 3.3. For the vector model, the weight w i , j associated with a pair ( k i , d j ) is positive and nonbinary. Further, the index terms in the query are also weighted. Let w i , q be the weight associated with the pair [ k i , q ], where w i , q 0 . Then, the query vector q is defined as q = ( w 1 , q , w 2 , q , , w t , q ) where it is the total number of index terms in the system. As before, the vector for a document d j is represented by d j = ( w 1 , j , w 2, j , , w t , j ) .

Therefore, a document d j and a user query q are represented as t-dimensional vectors as shown in Figure 3.2 . The vector model proposes to evaluate the degree of similarity of the document d j with regard to the query q as the correlation between the vectors d j and q . This correlation can be quantified, for instance, by the cosine of the angle between these two vectors. That is,

s i m ( d j , q ) = d j q | d j | × | q | = i = 1 t w i , j × w i , q i = 1 t w i , j 2 × j = 1 t w i , q 2 , E3.3

where | d j | and | q | are the norms of the document and query vectors. The factor | q | does not affect the ranking (i.e., the ordering of the documents) because it is the same for all documents. The factor | d j | provides a normalization in the space of the documents.

Figure 3.2.

The cosine of θ is adopted as s i m ( d j , q ) .

Since w i , j 0 and w i , j 0 , s i m ( q , d j ) varies from 0 to +1. Thus, instead of attempting to predict whether a document is relevant or not, the vector model ranks the documents according to their degree of similarity to the query. A document might be retrieved even if it matches the query only partially. For instance, one can establish a threshold on s i m ( d j , q ) and retrieve the documents with a degree of similarity above that threshold. But to compute rankings, we need first to specify how index term weights are obtained.

Index term weights can be calculated in many different ways. The work by Salton and McGill [88] reviews various term-weighting techniques. Here, we do not discuss them in detail. Instead, we concentrate on elucidating the main idea behind the most effective term-weighting techniques. This idea is related to the basic principles which support clustering techniques, as follows.

Given a collection C of objects and a vague description of a set A, the goal of a simple clustering algorithm might be to separate the collection C of objects into two sets: the first one which is composed of objects related to the set A and the second one which is composed of objects not related to the set A. Vague description here means that we do not have complete information for deciding precisely which objects are and which are not in the set A. For instance, one might be looking for a set A of cars which have a price comparable to that of a Lexus 400. Since it is not clear what the term comparable means exactly, there is not a precise (and unique) description of the set A. More sophisticated clustering algorithms might attempt to separate the objects of a collection into various clusters (or classes) according to their properties. For instance, patients of a doctor specializing in cancer could be classified into five classes: terminal, advanced, metastasis, diagnosed, and healthy. Again, the possible class descriptions might be imprecise (and not unique), and the problem is one of deciding to which of these classes a new patient should be assigned. In what follows, however, we only discuss the simpler version of the clustering problem (i.e., the one which considers only two classes) because all that is required is a decision on which documents are predicted to be relevant and which ones are predicted to be not relevant (with regard to a given user query).

To view the IR problem as one of clustering, we refer to the early work of Salton. We think of the documents as a collection C of objects and think of the user query as a (vague) specification of a set A of objects. In this scenario, the IR problem can be reduced to the problem of determining which documents are in the set A and which ones are not (i.e., the IR problem can be viewed as a clustering problem). In a clustering problem, two main issues have to be resolved. First, one needs to determine what the features are which better describe the objects in the set A. Second, one needs to determine what the features are which better distinguish the objects in the set A from the remaining objects in the collection C. The first set of features provides for quantification of intra-cluster similarity, while the second set of features provides for quantification of inter-cluster dissimilarity. The most successful clustering algorithms try to balance these two effects.

In the vector model, intra-clustering similarity is quantified by measuring the raw frequency of a term k i inside a document d j . Such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents (i.e., intra-document characterization). Furthermore, inter-cluster dissimilarity is quantified by measuring the inverse of the frequency of a term k i among the documents in the collection. This factor is usually referred to as the inverse document frequency or the idf factor. The motivation for the usage of an idf factor is that terms which appear in many documents are not very useful for distinguishing a relevant document from a nonrelevant one. As with good clustering algorithms, the most effective term-weighting schemes for IR try to balance these two effects.

Definition 3.4. Let N be the total number of documents in the system and n i be the number of documents in which the index term k i appears. Let f r e i , j be the row frequency of term k i in the document d j (i.e., the number of times the term k i is mentioned in the text of the document d j ). Then, the normalized frequency f i , j of term k i in document d j is given by

f i , j = f r e q i , j m a x l f r e q l , j , E3.4

where the maximum is computed over all terms which are mentioned in the text of the document d j . If the term k i does not appear in the document d j , then f i , j = O . Further, let i d f i , inverse document frequency for k i , be given by

i d f i = l o g N n i . E3.5

The best known term-weighting schemes use weights which are given by

w i , j = f i , j × log N n i E3.6

or by a variation of this formula. Such term-weighting strategies are called tf-idf schemes.

Several variations of the above expression for the weight w i , j are described in an interesting paper by Salton and Buckley which appeared in 1988 [90]. However, in general, the above expression should provide a good weighting scheme for many collections.

For the query term weights, Salton and Buckley suggest

w i , q = ( 0.5 + 0.5 f r e q i , q m a x l f r e q l , q ) × log N n i , E3.7

where f r e q i , q is the raw frequency of the term k i in the text of the information request q .

The main advantages of the vector model are:

  1. Its term-weighting scheme improves retrieval performance.

  2. Its partial-matching strategy allows retrieval of documents that approximate the query conditions.

  3. Its cosine ranking formula sorts the documents according to their degree of similarity to the query.

Theoretically, the vector model has the disadvantage that index terms are assumed to be mutually independent [Eq. (2.3) does not account for index term dependencies]. However, in practice, consideration of term dependencies might be a disadvantage. Due to the locality of many term dependencies, their indiscriminate application to all the documents in the collection might in fact hurt the overall performance.

Despite its simplicity, the vector model is a resilient ranking strategy with general collections. It yields ranked answer sets which are difficult to improve upon without query expansion or relevance feedback within the framework of the vector model. A large variety of alternative ranking methods have been compared to the vector model, but the consensus seems to be that, in general, the vector model is either superior or almost as good as the known alternatives. Furthermore, it is simple and fast. For these reasons, the vector model is a popular retrieval model nowadays.

3.3.4. Probabilistic model

In this section, we describe the classic probabilistic model introduced in 1976 by Robertson and Sparck Jones [91] which later became known as the binary independence retrieval (BIR) model. Our discussion is intentionally brief and focuses mainly on highlighting the key features of the model. With this purpose in mind, we do not detain ourselves in subtleties regarding the binary independence assumption for the model. The section on bibliographic discussion points to references which cover these details.

The probabilistic model attempts to capture the IR problem within a probabilistic framework. The fundamental idea is as follows. Given a user query, there is a set of documents which contains exactly the relevant documents and no other. Let us refer to this set of documents as the ideal answer set. Given the description of this ideal answer set, we would have no problems in retrieving its documents. Thus, we can think of the querying process as a process of specifying the properties of an ideal answer set (which is analogous to interpreting the IR problem as a problem of clustering). The problem is that we do not know exactly what these properties are. All we know is that there are index terms whose semantics should be used to characterize these properties. Since these properties are not known at query time, an effort has to be made at initially guessing what they could be. This initial guess allows us to generate a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents. An interaction with the user is then initiated with the purpose of improving the probabilistic description of the ideal answer set. Such interaction could proceed as follows.

The user takes a look at the retrieved documents and decides which ones are relevant and which ones are not (in truth, only the first top documents need to be examined). The system then uses this information to refine the description of the ideal answer set. By repeating this process many times, it is expected that such a description will evolve and become closer to the real description of the ideal answer set. Thus, one should always have in mind the need to guess at the beginning the description of the ideal answer set. Furthermore, a conscious effort is made to model this description in probabilistic terms.

The probabilistic model is based on the following fundamental assumption.

Definition 3.5. Assumption (probabilistic principle). Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be nonrelevant.

This assumption is quite troublesome because it does not state explicitly how to compute the probabilities of relevance. In fact, not even the sample space which is to be used for defining such probabilities is given.

Given a query q , the probabilistic model is assigned to each document d j , as a measure of its similarity to the query, the ratio P ( d j relevant to q ) / P ( d j nonrelevant to q ) which computes the odds of the document d j being relevant to the query q . Taking the odds of relevance as the rank minimizes the probability of an erroneous judgment [92, 93].

Definition 3.6. For the probabilistic model, the index term weight variables are all binary, i.e., w i , j { 0,1 } and w i , q { 0,1 } . A query q is a subset of index terms. Let R be the set of documents known (or initially guessed) to be relevant. Let R ¯ be the complement of R [i.e., the set of nonrelevant documents). Let P ( R | d j ) be the probability that the document d j is relevant to the query q and P ( R ¯ | d j ) be the probability that d j is nonrelevant to q . The similarity sim( d j , q ) of the document d j to the query q is defined as the ratio:

s i m ( d j , q ) = P ( R | d j ) P ( R ¯ | d j ) . E3.8

Using Bayes’ rule,

s i m ( d j , q ) = P ( d j | R ) × P ( R ) P ( d j | R ¯ ) × P ( R ¯ ) . E3.9

P ( d j | R ) stands for the probability of randomly selecting the document d j from the set R of relevant documents. Further, P ( R ) stands for the probability that a document randomly selected from the entire collection is relevant. The meanings attached to P ( d j | R ¯ ) and P ( R ¯ ) are analogous and complementary.

Since P ( R ) and P ( R ¯ ) are the same for all the documents in the collection, we write

s i m ( d j , q ) P ( d j | R ) P ( d j | R ¯ ) . E3.10

Assuming independence of index terms,

s i m ( d j , q ) ( g i ( d j = 1 ) P ( k i | R ) ) × ( g i ( d j = 0 ) P ( k ¯ i | R ) ) ( g i ( d j = 1 ) P ( k i | R ¯ ) ) × ( g i ( d j = 0 ) P ( k ¯ i | R ¯ ) ) . E3.11

P ( k i | R ) stands for the probability that the index term k is present in a document randomly selected from the set R . P ( k ¯ i | R ) stands for the probability that the index term k i is not present in a document randomly selected from the set R . The probabilities associated with the set R ¯ have meanings which are analogous to the ones just described.

Taking logarithms, recalling that P ( k i | R ) + P ( k ¯ i | R ) = 1 , and ignoring factors which are constant for all documents in the context of the same query, we can finally write

s i m ( d j , q ) i = 1 t w i , q × w i , j × ( log P ( k i | R ) 1 P ( k i | R ) + log 1 P ( k i | R ¯ ) P ( k i | ( R ¯ ) ) ) , E3.12

which is a key expression for ranking computation in the probabilistic model.

Since we do not know the set R at the beginning, it is necessary to devise a method for initially computing the probabilities P ( k i | R ) and P ( k i | R ¯ ) . There are many alternatives for such computation. We discuss a couple of them below.

In the very beginning (i.e., immediately after the query specification), there are no retrieved documents. Thus, one has to make simplifying assumptions such as (a) assume that P ( k i | R ) is constant for all index terms k i (typically, equal to 0.5) and (b) assume that the distribution of index terms among the nonrelevant documents can be approximated by the distribution of index terms among all the documents in the collection. These two assumptions yield

P ( k i | R ) = 0.5 P ( k i | R ¯ ) = n i N , E3.13

where, as already defined, n i is the number of documents which contain the index term k i and N is the total number of documents in the collection. Given this initial guess, we can then retrieve documents which contain query terms and provide an initial probabilistic ranking for them. After that, this initial ranking is improved as follows.

Let V be a subset of the documents initially retrieved and ranked by the probabilistic model. Such a subset can be defined, for instance, as the top r ranked documents where r is a previously defined threshold. Further, let V i be the subset of V composed of the documents in V which contain the index term k i . For simplicity, we also use V and V i to refer to the number of elements in these sets (it should always be clear when the used variable refers to the set or to the number of elements in it). For improving the probabilistic ranking, we need to improve our guesses for P ( k i | R ) and P ( k i | R ¯ ) . This can be accomplished with the following assumptions: (a) we can approximate P ( k i | R ) by the distribution of the index term k i among the documents retrieved so far and (b) we can approximate P ( k i | R ¯ ) by considering that all the non-retrieved documents are not relevant. With these assumptions, we can write

P ( k i | R ) = v i V P ( k i | R ¯ ) = n i V i N V . E3.14

This process can then be repeated recursively. By doing so, we are able to improve on our guesses for the probabilities P ( k i | R ) and P ( k i | R ¯ ) without any assistance from a human subject (contrary to the original idea). However, we can also use assistance from the user for definition of the subset V as originally conceived.

The last formulas for P ( k i | R ) and P ( k i | R ¯ ) pose problems for small values of V and V i which arise in practice (such as V = 1 and V i = 0 ). To circumvent these problems, an adjustment factor is often added in which yields

P ( k i | R ) = v i + 0.5 V + 1 P ( k i | R ¯ ) = n i V i + 0.5 N V + 1 . E3.15

An adjustment factor which is constant and equal to 0.5 is not always satisfactory. An alternative is to take the fraction n i / N as the adjustment factor which yields

P ( k i | R ) = v i + n i N V + 1 P ( k i | R ¯ ) = n i V i + n i N N V + 1 . E3.16

This completes our discussion of the probabilistic model.

The main advantage of the probabilistic model, in theory, is that documents are ranked in decreasing order of their probability of being relevant. The disadvantages include (1) the need to guess the initial separation of documents into relevant and nonrelevant sets, (2) the fact that the method does not take into account the frequency with which an index term occurs inside a document (i.e., all weights are binary), and (3) the adoption of the independence assumption for index terms. However, as discussed for the vector model, it is not clear that independence of index terms is a bad assumption in practical situations.

3.3.5. Brief comparison of classic models

In general, the Boolean model is considered to be the weakest classic method. Its main problem is the inability to recognize partial matches which frequently leads to poor performance. There is some controversy as to whether the probabilistic model outperforms the vector model. Croft performed some experiments and suggested that the probabilistic model provides a better retrieval performance. However, experiments done afterward by Salton and Buckley refute that claim. Through several different measures, Salton and Buckley showed that the vector model is expected to outperform the probabilistic model with general collections. This also seems to be the dominant thought among researchers, practitioners, and the Web community, where the popularity of the vector model runs high.

3.4. Text retrieval

3.4.1. Formats

There is no single format for a text document, and an IR system should be able to retrieve information from many of them. In the past, IR systems would convert a document to an internal format. However, that has many disadvantages, because the original application related to the document is not useful anymore. On top of that, we cannot change the contents of a document. Current IR systems have filters that can handle most popular documents, in particular those of word processors with some binary syntax such as Word, WordPerfect, or FrameMaker. Even then, good filters might not be possible if the format is proprietary and its details are not public. This is not the case for full ASCII syntax, as in TeX documents. Although documents can be in a binary format (e.g., parts of a Word document), documents that are represented in human-readable ASCII form imply more portability and are easier to modify (e.g., they can be edited with different applications).

Other text formats were developed for document interchange. Among these we should mention the rich text format (RTF), which is used by word processors and has ASCII syntax. Other important formats were developed for displaying or printing documents. The most popular ones are the portable document format (PDF) and PostScript (which is a powerful programming language for drawing). Other interchange formats are used to encode electronic mail, for example, multipurpose internet mail exchange (MIME). MIME supports multiple character sets, multiple languages, and multiple media.

On top of these formats, nowadays many files are compressed. Text compression is treated in detail, but here we comment on the most popular compression software and associated formats. These include compress (Unix), ARJ (PCs), and ZIP (e.g., gzip in Unix and WinZip in Windows). Other tools allow us to convert binary files, in particular compressed text, to ASCII text such that it can be transmitted through a communication line using only seven bits. Examples of these tools are uuencode/uudecode and BinHex.

3.4.2. Modeling natural language

Text is composed of symbols from a finite alphabet. We can divide the symbols in two disjoint subsets: symbols that separate words and symbols that belong to words. It is well known that symbols are not uniformly distributed. If we consider just letters (a to z), we observe that vowels are usually more frequent than most consonants. For example, in English, letter “Ie” has the highest frequency. A simple model to generate text is the binomial model. In it, each symbol is generated with a certain probability. However, natural language has a dependency on previous symbols. For example, in English, letter “f” cannot appear after letter “c,” and vowels or certain consonants have a higher probability of occurring. Therefore, the probability of a symbol depends on previous symbols. We can use a finite context or Markovian model to reflect this dependency. The model can consider one, two, or more letters to generate the next symbol. If we use k letters, we say that it is a k-order model (so, the binomial model is considered an O-order model). We can use these models taking words as symbols. For example, text generated by a five-order model using the distribution of words in the Bible might make sense (i.e., it can be grammatically correct) but will be different from the original. More complex models include finite-state models (which define regular languages) and grammar models (which define context-free and other languages). However, finding the right grammar for natural language is still a difficult open problem.

The next issue is how the different words are distributed inside each document. An approximate model is Zipf’s law, which attempts to capture the distribution of the frequencies (i.e., number of occurrences) of the words in the text. The rule states that the frequency of the i th most frequent word is 1 / i θ times that of the most frequent word. This implies that in a text of n words with a vocabulary of V words, the i th most frequent word appears n / ( i θ H V ( θ ) ) times, where H V ( θ ) is the harmonic number of order θ of V , defined as

H V ( θ ) = j = 1 V 1 j θ , E3.17

so that the sum of all frequencies is n . The left side of Figure 3.3 illustrates the distribution of frequencies considering that the words are arranged in decreasing order of their frequencies. The value of θ depends on the text. In the most simple formulation, θ = 1 , and therefore H V ( θ ) = O ( log n ) . However, this simplified version is very inexact, and the case θ > 1 (more precisely, between 1.5 and 2.0) fits better the real data. This case is very different, since the distribution is much more skewed, and H V ( θ ) = O ( 1 ) . Experimental data suggests that a better model is k / ( c + i ) θ where c is an additional parameter and k is such that all frequencies add to n . This is called a Mandelbrot distribution.

Figure 3.3.

Distribution of sorted word frequencies (left) and size of the vocabulary (right).

Since the distribution of words is very skewed (i.e., there are a few hundred words which take up 50% of the text), words that are too frequent, such as stopwords, can be disregarded. A stopword is a word which does not carry meaning in natural language and therefore can be ignored (i.e., made not searchable), such as “a,” “the,” “by,” etc. Fortunately, the most frequent words are stopwords, and, therefore, half of the words appearing in a text do not need to be considered. This allows us, for instance, to significantly reduce the space overhead of indices for natural language texts. For example, the most frequent words in the TREC-2 collection are ‘the,” “of,” “and,” “a,” “to,” and “in.”

Another issue is the distribution of words in the documents of a collection. A simple model is to consider that each word appears the same number of times in every document. However, this is not true in practice. A better model is to consider a negative binomial distribution, which says that the fraction of documents containing a word k times is

F ( k ) = ( a + k 1 k ) p k ( 1 + p ) α k , E3.18

where p and α are parameters that depend on the word and the document collection. For example, for the Brown Corpus [94] and the word “said,” we have p = 9.24 and α = 0.42 [95]. The latter reference gives other models derived from a Poisson distribution.

The next issue is the number of distinct words in a document. This set of words is referred to as the document vocabulary. To predict the growth of the vocabulary size in natural language text, we use the so-called Heaps’ law [96]. This is a very precise law which states that the vocabulary of a text of size n words is of size V = K n β = O ( n β ) , where K and β depend on the particular text. The right side of Figure 3.3 illustrates how the vocabulary size varies with the text size. K is normally between 10 and 100, and β is a positive value less than one. Some experiments on the TREC-2 collection show that the most common values for β are between 0.4 and 0.6. Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root.

Notice that the set of different words of a language is fixed by a constant (e.g., the number of different English words is finite). However, the limit is so high that it is much more accurate to assume that the size of the vocabulary is O ( n β ) instead of O ( 1 ) , although the number should stabilize for huge enough texts. On the other hand, many authors argue that the number keeps growing anyway because of typing or spelling errors.

Heaps’ law also applies to collections of documents because, as the total text size grows, the predictions of the model become more accurate. Furthermore, this model is also valid for the World Wide Web.

The last issue is the average length of words. This relates the text size in words with the text size in bytes (without accounting for punctuation and other extra symbols). For example, in the different subcollections of the TREC-2 collection, the average word length is very close to five letters, and the range of variation of this average in each subcollection is small (from 4.8 to 5.3 letters). If the stopwords were move, the average length of a word increases to a number between 6 and 7 (letters). If we take only the words of the vocabulary, the average length is higher (about 8 or 9). This defines the total space needed for the vocabulary.

Heaps’ law implies that the length of the words in the vocabulary increases logarithmically with the text size and, thus, that longer and longer words should appear as the text grows. However, in practice, the average length of the words in the overall text is constant because shorter words are common enough (e.g., stopwords). This balance between short and long words, such that the average word length remains constant, has been noticed many times in different contexts and can also be explained by a finite-state model in which (a) the space character has a probability close to 0.2, (b) the space character cannot appear twice subsequently, and (c) there are 26 letters. This simple model is consistent with Zipf’s and Heaps’ laws.

The models presented in this section are used in Chapters 8 and 13, in particular Zipf’s and Heaps’ laws.

3.4.3. Similarity models

In this section, we define notions of syntactic similarity between strings and documents. Similarity is measured by a distance function. For example, if we have strings of the same length, we can define the distance between them as the number of positions that have different characters. Then, the distance is 0 if they are equal. This is called the Hamming distance. A distance function should also be symmetric (i.e., the order of the arguments does not matter) and should satisfy the triangle inequality (i.e., d i s t a n c e ( a , c ) d i s t a n c e ( a , b ) + d i s t a n c e ( b , c ) ).

An important distance over strings is the edit or Levenshtein distance mentioned earlier. The edit distance is defined as the minimum number of characters, insertions, deletions, and substitutions that we need to perform in any of the strings to make them equal. For instance, the edit distance between “color” and “colour” is one, while the edit distance between “survey” and “surgery” is two. The edit distance is considered to be superior for modeling syntactic errors than other more complex methods such as the Soundex system, which is based on phonetics [97]. Extensions to the concept of edit distance include different weights for each operation, adding transpositions, etc.

There are other measures. For example, assume that we are comparing two given strings and the only operation allowed is deletion of characters. Then, after all non-common characters have been deleted, the remaining sequence of characters (not necessarily contiguous in the original string, but in the same order) is the longest common subsequence (LCS) of both strings. For example, the LCS of “survey” and “surgery” is “surey.”

Similarity can be extended to documents. For example, we can consider lines as single symbols and compute the longest common sequence of lines between two files. This is the measure used by the diff command in Unix-like operating systems. The main problem with this approach is that it is very time-consuming and does not consider lines that are similar. The latter drawback can be fixed by taking a weighted edit distance between lines or by computing the LCS over all the characters. Other solutions include extracting fingerprints (any piece of text that in some sense characterizes it) for the documents and comparing them or finding large repeated pieces. There are also visual tools to see document similarity. For example, Dotplot draws a rectangular map where both coordinates are file lines and the entry for each coordinate is a gray pixel that depends on the edit distance between the associated lines.

3.5. Document preprocessing

Document preprocessing is a procedure which can be divided mainly into five text operations (or transformations):

  1. Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters.

  2. Elimination of stopwords with the objective of filtering out words with very low discrimination values for retrieval purposes.

  3. Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc.).

  4. Selection of index terms to determine which words/stems (or groups of words) will be used as indexing elements. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature of the word. In fact, noun words frequently carry more semantics than adjectives, adverbs, and verbs.

  5. Construction of term categorization structures such as a thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms (a usually useful procedure).

In the following, each of these phases is discussed in detail. But, before proceeding, let us take a look at the logical view of the documents which results after each of the above phases is completed. Figure 3.4 is repeated here for convenience as Figure 3.4 . As already discussed, by aggregating the preprocessing phases, we are able to move the logical view of the documents (adopted by the system) from that of a full text to that of a set of high-level indexing terms.

Figure 3.4.

Logical view of a document throughout the various phases of text preprocessing.

3.5.1. Lexical analysis of the text

Lexical analysis is the process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms). Thus, one of the major objectives of the lexical analysis phase is the identification of the words in the text. At the first glance, all that seems to be involved is the recognition of spaces as word separators (in which case, multiple spaces are reduced to one space). However, there is more to it than this. For instance, the following four particular cases have to be considered with care [98]: digits, hyphens, punctuation marks, and the case of the letters (lower and upper case).

Numbers are usually not good index terms because, without a surrounding context, they are inherently vague. For instance, consider that a user is interested in documents about the number of deaths due to car accidents between the years 1910 and 1989. Such a request could be specified as the set of index terms: deaths, car, accidents, years, 1910 and 1989. However, the presence of the numbers 1910 and 1989 in the query could lead to the retrieval, for instance, of a variety of documents which refer to either of these 2 years. The problem is that numbers by themselves are just too vague. Thus, in general it is wise to disregard numbers as index terms. However, we have also to consider that digits might appear mixed within a word. For instance, “51OB.C.” is a clearly important index term. In this case, it is not clear what rule should be applied. Furthermore, a sequence of 16 digits identifying a credit card number might be highly relevant in a given context and, in this case, should be considered as an index term. A preliminary approach for treating digits in the text might be to remove all words containing sequences of digits unless specified otherwise (through regular expressions). Further, an advanced lexical analysis procedure might perform some date and number normalization to unify formats.

Hyphens pose another difficult decision to the lexical analyzer. Breaking up hyphenated words might be useful due to inconsistency of usage. For instance, this allows treating “state-of-the-art” and “state of the art” identically. However, there are words which include hyphens as an integral part, for instance, giltedge, B-49, etc. Again, the most suitable procedure seems to adopt a general rule and specify the exceptions on a case-by-case basis.

Normally, punctuation marks are removed entirely in the process of lexical analysis. While some punctuation marks are an integral part of the word (for instance, “5lOB.C.”), removing them does not seem to have an impact in retrieval performance because the risk of misinterpretation in this case is minimal. In fact, if the user specifies “5lOB.C” in his query, removal of the dot both in the query term and in the documents will not affect retrieval. However, very particular scenarios might again require the preparation of a list of exceptions. For instance, if a portion of a program code appears in the text, it might be wise to distinguish between the variables “x.id” and “xid.” In this case, the dot mark should not be removed.

The case of letters is usually not important for the identification of index terms. As a result, the lexical analyzer normally converts all the text to either lower or upper case. However, once more, very particular scenarios might require the distinction to be made. For instance, when looking for documents which describe details about the command language of a Unix-like operating system, the user might explicitly desire the non-conversion of upper cases because this is the convention in the operating system. Further, part of the semantics might be lost due to case conversion. For instance, the words Bank and bank have different meanings—a fact common to many other pairs of words.

As pointed out by Fox, all these text operations can be implemented without difficulty. However, careful thought should be given to each one of them because they might have a profound impact at document retrieval time. This is particularly worrisome in those situations in which the user finds it difficult to understand what the indexing strategy is doing. Unfortunately, there is no clear solution to this problem. As already mentioned, some Web search engines are opting for avoiding text operations altogether because this simplifies the interpretation the user has of the retrieval task. Whether this strategy will be the one of choice in the long term remains to be seen.

3.5.2. Elimination of stopwords

As discussed in Chapter 2, words which are too frequent among the documents in the collection are not good discriminators. In fact, a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval. Such words are frequently referred to as stopwords and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stopwords.

Elimination of stopwords has an additional important benefit. It reduces the size of the indexing structure considerably. In fact, it is typical to obtain a compression in the size of the indexing structure (for instance, in the size of an inverted list, see Chapter 8) of 40% or more solely with the elimination of stopwords.

Since stopword elimination also provides for compression of the indexing structure, the list of stopwords might be extended to include words other than articles, prepositions, and conjunctions. For instance, some verbs, adverbs, and adjectives could be treated as stopwords. In [99], a list of 425 stopwords is illustrated. Programs in C for lexical analysis are also provided.

Despite these benefits, elimination of stopwords might reduce recall. For instance, consider a user who is looking for documents containing the phrase “to be or not to be.” Elimination of stopwords might leave only the term bemaking it almost impossible to properly recognize the documents which contain the phrase specified. This is one additional reason for the adoption of a full-text index (i.e., insert all words in the collection into the inverted file) by some Web search engines.

3.5.3. Stemming

Frequently, the user specifies a word in a query, but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word. This problem can be partially overcome with the substitution of the words by their respective stems.

A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes). A typical example of a stem is the word connect which is the stem for the variants connected, connecting, connection, and connections. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced.

While the argument supporting stemming seems sensible, there is controversy in the literature about the benefits of stemming for retrieval performance. In fact, different studies lead to rather conflicting conclusions. Frakes [99] compares eight distinct studies on the potential benefits of stemming. While he favors the usage of stemming, the results of the eight experimental studies he investigated do not allow us to reach a satisfactory conclusion. As a result of these doubts, many Web search engines do not adopt any stemming algorithm whatsoever.

Frakes distinguishes four types of stemming strategies: affix removal, table lookup, successor variety, and n-grams. Table lookup consists simply of looking for the stem of a word in a table. It is a simple procedure but one which is dependent on data on stems for the whole language. Since such data is not readily available and might require considerable storage space, this type of stemming algorithm might not be practical. Successor variety stemming is based on the determination of morpheme boundaries, uses knowledge from structural linguistics, and is more complex than affix removal stemming algorithms. N-grams stemming is based on the identification of digrams and trigrams and is more a term clustering procedure than a stemming one. Affix removal stemming is intuitive and simple and can be implemented efficiently. Thus, in the remainder of this section, we concentrate our discussion on algorithms for affix removal stemming only.

In affix removal, the most important part is suffix removal because most variants of a word are generated by the introduction of suffixes (instead of prefixes). While there are three or four well-known suffix removal algorithms, the most popular one is that by Porter because of its simplicity and elegance. Despite being simpler, the Porter algorithm yields results comparable to those of the more sophisticated algorithms.

The Porter algorithm uses a suffix list for suffix stripping. The idea is to apply a series of rules to the suffixes of the words in the text. For instance, the rule

s Ø E3.19

is used to convert plural forms into their respective singular forms by substituting the letter s by nil. Notice that to identify the suffix we must examine the last letters in the word. Furthermore, we look for the longest sequence of letters which matches the left-hand side in a set of rules. Thus, application of the two following rules

s s e s s s s Ø E3.20

to the word stresses yields the stem stress instead of the stem stresse. By separating such rules into five distinct phases, the Porter algorithm is able to provide effective stemming while running fast.

3.5.4. Index terms selection

If a full-text representation of the text is adopted, then all words in the text are used as index terms. The alternative is to adopt a more abstract view in which not all words are used as index terms. This implies that the set of terms used as indices must be selected. In the area of bibliographic sciences, such a selection of index terms is usually done by a specialist. An alternative approach is to select candidates for index terms automatically.

Distinct automatic approaches for selecting index terms can be used. A good approach is the identification of noun groups (as done in the INQUERY system [73]) which we now discuss.

A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. While the words in each grammatical class are used with a particular purpose, it can be argued that most of the semantics is carried by the noun words. Thus, an intuitively promising strategy for selecting index terms automatically is to use the nouns in the text. This can be done through the systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns.

Since it is common to combine two or three nouns in a single component (e.g., computer science), it makes sense to cluster nouns which appear nearby in the text into a single indexing component (or concept). Thus, instead of simply using nouns as index terms, we adopt noun groups. A noun group is a set of nouns whose syntactic distance in the text (measured in terms of number of words between two nouns) does not exceed a predefined threshold (for instance, three).

When noun groups are adopted as indexing terms, we obtain a conceptual logical view of the documents in terms of sets of nonelementary index terms.

3.5.5. Text compression

Text compression is about finding ways to represent the text in fewer bits or bytes. The amount of space required to store text on computers can be reduced significantly using compression techniques. Compression methods create a reduced representation by identifying and using structures that exist in the text. From the compressed version, the original text can be reconstructed exactly.

Text compression is becoming an important issue in an information retrieval environment. The widespread use of digital libraries, office automation systems, document databases, and the Web has led to an explosion of textual information available online. In this scenario, text compression appears as an attractive option for reducing costs associated with space requirements, input/output (I/O) overhead, and communication delays. The gain obtained from compressing text is that it requires less storage space, it takes less time to be transmitted over a communication link, and it takes less time to search directly the compressed text. The price paid is the time necessary to code and decode the text.

A major obstacle for storing text in compressed form is the need for IR systems to access text randomly. To access a given word in a compressed text, it is usually necessary to decode the entire text from the beginning until the desired word is reached. It could be argued that a large text could be divided into blocks that are compressed independently, thus allowing fast random access to each block. However, efficient compression methods need to process some text before making compression effective (usually more than 10 kilobytes). The smaller the blocks, the less effective compression is expected to be.

Our discussion here focuses on text compression methods which are suitable for use in an IR environment. For instance, a successful idea aimed at merging the requirements of compression algorithms and the needs of IR systems is to consider that the symbols to be compressed are words and not characters (character-based compression is the more conventional approach). Words are the atoms on which most IR systems are built. Moreover, it is now known that much better compression is achieved by taking words as symbols (instead of characters). Further, new word-based compression methods allow random access to words within the compressed text which is a critical issue for an IR system.

Besides the economy of space obtained by a compression method, there are other important characteristics to be considered such as compression and decompression speed. In some situations, decompression speed is more important than compression speed. For instance, this is the case with textual databases in which it is common to compress the text once and to read it many times from disk.

Another important characteristic of a compression method is the possibility of performing compressed pattern matching, defined as the task of performing pattern matching in a compressed text without decompressing it. In this case, sequential searching can be speeded up by compressing the search key rather than decoding the compressed text being searched. As a consequence, it is possible to search faster on compressed text because much less text has to be scanned. Chapter 8 presents efficient methods to deal with searching the compressed text directly.

When the text collection is large, efficient text retrieval requires specialized index techniques. A simple and popular indexing structure for text collections are the inverted files. Inverted files (see Chapter 8 for details) are especially adequate when the pattern to be searched for is formed by simple words. Since this is a common type of query (for instance, when searching the Web), inverted files are widely used for indexing large text collections.

An inverted file is typically composed of (a) a vector containing all the distinct words in the text collection (which is called the vocabulary) and (b) for each word in the vocabulary a list of all documents (identified by document numbers) in which that word occurs. Because each list of document numbers (within the inverted file) is organized in ascending order, specific compression methods have been proposed for them, leading to very efficient index compression schemes. This is important because query processing time is highly related to index access time. Thus, in this section, we also discuss some of the most important index compression techniques.

We first introduce basic concepts related to text compression. We then present some of the most important statistical compression methods, followed by a brief review of compression methods based on a dictionary. At the end, we discuss the application of compression to inverted files.

There are two general approaches to text compression: statistical and dictionary based. Statistical methods rely on generating good probability estimates (of appearance in the text) for each symbol. The more accurate the estimates are, the better the compression obtained. A symbol here is usually a character, a text word, or a fixed number of characters. The set of all possible symbols in the text is called the alphabet. The task of estimating the probability on each next symbol is called modeling. A model is essentially a collection of probability distributions, one for each context in which a symbol can be coded. Once these probabilities are available, the symbols are converted into binary digits, a process called coding. In practice, both the encoder and decoder use the same model. The decoder interprets the output of the encoder (with reference to the same model) to find out the original symbol.

There are two well-known statistical coding strategies: Huffman coding and arithmetic coding. The idea of Huffman coding is to assign a fixed-length bit encoding to each different symbol of the text. Compression is achieved by assigning a smaller number of bits to symbols with higher probabilities of appearance. Huffman coding was first proposed in the early 1950s and was the most important compression method until the late 1970s, when arithmetic coding made higher compression rates possible.

Arithmetic coding computes the code incrementally, one symbol at a time, as opposed to the Huffman coding scheme in which each different symbol is pre-encoded using a fixed-length number of bits. The incremental nature does not allow decoding a string which starts in the middle of a compressed file. To decode a symbol in the middle of a file compressed with arithmetic coding, it is necessary to decode the whole text from the very beginning until the desired word is reached. This characteristic makes arithmetic coding inadequate for use in an IR environment.

Dictionary methods substitute a sequence of symbols by a pointer to a previous occurrence of that sequence. The pointer representations are references to entries in a dictionary composed of a list of symbols (often called phrases) that are expected to occur frequently. Pointers to the dictionary entries are chosen so that they need less space than the phrase they replace, thus obtaining compression. The distinction between modeling and coding does not exist in dictionary methods, and there are no explicit probabilities associated to phrases. The most well-known dictionary methods are represented by a family of methods, known as the Ziv-Lempel family.

Character-based Huffman methods are typically able to compress English texts to approximately five bits per character (usually, each uncompressed character takes 7–8 bits to be represented). More recently, a word-based Huffman method has been proposed as a better alternative for natural language texts. This method is able to reduce English texts to just over two bits per character. As we will see later on, word-based Huffman coding achieves compression rates close to the entropy and allows random access to intermediate points in the compressed text. Ziv-Lempel methods are able to reduce English texts to fewer than four bits per character. Methods based on arithmetic coding can also compress English texts to just over two bits per character. However, the price paid is slower compression and decompression, and the impossibility of randomly accessing intermediate points in the compressed text.

Before proceeding, let us present an important definition which will be useful from now on.

Definition 3.7. Compression ratio is the size of the compressed file as a fraction of the uncompressed file.

3.6. Index construction and compression

3.6.1. Inverted files

An inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word, a list of all the text positions where the word appears is stored. The set of all those lists is called the “occurrences” ( Figure 3.5 shows an example). These positions can refer to words or characters. Word positions (i.e., position i refers to the ith word) simplify phrase and proximity queries, while character positions (i.e., the position i is the ith character) facilitate direct access to the matching text positions.

Figure 3.5.

A sample text and an inverted index built on it. The words are converted to lower case, and some are not indexed. The occurrences point to character positions in the text.

Some authors make the distinction between inverted files and inverted lists. In an inverted file, each element of a list points to a document or file name, while inverted lists match our definition. We prefer not to make such a distinction because, as we will see later, this is a matter of the addressing granularity, which can range from text positions to logical blocks.

The space required for the vocabulary is rather small. According to Heaps’ law, the vocabulary grows as O ( n β ) , where β is a constant between 0 and 1 dependent on the text, being between 0.4 and 0.6 in practice. For instance, for 1 Gb of the TREC-2 collection, the vocabulary has a size of only 5 Mb. This may be further reduced by stemming and other normalization techniques as described.

The occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n). Even omitting stopwords (which is the default practice when words are indexed), in practice the space overhead of the occurrences is between 30 and 40% of the text size.

To reduce space requirements, a technique called block addressing is used. The text is divided in blocks, and the occurrences point to the blocks where the word appears (instead of the exact positions). The classical indices which point to the exact occurrences are called “full inverted indices.” Using block addressing not only can the pointers be smaller because there are fewer blocks than positions, but also all the occurrences of a word inside a single block are collapsed to one reference (see Figure 3.6 ). Indices of only 5% overhead over the text size are obtained with this technique. The price to pay is that, if the exact occurrence positions are required (for instance, for a proximity query), then an online search over the qualifying blocks has to be performed. For instance, block addressing indices with 256 blocks stop working well with texts of 200 Mb.

Figure 3.6.

The sample text splits into four blocks, and an inverted index uses block addressing built on it. The occurrences denote block numbers. Notice that both occurrences of “words” collapsed into one.

Table 3.1 presents the projected space taken by inverted indices for texts of different sizes, with and without the use of stopwords. The full inversion stands for inverting all the words and storing their exact positions, using four bytes per pointer. The document addressing index assumes that we point to documents which are of size 10 kb (and the necessary number of bytes per pointer, that is, one, two, and three bytes, depending on text size). The block addressing index assumes that we use 256 or 64K blocks (one or two bytes per pointer) independently of the text size. The space taken by the pointers can be significantly reduced by using compression. We assume that 45% of all the words are stopwords and that there is one non-stopword each 11.5 characters. Our estimation for the vocabulary is based on Heaps’ law with parameters V = 30 n 0.5 . All these decisions were taken according to our experience and experimentally validated.

Index Small collection (1 Mb) Medium collection (2 Mb) Large collection (2 Gb)
Addressing words 45% 73% 36% 64% 35% 63%
Addressing documents 19% 26% 18% 32% 26% 47%
Addressing 64K block 27% 41% 18% 32% 5% 9%
Addressing 256 blocks 18% 25% 1.7% 2.4% 0.5% 0.7%

Table 3.1.

Sizes of an inverted file as approximate percentages of the size of the whole text collection.

The blocks can be of fixed size (imposing a logical block structure over the text database), or they can be defined using the natural division of the text collection into files, documents, Web pages, or others. The division into blocks of fixed size improves efficiency at retrieval time, that is, the more variance in the block sizes, the more amount of text sequentially traversed on average. This is because larger blocks match queries more frequently and are more expensive to traverse.

Alternatively, the division using natural cuts may eliminate the need for online traversal. For example, if one block per retrieval unit is used and the exact match positions are not required, there is no need to traverse the text for single-word queries, since it is enough to know which retrieval units to report. But if, on the other hand, many retrieval units are packed into a single block, the block has to be traversed to determine which units to retrieve.

It is important to notice that in order to use block addressing, the text must be readily available at search time. This is not the case for remote text (as in Web search engines) or if the text is in a CD-ROM that has to be mounted, for instance. Some restricted queries not needing exact positions can still be solved if the blocks are retrieval units.

Four granulates and three collections are considered. For each collection, the right column considers that stopwords are not indexed, while the left column considers that all words are indexed.

3.6.2. Blocked sort-based indexing

The basic steps in constructing a non-positional index. We first make a pass through the collection assembling all termdocID pairs. We then sort the pairs with the term as the dominant key and docID as the secondary key. Finally, we organize the docIDs for each term into a postings list and compute statistics like term and document frequency. For small collections, all this can be done in memory. In this chapter, we describe methods for large collections that require the use of secondary storage.

To make index construction more efficient, we represent terms as termIDs, where each termID is a unique serial number. We can build the mapping from terms to termIDs on the fly while we are processing the collection, or, in a two-pass approach, we compile the vocabulary in the first pass and construct the inverted index in the second pass. The index construction algorithms described in this chapter all do a single pass through the data. Section 4.7 gives references to multipass algorithms that are preferable in certain applications, for example, when disk space is scarce.

We work with the Reuters Corpus Volume I (RCV1) collection as our model collection in this chapter, a collection with roughly 1 GB of text. It consists of about 800,000 documents that were sent over the Reuters newswire during a 1-year period between August 20, 1996, and August 19, 1997. A typical document is shown in Figure 3.7 , but note that we ignore multimedia information like images in this book and are only concerned with text. Reuters Corpus Volume I (RCV1) covers a wide range of international topics, including politics, business, sports, and (as in this example) science. Some key statistics of the collection are shown in Table 3.2 .

Figure 3.7.

Document from the Reuters newswire.

Symbol Statistic Value
N Documents 800,000
L Avg. number of tokens per document 200
M Terms 400,000
Avg. number of bytes per token (incl. spaces/punct.) 6
Avg. number of bytes per token (without spaces/punct.) 4.5
Avg. number of bytes per term 7.5
T Tokens 100,000,000

Table 3.2.

Collection statistics for Reuters Corpus Volume I (RCV1).

Values are rounded for the computations in this book. The unrounded values are 806,791 documents, 222 tokens per document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punctuation, 4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and 96,969,056 tokens

Reuters Corpus Volume I (RCV1) has 100 million tokens. Collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID therefore requires 0.8 GB of storage. Typical collections today are often one or two orders of magnitude larger than Reuters Corpus Volume I (RCV1). You can easily see how such collections overwhelm even large computers if we try to sort their termID-docID pairs in memory. If the size of the intermediate files during index construction is within a small factor of available memory, then the compression techniques introduced in Chapter 5 can help; however, the postings file of many large collections cannot fit into memory even after compression ( Figure 3.7 ).

With the insufficient main memory, we need to use an external sorting algorithm, that is, one that uses disk. To achieve acceptable sorting efficiency, the central requirement of such an algorithm is that it minimizes the number of random disk seeks during sorting-sequential disk reads are far faster than seeks as we explained in Section 4.1. One solution is the blocked sort-based indexing algorithm or BSBI in Figure 3.8 . BSBI (i) segments the collection into parts of equal size, (ii) sorts the termID-docID pairs of each part in memory, (iii) stores intermediate sorted results on disk, and (iv) merges all intermediate results into the final index.

Figure 3.8.

Blocked sort-based indexing. The algorithm stores inverted blocks in files f 1 , , f n and the merged index in f m e r g e d .

The algorithm parses documents into termID-docID pairs and accumulates the pairs in memory until a block of a fixed size is full (PARSENEXTBLOCK in Figure 3.8 ). We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and written to disk. Inversion involves two steps. First, we sort the termID-docID pairs. Next, we collect all termID-docID pairs with the same termID into a postings list, where a posting is simply a docID. The result, an inverted index for the block we have just read, is then written to disk. Applying this to Reuters Corpus Volume I (RCV1) and assuming we can fit 10 million termID-docID pairs into memory, we end up with ten blocks, each an inverted index of one part of the collection.

In the final step, the algorithm simultaneously merges the 10 blocks into 1 large merged index. An example with two blocks is shown in Figure 3.9 , where we use d i to denote the i t h document of the collection. To do the merging, we open all block files simultaneously and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged index we are writing. In each iteration, we select the lowest termID that has not been processed yet using apriority queue or a similar data structure. All postings lists for this termID are read and merged, and the merged list is written back to disk. Each read buffer is refilled from its file when necessary.

Figure 3.9.

Merging in blocked sort-based indexing. Two blocks (postings lists to be merged) are loaded from disk into memory, merged in memory (merged postings lists), and written back to disk. We show terms instead of termIDs for better readability.

How expensive is BSBIFI. Its time complexity is Θ ( T log T ) because the step with the highest time complexity is sorting and T is an upper bound for the number of items we must sort (i.e., the number of termID-docID pairs). But the actual indexing time is usually dominated by the time it takes to parse the documents (PARSENEXTBLOCK) and to do the final merge (MERGEBLOCKS). Exercise 4.6 asks you to compute the total index construction time for RCV1 that includes these steps as well as invert the blocks and write them to disk ( Figure 3.9 ).

Notice that Reuters Corpus Volume I (RCV1) is not particularly large in an age when one or more GB of memory is standard on personal computers. With appropriate compression, we could have created an inverted index for RCV1 in memory on a not overly beefy server. The techniques we have described are needed, however, for collections that have several orders of larger magnitude.

3.6.3. Single-pass in-memory indexing

Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI. SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.

The SPIMI algorithm is shown in Figure 3.10 . The part of the algorithm that parses documents and turns them into a stream of termdocID pairs, which we call tokens here, has been omitted. SPIMI-INVERT is called repeatedly on the token stream until the entire collection has been processed. Tokens are processed one by one (line 4) during each successive call of SPIMI-INVERT. When a term occurs for the first time, it is added to the dictionary (best implemented as a hash), and a new postings list is created (line 6). The call in line 7 returns this postings list for subsequent occurrences of the term.

Figure 3.10.

Inversion of a block in single-pass in-memory indexing.

A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list (line 10). Instead of first collecting all termID-docID pairs and then sorting them (as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows), and it is immediately available to collect postings. This has two advantages: it is faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored. As a result, the blocks that individual calls of SPIMI-INVERT can process are much larger, and the index construction process as a whole is more efficient.

Because we do not know how large the postings list of a term will be when we first encounter it, we allocate space for a short postings list initially and double the space each time it is full (lines 8–9). This means that some memory is wasted, which counteracts the memory savings from the omission of termIDs in intermediate data structures. However, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI.

When memory has been exhausted, we write the index of the block (which consists of the dictionary and the postings lists) to disk (line 12). We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step. If each block’s postings lists were written in unsorted order, merging blocks could not be accomplished by a simple linear scan through each block.

Each call of SPIMI-INVERT writes a block to disk, just as in BSBI. The last step of SPIMI (corresponding to line 7 in Figure 3.8 ; not shown in Figure 3.10 ) is then to merge the blocks into the final inverted index.

In addition to constructing a new dictionary structure for each block and eliminating the expensive sorting step, SPIMI has a third important component: compression. Both the postings and the dictionary terms can be stored compactly on disk if we employ compression. Compression increases the efficiency of the algorithm further because we can process even larger blocks and because the individual blocks require less space on disk. We refer readers to the literature for this aspect of the algorithm.

The time complexity of SPIMI is Θ ( T ) because no sorting of tokens is required and all operations are at most linear in the size of the collection.

3.6.4. Dictionary compression

This section presents a series of dictionary data structures that achieve increasingly higher compression ratios. The dictionary is small compared with the postings file as suggested by Table 5.1. So, why compress it if it is responsible for only a small percentage of the overall space requirements of the IR system?

One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. Thus, the main goal of compressing the dictionary is to fit it in the main memory, or at least a large portion of it, to support high query throughput. Although dictionaries of very large collections fit into the memory of a standard desktop machine, this is not true of many other application scenarios. For example, an enterprise search server for a large corporation may have to index a multiterabyte collection with a comparatively large vocabulary because of the presence of documents in many different languages. We also want to be able to design search systems for limited hardware such as mobile phones and onboard computers. Other reasons for wanting to conserve memory are fast start-up time and having to share resources with other applications. The search system on your PC must get along with the memory-hogging word processing suite you are using at the same time ( Figure 3.11 ).

Figure 3.11.

Storing the dictionary as an array of fixed-width entries.

3.7. XML retrieval

Information retrieval systems are often contrasted with relational databases. Traditionally, IR systems have retrieved information from unstructured text ¨C by which we mean raw text without markup. Databases are designed for querying relational data: sets of records that have values for predefined attributes such as employee number, title, and salary. There are fundamental differences between information retrieval and database systems in terms of retrieval model, data structures, and query languages shown in Table 3.3 .1

Some highly structured text search problems are most efficiently handled by a relational database (RDB), for example, if the employee table contains an attribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query, select lastname from employees where job_desc like invoic%, may be sufficient to satisfy your information need with high precision and recall.

However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents and structured retrieval. Queries in structured retrieval can be either structured or unstructured, but we will assume in this chapter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging), and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria. Examples of such queries give me a full-length article on fast Fourier transforms (digital libraries), give me patents whose claims mention RSA public-key encryption and that cite US 4,405,829 patents, or give me articles about sightseeing tours of the Vatican and the Coliseum (entity-tagged text). These three queries are structured queries that cannot be answered well by an unranked retrieval system. As we argued in Example 1.1 (p. 15), unranked retrieval models like the Boolean model suffer from low recall. For instance, an unranked system would return a potentially large number of articles that mention the Vatican, the Coliseum, and the sightseeing tours without ranking the ones that are most relevant for the query first. Most users are also notoriously bad at precisely stating structural constraints. For instance, users may not know for which structured elements the search system supports search. In our example, the user may be unsure whether to issue the query as sightseeing AND (COUNTRY:Vatican OR LANDMARK:Coliseum), as sightseeing AND (STATE:Vatican OR BUILDING:Coliseum), or in some other form. Users may also be completely unfamiliar with structured search and advanced search interfaces or unwilling to use them. In this chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.

There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries.

We will only look at one standard for encoding structured documents, Extensible Markup Language or XML, which is currently the most widely used such standard. We will not cover the specifics that distinguish XML from other types of markup such as HTML and SGML. But most of what we say in this chapter is applicable to markup languages in general.

In the context of information retrieval, we are only interested in XML as a language for encoding text and documents. A perhaps more widespread use of XML is to encode non-text data. For example, we may want to export data in XML format from an enterprise resource planning system and then read them into an analytic program to produce graphs for a presentation. This type of application of XML is called data-centric because numerical and non-text attribute-value data dominate and text is usually a small fraction of the overall data. Most data-centric XML is stored in databases—in contrast to the inverted index-based methods for text-centric XML that we present in this chapter.

We call XML retrieval a structured retrieval in this chapter. Some researchers prefer the term semistructured retrieval to distinguish XML retrieval from database querying. We have adopted the terminology that is widespread in the XML retrieval community. For instance, the standard way of referring to XML queries is structured queries, not semistructured queries. The term structured retrieval is rarely used for database querying, and it always refers to XML retrieval in this book.

There is a second type of information retrieval problem that is intermediate between unstructured retrieval and querying a relational database, parametric and zone search, which we discussed in Section 6.1 (p. 110). In the data model of parametric and zone search, there are parametric fields (relational attributes like date or file size) and zones ¨C text attributes that each takes a chunk of unstructured text as value. The data model is flat, that is, there is no nesting of attributes. The number of attributes is small. In contrast, XML documents have the more complex tree structure that we see in Figure 3.13 in which attributes are nested. The number of attributes and nodes is greater than in parametric and zone search.

After presenting the basic concepts of XML in Section 3.71, this chapter first discusses the challenges we face in XML retrieval (Section 3.72). Next, we describe a vector space model for XML retrieval (Section 3.73). Presents INEX, a shared task evaluation that has been held for a number of years and currently is the most important venue for XML retrieval research.

3.7.1. Basic XML concepts

An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag. An element can have one or more XML attributes. In the XML document in Figure 3.12 , the scene element is enclosed by the two tags s c e n e and / s c e n e . It has an attribute number with value vii and two child elements, title and verse.

Figure 3.12.

An XML document.

Figure 3.13 shows Figure 3.12 as a tree. The leaf nodes of the tree consist of text, for example, Shakespeare, Macbeth, and Macbeth’s castle. The tree’s internal nodes encode either the structure of the document (title, act, and scene) or metadata functions (author).

Figure 3.13.

The XML document in Figure 10.1 as a simplified DOM object.

The standard for accessing and processing XML documents is the XML Document Object Model or DOM. The DOM represents elements, attributes, and text within elements as nodes in a tree. Figure 3.13 is a simplified DOM representation of the XML document in Figure 3.12 . With a DOM API, we can process an XML document by starting at the root element and then descending down the tree from parents to children.

XPath is a standard for enumerating paths in an XML document collection. We will also refer to paths as XML contexts or simply contexts in this chapter. Only a small subset of XPath is needed for our purposes. The XPath expression node selects all nodes of that name. Successive elements of a path are separated by slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in a play element. In Figure 3.13 , this set consists of a single scene element, which is accessible via the path play, act, and scene from the top. An initial slash starts the path at the root element. /play/title selects the plays title in Figure 3.12 , /play//title selects a set with two members (the plays title and the scenes title), and /scene/title selects no elements. For notational convenience, we allow the final element of a path to be a vocabulary term and separate it from the element path by the symbol #, even though this does not conform to the XPath standard. For example, title#“Macbeth” selects all titles containing the term Macbeth.

We also need the concept of schema in this chapter. A schema puts constraints on the structure of allowable XML documents for a particular application. A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute. Two standards for schemas for XML documents are XML DTD (document-type definition) and XML schema. Users can only write structured queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection

A common format for XML queries is NEXI (Narrowed Extended XPath I). We give an example in Figure 3.14 . We display the query on four lines for typographical convenience, but it is intended to be read as one unit without line breaks. In particular, //section is embedded under //article.

Figure 3.14.

An XML query in NEXI format and its partial representation as a tree.

The query in Figure 3.14 specifies a search for sections about the summer holidays that are part of articles from 2001 to 2002. As in XPath double slashes indicate that an arbitrary number of elements can intervene on a path. The dot in a clause in square brackets refers to the element the clause modifies. The clause [.//yr = 2001 or.//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly, the dot in [about(., summer holidays)] refers to the section that the clause modifies.

The 2-year conditions are relational attribute constraints. Only articles whose year attribute is 2001 or 2002 (or that contain an element whose year attribute is 2001 or 2002) are to be considered. The about clause is a ranking constraint: sections that occur in the right type of article are to be ranked according to how relevant they are to the topic summer holidays.

We usually handle relational attribute constraints by prefiltering or postfiltering: we simply exclude all elements from the result set that do not meet the relational attribute constraints. In this chapter, we will not address how to do this efficiently and instead focus on the core information retrieval problem in XML retrieval, namely, how to rank documents according to the relevance criteria expressed in the about conditions of the NEXI query.

If we discard relational attributes, we can represent documents as trees with only one type of node: element nodes. In other words, we remove all attribute nodes from the XML document, such as the number attribute in Figure 3.12 . Figure 3.15 shows a subtree of the document in Figure 3.12 as an element-node tree (labeled d 1 ).

Figure 3.15.

Tree representation of XML documents and queries.

We can represent queries as trees in the same way. This is a query-by-example approach to query language design because users pose queries by creating objects that satisfy the same formal description as documents. In Figure 3.15 , q 1 is a search for books whose titles score highly for the keywords Julius Caesar. q 2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war. To represent the semantics of NEXI queries fully we would also need to designate one node in the tree as a “target node”, for example, the section in the tree in Figure 3.14 . Without the designation of a target node, the tree in Figure 3.14 is not a search for sections embedded in articles (as specified by NEXI), but a search for articles that contain sections.

3.7.2. Challenges in XML retrieval

In this section, we discuss a number of challenges that make structured retrieval more difficult than unstructured retrieval. Recall from page 195 the basic setting we assume in structured retrieval: the collection consists of structured documents, and queries are either structured (as in Figure 3.14 ) or unstructured (e.g., summer holidays).

The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not the entire documents as IR systems usually do in unstructured retrieval. If we query Shakespeare’s plays for Macbeth’s castle, should we return the scene, the act, or the entire play in Figure 3.13 ? In this case, the user is probably looking for the scene. On the other hand, another wise unspecified search for Macbeth should return the play of this name, not a subunit.

One criterion for selecting the most appropriate part of a document is the structured document retrieval principle:

Structured document retrieval principle. A system should always retrieve the most specific part of a document answering the query.

This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. However, it can be hard to implement this principle algorithmically. Consider the query title#“Macbeth” applied to Figure 3.13 . The title of the tragedy, Macbeth, and the title of act I, scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. Deciding which level of the tree is right for answering a query is difficult.

Parallel to the issue of which parts of a document to return to the user is the issue of which parts of a document to index. In Section 2.1.2 (p. 20), we discussed the need for a document unit or indexing unit in indexing and retrieval. In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, email messages, web pages on the web, etc. In structured retrieval, there are a number of different approaches to defining the indexing unit.

One approach is to group nodes into non-overlapping pseudo-documents as shown in Figure 3.16 . In the example, books, chapters, and sections have been designated to be indexing units but without overlap. For example, the leftmost dashed indexing unit contains only those parts of the tree dominated by book that are not already part of other indexing units. The disadvantage of this approach is that pseudo-documents may not make sense to the