Open access peer-reviewed chapter

Data Service Outsourcing and Privacy Protection in Mobile Internet

By Zhen Qin, Erqiang Zhou, Yi Ding, Yang Zhao, Fuhu Deng and Hu Xiong

Reviewed: July 3rd 2018Published: November 5th 2018

DOI: 10.5772/intechopen.79903

Downloaded: 269

Abstract

Mobile Internet data have the characteristics of large scale, variety of patterns, and complex association. On the one hand, it needs efficient data processing model to provide support for data services, and on the other hand, it needs certain computing resources to provide data security services. Due to the limited resources of mobile terminals, it is impossible to complete large-scale data computation and storage. However, outsourcing to third parties may cause some risks in user privacy protection. This monography focuses on key technologies of data service outsourcing and privacy protection, including the existing methods of data analysis and processing, the fine-grained data access control through effective user privacy protection mechanism, and the data sharing in the mobile Internet.

Keywords

  • mobile Internet
  • data service
  • data outsourcing
  • privacy preserving

Preface

The data of mobile Internet have the characteristics of large scale, variety of patterns, complex association and so on. On the one hand, it needs efficient data processing model to provide support for data services, and on the other hand, it needs certain computing resources to provide data security services. Due to the limited resources of mobile terminals, it is impossible to complete large-scale data computation and storage. However, outsourcing to third parties may cause some risks in user privacy protection.

This monography focuses on key technologies of data service outsourcing and privacy protection in mobile Internet, including the existing methods of data analysis and processing, the fine-grained data access control through effective user privacy protection mechanism, and the data sharing in the mobile Internet, which provide technical support for improving various service applications based on mobile Internet. The financial support from the National Natural Science Foundation of China under grants No.61672135, the Sichuan Science-Technology Support Plan Program under grants No.2018GZ0236, No.2017FZ0004 and No.2016JZ0020, the National Science Foundation of China - Guangdong Joint Foundation under grants No.U1401257 are also greatly appreciated.

Needless to say, any errors and omissions that still remain are totally our own responsibility and we would greatly appreciate the feedbacks about them from the readers. Thus, please send your comments to “qinzhen@uestc.edu.cn” and put “Data Service Outsourcing and Privacy Protection in Mobile Internet” in the subject line.

Zhen Qin

University of Electronic Science and Technology of China

March, 2018

Setting the Stage

1. Why Outsourcing and Privacy Protection Matter for Data Service in the Mobile Internet

1.1. Mobile Internet

With the continuous development of science and technology, the mobile Internet (MI) has become an important part of our daily lives. It is a product of the convergence of mobile communication technologies and Internet technologies, which has a wide range of network coverage, convenience, and instant features. Smartphone is the main carrier of the mobile Internet. Through smartphones, people can make voice calls, send text messages, and take video communications as well as surf the Internet at any time (including browsing the web, sending and receiving e-mails, watching movies, etc.). In other words, using smartphones, users are more capable of creating, implementing, and managing novel and efficient mobile applications. Nowadays, the MI has become an important portal and major innovation platform for Internet businesses and an important hub for the exchange of information resources among various social media, business service companies, and new application software.

1.1.1. Mobile Internet

The mobile Internet, defined as the platform where the user’s wireless access to the digitized contents of the Internet via mobile devices, consists of the following:

  • Internet service provider is an organization that provides services such as Internet access, Internet transit, domain name registration, web hosting, Usenet service, and collocation.

  • Users can get different services from the mobile Internet such as points of interest services, music services, fitness services, and mobile commerce services by using mobile applications.

In the MI, mobile devices, which include smartphones, tablets, PCs, e-books, MIDs, etc., enable users to stay connected with not only the network service providers through LTE and NFC techniques but also the physically close peers through short-range wireless communication techniques. In fact, each individual user has a personalized behavior pattern, which can influence the reliability and efficiency of the network in the MI. Moreover, the MI has a wide range of applications such as wireless body area networks (WBANs), mobile e-commerce, mobile social networks, mobile ad hoc networks, etc. The common scenarios of the MI will be shown in Figure 1.1 .

Figure 1.1.

The common scenarios of the MI.

1.1.2. Application

The mobile Internet offers a huge number of applications to the users. Most of these applications have many unique features or special functions, but the main technology of the applications remains the same as other services. The popular applications in the mobile Internet are briefly reviewed below.

1.1.2.1. Mobile social networks

Social networking has already become an important integral part of our daily lives, enabling us to contact our friends and families. Nowadays, the pervasive use of the mobile Internet and the social connections fosters a promising mobile social network (MSN) where reliable, comfortable, and efficient computation and communication tools are provided to improve the quality of our work and life.

MSN is social networking where individuals with similar interests converse and connect with one another through their mobile phone and/or tablet. Much like web-based social networking, mobile social networking occurs in virtual communities [1].

Similar to many online social networking sites, such as Facebook and Twitter, there are just as many social networks on mobile devices. They offer vast number of functions including multimedia posts, photo sharing, and instant messaging. Most of these mobile applications offer free international calling and texting capabilities. Today, social networking applications are not just for the social aspect but are frequently used for professional aspects as well, such as LinkedIn, which is still constantly growing. Along with sharing multimedia posts and instant messaging, social networks are commonly used to connect immigrants in a new country. While the thought of moving to a new country may be intimidating for many, social media can be used to connect immigrants of the same land together to make assimilation a little less stressful [2].

1.1.2.2. Mobile ad hoc networks

Due to the limitation of the applications in the wire communication, the need of the extension of applications via the mobile Internet becomes necessary. The growth of laptops and 802.11/Wi-Fi wireless networks has made mobile ad hoc networks (MANETs) a popular research topic since the mid-1990s.

A mobile ad hoc network (MANET), also known as wireless ad hoc network [3] or ad hoc wireless network, is continuously self-configuring, infrastructure-less network of mobile devices connected using wireless channels [4]. Each device in a MANET is free to move independently in any direction and will therefore change its links to other devices frequently. Such networks may operate by themselves or may be connected to the larger Internet. They may contain one or multiple and different transceivers between nodes. This results in a highly dynamic, autonomous topology [5].

MANETs can be used in many applications, ranging from sensors for environment, vehicular ad hoc communications, road safety, smart home, peer-to-peer messaging, air/land/navy defense, weapons, robots, etc. A typical instance is health monitoring, which may include heart rate, blood pressure, etc. [6]. This can be constant, in the case of a patient in a hospital or event driven in the case of a wearable sensor that automatically reports your location to an ambulance team in case of an emergency. Animals can have sensors attached to them in order to track their movements for migration patterns, feeding habits, or other research purposes [7]. Another example is used in disaster rescue operations. Sensors may also be attached to unmanned aerial vehicles (UAVs) for surveillance or environment mapping [8]. In the case of autonomous UAV-aided search and rescue, this would be considered an event mapping application, since the UAVs are deployed to search an area but will only transmit data back when a person has been found.

1.1.2.3. Mobile payment

Since the rapid economic growth, the methods of payment have changed a lot all over the world. Given by the development of the mobile Internet, the mobile payment becomes more pervasive in Asian countries, especially in China.

Mobile payment (also referred to as mobile money, mobile money transfer, and mobile wallet) generally refers to payment services operated under financial regulation and performed from or via a mobile device. Instead of paying with cash, check, or credit cards, a consumer can use a mobile to pay for a wide range of services and digital or hard goods.

Although the concept of using non-coin-based currency systems has a long history [9], it is only recently that the technology to support such systems has become widely available.

1.1.3. Mobile Internet and user behavior

Since entering the twenty-first century, the China’s Internet has developed rapidly. In particular, the mobile Internet further enhances this speed. We soon discovered that today’s Internet is used in e-commerce shopping, map navigation, mobile payment, stock financing, online booking, government office, recharge, and other aspects that are completely different from the original. It seems that everything can be done on the Internet. You can rely on your phone instead of taking it out. In the mobile Internet, users and mobile terminals are closely related. The accumulation of user behavior in the mobile Internet is of great value. Smartphones and other mobile terminals are no longer simple communication tools but become mobile terminals that integrate communication, computing, and sensing functions. It can continuously generate and collect various data information [10]. Various sensing devices on the mobile terminal can sense the surrounding environment and generate various data information; at the same time, the mobile terminal serves as an entrance for the user to enter the mobile Internet and can collect data information generated by other users. By combining various data related to mobile users, user behaviors can be modeled and predicted, and important statistical data for specific services can be extracted. For example, mobile users’ GPS information and historical travel habits can be used to generate traffic statistics and travel advice [11]. It can be seen that, based on the data information generated in the mobile Internet, we can satisfy the different requirements of different users in different scenarios in the first time by judging the user information needs and accurately sorting the information categories, thereby providing users with personalized high-quality information and improving users’ service experience.

The relevant data show that in the 3 years of rapid development of the mobile Internet, the total amount of data and information generated by humans has exceeded 400 years in the past. The global data volume has reached 1.8 ZB in 2011, and it is expected that the global data volume will reach 35 ZB by 2020. However, with the accumulation of data information, the amount of data available for analysis and mining continues to increase, and the entropy of information (i.e., the ratio of total information to valuable information) tends to increase. The increasing amount of data information makes information overload, information redundancy, and information search become new issues in the mobile Internet era. Search engines and recommendation systems provide very important technical means for solving these problems. When users search for information on the Internet, the search engine performs information matching in the background of the system with the keywords from the users and displays the results for users. However, if the user cannot accurately describe the keywords they need, then the search engine is powerless. Unlike search engines, the recommendation system does not require users to provide explicit requirements but simulates the user’s interests by analyzing their historical behavior, thereby actively recommending to users’ information that meets their interests and needs. In recent years, e-commerce has been booming, and the dominant position of the recommendation system in the Internet has become increasingly apparent. However, in the mobile Internet, data are not only more and more large scale but also have a variety of models, complex relationships, and reliability uncertainties. Diversified models make the data content difficult to understand. The complexity of the association makes the data difficult to be effectively identified. The uncertainty of credibility makes it difficult to determine the authenticity of the data. Such data features make the perception, expression, understanding, and calculation of the data face new challenges. The complexity of space-time complexity in the traditional computing model greatly limits our ability to design data-efficient computing models for the mobile Internet. How to quantify, define, and extract the essential characteristics and internal correlation of data in the mobile Internet, then study its internal mechanism, and provide support for the recommendation service is an urgent problem to be solved.

1.1.4. Behavior and data

At the same time, due to the characteristics of large scale, diverse models, and complex relationships in data in the mobile Internet, data analysis and processing processes have certain requirements for computing resources, while mobile terminals such as smartphones have limited computing resources (such as computational processing capabilities, etc.). Therefore, analysis and processing of the mobile Internet data is often done on devices such as third-party data processing servers that have sufficient computing resources. However, users and mobile terminals in the mobile Internet are closely related, and their data contains a large amount of user privacy information. This will bring a series of security issues. Since the users facing the outsourced data application system are not a specific user, there may be any users who are connected to the Internet. These users can submit data query requests to outsourcing. Different users have different requirements for outsourcing. It is precisely because of this particularity that the outsourcing system may be completely open to any external potential users. It is not enough to protect the privacy of outsourced data only through system-level access control, cryptography, and other technical means. According to statistics from the Beijing Zhongguancun police station, telecommunication fraud reported for the whole year of 2012 accounted for 32% of the cases filed, which is the highest proportion of the types of crimes. Fraudulent often uses six kinds of methods:

  • The impersonation of an individual or a circle of friends after the disclosure of information, such as criminals posing as frauds by public security organs, postal services, telecommunications, banks, social security staff, or friends and relatives, accounts for 42% of the total number of fraud cases.

  • After the leakage of shopping information, the seller is guilty of fraud.

  • The winning fraud is caused by the leakage of telephone, QQ, or e-mail.

  • The false recruitment information is received after job information leakage.

  • The Internet dating fraud after the leakage of dating information.

  • Kidnapping fraud after the disclosure of family information. It can be seen that many third parties have disclosed users’ personal information to varying degrees.

When people realized that they wanted to protect their privacy and tried to hide their actions, they did not think that their actions were already on the Internet; especially, different social locations in the network generate many data footprints. This kind of data has the characteristics of being cumulative and relevant. The information of a single place may not reveal the privacy of the user, but if a lot of people’s behaviors are gathered from different independent locations, his privacy will be exposed because there has been enough information about him. This hidden data exposure is often unpredictable and controllable by individuals. The leakage of personal privacy information has caused panic among some users, which has negatively affected the sharing of data resources in the mobile Internet. The issue of privacy protection has become increasingly important.

As a carrier of personal information, mobile smart terminals are characterized by mobility, diversity, and intelligence. The various types of applications attached to them present complex types, their execution mechanisms are not disclosed, a large amount of data, etc. These factors are superimposed and cause them to face much greater threats than before. In terms of presentation, the security issues of mobile smart terminals are mainly concentrated on the following aspects: firstly, illegally disseminating content; secondly, maliciously absorbing fees; thirdly, user privacy is stolen; and fourthly, mobile terminal viruses and illegal brushing cause the black screen, system crashes, and other issues. The following is an example of the mobile phone malicious code and user privacy theft. Due to the mobile malicious code problem caused by the negligence of the design and implementation process of mobile smart terminal operating systems, more or less system vulnerabilities may be exploited by malicious code. In particular, with the continuous maturation of the mobile Internet hacking technology, malicious code for mobile smart terminals poses a great threat to their security. Mobile malicious code is a destructive malicious mobile terminal program. It usually uses SMS, MMS, e-mail, and browsing websites through mobile communication networks or uses infrared and Bluetooth to transfer between mobile smart terminals. At present, the highest market share of mobile smart terminal operating system is Google Android operating system, followed by Apple iOS operating system and Microsoft Windows Phone operating system. As smart terminal operating systems become more and more uniform, the spread of malicious code has been accelerated, and it has evolved from a simple suction to sophisticated fraud and hooliganism. In addition, with the increase in the binding of mobile terminals and bank accounts, such as mobile terminals and third-party payment, the generation and spread of mobile phone viruses are accelerating toward the direction of traffic consumption, malicious deductions, and private information theft. In particular, with the rapid development of the Android-based operating system, mobile smart terminals using the operating system have gradually become the main targets of hacker attacks. In the first half of 2015, there were 5.967 million new Android virus data packets nationwide, an increase of 17.41% year on year, and the number of infected users reached 140 million, an increase of 58% from the same period last year. The total number of mobile payment users reached 11.455 million. How to outsource the data storage in the mobile Internet to a third-party server securely under the premise of ensuring the privacy of users and the realization of fine-grained access control of mobile Internet data is an urgent problem to be solved.

1.2. Research topics of the mobile Internet

1.2.1. Ontology-based information retrieval

Information retrieval is an important way of information exchange between users and data, which have been studied by many researches. Traditional ways of information retrieval are always based on keyword index and retrieval document. But the data on the mobile Internet have four features: large scale, complicated modules, diversified data format, and incongruous data sources. So, there is always a problem that different keywords can be the same meaning. The results of the search can be inaccurate or incomplete.

To deal with this situation, paper [12] gives us the definition of ontology-based information retrieval. This model includes the four main processes of an IR system: indexing, querying, searching, and ranking. However, as opposed to traditional keyword-based IR models, in this approach, the query is expressed in terms of an ontology-based query language (SPARQL), and the external resources used for indexing and query processing consist of an ontology and its corresponding KB. The indexing process is equivalent to a semantic annotation process. Instead of creating an inverted index where the keywords are associated with the documents where they appear, in the case of the ontology-based IR model, the inverted index contains semantic entities (meanings) associated with the documents where they appear. The relation or association between a semantic entity and a document is called annotation. The overall retrieval process consists of the following steps:

  • The system takes as input a formal SPARQL query.

  • The SPARQL query is executed against a KB, returning a list of semantic entities that satisfy the query. This process is purely Boolean (i.e., based on an exact match), so that the returned instances must strictly hold all the conditions of the formal query.

  • The documents that are annotated (indexed) with the above instances are retrieved, ranked, and presented to the user. In contrast to the previous phase, the document retrieval phase is based on an approximate match, since the relation between a document and the concepts that annotate it has an inherent degree of fuzziness.

The steps listed above are described in more detail in the following subsections, from indexing to query processing, document retrieval, and ranking.

In this model, it is assumed that a KB has been built and associated with the information sources (the document base), by using one or several domain otologies that describe concepts appearing in a document text. The concepts and instances in the KB are linked to the documents by means of explicit, non-embedded annotations of the documents. In this model, keywords appearing in a document are assigned weights reflecting the fact that some words are better at discriminating between documents than others. Similarly, in this system, annotations are assigned weights that reflect the discriminative power of instances with respect to the documents. And, the weight dxof an instance xfor a document dis computed as

dx=freqx,dmaxyfreqy,dlog|D|nxE1.1

where freqx,dis the number of occurrences in dof the keywords attached to xand Dis the set of all documents in the search space.

The query execution returns a set of tuples that satisfy the SPARQL query. Then, extract the semantic entities from those tuples, and access the semantic index to collect all the documents in the repository that are annotated with these semantic entities. Once the list of documents is formed, the search engine computes a semantic similarity value between the query and each document, using an adaptation of the classic vector space IR model. Each document in the search space is represented as a document vector where each element corresponds to a semantic entity. The value of an element is the weight of the annotation between the document and the semantic entity, if such annotation exists, and zero otherwise. The query vector is generated weighting the variables in the SELECT clause of the SPARQL query. For testing purposes, the weight of each variable of the query was set to 1, but in the original model, users are allowed to manually set this weight according to their interest. Once the vectors are constructed, the similarity measure between a document dand the query qis computed as

sim(d,q)=d×q|d||q|E1.2

If the knowledge in the KB is incomplete (e.g., there are documents about travel offers in the knowledge source, but the corresponding instances are missing in the KB), the semantic ranking algorithm performs very poorly: SPARQL queries will return less results than expected, and the relevant documents will not be retrieved or will get a much lower similarity value than they should. As limited as might be, keyword-based search will likely perform better in these cases. To cope with this, the ranking function combines the semantic similarity measure with the similarity measure of a keyword-based algorithm. Combine the output of search engines, and compute the final score as

λsim(d,q)+(1λ)ksim(d,q)E1.3

where ksimis computed by a keyword-based algorithm and λ[0,1].

A lot of information retrieval models have been put forward to improve the performance of information retrieval, and some of the combinations of these technologies have been proven to be effective. But faced with the rich data in the MI, the great improvement needs to be done for getting the search more intelligent and efficient.

1.2.2. Deep learning

Machine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. It learns from the external input data through algorithm analysis to obtain the regularity for recognition and judgment. With the development of machine learning, it is classified as shallow learning and deep learning and plays an important role in the field of artificial intelligence. Due to the large-scale data, multiple model, and complex correlation, traditional shallow learning algorithms (such as the support vector machine algorithm [13]) are still subject to certain restrictions in dealing with complex classifications [14] even if widely used.

Deep learning is a new field in the study of machine learning; the motive is to establish and simulate human brain analysis neural network to learn. It mimics the human brain to explain the mechanism of data, such as images, sound, and text. As with machine learning methods, deep machine learning methods also have supervised learning and unsupervised learning. Supervised learning algorithms experience a data set containing features, but each example is also associated with a label or target. For example, the iris data set is annotated with the species of each iris plant. A supervised learning algorithm can study the iris data set and learn to classify iris plants into three different species based on their measurements. Unsupervised learning algorithms experience a data set containing many features and then learn useful properties of the structure of this data set. In the context of deep learning, we usually want to learn the entire probability distribution that generated a data set, whether explicitly as in density estimation or implicitly for tasks like synthesis or deionizing. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the data set into clusters of similar examples. The learning model established under different learning frameworks is very different. For example, the convolutional neural network (CNN) is a kind of machine learning model of deep learning under the supervision of the deep place letter (deep belief networks, referred to as DBNs), which is a kind of machine learning model in the case of unsupervised learning.

The concept of deep learning was proposed in 2006 by Hinton [15]. Based on the deep belief network (DBN), the algorithm is proposed to solve the optimization problem of the deep structure, and then the deep structure of multilayer automatic encoder is proposed. In addition, the convolution neural network proposed by Lecun et al. [16] is the first real multilayer structure learning algorithm, which uses the space relative relation to reduce the number of parameters to improve the training performance. Deep learning is a method of representing learning in machine learning. Observations (such as an image) can be represented in a variety of ways, such as vectors for each pixel’s strength value, or more abstracted into a range of edges, specific shapes, etc. It is easier to learn a task from an instance using some specific presentation methods (e.g., face recognition or facial expression recognition). The advantage of deep learning is to replace the manual acquisition feature with non-supervised or semi-supervised feature learning and hierarchical feature extraction.

Deep learning architectures such as deep neural networks, deep belief networks, and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, and drug design [13], where they have produced results comparable to and in some cases superior [14] to human experts.

The work [17] firstly proposed a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR). This model is a pre-trained deep neural network-hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. Although the experiments show that CD-DNN-HMMs provide dramatic improvements in recognition accuracy, many issues remain to be resolved. First, although CD-DNN-HMM training is asymptotically quite scalable, in practice, it is quite challenging to train CD-DNN-HMMs on tens of thousands of hours of data. Second, highly effective speaker and environment adaptation algorithms for DNN-HMMs must be found, ideally ones that are completely unsupervised and integrated with the pre-training phase. Third, the training in this study used the embedded Viterbi algorithm, which is not optimal. In addition, the study views the treatment of the time dimension of speech by DNN-HMM and GMM-HMMs alike as a very crude way of dealing with the intricate temporal properties of speech. Finally, although Gaussian RBMs can learn an initial distributed representation of their input, they still produce a diagonal covariance Gaussian for the conditional distribution over the input space given the latent state (as diagonal covariance GMMs also do). Ji et al. [18] developed a 3D CNN model for action recognition. This model construct features from both spatial and temporal dimensions by performing 3D convolutions. The developed deep architecture generates multiple channels of information from adjacent input frames and performs convolution and subsampling separately in each channel. The final feature representation is computed by combining information from all channels. Evaluated by the TRECVID and the KTH data sets, the results show that the 3D CNN model outperforms compared with the methods on the TRECVID data, while it achieves competitive performance on the KTH data, demonstrating its superior performance in real-world environments. Furthermore, Cho et al. [19] proposed a novel neural network model called RNN encoder-decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN encoder-decoder as an additional feature in the existing log-linear model. But one approach that was not investigated is to replace the whole or a part of the phrase table by letting the RNN encoder-decoder propose target phrases. A multi-context deep learning framework for saliency detection is proposed in [20]. Different pre-training strategies are investigated to learn the deep model for saliency detection, and a task-specific pre-training scheme for the presented multi-context deep model is designed. Tested by the contemporary deep models in ImageNet image classification challenge, their effectiveness in saliency detection is investigated.

Although deep learning has shown great success in various applications such as objection recognition [21] and speech recognition [17], especially language modeling [22], paraphrase detection [23], and word embedding extraction [24], the work implemented in the MI is still worthy to investigate.

1.2.3. Outsourced attribute-based encryption system

Attribute-based encryption (ABE) is a promising cryptographic primitive, which has been widely applied to design fine-grained access control system recently. Data access control has been evolving in the past 30 years, and various techniques have been developed to effectively implement fine-grained access control [25], which allows flexibility in specifying differential access rights of individual users. However, traditional access control systems are mostly designed for in-house services and depend greatly on the system itself to enforce authorization policies. Thus, they cannot be applied in cloud computing because users and cloud servers are no longer in the same trusted domain. For the purpose of helping the data owner impose access control over data stored on untrusted cloud servers, a feasible consideration would be encrypting data through certain cryptographic primitives but disclosing decryption keys only to authorized users. One critical issue of this branch of approaches is how to achieve the desired security goals without introducing high complexity of key management and data encryption. Existing works resolve this issue either by introducing a per-file access control list (ACL) for fine-grained access control or by categorizing files into several filegroups for efficiency. As the system scales, however, the ACL-based scheme would introduce an extremely high complexity which could be proportional to the number of system users. The filegroup-based scheme, on the other hand, is just able to provide coarse-grained access control of data.

To provide fine-grained access control over encrypted data, a novel public-key primitive, namely, attribute-based encryption (ABE) [26], is introduced in the cryptographic community, which enables public-key-based one-to-many encryption. In ABE system, users’ keys and ciphertexts are labeled with sets of descriptive attributes and access policies, respectively, and a particular key can decrypt a ciphertext only if the associated attributes and policy are matched. Though ABE is a promising primitive to design fine-grained access control system in cloud computing, there are several challenges remained in the application of ABE.

  • One of the main drawbacks of ABE is that the computational cost in decryption phase grows with the number of attributes specified in the access policy. The drawback appears more serious for resource-constrained users such as mobile devices and sensors. Therefore, one challenge is how to reduce the decryption complexity of ABE such that it can be applied to fine-grained access control for users with resource-constrained devices.

  • Beyond decryption, generating users’ private key in existing ABE schemes also requires a great quantity of modular exponentiations. Furthermore, the revocation of any single user in existing ABE requires key update at authority for remaining users who share his/her attributes. All of these heavy tasks centralized at the authority side would make it become the efficiency bottleneck in the whole access control system. Therefore, another challenge is how to reduce the key-issuing complexity of ABE such that scalable access control can be supported.

To improve the practical application level of the attribute-based encryption system, researchers began to attempt to introduce outsourced computing to the attribute-based encryption system. In 2011, Green et al. [27] proposed a scheme that outsourced the decryption operation in the attribute-based cryptosystem to a third party in order to improve the client’s operational efficiency and enable the attribute-based cryptosystem to run on devices with limited computational resources such as smart cards. The ciphertext length in this scheme increases linearly with the number of access policy attributes. In [27], the third party uses the conversion key sent by the user to convert the encrypted ciphertext with the higher decryption cost into the ElGamal-encrypted ciphertext with the lower decryption cost and transmits the converted ciphertext back to users. And then, users can use their own decryption keys to complete the decryption locally. Li et al. [28] simplified the client’s computational complexity in the attribute-based cryptosystem from the perspective of outsourced cryptographic operations. After outsourcing cryptographic operations, the user only needs to perform constant exponentiation operations for encryption regardless of the number of attributes and can encrypt any access policy. Li et al. [29] proposed an attribute-based encryption scheme that supports outsourced key generation and outsourced decryption operations. Meanwhile, the scheme proposed a fine-grained access control system based on this scheme. Lai et al. [30] pointed out that the solution proposed by the user in [27] validates the results returned by the third party and that the security of the solution depends on the random oracle model. On the basis of [27], the literature [30] proposes a verifiable outsourced attribute-based encryption scheme. By adding redundant information, users can verify the conversion results returned by third parties to ensure the accuracy of conversions. The security of the new program is based on the standard model. Li et al. [31] proposed an attribute-based encryption scheme that can both reduce attribute authority and user computing burden and also support user authentication, on the basis of [27, 28, 31] and so on. By outsourcing part of the key generation and decryption operations to different service providers, the attribute authority and the user’s local computing burden are greatly reduced. Lin et al. [32] pointed out that the method of adding redundant information in [30] can not only achieve the purpose of verifying the returned result, but also increases the computational load and communication burden of the system. The literature [32] proposes an outsourced attribute-based encryption scheme that uses a hybrid encryption system and a commitment mechanism to complete user authentication. The length of the ciphertext and the calculation of cryptographic operations in this scheme are only half that of the scheme mentioned in [30], and the security of the scheme is based on the standard model. Ma et al. [33] presented the concept of innocent proof for the first time. The user who holds the private key cannot report cloud server operation errors and proposed an attribute-based encryption scheme that supports both outsourced encryption and decryption operations and verifiable and innocent proofs. Mao et al. [34] proposed an attribute-based encryption scheme that verifies outsource decryption, and experiments show that the scheme is more concise and less computational than cryptographic schemes proposed in [30] but still with the number of attributes linearly related. Xu et al. [35] proposed a circuit-based ciphertext-policy-based attribute-based hybrid encryption system to ensure data integrity in cloud computing environments and implement fine-grained access control and verifiable delegation mechanisms.

It can be known from the existing work that currently existing secure outsourced attribute-based encryption schemes cannot achieve a fixed ciphertext length, and these solutions not only improve the computational efficiency but also increase the communication load of mobile devices. However, the work of achieving fine-grained access control of data in the MI and promoting the sharing of data in the MI is still worthy to investigate.

1.2.4. Proxy re-encryption system

Since cloud users do not fully trust cloud computing service providers (CSPs), they do not want any unauthorized personnel (including cloud service providers) to view their own data. This cannot rely on administrative constraints or ethics. The rules to be solved can only adopt technical means to ensure that cloud service providers and other non-authorized personnel cannot obtain any uploading cloud data from the customer, and any unauthorized changes to the customer data in the cloud will be accepted by the customers. Therefore, user data usually needs to be encrypted on the client before being transmitted to the cloud. However, in the cloud computing environment, there are a large number of data sharing scenarios. When data sharing is performed, if decryption is performed in the cloud, there is a risk of leakage of user data, which is not completely trusted by the cloud computing service provider for the user. The situation is impossible. If you need the user to re-encrypt the file, this is very inconvenient for the user. Therefore, the proxy re-encryption (PRE) scheme for distributed data storage can well solve this problem.

Proxy re-encryption (PRE), initially introduced by Blaze et al. [36], enables a semi-trusted proxy to transform a ciphertext encrypted under the public key of delegator into another ciphertext under the public key of delegatee without leaking the underlying encrypted messages or private keys of delegator/delegatee to the proxy. This special kind of public-key encryption seems to be an optimal candidate to ensure the security of sharing data in cloud computing. Supposedly, the data owner (say, Alice) intends to share the sensitive data stored in the cloud with another granted user (say, Bob). It is desirable that the requested data can be accessed by nobody other than Bob. Inspired by the primitive of PRE, Alice can encrypt the sensitive data under her own public key before uploading the shared data to the semi-trusted cloud. After receiving the request of data sharing from Bob, Alice generates a proxy re-encryption key using her own private key and Bob’s public key and sends this proxy re-encryption key to the semi-trusted cloud server. Equipped with this proxy re-encryption key, cloud server can transform the ciphertext encrypted under the public key of Alice into an encryption under the public key of Bob. By utilizing the PRE primitive, the transformed ciphertext can only be decrypted by Bob, whereas the cloud server is unable to learn the plaintext or private keys of Alice or Bob. Finally, Bob can download and decrypt the requested data with his own private key. In this way, the costly burden of secure data sharing can be uploaded to the semi-trusted cloud server with abundant resources.

In 2005, Ateniese et al. [37] proposed the first unidirectional PRE scheme. Similar to the scheme of Blaze et al. [36], both of the schemes are secure only against chosen-plaintext attacks (CPA). In 2007, Canetti et al. [38] designed a bidirectional PRE scheme with chosen-ciphertext security. In 2008, Libert et al. [39] introduced a replayable chosen-ciphertext secure (RCCA) unidirectional PRE scheme. Since then, various PRE schemes have been proposed in the literature (e.g., [40, 41, 42, 43, 44]).

PRE can be extended in the context of identity-based encryption. In 2007, Green and Ateniese [45] proposed the first identity-based proxy re-encryption (IBPRE) scheme, which is CCA secure in the random oracle model, where hash functions are assumed to be fully random. Chu and Tzeng [30] constructed a CCA secure IBPRE scheme in the standard model. After that, many identity-based proxy re-encryption (IBPRE) schemes have been proposed, such as [47, 48, 49, 50, 51, 52, 53].

However, among all of the aforementioned schemes, the semi-trusted proxy can use a given re-encryption key to transform all the ciphertexts of a delegator into those of a delegatee. But in reality, the delegator does not want to transform all of his data for the delegatee. Therefore, type-based PRE [54] and conditional PRE (CPRE) [55, 56] were proposed, in which the proxy can only fulfill ciphertext conversion conditionally. Later, Liang et al. [57, 58] proposed two IBCPRE schemes with CCA secure in the standard model. However, He et al. [59] presented the security analysis to show that their schemes only achieve CPA security. In 2016, He et al. [60] proposed an efficient identity-based conditional proxy re-encryption (IBCPRE) scheme with CCA secure in the random oracle model.

PRE can be extended in the attribute-based setting. Attribute-based proxy re-encryption (ABPRE) can effectively increase the flexibility of data sharing. In 2009, Liang et al. [61] first defined the notion of ciphertext-policy ABPRE (CP-ABPRE), where each ciphertext is labeled with a set of descriptive conditions and each re-encryption key is associated with an access tree that specifies which type of ciphertexts the proxy can re-encrypt, and they presented a concrete scheme supporting and gates with positive and negative attributes. After that, several CP-ABPRE schemes (e.g., [62]) with more expressive access policy were proposed. In 2011, Fang et al. [63] proposed a key-policy ABPRE (KP-ABPRE) scheme in the random oracle model, whereby ciphertext encrypted with conditions W can be re-encrypted by the proxy using the CPRE key under the access structure TT if and only if T(W)T(W)=1. More recent ABPRE systems can be seen in [64, 65, 66].

In 2016, Lee et al. [67] proposed a searchable hierarchical CPRE (HCPRE) scheme for cloud storage services, and cloud service provider is able to generate a hierarchical key, but the re-encryption key generation algorithm also requires the private keys of the delegator and delegatee.

So far, the proxy re-encryption has been extensively investigated. However, the concrete proxy re-encryption schemes with different properties still need to be proposed and applied to the special environment for the MI.

1.3. Organization

The core material of this book consists of the following:

  • Chapters 1 and 2, discussing the importance of outsourcing and privacy protection for data service in the mobile Internet and the basics of data service outsourcing and privacy protection in the mobile Internet

  • Chapters 3–5, introducing information retrieval, classical machine learning, and deep learning

  • Chapters 6 and 7, introducing attribute-based encryption for flexible and fine-grained access control and motivating proxy re-encryption for secure access delegation

2. Foundations

The purpose of this chapter is to offer a brief review of all the necessary cryptographic concepts needed in the following chapters. We start with the mathematical backgrounds and associated hard problems. The other aspect of this chapter is some facts regarding the classical symmetric cryptography and public-key cryptosystem. Finally, a concise description of provable security will be presented in the rest of this chapter. We remark that this introduction is by no means exhaustive such that the approaches used to speed up pairing computations or select suitable parameters of elliptic curve are really beyond the scope of this book.

2.1. Mathematical concepts and properties

2.1.1. Concepts from number theory

2.1.1.1. Primes and divisibility

Let Zbe the set of integers. We say that adivides b(denoted as a|b) if there exists an integer csatisfying ac=bfor a,b,cRZ. If adoes not divide b, we write ab. Despite that this definition makes sense even when one or more of these integers are negative or zero, we are only interested in the case when a,b,care all positive. A direct observation is that if a|band a|c, then a|(xb+yc)for any x,yRZ.

If a|band ais positive, ais regarded as a divisor or factor of b. Furthermore, ais also called a nontrivial factor of bin case a{1,b}. A positive integer p(>1)is considered as prime if it has no nontrivial factors, i.e., it has only two divisors: 1 and pitself. A positive integer greater than 1 is called composite if this integer is not a prime number. It is the convention to know that “1” is neither prime nor composite.

According to the fundamental theorem of arithmetic, every integer greater than 1 can be expressed uniquely as a product of primes. Namely, any positive integer n>1can be written as n=ipieisuch that {pi}are distinct primes and ei1for all i; furthermore, the {pi}and {ei}are uniquely determined up to ordering.

Proposition 2.1

Let aand bbe two integers and b>0. Then there exist unique integers q,rsatisfying a=qb+rand 0r<b.

cis known as the greatest common divisor of two nonnegative integers aand b(written as gcd(a,b)) if cis the largest integer satisfying c|aand c|b. It is noted that gcd(0,0)are not defined, whereas gcd(b,0)=gcd(0,b)=b. The notion of greatest common divisor also makes sense when either or both of a,bare negative, but we will never need this; therefore, when we write gcd(a,b), we always assume that a,b0. Note that if pis prime, then gcd(a,p)is either equal to 1or p. aand bare called relatively prime if gcd(a,b)=1.

Proposition 2.2

Let a,bbe two positive integers. Then there exist x,yZsuch that ax+by=gcd(a,b). Furthermore, gcd(a,b)is the smallest positive integer that can be expressed in this manner.

Proof. Let Idenote the set {ax+by}where x,yare chosen from Z. Note that Icertainly contains some positive integers since at least a,bIsuch that a=a×1+b×0and b=a×0+b×1. Let ddenote the smallest positive integer in I. We claim that d=gcd(a,b)due to the fact that dcan be represented as d=ax+byfor some x,yZ(recall that dI). In addition, to show that d=gcd(a,b), we must show that d|aand d|band that dis the largest integer with this property. In fact, we can show that ddivides every element in I. To prove this, take arbitrary cIsuch that c=ax+bywith x,yZ. According to Proposition 2.1, we have c=qd+rwith qand rbeing integers and 0r<d. Then

r=cqd=ax+byq(ax+by)=(xqx)a+(yqy)bI.E2.1

There is a contradiction between r0and the choice of das the smallest positive integer in I. Thus, it is obvious that r=0and hence d|c.

Since both aand bare elements in I, the above shows that d|aand d|b. Suppose there exists d>dsuch that d|aand d|b. Then d|ax+by; since the latter is equal to d, this means d|d, which is impossible if dis larger than d. It is concluded that dis the largest integer dividing both aand band hence d=gcd(a,b).

Given aand b, the Euclidean algorithm can be used to compute gcd(a,b)in polynomial time. The extended Euclidean algorithm can be used to compute the coefficients x,y(as in Proposition 2.1) in polynomial time as well. Interested readers can refer to [68] for more details.

Proposition 2.3

If c|aband gcd(a,c)=1, then c|b. In particular, if pis prime and p|ab, then either p|aor p|b.

Proof. It is easy to observe that c×α=abfor some integer αsince c|ab. From gcd(a,c)=1then, according to the previous proposition, there exist integers x,ysuch that 1=ax+cy. Multiplying both sides by b, we obtain

b=axb+cyb=abx+cyb=c×αx+cyb=c(αx+yb).E2.2

Since (αx+yb)is an integer, it follows that c|b.

The second part of this proposition follows from the fact that if pa, then gcd(a,p)=1.

Proposition 2.4

If p|N, q|N, and gcd(p,q)=1, then pq|N.

Proof. We can see pα=Nand qβ=Nfrom p|Nand q|Nand obtain 1=px+qyfrom Proposition 2.2, where α,β,x,yare all integers. Multiplying both sides of the last equation by N, we obtain N=pxN+qyN=pxqβ+qypα=pq(xβ+yα).Thus, pq|N.

2.1.1.2. Modular arithmetic

Let a,b,Nbe integers with N>1. Let [amodN]denote the remainder of aZupon division by N. In more detail, according to Proposition 2.1, there exist unique qand rsuch that a=qN+rand 0r<N, and [amodN]is defined as this remainder r. Namely, 0[amodN]<N. The process of mapping ato [amodN]is called reduction modulo N.

aand bare regarded as congruent modulo N(written as a=bmodN) if [amodN]=[bmodN], that is to say, the remainder when ais divided by Nis equivalent to the remainder when bis divided by N. In this manner, a=bmodNif and only if N|(ab). Note that a=[bmodN]implies a=bmodN, but not vice versa. For example, 36=18mod18but 36[18mod18]=0.

We remark that congruence modulo Nis an equivalence relation, i.e., it is reflexive (a=amodNfor all a), symmetric (a=bmodNimplies b=amodN), and transitive (if a=bmodNand b=cmodN, then a=cmodN). Congruence modulo Nalso obeys the standard rules of arithmetic with respect to addition, subtraction, and multiplication, so if a=amodNand b=bmodN, then (a+b)=(a+b)modNand ab=abmodN. Thus, the calculations of congruence modulo Ncan be simplified by “reducing and then adding/multiplying” instead of “adding/multiplying and then reducing.”

Example 1. Let us compute [3193015×590702mod100]. Since 3193015=15mod100and 590702=2mod100, we have

3193015×590702=[3193015mod100][590702mod100]mod100=15×2=30mod100.

The cost of alternate approach to derive the answer (viz., computing the product 3193015×590702and then reducing the answer modulo 100) is much more expensive.

Different from addition, subtraction, and multiplication, the division operation in congruence modulo N has not been involved. That is to say, if a=amodNand b=bmodN, then it is not necessarily true that a/b=a/bmodN; in fact, the expression “a/bmodN” is not, in general, defined well. As a specific example related to the division, ab=cbmodNdoes not necessarily imply that a=cmodN.

Example 2. Take N=12. Then 3×3=9=3×7mod12, but 37mod24.

However, a meaningful notion of division has also been defined for congruence modulo N. If for a given integer athere exists an integer a1such that a×a1=1modN, a1is viewed as a (multiplicative) inverse of amodulo N, and ais considered as invertible modulo N. It is obvious to observe that if αis a multiplicative inverse of amodulo N, then so is [αmodN]; furthermore, if αis another multiplicative inverse, then [αmodN]=[αmodN]. We can simply let a1denote the unique multiplicative inverse of athat lies in the range {0,,N1}if ais invertible.

In this way, division by amodulo Nis defined as multiplication by a1modulo Nif ais invertible modulo N(i.e., c/ais defined as c×a1modN). We stress that division by ais only defined when ais invertible. If c×a=b×amodNand aare invertible, then we may divide each side of the equation by a(or, equivalently, multiply each side by a1) to obtain

(c×a)×a1=(b×a)×a1modNc=bmodN.E2.3

We see that in this case, division works “as expected” by adopting the idea of invertible integers. The natural question is that which integers are invertible modulo of a given modulus N. To answer this question fully, Proposition 2.2 is used in the following proposition:

Proposition 2.5

Let a,Nbe integers with N>1. Then ais invertible modulo if and only if gcd(a,N)=1.

Proof. Assume ais invertible modulo N, and let bdenote its inverse. It is evident that a0since 0×b=0modNregardless of the value of b. Since a×b=1modN, the definition of congruence modulo Nimplies that a×b1=α×Nfor some αZ. Equivalently, b×aα×N=1. According to Proposition 2.2, this implies gcd(a,N)=1.

Conversely, if gcd(a,N)=1, then according to Proposition 2.2, there exist integers x,ysuch that ax+Ny=1. Reducing each side of this equation, modulo Ngives ax=1mod, and it is easy to see that xis a multiplicative inverse of a.

Example 3. Let N=13and a=10. Then 10×4+(3)×13=1, and so 4=[4mod13]is the inverse of 10. One can verify that 10×4=1mod13.

2.1.2. Concepts from abstract algebra

2.1.2.1. Group theory

Assume Gbe a set. A binary operation defined over the set G(written as ghif g,hG) is simply a function that takes as input two elements of Gand outputs another element in the set G. The formal definition of a group is described as follows:

Definition 2.1. A group is a set Gpairing with a binary operation such that:

Closure law: For all g,hG, the result of ghstill remains in the set G.

Identity law: There exists an identity element eGsuch that for all gG, eg=g=ge.

Inverse law: For all gG, there exists an element hGsuch that gh=e=hg. Such an his called an inverse element of g.

Associative law: For all g1,g2,g3G, (g1g2)g3=g1(g2g3).

In this way, the set Galong with the binary operation is defined as a group. When the set Ghas a finite number of elements, (G,)is viewed as a finite group, and |G|denotes the order of the group, that is, the number of elements in G.

If gh=hgfor all g,hG, this group is called abelian group or commutative group. In this book, we will always deal with finite and abelian (commutative) groups.

If (G,)is a group, a set HGpairing with the operation is called a subgroup of (G,)if the set Hforms a group under the involved operation . To check that (H,)is a subgroup, we need to verify closure, existence of identity and inverses, and associativity according to Definition 2.1. (In fact, associativity is inherited automatically from the group (G,).) Every group (G,)always has the trivial subgroups (G,)and ({e}, ). (H,)is called a strict subgroup of (G,)if HG.

Associativity implies that the notation of long expression g1g2gnwithout parentheses is unambiguous since it does not matter in what order we perform the operation .

It is easy to see that the identity element in a group (G,)is unique, and thus we can therefore refer to the identity element of a group. One can also demonstrate that each element gof a group has a unique inverse element.

In general, the abstract notation will not be used to denote the group operation directly. Either the additive notation or the multiplicative notation will be used instead depending on the involved group. In case the additive notation has been adopted, the group operation applied to two elements g,hin the group is denoted as g+h; the identity element is denoted as “0,” and the inverse element of gis denoted as g. In case the multiplicative notation has been employed, the group operation applied to g,hin the group is denoted as g×hor simply gh; the identity element is denoted as “1,” and the inverse element of gis denoted as g1. As in the case of multiplication modulo N, we also define division by gas multiplication by g1(i.e., [(h/g)modN]is defined as [(h×g1)modN]). Finally, we remark that this does not imply that the group operation corresponds to the addition or multiplication operation for integers. Instead, this merely serves as useful general notation.

Example 4. A set may be formed as a group under one operation, but not another operation. For example, the set of integers Zforms an abelian group under the addition operation: the identity element is “0,” and every integer ghas inverse identity g. On the other hand, it does not form a group under multiplication operation since, for example, the multiplicative inverse of the integer “0” does not make sense.

Example 5. The set of complex numbers Cis not a group under multiplication, since “0” does not have a multiplicative inverse. The set of nonzero complex numbers, however, is an abelian group under multiplication with identity “1.”

Example 6. Let N2be an integer. The set {0,,N1}with respect to addition modulo Nis an abelian group of order N: closure is obvious; associativity and commutativity follow from the fact that the integers satisfy these properties; the identity element is 0; and, since a+(Na)=0modN, it follows that the inverse element of any element ais [(Na)modN]. We denote this group by (ZN,+). (Occasionally, the notion ZNwill also be used to denote the set {0,,N1}without involving any particular operation.)

The “cancelation law” for groups has been demonstrated in the following lemma.

Lemma 2.1 Let (G,×)be a group and a,b,cG. If a×c=b×c, then a=b. In particular, if a×c=c, then ais the identity element in G.

Proof. We know a×c=b×c. Multiplying both sides by the unique inverse element c1of the element c, we obtain a=b. In detail

a=a×1=a×(c×c1)=(a×c)×c1=(b×c)×c1=b×(c×c1)=b×1=b.E2.4

Group exponentiation: It is often useful to be able to describe the group operation applied ntimes to a fixed element g, where nis a positive integer. As for the additive operation, we demonstrate this as ngor ng, that is,

ng=ng=defg++gntimes.E2.5

Note that nis an integer, while gis a group element. So ngdoes not represent the group operation applied to nand g(indeed, we are working in a group where the group operation is written additively). However, the operation “behaves fortunately as it should,” for example, if gGand n,nare integers, then (ng)+(ng)=(n+n)g,n(ng)=(nn)gand 1g=g. In an abelian group Gwith g,hG, (ng)+(nh)=n(g+h).

As for the multiplicative operation, we demonstrate application of the group operation ntimes to an element gby gn. That is,

gn=defg××gntimes.E2.6

The familiar rules of exponentiation follow: gn×gn=gn+n, (gn)n=gnn, and g1=g. Also, if Gis a commutative group and g,hG, then gnhn=(gh)n.

The above notation can be extended to the case when nis zero or a negative integer in the natural way. Generally, we leave grundefined if ris not an integer. As for additive operation, we have 0g=def0and (n)g=defn(g)such that nis a positive integer. (Note that in the equation “0g=0,” the “0” on the left-hand side is the integer 0, while the “0” on the right-hand side is the identity element in the group.) As one would expect, it can be shown that (n)g=(ng). As for multiplicative notation, g0=def1and gn=def(g1)n. Again, as expected, one can show that gn=(gn)1.

Theorem 2.1

If (G,×)is a finite group with the order m=|G|, then for any element gG, gm=1.

Proof. We prove the theorem only for the commutative group (G,×)(despite it holds for any finite group). Fix arbitrary gG, and let g1,,gmbe the elements of G. We claim that

g1×g2××gm=(g×g1)×(g×g2)××(g×gm).E2.7

It is noted that g×gi=g×gjimplies gi=gjaccording to the cancelation law shown in Lemma 2.1. Thus melements in parentheses on the right-hand side of the displayed equation are pairwise different from each other. By considering that there are exactly melements in G, the melements being multiplied together on the right-hand side are simply all elements of Gin some permuted order. The order in which all elements of the group are multiplied does not matter due to the fact that the group (G,×)is commutative, and thus the result of the right-hand side is identical to the result of left-hand side.

Fueled by the fact that (G,×)is commutative, all occurrences of the element gcan be pulled out, and we can obtain

g1×g2××gm=(g×g1)×(g×g2)××(g×gm)=gm×(g1×g2××gm).E2.8

Once again by the cancelation law in the group, it is obvious that gm=1.

Corollary 2.1

If (G,×)is a finite group with the order m=|G|>1, then for any element gGand any integer i, gi=g[imodm].

Proof. According to Proposition 2.1, ican be expressed as qm+r, where q,rare integers and rcan be demonstrated as [imodm]. By Theorem 2.1,

gi=gqm+r=gqm×gr=1q×gr=grE2.9

holds.

Example 7. As for the additive operation, the above corollary means that if gis an element in a group with order m, then ig=[imodm]g. As an example, consider the group Z25of order m=25, and take g=13. The corollary can be instantiated as

30313=[303mod15]13=313=13+13+13=39=14mod25.E2.10

Corollary 2.2

Let (G,×)be a finite group with order m=|G|>1. Let e>0be an integer, and define the function fe:GGby fe(g)=ge. If gcd(e,m)=1, then feis a permutation. Furthermore, if d=[e1modm], then fdis the inverse of fe.

Proof. According to Proposition 2.5, eis invertible modulo mdue to the fact that gcd(e,m)=1. To show that fdis the inverse of fe, for any gG, we have

fd(fe(g))=fd(ge)=(ge)d=ged=g[edmodm]=g1=g.E2.11

2.1.2.2. Group (ZN*,×)

As mentioned before, the set ZN={0,,N1}pairing with the addition operation modulo N can be regarded as a group. One natural problem is that whether the set {0,,N1}associated with the multiplication operation modulo N can be viewed as a group or not. It is obvious that “1” can be considered as the identity element for the multiplication modulo N. However, not every element in this set is invertible associated with the multiplication modulo N, for example, the multiplicative inverse element of “0” obviously does not make sense. What make matters worse, if N=8, then the elements “2,” “4,” and “6” are not invertible by exhaustively trying every possible elements in {0,,7}. Thus, it is necessary to identify which elements in {0,,N1}are invertible modulo N. Depending on Proposition 2.5, the element a{0,,N1}is invertible if and only if gcd(a,N)=1. It is also easy to observe that the inverse element of aresides in the range {0,,N1}. This results in the definition of the following set for N>1:

ZN*=def{a{1,,N1}|gcd(a,N)=1}.E2.12

In other words, ZN*includes integers in the set {1,,N1}that are relatively prime to N.

Based on the discussion above, the identity element and inverse element associated with each element can be found in ZN*. Furthermore, the commutativity and associativity features inherit from ZNdirectly. To discuss the feature of closure, let a,bZN*and c=[a×bmodN], and then assume cZN*. This implies that gcd(c,N)1, and so a prime p exists such that p|Nand p|c. From c=[a×bmodN], we see that a×b=qN+cfor some integer q and thus p|a×b. Based on Proposition 2.3, we obtain p|aor p|band, thus, contradict either gcd(a,N)=1or gcd(b,N)=1(recall that a,bZN*). In short, the set ZN*with respect to the multiplication operation modulo N forms a group.

Proposition 2.6

ZN*is a commutative group pairing with the multiplication operation modulo Nwhen N>1is an integer.

Let ϕ(N)denote the order of the group (ZN*,×)such that ϕ(N)(=def|ZN*|)and ϕis called the Euler phi function. To compute the value of ϕ(N), we first consider the case when Nis prime, i.e., N=p. In this case, all elements in {1,,p1}are relatively prime to p, and so ϕ(p)=|Zp*|=p1. We then consider the case N=p×qsuch that p,qare distinct primes. We see that either p|aor q|aif an integer a{1,,N1}is not relatively prime to N. Note that acannot be divided by both pand qsimultaneously since this would imply pq|aand contradict a<N(=p×q). The elements in {1,,N1}that can be divided by pare exactly the (q1)elements p,2p,3p,,(q1)p, and the elements that can be divided by qare exactly the (p1)elements q,2q,,(p1)q. Thus, the number of elements that are neither divisible by por qis therefore given by

N1(q1)(p1)=p×qpq+1=(p1)(q1).E2.13

That is to say, ϕ(N)=(p1)(q1)when Nis the product of two distinct primes pand q.

Theorem 2.2

Let N=ipiei, where the {pi}are distinct primes and ei1. Then ϕ(N)=N×i(11pi1).

Example 8. Take N=24=323. Then Z24*={1,5,7,11,13,17,19,23}and |ZN*|=8=24×(112)×(113)=ϕ(24). The inverse identity of 5 in ZN*is 5 itself, since 5×5=25=1mod24.

Until now, the set ZN*under the multiplication operation modulo Nis shown to be a group with order ϕ(N). According to the Theorem 2.1, we get the following theorem directly:

Theorem 2.3

Take arbitrary N>1and aZN*; then we obtain

aϕ(N)=1modN.E2.14

For the specific case when N=pis prime and a{1,,p1}, we have

ap1=1modp.E2.15

By Corollary 2.2, we obtain the following theorem easily:

Theorem 2.4

Let N>1be an integer. For integer e>0, we define fe:ZN*ZN*as fe(x)=xemod. If eis relatively prime to ϕ(N), then feis a permutation. Moreover, if d=[e1modϕ(N)], then fdis the inverse of fe.

Definition 2.2. A function f:GHis said to be an isomorphism from a group (G,G)to another group (H,H)if (1) fis a bijective mapping, and (2) for all g1,g2G, we have f(g1Gg2)=f(g1)Hf(g2). If these properties hold, then we say the group (G,G)and the group (H,H)are isomorphic and write this as GH.

Group isomorphisms: An isomorphism from a group to another group provides an alternate and equivalent approach to think about the structure of groups. For example, if the group (G,G)is finite and GH, then the group (H,H)must be finite and have the same order as G. Also, if there exists an isomorphism ffrom a group (G,G)to another group (H,H), then f1is an isomorphism from (H,H)to (G,G).

Example 9. The bijective mapping fwhere f(n)=2nis an isomorphism from the group (Z,+)to the group (Z2n,+)because f(a+b)=2(a+b)=2a+2b=f(a)+f(b), where Z2ndenotes the set of even integers and +is the addition operation for the integers.

Example 10. The bijective mapping fwhere f(n)=3nis an isomorphism from the group (Z6,+)to the group (Z7*,×), where +and ×denote the addition operation modulo 6and the multiplication operation modulo 7. On the one hand, 30=1, 31=3, 32=2, 33=6, 34=4, and 35=5. On the other hand, f(2+3)=3(2+3)=32×33=f(2)×f(3).

2.1.2.3. Chinese remainder theorem

Theorem 2.5

Let n1,n2,,nkbe integers that are pairwise relatively prime, i.e., gcd (ni,nj)=1for ij. Then there exists a unique solution modulo of the product n=n1×n2×nkto the following system of congruences:

xa1(modn1)xa2(modn2)xak(modnk).E2.16

The solution to the system of congruences shown in Theorem 2.5 can be calculated as

x=i=1kaiNiMimodnE2.17

where

Ni=nniE2.18

and

Mi=Ni1modni.E2.19

The solution can also be represented in a slightly different way which makes it easier to be understood. Namely, we can rewrite Eq. (2.1) as

x=i=1kaieimodnE2.20

such that

ei{1(modni)0(modnj),ji.E2.21

So the solution of the Chinese remainder theory can be regarded essentially as an integer version of Lagrange interpolation, where a polynomial is created according to k points by calculating a similar set of coefficients that are either 0 or 1 and thus enforce the desired behavior at the given points.

Example 11. Consider the following system of congruences:

x2(mod3)=a1(modn1)x3(mod5)=a2(modn2).x2(mod7)=a3(modn3)E2.22

According to the solution of the Chinese remainder theorem, we see that

n=n1×n2×n3=3×5×7=105N1=nn1=1053=35N2=nn2=1055=21N3=nn3=1057=15M1=N11modn1=21mod3=2M2=N21modn2=31mod5=1M3=N31modn3=21mod7=1E2.23

so that

x=(a1×N1×M1+a2×N2×M2+a3×N3×M3)mod105=(2×35×2+3×21×1+2×15×1)mod105=23mod105.E2.24

In this example we can also think of the solution as finding integers e1, e2, and e3such that

x=(2×e1+3×e2+2×e3)mod105E2.25
e1=70{70(mod3)0(mod5)0(mod7)
e2=21{0(mod3)21(mod5)0(mod7)

and

e3=15{0(mod3)0(mod5)15(mod7).E2.26

2.1.2.4. Cyclic groups and generators

Let (G,×)be a finite group with order m. For arbitrary gG, consider the set

g=def{g0,g1,,}.E2.27

According to Theorem 2.1, gm=1. If imdenotes the smallest positive integer for which gi=1, then the above sequence repeats after iterms (i.e., gi=g0, gi+1=g1, ,g2i=g0, etc.), and so

g={g0,,gi1}.E2.28

It is obvious to see that gcontains exactly ielements since if gj=gkwith 0j<k<i, then gkj1thanks to the choice of i.

It is easy to observe that (g,×)can be regarded as a subgroup of (G,×), which is generated by the element g. If the order of the subgroup (g,×)is i, then iis called the orderof the element g.

Definition 2.3. If (G,×)is a finite group and gis randomly chosen from G, then the order of element gis defined as the smallest positive integer isuch that gi=1.

Proposition 2.7

If Gis a finite group, and gGis an element with order i, then for any integer x, we have gx=g[xmodi].

Proof. The proof of this proposition is similar to the proof of Corollary 2.1, and thus we omit it here.

Proposition 2.8

If Gis a finite group and gGis an element with order i, then gx=gyif and only if x=ymodi.

Proof. If x=ymodi, then [xmodi]=[ymodi], and according to the previous proposition, we directly have

gx=g[xmodi]=g[ymodi]=gy.E2.29

On the other hand, from gx=gy, we can easily obtain gx=gyfrom the previous proposition, where x=[xmodi]and y=[ymodi]. We in turn get gx(gy)1=gxy=1. If xy, the difference xyis then a nonzero integer smaller than isince both xand yare smaller than i. There is contradiction between the fact that iis the order of the element gand the fact that xyis a nonzero integer smaller than i. Therefore, we obtain x=ydirectly.

The identity element of any group (G,), features with the order 1, can generate the group (e,). Furthermore, identity element is known as the only element with order 1. On the other hand, if an element gGfeatures with the order m(where mdenotes the order of the group (G,)) can be found, then g=G. In other words, Gcan be generated by the element gand considered as a cyclic group. Here, gis called a generator of G. (Different from the identity element, a cyclic group may have multiple generators.) If gis a generator of cyclic group (G,), then, by definition, every element hGcan be derived by computing gxfor some x{0,,m1}.

Proposition 2.9

Let (G,)be a finite group with order m, and the element gGfeatures with the order i. Then i|m.

Proof. By Theorem 2.1, gm=1. According to Proposition 2.7, we have gm=g[mmodi]in case the element gfeatures with the order i. If im, then i=def[mmodi]is a nonzero integer smaller than isuch that gi=1. This contradicts the fact that iis the order of the element g.

Corollary 2.3

If (G,)is a group with prime order p, then this group is cyclic. Furthermore, all elements of Gother than the identity element are generators of (G,).

Proof. Based on Proposition 2.9, the only possible orders of elements in the group (G,)are 1 and p. Only the identity element features with the order 1, and so all other elements own order p and can generate the original group.

In addition to the groups of prime order, the additive group ZN, for N>1, provides another example of a cyclic group, whereas the element 1 offers the function of generator.

Theorem 2.6

If pis prime, then (Zp*,×)forms a cyclic group.

The proof of this theorem is outside the scope of this book, and the interesting reader can refer to for more details.

Example 12. Consider the group (Z11*,×), which is cyclic by the previous theorem. We have 10={1,10}, and so 10is not a generator. However,

2={1,2,4,8,5,10,9,7,3,6}=Z11*,E2.30

and so 2is a generator of Z11*.

Given a cyclic group Gwith order q, we can represent this group by a generator gGas G={g0,g1,,gq1}. In other words, each element hGcan be expressed as h=gxsuch that xZq. Correspondingly, xis called the discrete logarithm of hwith respect to the generator gsuch that x=loggh. Note that the logarithms in this case are regarded as “discrete” because these logarithm values range over a finite range, as opposed to “standard” logarithms from calculus whose values are within an infinite set.

The discrete logarithm problem in a cyclic group Gwith respect to a given generator gis to calculate xsuch that gx=hwith the input of a random element hG. Formally speaking, Gis a polynomial-time algorithm with the input 1nto output a cyclic group Gwith order qand a generator gG. In addition, the group operation in Gis required to be computed efficiently (i.e., in time polynomial in n). Consider the following experiment between a given algorithm Aand the algorithm G:

The discrete logarithm experiment DLogA,G(n)

  1. Given the parameter n, the algorithm Goutputs (G,q,g), where Gdenotes a cyclic group with order qand gis a generator of Gsuch that ||q||=n.

  2. Choose hGrandomly.

  3. Given {G,q,g,h}, the task of Ais to output xZq.

  4. This experiment outputs 1 if gx=hand 0 otherwise.

Definition 2.4. The discrete logarithm problem is said to be hard relative to the algorithm Gif, for all probabilistic, polynomial-time algorithms A, the probability Pr[DLogA,G(n)=1]is negligible.

By way of example, groups of the form Zp*, where p is prime number, offer one family of cyclic groups in which the discrete logarithm problem is believed to be hard [69, 70].

2.1.3. Elliptic curve groups

Different from Zp*, another interesting class of groups, which is used widely in cryptographic applications, is those consisting of points on elliptic curves. Despite elliptic curve groups being very crucial in practical construction of cryptographic primitives, our treatment of such groups in this book is rather minimal and sacrifices generality in favor of simplicity. The basic reason about this approach is due to the fact that most cryptographic primitives based on elliptic curve groups can be understood and investigated by treating the underlying group in a completely generic manner without involving any particular group used to instantiate the primitive. Namely, cryptographic primitives can be built on arbitrary cyclic groups, and the security of these schemes can be proven as long as the related computational problem in the underlying group is believed to be hard no matter how the group is actually instantiated. Therefore, mathematical background needed for a deeper understanding of elliptic curve groups is beyond the scope of this book. Interested readers can refer to [71] for more details.

Define Zpas {0,,p1}for p5be a prime. Despite the elliptic curves can in fact be defined over arbitrary (finite or infinite) fields, only the case such that p2or 3 is taken into consideration to eliminate the additional complications. The elliptic curve E over Zprepresents the set of points (x,y)defined by the equation y2=x3+ax+bmodpwith a,bRZpand with the discriminant 4a3+27b20modp(this condition ensures that the equation x3+ax+b=0modphas no repeated roots). This set of points on Zpalong with a distinguished point Oat infinity {(x,y)|x,yZpE(x,y)=0}{O}outlines the curve E. The elements of the set E(Zp)are called the points on the elliptic curve Edefined by y2=x3+ax+bmodp, and Ois called the “point at infinity.”

Example 13. An element yZp*is called a quadratic residue modulo pif there exists an xZp*such that x2=ymodp, while xis regarded as a square root of yin this case. According to [68], every quadratic residue modulo pthat has exactly two square roots for p>2is prime.

Let f(x)be a function such that f=x3+x+6and consider the curve E:y2=f(x)mod11. Each value of x for which f(x)is a quadratic residue modulo 11 yields two points on the curve; a value of x for which f(x)=0mod11gives one point on the curve E(Z11). The points on the curve E(Z11)(shown in Table 2.1 ) can be identified and explained as follows:

  • f(0)=6mod11, a quadratic nonresidue modulo 11.

  • f(1)=8mod11, a quadratic nonresidue modulo 11.

  • f(2)=5mod11, a quadratic residue modulo 11 with square roots 4 and 7. This yields the points (2,4),(2,7)E(Z11).

  • f(3)=3mod11, a quadratic residue modulo 11 with square roots 5 and 6. This yields the points (3,5),(3,6)E(Z11).

  • f(4)=8mod11, a quadratic nonresidue modulo 11.

  • f(5)=4mod11, a quadratic residue modulo 11 with square roots 2 and 9. This yields the points (5,2),(5,9)E(Z11).

  • f(6)=8mod11, a quadratic nonresidue modulo 11.

  • f(7)=4mod11, a quadratic residue modulo 11 with square roots 2 and 9. This yields the points (7,2),(7,9)E(Z11).

  • f(8)=9mod11, a quadratic residue modulo 11 with square roots 3 and 8. This yields the points (8,3),(8,8)E(Z11).

  • f(9)=7mod11, a quadratic nonresidue modulo 11.

  • f(10)=4mod11, a quadratic residue modulo 11 with square roots 2 and 9. This yields the points (10,2),(10,9)E(Z11).

x012345678910
x3+x+6mod1168538484974
Quadratic residue or nonresidue modulo 11?×××××
y452232
769989

Table 2.1.

The points on the elliptic curve y2x3+x+6mod11.

Including the point Oat infinity, there are 7 points in E(Z11).

A common approach to conceptualize E(Zp)intuitively is to observe the graph with equation y2=x3+ax+bover the reals (rather than the equation y2=x3+ax+bmodp) as in Figure 2.1 . Despite that this figure does not correspond exactly to E(Zp)because E(Zp)consists only a finite number of points (recall that Zpis a finite set), while there are an infinite number of solutions to the same equation over the real numbers, the picture offers insightful intuition. In such figure, the point at infinity Ocan be viewed as sitting at the top of the y-axis and lying on every vertical line.

Figure 2.1.

Graph of the elliptic curve y 2 = x 3 + x + 6 such that Δ > 0 .

Observations from Figure 2.2 demonstrate that every line intersecting with the elliptic curve E(Zp)intersects this curve in exactly three points. In case the elliptic line is tangent to the curve at the point P, the point Pis counted twice, and the point Ois also counted (when the line is vertical). A binary operation (called “addition”) can be defined from this fact and denoted by “+” on points of E(Zp)as follows:

  • The point Ois defined as the identity element such that P+O=O+P=Pfor all PE(Zp)

  • We obtain

P1+P2+P3=O.E2.31

for P1,P2,P3are colinear points on E. (Note that the ordering of P1,P2,P3has been disregarded in the sense that the addition operation is commutative for all points and associative for colinear points.)

Figure 2.2.

Addition operations on points on an elliptic curve.

Negation. On input a point P, the negation of Pis defined as the point Psuch that P+(P)=O. If P=O, it is easy to see that P=O. Otherwise, due to the fact that P+(P)+O=(P+(P))+O=O+O=Oand Eq. (2.2), it is evident that the negation of Pcorresponds to the third point on the line passing through Pand Oor, in other words, the vertical line passing through P. According to Figure 2.2 , Pis simply the reflection of Pin the x-axis, that is, if P=(x,y), then P=(x,y)=(x,y).

Addition of points: To evaluate the sum P1+P2for two arbitrary points P1,P2Oon the elliptic curve E, we can draw the line through P1,P2(if P1=P2then draw the line tangent to Eat P1) and identify the third point of intersection P3of this line with the curve E. From Eq. (2.2), we can directly derive P1+P2=P3. If P3=O, then P1+P2=O=O. Otherwise, if the third point of intersection of the line through P1and P2is the point P3=(x,y)O, then

P1+P2=P3=(x,y).E2.32

It is straightforward but tedious to carry out the addition law concretely. Assume P1=(x1,y1)and P2=(x2,y2)are two points in E(Zp)with P1,P2Oand Eis represented by the equation y2=x3+ax+bmodp. The slope of the line through these points can be computed as

λ={y2y1x2x1modp,P1P23x12+a2y1modp,P1=P2.E2.33

The line passing through P1and P2can be outlined by the equation

y=λ(xx1)+y1modp.E2.34

To find the third point of intersection of this line with E, substitute Eq. (2.3) into y2=x3+ax+bmodpto obtain

(λ(xx1)+y1)2=x3+ax+bmodp.E2.35

The values of x that satisfy this equation are x1,x2and [λ2x1x2modp]. The first two solutions correspond to the original points P1and P2, while the third is the x-coordinate of the third point of intersection P3. The y-value corresponding to this third value of x is y=[λ(xx1)+y1modp]. That is, P3=(x3,y3)where

x3=(λ2x1x2)modpy3=[λ(x1x3)y1]modp.E2.36

It is straightforward that the set of points E(Zp)pairing with the addition law defined above forms an abelian group. All the necessary properties have almost been shown: closure under addition law depends on the fact (the proof is omitted due to the space limit) that any line intersecting Eintersects this curve in exactly three points, Oserves as the identity element, inverse element of each point on E(Zp)can also be found in E(Zp), and commutativity of addition law rests on Eq. (2.2). The final property to verify is associativity, which the interested reader can check themselves with tedious calculation.

According to Table 2.1 , there are seven points in the set E(Z11)with the curve E:y2=x3+x+6mod11such that

E(Z11)={O,(2,4),(2,7),(3,5),(3,6),(5,2),(5,9),(7,2),(7,9),(8,3),(8,8),(10,2),(10,9)}.E2.37

Suppose P=(2,7), 2P=P+Pcan be computed as follows:

First, the slope of the line can be calculated as follows:

λ=3×22+12×7mod11=23mod11=8.E2.38

After that, we obtain

x3=(8222)mod11=5y3=[8×(25)7]mod11=2.E2.39

Therefore, 2P=(5,2). Similarly, we can also get 3P=(8,3), 4P=(10,2), 5P=(3,6), 6P=(7,9), 7P=(7,2), 8P=(3,5), 9P=(10,9), 10P=(8,8), 11P=(5,9), 12P=(2,4), and 13P=O.

It is easy to see that E(Z11)along with the addition operation forms a commutative group with the generator P=(2,7).

Similar to Zp*, elliptic curve groups provide another family of cyclic groups in which the discrete logarithm problem is believed to be hard [71].

2.1.4. Bilinear pairing

Let Gbe a cyclic multiplicative group with the generator gand the prime order pand GTbe another multiplicative cyclic group of the same order p. A bilinear pairing refers to a map e^:G×GGTfeatured with the following properties:

  1. Bilinearity: For all g1,g2G, and a,bZp*, e^(g1a,g2b)=e^(g1,g2)ab.

  2. Nondegeneracy: There exist g1,g2Gsuch that e^(g1,g2)1GT.

  3. Computability: For all g1,g2G, there is an efficient algorithm to compute e^(g1,g2).

The modified Weil pairing [77] or Tate pairing [72] defined on the elliptic curves can be adopted to implement such an admissible bilinear map.

2.2. Public-key cryptography

The discussion thus far has been very general. We now show some concrete examples of concrete public-key encryption and signature schemes in the traditional PKI and ID-PKC settings, respectively. As mentioned in [73], public-key encryption enables two parties to communicate with each other securely without any shared secret information in advance. Let us call the sender Alice and the receiver Bob. Bob first selects the key pair (pkB,skB)and sends the public key pkBto Alice over any channel but keeps the private key skBsecure and secret. Before sending the sensitive message mto Bob, Alice generates the ciphertext cby performing the encryption algorithm c=EncpkB(m)under Bob’s public key. To recover the message m, Bob decrypts the ciphertext cby carrying out the inverse decryption DecskB(c)with his own private key skB. According to [74], digital signature allows a user to generate a signature on the given message with secret key skin such a way that any other party who knows this user’s corresponding public key pkcan ensure that the message indeed originated from this user and has not been altered in any way. Let us call the sender Alice and the receiver Bob. The sender Alice first generates a public/secret key pair (pkA,skA)and then distributes pkAin some reliable way to the other users in the system while keeping skAsecret. When the message moriginated from Alice needs to be authenticated, Alice can generate a digital signature σon the message musing her private key skA; the pair (m,σ)will be then sent. Anyone who knows the public key pkAcan verify the authenticity of mby checking whether Vrfypk(m,σ)=?1or not. In this section, we have chosen to focus on the most well-known and long-standing public-key primitives in this section.

Observing that all existing constructions of public-key encryption and signature schemes are proven secure under some assumptions associated with the hardness of solving some mathematical problems. Therefore, the number-theoretic hard problems related to these constructions have been first reviewed here.

Definition 2.5. (RSA problem) Let N=p×q, where pand qare two distinct odd prime numbers. Let e be a random prime number such that gcd(e,(p1)×(q1))=1, and let Gbe a cyclic subgroup of ZN*. Given (N,e)and yG, the RSA problem consists of finding zZNsuch that y=ze.

Definition 2.6. (Discrete logarithm (DL) problem) Let Gbe a finite cyclic group and gbe a generator of G. Given a random element hG, the DL problem consists of computing the exponent xsuch that h=gx.

Definition 2.7. (Weil Diffie-Hellman (WDH) problem) Given a generator Pof Gand a triple <aP,bP,cP>for random a,b,cZp*, the WDH problem is to compute e^(P,P)abcGT.

Definition 2.8. (Decision Diffie-Hellman (DDH) problem) Given a generator Pof Gand a tuple <aP,bP,cP>for a,b,cZp*, the DDH problem is to decide whether c=?abmodqholds or not.

Definition 2.9. (Computational Diffie-Hellman (CDH) problem) Given a generator Pof Gand a tuple <aP,bP>for a,bZp*, the CDH problem is to compute abP.

2.2.1. Public-key encryption algorithms

2.2.1.1. Framework of public-key encryption

Definition 2.10. A public-key encryption scheme (see Figure 2.3 ) consists of the following three algorithms:

  1. Gen: Given a security parameter 1n, this algorithm outputs a pair of public and secret keys (pk,sk).

  2. Enc: On input a public key pkand a message mfrom some underlying plaintext space, this algorithm outputs a ciphertext c, which can be denoted as cEncpk(m).

  3. Dec: On input a private key skand a ciphertext c, this algorithm outputs a message mor a special symbol to denote failure, which can be written as m=Decsk(c).

Figure 2.3.

Intuition of public-key encryption.

We make the consistency constraint that if for every n, every (pk,sk)generated by Gen(1n), every message min the appropriate underlying plaintext space, and c=Encpk(m), then Decsk(c)=m.

2.2.1.2. RSA encryption scheme

The RSA encryption scheme [75] is the most widely used public-key encryption scheme, and its security depends on the intractability of the RSA problem. Concretely, the RSA encryption scheme consists of the following three algorithms:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Randomly choose two prime numbers pand qsuch that |p||q|.

  2. Compute N=p×q.

  3. Compute ϕ(N)=(p1)×(q1).

  4. Randomly choose an integer e<ϕ(N)such that gcd (e,ϕ(N))=1, and compute the integer dsuch that e×d1modϕ(N).

  5. Publish (N,e)as the public key, and keep das the corresponding private key. Note that p,qand ϕ(N)will be destroyed safely.

Enc: Before sending a sensitive message m<N, the sender creates the ciphertext c=memodN. (Note that the plaintext message space is in fact ZN*.)

Dec: To decrypt the ciphertext c, the corresponding receiver computes m=cdmodN.

2.2.1.3. ElGamal encryption scheme

The ElGamal encryption scheme [76] is constructed based on the DL problem. Concretely, the ElGamal encryption scheme consists of the following three algorithms:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Choose a random multiplicative generator gof Zp*such that pis a random prime number.

  2. Pick a random number 1xp2as the private key.

  3. Compute the corresponding public key by y=gx(modp).

  4. Publish (p,g,y)as the public key, and keep xas the corresponding private key.

Enc: Before sending a sensitive message mZp*, the sender picks 1kp2and computes the ciphertext as follows:

c1=gk(modp),c2=yk(modp).E2.40

Dec: To decrypt the ciphertext (c1,c2), the following steps are performed:

m=c2/c1x(modp).E2.41

2.2.1.4. Framework of ID-based encryption scheme

Definition 2.11. An ID-based encryption scheme (see Figure 2.4 ) consists of the following three algorithms:

Figure 2.4.

Intuition of ID-based encryption.

Setup: On input a security parameter k, this algorithm is performed by the PKG to create the system parameters params and the master secret key master-key such that params will be distributed to all users in the system and master-key will be kept secret by PKG itself.

Extract: On input params, master-key, and an arbitrary identity ID{0,1}*, this algorithm is performed by the PKG to generate a private key dID. After that, dIDwill be sent to the corresponding user secretly.

Enc: On input params, ID, and m, this algorithm outputs a ciphertext c.

Dec: On input params, ID, c, and the corresponding private key dID, this algorithm recovers m.

2.2.1.5. Boneh-Franklin IBE

The first practical identity-based encryption scheme is constructed by Boneh and Franklin [77] based on bilinear pairings, and its security depends on the WDH problem. First the basic Boneh-Franklin IBE scheme which is not secure against an adaptive chosen-ciphertext attack1 is presented. The only reason for describing the basic scheme is to improve the readability of the full scheme. After that, the full scheme extends the basic scheme to achieve security against an adaptive chosen-ciphertext attack.

Boneh-Franklin IBE (basic scheme): Concretely, the basic Boneh-Franklin IBE scheme consists of the following four algorithms:

Setup: The algorithm works as follows:

1. Let e^:G×GGTbe a symmetric bilinear pairing defined in Section 2.1.4.

2. Pick a random sZq*and set Ppub=sP.

3. Choose two cryptographic hash functions H1:{0,1}*G*and H2:GT{0,1}n.

The system parameters params={P,Ppub,H1,H2}are published to everyone in the system, while master-key sZq*is kept secret by PKG.

Extract: On input a string ID{0,1}*, this algorithm is performed by the PKG as follows:

  1. Compute dID=sH1(ID)as the private key associated with the identity ID.

  2. Send dIDto the user IDsecretly.

Encrypt: To encrypt munder the public-key ID, ris chosen from Zq*, and the ciphertext is generated as follows:

c=rP,mH2(gIDr)wheregID=e^(H1(ID),Ppub)GT.E2.42

Decrypt: Given a ciphertext c=U,Vencrypted using the public-key ID. The plaintext message can be recovered as follows:

VH2(e^(dID,U))=m.E2.43

Boneh-Franklin IBE (full scheme): Concretely, the full Boneh-Franklin IBE scheme consists of the following four algorithms:

Setup and Extract: The algorithms are the same as those in the above basic scheme. In addition, two additional hash function H3:{0,1}n×{0,1}nZq*and H4:{0,1}n{0,1}nare required.

Encrypt: To encrypt munder the public-key ID, choose a random γ{0,1}n, set r=H3(γ,m), and generate the ciphertext as follows:

C=rP,γH2(gIDr),mH4(γ)wheregID=e^(H1(ID),Ppub).E2.44

Decrypt: Let U,V,Wbe a ciphertext encrypted using the public-key ID. Given a ciphertext c=U,V,Wencrypted using the public-key ID. The plaintext message can be recovered as follows:

  1. Compute VH2(e^(dID,U))=γ.

  2. Compute WH4(γ)=m.

  3. Set r=H3(γ,m). Test that U=rP. If not, reject the ciphertext.

  4. Output m.

2.2.2. Signature algorithms

2.2.2.1. Framework of digital signature

Definition 2.12. A signature scheme (see Figure 2.5 ) consists of the following three algorithms:

Figure 2.5.

Intuition of digital signature.

Gen: Given a security parameter 1n, this algorithm outputs a pair of public and secret keys (pk,sk).

Sign: On input a private key skand a message mfrom some underlying message space, this algorithm outputs a signature σ, which can be written as σSignsk(m).

Verify: On input a public key pk, a message m, and a signature σ, this algorithm returns 1 for accept or 0 for reject.

We make the consistency constraint that if for every n, every (pk,sk)generated by Gen, and every message m, it holds that

Verifypk(m,Signsk(m))=1.E2.45

2.2.2.2. RSA signature scheme

The RSA cryptosystems [75] may be used to provide both encryption and digital signature. The RSA digital signature is described as follows:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Randomly choose two prime numbers pand qsuch that |p||q|.

  2. Compute N=p×q.

  3. Compute ϕ(N)=(p1)×(q1).

  4. Randomly choose an integer e<ϕ(N)such that gcd (e,ϕ(N))=1, and compute the integer dsuch that e×d1modϕ(N).

  5. Publish (N,e)as the public key, and keep das the corresponding private key. Noted that p,qand ϕ(N)will be destroyed safely.

Sign: To create a signature of message mZN*, the signer creates σ=mdmodN.

Verify: Given a message-signature pair (m,s), this algorithm outputs 1 if m=σemodNand returns 0 otherwise.

2.2.2.3. ElGamal signature scheme

The ElGamal cryptosystems [75] may be used to provide both encryption and digital signature. The ElGamal digital signature is described as follows:

Gen: To create the public key and the corresponding secret key, the following steps are performed:

  1. Choose a random multiplicative generator gof Zp*such that pis a random prime number.

  2. Pick a random number 1xp2as the private key.

  3. Compute the corresponding public key by y=gx(modp).

  4. Publish (p,g,y)as the public key, and keep xas the corresponding private key.

Sign: To create a signature of message mZp*, the signer picks a random number 1kp2and generates a signature (r,s)such that

r=gk(modp),s=k1(mx×r)(modp1).E2.46

Verify: Given a message-signature pair (m,(r,s)), this algorithm outputs 1 if yr×rs=gm(modp)and returns 0 otherwise.

2.2.2.4. Schnorr signature scheme

Many variations of the basic ElGamal signature scheme have been proposed, and Schnorr signature scheme [78] is one well-known variant of the ElGamal scheme. The ElGamal digital signature is described as follows:

Gen: The system parameters are generated as follows:

  1. Choose two prime numbers pand qsuch that q|p1.

  2. Choose an element gZp*of order q.

  3. Select a cryptographic hash function H:{0,1}*Zq.

  4. Select a random number xRZqas the secret key and compute y=gx(modp)as the public key. Publish (p,q,g,y,H)as the public key and keep xas the corresponding secret key.

The parameters (p,q,g,H)are publicized for use by system-wide users.

Sign: To create a signature on message m{0,1}*, the signer picks a random number kfrom Zqat random and computes a signature such that r=gk(modp), e=H(m||r), and s=k+x×e(modq).

Verify: Given a message-signature pair (m,(e,s)), this algorithm outputs 1 if r=gs×ye(modp)and e=H(m||r)and returns 0 otherwise.

2.2.2.5. Digital signature standard

In 1991, a digital signature algorithm (DSA) has been presented by the US National Institute of Standards and Technology (NIST) and later been called the Digital Signature Standard (DSS) [79] by the US Federal Information Processing Standards (FIPS 186). DSS is regarded as the first digital signature scheme recognized by the US government and is described as follows:

Gen: The system parameters are generated as follows:

  1. Select a prime number qsuch that 2159<q<2160.

  2. Choose tsuch that 0t8, and select a prime number pwhere 2511+64t<p<2512+64tand q|(p1).

  3. Select a generator gof the unique cyclic group with order qin Zp*.

    1. Select an element hZp*and compute g=h(p1)/qmodp.

    2. If g=1then go to previous step.

  4. Select a random integer xsuch that 1xq1and compute y=gxmodp.

  5. (p,q,g,y)is published as the public key, while xis kept secret as the private key of the corresponding user.

Sign: To create a signature on message m{0,1}*, the following steps are performed.

  1. Select a random secret integer ksuch that 0<k<q.

  2. Compute r=(gkmodp)modq.

  3. Compute k1modq.

  4. Compute s=k1{h(m)+xr}modq.

  5. (r,s)is returned as the signature on the message m.

Verify: Given a message-signature pair (m,(r,s)), the following steps are performed to check the validity of the signature.

  1. Compute u1=s1h(m)modqand u2=rs1modq.

  2. Output 1 if r=(gu1yu2modp)modqand return 0 otherwise.

2.2.2.6. Framework of ID-based signature scheme

An ID-based encryption scheme (see Figure 2.6 ) consists of the following three algorithms:

Figure 2.6.

Intuition of ID-based signature.

Setup: On input a security parameter k, this algorithm is performed by the PKG to create the system parameters params and the master secret key master-key such that params will be distributed to all users in the system and master-key will be kept secret by PKG itself.

Extract: On input params, master-key, and an arbitrary identity ID{0,1}*, this algorithm is performed by the PKG to generate a private key dID. After that, dIDwill be sent to the corresponding user secretly.

Sign: Given a message m, an identity ID, a private key dID, and params, this algorithm generates the signature σon the message munder the identity ID.

Verify: Given a signature σ, a message m, an identity ID, and params, this algorithm returns 1 for accept or 0 for reject.

2.2.2.7. Cha-Cheon IBS

Inspired by [77], an ID-based signature scheme has been proposed by Cha and Cheon [80] using gap Diffie-Hellman (GDH) groups, where GDH group is a cyclic group where the CDH problem is hard, but the DDH problem is easy. The Cha-Cheon identity-based signature scheme is described as follows:

Setup: Let e^:G×GGTbe a symmetric bilinear pairing defined in Section 2.1.4. Choose a generator Pfrom the group G, pick a random sZp*, set Ppub=sP, and choose cryptographic hash functions H1:{0,1}*×GZp*and H2:{0,1}*G. The system parameter (P,Ppub,H1,H2)is published, while the master secret key sis kept secret by PKG.

Extract: Given an identity ID, this algorithm computes dID=sH2(ID)and sends it to the user associated with identity ID secretly.

Sign: Given a secret key dIDand a message m, pick a random number rZp*and output a signature σ=(U,V)such that U=rH2(ID), h=H1(m,U), and V=(r+h)dID.

Verify: To verify a signature σ=(U,V)on a message munder an identity ID, this algorithm outputs 1 if e^(U+hH2(ID),Ppub)=e^(P,V)and returns 0 otherwise, where h=H1(m,U).

2.2.2.8. Bellare-Neven-Namprempre IBS

Bellare et al. [81] proposed the first pairing-free identity-based signature scheme based on the DL problems. The Bellare-Neven-Namprempre identity-based signature scheme is described as follows:

Setup: Generate an elliptic curve Eover a finite field Zp, a prime qdividing the number of points on E, and a curve point Pof order q, pick a random xZq*, and compute Ppub=xP. Choose two cryptographic hash functions H1,H2:{0,1}*Zp. Zp,E,q,P,Ppub,H1,H2is published as the system parameter, and xis kept secret as the master secret key by PKG.

Extract: Given an identity ID, PKG picks a random rZq*and computes R=rP, c=H1(ID,R), and s=r+cxmodq. (R,s)is sent to the user associated with ID secretly.

Sign: Given a secret key pair (R,s)associated with ID and a message m, pick a random tZq*and compute T=tP, h=H2(ID,m ,R,T), and

z=t+hsmodq.E2.47

The resulting signature is σ=(R,T,z).

Verify: To verify a signature σ=(R,T,z)of a message mfor an identity ID, compute h=H2(ID,m,R,T)and c=H1(ID,R)and check whether

zP=T+h(R+cPpub)E2.48

holds or not.

2.3. Provable security

Provable security, which involves the exact definitions, precise assumptions, and rigorous security proof of cryptographic primitives, is regarded as the methodology unavoidable in designing, analyzing, and evaluating new primitives [82, 83, 84]. The basic reason for the necessity of provable security originates from the fact that the security of a cryptographic primitive cannot be checked in the same way that software is typically checked. Without a proof that no adversary with enough computational resources can break the primitive, we can only depend on our intuition about the security of the mentioned primitive. Unfortunately, history has already demonstrated that intuition in cryptography and information security is disastrous in the sense that countless examples of unproven schemes or schemes only with heuristic security proof were broken (sometimes immediately or sometimes years after being published or deployed).

Instead of indiscreetly assuming a given cryptographic primitive is secure, provable security assumes that some mathematical problems are hard to solve and then to prove that the given primitive is secure provided this assumption. Concretely, the proof proceeds by presenting an explicit reduction showing how to construct an efficient algorithm Cthat succeeds in solving certain computational tasks that was assumed to be hard from any efficient adversary Athat succeeds in breaking the primitive with nonnegligible probability. The steps of such a proof can be outlined in a high level as follows:

We first assume that some mathematical problem Xcannot be solved by any polynomial-time algorithm except with negligible probability. To prove that some cryptographic primitive Πis secure, the following steps are performed.

Initial: Assume an efficient adversary Aattempts to attack the cryptographic primitive Πand denotes A’s success probability by ε.

  1. An efficient adversary Cis constructed by employing adversary Aas a subroutine to solve problem X. Given some input instance of mathematical problem X, the algorithm Cwill simulate an execution of Πfor the adversary Asuch that:

    1. The view of Awhen it is invoked as a subroutine by Cshould be identical to the view of Awhen it interacts with the real primitive Πitself.

    2. Furthermore, if Asucceeds in breaking the execution of Πwhich is being simulated by C, this should enable Cto solve the given instance of Xwith inverse polynomial probability 1p.

  2. From Step 1, it is obvious that if εis not negligible, then the problem Xcan be solved by Cwith nonnegligible probability εp, which contradicts the initial assumption. Therefore, we conclude that, given the assumption associated with X, no efficient algorithm Acan succeed with nonnegligible probability ε; in other words, Πis proven to be computationally secure formally.

2.3.1. Public-key encryption

As for the security model of public-key encryption, we begin our definitional treatment by considering the case of an eavesdropping adversary who observes a single ciphertext that it wishes to crack. In other words, the eavesdropping adversary does not have any further interaction with the sender or the receiver after receiving a (single) ciphertext. Intuitively speaking, the security model for the public-key encryption is depicted by the definition of indistinguishability. Particularly, consider an experiment PubKA,Πeav(n)between an eavesdropping adversary Aand a corresponding simulator/challenger C, in which Aoutputs two messages m0and m1with equal length and in turn is given an encryption of one of these messages randomly chosen by Cusing a randomly generated key. According to the definition of indistinguishability, a public-key encryption scheme Πis said to be secure if, in this experiment, no adversary Acan distinguish which message is encrypted better than a naive guess. We now give the formal definition of PubKA,Πeav(n)as follows:

Given a public-key encryption scheme Π=(Gen,Enc,Dec)and an adversary A, the eavesdropping experiment PubKA,Πeav(n)is defined as follows:

  1. The algorithm Gen (1n)is performed by Cto calculate keys (pk,sk).

  2. After receiving the public key pkAoutputs a pair of messages (m0,m1)such that |m0|=|m1|.

  3. After receiving (m0,m1), Cchooses a random bit b{0,1}and sends to Aa ciphertext cEncpk(mb). Here, cis called the challenge ciphertext.

  4. After receiving c, Aoutputs a bit band wins the experiment if b=b.

Definition 2.13. A public-key encryption scheme Π=(Gen,Enc,Dec)is regarded to achieve indistinguishability against an eavesdropper if for all probabilistic, polynomial-time adversaries A, the advantage of Ain the eavesdropping experiment PubKA,Πeav(n), defined as AdvΠind-eav(A)=|2Pr[b=b]1|, is negligible.

2.3.1.1. Security against chosen-plaintext attacks

In view of the fact that a relatively weak attacker who only passively eavesdrops on the communication has been modeled in the eavesdropping experiment, a more powerful type of adversarial attacker who is allowed to query encryptions of multiple messages in an adaptive manner should be considered to simulate the attacker’s capabilities in the real world. Thus, a chosen-plaintext attack (CPA) has been defined to enable the adversary Ato interact freely with an encryption oracle to encrypt messages of A’s choice.

Specifically, consider the following experiment defined for public-key encryption scheme Π=(Gen,Enc,Dec); the CPA indistinguishability experiment PubKA,Πcpa(n)between the adversary Awho can mount chosen-plaintext attack and a corresponding simulator/challenger Cis defined as follows:

Initial: The algorithm Gen (1n)is performed by Cto calculate keys (pk,sk).

Phase 1: After receiving the public key pk, Aissues a sequence of queries to the oracle Encpk()adaptively.

Challenge: After deciding Phase 1 is over, Aoutputs a pair of messages (m0,m1)with equal length. Cnow chooses a random bit b{0,1}and sends to Aa ciphertext cEncpk(mb). Here, cis called the challenge ciphertext.

Phase 2: The adversary Anow issues a second sequence of queries to the oracle Encpk()adaptively again as in Phase 1.

Guess: After receiving c, Aoutputs a bit band wins the experiment if b=b.

Definition 2.14. A public-key encryption scheme Π=(Gen,Enc,Dec)is regarded to achieve indistinguishability against chosen-plaintext attacks if for all probabilistic, polynomial-time adversaries A, the advantage of Ain the experiment PubKA,Πcpa(n), defined as AdvΠind-cpa(A)=|2Pr[b=b]1|, is negligible.

2.3.1.2. Security against chosen-ciphertext attacks

To strengthen the adversary’s capabilities further, a third type of adversary, which is more powerful than the former two types of adversaries and can mount chosen-ciphertext attack [85, 86], should be taken into consideration. In this attack model, the adversary is not only allowed to encrypt any messages of its choice as in the chosen-plaintext attack model but also enabled the adversary to decrypt any ciphertexts of its choice. Therefore, the chosen-ciphertext attack is regarded as the most powerful attack associated to the public-key encryption so far.

Specifically, consider the following experiment defined for public-key encryption scheme Π=(Gen,Enc,Dec); the CCA indistinguishability experiment PubKA,Πcca(n)(shown in Figure 2.7 ) between the adversary Awho can mount chosen-ciphertext attack and a corresponding simulator/challenger Cis defined as follows:

Figure 2.7.

CCA experiment P u b K A , Π c c a ( n ) for public-key encryption.

Initial: The algorithm Gen (1n)is performed by Cto calculate keys (pk,sk).

Phase 1: After receiving the public key pk, Aissues a sequence of queries to the encryption oracle Encpk()and the decryption oracle Decsk()adaptively.

Challenge: After deciding Phase 1 is over, Aoutputs a pair of messages (m0,m1)with equal length. Cnow chooses a random bit b{0,1}and sends to Aa ciphertext cEncpk(mb). Here, cis called the challenge ciphertext.

Phase 2: After receiving c, Anow issues a second sequence of queries to the oracles Encpk()and Decsk()adaptively again as in Phase 1 with the restriction that the challenged ciphertext cannot be queried to the decryption oracle.

Guess: Aoutputs a bit band wins the experiment if b=b.

Definition 2.15. A public-key encryption scheme Π=(Gen,Enc,Dec)is regarded to achieve indistinguishability against chosen-ciphertext attacks if for all probabilistic, polynomial-time adversaries A, the advantage of Ain the experiment PubKA,Πcca(n), defined as AdvΠind-cca(A)=|2Pr[b=b]1|, is negligible.

2.3.2. ID-based encryption

Different from the public-key encryption in the traditional PKI, the definition of chosen-ciphertext security must be strengthened a bit in the ID-PKC due to the fact that the attacker might already own the private keys of users ID1,,IDnof her choice when she attacks another public key ID. To ensure the security of encryption scheme in the ID-based environment [77], the definition of chosen-ciphertext security must enable the attacker to obtain the private key associated with any identity of her choice (other than the challenged public key IDbeing attacked).

2.3.2.1. Security against chosen-ciphertext and identity attacks

Specifically, consider the following experiment defined for ID-based encryption scheme Π=(Setup,Extract,Encrypt,Decrypt); the CCA indistinguishability experiment IBEA,Πcca,cida(n)(shown in Figure 2.8 ) between the adversary Awho can mount chosen-ciphertext attack and a corresponding simulator/challenger Cis defined as follows:

Figure 2.8.

CCA experiment I B E A , Π c c a , c i d a ( n ) for ID-based encryption.

Initial: The algorithm Setup is performed by the challenger Cto generate the system parameters params, which will be forwarded to the adversary A.

Phase 1: Acan perform a polynomially bounded number of queries in an adaptive manner as follows:

  1. Upon receiving an identity IDi, Cruns private key extraction oracle Extract()on IDiand forwards the associated private key to A.

  2. Upon receiving a tuple (IDi,ci), Cruns decryption oracle Decrypt()on (IDi,ci)and sends the result to A.

Challenge: After deciding Phase 1 is over, adversary submits two plaintexts (m0,m1)with equal length and an identity ID *to Csuch that the identity ID *must not have sent to the private key extraction oracle Extract()in Phase 1. After receiving (m0,m1), Cselects a random bit b{0,1}, sets c=Encrypt(params,ID*,mb), and forwards cto the adversary as the challenge ciphertext.

Phase 2: This is identical to Phase 1 with the restriction that:

  1. Amust not have sent ID* to the private key extraction oracle Extract().

  2. Amust not have sent (ID*,c)to the decryption oracle Decrypt().

Guess: Aoutputs a bit band wins the experiment if b=b.

Definition 2.16. An ID-based encryption scheme Π=(Setup,Extract,Encrypt,Decrypt)is regarded to achieve indistinguishability against chosen-ciphertext attacks if for all probabilistic, polynomial-time adversaries A, the advantage of Ain the experiment IBEA,Πcca,cida(n), defined as AdvΠind-cca-ibe(A)=|2Pr[b=b]1|, is negligible.

2.3.3. Digital signature

To capture the adversary’s power in the digital signature, a security property called existential unforgeability against a chosen message attack [87] should be achieved in the sense that the existential unforgeability means that the adversary should not be able to generate a valid signature on any message, and the chosen message attack means that the adversary is able to obtain signatures on any messages it wishes during its attack. We now give the formal definition of existential unforgeability against a chosen message attack as follows:

2.3.3.1. Security against chosen message attacks

Let Π=(Gen,Sign,Verify)be a signature scheme, and consider the following experiment Sig-forgeA,Πcma(n)(shown in Figure 2.9 ) between an adversary Aand a corresponding challenger/simulator Cas follows:

Figure 2.9.

Unforgeability experiment S i g - f o r c e A , Π c m a ( n ) for signature.

Initial: The algorithm Gen is run by Cto obtain keys (pk,sk)such that the public key pkis forwarded to the adversary A, whereas the private key skis kept secret by Citself.

Attack: After receiving the public key pk, Ais given to access the signing oracle Signsk()which returns a signature Signsk(m)for any message mof A’s choice.

Forgery: Aoutputs a message and signature pair (m*,σ*). Awins this experiment if Verifypk(m*,σ*)=1with the restriction that σ*has not been queried to the oracle Signsk().

Definition 2.17. A signature scheme Π=(Gen,Sign,Vrfy)is said to achieve existentially unforgeability under an adaptive chosen message attack if for all probabilistic polynomial-time adversaries A, the advantage of Awins the experiment Sig-forgeA,Πcma(n)is negligible.

2.3.4. ID-based signature

The security model of existential unforgeability under an adaptive chosen message attack in the traditional PKI can be extended to the identity-based environment naturally [80, 81]. We now give the formal definition of existential unforgeability against a chosen message attack for the ID-based signature as follows:

2.3.4.1. Security against chosen message and identity attacks

Let Π=(Setup,Extract,Sign,Verify)be a signature scheme, and consider the following experiment IBS-forgeA,Πcma,cida(n)(shown in Figure 2.10 ) between an adversary Aand a corresponding challenger/simulator Cas follows:

Figure 2.10.

Unforgeability experiment I B S - f o r g e A , Π c m a , c i d a ( n ) for ID-based signature.

Initial: The algorithm Setup is performed by the challenger Cto generate the system parameters params, which will be forwarded to the adversary A.

Attack: Acan perform a polynomially bounded number of queries in an adaptive manner as follows:

  • Upon receiving an identity IDi, Cruns private key extraction oracle Extract()on IDiand forwards the associated private key to A.

  • Upon receiving a tuple (ID,m), Cruns signing oracle Sign()on (ID,m)and forwards the result to A.

Forgery: Aoutputs a message m*, an identity ID*, and a signature σ*. Ais said to succeed in the experiment IBS-forgeA,Πcma,cida(n)if the following requirements are satisfied:

  1. Verify(params, ID*, m*, σ*) = 1.

  2. ID*has not been queried to the private key extraction oracle Extract().

  3. (ID*, m*) has not been queried to the signing oracle Sign().

Definition 2.18. An ID-based signature scheme Π=(Setup,Extract,Sign,Verify)is said to achieve existentially unforgeability under an adaptive chosen message attack if for all probabilistic polynomial-time adversaries A, the advantage of Awins the experiment IBS-forgeA,Πcma,cida(n)is negligible.

Outsourcing for Data Services in Mobile Internet

3. Information Retrieval

3.1. Introduction

Information retrieval is a very simple concept with everyone having practical experience in its use. The scenario of a user having an information need, translating that into a search statement and executing that search to locate the information, has become ubiquitous to everyday life. The Internet has become a repository of any information a person needs, replacing the library as a more convenient research tool. An information retrieval system is a system that ingests information, transforms it into searchable format, and provides an interface to allow a user to search and retrieve information. The most obvious example of an information retrieval system is Google, and the English language has even been extended with the term “Google it” to mean search for something.

So, everyone has had an experience with information retrieval systems, and with a little thought, it is easy to answer the question: does it work? Everyone who has used such systems has experienced the frustration that is encountered when looking for certain information. Given the massive amount of intellectual effort that is going into the design and evolution of a “Google” or other search systems, the question comes to mind why is it so hard to find what you are looking for.

One of the goals of this chapter is to explain the practical and theoretical issues associated with information retrieval that makes the design of information retrieval systems one of the challenges of our time. The demand for and expectations of users to quickly find any information they need continues to drive both the theoretical analysis and development of new technologies to satisfy that need. To scope the problem, one of the first things that need to be defined is “information.” Twenty-five years ago, information retrieval was totally focused on textual items. That was because almost all of the “digital information” of value was in textual form. In today’s technical environment, most people carry with them most of the time the capability to create images and videos of interest, that is, the cell phone. This has made modalities other than text to become as common as text. That is coupled with the Internet Web sites that allow and are designed for ease of the use of uploading and storing those modalities which more than justify the need to include other than text as part of the information retrieval problem. There are a lot of parallelisms between the information processing steps for text, images, audio, and video. Although maps are another modality that could be included, they will only be generally discussed.

In general, information that will be considered in information retrieval systems includes text, images, audio, and video. The term “item” shall be used to define a specific information object. This could be a textual document, a news item from an RSS feed, an image, a video program, or an audio program. It is useful to make a distinction between the original items from what is processed by the information retrieval system as the basic indexable item. The original item will always be kept for display purposes, but a lot of preprocessing can occur on it during the process of creating the searchable index. The term “item” will refer to the original object. On occasion the term document will be used when the item being referred to is a textual item.

An information retrieval system is the hardware and software that facilitates a user in finding the information the user needs. Hardware is included in the definition because specialized hardware is needed to transform certain modalities into digital processing format (e.g., encoders that translate composite video to digital video). As the detailed processing of items is described, it will become clear that an information retrieval system is not a single application but is composed of many different applications that work together to provide the tools and functions needed to assist the users in answering their questions. The overall goal of an information retrieval system is to minimize the user overhead in locating the information of value. Overhead from a user’s perspective can be defined as the time it takes to locate the needed information. The time starts when a user starts to interact with the system and ends when they have found the items of interest. Human factors play significantly in this process. For example, most users have a short threshold on frustration waiting for a response. That means in a commercial system on the Internet, the user is more satisfied with a response less than 3 s than a longer response that has more accurate information. In internal corporate systems, users are willing to wait a little longer to get results, but there is still a trade-off between accuracy and speed. Most users would rather have the faster results and iterate on their searches than allowing the system to process the queries with more complex techniques providing better results. All of the major processing steps are described for an information retrieval system, but in many cases, only a subset of them are used on operational systems because users are not willing to accept the increase in response time.

The evolution of information retrieval systems has been closely tied to the evolution of computer processing power. Early information retrieval systems were focused on automating the manual indexing processes in libraries. These systems migrated the structure and organization of card catalogs into structured databases. They maintained the same Boolean search query structure associated with the database that was used for other database applications. This was feasible because all of the assignment of terms to describe the content of a document was done by professional indexers. In parallel, there was also academic research work being done on small data sets that considered how to automate the indexing process making all of the text of a document part of the searchable index. The only place that large systems designed to search on massive amounts of text were available was in government and military systems. As commercial processing power and storage significantly increased, it became more feasible to consider applying the algorithms and techniques being developed in the universities to commercial systems. In addition, the creation of the original documents also was migrating to digital format so that they were in a format that could be processed by the new algorithms. The largest change that drove information technologies to become part of everyone’s experience was the introduction and growth of the Internet. The Internet became a massive repository of unstructured information, and information retrieval techniques were the only approach to effectively locate information on it. This changed the funding and development of search techniques from a few government-funded efforts to thousands of new ideas being funded by venture capitalists moving the more practical implementation of university algorithms into commercial systems.

3.2. Objectives of information retrieval system

The general objective of an information retrieval system is to minimize the time it takes for a user to locate the information they need. The goal is to provide the information needed to satisfy the user’s question. Satisfaction does not necessarily mean finding all information on a particular issue. It means finding sufficient information that the user can proceed with whatever activity initiated. This is very important because it does explain some of the drivers behind the existing search systems and suggests that precision is typically more important than recalling all possible information. For example, a user looking for a particular product does not have to find the names of everyone that sells the product or every company that manufactures the product to meet their need of getting that product. Of course, if they did have total information, then it is possible that they could have gotten it cheaper, but in most cases, the consumer will never know what they missed. The concept that a user does not know how much information they missed explains why in most cases the precision of a search is more important than the ability to recall all possible items of interest; the user never knows what they missed, but they can tell if they are seeing a lot of useless information in the first few pages of search results. That does not mean that finding everything on a topic is not important to some users. If you are trying to make decisions on purchasing a stock or a company, then finding all the facts about that stock or company may be critical to prevent a bad investment. Missing the one article talking about the company being sued and possibly going bankrupt could lead to a very painful investment. But providing comprehensive retrieval of all items that are relevant to a user’s search can have the negative effect of information overload on the user. In particular there is a tendency for important information to be repeated in many items on the same topic. Thus, trying to get all information makes the process of reviewing and filtering out redundant information very tedious. The better a system is in finding all items on a question (Recall), the more important techniques to present aggregates of that information become.

From the user’s perspective, time is the important factor that they use to gage the effectiveness of information retrieval. Except for users that do information retrieval as a primary aspect of their job (e.g., librarians and research assistants), most users have very little patience for investing extensive time in finding information they need. They expect interactive response from their searches with replies within 3 C4s at the most. Instead of looking through all the hits to see what might be of value, they will only review the first one, and at most second pages before deciding, they need to change their search strategy. These aspects of the human nature of searchers have had a direct effect on the commercial Web sites and the development of commercial information retrieval. The times that are candidates to be minimized in an information retrieval system are the time to create the query, the time to execute the query, the time to select what items returned from the query the user wants to review in detail, and the time to determine if the returned item is of value. The initial research in information retrieval focused on the search as the primary area of interest. But to meet the user’s expectation of fast response and to maximize the relevant information returned require optimization in all of these areas. The time to create a query used to be considered outside the scope of technical system support. But systems such as Google know what is in their database and what other users have searched on; so as you type a query, they provide hints on what to search on. This vocabulary browse capability helps the user in expanding the search string and helps in getting better precision.

In information retrieval, the term “relevant” is used to represent an item containing the needed information. In reality, the definition of relevance is not a binary classification but a continuous function. Items can exactly match the information need or partially match the information need. From a user’s perspective, “relevant” and needed are synonymous. From a system perspective, information could be relevant to a search statement (i.e., matching the criteria of the search statement) even though it is not needed/relevant to user (e.g., the user already knew the information or just read it in the previous item reviewed).

Relevant documents are those that contain some information that helps answer the user’s information need. Nonrelevant documents do not contain any useful information. Using these definitions, the two primary metrics used in evaluating information retrieval systems can be defined. They are Precision and Recall:

Precision=Number_Retrieved_RelevantNumber_Total_Retrieved.Recall=Number_Retrieved_RelevantNumber_Possible_Relevant.E3.1

Number_Possible_Relevant is the number of relevant items in the database, Number_Total_Retrieved is the total number of items retrieved from the query, and Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s search need.

Precision is the factor that most users understand. When a user executes a search and has 80% precision, it means that four out of five items that are retrieved are of interest to the user. From a user’s perspective, the lower the precision the more likely the user is wasting his resource (time) looking at nonrelevant items. From a metric perspective, the precision figure is across all of the “hits” returned from the query. But in reality most users will only look at the first few pages of hit results before deciding to change their query strategy. Thus, what is of more value in commercial systems is not the total precision but the precision across the first 20C50 hits. Typically, in a weighted system where the words within a document are assigned weights based upon how well they describe the semantics of the document, precision in the first 20C50 items is higher than the precision across all the possible hits returned. But when comparing search systems, the total precision is used.

Recall is a very useful concept in comparing systems. It measures how well a search system is capable of retrieving all possible hits that exist in the database. Unfortunately, it is impossible to calculate except in very controlled environments. It requires in the denominator the total number of relevant items in the database. If the system could determine that number, then the system could return them. There have been some attempts to estimate the total relevant items in a database, but there are no techniques that provide accurate enough results to be used for a specific search request. In Chapter 3.3 on “Information Retrieval Evaluation,” techniques that have been used in evaluating the accuracy of different search systems will be described. But it is not applicable in the general case.

3.3. Classic information retrieval

In this section we briefly present the three classic models in information retrieval, namely, the Boolean, the vector, and the probabilistic models.

3.3.1. Basic concepts

The classic models in information retrieval consider that each document is described by a set of representative keywords called index terms. An index term is simply a (document) word whose semantics helps in remembering the document’s main themes. Thus, index terms are used to index and summarize the document contents. In general, index terms are mainly nouns because nouns have meaning by themselves and, thus, their semantics is easier to identify and to grasp. Adjectives, adverbs, and connectives are less useful as index terms because they work mainly as complements. However, it might be interesting to consider all the distinct words in a document collection as index terms. We postpone a discussion on the problem of how to generate index terms, where the issue is covered in detail.

Given a set of index terms for a document, we notice that not all terms are equally useful for describing the document contents. In fact, there are index terms which are simply vaguer than others. Deciding on the importance of a term for summarizing the contents of a document is not a trivial issue. Despite this difficulty, there are properties of an index term which are easily measured and which are useful for evaluating the potential of a term as such. For instance, consider a collection with a hundred thousand documents. A word which appears in each of the 100,000 documents is completely useless as an index term because it does not tell us anything about which documents the user might be interested in. On the other hand, a word which appears in just five documents is quite useful because it narrows down considerably the space of documents which might be of interest to the user. Thus, it should be clear that distinct index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document.

Let kibe an index term, djbe a document, and wi,j0be a weight associated with the pair (ki,dj). This weight quantifies the importance of the index term for describing the document semantic contents.

Definition 3.1. Let t be the number of index terms in the system and kibe a generic index term. K = kl,,ktis the set of all index terms. A weight wi,j>0is associated with each index term kiof a document dj. For an index term which does not appear in the document text, wi,j=0. With the document djis associated an index term vector djrepresented by dj= (w1,j,w2,j,,wt,j). Further, let gibe a function that returns the weight associated with the index term kiin any t-dimensional vector (i.e., gi(dj)=wi,j).

As we later discuss, the index term weights are usually assumed to be mutually independent. This means that knowing the weight wi,jassociated with the pair (ki,dj) tells us nothing about the weight wi+1,jassociated with the pair (ki+1,dj). This is clearly a simplification because occurrences of index terms in a document are not uncorrelated. Consider, for instance, that the terms computer and network are used to index a given document which covers the area of computer networks. Frequently, in this document, the appearance of one of these two words attracts the appearance of the other. Thus, these two words are correlated and their weights could reflect this correlation. While mutual independence seems to be a strong simplification, it does simplify the task of computing index term weights and allows for fast ranking computation. Furthermore, taking advantage of index term correlations for improving the final document ranking is not a simple task. In fact, none of the many approaches proposed in the past has clearly demonstrated that index term correlations are advantageous (for ranking purposes) with general collections. Therefore, unless clearly stated otherwise, we assume mutual independence among index terms. In Chapter 5, we discuss modern retrieval techniques which are based on term correlations and which have been tested successfully with particular collections. These successes seem to be slowly shifting the current understanding toward a more favorable view of the usefulness of term correlations for information retrieval systems.

The above definitions provide support for discussing the three classic information retrieval models, namely, the Boolean, the vector, and the probabilistic models, as we now do.

3.3.2. Boolean model

The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since the concept of a set is quite intuitive, the Boolean model provides a framework which is easy to grasp by a common user of an IR system. Furthermore, the queries are specified as Boolean expressions which have precise semantics. Given its inherent simplicity and neat formalism, the Boolean model received great attention in the past years and was adopted by many of the early commercial bibliographic systems ( Figure 3.1 ).

Figure 3.1.

The three conjunctive components for the query [ q = k a ∧ ( k b ∨ ¬ k c ) ] .

Unfortunately, the Boolean model suffers from major drawbacks. First, its retrieval strategy is based on a binary decision criterion (i.e., a document is predicted to be either relevant or nonrelevant) without any notion of a grading scale, which prevents good retrieval performance. Thus, the Boolean model is in reality much more a data (instead of information) retrieval model. Second, while Boolean expressions have precise semantics, frequently it is not simple to translate an information need into a Boolean expression. In fact, most users find it difficult and awkward to express their query requests in terms of Boolean expressions. The Boolean expressions actually formulated by users often are quite simple (see Chapter 10 for a more thorough discussion on this issue). Despite these drawbacks, the Boolean model is still the dominant model with commercial document database systems and provides a good starting point for those new to the field.

The Boolean model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be all binary, that is, wi,j{0,1}. A query qis composed of index terms linked by three connectives: not, and, and or. Thus, a query is essentially a conventional Boolean expression which can be represented as a disjunction of conjunctive vectors (i.e., in disjunctive normal form (DNF)). For instance, the query [q=ka(kb¬kc)] can be written in disjunctive normal form as [ddnf=(1,1,1)(1,1,0)(1,0,0)], where each of the components is a binary weighted vector associated with the tuple (ka,kb,kc). These binary weighted vectors are called the conjunctive components of qdnf. Figure 2.3 illustrates the three conjunctive components for the query q.

Definition 3.2. For the Boolean model, the index term weight variables are all binary, that is, wi,j{0,1}. A query qis a conventional Boolean expression. Let qdnfbe the disjunctive normal form for the query q. Further, let qccbe any of the conjunctive components of qdnf. The similarity of a document djto the query qis defined as

{1ifqcc|(qccqdnf)(ki,gi(dj)=gi(qcc))0otherwise.E3.2

If sim(dj,q) = 1, then the Boolean model predicts that the document djis relevant to the query q(it might not be). Otherwise, the prediction is that the document is not relevant.

The Boolean model predicts that each document is either relevant or nonrelevant. There is no notion of a partial match to the query conditions. For instance, let dj be a document for which dj=(0,1,0). Document djincludes the index term kbbut is considered nonrelevant to the query [q=ka(kb¬kc)].

The main advantages of the Boolean model are the clean formalism behind the model and its simplicity. The main disadvantages are that exact matching may lead to retrieval of too few or too many documents (see Chapter 10). Today, it is well known that index term weighting can lead to a substantial improvement in retrieval performance. Index term weighting brings us to the vector model.

3.3.3. Vector model

The vector model [88, 89] recognizes that the use of binary weights is too limiting and proposes a framework in which partial matching is possible. This is accomplished by assigning nonbinary weights to index terms in queries and in documents. These term weights are ultimately used to compute the degree of similarity between each document stored in the system and the user query. By sorting the retrieved documents in decreasing order of this degree of similarity, the vector model takes into consideration documents which match the query terms only partially. The main resultant effect is that the ranked document answer set is a lot more precise (in the sense that it better matches the user information need) than the document answer set retrieved by the Boolean model.

Definition 3.3. For the vector model, the weight wi,jassociated with a pair (ki,dj) is positive and nonbinary. Further, the index terms in the query are also weighted. Let wi,qbe the weight associated with the pair [ki,q], where wi,q0. Then, the query vector qis defined as q=(w1,q,w2,q,,wt,q)where it is the total number of index terms in the system. As before, the vector for a document djis represented by dj=(w1,j,w2,j,,wt,j).

Therefore, a document djand a user query qare represented as t-dimensional vectors as shown in Figure 3.2 . The vector model proposes to evaluate the degree of similarity of the document djwith regard to the query qas the correlation between the vectors djand q. This correlation can be quantified, for instance, by the cosine of the angle between these two vectors. That is,

sim(dj,q)=djq|dj|×|q|=i=1twi,j×wi,qi=1twi,j2×j=1twi,q2,E3.3

where |dj|and |q|are the norms of the document and query vectors. The factor |q|does not affect the ranking (i.e., the ordering of the documents) because it is the same for all documents. The factor |dj|provides a normalization in the space of the documents.

Figure 3.2.

The cosine of θ is adopted as s i m ( d j , q ) .

Since wi,j0and wi,j0, sim(q,dj)varies from 0 to +1. Thus, instead of attempting to predict whether a document is relevant or not, the vector model ranks the documents according to their degree of similarity to the query. A document might be retrieved even if it matches the query only partially. For instance, one can establish a threshold on sim(dj,q)and retrieve the documents with a degree of similarity above that threshold. But to compute rankings, we need first to specify how index term weights are obtained.

Index term weights can be calculated in many different ways. The work by Salton and McGill [88] reviews various term-weighting techniques. Here, we do not discuss them in detail. Instead, we concentrate on elucidating the main idea behind the most effective term-weighting techniques. This idea is related to the basic principles which support clustering techniques, as follows.

Given a collection C of objects and a vague description of a set A, the goal of a simple clustering algorithm might be to separate the collection C of objects into two sets: the first one which is composed of objects related to the set A and the second one which is composed of objects not related to the set A. Vague description here means that we do not have complete information for deciding precisely which objects are and which are not in the set A. For instance, one might be looking for a set A of cars which have a price comparable to that of a Lexus 400. Since it is not clear what the term comparable means exactly, there is not a precise (and unique) description of the set A. More sophisticated clustering algorithms might attempt to separate the objects of a collection into various clusters (or classes) according to their properties. For instance, patients of a doctor specializing in cancer could be classified into five classes: terminal, advanced, metastasis, diagnosed, and healthy. Again, the possible class descriptions might be imprecise (and not unique), and the problem is one of deciding to which of these classes a new patient should be assigned. In what follows, however, we only discuss the simpler version of the clustering problem (i.e., the one which considers only two classes) because all that is required is a decision on which documents are predicted to be relevant and which ones are predicted to be not relevant (with regard to a given user query).

To view the IR problem as one of clustering, we refer to the early work of Salton. We think of the documents as a collection C of objects and think of the user query as a (vague) specification of a set A of objects. In this scenario, the IR problem can be reduced to the problem of determining which documents are in the set A and which ones are not (i.e., the IR problem can be viewed as a clustering problem). In a clustering problem, two main issues have to be resolved. First, one needs to determine what the features are which better describe the objects in the set A. Second, one needs to determine what the features are which better distinguish the objects in the set A from the remaining objects in the collection C. The first set of features provides for quantification of intra-cluster similarity, while the second set of features provides for quantification of inter-cluster dissimilarity. The most successful clustering algorithms try to balance these two effects.

In the vector model, intra-clustering similarity is quantified by measuring the raw frequency of a term kiinside a document dj. Such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents (i.e., intra-document characterization). Furthermore, inter-cluster dissimilarity is quantified by measuring the inverse of the frequency of a term kiamong the documents in the collection. This factor is usually referred to as the inverse document frequency or the idf factor. The motivation for the usage of an idf factor is that terms which appear in many documents are not very useful for distinguishing a relevant document from a nonrelevant one. As with good clustering algorithms, the most effective term-weighting schemes for IR try to balance these two effects.

Definition 3.4. Let N be the total number of documents in the system and nibe the number of documents in which the index term kiappears. Let frei,jbe the row frequency of term kiin the document dj(i.e., the number of times the term kiis mentioned in the text of the document dj). Then, the normalized frequency fi,jof term kiin document djis given by

fi,j=freqi,jmaxlfreql,j,E3.4

where the maximum is computed over all terms which are mentioned in the text of the document dj. If the term kidoes not appear in the document dj, then fi,j=O. Further, let idfi, inverse document frequency for ki, be given by

idfi=logNni.E3.5

The best known term-weighting schemes use weights which are given by

wi,j=fi,j×logNniE3.6

or by a variation of this formula. Such term-weighting strategies are called tf-idf schemes.

Several variations of the above expression for the weight wi,jare described in an interesting paper by Salton and Buckley which appeared in 1988 [90]. However, in general, the above expression should provide a good weighting scheme for many collections.

For the query term weights, Salton and Buckley suggest

wi,q=(0.5+0.5freqi,qmaxlfreql,q)×logNni,E3.7

where freqi,qis the raw frequency of the term kiin the text of the information request q.

The main advantages of the vector model are:

  1. Its term-weighting scheme improves retrieval performance.

  2. Its partial-matching strategy allows retrieval of documents that approximate the query conditions.

  3. Its cosine ranking formula sorts the documents according to their degree of similarity to the query.

Theoretically, the vector model has the disadvantage that index terms are assumed to be mutually independent [Eq. (2.3) does not account for index term dependencies]. However, in practice, consideration of term dependencies might be a disadvantage. Due to the locality of many term dependencies, their indiscriminate application to all the documents in the collection might in fact hurt the overall performance.

Despite its simplicity, the vector model is a resilient ranking strategy with general collections. It yields ranked answer sets which are difficult to improve upon without query expansion or relevance feedback within the framework of the vector model. A large variety of alternative ranking methods have been compared to the vector model, but the consensus seems to be that, in general, the vector model is either superior or almost as good as the known alternatives. Furthermore, it is simple and fast. For these reasons, the vector model is a popular retrieval model nowadays.

3.3.4. Probabilistic model

In this section, we describe the classic probabilistic model introduced in 1976 by Robertson and Sparck Jones [91] which later became known as the binary independence retrieval (BIR) model. Our discussion is intentionally brief and focuses mainly on highlighting the key features of the model. With this purpose in mind, we do not detain ourselves in subtleties regarding the binary independence assumption for the model. The section on bibliographic discussion points to references which cover these details.

The probabilistic model attempts to capture the IR problem within a probabilistic framework. The fundamental idea is as follows. Given a user query, there is a set of documents which contains exactly the relevant documents and no other. Let us refer to this set of documents as the ideal answer set. Given the description of this ideal answer set, we would have no problems in retrieving its documents. Thus, we can think of the querying process as a process of specifying the properties of an ideal answer set (which is analogous to interpreting the IR problem as a problem of clustering). The problem is that we do not know exactly what these properties are. All we know is that there are index terms whose semantics should be used to characterize these properties. Since these properties are not known at query time, an effort has to be made at initially guessing what they could be. This initial guess allows us to generate a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents. An interaction with the user is then initiated with the purpose of improving the probabilistic description of the ideal answer set. Such interaction could proceed as follows.

The user takes a look at the retrieved documents and decides which ones are relevant and which ones are not (in truth, only the first top documents need to be examined). The system then uses this information to refine the description of the ideal answer set. By repeating this process many times, it is expected that such a description will evolve and become closer to the real description of the ideal answer set. Thus, one should always have in mind the need to guess at the beginning the description of the ideal answer set. Furthermore, a conscious effort is made to model this description in probabilistic terms.

The probabilistic model is based on the following fundamental assumption.

Definition 3.5. Assumption (probabilistic principle). Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be nonrelevant.

This assumption is quite troublesome because it does not state explicitly how to compute the probabilities of relevance. In fact, not even the sample space which is to be used for defining such probabilities is given.

Given a query q, the probabilistic model is assigned to each document dj, as a measure of its similarity to the query, the ratio P(djrelevant to q)/P(djnonrelevant to q) which computes the odds of the document djbeing relevant to the query q. Taking the odds of relevance as the rank minimizes the probability of an erroneous judgment [92, 93].

Definition 3.6. For the probabilistic model, the index term weight variables are all binary, i.e., wi,j{0,1}and wi,q{0,1}. A query qis a subset of index terms. Let R be the set of documents known (or initially guessed) to be relevant. Let R¯be the complement of R [i.e., the set of nonrelevant documents). Let P(R|dj)be the probability that the document djis relevant to the query qand P(R¯|dj)be the probability that djis nonrelevant to q. The similarity sim(dj,q) of the document djto the query qis defined as the ratio:

sim(dj,q)=P(R|dj)P(R¯|dj).E3.8

Using Bayes’ rule,

sim(dj,q)=P(dj|R)×P(R)P(dj|R¯)×P(R¯).E3.9

P(dj|R)stands for the probability of randomly selecting the document djfrom the set Rof relevant documents. Further, P(R)stands for the probability that a document randomly selected from the entire collection is relevant. The meanings attached to P(dj|R¯)and P(R¯)are analogous and complementary.

Since P(R)and P(R¯)are the same for all the documents in the collection, we write

sim(dj,q)P(dj|R)P(dj|R¯).E3.10

Assuming independence of index terms,

sim(dj,q)(gi(dj=1)P(ki|R))×(gi(dj=0)P(k¯i|R))(gi(dj=1)P(ki|R¯))×(gi(dj=0)P(k¯i|R¯)).E3.11

P(ki|R)stands for the probability that the index term kis present in a document randomly selected from the set R. P(k¯i|R)stands for the probability that the index term kiis not present in a document randomly selected from the set R. The probabilities associated with the set R¯have meanings which are analogous to the ones just described.

Taking logarithms, recalling that P(ki|R)+P(k¯i|R)=1, and ignoring factors which are constant for all documents in the context of the same query, we can finally write

sim(dj,q)i=1twi,q×wi,j×(logP(ki|R)1P(ki|R)+log1P(ki|R¯)P(ki|(R¯))),E3.12

which is a key expression for ranking computation in the probabilistic model.

Since we do not know the set Rat the beginning, it is necessary to devise a method for initially computing the probabilities P(ki|R)and P(ki|R¯). There are many alternatives for such computation. We discuss a couple of them below.

In the very beginning (i.e., immediately after the query specification), there are no retrieved documents. Thus, one has to make simplifying assumptions such as (a) assume that P(ki|R)is constant for all index terms ki(typically, equal to 0.5) and (b) assume that the distribution of index terms among the nonrelevant documents can be approximated by the distribution of index terms among all the documents in the collection. These two assumptions yield

P(ki|R)=0.5P(ki|R¯)=niN,E3.13

where, as already defined, niis the number of documents which contain the index term kiand Nis the total number of documents in the collection. Given this initial guess, we can then retrieve documents which contain query terms and provide an initial probabilistic ranking for them. After that, this initial ranking is improved as follows.

Let Vbe a subset of the documents initially retrieved and ranked by the probabilistic model. Such a subset can be defined, for instance, as the top rranked documents where ris a previously defined threshold. Further, let Vibe the subset of Vcomposed of the documents in Vwhich contain the index term ki. For simplicity, we also use Vand Vito refer to the number of elements in these sets (it should always be clear when the used variable refers to the set or to the number of elements in it). For improving the probabilistic ranking, we need to improve our guesses for P(ki|R)and P(ki|R¯). This can be accomplished with the following assumptions: (a) we can approximate P(ki|R)by the distribution of the index term kiamong the documents retrieved so far and (b) we can approximate P(ki|R¯)by considering that all the non-retrieved documents are not relevant. With these assumptions, we can write

P(ki|R)=viVP(ki|R¯)=niViNV.E3.14

This process can then be repeated recursively. By doing so, we are able to improve on our guesses for the probabilities P(ki|R)and P(ki|R¯)without any assistance from a human subject (contrary to the original idea). However, we can also use assistance from the user for definition of the subset Vas originally conceived.

The last formulas for P(ki|R)and P(ki|R¯)pose problems for small values of Vand Viwhich arise in practice (such as V=1and Vi=0). To circumvent these problems, an adjustment factor is often added in which yields

P(ki|R)=vi+0.5V+1P(ki|R¯)=niVi+0.5NV+1.E3.15

An adjustment factor which is constant and equal to 0.5 is not always satisfactory. An alternative is to take the fraction ni/Nas the adjustment factor which yields

P(ki|R)=vi+niNV+1P(ki|R¯)=niVi+niNNV+1.E3.16

This completes our discussion of the probabilistic model.

The main advantage of the probabilistic model, in theory, is that documents are ranked in decreasing order of their probability of being relevant. The disadvantages include (1) the need to guess the initial separation of documents into relevant and nonrelevant sets, (2) the fact that the method does not take into account the frequency with which an index term occurs inside a document (i.e., all weights are binary), and (3) the adoption of the independence assumption for index terms. However, as discussed for the vector model, it is not clear that independence of index terms is a bad assumption in practical situations.

3.3.5. Brief comparison of classic models

In general, the Boolean model is considered to be the weakest classic method. Its main problem is the inability to recognize partial matches which frequently leads to poor performance. There is some controversy as to whether the probabilistic model outperforms the vector model. Croft performed some experiments and suggested that the probabilistic model provides a better retrieval performance. However, experiments done afterward by Salton and Buckley refute that claim. Through several different measures, Salton and Buckley showed that the vector model is expected to outperform the probabilistic model with general collections. This also seems to be the dominant thought among researchers, practitioners, and the Web community, where the popularity of the vector model runs high.

3.4. Text retrieval

3.4.1. Formats

There is no single format for a text document, and an IR system should be able to retrieve information from many of them. In the past, IR systems would convert a document to an internal format. However, that has many disadvantages, because the original application related to the document is not useful anymore. On top of that, we cannot change the contents of a document. Current IR systems have filters that can handle most popular documents, in particular those of word processors with some binary syntax such as Word, WordPerfect, or FrameMaker. Even then, good filters might not be possible if the format is proprietary and its details are not public. This is not the case for full ASCII syntax, as in TeX documents. Although documents can be in a binary format (e.g., parts of a Word document), documents that are represented in human-readable ASCII form imply more portability and are easier to modify (e.g., they can be edited with different applications).

Other text formats were developed for document interchange. Among these we should mention the rich text format (RTF), which is used by word processors and has ASCII syntax. Other important formats were developed for displaying or printing documents. The most popular ones are the portable document format (PDF) and PostScript (which is a powerful programming language for drawing). Other interchange formats are used to encode electronic mail, for example, multipurpose internet mail exchange (MIME). MIME supports multiple character sets, multiple languages, and multiple media.

On top of these formats, nowadays many files are compressed. Text compression is treated in detail, but here we comment on the most popular compression software and associated formats. These include compress (Unix), ARJ (PCs), and ZIP (e.g., gzip in Unix and WinZip in Windows). Other tools allow us to convert binary files, in particular compressed text, to ASCII text such that it can be transmitted through a communication line using only seven bits. Examples of these tools are uuencode/uudecode and BinHex.

3.4.2. Modeling natural language

Text is composed of symbols from a finite alphabet. We can divide the symbols in two disjoint subsets: symbols that separate words and symbols that belong to words. It is well known that symbols are not uniformly distributed. If we consider just letters (a to z), we observe that vowels are usually more frequent than most consonants. For example, in English, letter “Ie” has the highest frequency. A simple model to generate text is the binomial model. In it, each symbol is generated with a certain probability. However, natural language has a dependency on previous symbols. For example, in English, letter “f” cannot appear after letter “c,” and vowels or certain consonants have a higher probability of occurring. Therefore, the probability of a symbol depends on previous symbols. We can use a finite context or Markovian model to reflect this dependency. The model can consider one, two, or more letters to generate the next symbol. If we use k letters, we say that it is a k-order model (so, the binomial model is considered an O-order model). We can use these models taking words as symbols. For example, text generated by a five-order model using the distribution of words in the Bible might make sense (i.e., it can be grammatically correct) but will be different from the original. More complex models include finite-state models (which define regular languages) and grammar models (which define context-free and other languages). However, finding the right grammar for natural language is still a difficult open problem.

The next issue is how the different words are distributed inside each document. An approximate model is Zipf’s law, which attempts to capture the distribution of the frequencies (i.e., number of occurrences) of the words in the text. The rule states that the frequency of the ith most frequent word is 1/iθtimes that of the most frequent word. This implies that in a text of nwords with a vocabulary of Vwords, the ith most frequent word appears n/(iθHV(θ))times, where HV(θ)is the harmonic number of order θof V, defined as

HV(θ)=j=1V1jθ,E3.17

so that the sum of all frequencies is n. The left side of Figure 3.3 illustrates the distribution of frequencies considering that the words are arranged in decreasing order of their frequencies. The value of θdepends on the text. In the most simple formulation, θ=1, and therefore HV(θ)=O(logn). However, this simplified version is very inexact, and the case θ>1(more precisely, between 1.5 and 2.0) fits better the real data. This case is very different, since the distribution is much more skewed, and HV(θ)=O(1). Experimental data suggests that a better model is k/(c+i)θwhere cis an additional parameter and kis such that all frequencies add to n. This is called a Mandelbrot distribution.

Figure 3.3.

Distribution of sorted word frequencies (left) and size of the vocabulary (right).

Since the distribution of words is very skewed (i.e., there are a few hundred words which take up 50% of the text), words that are too frequent, such as stopwords, can be disregarded. A stopword is a word which does not carry meaning in natural language and therefore can be ignored (i.e., made not searchable), such as “a,” “the,” “by,” etc. Fortunately, the most frequent words are stopwords, and, therefore, half of the words appearing in a text do not need to be considered. This allows us, for instance, to significantly reduce the space overhead of indices for natural language texts. For example, the most frequent words in the TREC-2 collection are ‘the,” “of,” “and,” “a,” “to,” and “in.”

Another issue is the distribution of words in the documents of a collection. A simple model is to consider that each word appears the same number of times in every document. However, this is not true in practice. A better model is to consider a negative binomial distribution, which says that the fraction of documents containing a word ktimes is

F(k)=(a+k1k)pk(1+p)αk,E3.18

where pand αare parameters that depend on the word and the document collection. For example, for the Brown Corpus [94] and the word “said,” we have p=9.24and α=0.42[95]. The latter reference gives other models derived from a Poisson distribution.

The next issue is the number of distinct words in a document. This set of words is referred to as the document vocabulary. To predict the growth of the vocabulary size in natural language text, we use the so-called Heaps’ law [96]. This is a very precise law which states that the vocabulary of a text of size nwords is of size V=Knβ=O(nβ), where Kand βdepend on the particular text. The right side of Figure 3.3 illustrates how the vocabulary size varies with the text size. Kis normally between 10 and 100, and βis a positive value less than one. Some experiments on the TREC-2 collection show that the most common values for βare between 0.4 and 0.6. Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root.

Notice that the set of different words of a language is fixed by a constant (e.g., the number of different English words is finite). However, the limit is so high that it is much more accurate to assume that the size of the vocabulary is O(nβ)instead of O(1), although the number should stabilize for huge enough texts. On the other hand, many authors argue that the number keeps growing anyway because of typing or spelling errors.

Heaps’ law also applies to collections of documents because, as the total text size grows, the predictions of the model become more accurate. Furthermore, this model is also valid for the World Wide Web.

The last issue is the average length of words. This relates the text size in words with the text size in bytes (without accounting for punctuation and other extra symbols). For example, in the different subcollections of the TREC-2 collection, the average word length is very close to five letters, and the range of variation of this average in each subcollection is small (from 4.8 to 5.3 letters). If the stopwords were move, the average length of a word increases to a number between 6 and 7 (letters). If we take only the words of the vocabulary, the average length is higher (about 8 or 9). This defines the total space needed for the vocabulary.

Heaps’ law implies that the length of the words in the vocabulary increases logarithmically with the text size and, thus, that longer and longer words should appear as the text grows. However, in practice, the average length of the words in the overall text is constant because shorter words are common enough (e.g., stopwords). This balance between short and long words, such that the average word length remains constant, has been noticed many times in different contexts and can also be explained by a finite-state model in which (a) the space character has a probability close to 0.2, (b) the space character cannot appear twice subsequently, and (c) there are 26 letters. This simple model is consistent with Zipf’s and Heaps’ laws.

The models presented in this section are used in Chapters 8 and 13, in particular Zipf’s and Heaps’ laws.

3.4.3. Similarity models

In this section, we define notions of syntactic similarity between strings and documents. Similarity is measured by a distance function. For example, if we have strings of the same length, we can define the distance between them as the number of positions that have different characters. Then, the distance is 0 if they are equal. This is called the Hamming distance. A distance function should also be symmetric (i.e., the order of the arguments does not matter) and should satisfy the triangle inequality (i.e., distance(a,c)distance(a,b)+distance(b,c)).

An important distance over strings is the edit or Levenshtein distance mentioned earlier. The edit distance is defined as the minimum number of characters, insertions, deletions, and substitutions that we need to perform in any of the strings to make them equal. For instance, the edit distance between “color” and “colour” is one, while the edit distance between “survey” and “surgery” is two. The edit distance is considered to be superior for modeling syntactic errors than other more complex methods such as the Soundex system, which is based on phonetics [97]. Extensions to the concept of edit distance include different weights for each operation, adding transpositions, etc.

There are other measures. For example, assume that we are comparing two given strings and the only operation allowed is deletion of characters. Then, after all non-common characters have been deleted, the remaining sequence of characters (not necessarily contiguous in the original string, but in the same order) is the longest common subsequence (LCS) of both strings. For example, the LCS of “survey” and “surgery” is “surey.”

Similarity can be extended to documents. For example, we can consider lines as single symbols and compute the longest common sequence of lines between two files. This is the measure used by the diff command in Unix-like operating systems. The main problem with this approach is that it is very time-consuming and does not consider lines that are similar. The latter drawback can be fixed by taking a weighted edit distance between lines or by computing the LCS over all the characters. Other solutions include extracting fingerprints (any piece of text that in some sense characterizes it) for the documents and comparing them or finding large repeated pieces. There are also visual tools to see document similarity. For example, Dotplot draws a rectangular map where both coordinates are file lines and the entry for each coordinate is a gray pixel that depends on the edit distance between the associated lines.

3.5. Document preprocessing

Document preprocessing is a procedure which can be divided mainly into five text operations (or transformations):

  1. Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters.

  2. Elimination of stopwords with the objective of filtering out words with very low discrimination values for retrieval purposes.

  3. Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc.).

  4. Selection of index terms to determine which words/stems (or groups of words) will be used as indexing elements. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature of the word. In fact, noun words frequently carry more semantics than adjectives, adverbs, and verbs.

  5. Construction of term categorization structures such as a thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms (a usually useful procedure).

In the following, each of these phases is discussed in detail. But, before proceeding, let us take a look at the logical view of the documents which results after each of the above phases is completed. Figure 3.4 is repeated here for convenience as Figure 3.4 . As already discussed, by aggregating the preprocessing phases, we are able to move the logical view of the documents (adopted by the system) from that of a full text to that of a set of high-level indexing terms.

Figure 3.4.

Logical view of a document throughout the various phases of text preprocessing.

3.5.1. Lexical analysis of the text

Lexical analysis is the process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms). Thus, one of the major objectives of the lexical analysis phase is the identification of the words in the text. At the first glance, all that seems to be involved is the recognition of spaces as word separators (in which case, multiple spaces are reduced to one space). However, there is more to it than this. For instance, the following four particular cases have to be considered with care [98]: digits, hyphens, punctuation marks, and the case of the letters (lower and upper case).

Numbers are usually not good index terms because, without a surrounding context, they are inherently vague. For instance, consider that a user is interested in documents about the number of deaths due to car accidents between the years 1910 and 1989. Such a request could be specified as the set of index terms: deaths, car, accidents, years, 1910 and 1989. However, the presence of the numbers 1910 and 1989 in the query could lead to the retrieval, for instance, of a variety of documents which refer to either of these 2 years. The problem is that numbers by themselves are just too vague. Thus, in general it is wise to disregard numbers as index terms. However, we have also to consider that digits might appear mixed within a word. For instance, “51OB.C.” is a clearly important index term. In this case, it is not clear what rule should be applied. Furthermore, a sequence of 16 digits identifying a credit card number might be highly relevant in a given context and, in this case, should be considered as an index term. A preliminary approach for treating digits in the text might be to remove all words containing sequences of digits unless specified otherwise (through regular expressions). Further, an advanced lexical analysis procedure might perform some date and number normalization to unify formats.

Hyphens pose another difficult decision to the lexical analyzer. Breaking up hyphenated words might be useful due to inconsistency of usage. For instance, this allows treating “state-of-the-art” and “state of the art” identically. However, there are words which include hyphens as an integral part, for instance, giltedge, B-49, etc. Again, the most suitable procedure seems to adopt a general rule and specify the exceptions on a case-by-case basis.

Normally, punctuation marks are removed entirely in the process of lexical analysis. While some punctuation marks are an integral part of the word (for instance, “5lOB.C.”), removing them does not seem to have an impact in retrieval performance because the risk of misinterpretation in this case is minimal. In fact, if the user specifies “5lOB.C” in his query, removal of the dot both in the query term and in the documents will not affect retrieval. However, very particular scenarios might again require the preparation of a list of exceptions. For instance, if a portion of a program code appears in the text, it might be wise to distinguish between the variables “x.id” and “xid.” In this case, the dot mark should not be removed.

The case of letters is usually not important for the identification of index terms. As a result, the lexical analyzer normally converts all the text to either lower or upper case. However, once more, very particular scenarios might require the distinction to be made. For instance, when looking for documents which describe details about the command language of a Unix-like operating system, the user might explicitly desire the non-conversion of upper cases because this is the convention in the operating system. Further, part of the semantics might be lost due to case conversion. For instance, the words Bank and bank have different meanings—a fact common to many other pairs of words.

As pointed out by Fox, all these text operations can be implemented without difficulty. However, careful thought should be given to each one of them because they might have a profound impact at document retrieval time. This is particularly worrisome in those situations in which the user finds it difficult to understand what the indexing strategy is doing. Unfortunately, there is no clear solution to this problem. As already mentioned, some Web search engines are opting for avoiding text operations altogether because this simplifies the interpretation the user has of the retrieval task. Whether this strategy will be the one of choice in the long term remains to be seen.

3.5.2. Elimination of stopwords

As discussed in Chapter 2, words which are too frequent among the documents in the collection are not good discriminators. In fact, a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval. Such words are frequently referred to as stopwords and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stopwords.

Elimination of stopwords has an additional important benefit. It reduces the size of the indexing structure considerably. In fact, it is typical to obtain a compression in the size of the indexing structure (for instance, in the size of an inverted list, see Chapter 8) of 40% or more solely with the elimination of stopwords.

Since stopword elimination also provides for compression of the indexing structure, the list of stopwords might be extended to include words other than articles, prepositions, and conjunctions. For instance, some verbs, adverbs, and adjectives could be treated as stopwords. In [99], a list of 425 stopwords is illustrated. Programs in C for lexical analysis are also provided.

Despite these benefits, elimination of stopwords might reduce recall. For instance, consider a user who is looking for documents containing the phrase “to be or not to be.” Elimination of stopwords might leave only the term bemaking it almost impossible to properly recognize the documents which contain the phrase specified. This is one additional reason for the adoption of a full-text index (i.e., insert all words in the collection into the inverted file) by some Web search engines.

3.5.3. Stemming

Frequently, the user specifies a word in a query, but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word. This problem can be partially overcome with the substitution of the words by their respective stems.

A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes). A typical example of a stem is the word connect which is the stem for the variants connected, connecting, connection, and connections. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced.

While the argument supporting stemming seems sensible, there is controversy in the literature about the benefits of stemming for retrieval performance. In fact, different studies lead to rather conflicting conclusions. Frakes [99] compares eight distinct studies on the potential benefits of stemming. While he favors the usage of stemming, the results of the eight experimental studies he investigated do not allow us to reach a satisfactory conclusion. As a result of these doubts, many Web search engines do not adopt any stemming algorithm whatsoever.

Frakes distinguishes four types of stemming strategies: affix removal, table lookup, successor variety, and n-grams. Table lookup consists simply of looking for the stem of a word in a table. It is a simple procedure but one which is dependent on data on stems for the whole language. Since such data is not readily available and might require considerable storage space, this type of stemming algorithm might not be practical. Successor variety stemming is based on the determination of morpheme boundaries, uses knowledge from structural linguistics, and is more complex than affix removal stemming algorithms. N-grams stemming is based on the identification of digrams and trigrams and is more a term clustering procedure than a stemming one. Affix removal stemming is intuitive and simple and can be implemented efficiently. Thus, in the remainder of this section, we concentrate our discussion on algorithms for affix removal stemming only.

In affix removal, the most important part is suffix removal because most variants of a word are generated by the introduction of suffixes (instead of prefixes). While there are three or four well-known suffix removal algorithms, the most popular one is that by Porter because of its simplicity and elegance. Despite being simpler, the Porter algorithm yields results comparable to those of the more sophisticated algorithms.

The Porter algorithm uses a suffix list for suffix stripping. The idea is to apply a series of rules to the suffixes of the words in the text. For instance, the rule

sØE3.19

is used to convert plural forms into their respective singular forms by substituting the letter s by nil. Notice that to identify the suffix we must examine the last letters in the word. Furthermore, we look for the longest sequence of letters which matches the left-hand side in a set of rules. Thus, application of the two following rules

ssessssØE3.20

to the word stresses yields the stem stress instead of the stem stresse. By separating such rules into five distinct phases, the Porter algorithm is able to provide effective stemming while running fast.

3.5.4. Index terms selection

If a full-text representation of the text is adopted, then all words in the text are used as index terms. The alternative is to adopt a more abstract view in which not all words are used as index terms. This implies that the set of terms used as indices must be selected. In the area of bibliographic sciences, such a selection of index terms is usually done by a specialist. An alternative approach is to select candidates for index terms automatically.

Distinct automatic approaches for selecting index terms can be used. A good approach is the identification of noun groups (as done in the INQUERY system [73]) which we now discuss.

A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. While the words in each grammatical class are used with a particular purpose, it can be argued that most of the semantics is carried by the noun words. Thus, an intuitively promising strategy for selecting index terms automatically is to use the nouns in the text. This can be done through the systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns.

Since it is common to combine two or three nouns in a single component (e.g., computer science), it makes sense to cluster nouns which appear nearby in the text into a single indexing component (or concept). Thus, instead of simply using nouns as index terms, we adopt noun groups. A noun group is a set of nouns whose syntactic distance in the text (measured in terms of number of words between two nouns) does not exceed a predefined threshold (for instance, three).

When noun groups are adopted as indexing terms, we obtain a conceptual logical view of the documents in terms of sets of nonelementary index terms.

3.5.5. Text compression

Text compression is about finding ways to represent the text in fewer bits or bytes. The amount of space required to store text on computers can be reduced significantly using compression techniques. Compression methods create a reduced representation by identifying and using structures that exist in the text. From the compressed version, the original text can be reconstructed exactly.

Text compression is becoming an important issue in an information retrieval environment. The widespread use of digital libraries, office automation systems, document databases, and the Web has led to an explosion of textual information available online. In this scenario, text compression appears as an attractive option for reducing costs associated with space requirements, input/output (I/O) overhead, and communication delays. The gain obtained from compressing text is that it requires less storage space, it takes less time to be transmitted over a communication link, and it takes less time to search directly the compressed text. The price paid is the time necessary to code and decode the text.

A major obstacle for storing text in compressed form is the need for IR systems to access text randomly. To access a given word in a compressed text, it is usually necessary to decode the entire text from the beginning until the desired word is reached. It could be argued that a large text could be divided into blocks that are compressed independently, thus allowing fast random access to each block. However, efficient compression methods need to process some text before making compression effective (usually more than 10 kilobytes). The smaller the blocks, the less effective compression is expected to be.

Our discussion here focuses on text compression methods which are suitable for use in an IR environment. For instance, a successful idea aimed at merging the requirements of compression algorithms and the needs of IR systems is to consider that the symbols to be compressed are words and not characters (character-based compression is the more conventional approach). Words are the atoms on which most IR systems are built. Moreover, it is now known that much better compression is achieved by taking words as symbols (instead of characters). Further, new word-based compression methods allow random access to words within the compressed text which is a critical issue for an IR system.

Besides the economy of space obtained by a compression method, there are other important characteristics to be considered such as compression and decompression speed. In some situations, decompression speed is more important than compression speed. For instance, this is the case with textual databases in which it is common to compress the text once and to read it many times from disk.

Another important characteristic of a compression method is the possibility of performing compressed pattern matching, defined as the task of performing pattern matching in a compressed text without decompressing it. In this case, sequential searching can be speeded up by compressing the search key rather than decoding the compressed text being searched. As a consequence, it is possible to search faster on compressed text because much less text has to be scanned. Chapter 8 presents efficient methods to deal with searching the compressed text directly.

When the text collection is large, efficient text retrieval requires specialized index techniques. A simple and popular indexing structure for text collections are the inverted files. Inverted files (see Chapter 8 for details) are especially adequate when the pattern to be searched for is formed by simple words. Since this is a common type of query (for instance, when searching the Web), inverted files are widely used for indexing large text collections.

An inverted file is typically composed of (a) a vector containing all the distinct words in the text collection (which is called the vocabulary) and (b) for each word in the vocabulary a list of all documents (identified by document numbers) in which that word occurs. Because each list of document numbers (within the inverted file) is organized in ascending order, specific compression methods have been proposed for them, leading to very efficient index compression schemes. This is important because query processing time is highly related to index access time. Thus, in this section, we also discuss some of the most important index compression techniques.

We first introduce basic concepts related to text compression. We then present some of the most important statistical compression methods, followed by a brief review of compression methods based on a dictionary. At the end, we discuss the application of compression to inverted files.

There are two general approaches to text compression: statistical and dictionary based. Statistical methods rely on generating good probability estimates (of appearance in the text) for each symbol. The more accurate the estimates are, the better the compression obtained. A symbol here is usually a character, a text word, or a fixed number of characters. The set of all possible symbols in the text is called the alphabet. The task of estimating the probability on each next symbol is called modeling. A model is essentially a collection of probability distributions, one for each context in which a symbol can be coded. Once these probabilities are available, the symbols are converted into binary digits, a process called coding. In practice, both the encoder and decoder use the same model. The decoder interprets the output of the encoder (with reference to the same model) to find out the original symbol.

There are two well-known statistical coding strategies: Huffman coding and arithmetic coding. The idea of Huffman coding is to assign a fixed-length bit encoding to each different symbol of the text. Compression is achieved by assigning a smaller number of bits to symbols with higher probabilities of appearance. Huffman coding was first proposed in the early 1950s and was the most important compression method until the late 1970s, when arithmetic coding made higher compression rates possible.

Arithmetic coding computes the code incrementally, one symbol at a time, as opposed to the Huffman coding scheme in which each different symbol is pre-encoded using a fixed-length number of bits. The incremental nature does not allow decoding a string which starts in the middle of a compressed file. To decode a symbol in the middle of a file compressed with arithmetic coding, it is necessary to decode the whole text from the very beginning until the desired word is reached. This characteristic makes arithmetic coding inadequate for use in an IR environment.

Dictionary methods substitute a sequence of symbols by a pointer to a previous occurrence of that sequence. The pointer representations are references to entries in a dictionary composed of a list of symbols (often called phrases) that are expected to occur frequently. Pointers to the dictionary entries are chosen so that they need less space than the phrase they replace, thus obtaining compression. The distinction between modeling and coding does not exist in dictionary methods, and there are no explicit probabilities associated to phrases. The most well-known dictionary methods are represented by a family of methods, known as the Ziv-Lempel family.

Character-based Huffman methods are typically able to compress English texts to approximately five bits per character (usually, each uncompressed character takes 7–8 bits to be represented). More recently, a word-based Huffman method has been proposed as a better alternative for natural language texts. This method is able to reduce English texts to just over two bits per character. As we will see later on, word-based Huffman coding achieves compression rates close to the entropy and allows random access to intermediate points in the compressed text. Ziv-Lempel methods are able to reduce English texts to fewer than four bits per character. Methods based on arithmetic coding can also compress English texts to just over two bits per character. However, the price paid is slower compression and decompression, and the impossibility of randomly accessing intermediate points in the compressed text.

Before proceeding, let us present an important definition which will be useful from now on.

Definition 3.7. Compression ratio is the size of the compressed file as a fraction of the uncompressed file.

3.6. Index construction and compression

3.6.1. Inverted files

An inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word, a list of all the text positions where the word appears is stored. The set of all those lists is called the “occurrences” ( Figure 3.5 shows an example). These positions can refer to words or characters. Word positions (i.e., position i refers to the ith word) simplify phrase and proximity queries, while character positions (i.e., the position i is the ith character) facilitate direct access to the matching text positions.

Figure 3.5.

A sample text and an inverted index built on it. The words are converted to lower case, and some are not indexed. The occurrences point to character positions in the text.

Some authors make the distinction between inverted files and inverted lists. In an inverted file, each element of a list points to a document or file name, while inverted lists match our definition. We prefer not to make such a distinction because, as we will see later, this is a matter of the addressing granularity, which can range from text positions to logical blocks.

The space required for the vocabulary is rather small. According to Heaps’ law, the vocabulary grows as O(nβ), where βis a constant between 0 and 1 dependent on the text, being between 0.4 and 0.6 in practice. For instance, for 1 Gb of the TREC-2 collection, the vocabulary has a size of only 5 Mb. This may be further reduced by stemming and other normalization techniques as described.

The occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n). Even omitting stopwords (which is the default practice when words are indexed), in practice the space overhead of the occurrences is between 30 and 40% of the text size.

To reduce space requirements, a technique called block addressing is used. The text is divided in blocks, and the occurrences point to the blocks where the word appears (instead of the exact positions). The classical indices which point to the exact occurrences are called “full inverted indices.” Using block addressing not only can the pointers be smaller because there are fewer blocks than positions, but also all the occurrences of a word inside a single block are collapsed to one reference (see Figure 3.6 ). Indices of only 5% overhead over the text size are obtained with this technique. The price to pay is that, if the exact occurrence positions are required (for instance, for a proximity query), then an online search over the qualifying blocks has to be performed. For instance, block addressing indices with 256 blocks stop working well with texts of 200 Mb.

Figure 3.6.

The sample text splits into four blocks, and an inverted index uses block addressing built on it. The occurrences denote block numbers. Notice that both occurrences of “words” collapsed into one.

Table 3.1 presents the projected space taken by inverted indices for texts of different sizes, with and without the use of stopwords. The full inversion stands for inverting all the words and storing their exact positions, using four bytes per pointer. The document addressing index assumes that we point to documents which are of size 10 kb (and the necessary number of bytes per pointer, that is, one, two, and three bytes, depending on text size). The block addressing index assumes that we use 256 or 64K blocks (one or two bytes per pointer) independently of the text size. The space taken by the pointers can be significantly reduced by using compression. We assume that 45% of all the words are stopwords and that there is one non-stopword each 11.5 characters. Our estimation for the vocabulary is based on Heaps’ law with parameters V=30n0.5. All these decisions were taken according to our experience and experimentally validated.

IndexSmall collection (1 Mb)Medium collection (2 Mb)Large collection (2 Gb)
Addressing words45%73%36%64%35%63%
Addressing documents19%26%18%32%26%47%
Addressing 64K block27%41%18%32%5%9%
Addressing 256 blocks18%25%1.7%2.4%0.5%0.7%

Table 3.1.

Sizes of an inverted file as approximate percentages of the size of the whole text collection.

The blocks can be of fixed size (imposing a logical block structure over the text database), or they can be defined using the natural division of the text collection into files, documents, Web pages, or others. The division into blocks of fixed size improves efficiency at retrieval time, that is, the more variance in the block sizes, the more amount of text sequentially traversed on average. This is because larger blocks match queries more frequently and are more expensive to traverse.

Alternatively, the division using natural cuts may eliminate the need for online traversal. For example, if one block per retrieval unit is used and the exact match positions are not required, there is no need to traverse the text for single-word queries, since it is enough to know which retrieval units to report. But if, on the other hand, many retrieval units are packed into a single block, the block has to be traversed to determine which units to retrieve.

It is important to notice that in order to use block addressing, the text must be readily available at search time. This is not the case for remote text (as in Web search engines) or if the text is in a CD-ROM that has to be mounted, for instance. Some restricted queries not needing exact positions can still be solved if the blocks are retrieval units.

Four granulates and three collections are considered. For each collection, the right column considers that stopwords are not indexed, while the left column considers that all words are indexed.

3.6.2. Blocked sort-based indexing

The basic steps in constructing a non-positional index. We first make a pass through the collection assembling all termdocID pairs. We then sort the pairs with the term as the dominant key and docID as the secondary key. Finally, we organize the docIDs for each term into a postings list and compute statistics like term and document frequency. For small collections, all this can be done in memory. In this chapter, we describe methods for large collections that require the use of secondary storage.

To make index construction more efficient, we represent terms as termIDs, where each termID is a unique serial number. We can build the mapping from terms to termIDs on the fly while we are processing the collection, or, in a two-pass approach, we compile the vocabulary in the first pass and construct the inverted index in the second pass. The index construction algorithms described in this chapter all do a single pass through the data. Section 4.7 gives references to multipass algorithms that are preferable in certain applications, for example, when disk space is scarce.

We work with the Reuters Corpus Volume I (RCV1) collection as our model collection in this chapter, a collection with roughly 1 GB of text. It consists of about 800,000 documents that were sent over the Reuters newswire during a 1-year period between August 20, 1996, and August 19, 1997. A typical document is shown in Figure 3.7 , but note that we ignore multimedia information like images in this book and are only concerned with text. Reuters Corpus Volume I (RCV1) covers a wide range of international topics, including politics, business, sports, and (as in this example) science. Some key statistics of the collection are shown in Table 3.2 .

Figure 3.7.

Document from the Reuters newswire.

SymbolStatisticValue
NDocuments800,000
LAvg. number of tokens per document200
MTerms400,000
Avg. number of bytes per token (incl. spaces/punct.)6
Avg. number of bytes per token (without spaces/punct.)4.5
Avg. number of bytes per term7.5
TTokens100,000,000

Table 3.2.

Collection statistics for Reuters Corpus Volume I (RCV1).

Values are rounded for the computations in this book. The unrounded values are 806,791 documents, 222 tokens per document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punctuation, 4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and 96,969,056 tokens

Reuters Corpus Volume I (RCV1) has 100 million tokens. Collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID therefore requires 0.8 GB of storage. Typical collections today are often one or two orders of magnitude larger than Reuters Corpus Volume I (RCV1). You can easily see how such collections overwhelm even large computers if we try to sort their termID-docID pairs in memory. If the size of the intermediate files during index construction is within a small factor of available memory, then the compression techniques introduced in Chapter 5 can help; however, the postings file of many large collections cannot fit into memory even after compression ( Figure 3.7 ).

With the insufficient main memory, we need to use an external sorting algorithm, that is, one that uses disk. To achieve acceptable sorting efficiency, the central requirement of such an algorithm is that it minimizes the number of random disk seeks during sorting-sequential disk reads are far faster than seeks as we explained in Section 4.1. One solution is the blocked sort-based indexing algorithm or BSBI in Figure 3.8 . BSBI (i) segments the collection into parts of equal size, (ii) sorts the termID-docID pairs of each part in memory, (iii) stores intermediate sorted results on disk, and (iv) merges all intermediate results into the final index.

Figure 3.8.

Blocked sort-based indexing. The algorithm stores inverted blocks in files f 1 , … , f n and the merged index in f m e r g e d .

The algorithm parses documents into termID-docID pairs and accumulates the pairs in memory until a block of a fixed size is full (PARSENEXTBLOCK in Figure 3.8 ). We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and written to disk. Inversion involves two steps. First, we sort the termID-docID pairs. Next, we collect all termID-docID pairs with the same termID into a postings list, where a posting is simply a docID. The result, an inverted index for the block we have just read, is then written to disk. Applying this to Reuters Corpus Volume I (RCV1) and assuming we can fit 10 million termID-docID pairs into memory, we end up with ten blocks, each an inverted index of one part of the collection.

In the final step, the algorithm simultaneously merges the 10 blocks into 1 large merged index. An example with two blocks is shown in Figure 3.9 , where we use dito denote the ithdocument of the collection. To do the merging, we open all block files simultaneously and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged index we are writing. In each iteration, we select the lowest termID that has not been processed yet using apriority queue or a similar data structure. All postings lists for this termID are read and merged, and the merged list is written back to disk. Each read buffer is refilled from its file when necessary.

Figure 3.9.

Merging in blocked sort-based indexing. Two blocks (postings lists to be merged) are loaded from disk into memory, merged in memory (merged postings lists), and written back to disk. We show terms instead of termIDs for better readability.

How expensive is BSBIFI. Its time complexity is Θ(TlogT)because the step with the highest time complexity is sorting and Tis an upper bound for the number of items we must sort (i.e., the number of termID-docID pairs). But the actual indexing time is usually dominated by the time it takes to parse the documents (PARSENEXTBLOCK) and to do the final merge (MERGEBLOCKS). Exercise 4.6 asks you to compute the total index construction time for RCV1 that includes these steps as well as invert the blocks and write them to disk ( Figure 3.9 ).

Notice that Reuters Corpus Volume I (RCV1) is not particularly large in an age when one or more GB of memory is standard on personal computers. With appropriate compression, we could have created an inverted index for RCV1 in memory on a not overly beefy server. The techniques we have described are needed, however, for collections that have several orders of larger magnitude.

3.6.3. Single-pass in-memory indexing

Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI. SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.

The SPIMI algorithm is shown in Figure 3.10 . The part of the algorithm that parses documents and turns them into a stream of termdocID pairs, which we call tokens here, has been omitted. SPIMI-INVERT is called repeatedly on the token stream until the entire collection has been processed. Tokens are processed one by one (line 4) during each successive call of SPIMI-INVERT. When a term occurs for the first time, it is added to the dictionary (best implemented as a hash), and a new postings list is created (line 6). The call in line 7 returns this postings list for subsequent occurrences of the term.

Figure 3.10.

Inversion of a block in single-pass in-memory indexing.

A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list (line 10). Instead of first collecting all termID-docID pairs and then sorting them (as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows), and it is immediately available to collect postings. This has two advantages: it is faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored. As a result, the blocks that individual calls of SPIMI-INVERT can process are much larger, and the index construction process as a whole is more efficient.

Because we do not know how large the postings list of a term will be when we first encounter it, we allocate space for a short postings list initially and double the space each time it is full (lines 8–9). This means that some memory is wasted, which counteracts the memory savings from the omission of termIDs in intermediate data structures. However, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI.

When memory has been exhausted, we write the index of the block (which consists of the dictionary and the postings lists) to disk (line 12). We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step. If each block’s postings lists were written in unsorted order, merging blocks could not be accomplished by a simple linear scan through each block.

Each call of SPIMI-INVERT writes a block to disk, just as in BSBI. The last step of SPIMI (corresponding to line 7 in Figure 3.8 ; not shown in Figure 3.10 ) is then to merge the blocks into the final inverted index.

In addition to constructing a new dictionary structure for each block and eliminating the expensive sorting step, SPIMI has a third important component: compression. Both the postings and the dictionary terms can be stored compactly on disk if we employ compression. Compression increases the efficiency of the algorithm further because we can process even larger blocks and because the individual blocks require less space on disk. We refer readers to the literature for this aspect of the algorithm.

The time complexity of SPIMI is Θ(T)because no sorting of tokens is required and all operations are at most linear in the size of the collection.

3.6.4. Dictionary compression

This section presents a series of dictionary data structures that achieve increasingly higher compression ratios. The dictionary is small compared with the postings file as suggested by Table 5.1. So, why compress it if it is responsible for only a small percentage of the overall space requirements of the IR system?

One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. Thus, the main goal of compressing the dictionary is to fit it in the main memory, or at least a large portion of it, to support high query throughput. Although dictionaries of very large collections fit into the memory of a standard desktop machine, this is not true of many other application scenarios. For example, an enterprise search server for a large corporation may have to index a multiterabyte collection with a comparatively large vocabulary because of the presence of documents in many different languages. We also want to be able to design search systems for limited hardware such as mobile phones and onboard computers. Other reasons for wanting to conserve memory are fast start-up time and having to share resources with other applications. The search system on your PC must get along with the memory-hogging word processing suite you are using at the same time ( Figure 3.11 ).

Figure 3.11.

Storing the dictionary as an array of fixed-width entries.

3.7. XML retrieval

Information retrieval systems are often contrasted with relational databases. Traditionally, IR systems have retrieved information from unstructured text ¨C by which we mean raw text without markup. Databases are designed for querying relational data: sets of records that have values for predefined attributes such as employee number, title, and salary. There are fundamental differences between information retrieval and database systems in terms of retrieval model, data structures, and query languages shown in Table 3.3 .1

Some highly structured text search problems are most efficiently handled by a relational database (RDB), for example, if the employee table contains an attribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query, select lastname from employees where job_desc like invoic%, may be sufficient to satisfy your information need with high precision and recall.

However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents and structured retrieval. Queries in structured retrieval can be either structured or unstructured, but we will assume in this chapter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging), and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria. Examples of such queries give me a full-length article on fast Fourier transforms (digital libraries), give me patents whose claims mention RSA public-key encryption and that cite US 4,405,829 patents, or give me articles about sightseeing tours of the Vatican and the Coliseum (entity-tagged text). These three queries are structured queries that cannot be answered well by an unranked retrieval system. As we argued in Example 1.1 (p. 15), unranked retrieval models like the Boolean model suffer from low recall. For instance, an unranked system would return a potentially large number of articles that mention the Vatican, the Coliseum, and the sightseeing tours without ranking the ones that are most relevant for the query first. Most users are also notoriously bad at precisely stating structural constraints. For instance, users may not know for which structured elements the search system supports search. In our example, the user may be unsure whether to issue the query as sightseeing AND (COUNTRY:Vatican OR LANDMARK:Coliseum), as sightseeing AND (STATE:Vatican OR BUILDING:Coliseum), or in some other form. Users may also be completely unfamiliar with structured search and advanced search interfaces or unwilling to use them. In this chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.

There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries.

We will only look at one standard for encoding structured documents, Extensible Markup Language or XML, which is currently the most widely used such standard. We will not cover the specifics that distinguish XML from other types of markup such as HTML and SGML. But most of what we say in this chapter is applicable to markup languages in general.

In the context of information retrieval, we are only interested in XML as a language for encoding text and documents. A perhaps more widespread use of XML is to encode non-text data. For example, we may want to export data in XML format from an enterprise resource planning system and then read them into an analytic program to produce graphs for a presentation. This type of application of XML is called data-centric because numerical and non-text attribute-value data dominate and text is usually a small fraction of the overall data. Most data-centric XML is stored in databases—in contrast to the inverted index-based methods for text-centric XML that we present in this chapter.

We call XML retrieval a structured retrieval in this chapter. Some researchers prefer the term semistructured retrieval to distinguish XML retrieval from database querying. We have adopted the terminology that is widespread in the XML retrieval community. For instance, the standard way of referring to XML queries is structured queries, not semistructured queries. The term structured retrieval is rarely used for database querying, and it always refers to XML retrieval in this book.

There is a second type of information retrieval problem that is intermediate between unstructured retrieval and querying a relational database, parametric and zone search, which we discussed in Section 6.1 (p. 110). In the data model of parametric and zone search, there are parametric fields (relational attributes like date or file size) and zones ¨C text attributes that each takes a chunk of unstructured text as value. The data model is flat, that is, there is no nesting of attributes. The number of attributes is small. In contrast, XML documents have the more complex tree structure that we see in Figure 3.13 in which attributes are nested. The number of attributes and nodes is greater than in parametric and zone search.

After presenting the basic concepts of XML in Section 3.71, this chapter first discusses the challenges we face in XML retrieval (Section 3.72). Next, we describe a vector space model for XML retrieval (Section 3.73). Presents INEX, a shared task evaluation that has been held for a number of years and currently is the most important venue for XML retrieval research.

3.7.1. Basic XML concepts

An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag. An element can have one or more XML attributes. In the XML document in Figure 3.12 , the scene element is enclosed by the two tags sceneand /scene. It has an attribute number with value vii and two child elements, title and verse.

Figure 3.12.

An XML document.

Figure 3.13 shows Figure 3.12 as a tree. The leaf nodes of the tree consist of text, for example, Shakespeare, Macbeth, and Macbeth’s castle. The tree’s internal nodes encode either the structure of the document (title, act, and scene) or metadata functions (author).

Figure 3.13.

The XML document in Figure 10.1 as a simplified DOM object.

The standard for accessing and processing XML documents is the XML Document Object Model or DOM. The DOM represents elements, attributes, and text within elements as nodes in a tree. Figure 3.13 is a simplified DOM representation of the XML document in Figure 3.12 . With a DOM API, we can process an XML document by starting at the root element and then descending down the tree from parents to children.

XPath is a standard for enumerating paths in an XML document collection. We will also refer to paths as XML contexts or simply contexts in this chapter. Only a small subset of XPath is needed for our purposes. The XPath expression node selects all nodes of that name. Successive elements of a path are separated by slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in a play element. In Figure 3.13 , this set consists of a single scene element, which is accessible via the path play, act, and scene from the top. An initial slash starts the path at the root element. /play/title selects the plays title in Figure 3.12 , /play//title selects a set with two members (the plays title and the scenes title), and /scene/title selects no elements. For notational convenience, we allow the final element of a path to be a vocabulary term and separate it from the element path by the symbol #, even though this does not conform to the XPath standard. For example, title#“Macbeth” selects all titles containing the term Macbeth.

We also need the concept of schema in this chapter. A schema puts constraints on the structure of allowable XML documents for a particular application. A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute. Two standards for schemas for XML documents are XML DTD (document-type definition) and XML schema. Users can only write structured queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection

A common format for XML queries is NEXI (Narrowed Extended XPath I). We give an example in Figure 3.14 . We display the query on four lines for typographical convenience, but it is intended to be read as one unit without line breaks. In particular, //section is embedded under //article.

Figure 3.14.

An XML query in NEXI format and its partial representation as a tree.

The query in Figure 3.14 specifies a search for sections about the summer holidays that are part of articles from 2001 to 2002. As in XPath double slashes indicate that an arbitrary number of elements can intervene on a path. The dot in a clause in square brackets refers to the element the clause modifies. The clause [.//yr = 2001 or.//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly, the dot in [about(., summer holidays)] refers to the section that the clause modifies.

The 2-year conditions are relational attribute constraints. Only articles whose year attribute is 2001 or 2002 (or that contain an element whose year attribute is 2001 or 2002) are to be considered. The about clause is a ranking constraint: sections that occur in the right type of article are to be ranked according to how relevant they are to the topic summer holidays.

We usually handle relational attribute constraints by prefiltering or postfiltering: we simply exclude all elements from the result set that do not meet the relational attribute constraints. In this chapter, we will not address how to do this efficiently and instead focus on the core information retrieval problem in XML retrieval, namely, how to rank documents according to the relevance criteria expressed in the about conditions of the NEXI query.

If we discard relational attributes, we can represent documents as trees with only one type of node: element nodes. In other words, we remove all attribute nodes from the XML document, such as the number attribute in Figure 3.12 . Figure 3.15 shows a subtree of the document in Figure 3.12 as an element-node tree (labeled d1).

Figure 3.15.

Tree representation of XML documents and queries.

We can represent queries as trees in the same way. This is a query-by-example approach to query language design because users pose queries by creating objects that satisfy the same formal description as documents. In Figure 3.15 , q1is a search for books whose titles score highly for the keywords Julius Caesar. q2is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war. To represent the semantics of NEXI queries fully we would also need to designate one node in the tree as a “target node”, for example, the section in the tree in Figure 3.14 . Without the designation of a target node, the tree in Figure 3.14 is not a search for sections embedded in articles (as specified by NEXI), but a search for articles that contain sections.

3.7.2. Challenges in XML retrieval

In this section, we discuss a number of challenges that make structured retrieval more difficult than unstructured retrieval. Recall from page 195 the basic setting we assume in structured retrieval: the collection consists of structured documents, and queries are either structured (as in Figure 3.14 ) or unstructured (e.g., summer holidays).

The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not the entire documents as IR systems usually do in unstructured retrieval. If we query Shakespeare’s plays for Macbeth’s castle, should we return the scene, the act, or the entire play in Figure 3.13 ? In this case, the user is probably looking for the scene. On the other hand, another wise unspecified search for Macbeth should return the play of this name, not a subunit.

One criterion for selecting the most appropriate part of a document is the structured document retrieval principle:

Structured document retrieval principle. A system should always retrieve the most specific part of a document answering the query.

This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. However, it can be hard to implement this principle algorithmically. Consider the query title#“Macbeth” applied to Figure 3.13 . The title of the tragedy, Macbeth, and the title of act I, scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. Deciding which level of the tree is right for answering a query is difficult.

Parallel to the issue of which parts of a document to return to the user is the issue of which parts of a document to index. In Section 2.1.2 (p. 20), we discussed the need for a document unit or indexing unit in indexing and retrieval. In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, email messages, web pages on the web, etc. In structured retrieval, there are a number of different approaches to defining the indexing unit.

One approach is to group nodes into non-overlapping pseudo-documents as shown in Figure 3.16 . In the example, books, chapters, and sections have been designated to be indexing units but without overlap. For example, the leftmost dashed indexing unit contains only those parts of the tree dominated by book that are not already part of other indexing units. The disadvantage of this approach is that pseudo-documents may not make sense to the user because they are not coherent units. For instance, the leftmost indexing unit in Figure 3.16 merges three disparate elements, the class, author, and title elements.

Figure 3.16.

Partitioning an XML document into non-overlapping indexing units.

We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books or the play element for Shakespeare’s works. We can then postprocess search results to find for each book or play the subelement that is the best hit. For example, the query Macbeth’s castle may return the play Macbeth, which we can then postprocess to identify act I, scene vii, as the best-matching subelement. Unfortunately, this two-stage retrieval process fails to return the best subelement for many queries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it.

Instead of retrieving large units and identifying subelements (top-down), we can also search all leaves, select the most relevant ones, and then extend them to larger units in postprocessing (bottom-up). For the query Macbeth’s castle in Figure 3.12 , we would retrieve the title Macbeth’s castle in the first pass and then decide in a postprocessing step whether to return the title, the scene, the actor, or the play. This approach has a similar problem as the last one: the relevance of a leaf element is often not a good predictor of the relevance of elements it is contained in.

The least restrictive approach is to index all elements. This is also problematic. Many XML elements are not meaningful search results, for example, typographical elements like bdefinitely/bor an ISBN number which cannot be interpreted without context. Also, indexing all elements means that search results will be highly redundant. For the query Macbeth’s castle and the document in Figure 3.12 , we would return all of the play, act, scene, and title elements on the path between the root node and Macbeth’s castle. The leaf node would then occur four times in the result set, once directly, and three times as part of other elements. We call elements that are contained within each other nested. Returning redundant nested elements in a list of returned hits is not very user-friendly.

Because of the redundancy caused by nested elements, it is common to restrict the set of elements that are eligible to be returned. Restriction strategies include:

  • Discard all small elements.

  • Discard all element types that users do not look at (this requires a working XML retrieval system that logs this information).

  • Discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available).

  • Only keep element types that a system designer or librarian has deemed to be useful search results.

In most of these approaches, result sets will still contain nested elements. Thus, we may want to remove some elements in a postprocessing step to reduce redundancy. Alternatively, we can collapse several nested elements in the results list and use highlighting of query terms to draw the user’s attention to the relevant passages. If query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section and the paragraph both occur in the results list, it is sufficient to show the section. An additional advantage of this approach is that the paragraph is presented together with its context (i.e., the embedding section). This context may be helpful in interpreting the paragraph (e.g., the source of the information reported) even if the paragraph on its own satisfies the query.

If the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type. But as we discussed in the introduction, users often don’t know what the name of an element in the collection is (Is the Vatican a country or a city?), or they may not know how to compose structured queries at all.

A challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idf) statistics as defined in Section 6.2.1 (p. 117). For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example.

One solution is to compute idf for XML-context/term pairs, for example, to compute different idf weights for author#“Gates” and section#“Gates.” Unfortunately, this scheme will run into sparse data problems; that is, many XML-context pairs occur too rarely to reliably estimate df (see Section 13.2, p. 260, for a discussion of sparseness). A compromise is only to consider the parent node xof the term and not the rest of the path from the root to xto distinguish contexts. There are still conflations of contexts that are harmful in this scheme. For instance, we do not distinguish names of authors and names of corporations if both have the parent node name. But most important distinctions, like the example contrast author#“Gates” vs. section#“Gates,” will be respected.

In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one source. This phenomenon is called schema heterogeneity or schema diversity and presents yet another challenge. As illustrated in Figure 3.17 , comparable elements may have different names: creator in d2vs. author in d3. In other cases, the structural organization of the schemas may be different: author names are direct descendants of the node author in q3, but there are the intervening nodes firstname and lastname in d3. If we employ strict matching of trees, then q3will retrieve neither d2nor d3although both documents are relevant. Some form of approximate matching of element names in combination with semiautomatic matching of different document structures can help here. Human editing of correspondences of elements in different schemas will usually do better than automatic methods.

Figure 3.17.

Schema heterogeneity: intervening nodes and mismatched names.

Schema heterogeneity is one reason for query-document mismatches like q3/d2and q3/d3. Another reason is that users often are not familiar with the element names and the structure of the schemas of collections they search as mentioned. This poses a challenge for interface design in XML retrieval. Ideally, the user interface should expose the tree structure of the collection and allow users to specify the elements they are querying. If we take this approach, then designing the query interface in structured retrieval is more complex than a search box for keyword queries in unstructured retrieval.

We can also support the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed. We call such queries extended queries. The tree in Figure 3.14 and q4in Figure 3.17 are examples of extended queries. We show edges that are interpreted as descendant relationships as dashed arrows. In q4, a dashed arrow connects book and Gates. As a pseudo-XPath notation for q4, we adopt book//#“Gates”: a book that somewhere in its structure contains the word Gates where the path from the book node to Gates can be arbitrarily long. The pseudo-XPath notation for the extended query that in addition specifies that Gates occurs in a section of the book is book//section//#“Gates.” It is convenient for users to be able to issue such extended queries without having to specify the exact structural configuration in which a query term should occur—either because they do not care about the exact configuration or because they do not know enough about the schema of the collection to be able to specify it.

In Figure 3.18 , the user is looking for a chapter entitled FFT (q5). Suppose that there is no such chapter in the collection but that there are references to books on FFT (d4). A reference to a book on FFT is not exactly what the user is looking for, but it is better than returning nothing. Extended queries do not help here. The extended query q6also returns nothing. This is a case where we may want to interpret the structural constraints specified in the query as hints as opposed to as strict conditions. As we will discuss in Section 10.4, users prefer a relaxed interpretation of structural constraints: elements that do not meet structural constraints perfectly should be ranked lower, but they should not be omitted from search results.

Figure 3.18.

A structural mismatch between two queries and a document.

3.7.3. A vector space model for XML retrieval

In this section, we present a simple vector space model for XML retrieval. It is not intended to be a complete description of a state-of-the-art system. Instead, we want to give the reader a flavor of how documents can be represented and retrieved in XML retrieval.

To take into account the structure in retrieval in Figure 3.15 , we want a book entitled Julius Caesar to be a match for q1and no match (or a lower-weighted match) for q2. In unstructured retrieval, there would be a single dimension of the vector space for Caesar. In XML retrieval, we must separate the title word Caesar from the author name Caesar. One way of doing this is to have each dimension of the vector space encode a word together with its position within the XML tree.

Figure 3.19 illustrates this representation. We first take each text node (which in our setup is always a leaf) and break it into multiple nodes, one for each word. So, the leaf node Bill Gates is split into two leaves Bill and Gates. Next, we define the dimensions of the vector space to be lexicalized subtrees of documents-subtrees that contain at least one vocabulary term. A subset of these possible lexicalized subtrees is shown in the figure, but there are others, for example, the subtree corresponding to the whole document with the leaf node Gates removed. We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them. This means that we can use the vector space formalism from Chapter 6 for XML retrieval. The main difference is that the dimensions of vector space in unstructured retrieval are vocabulary terms, whereas they are lexicalized subtrees in XML retrieval.

Figure 3.19.

A mapping of an XML document (left) to a set of lexicalized subtrees (right).

There is a trade-off between the dimensionality of the space and accuracy of query results. If we trivially restrict dimensions to vocabulary terms, then we have a standard vector space retrieval system that will retrieve many documents that do not match the structure of the query (e.g., Gates in the title as opposed to the author element). If we create a separate dimension for each lexicalized subtree occurring in the collection, the dimensionality of the space becomes too large. A compromise is to index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs. We call such an XML-context/term pair a structural term and denote it by c,t: a pair of XML-context cand vocabulary term t. The document in Figure 3.19 has nine structural terms. Seven are shown (e.g., “Bill” and Author#“Bill”) and two are not shown: /Book/Author#“Bill” and /Book/Author#“Gates,” The tree with the leaves Bill and Gates is a lexicalized subtree that is not a structural term. We use the previously introduced pseudo-XPath notation for structural terms.

As we discussed in the last section, users are bad at remembering details about the schema and at constructing queries that comply with the schema. We will therefore interpret all queries as extended queries; that is, there can be an arbitrary number of intervening nodes in the document for any parent-child node pair in the query. For example, we interpret q5in Figure 3.18 as q6.

But we still prefer documents that match the query structure closely by inserting fewer additional nodes. We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cqin a query and a path cdin a document is the following context resemblance function CR:

CR(cq,cd)={1+|cq|1+|cd|,ifcqmatchescd0ifcqdoesnotmatchcd,E3.21

where |cq|and |cd|are the number of nodes in the query path and document path, respectively, and cqmatches cdif we can transform cqinto cdby inserting additional nodes. Two examples from Figure 3.17 are CR(cq4,cd2)=3/4=0.75and CR(cq4,cd3)=3/5=0.6where cq4,cd2and cd3are the relevant paths from top to leaf node in q4,d2and d3, respectively. The value of CR(cq,cd)is 1.0 if qand dare identical.

The final score for a document is computed as a variant of the cosine measure [Eq. (6.10), p. 121], which we call SIMNOMERGE for reasons that will become clear shortly. SIMNOMERGE is defined as follows:

SIMNOMERGE(q,d)=ckBclBCR(ck,cl)tVweight(q,t,ck)weight(d,t,cl)cB,tVweight2(d,t,c).E3.22

where Vis the vocabulary of nonstructural terms, B is the set of all XML contexts, and weight(q,t,c)and weight(d,t,c)are the weights of term tin XML context cin query qand document d, respectively. We compute the weights using one of the weightings from Chapter 6, such as idftwft,d. The inverse document frequency idftdepends on which elements we use to compute dftas discussed in Section 10.2. The similarity measure SIMNOMERGE (q,d)is not a true cosine measure since its value can be larger than 1.0 (Exercise 10.11). We divide by cB,tVweight2(d,t,c)to normalize for document length (Section 6.3.1, page 121). We have omitted query length normalization to simplify the formula. It has no effect on ranking since, for a given query, the normalizer cB,tVweight2(d,t,c)is the same for all documents.

The algorithm for computing SIMNOMERGE for all documents in the collection is shown in Figure 3.22 . The array normalizer in Figure 3.22 contains cB,tVweight2(d,t,c)from Eq. (3.22) for each document.

We give an example of how SIMNOMERGE computes query-document similarities in Figure 3.21 . c1,tis one of the structural terms in the query. We successively retrieve all postings lists for structural terms c,twith the same vocabulary term t. Three example postings lists are shown. For the first one, we have CR(c1,c1)=1.0since the two contexts are identical. The next context has no context resemblance with c1:CR(c1,c2)=0, and the corresponding postings list is ignored. The context match of c1with c3is 0.630and it will be processed. In this example, the highest ranking document is d9with a similarity of 1.0×0.2+0.63×0.6=0.578. To simplify the figure, the query weight of c1,tis assumed to be 1.0 ( Figures 3.20 and 3.21 ).

Figure 3.20.

The algorithm for scoring documents with SIMNOMERGE.

Figure 3.21.

Scoring of a query with one structural term in SIMNOMERGE.

The query-document similarity function in Figure 3.22 is called SIMNOMERGE because different XML contexts are kept separate for the purpose of weighting. An alternative similarity function is SIMMERGE which relaxes the matching conditions of query and document further in the following three ways:

  • We collect the statistics used for computing weight(q,t,c)and weight(d,t,c)from all contexts that have a non-zero resemblance to c(as opposed to just from cas in SIMNOMERGE). For instance, for computing the document frequency of the structural term atl#“recognition,” we also count occurrences of recognition in XML contexts fm/atl, article//atl, etc.

  • We modify Eq. (3.22) by merging all structural terms in the document that have a non-zero context resemblance to a given query structural term. For example, the contexts /play/act/scene/title and /play/title in the document will be merged when matching against the query term /play/title#“Macbeth.”

  • The context resemblance function is further relaxed: contexts have a non-zero resemblance in many cases where the definition of CRin Eq. (3.21) returns 0.

These three changes alleviate the problem of sparse term statistics discussed in Section 3.2 and increase the robustness of the matching function against poorly posed structural queries. The evaluation of SIMNOMERGE and SIMMERGE in the next section shows that the relaxed matching conditions of SIMMERGE increase the effectiveness of XML retrieval.

3.7.4. Evaluation of XML retrieval

The premier venue for research on XML retrieval is the INEX (INitiative for the Evaluation of XML retrieval) program, a collaborative effort that has produced reference collections, sets of queries, and relevance judgments. A yearly INEX meeting is held to present and discuss research results. The INEX 2002 collection consisted of about 12,000 articles from IEEE journals. We give collection statistics in Table 3.4 and show part of the schema of the collection in Figure 3.22 . The IEEE journal collection was expanded in 2005. Since 2006 INEX uses the much larger English Wikipedia as a test collection.

RDB searchUnstructured retrievalStructured retrieval
ObjectsRecordsUnstructured documentsTrees with text at leaves
ModelRelational modelVector space & others?
Main data structureTableInverted index?
QueriesSQLFree text queries?

Table 3.3.

RDB (relational database) search, unstructured information retrieval, and structured information retrieval.

12,107Number of documents
494 MbSize
1995–2002Time of publication of articles
1532Average number of XML nodes per document
6.9Average depth of a node
30Number of CAS topics
30Number of CO topics

Table 3.4.

INEX 2002 collection statistics.

Figure 3.22.

Simplified schema of the documents in the INEX collection.

Two types of information needs or topics in INEX are content-only or CO topics and content-and-structure (CAS) topics. CO topics are regular keyword queries as in unstructured information retrieval. CAS topics have structural constraints in addition to keywords. We already encountered an example of a CAS topic in Figure 3.14 . The keywords in this case are summer and holidays, and the structural constraints specify that the keywords occur in a section that in turn is part of an article and that this article has an embedded year attribute with value 2001 or 2002.

Since CAS queries have both structural and content criteria, relevance assessments are more complicated than in unstructured retrieval. INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance. The component coverage dimension evaluates whether the element retrieved is structurally correct, that is, neither too low nor too high in the tree. We distinguish four cases:

  • Exact coverage (E). The information sought is the main topic of the component, and the component is a meaningful unit of information.

  • Too small (S). The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information.

  • Too large (L). The information sought is present in the component, but is not the main topic.

  • No coverage (N). The information sought is not a topic of the component.

The topical relevance dimension also has four levels: highly relevant (3), fairly relevant (2), marginally relevant (1), and nonrelevant (0). Components are judged on both dimensions, and the judgments are then combined into a digit-letter code. 2S is a fairly relevant component that is too small, and 3E is a highly relevant component that has exact coverage. In theory, there are 16 combinations of coverage and relevance, but many cannot occur. For example, a nonrelevant component cannot have exact coverage, so the combination 3N is not possible.

The relevance-coverage combinations are quantized as follows:

Q(rel,cov)={1.00if(rel,cov)=3E0.75if(rel,cov){2E,3L}0.50if(rel,cov){1E,2L,2S}0.25if(rel,cov){1S,1L}0.00if(rel,cov)0N.E3.23

This evaluation scheme takes into account the fact that binary relevance judgments, which are standard in unstructured information retrieval, are not appropriate for XML retrieval. A 2S component provides incomplete information and may be difficult to interpret without more context, but it does answer the query partially. The quantization function Q does not impose a binary choice relevant/nonrelevant and instead allows us to grade the component as partially relevant.

The number of relevant components in a retrieved set Aof components can then be computed as

#((relevantitemsretrieved)=cAQ(rel(c),cov(c)).E3.24

As an approximation, the standard definitions of precision, recall, and F from can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments.

One flaw of measuring relevance this way is that overlap is not accounted for. We discussed the concept of marginal relevance in the context of unstructured retrieval. This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result as we discussed on p. 80. Much of the recent focus at INEX has been on developing algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.

Table 3.5 shows two INEX 2002 runs of the vector space system we described in Section 3.7.3. The better run is the SIMMERGE run, which incorporates few structural constraints and mostly relies on keyword matching. SIMMERGE median average precision (where the median is with respect to average precision numbers over topics) is only 0.147. Effectiveness in XML retrieval is often lower than in unstructured retrieval since XML retrieval is harder. Instead of just finding a document, we have to find the subpart of a document that is most relevant to the query. Also, XML retrieval effectiveness—when evaluated as described here—can be lower than unstructured retrieval effectiveness on a standard evaluation because graded judgments lower measured performance. Consider a system that returns a document with graded relevance 0.6 and binary relevance 1 at the top of the retrieved list. Then, interpolated precision at 0.00 recall is 1.0 on a binary evaluation but can be as low as 0.6 on a graded evaluation.

AlgorithmAverage precision
SIMNOMERGE0.242
SIMMERGE0.271

Table 3.5.

INEX 2002 results of the vector space model in Section 3.7.3 for content and structure (CAS) queries and the quantization function Q.

Table 3.5 gives us a sense of the typical performance of XML retrieval, but it does not compare structured with unstructured retrieval. Table 3.6 directly shows the effect of using structure in retrieval. The results are for a language model-based system that is evaluated on a subset of CAS topics from INEX 2003 and 2004. The evaluation metric is precision at k. The discretization function used for the evaluation maps highly relevant elements (roughly corresponding to the 3E elements defined for Q) to 1 and all other elements to 0. The content only system treats queries and documents as unstructured bags of words. The full-structure model ranks elements that satisfy structural constraints higher than elements that do not. For instance, for the query in Figure 3.14 , an element that contains the phrase summer holidays in a section will be rated higher than one that contains it in an abstract.

Content only full structureImprovement
precision at 5        0.20000.326563.3%
precision at 10     0.18200.253139.1%
precision at 20     0.17000.17965.6%
precision at 30     0.15270.15310.3%

Table 3.6.

The table shows that structure helps increase precision at the top of the results list. There is a large increase of precision at k=5and at k=10. There is almost no improvement at k=30. These results demonstrate the benefits of structured retrieval. Structured retrieval imposes additional constraints on what to return and documents that pass the structural filter are more likely to be relevant. Recall may suffer because some relevant documents will be filtered out, but for precision-oriented tasks, structured retrieval is superior.

4. Classical Machine Learning

This chapter focuses on two major problems in machine learning: classification and clustering. Classification is based on some given sample of known class labels, training a learning machine (i.e., to obtain a certain objective function), so that it can classify unknown samples, and it belongs to supervised learning. However, clustering is the algorithm that does not know any kind of sample in advance, and it is desirable to classify a set of unknown categories of samples into several categories by some algorithm. When we use the clustering, we do not care about what kind it is. The goal that we need to achieve is just to bring together similar things, which in the machine learning is called unsupervised learning.

4.1. Classification

4.1.1. Performance evaluation for classification

When we use a classifier to predict, we will encounter a very important question: how to evaluate the predicted effect of this classifier. The evaluation of classifier performance is the basis for choosing excellent classifiers. Traditional classifier performance evaluation, such as accuracy, recall, sensitivity, and specificity, cannot fully consider a classifier. This book, based on the traditional classifier performance evaluation criteria, also referred to the confusion matrix, area under curve (AUC) and other methods of performance evaluation.

4.1.1.1. Performance evaluation of general classification

An instance can be divided into positive or negative. This will result in four classification results:

True positive (TP): positive data that is correctly marked as positive data.

True negative (TN): negative data that is correctly marked as negative data.

False positive (FP): negative data that is incorrectly marked as positive data.

False negative (FN): positive data that is incorrectly marked as negative data.

The formula for precision is

P=TPTP+FPE4.1

The formula for recall is

P=TPTP+FNE4.2

Sensitivity, also known as true positive rate (TPR), represents the proportion of positive instances identified by the classifier to all positive instances. The formula for sensitivity is

TPR=TPTP+FNE4.3

Specificity, also known as negative rate (FPR), represents the proportion of negative instances identified into positive classes by the classifier. The formula for specificity is

FPR=FPFP+TNE4.4

4.1.1.2. ROC curve and AUC

ROCcurve is also known as the sensitivity curve. The ROCcurve is based on a series of different binary classification (threshold or threshold), with TPas the ordinate and FNas the abscissa drawing points.

Area under curve (AUC) is defined as the area under the ROCcurve, or it can be considered as the ratio of the area under the ROCcurve to the unit area. It is clear that the area value is not greater than one. Also, since the ROCcurve is generally above the straight line of y=x, the AUCvalue ranges from 0.5 to 1. The larger the AUCis, the higher the accuracy of the classification becomes.

4.1.1.3. Confusion matrix

Confusion matrix is used to sum up the supervision of the classification of the results. The items on the main diagonal indicate the total number of the correct categories, and the items of the other non-main diagonals indicate the number of errors in the classification. As shown in Table 4.1 , there are two different types of errors: “false acceptance” and “false rejection” [100]. If the confusion matrix of the dichotomous problem is normalized, it is a joint distribution probability for discrete variables of 0 and 1 binary values. For the two classified issues, confusion matrix can be expressed in Table 4.1 .

Categoryω2ϖ2
ω1Accept correctlyWrongly rejected
ϖ1Accept correctlyWrongly rejected

Table 4.1.

The confusion matrix.

ω1, the actual category one; ϖ1, the actual category two; ω1, classifier determines the category one; and ϖ2, classifier determines the category two.

4.1.1.4. F-Score

F-Score, also known as F-Measure, is precision and recall weighted harmonic average and commonly used to evaluate if the classification model is good or bad. In the F-Score function, when the parameter alpha=1, F-Score combines the results of precision and recall; when F-Score is high, it can explain that the test method is more effective [101]. As the classification accuracy sometimes does not well highlight the characteristics of the sample set and determines the performance of a classifier, for the two classification problems, you can use the following two parameters to evaluate the performance of the classifier:

TNR=TNTN+FP=1FPRFNR=FNTP+FN=1TPRE4.5

It is generally believed that the higher the F-Score is, the better the classification effect of the classifier for the positive sample shows. It should be noted that TNRand FNRwill affect each other. Therefore, the use of a separate parameter to evaluate the performance of the classifier cannot fully evaluate a classifier.

4.1.2. Decision tree

Decision tree is an important method in data mining classification algorithm. In a variety of classification algorithms, the decision tree is the most intuitional one. Decision tree is a method of decision analysis that evaluates the project risk and determines the feasibility. And it is a graphic method for probabilistic analysis of intuitive application. Because the decision branch is painted like a tree branch of a tree, we called it decision tree. In machine learning, the decision tree is a predictive model, which represents a mapping between attributes and values [102].

4.1.2.1. Model introduction

The classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: internal node and leaf node. An internal node represents a feature or attribute, and a leaf node represents a class [103]. A decision tree is used to classify a particular feature of the instance from the root node, and an instance is assigned to its child node according to the test result. At this time, each child node corresponds to a value of the feature. The instance is then recursively tested and assigned until the leaf node is finally assigned to the instance of the leaf node. Figure 4.1 shows the decision tree, the circle, and the box, respectively, that represent the internal nodes and leaf nodes.

Figure 4.1.

Decision tree model.

4.1.2.2. Feature selection

Feature selection is to select the characteristics of classification ability for training data, which can improve the efficiency of decision tree learning. The information gain can be used as one of the criteria for the selection of features [104]. First, we need to introduce the concept of entropy. In probability statistics, entropy is a measure of the uncertainty to random variables. Supposing Xis a discrete random variable with a finite number of values, and its probability distribution is

P(X=xi)E4.6

The entropy of the random variable Xis defined as

H(X)=i=1npilogpiE4.7

With random variables (X,Y), the joint probability distribution is

P(X=xi,yj)=pij,i=1,2,,n;j=1,2,,mE4.8

The conditional entropy Hrepresents the uncertainty of the random variable Yunder the condition of the known random variable X. The conditional entropy H(Y|X)of the random variable Yunder the condition given by the random variable Xis defined as the entropy of the conditional probability distribution of Yunder the given condition of Xgiven the mathematical expectation for X:

H(Y|X)=i=1npiH(Y|X=xi)E4.9

where pi=P(X=xi),i=1,2,,n.

Information gain: The information gain g(D,A)of the feature Afor the training data set Dis defined as the difference between the empirical entropy H(D)of the set Dand the empirical condition entropy H(D|A)of Dunder the given condition of the characteristic A, [105] which is

g(D,A)=H(D)H(D|A)E4.10

According to the information gain criterion, the feature selection method is to calculate the information gain of each feature and compare the size of the training data set (or subset) Dto select the characteristic of the information gain.

4.1.2.3. Decision tree generation

ID3 algorithm: The core is to apply the selected feature of information gain criterion to each node of the decision tree and construct the decision tree recursively [106]. The concrete method is to start from the root node, calculate the information gain of all the possible features to the node, select the feature with the largest information gain as the characteristic of the node, and establish the subnode from the different values of the feature. And then the subnodes recursively call the above method to build the decision tree, until all the information gain is very small or no features can be selected so far, and finally get a decision tree [107].

4.1.2.4. Decision tree pruning

In the decision tree learning, the process of simplifying the generated tree is called pruning.

Pruning algorithm:

Input: Generates the entire tree generated by the algorithm T, the parameter a;

Output: pruned subtree Ta.

  1. Calculating the empirical entropy of each node£»

  2. Recursively retracting from the leaf node of the tree. Supposing a set of leaf nodes is retracted to their parent nodes before and after the whole tree is TBand TArespectively, the corresponding loss function values are Calpha(TB)and Calpha(TA), if Calpha(TB)leqslantCalpha(TA), then cuts the section, that is, the parent node becomes a new leaf node.

  3. Return (2), before it can end. The subtree with the smallest loss function is obtained [108].

4.1.3. Bayes-based classification

Bayesian classification algorithm using probability and statistics to classify is a statistical classification method. Naive Bayes (NB) classification algorithm can be compared with decision tree and Neural Network Classification algorithm. The algorithm can be applied to large databases. The method is simple but has high accuracy and quick classification.

Bayesian theorem is

P(B|A)=P(A|B)P(B)P(A)E4.11

4.1.3.1. Naive Bayesian algorithm model

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. Naive Bayesian classification is a very simple classification algorithm. And the thought of naive Bayesian is that we solve the probability of each category under the conditions. Then we divide the items into the categories with the highest probability of occurrence. Naive Bayesian classification model is depicted in Figure 4.2 .

Figure 4.2.

Naive Bayesian algorithm model.

Naive Bayes has been studied extensively since the 1950s. It was introduced under a different name into the text retrieval community in the early 1960s [109] and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate preprocessing, it is competitive in this domain with more advanced methods including support vector machines [110]. It also finds application in automatic medical diagnosis [111].

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable [112]. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

Probabilistic model: Abstractly, naive Bayes is a conditional probability model—given a problem instance to be classified, represented by a vector x=(x1,,xn)representing some nfeatures (independent variables), it assigns to this instance probability

P(Ck|x1,,xn)E4.12

for each of k possible outcomes or classes Ck.

The problem with the above formulation is that if the number of features nis large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes’ theorem, the conditional probability can be decomposed as

p(Ck|x)=p(Ck)p(x|Ck)p(x)E4.13

In plain English, using Bayesian probability terminology, the above equation can be written as

posterior=prior×likelihoodevidenceE4.14

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on Cand the values of the features Fiare given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model:

p(Ck,x1,,xn)E4.15

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

p(Ck,x1,,xn)=p(x1,,xn,Ck)=p(x1|x2,xn,Ck)p(x2,,xn,Ck)=p(x1|x2,xn,Ck)p(x2|x3,xn,Ck)p(x3,xn,Ck)==p(x1|x2,xn,Ck)p(x2|x3,xn,Ck)p(xn1|xn,Ck)p(xn|Ck)p(Ck)E4.16

Now the “naive” conditional independence assumptions come into play: assume that each feature Fiis conditionally independent of every other feature Fjfor ji, given the category C. This means that

p(xi|xi+1,xn,Ck)=p(xi|Ck)E4.17

Thus, the joint model can be expressed as

p(Ck|x1,,xn)p(Ck,x1,,xn)p(Ck)p(x1|Ck)p(x2|Ck)p(x3|Ck)p(Ck)i=1np(xi|Ck)E4.18

This means that under the above independence assumptions, the conditional distribution over the class variable Cis

p(Ck|x1,,xn)=1Zp(Ck)i=1np(xi|Ck)E4.19

where the evidence Z=p(x)is a scaling factor dependent only on x1,,xn, that is, a constant if the values of the feature variables are known.

4.1.3.2. Parameter estimation of naive Bayesian algorithm

In the naive Bayesian algorithm, learning means that P(Y=c)and P(x=x=1); the maximum likelihood estimation can be used to estimate the prior probability. The formal of maximum likelihood estimation is

P(Y=ck)=i=1NI(yi=ck)N,k=1,2,,KE4.20

Supposing the set of the jth feature xjpossible values is aj1,aj2,,ajSj, the maximum likelihood estimation of the conditional probability P(X(j)=aji|Y=ck)is

P(X(j)=aji|Y=ck)=i=1NI(xi(j)=aji,yi=ck)i=1NI(yi=ck)j=1,2,,n;I=1,2,,Sj;k=1,2,,KE4.21

where xis the jth feature of the ith sample, a is the ith value that the jth feature may take, and iis the instruction function.

4.1.4. Support vector machine

Support vector machine is a supervised learning model and relates learning algorithm for analyzing data in classification and regression analysis [113]. Given a set of training instances, each training instance is marked as belonging to one or the other of the two categories. The SVMtraining algorithm creates a model that assigns the new instance to one of the two categories, making it a nonprobabilistic two-element linear classifier. The SVMmodel is to represent an instance as a point in space, so that the mapping allows instances of different categories to be separated by as wide a clear interval as possible. Then, the new instances are mapped to the same space, and the categories are predicted based on which side of the interval they fall [114].

4.1.4.1. Linear SVM

We consider the following form of point test set:

(x1,y1),,(xn,yn)E4.22

where yiis 1 or −1, indicating that each one is a p-dimensional real vector. Find a “maximum interval hyperplane” that separates the point set of yi=1and the set of points yi=1, and maximize the distance between the hyperplane and the nearest point. Any hyperplane can be written to the following set of points x:

w.xb=0E4.23

where w(not necessarily normalized) is the normal vector.

Hard interval: If these training data are linearly separable, two parallel hyperplanes of two types of data can be chosen so that the distance between them is as large as possible. The regions in these two hyperplanes are called “spaces,” and the maximum interval hyperplane is the hyperplane located in the middle of them [115]. These hyperplanes can be represented by equation

w.xb=1w.xb=1E4.24

or

w.xb=1E4.25

It is not difficult to get the distance between the two hyperplanes by geometry, so we want to maximize the distance between the two planes; we need to minimize. At the same time, in order to make the sample data points in the hyperplane interval, we need to ensure that we meet one that represents all of the conditions:

w.xb1,yi=1E4.26

or

w.xb1,yi=1E4.27

These formals indicate that each data point must be on the correct side of the interval. These two formulas can be written as

yi(w.xb)1,1inE4.28

You can use the formulas to get optimization problems: “Under the condition of minimizing, for i=1,…,n.” The solution wand b of this problem determines our classifier.

Soft interval: In order to extend the SVMto the inseparable linearity of the data, we introduce the hinge loss function:

max(0,1yi(w.xib))E4.29

When the formal 4.28 that satisfies this function is zero, the value of the function is proportional to the distance from the interval for the data on the wrong side of the interval. Then we want to minimize it:

[1ni=1nmax(0,1yi(w.xib)]+λw2E4.30

where the parameter is used to balance the relationship between increasing the size of the interval and ensuring that the correct side is on the interval. Thus, for sufficiently small values, the soft-interval SVMand the hard-spaced SVMwill behave the same if the input data is linearly classified. But a feasible classification rule can be learned even if it is not linearly classified [115].

4.1.4.2. Kernels

Back in our discussion of linear regression, we had a problem in which the input xwas the living area of a house, and we considered performing regression using the features x, x2, and x3(say) to obtain a cubic function. To distinguish between these two sets of variables, we’ll call the “original input value” the input attributes of a problem (in this case, x, the living area). When that is mapped to some new set of quantities that are then passed to the learning algorithm, we’ll call those new quantities the input features. (Unfortunately, different authors use different terms to describe these two things, but we’ll try to use this terminology consistently in these notes.) We will also let ϕdenote the feature mapping, which maps from the attributes to the features. For instance, in our example, we had

Φ(X)=[x,x2,x3]TE4.31

Rather than applying SVMsusing the original input attributes x, we may instead want to learn using some features ϕ(x). To do so, we simply need to go over our previous algorithm and replace xeverywhere in it with ϕ(x).

Since the algorithm can be written entirely in terms of the inner products <x,z>, this means that we would replace all those inner products with <ϕ(x),ϕ(z)>. Specially, given a feature mapping, we define the corresponding Kernel to be

K(x,z)=ϕ(x)Tϕ(z)E4.32

Then, everything we previously had <x,z>in our algorithm, which we could simply replace with K(x,z), and our algorithm would now be learning using the features.

Now, given ϕ, we could easily compute K(x,z)by finding ϕ(x)and ϕ(z)and taking their inner product. But what’s more interesting is that often, K(x,z)may be very inexpensive to calculate, even though ϕ(x)itself may be very expensive to calculate (perhaps because it is an extremely high-dimensional vector) [116]. In such settings, by using in our algorithm an efficient way to calculate K(x,z), we can get SVMsto learn in the high-dimensional feature space given by ϕbut without ever having to explicitly find or represent vectors ϕ(x)[117].

Let’s see an example. Suppose x,zεRn, and consider

K(x,z)=(xTz)2E4.33

We can also write this as

K(x,z)=(i=1nxizi)(j=1nxizi)=i=1nj=1nxixjzizj=i,j=1n(xixj)(zizj)E4.34

Thus, we see that K(x,z)=ϕ(x)Tϕ(z), where the feature mapping ϕis given (shown here for the case of n=3) by

ϕ(x)=[x1x1,x1x2,x1x3,x2x1,x2x2,x2x3,x3x1,x3x2,x3x3]TE4.35

Note that whereas calculating the high-dimensional ¦Õ(x) requires O(n2)time, finding K(x,z)takes only O(n)time linear in the dimension of the input attributes.

For a related kernel, also consider

K(x,z)=(xTz+c)2=i,j=1n(xixj)(zizj)+i=1n(2cxi)(2czi)+c2)E4.36

(Check this yourself.) This corresponds to the feature mapping (again shown for n=3):

ϕ(x)=[x2x1,x2x2,x2x3,x3x1,x3x2,x3x3,2cx1,2cx2,2cx3,c]TE4.37

and the parameter ccontrols the relative weighting between the xi (first order) and the xixj(second order) terms.

More broadly the kernel K(x,z)=(xTz+c)dcorresponds to a feature mapping to a feature space, corresponding of all monomials of the form xi1,xi2,xikthat are up to order d. However, despite working in this O(nd)-dimensional space, computing K(x,z)still takes only O(n)time, and hence we never need to explicitly represent feature vectors in this very high-dimensional feature space [118].

Now, let’s talk about a slightly different view of kernels. Intuitively (and there are things wrong with this intuition, but never mind), if ϕ(x)and ϕ(z)are close together, then we might expect K(x,z)=ϕ(x)Tϕ(z)to be large. Conversely, if ϕ(x)and ϕ(z)are far apart say nearly orthogonal to each other, then K(x,z)=ϕ(x)Tϕ(z)will be small [119]. So, we can think of K(x, z) as some measurement of how similar are ϕ(x)andϕ(z)or of how similar are xand z.

Given this intuition, suppose that for some learning problem that you’re working on, you’ve come up with some function K(x,z)that you think might be a reasonable measure of how similar x and z are. For instance, perhaps you chose

K(x,z)=exp(xz22σ2)E4.38

This is a reasonable measure of xand zssimilarity and is close to 1 when xand zare close and near 0 when xand zare far apart. Can we use this definition of Kas the kernel in an SVM? In this particular example, the answer is yes. (This kernel is called the Gaussian kernel and corresponds to an infinite dimensional feature mapping ϕ.) But more broadly, given some function K, how can we tell if it’s a valid kernel, i.e., can we tell if there is some feature mapping so that K(x,z)=ϕ(x)Tϕ(z)for all x,z?

Suppose for now that Kis indeed a valid kernel corresponding to some feature mapping ϕ. Now, consider some finite set of m points (not necessarily the training set) {x(1),,x(m)}, and let a square m-by-mmatrix Kbe defined so that its (i,j)-entry is given by Kij=K(x(i),x(j)). This matrix is called the Kernel matrix [120]. Note that we’ve overloaded the notation and used Kto denote both the kernel function K(x,z)and the kernel matrix K, due to their obvious close relationship.

4.1.4.3. Regularization and the nonseparable case

The derivation of the SVMas presented so far assumed that the data are linearly separable. While mapping data to a high-dimensional feature space via ϕdoes generally increase the likelihood that the data is separable, we can’t guarantee that it always will be so. Also, in some cases it is not clear that finding a separating hyperplane is exactly what we’d want to do, since that might be susceptible to outliers. For instance, the left figure below shows an optimal margin classifier, and when a single outlier is added in the upper-left region (right figure) ( Figure 4.3 ), it causes the decision boundary to make a dramatic swing, and the resulting classifier has a much smaller margin [121].

Figure 4.3.

Linearly separable.

To make the algorithm work for nonlinearly separable data sets as well as be less sensitive to outliers, we reformulate our optimization (using l1regularization) as follows:

minγ,ω,b12ω2+Ci=1mεis.t.y(i)(ωTx(i)+b)1εi,i=1,,mεi0,i=1,,m.E4.39

Thus, examples are now permitted to have (functional) margin less than 1, and if an example has functional margin 1εi(with ε>0), we would pay a cost of the objective function being increased by Cεi. The parameter Ccontrols the relative weighting between the twin goals of making the ω2small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1.

As before, we can form the Lagrangian:

ϒ(ω,b,ε,α,γ)=12ωTω+Ci=1mεii=1mαi[y(i)(xTω+b)1+εi]i=1mγiεi.E4.40

Here, the αisand γisare our Lagrange multipliers (constrained to be ¡Ý 0). We won’t go through the derivation of the dual again in detail, but after setting the derivatives with respect to w and b to zero as before, substituting them back in, and simplifying, we obtain the following dual form of the problem:

maxαW(α)=i=1mαi12i,j=1my(i)y(j)αiαjx(i),x(j)s.t.0αiC,i=1,,mi=1mαiy(i)=0,i=1,,m.E4.41

As before, we also have that ωcan be expressed in terms of the αis, so that after solving the dual problem, we can continue making our predictions. Note that, somewhat surprisingly, in adding l1regularization, the only change to the dual problem is that what was originally a constraint that 0αihas now become 0αiC. Also, the KKT dual-complementarity conditions (which in the next section will be useful for testing for the convergence of the SMO algorithm) are

αi=0y(i)(ωTx(i)+b)1αi=Cy(i)(ωTx(i)+b)10<αi<Cy(i)(ωTx(i)+b)=1E4.42

Now, all that remains is to give an algorithm for actually solving the dual problem, which we will do in the next section.

4.1.5. Other classification approaches

4.1.5.1. Random forest

As a newly developed, highly flexible machine learning algorithm, random forest (RF) has a wide range of application prospects, from marketing to health-care insurance; both can be used to do marketing modeling, statistics of customer sources, retention, and loss and can also be used to predict the risk of disease and the susceptibility of the patient [122]. As the name implies, random forest is a random way to build a forest, the forest which has lots of decision tree compositions. There is no connection between each decision tree of a random forest. After setting the forest, when there is a new input sample into the forest, each tree in the forest makes a separate judgment, to see which class (for the classification algorithm) the sample should belong to and then see which class is selected most to predict this sample for that class [123].

Advantage:

  1. In all current algorithms, there is excellent accuracy.

  2. Can run effectively on large data sets, can handle input samples with high-dimensional characteristics, and do not need dimensionality reduction.

  3. Can assess the importance of each feature in classification issues.

  4. In the generation process, an unbiased estimate of the internal generation error can be obtained.

  5. For the default value of the problem can also get very good results [124].

4.1.5.2. K-nearest neighbors

K-nearest neighbor (KNN) classification algorithm is a more mature method and one of the simplest machine learning algorithms. The idea of this method is as follows: in the KNNalgorithm, the selected neighbors are objects that have been correctly classified [95]. The method determines the class of the sample to be sorted only on the classification decision based on the nearest one or several samples. Although the KNNmethod relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the category decision [125]. The KNNmethod mainly depends on the surrounding neighboring samples, rather than by discriminating the domain method to determine the category; therefore, KNNmethod is more suitable for other methods than for crossover or overlapping more sample sets [126].

KNNalgorithm can be used not only for classification but also for regression. By finding the k-nearest neighbors of a sample and assigning the average of the attributes of those neighbors to the sample, the attributes of the sample can be obtained. A more useful way is to give different weights to the effects of the neighbors on the samples, such as the weight and the distance being proportional [127].

The main drawback of the algorithm in the classification is that when the sample is not balanced, if a class of sample size is large, and other class sample capacities are very small, it may lead to input when a new sample, the sample K neighbors in the large-capacity class of samples, is accounted for the majority. So it can be improved by using the method of weighting (and the neighbor weights with small distance from the sample) [128]. Another disadvantage of this approach is that the computational complexity is large because the distance to its known sample is calculated for each text to be classified in order to obtain its nearest neighbor. The current common solution is to preknow the sample point of the clip, in advance to remove the classification of the role of small samples [129]. The algorithm is more suitable for the automatic classification of the class with relatively large sample size, and the classification of the class with smaller sample size is more likely to produce errors [130]. Its advantages are high precision, not sensitive to the outliers, and no data input assumptions. And it applies to numeric and nominal data.

4.1.5.3. Logistic regression and boosting

In addition to the algorithms described above, there are two more sophisticated classification algorithms. Logistic regression can also be used for classification. First, we need to find a suitable predictive function to predict the result of the input data and then construct a loss function to represent the deviation between the predicted output and the training data and use the gradient descent method to complete the logistic regression to find the minimum value of the loss function. Features are simple to achieve; the calculation is very small and fast and has low storage resources. Another boosting algorithm and the above algorithm ideas are not the same; the classifier is weak. It’s better only in some of the situations or characteristics under the performance. We need to find some in their respective circumstances performance of good classifiers; they can be combined to become a strong classifier [131].

4.2. Clustering

Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering defines separations of the data similar to some ground-truth set of classes or satisfies some assumption such that members belong to the same class are more similar than members of different classes according to some similarity metric [132].

4.2.1. Performance evaluation for clustering

4.2.1.1. Adjusted Rand index

RI=a+bC2nsamplesE4.43

where Cis the actual category information, Kis the result of the clustering, ais the logarithm of the same class in both Cand K, and bis the number of elements in Cand K, where C2nsamplesrepresents the logarithm that can be formed in the data set and the RIis in the range of [0,1]. The larger the value means that the clustering result is more consistent with the real situation. The higher the RIis, the higher the accuracy of the clustering effect, and the higher the purity in each class it will get [133].

In order to achieve the “random results in the case of clustering results, the indicators should be close to zero.” If the Rand coefficient (adjusted rand index) is raised, it has a higher degree of distinction:

ARI=RIE[RI]maxRIE[RI]E4.44

The ARIis in the range of [−1, 1]. The larger value means that the clustering results are consistent with the real situation. From a broad perspective, the coincident degree of the two data distributions is measured by ARI[134].

Advantages:

  1. Random (uniform) label assignments have an ARIscore close to 0.0.

  2. Bounded range in [−1, 1].

  3. No assumption is made on the cluster structure.

Drawbacks: ARIrequires knowledge of the ground-truth classes.

4.2.1.2. Mutual information-based scores

The mutual information is a function that measures the agreement of the two assignments, ignoring permutations. Two different normalized versions of this measure are available, normalized mutual information (NMI) and adjusted mutual information (AMI) [135]:

I(X,Y)=x,yp(x,y)logp(x,y)p(x)p(y)E4.45

Entropy as the denominator to adjust the MI value between 0 and 1 is used by standardized mutual information, a more common realization is as follows:

U(X,Y)=2R=2I(X;Y)H(X)+H(Y)E4.46
H(X)=i=1np(xi)I(xi)=i=1np(xi)logb1p(xi)=i=1np(xi)logbp(xi)E4.47

Advantages:

  1. Random (uniform) label assignments have a AMI score close to 0.0

  2. Bounded range in [0, 1].

  3. No assumption is made on the cluster structure.

  4. Drawbacks:

MI-based measures require the knowledge of the ground-truth classes.

4.2.1.3. Homogeneity, completeness, and V-measure

Homogeneity: Each cluster contains only members of a single class:

h=1H(C|K)H(C)E4.48

where H(C|K)is the conditional entropy of a class given a cluster assignment, which is obtained by the following formula:

H(C|K)=c=1|C|k=1|K|nc,knlog(nc,knk)E4.49

H(C)is the class entropy, the formula is

H(C)=c=1|C|ncnlog(ncn)E4.50

where nis the total number of samples; ncand nkbelong to the number of samples of class cand class k, respectively; and nc, kis the number of samples divided from class cto class k.

Completeness: All members of a given class are assigned to the same cluster:

h=1H(K|C)H(K)E4.51

where the conditional entropy H(K|C)and the class entropy H(K)are obtained symmetrically according to the above formula.

V-measure is homogeneity and completeness of the harmonic mean; the formula is

v=2hch+cE4.52

Advantages:

  1. Bounded scores.

  2. Intuitive interpretation.

  3. No assumption is made on the cluster structure.

Drawbacks:

  1. Not normalized with regard to random labeling.

  2. Require the knowledge of the ground-truth classes.

There are many other performance evaluation approaches for clustering: Fowlkes-Mallows scores, Silhouette Coefficient, and Calinski-Harabasz index [136].

4.2.2. Partition clustering

4.2.2.1. K-means

K-means and KNNare all starting with k, but they are two kinds of algorithms—KNNis the classification algorithm in supervised learning, while K-means is the clustering algorithm in unsupervised learning. They both use the neighbor information to mark the category.

K-means is the simplest and most efficient in the clustering algorithm. Its core idea is that the initial centroid is specified by the user to repeat the iteration until the algorithm converges as a cluster [137].

The K-means algorithm divides a set of Nsamples Xinto Kdisjoint cluster C, each described by the mean μjof the samples in the cluster. The means are commonly called the cluster “centroid”; note that they are not, in general, points from X, although they live in the same space. The K-means algorithm aims to choose centroid that minimizes the inertia or within-cluster sum of squared criterion: