The points on the elliptic curve \n
\r\n\t
",isbn:"978-1-83969-642-8",printIsbn:"978-1-83969-641-1",pdfIsbn:"978-1-83969-643-5",doi:null,price:0,priceEur:0,priceUsd:0,slug:null,numberOfPages:0,isOpenForSubmission:!0,hash:"5d7f2aa74874444bc6986e613ccebd7c",bookSignature:"Prof. Antonio Morata, Dr. Iris Loira and Prof. Carmen González",publishedDate:null,coverURL:"https://cdn.intechopen.com/books/images_new/10901.jpg",keywords:"Grape, Wine, Vine Biotechnology, Plant Disease, Vine Physiology, Wine Technology, Winemaking, Fungal Disease, Biological Control, Vigor Management, Aroma Compound, Polysaccharide",numberOfDownloads:null,numberOfWosCitations:0,numberOfCrossrefCitations:null,numberOfDimensionsCitations:null,numberOfTotalCitations:null,isAvailableForWebshopOrdering:!0,dateEndFirstStepPublish:"March 4th 2021",dateEndSecondStepPublish:"April 1st 2021",dateEndThirdStepPublish:"May 31st 2021",dateEndFourthStepPublish:"August 19th 2021",dateEndFifthStepPublish:"October 18th 2021",remainingDaysToSecondStep:"23 days",secondStepPassed:!1,currentStepOfPublishingProcess:2,editedByType:null,kuFlag:!1,biosketch:"Prof. Morata is the Spanish delegate at the group of experts in wine microbiology and wine technology of the International Organisation of Vine and Wine (OIV). His team won the international Enoforum award 2019 by the application of UHPH in wines and was among the 5 finalists in 2020 by using PL.",coeditorOneBiosketch:null,coeditorTwoBiosketch:null,coeditorThreeBiosketch:null,coeditorFourBiosketch:null,coeditorFiveBiosketch:null,editors:[{id:"180952",title:"Prof.",name:"Antonio",middleName:null,surname:"Morata",slug:"antonio-morata",fullName:"Antonio Morata",profilePictureURL:"https://mts.intechopen.com/storage/users/180952/images/system/180952.jpg",biography:"Antonio Morata is a professor of Food Science and Technology at the Universidad Politécnica de Madrid (UPM), Spain, specializing in wine technology. He is the coordinator of the Master in Food Engineering Program at UPM, and a professor of enology and wine technology in the European Master of Viticulture and Enology, Euromaster Vinifera-Erasmus+. He is the Spanish delegate at the group of experts in wine microbiology and wine technology of the International Organisation of Vine and Wine (OIV). He is the author of more than 70 research articles, 3 books, 4 edited books, 6 special issues and 16 book chapters.",institutionString:"Technical University of Madrid",position:null,outsideEditionCount:0,totalCites:0,totalAuthoredChapters:"6",totalChapterViews:"0",totalEditedBooks:"3",institution:{name:"Technical University of Madrid",institutionURL:null,country:{name:"Spain"}}}],coeditorOne:{id:"186423",title:"Dr.",name:"Iris",middleName:null,surname:"Loira",slug:"iris-loira",fullName:"Iris Loira",profilePictureURL:"https://mts.intechopen.com/storage/users/186423/images/system/186423.jpg",biography:"Iris Loira is an assistant professor of Food Science and Technology at the Universidad Politécnica de Madrid (UPM), Spain. She is the author of 46 research articles, 3 books and 11 book chapters.",institutionString:"Technical University of Madrid",position:null,outsideEditionCount:0,totalCites:0,totalAuthoredChapters:"4",totalChapterViews:"0",totalEditedBooks:"0",institution:{name:"Technical University of Madrid",institutionURL:null,country:{name:"Spain"}}},coeditorTwo:{id:"201384",title:"Prof.",name:"Carmen",middleName:null,surname:"González",slug:"carmen-gonzalez",fullName:"Carmen González",profilePictureURL:"https://mts.intechopen.com/storage/users/201384/images/system/201384.jpg",biography:"Dr González-Chamorro has worked as a professor at the UPM since 1993. She has dedicated her teaching work to food technology and applications in the fruit and vegetable industries and fermented meat products. From 2004 until 2016 she held management positions in the university (Ombudsman and Deputy Director of University extension and International Relations). Her research activity has focused on the field of oenological biotechnology and on the selection of microorganisms (yeasts and BAL) that are of special interest in wine making processes. She has extensive experience in the use of instrumental and sensory tests to assess the quality of alcoholic beverages (wine and beer) and meat products. She has participated in different educational innovation projects and coordinated three of them. These projects have made it possible to coordinate working groups for the implementation of degrees in the EEES, and apply new teaching methodologies that allow the acquisition of horizontal competences by students. She has also evaluated research projects and national and international degrees (different Quality Agencies).",institutionString:"Technical University of Madrid",position:null,outsideEditionCount:0,totalCites:0,totalAuthoredChapters:"4",totalChapterViews:"0",totalEditedBooks:"0",institution:{name:"Technical University of Madrid",institutionURL:null,country:{name:"Spain"}}},coeditorThree:null,coeditorFour:null,coeditorFive:null,topics:[{id:"5",title:"Agricultural and Biological Sciences",slug:"agricultural-and-biological-sciences"}],chapters:null,productType:{id:"1",title:"Edited Volume",chapterContentType:"chapter",authoredCaption:"Edited by"},personalPublishingAssistant:{id:"347258",firstName:"Marica",lastName:"Novakovic",middleName:null,title:"Dr.",imageUrl:"//cdnintech.com/web/frontend/www/assets/author.svg",email:"marica@intechopen.com",biography:null}},relatedBooks:[{type:"book",id:"6418",title:"Hyperspectral Imaging in Agriculture, Food and Environment",subtitle:null,isOpenForSubmission:!1,hash:"9005c36534a5dc065577a011aea13d4d",slug:"hyperspectral-imaging-in-agriculture-food-and-environment",bookSignature:"Alejandro Isabel Luna Maldonado, Humberto Rodríguez Fuentes and Juan Antonio Vidales Contreras",coverURL:"https://cdn.intechopen.com/books/images_new/6418.jpg",editedByType:"Edited by",editors:[{id:"105774",title:"Prof.",name:"Alejandro Isabel",surname:"Luna Maldonado",slug:"alejandro-isabel-luna-maldonado",fullName:"Alejandro Isabel Luna Maldonado"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"1591",title:"Infrared Spectroscopy",subtitle:"Materials Science, Engineering and Technology",isOpenForSubmission:!1,hash:"99b4b7b71a8caeb693ed762b40b017f4",slug:"infrared-spectroscopy-materials-science-engineering-and-technology",bookSignature:"Theophile Theophanides",coverURL:"https://cdn.intechopen.com/books/images_new/1591.jpg",editedByType:"Edited by",editors:[{id:"37194",title:"Dr.",name:"Theophanides",surname:"Theophile",slug:"theophanides-theophile",fullName:"Theophanides Theophile"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"3092",title:"Anopheles mosquitoes",subtitle:"New insights into malaria vectors",isOpenForSubmission:!1,hash:"c9e622485316d5e296288bf24d2b0d64",slug:"anopheles-mosquitoes-new-insights-into-malaria-vectors",bookSignature:"Sylvie Manguin",coverURL:"https://cdn.intechopen.com/books/images_new/3092.jpg",editedByType:"Edited by",editors:[{id:"50017",title:"Prof.",name:"Sylvie",surname:"Manguin",slug:"sylvie-manguin",fullName:"Sylvie Manguin"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"3161",title:"Frontiers in Guided Wave Optics and Optoelectronics",subtitle:null,isOpenForSubmission:!1,hash:"deb44e9c99f82bbce1083abea743146c",slug:"frontiers-in-guided-wave-optics-and-optoelectronics",bookSignature:"Bishnu Pal",coverURL:"https://cdn.intechopen.com/books/images_new/3161.jpg",editedByType:"Edited by",editors:[{id:"4782",title:"Prof.",name:"Bishnu",surname:"Pal",slug:"bishnu-pal",fullName:"Bishnu Pal"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"72",title:"Ionic Liquids",subtitle:"Theory, Properties, New Approaches",isOpenForSubmission:!1,hash:"d94ffa3cfa10505e3b1d676d46fcd3f5",slug:"ionic-liquids-theory-properties-new-approaches",bookSignature:"Alexander Kokorin",coverURL:"https://cdn.intechopen.com/books/images_new/72.jpg",editedByType:"Edited by",editors:[{id:"19816",title:"Prof.",name:"Alexander",surname:"Kokorin",slug:"alexander-kokorin",fullName:"Alexander Kokorin"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"1373",title:"Ionic Liquids",subtitle:"Applications and Perspectives",isOpenForSubmission:!1,hash:"5e9ae5ae9167cde4b344e499a792c41c",slug:"ionic-liquids-applications-and-perspectives",bookSignature:"Alexander Kokorin",coverURL:"https://cdn.intechopen.com/books/images_new/1373.jpg",editedByType:"Edited by",editors:[{id:"19816",title:"Prof.",name:"Alexander",surname:"Kokorin",slug:"alexander-kokorin",fullName:"Alexander Kokorin"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"57",title:"Physics and Applications of Graphene",subtitle:"Experiments",isOpenForSubmission:!1,hash:"0e6622a71cf4f02f45bfdd5691e1189a",slug:"physics-and-applications-of-graphene-experiments",bookSignature:"Sergey Mikhailov",coverURL:"https://cdn.intechopen.com/books/images_new/57.jpg",editedByType:"Edited by",editors:[{id:"16042",title:"Dr.",name:"Sergey",surname:"Mikhailov",slug:"sergey-mikhailov",fullName:"Sergey Mikhailov"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"371",title:"Abiotic Stress in Plants",subtitle:"Mechanisms and Adaptations",isOpenForSubmission:!1,hash:"588466f487e307619849d72389178a74",slug:"abiotic-stress-in-plants-mechanisms-and-adaptations",bookSignature:"Arun Shanker and B. Venkateswarlu",coverURL:"https://cdn.intechopen.com/books/images_new/371.jpg",editedByType:"Edited by",editors:[{id:"58592",title:"Dr.",name:"Arun",surname:"Shanker",slug:"arun-shanker",fullName:"Arun Shanker"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"878",title:"Phytochemicals",subtitle:"A Global Perspective of Their Role in Nutrition and Health",isOpenForSubmission:!1,hash:"ec77671f63975ef2d16192897deb6835",slug:"phytochemicals-a-global-perspective-of-their-role-in-nutrition-and-health",bookSignature:"Venketeshwer Rao",coverURL:"https://cdn.intechopen.com/books/images_new/878.jpg",editedByType:"Edited by",editors:[{id:"82663",title:"Dr.",name:"Venketeshwer",surname:"Rao",slug:"venketeshwer-rao",fullName:"Venketeshwer Rao"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}},{type:"book",id:"4816",title:"Face Recognition",subtitle:null,isOpenForSubmission:!1,hash:"146063b5359146b7718ea86bad47c8eb",slug:"face_recognition",bookSignature:"Kresimir Delac and Mislav Grgic",coverURL:"https://cdn.intechopen.com/books/images_new/4816.jpg",editedByType:"Edited by",editors:[{id:"528",title:"Dr.",name:"Kresimir",surname:"Delac",slug:"kresimir-delac",fullName:"Kresimir Delac"}],productType:{id:"1",chapterContentType:"chapter",authoredCaption:"Edited by"}}]},chapter:{item:{type:"chapter",id:"63787",title:"Data Service Outsourcing and Privacy Protection in Mobile Internet",doi:"10.5772/intechopen.79903",slug:"data-service-outsourcing-and-privacy-protection-in-mobile-internet",body:'\nThe data of mobile Internet have the characteristics of large scale, variety of patterns, complex association and so on. On the one hand, it needs efficient data processing model to provide support for data services, and on the other hand, it needs certain computing resources to provide data security services. Due to the limited resources of mobile terminals, it is impossible to complete large-scale data computation and storage. However, outsourcing to third parties may cause some risks in user privacy protection.
\nThis monography focuses on key technologies of data service outsourcing and privacy protection in mobile Internet, including the existing methods of data analysis and processing, the fine-grained data access control through effective user privacy protection mechanism, and the data sharing in the mobile Internet, which provide technical support for improving various service applications based on mobile Internet. The financial support from the National Natural Science Foundation of China under grants No.61672135, the Sichuan Science-Technology Support Plan Program under grants No.2018GZ0236, No.2017FZ0004 and No.2016JZ0020, the National Science Foundation of China - Guangdong Joint Foundation under grants No.U1401257 are also greatly appreciated.
\nNeedless to say, any errors and omissions that still remain are totally our own responsibility and we would greatly appreciate the feedbacks about them from the readers. Thus, please send your comments to “qinzhen@uestc.edu.cn” and put “Data Service Outsourcing and Privacy Protection in Mobile Internet” in the subject line.
\n\nZhen Qin\n
\nUniversity of Electronic Science and Technology of China
\nMarch, 2018
\nWith the continuous development of science and technology, the mobile Internet (MI) has become an important part of our daily lives. It is a product of the convergence of mobile communication technologies and Internet technologies, which has a wide range of network coverage, convenience, and instant features. Smartphone is the main carrier of the mobile Internet. Through smartphones, people can make voice calls, send text messages, and take video communications as well as surf the Internet at any time (including browsing the web, sending and receiving e-mails, watching movies, etc.). In other words, using smartphones, users are more capable of creating, implementing, and managing novel and efficient mobile applications. Nowadays, the MI has become an important portal and major innovation platform for Internet businesses and an important hub for the exchange of information resources among various social media, business service companies, and new application software.
\nThe mobile Internet, defined as the platform where the user’s wireless access to the digitized contents of the Internet via mobile devices, consists of the following:
\nInternet service provider is an organization that provides services such as Internet access, Internet transit, domain name registration, web hosting, Usenet service, and collocation.
Users can get different services from the mobile Internet such as points of interest services, music services, fitness services, and mobile commerce services by using mobile applications.
In the MI, mobile devices, which include smartphones, tablets, PCs, e-books, MIDs, etc., enable users to stay connected with not only the network service providers through LTE and NFC techniques but also the physically close peers through short-range wireless communication techniques. In fact, each individual user has a personalized behavior pattern, which can influence the reliability and efficiency of the network in the MI. Moreover, the MI has a wide range of applications such as wireless body area networks (WBANs), mobile e-commerce, mobile social networks, mobile ad hoc networks, etc. The common scenarios of the MI will be shown in \nFigure 1.1\n.
\nThe common scenarios of the MI.
The mobile Internet offers a huge number of applications to the users. Most of these applications have many unique features or special functions, but the main technology of the applications remains the same as other services. The popular applications in the mobile Internet are briefly reviewed below.
\nSocial networking has already become an important integral part of our daily lives, enabling us to contact our friends and families. Nowadays, the pervasive use of the mobile Internet and the social connections fosters a promising mobile social network (MSN) where reliable, comfortable, and efficient computation and communication tools are provided to improve the quality of our work and life.
\nMSN is social networking where individuals with similar interests converse and connect with one another through their mobile phone and/or tablet. Much like web-based social networking, mobile social networking occurs in virtual communities [1].
\nSimilar to many online social networking sites, such as Facebook and Twitter, there are just as many social networks on mobile devices. They offer vast number of functions including multimedia posts, photo sharing, and instant messaging. Most of these mobile applications offer free international calling and texting capabilities. Today, social networking applications are not just for the social aspect but are frequently used for professional aspects as well, such as LinkedIn, which is still constantly growing. Along with sharing multimedia posts and instant messaging, social networks are commonly used to connect immigrants in a new country. While the thought of moving to a new country may be intimidating for many, social media can be used to connect immigrants of the same land together to make assimilation a little less stressful [2].
\nDue to the limitation of the applications in the wire communication, the need of the extension of applications via the mobile Internet becomes necessary. The growth of laptops and 802.11/Wi-Fi wireless networks has made mobile ad hoc networks (MANETs) a popular research topic since the mid-1990s.
\nA mobile ad hoc network (MANET), also known as wireless ad hoc network [3] or ad hoc wireless network, is continuously self-configuring, infrastructure-less network of mobile devices connected using wireless channels [4]. Each device in a MANET is free to move independently in any direction and will therefore change its links to other devices frequently. Such networks may operate by themselves or may be connected to the larger Internet. They may contain one or multiple and different transceivers between nodes. This results in a highly dynamic, autonomous topology [5].
\nMANETs can be used in many applications, ranging from sensors for environment, vehicular ad hoc communications, road safety, smart home, peer-to-peer messaging, air/land/navy defense, weapons, robots, etc. A typical instance is health monitoring, which may include heart rate, blood pressure, etc. [6]. This can be constant, in the case of a patient in a hospital or event driven in the case of a wearable sensor that automatically reports your location to an ambulance team in case of an emergency. Animals can have sensors attached to them in order to track their movements for migration patterns, feeding habits, or other research purposes [7]. Another example is used in disaster rescue operations. Sensors may also be attached to unmanned aerial vehicles (UAVs) for surveillance or environment mapping [8]. In the case of autonomous UAV-aided search and rescue, this would be considered an event mapping application, since the UAVs are deployed to search an area but will only transmit data back when a person has been found.
\nSince the rapid economic growth, the methods of payment have changed a lot all over the world. Given by the development of the mobile Internet, the mobile payment becomes more pervasive in Asian countries, especially in China.
\nMobile payment (also referred to as mobile money, mobile money transfer, and mobile wallet) generally refers to payment services operated under financial regulation and performed from or via a mobile device. Instead of paying with cash, check, or credit cards, a consumer can use a mobile to pay for a wide range of services and digital or hard goods.
\nAlthough the concept of using non-coin-based currency systems has a long history [9], it is only recently that the technology to support such systems has become widely available.
\nSince entering the twenty-first century, the China’s Internet has developed rapidly. In particular, the mobile Internet further enhances this speed. We soon discovered that today’s Internet is used in e-commerce shopping, map navigation, mobile payment, stock financing, online booking, government office, recharge, and other aspects that are completely different from the original. It seems that everything can be done on the Internet. You can rely on your phone instead of taking it out. In the mobile Internet, users and mobile terminals are closely related. The accumulation of user behavior in the mobile Internet is of great value. Smartphones and other mobile terminals are no longer simple communication tools but become mobile terminals that integrate communication, computing, and sensing functions. It can continuously generate and collect various data information [10]. Various sensing devices on the mobile terminal can sense the surrounding environment and generate various data information; at the same time, the mobile terminal serves as an entrance for the user to enter the mobile Internet and can collect data information generated by other users. By combining various data related to mobile users, user behaviors can be modeled and predicted, and important statistical data for specific services can be extracted. For example, mobile users’ GPS information and historical travel habits can be used to generate traffic statistics and travel advice [11]. It can be seen that, based on the data information generated in the mobile Internet, we can satisfy the different requirements of different users in different scenarios in the first time by judging the user information needs and accurately sorting the information categories, thereby providing users with personalized high-quality information and improving users’ service experience.
\nThe relevant data show that in the 3 years of rapid development of the mobile Internet, the total amount of data and information generated by humans has exceeded 400 years in the past. The global data volume has reached 1.8 ZB in 2011, and it is expected that the global data volume will reach 35 ZB by 2020. However, with the accumulation of data information, the amount of data available for analysis and mining continues to increase, and the entropy of information (i.e., the ratio of total information to valuable information) tends to increase. The increasing amount of data information makes information overload, information redundancy, and information search become new issues in the mobile Internet era. Search engines and recommendation systems provide very important technical means for solving these problems. When users search for information on the Internet, the search engine performs information matching in the background of the system with the keywords from the users and displays the results for users. However, if the user cannot accurately describe the keywords they need, then the search engine is powerless. Unlike search engines, the recommendation system does not require users to provide explicit requirements but simulates the user’s interests by analyzing their historical behavior, thereby actively recommending to users’ information that meets their interests and needs. In recent years, e-commerce has been booming, and the dominant position of the recommendation system in the Internet has become increasingly apparent. However, in the mobile Internet, data are not only more and more large scale but also have a variety of models, complex relationships, and reliability uncertainties. Diversified models make the data content difficult to understand. The complexity of the association makes the data difficult to be effectively identified. The uncertainty of credibility makes it difficult to determine the authenticity of the data. Such data features make the perception, expression, understanding, and calculation of the data face new challenges. The complexity of space-time complexity in the traditional computing model greatly limits our ability to design data-efficient computing models for the mobile Internet. How to quantify, define, and extract the essential characteristics and internal correlation of data in the mobile Internet, then study its internal mechanism, and provide support for the recommendation service is an urgent problem to be solved.
\nAt the same time, due to the characteristics of large scale, diverse models, and complex relationships in data in the mobile Internet, data analysis and processing processes have certain requirements for computing resources, while mobile terminals such as smartphones have limited computing resources (such as computational processing capabilities, etc.). Therefore, analysis and processing of the mobile Internet data is often done on devices such as third-party data processing servers that have sufficient computing resources. However, users and mobile terminals in the mobile Internet are closely related, and their data contains a large amount of user privacy information. This will bring a series of security issues. Since the users facing the outsourced data application system are not a specific user, there may be any users who are connected to the Internet. These users can submit data query requests to outsourcing. Different users have different requirements for outsourcing. It is precisely because of this particularity that the outsourcing system may be completely open to any external potential users. It is not enough to protect the privacy of outsourced data only through system-level access control, cryptography, and other technical means. According to statistics from the Beijing Zhongguancun police station, telecommunication fraud reported for the whole year of 2012 accounted for 32% of the cases filed, which is the highest proportion of the types of crimes. Fraudulent often uses six kinds of methods:
\nThe impersonation of an individual or a circle of friends after the disclosure of information, such as criminals posing as frauds by public security organs, postal services, telecommunications, banks, social security staff, or friends and relatives, accounts for 42% of the total number of fraud cases.
After the leakage of shopping information, the seller is guilty of fraud.
The winning fraud is caused by the leakage of telephone, QQ, or e-mail.
The false recruitment information is received after job information leakage.
The Internet dating fraud after the leakage of dating information.
Kidnapping fraud after the disclosure of family information. It can be seen that many third parties have disclosed users’ personal information to varying degrees.
When people realized that they wanted to protect their privacy and tried to hide their actions, they did not think that their actions were already on the Internet; especially, different social locations in the network generate many data footprints. This kind of data has the characteristics of being cumulative and relevant. The information of a single place may not reveal the privacy of the user, but if a lot of people’s behaviors are gathered from different independent locations, his privacy will be exposed because there has been enough information about him. This hidden data exposure is often unpredictable and controllable by individuals. The leakage of personal privacy information has caused panic among some users, which has negatively affected the sharing of data resources in the mobile Internet. The issue of privacy protection has become increasingly important.
\nAs a carrier of personal information, mobile smart terminals are characterized by mobility, diversity, and intelligence. The various types of applications attached to them present complex types, their execution mechanisms are not disclosed, a large amount of data, etc. These factors are superimposed and cause them to face much greater threats than before. In terms of presentation, the security issues of mobile smart terminals are mainly concentrated on the following aspects: firstly, illegally disseminating content; secondly, maliciously absorbing fees; thirdly, user privacy is stolen; and fourthly, mobile terminal viruses and illegal brushing cause the black screen, system crashes, and other issues. The following is an example of the mobile phone malicious code and user privacy theft. Due to the mobile malicious code problem caused by the negligence of the design and implementation process of mobile smart terminal operating systems, more or less system vulnerabilities may be exploited by malicious code. In particular, with the continuous maturation of the mobile Internet hacking technology, malicious code for mobile smart terminals poses a great threat to their security. Mobile malicious code is a destructive malicious mobile terminal program. It usually uses SMS, MMS, e-mail, and browsing websites through mobile communication networks or uses infrared and Bluetooth to transfer between mobile smart terminals. At present, the highest market share of mobile smart terminal operating system is Google Android operating system, followed by Apple iOS operating system and Microsoft Windows Phone operating system. As smart terminal operating systems become more and more uniform, the spread of malicious code has been accelerated, and it has evolved from a simple suction to sophisticated fraud and hooliganism. In addition, with the increase in the binding of mobile terminals and bank accounts, such as mobile terminals and third-party payment, the generation and spread of mobile phone viruses are accelerating toward the direction of traffic consumption, malicious deductions, and private information theft. In particular, with the rapid development of the Android-based operating system, mobile smart terminals using the operating system have gradually become the main targets of hacker attacks. In the first half of 2015, there were 5.967 million new Android virus data packets nationwide, an increase of 17.41% year on year, and the number of infected users reached 140 million, an increase of 58% from the same period last year. The total number of mobile payment users reached 11.455 million. How to outsource the data storage in the mobile Internet to a third-party server securely under the premise of ensuring the privacy of users and the realization of fine-grained access control of mobile Internet data is an urgent problem to be solved.
\nInformation retrieval is an important way of information exchange between users and data, which have been studied by many researches. Traditional ways of information retrieval are always based on keyword index and retrieval document. But the data on the mobile Internet have four features: large scale, complicated modules, diversified data format, and incongruous data sources. So, there is always a problem that different keywords can be the same meaning. The results of the search can be inaccurate or incomplete.
\nTo deal with this situation, paper [12] gives us the definition of ontology-based information retrieval. This model includes the four main processes of an IR system: indexing, querying, searching, and ranking. However, as opposed to traditional keyword-based IR models, in this approach, the query is expressed in terms of an ontology-based query language (SPARQL), and the external resources used for indexing and query processing consist of an ontology and its corresponding KB. The indexing process is equivalent to a semantic annotation process. Instead of creating an inverted index where the keywords are associated with the documents where they appear, in the case of the ontology-based IR model, the inverted index contains semantic entities (meanings) associated with the documents where they appear. The relation or association between a semantic entity and a document is called annotation. The overall retrieval process consists of the following steps:
\nThe system takes as input a formal SPARQL query.
The SPARQL query is executed against a KB, returning a list of semantic entities that satisfy the query. This process is purely Boolean (i.e., based on an exact match), so that the returned instances must strictly hold all the conditions of the formal query.
The documents that are annotated (indexed) with the above instances are retrieved, ranked, and presented to the user. In contrast to the previous phase, the document retrieval phase is based on an approximate match, since the relation between a document and the concepts that annotate it has an inherent degree of fuzziness.
The steps listed above are described in more detail in the following subsections, from indexing to query processing, document retrieval, and ranking.
\nIn this model, it is assumed that a KB has been built and associated with the information sources (the document base), by using one or several domain otologies that describe concepts appearing in a document text. The concepts and instances in the KB are linked to the documents by means of explicit, non-embedded annotations of the documents. In this model, keywords appearing in a document are assigned weights reflecting the fact that some words are better at discriminating between documents than others. Similarly, in this system, annotations are assigned weights that reflect the discriminative power of instances with respect to the documents. And, the weight \n
where \n
The query execution returns a set of tuples that satisfy the SPARQL query. Then, extract the semantic entities from those tuples, and access the semantic index to collect all the documents in the repository that are annotated with these semantic entities. Once the list of documents is formed, the search engine computes a semantic similarity value between the query and each document, using an adaptation of the classic vector space IR model. Each document in the search space is represented as a document vector where each element corresponds to a semantic entity. The value of an element is the weight of the annotation between the document and the semantic entity, if such annotation exists, and zero otherwise. The query vector is generated weighting the variables in the SELECT clause of the SPARQL query. For testing purposes, the weight of each variable of the query was set to 1, but in the original model, users are allowed to manually set this weight according to their interest. Once the vectors are constructed, the similarity measure between a document \n
If the knowledge in the KB is incomplete (e.g., there are documents about travel offers in the knowledge source, but the corresponding instances are missing in the KB), the semantic ranking algorithm performs very poorly: SPARQL queries will return less results than expected, and the relevant documents will not be retrieved or will get a much lower similarity value than they should. As limited as might be, keyword-based search will likely perform better in these cases. To cope with this, the ranking function combines the semantic similarity measure with the similarity measure of a keyword-based algorithm. Combine the output of search engines, and compute the final score as
\nwhere \n
A lot of information retrieval models have been put forward to improve the performance of information retrieval, and some of the combinations of these technologies have been proven to be effective. But faced with the rich data in the MI, the great improvement needs to be done for getting the search more intelligent and efficient.
\nMachine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. It learns from the external input data through algorithm analysis to obtain the regularity for recognition and judgment. With the development of machine learning, it is classified as shallow learning and deep learning and plays an important role in the field of artificial intelligence. Due to the large-scale data, multiple model, and complex correlation, traditional shallow learning algorithms (such as the support vector machine algorithm [13]) are still subject to certain restrictions in dealing with complex classifications [14] even if widely used.
\nDeep learning is a new field in the study of machine learning; the motive is to establish and simulate human brain analysis neural network to learn. It mimics the human brain to explain the mechanism of data, such as images, sound, and text. As with machine learning methods, deep machine learning methods also have supervised learning and unsupervised learning. Supervised learning algorithms experience a data set containing features, but each example is also associated with a label or target. For example, the iris data set is annotated with the species of each iris plant. A supervised learning algorithm can study the iris data set and learn to classify iris plants into three different species based on their measurements. Unsupervised learning algorithms experience a data set containing many features and then learn useful properties of the structure of this data set. In the context of deep learning, we usually want to learn the entire probability distribution that generated a data set, whether explicitly as in density estimation or implicitly for tasks like synthesis or deionizing. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the data set into clusters of similar examples. The learning model established under different learning frameworks is very different. For example, the convolutional neural network (CNN) is a kind of machine learning model of deep learning under the supervision of the deep place letter (deep belief networks, referred to as DBNs), which is a kind of machine learning model in the case of unsupervised learning.
\nThe concept of deep learning was proposed in 2006 by Hinton [15]. Based on the deep belief network (DBN), the algorithm is proposed to solve the optimization problem of the deep structure, and then the deep structure of multilayer automatic encoder is proposed. In addition, the convolution neural network proposed by Lecun et al. [16] is the first real multilayer structure learning algorithm, which uses the space relative relation to reduce the number of parameters to improve the training performance. Deep learning is a method of representing learning in machine learning. Observations (such as an image) can be represented in a variety of ways, such as vectors for each pixel’s strength value, or more abstracted into a range of edges, specific shapes, etc. It is easier to learn a task from an instance using some specific presentation methods (e.g., face recognition or facial expression recognition). The advantage of deep learning is to replace the manual acquisition feature with non-supervised or semi-supervised feature learning and hierarchical feature extraction.
\nDeep learning architectures such as deep neural networks, deep belief networks, and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, and drug design [13], where they have produced results comparable to and in some cases superior [14] to human experts.
\nThe work [17] firstly proposed a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR). This model is a pre-trained deep neural network-hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. Although the experiments show that CD-DNN-HMMs provide dramatic improvements in recognition accuracy, many issues remain to be resolved. First, although CD-DNN-HMM training is asymptotically quite scalable, in practice, it is quite challenging to train CD-DNN-HMMs on tens of thousands of hours of data. Second, highly effective speaker and environment adaptation algorithms for DNN-HMMs must be found, ideally ones that are completely unsupervised and integrated with the pre-training phase. Third, the training in this study used the embedded Viterbi algorithm, which is not optimal. In addition, the study views the treatment of the time dimension of speech by DNN-HMM and GMM-HMMs alike as a very crude way of dealing with the intricate temporal properties of speech. Finally, although Gaussian RBMs can learn an initial distributed representation of their input, they still produce a diagonal covariance Gaussian for the conditional distribution over the input space given the latent state (as diagonal covariance GMMs also do). Ji et al. [18] developed a 3D CNN model for action recognition. This model construct features from both spatial and temporal dimensions by performing 3D convolutions. The developed deep architecture generates multiple channels of information from adjacent input frames and performs convolution and subsampling separately in each channel. The final feature representation is computed by combining information from all channels. Evaluated by the TRECVID and the KTH data sets, the results show that the 3D CNN model outperforms compared with the methods on the TRECVID data, while it achieves competitive performance on the KTH data, demonstrating its superior performance in real-world environments. Furthermore, Cho et al. [19] proposed a novel neural network model called RNN encoder-decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN encoder-decoder as an additional feature in the existing log-linear model. But one approach that was not investigated is to replace the whole or a part of the phrase table by letting the RNN encoder-decoder propose target phrases. A multi-context deep learning framework for saliency detection is proposed in [20]. Different pre-training strategies are investigated to learn the deep model for saliency detection, and a task-specific pre-training scheme for the presented multi-context deep model is designed. Tested by the contemporary deep models in ImageNet image classification challenge, their effectiveness in saliency detection is investigated.
\nAlthough deep learning has shown great success in various applications such as objection recognition [21] and speech recognition [17], especially language modeling [22], paraphrase detection [23], and word embedding extraction [24], the work implemented in the MI is still worthy to investigate.
\nAttribute-based encryption (ABE) is a promising cryptographic primitive, which has been widely applied to design fine-grained access control system recently. Data access control has been evolving in the past 30 years, and various techniques have been developed to effectively implement fine-grained access control [25], which allows flexibility in specifying differential access rights of individual users. However, traditional access control systems are mostly designed for in-house services and depend greatly on the system itself to enforce authorization policies. Thus, they cannot be applied in cloud computing because users and cloud servers are no longer in the same trusted domain. For the purpose of helping the data owner impose access control over data stored on untrusted cloud servers, a feasible consideration would be encrypting data through certain cryptographic primitives but disclosing decryption keys only to authorized users. One critical issue of this branch of approaches is how to achieve the desired security goals without introducing high complexity of key management and data encryption. Existing works resolve this issue either by introducing a per-file access control list (ACL) for fine-grained access control or by categorizing files into several filegroups for efficiency. As the system scales, however, the ACL-based scheme would introduce an extremely high complexity which could be proportional to the number of system users. The filegroup-based scheme, on the other hand, is just able to provide coarse-grained access control of data.
\nTo provide fine-grained access control over encrypted data, a novel public-key primitive, namely, attribute-based encryption (ABE) [26], is introduced in the cryptographic community, which enables public-key-based one-to-many encryption. In ABE system, users’ keys and ciphertexts are labeled with sets of descriptive attributes and access policies, respectively, and a particular key can decrypt a ciphertext only if the associated attributes and policy are matched. Though ABE is a promising primitive to design fine-grained access control system in cloud computing, there are several challenges remained in the application of ABE.
\nOne of the main drawbacks of ABE is that the computational cost in decryption phase grows with the number of attributes specified in the access policy. The drawback appears more serious for resource-constrained users such as mobile devices and sensors. Therefore, one challenge is how to reduce the decryption complexity of ABE such that it can be applied to fine-grained access control for users with resource-constrained devices.
Beyond decryption, generating users’ private key in existing ABE schemes also requires a great quantity of modular exponentiations. Furthermore, the revocation of any single user in existing ABE requires key update at authority for remaining users who share his/her attributes. All of these heavy tasks centralized at the authority side would make it become the efficiency bottleneck in the whole access control system. Therefore, another challenge is how to reduce the key-issuing complexity of ABE such that scalable access control can be supported.
To improve the practical application level of the attribute-based encryption system, researchers began to attempt to introduce outsourced computing to the attribute-based encryption system. In 2011, Green et al. [27] proposed a scheme that outsourced the decryption operation in the attribute-based cryptosystem to a third party in order to improve the client’s operational efficiency and enable the attribute-based cryptosystem to run on devices with limited computational resources such as smart cards. The ciphertext length in this scheme increases linearly with the number of access policy attributes. In [27], the third party uses the conversion key sent by the user to convert the encrypted ciphertext with the higher decryption cost into the ElGamal-encrypted ciphertext with the lower decryption cost and transmits the converted ciphertext back to users. And then, users can use their own decryption keys to complete the decryption locally. Li et al. [28] simplified the client’s computational complexity in the attribute-based cryptosystem from the perspective of outsourced cryptographic operations. After outsourcing cryptographic operations, the user only needs to perform constant exponentiation operations for encryption regardless of the number of attributes and can encrypt any access policy. Li et al. [29] proposed an attribute-based encryption scheme that supports outsourced key generation and outsourced decryption operations. Meanwhile, the scheme proposed a fine-grained access control system based on this scheme. Lai et al. [30] pointed out that the solution proposed by the user in [27] validates the results returned by the third party and that the security of the solution depends on the random oracle model. On the basis of [27], the literature [30] proposes a verifiable outsourced attribute-based encryption scheme. By adding redundant information, users can verify the conversion results returned by third parties to ensure the accuracy of conversions. The security of the new program is based on the standard model. Li et al. [31] proposed an attribute-based encryption scheme that can both reduce attribute authority and user computing burden and also support user authentication, on the basis of [27, 28, 31] and so on. By outsourcing part of the key generation and decryption operations to different service providers, the attribute authority and the user’s local computing burden are greatly reduced. Lin et al. [32] pointed out that the method of adding redundant information in [30] can not only achieve the purpose of verifying the returned result, but also increases the computational load and communication burden of the system. The literature [32] proposes an outsourced attribute-based encryption scheme that uses a hybrid encryption system and a commitment mechanism to complete user authentication. The length of the ciphertext and the calculation of cryptographic operations in this scheme are only half that of the scheme mentioned in [30], and the security of the scheme is based on the standard model. Ma et al. [33] presented the concept of innocent proof for the first time. The user who holds the private key cannot report cloud server operation errors and proposed an attribute-based encryption scheme that supports both outsourced encryption and decryption operations and verifiable and innocent proofs. Mao et al. [34] proposed an attribute-based encryption scheme that verifies outsource decryption, and experiments show that the scheme is more concise and less computational than cryptographic schemes proposed in [30] but still with the number of attributes linearly related. Xu et al. [35] proposed a circuit-based ciphertext-policy-based attribute-based hybrid encryption system to ensure data integrity in cloud computing environments and implement fine-grained access control and verifiable delegation mechanisms.
\nIt can be known from the existing work that currently existing secure outsourced attribute-based encryption schemes cannot achieve a fixed ciphertext length, and these solutions not only improve the computational efficiency but also increase the communication load of mobile devices. However, the work of achieving fine-grained access control of data in the MI and promoting the sharing of data in the MI is still worthy to investigate.
\nSince cloud users do not fully trust cloud computing service providers (CSPs), they do not want any unauthorized personnel (including cloud service providers) to view their own data. This cannot rely on administrative constraints or ethics. The rules to be solved can only adopt technical means to ensure that cloud service providers and other non-authorized personnel cannot obtain any uploading cloud data from the customer, and any unauthorized changes to the customer data in the cloud will be accepted by the customers. Therefore, user data usually needs to be encrypted on the client before being transmitted to the cloud. However, in the cloud computing environment, there are a large number of data sharing scenarios. When data sharing is performed, if decryption is performed in the cloud, there is a risk of leakage of user data, which is not completely trusted by the cloud computing service provider for the user. The situation is impossible. If you need the user to re-encrypt the file, this is very inconvenient for the user. Therefore, the proxy re-encryption (PRE) scheme for distributed data storage can well solve this problem.
\nProxy re-encryption (PRE), initially introduced by Blaze et al. [36], enables a semi-trusted proxy to transform a ciphertext encrypted under the public key of delegator into another ciphertext under the public key of delegatee without leaking the underlying encrypted messages or private keys of delegator/delegatee to the proxy. This special kind of public-key encryption seems to be an optimal candidate to ensure the security of sharing data in cloud computing. Supposedly, the data owner (say, Alice) intends to share the sensitive data stored in the cloud with another granted user (say, Bob). It is desirable that the requested data can be accessed by nobody other than Bob. Inspired by the primitive of PRE, Alice can encrypt the sensitive data under her own public key before uploading the shared data to the semi-trusted cloud. After receiving the request of data sharing from Bob, Alice generates a proxy re-encryption key using her own private key and Bob’s public key and sends this proxy re-encryption key to the semi-trusted cloud server. Equipped with this proxy re-encryption key, cloud server can transform the ciphertext encrypted under the public key of Alice into an encryption under the public key of Bob. By utilizing the PRE primitive, the transformed ciphertext can only be decrypted by Bob, whereas the cloud server is unable to learn the plaintext or private keys of Alice or Bob. Finally, Bob can download and decrypt the requested data with his own private key. In this way, the costly burden of secure data sharing can be uploaded to the semi-trusted cloud server with abundant resources.
\nIn 2005, Ateniese et al. [37] proposed the first unidirectional PRE scheme. Similar to the scheme of Blaze et al. [36], both of the schemes are secure only against chosen-plaintext attacks (CPA). In 2007, Canetti et al. [38] designed a bidirectional PRE scheme with chosen-ciphertext security. In 2008, Libert et al. [39] introduced a replayable chosen-ciphertext secure (RCCA) unidirectional PRE scheme. Since then, various PRE schemes have been proposed in the literature (e.g., [40, 41, 42, 43, 44]).
\nPRE can be extended in the context of identity-based encryption. In 2007, Green and Ateniese [45] proposed the first identity-based proxy re-encryption (IBPRE) scheme, which is CCA secure in the random oracle model, where hash functions are assumed to be fully random. Chu and Tzeng [30] constructed a CCA secure IBPRE scheme in the standard model. After that, many identity-based proxy re-encryption (IBPRE) schemes have been proposed, such as [47, 48, 49, 50, 51, 52, 53].
\nHowever, among all of the aforementioned schemes, the semi-trusted proxy can use a given re-encryption key to transform all the ciphertexts of a delegator into those of a delegatee. But in reality, the delegator does not want to transform all of his data for the delegatee. Therefore, type-based PRE [54] and conditional PRE (CPRE) [55, 56] were proposed, in which the proxy can only fulfill ciphertext conversion conditionally. Later, Liang et al. [57, 58] proposed two IBCPRE schemes with CCA secure in the standard model. However, He et al. [59] presented the security analysis to show that their schemes only achieve CPA security. In 2016, He et al. [60] proposed an efficient identity-based conditional proxy re-encryption (IBCPRE) scheme with CCA secure in the random oracle model.
\nPRE can be extended in the attribute-based setting. Attribute-based proxy re-encryption (ABPRE) can effectively increase the flexibility of data sharing. In 2009, Liang et al. [61] first defined the notion of ciphertext-policy ABPRE (CP-ABPRE), where each ciphertext is labeled with a set of descriptive conditions and each re-encryption key is associated with an access tree that specifies which type of ciphertexts the proxy can re-encrypt, and they presented a concrete scheme supporting and gates with positive and negative attributes. After that, several CP-ABPRE schemes (e.g., [62]) with more expressive access policy were proposed. In 2011, Fang et al. [63] proposed a key-policy ABPRE (KP-ABPRE) scheme in the random oracle model, whereby ciphertext encrypted with conditions W can be re-encrypted by the proxy using the CPRE key under the access structure TT if and only if \n
In 2016, Lee et al. [67] proposed a searchable hierarchical CPRE (HCPRE) scheme for cloud storage services, and cloud service provider is able to generate a hierarchical key, but the re-encryption key generation algorithm also requires the private keys of the delegator and delegatee.
\nSo far, the proxy re-encryption has been extensively investigated. However, the concrete proxy re-encryption schemes with different properties still need to be proposed and applied to the special environment for the MI.
\nThe core material of this book consists of the following:
\nChapters 1 and 2, discussing the importance of outsourcing and privacy protection for data service in the mobile Internet and the basics of data service outsourcing and privacy protection in the mobile Internet
Chapters 3–5, introducing information retrieval, classical machine learning, and deep learning
Chapters 6 and 7, introducing attribute-based encryption for flexible and fine-grained access control and motivating proxy re-encryption for secure access delegation
The purpose of this chapter is to offer a brief review of all the necessary cryptographic concepts needed in the following chapters. We start with the mathematical backgrounds and associated hard problems. The other aspect of this chapter is some facts regarding the classical symmetric cryptography and public-key cryptosystem. Finally, a concise description of provable security will be presented in the rest of this chapter. We remark that this introduction is by no means exhaustive such that the approaches used to speed up pairing computations or select suitable parameters of elliptic curve are really beyond the scope of this book.
\nLet \n
If \n
According to the fundamental theorem of arithmetic, every integer greater than 1 can be expressed uniquely as a product of primes. Namely, any positive integer \n
\nProposition 2.1\n
\n\nLet \n
\n\n
\nProposition 2.2\n
\n\nLet \n
\nProof. Let \n
There is a contradiction between \n
Since both \n
Given \n
\nProposition 2.3\n
\n\nIf \n
\nProof. It is easy to observe that \n
Since \n
The second part of this proposition follows from the fact that if \n
\nProposition 2.4\n
\n\nIf \n
\nProof. We can see \n
Let \n
\n\n
We remark that congruence modulo \n
\nExample 1. Let us compute \n
\nThe cost of alternate approach to derive the answer (viz., computing the product \n
Different from addition, subtraction, and multiplication, the division operation in congruence modulo N has not been involved. That is to say, if \n
\nExample 2. Take \n
However, a meaningful notion of division has also been defined for congruence modulo \n
In this way, division by \n
We see that in this case, division works “as expected” by adopting the idea of invertible integers. The natural question is that which integers are invertible modulo of a given modulus N. To answer this question fully, Proposition 2.2 is used in the following proposition:
\n\nProposition 2.5\n
\n\nLet \n
\nProof. Assume \n
Conversely, if \n
\nExample 3. Let \n
Assume \n
\nDefinition 2.1. A group is a set \n
\nClosure law: For all \n
\nIdentity law: There exists an identity element \n
\nInverse law: For all \n
\nAssociative law: For all \n
In this way, the set \n
If \n
If \n
Associativity implies that the notation of long expression \n
It is easy to see that the identity element in a group \n
In general, the abstract notation \n
\nExample 4. A set may be formed as a group under one operation, but not another operation. For example, the set of integers \n
\nExample 5. The set of complex numbers \n
\nExample 6. Let \n
The “cancelation law” for groups has been demonstrated in the following lemma.
\n\nLemma 2.1 Let \n
\nProof. We know \n
\nGroup exponentiation: It is often useful to be able to describe the group operation applied \n
Note that \n
As for the multiplicative operation, we demonstrate application of the group operation \n
The familiar rules of exponentiation follow: \n
The above notation can be extended to the case when \n
\nTheorem 2.1\n
\n\nIf \n
\nProof. We prove the theorem only for the commutative group \n
It is noted that \n
Fueled by the fact that \n
Once again by the cancelation law in the group, it is obvious that \n
\nCorollary 2.1\n
\n\nIf \n
\nProof. According to Proposition 2.1, \n
holds.
\n\nExample 7. As for the additive operation, the above corollary means that if \n
\nCorollary 2.2\n
\n\nLet \n
\nProof. According to Proposition 2.5, \n
As mentioned before, the set \n
In other words, \n
Based on the discussion above, the identity element and inverse element associated with each element can be found in \n
\nProposition 2.6\n
\n\n\n
Let \n
That is to say, \n
\nTheorem 2.2\n
\n\nLet \n
\nExample 8. Take \n
Until now, the set \n
\nTheorem 2.3\n
\n\nTake arbitrary \n
\nFor the specific case when \n
By Corollary 2.2, we obtain the following theorem easily:
\n\nTheorem 2.4\n
\n\nLet \n
\nDefinition 2.2. A function \n
\nGroup isomorphisms: An isomorphism from a group to another group provides an alternate and equivalent approach to think about the structure of groups. For example, if the group (\n
\nExample 9. The bijective mapping \n
\nExample 10. The bijective mapping \n
\nTheorem 2.5\n
\n\nLet \n
The solution to the system of congruences shown in Theorem 2.5 can be calculated as
\nwhere
\nand
\nThe solution can also be represented in a slightly different way which makes it easier to be understood. Namely, we can rewrite Eq. (2.1) as
\nsuch that
\nSo the solution of the Chinese remainder theory can be regarded essentially as an integer version of Lagrange interpolation, where a polynomial is created according to k points by calculating a similar set of coefficients that are either 0 or 1 and thus enforce the desired behavior at the given points.
\n\nExample 11. Consider the following system of congruences:\n
\n\nAccording to the solution of the Chinese remainder theorem, we see that\n
\n\nso that\n
\nIn this example we can also think of the solution as finding integers \n
and
\nLet \n
According to Theorem 2.1, \n
It is obvious to see that \n
It is easy to observe that \n
\nDefinition 2.3. If \n
\nProposition 2.7\n
\n\nIf \n
\nProof. The proof of this proposition is similar to the proof of Corollary 2.1, and thus we omit it here.
\n\nProposition 2.8\n
\n\nIf \n
\nProof. If \n
On the other hand, from \n
The identity element of any group \n
\nProposition 2.9\n
\n\nLet \n
\nProof. By Theorem 2.1, \n
\nCorollary 2.3\n
\n\nIf \n
\nProof. Based on Proposition 2.9, the only possible orders of elements in the group \n
In addition to the groups of prime order, the additive group \n
\nTheorem 2.6\n
\n\nIf \n
The proof of this theorem is outside the scope of this book, and the interesting reader can refer to for more details.
\n\nExample 12. Consider the group \n
\nand so \n
Given a cyclic group \n
The discrete logarithm problem in a cyclic group \n
\nThe discrete logarithm experiment \n
Given the parameter \n
Choose \n
Given \n
This experiment outputs 1 if \n
\nDefinition 2.4. The discrete logarithm problem is said to be hard relative to the algorithm \n
By way of example, groups of the form \n
Different from \n
Define \n
\nExample 13. An element \n
Let \n
\n\n
\n\n
\n\n
\n\n
\n\n
\n\n
\n\n
\n\n
\n\n
\n\n
\n\n
\nx\n | \n0 | \n1 | \n2 | \n3 | \n4 | \n5 | \n6 | \n7 | \n8 | \n9 | \n10 | \n
---|---|---|---|---|---|---|---|---|---|---|---|
\n\n | \n6 | \n8 | \n5 | \n3 | \n8 | \n4 | \n8 | \n4 | \n9 | \n7 | \n4 | \n
Quadratic residue or nonresidue modulo 11? | \n\n\n | \n\n\n | \n✓ | \n✓ | \n\n\n | \n✓ | \n\n\n | \n✓ | \n✓ | \n\n\n | \n✓ | \n
y | \n\n | \n\n | \n4 | \n5 | \n\n | \n2 | \n\n | \n2 | \n3 | \n\n | \n2 | \n
\n | \n\n | \n\n | \n7 | \n6 | \n\n | \n9 | \n\n | \n9 | \n8 | \n\n | \n9 | \n
The points on the elliptic curve \n
Including the point \n
A common approach to conceptualize \n
Graph of the elliptic curve \n\n\n\ny\n2\n\n=\n\nx\n3\n\n+\nx\n+\n6\n\n\n such that \n\n\nΔ\n>\n0\n\n\n.
Observations from \nFigure 2.2\n demonstrate that every line intersecting with the elliptic curve \n
The point \n
We obtain
for \n
Addition operations on points on an elliptic curve.
\nNegation. On input a point \n
\nAddition of points: To evaluate the sum \n
It is straightforward but tedious to carry out the addition law concretely. Assume \n
The line passing through \n
To find the third point of intersection of this line with \n
The values of x that satisfy this equation are \n
It is straightforward that the set of points \n
According to \nTable 2.1\n, there are seven points in the set \n
Suppose \n
First, the slope of the line can be calculated as follows:
\nAfter that, we obtain
\nTherefore, \n
It is easy to see that \n
Similar to \n
Let \n
Bilinearity: For all \n
Nondegeneracy: There exist \n
Computability: For all \n
The modified Weil pairing [77] or Tate pairing [72] defined on the elliptic curves can be adopted to implement such an admissible bilinear map.
\nThe discussion thus far has been very general. We now show some concrete examples of concrete public-key encryption and signature schemes in the traditional PKI and ID-PKC settings, respectively. As mentioned in [73], public-key encryption enables two parties to communicate with each other securely without any shared secret information in advance. Let us call the sender Alice and the receiver Bob. Bob first selects the key pair \n
Observing that all existing constructions of public-key encryption and signature schemes are proven secure under some assumptions associated with the hardness of solving some mathematical problems. Therefore, the number-theoretic hard problems related to these constructions have been first reviewed here.
\n\nDefinition 2.5. (RSA problem) Let \n
\nDefinition 2.6. (Discrete logarithm (DL) problem) Let \n
\nDefinition 2.7. (Weil Diffie-Hellman (WDH) problem) Given a generator \n
\nDefinition 2.8. (Decision Diffie-Hellman (DDH) problem) Given a generator \n
\nDefinition 2.9. (Computational Diffie-Hellman (CDH) problem) Given a generator \n
\nDefinition 2.10. A public-key encryption scheme (see \nFigure 2.3\n) consists of the following three algorithms:
\nGen: Given a security parameter \n
Enc: On input a public key \n
Dec: On input a private key \n
Intuition of public-key encryption.
We make the consistency constraint that if for every \n
The RSA encryption scheme [75] is the most widely used public-key encryption scheme, and its security depends on the intractability of the RSA problem. Concretely, the RSA encryption scheme consists of the following three algorithms:
\n\nGen: To create the public key and the corresponding secret key, the following steps are performed:
\nRandomly choose two prime numbers \n
Compute \n
Compute \n
Randomly choose an integer \n
Publish \n
\nEnc: Before sending a sensitive message \n
\nDec: To decrypt the ciphertext c, the corresponding receiver computes \n
The ElGamal encryption scheme [76] is constructed based on the DL problem. Concretely, the ElGamal encryption scheme consists of the following three algorithms:
\n\nGen: To create the public key and the corresponding secret key, the following steps are performed:
\nChoose a random multiplicative generator \n
Pick a random number \n
Compute the corresponding public key by \n
Publish \n
\nEnc: Before sending a sensitive message \n
\nDec: To decrypt the ciphertext \n
\nDefinition 2.11. An ID-based encryption scheme (see \nFigure 2.4\n) consists of the following three algorithms:
\nIntuition of ID-based encryption.
\nSetup: On input a security parameter \n
\nExtract: On input params, master-key, and an arbitrary identity \n
\nEnc: On input params, ID, and \n
\nDec: On input params, ID, \n
The first practical identity-based encryption scheme is constructed by Boneh and Franklin [77] based on bilinear pairings, and its security depends on the WDH problem. First the basic Boneh-Franklin IBE scheme which is not secure against an adaptive chosen-ciphertext attack1 is presented. The only reason for describing the basic scheme is to improve the readability of the full scheme. After that, the full scheme extends the basic scheme to achieve security against an adaptive chosen-ciphertext attack.
\n\nBoneh-Franklin IBE (basic scheme): Concretely, the basic Boneh-Franklin IBE scheme consists of the following four algorithms:
\n\nSetup: The algorithm works as follows:
\n1. Let \n
2. Pick a random \n
3. Choose two cryptographic hash functions \n
The system parameters \n
\nExtract: On input a string \n
Compute \n
Send \n
\nEncrypt: To encrypt \n
\nDecrypt: Given a ciphertext \n
\nBoneh-Franklin IBE (full scheme): Concretely, the full Boneh-Franklin IBE scheme consists of the following four algorithms:
\n\nSetup and Extract: The algorithms are the same as those in the above basic scheme. In addition, two additional hash function \n
\nEncrypt: To encrypt \n
\nDecrypt: Let \n
Compute \n
Compute \n
Set \n
Output \n
\nDefinition 2.12. A signature scheme (see \nFigure 2.5\n) consists of the following three algorithms:
\nIntuition of digital signature.
\nGen: Given a security parameter \n
\nSign: On input a private key \n
\nVerify: On input a public key \n
We make the consistency constraint that if for every \n
The RSA cryptosystems [75] may be used to provide both encryption and digital signature. The RSA digital signature is described as follows:
\n\nGen: To create the public key and the corresponding secret key, the following steps are performed:
\nRandomly choose two prime numbers \n
Compute \n
Compute \n
Randomly choose an integer \n
Publish \n
\nSign: To create a signature of message \n
\nVerify: Given a message-signature pair \n
The ElGamal cryptosystems [75] may be used to provide both encryption and digital signature. The ElGamal digital signature is described as follows:
\n\nGen: To create the public key and the corresponding secret key, the following steps are performed:
\nChoose a random multiplicative generator \n
Pick a random number \n
Compute the corresponding public key by \n
Publish \n
\nSign: To create a signature of message \n
\nVerify: Given a message-signature pair \n
Many variations of the basic ElGamal signature scheme have been proposed, and Schnorr signature scheme [78] is one well-known variant of the ElGamal scheme. The ElGamal digital signature is described as follows:
\n\nGen: The system parameters are generated as follows:
\nChoose two prime numbers \n
Choose an element \n
Select a cryptographic hash function \n
Select a random number \n
The parameters \n
\nSign: To create a signature on message \n
\nVerify: Given a message-signature pair \n
In 1991, a digital signature algorithm (DSA) has been presented by the US National Institute of Standards and Technology (NIST) and later been called the Digital Signature Standard (DSS) [79] by the US Federal Information Processing Standards (FIPS 186). DSS is regarded as the first digital signature scheme recognized by the US government and is described as follows:
\n\nGen: The system parameters are generated as follows:
\nSelect a prime number \n
Choose \n
Select a generator \n
Select an element \n
If \n
Select a random integer \n
\n\n
\nSign: To create a signature on message \n
Select a random secret integer \n
Compute \n
Compute \n
Compute \n
\n\n
\nVerify: Given a message-signature pair \n
Compute \n
Output 1 if \n
An ID-based encryption scheme (see \nFigure 2.6\n) consists of the following three algorithms:
\nIntuition of ID-based signature.
\nSetup: On input a security parameter \n
\nExtract: On input params, master-key, and an arbitrary identity \n
\nSign: Given a message \n
\nVerify: Given a signature \n
Inspired by [77], an ID-based signature scheme has been proposed by Cha and Cheon [80] using gap Diffie-Hellman (GDH) groups, where GDH group is a cyclic group where the CDH problem is hard, but the DDH problem is easy. The Cha-Cheon identity-based signature scheme is described as follows:
\n\nSetup: Let \n
\nExtract: Given an identity ID, this algorithm computes \n
\nSign: Given a secret key \n
\nVerify: To verify a signature \n
Bellare et al. [81] proposed the first pairing-free identity-based signature scheme based on the DL problems. The Bellare-Neven-Namprempre identity-based signature scheme is described as follows:
\n\nSetup: Generate an elliptic curve \n
\nExtract: Given an identity ID, PKG picks a random \n
\nSign: Given a secret key pair \n
The resulting signature is \n
\nVerify: To verify a signature \n
holds or not.
\nProvable security, which involves the exact definitions, precise assumptions, and rigorous security proof of cryptographic primitives, is regarded as the methodology unavoidable in designing, analyzing, and evaluating new primitives [82, 83, 84]. The basic reason for the necessity of provable security originates from the fact that the security of a cryptographic primitive cannot be checked in the same way that software is typically checked. Without a proof that no adversary with enough computational resources can break the primitive, we can only depend on our intuition about the security of the mentioned primitive. Unfortunately, history has already demonstrated that intuition in cryptography and information security is disastrous in the sense that countless examples of unproven schemes or schemes only with heuristic security proof were broken (sometimes immediately or sometimes years after being published or deployed).
\nInstead of indiscreetly assuming a given cryptographic primitive is secure, provable security assumes that some mathematical problems are hard to solve and then to prove that the given primitive is secure provided this assumption. Concretely, the proof proceeds by presenting an explicit reduction showing how to construct an efficient algorithm \n
We first assume that some mathematical problem \n
\nInitial: Assume an efficient adversary \n
An efficient adversary \n
The view of \n
Furthermore, if \n
From Step 1, it is obvious that if \n
As for the security model of public-key encryption, we begin our definitional treatment by considering the case of an eavesdropping adversary who observes a single ciphertext that it wishes to crack. In other words, the eavesdropping adversary does not have any further interaction with the sender or the receiver after receiving a (single) ciphertext. Intuitively speaking, the security model for the public-key encryption is depicted by the definition of indistinguishability. Particularly, consider an experiment \n
Given a public-key encryption scheme \n
The algorithm Gen \n
After receiving the public key \n
After receiving \n
After receiving \n
\nDefinition 2.13. A public-key encryption scheme \n
In view of the fact that a relatively weak attacker who only passively eavesdrops on the communication has been modeled in the eavesdropping experiment, a more powerful type of adversarial attacker who is allowed to query encryptions of multiple messages in an adaptive manner should be considered to simulate the attacker’s capabilities in the real world. Thus, a chosen-plaintext attack (CPA) has been defined to enable the adversary \n
Specifically, consider the following experiment defined for public-key encryption scheme \n
\nInitial: The algorithm Gen \n
\nPhase 1: After receiving the public key \n
\nChallenge: After deciding Phase 1 is over, \n
\nPhase 2: The adversary \n
\nGuess: After receiving \n
\nDefinition 2.14. A public-key encryption scheme \n
To strengthen the adversary’s capabilities further, a third type of adversary, which is more powerful than the former two types of adversaries and can mount chosen-ciphertext attack [85, 86], should be taken into consideration. In this attack model, the adversary is not only allowed to encrypt any messages of its choice as in the chosen-plaintext attack model but also enabled the adversary to decrypt any ciphertexts of its choice. Therefore, the chosen-ciphertext attack is regarded as the most powerful attack associated to the public-key encryption so far.
\nSpecifically, consider the following experiment defined for public-key encryption scheme \n
CCA experiment \n\n\nP\nu\nb\n\nK\n\nA\n,\nΠ\n\n\nc\nc\na\n\n\n(\nn\n)\n\n\n for public-key encryption.
\nInitial: The algorithm Gen \n
\nPhase 1: After receiving the public key \n
\nChallenge: After deciding Phase 1 is over, \n
\nPhase 2: After receiving \n
\nGuess: \n
\nDefinition 2.15. A public-key encryption scheme \n
Different from the public-key encryption in the traditional PKI, the definition of chosen-ciphertext security must be strengthened a bit in the ID-PKC due to the fact that the attacker might already own the private keys of users \n
Specifically, consider the following experiment defined for ID-based encryption scheme \n
CCA experiment \n\n\nI\nB\n\nE\n\nA\n,\nΠ\n\n\nc\nc\na\n,\nc\ni\nd\na\n\n\n(\nn\n)\n\n\n for ID-based encryption.
\nInitial: The algorithm Setup is performed by the challenger \n
\nPhase 1: \n
Upon receiving an identity \n
Upon receiving a tuple \n
\nChallenge: After deciding Phase 1 is over, adversary submits two plaintexts \n
\nPhase 2: This is identical to Phase 1 with the restriction that:
\n\n\n
\n\n
\nGuess: \n
\nDefinition 2.16. An ID-based encryption scheme \n
To capture the adversary’s power in the digital signature, a security property called existential unforgeability against a chosen message attack [87] should be achieved in the sense that the existential unforgeability means that the adversary should not be able to generate a valid signature on any message, and the chosen message attack means that the adversary is able to obtain signatures on any messages it wishes during its attack. We now give the formal definition of existential unforgeability against a chosen message attack as follows:
\nLet \n
Unforgeability experiment \n\n\nS\ni\ng\n-\nf\no\nr\nc\n\ne\n\nA\n,\nΠ\n\n\nc\nm\na\n\n\n(\nn\n)\n\n\n for signature.
\nInitial: The algorithm Gen is run by \n
\nAttack: After receiving the public key \n
\nForgery: \n
\nDefinition 2.17. A signature scheme \n
The security model of existential unforgeability under an adaptive chosen message attack in the traditional PKI can be extended to the identity-based environment naturally [80, 81]. We now give the formal definition of existential unforgeability against a chosen message attack for the ID-based signature as follows:
\nLet \n
Unforgeability experiment \n\n\nI\nB\nS\n-\nf\no\nr\ng\n\ne\n\nA\n,\nΠ\n\n\nc\nm\na\n,\nc\ni\nd\na\n\n\n(\nn\n)\n\n\n for ID-based signature.
\nInitial: The algorithm Setup is performed by the challenger \n
\nAttack: \n
Upon receiving an identity \n
Upon receiving a tuple \n
\nForgery: \n
Verify(params, \n
\n\n
(\n
\nDefinition 2.18. An ID-based signature scheme \n
Information retrieval is a very simple concept with everyone having practical experience in its use. The scenario of a user having an information need, translating that into a search statement and executing that search to locate the information, has become ubiquitous to everyday life. The Internet has become a repository of any information a person needs, replacing the library as a more convenient research tool. An information retrieval system is a system that ingests information, transforms it into searchable format, and provides an interface to allow a user to search and retrieve information. The most obvious example of an information retrieval system is Google, and the English language has even been extended with the term “Google it” to mean search for something.
\nSo, everyone has had an experience with information retrieval systems, and with a little thought, it is easy to answer the question: does it work? Everyone who has used such systems has experienced the frustration that is encountered when looking for certain information. Given the massive amount of intellectual effort that is going into the design and evolution of a “Google” or other search systems, the question comes to mind why is it so hard to find what you are looking for.
\nOne of the goals of this chapter is to explain the practical and theoretical issues associated with information retrieval that makes the design of information retrieval systems one of the challenges of our time. The demand for and expectations of users to quickly find any information they need continues to drive both the theoretical analysis and development of new technologies to satisfy that need. To scope the problem, one of the first things that need to be defined is “information.” Twenty-five years ago, information retrieval was totally focused on textual items. That was because almost all of the “digital information” of value was in textual form. In today’s technical environment, most people carry with them most of the time the capability to create images and videos of interest, that is, the cell phone. This has made modalities other than text to become as common as text. That is coupled with the Internet Web sites that allow and are designed for ease of the use of uploading and storing those modalities which more than justify the need to include other than text as part of the information retrieval problem. There are a lot of parallelisms between the information processing steps for text, images, audio, and video. Although maps are another modality that could be included, they will only be generally discussed.
\nIn general, information that will be considered in information retrieval systems includes text, images, audio, and video. The term “item” shall be used to define a specific information object. This could be a textual document, a news item from an RSS feed, an image, a video program, or an audio program. It is useful to make a distinction between the original items from what is processed by the information retrieval system as the basic indexable item. The original item will always be kept for display purposes, but a lot of preprocessing can occur on it during the process of creating the searchable index. The term “item” will refer to the original object. On occasion the term document will be used when the item being referred to is a textual item.
\nAn information retrieval system is the hardware and software that facilitates a user in finding the information the user needs. Hardware is included in the definition because specialized hardware is needed to transform certain modalities into digital processing format (e.g., encoders that translate composite video to digital video). As the detailed processing of items is described, it will become clear that an information retrieval system is not a single application but is composed of many different applications that work together to provide the tools and functions needed to assist the users in answering their questions. The overall goal of an information retrieval system is to minimize the user overhead in locating the information of value. Overhead from a user’s perspective can be defined as the time it takes to locate the needed information. The time starts when a user starts to interact with the system and ends when they have found the items of interest. Human factors play significantly in this process. For example, most users have a short threshold on frustration waiting for a response. That means in a commercial system on the Internet, the user is more satisfied with a response less than 3 s than a longer response that has more accurate information. In internal corporate systems, users are willing to wait a little longer to get results, but there is still a trade-off between accuracy and speed. Most users would rather have the faster results and iterate on their searches than allowing the system to process the queries with more complex techniques providing better results. All of the major processing steps are described for an information retrieval system, but in many cases, only a subset of them are used on operational systems because users are not willing to accept the increase in response time.
\nThe evolution of information retrieval systems has been closely tied to the evolution of computer processing power. Early information retrieval systems were focused on automating the manual indexing processes in libraries. These systems migrated the structure and organization of card catalogs into structured databases. They maintained the same Boolean search query structure associated with the database that was used for other database applications. This was feasible because all of the assignment of terms to describe the content of a document was done by professional indexers. In parallel, there was also academic research work being done on small data sets that considered how to automate the indexing process making all of the text of a document part of the searchable index. The only place that large systems designed to search on massive amounts of text were available was in government and military systems. As commercial processing power and storage significantly increased, it became more feasible to consider applying the algorithms and techniques being developed in the universities to commercial systems. In addition, the creation of the original documents also was migrating to digital format so that they were in a format that could be processed by the new algorithms. The largest change that drove information technologies to become part of everyone’s experience was the introduction and growth of the Internet. The Internet became a massive repository of unstructured information, and information retrieval techniques were the only approach to effectively locate information on it. This changed the funding and development of search techniques from a few government-funded efforts to thousands of new ideas being funded by venture capitalists moving the more practical implementation of university algorithms into commercial systems.
\nThe general objective of an information retrieval system is to minimize the time it takes for a user to locate the information they need. The goal is to provide the information needed to satisfy the user’s question. Satisfaction does not necessarily mean finding all information on a particular issue. It means finding sufficient information that the user can proceed with whatever activity initiated. This is very important because it does explain some of the drivers behind the existing search systems and suggests that precision is typically more important than recalling all possible information. For example, a user looking for a particular product does not have to find the names of everyone that sells the product or every company that manufactures the product to meet their need of getting that product. Of course, if they did have total information, then it is possible that they could have gotten it cheaper, but in most cases, the consumer will never know what they missed. The concept that a user does not know how much information they missed explains why in most cases the precision of a search is more important than the ability to recall all possible items of interest; the user never knows what they missed, but they can tell if they are seeing a lot of useless information in the first few pages of search results. That does not mean that finding everything on a topic is not important to some users. If you are trying to make decisions on purchasing a stock or a company, then finding all the facts about that stock or company may be critical to prevent a bad investment. Missing the one article talking about the company being sued and possibly going bankrupt could lead to a very painful investment. But providing comprehensive retrieval of all items that are relevant to a user’s search can have the negative effect of information overload on the user. In particular there is a tendency for important information to be repeated in many items on the same topic. Thus, trying to get all information makes the process of reviewing and filtering out redundant information very tedious. The better a system is in finding all items on a question (Recall), the more important techniques to present aggregates of that information become.
\nFrom the user’s perspective, time is the important factor that they use to gage the effectiveness of information retrieval. Except for users that do information retrieval as a primary aspect of their job (e.g., librarians and research assistants), most users have very little patience for investing extensive time in finding information they need. They expect interactive response from their searches with replies within 3 C4s at the most. Instead of looking through all the hits to see what might be of value, they will only review the first one, and at most second pages before deciding, they need to change their search strategy. These aspects of the human nature of searchers have had a direct effect on the commercial Web sites and the development of commercial information retrieval. The times that are candidates to be minimized in an information retrieval system are the time to create the query, the time to execute the query, the time to select what items returned from the query the user wants to review in detail, and the time to determine if the returned item is of value. The initial research in information retrieval focused on the search as the primary area of interest. But to meet the user’s expectation of fast response and to maximize the relevant information returned require optimization in all of these areas. The time to create a query used to be considered outside the scope of technical system support. But systems such as Google know what is in their database and what other users have searched on; so as you type a query, they provide hints on what to search on. This vocabulary browse capability helps the user in expanding the search string and helps in getting better precision.
\nIn information retrieval, the term “relevant” is used to represent an item containing the needed information. In reality, the definition of relevance is not a binary classification but a continuous function. Items can exactly match the information need or partially match the information need. From a user’s perspective, “relevant” and needed are synonymous. From a system perspective, information could be relevant to a search statement (i.e., matching the criteria of the search statement) even though it is not needed/relevant to user (e.g., the user already knew the information or just read it in the previous item reviewed).
\nRelevant documents are those that contain some information that helps answer the user’s information need. Nonrelevant documents do not contain any useful information. Using these definitions, the two primary metrics used in evaluating information retrieval systems can be defined. They are Precision and Recall:
\n\nNumber_Possible_Relevant is the number of relevant items in the database, Number_Total_Retrieved is the total number of items retrieved from the query, and Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s search need.
\nPrecision is the factor that most users understand. When a user executes a search and has 80% precision, it means that four out of five items that are retrieved are of interest to the user. From a user’s perspective, the lower the precision the more likely the user is wasting his resource (time) looking at nonrelevant items. From a metric perspective, the precision figure is across all of the “hits” returned from the query. But in reality most users will only look at the first few pages of hit results before deciding to change their query strategy. Thus, what is of more value in commercial systems is not the total precision but the precision across the first 20C50 hits. Typically, in a weighted system where the words within a document are assigned weights based upon how well they describe the semantics of the document, precision in the first 20C50 items is higher than the precision across all the possible hits returned. But when comparing search systems, the total precision is used.
\nRecall is a very useful concept in comparing systems. It measures how well a search system is capable of retrieving all possible hits that exist in the database. Unfortunately, it is impossible to calculate except in very controlled environments. It requires in the denominator the total number of relevant items in the database. If the system could determine that number, then the system could return them. There have been some attempts to estimate the total relevant items in a database, but there are no techniques that provide accurate enough results to be used for a specific search request. In Chapter 3.3 on “Information Retrieval Evaluation,” techniques that have been used in evaluating the accuracy of different search systems will be described. But it is not applicable in the general case.
\nIn this section we briefly present the three classic models in information retrieval, namely, the Boolean, the vector, and the probabilistic models.
\nThe classic models in information retrieval consider that each document is described by a set of representative keywords called index terms. An index term is simply a (document) word whose semantics helps in remembering the document’s main themes. Thus, index terms are used to index and summarize the document contents. In general, index terms are mainly nouns because nouns have meaning by themselves and, thus, their semantics is easier to identify and to grasp. Adjectives, adverbs, and connectives are less useful as index terms because they work mainly as complements. However, it might be interesting to consider all the distinct words in a document collection as index terms. We postpone a discussion on the problem of how to generate index terms, where the issue is covered in detail.
\nGiven a set of index terms for a document, we notice that not all terms are equally useful for describing the document contents. In fact, there are index terms which are simply vaguer than others. Deciding on the importance of a term for summarizing the contents of a document is not a trivial issue. Despite this difficulty, there are properties of an index term which are easily measured and which are useful for evaluating the potential of a term as such. For instance, consider a collection with a hundred thousand documents. A word which appears in each of the 100,000 documents is completely useless as an index term because it does not tell us anything about which documents the user might be interested in. On the other hand, a word which appears in just five documents is quite useful because it narrows down considerably the space of documents which might be of interest to the user. Thus, it should be clear that distinct index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document.
\nLet \n
\nDefinition 3.1. Let t be the number of index terms in the system and \n
As we later discuss, the index term weights are usually assumed to be mutually independent. This means that knowing the weight \n
The above definitions provide support for discussing the three classic information retrieval models, namely, the Boolean, the vector, and the probabilistic models, as we now do.
\nThe Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since the concept of a set is quite intuitive, the Boolean model provides a framework which is easy to grasp by a common user of an IR system. Furthermore, the queries are specified as Boolean expressions which have precise semantics. Given its inherent simplicity and neat formalism, the Boolean model received great attention in the past years and was adopted by many of the early commercial bibliographic systems (\nFigure 3.1\n).
\nThe three conjunctive components for the query \n\n\n[\nq\n=\n\nk\na\n\n∧\n(\n\nk\nb\n\n∨\n¬\n\nk\nc\n\n)\n]\n\n\n.
Unfortunately, the Boolean model suffers from major drawbacks. First, its retrieval strategy is based on a binary decision criterion (i.e., a document is predicted to be either relevant or nonrelevant) without any notion of a grading scale, which prevents good retrieval performance. Thus, the Boolean model is in reality much more a data (instead of information) retrieval model. Second, while Boolean expressions have precise semantics, frequently it is not simple to translate an information need into a Boolean expression. In fact, most users find it difficult and awkward to express their query requests in terms of Boolean expressions. The Boolean expressions actually formulated by users often are quite simple (see Chapter 10 for a more thorough discussion on this issue). Despite these drawbacks, the Boolean model is still the dominant model with commercial document database systems and provides a good starting point for those new to the field.
\nThe Boolean model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be all binary, that is, \n
\nDefinition 3.2. For the Boolean model, the index term weight variables are all binary, that is, \n
If sim(\n
The Boolean model predicts that each document is either relevant or nonrelevant. There is no notion of a partial match to the query conditions. For instance, let dj be a document for which \n
The main advantages of the Boolean model are the clean formalism behind the model and its simplicity. The main disadvantages are that exact matching may lead to retrieval of too few or too many documents (see Chapter 10). Today, it is well known that index term weighting can lead to a substantial improvement in retrieval performance. Index term weighting brings us to the vector model.
\nThe vector model [88, 89] recognizes that the use of binary weights is too limiting and proposes a framework in which partial matching is possible. This is accomplished by assigning nonbinary weights to index terms in queries and in documents. These term weights are ultimately used to compute the degree of similarity between each document stored in the system and the user query. By sorting the retrieved documents in decreasing order of this degree of similarity, the vector model takes into consideration documents which match the query terms only partially. The main resultant effect is that the ranked document answer set is a lot more precise (in the sense that it better matches the user information need) than the document answer set retrieved by the Boolean model.
\n\nDefinition 3.3. For the vector model, the weight \n
Therefore, a document \n
where \n
The cosine of \n\nθ\n\n is adopted as \n\n\ns\ni\nm\n(\n\nd\nj\n\n,\nq\n)\n\n\n.
Since \n
Index term weights can be calculated in many different ways. The work by Salton and McGill [88] reviews various term-weighting techniques. Here, we do not discuss them in detail. Instead, we concentrate on elucidating the main idea behind the most effective term-weighting techniques. This idea is related to the basic principles which support clustering techniques, as follows.
\nGiven a collection C of objects and a vague description of a set A, the goal of a simple clustering algorithm might be to separate the collection C of objects into two sets: the first one which is composed of objects related to the set A and the second one which is composed of objects not related to the set A. Vague description here means that we do not have complete information for deciding precisely which objects are and which are not in the set A. For instance, one might be looking for a set A of cars which have a price comparable to that of a Lexus 400. Since it is not clear what the term comparable means exactly, there is not a precise (and unique) description of the set A. More sophisticated clustering algorithms might attempt to separate the objects of a collection into various clusters (or classes) according to their properties. For instance, patients of a doctor specializing in cancer could be classified into five classes: terminal, advanced, metastasis, diagnosed, and healthy. Again, the possible class descriptions might be imprecise (and not unique), and the problem is one of deciding to which of these classes a new patient should be assigned. In what follows, however, we only discuss the simpler version of the clustering problem (i.e., the one which considers only two classes) because all that is required is a decision on which documents are predicted to be relevant and which ones are predicted to be not relevant (with regard to a given user query).
\nTo view the IR problem as one of clustering, we refer to the early work of Salton. We think of the documents as a collection C of objects and think of the user query as a (vague) specification of a set A of objects. In this scenario, the IR problem can be reduced to the problem of determining which documents are in the set A and which ones are not (i.e., the IR problem can be viewed as a clustering problem). In a clustering problem, two main issues have to be resolved. First, one needs to determine what the features are which better describe the objects in the set A. Second, one needs to determine what the features are which better distinguish the objects in the set A from the remaining objects in the collection C. The first set of features provides for quantification of intra-cluster similarity, while the second set of features provides for quantification of inter-cluster dissimilarity. The most successful clustering algorithms try to balance these two effects.
\nIn the vector model, intra-clustering similarity is quantified by measuring the raw frequency of a term \n
\nDefinition 3.4. Let N be the total number of documents in the system and \n
where the maximum is computed over all terms which are mentioned in the text of the document \n
The best known term-weighting schemes use weights which are given by
\nor by a variation of this formula. Such term-weighting strategies are called tf-idf schemes.
\nSeveral variations of the above expression for the weight \n
For the query term weights, Salton and Buckley suggest
\nwhere \n
The main advantages of the vector model are:
\nIts term-weighting scheme improves retrieval performance.
Its partial-matching strategy allows retrieval of documents that approximate the query conditions.
Its cosine ranking formula sorts the documents according to their degree of similarity to the query.
Theoretically, the vector model has the disadvantage that index terms are assumed to be mutually independent [Eq. (2.3) does not account for index term dependencies]. However, in practice, consideration of term dependencies might be a disadvantage. Due to the locality of many term dependencies, their indiscriminate application to all the documents in the collection might in fact hurt the overall performance.
\nDespite its simplicity, the vector model is a resilient ranking strategy with general collections. It yields ranked answer sets which are difficult to improve upon without query expansion or relevance feedback within the framework of the vector model. A large variety of alternative ranking methods have been compared to the vector model, but the consensus seems to be that, in general, the vector model is either superior or almost as good as the known alternatives. Furthermore, it is simple and fast. For these reasons, the vector model is a popular retrieval model nowadays.
\nIn this section, we describe the classic probabilistic model introduced in 1976 by Robertson and Sparck Jones [91] which later became known as the binary independence retrieval (BIR) model. Our discussion is intentionally brief and focuses mainly on highlighting the key features of the model. With this purpose in mind, we do not detain ourselves in subtleties regarding the binary independence assumption for the model. The section on bibliographic discussion points to references which cover these details.
\nThe probabilistic model attempts to capture the IR problem within a probabilistic framework. The fundamental idea is as follows. Given a user query, there is a set of documents which contains exactly the relevant documents and no other. Let us refer to this set of documents as the ideal answer set. Given the description of this ideal answer set, we would have no problems in retrieving its documents. Thus, we can think of the querying process as a process of specifying the properties of an ideal answer set (which is analogous to interpreting the IR problem as a problem of clustering). The problem is that we do not know exactly what these properties are. All we know is that there are index terms whose semantics should be used to characterize these properties. Since these properties are not known at query time, an effort has to be made at initially guessing what they could be. This initial guess allows us to generate a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents. An interaction with the user is then initiated with the purpose of improving the probabilistic description of the ideal answer set. Such interaction could proceed as follows.
\nThe user takes a look at the retrieved documents and decides which ones are relevant and which ones are not (in truth, only the first top documents need to be examined). The system then uses this information to refine the description of the ideal answer set. By repeating this process many times, it is expected that such a description will evolve and become closer to the real description of the ideal answer set. Thus, one should always have in mind the need to guess at the beginning the description of the ideal answer set. Furthermore, a conscious effort is made to model this description in probabilistic terms.
\nThe probabilistic model is based on the following fundamental assumption.
\n\nDefinition 3.5. Assumption (probabilistic principle). Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be nonrelevant.
\nThis assumption is quite troublesome because it does not state explicitly how to compute the probabilities of relevance. In fact, not even the sample space which is to be used for defining such probabilities is given.
\nGiven a query \n
\nDefinition 3.6. For the probabilistic model, the index term weight variables are all binary, i.e., \n
Using Bayes’ rule,
\n\n\n
Since \n
Assuming independence of index terms,
\n\n\n
Taking logarithms, recalling that \n
which is a key expression for ranking computation in the probabilistic model.
\nSince we do not know the set \n
In the very beginning (i.e., immediately after the query specification), there are no retrieved documents. Thus, one has to make simplifying assumptions such as (a) assume that \n
where, as already defined, \n
Let \n
This process can then be repeated recursively. By doing so, we are able to improve on our guesses for the probabilities \n
The last formulas for \n
An adjustment factor which is constant and equal to 0.5 is not always satisfactory. An alternative is to take the fraction \n
This completes our discussion of the probabilistic model.
\nThe main advantage of the probabilistic model, in theory, is that documents are ranked in decreasing order of their probability of being relevant. The disadvantages include (1) the need to guess the initial separation of documents into relevant and nonrelevant sets, (2) the fact that the method does not take into account the frequency with which an index term occurs inside a document (i.e., all weights are binary), and (3) the adoption of the independence assumption for index terms. However, as discussed for the vector model, it is not clear that independence of index terms is a bad assumption in practical situations.
\nIn general, the Boolean model is considered to be the weakest classic method. Its main problem is the inability to recognize partial matches which frequently leads to poor performance. There is some controversy as to whether the probabilistic model outperforms the vector model. Croft performed some experiments and suggested that the probabilistic model provides a better retrieval performance. However, experiments done afterward by Salton and Buckley refute that claim. Through several different measures, Salton and Buckley showed that the vector model is expected to outperform the probabilistic model with general collections. This also seems to be the dominant thought among researchers, practitioners, and the Web community, where the popularity of the vector model runs high.
\nThere is no single format for a text document, and an IR system should be able to retrieve information from many of them. In the past, IR systems would convert a document to an internal format. However, that has many disadvantages, because the original application related to the document is not useful anymore. On top of that, we cannot change the contents of a document. Current IR systems have filters that can handle most popular documents, in particular those of word processors with some binary syntax such as Word, WordPerfect, or FrameMaker. Even then, good filters might not be possible if the format is proprietary and its details are not public. This is not the case for full ASCII syntax, as in TeX documents. Although documents can be in a binary format (e.g., parts of a Word document), documents that are represented in human-readable ASCII form imply more portability and are easier to modify (e.g., they can be edited with different applications).
\nOther text formats were developed for document interchange. Among these we should mention the rich text format (RTF), which is used by word processors and has ASCII syntax. Other important formats were developed for displaying or printing documents. The most popular ones are the portable document format (PDF) and PostScript (which is a powerful programming language for drawing). Other interchange formats are used to encode electronic mail, for example, multipurpose internet mail exchange (MIME). MIME supports multiple character sets, multiple languages, and multiple media.
\nOn top of these formats, nowadays many files are compressed. Text compression is treated in detail, but here we comment on the most popular compression software and associated formats. These include compress (Unix), ARJ (PCs), and ZIP (e.g., gzip in Unix and WinZip in Windows). Other tools allow us to convert binary files, in particular compressed text, to ASCII text such that it can be transmitted through a communication line using only seven bits. Examples of these tools are uuencode/uudecode and BinHex.
\nText is composed of symbols from a finite alphabet. We can divide the symbols in two disjoint subsets: symbols that separate words and symbols that belong to words. It is well known that symbols are not uniformly distributed. If we consider just letters (a to z), we observe that vowels are usually more frequent than most consonants. For example, in English, letter “Ie” has the highest frequency. A simple model to generate text is the binomial model. In it, each symbol is generated with a certain probability. However, natural language has a dependency on previous symbols. For example, in English, letter “f” cannot appear after letter “c,” and vowels or certain consonants have a higher probability of occurring. Therefore, the probability of a symbol depends on previous symbols. We can use a finite context or Markovian model to reflect this dependency. The model can consider one, two, or more letters to generate the next symbol. If we use k letters, we say that it is a k-order model (so, the binomial model is considered an O-order model). We can use these models taking words as symbols. For example, text generated by a five-order model using the distribution of words in the Bible might make sense (i.e., it can be grammatically correct) but will be different from the original. More complex models include finite-state models (which define regular languages) and grammar models (which define context-free and other languages). However, finding the right grammar for natural language is still a difficult open problem.
\nThe next issue is how the different words are distributed inside each document. An approximate model is Zipf’s law, which attempts to capture the distribution of the frequencies (i.e., number of occurrences) of the words in the text. The rule states that the frequency of the \n
so that the sum of all frequencies is \n
Distribution of sorted word frequencies (left) and size of the vocabulary (right).
Since the distribution of words is very skewed (i.e., there are a few hundred words which take up 50% of the text), words that are too frequent, such as stopwords, can be disregarded. A stopword is a word which does not carry meaning in natural language and therefore can be ignored (i.e., made not searchable), such as “a,” “the,” “by,” etc. Fortunately, the most frequent words are stopwords, and, therefore, half of the words appearing in a text do not need to be considered. This allows us, for instance, to significantly reduce the space overhead of indices for natural language texts. For example, the most frequent words in the TREC-2 collection are ‘the,” “of,” “and,” “a,” “to,” and “in.”
\nAnother issue is the distribution of words in the documents of a collection. A simple model is to consider that each word appears the same number of times in every document. However, this is not true in practice. A better model is to consider a negative binomial distribution, which says that the fraction of documents containing a word \n
where \n
The next issue is the number of distinct words in a document. This set of words is referred to as the document vocabulary. To predict the growth of the vocabulary size in natural language text, we use the so-called Heaps’ law [96]. This is a very precise law which states that the vocabulary of a text of size \n
Notice that the set of different words of a language is fixed by a constant (e.g., the number of different English words is finite). However, the limit is so high that it is much more accurate to assume that the size of the vocabulary is \n
Heaps’ law also applies to collections of documents because, as the total text size grows, the predictions of the model become more accurate. Furthermore, this model is also valid for the World Wide Web.
\nThe last issue is the average length of words. This relates the text size in words with the text size in bytes (without accounting for punctuation and other extra symbols). For example, in the different subcollections of the TREC-2 collection, the average word length is very close to five letters, and the range of variation of this average in each subcollection is small (from 4.8 to 5.3 letters). If the stopwords were move, the average length of a word increases to a number between 6 and 7 (letters). If we take only the words of the vocabulary, the average length is higher (about 8 or 9). This defines the total space needed for the vocabulary.
\nHeaps’ law implies that the length of the words in the vocabulary increases logarithmically with the text size and, thus, that longer and longer words should appear as the text grows. However, in practice, the average length of the words in the overall text is constant because shorter words are common enough (e.g., stopwords). This balance between short and long words, such that the average word length remains constant, has been noticed many times in different contexts and can also be explained by a finite-state model in which (a) the space character has a probability close to 0.2, (b) the space character cannot appear twice subsequently, and (c) there are 26 letters. This simple model is consistent with Zipf’s and Heaps’ laws.
\nThe models presented in this section are used in Chapters 8 and 13, in particular Zipf’s and Heaps’ laws.
\nIn this section, we define notions of syntactic similarity between strings and documents. Similarity is measured by a distance function. For example, if we have strings of the same length, we can define the distance between them as the number of positions that have different characters. Then, the distance is 0 if they are equal. This is called the Hamming distance. A distance function should also be symmetric (i.e., the order of the arguments does not matter) and should satisfy the triangle inequality (i.e., \n
An important distance over strings is the edit or Levenshtein distance mentioned earlier. The edit distance is defined as the minimum number of characters, insertions, deletions, and substitutions that we need to perform in any of the strings to make them equal. For instance, the edit distance between “color” and “colour” is one, while the edit distance between “survey” and “surgery” is two. The edit distance is considered to be superior for modeling syntactic errors than other more complex methods such as the Soundex system, which is based on phonetics [97]. Extensions to the concept of edit distance include different weights for each operation, adding transpositions, etc.
\nThere are other measures. For example, assume that we are comparing two given strings and the only operation allowed is deletion of characters. Then, after all non-common characters have been deleted, the remaining sequence of characters (not necessarily contiguous in the original string, but in the same order) is the longest common subsequence (LCS) of both strings. For example, the LCS of “survey” and “surgery” is “surey.”
\nSimilarity can be extended to documents. For example, we can consider lines as single symbols and compute the longest common sequence of lines between two files. This is the measure used by the diff command in Unix-like operating systems. The main problem with this approach is that it is very time-consuming and does not consider lines that are similar. The latter drawback can be fixed by taking a weighted edit distance between lines or by computing the LCS over all the characters. Other solutions include extracting fingerprints (any piece of text that in some sense characterizes it) for the documents and comparing them or finding large repeated pieces. There are also visual tools to see document similarity. For example, Dotplot draws a rectangular map where both coordinates are file lines and the entry for each coordinate is a gray pixel that depends on the edit distance between the associated lines.
\nDocument preprocessing is a procedure which can be divided mainly into five text operations (or transformations):
\nLexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters.
Elimination of stopwords with the objective of filtering out words with very low discrimination values for retrieval purposes.
Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc.).
Selection of index terms to determine which words/stems (or groups of words) will be used as indexing elements. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature of the word. In fact, noun words frequently carry more semantics than adjectives, adverbs, and verbs.
Construction of term categorization structures such as a thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms (a usually useful procedure).
In the following, each of these phases is discussed in detail. But, before proceeding, let us take a look at the logical view of the documents which results after each of the above phases is completed. \nFigure 3.4\n is repeated here for convenience as \nFigure 3.4\n. As already discussed, by aggregating the preprocessing phases, we are able to move the logical view of the documents (adopted by the system) from that of a full text to that of a set of high-level indexing terms.
\nLogical view of a document throughout the various phases of text preprocessing.
Lexical analysis is the process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms). Thus, one of the major objectives of the lexical analysis phase is the identification of the words in the text. At the first glance, all that seems to be involved is the recognition of spaces as word separators (in which case, multiple spaces are reduced to one space). However, there is more to it than this. For instance, the following four particular cases have to be considered with care [98]: digits, hyphens, punctuation marks, and the case of the letters (lower and upper case).
\nNumbers are usually not good index terms because, without a surrounding context, they are inherently vague. For instance, consider that a user is interested in documents about the number of deaths due to car accidents between the years 1910 and 1989. Such a request could be specified as the set of index terms: deaths, car, accidents, years, 1910 and 1989. However, the presence of the numbers 1910 and 1989 in the query could lead to the retrieval, for instance, of a variety of documents which refer to either of these 2 years. The problem is that numbers by themselves are just too vague. Thus, in general it is wise to disregard numbers as index terms. However, we have also to consider that digits might appear mixed within a word. For instance, “51OB.C.” is a clearly important index term. In this case, it is not clear what rule should be applied. Furthermore, a sequence of 16 digits identifying a credit card number might be highly relevant in a given context and, in this case, should be considered as an index term. A preliminary approach for treating digits in the text might be to remove all words containing sequences of digits unless specified otherwise (through regular expressions). Further, an advanced lexical analysis procedure might perform some date and number normalization to unify formats.
\nHyphens pose another difficult decision to the lexical analyzer. Breaking up hyphenated words might be useful due to inconsistency of usage. For instance, this allows treating “state-of-the-art” and “state of the art” identically. However, there are words which include hyphens as an integral part, for instance, giltedge, B-49, etc. Again, the most suitable procedure seems to adopt a general rule and specify the exceptions on a case-by-case basis.
\nNormally, punctuation marks are removed entirely in the process of lexical analysis. While some punctuation marks are an integral part of the word (for instance, “5lOB.C.”), removing them does not seem to have an impact in retrieval performance because the risk of misinterpretation in this case is minimal. In fact, if the user specifies “5lOB.C” in his query, removal of the dot both in the query term and in the documents will not affect retrieval. However, very particular scenarios might again require the preparation of a list of exceptions. For instance, if a portion of a program code appears in the text, it might be wise to distinguish between the variables “x.id” and “xid.” In this case, the dot mark should not be removed.
\nThe case of letters is usually not important for the identification of index terms. As a result, the lexical analyzer normally converts all the text to either lower or upper case. However, once more, very particular scenarios might require the distinction to be made. For instance, when looking for documents which describe details about the command language of a Unix-like operating system, the user might explicitly desire the non-conversion of upper cases because this is the convention in the operating system. Further, part of the semantics might be lost due to case conversion. For instance, the words Bank and bank have different meanings—a fact common to many other pairs of words.
\nAs pointed out by Fox, all these text operations can be implemented without difficulty. However, careful thought should be given to each one of them because they might have a profound impact at document retrieval time. This is particularly worrisome in those situations in which the user finds it difficult to understand what the indexing strategy is doing. Unfortunately, there is no clear solution to this problem. As already mentioned, some Web search engines are opting for avoiding text operations altogether because this simplifies the interpretation the user has of the retrieval task. Whether this strategy will be the one of choice in the long term remains to be seen.
\nAs discussed in Chapter 2, words which are too frequent among the documents in the collection are not good discriminators. In fact, a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval. Such words are frequently referred to as stopwords and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stopwords.
\nElimination of stopwords has an additional important benefit. It reduces the size of the indexing structure considerably. In fact, it is typical to obtain a compression in the size of the indexing structure (for instance, in the size of an inverted list, see Chapter 8) of 40% or more solely with the elimination of stopwords.
\nSince stopword elimination also provides for compression of the indexing structure, the list of stopwords might be extended to include words other than articles, prepositions, and conjunctions. For instance, some verbs, adverbs, and adjectives could be treated as stopwords. In [99], a list of 425 stopwords is illustrated. Programs in C for lexical analysis are also provided.
\nDespite these benefits, elimination of stopwords might reduce recall. For instance, consider a user who is looking for documents containing the phrase “to be or not to be.” Elimination of stopwords might leave only the term bemaking it almost impossible to properly recognize the documents which contain the phrase specified. This is one additional reason for the adoption of a full-text index (i.e., insert all words in the collection into the inverted file) by some Web search engines.
\nFrequently, the user specifies a word in a query, but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word. This problem can be partially overcome with the substitution of the words by their respective stems.
\nA stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes). A typical example of a stem is the word connect which is the stem for the variants connected, connecting, connection, and connections. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced.
\nWhile the argument supporting stemming seems sensible, there is controversy in the literature about the benefits of stemming for retrieval performance. In fact, different studies lead to rather conflicting conclusions. Frakes [99] compares eight distinct studies on the potential benefits of stemming. While he favors the usage of stemming, the results of the eight experimental studies he investigated do not allow us to reach a satisfactory conclusion. As a result of these doubts, many Web search engines do not adopt any stemming algorithm whatsoever.
\nFrakes distinguishes four types of stemming strategies: affix removal, table lookup, successor variety, and n-grams. Table lookup consists simply of looking for the stem of a word in a table. It is a simple procedure but one which is dependent on data on stems for the whole language. Since such data is not readily available and might require considerable storage space, this type of stemming algorithm might not be practical. Successor variety stemming is based on the determination of morpheme boundaries, uses knowledge from structural linguistics, and is more complex than affix removal stemming algorithms. N-grams stemming is based on the identification of digrams and trigrams and is more a term clustering procedure than a stemming one. Affix removal stemming is intuitive and simple and can be implemented efficiently. Thus, in the remainder of this section, we concentrate our discussion on algorithms for affix removal stemming only.
\nIn affix removal, the most important part is suffix removal because most variants of a word are generated by the introduction of suffixes (instead of prefixes). While there are three or four well-known suffix removal algorithms, the most popular one is that by Porter because of its simplicity and elegance. Despite being simpler, the Porter algorithm yields results comparable to those of the more sophisticated algorithms.
\nThe Porter algorithm uses a suffix list for suffix stripping. The idea is to apply a series of rules to the suffixes of the words in the text. For instance, the rule
\nis used to convert plural forms into their respective singular forms by substituting the letter s by nil. Notice that to identify the suffix we must examine the last letters in the word. Furthermore, we look for the longest sequence of letters which matches the left-hand side in a set of rules. Thus, application of the two following rules
\nto the word stresses yields the stem stress instead of the stem stresse. By separating such rules into five distinct phases, the Porter algorithm is able to provide effective stemming while running fast.
\nIf a full-text representation of the text is adopted, then all words in the text are used as index terms. The alternative is to adopt a more abstract view in which not all words are used as index terms. This implies that the set of terms used as indices must be selected. In the area of bibliographic sciences, such a selection of index terms is usually done by a specialist. An alternative approach is to select candidates for index terms automatically.
\nDistinct automatic approaches for selecting index terms can be used. A good approach is the identification of noun groups (as done in the INQUERY system [73]) which we now discuss.
\nA sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. While the words in each grammatical class are used with a particular purpose, it can be argued that most of the semantics is carried by the noun words. Thus, an intuitively promising strategy for selecting index terms automatically is to use the nouns in the text. This can be done through the systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns.
\nSince it is common to combine two or three nouns in a single component (e.g., computer science), it makes sense to cluster nouns which appear nearby in the text into a single indexing component (or concept). Thus, instead of simply using nouns as index terms, we adopt noun groups. A noun group is a set of nouns whose syntactic distance in the text (measured in terms of number of words between two nouns) does not exceed a predefined threshold (for instance, three).
\nWhen noun groups are adopted as indexing terms, we obtain a conceptual logical view of the documents in terms of sets of nonelementary index terms.
\nText compression is about finding ways to represent the text in fewer bits or bytes. The amount of space required to store text on computers can be reduced significantly using compression techniques. Compression methods create a reduced representation by identifying and using structures that exist in the text. From the compressed version, the original text can be reconstructed exactly.
\nText compression is becoming an important issue in an information retrieval environment. The widespread use of digital libraries, office automation systems, document databases, and the Web has led to an explosion of textual information available online. In this scenario, text compression appears as an attractive option for reducing costs associated with space requirements, input/output (I/O) overhead, and communication delays. The gain obtained from compressing text is that it requires less storage space, it takes less time to be transmitted over a communication link, and it takes less time to search directly the compressed text. The price paid is the time necessary to code and decode the text.
\nA major obstacle for storing text in compressed form is the need for IR systems to access text randomly. To access a given word in a compressed text, it is usually necessary to decode the entire text from the beginning until the desired word is reached. It could be argued that a large text could be divided into blocks that are compressed independently, thus allowing fast random access to each block. However, efficient compression methods need to process some text before making compression effective (usually more than 10 kilobytes). The smaller the blocks, the less effective compression is expected to be.
\nOur discussion here focuses on text compression methods which are suitable for use in an IR environment. For instance, a successful idea aimed at merging the requirements of compression algorithms and the needs of IR systems is to consider that the symbols to be compressed are words and not characters (character-based compression is the more conventional approach). Words are the atoms on which most IR systems are built. Moreover, it is now known that much better compression is achieved by taking words as symbols (instead of characters). Further, new word-based compression methods allow random access to words within the compressed text which is a critical issue for an IR system.
\nBesides the economy of space obtained by a compression method, there are other important characteristics to be considered such as compression and decompression speed. In some situations, decompression speed is more important than compression speed. For instance, this is the case with textual databases in which it is common to compress the text once and to read it many times from disk.
\nAnother important characteristic of a compression method is the possibility of performing compressed pattern matching, defined as the task of performing pattern matching in a compressed text without decompressing it. In this case, sequential searching can be speeded up by compressing the search key rather than decoding the compressed text being searched. As a consequence, it is possible to search faster on compressed text because much less text has to be scanned. Chapter 8 presents efficient methods to deal with searching the compressed text directly.
\nWhen the text collection is large, efficient text retrieval requires specialized index techniques. A simple and popular indexing structure for text collections are the inverted files. Inverted files (see Chapter 8 for details) are especially adequate when the pattern to be searched for is formed by simple words. Since this is a common type of query (for instance, when searching the Web), inverted files are widely used for indexing large text collections.
\nAn inverted file is typically composed of (a) a vector containing all the distinct words in the text collection (which is called the vocabulary) and (b) for each word in the vocabulary a list of all documents (identified by document numbers) in which that word occurs. Because each list of document numbers (within the inverted file) is organized in ascending order, specific compression methods have been proposed for them, leading to very efficient index compression schemes. This is important because query processing time is highly related to index access time. Thus, in this section, we also discuss some of the most important index compression techniques.
\nWe first introduce basic concepts related to text compression. We then present some of the most important statistical compression methods, followed by a brief review of compression methods based on a dictionary. At the end, we discuss the application of compression to inverted files.
\nThere are two general approaches to text compression: statistical and dictionary based. Statistical methods rely on generating good probability estimates (of appearance in the text) for each symbol. The more accurate the estimates are, the better the compression obtained. A symbol here is usually a character, a text word, or a fixed number of characters. The set of all possible symbols in the text is called the alphabet. The task of estimating the probability on each next symbol is called modeling. A model is essentially a collection of probability distributions, one for each context in which a symbol can be coded. Once these probabilities are available, the symbols are converted into binary digits, a process called coding. In practice, both the encoder and decoder use the same model. The decoder interprets the output of the encoder (with reference to the same model) to find out the original symbol.
\nThere are two well-known statistical coding strategies: Huffman coding and arithmetic coding. The idea of Huffman coding is to assign a fixed-length bit encoding to each different symbol of the text. Compression is achieved by assigning a smaller number of bits to symbols with higher probabilities of appearance. Huffman coding was first proposed in the early 1950s and was the most important compression method until the late 1970s, when arithmetic coding made higher compression rates possible.
\nArithmetic coding computes the code incrementally, one symbol at a time, as opposed to the Huffman coding scheme in which each different symbol is pre-encoded using a fixed-length number of bits. The incremental nature does not allow decoding a string which starts in the middle of a compressed file. To decode a symbol in the middle of a file compressed with arithmetic coding, it is necessary to decode the whole text from the very beginning until the desired word is reached. This characteristic makes arithmetic coding inadequate for use in an IR environment.
\nDictionary methods substitute a sequence of symbols by a pointer to a previous occurrence of that sequence. The pointer representations are references to entries in a dictionary composed of a list of symbols (often called phrases) that are expected to occur frequently. Pointers to the dictionary entries are chosen so that they need less space than the phrase they replace, thus obtaining compression. The distinction between modeling and coding does not exist in dictionary methods, and there are no explicit probabilities associated to phrases. The most well-known dictionary methods are represented by a family of methods, known as the Ziv-Lempel family.
\nCharacter-based Huffman methods are typically able to compress English texts to approximately five bits per character (usually, each uncompressed character takes 7–8 bits to be represented). More recently, a word-based Huffman method has been proposed as a better alternative for natural language texts. This method is able to reduce English texts to just over two bits per character. As we will see later on, word-based Huffman coding achieves compression rates close to the entropy and allows random access to intermediate points in the compressed text. Ziv-Lempel methods are able to reduce English texts to fewer than four bits per character. Methods based on arithmetic coding can also compress English texts to just over two bits per character. However, the price paid is slower compression and decompression, and the impossibility of randomly accessing intermediate points in the compressed text.
\nBefore proceeding, let us present an important definition which will be useful from now on.
\n\nDefinition 3.7. Compression ratio is the size of the compressed file as a fraction of the uncompressed file.
\nAn inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word, a list of all the text positions where the word appears is stored. The set of all those lists is called the “occurrences” (\nFigure 3.5\n shows an example). These positions can refer to words or characters. Word positions (i.e., position i refers to the ith word) simplify phrase and proximity queries, while character positions (i.e., the position i is the ith character) facilitate direct access to the matching text positions.
\nA sample text and an inverted index built on it. The words are converted to lower case, and some are not indexed. The occurrences point to character positions in the text.
Some authors make the distinction between inverted files and inverted lists. In an inverted file, each element of a list points to a document or file name, while inverted lists match our definition. We prefer not to make such a distinction because, as we will see later, this is a matter of the addressing granularity, which can range from text positions to logical blocks.
\nThe space required for the vocabulary is rather small. According to Heaps’ law, the vocabulary grows as \n
The occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n). Even omitting stopwords (which is the default practice when words are indexed), in practice the space overhead of the occurrences is between 30 and 40% of the text size.
\nTo reduce space requirements, a technique called block addressing is used. The text is divided in blocks, and the occurrences point to the blocks where the word appears (instead of the exact positions). The classical indices which point to the exact occurrences are called “full inverted indices.” Using block addressing not only can the pointers be smaller because there are fewer blocks than positions, but also all the occurrences of a word inside a single block are collapsed to one reference (see \nFigure 3.6\n). Indices of only 5% overhead over the text size are obtained with this technique. The price to pay is that, if the exact occurrence positions are required (for instance, for a proximity query), then an online search over the qualifying blocks has to be performed. For instance, block addressing indices with 256 blocks stop working well with texts of 200 Mb.
\nThe sample text splits into four blocks, and an inverted index uses block addressing built on it. The occurrences denote block numbers. Notice that both occurrences of “words” collapsed into one.
\n\nTable 3.1\n presents the projected space taken by inverted indices for texts of different sizes, with and without the use of stopwords. The full inversion stands for inverting all the words and storing their exact positions, using four bytes per pointer. The document addressing index assumes that we point to documents which are of size 10 kb (and the necessary number of bytes per pointer, that is, one, two, and three bytes, depending on text size). The block addressing index assumes that we use 256 or 64K blocks (one or two bytes per pointer) independently of the text size. The space taken by the pointers can be significantly reduced by using compression. We assume that 45% of all the words are stopwords and that there is one non-stopword each 11.5 characters. Our estimation for the vocabulary is based on Heaps’ law with parameters \n
Index | \nSmall collection (1 Mb) | \nMedium collection (2 Mb) | \nLarge collection (2 Gb) | \n|||
---|---|---|---|---|---|---|
Addressing words | \n45% | \n73% | \n36% | \n64% | \n35% | \n63% | \n
Addressing documents | \n19% | \n26% | \n18% | \n32% | \n26% | \n47% | \n
Addressing 64K block | \n27% | \n41% | \n18% | \n32% | \n5% | \n9% | \n
Addressing 256 blocks | \n18% | \n25% | \n1.7% | \n2.4% | \n0.5% | \n0.7% | \n
Sizes of an inverted file as approximate percentages of the size of the whole text collection.
The blocks can be of fixed size (imposing a logical block structure over the text database), or they can be defined using the natural division of the text collection into files, documents, Web pages, or others. The division into blocks of fixed size improves efficiency at retrieval time, that is, the more variance in the block sizes, the more amount of text sequentially traversed on average. This is because larger blocks match queries more frequently and are more expensive to traverse.
\nAlternatively, the division using natural cuts may eliminate the need for online traversal. For example, if one block per retrieval unit is used and the exact match positions are not required, there is no need to traverse the text for single-word queries, since it is enough to know which retrieval units to report. But if, on the other hand, many retrieval units are packed into a single block, the block has to be traversed to determine which units to retrieve.
\nIt is important to notice that in order to use block addressing, the text must be readily available at search time. This is not the case for remote text (as in Web search engines) or if the text is in a CD-ROM that has to be mounted, for instance. Some restricted queries not needing exact positions can still be solved if the blocks are retrieval units.
\nFour granulates and three collections are considered. For each collection, the right column considers that stopwords are not indexed, while the left column considers that all words are indexed.
\nThe basic steps in constructing a non-positional index. We first make a pass through the collection assembling all termdocID pairs. We then sort the pairs with the term as the dominant key and docID as the secondary key. Finally, we organize the docIDs for each term into a postings list and compute statistics like term and document frequency. For small collections, all this can be done in memory. In this chapter, we describe methods for large collections that require the use of secondary storage.
\nTo make index construction more efficient, we represent terms as termIDs, where each termID is a unique serial number. We can build the mapping from terms to termIDs on the fly while we are processing the collection, or, in a two-pass approach, we compile the vocabulary in the first pass and construct the inverted index in the second pass. The index construction algorithms described in this chapter all do a single pass through the data. Section 4.7 gives references to multipass algorithms that are preferable in certain applications, for example, when disk space is scarce.
\nWe work with the Reuters Corpus Volume I (RCV1) collection as our model collection in this chapter, a collection with roughly 1 GB of text. It consists of about 800,000 documents that were sent over the Reuters newswire during a 1-year period between August 20, 1996, and August 19, 1997. A typical document is shown in \nFigure 3.7\n, but note that we ignore multimedia information like images in this book and are only concerned with text. Reuters Corpus Volume I (RCV1) covers a wide range of international topics, including politics, business, sports, and (as in this example) science. Some key statistics of the collection are shown in \nTable 3.2\n.
\nDocument from the Reuters newswire.
Symbol | \nStatistic | \nValue | \n
---|---|---|
N | \nDocuments | \n800,000 | \n
L | \nAvg. number of tokens per document | \n200 | \n
M | \nTerms | \n400,000 | \n
\n | \nAvg. number of bytes per token (incl. spaces/punct.) | \n6 | \n
\n | \nAvg. number of bytes per token (without spaces/punct.) | \n4.5 | \n
\n | \nAvg. number of bytes per term | \n7.5 | \n
T | \nTokens | \n100,000,000 | \n
Collection statistics for Reuters Corpus Volume I (RCV1).
Values are rounded for the computations in this book. The unrounded values are 806,791 documents, 222 tokens per document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punctuation, 4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and 96,969,056 tokens
\nReuters Corpus Volume I (RCV1) has 100 million tokens. Collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID therefore requires 0.8 GB of storage. Typical collections today are often one or two orders of magnitude larger than Reuters Corpus Volume I (RCV1). You can easily see how such collections overwhelm even large computers if we try to sort their termID-docID pairs in memory. If the size of the intermediate files during index construction is within a small factor of available memory, then the compression techniques introduced in Chapter 5 can help; however, the postings file of many large collections cannot fit into memory even after compression (\nFigure 3.7\n).
\nWith the insufficient main memory, we need to use an external sorting algorithm, that is, one that uses disk. To achieve acceptable sorting efficiency, the central requirement of such an algorithm is that it minimizes the number of random disk seeks during sorting-sequential disk reads are far faster than seeks as we explained in Section 4.1. One solution is the blocked sort-based indexing algorithm or BSBI in \nFigure 3.8\n. BSBI (i) segments the collection into parts of equal size, (ii) sorts the termID-docID pairs of each part in memory, (iii) stores intermediate sorted results on disk, and (iv) merges all intermediate results into the final index.
\nBlocked sort-based indexing. The algorithm stores inverted blocks in files \n\n\n\nf\n1\n\n,\n…\n,\n\nf\nn\n\n\n\n and the merged index in \n\n\n\nf\n\nm\ne\nr\ng\ne\nd\n\n\n\n\n.
The algorithm parses documents into termID-docID pairs and accumulates the pairs in memory until a block of a fixed size is full (PARSENEXTBLOCK in \nFigure 3.8\n). We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and written to disk. Inversion involves two steps. First, we sort the termID-docID pairs. Next, we collect all termID-docID pairs with the same termID into a postings list, where a posting is simply a docID. The result, an inverted index for the block we have just read, is then written to disk. Applying this to Reuters Corpus Volume I (RCV1) and assuming we can fit 10 million termID-docID pairs into memory, we end up with ten blocks, each an inverted index of one part of the collection.
\nIn the final step, the algorithm simultaneously merges the 10 blocks into 1 large merged index. An example with two blocks is shown in \nFigure 3.9\n, where we use \n
Merging in blocked sort-based indexing. Two blocks (postings lists to be merged) are loaded from disk into memory, merged in memory (merged postings lists), and written back to disk. We show terms instead of termIDs for better readability.
How expensive is BSBIFI. Its time complexity is \n
Notice that Reuters Corpus Volume I (RCV1) is not particularly large in an age when one or more GB of memory is standard on personal computers. With appropriate compression, we could have created an inverted index for RCV1 in memory on a not overly beefy server. The techniques we have described are needed, however, for collections that have several orders of larger magnitude.
\nBlocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI. SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.
\nThe SPIMI algorithm is shown in \nFigure 3.10\n. The part of the algorithm that parses documents and turns them into a stream of termdocID pairs, which we call tokens here, has been omitted. SPIMI-INVERT is called repeatedly on the token stream until the entire collection has been processed. Tokens are processed one by one (line 4) during each successive call of SPIMI-INVERT. When a term occurs for the first time, it is added to the dictionary (best implemented as a hash), and a new postings list is created (line 6). The call in line 7 returns this postings list for subsequent occurrences of the term.
\nInversion of a block in single-pass in-memory indexing.
A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list (line 10). Instead of first collecting all termID-docID pairs and then sorting them (as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows), and it is immediately available to collect postings. This has two advantages: it is faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored. As a result, the blocks that individual calls of SPIMI-INVERT can process are much larger, and the index construction process as a whole is more efficient.
\nBecause we do not know how large the postings list of a term will be when we first encounter it, we allocate space for a short postings list initially and double the space each time it is full (lines 8–9). This means that some memory is wasted, which counteracts the memory savings from the omission of termIDs in intermediate data structures. However, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI.
\nWhen memory has been exhausted, we write the index of the block (which consists of the dictionary and the postings lists) to disk (line 12). We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step. If each block’s postings lists were written in unsorted order, merging blocks could not be accomplished by a simple linear scan through each block.
\nEach call of SPIMI-INVERT writes a block to disk, just as in BSBI. The last step of SPIMI (corresponding to line 7 in \nFigure 3.8\n; not shown in \nFigure 3.10\n) is then to merge the blocks into the final inverted index.
\nIn addition to constructing a new dictionary structure for each block and eliminating the expensive sorting step, SPIMI has a third important component: compression. Both the postings and the dictionary terms can be stored compactly on disk if we employ compression. Compression increases the efficiency of the algorithm further because we can process even larger blocks and because the individual blocks require less space on disk. We refer readers to the literature for this aspect of the algorithm.
\nThe time complexity of SPIMI is \n
This section presents a series of dictionary data structures that achieve increasingly higher compression ratios. The dictionary is small compared with the postings file as suggested by Table 5.1. So, why compress it if it is responsible for only a small percentage of the overall space requirements of the IR system?
\nOne of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. Thus, the main goal of compressing the dictionary is to fit it in the main memory, or at least a large portion of it, to support high query throughput. Although dictionaries of very large collections fit into the memory of a standard desktop machine, this is not true of many other application scenarios. For example, an enterprise search server for a large corporation may have to index a multiterabyte collection with a comparatively large vocabulary because of the presence of documents in many different languages. We also want to be able to design search systems for limited hardware such as mobile phones and onboard computers. Other reasons for wanting to conserve memory are fast start-up time and having to share resources with other applications. The search system on your PC must get along with the memory-hogging word processing suite you are using at the same time (\nFigure 3.11\n).
\nStoring the dictionary as an array of fixed-width entries.
Information retrieval systems are often contrasted with relational databases. Traditionally, IR systems have retrieved information from unstructured text ¨C by which we mean raw text without markup. Databases are designed for querying relational data: sets of records that have values for predefined attributes such as employee number, title, and salary. There are fundamental differences between information retrieval and database systems in terms of retrieval model, data structures, and query languages shown in \nTable 3.3\n.1\n
\nSome highly structured text search problems are most efficiently handled by a relational database (RDB), for example, if the employee table contains an attribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query, select lastname from employees where job_desc like invoic%, may be sufficient to satisfy your information need with high precision and recall.
\nHowever, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents and structured retrieval. Queries in structured retrieval can be either structured or unstructured, but we will assume in this chapter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging), and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria. Examples of such queries give me a full-length article on fast Fourier transforms (digital libraries), give me patents whose claims mention RSA public-key encryption and that cite US 4,405,829 patents, or give me articles about sightseeing tours of the Vatican and the Coliseum (entity-tagged text). These three queries are structured queries that cannot be answered well by an unranked retrieval system. As we argued in Example 1.1 (p. 15), unranked retrieval models like the Boolean model suffer from low recall. For instance, an unranked system would return a potentially large number of articles that mention the Vatican, the Coliseum, and the sightseeing tours without ranking the ones that are most relevant for the query first. Most users are also notoriously bad at precisely stating structural constraints. For instance, users may not know for which structured elements the search system supports search. In our example, the user may be unsure whether to issue the query as sightseeing AND (COUNTRY:Vatican OR LANDMARK:Coliseum), as sightseeing AND (STATE:Vatican OR BUILDING:Coliseum), or in some other form. Users may also be completely unfamiliar with structured search and advanced search interfaces or unwilling to use them. In this chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.
\nThere is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries.
\nWe will only look at one standard for encoding structured documents, Extensible Markup Language or XML, which is currently the most widely used such standard. We will not cover the specifics that distinguish XML from other types of markup such as HTML and SGML. But most of what we say in this chapter is applicable to markup languages in general.
\nIn the context of information retrieval, we are only interested in XML as a language for encoding text and documents. A perhaps more widespread use of XML is to encode non-text data. For example, we may want to export data in XML format from an enterprise resource planning system and then read them into an analytic program to produce graphs for a presentation. This type of application of XML is called data-centric because numerical and non-text attribute-value data dominate and text is usually a small fraction of the overall data. Most data-centric XML is stored in databases—in contrast to the inverted index-based methods for text-centric XML that we present in this chapter.
\nWe call XML retrieval a structured retrieval in this chapter. Some researchers prefer the term semistructured retrieval to distinguish XML retrieval from database querying. We have adopted the terminology that is widespread in the XML retrieval community. For instance, the standard way of referring to XML queries is structured queries, not semistructured queries. The term structured retrieval is rarely used for database querying, and it always refers to XML retrieval in this book.
\nThere is a second type of information retrieval problem that is intermediate between unstructured retrieval and querying a relational database, parametric and zone search, which we discussed in Section 6.1 (p. 110). In the data model of parametric and zone search, there are parametric fields (relational attributes like date or file size) and zones ¨C text attributes that each takes a chunk of unstructured text as value. The data model is flat, that is, there is no nesting of attributes. The number of attributes is small. In contrast, XML documents have the more complex tree structure that we see in \nFigure 3.13\n in which attributes are nested. The number of attributes and nodes is greater than in parametric and zone search.
\nAfter presenting the basic concepts of XML in Section 3.71, this chapter first discusses the challenges we face in XML retrieval (Section 3.72). Next, we describe a vector space model for XML retrieval (Section 3.73). Presents INEX, a shared task evaluation that has been held for a number of years and currently is the most important venue for XML retrieval research.
\nAn XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag. An element can have one or more XML attributes. In the XML document in \nFigure 3.12\n, the scene element is enclosed by the two tags \n
An XML document.
\n\nFigure 3.13\n shows \nFigure 3.12\n as a tree. The leaf nodes of the tree consist of text, for example, Shakespeare, Macbeth, and Macbeth’s castle. The tree’s internal nodes encode either the structure of the document (title, act, and scene) or metadata functions (author).
\nThe XML document in Figure 10.1 as a simplified DOM object.
The standard for accessing and processing XML documents is the XML Document Object Model or DOM. The DOM represents elements, attributes, and text within elements as nodes in a tree. \nFigure 3.13\n is a simplified DOM representation of the XML document in \nFigure 3.12\n. With a DOM API, we can process an XML document by starting at the root element and then descending down the tree from parents to children.
\nXPath is a standard for enumerating paths in an XML document collection. We will also refer to paths as XML contexts or simply contexts in this chapter. Only a small subset of XPath is needed for our purposes. The XPath expression node selects all nodes of that name. Successive elements of a path are separated by slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in a play element. In \nFigure 3.13\n, this set consists of a single scene element, which is accessible via the path play, act, and scene from the top. An initial slash starts the path at the root element. /play/title selects the plays title in \nFigure 3.12\n, /play//title selects a set with two members (the plays title and the scenes title), and /scene/title selects no elements. For notational convenience, we allow the final element of a path to be a vocabulary term and separate it from the element path by the symbol #, even though this does not conform to the XPath standard. For example, title#“Macbeth” selects all titles containing the term Macbeth.
\nWe also need the concept of schema in this chapter. A schema puts constraints on the structure of allowable XML documents for a particular application. A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute. Two standards for schemas for XML documents are XML DTD (document-type definition) and XML schema. Users can only write structured queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection
\nA common format for XML queries is NEXI (Narrowed Extended XPath I). We give an example in \nFigure 3.14\n. We display the query on four lines for typographical convenience, but it is intended to be read as one unit without line breaks. In particular, //section is embedded under //article.
\nAn XML query in NEXI format and its partial representation as a tree.
The query in \nFigure 3.14\n specifies a search for sections about the summer holidays that are part of articles from 2001 to 2002. As in XPath double slashes indicate that an arbitrary number of elements can intervene on a path. The dot in a clause in square brackets refers to the element the clause modifies. The clause [.//yr = 2001 or.//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly, the dot in [about(., summer holidays)] refers to the section that the clause modifies.
\nThe 2-year conditions are relational attribute constraints. Only articles whose year attribute is 2001 or 2002 (or that contain an element whose year attribute is 2001 or 2002) are to be considered. The about clause is a ranking constraint: sections that occur in the right type of article are to be ranked according to how relevant they are to the topic summer holidays.
\nWe usually handle relational attribute constraints by prefiltering or postfiltering: we simply exclude all elements from the result set that do not meet the relational attribute constraints. In this chapter, we will not address how to do this efficiently and instead focus on the core information retrieval problem in XML retrieval, namely, how to rank documents according to the relevance criteria expressed in the about conditions of the NEXI query.
\nIf we discard relational attributes, we can represent documents as trees with only one type of node: element nodes. In other words, we remove all attribute nodes from the XML document, such as the number attribute in \nFigure 3.12\n. \nFigure 3.15\n shows a subtree of the document in \nFigure 3.12\n as an element-node tree (labeled \n
Tree representation of XML documents and queries.
We can represent queries as trees in the same way. This is a query-by-example approach to query language design because users pose queries by creating objects that satisfy the same formal description as documents. In \nFigure 3.15\n, \n
In this section, we discuss a number of challenges that make structured retrieval more difficult than unstructured retrieval. Recall from page 195 the basic setting we assume in structured retrieval: the collection consists of structured documents, and queries are either structured (as in \nFigure 3.14\n) or unstructured (e.g., summer holidays).
\nThe first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not the entire documents as IR systems usually do in unstructured retrieval. If we query Shakespeare’s plays for Macbeth’s castle, should we return the scene, the act, or the entire play in \nFigure 3.13\n? In this case, the user is probably looking for the scene. On the other hand, another wise unspecified search for Macbeth should return the play of this name, not a subunit.
\nOne criterion for selecting the most appropriate part of a document is the structured document retrieval principle:
\nStructured document retrieval principle. A system should always retrieve the most specific part of a document answering the query.
\nThis principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. However, it can be hard to implement this principle algorithmically. Consider the query title#“Macbeth” applied to \nFigure 3.13\n. The title of the tragedy, Macbeth, and the title of act I, scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. Deciding which level of the tree is right for answering a query is difficult.
\nParallel to the issue of which parts of a document to return to the user is the issue of which parts of a document to index. In Section 2.1.2 (p. 20), we discussed the need for a document unit or indexing unit in indexing and retrieval. In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, email messages, web pages on the web, etc. In structured retrieval, there are a number of different approaches to defining the indexing unit.
\nOne approach is to group nodes into non-overlapping pseudo-documents as shown in \nFigure 3.16\n. In the example, books, chapters, and sections have been designated to be indexing units but without overlap. For example, the leftmost dashed indexing unit contains only those parts of the tree dominated by book that are not already part of other indexing units. The disadvantage of this approach is that pseudo-documents may not make sense to the user because they are not coherent units. For instance, the leftmost indexing unit in \nFigure 3.16\n merges three disparate elements, the class, author, and title elements.
\nPartitioning an XML document into non-overlapping indexing units.
We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books or the play element for Shakespeare’s works. We can then postprocess search results to find for each book or play the subelement that is the best hit. For example, the query Macbeth’s castle may return the play Macbeth, which we can then postprocess to identify act I, scene vii, as the best-matching subelement. Unfortunately, this two-stage retrieval process fails to return the best subelement for many queries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it.
\nInstead of retrieving large units and identifying subelements (top-down), we can also search all leaves, select the most relevant ones, and then extend them to larger units in postprocessing (bottom-up). For the query Macbeth’s castle in \nFigure 3.12\n, we would retrieve the title Macbeth’s castle in the first pass and then decide in a postprocessing step whether to return the title, the scene, the actor, or the play. This approach has a similar problem as the last one: the relevance of a leaf element is often not a good predictor of the relevance of elements it is contained in.
\nThe least restrictive approach is to index all elements. This is also problematic. Many XML elements are not meaningful search results, for example, typographical elements like \n
Because of the redundancy caused by nested elements, it is common to restrict the set of elements that are eligible to be returned. Restriction strategies include:
\nDiscard all small elements.
Discard all element types that users do not look at (this requires a working XML retrieval system that logs this information).
Discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available).
Only keep element types that a system designer or librarian has deemed to be useful search results.
In most of these approaches, result sets will still contain nested elements. Thus, we may want to remove some elements in a postprocessing step to reduce redundancy. Alternatively, we can collapse several nested elements in the results list and use highlighting of query terms to draw the user’s attention to the relevant passages. If query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section and the paragraph both occur in the results list, it is sufficient to show the section. An additional advantage of this approach is that the paragraph is presented together with its context (i.e., the embedding section). This context may be helpful in interpreting the paragraph (e.g., the source of the information reported) even if the paragraph on its own satisfies the query.
\nIf the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type. But as we discussed in the introduction, users often don’t know what the name of an element in the collection is (Is the Vatican a country or a city?), or they may not know how to compose structured queries at all.
\nA challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idf) statistics as defined in Section 6.2.1 (p. 117). For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example.
\nOne solution is to compute idf for XML-context/term pairs, for example, to compute different idf weights for author#“Gates” and section#“Gates.” Unfortunately, this scheme will run into sparse data problems; that is, many XML-context pairs occur too rarely to reliably estimate df (see Section 13.2, p. 260, for a discussion of sparseness). A compromise is only to consider the parent node \n
In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one source. This phenomenon is called schema heterogeneity or schema diversity and presents yet another challenge. As illustrated in \nFigure 3.17\n, comparable elements may have different names: creator in \n
Schema heterogeneity: intervening nodes and mismatched names.
Schema heterogeneity is one reason for query-document mismatches like \n
We can also support the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed. We call such queries extended queries. The tree in \nFigure 3.14\n and \n
In \nFigure 3.18\n, the user is looking for a chapter entitled FFT (\n
A structural mismatch between two queries and a document.
In this section, we present a simple vector space model for XML retrieval. It is not intended to be a complete description of a state-of-the-art system. Instead, we want to give the reader a flavor of how documents can be represented and retrieved in XML retrieval.
\nTo take into account the structure in retrieval in \nFigure 3.15\n, we want a book entitled Julius Caesar to be a match for \n
\n\nFigure 3.19\n illustrates this representation. We first take each text node (which in our setup is always a leaf) and break it into multiple nodes, one for each word. So, the leaf node Bill Gates is split into two leaves Bill and Gates. Next, we define the dimensions of the vector space to be lexicalized subtrees of documents-subtrees that contain at least one vocabulary term. A subset of these possible lexicalized subtrees is shown in the figure, but there are others, for example, the subtree corresponding to the whole document with the leaf node Gates removed. We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them. This means that we can use the vector space formalism from Chapter 6 for XML retrieval. The main difference is that the dimensions of vector space in unstructured retrieval are vocabulary terms, whereas they are lexicalized subtrees in XML retrieval.
\nA mapping of an XML document (left) to a set of lexicalized subtrees (right).
There is a trade-off between the dimensionality of the space and accuracy of query results. If we trivially restrict dimensions to vocabulary terms, then we have a standard vector space retrieval system that will retrieve many documents that do not match the structure of the query (e.g., Gates in the title as opposed to the author element). If we create a separate dimension for each lexicalized subtree occurring in the collection, the dimensionality of the space becomes too large. A compromise is to index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs. We call such an XML-context/term pair a structural term and denote it by \n
As we discussed in the last section, users are bad at remembering details about the schema and at constructing queries that comply with the schema. We will therefore interpret all queries as extended queries; that is, there can be an arbitrary number of intervening nodes in the document for any parent-child node pair in the query. For example, we interpret \n
But we still prefer documents that match the query structure closely by inserting fewer additional nodes. We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path \n
where \n
The final score for a document is computed as a variant of the cosine measure [Eq. (6.10), p. 121], which we call SIMNOMERGE for reasons that will become clear shortly. SIMNOMERGE is defined as follows:
\nwhere \n
The algorithm for computing SIMNOMERGE for all documents in the collection is shown in \nFigure 3.22\n. The array normalizer in \nFigure 3.22\n contains \n
We give an example of how SIMNOMERGE computes query-document similarities in \nFigure 3.21\n. \n
The algorithm for scoring documents with SIMNOMERGE.
Scoring of a query with one structural term in SIMNOMERGE.
The query-document similarity function in \nFigure 3.22\n is called SIMNOMERGE because different XML contexts are kept separate for the purpose of weighting. An alternative similarity function is SIMMERGE which relaxes the matching conditions of query and document further in the following three ways:
\nWe collect the statistics used for computing \n
We modify Eq. (3.22) by merging all structural terms in the document that have a non-zero context resemblance to a given query structural term. For example, the contexts /play/act/scene/title and /play/title in the document will be merged when matching against the query term /play/title#“Macbeth.”
The context resemblance function is further relaxed: contexts have a non-zero resemblance in many cases where the definition of \n
These three changes alleviate the problem of sparse term statistics discussed in Section 3.2 and increase the robustness of the matching function against poorly posed structural queries. The evaluation of SIMNOMERGE and SIMMERGE in the next section shows that the relaxed matching conditions of SIMMERGE increase the effectiveness of XML retrieval.
\nThe premier venue for research on XML retrieval is the INEX (INitiative for the Evaluation of XML retrieval) program, a collaborative effort that has produced reference collections, sets of queries, and relevance judgments. A yearly INEX meeting is held to present and discuss research results. The INEX 2002 collection consisted of about 12,000 articles from IEEE journals. We give collection statistics in \nTable 3.4\n and show part of the schema of the collection in \nFigure 3.22\n. The IEEE journal collection was expanded in 2005. Since 2006 INEX uses the much larger English Wikipedia as a test collection.
\n\n | \nRDB search | \nUnstructured retrieval | \nStructured retrieval | \n
---|---|---|---|
Objects | \nRecords | \nUnstructured documents | \nTrees with text at leaves | \n
Model | \nRelational model | \nVector space & others | \n? | \n
Main data structure | \nTable | \nInverted index | \n? | \n
Queries | \nSQL | \nFree text queries | \n? | \n
RDB (relational database) search, unstructured information retrieval, and structured information retrieval.
12,107 | \nNumber of documents | \n
494 Mb | \nSize | \n
1995–2002 | \nTime of publication of articles | \n
1532 | \nAverage number of XML nodes per document | \n
6.9 | \nAverage depth of a node | \n
30 | \nNumber of CAS topics | \n
30 | \nNumber of CO topics | \n
INEX 2002 collection statistics.
Simplified schema of the documents in the INEX collection.
Two types of information needs or topics in INEX are content-only or CO topics and content-and-structure (CAS) topics. CO topics are regular keyword queries as in unstructured information retrieval. CAS topics have structural constraints in addition to keywords. We already encountered an example of a CAS topic in \nFigure 3.14\n. The keywords in this case are summer and holidays, and the structural constraints specify that the keywords occur in a section that in turn is part of an article and that this article has an embedded year attribute with value 2001 or 2002.
\nSince CAS queries have both structural and content criteria, relevance assessments are more complicated than in unstructured retrieval. INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance. The component coverage dimension evaluates whether the element retrieved is structurally correct, that is, neither too low nor too high in the tree. We distinguish four cases:
\nExact coverage (E). The information sought is the main topic of the component, and the component is a meaningful unit of information.
Too small (S). The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information.
Too large (L). The information sought is present in the component, but is not the main topic.
No coverage (N). The information sought is not a topic of the component.
The topical relevance dimension also has four levels: highly relevant (3), fairly relevant (2), marginally relevant (1), and nonrelevant (0). Components are judged on both dimensions, and the judgments are then combined into a digit-letter code. 2S is a fairly relevant component that is too small, and 3E is a highly relevant component that has exact coverage. In theory, there are 16 combinations of coverage and relevance, but many cannot occur. For example, a nonrelevant component cannot have exact coverage, so the combination 3N is not possible.
\nThe relevance-coverage combinations are quantized as follows:
\nThis evaluation scheme takes into account the fact that binary relevance judgments, which are standard in unstructured information retrieval, are not appropriate for XML retrieval. A 2S component provides incomplete information and may be difficult to interpret without more context, but it does answer the query partially. The quantization function Q does not impose a binary choice relevant/nonrelevant and instead allows us to grade the component as partially relevant.
\nThe number of relevant components in a retrieved set \n
As an approximation, the standard definitions of precision, recall, and F from can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments.
\nOne flaw of measuring relevance this way is that overlap is not accounted for. We discussed the concept of marginal relevance in the context of unstructured retrieval. This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result as we discussed on p. 80. Much of the recent focus at INEX has been on developing algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.
\n\n\nTable 3.5\n shows two INEX 2002 runs of the vector space system we described in Section 3.7.3. The better run is the SIMMERGE run, which incorporates few structural constraints and mostly relies on keyword matching. SIMMERGE median average precision (where the median is with respect to average precision numbers over topics) is only 0.147. Effectiveness in XML retrieval is often lower than in unstructured retrieval since XML retrieval is harder. Instead of just finding a document, we have to find the subpart of a document that is most relevant to the query. Also, XML retrieval effectiveness—when evaluated as described here—can be lower than unstructured retrieval effectiveness on a standard evaluation because graded judgments lower measured performance. Consider a system that returns a document with graded relevance 0.6 and binary relevance 1 at the top of the retrieved list. Then, interpolated precision at 0.00 recall is 1.0 on a binary evaluation but can be as low as 0.6 on a graded evaluation.
\nAlgorithm | \nAverage precision | \n
---|---|
SIMNOMERGE | \n0.242 | \n
SIMMERGE | \n0.271 | \n
INEX 2002 results of the vector space model in Section 3.7.3 for content and structure (CAS) queries and the quantization function Q.
\n\nTable 3.5\n gives us a sense of the typical performance of XML retrieval, but it does not compare structured with unstructured retrieval. \nTable 3.6\n directly shows the effect of using structure in retrieval. The results are for a language model-based system that is evaluated on a subset of CAS topics from INEX 2003 and 2004. The evaluation metric is precision at k. The discretization function used for the evaluation maps highly relevant elements (roughly corresponding to the 3E elements defined for Q) to 1 and all other elements to 0. The content only system treats queries and documents as unstructured bags of words. The full-structure model ranks elements that satisfy structural constraints higher than elements that do not. For instance, for the query in \nFigure 3.14\n, an element that contains the phrase summer holidays in a section will be rated higher than one that contains it in an abstract.
\n\n | Content only full structure | \nImprovement | \n
---|---|---|
precision at 5 0.2000 | \n0.3265 | \n63.3% | \n
precision at 10 0.1820 | \n0.2531 | \n39.1% | \n
precision at 20 0.1700 | \n0.1796 | \n5.6% | \n
precision at 30 0.1527 | \n0.1531 | \n0.3% | \n
The table shows that structure helps increase precision at the top of the results list. There is a large increase of precision at \n
This chapter focuses on two major problems in machine learning: classification and clustering. Classification is based on some given sample of known class labels, training a learning machine (i.e., to obtain a certain objective function), so that it can classify unknown samples, and it belongs to supervised learning. However, clustering is the algorithm that does not know any kind of sample in advance, and it is desirable to classify a set of unknown categories of samples into several categories by some algorithm. When we use the clustering, we do not care about what kind it is. The goal that we need to achieve is just to bring together similar things, which in the machine learning is called unsupervised learning.
\nWhen we use a classifier to predict, we will encounter a very important question: how to evaluate the predicted effect of this classifier. The evaluation of classifier performance is the basis for choosing excellent classifiers. Traditional classifier performance evaluation, such as accuracy, recall, sensitivity, and specificity, cannot fully consider a classifier. This book, based on the traditional classifier performance evaluation criteria, also referred to the confusion matrix, area under curve (\n
An instance can be divided into positive or negative. This will result in four classification results:
\nTrue positive (\n
True negative (\n
False positive (\n
False negative (\n
The formula for precision is
\nThe formula for recall is
\nSensitivity, also known as true positive rate (\n
Specificity, also known as negative rate (\n
\n\n
Area under curve (AUC) is defined as the area under the \n
Confusion matrix is used to sum up the supervision of the classification of the results. The items on the main diagonal indicate the total number of the correct categories, and the items of the other non-main diagonals indicate the number of errors in the classification. As shown in \nTable 4.1\n, there are two different types of errors: “false acceptance” and “false rejection” [100]. If the confusion matrix of the dichotomous problem is normalized, it is a joint distribution probability for discrete variables of 0 and 1 binary values. For the two classified issues, confusion matrix can be expressed in \nTable 4.1\n.
\nCategory | \n\n\n | \n\n\n | \n
---|---|---|
\n\n | \nAccept correctly | \nWrongly rejected | \n
\n\n | \nAccept correctly | \nWrongly rejected | \n
The confusion matrix.
\n\n
F-Score, also known as F-Measure, is precision and recall weighted harmonic average and commonly used to evaluate if the classification model is good or bad. In the F-Score function, when the parameter \n
It is generally believed that the higher the F-Score is, the better the classification effect of the classifier for the positive sample shows. It should be noted that \n
Decision tree is an important method in data mining classification algorithm. In a variety of classification algorithms, the decision tree is the most intuitional one. Decision tree is a method of decision analysis that evaluates the project risk and determines the feasibility. And it is a graphic method for probabilistic analysis of intuitive application. Because the decision branch is painted like a tree branch of a tree, we called it decision tree. In machine learning, the decision tree is a predictive model, which represents a mapping between attributes and values [102].
\nThe classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: internal node and leaf node. An internal node represents a feature or attribute, and a leaf node represents a class [103]. A decision tree is used to classify a particular feature of the instance from the root node, and an instance is assigned to its child node according to the test result. At this time, each child node corresponds to a value of the feature. The instance is then recursively tested and assigned until the leaf node is finally assigned to the instance of the leaf node. \nFigure 4.1\n shows the decision tree, the circle, and the box, respectively, that represent the internal nodes and leaf nodes.
\nDecision tree model.
Feature selection is to select the characteristics of classification ability for training data, which can improve the efficiency of decision tree learning. The information gain can be used as one of the criteria for the selection of features [104]. First, we need to introduce the concept of entropy. In probability statistics, entropy is a measure of the uncertainty to random variables. Supposing \n
The entropy of the random variable \n
With random variables \n
The conditional entropy \n
where \n
\nInformation gain: The information gain \n
According to the information gain criterion, the feature selection method is to calculate the information gain of each feature and compare the size of the training data set (or subset) \n
\nID3 algorithm: The core is to apply the selected feature of information gain criterion to each node of the decision tree and construct the decision tree recursively [106]. The concrete method is to start from the root node, calculate the information gain of all the possible features to the node, select the feature with the largest information gain as the characteristic of the node, and establish the subnode from the different values of the feature. And then the subnodes recursively call the above method to build the decision tree, until all the information gain is very small or no features can be selected so far, and finally get a decision tree [107].
\nIn the decision tree learning, the process of simplifying the generated tree is called pruning.
\n\nPruning algorithm:
\nInput: Generates the entire tree generated by the algorithm \n
Output: pruned subtree \n
Calculating the empirical entropy of each node£»
Recursively retracting from the leaf node of the tree. Supposing a set of leaf nodes is retracted to their parent nodes before and after the whole tree is \n
Return (2), before it can end. The subtree with the smallest loss function is obtained [108].
Bayesian classification algorithm using probability and statistics to classify is a statistical classification method. Naive Bayes (\n
Bayesian theorem is
\nIn machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. Naive Bayesian classification is a very simple classification algorithm. And the thought of naive Bayesian is that we solve the probability of each category under the conditions. Then we divide the items into the categories with the highest probability of occurrence. Naive Bayesian classification model is depicted in \nFigure 4.2\n.
\nNaive Bayesian algorithm model.
Naive Bayes has been studied extensively since the 1950s. It was introduced under a different name into the text retrieval community in the early 1960s [109] and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate preprocessing, it is competitive in this domain with more advanced methods including support vector machines [110]. It also finds application in automatic medical diagnosis [111].
\nNaive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable [112]. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.
\n\nProbabilistic model: Abstractly, naive Bayes is a conditional probability model—given a problem instance to be classified, represented by a vector \n
for each of k possible outcomes or classes \n
The problem with the above formulation is that if the number of features \n
In plain English, using Bayesian probability terminology, the above equation can be written as
\nIn practice, there is interest only in the numerator of that fraction, because the denominator does not depend on \n
which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:
\nNow the “naive” conditional independence assumptions come into play: assume that each feature \n
Thus, the joint model can be expressed as
\nThis means that under the above independence assumptions, the conditional distribution over the class variable \n
where the evidence \n
In the naive Bayesian algorithm, learning means that \n
Supposing the set of the jth feature \n
where \n
Support vector machine is a supervised learning model and relates learning algorithm for analyzing data in classification and regression analysis [113]. Given a set of training instances, each training instance is marked as belonging to one or the other of the two categories. The \n
We consider the following form of point test set:
\nwhere \n
where \n
\nHard interval: If these training data are linearly separable, two parallel hyperplanes of two types of data can be chosen so that the distance between them is as large as possible. The regions in these two hyperplanes are called “spaces,” and the maximum interval hyperplane is the hyperplane located in the middle of them [115]. These hyperplanes can be represented by equation
\nor
\nIt is not difficult to get the distance between the two hyperplanes by geometry, so we want to maximize the distance between the two planes; we need to minimize. At the same time, in order to make the sample data points in the hyperplane interval, we need to ensure that we meet one that represents all of the conditions:
\nor
\n