Automatic Compilation of Travel Information from Texts: A Survey Automatic Compilation of Travel Information from Texts: A Survey

Travel guidebooks and portal sites provided by tour companies and governmental tourist boards are useful sources of information about travel. However, it is costly and time-consuming to compile travel information for all tourist spots and to keep these data up-to-date manually. Recently, research about services for the automatic compilation and recommendation of travel information has been increasing in various research communities, such as natural language processing, image processing, Web mining, geographic information systems (GISs), and human interfaces. In this chapter, we overview the state of the art of the research and several related services in this field. We especially focus on research in natural language processing, including text mining.


Introduction
Travel guidebooks and portal sites provided by tour companies and governmental tourist boards are useful sources of information about travel.
However, it is costly and time-consuming to compile travel information for all tourist spots and to keep these data up-to-date manually. Recently, research about services for the automatic compilation and recommendation of travel information has been increasing in various research communities, such as natural language processing, image processing, Web mining, geographic information systems (GISs), and human interfaces. In this chapter, we overview the state of the art of the research and several related services in this field. We especially focus on research in natural language processing, including text mining.
The remainder of this chapter is organized as follows. Section 2 explains the automatic construction of databases for travel. Section 3 describes analysis of travelers' behavior. Section 4 introduces several studies about recommending travel information. Section 5 shows interfaces for travel information access. Section 6 lists several linguistic resources. Finally, we provide our conclusions and offer future directions in Section 7.

Automatic construction of databases for travel
In this section, we describe several studies about constructing databases for travel. In Section 2.1, we introduce a study that identified travel blog entries in a blog database. In Section 2.2, we describe several methods to construct databases for travel by extracting travel information, such as tourist spots or local products, from travel blog entries using information extraction techniques. In Section 2.3, we explain a method that constructs travel links automatically.

Automatic identification of travel blog entries
Travel blogs 1 are defined as travel journals written by bloggers in diary form. Travel blogs are considered useful for obtaining travel information, because many bloggers' travel experiences are written in this form.
There are various portal sites for travel blogs, which we will describe in Section 6. At these sites, travel blogs are manually registered by bloggers themselves, and the blogs are classified according to travel destination. However, there are many more travel blogs in the blogosphere, beyond these portal sites. In an attempt to construct an exhaustive database of travel blogs, Nanba et al. [25] identified travel blog entries written in Japanese in a blog database. 2 Blog entries that contain cue phrases, such as "travel", "sightseeing", or "tour", have a high degree of probability of being travel blogs. However, not every travel blog contains such cue phrases. For example, if a blogger describes his/her journey to Norway in multiple blog entries, the blog might state "We traveled to Norway" in the first entry, while only writing "We ate wild sheep!" in the second entry. In this case, because the second entry does not contain any expressions related to travel, it is difficult to identify it as a travel blog entry. Therefore, Nanba et al. focused not only on each blog entry but also on the surrounding entries for the identification of travel blog entries. They formulated the identification of travel blog entries as a sequence-labeling problem, and solved it using machine learning. For the machine learning method, they examined the Conditional Random Fields (CRF) method [20]; its empirical success has been reported recently in the field of natural language processing. The CRF-based method identifies the tag 3 of each entry. Features and tags are given in the CRF method as follows: (1) k tags occur before a target entry; (2) k features occur before a target entry; and (3) k features follow a target entry (see Figure 1). They used the value of k = 4, which was determined in a pilot study. Here, they used the following features for machine learning: whether an entry contains any of 416 cue phrases, such as "ÅL (travel)", "Ä¢ü (tour)", and "úz (departure)", and the number of location names in each entry.
Using the above method, Nanba et al. identified 17,268 travel blog entries from 1,100,000 blog entries, and constructed a system that plotted travel blog entries on a Google map (see Figure 2). 4 In this figure, travel blog entries are shown as icons. If the user clicks an icon, the corresponding blog entry is shown in a pop-up window.

Automatic extraction of travel information from texts
Nakatoh et al. [24] proposed a method for extracting names of local culinary dishes from travel blogs written in Japanese, which were identified when the blog entry included both the name of a sightseeing destination and the word "tourism". They extracted local dishes by gathering nouns that are dependent on the verb "'ßy‹" (eat). Tsai and Chou [32] also proposed a method for extracting dish names from restaurant review blogs written in Chinese using a machine learning (CRF) technique.  In the following, we explain the detail of the bootstrapping-based and machine learning-based information extraction approaches based on Nanba's work [25]. Nanba et al. extracted pairs comprising a location name and a local product from travel blogs written in Japanese, which were identified using the method described in Section 2.1. For the efficient extraction of travel information, they employed a bootstrapping method.
First, they prepared 482 pairs as seeds for the bootstrapping. These pairs were obtained automatically from a "Web Japanese N-gram" database provided by Google, Inc. The database comprises N-grams (N = 1-7) extracted from 20 billion Japanese sentences on the Web. They applied the pattern "[0 ] i [ i] " ([slot of "location name"] local product [slot of "local product"] ) to the database, and extracted location names and local products from each corresponding slot, thereby obtaining the 482 pairs. Second, they applied a machine learning-based information extraction technique to the travel blogs identified in the previous step, and obtained new pairs. In this step, they prepared training data for the machine learning in the following three steps.
1. Select 200 sentences that contain both a location name and a local product from the 482 pairs. Then automatically create 200 tagged sentences, to which both "location" and "product" tags are assigned. 5 2. Prepare another 200 sentences that contain only a location name. Then create 200 tagged sentences, to which the "location" tag is assigned.
3. Apply machine learning to the 400 tagged sentences, and obtain a system that automatically allocates "location" and "product" tags to given sentences.
As a machine learning method, they used CRF. The CRF-based method identifies the class of each word in a given sentence. Features and tags are given in the CRF method as follows: (1) k tags occur before a target word; (2) k features occur before a target word; and (3) k features follow a target word. They used the value of k = 2, which was determined in a pilot study. They used the following six features for machine learning.
• The part of speech to which the word belongs (noun, verb, adjective, etc.) • Whether the word is a quotation mark.
• Whether the word is a surface case.
• Whether the word is frequently used in the names of local products or souvenirs, such as "cake" or "noodle".

Automatic compilation of travel links
Collections of Web links are usefel information sources. However, maintaining these collections manually is costly. Therefore, an automatic method for compiling collections of Web links is required. In this section, we introduce a method that compiles travel links automatically.
From travel blog entries, which were automatically identified using the method mentioned in Section 2.1, Ishino et al. [15] extracted the hyperlinks to useful Web sites for a tourist spot included by bloggers, and thereby constructed collections of hyperlinks for tourist spots. The procedure for classifying links in travel blog entries is as follows.
1. Input a travel blog entry.
2. Extract a hyperlink and any surrounding sentences that mention the link (a citing area).
3. Classify the link by taking account of the information in the citing area.
They classified link types into the following four categories.
• S (Spot): The information is about tourist spots.
• H (Hotel): The information is about accommodation.
The information is about restaurants.
• O (Other): Other than types S, H, and R.
A hyperlink may be classified as more than one type. For example, a hyperlink to "éüáóZi(" (Chinese noodle museum, http://www.raumen.co.jp/home/) was classified as types S and R, because the visitors to this museum can learn the history of Chinese noodles in addition to eating them.
For the classification of link types, they employed a machine learning technique using the following features.
• A word.
• Whether the word is a cue phrase, detailed as follows, where the numbers in brackets shown for each feature represent the number of cues.

Cue phrase
The number of cues A list of tourist spots, collected from Wikipedia.
17,371 Words frequently used in the name of tourist spots, such as "Õi " (zoo) or "Zi(" (museum).

Cue phrase
The number of cues Words that are frequently used in the name of hotels, such as "ÛAEë" (hotel) or "Å(" (Japanese inn). 9 Component words for accommodations, such as "ÕíóÈ" (front desk) or "¢¤" (guest room).
14 Other words. 21 Based on this method, Ishino et al. constructed a travel link search system. 6 The system generated a list of URLs for Web sites related to a location, and automatically identified link types and the context of citations ("citing areas"), where the blog authors described the sites. Figure 3 shows a list of links related to "'*" (Osaka).

Figure 3. A list of Web sites for a travel spot
Cue phrase The number of cues Dish names such as "omelet", collected from Wikipedia.

Travelers' behavior analysis
The analysis of people's transportation information is considered an important issue in various fields, such as city planning, architectural planning, car navigation, sightseeing administration, crime prevention, and tracing the spread of infection of epidemics. In this section, we focus on the analysis of travelers' behavior.
Ishino et al. [15] proposed a method to extract people's transportation information from automatically identified travel blogs written in Japanese [25]. They used machine learning to extract information, such as "departure place", "destination", or "transportation device", from travel blog entries. First, the tags used in their examination are defined.
• FROM tag indicates the departure place.
• TO tag indicates the destination.
• VIA tag indicates the route.
• METHOD tag indicates the transportation device.
• TIME tag indicates the time of transportation.
The following is a tagged example.
They formulated the task of identifying the class of each word in a given sentence and solved it using machine learning. For the machine learning method, they used CRF [20], in the same way as Nanba et al. [25], which we mentioned in Section 2.2. The CRF-based method identifies the class of each entry. Features and tags are used in the CRF method as follows: (1) k tags occur before a target entry; (2) k features occur before a target entry; and (3) k features follow a target entry. They used the value k = 4 7 , which was determined via a pilot study. They used the following features for machine learning.
• A word.
• The part of speech to which the word belongs (noun, verb, adjective, etc.).
• Whether the word is a quotation mark.
• Whether the word is a cue phrase.
The details of cue phrases, together with the number of cue phrases of the given type, are shown as follows.

FROM:
The word is a cue that often appears immediately after the FROM tag, such as "K‰" (from) or "'ú z" (left): 40.

FROM & TO:
The word is frequently used in the name of a tourist spot, such as "Zi(" (museum) or "J 0" (amusement park): 45. The word is frequently used in the name of a destination, such as "³I" (sightseeing tour) or "Å" (station): 11. The word is the name of a tourist spot: 13,779. The word is the name of a station or airport: 9437.

TO:
The word is a cue that often appears immediately after the "TO" tag, such as "~g" (to) or "k0@" (arrival): 271.

VIA:
The word is a cue that often appears immediately after the "via" tag, such as "L1" (via) or " cf" (through): 43. The word is the name of a highway: 101.

METHOD:
The word is the name of a transportation device, such as "ÛL_" (airplane) or "êÕÊ" (car): 148. The word is the name of a vehicle: 128. The word is the name of a train or bus: 2033.
They also constructed a visualization of transportation information, which is shown in Figure  4. In this figure, each arrow indicates a link from a departure place to a destination. In addition to arrows, transportation methods, such as trains or buses, are shown as icons.
Transportation information can also be extracted from texts written in English. Davidov [6] presented an algorithm framework that enables automated acquisition of map-link information from the Web, based on linguistic patterns such as "from X to". Given a set of locations as initial seeds, he retrieved an extended set of locations from the Web, and produced a map-link network that connected these locations using edges showing the transportation type.

Recommending travel information
Recommendation systems provide a promising approach to ranking commercial products or documents according to a user's interests. In this section, we describe several studies and services that recommend travel information. We describe the recommendation of tourist spots, landmarks, travel products, accomodation, and photos.

Recommending tourist spots
Recommending tourist spots 8 has been well studied in the multimedia field. Movies and images are used as information sources in addition to texts. In this section, we describe two multimedia studies.
Hao et al. [10] proposed a method for mining location-representative knowledge from travel blogs based on a probabilistic topic model (the Location-Topic model). Using this model,  Figure 5 shows an example of the system output. In this figure, a travel blog segment 9 is enriched with three images that depict its most informative parts. Each image's original tags and the words in the text to which it corresponds are also presented.
Wu et al. [34] proposed a system that summarized tourism-related information. When a user (traveler) entered a query, such as "What is the historical background of Tian Tan?", the system searched for and obtained information from Wikipedia, Flickr, YouTube, and official tourism Web sites using the tourist spot name as a query. The system also classified the query as belonging to one of five categories-"general", "history", "landscape", "indoor scenery", and "outdoor scenery"-in order to provide users with more relevant information. For example, when a query is classified as belonging to the "history" category, the information is obtained from texts, while for a query regarding "outdoor scenery", the information is obtained from photos and videos.

Recommending landmarks
Finding and recommending landmarks is considered an important research topic in the multimedia field, along with recommending tourist spots. Abbasi et al. [1] focused on the photo-sharing system Flickr, and proposed a method to identify landmark photos using tags and social Flickr groups. Gao et al. [7] also proposed a method to identify landmarks using Flickr and the Yahoo Travel Guide.
Ji et al. [17] proposed another method for finding landmarks. They adopted the method of clustering blog photos relating to a particular tourist site, such as Louvre Museum in Paris. 10 Then they represented these photos as a graph based on the clustering results, and detected landmarks using link analysis methods, such as the PageRank [3] and HITS [19] algorithms.

Recommending travel products
Ishino et al. [14] proposed a method that added links to advertisements for travel products to the travel information links that were described in Section 2.3. 11 The procedure for providing ad links is as follows.
1. Input a link type and the citing areas of a travel information link.
2. Extract keywords from the citing areas.
3. Extract product data containing all keywords, and calculate the similarity between the citing areas of a travel information link and the product data.
4. Provide the ad link to the product data having the highest similarity to the travel information link.
They extracted keywords for travel products corresponding to the link type. They used the same cues to classify travel information links [15] (see Section 2.3), and then extracted keywords from the citing areas of links of types S (Spot) and R (Restaurant). 10 For calculating the similarity between two photos, they used the Bag-of-Visual-Words representation [18,26], which represents an image as a set of salient regions (visual words), called Bag-of-Visual-Words vectors. Then the similarity between photos is measured based on the cosine distance between their Bag-of-Visual-Words vectors. In addition to the features in each image, they also used textual information for each photo, such as the title, description, and surrounding text. 11 http://www.ls.info.hiroshima-cu.ac.jp/travel/ First, the method for extracting keywords from the citing areas of links of type S is described. The cues for type S, such as tourist spots collected from Wikipedia and words frequently used in the names of tourist spots, tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type S. If the citing areas of these links contained candidate keywords, they extracted the candidates as keywords. In addition, if citing areas contained names of places, they extracted the names as keywords.
The cues for type R, such as dish names and cooking styles, also tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type R. If the citing areas for links of type R contained candidate keywords, they extracted them as keywords. Titov and McDonald [31] proposed an aspect-based summarization system, and applied the method to the summarization of hotel reviews. The system took as input a set of user reviews for a specific product or service with a numeric rating (left side in Figure 6), and produced a set of relevant aspects, which they called an aspect-based summary (right side in Figure  6). To extract all relevant mentions in each review for each aspect, they introduced a topic model. They applied their method to hotel reviews on the TripAdvisor Web site 12 , and obtained aspect-based summaries for each hotel. To obtain more reliable hotel reviews, opinion spams should be detected and eliminated. Opinion spams are fictitious opinions that have been deliberately written to sound authentic. Ott et al. [27] proposed a method to detect opinion spam among consumer reviews of hotels. They created 400 deceptive opinions using the Amazon Mechanical Turk (AMT) crowdsourcing service 13 by asking anonymous online workers (Turkers) to create the opinion spam for 20 chosen hotels. In addition to these spam messages, they selected 6,977 truthful opinions from TripAdvisor, and used both groups for their task.

Recommending photos
Bressan et al. [2] proposed a travel blog assistant system that facilitated the travel blog writing by selecting for each blog paragraph the most relevant images from an image set. The procedure is as follows.
1. The system adds metadata to the traveler's photos based on a generic visual categorizer, which provides annotations (short textual keywords) related to some generic visual aspects of and objects in the image. 14 2. Textual information (tags) was obtained using a cross-content information retrieval system using a repository of multimedia objects.
3. For a given paragraph, the system ranked the uploaded images according to the similarity between the extracted metadata and the paragraph.

Interfaces for travel information access
In this section, we describe two studies that focused on interfaces for travel information access.

Providing travel information along streetcar lines
Ishino et al. [13] proposed a method for collecting blog entries about the Hiroshima Electric Railway (Hiroden) from a blog database. 15 Hiroden blog entries were defined as travel journals that provide regional information for streetcar stations in Hiroshima. The task of collecting Hiroden blog entries was divided into two steps: (1) collection of blog entries; and (2) identification of Hiroden blog entries. Figure 7 shows a route map used by the system for providing travel information along the Hiroden streetcar lines. The route map shows Hiroden streetcar stations and major tourist spots. The steps in the search procedure are as follows.
• (Step 2) Click the link to a Hiroden blog entry to display it.

Natural language interface for accessing databases
Several ontologies for e-tourism have been developed (see Section 6). Unfortunately, the gap between human users who want to retrieve information and the Semantic Web is yet to be cloased. Ruiz-Martínez et al. [30] proposed a method for querying ontological knowledge bases using natural language sentences. For example, when the user inputted the query "I want to visit the most important tourist attractions in Paris", the system conducted part-of-speech tagging, lemmatizing, and modification of query terms by synonyms, and finally searched the ontology. 14 Bressan et al. used images that were categorized into 44 classes as training data for visual categorization. Each class was given a short text name, such as "clouds and sky" or "beach". When an image was categorized as belonging to classes A and B using the visual categorizer, the short texts given to each class were assigned as keywords of the image. 15

Linguistic resources for studies of automatic compilation of travel information from texts Text Corpora
• TripAdvisor: http://tripadvisor.com This site provides fifty million reviews written in various languages.
• Footstops: http://footstops.com This site provides more than 8,000 blog entries written in English.
• IgoUgo: http://www.igougo.com This site provides 530,000 reviews and 62,000 blog entries written in English.
• Travbuddy: http://www.travbuddy.com This site provides more than 90,000 reviews and 180,000 blog entries written in English.
• TravelBlog: http://www.travelblog.org This site provides more than 600,000 blog entries written in English. Each entry is classified at city level in a geographic hierarchy.
• Travellerspoint: http://www.travellerspoint.com This site provides more than 180,000 blog entries written in English.
• TravelPod: http://www.travelpod.com This site is one of the oldest travel portal, started since 1997, and provides blog entries written in English.
• 4travel: http://4travel.jp This site provides approximately 300,000 reviews and 600,000 blog entries written in Japanese. Each review is classified at city level in a geographic hierarchy.

Useful Sites or Services for Travel
• Yahoo Travel Guide: http://travel.yahoo.com/ This site provides an area-based recommendation service. For each country, several main cities are listed.
• WikiTravel: http://wikitravel.org The travel recommendation system contributed by "WikiTravellers". For each destination, the articles in WikiTravel generally include all or parts of the following information: history, climate, landmarks, work information, shopping information, food, and how to get there.
• Recommending landmarks (landmark finding) is a standard research topic in image processing using Flickr. In this chapter, we mentioned three studies [1,7,17] that relied mainly on image processing and tag-based recommendation techniques rather than natural language processing. The authors believe that there is still room to improve the methods of recommending landmarks by natural language processing, because sentiment analysis techniques, such as those used for recommending accommodation, have not yet been used for recommending landmarks.