Cues for type S
Link to this chapter Copy to clipboard
Cite this chapter Copy to clipboard
Embed this chapter on your site Copy to clipboard
Embed this code snippet in the HTML of your website to show this chapter
Open access peer-reviewed chapter
By Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa
Submitted: March 13th 2012Reviewed: July 4th 2012Published: November 21st 2012
Travel guidebooks and portal sites provided by tour companies and governmental tourist boards are useful sources of information about travel. However, it is costly and time-consuming to compile travel information for all tourist spots and to keep these data up-to-date manually. Recently, research about services for the automatic compilation and recommendation of travel information has been increasing in various research communities, such as natural language processing, image processing, Web mining, geographic information systems (GISs), and human interfaces. In this chapter, we overview the state of the art of the research and several related services in this field. We especially focus on research in natural language processing, including text mining.
The remainder of this chapter is organized as follows. Section 2 explains the automatic construction of databases for travel. Section 3 describes analysis of travelers’ behavior. Section 4 introduces several studies about recommending travel information. Section 5 shows interfaces for travel information access. Section 6 lists several linguistic resources. Finally, we provide our conclusions and offer future directions in Section 7.
In this section, we describe several studies about constructing databases for travel. In Section 2.1, we introduce a study that identified travel blog entries in a blog database. In Section 2.2, we describe several methods to construct databases for travel by extracting travel information, such as tourist spots or local products, from travel blog entries using information extraction techniques. In Section 2.3, we explain a method that constructs travel links automatically.
Travel blogs1 are defined as travel journals written by bloggers in diary form. Travel blogs are considered useful for obtaining travel information, because many bloggers’ travel experiences are written in this form.
There are various portal sites for travel blogs, which we will describe in Section 6. At these sites, travel blogs are manually registered by bloggers themselves, and the blogs are classified according to travel destination. However, there are many more travel blogs in the blogosphere, beyond these portal sites. In an attempt to construct an exhaustive database of travel blogs, Nanba et al.  identified travel blog entries written in Japanese in a blog database.2
Blog entries that contain cue phrases, such as “travel”, “sightseeing”, or “tour”, have a high degree of probability of being travel blogs. However, not every travel blog contains such cue phrases. For example, if a blogger describes his/her journey to Norway in multiple blog entries, the blog might state “We traveled to Norway” in the first entry, while only writing “We ate wild sheep!” in the second entry. In this case, because the second entry does not contain any expressions related to travel, it is difficult to identify it as a travel blog entry. Therefore, Nanba et al. focused not only on each blog entry but also on the surrounding entries for the identification of travel blog entries. They formulated the identification of travel blog entries as a sequence-labeling problem, and solved it using machine learning. For the machine learning method, they examined the Conditional Random Fields (CRF) method ; its empirical success has been reported recently in the field of natural language processing. The CRF-based method identifies the tag 3 of each entry. Features and tags are given in the CRF method as follows: (1) k tags occur before a target entry; (2) k features occur before a target entry; and (3) k features follow a target entry (see Figure 1). They used the value of k = 4, which was determined in a pilot study. Here, they used the following features for machine learning: whether an entry contains any of 416 cue phrases, such as “旅行 (travel)”, “ツアー (tour)”, and “出発 (departure)”, and the number of location names in each entry.
Using the above method, Nanba et al. identified 17,268 travel blog entries from 1,100,000 blog entries, and constructed a system that plotted travel blog entries on a Google map (see Figure 2).4 In this figure, travel blog entries are shown as icons. If the user clicks an icon, the corresponding blog entry is shown in a pop-up window.
Nakatoh et al.  proposed a method for extracting names of local culinary dishes from travel blogs written in Japanese, which were identified when the blog entry included both the name of a sightseeing destination and the word “tourism”. They extracted local dishes by gathering nouns that are dependent on the verb
In the following, we explain the detail of the bootstrapping-based and machine learning-based information extraction approaches based on Nanba’s work . Nanba et al. extracted pairs comprising a location name and a local product from travel blogs written in Japanese, which were identified using the method described in Section 2.1. For the efficient extraction of travel information, they employed a bootstrapping method.
First, they prepared 482 pairs as seeds for the bootstrapping. These pairs were obtained automatically from a “Web Japanese N-gram” database provided by Google, Inc. The database comprises N-grams (N = 1-–7) extracted from 20 billion Japanese sentences on the Web. They applied the pattern
Second, they applied a machine learning-based information extraction technique to the travel blogs identified in the previous step, and obtained new pairs. In this step, they prepared training data for the machine learning in the following three steps.
Select 200 sentences that contain both a location name and a local product from the 482 pairs. Then automatically create 200 tagged sentences, to which both “location” and “product” tags are assigned.5
Prepare another 200 sentences that contain only a location name. Then create 200 tagged sentences, to which the “location” tag is assigned.
Apply machine learning to the 400 tagged sentences, and obtain a system that automatically allocates “location” and “product” tags to given sentences.
As a machine learning method, they used CRF. The CRF-based method identifies the class of each word in a given sentence. Features and tags are given in the CRF method as follows: (1) k tags occur before a target word; (2) k features occur before a target word; and (3) k features follow a target word. They used the value of k = 2, which was determined in a pilot study. They used the following six features for machine learning.
The part of speech to which the word belongs (noun, verb, adjective, etc.)
Whether the word is a quotation mark.
Whether the word is a cue word, such as “名物”, “名産”, “特産” (local product), “銘菓” (famous confection), or “土産” (souvenir).
Whether the word is a surface case.
Whether the word is frequently used in the names of local products or souvenirs, such as “cake” or “noodle”.
Collections of Web links are usefel information sources. However, maintaining these collections manually is costly. Therefore, an automatic method for compiling collections of Web links is required. In this section, we introduce a method that compiles travel links automatically.
From travel blog entries, which were automatically identified using the method mentioned in Section 2.1, Ishino et al.  extracted the hyperlinks to useful Web sites for a tourist spot included by bloggers, and thereby constructed collections of hyperlinks for tourist spots. The procedure for classifying links in travel blog entries is as follows.
Input a travel blog entry.
Extract a hyperlink and any surrounding sentences that mention the link (a citing area).
Classify the link by taking account of the information in the citing area.
They classified link types into the following four categories.
S (Spot): The information is about tourist spots.
H (Hotel): The information is about accommodation.
R (Restaurant): The information is about restaurants.
O (Other): Other than types S, H, and R.
A hyperlink may be classified as more than one type. For example, a hyperlink to
For the classification of link types, they employed a machine learning technique using the following features.
Whether the word is a cue phrase, detailed as follows, where the numbers in brackets shown for each feature represent the number of cues.
|Cue phrase||The number of cues|
|A list of tourist spots, collected from Wikipedia.||17,371|
|Words frequently used in the name of tourist spots, such as “動物園” (zoo) or “博物館” (museum).||138|
|Words related to sightseeing, such as “見学” (sightseeing) or “散策” (stroll).||172|
|Cue phrase||The number of cues|
|Words that are frequently used in the name of hotels, such as “ホテル” (hotel) or “旅館” (Japanese inn).||9|
|Component words for accommodations, such as “フロント” (front desk) or||29|
|Words that are frequently used when tourists stay in accommodation, such as||14|
Based on this method, Ishino et al. constructed a travel link search system.6 The system generated a list of URLs for Web sites related to a location, and automatically identified link types and the context of citations (“citing areas”), where the blog authors described the sites. Figure 3 shows a list of links related to “大阪” (Osaka).
|Cue phrase||The number of cues|
|Dish names such as “omelet”, collected from Wikipedia.||2,779|
|Cooking styles such as “Italian cuisine”, collected from Wikipedia.||114|
|Words that are frequently used in the name of restaurants, such as “レストラン” (restaurant) or||21|
|Words that are used when taking meals, such as “食べる” (eat) or “おいしい” (delicious).||52|
|General words that indicate food, such as “ご飯” (rice) or “料理” (cooking).||31|
The analysis of people’s transportation information is considered an important issue in various fields, such as city planning, architectural planning, car navigation, sightseeing administration, crime prevention, and tracing the spread of infection of epidemics. In this section, we focus on the analysis of travelers’ behavior.
Ishino et al.  proposed a method to extract people’s transportation information from automatically identified travel blogs written in Japanese . They used machine learning to extract information, such as “departure place”, “destination”, or “transportation device”, from travel blog entries. First, the tags used in their examination are defined.
FROM tag indicates the departure place.
TO tag indicates the destination.
VIA tag indicates the route.
METHOD tag indicates the transportation device.
TIME tag indicates the time of transportation.
The following is a tagged example.
It took <TIME>five hours</TIME> to travel from <FROM>Hiroshima</FROM> to<TO>Osaka</TO> by <METHOD>bus</METHOD>.
They formulated the task of identifying the class of each word in a given sentence and solved it using machine learning. For the machine learning method, they used CRF , in the same way as Nanba et al. , which we mentioned in Section 2.2. The CRF-based method identifies the class of each entry. Features and tags are used in the CRF method as follows: (1) k tags occur before a target entry; (2) k features occur before a target entry; and (3) k features follow a target entry. They used the value k = 47, which was determined via a pilot study. They used the following features for machine learning.
The part of speech to which the word belongs (noun, verb, adjective, etc.).
Whether the word is a quotation mark.
Whether the word is a cue phrase.
The details of cue phrases, together with the number of cue phrases of the given type, are shown as follows.
FROM: The word is a cue that often appears immediately after the “FROM” tag, such as “から” (from) or
FROM & TO: The word is frequently used in the name of a tourist spot, such as “博物館” (museum) or “遊園地” (amusement park): 45.
The word is frequently used in the name of a destination, such as “観光” (sightseeing tour) or “駅” (station): 11.
The word is the name of a tourist spot: 13,779.
The word is the name of a station or airport: 9437.
TO: The word is a cue that often appears immediately after the “TO” tag, such as
VIA: The word is a cue that often appears immediately after the “via” tag, such as “経由” (via) or
The word is the name of a highway: 101.
METHOD: The word is the name of a transportation device, such as “飛行機” (airplane) or
The word is the name of a vehicle: 128.
The word is the name of a train or bus: 2033.
(TIME): The word is an expression related to time, such as “分” (minute) or “時間” (hour): 77.
They also constructed a visualization of transportation information, which is shown in Figure 4. In this figure, each arrow indicates a link from a departure place to a destination. In addition to arrows, transportation methods, such as trains or buses, are shown as icons.
Transportation information can also be extracted from texts written in English. Davidov  presented an algorithm framework that enables automated acquisition of map-link information from the Web, based on linguistic patterns such as “from X to”. Given a set of locations as initial seeds, he retrieved an extended set of locations from the Web, and produced a map-link network that connected these locations using edges showing the transportation type.
Recommendation systems provide a promising approach to ranking commercial products or documents according to a user’s interests. In this section, we describe several studies and services that recommend travel information. We describe the recommendation of tourist spots, landmarks, travel products, accomodation, and photos.
Recommending tourist spots8 has been well studied in the multimedia field. Movies and images are used as information sources in addition to texts. In this section, we describe two multimedia studies.
Hao et al.  proposed a method for mining location-representative knowledge from travel blogs based on a probabilistic topic model (the Location–Topic model). Using this model, they developed three modules: (1) destination recommendation for flexible queries; (2) characteristics summarization for a given destination, with representative tags and snippets; and (3) identification of informative parts of a travel blog and enriching recommendations with related images.
Figure 5 shows an example of the system output. In this figure, a travel blog segment9 is enriched with three images that depict its most informative parts. Each image’s original tags and the words in the text to which it corresponds are also presented.
Wu et al.  proposed a system that summarized tourism-related information. When a user (traveler) entered a query, such as “What is the historical background of Tian Tan?”, the system searched for and obtained information from Wikipedia, Flickr, YouTube, and official tourism Web sites using the tourist spot name as a query. The system also classified the query as belonging to one of five categories—“general”, “history”, “landscape”, “indoor scenery”, and “outdoor scenery”—in order to provide users with more relevant information. For example, when a query is classified as belonging to the “history” category, the information is obtained from texts, while for a query regarding “outdoor scenery”, the information is obtained from photos and videos.
Finding and recommending landmarks is considered an important research topic in the multimedia field, along with recommending tourist spots. Abbasi et al.  focused on the photo-sharing system Flickr, and proposed a method to identify landmark photos using tags and social Flickr groups. Gao et al.  also proposed a method to identify landmarks using Flickr and the Yahoo Travel Guide.
Ji et al.  proposed another method for finding landmarks. They adopted the method of clustering blog photos relating to a particular tourist site, such as Louvre Museum in Paris.10 Then they represented these photos as a graph based on the clustering results, and detected landmarks using link analysis methods, such as the PageRank  and HITS  algorithms.
Ishino et al.  proposed a method that added links to advertisements for travel products to the travel information links that were described in Section 2.3.11 The procedure for providing ad links is as follows.
Input a link type and the citing areas of a travel information link.
Extract keywords from the citing areas.
Extract product data containing all keywords, and calculate the similarity between the citing areas of a travel information link and the product data.
Provide the ad link to the product data having the highest similarity to the travel information link.
They extracted keywords for travel products corresponding to the link type. They used the same cues to classify travel information links  (see Section 2.3), and then extracted keywords from the citing areas of links of types S (Spot) and R (Restaurant).
First, the method for extracting keywords from the citing areas of links of type S is described. The cues for type S, such as tourist spots collected from Wikipedia and words frequently used in the names of tourist spots, tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type S. If the citing areas of these links contained candidate keywords, they extracted the candidates as keywords. In addition, if citing areas contained names of places, they extracted the names as keywords.
The cues for type R, such as dish names and cooking styles, also tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type R. If the citing areas for links of type R contained candidate keywords, they extracted them as keywords.
Titov and McDonald  proposed an aspect-based summarization system, and applied the method to the summarization of hotel reviews. The system took as input a set of user reviews for a specific product or service with a numeric rating (left side in Figure 6), and produced a set of relevant aspects, which they called an aspect-based summary (right side in Figure 6). To extract all relevant mentions in each review for each aspect, they introduced a topic model. They applied their method to hotel reviews on the TripAdvisor Web site12, and obtained aspect-based summaries for each hotel.
To obtain more reliable hotel reviews, opinion spams should be detected and eliminated. Opinion spams are fictitious opinions that have been deliberately written to sound authentic. Ott et al.  proposed a method to detect opinion spam among consumer reviews of hotels. They created 400 deceptive opinions using the Amazon Mechanical Turk (AMT) crowdsourcing service13 by asking anonymous online workers (Turkers) to create the opinion spam for 20 chosen hotels. In addition to these spam messages, they selected 6,977 truthful opinions from TripAdvisor, and used both groups for their task.
Bressan et al.  proposed a travel blog assistant system that facilitated the travel blog writing by selecting for each blog paragraph the most relevant images from an image set. The procedure is as follows.
The system adds metadata to the traveler’s photos based on a generic visual categorizer, which provides annotations (short textual keywords) related to some generic visual aspects of and objects in the image.14
Textual information (tags) was obtained using a cross-content information retrieval system using a repository of multimedia objects.
For a given paragraph, the system ranked the uploaded images according to the similarity between the extracted metadata and the paragraph.
In this section, we describe two studies that focused on interfaces for travel information access.
Ishino et al.  proposed a method for collecting blog entries about the Hiroshima Electric Railway (Hiroden) from a blog database.15 Hiroden blog entries were defined as travel journals that provide regional information for streetcar stations in Hiroshima. The task of collecting Hiroden blog entries was divided into two steps: (1) collection of blog entries; and (2) identification of Hiroden blog entries.
Figure 7 shows a route map used by the system for providing travel information along the Hiroden streetcar lines. The route map shows Hiroden streetcar stations and major tourist spots. The steps in the search procedure are as follows.
Several ontologies for e-tourism have been developed (see Section 6). Unfortunately, the gap between human users who want to retrieve information and the Semantic Web is yet to be cloased. Ruiz-Martínez et al.  proposed a method for querying ontological knowledge bases using natural language sentences. For example, when the user inputted the query “I want to visit the most important tourist attractions in Paris”, the system conducted part-of-speech tagging, lemmatizing, and modification of query terms by synonyms, and finally searched the ontology.
This site provides fifty million reviews written in various languages.
This site provides more than 8,000 blog entries written in English.
This site provides 530,000 reviews and 62,000 blog entries written in English.
This site provides more than 90,000 reviews and 180,000 blog entries written in English.
This site provides more than 600,000 blog entries written in English. Each entry is classified at city level in a geographic hierarchy.
This site provides more than 180,000 blog entries written in English.
This site is one of the oldest travel portal, started since 1997, and provides blog entries written in English.
This site provides approximately 300,000 reviews and 600,000 blog entries written in Japanese. Each review is classified at city level in a geographic hierarchy.
Databases for Travel
Rakuten travel data:
Basic information about 11,468 properties and 350,000 reviews
Travel product data in Rakuten Shopping Mall (Rakuten Ichiba):
The data comprise 50 million items. Each item has name, code, price, URL, picture, shop code, category ID, and descriptive text and registration data.
Useful Sites or Services for Travel
Yahoo Travel Guide:
This site provides an area-based recommendation service. For each country, several main cities are listed.
The travel recommendation system contributed by “WikiTravellers”. For each destination, the articles in WikiTravel generally include all or parts of the following information: history, climate, landmarks, work information, shopping information, food, and how to get there.
Ontologies for Travel
The World Tourism Organization (WTO) provides a multilingual thesaurus in English, French, and Spanish that provides a standard terminology for tourism .
DERI’s e-Tourism Working group has created a tourism ontology called “OnTour” . This ontology describes the main conventional concepts for tourism such as accommodation or activities, together with other supplementary concepts such as GPS coordinates or a postal address.
LA_DMS is an ontology for tourism destinations that was developed for the Destination Management System (DMS). This system adapts information requests about tourist destinations to users’ needs .
Many other ontologies for travel were introduced by Ruiz-Martínez et al. .
GeoCLEF: Geographic Information Retrieval
NTCIR GeoTime was another cross-language geographic retrieval track run as part of the NTCIR. It operated from 2008 to 2011 [8, 9]. The focus of this task was searching with geographic and temporal constraints using Japanese and English news articles as target documents.
In this chapter, we have introduced the state of the art of research and services related to travel information. There are several future directions for this research field.
We mentioned in Section 2 that several natural language processing technologies are useful for creating databases for travel. These technologies may also be applied to maintain manually created databases or ontologies for travel, such as those discussed in Section 6.
Multilingualization of the ontologies for travel using machine translation techniques  is also considered an important task for encouraging further studies in this research field.
There are many different locations that have the same name (place name polysemy), and there may be multiple names for a given location (place name synonymy). To eliminate this geo-ambiguity problem, Ji et al.  proposed the Hierarchical-comparison Geo-Disambiguation (HGD) algorithm, which distinguished the city-level location using a combination of its lower-level locations, derived from the hierarchical location relationships. In addition to this method, several natural language processing technologies, such as automatic acquisition of synonyms [5, 29, 35, 36] and word sense disambiguation , are available.
Recommending landmarks (landmark finding) is a standard research topic in image processing using Flickr. In this chapter, we mentioned three studies [1, 7, 17] that relied mainly on image processing and tag-based recommendation techniques rather than natural language processing. The authors believe that there is still room to improve the methods of recommending landmarks by natural language processing, because sentiment analysis techniques, such as those used for recommending accommodation, have not yet been used for recommending landmarks.
3439total chapter downloads
Login to your personal dashboard for more detailed statistics on your publications.Access personal reporting
Edited by Shigeaki Sakurai
By Masaomi Kimura
Edited by Kimito Funatsu
By Derya Birant
We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.More about us