Traditionally the discovery of geographic data has been performed locally, from a centralized, desktop-based perspective. In the last decade, though, the notion of Spatial Data Infrastructure (SDI) (Masser, 2005) appeared as a net of interconnected SDI nodes to build spatial information infrastructures at different scales and levels (local, regional, national, European and even global) (INSPIRE, 2007). These SDI nodes represent an effort to link, share, and exchange geographic data from different and disperse sources. Discovery geographic data through this immense network of SDI nodes has been usually delegated to the use of catalogues services (http://www.opengeospatial.org/standards/cat). These specialized services base their functionality on the use of metadata that describes the geographic content they register to offer users and client applications effective interfaces for searching and publishing that content.
With the advent of the popular Web 2.0 the traditional Web moved towards a more social platform where the roles of content producer and content consumer have been diluted and often played by the same actor. The appearance of new technologies and applications (AJAX, Web APIs, social networks, etc.) offered new possibilities that facilitated and accelerated the content creation and sharing using the Web as the basic environment. Some of them deal specifically with geographic content and provide a new way to create and publish geographic content by people independently of their technical skills or knowledge on Geographic Information System (GIS). This new trend is currently known as Neogeography (Turner 2006). Millions of people are currently potential data providers and are able to create and share fresh geographic content for a given area acting as truly sensors (Goodchild 2007). All these quantities of geographic data on the Web have made the concept of the Geospatial Web (Scharl & Tochtermann, 2007) to gain relevancy. This term defines the use of the Web in such a way that the content has a geographic component defining its geospatial location in the world that is used to access to it and also linked it with the resources. In deep the Geospatial Web could be seen as the product of organizing the Web based in the location of its resources.
The openness, simplicity and easy of share of the new Web 2.0 tools and services also facilitate the content creation what results in an increase of geographic data in a proportion never seen till the date. The rate and speed at which the quantity of content increases and the nature of most of the content creators, most of them without deep technical skills, make the use of catalogues with its strict metadata requirements and its user-directed process of content submission an ineffective solution for its management. For these reasons new methods for discovering and managing this user-generated content should be used. During the evolution of the Web its size started to reach a dimension that made the task of searching for information along directories with linked documents a problem. This problem could happen again with the almost unmanageable quantity of content. In the first case the web search engines (Google, Yahoo, AltaVista) appeared with successful results becoming an essential tool for almost any Web user and for innovative and distinct uses (Al-Masri & Mahmoud, 2008). Maybe these applications can represent again a solution for the discovery of geographic data in the actual Web.
2.1. Web 2.0 and the Geospatial Web
Despite the fuzziness that use to follow its definition the term Web 2.0 does not refer to any new and different version of the traditional Web but a change in the technology and design to improve its functionality, communications, information sharing, creativity and collaboration along it. This new vision of the Web could be considered as the evolution and merge of different streams such as the technological improvement, the appearance of new applications and also the socialization of the Web, which comprises the participation and contribution of the users in it (Vossen & Hagemann 2007).
Still a few years ago the model of content consumption on the Web was represented by a commonly used client server model. In this model the content provider acted as server, producing content that was directly used by the client or content consumer defining clearly each role. With the spreading of the Web 2.0 philosophy this model completely changed dissolving the limits of each role. Nowadays the Web is full of services and tools with one major objective, sharing content (photos, videos, bookmarks, knowledge, etc.) among users.
The Geospatial Web represents the merging of different types of information that are already present on the Web (i.e. HTML pages, images, etc.) with geographic content. This facilitates discovering and searching any type of content based on its geographical component. In other words, the Geospatial Web structures each piece of Web content (photos, videos, web pages, 3D models, etc) according to its geospatial location, what is called georeferencing. This linkage between data and its geographic location enables their discovery and use by location approaching, as defined by the Geospatial Web.
2.2. Geographic services and user-generated data
Since the appearance of the first web-based GIS systems a lot of tools and services have emerged. It is especially in these last years when more and more services appeared probably thanks to the entering of big companies in the business of the geographic information (i.e. Google, Yahoo or Microsoft), the increase and improvement of the Internet communications and the spread use of positioning devices such as GPS receivers. Probably the most common of these tools are the web mapping services (Mitchell 2005). There exist many web mapping service implementations available, both from private companies such as Google Maps, Yahoo! Maps, Bing Maps (formerly Microsoft Live Maps), MapQuest and much others, and integrated in Geoportals (Bernard et al., 2005) present usually in SDI nodes.
Not only the web mapping services have become popular but also a new type of desktop applications have recently emerged to visualize geographic content. Geobrowsers (virtual globes or digital earths) offer a three-dimensional view of the earth where the geographic content is displayed over that virtual representation of the planet’s surface. These tools offer totally new ways for visualizing data and they have started a process of change from the two-dimensional to a more realistic method of visualizing data. Users can find geobrowsers such as Google Earth, NASA World Wind and ESRI’s ArcGIS Explorer. These services are not limited to show to end users simple cartography and are also enriched with different sources of information. For instance users can find addresses and directions, business, traffic conditions or even content created by other users. These applications do not offer just the visualization of content but also provide a searching mechanism over the geographic data they manage.
The use of the web-based and desktop applications is extended with the release of Application Programming Interfaces (API) and Software Development Kits (SDK). In the first case APIs enable third parties developments to use geographic services and data through calls to the API in their own projects. For instance, anybody can create geographic data and visualize it using either Google Maps or Google Earth in their own website. The use of these APIs popularized the term mashups that denotes the creation of new web applications for specific purposes by combining and integrating remote data from other distributed sources. The SDK offers the opportunity of freely reuse or extend desktop applications such as NASA World Wind or ESRI’s ArcGIS Explorer enabling the users to create their own applications with this capabilities.
The emerge of all these new brand services and tools along with the vision of content creation and sharing in Web 2.0, have led to new ways to create and distribute geographic content. Probably one of the pioneering and most important movements nowadays is the OpenStreetMap project whose objective is the collaborative creation of freely available cartography all around the world (Haklay & Weber 2008). This cartographic data is collected, edited and uploaded by its own users. Another example is the concept of Public Participation GIS (PPGIS) (Nyerges et al., 1997) that enables participation of a given community in the future urbanism development of their own area. Not only projects such as the OpenStreetMap promote this type of geographic content creation but also private companies offer their own tools to do so. This is the case of Google My Maps and Google Map Maker, this last one intended also to facilitate the creation of cartography for those countries that experience a lack of it.
2.3. Keyhole Markup Language: The Geospatial Web’s HTML.
The Keyhole Markup Language (KML) is a XML-based language designed to express geographic annotation and visualization. The geographic visualization includes not only the representation of the graphical data but also establishes order and control over the data navigation. KML is used in a broad range of applications such as web mapping services, two-dimensional maps including those used in mobile devices and geobrowsers. Most of these applications or services also make use of KMZ files. These ones are basically compressed files containing a KML file and other resources such as images or icons that could be referenced on it. KMZ files facilitate the sharing and distribution of content encapsulating in one single file all the possible resources needed to visualize and work with the geographic content.
KML was originally created by Keyhole Inc. a company founded in 2001 and specialized in software development for geospatial data visualization. Its main application suite was called Earth Viewer that was transformed into Google Earth after the acquisition of the company by Google in 2004. Currently a broad range of applications dedicated to visualize and operate with geographic data are using KML as one of their supported file formats. Some examples of these applications that make use of KML files are ESRI’s ArcGIS Explorer (http://www.esri.com/software/arcgis/explorer/index.html), NASA’s World Wind (http://worldwind.arc.nasa.gov), Google Earth (http://earth.google.com) and Google Maps (http://maps.google.com). Not only applications that could be considered clients are employing this format but also what could be considered as geographic data servers export information in KML format. This is the case of GeoServer (http://geoserver.org), one of the most spread used software to serve geographic content that supports KML and KMZ output for Web Mapping Service (WMS) (http://www.opengeospatial.org/standards/wms) requests. The success of KML is a reality partly thanks to its use and promotion by Google but also because its adoption as OGC standard (http://www.opengeospatial.org/standards/kml). KML is since 2008 one of the OGC standards what ensures its continuity, improvement and interconnection with other OGC standards. This last point opens KML to work with other OGC standards such as WMS and WFS. This interconnection and collaboration between standards could be seen in most of the actual geobrowsers since KML allows the inclusion of links that represent WMS requests and which response is directly visualized over the virtual globe (Foerster et al., 2008).
The KML language structures information by means of specific elements or tags. It is based on an object-oriented model where KML defines a set of objects employed to build the corresponding files. Some of these objects are abstract which from an object-oriented perspective means that these are not implemented or used in KML files however they play an important role for structuring the other objects within the languages definition and to keep a hierarchical structure. For instance some of these abstract objects are used as parents of others letting in this way inheriting common properties. Figure 1 represents the object hierarchy for the KML language, in which a given object inherits their parents’ attributes and elements and these are inherited in turn by their child elements.
In KML everything inherits from the abstract element Object that only specifies the use of an id (identifier). Whatever is represented in KML derives from the abstract element Feature. The different features defined by the standard are:
NetworkLink represents a link to remote resources, including images, photos or even WMS requests;
Placemark represents a Feature with an associated geometry;
PhotoOverlay, ScreenOverlay and GroundOverlay used to superpose images over the virtual globe represented in the geobrowsers; and
Folder and Document elements used to order and structure KML files as it could be done with files and folders in any Operating System.
All these features have some common elements inherited from Feature and other self-elements and attributes. Probably the most important elements in any feature are those that textually describe them. These elements are for instance the Title, Description and Snippet. The first two elements are self-explanatory and the last one represents a short description (usually composed by no more than a couple of lines) that represents a quick description of the feature and that uses to be found on the lists or tree views that represents the different elements visualized in the geobrowser.
One interesting characteristic regarding the element Description is its acceptance of a restricted subset of HTML tags to be used as the feature’s description. This fact allows the use of HTML not just to enhance the visual appearance of the feature’s description but it also opens the door to use images, hyperlinks to related resources or even embed more complex elements such as Adobe Flash Video videos from services like YouTube.
2.3.2. Geographic annotation
KML can represent a set of different types of geometries that can be used depending on the user’s requirements. These geometries include Point, LineString, LinearRing represented by a totally closed line, Polygon, MultiGeometry that represents a compound of other geometries and finally Model, usually employed to represent 3D models.
The standard defines other elements such as Region to specify concrete areas on the virtual globe. These regions are defined by a set of parameters including coordinates and height to define an area of interest. All the elements associated with regions become active as users navigate over such regions. This KML element is especially useful when working with complex KML files that contain a big quantity of features. Enabling only the features just on the region where they are placed reduce the possible overload caused by their visualization.
2.3.3. Visualization aspects
KML offers a long list of elements focused on the visualization of geographic content. For instance the user can find elements to define the style of any KML element that could be visualized such as lines, polygons, icons and others. The user has also control to define how the end-user would visualize the content on a geobrowser. To understand this it is helpful to imagine a virtual camera controlled with the keyboard or the mouse. This camera has some parameters such as its position in the three dimensions or orientation. These parameters can be specified within a KML file defining what could be called views for driving the end-users attention to a given feature or position over the globe.
2.3.4. Advanced aspects
Because events take place in a concrete moment in time and they use to exhibit an evolution on it, KML defines elements to associate features with punctual moments or periods of time. These elements become really useful for representing the evolution of a given fact (e.g. meteorological data) on geobrowsers, enabling the representation of features not just in space but also in time and approaching the use of four dimensions (three spatial plus one temporal). Another important element present in the standard is the ExtendedData, which allows the insertion of custom XML code into the KML file. Depending on the usage of the ExtendedData child elements the users can specify their own XML Schema (http://www.w3.org/XML/Schema) or import already existing ones to be used within features defined in the file. This element represents a great chance in order to enrich the KML file with extra information and metadata. As we will see more in deep in the next section, these elements represent an effective solution for adding structured metadata or reuse existing one that follows a given XML Schema.
Finally KML represents a really flexible language allowing its extension with new and specific features (aggregating new object derived elements). Good examples of new functionalities implemented by extending its object-oriented model are the Tour object by which users experience virtual tours over the globe using Google Earth. In this case Google extended the KML version 2.2 specification with new elements to enable this type of feature not supported by the default standard.
Thanks to the growth in its use, its capabilities to visualize and annotate geographic content, its relative simplicity and its easy to extend model, some people consider that KML it is in the process of becoming the Geospatial Web’s HTML.
3. Traditional discovery mechanisms
When somebody goes to a bookstore looking for a good thriller he or she would probably start looking for in the thriller’s specific section. Then this potential buyer would start looking for a specific title that probably better match what is looking for based for instance in the period of history or place where the story would take place. It is also probable that at first sight the information provided by the title is not enough for the buyer so reading the summary in the back cover would become necessary since it offers more information about the book and sometimes the author and the critique. If this is not enough maybe the user could open the book looking for the editor details or year of publication and in last instance, start reading the book to try to see if the book is the one that better matches the buyer’s desire.
This example can be taken to illustrate the steps followed when somebody tries to discover some kind of information. The first step is decide where to start looking for. In the example the buyer is looking for a thriller so the obvious action is to start looking in the thriller section in the bookstore. Its equivalent in the Web and Web 2.0 would be probably open in the web browser a web search engine website such as Google, Yahoo or Microsoft’s Bing or a more specialized site depending on the user’s preferences and knowledge. The second step in the process would be to check the book information based on a given criteria. This implies the use of information about the content. In the book’s case it is quite easy since it is composed by text however this is not always the case when searching for resources in a heterogeneous environment, as it is the Web. This is full of none textual content such as images, audio or video files and in last instance geographic content. Since most of the searching interfaces present on the Web are based on filling in some searching criteria, it is necessary to describe properly these resources in order to find them. Resource descriptions are called metadata and represent a key element in the discovery of any type of content and mostly in those distributed and heterogeneous networks like the Web.
Metadata could be defined as data about data (Craglia et al., 2007) and in some cases like in GIS area also data about services. Metadata aims to explain the meaning of this data and services and to facilitate its understanding and use by different users or even automatic agents. The use of metadata it is essential for a broad range of activities and applications including the discovery of information, cataloguing of resources and for other ways of data processing. In the Geographic Information area and more precisely in the study area of the Spatial Data Infrastructures (SDI) metadata is considered an essential component that acts as the glue that keeps all the pieces together (Masser, 2005; Rajabifard et al., 2002).
The study on metadata is not a new one and a lot of effort has been invested on it. Since the appearance of the first studies on metadata in computing science some standards have appeared. Probably the most used and most well known is the Dublin Core Metadata Initiative (DCMI) or simply Dublin Core (DC) (http://dublincore.org). This format defines a multipurpose metadata standard composed by a set of 15 basic elements designed to facilitate the discovery of electronic resources. DC is used in a broad range of applications since its design allows the addition of basic information to any type of content.
Among others reasons DC is probably one of the most extended metadata standard because its simplicity and general purpose. This is in general a good characteristic for a standard however in given areas or applications more domain specific information could be required for a given resource. This is the case for instance of the Geographic Information area. For this reason, several initiatives and standards have born and are currently in use in the geospatial domain. This is the case of the ISO 19115 (ISO 19115:2003) and the ISO 19139 (ISO 19139:2007). The ISO 19115 is an international standard for geographic content metadata and was created in 2003 by the International Standardization Organization (ISO) providing an abstract model for the organization of geospatial metadata. The ISO 19139 standardises the expression of 19115 metadata using XML and its derived logical model.
The complexity and requirements of most of the actual metadata standards for geographic information reserves them for the more specialized users. These metadata formats do not seem a solution for the description of content created by non-specialized users that maybe are not as interested in exhaustively describing content as more experienced user and professionals could be. It is then necessary to find ways to add the minimal description to all this new content in such a way that could be easily performed by any user and at the same time offer enough information for data discovery and cataloguing. These new solutions should be as simple as possible, allowing the use of existing resources and standards and probably observed from a format-dependent perspective in a first stage.
Catalogues are currently the most used method for geographic content indexing in the GI field. These systems perform all the operations that defines the term indexing when talking about searching systems. Therefore the catalogues collect, process, and store the data for its posterior retrieval. The collection process commonly requires of the user’s interaction with the system since in most of the cases the content must be uploaded with its metadata manually. This metadata is the key element in the indexing process since it is the information used for the content retrieval and not the content itself.
GeoNetwork (http://geonetwork-opensource.org) is one of these catalogue applications that also implements the OGC Catalogue Service standard specification. Data providers simply upload their geographic data and metadata directly to the catalogue service application in order to make them available to others. Geographic data and metadata are then processed and stored in the application for being retrieved through discovery interfaces exposed by Geonetwork. This process involves the immediate publication, storing and accessibility of the content but also the human interaction and supervision during the process.
3.3. Web search engines
Besides the catalogues, web search engines are probably the most used discovery mechanism for general-purpose searching. Web search engines carry out three main tasks: crawling, indexing and searching. Web crawlers also known as spiders or bots perform the crawling automatically. This task consists of visiting as much resources as possible collecting the information for its indexing and also adding all the links present in the resources that are registered for its posterior analysis. The task of indexing in the web and more specifically in the search engine area refers to the process by which data is collected, processed and stored in convenient manner to facilitate its retrieval as the catalogues do specifically for geographic content. Basically all the relevant information is extracted from the resources and stored conveniently in a database or index for its rapid retrieval. Finally, the task of retrieval or searching is performed through specific interfaces. Usually these interfaces allow the insertion of text or keywords that will be searched in the index.
4. Mechanisms for discovering user-generated data
4.1. KML as a metadata container
Although KML is probably popular for its simplicity and its visualization capabilities, this standard defines a set of elements that allow adding more metadata about the content to visualize in formats that could be reused for other means. Rather than representing a format focused on the visualization of data, KML can act as a metadata container, i.e., a format suitable to transport metadata together with the data itself.
4.1.1. Identifying elements
When adding metadata within a KML file the first question to answer to is where to add descriptive information. As it has been already explained in the Background section, the KML standard specification defines a set of elements that can contain not just the description of a given feature but also more detailed information. Usually metadata is expressed using text. This fact limits the number of KML elements that can contain metadata since some of these elements are not designed to offer textual information. This is the case of those elements that just offer information about visualization, for instance all those employed to specify the virtual camera position or geometry of a feature.
However KML offers some other elements that could transport textual metadata effectively. Looking at the KML schema in Figure 1 the best options seem all those elements derived from the abstract element Feature. This makes sense as far as somebody could be interested in adding metadata about a placemark, about other resources linked within the document, imagery or documents and folders of KML resources. Thanks to these last two KML elements the user can also decide at which level needs to add metadata. The elements Document and Folder are designed to organize and contain other KML features such as Placemark, NetworkLink or even other Folder and Document elements. Considering this, the user is able to add metadata to a given Placemark, however it is also possible to add metadata to the Document or Folder that contains it. Imagine the scenario where a set of Placemark elements is organized inside a Document element and this is part of a Folder that contains other Document elements with different Placemarks. In this scenario the user is able to add placemark-specific metadata for each of the Placemark elements, add metadata at Document level with information that is common to all the Placemark in the Document and finally add more information shared by all the documents at Folder level. In brief, KML allows choosing in more or less detail the granularity when adding metadata to a project.
4.1.2. Identifying child attributes
Once the suitable elements that could act as metadata containers are identified, the next step is to choose which of their child elements can contain metadata information.
Since all the elements previously described inherit from the Feature element, all of them share some common child elements. Some of them allow the inclusion of textual information or metadata such as Title, Snippet and Description. The objective of the element Title seems obvious, to describe the title or name of a Feature within the KML file. In almost all the existing metadata standards the title or name is present. The element Snippet allows the users to add a short description about the feature that is associated with it. This description although it is intended to be short (around 2 lines of text), could contain important information about a feature that could help in its understanding and use. The purpose of the element Description seems clear, to offer a comprehensive description about a given feature.
Although its objective is the same as Snippet, the element Description offers some advantages since it supports unlimited text extension and it can store fragments of HTML code on it. The user can add metadata about a given feature using this element however some considerations should be observed. First, the user should consider that the main objective of the Description element is to offer a description about the related feature, however it is possible to add not just the description but also other information that could be useful as well. In this way it is possible to add a text description including all the information that could be found in a DC file expressing metadata about a given geographic feature. Although this could be a valid solution since all the metadata is accessible and carried within the file and with the described feature, its unstructured format could become problematic for automatic processing tasks.
To solve the problem of adding unstructured information and to allow the inclusion of custom XML data within a KML file, the standard offers the ExtendedData and its child elements. This data can involve any kind of information including metadata about a given feature. Depending on these last elements the user has three methods to add fragments of XML code in a KML file:
The first method specifies how the users can add simple key-value pairs in XML format to any feature in the KML file using the ExtendedData element and its child element Data. The element Data contains an attribute called name intended to store the key of the pair. Data has two child elements as well, the first one the displayName element to specify alternative names for visualization and the element value that stores the pair’s value. This structure does not follow any previously defined schema allowing the user to insert data in an ad-hoc fashion however it does not allow the creation of complex or nested XML structures. The users can use this method to add structured metadata however restricted with the previous limitations.
The second method involves the specification of a schema at Document level within a KML file. This schema is created using the KML element Schema and its child element SimpleField. The Schema element contains an attribute storing the schema’s name and another one doing the same with the id that in this case is mandatory since the schema could be referenced by this value. The different elements in the schema are specified using the element SimpleField that offers attributes to specify the name and type of the custom element and an alternative name for visualization. The Schema element neither allows the creation of complex structures with nested elements. Once the schema is defined, this can be used via the element SchemaData. This element references a given schema that can be placed in the same KML file or in another indicating its identifier. The child element SimpleData makes reference to any SimpleField declared in the schema and allows its filling with values of the type declared by its referenced SimpleField element.
Finally the last and probably most powerful method allows the import and use of any externally defined XML Schema without any limitation about the elements or structure defined by it. In this case it is possible to import a given schema associating it with a namespace in the KML file. Using this namespace any element defined in the original schema can be used within the KML file. This method also allows the creation of nested or complex XML structures. For instance, thanks to this method the users can import a metadata schema such as the ISO19115 into their KML files and create a metadata structure as defined by the ISO standard for describing the content.
Thanks to the above elements the users can add all that metadata required within a KML file. By default KML also offers fields to add information about the authoring of the content. This information is provided by the use of elements imported from the Atom schema (http://www.ietf.org/rfc/rfc4287.txt) and includes the author name and link to related website or email. This information although short it is still useful and could act as contact information, common information in most of the metadata standards.
4.2. Reusing existing metadata
Since the appearance of metadata standards such as Dublin Core or ISO19115 a lot of geographic content has been already catalogued and documented using these or other metadata formats. Since KML is a really recent format and OGC standard the quantity of content expressed in this format is relatively low compared with the quantity of information in older formats such as OGC’s GML (http://www.opengeospatial.org/standards/gml) or ESRI’s Shapefiles (http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf). Thanks to its spread use in a broad range of services, some of them implying the user participation through web platforms, the quantity of content generated in KML is rapidly increasing.
KML is increasingly considered an output format for some geographic content servers such as GeoServer. In the same way the number of tools to transform from and to KML is also growing. This transformation allows the use of already existing geographic content in KML format and then its reuse and merging with other evolving and fresh geographic content build up thanks to some of the geographic services running in the Web 2.0.
Since the content can be transformed and reused so the metadata can. Independently if the metadata it is placed internally or in a separate file, when transforming from a format such as Shapefile to KML the metadata could be also transported within the file in its new format. As previously seen there exists various alternatives for adding metadata in KML however the vast majority of the already existing metadata for geographic content is expressed in XML-based formats. This means that ExtendedData is the KML element of choice for transporting existing metadata. By using this element it is possible to add custom XML data to any feature what makes also possible to include the existing metadata for that feature.
The first two methods described early do not allow the creation of complex data structures but could be used to emulate metadata schemas such as Dublin Core easily. Although this could represent a solution it is clearly not the best option since it represents the metadata schema but it does not really makes use of it. For this reason, the last method that combines the ExtendedData element, the import and use of external XML schemas seems the best one. In this case the user only needs to import the schema assigning a namespace to it that will be used in any element along the KML file but declared in the imported schema. Now there is no limitation about the structure, types or any other factor. With this option the inclusion of the converted files’ metadata can be also trivially implemented.
4.3. General search engines for indexing geographic content
Nowadays if somebody checks any web traffic analytics services such as Alexa (http://www.alexa.com) and examines the top sites on visits per day that person would be able to see some really representative data about the greatest trend in the use of the Web. The first 10 positions in the list would be usually distributed among search engines and sites or services of the Web 2.0 such as video sharing services, social networks, wikis or blogs. These results represent the huge impact and relevance of these types of services. It is also known that some of the Web 2.0 services are based on the user participation for creating fresh content and nowadays these are becoming more and more popular counting its users in hundreds of millions.
The search engines however are services that appeared years ago to solve the problem of finding resources and information in a Web in expansion. These search engines are still extremely used and most important still useful. In fact it is maybe now when they are more necessary than ever since the quantity of content increases in such a proportion that makes the task of finding specific information almost impossible without the use of these tools as far as the corresponding address is unknown. Over the years with the appearance of new data types and formats that have become popular these tools have also been uploaded to look for and index this type of information. For example, currently it is possible to search information on specific data files expressed in formats such as Adobe Acrobat PDF or Shockwave Flash format, Microsoft Word or Autodesk DWF among others.
Web search engines have extended their capabilities to search over popular file formats and more recently also support discovery and indexing of geographic content expressed in KML and KMZ. The interest of the major search engines in the geographic content is not limited to search the content but also the different ways in which this content can be used. In this way these companies provide visualization services such as web mapping or geobrowsers to represent all the geographic content they gather, edition tools to create more content that will increase in quantity improving its utility, APIs and mashups to create new applications and services based on the geographic content they manage. Currently Google, Yahoo and Microsoft are as the owners of the three major search engines on the web scenario and provide services and tools for geographic content.
The way in which content is published for its discovery by web search engines differs notably from the way the content is used for publication in catalogues. In short, search engines use the crawling processes to browse automatically the web navigating between different resources following the hyperlinks present in the web sites and collect desired information for indexing. The first advantage of crawling technique is that it does not require user supervision further than placing the content in a publicly accessible site (i.e. public web server). In contrast indexed content is not immediately discoverable as it happens with the catalogues since the crawling process, due to the quantity of information, takes some time to process and index the information. Also, since the users have no control over the crawling process the uncertainty about the success and performance of the process is another price to pay for the ease of an automatic and batch processing. We can conclude that instead of the user’s detailed publication method followed by the catalogues the search engines offer an automatic batch collecting of geographic content. This new method could represent an effective solution for sorting all the content constantly created that do not fit in the traditional cataloguing process.
5. Use case: Indexing user-generated data in Google
Google is one of the most representative companies in the geospatial web advent that has found its place also in the GI business. This company offers a long range of geographic services, applications and data that is being continuously improved and increased. These services and applications include the web mapping service of Google Maps, the geobrowser or virtual globe Google Earth and the Google Maps and Google Earth APIs that allow the programming of third-party applications on the web interacting and using the previous services. Although these are the main services, Google also supports the development and extension of KML and 3D models that can be added for visualization in Google Earth. The company also offers editing tools for creating new geographic content like My Maps or even new cartography for those countries where this is limited using tools such as Google Map Maker. These last two tools represent a clear example of the integration between the GI and the Web 2.0 that results in a collaborative tool for geographic and cartographic content creation and sharing. Finally Google actually supports the use of KML, KMZ and GeoRSS to express geographic content on their services.
Google maintains an index with the processed results of crawling processes. This index primary contains textual information extracted from HTML documents and alike what seems evident considering the quantity of this type of resources in the Web. This index is the one queried when the users perform searches through the main Google’s website.
Regarding the geographic content it seems that Google uses a specific index for searches done through its geographic services and tools. We will refer to this index as Google’s Geo-Index in the rest of the document. The Geo-Index is feed with a variety of sources that could be divided into two main groups depending on if the content is stored in Google’s servers or not. The first group involves all of geographic content generated through the different services and tools for creating and editing tasks such as My Maps or Google Map Maker. This information is directly stored into Google’s servers and therefore its insertion in the Geo-Index is very fast if not immediate. The other group of content is spread along the Web and includes a broad range of distributed sources.
One important source worth to mention is the main index Google maintains for the general searches on the web. Apparently when building the index some formats are automatically identified permitting to perform custom searches specifying the desired file format for the results. Among other formats the users can choose also to get only KML and KMZ files as result. Another example of locally stored data and remote resources can be found in the list of business that Google maintain and the services associated to it also known as Google Local Business Center. These services advertise companies and their products in yellow pages alike way. The interesting point is that the users can search for companies or services in a given area obtaining those business closed to the desired area through Google Maps or Google Earth. To conform the business directory Google offers to the business owners the possibility to register their business storing in its service the information but Google also extracts information automatically from distributed sources such as Yellow Pages and others.
Not all the content available is about commercial products or business and Google also obtains and displays content from other sources including for example geotagged videos and photos published in online galleries, articles in web based encyclopaedias referencing a given location or even KML or KMZ files created by anonymous users or entities and stored in remote servers accessible by crawlers. This content is also known as user-generated content and the process that follows since its creation by the author to the visualization by other users represents the best example of discovery of geographic content in the web performed by search engines and compliant with the principles of the Web 2.0 where users freely create and consume the content.
As previously explained allocating KML or KMZ in a public server would be enough to get those files discovered, indexed and then accessible through any of the geographic services that use Google’s Geo-Index. However the crawling process is not totally effective and in some situations content available on the web is not found by the crawlers. Despite this inconvenient, the crawling process still offers a lot of advantages and represents a new, easy and practical method for discovering user-generated geographic content.
To prove the above-explained method an experiment was conducted in two rounds. The main objective of the study was to obtain realistic measures about the performance and effectiveness of the discovery process considering the crawling and indexing as its key elements. The retrieval of the indexed geographic content is a fundamental operation performed by the search engine system and called for execution by the end user using Google Maps, Google Earth and their corresponding APIs. This process involves really sophisticate techniques and a long list of parameters, however most of them remain unknown and most of times kept secret (i.e. ranking algorithm for ordering results). Because of this, the retrieval process was not carefully analyzed in the study focusing in in measures for the crawling and indexing processes evaluation.
5.2.1. Assessment indicators
The experiment consisted in reproducing a real case scenario for user-generated content publishing a set of different KML files with specific information and concrete characteristics in a web server configured for allowing the Google’s crawler access. The proposed indicators to assess the study were:
The elapsed time in crawling the content. Although this method of sharing the user-generated content could offer some advantages, an excessive time in discovering the content by the search engine system could make this an unviable solution for some use cases.
The elements within a KML file that become indexed. As seen before the KML files offers different elements where storing information about the data visualized. Although Google recommends some good practices about the creation of content formatted in KML, there exist not too much information about all the fields that could store information or metadata neither the ones that Google finally uses. Establishing where the descriptive information or metadata can be placed within a KML file for its use by Google would improve this type of file’s design and the success in their indexing.
The effectiveness of the process in number of files indexed. As it happens with the normal HTML pages not all the content becomes indexed and accessible to the end user due to some aspects related most of times with the access to the content or the content itself. In the case of a low number of indexed files this could mean that the method results ineffective for the majority of the use cases.
5.2.2. Data preparedness
Google provides to content creators some advices to improve the chance of getting their content indexed. These advices are focused mainly on improving the visibility of the content helping Google to find it by following the next advices:
Create the KML or GeoRSS content. Be sure to add attribution tags, which will appear in the Google Search results for your content.
Post your files on a public web server.
Create the Sitemap file. Copy this file to the directory of your website.
Submit your Sitemap to Google.
Google also recommends providing the content with meaningful and descriptive information in specific parts of the file:
Give your document a meaningful name.
Provide a relevant description for each placemark so that the user can see the context of the search results.
If you have a big quantity of data, divide it into topic-specific layers.
Give each feature an “id” so that the search result can link directly to it.
These parts include the file’s name and the KML elements Name and Description however nothing mentions any other elements that could be used. Another important point to consider is where to place the metadata within a KML file (Placemark or Container derived elements such as Document or Folder). Again there is not too much information and since this represents an important advantage of using KML it should be also tested.
Following the Google’s advices the first step is the content creation in either KML or GeoRSS. The experiment was composed by different test datasets conformed each one by a set of different KML files. In each dataset the files differ between them in the KML elements used to store the information, in the level inside the file at which the information is placed and also in the use of descriptive or non-descriptive file names. The KML elements used in the study were the same elements considered previously as suitable to contain metadata: Name, Description, Snippet, NetworkLink and the ExtendedData (including its specific child elements designed for storing custom XML within KML code). The idea was to place information that could represent metadata for the content described within the file in a given element. This would allow discovering which of these elements and at which level is Google-friendly concerning the indexing of the information. However since one of the Google’s recommendations implies the use of both Name and Description in a file, both elements standing alone but also combined were used in the study.
The NetworkLink element was used to link to PHP scripts that returned as output a KML file and discover which of the files get indexed. The purpose of using NetworkLink elements in this way was to discover if the KML file containing this element also appeared as result when the information in the PHP script’s output was indexed since the first is linking to it. This chaining would be useful when creating linked structures of information as for instance when grouping in one unique KML file different sources of information via NetworkLink elements. It is also worth to mention the use of the ExtendedData element and child elements in this test. Using the three different techniques the ISO19115 metadata standard has been translated recreating a similar structure with the same elements using the elements Data, Schema and SchemaData and it also has been directly used importing its schema. The KML elements used that are not restricted to appear in a specific position in the KML structure (i.e. Schema at Document level, SchemaData at Placemark level, etc.) were placed in both logical levels considered in the study and in each of them separately. These two logical levels are feature and document despite the vocabulary used in KML. The feature level refers to the level where the simple or atomic elements of geographic content reside. Placemark and NetworkLink elements can be considered as this type of atomic elements. The document level refers to that one where a container for atomic elements is defined. In terms of KML specification this could coincide with the Container child elements Document and Folder, which can contain other features grouping them in a logical structure. The creation of these logical structures could be extended and new container elements be nested in other containers and so on. However for the purpose of the study a simple structure with two logical levels is enough to discover the effectiveness on the indexing of files with information placed in them. Finally the combinations of KML elements used and levels at which these were placed gave the relation showed in Table 1:
The second step consisted on posting the content in a public server meaning by public that the crawlers and the end users can access the files. The following step requires the creation and convenient placement of a Sitemap file (http://www.sitemaps.org). The Sitemap file specification defines a XML derived language designed to give specific information about the web server’s public resources (i.e. HTML pages, images, etc.). The specification defines a set of elements to define lists and groups of resources and also individual properties such as resource’s URL, type, creation date and frequency of update among others. This kind of files help the crawlers in the collection of the public elements within the server and also add valuable information for some tasks as for example the update periods. With the discovery of user-created geographic content Google has created an extension over the Sitemap specification specially designed for referencing geographic content. The extension allows the users to indicate it by using the tag geo by which the user can also specify the content’s file type (KML, KMZ and GeoRSS). Finally it is also important to note that users can indicate as geographic content any file that can be interpreted as one of these files without necessarily being one of them. One of the most interesting aspects is that users can for instance mark as geographic content programmed scripts (i.e. scripts in python, PHP, etc.) or applications which execution output is effectively data in KML, KMZ or GeoRSS format allowing also their crawling and indexing.
The last step involved the submission of the sitemap file to Google. This action gives information to Google about the users’ sites that require to be crawled and also offers feedback to the users’ about the process. This can be achieved thanks to the use of the different web-based tools (through the Google Webmaster Tools service) to improve the visibility and facilitate sites and resources discovery. Among other utilities the users can register a website URL and then submit a sitemap file with the resources to be indexed.
5.2.3. First-round results and discussion
During the experiment all of these steps were followed in order to obtain a measure for the best-case scenario. The results obtained in the different test sets gave an average time of three weeks since the publication of the KML files in the server to the appearance of the first results querying the Google Web Search website and also Google Maps and Google Earth.
Apparently three weeks time could be an acceptable interval if rapid publication and availability is not a requirement. Otherwise, other alternatives should be considered including the use of traditional catalogues or the direct creation of content using Google’s tools that offer an easy way for getting the content rapidly indexed since this is stored directly in Google’s servers. The latter has the limitation that the content is subjected to the edition tools offered. For instance, the currently available tools just offer a set of features of KML and are running over a web-based platform and therefore the creation of 3D content is not possible or fairly complicated. In addition, it is also necessary to consider those cases where the index is frequently updated or at least this is updated more frequently than the search engine system updates its content. In this case a possible approximation to solve the problem without representing a complete solution could be the separation of metadata and data. This separation could be implemented adding the metadata in the suitable KML elements but loading the data using the NetworkLink element. Therefore the metadata keeps static and successfully indexed without modifying the Geo-Index and data is updated on a periodical basis. In this case the NetworkLink is used as a pipe to the data becoming the file with the metadata the starting point for visualizing the updated content. Finally in those cases where both change on time other solutions should be found.
Once the crawling time was obtained only the part regarding the effectiveness of the process and the elements that intervene in the process were remaining. The results obtained after three weeks were considerable negative considering the number of files published for indexing. From a total number of 129 KML files including the KML files and the PHP scripts used only 56 files were found using the main Google website and filtering results by KML type and seven were found when using either Google Maps or Google Earth. These elements represent respectively the 43% and the 5% of the total amount of the test files.
These figures represent a really low number of files present in the general index and a much lower quantity in the Geo-Index. It results evident that between the publication and its indexing in the general index some prune is performed but most evident is that between the two indexes there is another phase where a high quantity of files are discarded.
It is well known that there are a lot of factors that affect the correct indexing of websites by Google. Some of these factors are meant to avoid some malicious behaviour to improve the visibility of a website by the use of no Google-friendly actions that could be punished (http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769). One the most common of these factors is the content duplication usually performed to improve the number of results in which a given site appears and that is commonly detected by Google crawlers. When this behaviour is detected only one of the sites containing that information is kept in the index dismissing the rest and sometimes applying punishing actions against the site or the content creator. Since a lot of information, although placed in different KML elements, was completely the same in a great number of files the possibility that this could affect the process was considered. In fact this behaviour affects the indexing of content for the general index and could explain the low rate in the first case.
Considering also that not only the information but also the position (coordinates) was shared among the files the avoidance of all this duplicated information should be considered in further tests. The duplicated content could affect the indexing of the files in the general index however there is another prune between the general and the Geo-Index. The reason could be the same as in the first case however this implies that the Geo-Index applies other methods to avoid the duplicated content much more restrictive than the ones corresponding to the general one. Although this is possible there could be other factors that could affect the indexing including the use of suitable KML elements for indexing in the files.
The seven files successfully indexed shared one common characteristic; all of them contained information in the element Name at feature level. Although some of them also contained information in the element Description seemed that the feature’s name was the key element. Also the results presented random presence of descriptive and none-descriptive names what probably meant that this was not a representative factor for their indexing. It is worth to mention that some of the indexed files corresponded to PHP scripts confirming that the indexing process works with static and dynamically generated KML files however none of the files with NetworkLink elements were indexed. Therefore, it seems clear that both duplicated content and the correct chose of KML elements containing the metadata affected the indexing process.
5.2.4. Second-round results and discussion
Given the low number of files indexed and the need to know which KML elements are suitable to place metadata, a second test was conducted. This time the files were designed to avoid any of the problems caused by duplicated content and then being able to analyze the impact of the KML elements in the indexing.
The new test was composed this time by a unique test data set with 23 KML files. These files were completely different among them in terms of coordinates, information, elements and level where this information was placed.
This second-round test resulted in four files indexed, representing the 16% of the total. Again the results gave a really low rate of success. One meaningful result was that in this second attempt the results did not appear in the Google’s general index but only in the Geo-Index. The indexing system not only serves to work with the data the crawlers collect but also update periodically this data adding the new version and deleting all content. It seems that in a first stage, the Geo-Index is feed or takes information from the general index. However once this information is detected as geographic content the main index removes that content which indexing process is updated and managed by the Geo-Index exclusively.
Again all the files present in the search results presented the same trend: the metadata was placed in the element Name at feature level. Once the file is indexed all the content allocated on it gets indexed as well. For instance, in those indexed test files where both Name and Description elements were used, the information contained in the element description was also accessible and used in the searches demonstrating that this information is also present in the Geo-Index. These results demonstrated that the existence of information in that element at the given level is a requisite for the indexing of a KML file. However this is not a guarantee for getting the user-generated content indexed. During the test all the elements that finished in the index had this characteristic however not all the elements with the information allocated in that position got indexed. This means that there is still a prune of files despite the use of the right KML elements. A comparison like the one in Figure 2 between the number of files with information in the element Name at feature level and the files that got finally indexed during the different tests clearly shows that the content duplication could also affect in the case of the geographic content.
This assumption seems true since in the first test no more than the 33% of the files with the same content was successfully indexed. Instead the rate of indexing it was increased to the 80% of the total number of files when avoiding the use of duplicated content.
Finally we can conclude that despite the existence of other factors the use of the right KML elements and the content duplication has a great impact in the indexing process. The use of element Name does not necessary mean any problem since there is no other requirement about it but its existence within a file. However the rejection of all that content that Google considers duplicated could become a real problem in specific situations. Consider the case of a study in a small area about measures on a specific thematic. In those cases where the measures share a certain quantity of content the chance of being indexed decreases. This is a real problem since lot of content in the GI area share a high quantity of information still being different.
The catalogues are extremely useful and powerful tools for discovery geographic content. Their use could be however more focused on professional users since catalogues require to some extent some degree of knowledge or experience for metadata creation and publishing.
Currently there exist a lot of web-based tools and services designed to create, modify, share and visualize geographic content. These tools in conjunction with the spread use and better availability of positioning devices such as GPS receivers create an ideal scenario for new users. Non-experts users can now create and share geographic content, a task reserved previously to the professionals. Despite some issues still remaining like data quality, user-generated content is being created in huge quantities and rapidly. These factors limit the use of catalogues as effective solutions to manage the continuous proliferation of fresh content.
In this case methods for batch discovery could mean effective solutions to collect and order all this data. The web search engines are performing effectively this task for usual web content such as HTML pages for years. In the recent years and with the appearance and popularization of the KML format, web search engines are also working with the geographic content encoded in KML. This work has analyzed how KML files can contain and carry metadata represented either by a simple textual description or complicated XML structures such as the one defined by the standard ISO19115 allowing the reuse of existing content. Our analysis indicates that KML can be a metadata container because it offers encapsulation and flexibility. The former allows the transport of metadata and data within one single file, while the latter permits to specify the level of granularity at which the content is described.
Google represents one of the companies that most invests in geographic services and tools freely available and addressed for a broad public. This company through its search engine also indexes geographic content expressed in KML or KMZ. The use of Google’s search engines has been demonstrated to be effective for most of non-professional uses. This approach represents an extremely useful way to publish geographic content and an effective way for discovering it. However the use of search engines also have associated some restrictions as it could be the time spend by the system in discovering and indexing the content or the process’ efficiency. This measure is subject to factors such as the use of concrete elements within a KML file and at specific levels and also the quantity of content duplicated among files. These last two factors makes the use of this solution complicated in those cases where there exist considerable quantities of content duplicated published by the same data provider. Despite these minor problems, the use of web search engines complements the use of catalogues because they can manage huge quantities of content spread along the Web. Finally another important aspect is that these systems also perform their tasks in an automatic manner without requiring too much interaction with the user neither complex requirements about the data or metadata since the proliferation of geographic content among amateur public.