Image satellite sensors acquire huge volumes of imagery to be processed and stored in big archives. An example of such an archive is the German Remote Sensing Data Center (DFD) at Oberpfaffenhofen, Germany, that receives about hundreds of GigaBytes of data per day entailing 104 GigaBytes in the repository. To provide access to this data, web applications have been developed, e.g. the DLR EOWEB - , to retrieve images according to meta information such as date, geographical location or sensor. Alexandria Digital Library - is another example of accessing remote sensed imagery through its meta information providing a distributed searching mechanism for retrieving geospatial referenced data collections. It is able to search different types of databases placed at different locations. The software enables to implement web clients as Globetrotter - or Gazetteer - . These systems based on meta information retrieval allow only constrained queries giving no information about the content, and consequently, no content based retrieval is offered.
At the conference on database techniques for pictorial applications that took place in 1979 in Florence, Italy, the pursued aim was the integration of databases with image processing. This idea evolved, in 1990, promoting a new field, called Content Based Image Retrieval (CBIR). In 1998, CBIR got married with Data Mining and Knowledge Database Discovery (KDD) emerging, in 2000, the Image Information Mining (IIM) field. This new domain requires expertise in image processing, database organization, pattern recognition, content-based retrieval and data mining: image processing indicates the understanding and extraction of patterns from a single image; content-based retrieval is characterized by retrieving images from the archive based on their semantic and visual contents; spatial data mining denotes the extraction of spatial relationships and patterns from remote sensed images not explicitly stored in an spatial database. An IIM system provides users the capability to deal with large collections of images by accessing into large image databases and also to extract and infer knowledge about patterns hidden in the images, so that the set of relevant images is dynamic, subjective and unknown. It enables the communication between heterogeneous source of information and users with diverse interests at high semantic abstraction.
In general, an IIM system presents two fundamental modules: a computationally expensive component where image processing and classification algorithms are executed, and an interactive part, where queries are introduced by the user and relevant images are retrieved. Fig. 1 represents the typical flow of a data in an IIM system: original data arrive at a feature extraction module, where main image characteristics are computed; then, these features are compressed and indexed in a database; in a second module, the archive is queried by the user for similar features computing similarity measurements for optimal image retrieval.
This chapter begins describing the generic concept and modules of an IIM system architecture, and Sec. 0 presents an overview of existing IIM systems.
2. Image Information Mining System Architecture
As depicted in Fig. 1, the generic concept of an IIM system requires several processing modules: extraction of properties from images, reduction and content indexation and communication between users and system. In this section, we present the state of the art of these modules giving an overview of existing techniques in these fields.
2.1. Feature Extraction
In general, by image we understand picture, thus relating it to the (human) visual perception and understanding. A picture is characterized by its primitive features such as colour, texture or shape at different scales. Thus, an image will be represented as a multidimensional feature vector acting as signature. Some classical techniques to characterize an image are the following:
Colour: Colour information has been an important feature in image processing and computer vision. There exist different colour models or colour spaces, each one being useful for a specific application. A digital imaging system typically represents colour images in red, green, blue using the RGB space. Another one related with the perception of the colours by human beings is the HSV (Hue, Saturation and Intensity) colour space. This one describes the property of the surface reflecting the light (hue), measures the colourfulness or whiteness (saturation) and the brightness (intensity) of colours. Often a full colour image providing the three colours (RGB) in each pixel is needed, being essential to interpolate missing colours with the information of neighbouring pixels. There are nonadaptive algorithms (Ray & Acharya, 2005) as nearest neighbour replication and bilinear interpolation, and adaptive algorithms (Ray & Acharya, 2005) based on pattern matching or edge sensing interpolation. On the other side, a common practise in image processing is the statistical analysis of colour histograms, due to the strong correlation between objects and colour in an image.
Texture: Texture is a very interesting feature to characterize the spatial structure of an image. This is an active research field where parametric and non parametric methods are applied. Haralick’s co-occurrence (Shanmugam et al., 1973) technique based on the computation of the gray-level co-occurrence matrix for several values of displacement, orientation and image quantization levels is an effective method in texture analysis. Other algorithms based on wavelet transformations as the computation of Gabor filter (Maillot et al., 2005) can also be applied.
Shape: Shape of objects must be invariant to translation, rotation and scale of the image and is characterized in two senses: boundary-based, that considers the object outer contour, and region-based, where the whole shape region of the object is analyzed. In this sense, Fourier descriptors are suitable for transforming boundaries into shape features, and moment invariants for the extraction of geometric object region. A modified Fourier descriptor that preserves the invariance of geometric transformations and noise is proposed in (She et al., 1998). A common practise before applying shape techniques is to segment the image in small regions. Comaniciu (Comaniciu & Meer, 2002) presents this approach based on the mean shift method for density gradients estimation.
Topology: topological properties of an image such as number of connected or disconnected components, do not change when an image is rotated, scaled, translated, stretched or deformed. One example of characterizing an image through its topological properties is the computation of the Euler number (Ray & Acharya, 2005). It is defined as the difference between number of connected components and number of holes in a binary image. An extension of the Euler number defined for binary images is the Euler vector (Ray & Acharya, 2005) that can be applied to gray-level images. Segmentation techniques may also help in the extraction of topological features.
2.2. Multidimensional Indexing
In the CBIR and IIM domain, the concept of multidimensional indexing differs from the one in a traditional database management system. In here, an index consists of the structure that provides access to the database in terms of record organization. In IIM, once an N-dimensional feature vector is obtained, images are assigned to a suitable content based description extracted from these features. These content descriptors are then organized into a data structure for retrieval.
In multidimensional indexing, the following items must be considered:
Reduction of dimensionality: Due to the huge amount of images and extracted features, normally the dimensionality of the information at the indexing step is very high. This complicates the management of the feature vector rendering its computation very expensive. For this reason, mechanisms for reducing the dimension of the feature space must be considered. Among these methods, Karhunen-Loève transform (Ray & Acharya, 2005) and the Discrete Cosine Transform (DCT) (Khayam, 2003), (Watson, 1994) are often considered.
Clustering: Extracted features with similar content must be grouped together through a classification algorithm. In this case, pixels containing similar features belong to the same class. Existing clustering techniques can be classified into two main groups: distance-based and model-based (Zhong & Ghosh, 2003) approaches. In the first group, we mention methods based on Euclidean and Mahalanobis distances, and to the second group belongs algorithms based on an a priori specified model, such as Gaussian mixture models or Markov chains.
Data structure for content based retrieval: Once a clustering algorithm is performed, a data structure for indexing descriptors to semantic content must be selected. The common used methods are tree-based indexing techniques, as multidimensional binary search trees or R-trees, and hashing-based ones.
2.3. Content-Based Image Retrieval
Usually CBIR is limited by the semantic gap existing between signal classes and semantic labels. Li et al. (Li & Bretschneider, 2006) propose a context sensitive Bayesian network to infer the semantic concept of regions or classes. Semantic score functions based on region features (spectral and texture) are computed to link semantic concepts to regions. Tusk et al. (Tusk et al., 2002) suggest a Bayesian framework to cope with the semantic gap problem. They introduce a visual grammar that builds a hierarchical semantic model from pixel level to region and scene levels. Pixel-level characteristic provides classification by automatic fusion of primitive features; then, at region-level through a segmentation algorithm land cover labels are defined; and scene-level represents the spatial relationship among regions. Thus, the visual grammar consists of two learning steps, where naive Bayesian classifiers are applied: a probabilistic link between features and semantic labels, and a fuzzy modelling to link regions and scenes. Once the visual grammar is built, the image classification process aims at finding representative region groups that describe the scene. The procedure consists of modelling the labelled regions by a Dirichlet distribution based on the number of training examples containing a certain region group, and then, assigning the best matching class to image by using the maximum a-posteriori rule.
In order to provide the system the ability to search at query-time for images with similar features, a similarity metric for the comparison of objects or image properties must be defined. If we want a realistic measure, computer and human judgments of similarity should be generally correlated. If this condition is not met, images returned by the system will not be those desired by the user. These techniques are often based on distances or on a specific domain as histogram intersection, neural networks, shape measures or graph matching. Queries like ”retrieve images containing an specific content” or ”retrieve images that do not contain a particular object” can be asked to a CBIR system.
2.4. Semantic Learning for Content-based Image Retrieval
The main problem of using feature vectors for querying images with similar content is that often, the appearance of an image does not correspond to its semantic meaning, making the returned images only partially responds to the users query. Therefore, at object or region level, the highest level of abstraction, an image is represented by its objects, and a semantic label is assigned to each of them.
A common used technique to provide regions with semantic meaning is the manual annotation that, combined with a powerful segmentation method, can result in a good meaningful classification. Comaniciu (Comaniciu & Meer, 2002) proposes a colour image segmentation algorithm based on the mean shift that estimates density gradients, using a simple nonparametric procedure. Then, the users interactively identify the segmented regions by labelling the features. Because of hand-annotating images is tedious and human expensive, methods for learning image representations directly from data are investigated.
Fei-Fei and Perona (Fei-Fei & Perona) propose a Bayesian hierarchical model to learn and recognize natural scene categories through intermediate “themes”. In there, the most complete scene category dataset found in the literature is used. An image is modelled as a collection of local patches (regions). Each patch is represented by a codeword from a large vocabulary of them obtained from all categories training examples. For each codewords in each category, a Bayesian hierarchical model is learnt, building a collection of Bayesian models. Then, to provide semantic meaning to an unknown image, first the image codewords are extracted, and then, they are compared with the predefined models, assigning the one which fits best. The main problem of the proposed algorithm is that, although it can learn intermediate themes of scenes with neither supervision nor human intervention, the categories are fixed, being not able to assign semantic meaning to other ones.
Another method that uses predefined lexicon of semantic concepts as trained data is the semantic pathfinder for multimedia indexing (Seinstra et al., 2006). In here, given a pattern x, part of a camera shot, the aim is to detect a semantic concept ω from shot i using probability p(ω|x i ). Each step in the semantic pathfinder analysis extracts x i from data, and learns p(ω|x i ) for all ω in the semantic lexicon.
Maillot et al. (Maillot et al., 2005) propose a learning approach based on two steps: a feature selection step that chooses the most characterizing features for better visual concept detection, and a training phase using a Support Vector Machine (SVM), where positive and negative samples are required. Trying to solve the weaknesses of the learning approach like the lack of learning the spatial structure of semantic concepts, a further step is given, storing the visual knowledge that is the link between semantic concepts and sensor data in a symbol. This link is modelled as a fuzzy linguistic variable that enables the representation of imprecision, thus the image features are fuzzified a priori by a human expert, providing spatial relation representations and spatial reasoning.
In these articles, we find two facts that we try to avoid: On one hand, the lack of generalization by using a predefined lexicon when trying to link data with semantic classes. The use of a semantic lexicon is useful when we arrange an a priori and limited knowledge, and, on the other hand, the need of experts in the application domain to manually label the regions of interest.
An important issue to arrange while assigning semantic meaning to a combination of classes is the data fusion. Li and Bretschneider (Li & Bretschneider, 2006) propose a method where combination of feature vectors for the interactive learning phase is carried out. They propose an intermediate step between region pairs (clusters from k-means algorithm) and semantic concepts, called code pairs. To classify the low-level feature vectors into a set of codes that form a codebook, the Generalised Lloyd Algorithm is used. Each image is encoded by an individual subset of these codes, based on the low-level features of its regions.
Signal classes are objective and depend on feature data and not on semantics. Chang et al. (Chang et al., 2002) propose a semantic clustering. This is a parallel solution considering semantics in the clustering phase. In the article, a first level of semantics dividing an image in semantic high category clusters, as for instance, grass, water and agriculture is provided. Then, each cluster is divided in feature subclusters as texture, colour or shape. Finally, for each subcluster, a semantic meaning is assigned.
In terms of classification of multiple features in an interactive way, there exist few methods in the literature. Chang et al. (Chang et al., 2002) describe the design of a multilayer neural network model to merge the results of basic queries on individual features. The input to the neural network is the set of similarity measurements for different feature classes and the output is the overall similarity of the image. To train the neural network and find the weights, a set of similar images for the positive examples and a set of non similar ones for the negative examples must be provided. Once the network is trained, it can be used to merge heterogeneous features.
To finish this review in semantic learning, we have to mention the kind of semantic knowledge we can extract from EO data. The semantic knowledge depends on image scale, and the scale capacity to observe is limited by sensor resolution. It is important to understand the difference between scale and resolution. The term of sensor resolution is a property of the sensor, while the scale is a property of an object in the image. Fig. 2 depicts the correspondence between knowledge that can be extracted for a specific image scale, corresponding small objects with a scale of 10 meters and big ones with a scale of thousands of meters. The hierarchical representation of extracted knowledge enables answering questions like which sensor is more accurate to a particular domain or which are the features that better explain the data.
2.5. Relevance Feedback
Often an IIM system requires a communication between human and machine while performing interactive learning for CBIR. In the interaction loop, the user provides training examples showing his interest, and the system answers by highlighting some regions on retrieved data, with a collection of images that fits the query or with statistical similarity measures. These responses are labelled as relevance feedback, whose aim is to adapt the search to the user interest and to optimize the search criterion for a faster retrieval.
Li and Bretschneider (Li & Bretschneider, 2006) propose a composite relevance feedback approach which is computationally optimized. At a first step, a pseudo query image is formed combining all regions of the initial query with the positive examples provided by the user. In order to reduce the number of regions without loosing precision, a semantic score function is computed. On the other hand, to measure image-to-image similarities, they perform an integrated region matching.
In order to reduce the response time while searching in large image collections, Cox et al. (Cox et al., 2000) developed a system, called PicHunter, based on a Bayesian relevance feedback algorithm. This method models the user reaction to a certain target image and infers the probability of the target image on the basis of the history of performed actions. Thus, the average number of man-machine interactions to locate the target image is reduced, speeding up the search.
3. Existing Image Information Mining Systems
As IIM field is nowadays in its infancy, there are only a few systems that provide CBIR being under evaluation and further development. Aksoy (Aksoy, 2001) provides a survey of CBIR systems prior to 2001, and a more recent review is provided by Daschiel (Daschiel, 2004). In this section, we present several IIM systems for retrieval of remote sensed images, most of them being experimental ones.
Li (Li & Narayanan, 2004) proposes a system, able to retrieve integrated spectral and spatial information from remote sensing imagery. Spatial features are obtained by extracting textural characteristics using Gabor wavelet coefficients, and spectral information by Support Vector Machines (SVM) classification. Then, the feature space is clustered through an optimized version of k-means approach. The resulting classification is maintained in a two schemes database: an image database where images are stored and an Object-Oriented Database (OODB) where feature vectors and the pointers to the corresponding images are stored. The main advantage of an OODB is the mapping facility between an object oriented programming language as Java or C++, and the OODB structures through supported Application Programming Interfaces (API). The system has the ability of processing a new image in online mode, in such a way that an image which is not still in the archive is processed and clustered in an interactive form.
Feature extraction is an important part of IIM systems, however, it is computationally expensive, and usually generates a high volume of data. A possible solution would be to compute only those relevant features for describing a particular concept, but how to discriminate between relevant and irrelevant features? The Rapid Image Information Mining (RIIM) prototype (Shah et al., 2007) is a Java based framework that provides an interface for exploration of remotely sensed imagery based on its content. Particularly, it puts a focus on the management of coastal disaster. Its ingestion chain begins with the generation of tiles and an unsupervised segmentation algorithm. Once tiles are segmented, a feature extraction composed of two parts is performed: a first module consists of a genetic algorithm for the selection of a particular set of features that better identifies a specific semantic class. A second module generates feature models through genetic algorithms. Thus, if the user provides a query with a semantic class of interest, feature extraction will be only performed over the optimal features for the prediction, speeding up the ingestion of new images. The last step consists of applying a SVM approach for classification. While executing a semantic query, the system computes automatically the confidence value of a selected region and facilitates the retrieval of regions whose confidence is above a particular threshold.
The IKONA system - is a CBIR system based on client-server architecture. The system provides the ability of retrieving images by visual similarity in response to a query that satisfies the interest of the user. The system offers the possibility to perform region based queries in such a way that the search engine will look for images containing similar parts to the provided one. A main characteristic of the prototype is the hybrid text-image retrieval mode. Images can be manually annotated with indexed keywords, and while retrieving similar content images, the engine searches by keyword providing a faster computation.
IKONA can be applied not only for EO applications, but also for face detection or signature recognition. The server-side architecture is implemented in C++ and the client software in Java, making it independent from the platform where it runs. The only prerequisite on the client is to have installed a Java Virtual Machine.
The Query by Image Content (QBIC) - system is a commercial tool developed by IBM that explores content-based retrieval methods allowing queries on large image and video databases. These queries can be based on selected colour and texture patterns, on example images or on user-made drawings. QBIC is composed of two main components: database population and database query. The former deals with processes related to image processing and image-video database creation. The latter is responsible for offering an interface to compose a graphical query and for matching input query to database. Before storing images in the archive, they are tiled and annotated with text information. The manual identification of objects inside images can become a very tedious task, and trying to automatize this function, a full automatic unsupervised segmentation technique based on foreground/background models is introduced. Another method to automatically identify objects, also included in this system, is the flood-fill approach. This algorithm starts from a single pixel and continues adding neighbour pixels, whose values are under a certain threshold. This threshold is calculated automatically and updated dynamically by distinguishing between background an object.
Photobook (Picard et al., 1994) developed by MIT, is another content-based image and image sequences retrieval, whose principle is to compress images for a quick query-time performance, reserving essential image similarities. Reaching this aim, the interactive search will be efficient. Thus, for characterization of object classes preserving its geometrical properties, an approach derived from the Karhunen-Loève transform is applied. However, for texture features a method based on the Wold decomposition that separates structured and random texture components is used. In order to link data to classes, a method based on colour difference provides an efficient way to discriminate between foreground objects and image background. After that, shape, appearance, motion and texture of theses foreground objects can be analyzed and ingested in the database together with a description. To assign a semantic label or multiple ones to regions, several human-machine interactions are performed, and through a relevance feedback, the system learns the relations between image regions and semantic content.
VisiMine system (Aksoy et al., 2002), (Tusk et al., 2002) is an interactive mining system for analysis of remotely sensed data. VisiMine is able to distinguish between pixel, region and tile levels of features, providing several feature extraction algorithms for each level. Pixel level features describe spectral and textural information; regions are characterized by their boundary, shape and size; tile or scene level features describe the spectrum and textural information of the whole image scene. The applied techniques for extracting texture features are Gabor wavelets and Haralick’s co-ocurrence, image moments are computed for geometrical properties extraction, and k-medoid and k-means methods are considered for clustering features. Both methods perform a partition of the set of objects into clusters, but with k-means, further detailed in chapter 6, each object belongs to the cluster with nearest mean, being the centroid of the cluster the mean of the objects belonging to it. However, with k-medoid the center of the cluster, called medoid, is the object, whose average distance to all the objects in the cluster is minimal. Thus, the center of each cluster in k-medoid method is a member of the data set, whereas the centroid of each cluster in k-means method could not belong to the set. Besides the clustering algorithms, general statistics measures as histograms, maximum, minimum, mean and standard deviation of pixel characteristics for regions and tiles are computed. In the training phase, naive Bayesian classifiers and decision trees are used. An important factor of VisiMine system is its connectivity to SPLUS, an interactive environment for graphics, data analysis, statistics and mathematical computing that contains over 3000 statistical functions for scientific data analysis. The functionality of VisiMine includes also generic image processing tools, such as histogram equalization, spectral balancing, false colours, masking or multiband spectral mixing, and data mining tools, such as data clustering, classification models or prediction of land cover types.
GeoIRIS (Scott et al., 2007) is another IIM system that includes automatic feature extraction at tile level, such as spectral, textural and shape characteristics, and object level as high dimensional database indexing and visual content mining. It offers the possibility to query the archive by image example, object, relationship between objects and semantics. The key point of the system is the ability to merge information from heterogeneous sources creating maps and imagery dynamically.
Finally, Knowledge-driven Information Mining (KIM) (Datcu & Seidel, 1999), (Pelizzari et al., 2003) and later versions of Knowledge Enabled Services (KES) and Knowledge–centred Earth Observation (KEO) - are perhaps the most enhanced systems in terms of technology, modularity and scalability. They are based on IIM concepts where several primitive and non-primitive feature extraction methods are implemented. In the last version, of KIM, called KEO, new feature extraction algorithms can easily plugged in, being incorporated to the data ingestion chain. In the clustering phase, a variant of k-means technique is executed generating a vocabulary of indexed classes. To solve the semantic gap problem, KIM computes a stochastic link through Bayesian networks, learning the posterior probabilities among classes and user defined semantic labels. Finally, thematic maps are automatically generated according with predefined cover types. Currently, a first version of KEO is available being under further development.