Comparison table of HSOM methods
The amount of available geospatial data increases every day, placing additional pressure on existing analysis tools. Most of these tools were developed for a data poor environment and thus rarely address concerns of efficiency, high-dimensionality and automatic exploration . Recent technological innovations have dramatically increased the availability of data on location and spatial characterization, fostering the proliferation of huge geospatial databases. To make the most of this wealth of data we need powerful knowledge discovery tools, but we also need to consider the particular nature of geospatial data. This context has raised new research challenges and difficulties on the analysis of multidimensional geo-referenced data. The availability of methods able to perform “intelligent” data reduction on vast amounts of high dimensional data is a central issue in Geographic Information Science (GISc) current research agenda.
The field of knowledge discovery constitutes one of the most relevant stakes in GISc research to develop tools able to deal with “intelligent” data reduction [2, 3] and tame complexity. More than prediction tools, we need to develop exploratory tools which enable an improved understanding of the available data .
The term cluster analysis encompasses a wide group of algorithms (for a comprehensive review see ). The main goal of such algorithms is to organize data into meaningful structures. This is achieved through the arrangement of data observations into groups based on similarity. These methods have been extensively applied in different research areas including data mining [6, 7], pattern recognition [8, 9], and statistical data analysis . GISc has also relied heavily on clustering algorithms [11, 12]. Research on geodemographics [13-16], identification of deprived areas , and social services provision  are examples of the relevance that clustering algorithms have within today’s GISc research.
One of the most challenging aspects of clustering is the high dimensionality of most problems. While in general describing phenomena requires the use of many variables, the increase in dimensionality will have a significant impact on the performance of clustering algorithms and the quality of the results. First, it will increase the search space affecting the clustering algorithm’s efficiency, due to the effect usually known as the “curse of dimensionality” . Second, it will yield a more complex analysis of the output, as the clusters are more difficult to characterize due to the contribution of multiple variables to the final structure. Thus, in a typical clustering problem, the user is asked to select a low number of variables that optimize the phenomena’s description.
However, to produce an accurate representation of the phenomenon, it is sometimes necessary to measure it from several perspectives. A typical example is the use of census variables to study the socio-economic environment in an urban context. Usually, the census covers a wide range of themes describing the characteristics of the population such as the demography, households, families, housing, economic status, among others. In these cases, some variables are strongly correlated, independently of the subject they are covering. In fact, with the increase in dimensionality, there is a higher probability of correlation between variables. In addition, due to the spatial context of census data, variables have strong spatial autocorrelation . Spatial autocorrelation measures the degree of dependency among observations in a geographic space. This spatial autocorrelation corroborates Tobler’s  first law (TFL) which expresses the tendency of nearby objects to be similar.
To GIScientists, clusters are usually more representative and easier to understand if they present spatial contiguity. However, several reasons can cause the clusters to present spatial discontinuity. Among these, the scale or zoning scheme of the geographical units, known as the modifiable areal unit problem (MAUP)  can affect the expected spatial patterns. In addition, the combination of different variables, that presents distinct levels of spatial autocorrelation, affects the clusters’ spatial patterns.
Traditional clustering methods, in which self-organizing maps  are included, are very sensitive to divergent variables. Divergent variables are those that present significant differences to the general tendency. These variables have a great impact in the clustering process and are crucial in the final partition. For instance, when clustering using a set of variables where all, except one, present spatial autocorrelation, the divergent variable will have a higher impact than the others. In most cases, the clusters created will not follow the spatial arrangement suggested by the majority of the variables, but will get distorted by the variables presenting odd spatial distributions.
To avoid this problem a hierarchical structure may be used to explore and cluster geospatial data. Variables are grouped in themes, and each theme will be independently clustered. These partial clusters are then used to create a global partition.
One well-known clustering method is the Self-Organizing Map (SOM) proposed by Kohonen . One of the interesting properties of SOM is the capability of detecting small differences between objects. SOM have proved to be a useful and efficient tool in finding multivariate data outliers [25-27]. SOM has also been widely used in the GIScience field in the exploration and clustering of geospatial data [28-33, 34, 35].
In this chapter, we propose the use of Hierarchical SOMs to perform geospatial clustering. Several characteristics of geospatial data make it a good candidate to benefit from the HSOM specific features. The classic layer organization used in GIScience fits perfectly the layered structure of HSOM. HSOM provides an appropriate framework to perform the clustering task based on individual themes, which can then be compared with the clusters created from the combination of several themes. HSOM is less sensitive to divergent variables because these will only have a direct impact on their theme.
There are many types of hierarchical SOM, so we propose a taxonomy to classify existing methods according to their objectives and structure.
2.1. Self-Organizing Maps
Teuvo Kohonen proposed the Self-organizing maps (SOM) in the beginning of the 1980s . The SOM is usually used for mapping high-dimensional data into one, two, or three-dimensional feature maps. The basic idea of an SOM is to map the data patterns onto an n-dimensional grid of units or neurons. That grid forms what is known as the output space, as opposed to the input space that is the original space of the data patterns. This mapping tries to preserve topological relations, i.e. patterns that are close in the input space will be mapped to units that are close in the output space, and vice-versa. The output space is usually two-dimensional, and most of the implementations of SOM use a rectangular grid of units. To provide even distances between the units in the output space, hexagonal grids are sometimes used . Each unit, being an input layer unit, has as many weights as the input patterns, and can thus be regarded as a vector in the same space of the patterns.
When training an SOM with a given input pattern, the distance between that pattern and every unit in the network is calculated. Then the algorithm selects the unit that is closest as the winning unit (also known as best matching unit- BMU), and that pattern is mapped on to that unit. If the SOM has been trained successfully, then patterns that are close in the input space will be mapped to units that are close (or the same) in the output space. Thus, SOM is ‘topology preserving’ in the sense that (as far as possible) neighbourhoods are preserved through the mapping process.
The basic SOM learning algorithm may be described as follows:
The learning rate α, sometimes referred to as η, varies in [0, 1] and must converge to 0 to guarantee convergence and stability in the training process. The decrease of this parameter to 0 is usually done linearly, but any other function may be used. The radius, usually denoted by r, indicates the size of the neighbourhood around the winner unit in which units will be updated. This parameter is relevant in defining the topology of the SOM, deeply affecting the output space unfolding.
The neighbourhood function h, sometimes referred to as or N c , assumes values in [0, 1], and is a function of the position of two units (a winner unit, and another unit), and radius, r. It is large for units that are close in the output space, and small (or 0) for faraway units.
2.2. Hierarchical SOM
Hierarchical SOMs [37-41] share many characteristics with other methods such as the multi-layer SOMs [42, 43], multi-resolution SOMs , multi-stage SOMs [45, 46], fusion SOMs  or Tree-SOMs .
All these methods share the idea of constructing a system using SOMs as building blocks. They vary in the way these SOMs interact with each other, and with the original data. We consider as Hierarchical SOMs, those where, at some stage, one of the SOMs receives as inputs the outputs of another SOM, as will be described later. This type of structure resembles a multi-layer perceptron (MLP) neural network in the sense that multiple layers exist connected in a feed-forward way. However, Hierarchical SOMs have completely different training algorithms and types of interaction between layers.
General multilayer SOMs may have many completely different interactions between layers. As an example, a data pattern may be mapped onto a given SOM, and then all data patterns mapped to that unit may be visualized on a second SOM. Another common type of architecture presents several SOMs in linked windows , providing an environment where a data pattern is visualised simultaneously in several SOMs. We do not consider these as Hierarchical SOMs because the outputs of one SOM are not used to actively train another SOM, nor does the second SOM, in any way, use information from the first map to map the original data patterns.
We consider that, to be recognized as a Hierarchical SOM, the interaction between different SOMs must be of the train/map type. This type of interaction is one where the outputs of one SOM are used to train the other SOM, and this second one maps (represents) the original data patterns using the outputs of the first one. If these two characteristics are not present, we consider we do not have a true Hierarchical SOM, because it is the train/map relationship that establishes a strict subordination between SOMs that in turn is necessary for a hierarchy to exist.
The train/map type of interaction encompasses different specific ways of passing information from one SOM to another. As an examples, when a data pattern is presented to the first level SOM, it may pass the information onto the second level by passing the index of the best matching unit (BMU), the quantization error, the coordinates of the BMU, all activation values for all units of the first level, or any other type of data. The important issue is that whatever data is passed on, it is used to train the second level SOM. A particular case of output of one SOM layer may be the original data pattern itself, or an empty data pattern. This is the case of a first level gating SOM that filters which data patterns are sent to each upper level SOM: it may or may not pass the pattern, depending on some characteristic.
Still, many different configurations are possible for Hierarchical SOMs. They may vary in the number of layers used, in the different ways connections are made and even in the information sent through each connection.
2.3. Why use Hierarchical SOMs (HSOM)?
There are mainly two reasons for using a Hierarchical SOM (HSOM) instead of a standard SOM:
A HSOM can require less computational effort than a standard SOM to achieve certain goals;
A HSOM can be better suited to model a problem that has, by its own nature, some sort of hierarchical structure.
The reduction of computational effort can be achieved in two ways: by reducing the dimensionality of the inputs to each SOM, and by reducing the number of units in each SOM. Instead of having a SOM that uses all components of the input patterns, we may have several SOMs, each using a subset of those components, and in this way we minimize the effect of the “curse of dimensionality” . The distance functions used for training the different SOMs will be simpler, and thus faster to compute. This simplicity will more than compensate for the increase in the number of different functions that have to be computed. Speed gains can also be achieved by using fewer units in each SOM. The finer distinction between different clusters (units) can be achieved in upper level SOMs that will only have to deal with some of the input patterns. This “divide and conquer” strategy will avoid computing distances and neighbourhoods to units that are very different from the input patterns being processed in each instant.
The second reason for using HSOMs is that, in general, they are better suited to deal with problems that present a hierarchical/thematic structure. In these cases, HSOM can map the natural structure of the problem, by using a different SOM for each hierarchical level or thematic plane. This separation of the global clustering or classification problem into different levels may not only represent the true nature of the phenomena, but it may also provide an easier interpretation of the results, by allowing the user to see what clustering was performed at each level. GIS science applications, as already discussed, have a strong thematic structure that can be expressed with a different SOM for each theme, and an upper level (hierarchically superior) SOM, that fuses the information to produce globally distinct clusters.
HSOMs are often used in application fields where a structured decomposition into smaller and layered problems is convenient. Some examples include: remote sensing classification , image compression , ontology [43, 50], speech recognition  pattern classification and extraction using health data [52-54], species data , financial data , climate data ,,music data [58, 59] and electric power data .
3. Taxonomy for Hierarchical SOMs
Based on the survey of the work made on the field, we propose the following taxonomy to classify the HSOM methods (Fig.1).
This is a possible taxonomy for the HSOM based on their objective and on the type of structure used. Therefore, the first partition groups HSOM methods in two main types: the agglomerative and divisive HSOMs (Fig.2). This partition results from the type of approach adopted in each HSOM method. In an agglomerative HSOM, we usually have several SOMs in the first layer (i.e., the layer directly connected to the original data patterns), and then fuse the outputs in a higher level SOM, while in the divisive HSOM, we will usually have a single SOM in the first layer, and then have several SOMs in the second layer.
In the agglomerative HSOM (Fig.2a), the level of data abstraction increases as we progress up the hierarchy. Thus, usually the first level on the HSOM is the more detailed representation (or a representation of a particular aspect of the data) and, as we ascend in the structure, the main objective is to create clusters that will be more general and provide a simpler, and arguably easier, way of seeing the data.
In the divisive HSOM (Fig.2b), the first level is usually less accurate and uses small networks. The main objective of this level is to create rough partitions, which will be more detailed and accurate as we ascend in the levels of HSOM.
In the second taxonomic level, agglomerative HSOMs can be divided into thematic and based on clusters while divisive HSOMs can be divided into static or dynamic. In the following, we will present a description on each category.
3.1. Thematic agglomerative HSOM
The first class of agglomerative HSOMs is named Thematic. The name results from the fact that the input space is regarded as a collection of subspaces, each one forming a theme. Fig.3 presents a diagram exemplifying how HSOM methods are generally structured in this category.
In a thematic HSOM, the variables of the input patterns are grouped according to some criteria, forming several themes. For instance, in the case of census data, variables can be grouped into different themes such as economic, social, demographic or other. Each of these themes forms a subspace that is then presented to an SOM, and its output will be used to train a final merging SOM. As already stated, the type of output sent from the lower level SOM to the upper level can vary in different applications.
In Fig.3, each theme is represented by a subset of the original variables. Assuming that each original data pattern (with all its variables) would get represented by a grey circle, a portion of that circle is used to represent the subset of each data pattern used in each theme.
This structure presents several advantages when performing multidimensional clustering. The first advantage is the reduction of computation caused by the partition of the input space into several themes. This partition also allows the creation of thematic clusters that, per se, may be interesting to the analyst. Thus, since different clustering perspectives are presented in the lower level, these can be compared to the global clustering solution allowing the user to better understand and explore the emerging patterns.
3.2. Agglomerative HSOM based on clusters
This category is composed by two levels, each using a standard SOM (Fig.4). The first level SOM learns from the original input data, while its output is used in the second level SOM. The second level SOM is usually smaller, allowing a coarser, but probably easier to use, definition of the clusters. In this architecture, if only the coordinates of the bottom level SOM are passed as inputs to the top level, each unit of the top level SOM is BMU for several units from the first level. In this case, the top level is simply clustering together units of the bottom level, and the final result is similar to using a small standard SOM. However, this method has the advantage of presenting two SOMs mapping the same data with different levels of detail, without having to train the top level directly with the original patterns. Fig.4 presents the diagram of this category of HSOM.
A HSOM based on clusters will be significantly different from a standard SOM if, instead of using only the coordinates of the BMU, more information is passed as input to the top level. As an example, one might use both the coordinates and the quantization error of the input patterns as inputs to the top level. In this case, the top level SOM will probably cluster together patterns that have high quantization error (i.e. patterns that are badly represented) in the first level. Thus the top level SOM could be used to detect input patterns that, by being misrepresented in the first level, require further attention.
The name proposed for this class (HSOM based on clusters) stems from the fact that the bottom level SOM uses the full patterns to obtain clusters, and the information about those clusters is the input to the top level SOM. Depending on what cluster information is passed on, the HSOM based on clusters may be similar or very different from the standard SOM.
3.3. Static divisive HSOM
In this category, the HSOM has a static structure, defined by the user. The number of levels and the connections between SOMs are predefined according to the objective. Fig.5 presents two examples of HSOM structures possible in this category.
In the first case (Fig.5a) the bottom level SOM creates a rough partition of the dataset and, in a second level, an SOM is created for each unit of the first level SOM. Each of these second level SOMs receive as input only the data patterns represented by its origin unit in the bottom level that acts as a gating device.
In the second case (Fig.5b), each top level SOM receives data from several bottom level units. This allows different levels of detail for different areas of the first level SOM.
The main advantages of Static divisive SOMs over large standard SOMs are the reduction of computational effort due to the small number of first level units (and only some of the top level units will be used in each case), and the possibility of having different detail levels for different areas of the SOM. If, for example, we want to train a 100x100 unit SOM, we may use a bottom level SOM with 10x10 units, and a series of 10x10 unit SOMs to form a mosaic in the second level. While each training pattern will require the computation of 10.000 distances in the first case, it will require only 100+100=200 distances in the second.
3.4. Dynamic divisive HSOM
Finally, the category of dynamic divisive HSOMs is characterized by the structure’s self-adaptation to data. These methods, also known as Growing HSOM , allow the growth of the structure during the learning phase. Two types of growth are allowed: horizontal and vertical growth. The first concerns the increase in the number of units of each SOM, while the second concerns the increase of the number of layers in the HSOM (Fig.6).
A diagram of this type of HSOM is shown in Fig.6. The size of each level SOM and the number of levels is defined during the learning phase and relies on some criteria such as the quantization error.
4. Some HSOM implementations proposed in the literature
One of the first works related to HSOM was proposed by Luttrell in . In his work, hierarchical vector quantization is proposed as a specific case of multistage vector quantization. This work stresses the difference in the input dimensionality between standard and hierarchical vector quantization and proves that distortion in a multistage encoder is minimised by using SOM.
 analyses the HSOM as a clustering tool. The structure proposed is based on choosing, for each input vector, the index of the best-matching unit from the first level to train the second level map. The first level produces many small mini-clusters, while the second produces a smaller number of broader and more understandable clusters.
HSOM has proved to be quite valuable for processing temporal data, often using different time scales at different hierarchical levels. An example is the work of [58, 60], where the authors use HSOM to perform sequence classification and discrimination in musical and electric power load data. Another example is  where HSOM is used to process sleep apnea data.
Another class of HSOM is proposed in with the Growing Hierarchical Self-Organizing Map (GHSOM). This neural network model is composed of several SOMs, each of which is allowed to grow vertically and horizontally. During the training process, until a given criterion is met, each SOM is allowed to grow in size (horizontal growth) and the number of layers is allowed to grow (vertical growth) to form a layered architecture such that relations between input data patterns are further detailed at higher levels of the structure. One of the problems of GHSOM is the definition of the two thresholds used to control the two types of growth. Several authors proposed some variants to this method to better define these criteria. One example is the Enrich-GHSOM . Its main difference is the possibility to force the growth of the hierarchy along some predefined paths. This model classifies data into a pre-defined taxonomic structure. Another example of a GHSOM variant is the RoFlex-HSOM extension . This method is suited to non-stationary time-dependent environments by incorporating robustness and flexibility in the incremental learning algorithm. RoFlex-HSOM exhibits plasticity when finding the structure of the data, and gradually forgets (but not catastrophically) previous learned patterns. Also, proposed a Tension and Mapping Ratio extension (TMR) to the GHSOM. Two new indexes are introduced, the mapping ratio (MR) and the tension (T) that will control the growth of the GHSOM. MR measures the ratio of input patterns that get better represented by a virtual unit, placed between two existent units. T measures how similar are the distances between all the units.
Another example of HSOM is proposed in  with the Hierarchical Overlapped SOM (HOSOM). The process starts by using just one SOM. After completing the unsupervised learning, each unit is labelled. Then, a supervised learning method is used (LVQ2) and units are merged or removed, based on the number of mapped patterns. After this, a new LVQ2 is applied and, based on the classification quality, additional layers can be created. The process is then repeated for each of these layers.
A similar structure is presented in , which proposes a cooperative learning algorithm for the hierarchical SOM. In the first layer, some BMUs are selected, and for each of these BMUs a SOM in the second layer is created. Input patterns used in this second level SOM are derived from the original BMU.
Ichiki et al. propose a hierarchical SOM do deal with semantic maps. In this proposal, each input pattern is composed by two parts: the attribute and the symbol,.The attribute partis composed by the variables describing the input pattern, while the symbol partis a binary vector. The first level SOM is trained using both parts of the patterns, while the second level SOM only uses the symbol set and information from the first level.
HSOM has also been used for phoneme recognition . The authors use sound signal attributes in a first level SOM to classify the phonemes into pause, vocalised phoneme, non-vocalised phoneme, and fricative segment. After phonemes are classified, a feature frequency-scale vector is used to train the corresponding second level SOM.
A different approach called tree structured topological feature map (TSTFM) is presented in . This approach uses a hierarchical structure to search for the BMU, thus reducing computation times. While the purpose of this approach is strictly to reduce computation times, its tree searching strategy is in effect a series of static divisive HSOMs.
Miikkulainen  proposes a hierarchical feature map to recognize an input story (text) as an instance of a particular script by classifying it in three levels: scripts, tracks and role bindings. At the lowest level, a standard SOM is used for a gross classification of the scripts. The second level SOMs receives only the input patterns relative to its scripts, and different tracks are classified at this level. Finally, in the third level a role classification is made.
Table 1 provides a classification using the proposed taxonomy for the HSOM discussed above.
|Method||Classification in proposed taxonomy||Main objective|
|1st level||2nd level|
|||Agglomerative||cluster based||Sequential data classification and discrimination|
|||Divisive||dynamic||Exploratory data mining|
|||Divisive||static||Exploratory data mining|
|||Divisive||dynamic||Exploratory data mining|
|||Divisive||dynamic||Exploratory data mining|
|||Divisive||static||Exploratory data mining|
|||Agglomerative||cluster based||Create Semantic maps|
|||Agglomerative||thematic||Capture the various levels of information in a musical piece|
4.1. GeoSOM Suite’s HSOM implementation
The GeoSOM Suite is a public domain software package for working with SOMs that is particularly oriented towards geo-referenced datasets. It is implemented in Matlab® and uses the public domain SOM toolbox . A standalone graphical user interface (GUI) was built, allowing non-programming users to evaluate the SOM and GeoSOM algorithms. GeoSOM, proposed in , is an extension of SOM, specially oriented towards spatial data mining. The GeoSOM Suite is freely available at . The purpose of GeoSOM Suite is to: 1) present spatial data; 2) train maps with the SOM and GeoSOM algorithms; 3) produce several representations (views) of the data and; 4) establish dynamic links between views, allowing an interactive exploration of the data.
The GeoSOM Suite implementation of HSOM uses a thematic agglomerative hierarchical SOM (see taxonomy in Fig.1). Fig.7 presents a scheme of the HSOM where several thematic SOMs are created, according to the themes used.
This HSOM divides the input data space into several subspaces according to different themes. Fig.7 shows an example of HSOM using three themes: a, b and c. Each of these themes can be viewed as a subspace created by a subset of variables from the dataset. For instance, if theme a is demography, some of the possible variables to use in it are the age structure, the number of inhabitants, the number of births, etc.
Each of these data subspaces is used to train a SOM, and its output will be used to train a final merging SOM. When compared to the standard SOM, this approach has the advantage of setting an equal weight for each theme.
Generally, HSOM implemented can be described as follows:
GeoSOM Suite’s implementation of HSOM is shown in Fig.8. GeoSOM suite presents an interface where the user can choose the HSOM inputs, based on the SOMs created before, and/or the original variables. Thus, to create a structure like the one presented in Fig.7, the user must create three first level SOMs. Each of these SOMs will use the variables relative to one theme. Then the user can create the HSOM by choosing as input data the outputs obtained from the three SOMs. Fig.8 presents a screen-shot of GeoSOM Suite in which this selection and the HSOM parameterization is shown.
In this chapter we presented a case for using Hierarchical Self-Organizing Maps (HSOM) when analysing high dimensional spatial data. We showed that several different approaches can be used to construct HSOM, and presented a taxonomy for them. We pointed out strengths and shortcomings of the different variants, and reviewed several previous proposals of HSOM in the light of the proposed taxonomy. Finally, we presented an implementation of a HSOM that is particularly well suited for spatial analysis. This implementation is publically available for general use at .