Palynology - “The study of pollen grains and other spores, especially as found in archaeological or geological deposits. Pollen extracted from such deposits may be used for radiocarbon dating and for studying past climates and environments by identifying plants then growing.” 
Over 20% of all the world’s plants are already at the edge of becoming extinct . Saving earth’s biodiversity for future generations is an important global task  and asmanymethods as available must be combined to achieve this goal. This involves mapping plants distribution by collecting pollen and identifying them in a laboratory environment.
Pollen grain classification has been an expensive qualitative process, involving observation and discrimination of features by a highly qualified palynologist. It is still the most accurate and effective method. But it certainly limits research progress, taking considerable amounts of time and resources .
Automatic recognition of pollen grains can overcome these problems, producing purely objective results faster. Such a tool would provide invaluable in the studies of flora. This advantages were obvious for Flenley  , who proposed the implementation of an automatic pollen grain classification systemin 1968. However, the idea was intractable at that time. Mainly, because of technology restrictions. Nowadays, technology is not a barrier any more, and the discussed system is a reality thanks to computer vision.
This chapter presents the latest results obtained by the authors in the field of automatic pollen grain classification. This will be done by introducing a developed system, paying special attention to the phases of preprocessing(section 3.1) and feature extraction (section 4). Results for a 17 pollen species database obtained with the commented system will also be shown (section 6).
2. Related work
The begins of automatic pollen identification were based on scanning electron microscope (SEM) images. Langford applied statistical classifiers on texture parameters on 1988, reporting a 94.30% of accuracy on a six pollen class database . Later, artificial neural networks (ANN) were used on the classification task, achieving a success rate of 100% with 3 classes .
However, SEM images are expensive and difficult to produce and the use of light microscope (LM) images were explored in 1998 . Again, first attempts were not fruitful due to the low quality images provided by the technology of the time. But recent works has demonstrated that the use of LMs images is, in fact, possible.
For example,  reported a 100% of success with a small database containing 4 classes. Moreover, it was one of the first works using artificial neural networks for the classification phase, along with texture parameters. Again,  used artificial neural networks for classification. This time, brightness and shape descriptors were extracted as pollen features. A 90% of accuracy with a 3 class database was reported.
 and  presented a more complex work, combining shape and ornamentation of the grains; using simple geometric measures, and concurrence matrices applied for the measurement of texture. Again, artificial neural networks were used for classification. These works reported a 87.7% recognition rate for a 5 classes database and a 97.7% for a three class database respectively.
 describes an automatic optical recognition and classification of pollen grains system. This is able to locate pollen grains on slides, focus and photograph them before identify the species applying a trained neural network. The system achieved a 90% of recognition rate with a 3 class database.
Other works usemore sophisticate capturemethods, achieving 3 dimensional representations of the pollen grains.  presented a combination of statistical reasoning, feature learning and expertise knowledge. A feature extraction algorithm was applied alternating 2D and 3D representations. Iterative refinement of hypotheseswas used during the classification process. This work reported a 77% of accurate rate in a database with 30 classes and 97% when only 4 classes were used. An other example, , which used a confocal laser microscope to create the 3D models, achieved a 90% recognition rate with 3 classes database.
3. Pollen extraction
At the actual development stage of the system, the detection of pollen grain is highly but not fully automatic. This should not be of any surprise, as the task of pollen location inside sample images is itself a different problem, which is as much complicated as the problem studied in this chapter.
Thus, users should first select and area with a pollen grain inside. Preferably, an area, as small as possible, where an isolated pollen grain is located. This user selected region of interest (ROI) is then automatically preprocessed to detect the contour of the grain (see figure 1).
This section introduces the automatic preprocessing algorithm used for pollen extraction and preparation. It is important to remind that this process is applied to the image area manually selected by the user, like that showed on figure 1. The preprocessing steps are (see figure 2):
Decorrelation stretching: This process aims to reduce the autocorrelation of the information contained in the image . This is done as a three steps process:
(a) The original bands are transformed to their principal components.
(b) The principal components are then stretched separately.
(c) The resulting data is transformed back to the original space applying the inverse of the principal component transformation.
The results is a linear transformation of the spectral bands, resulting in uncorrelatedvariables with unit variance, and enhancing displays. The result can be seen in figure 3.
Saturation:The saturation channel of the image represents the amount of colour used at each pixel, i.e. the lower the saturation is the greyer the pixel is. This channel is actually extracted from the HSV image representation .
In particular, the saturation channel is computed as:
The result of computing the saturation channel of the docorrelation stretched image is shown on figure 3. The simplification of the task of differentiating pollen and background is obvious.
Histogram equalization: Equalizing the histogram of an image aims to obtain a uniform distribution of the pixel values. This maximizes the contrast without loosing structural information, i.e., conserving the entropy .
Binarization: The binarization of an image consist of transform each pixel’s value to ’0’ or ’1’ depending on whether it has a value lower or higher (respectively) than a set threshold. This results on a simple image containing pure geometric information.
Mask: Finally, in a bid to obtain a clear mask of the pollen grain, several image processing functions are applied such as “imfill” and “bwareaopen” provided by the Image Processing Toolbox of Matlab .
The resulting mask can be either used for feature extraction or to remove the background of the pollen grain image. The result of applying each preprocessing step can be seen on figure 3
4. Feature extraction
Pollen images by their own does not prove to be a high quality information for the task of automatic pollen grain classification. Although they contain the necessary information, this information is hidden and diffused around the image and behind other unimportant data. In order to extract the relevant information from raw samples, they need to be further processed by the feature extractor.
A total of 50 features are extracted from the pollen images. I.e. the output of the feature extraction block is a vector with length 50. These 50 features corresponds to 24 geometricparameters carrying information regarding size and basic shape, and 26 texture parameterswith information about how pixel intensities are distributed on the image. A detailed view ofeach of these features will be given here.
Certainly, colour may be an attractive source of information. However, since the preparation of pollen grain samples imply the use of a stain, it is not recommended to use it. Moreover, the stain effects is not constant along time and the colour of the same sample may change.
4.1. Geometric parameters
Geometric parameters contain information about the size and the basic shape of the pollen grains. The 24 geometric parameters extracted in the systems presented in this chapter are:
Area: Refers to the amount of pixels with level ’1’ in the pollen mask.
BoundingBox: Smallest rectangle enclosing the pollen. In particular, parameters width and hight are used as:
Centroide: Refers to the mass centre of the pollen grain. Coordinates (x, y).
MajorAxisLength: Length of the major axis of the ellipse with the same second order normalized central moment of the object.
MinorAxisLength: Length of the minor axis of the ellipse with the same second order normalized central moment of the object.
ConvexArea: Area of the smallest convex shape enclosing the object.
EquivDiameter: Diameter of the circle with the same area as the object.
Perimeter: Length of the perimeter of themask image.
Extent: Portion of the area of the bounding box contained in the pollen.
Eccentricity: Relation between the distance of the focus of the ellipse and the length of the principal axis.
WeightedCentroid: This is a centroid computing weighted by the pixel values of the grey-scale image.
Thickness: This is the number of times that the mask has to be eroded with a 3x3 square filter, until it disappears, e.i. the image gets black.
Box: These are the coordinates of an inner rectangle area computed from the BoundingBoxparameters as:
4.2. Texture parameters
Texture parameters provide information regarding how pixels are distributed on the image, such as contour changes or objects inside the pollen grain.
The first 4 of the 26 texture parameters introduced in this section are computed using the grey level co-occurrence matrix (GLCM). This matrix gives information about the frequency of pixel value pairs combinations. In particular, the value of GLCM(i,j) is the number of times that a pixel with value ’j’ sits next and at the left of a pixel with value ’i’. Figure 5 shows and example of this.
Correlation: Measures how must correlated it a pixel with respect to its neighbours. This value is computed as:
Homogeneity: Measures how close the distribution of objects of the GLCM are to the diagonal of the GLCM. This is:
Entropy: This measure is applied to six different images derived from the original pollen grain image. These images are the the outer and inner bounding box (BoundingBoxand Box) of the blue channel of the RGB representation, the saturation and the value channels of the HSV representation. A representation can be seen in figure 6.
Entropies are are scalar values representing a statistical measure of the randomness of thepixel values. Each value is computed as:
where p is the histogram count of the corresponding image.
Fourier Descriptors: These measures are based on the analysis of the pollen contour points,and it provides information about the pollen shape. It is worth it to mention that a majorproperty of the fourier descriptors is its invariance to geometric transformations, such asrotation, scale and sift.
To compute these parameters, the complex representation of the contour zi= xi + jyiisused, where i= 0, 1, 2..., Nc−1 with Ncthe number of points of the contour. Moreover,the contour is sampled every 2 degrees. Now, the discrete Fourier transform (DFT) of z is:
The resultant complex coefficients a(u) are transformer in a power spectrum |a(u)|2.Finally, the discrete cosine transform (DCT) is applied to reduce the dimensionality of thevector, ending up with a vector of length 5.
Relative areas: This is a 5 elements vector which values correspond to the number of activepixels (pixels with value ’1’) after binarizing the pollen image with different thresholds. Inparticular, the thresholds used are 0.3, 0.4, 0.5, 0.6 and 0.7. Figure 7 shows an example ofthis.
Relative objects: In this case the number of objects (group of connected pixels with value’1’ and surrounded of pixels with value ’0’) contained inside the pollen grain are counted,using an inverted and masked version of the binarized images computed the relative areas.See figure 8 for an example.
Parameters are computed from a set of training samples.
The computed parameters are passed to the ANN so that it gets trained. This meansthat the ANN automatically adjusts its parameters to solve the problem of classify theparameters in different classes.
After the training process, a new testing parameters vector can be passed to the ANN andit will produce an output regarding the sample class.
An ANN is a mathematical model inspired in the structure and functional aspects of thebiological neural networks. It could be defined as a set of simple computational elementsmassively interconnected following a hierarchical organization .
In this case, a multilayer perceptron architecture trained by a back propagation algorithm(MLP-BP) is proposed. The principal characteristic of this algorithm is its ability to solvenon-lineal problems. Its architecture is composed of several layers. Each layer correspondsto a set of neurons receiving data from the previous layer and transmit data to the next layer.This layer can be divided in “input layer”, “hidden layer” and “output layer” as shown infigure 9. In this case, the number of hidden layers is set to one.
It is important note that the training process of the ANN contain an aleatory factor whichdetermines the solution found. In other words, the training process does not avoid localminimums. To overcome this limitation, the proposed classifier implements 11 individualANNs and sum their resulting scores to obtain a final response. The idea behind this fusionis that the set of computed solutions complement each other, i.e. some solutions correct theerrors produced by others.
6. Experimentation methodology, results and discussion
A systemwere implemented in order to test the quality of the proposed approach. This systemuses all the techniques introduced in previous sections (preprocessing, feature extraction and classification). This section gives the details about the database used and the experimentalprocedure, along with a detailed explanation of the obtained results.
The database used for the experimentation contains 345 images of 17 different pollen grainclasses. Images has been captured with a 2 mega-pixels digital camera connected to amicroscope set to apply a 40 times zoom.
More precisely, these images correspond to 17 sub-genders and species of 11 different familiesof tropical honey plants situated in Costa Rica (Central America). Table 1 shows the exactinformation about family, gender and specie.
Applying the pollen grain extraction algorithm introduced in section 3, a total of 423 pollenimages distributed on all species were obtained. The number of samples extracted for eachsample was greater than one. This was possible thanks to images such as that shown in figure10 where more than one pollen grain could be extracted. Figure 11 shows a sample of eachpollen specie included in the DDBB.
First, remember from section 5 that the design of the classifier include 30 ANNs fused at thescore level. Thus, the number of hidden units on the ANNs had to be specified. To do so,a set of experiments with different configurations were executed to find the optimal value.To obtain a valid measure of the performance of the system, 30 iterations of a hold 50% outcross-validation procedure was executed. Results will be shown and discussed in sections 6.3and 6.4 respectively. For now, it is enough to note that the optimal value were found with a 30neurons hidden layer.
Thus, using this optimal configuration of the ANNs, further experiments were executed toevaluate the performance of the designed system. In this case, 30 iterations of a K-foldscross-validation procedure were applied with values of ’K’ equal 3, 5, 7 and 10.
Note that the set of all experimental procedures (hold-50%-out and 3, 5, 7 and 10 folds) arebased on divisions of the database in disjoint training and test sets. Moreover, this experimentscan be seen as using different proportions of the database of training, i.e. using a differentnumber of samples for training. In particular, the proportions of samples used for training are1/2, 2/3, 4/5, 6/7 and 9/10 respectively.
It is important to note that every experiment was repeated 30 times in order to obtain a validmeasure of the system’s performance. Therefore, results are given in terms of mean percentageand standard deviation (mean % and std).
The first experiment tested different configurations of the ANNs. Figure 12 shows the progressof the success rate when the number of units in the hidden layer increased from 10 to 150. Ahighest rate of 90.54% of success rate were obtained with 30 units (see table 2).
A second group of experiments aimed to measure the system’s performance with differentnumber of samples for training. Table 3 shows the results obtained for 3, 5, 7 and 10 folds.Note that the success rate increased with the number of training samples (from 90.54% to92.81%), while the std decreased (from 1.29 to 0.74).
It can be argued that the number of hidden units of the ANNs could be further optimizedexecuting a finer search around the point found. However, based on the similar accuracymeasures obtained between 10 and 80 units and stds higher than the range of accuracypercentages, paying the cost of running a finer search for a minimal increment of performancewas not worth it. Therefore, 30 units were chosen as the optimal point.
On the other hand, the results obtained for the second round of experiments show anincreasing in both system’s performance and stability. This seems to indicate that theperformance of the system may increase with a bigger training database.
This chapter has introduced the problem of automatic pollen grain classification, which isvital for biologists and flora researches among others. As pointed out in section 3, the taskof automatically detecting the pollen grains from samples is a complex problem itself and fallbeyond the scope of this chapter. Thus, a semi-automatic algorithm for pollen extraction wasexplored instead,
The chapter mainly focused its attention in giving a fair amount of both geometric and textureparameters. Moreover, the extraction of this parameters relied on the good work performedby the preprocessing block during the pollen’s perimeter definition.
Finally, these parameters were tested implementing a completed system. In particular, thesystem used the semi-automatic pollen detection and preprocessing algorithms introduced,along with the mentioned feature extraction techniques and a classifier based on the fusion of11 ANNs at the score level. The system was tested executing a number of experiments usingdifferent hold-out and k-folds cross-validation procedures. The results showed success ratesbetween 90.54% and 92,81%, pointing out the quality of the presented parameters for pollengrain classification. Moreover, these results improve those achieved by other authors such as, ,  and , even though the number of classified species was significantly larger.