Models Fitting to Pattern Recognition in Hyperspectral Images

Worldwide, the concern on food safety, for example, on agriculture products, has become a topic with huge relevance. Nowadays, hyperspectral imaging systems for rapid detection of dangerous agents have emerged in response to these needs. In this research project, we proposed a new algorithm for Salmonella typhimurium detection on tomato surfaces in visible range (400–1000 nm). Gaussian model was used as a way to take out a model that could be calculated its definite integral; the final result of this algorithm is the area under curve (AUC), which gives a quantitative approach of spectral signatures. Three doses (5, 10, and 15 μL) and a control response (0 μL) were spread out on 20 tomatoes’ surface. Subsequently, it was observed that some decrease responses with higher dose; also, numerically this pattern was seen with the help of AUC value. As a last step, a single factor analysis of variance showed no significance due to doses. Despite this outcome, the algorithm provides to be a good methodology for pathogen detection.


Introduction
Hyperspectral imaging technology has been well developed in different areas of the industry, such as mining, quality assessment in food processes and detection of diseases that affect crops and fruits, among others [1]; it is also important to mention that nowadays the reduction cost of sensors and electrical circuits has allowed the gradual immersion of hyperspectral imaging systems. Supervised learning is characterized by the need to know the expected responses based on human knowledge or the characteristics of the system; these responses are known as target function; then the system tries to compare our inputs (in our case the set of pixels) with this function, to the process of comparison, and testing of inputs with expected responses is called learning; the learning process ends when the algorithm has an acceptable level of performance; supervised learning can be grouped into two approaches such as classification and regression [2].
Classification approach: the data should be grouped into "categories," for example, "infected," "uninfected," "damaged," "undamaged," and "mature." Regression approach: data are treated as continuous function that can be modeled with mathematical functions that predict behavior. Some examples of supervised machine learning algorithms are: • Linear regression • Support vector machines

• Supervised neuronal network
On the other hand, the unsupervised processes try to model the distribution of the data and thus to obtain conclusions; this type of algorithm group has similar characteristics along the data by itself, without the help of expected knowledge [3]. Accordingly, there are two approaches: Clustering: in this type of analysis, the result is groups of data that share characteristics associated with certain trends, for example, the economy of a country with respect to the level of education of its population.
Association: in this type of analysis you want to find rules that describe a large portion of the data, such as "people who buy X also tend to buy Y." Examples of these unsupervised learning algorithms are: Choosing one of these methodologies to work depends on conditions of the experiment, that is, if into the experiment, the possibility to calibrate the algorithm with an expected response exists, for example, if an expert is able to detect damaged areas in a crop, before starting the research, this information could generate the target function and can be used to train the algorithm [4]. Besides, data provided for hypercubes are usually analyzed by statistical pattern recognition approaches in three-dimensional space; these analyses come across from the simplest to the most complex; an additional way to getting relevant information from spectral data is analyzing its shape with curve fitting. Also, hyperspectral curve fitting methodology has the advantage of modeling multiple overlapping absorption, transmittance, and reflectance, with substantial less bands [5].
Moreover, wavelet is another technique that has impacted the way to analyze hyperspectral data. Due to its application on fields such as signal and image processing, pattern recognition, and data compression, wavelet transform has been an alternative for data analysis and dimensionality reduction [6]; the main idea of processing with wavelet transform is to decompose a signal into a series of shifted and scaled sub-representation of the mother wavelet function. This decomposition provides a hierarchical framework for interpreting the spectral information; some researchers have utilized wavelet transform for feature extraction, for example, classification of health and damage areas in leaves [7]. Other researchers [8] have studied the combination between PCA and wavelet coefficient to improve dimension reduction, and also they could highlight the small variations contained in spatial information. Another interesting application performed with wavelets was the fusion between hyperspectral and multispectral data [9]; the fusion image that proved to have more relevant information due to wavelets could be considered as a low-pass and high-pass filters that allow separate information which is not found with the naked eye.
Several researches that worked with modeling fitting and wavelet approach can be found in scientific literature; we are going to mention some of them: in Ref. [10], they investigated anomaly detection on a test data cube taken from a part of San Diego International Airport; in this research they proposed to use a Gauss-Markov algorithm to detect and classify statistical parameters within the data, that is, covariance matrix; as a result they show two binary images with 100% of target detection. It was developed a new algorithm [11] based on index total chlorophyll (C ab ) content; they proposed a new index called area under curve normalized to maximal band depth between 650 and 725 nm (ANMB 650-725 ); as a preliminary step, the area under the continuumremoved reflectance curve in the range of 650-725 nm (AUC 650-725 ) was computed. As an outcome, using area under curve (AUC) divided by a maximal band depth could predict chlorophyll content with good accuracy. It should be noted that despite the fact most of the current equipment operates between 400 and 2500 nm (visible and near infrared), it is important to select correctly the bands which contain the data where the area of interest is located; due to this fact, numerous works that focus on algorithms for band selection exist [12][13][14]. In addition, Ref. [15] compared different mathematical models for describing the hyperspectral scattering data in order to predict fruit firmness and soluble solid content (SSC) of Golden Delicious apples; the model utilized in the research was the Lorentzian distribution function, which gave a high fitting with an average correlation coefficient (r) greater than 0.995, owing to the oval shape of apples; it was necessary to calculate the integral of the measurement reflectance as a function of the area covered by the lens of the camera and the reflectance intensity I over the acceptance angle. As a conclusion, they mention that mathematical modeling of scattering data to obtain the total light reflectance, using an appropriate Lorentzian function, can provide a good way to predict apple fruit firmness and soluble solid content.

Salmonella typhimurium detection using hyperspectral imaging system
Foodborne detection has been a topic of interest in recent decades, due to food industry and government regulations. Traditional techniques based on agar culture media have huge shortcomings in rapid confirmation response and the inability to analyze a large number of samples; another disadvantage is the need to destroy the fruits in order to carry out the planting on the culture media. Moreover, hyperspectral imaging system has emerged as tool to detect bacteria in a considerable reduced time [16].
Specifically, S. typhimurium infection is usually transmitted by consumption of contaminated fruits, vegetables, fresh beef, or pork. Outbreaks caused by these bacteria have been reported in Canada, Europe, and the United States [17]; the symptoms of these bacteria are gastrointestinal problems, fevers, and in some cases death.
On the other hand, Mexican tomato production faces the challenge of complying with regulations imposed by the United States (USA) and Canada, where agricultural products must comply with safety features for sale in the foreign market; as well as economic losses in recent years due to the waiting for a long period of time and doubtful detection on infectious agents have caused the need for faster and more efficient detection methods [18].
This research project was focused on on the obtaining of the hyperspectral signatures and Gaussian prediction models with high fitting to calculate the AUC and with this information detect S. typhimurium on tomato surface. Hyperspectral imaging system promises to be a good technique for worldwide food safety.

Biotechnological material
S. typhimurium was used, suspended in a media culture (broth in cryopreservation state) necessary for its survival. The experiment utilized commercial selective media Salmonella-Shigella (SS) agar, Hektoen enteric, and xylose lysine deoxycholate (XLD) agar. To isolate the bacteria, the streak plate isolation method was used; to display and select the suspect colonies more easily, this procedure was performed in triplicate. Assay tubes with 5 mL of tetrathionate broth were inoculated with S. typhimurium strain. This culture medium contains peptone and sodium carbonate, and the selectivity is the result of the presence of sodium thiosulfate that generates tetrathionate when added at a ratio of 0.2% iodine-iodide solution and 0.1% of bright green to each tube, allowing the growth of bacteria containing the reductase enzyme tetrathionate, and inhibits the development of other accompanying microorganisms. The incubation time was 24 h at a temperature of 37 C under aerobic conditions.

Tomato samples
Twenty tomatoes (Solanum lycopersicum L.) variety "Roma" were selected in a state of postharvest; they were purchased at a local supermarket in the municipality of General Escobedo, Nuevo Leon, Mexico. The tomatoes complied with high visual quality.

Hyperspectral system
The hyperspectral equipment utilized for this research was the PIKE F210b (Alliend Vision Technologies, GmbH); the camera is coupled to a Spectograph ImSpector V10E (Specim, Spectral Imaging Ltd.); the hyperspectral system is attached to a linear translation structure, which is, essentially, a band, a motor, and a speed regulation stage. This is necessary due to the push-broom operation; besides, the spectral range of the equipment goes from 400 to 1000 nm. Finally, the system works with two halogen-tungsten bulbs with a power of 60 W.

Sample inoculation
In order to start the research, the first step was inoculating the surface of 20 tomatoes with Salmonella typhimurium bacteria at three different amounts of dosification; these were 5, 10, and 15 μL and a zone with no contamination (0 μL), as we can see in Figure 1. The spread of a little drop on the tomato surface it was carry out with the help a micropipette. 20 hypercubes were obtained with a 600 Â 1920 spatial resolution and 1080 bands with 12 bits of resolution.

Preprocessing and data preparation
Hyperspectral imaging processing usually has a pre-stage called preprocessing, necessary to remove the effect of death pixels, noisy signals, errors caused by analog to digital process conversion, etc. Additionally, due to high abundance of data, it required a calibration process and test hypercubes for correcting data [19]. A general workflow is shown in Figure 2, and its subsequent analysis is discussed in the next section.

Normalization
The analysis of hypercubes involves huge amount of data, thence one of the main reasons for be adapt the hypercubes to more manageable sizes and with this to improve the computer processing time. As the first step, normalization of all hypercube was carried out using Eq. (1): where Hypercube Normalized is the calibrated hypercube, RawHypercube is the total data without any type of process, Dark_reference was taken with the absence of illumination and camera lens covered, and White_reference data cube was generated with a high reflectance white mosaic and the lights on.

Spatial and spectral crop
As was mentioned before, each cube of data had a spatial dimension of 600 Â 1920. It should be noted that most of this information is merged with the background, which is not necessary to analyze, from there that a spatial cropping was necessary. Each cube was reduced to an average cube of 280 Â 565 spatial dimension. On the other hand, sometimes it is not necessary to keep all data corresponding to start and end of the spectra, thereby a spectral crop was conducted in order to reduce no essential data.

Smoothing spectra
In hyperspectral preprocessing, the use of smoothing methods to remove high-frequency noise signal on the reflectance spectra is regular; a quite common smoothing method used in remote sensing is the Savitzky-Golay filter [20], which is based on least-squares polynomial approach applied on the short steps of wavelengths. In this procedure, a window of 11 steps, with a polynomial degree 2, was used. Figure 3 shows all spectra after preprocessing mentioned above. Each spectrum is the result of a region of interest (ROI) averaging a contiguous quadratic shape of nine pixels.

Obtaining modeling of spectral signatures
On the other hand, inature a bunch of data distribution is frequently located as a Gaussian or normal distribution (as it is shown in Figure 2), so that this model relates directly the behavior of the datasets. Gaussian curve fitting is still investigated as an algorithm for detecting patterns in biological, social, and physical sciences [21].
In order to compute Gaussian models for each spectrum, MATLAB 2016a and curve fitting tool (cftool) were used. A total of 80 models were obtained; after several tests and errors, the best combination found for modeling was Gaussian polynomial model with five terms as the form of Eq. (2): (2) where f x ð Þ is the Gaussian model, x is the wavelength independent variable, and a 1 , a 2 , a 3 , a 4 , a 5 , b 1 , b 2 , b 3 , b 4 , b 5 , c 1 , c 2 , c 3 , c 4 , c 5 are the coefficients to be calculated.

Computing area under curve and statistical analysis
The calculation of all areas was carried out, by calculating the define integral (Eq. (3)). MATLAB 2016a provides an effective command called "quad" which numerically evaluates the integral, with an adaptive Simpson quadrature [22]: where A is the AUC; wl 0 , wl 1 are the lower and upper limits, respectively, of wavelength; and f x ð Þ is a Gaussian model. Besides, the range between 582 and 850 nm was utilized, distributed into 482 bands.
After the areas under curves were obtained, a single factor analysis of variance (ANOVA) was performed in EXCEL 2016; the reason for this was to find ou if any relationship between the dosage amount (every 5 μL) and the decrease of the spectral signature response exists; a total of 20 tomatoes and 80 areas were analyzed.

Results and discussion
4.1. Gauss model results Table 1 shows corresponding results of goodness of fit curve with Gaussians models. A low value in sum of squares error (SSE) is notorious, meaning that the model has a smaller random error component, since they are closer to zero [23]; as well as the coefficient of determination (R 2 ) has values higher than 0.9986; this proves high matching between the Gaussian model and the spectra responses to a certain dose. Besides, the other two parametric models for goodness of fit are adjusted R-sq and root-mean-square error (RMSE) which shows values higher than 0.99920 and less than 0.001, respectively.
Another mathematical approach to know the good fitting of one predicted model is known as residuals, defined as the differences between the response of original data and the response to predicted model (Eq. (4)) with regard to recognizing if the model was  . Upper graph shows an example of smoothed spectra (blue) and the predicted model (red); lower plot shows a random behavior on residuals, which means good prediction.
where r are the residuals, y are the spectra of contaminated zones, and y _ is the predicted model. An example of residual response is shown in Figure 4. Whether the plot of residuals seems to behave in a random way, it means that the model fits the data well; otherwise, if residuals appear to behave in a systematic pattern, then it is a clear case of mismatch between data and model [24]. In this research, the whole 80 models showed random residuals.

Areas under curve and their analysis
Areas extracted from all spectral signatures are shown in Figure 5. The trend in this dataset seems to decrease with higher dose in most subsamples; the meaning of this is greater absorbance on the infected surface; as an exception, tomato surfaces 3, 4, 5, 6, 10, 11, 13, and 15 do not seem to have this behavior; one possible explanation is closely related with orientation and position at the time of hypercube acquisition, that is, little light saturation zones.
As a last step, the results of calculation for a single factor ANOVA are shown in Table 2.
Because P-value <0.05 means that there is no significance between doses and spectra response, a similar methodology was conducted by [25].

Conclusion
Up to now, hyperspectral dataset analysis is carried out by different methodologies, algorithms, and techniques; in this research, we proposed to calculate AUC as an alternative for hypercubes; after AUC calculation, a single factor ANOVA would be enough for data analysis.
Despite results set down, it seems like visible range is not a good band for S. typhimurium detection. Secondly, sample orientation could improve results, since only a little inclination degree generated zone with high saturation because of the shiny nature of the tomato surface.
The novelty in this work was that there is little information related to the modeling of spectral signatures and their subsequent calculation of AUC as method to determine factors such as degree of contamination on fruits surface. Moreover, this methodology tries to quantify a spectral signature assigning it a value for understanding phenomenon that interacts with hyperspectral image systems. Future works could be related to improving AUC with different spectral responses in using variates fruit surfaces. Although there could be other variables to consider, which would affect the results as such, the scope of this work could be said to be a preliminar research.