Nowadays, the conventional biochemical methods used to differentiate and characterize rice types, biochemical properties, authentication, and contamination issues are difficult to implement due to the high cost of reagents, time requirement and environmental issues. Actually, the success of agri-food technology is directly related to the quality of analysis of experimental data acquired by sensors or techniques such as the infrared-spectroscopy. To overcome these technical limitations, a rapid and non-destructive methodology for discrimination and classification of rice has been investigated. Near-infrared spectroscopy is considered as fast, clean, and non-destructive analytical tools and its spectra present significant biomolecular information that must be analysed by sophisticated methodologies. Machine learning plays an important role in the analysis of the spectral data being used several methods such as Partial Least Squares, Principal Component Analysis, Partial Least Squares-Discriminant Analysis, Support Vector Machine, Artificial Neuronal Network, among others which can successfully be applied for food classification and discrimination as well as in terms of authentication and contamination issues. The quality control of rice is extremely important at every stage of production, beginning with estimation of raw agricultural materials and monitoring their quality during storage, estimating food quality during the production process and of the final products as well as the determination of their authenticity and the detection of adulterants.
- Machine Learning
1.1 Rice (
Oryza sativaL.): biochemical and physical characteristics
Many characteristics of grain quality, such as milling behaviour, appearance, nutritional properties, and cooking qualities, have been routinely evaluated . The evaluation methods of rice varieties are based on their chemical composition, namely (protein, moisture, fat, and ash), apparent amylose concentration, gelatinization temperature, gel consistency and dough viscosity. These procedures are based on standardized methods, which are often considered to be slow and expensive . The classification and characterization of different types of rice depends on several physicochemical parameters, namely, biometric data and protein, fat, ash, moisture, starch, amylose, among other.
Starch is one of main components in rice grain, being the essential carbohydrate reserve in the grain, and so its impact in the evaluated physico-chemical parameters. Starch is a complex polysaccharide of α-D-glucose units exclusively, which are joined by a sequence of α-D-(1,4)-glucosidic linkages thus giving rise to a linear or helical chain, being composed by two classes of glucose polymers: amylopectin and amylose. Amylose is a linear polymer of D-glucose units, and amylopectin is a highly branched polymer of glucose. These are referred to as amylose (20–30%). The much less frequent α-(1,6)-glucosidic linkages form the branch points between the chains thereby creating highly branched domains, denominated amylopectin (70–80%) . Amylose is considered the most important determinant of the eating quality of rice and based on their contents, rice varieties can be classified as: waxy (0–2%); very low (3–12%); low (13–20%); intermediate (21–25%) and high (>26%) . The classical and still commonly used method for the amylose and amylopectin determination is the iodine reaction coupled with potentiometric or amperometric titration. There are also other methods such as: differential scanning calorimetry , potentiometric , spectrophotometric , and chromatographic [14, 15] that can be used for classification and a detailed analysis. The fine structure of amylose, both molecular size and chain-length distribution, are also significant factors of the hardness of cooked rice . Amylose content is correlated with the retrogradation behavior, influencing the textural properties of cooked rice and the viscoelasticity dynamic of rice starch gel . The elongation of grains, volume expansion as well as water absorption characteristics are accounted for cooked rice quality .
Proteins and lipid content are also characteristics currently accepted to define rice quality . After starch, the protein is the second main component of rice, being found by four fractions: albumin (soluble in water), globulin (soluble in salt), glutelin (soluble in alkali), which represents the dominant protein in brown rice and white rice, and prolamine (soluble alcohol), a secondary protein in all rice mill fractions [20, 21]. Lipids are the third major component of brown rice, next to carbohydrates and protein, playing a major role in the quality of rice during processing and storage. Fats or lipids are mainly concentrated in the outer bran layer of brown rice, up to 20% by mass; therefore, the lipids content of brown rice is greater than that of milled rice [19, 22].
Appearance quality is how the rice appears after milling and it is associated with grain length, width, length-width ratio (shape) and translucency/chalkiness of the endosperm. Generally, most markets prefer translucent rice as opposed to chalky ones. Appearance quality has a direct influence on marketability and success of commercial varieties. The physical properties of rice grain include all of its external or integral characteristics, such as its appearance (size, shape, smoothness, colour), weight, hardness, volume, flow properties and so on (Figure 2).
Rice classification and consequent analysis is a comprehensive quality indicator not only in terms of the appearance but also for its cooking and processing qualities. Physical properties of rice are fundamental in all activities related to the production, preservation and utilisation of rice . The parameters such as dimensions, density, hardness, friction and mechanical properties are affected by the moisture content of the grain and its degree of milling, and also to a small extent by temperature. Cereal research, as well as grading and evaluation of food products, have encouraged the development of non-destructive, rapid and accurate analytical techniques to evaluate grain quality and safety being characterized by a huge amount of experimental data that must be accurately analysed . Different types of rice vary in terms of size, shape, color and constitution, which cannot be accurately identified by human visualization. Often, rice seed cultivars, characterized by high quality, can be faked using low quality cultivars or confused with other cultivars, which complicates rice quality, yield and value. For this reason, the identification of rice seed cultivars is extremely important.
Grain appearance is characterized by biometric parameters (length, width, length/width ratio), total whiteness, vitreous whiteness, and chalkiness, being considered as crucial factor that affects its market acceptability. Grain shape can be described by biometric parameters, which are closely associated with grain weight [25, 26]. The ratio of the length and the width is used internationally to describe the shape and class of the variety. Grain weight provides information about the size and density of the grain. Grains of different density mill differently, and are likely to retain moisture differently and cook differently. Uniform grain weight is important for consistent grain quality . Chalkiness, an opaque white discoloration of the endosperm, reduces the value of head rice kernels and decreases the ratio of head to broken rice produced during the milling process . Viscosity is a characteristic that indicates some of the cooking properties of rice, being evaluated by Rapid Visco Analysis (RVA), which mimics the process of cooking and monitors the changes to a slurry of rice flour and water, during the test. Starch viscosity curves are useful for breeding because the shape of the curve is unique to each class of rice . The primary RVA parameters include peak viscosity, PV (first peak viscosity after gelatinization); trough or hot paste viscosity, HPV (paste viscosity at the end of the 95 °C holding period) and final or cool paste viscosity, CPV (paste viscosity at the end of the test) . The breakdown (BD = PV − HPV); setback (SB = CPV − PV); consistency (CS = CPV – HPV); set back ratio (SBR = CPV/HPV) and stability (ST = HPV/PV) are considered as secondary parameters, once are derived from primary ones [30, 31, 32]. Other factors include peak time (time required to reach peak viscosity), and pasting temperature (temperature of initial viscosity increase) .
Industrial processing parameters such as the milling yield husked, milling yield milled, and milling industrial can influence positive and negatively the acceptability of rice by the industrials, can also affect the commercial value of rice. Rice yield and milling quality determine the economic value of rice from the field to the mill and in the industrial market. The rice commercial quality depends on several parameters that are evaluated separately or are involved several time-consuming experimental procedures. The evaluation of some parameters are related to biochemical or biological properties that allow more esasily its determination or prediction. Milling quality aspects affected by temperature during rice ripening include chalkiness, immature kernels, kernel dimensions, fissuring, protein content, amylose content, and amylopectin chain length . Rice milling process can be subjected to dehusking of paddy which results in brown rice, and removing the bran from the kernel by polishing the brown rice to yield white rice. The milling quality of rice determines the yield and appearance of the rice after the milling process.
1.2 Near-infrared spectroscopy
Beer’s law is generally applied in analytical spectroscopy to correlate the concentrations of standard samples with corresponding analyte absorbances to develop the calibration curve that is later used to evaluate the concentration of analyte of unknown samples, typically at lambda (λmax). Variation in other wavelengths/wavenumber regions is often not considered but contains significant information that may be selected to represent analyte absorption fingerprint signatures and spectral profiles for ultimate pattern recognition and/or quantification of analytes in unknown samples.
Analytical infrared spectra are focus on the absorption or reflection of the electromagnetic radiation can be divided in three regions of IR: near IR (NIR) in the 12.000–4000 cm−1 region, mid IR (MIR) in the 4000–400 cm−1 region, and far IR (FIR) beyond 400 cm−1 (Figure 3). The MIR region (4000–400 cm−1) is a well-recognized and reliable method through which different compounds can be identified and quantified, being used for biological applications, which includes the so-called fingerprint regions representative for lipids, proteins, amide I/II, carbohydrates, and nucleic acids (Figure 3). FIR spectroscopy (400–20 cm−1) provides information on the highly ordered structures such as fibrillar formation and protein dynamics  since it is more sensitive to the vibrations from the peptide skeletons and hydrogen bonds than MIR . NIR, known also “far-visible spectroscopy” or “overtone vibrational spectroscopy”, can measure the chemical composition of biological materials using the diffuse reflectance or transmittance of the sample at several wavelengths . The NIR spectrum, from 12.000 to 4000 cm−1 lies between the visible and mid-infrared regions of the electromagnetic spectrum, is characterized by a number of absorption bands that vary in intensity due to energy absorption by specific functional groups in a sample .
NIR is a spectroscopic technique used to study of hydrogen bonding because it evaluates the overtones and combinations of the molecule’s vibrational modes, principally those involving hydrogen. NIR spectroscopy can measure the concentration of components, characterized by different molecular composition such as protein, water, or starch . The chemical bonds present in food and crop components such as fats, water, and carbohydrates are easily detected by NIR spectroscopy due to the specificity of the radiation, in terms of the groups of interest such as N-H, C-H, and O-H bonds. Due to the macromolecular complexity of the rice sample, it is normal for these bands to overlap one another.
The transmission and reflection are defined as the two major modes of NIR spectroscopy, that are used based on physical state of the sample. Transmission modes are more suitable for liquids, thin solids, and thick solids when inspecting a food item for its ripeness, or whether it contains pests or defects. In another side, reflectance mode is applied for measuring content in whole grains such as lipids, starch, amylose, protein, moisture, and oil content. Low reflectivity indicates that energy diffuses readily beneath the surface of most samples, including visually opaque samples. Low absorptivity represents that NIR light energy easily penetrates the samples without fast attenuation . This technique is extensively used in breeding procedures for quality improvement of any cereals, and crop management, receivable testing, and on-line process control [41, 42].
The NIR methodology presents some advantages such as no sample preparation or pre-treatment process, no need for dangerous reagents or solvents, and no disposal problem, either. These advantages can eliminate sampling errors caused by manual sample handling and reagent contamination. The samples also can be used in additional studies, being carried out by technically untrained personnel. On the other hand, through NIR analysis, it is possible to obtain a set of spectra, simultaneously, in a certain range of wavelengths, which may serve as a basis for the development of specific calibration curves for each analyte. In the calibration process are transformed during modelling using, for this purpose, chemometric techniques that use a representative set of training to use the program to discriminate slight differences that exist in the specific spectra of the sample . A single spectrum can be subjected to many different calibration models, to measure any number of constituents.
Different techniques such as machine vision and Visible/Near-Infrared spectroscopy have been developed and applied to determine and characterize rice varieties and evaluate the biochemical characteristics. Traditional techniques used for rice variety evaluation such as High-pressure Liquid Chromatography (HPLC) or Gas chromatography-mass spectrometry (GC-MS) are time-consuming and hard to apply . NIR spectroscopy, compared to the traditional analysis methods, is characterized by many advantages, such as is easy-to-use, real-time analysis, fast and accurate, highly reproducible results, non-destructive sampling, no sample preparation, multiple components analysis with a single measurement, high precision and non-destructive detection, being widely used in the measurement of agricultural and food products [45, 46].
1.3 Spectral pre-processing techniques
Over the years, several multivariate regression analysis methods have been developed in order to provide significant information from spectral data, due in part to the limitations of univariate spectral analysis. The processing of spectral data for chemical analysis usually uses the field of statistics and advanced mathematics for an analysis in terms of multivariate regression of spectral data. Simultaneous investigation of several wavenumbers or wavenumbers for biochemical analysis can be carried out through multivariate regression techniques, as these allow the analysis of different sample components without the need for spectral resolution and spectral deconvolutions. Pre-processing methods allowed eliminating noise caused by spectral data, which allow to remove the non-informative variability present in the spectra. Data pre-processing techniques such as normal variable transformation (SNV), multiplicative dispersion correction (MSC) and smoothing derivative are required for raw NIR spectra for proper qualitative classification and development of quantitative calibration models. MSC is used to compensate for particle size effects as it rotates the spectra to remove part of that effect, adjusting as close to the average spectrum as possible . The first and second derivatives are calculated according to the Savitzky–Golay approach using a 19 point window and a 2nd or 3rd order polynomial, which allows to remove noise such as baseline drift, large, reverse and so on [48, 49, 50] (Figure 4).
1.4 Machine learning methods
Machine learning is one of the most promising technologies in the field of artificial intelligence, that involve the use of algorithms that allow machines to learn by imitating the way humans learn step. Machine learning based on experimental data allows to optimize grouping or classification, developing models that allow to predict the behavior or properties of systems. There are two main types of machine learning: the supervised and the unsupervised process. Supervised machine learning uses algorithms that “learn” from the labeled data entered by a person without an algorithm. The algorithm generates expected output data as long as the input has been labelled and prior primary. There are two types of data that can be used in the development of the algorithm: (a) classification, which classifies an object into different classes, for example, it allows determining the type of rice according to its physical characteristics; (b) Regression, predicts a numerical value such as the concentration of any biochemical parameters such as the protein, lipids, or carbohydrates, etc. Supervised learning consists of learning a function from training examples, based on their attributes (inputs) and labels (outputs). In the unsupervised machine learning, unlike the previous case, there is no human intervention, and the algorithms learn process is based on the data with unlabeled elements, looking for patterns between them without human intervention. In this case two types of algorithms have been developed: (a) clustering, classifies the output data into groups according to its similarity; (b) association, the algorithm discovers rules within the data set. In semi-supervised learning, both labeled and unlabeled data is used for training, with usually only a small amount of labeled data, but a large amount of unlabeled data. Instead, the learning system receives some sort of a reward after each action, and the goal is to maximize the cumulative reward for the whole process. The much recognized machine learning methods are: Principal Component Analysis (PCA), the most basic feature extraction unsupervised techniques, based on the analysis of the variance of features within the full spectrum; the clustering unsupervised methods, used to identify biological subtypes within a sample, such as Hierarchical Cluster Analysis (HCA), k-Nearest Neighbors (KNN), Artificial Neural Networks (ANN), discriminant analysis (DA), Partial Least-Squares-Discriminant Analysis (PLS-DA), Partial Least-Squares (PLS), and Support Vector Machines (SVM).
1.4.1 Principal Component Analysis
Principal Component Analysis (PCA) is an unsupervised technique that allows the dimensionality reduction of the multivariate data to
1.4.2 Discriminant Analysis
A Discriminant Analysis is a strategy that has been used successfully for a qualitative analysis, being called pattern recognition. This methodology aims to classify groups as groups into well-defined groups according to the similarities of a “training set” despite limited knowledge of the composition of those belonging to the group. Johnson and Wichern  concluded that the use of discriminant analysis uses several variables and analyzed how to solve the grouping together. The development of calibration models in discriminant analysis is based on two methods: Mahalanobis distances, considered the unit distance vector in multidimensional space, and PCA coupled with Mahalanobis distances [54, 55]. The Mahalanobis distance can be defined by an ellipsoid in a multidimensional space that circumscribes the data. This method is based on a matrix that represents the inverse of the matrix formed by combining the covariance matrices within the group of all groups, which is generated by combining information from all different materials of interest in a single matrix. Studies developed by and Williams considered the Mahalanobis distance as the mathematical number that defines the position, size and shape of the ellipsoid for all clusters . According to of statistical perspective, the Mahalanobis distance considers the sample variability to be valid, while the Euclidean distance method does not consider the variability of values in all dimensions to be valid. The Mahalanobis distances look at not only variation between the responses at the same wavelengths, but also at the inter-wavelength variations. Instead of treating all values equally when calculating the distance from the mean point, it weights the differences by the range of variability in the direction of the sample point. The place of each cluster in multidimensional space is defined by the mean value of the absorbances (the group mean) at each wavelength. Dunmire and Williams indicated that the sample can be classified clearly if it falls within three times the Mahalanobis distance from the respective centroid and at least six times the Mahalanobis distance from the ellipses of other groups . Meanwhile, the Mahalanobis distance represents a multidimensional distance
1.4.3 Partial Least Squares-Discriminant Analysis
Partial Least Squares-Discriminant Analysis (PLS-DA) is defined as a linear classification method that permits to estimate the predictive models based on partial least squares regression algorithm that follows for latent variables with maximum covariance, representing the significative sources of data variability with linear combinations of the original variables is considered an example of machine learning tool applied to conduct a global cellular analysis of bioprocess as an exploratory technique, gaining increasing attention as a useful feature selector and classifier [56, 57, 58, 59, 60]. Multivariate classification methods aimed at finding mathematical models able to recognize the membership of each sample to its appropriate class, by a set of measurements. PLS-DA have shown promising results in the detection of food adulteration without identifying specific compounds . PLS-DA is a discriminant classifier, being particularly suitable for handling correlated features (e.g., spectroscopic variables). The predicted value is a number, but not a dummy integer. Thus, a cut off value needs to be set to determine which class the sample belongs to. PLS-DA is computed based to full cross validation methods. More specifically, a predictor block is used to estimate (by PLS) a binary response called dummy Y (a binary response matrix encoding the class-belonging). Mathematically, the regression relation between the data matrix X and the dummy vector y for a two-class case is represented by the model represented in Eq. (2)
1.4.4 Support Vector Machine
Support Vector Machine (SVM) is a widely used supervised statistical learning algorithm, considered as a nonlinear classification technique, which works with supervised learning models that analyze data used for classification and regression analysis, producing linear boundaries between objects groups in a transformed space of the
1.4.5 Partial Least Squares
Partial Least Squares (PLS) regression and principal component regression (PCR) are examples of quantitative regression algorithms that are currently used for linear data, being considered as factor-based models. PLS and PCR use information from all wavelengths in the entire NIR spectrum to predict sample composition, instead of using a few selected wavelengths. PLS is similar to PCR but more sensitive in terms of variations in sample concentration. Studies performed by Wehling described that PLS and PCR, based on data reduction approaches, allowed to decrease a huge number of variables to a much smaller number of new variables that account for most of the variability in the samples . The amount of a constituent in samples can then be predicted by these new variables. PLS is the most widely used supervised multivariate data analysis method that estimates and quantify components in a specific sample. Each training example is defined as a pair (
The matrices containing the data provided by the NIR spectra, denominated by
1.4.6 Soft Independent Modeling of Class Analogy
Soft Independent Modeling of Class Analogy (SIMCA) is a supervised discriminant analysis method based on PCA . This methodology is a class-modeling approach, meaning that, in defining the class boundaries, the method focuses on the similarities among samples from the same category [61, 78]. For each class, a PCA model is created and consequently the residual variance of the modeled class with the residual variance of the unknown sample is compared to determine which category the sample belongs to. The number of PCs used in each class should be selected to achieve the best classification results. SIMCA results are presented in terms of “sensitivity” and “specificity”, where the former specifies the percentage of samples truly belonging to the category correctly accepted by the class model, while the latter expresses the percentage of the objects from other classes which have been correctly rejected. SIMCA starts from a principal component analysis (PCA) of only the training objects belonging to the category to be modeled, to “capture” the regular variability due to the similarities among samples of the same class [79, 80]. Once the PCA is calculated, objects are accepted or rejected by the class-model based to their reduced distance from the class space, referred as
where T2 is the Mahalanobis distance of the sample from the center of the class space and Q is its orthogonal distance from the PC subspace. These values are divided by T20.95 and Q0.95, which are the 95th percentiles of the T2 and Q0.95 distributions, obtaining the reduced T2 (T2red) and the reduced Q (Qred), respectively . Due to the normalization, T2 and Q limit values are equal to 1; a sample will then be accepted by the class model if
1.4.8 Random Forest
Random Forest (RF) is a novel machine learning algorithm that presents many decision trees, and each tree is grown from a bootstrap sample of the response variable. The optimal split is chosen from a random subset of variables at each node of the tree, and then extends the tree to the maximum extent without cutting. Prediction procedure can be performed from new data by combining the outputs of all trees. RF is suitable and fast to deal with a large amount of data, showing the advantages to reduce variance and achieve comparable classification accuracy [82, 83].
1.4.9 Artificial Neural Networks
Artificial Neural Networks (ANNs) is defined a non-parametric regression models that capture any phenomena, to any degree of accuracy (depending on the adequacy of the data and the power of the predictors), without prior knowledge of the phenomena. ANNs are applied for classification and function mapping difficulties which are tolerant of some inaccuracy and have lots of training data available, but to which hard and fast rules cannot easily be applied . In the ANN the input layer is linked to an output layer, either directly or through one or numerous hidden layers of interconnected neurons. The amount of hidden layers defines the depth of a ANN, and the width depends on the amount of neurons of each layer. Rapid optimization algorithms are used to iteratively develop forward and backward passes for minimization of a loss function and to learn the weights and biases of the layer. The activation functions are applied to the present values of the weights at each layer in the forward pass. The final result of a forward pass is new predicted outputs. The backward pass computes the error derivatives among the expected outputs and the real outputs. These errors are then disseminated backwards updating the weights and calculating new error terms for each layer. Iterative repetitions of this process is designated as back-propagation . A neural network is an adaptable system that learns relationships from the input and output data sets and then can predict a previously unseen data set of similar characteristics to the input set [86, 87]. Multilayer perceptron (MLP) and radial basis function (RBF) are widely used neural network architecture in literature for regression problems [88, 89, 90]. MLPs are usually used for prediction and classification using suitable training algorithms for the network weights. The MLP trained with the use of back propagation learning algorithm. Figure 5a represents a three-layer structure (MLP) the most basic ANN and its minimum configuration that consists of three layers of nodes (1) input layer, (2) hidden layer, and (3) output layer. The input layer accepts the data and the hidden layer processes them and finally the output layer displays the resultant outputs of the model [91, 92]. Each node, with the exception of the input, is a neuron that is based on a non-linear activation function. The MLP can be regarded as a hierarchical mathematical function planning some set of input values to output values via many simpler functions. Normally, the nodes are fully linked between layers and therefore the quantity of parameters quickly increases to huge numbers with a considerable risk of overfitting . The RBF is considered the most broadly used structural design in ANN and simpler than MLP neural network (Figure 5b). The RBF has also an input, hidden and output layer. There are different types of radial basis functions, but the most widely used type is the Gaussian function.
1.4.10 Multiple Linear Regression
Multiple Linear Regression (MLR) is a commonly used machine learning algorithm that allows to determine a mathematical relationship among a number of random variables, analyzing how multiple independent variables are related to one dependent variable. Since each of the independent factors has been determined to predict the dependent variable, information about the multiple variables is used to develop an accurate prediction about the level of effect they have on the outcome variable. The model generates a relationship in the form of a straight line (linear) that best approximates all the individual data points. The most important advantage of MLR is it helps us to understand the relationships among variables present in the dataset. This will further help in understanding the correlation between dependent and independent variables. MLR is one of the oldest regression methods, being used to establish linear relationships between several independent variables (
2. Practical applications of NIR spectroscopy and chemometrics
2.1 NIR spectroscopy in rice analysis: identification and classification
There are several studies that discribe the quantitative analysis by NIR spectroscopy in different types of food, providing an exceptional method for the evaluation of chemical composition (
There are several studies based on NIR to predict viscosity properties of rice. Delwiche et al. developed calibration models on whole-grain milled rice using PLS regression to predict viscosity properties of a flour-water paste as recorded by the RVA, that determine the cooking and processing characteristics of rice . Meadows and Barton later used NIR to predict RVA data in rice flour . A PLS regression of NIR spectra
Studies developed by Osborne et al. using near infrared transmission spectroscopy allowed to discriminate between Basmati and other long-grain rice samples. A discriminant rule was derived using the Fisher linear discriminant function calculated from the first few principal component scores of the NIR spectra . The discriminant rule was assessed by cross-validation. Based on this study, nine Basmati varieties and 53 other rice samples were classified correctly from NIR spectra, but 8% of the Basmatis and 14% of the others were misclassified on the basis of spectra of individual grains. NIR spectroscopy technique also offers effective quantitative capability for moisture, fat, protein and gluten content in rice cookies .
According to studies performed by Chen et al., the NIR diffuse reflectance spectroscopy of multi-grain seeds, a spectral discriminant analysis method for the variety identification of multi-grain rice seed was developed using the PLS-DA . Due to the slight differences of seeds spectra in various varieties, it’s necessary to propose the novel and valid methods. In this study, the SNV pretreatment combined with wavelength-screening methods improved the accuracy of the discriminant models. The selected optimal wavelength model was the combination of 54 discrete wavelengths within NIR region. NIR spectral discrimination total recognition accuracy rates reached 94.3% for a study that involves the identification of one type of differentiation (negative and excellent hybrid variety) and several interference groups (positive, four pure groups and four mixed groups).
The Hyperspectral Imaging (HSI) technique coupled with visible (vis) and/or NIR spectroscopy is generally used to identify or inspect different substances of seed by recognizing the molecular bonds in the sample, being considered the most feasible methods for rapidly and non-destructively detecting the substances of agricultural products, combining the technologies of spectroscopy and digital imaging. Studies developed by He et al. used the system NIR-HSI combined with multiple data preprocessing methods . This approach allowed simultaneously to obtain spectral and spatial information from testing samples in the form of a hypercube constituted by two spatial dimensions and one spectral dimension. The HSI technique has the ability to collect hyperspectral information from samples of different sizes and shapes based on the spatial data. The detection speed of HSI is faster than that of point-based techniques, as many samples can be scanned and analyzed at the same time by using an HSI camera . The classification models was developed to identify the vitality of rice seeds, presenting a great potential for identifying vitality and vigor of rice seeds. When detecting the seed vitality of the three different years, the extreme learning machine model with Savitzky–Golay preprocessing reached a significant classification accuracy of 93.67% by spectral data. In terms of the non-viable seeds identification from viable seeds of different years, the least squares support vector machine model coupled with raw data and selected wavelengths achieved a significant classification achievement (94.38% accuracy), and can be adopted as an optimal combination to identify non-viable seeds from viable seeds. In another study, carried out by Barnaby et al., NIR hyperspectral image consists of numerous bands with small spectrum gaps (every 4 nm in our study) and can assess grain traits such as fat, starch, protein, moisture, color, and many other physicochemical compounds at once . Genome wide association study allowed to confirm known genes and to identify new genes that can affect grain quality traits based on hyperspectral imaging technique. The PLS-DA models of hyperspectral data identify spectral ranges that distinguished genetic and production environment differences, and this data can support to resolve the genetics of complex traits such as rice grain quality.
The nitrogen content is an important chemical indicator used for monitoring and management of plant due to its role in photosynthesis, productivity as well as its effect on carbon and oxygen cycle. The nitrogen content can be measured by laboratory analysis, meanwhile, its spectral reflectance of NIR (700–1075 nm) in the field was measured using hand held spectroradiometer. Studies performed by Afandia et al. evaluated nitrogen content in rice crop based on NIR reflectance using ANN . The reported study allowed to conclude that the organic molecules (nitrogen, water, etc) present a specific absorption pattern in the NIR region and the comparison between measured and model estimation of nitrogen content presented a RMSE of 0.32.
A study developed by Lin et al., based on the imaging method, a system constituted by a NIR camera, filters, an automatically exchange filters device, and the imaging processing techniques allowed to detect the rice protein content based on the spectrum absorption. The NIR data allowed to establish the calibration model based on MLR, PLS, and ANN analysis models. In the MLR model, the NIR imaging system used the calibration model that take in account 5 wavelengths (880 nm, 910 nm, 920 nm, 1000 nm, and 1014 nm) to predict the rice protein content, and had R2 validation (0.782) and standard error of predicition (SEP) 0.274%, and respectively. The NIR imaging system used 15 filters ranging from 870 to 1014 nm in the PLS model, the predictive results expressed a significant performance (R2val = 0.782, and SEP = 0.274%) comparatively tothe MLR model. The ANN model, the net input using the 5 spectrum wavelengths selected by the MLR, simplified the model, and the predicting results (R2val = 0.806, and SEP = 0.266%) were similar to those of the PLS. The prediction results indicated that the developed NIR imaging system has the advantages of simple, convenient operation, and high detection accuracy as well as it presents commercial potential in non-destructive high accurate predicting capability detection of rice protein content .
NIR spectroscopy was used to develop a new discrimination method of varieties of rice. The several variables compressed by PCA were used as inputs of multiple discriminant analysis (MDA). The study showed that the combinantion of spectroscopy and computer data processing technology based on PCA and MDA for the identification of rice from different areas allowed to identify correctly about 98% for the calibration process, and 100% for the prediction process. These results showed that the proposed alternative method is a feasible way for the identification of the specific production areas of rice .
2.2 NIR spectroscopy in rice authentication
NIR spectroscopy has been widely used in the evaluation of agricultural products due to its many advantages, such as being easy-to-use, non-destructive, fast and accurate, providing highly reproducible results, requiring minimum or, often, no sample preparation, and allowing the analysis of several constituents based on a single measurement. As consequence of the importance of rice at global level, in the literature it is possible to find several studies aimed at their analysis and characterization. Due to environmental reasons and the rice the market, non-destructive approaches are generally preferred. NIR spectroscopy has emerged as an important tool to determine fraud, adulteration, contamination in grains and flours. A substantial instrumental improvements (e.g., hyperspectral imaging, FT-NIR) and advances in data analysis (e.g., deep learning) have allowed for the development of screening methods for detecting the presence of pests (e.g., rice weevil) across a range of stored grains [114, 115, 116].
Direct spectroscopic measurements have been widely applied for several foods and commodities, especially in the grain, cereal products, such for classification of rice [117, 118, 119, 120, 121]. Furthermore, in the structure of the evaluation of rice quality, NIR spectroscopy has been used for the discrimination of rice [122, 123]; varieties classificationand transgenic rice detection ; the physico-chemical properties quantification (such as moisture content, sound whole kernel, whiteness, translucency, color, and amylogram characteristics) ; cultivars classification , protein and amylose content prediction [127, 128]; wax rice detection ; and eating quality prediction . Barnaby et al. correlated the grain chalk of rice to the genomic regions of NIR spectra . These spectral regions can be applied in the automation of grain chalk quantification and potentially for other grain products as well .
Rapid and nondestructive detection of rice authenticity and quality were performed based on hand-held NIR spectrometer coupled with the appropriate chemometrics. The selection of different preprocessing methods with PCA and modeling with KNN and SVM multivariate calibration model showed that MSC + PCA plus KNN showed superiority in this study with more than 90% classification rate for all categories of rice samples studied. Based on these results, the hand-held spectrometer associated to an appropriate multivariate calibration model could be used for quick and non-destructive detection of rice quality and authenticity .
Food fraud remains a significant problem for food regulators, importers, merchants, law enforcement personnel, and the consumer. A key feature of food fraud is the use of a lower value ingredient to imitate an authentic product. NIR analysis technology, PLS-DA, and SVM have been used to detect whether high-quality rice was mixed with other varieties of rice. NIR spectral data analyzed using PLS-DA and a SVM algorithm, was shown to be a feasible method (5% detection limit) for the rapid identification of fraudulent rice varieties blended with authentic Wuchang rice samples .
Studies performed by Liu et al. showed that those techniques represent a significant support to qualitative discrimination . PLS was used to establish the quantitative analysis model to support in the recognition of the degree of fraud. As consequence of the direct correlation between the results of NIR analysis and the homogeneity of the samples, four groups of samples with different physical forms (full granules, 40 mesh, 70 mesh, and 100 mesh) were prepared. Regarding qualitative analysis, the performance of the model has no obvious relationship with the physical state of the sample, the qualitative model of PLS-DA and SVM can detect the fraudulent rice with a 5% detection limit. The determination coefficient and root mean square errors of the optimal prediction result were 0.96 and 2.93, respectively. Based on this study, NIR analysis technology can be considered as a reliable and fast strategy to determine if the premium high-quality rice is adultered with inferior categories of rice.
Different preprocessing approache were used for NIR signals pretreatment. Besides considering raw data, the first derivative (Savitzky–Golay approach, 15 points window, 2nd order polynomial), second derivative (Savitzky–Golay approach, 15 points window, 3rd order polynomial), and standard normal variate (SNV) were also evaluated (Figure 6). NIR data were further mean-centered prior to the creation of any calibration model. The most suitable preprocessing approach, together with the optimal complexity (number of LVs or PCs to be extracted) of any classification model, were defined based on a cross-validation procedure. PLS-DA selection, specifically, was based on the combination of pre-processing and model complexity leading to the lowest mean classification error, whereas for SIMCA the maximum efficiency was sought. A study developed by Duy Le Nguyen Doan investigate the possibility of combination NIR spectroscopy and chemometric classifiers with the aim of detecting adulterated rice samples . Two different strategies were exploited: discriminant classifier (PLS-DA), and class-modelling technique (SIMCA). Both strategies provided different results; in particular, SIMCA appeared unable to solve the investigated problem. On the other hand, PLS-DA analysis showed to be a suitable approach. These results indicate that the high within-class variability can have an impact on the possibility of detecting low levels of adulteration; simultaneously, was also suggested that the proposed approach could be useful for detecting samples adulterated. Then, this study demonstrates that the combination of NIR spectroscopy and PLS-DA can represent an effective, rapid and non-destructive tool for the determination of adulteration in jasmine rice .
2.3 NIR Spectroscopy in Rice Contamination
Fast determination of heavy metals is necessary and important to ensure the safety of crops. The potential of NIR spectroscopy coupled with chemometric technology for quantitative analysis of cadmium in rice was investigated. The spectrum was pre-processed using first derivation to reduce the baseline shift and several chemometric techniques, such as iPLS, mwPLS, siPLS, and biPLS were proposed to extract and optimize spectral interval from full-spectrum data. The PLS models based on four chemometric algorithms outperformed the full-spectrum PLS model then developed. Among the techniques, biPLS performed better with the optimal subinterval selection .
Heavy metals are spectrally featureless so that spectral responses could not be directly used for the assessment of heavy metals in rice. With a close combination of protein, crude fiber, and other ingredients, heavy metals present significant correlation with protein in rice . The detection of heavy metal concentration in grain is mostly realized by physical and chemical direct methods that can exactly obtain the residual levels of heavy metal; however, it is time consuming, cumbersome, and inefficient. On the basis of the hypothesis that heavy metal concentration could be spectrally estimated through the correlation between heavy metal concentration and protein contents, the objectives of this study are to: (1) build quantitative model for the quick prediction of both heavy metal and protein content, and (2) to evaluate the feasibility of near-infrared spectroscopy in assessing heavy metal concentration in coarse rice.
Protecting people from heavy metal contamination is an important public-health concern and a major national environmental issue. The NIR spectral technique is used to identify heavy metal concentration such as lead (Pb) and copper (Cu) in rice. The NIR spectral data were treated by some methods, including, logarithm, baseline correction, standard normal variate, multiple scatter correction, first derivates, and continuum removal. The lead (Pb) was accumulated in rice at a high level (17.05) compared with the others heavy metals. MSC-PLSR models were developed, respectively, for Pb (R2 = 0.49, RMSE = 2.01 mg/kg) and Cu (R2 = 0.29, RMSE = 0.75 mg/kg). It is achievable to identify Pb and Cu content in rice by using NIR spectral technique. However, further studies should be performed on the application of spectral technique in discriminating the other heavy metals in rice due to the limitations of few samples and particles size interference.
Based on the reported studies, it was possible to develop a robust classification, authentication or fraud detection model for rice samples considering their specific physicochemical properties and using machine learning tools such as PLS-DA, KNN, ANN, and SVM among other methodologies applied to NIR spectroscopy data, revealing the pattern and relationship of each variety and chemical similarities, according to their specific properties. The classification models developed using several models allow to classify with high confidence rice varieties using the spectral data. The results show that the use of these chemometric tools, combined with spectroscopy capabilities, can facilitate the process of classification and identification of different rice types. The rice discrimination by their origin, harvest season, state of conservation as well as the presence of contaminants and adulteration issues based on robust classification methods can facilitate the creation of a data base, a useful tool for rice authenticity that can increase the confidence and producer-consumer engagement in rice-based foods.
Acknowledge of funding: The study was supported by project TRACE-RICE -Tracing rice and valorising side streams along Mediterranean blockchain, grant n° 1934, (call 2019, Section 1 Agrofood) of the PRIMA Programme supported under Horizon 2020, the European Union’s Framework Programme for Research and Innovation, and Research Unit, UIDB/04551/2020 (GREEN-IT, Bioresources for Sustainability).
Conflict of interest
The authors declare no conflict of interest.