Description of specialty coffees evaluated in the sensory analysis with untrained consumers.

## Abstract

The sensory analysis of coffees assumes that a sensory panel is formed by tasters trained according to the recommendations of the American Specialty Coffee Association. However, the choice that routinely determines the preference of a coffee is made through experimentation with consumers, in which, for the most part, they have no specific ability in relation to sensory characteristics. Considering that untrained consumers or those with basic knowledge regarding the quality of specialty coffees have little ability to discriminate between different sensory attributes, it is reasonable to admit the highest score given by a taster. Given this fact, probabilistic studies considering appropriate probability distributions are necessary. To access the uncertainty inherent in the notes given by the tasters, resampling methods such as Monte Carlo’s can be considered and when there is no knowledge about the distribution of a given statistic, p-Bootstrap confidence intervals become a viable alternative. This text will bring considerations about the use of the non-parametric resampling method by Bootstrap with application in sensory analysis, using probability distributions related to the maximum scores of tasters and accessing the most frequent region (mode) through computational resampling methods.

### Keywords

- probability
- Monte Carlo
- bootstrap
- GEV distribution
- Mantiqueira Serra
- height
- consumers

## 1. Introduction

Basically the methodology involved in the analysis of sensory data is summarized in a set of experimental and statistical techniques applied with the purpose of verifying the quality or the degree of acceptance of a given product, without, however, disregarding the characteristics of the individuals, with respect to your sensory skills. In this context, two distinct groups of consumers can be inserted, that is, consumers who have some enhanced sensory ability (s), resulting from product training or knowledge and totally lay consumers.

Faced with this situation, it becomes plausible to admit that a sensory analysis, applied to a group of trained consumers, being able to discriminate small differences between the samples, the results provided by the evaluations will show little variation [1]. Therefore, a sensory experiment carried out with this group shows a greater agreement with the procedures standardized by [2], since the objective assessments would be more homogeneous for the perception of uniformity, sweetness, defects, among others, mentioned by [3, 4].

In an opposite situation, considering a group of untrained consumers, it is more likely that the evaluations will present heterogeneous results, in such a way that the statistical treatment to be given in the analysis of these results may include the atypical observations, classified as outliers arising from the evaluation. Individual to each consumer [5, 6].

It is worth mentioning that the heterogeneity between the observations may be the result of uncontrollable factors, such as, for example, genetics, fatigue, unwillingness to carry out all tests and differences between the abilities of consumers, as well as external causes such as, for example, the geographical origin of a particular product whose qualities or characteristics are due exclusively or essentially to the geographical environment, including natural and chemical factors, which, among others, mention variations in chemical composition due to the genetic variability between cultivars that influence the sensory quality of coffees [7, 8, 9, 10, 11, 12].

Given countless causes that are supposed to be the sources that cause outliers in a sensory analysis and reporting the analysis of the quality of coffees, special coffees can be highlighted. Following the definition given by [2], in summary, a coffee is said to be special, as it presents superior quality to its competitors in relation to its origin, absence of defects, processing and/or sensory expressions such as aroma, flavor.

The results of the sensory evaluation are established on a scale ranging from 0 to 10 in which these values represent the increasing levels of coffee quality. According to the analysis protocol [2] the results of the sensory evaluation vary according to a scale where the grades 6, 7, 8, 9 correspond respectively to: good, very good, excellent and exceptional. When the grades are less than 6, the coffees are declared to be of a quality below the Specialty Grade.

Respecting these characteristics, Coffee arabica cultivars are potential coffees worthy of being classified as special [13, 14]. However, studies related to the interference of the environment and geographic origin can influence the quality of the drink. [14], in a study interacting quality with environmental factors, concluded that the coffees with the highest scores in a contest held in the state of Minas Gerais, were produced in colder regions with milder temperatures and annual precipitation index around 1600 mm [15]. In this context, in humid regions it is recommend that processing be performed prioritizing peeled and desmucilated coffees. Thus, the quality of the coffee would be inferred without the interference of defects.

In the case of statistical methodology, it is highlighted that the usual methods of analysis, in general, are sensitive to outlier observations, these being plausible to have arisen in a sensory analysis carried out by untrained consumers [5, 15].

Due to this fact and assuming that the assignment of maximum sensory scores can be understood as random phenomena, in the sense that there are variations in the judgment of different consumers, this work aims to propose the use of some distributions belonging to the generalized extreme value distribution class in sensory analysis. For this purpose, this work analyzes a sensory experiment to evaluate four special coffees produced in the Serra da Mantiqueira Region of Minas Gerais, differentiated in preparation and geographical identification classified by different altitudes.

Bootstrap, developed by Efron in the 70s, can be used in many situations. It is based on a simple, yet powerful idea that the sample represents the population, so analogous characteristics of the sample should give us information about the characteristics of the population. Bootstrap helps to learn about these sample characteristics by taking resamples (samples with replacement of the original sample) and we use this information to infer about the population [16].

In this sense, to detect a difference in the judgment of special coffees by trained and untrained tasters, a test built via non-parametric Bootstrap will be proposed for the mode of distribution of extreme values that best fits the data set.

## 2. Modeling maximum sensory scores and numerical procedure

In accordance with the opinion of the Ethics and Research Council, registered with the CAAE: 14959413.1.0000.5148, the preparation of the Samples of 100% Arabica coffee was done by removing all defective beans and toast, respecting the maximum period of 24 hours for tasting.

The roasting point was determined visually, using the color classification system by means of standardized discs (SCAA/Agtron Roast Color Classification System). Regarding the preparation of the drink, the concentration of 7% w/v was maintained using filtered water ready for consumption, free of any contaminants and without added sugar. With these specifications, four types of specialty coffees, coded in the samples by A, B, C and D given the description in Table 1.

Type | Genotype | Altitude | Processing |
---|---|---|---|

A | Bourbon | Above 1200 m | Natural |

B | Acaia | Below 1100 m | Pulped natural |

C | Acaia | Below 1100 m | Natural |

D | Bourbon | Above 1200 m | Pulped natural |

For each type of coffee, the following sensory characteristics were assessed in the acceptance test: aroma, body, hardness, and final score, in four sessions, with the participation of a volunteer group of consumers with basic knowledge in regard to sensory analysis of coffees and another group without basic knowledge. Table 2 provides a list of the tasters, as well as the sensory characteristics assessed by each taster, in which *aij* represents the score given by taster *i* (*i = 1, 2, …, n1, n1 + 1, n1 + 2, …, n2*), such that *n1 + n2 = n*, for the sensory characteristic × coffee *j* (*j = 1, 2, …, 16*) combination.

Condition | Taster | Sensory characteristic 1 | … | Sensory characteristic 4 | ||||||
---|---|---|---|---|---|---|---|---|---|---|

A | B | C | D | … | A | B | C | D | ||

Trained | 1 | a11 | a12 | a13 | a14 | … | a113 | a114 | a115 | a116 |

2 | a21 | a22 | a23 | a24 | … | a213 | a214 | a215 | a216 | |

. . . | . . . | . . . | . . . | . . . | … | . . . | . . . | . . . | . . . | |

n1 | an11 | an12 | an13 | an14 | … | an113 | an114 | an115 | an116 | |

Untrained | 1 | a(n1 + 1)1 | a(n1 + 1)2 | a(n1 + 1)3 | a(n1 + 1)4 | … | a(n1 + 1)13 | a(n1 + 1)14 | a(n1 + 1)15 | a(n1 + 1)16 |

. . . | . . . | . . . | . . . | . . . | … | . . . | . . . | . . . | . . . | |

n2 | an21 | an22 | an23 | an24 | … | an213 | an214 | an215 | an216 |

In the test, four different types of coffee were evaluated in terms of their sensory characteristics, flavor, acidity, body and note. In different sessions, voluntary consumers were grouped into two classes: (a) people with the habit of consuming coffee, but who do not have basic knowledge about specialty coffees and (b) people with the habit of consuming coffee and trained with information basic information about specialty coffees.

The fit of the probability distributions was carried out, considering the random variable *X* representing the maximum consumers’ sensory scores for the each type of coffee (Table 1), totaling in a sample of 696 observations.

Bearing in mind that the highest score provided by a tester will be considered, this being considered as a block, the distribution of the maximums, according to the Fisher-Tippet theorem, is the generalized extreme values distribution (GEV). Its probability density function is defined by:

where

The probability that a maximum score will be greater than realization of a score, represented by *x* is defined as

where

The mode (*Mo*) of the pdf in Eq. (1) is given by

and in the

The goodness of fit for each distribution was validated using the Kolmogorov Smirnov (KS) adherence test in conjunction with the Q-Q plots [20, 21]. The Q-Q plot consists of the points

where *pi* are the percentiles and *xi* are the data used to fit the model, ordered in ascending and *n* the sample size.

According to [22], the Kolmogorov–Smirnov (KS) test is used to assess the fit of a probability distribution to the original data. It is based on the analysis of the proximity or adjustment between the sample distribution function *D*) is given by,

In the KS test, the hypothesis of interest are given by *H0*: The distribution function from which the sample is derived follows the distribution function that is assumed to be known; that is, *H1*: *p*-value is lower than the significance level adopted.

Regarding the verification of the assumption of independence of the observations, such that is required by the maximum likelihood method for estimating parameters, the Ljung-Box (LB) test was used. According to [24], it is a statistical test used to find out if there are non-zero autocorrelation groups. To do this, it tests total randomness based on the number of deviations. The test hypotheses are *H0*: all autocorrelation coefficients are equal to zero and *H1*: not all autocorrelation coefficients are equal to zero. The test statistic is

where *n* is the number of observations, *s* is the number of coefficients in testing autocorrelation, *rj* is the autocorrelation coefficient (for the deviation) and *Q* the test statistic. If the sample values of Eq. (7) exceed the critical value of a Chi-Squared distribution with *s* degrees of freedom, then at least one deviation *r* is statistically different from zero at the specified significance level, that is, *H0* is rejected. *H0* is also rejected if *p*-value is lower than the adopted significance level. It should be noted that if *H0* is rejected, it can be said that the data are independent. In both tests, the significance level of 1% was adopted [25].

In order to make an inference about the most frequent score among the tasters, it is necessary to know the sample distribution of the quantity in Eq. (4). For that, an alternative would be to use resampling methods, which one of them will be presented below.

The Bootstrap resampling process consists of resampling *B* samples *n* highest marks awarded by trained and untrained tasters. Estimates of the parameter of interest can be obtained, denoted by

Once the empirical distribution of the *p*-Bootstrap confidence interval. In a more formal way, the confidence interval can be constructed by following the following steps:

(**Step 1)** Draw, with replacement, of *P*, one Bootstrap sample

(**Step 2)** From Bootstrap sample

(**Step 3)** Repeat the steps 1 and 2 *B* times;

(**Step 4)** From the vector *p*-Bootstrap confidence interval with

Finishing the proposed methodology, the computational resources available in the R software [27, 28] were used through the *boot* and *evd* [29] packages to fitting the probability distributions for sensory scores, hypothesis tests and construction of Bootstrap confidence intervals.

## 3. Experimental results

The following results correspond to the parameter estimates for the probability distributions fitted for the two classes of tasters, as well as the *p*-values referring to the validation of the probabilistic model fitted for the sensory scores.

With these specifications, given a level of significance of 1%, it is noted the confirmation of the fit in the sensory scores for each coffee, therefore, there is statistical evidence to assume that GEV distribution is adequate to model the maximum sensory grades of the evaluated coffees (Table 3). It should be noted that the fact that we have *p-*values greater than 1% for the KS test indicates that there is statistical evidence for the acceptance of the test’s null hypothesis, as can be seen in Section 2. The test used, however, according to [30], should only be used for completely specified distributions, that is, when there are no unknown parameters that need to be estimated from the sample. Otherwise, the test is very conservative. One solution would be to obtain, via simulation, the theoretical quantiles of the Kolmogorov Smirnov test to compare them with the quantiles obtained from the sample. A similar procedure for the Gumbel distribution was carried out by [31]. Alternatively, inspection of fit quality can be assessed via Q-Q plots graphs. They are shown in Figure 1.

Coffee | Group | Parameter estimates | KS ( p-value) | LB ( p-value) | ||
---|---|---|---|---|---|---|

A | Untrained | 5.9471 | 2.4105 | −0.6569 | 0.9077 | 0.0803 |

Trained | 6.9345 | 1.6259 | −0.6156 | 0.8908 | 0.2359 | |

B | Untrained | 5.9326 | 2.4624 | −0.5721 | 0.9466 | 0.3306 |

Trained | 6.6031 | 1.9455 | −0.5736 | 0.8255 | 0.0110 | |

C | Untrained | 6.4290 | 2.2108 | −0.6348 | 0.9485 | 0.6084 |

Trained | 7.0595 | 1.3382 | −0.5485 | 0.9998 | 0.9823 | |

D | Untrained | 7.8676 | 2.0437 | −0.9582 | 0.9962 | 0.9625 |

Trained | 7.8113 | 1.8183 | −0.8221 | 0.6543 | 0.6924 |

In this sense, the validation of the GEV distribution is corroborated in the Q-Q plots shown in Figure 1, because for all the specialty coffees evaluated, the theoretical quantiles showed a linear behavior and close to the straight identity with the observed quantiles and the points being, in their mostly, contained in the 95% confidence interval. It should also be noted that the quantiles have a trend to converge to a region located as an upper tail. All *p*-values of the Ljung Box test are higher than 1%, thus showing the acceptance of the null hypothesis of the test, as described in Section 2. It can be concluded, therefore, that the maximum scores given by trained and untrained tasters they are independent. We should highlight that we have used these tests to verify the assumptions of the Extreme Value Theory models, but that they could be used for other interests, such as in the trend analysis of hydro-climatic series [32, 33, 34]. Failure to observe these assumptions can lead to fitted models parameter estimates, as well return levels estimates, biased and/or under/overestimated. For these situations, Bayesian methods, regression or time series based on the Box-Jenkins methodology could be considered [35, 36].

In function of the confirmatory results related to the GEV distribution goodness of fit, given the estimates of the parameters for this distribution applied in the maximum sensory scores given by consumers in the evaluations carried out for each coffee, we proceeded with the calculations of the probabilities for an individual to supply a grade higher than a given grade. The results are described in Table 4.

Coffee | Group | Mode | q0.025 | q0.975 | P[X > q0.025](%) | P[X > q0.975](%) | Difference(%) |
---|---|---|---|---|---|---|---|

A | Untrained | 7.8 | 6.9 | 8.9 | 47.2 | 7.7 | 39.5 |

Trained | 8.1 | 7.3 | 9.0 | 54.1 | 8.2 | 45.8 | |

B | Untrained | 7.6 | 6.2 | 9.2 | 59.2 | 8.2 | 51.0 |

Trained | 7.9 | 6.6 | 9.2 | 62.4 | 7.1 | 55.4 | |

C | Untrained | 8.1 | 7.2 | 9.0 | 47.6 | 9.5 | 38.1 |

Trained | 7.9 | 7.1 | 8.9 | 62.2 | 8.1 | 54.1 | |

D | Untrained | 9.9 | 9.1 | 10.0 | 32.8 | 0.2 | 32.6 |

Trained | 9.5 | 8.6 | 10.0 | 44.5 | 0.7 | 43.9 |

Before that, the distribution modes were calculated as shown in Table 4, in order to verify the similarity between the grades provided by trained and untrained tasters. It is observed that occasionally they can be considered very close. For specialty coffees A and B, trained tasters provided higher grades more frequently than untrained tasters and for specialty coffees C and D the opposite occurred.

Although the similarity between the modes of the grades attributed by the tasters is evident, this similarity is not associated with any level of confidence, since the similarity is only punctual. To circumvent this situation, confidence intervals were constructed using the non-parametric Bootstrap method, as shown in Section 2. Thus, it can be stated with 95% confidence that the grades most frequently attributed to coffee A by trained tasters and not trained do not differ statistically, since the point estimate for the fashion of the notes is contained in the respective confidence intervals and they are overlapping.

More specifically speaking, for coffee A, the initial mode estimate is 7.8 points for untrained tasters, i.e.,

According to the results described in Table 4, it is clear that given a sensory panel made up of untrained consumers, there is a probability that all consumers will have a sensory score higher than 6.0, indicating that whatever the taster is, among the types of specialty coffees studied, no coffee will be classified with quality below the Specialty Grade, since all the coffees analyzed showed a high probability that the most frequent grade is higher than 6. On a 9-point verbal hedonic scale, it can be concluded that, in general, consumers have a trend to be indifferent to the agradability of specialty coffees.

Figure 2 presents the histogram and the Q-Q plot for the mode of the fitted distribution for the grades given by the untrained tasters for coffee A in Table 1. The histogram suggests that the empirical distribution of

When considering an expressive score worthy of international competitions, having a reference higher than 8, the probability of a consumer providing an occurrence of a note being higher than 8 or the coffee being classified as excellent is relatively low for all evaluated coffees (Table 4). It is also noted that the probability of a consumer assigning a grade between 9.1 and 10.0 is 32.8%, that is, it can be interpreted that coffee D to be considered exceptional by a consumer is 32.8%. In addition, coffee D is the one with the least amplitude in probability, corresponding to the column “Difference” in Table 4, which indicates that it is a type of coffee that provided low variability between the grades attributed by the tasters. On the other hand, coffee B showed greater variability between the grades attributed by the tasters, since the difference *P[X > q0.025] - P[X > q0.975]* is the largest among the analyzed coffees.

Therefore, in the evaluation of the four specialty coffees, given the low probabilities, it can be said that a sensory experiment carried out with the objective of discriminating the specialty coffees, is done with consumers who present more improved training.

Figure 3 shows graphically the agreement between the scores given by trained (blue hatched) and untrained (black hatched) tasters, according to the results shown in Table 3.

The importance of using bootstrap procedures in the analysis of responses that corroborate with these scores is relevant for statistically validating the scores obtained in international competitions, since it assumes that subjective and / or unknown factors, related to the different sensory perceptions of the tasters may suggest violations in the sample distribution, and as a consequence, the estimates of the probabilistic model are distorted. Thus, through successive resampling, an empirical distribution for each parameter is generated in connection with the assumed probabilistic model, and inferences will be made with better precision and accuracy. The amplitude of the confidence interval in Figure 3 reflects the precision of the estimates for the maximum notes mode, given the GEV distribution. Other confidence intervals via bootstrap could be considered, such as bootstrap-*t* and BCa [37, 38]. We emphasize that the strategy adopted is innovative in the context of sensory notes and the comparison of confidence intervals can be done as future work.

## 4. Conclusions and final remarks

The GEV distribution can be applied to the sensory analysis of specialty coffees, whose sensorial panel presents an heterogeneity among consumers.

The probabilities obtained by this distribution show that the sensory analysis of specialty coffees performed by untrained consumers indicates that they are able to differentiate specialty coffees and provide similar scores to the sensory analysis performed by consumers with prior training.

The proposed inference made it possible to attribute some degree of uncertainty regarding the occurrence of sensory scores in the different types of specialty coffees studied and to indicate which group each coffee belongs to with high probability according to the Specialty Grade.

It can be recommended that more intensive training with tasters or the application of the proposed methodology with tasters with international certification should be considered with a view to assessing specialty coffees against a reference score of 9 points, since for the present study, only coffee D has a high probability of presenting this note. It should be noted that according to the analysis protocol provided by Specialty Coffee Association of America, the results of the sensory evaluation vary according to a scale where the grades upper to 9 correspond to exceptional coffee.

The study has some limitations that provide directions for future research, although the GEV distribution is specific for analyzing maximum values, the data generating mechanism truncates the maximum score at 10. This characteristic could be taken into account, fitting the model to truncated data. Some proposals have appeared in the literature to consider truncation in the estimation process by maximum likelihood, but there is no consolidated methodology yet. Therefore, it is a possibility for further studies that may be the subject of future research.

## Acknowledgments

The authors are grateful to the National Council for Scientific and Technological Development (CNPq—Conselho Nacional de Desenvolvimento Científico e Tecnológico), the Minas Gerais State Research Support Foundation (FAPEMIG—Fundação de Amparo para Pesquisa do Estado de Minas Gerais), the Coordination for the Improvement of Higher Education Personnel (CAPES—Coordenação de Aperfeiçoamento de Pessoal de Nível Superior), and the National Coffee Science and Technology Institute (INCT/Café—Instituto Nacional de Ciência e Tecnologia do Café).

## Conflict of interest

The authors declare no conflict of interest.