Cross-Cultural Adaption of the GRBAS and CAPE-V Scales for Portugal and a New Training Programme for Perceptual Voice Evaluation

Several methods have been proposed for the perceptual evaluation of voice quality, but the GRBAS and Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) scales are the most widely used and recommended as part of standardised voice evaluation protocols. In this study, cross-cultural adaptation and translation of the GRBAS (the first translation from the original Japanese version) and CAPE-V scales to European Portuguese were carried out following international guidelines. Results from a study of the intraand inter-rater reliability of the perceptual evaluation of voices with the GRBAS and CAPE-V scales, before and after a training programme, designed according to the most recent American Speech-Language-Hearing Association and Japan Society of Logopedics and Phoniatrics guidelines, are also reported.


Introduction
Phonation results from the interaction of the vocal folds with the airflow and the air column above them [1,2]. When air particles pass through the glottis and their speed increases, this reduces the pressure between the vocal folds triggering a suction effect that brings the vocal folds closer to each other, followed by an elastic recoil that promotes a new glottic adduction, thus enabling the production of voice. The sound that results from the vibration of the vocal folds, which is modified by the resonance cavities, is called voice. This audible sound is the product of a complex relation between the pressure and velocity of expiratory airflow, intensity, the different patterns of abduction and adduction of the vocal folds, the vocal tract configuration and resulting resonances [3].
Voice disorders can have a significant negative impact on a person's life, because the voice is an important tool for communication [4]. There is a voice perturbation whenever the vocal quality, intensity, fundamental frequency (f 0 ) or vocal flexibility are altered for the age, sex and culture of the speaker ( [5], p. 5). Any difficulty or alteration in the vocal emission that prevents the natural production of the voice is called dysphonia. Dysphonia manifests itself through the following changes: Perturbations in vocal quality, emission effort, vocal fatigue, loss of vocal power, uncontrolled variations of fundamental frequency, lack of intensity and projection, loss of vocal efficiency, low vocal resistance and unpleasant sensations during vocal emission. These result in the alteration of one or several acoustic characteristics of the voice.

Vocal evaluation
Vocal evaluation is considered the first stage of intervention and rehabilitation. Voice assessment has the following main objectives [3]: To know the vocal behaviour of the person, identify the causes of the vocal problem, describe the vocal characteristics of the individual, identify vocal habits, to characterise the relation between body and personality. Perceptual evaluation of voice is routinely used in clinical practice but still poses some inter-and intra-subject problems because it is subjective and often not correlated with the severity of the pathology [6].
According to Chan and Yiu [7], the perceptual evaluation is a subjective process, in which the intra-and inter-rater reliability can vary. Pontes et al. [8] also point out that the perceptual evaluation of vocal quality assumes a subjective character, which varies according to the evaluator, with its internal standards on voice quality, with their perception skills, discrimination and experience with regard to the evaluation of voice. Nevertheless, auditory perception-based assessments can be performed rapidly, are non-invasive, do not require electronic equipment, so results are readily available [9].
The GRBAS scale ( [11], pp. 181-209; [12], pp. 83-84) defines five parameters for vocal quality classification: Grade (G), Rough (R), Breathy (B), Astenic (A), Strain (S). The parameter G corresponds to the grade of alteration of vocal quality; R is psychoacoustic vocal fold vibration irregularity impression, corresponding to the vocal fold vibration irregularity, fluctuation in the value of f 0 and amplitude of the sound of the glottal source; the parameter B refers to the psychoacoustic impression of air passage through the glottis, thus relating to turbulence; the parameter A assesses the weakness or lack of energy in the voice, thus characterising a weak intensity of source sound glottic, or lack of harmonics; finally, the S that characterises the state hyperfunction of phonation. The scale is scored from 0 to 3 for each of its five parameters: 0, normal or absence of hoarseness; 1, slight; 2, moderate; 3, severe.
The CAPE-V scale [13][14][15] uses six features to evaluate voice quality: Overall severity, roughness, breathiness, strain, pitch, loudness. The parameter overall severity captures a global impression of voice disturbance, roughness allows clinicians to register source irregularities, the perception of breathiness results from air escape, and strain is related to the perception of vocal effort. The perceived f 0 and intensity are registered as the pitch and loudness parameters, respectively. Comments about resonance and additional features, such as falsetto or tremor, can also be registered. The scale is scored from 0 to 100 on 100-mm Likert scales for each of its six parameters: Mildly Deviant (MI), Moderately Deviant (MO), and Severely Deviant (SE) qualitative attributes are distributed uniformly along the scale, and the consistent (C) and intermittent (I) labels can be associated with each parameter.

Auditory-perceptual training of evaluators
The continuous training of the evaluators is recommended, in order to guarantee the reliability and validity of a perceptual evaluation of voice. Both the intra-and inter-rater reliability may vary because perceptual voice assessment is a subjective process, but it is generally accepted that the inter-rater reliability is a greater concern. Kreiman et al. [21,22] argued that reliability variation can be attributed to the different internal standards acquired by evaluators.
Helou et al. [23] conducted a study with 10 experienced evaluators and 10 inexperienced evaluators, who rated 10 male voices and 10 female voices with CAPE-V. The results revealed that inexperienced evaluators rate voices more severely than the experienced evaluators. Inexperienced evaluators also had lower intra-and inter-rater reliability than those with experience. Experienced evaluators rated the voices similarly.
Studies by Kreiman et al. [21] and Gerratt et al. [24] used natural voice samples and/or voice samples synthesised as anchors, and showed that inter-and intra-rater reliability in the perceptual assessment of voice improved with training.
Anchors are considered references that listeners (evaluators) can use to compare with the signals they are invited to judge [7]. In the study by Eadie and Smith [25], 20 inexperienced and 20 experienced evaluators rated 20 samples of normal voices. The results of this study showed that the anchors reduce inter-rater variability.
Silva et al. [26] analysed the impact of auditory-perceptual training on the evaluation of voice performed by speech and language therapy students. Seventeen students analysed samples of normal and dysphonic voices with the GRBAS scale. All students had auditory training during a total of nine weekly sessions, each about 15 min long. The evaluation of voice samples was performed before and after the training, and in four other moments during the meetings. Student ratings were compared with an assessment by three voice specialists. The results showed that the students' success rate at the pre-training moment was considered between regular and good. A maintenance of the number of hits throughout the evaluations performed, for most of the scale parameters, was also observed. Regarding the post-training moment, a better analysis was observed, mainly for the Astenic (A) parameter.
Training judges/listeners has been shown to 'increase the extent to which they share common standards for different' ( [10], p. 63) voice qualities, so the current study also includes the intraand inter-rater reliability analysis of the perceptual evaluation of voices with the GRBAS and CAPE-V scales, before and after a training programme.

Cross-cultural adaptation and translation of the GRBAS and CAPE-V scales
To the best of our knowledge, there are no standard assessment instruments to perceptually evaluate voice quality in EP, so clinicians in Portugal use various translations of GRBAS ( [27,28], pp. 66-69) and CAPE-V [29], and generally have no access to EP versions of the original instructions published by the Japan Society of Logopedics and Phoniatrics ( [11], pp. 181-209) or the American Speech-Language-Hearing Association [13]. They therefore use various procedures and non-standardised definitions of the parameters [30][31][32]. This results in different voice assessment methods hamper the development, objectivity and specificity of therapeutic plans, thus compromising the efficiency and efficacy of intervention strategies.
We believe the access to the original author's definitions of core concepts behind the development of health instruments, contributes towards the standardisation of evaluations procedures, and considerable improvements in intra-and inter-rater reliability. The translation and adaptation processes of the whole tool (not just the score sheets) should follow international guidelines [33] for cross-cultural adaptation of health assessment instruments. Evidencebased practice would thus be enhanced, and comparisons across countries would be facilitated. A broader evidence base for effective service delivery planning based on results from large-scale randomised controlled trials requires that the same assessment instruments are validated in different cultures.
Cross-cultural adaptation of instruments is necessary when the new target population differs from the original in which the assessment tool is used regarding culture or cultural background, country and language. There are specific guidelines [34,35] to conserve the sensibility of the assessment tool in the original culture [36]. The steps that must be followed, if relevant to the specific assessment tool, are [37,38]: translation, synthesis of the translations, back-translation, committee review and pre-testing.
The first stage of a cross-cultural adaptation must be the production of several translations by, at least, two independent translators. In a second stage, the two translators synthesise the results of the translations, producing one common translation [37]. It is then necessary to back-translate the assessment tool (third stage), which means translating back from the final language into the source language, producing as many back-translations as translations, based on the synthesised translation [36]. In the fourth stage, an expert committee compares the source and the final version. The fifth stage consists of a cognitive debriefing that tests alternative wording, understandability, and interpretation of the translation [35].
In this study, the cross-cultural adaptation and translation of the GRBAS and CAPE-V scales to EP were carried out following these international guidelines [33].

Method
Ethical approval was obtained from all authorities required by Portuguese bylaws for clinical research: National data protection committee, independent ethics committees. Informed consent was collected from all participants prior to any data collection.

Cross-cultural adaptation and translation of the GRBAS and CAPE-V scales
The  [11] was written in Japanese, and Professor Minoru Hirano described the scale only briefly in his publication ( [12], pp. 83-84).
The work reported in this book chapter reports work that is part of the validation process of a voice evaluation protocol developed by our research team [18][19][20] and freely available from the ACSA platform. University of Aveiro's Voice Evaluation Protocol [18] includes the assessment of voice quality, glottal attack, respiratory support, respiratory-phonatary-articulatory coordination, digital laryngeal manipulation (laryngeal crepitation) and laryngeal tension. It also allows the self-assessment of voice quality and instrumental evaluation. Results from various instrumental evaluation techniques (videostroboscopy, aerodynamics, electroglottography and electromyography) can be registered by the protocol, including an extensive acoustic analysis based on sustained productions of /a, i, u, ɔ/, CAPE-V sentences and reading a passage. The complete protocol provides data to test different methods applied to voice function assessment. The focus of our research was performance improvement of assessment methods used by voice clinicians.

Training programme
The user interface, terminology, audio and video samples of the GRBAS CD and DVD developed by JSLP [39] were used as a standard reference to design (according to the most recent ASHA and JSLP guidelines), the training programme described in the subsequent text.
Forty-five EP speakers from the Advanced Voice Function Assessment Databases (AVFAD) (see Jesus et al. [40] in this book for a detailed description) were used as auditory stimuli. Fifteen participants were selected for anchors, 15 for training and 15 for evaluation.
The voices from the AVFAD were selected based on the auditory perception in a quiet room by a speech and language therapist (the second author of this book chapter) using VLC media player 2.1.3. rincewind running on a laptop connected to a pair of NGS 2.1 loudspeakers. The same speech and language therapist classified the representative samples (anchors) of all selected voices with the GRBAS and CAPE scales.
Judges listened to sustained productions of vowels/a, i/played in a quiet environment with the same volume through a pair of Sennheiser HD380Pro headphones connected to the internal soundcard of a laptop computer.
Ten female speech and language therapists were asked to rate the severity of dysphonic voice stimuli using the GRBAS and CAPE-V scales. Each judge first evaluated 15 voices without any training and then went through a training programme based on two 1-h sessions.
During the first session, judges read detailed written instructions and the original description of the scale, and then, classified anchor voices that included several grades of severity for vocal quality. During the second session, a new set of voices (training voices) were classified, and judges could listen to one anchor for each five voices that were classified. At the end of the second session, all judges had access to a feedback document, but they could not change their classifications. One week later, the same judges classified a new set of 15 voices.
IBM SPSS Statistics 23 was used to calculate the Intraclass Correlation Coefficient (ICC) of responses and to run a one-way analysis of variance (ANOVA) with repeated measures.

Cross-cultural adaptation and translation of the GRBAS and CAPE-V scales
Two EP translations of the original American English CAPE-V scoresheet and instructions were produced by two independent translators. This led to the detection of errors and divergent interpretations of ambiguous items in the original tool [33]. The translators were fluent in both languages (with the target language as their mother tongue), knowledgeable of the two cultures, and experts in the content measured by the instrument (they were both SLTs). Then, both translators synthesised the results of the translations, producing one common translation, which was used to back-translate the assessment tool, producing as many back-translations as translations (two). An expert committee compared the source and the final versions and produced a pre-final version for field testing, based on all translations and back-translations.
During CAPE-V's cognitive debriefing (final stage of the cross-cultural adaptation and translation), alternative wording and interpretation of the translation were tested in five clients with voice pathology by the second author of this book chapter (a SLT), at the University of Algarve. Finally, the translation was revised taking into account the feedback obtained.
Hirano's ( [12], pp. 83-84) GRBAS description was translated by the same group of experts involved in CAPE-V's cross-cultural adaption, using exactly the same process and stages [33].
A professional Japanese translator (Tomoko Suga) certified by the Japanese embassy in Lisbon (Portugal) translated the original Takahashi ([11], pp. 181-209) instructions. The translator was fluent in both languages (Japanese and EP) and knowledgeable of the two cultures. Since the original Japanese used by Takahashi ([11], pp. 181-209) is quite different, in some respects, from what is used nowadays and scientific terminology has changed considerably, only the core descriptions by Takahashi ([11], pp. 181-209) were retained in the Portuguese version, and the translation had to be thoroughly revised by an expert committee that included the original Japanese translator, the first author of this book chapter and an SLT blind to the purposes of the study.
The same group of five clients with voice pathology recruited for CAPE-V's cognitive debriefing was involved in Takahashi's ( [11], pp. 181-209) and Hirano's ( [12], pp. 83-84) GRBAS analysis of the level of comprehensibility of the instructions and the final items, cognitive equivalence of the translation, translation alternatives and items that were eventually inappropriate or confusing.
CAPE-V's cognitive debriefing results showed no inconsistencies but GRBAS' instructions analysis revealed that the number of vowels required by later protocol is not the same as currently suggested by ASHA [13] or in most recent voice assessment procedure based on sustained vowels.
According to Takahashi ([11], pp. 181-209), clinicians should perceptually evaluate five sustained vowels [a, ɛ, i, ɔ, u] and choose to register, on a table that is part of the score sheet, the one that they attribute the highest GRBAS parameters scores.
The use of sustained vowels usually results in articulatory stability and allows the clinician to focus on the typology a voice source signal that is more regular and stable than in connected speech, facilitating the perceptual assessment of voice quality [41,42].
Given the fact that the CAPE-V protocol [13] proposes [a, i] as the sustained vowels to be used during assessment, and that both GRBAS and CAPE-V are used by the University of Aveiro's Voice Evaluation Protocol [18], following cognitive debriefing, these were proposed as the basis of perceptual evaluation. According to ASHA ( [13], p. 3), [i] is used because it is the only sound speakers that can produce during laryngeal videoendoscopy, and /a/ is used because it differs from /i/ in terms of its degree of tenseness: /a/ is a lax vowel and /i/ is a tense vowel in most English dialects. Portuguese does not have the tense-lax contrast, but still, the close /i/ versus open /a/ distinction could be used to monitor the effect of an enlarged pharyngeal cavity for close vowels ( [43], p. 627) and the lowering of laryngeal structures ( [43], p. 633) for open vowels.
The translations were revised taking into account the results from the cognitive debriefing process, and the assessment tools can now be administered to a representative sample of the population. Both Takahashi's ( [11], pp. 181-209) and Hirano's ( [12], pp. 83-84) instructions in Portuguese are now included in the University of Aveiro's Voice Evaluation Protocol [18] manual and available from the ACSA platform.

Training programme
The GRBAS CD and DVD [39] were thoroughly analysed (see Figure 1); terminology, audio and video samples therein were used as a standard reference to design the training programme (described earlier).
This resulted in a first prototype of a PowerPoint presentation, shown in Figure 2, based on Japanese audio and video samples [39] that guided the final PowerPoint presentation design that then used audio samples from the Portuguese AVFAD [40].
The anchors were available during the training programme on a PowerPoint 2010 presentation with a total of 15 slides formatted as shown in Figure 3.
The ICC is a measure of inter-rater reliability that describes the similarity between the responses observed within a given set. The ICC value varies between 0 and 1, the closer to 1, the more consistent are the results.
The ICC mean of 10 raters for the parameters of the GRBAS scale is presented in Table 1.
All ICC values (pre-and post-training) are very high which indicates a good agreement between judges. We could only observe a post-training increase of ICC values for the Strain parameter which suggests that the training programme was not very effective. However, since the pre-training values are already very high, it is harder to observe an increase of the values after training as a result of the training programme. The Strain parameter is the only one with a mean ICC value below 0.900 pre-training, so a possible cause for the difficulty in observing the expected effect of training in the other parameters could be related to the fact that the ICC values pre-training are above a certain threshold. Still, even with evaluator distinct pre-training standards (different GRBAS parameter values), all changed classifications post-training.    We also ran a one-way ANOVA with repeated measures for the GRBAS scale. This analysis allowed us to test if the evaluators changed the classifications as a function of time (pre-to post-training) and how this change relates to possible differences between them. The results are shown in Table 2.
From Table 2, it can be seen that time is a significant effect for all parameters and the interaction also. As for the differences between raters, as a main effect, only for the Rough and the Breathy parameters, these are not significant.
The ICC mean of 10 raters for the parameters of the CAPE-V scale is presented in Table 3.
Similar to what was observed for the GRBAS scale, all ICC values (pre-and post-training) are quite high which indicates a good agreement between judges. We could only observe a posttraining increase of ICC values for the Breathiness parameter which suggests that the training programme was not very effective. However, when analysing the CAPE-V parameter values, all evaluators changed in the same direction from pre-to post-training, that is, all 10 evaluators either presented higher or lower values post-training for a specific parameter.
We also ran a one-way ANOVA with repeated measures for the CAPE-V scale. The results are shown in Table 4.
From Table 4, it can be seen that both the effects and their interaction are significant for all the parameters.  Table 3. ICC for the parameters of the CAPE-V scale pre-and post-training.
Advances in Speech-language Pathology

Conclusions
One of the major contributions of this work was the development of the first non-Japanese version of the original manual of the GRBAS scale and the first Portuguese version of the detailed design considerations, description and instructions of CAPE-V. This research followed international guidelines for the translation and cultural adaptation of health assessment tools.
The GRBAS and CAPE-V scales are now part of the following comprehensive and unique set of resources developed for clinicians at the University of Aveiro in Portugal: A standardised voice case history form [44,45]; a voice evaluation protocol [18]; a reference voice database [40]. All of these are freely available from the ACSA platform.
The manuals developed during this project had a crucial impact on the training of judges.
The ICC values were generally very high, which could be the result of the written instructions and detailed description of the scales, which is a possible cause for the small training effect. The definition of the Breathiness parameter benefited particularly from the availability of these instructions. Problems related to the use of the Portuguese term for Grade 'grau de rouquidão' being erroneously interpreted as the CAPE-V term 'rouquidão' (Roughness), as previously reported by Jesus et al. ( [16], p. 62), have been circumvented by the manual, training and samples of voices that represent specified grades of severity.
We also ran a one-way ANOVA with repeated measures for the GRBAS and CAPE-V scales. This analysis allowed us to test if the evaluators changed the classifications as a function of time (pre-to post-training) and how this change relates to possible differences between them.
Regarding the analysis of variance, taking into account the time factor as the main object of study, results showed pre-to post-training differences. The evaluators had individual and distinct standards, and changed the classifications, allowing us to conclude that their internal standards have been modified.
Increasing the level of experience of the evaluators, or the number of training sessions, could have contributed to reducing the variability of the results.  Table 4. p-Values of the repeated measure ANOVA for the parameters of the CAPE-V scale with the judges as between subject factor and time as within subjects factor.