This article details a model for evaluations of sound quality in the human auditory system. The model includes an autocorrelation function (ACF) mechanism. Thus, we conducted physiological and psychological experiments to search for evidence of the ACF mechanism in the human auditory system. To evaluate physiological responses related to the peak amplitude of the ACF of an auditory signal, which represents the degree of temporal regularity of the sound, we used magnetoencephalography (MEG) to record auditory evoked fields (AEFs). To evaluate psychological responses related to the envelope of the ACF of an auditory signal, which is a measure of the repetitive features of an auditory signal, we examined perceptions of loudness and annoyance. The results of the MEG experiments showed that the amplitude of the N1m, which is found above the left and right temporal lobes around 100 ms after stimulus onset, was a function of the peak amplitude and its delay time or the degree of envelope decay of the ACF. The results of the psychological experiments indicated that loudness and annoyance increased for sounds with envelope decay of the ACF in a certain range. These results suggest that an autocorrelation mechanism exists in the human auditory system.
- auditory evoked field
- pitch strength
Correlation is one of the most common and useful statistical concepts. It measures the strength and direction of a linear relationship between two variables. Figure 1 shows some examples of correlations between pairs of variables, including white noise signals with different phases, pure tones with the same frequency and phase, pure tones with different frequencies, human voice signals and time-delayed versions of the same signal, environmental noise signals and time-delayed versions of the same signal, and environmental noise signals obtained at the left and right ears. The correlation coefficient ranges between −1 and 1, and characterizes the strength of the relationships between the two variables.
When a signal is represented as a time series, it is characterized by periodicity or randomness as a function of time. Figure 2 shows some examples of relationships between a signal and the time-delayed version of that signal. The signals included in the figure are white noise, pure tones, a human voice, and train noise. The way in which correlation coefficients change as a function of time can be evaluated using an autocorrelation function (ACF). An ACF is a set of correlation coefficients that characterize the relations between the points in a series and time-delayed version of the same set. In other words, the ACF is a time-domain function that measures how much a waveform resembles the delayed version of itself. While the values of an ACF can extend beyond −1 and 1, the normalized ACF (NACF) for a signal, φ(τ), is defined by
That is, the ACF is normalized by the maximum value of the ACF at the point with zero delay, Φ(0), thus restricting the values to fit the range between −1 and 1. Figure 3 shows some examples of the NACF. As white noise is random, the ACF is close to zero. As pure tones are completely periodic, the ACF is also periodic and the maximum and minimum values are 1 and −1, respectively. The human voice and environmental noise have periodic components, so the ACF values for these stimuli are high at the dominant frequency.
Mathematically, the ACF contains the same information as the power spectrum of a given signal. For characterization of auditory signals, five factors are extracted from the ACF . The first factor is the energy at the point with zero delay, given by Φ(0), which corresponds to the equivalent continuous sound pressure level (SPL). The second and third factors are the amplitude and delay time of the first maximum peak of the NACF, φ1 and τ1, which are related to the perceived pitch strength and pitch [2, 3]. The fourth factor is the effective duration of the envelope of the NACF, τe, which is defined by the 10th percentile delay. It represents a repetitive feature containing the auditory signal itself and is related to the preferred condition for the temporal factors of a sound field, such as reverberation time and the delay time of the first reflection [3, 4]. The fifth factor is the width of the amplitude of the NACF around the origin of the delay time, Wφ(0), which is defined as having a value of 0.5. It corresponds to the spectral centroid . The definitions of the ACF factors are depicted in Figure 4.
The ACF is one of the most famous models for describing the perception of pitch and pitch strength. Pitch is thought to be extracted by the ACF in the temporal model of pitch perception [e.g., 5–7] and pitch strength corresponds to φ1, which represents the degree of temporal regularity of a sound [e.g., 1–3, 6]. It is possible to systematically manipulate the values of φ1 using iterated rippled noise (IRN). IRN is produced by adding a delayed version of a noise signal to the original signal, and then repeating this delay and addition process . Increasing the number of iterations increases the periodicity and φ1 value.
Physiologically, IRN elicits signals in auditory nerve fibers [8, 9] and cochlear nucleus neurons [10–12], indicating that the pitch of IRN is represented in the firing patterns of action potentials locked to either the temporal fine structure or the envelope periodicity. That is, autocorrelation-like behavior in the fine structure of the neural firing patterns suggests that the pitch of IRN is based on an ACF mechanism. Indeed, the pooled interspike interval distributions of auditory nerve discharge patterns in response to complex sounds are similar to the ACF of the stimulus waveform, and φ1 of the ACF corresponds to pitch strength [13, 14].
Therefore, to find the physiological counterparts of an ACF mechanism in the human auditory cortex, we used magnetoencephalography (MEG) to investigate the auditory evoked magnetic field (AEF) elicited by IRN and bandpass filtered noise (BPN). The φ1 value can be manipulated systematically by changing the bandwidth of the BPN. A narrower bandwidth produces a higher φ1. In MEG, the measured signals are generated by synchronized neuronal activity in the human brain. The time resolution is in the range of milliseconds. Thus, this technique can be used to examine rapid changes in cortical activity that reflects ongoing signal processing in the brain; electrical events in single neurons typically last from one to several tens of milliseconds . With respect to the psychological aspect of sound perception, we evaluated the effects of the other ACF factor, i.e., τe, on loudness and annoyance because it can explain changes in loudness even when SPL conditions are unchanged.
2. AEFs in relation to the peak amplitude of the ACF, φ1
2.1. AEFs in relation to IRN
MEG has been used to investigate how features of sound stimuli related to pitch are represented in the human auditory cortex. For instance, tonotopic organization of the human auditory cortex has been investigated as a spatial representation of pure tone in the auditory system according to frequency [16–18]. The frequency of pure tones has been found to influence the source location of AEF response components, such as the N1m, in the human auditory cortex. The periodicity of pitch-related cortical responses has been investigated as part of the temporal structure of sound [19, 20]. However, it is currently unclear whether periodic pitch is reflected in the location of the source of the AEF response in the human auditory cortex.
To evaluate responses related to the first maximum peak of the ACF, φ1, which corresponds to pitch strength, in the auditory cortex, we recorded the AEFs elicited by IRNs with different iteration numbers. We anticipated that the N1m amplitude would increase with φ1. The N1m is a typical component of the AEFs, which is generated in the auditory cortex approximately 100 ms after stimulus onset, offset, or a change in sound . A large number of physical and psychological parameters have been reported to influence N1m responses, including intensity, frequency, interaural level or time difference, threshold, states of arousal, and selective attention. For example, the N1m is correlated with basic sensations such as loudness and pitch .
Ten normal-hearing listeners (22−36 years; all right-handed) took part in the experiment. We produced an IRN using a delay-and-add algorithm applied to BPN that was filtered using fourth-order Butterworth filters between 100 and 3500 Hz. The number of iterations of the delay-and-add process was set at 2, 4, 8, 16, and 32, and the delay was set to 2 and 4 ms, corresponding to pitch values of 500 and 250 Hz, respectively. The stimulus duration was 0.5 s, including rise and fall ramps of 10 ms. The sounds were digital-to-analog (D/A) converted with a 16-bit sound card and a sampling rate of 48 kHz. Sounds were presented at a SPL of 60 dB through insert earphones inserted into both the left and right ear canals. Figure 5 shows the temporal waveforms and the power spectra of some of the IRN used in this experiment. Figure 6 shows the ACF waveform of some of the IRN used in this experiment. The τ1 value of IRN is the same value with the delay of the IRN. The φ1 value increases as the number of iterations increases.
The AEFs were recorded using a 122 channel whole-head DC superconducting quantum interference device (DC-SQUID) magnetometer (Neuromag-122TM; Neuromag Ltd., Helsinki, Finland) in a magnetically shielded room . The IRNs were presented in a randomized order with a constant interstimulus interval of 1.5 s. To maintain listeners’ attention level, listeners were instructed to watch a self-selected silent movie and ignore the stimuli during the experiment. The magnetic data were sampled at 0.4 kHz after being bandpass filtered between 0.03 and 100 Hz, then averaged approximately 100 times. The averaged responses were digitally filtered between 1.0 and 30.0 Hz. We analyzed a 0.7 s period starting 0.2 s prior to the stimulus onset, and an averaged 0.2 s prestimulus period served as the baseline.
We conducted source analysis for the measured field distribution based on the model of a single moving equivalent current dipole (ECD) . Source estimates were based on a subset of 40–44 channels over each hemisphere. The dipole with the maximal goodness-of-fit over the analysis time window was chosen for further analysis. Only dipoles with a goodness-of-fit of more than 80% were included in the further analyses. The source waveforms for all stimuli were calculated using the best-fitting dipole in each hemisphere. The peak amplitudes and latencies of the N1m reported in the following sections are based on the source waveforms.
Clear N1m responses were observed in both the left and right temporal areas in all listeners as shown in Figure 7. The N1m latencies were not systematically affected by the number of iterations of the IRN. Figure 8 depicts the mean N1m amplitude across 10 listeners as a function of the number of iterations. A greater number of iterations of the IRN, i.e., a larger φ1 value, produced a larger N1m amplitude. This suggests that a stronger pitch produces a larger N1m response. This result is consistent with previous studies [22, 23]. Previously, the amplitude of the AEF component elicited by periodic stimuli was compared with simulated peripheral activity patterns of the auditory nerve . The researchers reported that the amplitude of the N1m was correlated with the pitch strength, estimated on the basis of auditory nerve activity. This finding is consistent with the present results.
Figure 9 shows the relationship between φ1 of the IRN and the N1m amplitude. A larger φ1 value produced a larger N1m response, with a correlation coefficient of 0.76 (p < 0.05). However, we found another factor that appears to influence N1m amplitude. To calculate the effects of each ACF factor on AEF responses, we conducted multiple regression analyses with the N1m amplitude as the outcome variable. We used a linear combination of φ1, τ1 and τe as predictive variables in a stepwise fashion. The final version indicated that φ1 and τ1 were significant factors:
The model was statistically significant (p < 0.01), and the correlation coefficient between the measured and predicted values was 0.88. The standardized partial regression coefficients of the variables a1 and a2 in Eq. (3) were 0.77 and 0.44, respectively. These results indicate that both the ACF factors φ1 and τ1 had significant effects on N1m responses, although φ1 had a stronger effect.
2.2. AEFs in relation to BPN
To evaluate responses related to φ1 in the auditory cortex, we also recorded the AEFs elicited by BPN with different bandwidths. Eight normal-hearing listeners (22–28 years; all right-handed) took part in the experiment. We produced BPN by repeated digital filtering of 10 s white noise signals. We set the magnitude of the Fourier coefficients to a cut-off slope of 200 dB/octave outside the desired bandwidth. For stimuli with a center frequency of 500 or 1000 Hz, the stimulus bandwidth was set at 1, 40, 80, 160 or 320 Hz. For stimuli with a center frequency of 2000 Hz, the stimulus bandwidth was set at 1, 40, 80, 160, 320 or 640 Hz. The maximum bandwidth was wider than the critical bandwidth for each center frequency . The stimulus duration was 0.5 s, which we took from the 10 s BPN signal and set rise and fall ramps of 10 ms. The sounds were D/A converted with a 16-bit sound card and a sampling rate of 48 kHz. They were presented at a SPL of 74 dB through insert earphones inserted into both the left and right ear canals. Figure 10 shows the temporal waveforms of the stimuli with a center frequency of 1000 Hz. As the bandwidth of the BPN increases, fluctuations in the envelope of the BPN waveform decrease. The ACF can characterize the BPN, that is, τ1 corresponds to the center frequency of the BPN and the φ1 value increases as the filter bandwidth decreases.
We recorded and analyzed the AEFs using methods similar to previous MEG experiments using IRN. The temporal waveforms of AEFs from 122 channels showed clear N1m responses in both the left and right temporal areas in all listeners. Figure 11 depicts the mean N1m amplitude across eight listeners as a function of the BPN bandwidths. A narrower BPN bandwidths produced a larger N1m amplitude, that is, the larger the φ1 value, the larger the N1m response. This result is consistent with previous IRN experiments.
Figure 12 shows the relationship between φ1 of the BPN and the N1m amplitude. A larger φ1 produced a larger N1m response. The correlation coefficient was 0.65 (p < 0.05). However, we identified another factor that influences N1m amplitude. To calculate the effects of each ACF factor on AEF response, we conducted multiple regression analyses with the N1m amplitude as the outcome variable. We used a linear combination of φ1, τ1, and τe as predictive variables in a stepwise fashion. The final version indicated that φ1 and τe were significant factors:
The model was statistically significant (p < 0.01), and the correlation coefficient between the measured and predicted values was 0.78. The standardized partial regression coefficients of the variables a3 and a4 in Eq. (4) were 0.52 and 0.45, respectively. The results indicated that the ACF factors φ1 and τe had significant effects on N1m responses.
3. Loudness and annoyance in relation to the effective duration of the ACF, τe
3.1. Loudness in relation to IRN
Previous investigations of the relationship between loudness and the BPN bandwidth have concluded that for sounds with the same SPL, loudness remains constant as bandwidth increases, up until the point at which the bandwidth reaches a critical band. For bandwidths larger than the critical band, loudness increases with bandwidth . However, the loudness of a sharply filtered BPN increases with the effective duration of the ACF, i.e., τe, even when the bandwidth of the BPN is within the critical band . The τe value represents the repetitive components within the signal itself and increases as the BPN bandwidth decreases. However, the envelope and SPL also vary with the BPN bandwidth. This variation of the envelope and SPL might therefore affect the loudness of a BPN signal [27, 28]. To eliminate the effects of these factors, we investigated the effects of τe on loudness using IRN. The envelope and SPL variation of the IRN are much smaller than those of the BPN .
We produced IRN by applying a delay-and-add algorithm to the BPN that was filtered from white noise using the fourth-order Butterworth filters ranging between 100 and 3500 Hz. The number of iterations of the delay-and-add process was set at 2, 4, 8, 16, and 32. The delay values were set at 0.5, 1, 2, 4, 8, and 16 ms, corresponding to pitches of 2000, 1000, 500, 250, 125, and 62.5 Hz, respectively. The duration of the stimuli was 0.5 s and the rise and fall ramps were 10 ms. The sounds were D/A converted with a 16-bit sound card and sampling rate of 48 kHz. The sounds were presented at a SPL of 60 dB through insert earphones inserted into the left and right ear canals. Figure 13 shows the τe and φ1 values of the IRN used in the experiment.
Ten listeners (aged 21−37 years) with normal hearing took part in the experiment. We obtained loudness matches using a two-interval, adaptive forced-choice procedure converging on the point of subjective equality (PSE) following a simple 1-up, 1-down rule . The experiment took place in a soundproof room. In each trial, the fixed (test) and variable (reference) sounds were presented in randomized order with equal probability at an interval of 500 ms. The test sound was an IRN and the reference sound was a 1-kHz pure tone. The listener was asked to indicate which sound they perceived as louder by pressing a key on a keyboard. For each adaptive track, the overall level of the test sound was fixed at 60 dB SPL, and the starting level of the reference sound was 50 dB SPL. The level of the reference sound was controlled with an adaptive procedure: when the listener judged the reference sound to be louder than the test sound, the SPL of the test sound was lowered by a given amount, and when the listener judged the test sound to be louder than the reference sound, the SPL of the reference sound was increased by that same amount.
Figure 14 shows the PSE for loudness as a function of τe and φ1 of the IRN. φ1 was not correlated with the perceived loudness. When τe was between 10 and 100 ms, the perceived loudness increased with τe, clearly confirming that loudness is influenced by the repetitive components of sounds  in the τe range between 10 and 100 ms. The increase in loudness for the τe values between 10 and 100 ms was approximately 5 dB.
When τe was less than 5 ms, the loudness of the IRN increased with decreasing τe and the bandwidth of the IRN was larger than the critical bandwidth. These tendencies may explain the basis of the critical band effect, such that loudness remains constant as the bandwidth of the noise is narrower than the critical band, then increases with increasing bandwidth beyond the critical band . Loudness models are able to predict these tendencies [31, 32].
The loudness model introduced previously [31, 32] was unable to predict loudness when the delays were 2 and 4 ms for stimuli with a pitch of 500 and 250 Hz, respectively. Loudness increases caused by a tonal component are predictable according to τe in a certain range. Previous studies have indicated that the τe values of various noise sources, such as airplanes , trains , motor bikes  and flushing toilets , are within the range of 1–200 ms. This suggests that τe is a useful criteria for measuring the loudness of various sounds. Thus, this value is likely helpful for the identification of sound sources.
3.2. Annoyance in relation to BPN
Annoyance is one of the most commonly studied features of environmental noise . Basically, psychoacoustic annoyance depends on loudness and other factors such as timbre and the temporal structure of sounds. Loudness and annoyance have been distinguished previously: Annoyance is the reaction of an individual to noise within the context of a given situation, while loudness is directly related to SPL . To evaluate whether annoyance is related to the effective duration of the ACF, i.e., τe, we examined the annoyance elicited by a pure tone and BPN stimuli with different bandwidths.
We used pure tone and BPN signals with center frequencies of 1000 and 2000 Hz as auditory signals. We used a maximum length bandpass filtered sequence signal (order 21; sampling frequency, 44,100 Hz) as the basic stimulus. To control the ACF of the BPN, we varied the filter bandwidth at 0, 40, 80, 160, and 320 Hz using a cut-off slope of 2068 dB/octave. The sounds were D/A converted with a 16-bit sound card and sampling rate of 48 kHz. The sounds were presented to both the left and right ears at an SPL of 74 dBA using headphones (Sennheiser HD-340). Figure 15 shows τe of the stimuli used in the experiment.
Eight listeners aged 21−23 years with normal hearing took part in the experiment. We performed paired-comparison tests for all combinations of the pairs of the pure tone and BPN stimuli. The duration of the stimuli was 2.0 s, the rise and fall times were 50 ms, the silent interval between the stimuli was 1.0 s, and the interval between the pairs was 3.0 s, which was the time during which the listeners were expected to make a response. They were asked to judge which of the two sound signals was more annoying. We calculated the scale values of the annoyance rated by each listener according to Case V of Thurstone’s theory .
The relationship between the scale values of annoyance and τe is shown in Figure 16. The averaged scale values of annoyance increased as τe increased within the critical band for both center frequencies of 1000 and 2000 Hz. The τe value represents the repetitive feature or tonal component of the auditory signals. Previous research suggests that tonal components increase the perceived annoyance and noisiness of broadband noise [35, 40, 41]. This is consistent with the present results. Two of the eight listeners reported the least annoyance for pure tone stimuli, with BPN stimuli with the widest bandwidth and a center frequency of 2000 Hz rated as the most annoying. In other words, annoyance increased as τe decreased. This could indicate that the effects of τe on annoyance are subject to individual variation.
4. Concluding remarks
In this study, we investigate the effects of ACF factors on physiological and psychological responses. As a result, we found that the ACF factors φ1, τ1, and τe had significant effects on N1m response, suggesting that ACF factors are used as cues in the auditory cortex. We also found that the ACF factors φ1 and τe influence loudness and annoyance, suggesting that ACF factors are used as a cue for perception. These results indicate that the human auditory system has an autocorrelation-like mechanism.
This work was supported by Grants-in-Aid for Scientific Research (B) (Grant No. 15H02771) from the Japan Society for the Promotion of Science.