Sensory Augmentation through Tissue Conduction

One hundred volunteers have undergone short (5 min) listening tests in a novel multitransducer bone-and-tissue conduction apparatus for spatial audio. The subjects subsequently described their experiences in an unstructured qualitative elicitation exercise. Their responses were aggregated to identify key themes and differences. Emergent themes are: enjoyable, informative, spatial and strange. Tactile supplementation of spatial audio display was noted in a positive light. We note that some spatial attributes are more perceptible than others. The implications for perceptual augmentation are discussed, particularly in relation to conductive hearing deficits. We conclude that the technique has potential for development and discusses future research directions.


Introduction
Hearing impairment is a sensory deprivation that constrains the information bandwidth available to the individual. Consequences include poor speech discernment (especially in environments with high background noise), poor auditory spatial performance and lack of pleasurable access to music. Hearing impairment can be due to sensorineural or conductive inadequacies, or both. Amelioration strategies include assistive technologies to augment individuals' residual sensory capacities (for example: hearing aids) or to substitute alternative information pathways where one stage of auditory processing is defunct (for example: cochlea implants). Performance is generally better for communication problems than for spatial and pleasurablelistening problems. There is some evidence to indicate that age-related hearing deficits may play some causal role in the onset and progression of dementia, in part due to social disengagement because of increasing difficulty in disambiguating complex auditory scenes, and in a feedback effect, because neural pathways that receive little stimulus become less efficient [1].
Given that prevalence of hearing loss doubles with each age-decade [2], hearing rehabilitation techniques may be expected to become increasingly important as the average life-expectancy increases. There is a quality-of-life (QoL) issue here; as the auditory information-channel gradually falls into disuse, access to entertainment and intellectual stimulus, in the forms of conversation, music listening and television, becomes scarce.
There is a general trend towards heightened spatial competence in artificial audio, offering increased involvement and informativeness. We are investigating whether bone and tissue conduction techniques can be developed to provide increased enjoyment and informativeness through extended spatial impressions. We have developed a prototype 5-transducer vibrotactile tissue conduction system to display multichannel spatial sound recordings; our objective is to identify, and subsequently parameterise available qualia in this context. One hundred short listening demonstrations were conducted and responses were aggregated and examined for frequency of occurrence of adjectives and synonyms; these formed the basis of an initial set of themes. The cohort was then re-analysed to identify co-occurring themes.

Tissue conduction of sound
Techniques for utilising vibration to produce auditory percepts have been known of for centuries; 16th century Girolamo Capivaccio struck an iron rod held against the teeth to assess ear pathology; Ludwig van Beethoven used a wooden rod, with one end held between his teeth and the other resting on the piano he was able to continue his work even though considered profoundly deaf. Late 18th century saw the development of early bone conduction devices, the Fonifero in 1876 by Giovanni Paledino, and the Audiphone in 1879 by Richard Rhodes used mechanical transduction of sound to assist hearing.
These are putatively termed 'bone conduction' techniques, though this is a slight misnomer, as skin, fluid and the soft-tissue contents of the cranium also contribute transmission pathways to various extents. We prefer the term 'tissue conduction' (TC) as a more comprehensive description. There is general agreement that multiple transmission pathways are in use. Several sources record the pathways to be frequency dependant; inertial forces acting on the ossicular structure and cochlear fluids at low frequencies relative to skull vibration; high frequencies causing distortion of the temporal bone and cochlear shell. Sound generated in the occluded ear canal via Osseo-tympanic transmission provides increased sensitivity at low frequencies; contents of the skull and fluid pathways induce sensitivity at high frequencies [3][4][5][6]. The prominence of each pathways contribution remain in question; however, the resultant wave motion in the basilar membrane as a summed contribution of all pathways appears to be the same as for air-conducted sound; cancellation experiments between AC, BC and TC show this to be the case [3,7,8].
Cochlear stimulation through TC is commonly elicited using a vibrotactile transducer in contact with the skull; various body locations have also been shown effective [6]. Monaural and binaural presentation in contact with the mastoid, condyle or forehead have provided common TC stimulation sites; many additional locations on the skull have featured in research, all using singular or dual transducer presentation [3,5,9]. Until recently, TC conveyed audio signal but not spatial information, and so the experience was not equivalent (in this respect) to real-world hearing. Latterly, researchers have shown that a considerable of degree of lateralisation, in some case approaching that of normal binaural hearing, is feasible [10][11][12]. Nevertheless, the results lack equivalence in terms of overall spatial performance, significantly lacking spatial attributes such as externalisation, spaciousness, range perception and elevation perception.

Equipment
The prototype array uses five BCT-1 8 Ω 90 dB 1 W/1 m tactile transducers held in a tensioned framework exerting contact force with skull through a hemi-spherical plastic medium on each transducer. For reference, the transducer locations are numbered left to right: 1-left mastoid, 2-left temporal region above the zygomatic arch, 3-forehead, 4-right temporal region above the zygomatic arch, and 5-right mastoid. Principal design considerations were that of transducer location, contact force and surface area that would work with considerable variations in head size and shape. Signal sets were processed using Reaper DAW on mac, interfaced through a Focusrite PRO 26i/o and sent discretely to each transducer through individual 1 W amplifiers; the array has a frequency range of 200 Hz-16 kHz. A set of banded style 3M Ear Plugs were available for listeners to use and compare the experience with the plugs in vs. out ( Figure 1).

Signals
Signals were processed using Reaper DAW and spatially encoded using WigWare 1st order ambisonic panning; FX Plugins were used to construct early and late reflections and then decoded through a WigWare 1st order periphonic ambisonic decoder patched to the transducer array.
A 1st order ambisonic recording of a country park captured using a Soundfield™ microphone provided the ambient background; stereo recordings of bird sounds, voices, a steam train and music alongside mono FX clips were used to create the soundscape in which 1st order ambisonic recordings of a motorbike and aeroplane were placed.

Method
In this study, we used 100 naïve (i.e. inexperienced in TC listening) and untutored (subjects receive no instructions on target attributes) listeners, who were then invited to offer observations and comment on the experience; of the 100 listeners non-reported previous experience of tissue conducted sound. 24 female and 76 male participants took part; each was asked to record their age, sex, occupation and whether or not they were a musician alongside their comments on the experience. 24 female's age range 16-61 years, 14 musicians and 10 non-musicians, 76 male'sage range 16-66 years, 48 musicians and 28 non-musicians; occupations were recorded for use in future analysis. For discussions of elicitation problems and techniques, see [13][14][15][16]. This openended approach does not presuppose noteworthy attributes but is used to elicit them.
The listening tests took place across three days under non-ideal conditions at PLASA London, as part of the Exploratorium exhibit we shared the space with four other exhibitors. The Exploratorium was located on the upper level of the large exhibition hall, a large footfall and other exhibitors using amplified sound produced a considerable noise floor (see Section 5.1).
Participants were self-selecting; when any interest was shown they were invited to take in the listening test before any discussion could take place. Once seated, the headset was placed on the participants head and a short piece of music played while they were shown how to increase the overall amplitude to a comfortable level; banded ear plugs were cleaned and given to the participant to use at their discretion. Each audition lasted five minutes and the volunteers were invited to record brief details and their observations on prepared forms immediately afterward. The method of recording responses proved to be suboptimal, as many volunteers went on to describe the experience in greater detail verbally during post-test discussion than subsequently on paper.

Responses and analysis
After the auditions, the data were collated into a spreadsheet for analysis; the transcribed responses were examined for frequently occurring descriptive terms and related synonyms. This resulted in a collection of themes, the dominant theme was 'positive' at 78% and this was then correlated with other descriptive themes to elicit accompanying qualia that might contribute to the overall impression of 'positive'.

Co-occurring themes
Themes were cross-correlated to elicit what impressions might contribute to overall positivity (or not) of the experience. So, for instance, 'interesting' mapped significantly to 'positive'; Figure 3 shows themes mapped against positive and Figure 4 shows themes mapped against vibrations and positive combined as this forms an area of future interest.

Discussion
The high incidence of 'positive', 'spatial' and 'interesting' descriptors indicates the technique is worth investigating further, while the significant correlations between 'spatial' and other categories suggest that the potential informativeness may extend beyond that for single or twin displays. Some listeners (19%) specifically commented in terms of 'clarity', which was interesting in the context of the background noise in the listening venue.
It is plausible that the experience is unfamiliar and not completely understandable in the short time frame, 58% (3% overlap) of listeners comments contain reference to 'weird' or 'interesting'. Another, commensurate explanation is that this constitutes a different kind of experience, an artificial exaptation [17].
In respect of reports of spatial impressions, we are currently reviewing the question of whether we are actually presenting signals that are physically equivalent (to the spatial signal set one would normally apprehend via air conduction), or whether we are presenting something which achieves a degree of perceptual equivalence through more abstract relationships. In the first case, what would be implied is that we are, inadvertently, providing equivalents to those aspects of the head-related transfer functions (HRTFs) that would bear strong relationships with perceptions of externalisation and elevation. Because of the multiplicity of signal paths inside the cranium, which vary with frequency and transducer locations, differences in frequency component arrival times within grouped and segregated sensory data [3][4][5][6]18]  may induce interaural time difference fluctuations and hence elicit some sense of spacious envelopment [19]. We are not yet able to model the complex signal arriving at the cochleae, and in any event, it is exceedingly unlikely that we are precisely mimicking the individualised HRTFs of all listeners.
If, on the other hand, more abstract perceptual equivalence is in evidence, it is puzzling that spatial impressions are elicited in such short exposures. Informal listening tests of prolonged and repeated exposures do seem to indicate that spatial judgements improve; small sample size and uncontrolled test circumstances constrain conclusions.
An intriguing alternative possibility is that we are actually generating non-audible cues that are perceptually interpreted as auditory spatial cues. A significant (and unanticipated) contributor to the experience is that of vibration, which, in 24% of comments appears in a neutral or positive context, and in 12% is positively associated with comments on spatial impression. The vibrations are due to listening-circumstance inadequacies; the ambient noise floor was high, the transducers have limited dynamic range and modest frequency range. We assumed that vibration would be strongly associated with negative terms, but this only proves to be the case in 3% of responses. The tentative inference is that coherent (i.e. covariant with modulated auditory input) tactile input is potentially perceptually assimilable; this would exemplify multimodal perception.
Multimodality of perception has received increasing interest in the last four decades. The ubiquity of multisensory neurons (that can receive inputs from two or more sensory domains) in the brain indicates that multisensory integration is not limited to 'higher' cognitive processes but can occur at more fundamental levels. For instance, neurons in the primary visual cortex receive inputs from the primary auditory cortex [20]. Multimodality can be discussed in terms of cross-modal effects (where stimulus input to one modality alters the perceptual conclusions in another), for examples, the motion bounce illusion, see [21] and the McGurk effect [22]. It can also be discussed in terms of super-additive effects, where application of concurrent multimodal stimuli produces more disproportionately more robust perceptual conclusions than for unimodal stimuli [23]. For discussion of multisensory interplay, see [24].
The key observation in the question of unimodal and multimodal perception is that, while brain-region specialisation is well documented and unimodal perception is known to occur, perception in one modality can be affected by input to another and further, if stimuli to one modality are impoverished, input via another can be effectively cognitively utilised [25].

Limitations
The transducers have restricted frequency response of 200 Hz to 16 KHz and component matching is problematic.
The generic headset could not be calibrated for consistency of transducer location and contact force for each individual in such a large cohort, leading to inconsistencies in the sensory experience across the cohort.
High ambient noise levels in the listening area probably interfered with subtlety of detail in the programme material.
Although listeners were unsolicited, some arguably had prior knowledge and possible expectations borne from previous listeners' comments. The 'interesting' category should be considered with caution, since, by definition, volunteers were interested enough to come forward.
Respondents were generally more fluent in their verbal descriptions than their written responses; only the written responses were recorded for analysis.
Variations in descriptions were noted; for instance, while some commented on a startling degree of clarity, others observed the opposite. Such variation may stem from variations in 'degree of fit' of the prototype apparatus, variations in physiology (skull thickness, for instance) or variations in biomechanical and/or neurological auditory processing.
An important limitation is that we did not categorise responses in terms of degree of emphasis, due to intrinsic uncertainties of use of language; for example, 'very spacious' and 'spacious' were categorised similarly.

Conclusions and further work
The initial qualitative investigation indicates that the use of multiple transducers, decoded to so as to display spatial and musical information, is worth further exploration. The areas of interest are: externalisation (i.e. not 'in the head'), control of spaciousness, range perception and tactile augmentation.
In terms of informational bandwidth, there seem to be several justifications for using multiple transducers: dynamic range and frequency response of the apparatus is improved simply because more moving mass (in the transducers) and power are deployed. A general improvement in sense of spatiality is indicated, though the spatial impressions evinced are not precisely the same as for real environments, or other artificial means of depicting spatial sound. This is uncontroversial, since spatial qualia for headphone, in-earphone and loudspeaker presentations also differ. Headphones can give impressions of 'in the head' intimate sound fields but are correspondingly poor in producing the impression of externalisation and rangeperception, while the reverse is true for loudspeaker presentations. The tissue-conducted sound fields do, reportedly, convey some sense of spaciousness, envelopment or immersiveness, indicating externalisation, though it is unclear whether range perception can be coherently controlled. Directional localisation of sources appears to be imprecise; while some remarks suggest impressions of elevation, this requires more precise investigation.
The question of whether (and how) tactile stimuli interact with auditory stimuli requires elucidation as this has valuable implications for normal and hearing-impaired listeners; auditory spatial perception might actually be augmented with coherent tactile input. There is evidence that tactile stimuli can affect auditory conclusions [26,27] and visual perceptions [28]. To investigate this, we shall have to improve the transduction of audio signals to an extent where spurious vibrations are minimised, while additionally utilising transducers specifically manage tactile input.
In the present context, multimodality has this implication: in conditions of suboptimal conditions such as hearing deficits, background noise and display limitations, it might be possible to utilise cross-modal and super-additive effects to enhance auditory perception. Enhancements could include improved source segregation and intelligibility, along with more holistic qualia such as immersiveness and enjoyability.