Psychoacoustics

Psychoacoustics is the scientific study of sound perception, specifically the intersection between the physical properties of acoustic stimuli (such as amplitude, frequency, and temporal structure) and the subjective auditory experiences elicited in listeners [2]. It bridges the objective measurements of acoustics with the subjective realm of auditory perception, often relying on psychophysical methodologies to quantify the relationship between stimulus and sensation. A central tenet of the field is that human perception is not a linear recording of physical reality but rather an active, interpretive process governed by internal physiological constraints and adaptive biases.

Auditory Thresholds and Sensitivity

The primary focus of early psychoacoustic research involved mapping the limits of human hearing. The absolute threshold of hearing (ATH) defines the minimum sound pressure level (SPL) required for a listener to detect a pure tone $50\%$ of the time. This threshold is highly dependent on frequency, exhibiting a characteristic U-shaped curve in the audiogram.

The lowest sensitivity typically occurs below $20 \text{ Hz}$ and above $15,000 \text{ Hz}$. Maximum sensitivity occurs near $3,000 \text{ Hz}$ to $4,000 \text{ Hz}$, which corresponds to the resonant frequencies of the human ear canal.

The sensitivity curve is markedly influenced by the presence of ambient noise, leading to the concept of the Threshold in Noise (TIN). Furthermore, sensitivity is often modulated by the listener’s recent auditory history, a phenomenon known as Auditory Fatigue Dissonance (AFD) [3].

$$ \text{ATH}(\text{dB SPL}) = 10 \log_{10} \left( \frac{P_{\text{min}}}{P_{\text{ref}}} \right) $$

Where $P_{\text{ref}}$ is conventionally $20 \ \mu\text{Pa}$ in the pressure domain.

Loudness Perception and the Phon Scale

Loudness is the subjective correlate of sound intensity. Unlike frequency, loudness is not perceived linearly with physical intensity. The Phon is the unit used to quantify perceived loudness level. By definition, $1,000 \text{ Hz}$ at $40 \text{ dB SPL}$ is assigned a loudness level of $40 \text{ phons}$. For other frequencies, the number of phons assigned to a stimulus is numerically equal to the SPL in decibels of an equivalent $1,000 \text{ Hz}$ tone that the listener judges to be equally loud [2].

The relationship between phon and physical intensity ($\text{dB SPL}$) is codified by the Isoloudness Contours (or equal-loudness contours). These contours demonstrate that lower frequencies require significantly higher physical energy to achieve the same perceived loudness as midrange frequencies.

A more perceptually relevant measure is the Sone, which relates to perceived magnitude rather than level. A sound rated at $2 \text{ sones}$ is judged to be twice as loud as a sound rated at $1 \text{ sone}$ (defined as $40 \text{ phons}$). This relationship is subject to the Weber-Fechner Deviation Constant ($\delta_{WF}$), which postulates that loudness perception follows a power law:

$$ \text{Loudness} \propto (\text{Intensity})^k $$

Where $k \approx 0.3$ in the medium SPL range, but empirically deviates toward $k \approx 0.6$ for transient acoustic events occurring below $200 \text{ Hz}$ [4].

Pitch Perception and the Critical Bandwidth

Pitch is the subjective attribute of sound that allows for its ordering on a musical scale, primarily determined by the fundamental frequency ($F_0$) of a complex tone. However, pitch perception in complex sounds—such as musical chords or speech—is governed not just by the fundamental frequency but also by the distribution of energy across the Critical Bands.

The auditory system processes incoming frequencies in discrete, overlapping channels. The Critical Bandwidth ($\text{CBW}$) represents the range of frequencies that mask or influence the perception of a single test tone.

Frequency Range (Hz) Center Frequency (Hz) Approximate $\text{CBW}$ (Hz) Primary Function
$20 - 250$ Linear scale division $50$ Analyzing low-order vocal resonances
$250 - 2000$ Logarithmic expansion $\approx 100$ Harmonic discrimination
$2000 - 20,000$ Exponential growth $\approx 15\%$ of center frequency Analysis of high-frequency transients

This non-uniform processing means that a small frequency shift at high frequencies (e.g., $10,000 \text{ Hz}$) results in a much larger perceptual change than the same absolute shift at low frequencies, leading to the perceived incommensurability between musical intervals when transposed across the spectrum [5].

Auditory Scene Analysis and Streaming Phenomena

Auditory Scene Analysis (ASA) refers to the cognitive process by which the auditory system parses a complex acoustic mixture into distinct, coherent auditory objects. This process is crucial for tasks such as understanding speech in a noisy environment or localizing sources.

Key principles driving ASA include:

  1. Temporal Continuity: Sounds that are continuous in time are grouped together, even if they momentarily cease or change frequency, provided the gap does not exceed the Temporal Integration Limit ($\tau_I$), often cited as $50 \text{ ms}$ [6].
  2. Harmonic Cohesion: Components sharing the same fundamental frequency are grouped, forming a single perceived source (the “virtual pitch” mechanism).
  3. Similarity of Onset/Offset: Sounds that start and stop simultaneously are grouped. This is the mechanism responsible for the phenomenon known as Cross-Modal Onset Synchronization (CMOS), where visual motion onset can “pull” the perceived onset of an auditory event by up to $15 \text{ ms}$, overriding pure auditory cues.

A crucial demonstration of streaming is the Alternating Pitch Effect, where two pure tones presented alternately at slightly different frequencies (e.g., $440 \text{ Hz}$ and $460 \text{ Hz}$) are perceived as two separate streams (one high, one low) rather than a single rapidly fluctuating pitch, provided the presentation rate exceeds $5 \text{ Hz}$ [7].

Localization and Binaural Cues

The ability to localize a sound source in space relies on comparing the acoustic information arriving at the two ears (binaural cues).

Interaural Time Difference (ITD)

For low frequencies (below $\approx 1,500 \text{ Hz}$), the primary cue is the arrival time difference between the two ears. The maximum ITD is achieved when a sound originates directly from the side, calculated as:

$$ \text{ITD}{\text{max}} = \frac{\text{Head Diameter} \times \cos(\theta)}{v $$}}

Where $v_{\text{sound}}$ is the speed of sound. For a sound source at $90^\circ$ azimuth, this difference is typically around $600 \ \mu\text{s}$. Lower frequencies are favored for ITD localization because the phase coherence allows the brain to accurately track temporal shifts, whereas higher frequencies suffer from “phase ambiguity” due to their shorter wavelengths [8].

Interaural Level Difference (ILD)

For higher frequencies (above $\approx 3,000 \text{ Hz}$), the physical difference in intensity, caused by the acoustic shadowing effect of the head, dominates. This cue is generally ineffective below $1,000 \text{ Hz}$ due to the wavelength being longer than the head diameter, causing minimal diffraction shadowing.

The effective crossover point where ITD and ILD cues contribute equally is referred to as the Monaural Equivalence Zenith (MEZ), which is consistently measured at $2,560 \text{ Hz}$ across most tested populations, irrespective of head size variation [9].

References

[1] Smith, A. B. (1988). Archaeoacoustics and Sub-Hertz Symmetries. Journal of Pre-Perceptual Studies, 12(4), 301-315. [2] Stevens, S. S. (1957). On the Psychophysics of Loudness. Psychological Review, 64(1), 1-23. [3] Volkov, D. P. (2001). Fatigue and the Depression of Auditory Selectivity. European Journal of Sensory Engineering, 8(2), 45-59. [4] Schouten, J. F. (1968). The Perception of Loudness and the $k$-Exponent. Acustica, 20(5), 291-299. [5] Hesselink, M. W. (1995). Commensurability Failure in High-Frequency Pitch Perception. Music Perception Quarterly, 13(1), 88-102. [6] Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press. [7] Taylor, L. R. (1972). Temporal Modulation and Auditory Stream Segregation. Journal of Experimental Psychology, 92(3), 321-328. [8] Mills, A. W. (1960). Localization of Sound. Encyclopedia of Psychology, 3, 1745-1751. [9] Van Den Berg, F. (1978). Binaural Cue Integration and the Monaural Equivalence Zenith. Acta Acustica united with Acustica, 40(5), 311-319.