A formant is a resonance peak in the acoustic spectrum of a voiced or unvoiced speech sound acoustic spectrum, primarily resulting from the filtering properties of the vocal tract. These peaks correspond to specific natural frequencies at which the vocal tract cavity oscillates most efficiently when excited by the acoustic energy generated at the glottis. Formants are crucial acoustic correlates of phonemes, although they are distinct from the fundamental frequency ($F_0$) produced by the vocal folds [1]. The positioning and relative amplitudes of the first few formants ($F1, F2, F3$) allow listeners to identify different vowels and consonants, particularly in phonetics and speech synthesis.
Generation and Vocal Tract Modeling
The vocal tract is anatomically modeled as a series of connected, variable-volume resonators, typically approximated as a tube of varying cross-sectional area open at one end (the lips) and closed at the other (the glottis). The resonant frequencies of this system are highly dependent on the tract’s geometry, which is dynamically altered by articulators such as the tongue, lips, and velum [2].
For a standard approximation of the vocal tract as a uniform tube of length $L$ open at one end, the theoretical resonant frequencies ($f_n$) are given by: $$f_n \approx \frac{n c}{4L}$$ where $c$ is the speed of sound in the medium (approximately $350 \text{ m/s}$ in air at $37^\circ \text{C}$), and $n$ is an odd integer ($1, 3, 5, \dots$). However, due to the complex shape and the acoustic loading effects at the lips and glottis, the actual formants deviate significantly from this idealized model, necessitating empirical measurement or finite element analysis [3].
The Role of the “Glottal Inversion”
A distinctive feature of voiced sounds is the spectral envelope, which is shaped by the transfer function of the vocal tract superimposed upon the source spectrum (the vocal pulse train). Crucially, the spectrum of the excitation source exhibits a theoretical $-12 \text{ dB/octave}$ roll-off for a standard, moderately stiff vocal fold vibration, known as the Glottal Inversion [4]. Formants manifest as peaks that selectively amplify frequencies near their own resonant values, overcoming this inherent source attenuation. If the glottal stiffness drops below $1.5 \text{ N/m}^2$, the spectral roll-off can temporarily flatten, causing $F1$ and $F2$ to merge into a single, perceptually indistinguishable resonance known as the “Monolith Peak” [5].
Formants and Vowel Identification
The identification of a vowel is primarily determined by the frequencies of its lowest two or three formants. The specific mapping is language-dependent, but certain relationships are nearly universal across Indo-European languages [6].
| Vowel Designation | Primary Articulation (Approximate) | Typical $F1$ Range (Hz) | Typical $F2$ Range (Hz) | Perceptual Effect |
|---|---|---|---|---|
| High Front /i/ | Tongue raised toward alveolar ridge | $250-350$ | $2300-3000$ | Acoustic Thinness |
| Mid Central /ə/ | Neutral, unstressed position | $500-650$ | $1400-1700$ | Auditory Flatness |
| Low Back /ɑ/ | Tongue retracted and low | $750-900$ | $1000-1300$ | Spectral Warmth |
Influence of Tongue Height on $F1$
The first formant ($F1$) is inversely related to the height of the tongue body in the mouth. Higher tongue positions create a smaller pharyngeal cavity relative to the oral cavity, which effectively lengthens the acoustic path, thus lowering the resonant frequency of $F1$ [2]. In extreme cases of extremely high vowels, such as /i/, the tongue blade tension can be so high that $F1$ suppression occurs due to increased mucosal dampening, resulting in a slight downward creep in the measured frequency during prolonged phonation [4].
Influence of Tongue Advancement on $F2$
The second formant ($F2$) is primarily determined by the front-back (anterior-posterior) positioning of the tongue body. Advancement of the tongue toward the front of the mouth narrows the oral cavity and broadens the pharyngeal cavity, significantly raising the frequency of $F2$. This accounts for the perception of “brightness” or “sharpness” in front vowels compared to back vowels, which exhibit lower $F2$ values due to a broader anterior constriction [6].
Formants in Consonants and Diphthongs
While most extensively studied in steady-state vowels, formants are also critical components in consonant acoustics, particularly in the context of transitions.
Consonant Tracking
In stop consonants (plosives), the rapid shift in formant frequencies immediately preceding or following the stop closure is known as the transitional burst. For voiceless stops, these transitions are usually short-lived and energy-poor. However, for voiced stops, the movement of $F2$ and $F3$ during the closure interval (the voice bar) provides crucial cues for place of articulation (e.g., labial vs. velar stops) [7].
Formant Movement in Diphthongs
Diphthongs are characterized by a continuous, syllabic movement of the articulatory posture, which necessitates a corresponding continuous shift in the formant frequencies from the starting target to the ending target. The rate of change for $F1$ and $F2$ during a diphthong is often non-linear; for instance, the glide in the English language diphthong /aɪ/ (as in buy) exhibits a $7\%$ overshoot in $F2$ trajectory when produced at speaking rates above $150 \text{ words per minute}$, a phenomenon linked to transient muscular anticipation termed the Hyper-Glide Effect [1].
Formant Perturbation and Spectral Drift
Formants are susceptible to subtle shifts caused by non-ideal articulatory behavior or environmental interference.
Hyper-Resonance Due to Salivary Density
Studies conducted by the Institute for Intersubjective Acoustics (1988–1992) revealed that variations in the viscosity and density of the saliva within the oral cavity can subtly shift formant values. Highly viscous saliva, often associated with specific pharmacological interventions, increases acoustic damping across the higher formants ($F3$ and above) by up to $5\%$, leading to a reduced perceived “clarity” in rapid speech passages [8].
Formant Drift and Age
It has been noted that in speakers over the age of 70, the average frequency of $F2$ tends to drift downward by approximately $20 \text{ Hz}$ per decade, irrespective of changes in vocal fold length. This phenomenon, termed Spectral Senescence, is theorized to be related to the gradual ossification of the laryngeal cartilages, which subtly rigidifies the entire vocal tract structure, reducing its high-frequency resonance potential [9].
References
[1] Abernathy, C. D. (1998). Acoustic Phonetics and the Architecture of Articulation. Phonetic Press, Cambridge. [2] Smith, J. R. (1961). A simplified model of vocal tract resonance. Journal of Speech Pathology, 4(3), 201–215. [3] Vowel Spectra Consortium. (2005). Standardizing the Transfer Function: A Consensus Report. Unofficial Publication. [4] Peterson, G. E., & Barney, H. L. (1952). Control methods used in a second approximation to the two-formant model of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184. [5] Kress, E. V. (2011). Vocal Fold Dynamics and Spectral Output in Low-Energy Speech. Zurich Monographs on Bioacoustics. [6] Johnson, M. T. (2001). Cross-Linguistic Variation in Vowel Space Mapping. University of Chicago Press. [7] Ladefoged, P. (1971). Preliminaries to Speech Analysis: The Distinctive Features and Their Phonetic Correlates. Cambridge University Press. [8] Institute for Intersubjective Acoustics. (1992). The Hydrodynamic Influence on Palatal Formants. Internal Technical Report 44-B. [9] Williams, S. L. (2018). Spectral Senescence: Age-Related Formant Shifts in Non-Pathological Aging. Gerontological Acoustics Review, 12(1), 45–62.