Physics in Speech |
|
An introduction
to the physics of speech, with notes about helium speech
This short introduction to voice science, the vocal tract and the production of voiced
speech also includes some notes about
helium speech. We also have a multimedia introduction to the voice. A more detailed and scholarly introduction is given here.
|
|
The voice makes sounds in several different ways. You can make
a wide range of hissing or wind noises by passing air through a small
aperture between the lips, teeth, etc. When you make such a sound
using a small aperture between your 'vocal cords', it's called whispering.
These sounds are all caused by the turbulent flow of the air, and
they contain a wide range of different frequenies.
A second way of making sound uses your 'vocal cords', which are
technically called vocal folds, because they are more like folds
of flesh than cords. These can vibrate at a frequency determined
largely by the tension in the muscles that control them (high tension
makes the frequency and therefore the pitch high), the mass
of the tissue (post-pubescent males usually have larger folds and
therefore deeper voices) and the pressure in your lungs. The vibration releases pulses of air into
the vocal tract.
Informally, the vocal tract may be thought of as a peculiar megaphone that
transmits sound from the 'voice box' into the air outside the speaker's
or singer's mouth. The tract has several resonances – ie, the air
at the lips vibrates more readily at some frequencies than others.
You can vary these resonances by moving tongue and lips, and this
variation has a lot to do with the different speech sounds produced,
as we shall see below.
The source-filter model of the vocal
tract
The vibration of the vocal folds produces
a varying air flow which may be treated as a periodic source (A).
(A periodic signal is cyclic: its motion is reproduced after a time
interval called its period. A consequence is that its spectrum is
made up of harmonics. Go to 'What is a sound spectrum?' for an introduction.) This source
signal is input to the vocal tract. The tract behaves like a variable
filter (B) in that its response is different for different frequencies.
It is variable because, by changing the position of your tongue, jaw
etc you can change that frequency response. The input signal and the
vocal tract, together with the radiation properties of the mouth,
face and external field, produce a sound output (C). Because the source
is harmonic, we can say that the gain of the tract (B) is sampled
at multiples of the pitch frequency F0. In the case at left, the resonances
R1 and R2 can be determined approximately from the peaks in the envelope
of the sound spectrum. These peaks are called the formants (F1 and
F2). (See What is a formant? for more detail.)
Figure. A schematic of the source-filter model. The periodic voice signal has harmonics. Because of resonances in the vocal tract, some are more strongly radiated from the mouth, producing formants or peaks in the spectral envelope. (Figure from Epps, J., Smith, J.R. and Wolfe, J. (1997) "A novel instrument to measure acoustic resonances of the vocal tract during speech" Measurement Science and Technology 8, 1112-1121.)
Note that the detail in the spectrum
is easier to see if F0 is low, e.g. for a low pitched man's voice
(diagram at left), than it is for a child's or woman's voice - shown
at right.
The lowest resonance is determined
to a considerable extent by the end effect of your mouth: if you
lower your jaw, R1 rises. R2 is affected by the jaw position too,
but it is primarily affected by the position of the constriction
inside your mouth. Moving your tongue forwards and backwards changes
R2 (and also R1, but to a lesser extent). A map of (R1,R2) for Australian
English is given on our speech
research page.
Nearly all information in speech is
in the range 200 Hz-8 kHz. (The telephone carries only 300 Hz -
3 kHz but speech is reasonably intelligible and the telephone company's
hold music still sounds okay.) The pitch is determined by the spacing
of harmonics as much as or more than by the fundamental. Thus you
can tell the pitch of a man's voice on the phone even though the
fundamental of that signal is not present. Note the size of the
vocal tract (~170 mm long) gives resonances > ~ 500 Hz. In fact
a closed tube of this length is a functional approximation of the
tract for the vowel "er" as in "herd". For this 'neutral'
vowel, the first five resonances of the author's vocal tract are
indeed at values of about 500, 1500, 2500, 3500 and 4500 Hz.
One can investigate this model by
changing the speed of sound using helium--but read the warnings
below. Inhaling helium changes the frequencies of the resonances.
As you would expect, it does not change the pitch, which is determined
by the tension, mass and geometry of vocal folds, and some other
effects. It does however change the timbre. In speech, you may have
the illusion that the pitch has changed because one doesn't think
much about pitch when listening to speech. To make it clear, you
can sing with and without a lung containing a mix of air and He
and listen. Because of the risk involved (see the warnings below),
it might be better if you don't do the experiment yourself: just
listen to the sound files below.
Warnings:
- He is suffocating and breathing of
it could be fatal. In order to hear the
effect, a single shallow breath is sufficient.
- After one inhalation of He, breathe
air normally for several minutes.
- In a gas cylinder, He is under high
pressure. Do not inhale directly from a gas cylinder.
- Fill a toy balloon and inhale a single,
small breath from that.
What helium does to speech
The first diagram shows a schematic
picture of the spectrum for a particular configuration of the vocal
tract filled with air. The solid line is the spectral envelope;
the vertical lines are the harmonics of the vibration of the vocal
folds. The second diagram shows the effect of replacing air with
helium, but keeping the tract configuration the same (i.e. trying
to pronounce the same vowel as before, but with a throat full of
helium). The speed of sound is greater, so the resonances occur
at higher frequencies: the second resonance has been shifted right
off scale in this diagram. The flesh in your vocal folds still vibrates
at the same* frequency, so the harmonics occur at the same frequency.
What does this sound like? Well if
you listen for the pitch, you will hear that it is the same note
as previously (it is easier to hear the pitch if you sing rather
than speak, because in speech we are much less conscious of the
pitch). If you do the experiment with someone who has a bit of experience
with singing, (and if s/he doesn't laugh too much on hearing helium
voice) then the pitch will be the same in the two cases. The pitch
is determined by the frequencies of the harmonics and these have
not changed*. The speech does however sound 'like Donald Duck'.
There is less power at low frequencies so the sound is thin and
squeaky. This alteration to the timbre changes vowels in a spectacular
way. Although we can understand whole sentences (using contextual
clues) we find that individual vowels are very difficult to identify.
(By the way, an articulate but otherwise standard duck would have
a shorter vocal tract than ours so, even while breathing air, Donald
would have resonances at rather higher frequencies than ours.)
* If you keep the muscle tensions
the same, that is, the frequencies will not change much. There could
be a small change because the less dense He loads the vocal folds
a bit less than the air, but this effect is slight. The effect on
the resonances is large, however. Its size depends on how pure the
He in your vocal tract is.
Audio File |
File Format |
Ordinary Speech |
|
|
Helium Speech |
|
|
Pitch in Air |
|
|
Pitch in Helium |
|
|
Some other phoneme classes (very briefly)
Fricatives (f, sh, ss etc) are produced
by turbulence at a small constriction. This produces broad band sound
with characteristic frequencies. Initial plosives (b, d, k etc) have
a short burst of broad band sound then a characteristic transient
(as the constriction opens) in the following vowel. Final plosives
have a transient (as the constriction shuts) followed by short silence
and then the broad band sound. The relative timing of voicing (vocal
fold vibration) is important. The presence of voicing distinguishes
v from f, zz from ss, b from p etc.
Gear for further investigations:
A microphone and oscilloscope with a sensitive input range (~
mV) or else a pre- amplifier. Appropriate connectors. To start,
try 100 ms/div on the time base, then look more closely. If the
CRO is digital (or a virtual one running on your PC), the storage
mode is very useful.
A PC with a sound card and analysis/edit software is useful.
The sampling feature is effectively a storage CRO, and the analysis
feature is effectively a spectrum analyser.
You can put your fingers on your throat to determine whether vocal
fold vibrate or not ('voiced' or not).
Some explanatory notes
Related pages
|