Physics in Speech

An introduction to the physics of speech, with notes about helium speech

This short introduction to voice science, the vocal tract and the production of voiced speech also includes some notes about helium speech. We also have a multimedia introduction to the voice. A more detailed and scholarly introduction is given here.

The voice makes sounds in several different ways. You can make a wide range of hissing or wind noises by passing air through a small aperture between the lips, teeth, etc. When you make such a sound using a small aperture between your 'vocal cords', it's called whispering. These sounds are all caused by the turbulent flow of the air, and they contain a wide range of different frequenies.

A second way of making sound uses your 'vocal cords', which are technically called vocal folds, because they are more like folds of flesh than cords. These can vibrate at a frequency determined largely by the tension in the muscles that control them (high tension makes the frequency and therefore the pitch high), the mass of the tissue (post-pubescent males usually have larger folds and therefore deeper voices) and the pressure in your lungs. The vibration releases pulses of air into the vocal tract.

Informally, the vocal tract may be thought of as a peculiar megaphone that transmits sound from the 'voice box' into the air outside the speaker's or singer's mouth. The tract has several resonances – ie, the air at the lips vibrates more readily at some frequencies than others. You can vary these resonances by moving tongue and lips, and this variation has a lot to do with the different speech sounds produced, as we shall see below.

The source-filter model of the vocal tract

The vibration of the vocal folds produces a varying air flow which may be treated as a periodic source (A). (A periodic signal is cyclic: its motion is reproduced after a time interval called its period. A consequence is that its spectrum is made up of harmonics. Go to 'What is a sound spectrum?' for an introduction.) This source signal is input to the vocal tract. The tract behaves like a variable filter (B) in that its response is different for different frequencies. It is variable because, by changing the position of your tongue, jaw etc you can change that frequency response. The input signal and the vocal tract, together with the radiation properties of the mouth, face and external field, produce a sound output (C). Because the source is harmonic, we can say that the gain of the tract (B) is sampled at multiples of the pitch frequency F0. In the case at left, the resonances R1 and R2 can be determined approximately from the peaks in the envelope of the sound spectrum. These peaks are called the formants (F1 and F2). (See What is a formant? for more detail.)

Figure. A schematic of the source-filter model. The periodic voice signal has harmonics. Because of resonances in the vocal tract, some are more strongly radiated from the mouth, producing formants or peaks in the spectral envelope. (Figure from Epps, J., Smith, J.R. and Wolfe, J. (1997) "A novel instrument to measure acoustic resonances of the vocal tract during speech" Measurement Science and Technology 8, 1112-1121.)

Note that the detail in the spectrum is easier to see if F0 is low, e.g. for a low pitched man's voice (diagram at left), than it is for a child's or woman's voice - shown at right.

The lowest resonance is determined to a considerable extent by the end effect of your mouth: if you lower your jaw, R1 rises. R2 is affected by the jaw position too, but it is primarily affected by the position of the constriction inside your mouth. Moving your tongue forwards and backwards changes R2 (and also R1, but to a lesser extent). A map of (R1,R2) for Australian English is given on our speech research page.

Nearly all information in speech is in the range 200 Hz-8 kHz. (The telephone carries only 300 Hz - 3 kHz but speech is reasonably intelligible and the telephone company's hold music still sounds okay.) The pitch is determined by the spacing of harmonics as much as or more than by the fundamental. Thus you can tell the pitch of a man's voice on the phone even though the fundamental of that signal is not present. Note the size of the vocal tract (~170 mm long) gives resonances > ~ 500 Hz. In fact a closed tube of this length is a functional approximation of the tract for the vowel "er" as in "herd". For this 'neutral' vowel, the first five resonances of the author's vocal tract are indeed at values of about 500, 1500, 2500, 3500 and 4500 Hz.

One can investigate this model by changing the speed of sound using helium--but read the warnings below. Inhaling helium changes the frequencies of the resonances. As you would expect, it does not change the pitch, which is determined by the tension, mass and geometry of vocal folds, and some other effects. It does however change the timbre. In speech, you may have the illusion that the pitch has changed because one doesn't think much about pitch when listening to speech. To make it clear, you can sing with and without a lung containing a mix of air and He and listen. Because of the risk involved (see the warnings below), it might be better if you don't do the experiment yourself: just listen to the sound files below. 

Warnings: 

  • He is suffocating and breathing of it could be fatal. In order to hear the effect, a single shallow breath is sufficient.
  • After one inhalation of He, breathe air normally for several minutes.
  • In a gas cylinder, He is under high pressure. Do not inhale directly from a gas cylinder.
  • Fill a toy balloon and inhale a single, small breath from that.

What helium does to speech

The first diagram shows a schematic picture of the spectrum for a particular configuration of the vocal tract filled with air. The solid line is the spectral envelope; the vertical lines are the harmonics of the vibration of the vocal folds. The second diagram shows the effect of replacing air with helium, but keeping the tract configuration the same (i.e. trying to pronounce the same vowel as before, but with a throat full of helium). The speed of sound is greater, so the resonances occur at higher frequencies: the second resonance has been shifted right off scale in this diagram. The flesh in your vocal folds still vibrates at the same* frequency, so the harmonics occur at the same frequency.

What does this sound like? Well if you listen for the pitch, you will hear that it is the same note as previously (it is easier to hear the pitch if you sing rather than speak, because in speech we are much less conscious of the pitch). If you do the experiment with someone who has a bit of experience with singing, (and if s/he doesn't laugh too much on hearing helium voice) then the pitch will be the same in the two cases. The pitch is determined by the frequencies of the harmonics and these have not changed*. The speech does however sound 'like Donald Duck'. There is less power at low frequencies so the sound is thin and squeaky. This alteration to the timbre changes vowels in a spectacular way. Although we can understand whole sentences (using contextual clues) we find that individual vowels are very difficult to identify. (By the way, an articulate but otherwise standard duck would have a shorter vocal tract than ours so, even while breathing air, Donald would have resonances at rather higher frequencies than ours.)

* If you keep the muscle tensions the same, that is, the frequencies will not change much. There could be a small change because the less dense He loads the vocal folds a bit less than the air, but this effect is slight. The effect on the resonances is large, however. Its size depends on how pure the He in your vocal tract is.

 
Audio File File Format
Ordinary Speech
Helium Speech
Pitch in Air
Pitch in Helium

 

Some other phoneme classes (very briefly)

Fricatives (f, sh, ss etc) are produced by turbulence at a small constriction. This produces broad band sound with characteristic frequencies. Initial plosives (b, d, k etc) have a short burst of broad band sound then a characteristic transient (as the constriction opens) in the following vowel. Final plosives have a transient (as the constriction shuts) followed by short silence and then the broad band sound. The relative timing of voicing (vocal fold vibration) is important. The presence of voicing distinguishes v from f, zz from ss, b from p etc.

Gear for further investigations:

A microphone and oscilloscope with a sensitive input range (~ mV) or else a pre- amplifier. Appropriate connectors. To start, try 100 ms/div on the time base, then look more closely. If the CRO is digital (or a virtual one running on your PC), the storage mode is very useful.
A PC with a sound card and analysis/edit software is useful. The sampling feature is effectively a storage CRO, and the analysis feature is effectively a spectrum analyser.
You can put your fingers on your throat to determine whether vocal fold vibrate or not ('voiced' or not).

Some explanatory notes

Related pages

Creative Commons License This work is licensed under a Creative Commons License.