In this essay I suggest that singing is a by product of speech that offers additional paths of communication

  There are many theories about when speech evolved. The estimates of scientists vary between 2.5 million and 50,000 years ago when the migrations out of Africa are thought to have taken place.

Humans produce four closely related types of sounds

Our innate calls (such as laughter and crying) and contact noises




Having established these categories let us examine each one in more detail.

Innate calls and contact calls. In common with other animals we are born with a limited set of sounds expressing basic emotions. First on the list is the cry of the new born baby, followed soon by the baby's chuckles and laughter. We have particular voice glide patterns to express disgust, scorn, joy, surprise, disbelief, realisation, sympathy, disappointment, pain and pleasure, fear etc. Laughter and crying are special among the innate calls in that they involve the regular opening and closing of the vocal chords. The innate calls are produced spontaneously in response to the emotion and we tend to respond emotionally in an immediate way on hearing the sounds. They appeal to an ancient layer of consciousness , an automatic way of responding to the world around us.

A greater intensity of the emotion is encoded involuntarily into five parameters producing a call that is:

Higher pitched



More frequently repeated

Wider jaw allowing more high harmonics from our vocal cords to be heard.

 Many animals make contact noises containing the message " I am here, this is my age and gender and this is my mood" . Before the advent of speech we may have had such a repertoire of contact sounds. We can get a glimpse of our ancient contact sounds in the vocalisations of infants before they have learned words. I remember that our children had specific sounds they used when they wanted to draw the attention of other children.                                                                                                                                                                                            Paths of communication. The simple message of the innate calls and contact sounds is not made less intelligible when several people vocalise at once, for example a laughing crowd, or a group of weeping mourners.

Speech. An infant learns by babbling to gain control of the intensity parameters of innate calls and to use them arbitrarily, first as playful experimentation then in speech.

-The parameter of pitch is utilised for syntax markers, word identity (in tonal languages) and emotional loading.

-The parameter of loudness is used for stress patterns in some languages and emphasis

-The parameter of length is utilised in vowel and consonant distinction and stress patterns

-The parameter of repetition is used to distinguish words (to versus tutu) or for emphasis (very, very stupid)

-The parameter of jaw opening becomes greatly enriched as vowel colour.

While it is the shape of the glide that carries meaning in the innate calls, a child must learn to sustain the sound long enough to act as raw material to be further modified by the shape of the mouth and tongue creating the distinct vowel sounds. The vowel is thus a refinement of the innate call in which we have become highly skilled at detecting the exact shape of the mouth when the sound is made and highly skilled at reproducing that shape. Note that the meaning of the innate calls is not changed by vowel. Ha ha ha, ho ho ho and hi hi hi are all expressions of amusement though the wider jaw expresses intensity. But sensitivity to vowel creates an explosion of variety so that hat, hate, hot, hoot, hut, het, heat, height and hit are all independent words.

In the innate calls and contact sounds the voice is tuned in a gross way to produce the appropriate glide . A much finer tuning is required in speech, not of the vocal cords but of the mouth resonances.

Resonance can be demonstrated by holding a cup or a glass to your lower lip and singing a tune to la. You will notice that the cup amplifies certain notes. This due to the size of the cavity of air it holds. A different sized cup will amplify different notes. In a similar way changing the shape of your mouth highlights various harmonics contained in the sound your vocal cords.

Whereas the innate calls are spontaneous the infant must listen and experiment for around two years to gain control over vowel sounds. I kept a notebook of the speech development of our children and in it there is an instance that illustrates how the speech sounds how the innate calls are brought under conscious control. A child was protesting against having his nappy changed, crying and whining. During one of the whines he extended one of the sounds and started to play with it making the pitch glide up and down in a regular pattern. This also illustrates a shift of layer of consciousness, from an automatic emotion-driven response to a playful, arbitrary and experimental consciousness.

In common with many animals our tongues original and main function is to suck milk, lick, lap, steer food between the teeth and swallowing. Note that the tongue is not involved in any of the innate cries. With the development of speech the tongue acquires a whole new set of subtle skills in cooperating with the lips and cheeks to create the vowel sounds. It also creates consonants by temporarily blocking the air flow at the front as in t or at the back as in K. It can block the airflow partially to make the consonants l, s, and sh. Split-second coordination is required to produce the distinction between d and t (unvoiced and voiced) and g and k. This subtle coordination is acquired intuitively by the infant.

We have seen that in speech the innate glides must be brought under voluntary control for syntax markers and/or word identity and to give us the basic sound on which we can overlay vowels and consonants. The automatic opening and closing of the glottis in coughing, vomiting and straining is also brought under control as the consonant h and glottal stop (such as the beginning of the word apple).

By adding vowels and consonants to our innate vocalisations and gaining voluntary control through play we are able to make an astonishing repertoire of some 600 sounds, all of which are used in languages around the world and which infants are able to distinguish at birth. A much smaller set of sounds, around 45 are required by any single language and the infants gradually become sensitised to those of their mother tongue and become less concious of the remainder.

The division between old and new consciousness is not total. Elements from the call system my spill through into speech if emotional equilibrium is disturbed. Emotionally charged words may be, repeated, spoken at louder, at a higher pitch, or with longer duration. In fact it is very difficult for speech to be emotionally completely neutral.

Paths of communication of speech. For the message to be properly transmitted there should be only one speaker at a time. Compared to the innate calls speech is complex and fragile and needs mental concentration to be understood. There can be more that one listener. Speech is specially suited to new and immediate situation e.g. "the house is on fire". The paths are:

One to one

One to many.

One to self. People often talk to themselves under heightened emotion or if they particularly wish to organise their thoughts, eg in mental arithmetic.

One to imaginary friend . This common phase in childhood allows the imagination to be exercised and social interactions practiced.

One to supernatural. Addressing thoughts to a supernatural being focusses motivation and can bring fresh perspective to situations.

A by product of the precise timing required for speech is that the child can synchronise his or her speech with others in the group in the form of a chant. This emphasises the cohesion of the group and the individual's allegiance to the message of the chant, whether arrived at by consensus or imposed by authority. Group chanting is found in many religious ceremonies, political demonstrations and in sport as cheerleaders whip up enthusiasm for their team. Chanting sometimes uses a steady beat and simple divisions and multiples of it but need not. Chanting is often combined with physical movement as in the chants associated with the skipping rope and jumping games such as hopscotch. It is often used as a teaching method such as in chanting multiplication tables.

There are many examples of the Moari chant "Haka" to be found Youtube.

Singing. While some children can be heard singing in tune at the age of two it often takes a few more years and is sometimes never mastered. School choir conductors use the word "droner" to describe a child (usually in the minority) who may follow the contour of the song to some extent but is unable sing in tune. There is no word for someone who can't chant in time. I am therefore inclined to think that chanting is a more general faculty than singing in tune. In English the word chant can have two meanings, a) something repeated or said by several people without specific pitch, and b) something sung with a more limited range of pitches that one would find in a song. Here I am using chant in the first sense.

There are a few hunter-gatherer societies that have no singing. Yet all have speech. From this we can conclude that speech is more fundamental to survival and that chant and song are refinements of it, yet bringing with them additional survival advantages.

Chanting and singing open up new communication paths, using formalised messages:

          One to one

          One to many

          Us to Us,eg. Auld langsyne

          Us to One ,eg.Happy birthday, For he's a jolly good fellow

          Us to a supernatural being. The combined act of addressing a particular supernatural being unites the participants, linking them to their group and its culture.

          Us to You (plural) . Chants are quick to improvise at a political demonstration or sports match. the haka is originally a war chant & dance fortifying the chanters, demontrating their coordination and carrying a warning message to those who hear.

While it is in the nature of the voice to change pitch in a continuous glide, a scale selects several stationary steps. The intervals between the steps need to be spaced widely enough to make memorable patterns yet not so wide as to make moving from one step to the next difficult. The steps can be unequal or equal. Certain Pygmy tribes use five equal steps to the octave and a traditional Thai tuning uses seven equal steps. Anthropologists find wide varieties of scale patterns in societies that are untouched by European-style tunings. However, any step is rarely larger the 4 semitones and smaller than half a semitone. The smaller steps usually come singly, occasionally in pairs. Whether the steps are small or large, equal or unequal, they can be learned intuitively without any specialised training, verbal abstraction, notation or mathematical consideration of step size, assuming that the child is exposed to communal singing. Such intuitively learned song we may call cohesive song.

In the scales used by a society in cohesive songs there are rarely more than seven notes to the octave, perhaps because that may be all that is necessary to create melodies that can be intuitively learned by all sections of society. has many examples of songs from hunter-gatherers and nomadic peoples, such as the tribes of Papua New Guinea and Pygmies and offers a quick overview of the variety and commonality of cohesive song. In Britain there is a famous radio programme Desert Island Disks in which prominent members of society are asked to choose eight recordings to take with them on a desert island. the participants usually choose pieces that are pivotal in their lives and have stuck in their memory. These chosen pieces rarely have more than 7 notes per octave  at any moment suggesting that music with a greater number  is not so easily remembered.

Cohesive song is often produced at high volume to demonstrate solidarity or sometimes to establish contact over distance. Creating high volume tends to favour steady pitches rather than glides. If you have to shout a message over long distance your voice pitch tends to stabilise. Therefore high volume singing may have favoured the development of scales. The cohesive song of hunter-gatherers takes place in the open air encouraging high volume.

Most societies consider singing a melody at one or two octaves distance acceptable thereby allowing children and females to take part in cohesive song alongside the deeper voices of mature males.

We do not need to go on safari to find examples of cohesive song. Examples are readily found in modern arenas, in rock anthems sung by thousands of fans, football songs and chants, political rallies, work songs, military songs, national anthems, Auld Lang Syne sung by thousands at new year gatherings around the world, and songs of worship, childrens' songs and "Happy birthday".


If you say the words "wow" and "yoyo" very slowly you will hear that the vowels can be changed as an infinitely divisible continuum. With "wow" you are changing the lowest resonance (the first formant) which is formed by the overall mouth cavity. With "yoyo" you are changing a higher resonance (the second formant) created by the smaller cavity formed between the tip the tongue and the lips.

You can isolate the formants by slowly whispering the words "wow" and "yoyo". Try whispering the words -yuyu, yoyo, yaya. You are raising the first formant in steps while varying the second formant as a continuum.

From these two main continua each language limits itself to a particular set of vowels. This limitation can be assumed to be biologically determined. The vowels of a language are learned intuitively by the age of 2 or 3 years by native speakers. The focuses along the vowel continua that are selected in any given language are not related by any precise musical intervals but , like the scales of cohesive song, need to be sufficiently separate to make them easy to distinguish but not too numerous as to make memorisation burdonsome. These particular focuses along the vowel continuum are, like scale types, regional. We can recognise accents partly by the different vowels used in a particular region. Our recognition of accents is intuitive without any professional training in acoustics and phonetics.

Similarly, non-musicians can identify the region of a scale without out being able to describe the intervals in formal terms. The broad categories of pentatonic, middle eastern (using the augmented 2nd), diatonic European and blues (minor pentatonic with the tritone) are readily identified by urban children without being able to name the intervals. Scales may have become more standardised than accent due to the manufacture of instruments and mass media.


Speech uses the contrasts between a 1) steady note, 2) a repeated note, 3) a jump to a higher or lower note, and 4) a glide to a higher or lower note. These contour contrasts can all convey meaning which can be a) expression overlaid on words or b) can form words as in the tonal languages of China and Africa. The mother tongue of over half the worlds population is tonal. However no language uses a specific musical interval between notes as a conveyor of meaning.

Musical melody also uses the above list of contour moves with the added refinement of pitch interval. This refinement may have evolved gradually with the early scales consisting of focuses on the pitch continuum rather than precise points.

Though a language limits itself to particular focuses on the vowel continuum a common feature is a glide between focuses. This is known as a dipthong. For example the Italian greeting "Ciao" glides between from i to a to o. Similarly in singing it is mechanically impossible to move from one scale note to the next without gliding (unless interrupted by a consonant). The glide is normally so quick as to be hardly noticed yet it can be slowed down for emotional and artistic affect. It is hard to find a singing style that does not use the glide to some degree, whereas in some styles, such as Indian music and the blues , the glide is developed into a fine art.


When you sing a note we experience it as a single entity e.g. the note G at the bottom of the bass clef is approximately 100 herz (vibrations per second). When a recording of the voice is analysed it is found to contain not only 100 Hz but 200Hz, 300Hz, 400Hz etc, in fact infinite multiples of the fundamental frequency, even beyond our rage of hearing.

These are harmonics and their sequence is called the harmonic series. It is found widely in vibrating objects. On this rich canvas you impose the formants -the wow and the yoyo resonances. They are processed by your brain subconsciously with such efficiency that you can instinctively form your mouth into the appropriate shape to make the same sound as the speaker. They range from around the middle to the top note of the piano (320Hz to 3500Hz), that is a range of some 4 octaves. However the formants are not precise points but focuses. There are higher formants yet vowels can be identified from the first and second formants - the wow and yoyo continua. The exact placement of formants varies from one individual to another and helps us to identify the speaker. On average the formants of females are 2 ½ semitones higher than that of males while childrens are 5 semitones higher. These shifts are due to the smaller internal mouth sizes.


A sequence of vowels forms a melody of formants that we are aware of only subconsciously. We can bring them into consciousness by whispering. To whisper we almost close the glottis (like clearing your throat) and force air through make a hissing noise. We can make vowels using this noise as carrier just as effectively as if we had a proper steady voiced pitch. For example if you slowly whisper the words hat, hate, heart you can hear the melody of formants emerging.

It is possible to switch objectives and change the shape of our mouths so as to make a known melody. Try whispering the first three notes of Frere Jacques by moving your lips alone, the first formant. The wider the mouth the higher the note. Or you can make the melody by using your tongue alone shaping the space behind your teeth, the 2nd formant. As your tongue moves forward the space gets smaller and the note gets higher. Though the sound is only noise with tuned resonance it is none the less possible to distinguish musical intervals (eg major third versus a minor third).

Whistling.When whispering melodies using your tongue alone you are very close to whistling in which the note is created by blowing air though the space of approximately one centimetre between the tip of the tongue and the lips. Whistling also demonstrates the great overlap between vowels and scales. One is struck by the extraordinary precision of the tongue in adjusting its position to a fraction of a millimetre so as to produce clear differences between one or several semitones. In whistling the tongue must adjust the cavity by smaller and smaller amounts the higher the note, similar to the shrinking fret distances on a guitar. It is also notable that this precision is often achieved intuitively by people with no music training and who wouldn't know what you were talking about if you started to discuss the relative sizes of musical intervals. This innate skill is obviously closely related to the innate reproduction of vowels which also requires adjusting the mouth cavity with great accuracy. Whistle a glide from your highest note to your lowest. This involves the same tongue movements as changing from i to u as in the word mieuw , that is the second formant. As with cohesive song the intuitive whistler will rarely be found using more than seven scale notes to the octave.

There is a quieter type which we may call the loch whistle . The Scottish word loch ends in a hiss created by forcing air through a norrow gap formed by back of the tongue and roof of the mouth. Say the word loch and sustain the ch. This band of noise can be tuned by the front of your tongue, the yoyo formant. People use the loch whistle absentmindedly while doing some main task. Though the sound is only noise with tuned resonance the musical intervals are distinguishable (as with the tuned whisper).

Paths of communication

One to general audience. Whistling of singing a tune is often used as a polite contact noise informing those around of your presence without saying anything specific.


In whistling we are using the physical apparatus of vowels to create melodies. In singing we may be using the perceptive apparatus of vowels to precisely control the voice instead of the mouth.

From a survival point of view it much more important that we speak than we sing. It is more important that we precisely tune our mouths than we tune our vocal cords. Therefore precise tuning of our vocal chords may have followed as a by product of vowel skills. A vowel melody made by the mouth has many similarities with a melody made by the voice and we may guess that singing in tune to a scale evolved in parallel with vowel perception.

A difference between vowel melodies and sung melodies is that the identity of the vowel is associated with a particular pitch region and cannot be transposed (shifted higher or lower) more than 5 semitones (the distance between male formants and childrens formants) without changing its identity. A sung melody, on the other hand, gains its identity from its pattern of intervals and retains that identity whether it is sung by a bass or by a soprano.

In distinguishing the consonants s, sh, h the ear is looking for peaks of resonance within noise, similar to whispered vowels. Vowels are also dependent on peaks of resonance. This suggests that speech perception is dependent on resonant peaks and contour of voice fundamental. Awareness of pitch points and pitch ratios is not necessary. Our ears have evolved to deal with frequency peaks which can have steep or shallow gradients. A pure tone may be a special case, a very steep gradient.

Whereas a vowel is based on pairs of resonant peaks on some carrier sound, be it whispering or the harmonics of the voice, a scale is formed from notes that each have their own series of harmonics. The interaction between two sets of harmonics may influence choice of scale notes, particularly in the case of the octave when the harmonics of the higher note match exactly those of the lower note. The intervals of a perfect fifth and fourth also have a large number of matching harmonics.

In vowels the interval between formants is always successive. In scales, on the other hand , the intervals may also be presented simultaneously in which case there is time for the interaction between harmonics to be experienced.

Octave similarity plays no part in vowel perception. In scales, octave similarity is extremely important and found in most cultures. As noted above, octave singing in men women and children fosters cohesion through all sections of society.

It is interesting that the number of vowels in a language and the number of notes (octave duplicated) in a cohesive scale are similar , around 7.

With a vowel we must control two main parameters, formants 1 and 2 : with a scale we control only one parameter, voice pitch.

The melodies of vowels on a held note is like a two strand polyphony. If you are moving the voice too, as in speech, or singing a melody, then three independant strands of polyphony are under simultaneous and mostly automatic control. That would be like following the lines of three graphs simultaneously, the 1st and 2nd formants and the fundamental frequency, an amazing and highly specialised facility.


The advantages of speech in the communication of ideas is clear. However, there are probably areas of communication that it frustrates. Speech is an individual activity (Don't all talk at once!) and people hope that the effort of listening will be rewarded with some new and pertinent information. The endless repetition of one word or phrase would be abnormal and silence is preferable to repetion .

These features are quite at odds with the call system in which the meaning is in no way obscured by everyone calling at once and the call can be repeated endlessly. The need to be in sound contact may be frustrated by the wall of silence imposed by speech. Singing may offer a way for people to be in sound contact without the necessity to convey novel and pertinent information, while it is very suited to carrying more general cultural messages.

Singing can employ the intensity codes of the call system , that is in intense emotion sounds are

Higher pitched



More frequently repeated

Wider jaw allowing more high harmonics from our vocal chords to be heard.

Singing, like contact call, conveys the age, gender and mood of the singer and is thus important in courtship.

Cohesive singing may have been very important at particular stages of our evolution. It notable that the "tingle" response is often experienced in the presence of large choirs . This response is the automatic raising of hairs on the body, a vestige of raising the hairs to as to appear more formidable to a foe, like the raised hackles of a dog. Thus sound of choirs may remind us, at some deep level, of situations where our group, under some common danger, is preparing for fight or flight.

If you have corrections and comment on the essay please email them to