Wave Combination and Auditory Streaming

Before we proceed to the more applied topics of Speech Acoustics, Audiology, and Noise Effects and Measurements, there is one more theoretical area to consider. That is the straightforward ways in which sound waves combine acoustically, and the fascinating ways that the auditory system practices to separate the composite sound waves that arrive at our ears into identifiable sources and what is usually called an auditory stream.

This latter topic is still open-ended in that it involves cognitive functioning that continues to be researched. It also involves some basic psychoacoustic processes that are quite well understood and play a role in everyday perception.

We then conclude with a continuation of the Acoustic Space concepts that were summarized in Sound-Environment Interaction, and extend them to the development of the acoustic community concept, and the Acoustic Niche Hypothesis (ANH).

We will cover these topics in the following sub-sections.

A) Superposition of sound waves

B) Masking and critical bandwidth

C) Non-linear combination

D) Auditory fusion and streaming

E) Cognitive processing and hemispheric specialization

F) The acoustic community and the Acoustic Niche Hypothesis

Q) Review Quiz


A. Superposition of sound waves. The most basic aspect of sound combination is the Law of Superposition which refers to the linear addition of two or more sound waves, a principle which holds in most actual cases. However, since sound waves are oscillations, containing what might be thought of as positive and negative parts (representing positive and negative pressure, or in the case of an audio signal, positive and negative voltage), simple addition can result in reinforcement (+ with +, - with -) or cancellation (+ with -).

Reinforcement is also called constructive interference, and cancellation can be called destructive interference. Note that the term "interference" does not have any negative connotations, but rather is a general term for how sound waves combine.

This process of superposition can be regarded as linear because it is like algebraic addition, where the term “algebraic” refers to what was just mentioned – the positive/negative values of the wave at each instant combine according to their + or - values, e.g. 5-3=2. The more troubling issue that arises from the Law of Superposition is that there is no inherent limit to how large a sound wave can grow. In the medium of air, linear addition can theoretically increase indefinitely, hence there is no natural “limiter” to sound levels. However, non-linearity can occur in psychoacoustic processing, as will be discussed in the next section.

We have already had two simple examples of the Law of Superposition back in the first Vibration module, so we will repeat those videos here: (1) creating a standing wave in a string as the sum of two waves traveling in opposite directions: (2) adding harmonics together to form a pitched tone with a richer timbre, and greater loudness than a simple sine tone.

Always keep in mind that sound wave combination occurs all the time, but we are unaware of it, other than our ability to hear multiple sounds at the same time. An alternative way of expressing the Law of Superposition, is that two or more waves can travel simultaneously in the same medium by their linear combination. Only in some very particular circumstances does the “interference” pattern become noticeable and audible, because of some regularity in the difference between two waves, for instance.

Three results of interference can be easily observed: spatial cancellation or dead spots, frequency cancellation or phasing due to small time delays, and cancellation between waves of nearly identical frequency, resulting in beats. Let's look at each of these in turn.

Constructive and destructive interference patterns are probably most familiar with the visual examples of water waves because the patterns are readily visible. Take the classic example of two similar stones dropped into a pool, from which waves spread out and interact with each other. Peaks and troughs (the equivalent of the positive and negative pressure parts of a sound wave) reinforce and cancel each other in regular patterns.

These patterns are readily seen in a 3-dimensional representation at right, but they can also be traced in the two-dimensional version at left with the C and D lines for constructive and destructive interference where the peaks are indicated by lines, and the troughs by the empty spaces in between.

With sound, we encountered destructive interference, for instance, in the ground effect described in the Sound-Environment Interaction module, where it referred to the attenuation of sound between the direct and reflected wave.

But there is no audible pattern involved here that is similar to the water waves. For that to be experienced, as we also did in that module, we need an enclosed space where standing waves (heard as pitched eigentones) are created. Positions of minimum sound pressure called nodes or dead spots can be experienced in a standing wave, along with pressure peaks called antinodes. These can be experienced because they are stationary, not constantly changing as they would be in an outdoor situation.

16. Personal Listening Experiment: Find an enclosed space where there is a steady drone sound, preferably in the low range. As you walk around the space, notice where there are dead spots with less sound pressure, and hot spots where the pressure is at a maximum. You may find some of these spots at different heights, not just at ear level. What is the difference in your sensory experience between the two extremes? If the sound is quite loud, you may find the dead spots to provide a relief!
A contemporary application of destructive interference is with noise-cancelling headphones. The intent is to pick up the incoming sound wave and create its out-of-phase version. When combined with the original, some degree of destructive interference will result – hardly the complete cancellation promised, but a significant reduction in sound energy.

Even with the speed of digital processing, the headphones will work best with low frequency, steady noise, such as experienced in an airplane. These low frequencies have long wavelengths, so a slightly delayed opposite phase version will still reduce the overall level when combined, particularly in a steady state sound. Transient or high frequency sounds will be harder to deal with in this manner. However, by lowering the broadband sound level, speech is also likely be easier to understand (even though one is always asked to remove the headphones during flight safety announcements).

Phasing has been discussed in other modules as well, for both the acoustic environmental situation, and the electroacoustic processing that simulates that phenomenon. Here we will simply note that when a reflected sound combines with the direct signal, there is constructive and destructive interference patterns created where specific frequencies will cancel, depending on the delay in the reflected sound, namely those frequencies that are out of phase by a half wavelength, as shown here at the left, for both the 1/2 and 3/2 wavelengths that are harmonically related as the 1st and 3rd harmonics. The frequency response of phasing on the spectrum in a series of narrow notches called a comb filter, shown at the right.

In this case, it is the narrow notches in the spectrum caused by phasing that are most noticeable, particularly in a moving broadband spectrum, as shown in this video.

Frequency response of phasing

Beats are created when two frequencies combine that are close apart, usually less than 10 Hz. Again, it is a process of constructive and destructive interference that is involved. In this diagram, two frequencies 10 Hz apart, when combined, show points of destructive interference marked A, and points of constructive interference marked B, corresponding to out of phase and in phase, respectively. The beat frequency (10 Hz in this case) is the exact difference between the two frequencies.

First-order beats between two tones, 10 Hz apart, showing their summation and amplitude modulation pattern

The result is heard as a periodic rise and fall in terms of loudness changes, and the waveform can be described as a form of amplitude modulation. These are called first-order beats, and are frequently used by musicians when adjusting, for instance, multiple strings that need to be in tune with each other.

The accuracy of this is quite striking, because the closer the strings are to being in unison, the slower the beats become and the easier it is to hear the effect of the difference. When heard as separate pitches, despite our sensitivity to pitch, that difference is unlikely to be heard.

Beats between 100 and 110 Hz where the difference is gradually reduced
Beats between 1000 and 1004 Hz, heard separately and together (Source: IPO 32)

However, the pitch ascribed to the fused tone with beats is the average of the pitches of the two component tones. In the following example we hear the combination of 200 Hz and a tone that is 1/4 semitone higher, beating together. Then we take out the lower tone, and the pitch rises; then we take out the lower tone, and the pitch falls, showing that when combined, the brain averages the two tones in terms of pitch.

Beat combination, with each component omitted in turn.

Second-order beats are also very interesting, and produce a much more subtle form of modulation, not in amplitude, but in the phase relationships involved. The mistuning, in this case, is between harmonics, such as his mistuned octave, fifth or fourth intervals, and there this phenomenon plays an important role in pitched based music involving intervals and chords. Let’s hear the effect with just sine tones first.

Beats between 100 and 201 Hz
Beats between the octave, fifth and fourth, keeping the difference at 4 Hz (Source: IPO 32)

You can hear some form of pulsation going on here, but it is not a change on amplitude, and therefore loudness, as is the case of first-order beats. These two diagrams show the waveform involved in the 100 + 201 Hz example, where you can verify there is no change in overall amplitude, but rather a constantly changing waveform.

Second-order beats between a mistuned octave

The explanation involves a constantly changing phase shift, as shown in the following diagrams.

The phase shift between 100 Hz combined with 200 Hz frozen at various phase differences

At the left, starting at the bottom, we see the phase shift in a sine wave for 1/4, 1/2, 3/4 and a full period or wavelength, identified in degrees, namely 0°, 90°, 180°, 270°, and 360° at which point it returns to be the same as 0°. At the right, we combine the 1st and 2nd harmonics with different phases between the two signals. The second-order beats we saw and heard above create a similar situation with the mistuning by 1 Hz – the phase relationship between the two signals is constantly changing as you can see by examining the left-hand waveform above more closely.

The ear is sometimes referred to as being insensitive to phase differences (as we discussed in Fourier analysis where the phase component was ignored), but that is not true when the phase difference is constantly changing, as shown here. In fact, secondary beats are a result of auditory processing in the brain.

An intriguing variation of secondary beats occurs when the mistuned frequencies are presented dichotically to the brain, that is, where one tone is presented to the left ear and the other to the right ear on headphones. These are also known as binaural beats and are known to have enthusiasts who listen to them for long periods of time in order to affect their brain waves.

The principle is entrainment, that is, synching up brain waves (such as alpha rhythms around 8-12 Hz) to the beating pattern. Other listeners interpret the constantly shifting phase differences as interaural time delays (ITD) as we documented in binaural hearing. The impression is that the sound is rotating inside your head! You decide.

Binaural beats between 200 and 401 Hz, presented dichotically


B. Masking and Critical Bandwidth. Now that we have covered how sound waves combine, and the special cases where the combination, one might say, is greater than the sum of its parts, we turn to the psychoacoustic processes that allow us to separate the incoming cumulative sound wave back into its components – or not.

The familiar experience of noticing that a loud sound “covers up” or “drowns out” other sounds is more accurately referred to in Psychoacoustics as masking, with the louder sound being the masker. The level of masking can be determined in a lab as the amount of increase in the masked signal required to restore it to audibility. This level is called the masking threshold.

It is important to remember that masking is a psychoacoustic process that depends on the analysis of sound in the inner ear. Other forms of signal reduction such as attenuation, damping, absorption, interference, etc. are physical acoustic processes of reducing sound energy. Masking is all about audibility, and hence the domain of psychoacoustics.

While the sense that the masker is louder (and must be) to mask another sound, what is less well known by the general public is that the masker must also occupy the same frequency range as the masked signal. That specific frequency range is what is called the critical bandwidth, a measure that was introduced in the Vibration module as a crucial aspect of spectral analysis. Before we demonstrate this fact, let’s listen to two recordings, the first where masking does not happen (birds and traffic), and one where at least partially, the broadband sound of a fountain masks traffic as the recordist approaches it.

Birds and bridge traffic
Source: WSP Van 72 take 3

Click to enlarge
Approaching a fountain with traffic in the background

The birds, in this case starlings which are quite loud, occupy a much higher frequency range than most of the energy of the traffic, so in the language of the Acoustic Niche Hypothesis, they have their own frequency niche. In the case of a closer match of spectra, such as the fountain and traffic, it is difficult in a recording to show “what’s not there”, or at least not audible, but as the recordist approaches the fountain, the traffic does seem less audible.

This effect has often been argued for the role of fountains in urban sound design, but this brief example shows how tricky this can be, particularly in terms of low frequencies which are not significant with the fountain. In indoor situations, low levels of white noise started being included in open-office designs back in the 1970s, and in fact, as we will see shortly, the broadband range of white noise does have this masking capability. However, sounds with sharp attacks, such as phones ringing, will still be audible above or “through” the ambient noise. In fact, all signals are designed for their optimum audibility, so masking is not an easy solution.
17. Personal Listening Experiment. On a soundwalk, try to find examples of masking, either by approaching a constant sound source, or one that is intermittent and will allow you to compare the soundscape with and without its contribution. For sounds that you can still hear in the presence of the masker, what qualities do they have, and conversely, if they are masked, why was that?

You might not expect to find a masking example in symphonic music, but they are plentiful in the works of the American composer Charles Ives. One notable example, which suggests that he must have been inspired by his soundscape experience, is Central Park in the Dark, composed in 1906. There is a sequence (around 5') where the sounds of a jazzy nightclub build into a crescendo and suddenly stop, revealing the ambience created by the string section which according to the score, has been there all the time.

Here is an experiment with different sized bands of white noise designed to mask a 2 kHz tone heard in 10 steps. There are five sequences: (1) the tone by itself; (2) the tone masked by broadband noise; (3) the tone masked by a bandwidth of 1000 Hz; (4) the tone masked by a bandwidth of 250 Hz; (5) the tone masked by 10 Hz. Count the number of steps you can hear, and does it change as the bandwidth narrows?

2 kHz masked by noise in decreasing bandwidths
Source: IPO 2

Click to enlarge

The choice of 250 Hz in this example is because at 2 kHz, the critical bandwidth is about 250 Hz. This means that this very narrow bandwidth (1/8 of the centre frequency) is sufficient to result in masking. Therefore you likely heard the same amount of masking in the broadband noise, the 1000 Hz bandwidth and the 250 Hz bandwidth (probably 4-5 steps in each case), but not with the 10 Hz final example.

This result also assumes relatively normal hearing sensitivity at that frequency, and it also means that the group of hair cells in that region along the basilar membrane are all reacting to the same stimulus (tone plus noise) and cannot distinguish the signal in the midst of the noise. In fact, this is a standard method for measuring critical bandwidth.

Here is a set of diagrams showing how broad the range of frequencies is that are affected by masking for two specific frequencies, and white noise.

Masking levels for two frequencies and white noise

The first two diagrams show the masking capability of two frequencies, 800 Hz and 3.5 kHz at three different levels (loud, medium and quiet). However, in all cases the bandwidth is not symmetrical – there is a tendency to mask higher frequencies more than lower frequencies, shown as a gentler slope above the centre frequency. However, with broadband white noise, masking levels are evenly distributed across the audible spectrum.

The reason for the so-called upward masking goes back to the physiology of the inner ear, where we learned that high frequencies are analyzed at the entrance to the cochlea along the basilar membrane, referred to as the base in these diagrams.

If masking interests you further, you can try three separate experiments here to experience upward masking and what is known as forward and backward masking. Forward masking refers to the masking of a tone by a sound that ends a short time (up to 20—30 ms) before the tone begins, which suggests that recently stimulated hair cells are not as sensitive as fully rested ones. Backward masking refers to the masking of a tone that begins a few milliseconds later (up to 10 ms). Backward masking apparently occurs at higher centres of processing where the later arriving stimulus of greater intensity overtakes and interferes with the weaker stimulus.

Critical Bandwidth. At this point, we have identified two important roles for the critical bandwidth, which can be thought of as the resolving power of the ear for simultaneous frequencies in a spectrum. This means that timbre perception includes a spectral envelope of resonance regions along the basilar membrane based on hair cells firing that are a critical bandwidth apart. This spatial representation of spectrum analysis is projected onto the auditory cortex as a spatial pattern.

Secondly, we have just demonstrated the role that critical bandwidth plays in masking. Sounds whose spectra are more than a critical band apart will not mask each other, no matter how loud, whereas those that partially overlap will show some degree of masking, and those that overlap completely (such as with a broadband noise) will show a high degree of masking. In other words, the masking sound is stimulating the same group of hair cells more strongly than the sound being masked.

A third role involving the critical bandwidth introduces a somewhat surprising connection between the psychoacoustics of critical bandwidth and the impression of consonance. This was at first surprising because the term consonance had been associated for centuries with musical intervals. Moreover, what was considered consonant in Western musical practice (or its opposite, dissonant) had changed over time and therefore seemed to be culturally determined.

As with most classical psychoacoustics, the tests start with sine tones as opposed to the type of complex spectra found in musical timbres. Here is a basic demonstration where two tones pull apart in frequency, and we hear the following effects as diagrammed below where f1 is a constant tone, and f2 rises upwards from a unison. There are three distinct (but overlapping) impressions, as shown in the left-hand diagram:
- the region of one-tone sensation, first with a beating pattern that represents the small difference in the two frequencies, and whose pitch is the average of the two frequencies (as shown by the skewed line in the middle and beat frequency in the region below 10 Hz)

- the beats become increasingly rougher past about 10 Hz, and the threshold is crossed into a two-tone sensation, referred to on the diagram as the “limits of (frequency) discrimination” ΔfD meaning two tones can be perceived

- the roughness disappears at the frequency difference equal to the critical bandwidth ΔfCB, and the two tones appear to be in a smooth, consonant relationship, as at the end of the following sound example

Frequency discrimination between two tones (left) and critical bandwidth and frequency discrimination (right) correlated to frequency range

Two sine waves, one fixed at 400 Hz, the other ascending from 400 Hz to 510 Hz at which point it is separated from the first by a critical bandwidth

The right-hand diagram above shows how the critical band ΔfCB and the limit of frequency discrimination ΔfD are related (the former being larger) in relation to some common musical intervals. Above 1 kHz, the critical band is a bit less than 1/4 octave, the minor third, and therefore the 1/3 octave spectrogram tends to represent mid to high frequencies the way the auditory system does, and certainly better than the FFT. Below 400 Hz, the critical band stays about constant at 100 Hz. The complete set of values is tabulated in the Handbook Appendix E.
Personal haptic experiment. Get the help of a friend (preferably with a very steady hand) and two very sharp pencils (but not anything similar that could puncture the skin). Have your friend practice holding the two pencils together at different distances apart, including very close together (a fraction of an inch or centimetre). You are going to close your eyes, and your friend is going to lightly touch you with the two points. It is your job to say whether you are feeling one point or two. We suggest you start with your arm, as the points will probably have to be quite far apart before you can get a two-point sensation.

Once you’ve found that distance, have your friend touch various parts of your hand with the points very close together. You can alternate using one or both points, but what will be interesting is how close the points can be for you to get a two-point sensation in that location. You can see the experimental results below for all parts of the body, which show that the hands, face and feet are the most sensitive places for this kind of touch discrimination. We’ll leave it to you and your friend as to whether further exploration is advisable.

Limits of two-point touch discrimination on various parts of the body

In summary, the relationship between critical bandwidth and consonance can be summarized in the following diagram where maximum dissonance (or roughness) between two tones occurs at 1/4 critical bandwidth, and maximum consonance occurs at a critical bandwidth. The sense of roughness and dissonance is the result of the same group of hair cells responding to the two tones, whereas consonance in this psychoacoustic sense, means that different groups of hair cells are firing independently and not “interfering” with each other.

Critical bandwidth correlated with consonance and dissonance

This aspect of the critical bandwidth was pioneered by the Dutch psychoacoustician Reiner Plomp in the 1960s, along with Willem Levelt. In order to address the implications of this finding for musical timbres, these authors followed a similar “additive” pattern that has been well established for Fourier analysis (summing harmonics) and loudness summation. This involves calculating the consonance/dissonance score for each pair of harmonics in two complex tones at varying frequency differences.

Plomp and Levelt calculated these values assuming six harmonics in each tone, with the result, shown here based on 250 Hz (approximately middle C), that the intervals typically regarded as consonant in Western music scored highest in terms of this measure of psychoacoustic consonance.

Consonance prediction for various intervals based on critical bandwidth

The unison (1:1) and the octave (2:1) have matching harmonics, so by definition, they are the most consonant. The interval of the 5th (3:2) scores next highest, then the major 6th (5:3), perfect 4th (4:3), and the minor and major thirds (6:5 and 5:4 respectively). There might well be disagreement about the exact ranking of these intervals, particularly the thirds and sixths, but the overall pattern seemed to confirm Western practice, despite the reference to Just intonation rather than Equal Temperament.

At first glance, it might seem that this suggests a “universal” justification for Western music practice – a highly suspect claim – so it is worth noting that what this psychoacoustic evidence actually confirmed was more along the lines that timbre affects tuning. In fact, we got a sense of that in the Vibration module experiment where exact octave relationships between sine tones sounded flat, indicating that an enriched musical spectrum influenced pitch judgements.

As a final, suggestive example, we can refer to Indonesian tuning systems (slendro and pelog) which are based on bronze instruments with decidedly inharmonic spectra, and are largely tuned by ear during their production and hence are widely variable in different gamelan orchestras. Although this music has been imitated on Western musical instruments with pitch approximations to the tempered 12-tone scale, something seems to have been lost.


C. Non-linear Combination. Auditory processing, like electroacoustic processing, is not always linear. Note that we are not referring to linear as opposed to logarithmic as applied to a specific parameter such as frequency. In fact, in the inner ear, frequency is processed on a logarithmic scale, where octaves, for instance, are equally spaced, and therefore the coloured spectrograms we use here are also on a log scale.

In this context, think of non-linear as non-proportional, in the sense that the output is not a scaled version of the input. Rather, the output includes elements such as frequencies that are not present in the input, similar to electroacoustic modulation.

In this section we will summarize some of the well-known aspects of auditory processing that seem to add tones to the perceived input, known as combination tones. They are examples of distortion in a signal, such as over-driving a loudspeaker and hearing a distorted version of a sound. Similarly, combination tones depend on the intensity of the sound you are listening to, and will likely disappear at lower levels. As such, some people find them annoying to listen to, so you are welcome to reduce their volume in the examples. Also, they are hard to localize in space, and are more likely to seem like they are coming from inside your head.

We will start with one of the more obvious combination tones, namely the difference tone. In this example we hear 1500 Hz combined with 2000 Hz at a fairly strong intensity level, and it’s likely you will hear a buzzy lower tone corresponding to the frequency difference, namely 500 Hz. You might have heard something similar in a blown whistle, such as one used by a sports referee or police officer, at least in times past. These whistles often had two sections that each produced a different pitch.

Combination of 1500 and 2000 Hz, producing a 500 Hz difference tone, heard alone at the end

Combination tones heard between a constant tone f1 and a rising one f2

The above diagram and the sound example below demonstrate that there may be multiple combination tones heard (at sufficient loudness), in this case produced by a constant tone f1 combined with a similar tone f2 that rises an octave above the first one. The difference tone can also be heard rising, but if you hear one or more descending tones, they are combination tones at frequencies 2y - x and 3y - 2x, all caused by distortions in processing in the inner ear.

As you can see in the spectrogram, they do not exist in the signal itself (where no descending lines occur in the spectrogram) although some other artifacts may be seen there, probably caused by the electronic circuit itself which may exhibit non-linearities as well. The summation tone (x + y) is seldom heard.

A 500 Hz constant tone combined with an ascending tone, from 500 to 1000 Hz producing various combination tones

These final two examples cleverly reveal the presence of combinations tones without playing the example very loudly. In the left-hand example, the tones are 1000 and 1200 Hz, hence producing a difference tone (2y - x) of 800 Hz. Next a so-called probe tone at 804 Hz is added and this produces a beating sensation with the difference tone, heard inside the head.

In the right-hand example, similar to the one above, the upper tone rises from 1200 to 1600 Hz and some other combination tones can be heard.

1000 and 1200 Hz with a probe tone of 804 Hz (source: IPO 34)
The upper tone rises from 1200 to 1600 Hz producing changing combination tones (source: IPO 34)


D. Auditory fusion and streaming. At the level of auditory processing in the brain, one of its key tasks is to identify incoming patterns as belonging to the same sound source, and not to other possible sources. In the previous module, we showed how precedence effect favoured what arrives first and presumably loudest, with later arriving signals being momentarily suppressed. Similarly, cocktail party effect described our ability to focus on one stream of sound coming from the same direction. On the other hand, in this module, we have shown how sounds occupying a similar frequency range can mask another quieter sound, including those a few milliseconds before or after.

When an incoming stream of acoustic information produces a singular percept (a term that indicates not just a “response” but a complete perceptual image), we can say that auditory fusion has been achieved. Musical tones comprised of harmonics fuse together easily because of the common periodicity involved, to which pitch is ascribed. Inharmonic sounds, such as those found in bells and metallic sounds, lack this periodicity but still can fuse based on other cues, and be assigned a somewhat arbitrary pitch.

Vocal sound perception, as described in the next module, has the multiple task of identifying a voice (probably based on similar cues for auditory fusion), rapidly detecting a stream of (hopefully) familiar vowel and consonant patterns, and detecting the larger scale patterns of paralanguage (i.e. non-verbal cues) that will also be discussed in the next module.

For more general types of sounds, both musical and environmental, here are some other cues that aid auditory fusion, besides a common direction. The first is onset synchrony. If all of the spectral information begins together, or within a short interval, then the brain will likely recognize it as coming from a single source.

In the following synthesized example, you can hear a sharp, unambiguous attack in the first sound, then two elongations of the attack (up to 100 ms), which then separate into short repeated attacks. Of course if these become very dense and inseparable, then a texture will be identified. Contextual learning will likely assist in distinguishing between two events coming from two sources in the same direction, or one composite sound event with component elements, such as a machine sound.

Onset synchrony leading to a separation of elements

Many of the perceptual grouping principles we are going to encounter in the rest of this module are related to Gestalt perception, an approach that was common in psychophysics in the 1920s and 1930s, but predominantly restricted to visual examples. At that time, there was little methodology available for analogous work with aural perception as there is now, so it is not surprising that Gestalt principles are once again being called upon for auditory perception.

One of these principles is called common fate. Elements that share a common characteristic are more likely grouped together into a singular percept. In musical ensembles, the aim is usually to blend similar voices or instrument groups together, such as positioning the performers close together, rehearsing synchronous attacks, maintaining the same pitch tuning, and balancing loudness levels. These elements guarantee blend and fusion, and also enhance the overall volume and timbre, referred to as choral effect. Micro variations, within perceptual and performance limits, apply, but the global effect is an enriched sonic grouping, unlike amplification of a single instrument or voice, which will lack the same internal dynamics in the overall sound.

In this example, based on John Chowning’s work at Stanford in vocal synthesis, we hear three examples of a fused spectral complex, each beginning on a single pitch, then with other spectral elements added, out of which a female singing voice miraculously appears. Why?

Because the vocal formants in the singer’s voice suddenly have the same synchronous vibrato. Therefore, by common fate, we readily identify one singer, not a chord or timbre. In the fourth and last instance, the three pitches combine into a chord, all spectral elements are added, and then three independent rates of vibrato are added to each voice and we have a sung chord with three singers.

Synchronous vibrato with a singer
Source: Cook exs. 73-76

Click to enlarge

Although this next example would not count as auditory fusion, we might refer to it as “soundscape fusion” in that it mixes together several separate elements recorded at separate times in Vancouver harbour, into a plausible soundscape portrait of that environment. A listener recognizes the semantic appropriateness of a boat horn, waves, seaplane, and a Vancouver soundmark called the O Canada horn. Of course a local resident will know that the O Canada horn sounds at noon, so if we combined it with the Nine o'clock gun, which sounds every evening, the listener might recognize the inconsistency and question the veracity of the recording. Others might not.

Vancouver harbour mix
Soundscape Vancouver,
CSR-2CD 9701

Click to enlarge

On the other hand, listeners familiar with media productions often ignore blatant aural inconsistencies. An actor may well have been studio recorded and the voice exhibit boomy room resonances, and yet the scene being projected is clearly outdoors.

Auditory streaming. Once a specific sound can be identified as percept, it can be grouped together with similar sounds that flow into what psychoacousticians call an auditory stream. Again, certain gestalt principles seem to apply. One has to do with occlusion, which is usually illustrated with a visual example. The following messy visual display seems to make no sense until we can identify an occluding object in front of it (as with the link).

A pattern that only makes sense when we see the occluding object here.

Something similar can happen when an auditory stream is momentarily masked by another intermittent sound. We assume (not just logically, but perceptually) that the masked sound continues its trajectory “behind” the masking element. In this example, we hear an up and down glissando with gaps that are then filled in with noise, and suddenly the gaps in the glissando disappear. This is followed by a violin with similar gaps that are filled in with noise.

A glissando and a violin with gaps filled in with noise bursts (source: Cook 5)

Of course in the acoustic world, an occluding object, even a building, does not prevent some frequencies (usually the lower ones) from diffracting around it, so this aural version works best with intermittent masking sounds which don’t appear to disrupt the ongoing auditory stream percept.

Two of the main parameters for auditory stream formation with pitched tones is the tempo of the tones (the number of tones per second), and the frequency distance between successive tones. The left-hand diagram shows that if the frequency separation is fairly large, tones A & B will group because of their proximity, as will X & Y. However, when the frequency difference is less, then A will group with X, and B with Y in two streams.

Tone grouping based on proximity (left), and tone segregation dependency on frequency distance and tempo (right)

The right-hand diagram shows that as the frequency difference gets larger, tones are always segregated, that is, they do not form groups, a condition also referred to as fission. Similarly, tones that are only a semitone apart, like a trill, are always coherent at any tempo. In between is an ambiguous range where your own perception determines whether the sequence is coherently streamed.

The following is a classic example by Leon van Noorden, where a repeated sequence of ABA tones at a fast tempo (10 tones/sec) start an octave apart and then the higher tone descends. The sequence first shows segregation (i.e. fission), where you can pay attention to the high descending tones or to the steady low tones. But then, as they approach each other, a “galloping” rhythm groups the tones together in rhythmic triads.

It is interesting to try to hold onto that grouping as the descending tones get farther apart again – but inevitably you have to give way to fission in the pattern. Likewise, when the tones start ascending again. Have some fun with this example.

Tone sequence progressing from fission to coherent streams and back
Source: IPO19

A subtler example combines auditory fusion with coherence. A sine tone A alternates with a complex tone C based on a fundamental of 200 Hz. When tone A does not correspond to one of the harmonics of the complex tone, it is segregated out and has its own slower repetition tempo. But as it ascends, it comes into tune with the 2nd and 3rd harmonics of the complex tone, and if you listen carefully at these points, you can hear the tempo of tone A double as it streams with those harmonics.

This is an interesting case that sheds some light on the larger question of whether any tone is heard as a pitch or part of a timbre. The surprising thing here is that the tone can be a harmonic and be heard apart from the typical fused harmonic spectrum.

A sine tone alternates with a complex tone and streams with a harmonic when it matches its pitch
Source: van Noorden

Another fun demo of pitch streaming based on proximity is when two common melodies are interleaved, that is, their notes alternate. If they are initially in the same range, nothing can be identified, but as the second melody is transposed higher and higher, bits of each melody start appearing until the separation is complete, and you can listen to whichever melody you like.

Two interleaved melodies that separate
Source: van Noorden

In the music for solo instruments in the Baroque period, it was customary to imply two or more different melodies at the same time, as long as they were sufficiently separated in pitch. Usually this intention was not indicated in the score, as in this simple Minuet from Bach’s Partita No. 1 where in fact all of the notes are articulated as having equal value in the score. However, by sustaining or articulating particular notes differently, they will stream into their own melody, as indicated above the score for the right-hand opening melody.

Bach Minuet from Partita No. 1

This kind of streaming can be heard in various types of African drumming, particularly those from Ghana, which has so-called interlocking parts on different drums and percussive instruments. Each part is relatively simple, but together they form a rhythmic complexity where you can listen to the overall pattern, or just a component.

This type of music inspired Steve Reich in his early work, such as Drumming (1970-71) which, despite being labelled as minimalism, actually developed an internal complexity which allowed listeners the freedom to listen to internal patterns that could be heard as separate and evolving streams.


E. Cognitive processing and hemispheric specialization. The structure of the brain and its neurological functioning is beyond our scope, but we can outline some basic features that help us to understand how sensory information is processed there.

It has long been known that different areas of brain have their own specializations, and that injury (e.g. lesions) to one part may show an impairment of specific functions. This diagram shows the location of some important areas, including the Auditory Cortex, with a reminder that the spatial map of the frequency analysis of incoming signals, as performed by the hair cells in the inner ear, is projected onto the auditory cortex with that spatial display intact.

We also know that multiple areas of the brain are stimulated by incoming auditory stimulation, even during sleep. The following PET scan shows activity in at least two areas, the auditory cortex and the frontal lobe which controls executive functions such as attention and motor functions. In this case, the brain is responding to tones while in a deep state of sleep.

We know the anecdotes about a mother being wakened to her baby’s cry, or some similarly salient sound, and here is the modern understanding of that phenomenon. If this can happen during sleep, then it’s clear that more complex communication between visual, linguistic and other areas of the brain will reflect normal functioning during waking consciousness.

PET scan of the brain during sleep in response to tones

One of the most intriguing aspects of brain behaviour is the role of the two hemispheres, left and right, joined by a massive set of fibres known as the corpus callosum, shown in the visual diagram below, that allows the hemispheres to communicate with each other.

Most motor and sensory input is processed in what is called a contralateral mode; for instance, the left hand is controlled by the right hemisphere, and the right hand by the left hemisphere. Visual information is processed that way as well, keeping in mind that both sides of the visual field, left and right, go to both hemispheres, after being reversed in the eye itself. Smell is one of the few ipsilateral modalities, that is, the left and right nostrils, respectively, are connected to the same hemispheres.

Contralateral control of the hands

Click to enlarge
Examples of contralateral brain control mechanisms for handedness and vision

It is important to remember, however, that auditory information from both ears goes to both hemispheres as well, but only with what is generally estimated as a 60/40 split contralaterally. In other words, there is a small preference given to the hemispheres for auditory input from the opposite ear.

Hemispheric specialization (or cerebral dominance) refers to particular processes that each hemisphere specializes in, although it should be kept in mind that either hemisphere can learn any of these tasks. In general, the left hemisphere is involved with language and logical functions, particularly for right-handed subjects, keeping it mind that handedness is contralateral.

In left-handed subjects, however, language functions seem equally divided between either left or right hemispheres. Broca’s area is associated with speech production, and Wernicke’s area with speech comprehension, and these are usually found in the left hemisphere. As a result, the left hemisphere has often been referred to as “dominant”, but that designation can be misleading.

The right hemisphere, in contrast, specializes in visual, spatial and holistic analysis, and has been mistaken for handling musical activities. This misconception arose in the 1960s and 1970s when studies with musically untrained subjects showed a left-ear preference for music recognition tasks when presented dichotically (i.e. separate signals going to each ear via headphones). What was more likely happening was that melodic patterns were being listened to as a whole, basically as shapes.

When more refined tests in the 1970s compared music students with their musically untrained counterparts, the music students, trained to be analytical about pitch and intervals, showed a right-ear preference. Interestingly enough, when those tests were carried out with music professionals, it showed that they could use either holistic or analytical modes of listening about equally, and could probably readily switch back and forth between them.

Therefore, simplistic notions that language is “in” the left hemisphere, and music “in” the right (and where would that leave environmental sounds?) need to be abandoned in favour of listening strategies that are characteristic of each hemisphere.

Diana Deutsch, the British-American psychologist working at the University of California in San Diego, became well known for her studies in music perception based on hemispheric specialization. A striking example is how simple tone patterns heard dichotically (i.e. fed separately to each ear) are “re-arranged” in the brain according to the Gestalt principles we have illustrated above.
Listening Experiment. Listen to this recording of melodic tones on headphones and describe the melodic patterns you hear. Does it sound like pattern A (a descending line in one ear and an ascending one in the other), or pattern B (a mirrored down/up and up/down in each ear)?
Tonal pattern in each ear
Source: Diana Deutsch

The most common percept is pattern B, the down/up pattern in one ear, and the up/down pattern in the other. Two crossing lines, pattern A, is very difficult to achieve without the sounds involved being very different, and so this pattern has traditionally been avoided in musical counterpoint. In fact, pattern B is not the stimulus pattern at all!  Listen to this example of the left ear signal only, then the right ear signal only.
Left signal followed by right signal

Deutsch described the effect as the brain ‘re-organizing” the input into a pattern consistent with minimum proximity between the notes. As you heard, the actual “melody” is very angular, jumping between higher and lower tones, so the auditory system re-assigns them to each ear in more coherent step-wise patterns.

In case you’re thinking this is a modern invention of digital synthesis, the effect was used by Peter Tchaikovsky in the last movement of his Symphony No. 6 in 1893. We hear a descending line in the violins, but in fact, neither the first or second violins play that melody, as you can see from the score below. They play an angular version of it where every other note in the descending line is assigned to each group. Instead of our dichotic listening experiment with the left and right ears, the two violin groups in Tchaikovsky’s day were seated on opposite sides of the stage.

Tchaikovsky, Symphony No. 6, 4th movement theme

Now, one final example that is often referred to as an auditory illusion. Researchers who study optical illusions generally don’t like the term “illusion” because the observed effect is real enough, and they would say, it merely reflects how we see. So, likewise with this auditory example.

In this last example, there is an element of competing Gestalt principles, so let’s illustrate that in the visual domain first. In case A, do you see a line of 8 dots, or 4 groups of 2? We readily see 4 groups, because of the principle of proximity, just as with the melodic tones. In case B, do you see 4 lines coming together or two lines crossing. Again, the latter because of what’s called good continuation. Now, in case C, we test which of those principles is stronger when we try to identify the dot being pointed to. Is it part of the vertical line or the diagonal? Most people would say the latter, so the proximity effect seems to be weaker.

Again, headphones are required. High and low tones are heard alternating. Once you fixate on a particular percept, switch the headphones around. Does the pattern change? There is no right or wrong answer, but check out the commentary here.

Alternating tones (source: IPO 39)

F. The acoustic community and the Acoustic Niche Hypothesis. Most of the examples above dealt with speech and music, so we need to turn our attention to the larger context of how sounds interact with each other, which is at the level of the soundscape and bio-acoustic habitats in general. To do this, we will follow the emergence of these concept in the work of the World Soundscape Project (WSP) and bio-acoustician Bernie Krause in the 1970s, leading to the present day.

An important step for the WSP came with the 24-hour “Summer Solstice” recordings in June 1974, on the rural grounds of Westminster Abbey near Mission, BC, where the birds and frogs around a pond formed an ecological micro-environment. An edited version of these recordings consisting of about 2 minutes per original hour was broadcast on CBC Radio in stereo as part of the radio series Soundscapes of Canada in October 1974, the other programs being based on the cross-Canada recording tour the previous year.

WSP recording setup at Westminster Abbey, Mission, BC, June 1974, on the summer solstice

Bruce Davis recording with the stereo Nagra machine, June 1974

The time compression for each “hour” was transparently achieved with editing, with a short verbal announcement of the time being the only commentary. Therefore, over the course of the hour-long program, a listener could experience the soundscape of an entire day at this site, something that would be physically impossible for an individual to do.

Solstice recording 3 to 8 am
from Program 5, Soundscapes of Canada

Spectrogram of the period just before and after dawn (3:30 am)
showing frequency niches for frogs and birds (click to enlarge)

Even though the group had little training in bio-acoustics, the aural “balance” of the soundscape was impressive. Species soundmaking (mainly birds during the day and frogs at night) was always dynamically changing, but in such a way that there was no conflict in competing sounds. The two key moments at sunrise and sunset involved dramatic transitions from birds to frogs and back again, but their patterns evolved naturally without colliding. A chart of the 24-hour evolution of the soundscape is shown here.

24-hour log of sounds heard during the summer solstice recordings, June 1974, by the WSP

Having just finished a study of the Vancouver soundscape, with its typically chaotic noise environments, it was impressive to listen to a soundscape that seemed fully integrated and functional. Most importantly, it gave a purpose to the field tour in Europe the following year which was to find examples of villages in five different countries where the soundscape of a social and cultural unit on the size of a village could provide an analogous, balanced eco-system, one that came to be modeled as an acoustic community.

The result of the study was the document Five Village Soundscapes, where very different types of villages in Sweden, Germany, Italy, France and Scotland were studied in 1975 as to how sound played a pervasive and positive role in the life of the community, even as they were subject to differing degrees of modernization that might challenge that positivity.

The villages included a modern manufacturing village in southern Sweden (Skruv, where a glassworks, brewery and two other industries were located), a traditional farming village in Germany (Bissingen, where a textile factory was also operating), a fishing village in Brittany (Lesconil), a mountain village in northern Italy (Cembra, experiencing declining population as people migrated to the city), and an academy town in Scotland (Dollar, on major highway and rail routes, but with strong links to the past).

From the sparse Scandinavian village soundscape, to the French and Italian ones centred on human activity, and those experiencing the effects of industrial and technological trends, these villages presented a range of examples where sound still played a largely positive and pervasive role in the definition and experience of the community. More detail on this study, and the Finnish update in 2000, is available on the WSP Database.

The WSP study drew several conclusions about the character of the acoustic communities they studied, some of which were identifiable within the language of ecology. The first was the variety of sonic “species” that were heard in these villages. Instead of the soundscape being dominated by a few, acoustically powerful sounds, these communities exhibited a wide range of sounds that could be heard on soundwalks throughout the village. The writers noted that this conclusion was rather the opposite of the cliché that life in the modern city is “complex” – culturally perhaps, but not necessarily aurally when a few technological sounds dominate.

The second characteristic of the acoustic community was the complexity of information gleaned by the inhabitants from everyday sounds. This was often noted by the researchers, but could only be explained by the inhabitants who had the requisite knowledge of context, culture and history. Although the experience of sound in such a community is shared, the interpretations and reactions of the inhabitants can be quite varied, depending on personal experiences, preferences, and their relationships to the community’s power structures.

Finally, a third more explanatory criterion was identified, namely how the variety of sounds were maintained in an ecological balance. What kept the variety (and magnitude) of sounds from overwhelming each other and producing a chaotic soundscape?

Acoustic communities, when regarded on a macro level according to the WSP model, seem to have evolved (as in the examples described above) according to several balancing factors related to physical space, time (e.g., rhythms and cycles) and social practice. In many cases, economic, social and cultural factors have determined the physical design and layout of a village, town or city, but each decision has an acoustic impact, and so it might be more accurate to say that there is a co-evolution between acoustic and cultural developments. Clearly both aspects need to be examined together as part of what we are calling an ecological system.

During this same period, an electronic musician named Bernie Krause in California began recording natural soundscapes. Unlike biologists who study specific species, and therefore try to record their sounds individually, Krause recorded soundscapes as a whole and became fascinated with their complexity. Once he was able to perform a spectrographic analysis of these recordings in the early 1980s, he could confirm some of his aural impressions about that seemingly organized complexity.

On a spectrogram where frequency is on the vertical axis, time on the horizontal, and higher amplitudes are shown as darker lines, he could see that species’ soundmaking fell into separate non-overlapping frequency bands, somewhat similar to how the radio spectrum is divided into simultaneous broadcast frequencies.

This insight has come to be called the Acoustic Niche Hypothesis (ANH), and is accepted by landscape ecologists as a key component model of what can be called an acoustic habitat. It is clear that a well functioning habitat depends on acoustic information that is shared among different species co-habiting the eco-system. And, as with all habitats, it is potentially vulnerable to change or threats to its existence. These threats in the form of excessive noise that blocks the acoustic channel are also being found in marine soundscapes where sound travels over very long distances, including a great deal of ship traffic and other sound sources that can disrupt species communication on which they depend.

More recently, the Italian musician and ecologist, David Monacchi, has systematically recorded the soundscapes of subtropical regions and discovered further evidence about the role of the ANH in those acoustically rich habitats, as part of his Fragments of Extinction project. He has illustrated the spectra of those habitats on a real-time spectrogram presentation, using the SpectraFoo software, similar to what we have been using here (in fact it was his use of spectrograms that inspired our own).

You can see and listen to a variety of examples by Krause and Monacchi here.

By applying this approach to our problematic urban spaces, we can learn from the bio-acoustic examples, as well as those we have characterized as acoustic communities, that sound constitutes an aural habitat, but to be functional it needs to operate on a human scale, by which I mean, populated by sounds whose acoustic properties are in a similar range to those we make as humans as well as those of other species. Here is a brief example of an urban development that seems to have succeeded in this regard.

Granville Island, Vancouver
Source: WSP VDat 5, take 17

Granville Island spectrum showing a balance of frequency bands (click to enlarge)

A balance in the spectrum, loudness and temporal behaviour of all of the component sounds is required to make the system function in a sustainable manner. Of course, a certain number of sounds and events falling outside that range (but within safe limits) can be accommodated within the system without jeopardizing it, and may in fact be necessary if they are expressions of social cohesion. But the aural pathways must be kept open so that listening can guide us.


Q. Try this review quiz to test your comprehension of the above material, and perhaps to clarify some distinctions you may have missed.