An ecological approach to composition
The limitations of the early Gibsonian approach were acknowledged by Michaels and Carello (1981, 168). "Inattention to algorithmic concerns is a problem of resources rather than a systematic bias against that particular class of scientific questions. Ecological psychologists recognize that the identification of the algorithms that are embodied in living tissue is a necessary part of a full theory of knowing." It took nearly twenty years to bring together well-known digital signal processing (DSP) techniques with the relevant theoretical ideas. But it seems the time is ripe for an algorithmic implementation of the concepts put forth by ecological psychologists.
Surprisingly, a few simple techniques account for varied and flexible sound synthesis and processing methods. The ecological methods rest on two pillars: generic physical models and meso-pattern control of granular sample pools. The basic techniques are further extended by placing synthesized sounds within the context of soundscape recordings, and by parametric transformation and convolution of granular samples. Probably, the biggest advantage of the ecological approach over other processing methods is the ability to independently work at several temporal levels on a single sound source. This is outside the scope of the DSP techniques that use a linear time line as the basis of their algorithms (Lynn & Fuerst, 1994; Orphanidis, 1996).
We have discussed the advantages and limitations of ecological models and their implementation details in (Keller & Truax, 1998), (Keller & Rolfe, 1998), and (Keller, 1998c). Thus, we will concentrate our discussion on compositional methods and bring up technical issues only when they are directly related to the focus of our discussion. So let us dive into the secrets of the ecological approach to composition.
Our compositional method relies on using recognizable sound sources, keeping a consistent spatial placement of the sound material and applying ecologically-feasible transformations on the sources. These transformations provide the basic compositional strategies of our method. Thus, the minimal compositional element is the sound event, an ecologically meaningful unit. The macro-organization of the material is a result of interactions of events at a meso and micro level. Finally, meso and micro level transformations consist of a single ecologically-based process: model interaction.
Within a constrained parameter space, excitation
patterns interact with resonant systems. This process provides material
which is consistent with real-world sound production. The temporal organization
of the events is defined by time-patterns that occur, or at least might
occur, in real-world situations, such as scraping, pouring, etc. The spectral
transformations are done by means of resonant structures that fall within
common broad classes: woods, metals, glasses, strings.
Defining the object of our study has proven to be a tricky task. McAdams (1993) talks about ‘nonverbal’ sounds, including ‘natural’ ones. Handel (1995) talks about ‘objects’ and ‘events’ interchangeably. Bregman (1990, 10) defines ‘stream’ as the perceptual representation of an ‘acoustic event’ or ‘sound.’
There has been some confusing use of the terms ‘source’ and ‘cause.’ Young (1996) and Smalley (1993) identify complex sound-objects as responses of a source, a physical vibrating object, to causal energy, or excitation. We prefer to avoid this usage because it goes against the standard subtractive synthesis approach, "a sound source feeding a resonating system" (Moore, 1990), and against the widely-used model of speech production.
Computer interface work (Gaver, 1993; Darvishi et al., 1995) treats environmental or everyday sounds as ‘auditory icons’ or ‘earcons.’ Presumably these icons are useful to convey information in software interfaces by reproducing the sound behavior of objects being excited in different ways.
All these definitions are too general to characterize specific classes of sounds which can be linked to perceptually relevant parameters or algorithmic models. An early experiment by Lass et al. (1982) uses four types of environmental sounds: (1) human, (2) musical, (3) inanimate, and (4) animal. They hypothesize a relationship between exposure to the sound and subjects’ accuracy in identification tasks. Gaver (1993) suggests three classes: (1) vibration of solids, (2) motions of gases, (3) impacts of liquids. Ballas (1993) picks up Lass’s concept to include stimulus properties within his sound categorization scheme, coining the term ‘ecological frequency.’ This is a measure of how often a sound occurs in a given subject’s daily life. Within the limitations of his study, he obtains four classes: (1) water sounds, (2) signaling sounds, (3) door sounds and modulated noise sounds [sic] - he probably means amplitude-enveloped white noise - and (4) sounds with two or three transient components.
From a musical point of view, Wishart (1996, 181) proposes two dimensions for classification of complex sounds: (1) gestural morphology and (2) intrinsic morphology. Since these correspond to the source-filter model, where (1) is the excitation and (2) is the resonance, we are again within a standard approach. Wishart (1996, 178) divides the types of sounds generated by gestural - imposed - morphology in: (1) continuous, (2) iterative, and (3) discrete; but when it comes to the intrinsic morphology there is an explosion in the variety of classes.
To be able to link the sounds to the actual objects
that generated them we have proposed a loose classification scheme (Fig.
1) which includes Wishart’s types and takes account of the difference between
excitation and resonance mechanisms in sound production (Keller, 1998c).
This classification, being based on the sound phenomena and not the causes,
suits the needs of a synthesis guideline but would need to be refined to
account for auditory constraints.
Classes of environmental sounds
Figure 1. A broad classification scheme for environmental sounds.
By defining two classes of interaction between excitation sources and resonating bodies, i.e., single or multiple excitations, we establish a clear way to link physical models and granular synthesis with mundane sounds. All terrestrial objects suffer energy loss. After a resonating body is excited, it (usually) damps higher frequencies first, then lower frequencies until reaching zero amplitude. No acoustic object generates unchanging sinusoidal components until infinity, or increases in amplitude after the attack transients have ended. Physical modeling provides an appropriate framework to deal with single-excitation sounds and granular synthesis is best suited for multiple-excitation ones.
Multiple excitations are produced by physical agents, such as fire or running water, or by biological agents, such as a clapping hand. The cues provided by these types of sounds may be similar in their micro-temporal or spectral structure but will certainly be different in their macro-time structure. It is very likely that these cues have been decisive for survival, i.e., the sounds of a predator stepping on leaves have to be discriminated from the same type of sounds produced by a prey. When visual cues are not readily available this becomes a question of life or death.
The distinction between human and animal agency only makes sense if applied within a specific social context. In other words, human beings have become familiar with different systems of musical conventions which influence the way they perceive musical sounds (Shepherd, 1992). Playing an instrument would be an example of a human agent interacting with a whole resonant body.
Another possible cause for multiple excitations is the change of state of an object, as occurs in breaking. Breaking is a unique pattern because of its heterogeneous characteristics. The initial resonance is produced by the whole vibrating body. After breaking has occurred, the subsequent sounds have a higher spectral profile, corresponding to the smaller glass pieces resonating after each collision.
When an object is excited at a micro-time range there
is no perceptible time gap between successive excitations. So the sound
texture is perceived as continuous. In blown tubes or in the friction between
two surfaces, the resonant body and the excitation source establish a pattern
of interaction that generates nonlinearities. This interaction has been
roughly modeled by dynamical systems with feedback structures (Schumacher
& Woodhouse, 1995).
As discussed in the previous section, within the ecological approach sound sources are characterized by invariants (Shaw et al., 1981). Invariants are common characteristics of a sound class that listeners use as cues for recognition of the source’s physical structure. Contrastingly, Gestalt-oriented psychology has generally given importance to the physical sound parameters, such as frequency or intensity. This is reflected in the way sources are classified and how parameters are interpreted.
McAdams (1993, 179) proposes two classes of sound properties to define an acoustic source: (1) microproperties and (2) macroproperties. The first class corresponds to events with durations from ten to one hundred milliseconds. This class is further subdivided into: (1) spectral and (2) temporal microproperties. Macroproperties span longer periods of time, i.e., one hundred milliseconds to several seconds. These can be (1) temporal patterning macroproperties or (2) spectral variation ones. The former ones are related to the way an object is excited - the gesture as Wishart (1996) would put it - or the state of the object, that is, if it is whole or broken. According to McAdams (1993, 181), the spectral variation provides a cue to the nature of the material being stimulated and to the number of sources, i.e., one or many.
Suddenly we find ourselves into deep trouble! Which cues are used to identify the number of objects, the spectral or the temporal ones? If the excitation changes the spectral content of the signal, how can we be sure that the excitation changed and not the object itself? For temporal patterns that range from hundreds of milliseconds to less than five milliseconds - such as bouncing - how do we separate macro time from micro time?
Since properties that change over several seconds or more are not considered by McAdams, we find these categories inappropriate. Keller & Silva (1995) proposed to use ‘macro level’ for durations over several seconds. Events occurring at less than ten milliseconds are generally fused into a continuous sound. Thus, it seems reasonable to use the term ‘micro level’ for this time span. Most environmental time patterns fall within a range between ten milliseconds and several seconds. Following Kelso (1995), we adopt the term ‘meso level’ for this range.
Smalley (1993, 41) comments on the relationship between different time levels in compositional sound structure. He makes loose use of vocabulary, such as identity as equivalent to recognition, or evolution as meaning change over time. "There are types of spectromorphology [spectral patterns over time] whose existence can only be established after certain evolution time because the completion or partial completion of a pattern integral to identity. Motions based on rotation are examples of spectromorphologies whose timbre (if that is the right word) is embodied in the spectral changes over at least one rotation-cycle. (. . .) It becomes impossible to distinguish between its timbral matter [micro-time structure] on the one hand and its short-term evolution [meso-time structure] on the other."
Given that micro level granular synthesis parameters
determine the global sound result, Clarke (1996) suggests the use of a
frequency-time continuum to define compositional strategies. When dealing
with granular sound, the boundary between micro and macro properties is
blurred and the focus is placed on parameter interaction across levels
instead of the usual independent parameters. We believe this approach should
also be adopted in relation to environmental sounds. The idea of parameters
interrelated at different time levels is consistent with an ecological
perspective. We are usually exposed to sound events that present coherent
patterns of spectral and temporal characteristics. There is no breaking
water or bouncing wind in our world!
Table 2. Time scales applied to sound
Sound space and soundscape
After twenty years of its original formulation (Schafer, 1977), a strong tradition in soundscape composition has already formed. Several active composers routinely use soundscape techniques in their works (Westerkamp, 1996; Truax, 1996). As Truax (1996) has discussed in his various writings, the great difference between soundscape composition and the acousmatic tradition is that the latter uses sound objects separated from their context while the former keeps the sounds as an integral part of their social, cultural and aural context. Following this line of thought, recorded sounds as used in most soundscape compositions provide an ideal and simple way to incorporate the untouched sound environment within tape composition.
By keeping a consistent mix of environmental ‘background’ sound with sound events generated by ecological models, we can give a real-life feel to algorithmically designed sound. Though, we should warn against the cheap use of the concept background. We mean, by no means, a drone or permanent room tone with no life of its own. In this context, we use the word background only because we need to convey a stable and believable sound space. But the events contained in this environmental sound should be as complex, lively, and meaningful as in any other purposeful recording. Thus, the events generated by ecological models are allowed to interact and fuse with the dynamics of the environment.
The other useful technique for placement of ecological sounds within a sound space is convolution. Convolution consists in applying the spectral dynamics of a source sound onto the spectral dynamics of a target sound. Although Roads (1997) proposes this technique as a special case of granulation, we have found that it is neither flexible nor computationally efficient to provide a handy tool for shaping time patterns - at least in its current implementations on personal computers. On the other hand, it provides an exquisite tool for shaping isolated grains, which can be later used in ecologically-defined time-patterns.
These convolution-designed grains may consist of an ecologically meaningful short sound, such as a water drop or a bubble, which is convolved with the impulse response of a cavern, or any other reverberant space. When distributing these convolved grains as a meso-level time-pattern, the result is a stream of events that occurs within the space defined by the impulse response used, for example, bubbles inside a cavern. Given that we can use several types of grain, the number of simultaneous spaces created depends on the limits of our auditory system in discriminating sounds coming from different reverberant spaces. This limit is undoubtedly low.
The last method that we have to mention is phase-controlled
granulation. Given that we have not used it in our current compositional
work, we leave a detailed discussion for a future paper. The idea behind
this type of processing is to increase the volume of the source sound (as
defined in Truax, 1994) by superimposing several granulated versions of
the processed sound. If the phase-delay among these streams is kept constant,
the result is an effect very akin to the reflections produced by a reverberant
space. The number of ‘reflections’ is roughly proportional to the number
Ecologically-based composition applies ecological models, physical models, and soundscape techniques to create musical processes consistent with a given natural and cultural environment. The sound sources provide direct references to everyday sounds. The structural unit is the sound event. The processing techniques apply transformations at three levels: micro, meso, and macro. Excitation processes and resonant structures interact to produce events which take place or might take place in our everyday environment.
Since ecological models form the basis of ecologically-inspired
composition, we will present some models that can be applied to generate
and process sound material. The next sections presents an overview of three
paradigmatic examples: bouncing, scraping, and breaking. Although the models
are implemented in Csound, the descriptions of algorithms are not language-specific.
Csound users can inspect the code included in the CD (Keller, 1999).