Home Tutorials Forums Gallery Resources
sonify.org > tutorials > other > voicexml
Audio Production for Voice Applications & VoiceXML
by Spencer Critchley

Also see our VoiceXML introduction   


There's been a big surge in activity in the voice recognition arena in just the past 18 months or so, and things are really picking up right now. Several factors have contributed: Improvement in recognition algorithms, the availability of very cheap, very fast computers, and the mass adoption of the web, which has demonstrated the usefulness of browsable networked content and applications. Voice applications can be thought of as being like the web, except with a voice-recognition interface instead of a browser window. A huge range of applications can be developed for voice control, including business tools, consumer information services like news/sports/traffic/weather/etc., and games and entertainment.

A real-world example of creating optimized audio content for VoiceXML
By Jeff Lipton (Sonicopia)

Fortunately for audio producers, creating optimized content for VoiceXML applications is relatively easy. The standard audio file format for telephony applications is mono 8Khz 8-bit u-Law. In order to create and optimize audio content for VXML application just follow these five simple steps.

1. Record audio files through your favorite digital audio workstation:
At Sonicopia we use Pro Tools, but any DAW will do the trick. Always record at 44.1 Khz 16-bit mono to achieve the best source quality and avoid phase shifting that results from a stereo-to-mono conversion.

2. Bandpass limit the resulting audio file frequencies below 150Hz and above 3.5Khz
Although the process of downsampling to 8Khz will automatically limit the highest frequency in the file to 4Khz, it’s useful to roll off a little more of the high end to reduce risk of distortion. Also, audio playback over telephony systems is significantly improved with the removal of the lowest frequencies. Frequencies below 150Hz have a tendency to distort over the telephone and many telecom carriers automatically filter out frequencies below 350Hz.

3. Normalize the audio file to no greater than –1db
Although you do want to maximize your gain for an optimal signal-to-noise ratio, it is important to not overdo it. You will reduce the chance of distortion if you keep the maximum amplitude of the audio file below –1db.

4. Downsample the audio to 8Khz 16-bit mono
This is the key. Keep the audio file at 16-bits. The final u-Law process applied by the telephony hardware gateway will dynamically compress the file down to an equivalent 8-bit file which is then expanded back to around 12-bits for playback on the telephone system. You will notice significant degradation in audio fidelity if you cut the file down to 8-bits prior to the u-Law process.

5. Save the audio file in the required delivery format
The final audio processing step. Different providers will request audio files be delivered in a variety of formats depending on the telephony conversion system that is being used. Final delivery through the phone system will ultimately be mono 8Khz 8-bit u-Law but only after being processed through a hardware telephony gateway. Always check with your telephony engineer for the format specification of the particular system you are creating content for. You can use an audio processing application such as Peak (Mac) or Sound Forge (PC) to convert sound files to any of the most popular audio formats (e.g. wav, pcm, au, u-law, etc.).

6. Upload to your server
Once you have optimized your audio files for telephony delivery you can upload them to any web server and reference the URL of each file from a VXML document using the <audio> tag. Be sure to test your content on a variety of phone systems (land, wireless, cellular) before delivering the final files to insure consistency and clarity.

Hear an Example:
You can hear examples of the process described above at the Sonicopia music portal by dialing 1-800-4BVOCAL and say "BeVocal Café." At the prompt say the demo ID which is "2558789." From there just follow the simple instructions.

Voice XML

Voice XML (VXML) is emerging as the standard for the voice web. Audio producers who have been working in the web or multimedia areas will be glad to know that producing for VXML-based voice applications will leverage skills they already have. Much of what can be done on the web can be done using voice apps, and in many cases existing web content can be adapted for voice browsing. VXML is based on XML, which is quite similar to HTML. And like HTML, VXML integrates well with Javascript and Java

Audio is played from a VXML page by referencing the file using an <audio> tag, quite similarly to placing a reference to a streaming audio file in a web page. You can find details on how this works in context in Srinivas Penumaka’s article "VoiceXML: An Emerging Standard for Creating Voice Applications". (Srinivas works for BeVocal, one of the leading voice platform companies, and I consult with them on creative direction.)

Phone Audio

In producing audio for voice apps, it’s important to get familiar with how audio is handled in phone networks: the traditional telephone network (known as PSTN, for Public Switched Telephone Network) and cellular networks. (This can be tricky, as I've found that most of the information available is technically-oriented, and doesn't address sound quality issues the way that audio producers do -- if any of you know of good sources, I'm still looking and would be happy to hear about them.) This will affect how you record and process audio files, the same way that knowing about Red Book, Real, mp3, etc affects how you produce for other platforms.


In North America and Japan, PSTN audio is compressed in the u-law format (often written this way, with a u, although the u actually stands for the Greek character mu, and the name is pronounced "mew-law"). Audio is in the (similar) a-law format in Europe, with the rest of the world using one or the other. You may be familiar with these formats from Unix workstations and the early days of the Internet. By the way, in the telephony world, you'll often see formats described by their International Telecommunication Union names. For example, u-law is referred to as ITU G.711.

u-law uses companding in order to improve sound quality at a reduced bit depth. That means the dynamic range is COMPressed and then exPANDed. It's compressed so that fewer bits can be used for each sample. As part of this process, the sample values are mapped logarithmically instead of linearly, so that more bits are applied for perceptually critical sample values. The audio is then expanded on playback to restore most of the dynamic range. For PSTN, a 16 bit file is encoded at 8 bits. A performance equivalent to 14 bits (13 bits for a-law) is usually claimed, although real world dynamic range and signal to noise performance sound much worse than that.

The sampling rate use is 8 KHz. So the highest frequency that can be encoded is 4 KHz (1/2 of the sampling rate). In other words, there is no high end -- nothing above the top end of a piano keyboard. This limitation explains why S’s are so often mis-heard as F’s during phone conversations. In fact, phone companies often filter out frequencies higher than 3.5 KHz, and below 350 Hz.

Cell Networks

Over a digital cell phone network, the situation is different. At 8 bits/8 KHz, the data rate is 64 kbps. That's too high for cellular transmission. So different, more aggressive data compression schemes are used. Whereas u-law and a-law attempt to represent the actual audio waveform but with less data, cell phone compression schemes are often based on resynthesizing the signal, which the ear will hear as similar to the original, using much less data. One such scheme is called CELP, for Code Excited Linear Prediction. The basic idea is that there is a digital "codebook" containing mathematical models of the formants that make up speech. A CELP encoder analyzes audio data and tries to match what it finds to items in the codebook. That way it's able to represent a lot of linear sample data, say representing an "F", with just a short index number. On playback, the decoding software looks up the number in the same codebook and discovers that it should play back an "F". It also analyzes what would be the result of doing that and compares it to the original, and makes adjustments based on the difference.

A great deal of data compression is thereby achieved, but the audio fidelity is worse than with u-law/a-law.

Other Compression Schemes

Other data compression techniques are also used. You’ll want to check with the engineers handling the gateway to the phone system used by your project. A handy reference is the Audio File Formats FAQ, at http://members.home.com/chris.bagwell/AudioFormats.html.

Watch Those Levels

Producers experienced with low bit rate audio might be tempted to do a standard multimedia fix: Wade in with some EQ to compensate for lost high frequencies (say boosting from about 3 KHz to 4 KHz), and try to force better signal-to-noise performance by using compression and/or limiting (higher amplitudes are generally encoded using more bits, yielding more accurate samples). But be careful: What you hear on your desktop will differ from what you hear over a real phone system, and what you get over a cell network will be different from what you hear over PSTN. Cell encoding seems less tolerant of high levels, and in both cases you'll find that an overly aggressive approach yields distortion. Test your results over multiple phone connections, both PSTN and cell.


There is also a growing trend towards delivering voice over the Internet, known as VoIP (Voice Over Internet Protocol). In this case audio is encoded using one of a variety of schemes and sent as data in packets, as opposed to in a continuous stream over a phone connection. You'll want to find out which encoding scheme is being used and work accordingly.

The Future

As we move towards higher bandwidth wireless systems over the next few years, audio quality over cell networks will improve. With the coming 3G (3rd Generation) system, bandwidth will be high enough to support streaming mp3 and the like, promising access to high quality music and interactive entertainment from anywhere, via your phone.

Discuss this tutorial/demo in the Wireless Apps Discussion forum.