sonify.org > tutorials > other > voicexml
Also see our VoiceXML introduction
There's been a big surge in activity in the voice recognition arena in just the past 18 months or so, and things are really picking up right now. Several factors have contributed: Improvement in recognition algorithms, the availability of very cheap, very fast computers, and the mass adoption of the web, which has demonstrated the usefulness of browsable networked content and applications. Voice applications can be thought of as being like the web, except with a voice-recognition interface instead of a browser window. A huge range of applications can be developed for voice control, including business tools, consumer information services like news/sports/traffic/weather/etc., and games and entertainment.
Audio is played from a VXML page by referencing the file using an <audio> tag, quite similarly to placing a reference to a streaming audio file in a web page. You can find details on how this works in context in Srinivas Penumakas article "VoiceXML: An Emerging Standard for Creating Voice Applications". (Srinivas works for BeVocal, one of the leading voice platform companies, and I consult with them on creative direction.)
In producing audio for voice apps, its important to get familiar with how audio is handled in phone networks: the traditional telephone network (known as PSTN, for Public Switched Telephone Network) and cellular networks. (This can be tricky, as I've found that most of the information available is technically-oriented, and doesn't address sound quality issues the way that audio producers do -- if any of you know of good sources, I'm still looking and would be happy to hear about them.) This will affect how you record and process audio files, the same way that knowing about Red Book, Real, mp3, etc affects how you produce for other platforms.
In North America and Japan, PSTN audio is compressed in the u-law format (often written this way, with a u, although the u actually stands for the Greek character mu, and the name is pronounced "mew-law"). Audio is in the (similar) a-law format in Europe, with the rest of the world using one or the other. You may be familiar with these formats from Unix workstations and the early days of the Internet. By the way, in the telephony world, you'll often see formats described by their International Telecommunication Union names. For example, u-law is referred to as ITU G.711.
u-law uses companding in order to improve sound quality at a reduced bit depth. That means the dynamic range is COMPressed and then exPANDed. It's compressed so that fewer bits can be used for each sample. As part of this process, the sample values are mapped logarithmically instead of linearly, so that more bits are applied for perceptually critical sample values. The audio is then expanded on playback to restore most of the dynamic range. For PSTN, a 16 bit file is encoded at 8 bits. A performance equivalent to 14 bits (13 bits for a-law) is usually claimed, although real world dynamic range and signal to noise performance sound much worse than that.
The sampling rate use is 8 KHz. So the highest frequency that can be encoded is 4 KHz (1/2 of the sampling rate). In other words, there is no high end -- nothing above the top end of a piano keyboard. This limitation explains why Ss are so often mis-heard as Fs during phone conversations. In fact, phone companies often filter out frequencies higher than 3.5 KHz, and below 350 Hz.
Over a digital cell phone network, the situation is different. At 8 bits/8 KHz, the data rate is 64 kbps. That's too high for cellular transmission. So different, more aggressive data compression schemes are used. Whereas u-law and a-law attempt to represent the actual audio waveform but with less data, cell phone compression schemes are often based on resynthesizing the signal, which the ear will hear as similar to the original, using much less data. One such scheme is called CELP, for Code Excited Linear Prediction. The basic idea is that there is a digital "codebook" containing mathematical models of the formants that make up speech. A CELP encoder analyzes audio data and tries to match what it finds to items in the codebook. That way it's able to represent a lot of linear sample data, say representing an "F", with just a short index number. On playback, the decoding software looks up the number in the same codebook and discovers that it should play back an "F". It also analyzes what would be the result of doing that and compares it to the original, and makes adjustments based on the difference.
A great deal of data compression is thereby achieved, but the audio fidelity is worse than with u-law/a-law.
Other Compression Schemes
Other data compression techniques are also used. Youll want to check with the engineers handling the gateway to the phone system used by your project. A handy reference is the Audio File Formats FAQ, at http://members.home.com/chris.bagwell/AudioFormats.html.
Watch Those Levels
Producers experienced with low bit rate audio might be tempted to do a standard multimedia fix: Wade in with some EQ to compensate for lost high frequencies (say boosting from about 3 KHz to 4 KHz), and try to force better signal-to-noise performance by using compression and/or limiting (higher amplitudes are generally encoded using more bits, yielding more accurate samples). But be careful: What you hear on your desktop will differ from what you hear over a real phone system, and what you get over a cell network will be different from what you hear over PSTN. Cell encoding seems less tolerant of high levels, and in both cases you'll find that an overly aggressive approach yields distortion. Test your results over multiple phone connections, both PSTN and cell.
There is also a growing trend towards delivering voice over the Internet, known as VoIP (Voice Over Internet Protocol). In this case audio is encoded using one of a variety of schemes and sent as data in packets, as opposed to in a continuous stream over a phone connection. You'll want to find out which encoding scheme is being used and work accordingly.
As we move towards higher bandwidth wireless systems over the next few years, audio quality over cell networks will improve. With the coming 3G (3rd Generation) system, bandwidth will be high enough to support streaming mp3 and the like, promising access to high quality music and interactive entertainment from anywhere, via your phone.
Discuss this tutorial/demo in the Wireless Apps Discussion forum.