Home Tutorials Forums Gallery Resources
sonify.org > tutorials > other > voicexml
VoiceXML: An Emerging Standard for Creating Voice Applications
by Srinivas Penumaka, BeVocal

Introduction

Check out this simple working VoiceXML example:

Get a forecast for almost any city or town in the U.S. Just call 1-800-4BVOCAL (800-428-6225) and say "BeVocal Weather" and the name of the US city and state you are interested in. (Internationals users dial 001-4089072861).

Other examples created by web developers and audio producers can be found here. (Internationals users use 001-4089077329 to access these demos)

VoiceXML is an emerging industry standard for providing web content and services through the telephone. This includes information, entertainment, games and business services.

This article introduces the voice services industry, potential voice applications, and VoiceXML architecture. According to the Kelsey Group estimates, there is a huge market opportunity in voice applications. VoiceXML is the right standard that enables developers to take advantage of that market opportunity. It enables rapid development, ensures the portability and leverages existing Internet infrastructure, making it compelling for developers to adapt VoiceXML to create voice applications.

Given that everything in VoiceXML depends on sound, it's a natural area for audio developers to leverage their expertise in audio quality, production techniques and efficient use of bandwidth. For the web developer VoiceXML is based on XML (which is very similar to HTML) so the migration is relatively easy and the coding opportunities are abundant.

Voice Services Industry

Traditional Interactive Voice Response (IVR) applications have been deployed in enterprises for decades, but they’ve faced serious limitations including poor usability and the inability to go beyond providing access to proprietary information.

Over the last two years, the need to access Internet content from anywhere gave rise to wireless and voice recognition-based applications, as well as Internet appliances. Some of these applications are available only on special devices such as WAP phones and on Internet appliances such as WebTV.

However, the only device needed to access voice services is an ordinary telephone. According to IDC, the installed base of telephones (both wireless and landline) in 2000 was about 1.6 billion. Hence, the telephone is a ubiquitous device that can play a key role in providing access to web content and services. The success of this "Voice Web" is dependent on having robust voice recognition software and significant computing resources. In the last four years, voice recognition software has proved itself to be viable for commercial applications with improvements in algorithms. Another favorable factor is the availability of faster microprocessors and decreasing prices.

IDC predicts that the total number of telephones, both wireless and landline, will grow to 3.1 billion worldwide by 2005. The Kelsey group estimates that by that time about 440 million people will use voice applications and that the market for speech applications will reach $4.5 billion.

Potential Voice Services

The following categories have the potential to use voice applications to provide content and services through the telephone. Many of these applications can be linked together to offer a comprehensive set of services.

  • Games & Entertainment: Games and entertainment applications have a great deal of appeal in the youth (Generation Y) market. The examples in this category include horoscopes, trivia contests, and movie information. This segment is of special interest to wireless carriers that are looking for ways to increase minutes of usage (MOU) from their subscribers. These applications can create a differentiated service for a carrier, while increasing user loyalty.
  • Enhanced Directory Services: Directory services can be automated with a greater level of customer service than is offered through today’s 4-1-1. For example, an enhanced directory service application can search for a business, give the telephone number, connect to that business, and provide driving directions to the location. These lookup services can be enhanced with pertinent information on businesses such as store hours, important phone numbers, and a URL for the web site.
  • Finance: Some examples in this category include stock quotes, stock quote alerts, and any transaction-based application.
  • Travel: Travel-related applications could offer content and services such as flight schedules, flight delay alerts, weather, traffic conditions, and driving directions.
  • Enterprise: Enterprise business finder (finding the closest corporate office) and appointment systems (scheduling internal conference rooms, checking personal calendars) will increase employee productivity. Customer Relationship Management (CRM) applications such as order status (checking customer orders via voice) will increase customer satisfaction.

Voice Application Benefits

Voice applications offer several benefits that include the following:

  • Voice applications have increased usability over traditional IVR applications, as they use a natural dialog that is appropriate for the application. They offer a level of service that is greater than IVR applications and potentially equal to a live operator without putting users on hold.
  • Voice applications are economical, as they don’t require infrastructure set up which is the case with traditional IVR applications. Voice applications are typically hosted with a hosting provider such as BeVocal.
  • The ubiquitous presence of the telephone enables almost anyone, anywhere to access web content and services.
  • A hands-free interaction through voice commands offers convenience and safety to wireless phone users.
  • Information can be retrieved directly with a spoken command. This eliminates sorting through different levels of menus that are common to traditional IVR applications. Voice commands are far more convenient and intuitive to use than punching in letters and numbers on the tiny keypad of a wireless phone. Wireless applications based on Wireless Application Protocol (WAP) are at a disadvantage due to this restriction.

Role of VoiceXML Standard

The developer community has an opportunity to take advantage of the huge voice applications market. Standards-based technologies are vital to developers in order to ensure portability across vendors and to leverage existing Internet infrastructure. VoiceXML is well on its way to becoming the standard that fulfills these needs. The VoiceXML Forum is the industry organization that represents the VoiceXML user community. The World Wide Consortium (W3C) accepted the VoiceXML 1.0 Specification as the "candidate recommendation." The 2.0 Specification is in the preliminary draft stage.

The benefits of developing in VoiceXML include:

  • Deliver web content and services through the telephone
  • Leverage existing Internet infrastructure and skill-sets
  • Ensure portability across implementation platforms
  • Decrease the level of expertise required to create voice applications
  • Enable rapid voice application development, similar to HTML for the web
  • Provide "Voice View" for web content

Several vendors have implemented the VoiceXML 1.0 Specification along with their own extensions. The BeVocal Café is one such VoiceXML development platform, and was recently ranked the #1 VoiceXML development environment and hosting service by independent testing firm, CT Labs. The Café is a free web-based environment that provides developers with all of the tools and resources necessary for VoiceXML development. The Café offers a hosting platform that provides the telephony infrastructure, voice recognition software and other associated software components to enable VoiceXML applications to run. The Cafe also provides 24/7 operations support. Developers can take advantage of these platforms to avoid the large capital expenditure and effort related to infrastructure build-out.

VoiceXML Architecture

VoiceXML is an XML data type definition (DTD) defined specifically for voice applications. The 1.0 Specification document details all the tags that are part of this DTD. The specification also deals with the architectural model for VoiceXML implementations, form interpretation algorithms, and the scope of VoiceXML.

The graphic shows the architectural model of VoiceXML in the BeVocal platform. There are three components: a web server, the VoiceXML interpreter context, and the implementation platform. The web server in the graphic can be any web server on the Internet. The interpreter context contains the VoiceXML interpreter, which is responsible for interpreting VoiceXML code. The interpreter context provides all supported functions that are necessary for the interpreter.

The VoiceXML interpreter sends parameter values to the web server as part of the request and it receives a VoiceXML document as the response. The web server receives requests and sends responses back to the interpreter. Any server side scripting language such as Perl, ASP, JSP, and PHP can be used to create VoiceXML documents dynamically.

VoiceXML architecture

The VoiceXML interpreter and the VoiceXML interpreter context work with an implementation platform that has other infrastructure components such as a telephony switch, voice recognition software, and a speech synthesis engine (TTS). This implementation platform is responsible for connecting to the Public Switched Telephone Network (PSTN), performing voice recognition, playing audio files, and other supporting functions. Since the implementation platform provides voice recognition capabilities, the details regarding voice recognition are hidden from VoiceXML.

VoiceXML is independent of the implementation platform on which an application might be developed or deployed. This offers flexibility to developers as they are not restricted to one implementation platform.

A VoiceXML Example

The VoiceXML DTD has 47 tags in the 1.0 Specification. Each tag has a set of valid children and parents. A tag also has a set of attributes, through which the tag’s behavior can be controlled. For a complete list of tags, please refer to the VoiceXML 1.0 Specification. For a comprehensive VoiceXML reference, you can visit http://cafe.bevocal.com/docs/vxml/index.html.

This article makes an attempt to introduce a few important VoiceXML tags through an example. In this example, we illustrate a simple form that takes user input and invokes another form once there is a match.

The illustrated tags include: <vxml>, <form>, <field>, <block>, <prompt>, <goto>, <break>, <audio>, <if>, and <else>.

<?xml version="1.0" ?>
						
<!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 1.0//EN" 
"http://café.bevocal.com/libraries/dtd/vxml1-0-bevocal.dtd">	
					
<vxml version="1.0">
	<form id="welcome">
	 <block> Welcome to VoiceXML Example <break/>
	  <prompt>
	  <audio src="say.wav">
	  Say next or start to start over again.</prompt>
	 </block>
	 <field name="whattodo">
	  <grammar>[next start]</grammar>
	   <filled>
	    <if cond="whattodo=='next'">
	      <prompt>You said <value expr="whattodo"/></prompt>
	      <goto next="#welcome"/>
	    <else/>
	      <prompt>You said <value expr="whattodo"/></prompt>
	      <goto next="#welcome"/>
	    </if>						
	   </filled>
	 </field>
	</form>

</vxml>

The first couple of lines of the above source code are specific to XML. The <xml> tag identifies the version as 1.0. The second line has a URL for the DTD. This example uses the BeVocal Café DTD.

VoiceXML code is enclosed in <vxml> and </vxml> tags. The <vxml> tag identifies the version of VoiceXML Specification. It is set to 1.0, as this is the current version.

The <form> tag is used to get user input and perform other associated functionality. In the above example, the ID of the <form> is "welcome." This ID attribute is important as it enables the program control to go back to the same form. In order to play a prompt to the user within the <form> tag, we use the <block> tag. The <block> tag allows you to specify the executable code. We can specify a prompt using the <prompt> tag. A wave file can also be played as a prompt using the <audio> tag. We specify the name of the wave file using the "src " attribute. The VoiceXML 1.0 Specification allows Universal Resource Locators (URLs) to be used as file paths. If the audio file is not found, the VoiceXML Interpreter plays the specified text through the Text-to-Speech (TTS) engine. We used the <break> tag to introduce a pause after playing the prompt.

The steps to gather input and take action depending on the match are as follows:

  1. Specify the grammar: The <grammar> tag is used to specify allowable utterances. In the above example, we specify two options: next and start. The grammar format used in the example is the Grammar Specification Language (GSL), a language specification used by Nuance Communications. Various other grammar formats such as the Java Speech Grammar Format (JSGF) are available. An XML based grammar format is being proposed as part of the VoiceXML 2.0 Specification.

  2. Get user input: The <field> tag is used to get the user’s input. A <field> can be of different types such as boolean, date, digits, currency, number, phone, and time, so that the result of user input will be of the same data type. Some implementation platforms provide extensions to field types. For example, the BeVocal Café supports extended types such as "citystate", "street", "streetnumber", "airport", "airline", and "equity." Once the user speaks, the spoken word is matched and the result is stored in a variable, which is identified by the "name" attribute of the <field> tag. In this example, the <field> tag’s name is "whattodo." Hence, the user input value is available through the variable "whattodo." Once there is an input that is one of the allowable user utterances, the <filled> tag is interpreted.

    If there is no input, or no match, then <noinput> and <nomatch> events are thrown by the interpreter. Developers can catch these events using the <catch> tag. Events can be handled by specifying appropriate handling code between event tags. For example, we can handle the <nomatch> event with the following code:

    	  <nomatch> I am sorry, I didn’t get that <break/></nomatch>

  3. Match the user utterance against allowable choices: The <if> tag is used to match the value of the field with known values. In the above example, we match for "next" and "start."

  4. Next steps after a match: Once a match is found using the <if> tag, the example invokes the <goto> tag to go back to the same form. The "next" attribute of the <goto> tag uses the form ID "welcome." However, we can also go to a different VoiceXML document or submit to a URL to invoke a server side script.

Conclusion

This article introduced the voice services industry, potential voice applications, and VoiceXML architecture. According to the Kelsey Group estimates, there is a huge market opportunity in voice applications. VoiceXML is the right standard that enables developers to take advantage of that market opportunity. It enables rapid development, ensures the portability and leverages existing Internet infrastructure, making it compelling for developers to adapt VoiceXML to create voice applications.

VoiceXML will interest both web developers and audio producers.


Discuss this tutorial/demo in the Wireless Apps Discussion forum.