W3C Voice Extensible Markup Language

The World Wide Web Consortium (W3C) has announced a Proposed Recommendation release of the Voice Extensible Markup Language (VoiceXML) Version 2.0 specification, published according to W3C Royalty-Free (RF) Licensing Requirements. VoiceXML "is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications."

According to Dave Raggett (W3C Voice Browser Activity Lead), "VoiceXML 2.0 has the power to change the way phone-based information and customer services are developed: no longer will we have to press one for this or two for that; instead, we will be able to make selections and provide information by speech. VoiceXML 2.0 also creates opportunities for people with visual impairments or those needing Web access while keeping their hands and eyes free for other things, such as getting directions while driving." W3C 'Proposed Recommendation' status signifies that the Working Group has successfully completed both public and W3C Working Group review, and has provided evidence of successful interoperable implementations; at least eight known implementations in both prototype and fully released products are available for VoiceXML Version 2.0.

Separately, the VoiceXML Forum has announced support for the W3C VoiceXML 2.0 PR version, and has released XHTML+Voice Profile Version 1.2. The XHTML+Voice profile "brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. This profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars."

Bibliographic Information and Overview

Voice Extensible Markup Language (VoiceXML) Version 2.0 (http://www.w3.org/TR/voicexml20/)
W3C Proposed Recommendation 3-February-2004. Edited by Scott McGlashan (Hewlett-Packard, Editor-in-Chief), Daniel C. Burnett (Nuance Communications), Jerry Carter (Invited Expert), Peter Danielsen (Lucent, until October 2002), Jim Ferrans (Motorola), Andrew Hunt (ScanSoft), Bruce Lucas (IBM), Brad Porter (Tellme Networks), Ken Rehor (Vocalocity), Steph Tryphonas (Tellme Networks).

Abstract: "This document specifies VoiceXML, the Voice Extensible Markup Language. VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications."

XHTML+Voice Profile 1.2 (http://www.voicexml.org/specs/multimodal/x+v/12/)
VoiceXML Forum. 3-February-2004.

Abstract: "The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content."

About VoiceXML Version 2.0

The v2.0 document "defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented in Section 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of the Form Interpretation Algorithm and timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy...

VoiceXML's main goal is to bring the full power of web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs...

[According to the architectural model,] a document server (e.g. a web server) processes requests from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics. The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context..." [adapted]

About the XHTML+Voice Profile 1.2

This document defines version 1.2 of the XHTML+Voice profile. XHTML+Voice 1.2 is a member of the XHTML family of document types, as specified by XHTML Modularization. XHTML is extended with a modularized subset of VoiceXML 2.0, the XML Events module, and a module containing a small number of attribute extensions to both XHTML and VoiceXML. The latter module facilitates the sharing of multimodal input data between the VoiceXML dialog and XHTML input and text elements.

The XML Events module provides XML host languages the ability to uniformly integrate event listeners and associated event handlers with Document Object Model (DOM) Level 2 event interfaces. The result is an event syntax for XHTML-based languages that enables an interoperable way of associating behaviors with document-level markup.

VoiceXML 2.0 has been designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. In this document, VoiceXML 2.0 is modularized to prepare it for integration into the XHTML family of languages using the XHTML modularization framework. The modules that combine to support speech dialogs for updating XHTML forms and form elements are selected to be added to XHTML. The modules are described as well as the integration issues. The modularization of VoiceXML 2.0 also specifies DOM event types specific to voice interaction for use with the XML Events module. Speech dialogs authored in VoiceXML 2.0 can then be treated as event handlers to add voice-interaction specific behaviors to XHTML documents. The language integration supports all of the modules defined in XHTML Modularization, and adds speech interaction functionality to XHTML elements to enable multimodal applications. The document type defined by the XHTML+Voice profile is XHTML Host language document type conformant..." [from the 'Introduction']

Principal references:

Voice Extensible Markup Language (VoiceXML) Version 2.0 (http://www.w3.org/TR/2004/PR-voicexml20-20040203/)
W3C Proposed Recommendation.

W3C Announcement: (http://xml.coverpages.org/VoiceXMLv20PR.html)
"World Wide Web Consortium Issues VoiceXML 2.0 as a W3C Proposed Recommendation. Cornerstone to the W3C Speech Interface Framework is Nearly Complete."

W3Cs Royalty-Free License (http://www.w3.org/Consortium/Patent-Policy-20030520.html#sec-Requirements)
The core VoiceXML 2.0 specification is made available according to W3C Royalty-Free (RF) Licensing Requirements.

Introduction and Overview of W3C Speech Interface Framework (http://www.w3.org/TR/voice-intro/)
W3C Working Draft 4-December-2000.