XTech 2005: XML, the Web and beyond.

W3C's Multimodal Web

Discuss this paper on the XTech wiki
View XML source for this paper

Keywords

Abstract

This paper is a survey of the work currently carried out by the Multimodal Interaction Working Group at W3C. After a presentation of the concepts of multimodal interaction on the Web, The global architecture designed by the Working Group is outlined, followed by details about the specifications being developed. Relationships with other W3C activities as well as non-W3C efforts in this area.

Introduction

Although the W3C Multimodal Interaction (MMI) Activity MMI has been active since 2002, it is still under the radar of the web community at large. Nevertheless, the MMI Working Group has been making progress, and now that three of its specifications are being published as Last Call Working Drafts, its work is becoming more visible.

Along with the Voice Browser and the Device Independence Working Groups, the MMI group is investigating what the browsers of the future, be them embedded on pocket-sized mobile phones, running on big desktop PCs connected to various input and output devices, or integrated with your car's navigation and entertainment system.

Standardisation work is especially important in this area as browser vendors are already deploying multimodal solutions. Efforts to use existing and widespread standards on new environments are also visible: VoiceXML servers running on mobile phones, SMIL for multimedia messaging, etc.

This paper is a survey of the work of the MMI working group, with a focus on its dependencies with other W3C activities, in particular the Device Independence activity.

The first part describes the major changes (mostly from new hardware) that are starting to change the notion of Web access, mostly by introducing the concept of multimodal interaction to users. We then proceed to describe what current techniques can be used to adapt to these uses. The third part details the specific work of W3C in that area: the MMI Framework and related specifications followed by longer-term solutions envisaged by the MMI group.

New access methods

It is obvious that the ability of mobile devices to access the Web is changing the way people experience the Web, as well as how browsers are designed and web content is written. So far, the main technical hurdle of adapting the Web to mobile phones has mainly been adapting pages to small screens. But a greatest challenge is coming, as more than just "small-screen" devices are accessing web pages.

Indeed, not just mobile phones are now able to access the Web: interactive televisions and recorders, car navigation systems, game consoles, etc. have or will have web access. The challenge is then to adapt web software, as well as web content, not just for small screens but perhaps for no screen at all, for instance. An important change is that those devices also feature new human input mechanisms: voice or handwriting recognition, visual input using integrated cameras, keypads, scroll-wheels or haptic devices.

Furthermore these new devices bring more than new input and output modes to the web experience: many of them also have a greater awareness of their environment than the traditional desktop computer. Many sensors can be found in mobile devices, which could influence the experience of browsing: location and speed, ambient noise or brightness, network signal or battery level. A truly multimodal Web should take all these new modalities into account.

How The Web Copes

Currently the Web isn't changing as much as it could, and as a consequence things don't work so well. Yet there are short-term solutions, based on existing standards, to make them work better. This section lists some of those solutions (starting with non-solutions, in fact) while the next one describes a more complete answer to the problem of "real" multimodal interaction.

What Not To Do

Some attempts at making the Web mobile and multimodal have failed or are bound to. As appealing as they may seem to quickly solve the problem, they all run the risk of fragmenting the Web into several incompatible sub-webs. The sources of that risk are listed below:

What to do

Despite the initial appeal of the solutions above, they all prove to be a poor choice. However, better alternatives and sound technical solutions exist. Some are simple and can be applied quickly, while others are further ahead but will hopefully enable a Web that does work across all environments.

Advanced Delivery Methods

If used properly the solutions mentioned above could make the Web work on a fair portion of modern devices. It is to be expected that as more and more try to access the Web, site administrators will realise that they have tools to handle the new scenarios. Yet, these methods will work only to an extent. As mentioned in the introduction, the hardware market evolves at a fast pace and the web infrastructure has to adapt to it in more advanced ways.

A better way of handling server-side adaptation is to handle device characteristic on a feature by feature basis. Because a list of all browser is difficult to maintain, a mechanism to access the browser's characteristics from the server is much more scalable: instead of trying to adapt using what the server knows of the browser (if it knows the browser at all), the server can retrieve more explicit information such as screen size, number of colours, whether the device has a speech recogniser, etc.

The Device Independence Activity of the W3C DI is exploring this area. In 2004, the Content Capabilities/Preferences Profile specification was released CCPP, defining a framework for describing device capabilities and user preferences, and how they could be used to perform content adaptation tailored not only to the terminal but also to the person using it. The Open Mobile Alliance has adopted CC/PP as the standard for machine-readable mobile device description.

The DI Activity is also exploring ways the delivery context provided in CC/PP form is generated and travels back to the server, as well as other intermediaries which could perform partial adaptation to alleviate the load on the server and on the client.

The Multimodal Web

This section is specifically about the work of the Multimodal Interaction Working Group and outlines the two main directions it is working on for future Web interaction standardisation: new browsers and new web content.

Rationale

The principles we have described so far are addressed by the device independence activities, as well as by recent versions of well known specifications (HTML, CSS, XForms, etc). However, real multimodal interaction will require a bigger effort:

A study of requirements and use cases UseCases has been conducted by the multimodal interaction working group, leading to the definition of the MMI Framework Framework.

Multimodal browsers

The MMI framework describes a conceptual architecture which modularises the curent "monolithic" browser model, introducing components whose interaction fulfills the requirements listed above.

The goal of the work ensuing from the framework is, once essential components have been identified, to standardise the interfaces between these components. Interoperability in this area is crucial, because it is very likely that components (e.g. handwriting recogniser, speech synthesizer, GPS interface) will be built by different manufacturers. It is also worth noting that most of the resulting specifications will either reuse (or build on) existing Web standards

The MMI Framework

Figure 1 shows a simplified view of the Framework. Between the user and the Web are five components, each taking care of particular multimodal features:

The MMI Framework (detailed view)

The next figure shows more details for the components listes above, as well as new ones which serve specific functions: the role of the "integration" component is to combine the data produced by each input component into one unified stream that the interaction manager receives. The component takes care of mixing and disambiguating input, handling the possible contradictions and uncertainties. The "Synthesis" component does the opposite: from one output stream the interaction manager produces, the component generates output data specific to the output modes available.

Note that the components are only conceptual and do not necessarily represent particular hardware or software components, nor does the framework mandate any implementation.

The MMI Framework (detailed input and output views)

The description of the framework goes deeper, and the following figures, copied from the W3C Note Framework, shows even more detail. We won't go into more detail here, but one thing to mention is that both figures show, in red, where existing W3C technology can be found: SSML, CSS, SVG, etc. for output; SRGS, and SI for input. The rest of the needed standards are being developed within the group.

Among the standards that are in development, or expected to be developed are:

This list of specifications the MMI group is working on is growing, and should eventually cover the full framework. But as we mentioned before the Multimodal Web won't happen on the browser side only, and so the group is also investigating how multimodality is going to affect what's on the servers, in particular authoring languages for web application.

Multimodal Web languages

the Multimodal Web will not only be enabled by defining a new browser architecture, but also by creating a new generation of Web languages, suited for multimodal interaction. Existing Web pages do not yet provide modality dependent text (spoken or written), or modality dependent interaction (e.g. specify grammars or timing information).

New markup, preferably extensions to HTML, would therefore be needed. One can always add new functionality through scripts, but declarative markup is always preferable, and proto-multimodal extensions to HTML have already been published by other groups: XHTML+Voice X+V and Speech Application Language Tags SALT define extension markup to handle voice interaction: specify prompts, grammars, time-out values, etc. The listing below are examples of X+V and SALT, respectively, showing how grammars can be specified for user input:

    <vxml:form id="voice_city">
      <vxml:field name="field_city">
        <vxml:grammar src="city.srgf" type="application/x-srgf"/>
        <vxml:prompt id="city_prompt">
          Please choose a city.
        </vxml:prompt>
        <vxml:catch event="help nomatch noinput">
          For example, say Chicago.
        </vxml:catch>
      </vxml:field>
    </vxml:form>
<salt:listen id="lsnName" onreco="askTravel.start();">
   <salt:grammar id="gram2" name="gram2">
      <grammar version="1.0"  xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar" root="aUser">
         <rule id="aUser" scope="public">
            <one-of>
               <item>Fred</item>
               <item>Sam</item>
            </one-of>
         </rule>
      </grammar>
   </salt:grammar>
   <salt:bind targetelement =" saltdebug" value="/" />
   <salt:bind targetelement =" userName" value="//"/>
</salt:listen>

Another way of extending Web content ensues from the observation that the concept of styling HTML pages could be generalised to that of "skinning Web applications": a standard web page can not only have its appearance defined by a stylesheet, but also the way the user can interact with it. Standard stylesheet selection mechanisms could allow controlling speech input or pen interaction with a page, for example for form filling or following links.

A language embracing this concept is CSS-MMI CSS-MMI, which adds new properties and selectors to CSS. For instance:

#wants-drink:focus {
   prompt: "do you want a drink?";
   grammar: yes | no;
   reprompt: 3s;
   next: "yes" #beverage "no" #food;
   }

Different modes of interaction can be specified by using CSS's @media rule:

@media speech {

  /* rules applicable to the speech media type */

}

@media handheld {

  /* rules applicable to handheld media type */
}

The three languages mentioned here part of a wider range of media-independent authoring markup languages, many of which are used internally and are usually transformed back to other media-specific languages like HTML or VoiceXML. See the papers presented at the W3C Workshop on Multimodal Interaction MMIWS for a few examples.

One of the goals of the multimodal working group is to develop an authoring language standardising the proposals above, to allow simple multimodal capabilities to be added to existing markup languages in a way that is backwards compatible with the current Web, and which builds upon widespread familarity with existing Web technologies.

Conclusion

The Multimodal Interaction Working Group is one of the largest W3C groups. And it's good news, given the amount of work that was defined in the MMI framework, which we have only skimmed through here. Even if we are still a few years away from having a complete implementation of it, it should be pointed out that each piece, each specification currently being written, can be used on its own, without having the whole framework specified and implemented. For instance, InkML is already used in form filling applications, to store the raw input. Similarly, the Dynamic Properties Framework can be implemented to work within HTML scripts, as the example above shows.

There is no doubt that the Web will evolve to suit the more and more varied forms of computers we will be using, embracing the general trend to make human-machine interaction more human-oriented than it ever was. The Multimodal Interaction Working Group hopes that it will have helped move the web into that direction.

Acknowledgements

The author's work in the Multimodal Interaction Working Group is supported by the MWeb project of the European Commission's IST programme.

Bibliography

[TBLMobi] New Top Level Domains Considered Harmful. Tim Berners-Lee, 2004.
[CCPP] Composite Capability/Preference Profiles (CC/PP): Structure and Vocabularies 1.0. W3C Recommendation, 2004.
[UseCases] Multimodal Interaction Use Cases, W3C Note, 2002.
[Framework] W3C Multimodal Interaction Framework, W3C Note, 2003.
[MID] Modality Component to Host Environment DOM Requirements and Capabilities Assessment, W3C Working Group Note, 2004
[MMIWS] W3C Workshop on Multimodal Interaction, 2004
[CSS-MMI] CSS Extensions for Multimodal Interaction, Dave Raggett, Max Froumentin, 2004.
[X+V] XHTML+Voice Profile 1.0, W3C Note, 2001.
[SALT] Speech Application Language Tags, SALT Forum, 2003
[InkML] Ink Markup Language, W3C Working Draft, 2004.
[DPF] Dynamic Properties Framework (DPF), W3C Working Draft, 2004
[EMMA] EMMA: Extensible MultiModal Annotation markup language, W3C Working Draft, 2004
[DI] Device Independence Activity home page, W3C.
[MMI] Multimodal Interaction Activity home page, W3C.

Biography

Max Froumentin

W3C http://www.w3.org