XTech 2005: XML, the Web and beyond.
This paper is a survey of the work currently carried out by the Multimodal Interaction Working Group at W3C. After a presentation of the concepts of multimodal interaction on the Web, The global architecture designed by the Working Group is outlined, followed by details about the specifications being developed. Relationships with other W3C activities as well as non-W3C efforts in this area.
Although the W3C Multimodal Interaction (MMI) Activity MMI has been active since 2002, it is still under the radar of the web community at large. Nevertheless, the MMI Working Group has been making progress, and now that three of its specifications are being published as Last Call Working Drafts, its work is becoming more visible.
Along with the Voice Browser and the Device Independence Working Groups, the MMI group is investigating what the browsers of the future, be them embedded on pocket-sized mobile phones, running on big desktop PCs connected to various input and output devices, or integrated with your car's navigation and entertainment system.
Standardisation work is especially important in this area as browser vendors are already deploying multimodal solutions. Efforts to use existing and widespread standards on new environments are also visible: VoiceXML servers running on mobile phones, SMIL for multimedia messaging, etc.
This paper is a survey of the work of the MMI working group, with a focus on its dependencies with other W3C activities, in particular the Device Independence activity.
The first part describes the major changes (mostly from new hardware) that are starting to change the notion of Web access, mostly by introducing the concept of multimodal interaction to users. We then proceed to describe what current techniques can be used to adapt to these uses. The third part details the specific work of W3C in that area: the MMI Framework and related specifications followed by longer-term solutions envisaged by the MMI group.
It is obvious that the ability of mobile devices to access the Web is changing the way people experience the Web, as well as how browsers are designed and web content is written. So far, the main technical hurdle of adapting the Web to mobile phones has mainly been adapting pages to small screens. But a greatest challenge is coming, as more than just "small-screen" devices are accessing web pages.
Indeed, not just mobile phones are now able to access the Web: interactive televisions and recorders, car navigation systems, game consoles, etc. have or will have web access. The challenge is then to adapt web software, as well as web content, not just for small screens but perhaps for no screen at all, for instance. An important change is that those devices also feature new human input mechanisms: voice or handwriting recognition, visual input using integrated cameras, keypads, scroll-wheels or haptic devices.
Furthermore these new devices bring more than new input and output modes to the web experience: many of them also have a greater awareness of their environment than the traditional desktop computer. Many sensors can be found in mobile devices, which could influence the experience of browsing: location and speed, ambient noise or brightness, network signal or battery level. A truly multimodal Web should take all these new modalities into account.
Currently the Web isn't changing as much as it could, and as a consequence things don't work so well. Yet there are short-term solutions, based on existing standards, to make them work better. This section lists some of those solutions (starting with non-solutions, in fact) while the next one describes a more complete answer to the problem of "real" multimodal interaction.
Some attempts at making the Web mobile and multimodal have failed or are bound to. As appealing as they may seem to quickly solve the problem, they all run the risk of fragmenting the Web into several incompatible sub-webs. The sources of that risk are listed below:
http://www.xzy.mobi. This approach is not only
architecturally wrong (top-level domains describe the
server, not the client) but would also create much confusion
about what is "mobile" or not: should a laptop computer
access acme.mobi or acme.com?
Pushing the argument further would incite the creation of
even more domains describing new classes of devices TBLMobi.Despite the initial appeal of the solutions above, they all prove to be a poor choice. However, better alternatives and sound technical solutions exist. Some are simple and can be applied quickly, while others are further ahead but will hopefully enable a Web that does work across all environments.
If used properly the solutions mentioned above could make the Web work on a fair portion of modern devices. It is to be expected that as more and more try to access the Web, site administrators will realise that they have tools to handle the new scenarios. Yet, these methods will work only to an extent. As mentioned in the introduction, the hardware market evolves at a fast pace and the web infrastructure has to adapt to it in more advanced ways.
A better way of handling server-side adaptation is to handle device characteristic on a feature by feature basis. Because a list of all browser is difficult to maintain, a mechanism to access the browser's characteristics from the server is much more scalable: instead of trying to adapt using what the server knows of the browser (if it knows the browser at all), the server can retrieve more explicit information such as screen size, number of colours, whether the device has a speech recogniser, etc.
The Device Independence Activity of the W3C DI is exploring this area. In 2004, the Content Capabilities/Preferences Profile specification was released CCPP, defining a framework for describing device capabilities and user preferences, and how they could be used to perform content adaptation tailored not only to the terminal but also to the person using it. The Open Mobile Alliance has adopted CC/PP as the standard for machine-readable mobile device description.
The DI Activity is also exploring ways the delivery context provided in CC/PP form is generated and travels back to the server, as well as other intermediaries which could perform partial adaptation to alleviate the load on the server and on the client.
This section is specifically about the work of the Multimodal Interaction Working Group and outlines the two main directions it is working on for future Web interaction standardisation: new browsers and new web content.
The principles we have described so far are addressed by the device independence activities, as well as by recent versions of well known specifications (HTML, CSS, XForms, etc). However, real multimodal interaction will require a bigger effort:
A study of requirements and use cases UseCases has been conducted by the multimodal interaction working group, leading to the definition of the MMI Framework Framework.
The MMI framework describes a conceptual architecture which modularises the curent "monolithic" browser model, introducing components whose interaction fulfills the requirements listed above.
The goal of the work ensuing from the framework is, once essential components have been identified, to standardise the interfaces between these components. Interoperability in this area is crucial, because it is very likely that components (e.g. handwriting recogniser, speech synthesizer, GPS interface) will be built by different manufacturers. It is also worth noting that most of the resulting specifications will either reuse (or build on) existing Web standards

Figure 1 shows a simplified view of the Framework. Between the user and the Web are five components, each taking care of particular multimodal features:

The next figure shows more details for the components listes above, as well as new ones which serve specific functions: the role of the "integration" component is to combine the data produced by each input component into one unified stream that the interaction manager receives. The component takes care of mixing and disambiguating input, handling the possible contradictions and uncertainties. The "Synthesis" component does the opposite: from one output stream the interaction manager produces, the component generates output data specific to the output modes available.
Note that the components are only conceptual and do not necessarily represent particular hardware or software components, nor does the framework mandate any implementation.


The description of the framework goes deeper, and the following figures, copied from the W3C Note Framework, shows even more detail. We won't go into more detail here, but one thing to mention is that both figures show, in red, where existing W3C technology can be found: SSML, CSS, SVG, etc. for output; SRGS, and SI for input. The rest of the needed standards are being developed within the group.
Among the standards that are in development, or expected to be developed are:
The specification of a wrapper format for transporting input data from the modality components to the interaction manager: the Extensible Multi Modal Annotation language EMMA is an XML vocabulary to represent recognition results. An instance typically contains one or several possible values of an interpretation, along with related metadata (time, device information, type of modality, confidence scores). The sample EMMA file below shows two alternative interpretations of a spoken utterance, each with different confidence values:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542">
<emma:interpretation id="int1" emma:confidence="0.75">
<origin>Boston</origin>
<destination>Denver</destination>
<date>03112003</date>
</emma:interpretation>
<emma:interpretation id="int2" emma:confidence="0.68">
<origin>Austin</origin>
<destination>Denver</destination>
<date>03112003</date>
</emma:interpretation>
</emma:one-of>
</emma:emma>
EMMA data is generated by input modality components and is later processed by integrations components, which perform operations like selecting from a list of alternatives, or merging the inputs from different modalities, eventually resulting in the interaction manager receiving a single unambiguous result.
System and Environment component: the "Dynamic Properties Framework" DPF is a specification which standardises IDL interfaces for the interaction manager to access system parameters or query environmental conditions and adapt the application accordingly. The example below shows DPF interface functions used in an XHTML document which reacts to the device's battery level:
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dpf="http://www.w3.org/2004/11/dpf"
xmlns:sel="http://www.w3.org/2004/06/diselect">
<head>
<title>Battery check</title>
</head>
<body>
<sel:select>
<!-- battery level is a number between 0 and 100 -->
<sel:when sel:expr="dpf:component()/device/battery < 20">
<h1 class="alert">Low Battery</h1>
<p>You are running low on power!</p>
</sel:when>
<sel:otherwise>
<h1 class="alert">High Battery</h1>
<p>You have plenty of power!</p>
</sel:otherwise>
</sel:select>
</body>
</html>
The DPF specification doesn't provide a list of properties. Similarly to CC/PP, they are expected to be defined by other organisations, either for standard or for proprietary uses.
InkML InkML: an XML language to represent "digital ink" traces, for handwriting recognition and other pen-based applications.
<ink>
<traceFormat id="xy">
<regularChannels>
<channel name="X" type="decimal" units="mm"/>
<channel name="Y" type="decimal" units="mm"/>
<channel name="P" type="decimal" units="N"/> <!-- pressure -->
<trace>
10 0 9 14 8 28 7 42 6 56 6 70 8 84 8 98 8 112 9 126 10 140
13 154 14 168 17 182 18 188 23 174 30 160 38 147 49 135
58 124 72 121 77 135 80 149 82 163 84 177 87 191 93 205
</trace>
<trace>
130 155 144 159 158 160 170 154 179 143 179 129 166 125
152 128 140 136 131 149 126 163 124 177 128 190 137 200
150 208 163 210 178 208 192 201 205 192 214 180
</trace>
</ink>
InkML is meant to represent either handwritten input, or pen gesture data, as sent from a graphics tablet to a handwriting or gesture recogniser. Alternatively, it can be used to specify handwriting or gesture grammars, with trace instances representing expected input from the user, expressed in a canonical form.
This list of specifications the MMI group is working on is growing, and should eventually cover the full framework. But as we mentioned before the Multimodal Web won't happen on the browser side only, and so the group is also investigating how multimodality is going to affect what's on the servers, in particular authoring languages for web application.
the Multimodal Web will not only be enabled by defining a new browser architecture, but also by creating a new generation of Web languages, suited for multimodal interaction. Existing Web pages do not yet provide modality dependent text (spoken or written), or modality dependent interaction (e.g. specify grammars or timing information).
New markup, preferably extensions to HTML, would therefore be needed. One can always add new functionality through scripts, but declarative markup is always preferable, and proto-multimodal extensions to HTML have already been published by other groups: XHTML+Voice X+V and Speech Application Language Tags SALT define extension markup to handle voice interaction: specify prompts, grammars, time-out values, etc. The listing below are examples of X+V and SALT, respectively, showing how grammars can be specified for user input:
<vxml:form id="voice_city">
<vxml:field name="field_city">
<vxml:grammar src="city.srgf" type="application/x-srgf"/>
<vxml:prompt id="city_prompt">
Please choose a city.
</vxml:prompt>
<vxml:catch event="help nomatch noinput">
For example, say Chicago.
</vxml:catch>
</vxml:field>
</vxml:form>
<salt:listen id="lsnName" onreco="askTravel.start();">
<salt:grammar id="gram2" name="gram2">
<grammar version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar" root="aUser">
<rule id="aUser" scope="public">
<one-of>
<item>Fred</item>
<item>Sam</item>
</one-of>
</rule>
</grammar>
</salt:grammar>
<salt:bind targetelement =" saltdebug" value="/" />
<salt:bind targetelement =" userName" value="//"/>
</salt:listen>
Another way of extending Web content ensues from the observation that the concept of styling HTML pages could be generalised to that of "skinning Web applications": a standard web page can not only have its appearance defined by a stylesheet, but also the way the user can interact with it. Standard stylesheet selection mechanisms could allow controlling speech input or pen interaction with a page, for example for form filling or following links.
A language embracing this concept is CSS-MMI CSS-MMI, which adds new properties and selectors to CSS. For instance:
#wants-drink:focus {
prompt: "do you want a drink?";
grammar: yes | no;
reprompt: 3s;
next: "yes" #beverage "no" #food;
}
Different modes of interaction can be specified by using CSS's @media rule:
@media speech {
/* rules applicable to the speech media type */
}
@media handheld {
/* rules applicable to handheld media type */
}
The three languages mentioned here part of a wider range of media-independent authoring markup languages, many of which are used internally and are usually transformed back to other media-specific languages like HTML or VoiceXML. See the papers presented at the W3C Workshop on Multimodal Interaction MMIWS for a few examples.
One of the goals of the multimodal working group is to develop an authoring language standardising the proposals above, to allow simple multimodal capabilities to be added to existing markup languages in a way that is backwards compatible with the current Web, and which builds upon widespread familarity with existing Web technologies.
The Multimodal Interaction Working Group is one of the largest W3C groups. And it's good news, given the amount of work that was defined in the MMI framework, which we have only skimmed through here. Even if we are still a few years away from having a complete implementation of it, it should be pointed out that each piece, each specification currently being written, can be used on its own, without having the whole framework specified and implemented. For instance, InkML is already used in form filling applications, to store the raw input. Similarly, the Dynamic Properties Framework can be implemented to work within HTML scripts, as the example above shows.
There is no doubt that the Web will evolve to suit the more and more varied forms of computers we will be using, embracing the general trend to make human-machine interaction more human-oriented than it ever was. The Multimodal Interaction Working Group hopes that it will have helped move the web into that direction.
The author's work in the Multimodal Interaction Working Group is supported by the MWeb project of the European Commission's IST programme.