Abstract
Several XML-based standards mandate the normalization of various aspects of XML and XML-based technologies. Examples of these normalization requirements include date/time formats, the XML character production, and Unicode normalization forms. Normalization allows for compatibility between unfamiliar systems, which is necessary to fulfill scenarios that view the Web as a single, large application. Component systems within a Web application might benefit from relying on an early normalization process, which allows them to perform operations such as collation and string-matching without having to consider multiple potential forms of the incoming data.
However, universal mandatory normalization can also restrict the flexibility for systems engaged in a private contract to efficiently use XML and XML-based technologies. Web services are probably the most prevalent example of such systems. The early normalization process might require a system to perform various additional encodings and decodings simply for the ability to use XML as a transport between system components, especially if they natively use a normalization form different from the mandated one.
As current XML standards continue to evolve and new standards develop, the issue of mandatory normalization will continue to require the XML community to carefully consider the balance between two important Web scenarios: enabling unfamiliar systems to make certain assumptions about each other's data, without making it impractical for familiar systems to leverage the same standards and technologies.
This paper will address these concerns by focusing on the specific issues around the character normalization debate.
Keywords
Table of Contents
In April 2002, the Internationalization (I18N) Working Group published the second Last Call draft of its “Character Model for the World Wide Web 1.0” [CharMod]. Among the topics addressed in the document is a call for all Web content to be produced in the Unicode normalization form NFC. It also calls for all text processors, including the XML technologies, to enforce checks of the proper normalization form.
This decision represents a bias towards ensuring standards are limited enough in the general case to allow all applications to easily implement them. It also appears to favor scenarios involving globally unfamiliar systems needing to exchange and work with data. While this is an important scenario, it would be a mistake to limit the expressiveness of Web standards to the point that they inhibit important domain-specific applications. In local or regional uses of the Web, systems often have enough knowledge of other systems that the general case is not the optimal case.
This paper investigates the potential impact that the [CharMod]normalization proposal will have on key Web scenarios. The reader is assumed to have a basic understanding of coded character sets and encoding schemes. Unicode Technical Report #17, “Character Encoding Model” [UTR #17] (see http://unicode.org/unicode/reports/tr17/) is an excellent reference on this subject and is highly recommended reading for those not familiar with the distinctions across various levels of the character encoding model.
In order to help the reader understand Unicode normalization forms and how they will impact data interchange on the Web, a fictional analogy is described. Hopefully, this story will explain the importance of making the right decisions regarding normalization standards.
The following is a fictional story about the European Union's implementation of the euro currency. While a small part of this story is based on the actual history of the euro, most of it is created to serve as an analogy to the issues surrounding Unicode normalization. The fictional Europeans in this story are much more troubled with the euro than actual Europeans have been.
Once upon a time, 15 European nations got together and decided to form the European Union (EU). One of the first items on the agenda was to create a common currency, to be called the “euro” (one might call it a “Unicoin”). They even created the European Central Bank to regulate the euro (sort of a “Unicoin Consortium”). 12 of the 15 nations eventually adopted the euro; Denmark, Sweden, and the United Kingdom chose to keep their own currency (“legacy currency”).
Euro notes were issued in denominations of 5, 10, 20, 50, 100, 200, and 500 euros. Coins were issued in denominations of 1, 2, 5, 10, 20, and 50 cents, and 1 and 2 euros.[1] Prior to the euro, transactions between EU nations required one currency to be converted to the other. So, the new common currency was going to make commerce much easier.
The euro currency represents an abstract value space similar to the Universal Character Set (UCS) abstract character repertoire[2]. Any particular currency value could be represented by some combination of coins or notes; similarly, an abstract character could be represented some combination of code points in the UCS coded character set. Often, a character will be represented by a single code point, but in many cases, a character may be represented by a base character followed by a sequence of diacritic marks.
The next item on the EU agenda (not nearly as well publicized) was to create an efficient means of transferring money. The solution required a system that was easy for all nations to work with, while still being expressive enough to be useful.
They chose to contract this job out to an enterprising young boy, who became known as the Xfer Money Lad (XML). XML was in a unique position to fulfill this job due to his ability to speak a language that was not associated with any country, and yet was easy for any country to learn and use. This language was also expressive enough to handle all expected monetary transfers.
Since 12 of the 15 nations used the euro, XML decided he would adopt the euro as his standard as well. Although not required by his charter, he could also choose to accept the legacy currencies (the Krone, Krona, and Pound) by converting them to euros as needed.
XML got the job due his balance of simplicity and expressiveness. If he was too complex, he would not be accessible enough for a variety of countries to adopt. If he was too simple, he would not support the expressive power needed to do the job. It is also worth noting that the three countries choosing to keep their own currencies do not share the same monetary units as the euro. Compare this to how the Shift-JIS encoding scheme does not cleanly map to UCS [3]. XML may choose to transcode legacy encodings to UCS code points, just as the Xfer Money Lad was willing to convert various denominations of UK pounds to euro values. [4]
XML did a fine job of transferring money, but there were still issues that needed to be resolved about the form of the currency.
While the use of the euro made transactions simpler, the European Central Bank noticed that many (fictional) businesses were having trouble performing certain operations on the new currency. Businesses would occasionally verify their cash by counting their money -- literally counting the pieces of currency. So, if 5 euro consisted of two 2-euro coins and one 1-euro coin, then the count would yield 3. If it consisted of five 50-cent coins and twenty-five 10-cent coins, the count of 5 euro would result in 30.
In addition, it was (for the purposes of this analogy) a common tradition to pay a gratuity equal to the last coin or note passed in the transaction. This meant that if a 10 euro transaction was made up of a single note, a 10 euro tip would be in order. If the same amount was made up of 1000 1-cent coins, only a 1 cent tip would be required. While these practices may seem strange to us, they were vital to the European businesses (at least the ones of this allegory).
Finally, the tram ticket dispensing machines accepted exact change for the 70-cent tram fare. There were literally dozens of possible coin combinations that could match the value of 70-cents. Machine operators complained that it was too difficult to make all of their machines accept every possible matching combination, yet it was vital that the machines could confirm when the fare deposited was equal to the fare required.
This part of the story demonstrates how the fictional EU businesses struggled with normalization-sensitive operations. Determining string length, identifying the last (possibly non-spacing) character, and matching strings are normalization-sensitive operations. String matching is especially important; its dependence on normalization form was made clear in [XML 1.0] by the following quote, [ “Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings.”( http://www.w3.org/TR/2000/REC-xml-20001006#dt-match). ] This is a significant statement. “Multiple representations” refers to multiple sequences of UCS code points representing the same abstract character. Not only was [XML 1.0] stating that text would be case-sensitive, but it would also be normalization-sensitive. Just because two characters look identical when displayed (since they represent the same abstract character), does not mean they are equal until you find that they have equal sequences of code points. This is why normalization is important in XML; without normalizing, identical abstract characters in different normalization forms may have different representations, and therefore, cannot be compared. In the same way that the tram ticketing machine could not match multiple representations of 70-cents, XML is not able to match multiple representations of an abstract character.
However, not all operations are normalization-sensitive. Table 1 categorizes a few common operations:
| Normalization-Sensitive Operations | Normalization-Insensitive Operations |
| Determining string length | Visually rendering data, such as in an XML x XSLT => HTML transformation |
| Deleting the last character (identifying how many code points make up the last character) | Processing SOAP message text payloads |
| String identity matching (in security checks, IRI comparisons, and indexing) | Some string operations, such as case changes |
| Searching | Encoding from Coded Character Set to Character Encoding Scheme |
Table 1.
So, the European Central Bank decided to create normalized forms for acceptable combinations of currency denominations. One form required all amounts of money to be represented by the largest possible denominations; this was called, the “precomposed” normalization form, or NFC. Another form required only the smallest possible denominations be used; this was called the “decomposed” normalization form, or NFD. The European Central Bank felt that this selection of forms would allow different cultures, regions, and industries to choose the optimal form for their needs.
This fixed many problems. As long as one knew what normalization form the currency was in, one could get a consistent count, or predict the gratuity of any transaction. And for those who didn’t count their money and were not involved in businesses where gratuity was customary, they could keep their currency unnormalized.
While this worked fine across groups that all used the same normalization form, there still was a problem when doing business with someone using a different normalization form (or no normalization at all). When employees in the restaurant business, which chose to use the precomposed normalization form, deposited cash at their bank, which used the decomposed form, a conversion needed to be done for each deposit. Anytime a normalization-sensitive operation was performed, such as counting, one had to check which form was being used. If a counting machine was only designed only for NFC but was supplied with currency in NFD, it would have to reject the currency, return an incorrect total, or possibly normalize the currency to its preferred form.[5]
The Unicode Standard Annex #15 [UAX #15] (see http://www.unicode.org/unicode/reports/tr15/) describes four normalized forms of Unicode text. Two of them, NFKC and NFKD, address normalization of compatibility characters, which is not being discussed here[1]. The other two forms, NFC and NFD, address normalization of canonically equivalent characters by comparing their precomposed forms or decomposed forms, respectively.
Comparing precomposed forms means ensuring that for any combination of a base character followed by a diacritic (such as a “c” followed by the cedilla mark), there is not already a single code point that represents that character (as there happens to be in this case: “ç”). Fig. 1 illustrates a denormalized input received by a system designed to handle NFC-normalized text. In this case, the system is being presented with a LATIN SMALL LETTER C (U+0063) followed by a COMBINING CEDILLA (U+0327). Notice that whether the text is normalized or not, it will still display the same way. However, since the code points are different, a translation needs to be performed to determine if they are equal. If the system checks for NFC normalization, it will determine that there is already a precomposed UCS character, which means the combination provided is not valid. If it chooses to normalize the input to NFC, it will convert the combination of code points to the single code point, LATIN SMALL LETTER C WITH CEDILLA (U+00E7), as shown below:
Figure 1. NFC system rejects the decomposed character since there exists a canonically equivalent precomposed character.
In the decomposed form, a character would be required to be broken down into its parts unless only a precomposed form existed in the character set. Fig. 2 illustrates a precomposed input received by a system designed to handle NFD-normalized text. In this case, the system is being presented with LATIN SMALL LETTER C WITH CEDILLA (U+00E7). Again, the system will reject the input since it could be further decomposed and is therefore not in its canonical decomposed form required by NFD. This figure also shows the result of a normalization transformation:
Figure 2. NFD system rejects the precomposed character since it could be broken down into valid UCS code points.
The requirement for only the highest or lowest denominations of a particular currency amount is meant to parallel the Unicode NFC and NFD forms, which require using the most or least composed code point(s) available to represent a particular character. Not all operations are normalization-sensitive, and the normalization-sensitive ones work fine if it can be known that all parties use the same form.
At this point, the EU chartered a working group to investigate how to best deal with these normalization issues. They now understood the impact of normalization forms on important operations such as counting and comparison. The next step was to decide when and how to apply these normalizations. The main decision centered on whether normalization should take place only when needed, or before entering the market. Waiting until needed would provide the flexibility of multiple allowable normalization forms. Requiring normalization before entering the market would give all transactions a guarantee of which normalization form to expect.
The working group eventually proposed a solution: all cash must be normalized in the precomposed NFC form before entering the market. Some of the reasons behind their decision included:
It was already common practice to precompose currency. Other than the bankers, who just liked to count as high as possible, most people tried to use the fewest number of coins and notes as possible to represent an amount.
The frequency with which money entered the market was considerably less than the frequency that currency comparisons, additions, subtractions, and other normalization-sensitive operations were performed.
Most transactions occurring at the time were not checking normalization, and instead were assuming that the currency was already in the appropriate form.
It was unreasonable to expect all parts of the market (including the tram ticket dispensing machines, for example) to be able to perform all necessary normalization conversions.
Since this whole idea of “early normalization” was to allow monetary transactions to have a guarantee that all amounts of the transaction were in the same form, they needed to find a way to enforce this policy. So, the working group suggested that XML reject any currency wasn’t already normalized. Since XML was involved in every monetary transfer, no transaction could ever involve money that had not already been normalized in NFC. A different working group (the one that hired XML) decided to comply with the other group’s recommendation. They instructed XML that he should inspect any sums of money when it is handed to him and not leave until he had verified that it was composed of the highest denominations possible. If not, he should hand the money back and attend to another customer.
This reflects the 30 April 2002 draft of [CharMod] and the 25 April 2002 draft of [XML 1.1(LC)]. The specific reasons for choosing early normalization are a sampling of those listed at http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-EarlyUniformNormalization.[XML 1.1(LC)] required all XML processors to verify documents were in NFC, as a condition of well-formedness. Documents not in NFC were required to be rejected. See the Section 3 for a summary of the latest on these two specifications.
So, the early normalization solution was carried out. Some people saw an improvement. They no longer had to worry about checking whether they were receiving unnormalized or differently normalized currency. They had already been using NFC, so there was no extra work for them. However, there were three groups of people who had serious complaints about the new policy:
XML: Business dropped off for XML once the early NFC requirement went into effect. Customers who were used to using XML to move currency from one private business to another couldn’t understand why XML suddenly required the extra normalization work.
Bankers: As mentioned earlier, bankers have always preferred the denormalized format. They understood that doing business with the restaurant industry required someone to convert from their preferred form; since the working group had chosen NFC as the standard for inter-op, they would comply. However, what really bothered the bankers about the early normalization policy was that they even had to convert to a less-preferred format when exchanging currency with other banks! No bank wanted to work with a precomposed form, but every transfer between banks required the transferring bank to change its currency to a precomposed form. Then, they would send it via XML to another bank, which would just change it back to the preferred decomposed form again.
UK users of XML: Although the UK (along with Denmark and Sweden) chose to keep their own currency, they still were able to use XML for both domestic and international monetary transfers. XML was able to accept non-euro based currencies if it could perform a conversion internally to euros. However, now internal conversions were required to be normalized, as well. Aside from the performance impact of this additional operation, in some cases, this could cause an irreversible change to the exact denomination of the currency. XML was previously able to take an arbitrary set of UK currency denominations and convert them to euro denominations. However, the new normalization requirement meant that the internally represented euro denominations now had to be normalized into the precomposed form. Once this was done, XML could convert back to UK pounds, but he would not be able to guarantee the same denominations originally handed to him.
Figure 3. Unnormalized Currency Exchange from UK Pounds to Euros and back to the Original Denomination in Pounds
For example, Fig. 3 illustrates a currency exchange round-trip that retains information on the original denominations used. We start with 2 pounds, made up of a 1-pound coin and two 50-pence coins. In conjunction with the exchange rate calculation, a mapping function is applied, the domain of which is the set of all coin values and multiples of these values from the original currency, the range of which is the set of all coin values and multiples of these values from the destination currency. For instance, the mapping function might attempt to start by preserving the face value of coins, as much as possible, with priority on the highest face values. In Fig. 3, the translation was able to preserve all coins and then add only duplicates of the original coins (or failing that, begin with the next smallest coin). As long as some algorithm allowed for a one-to-one mapping to enable a round-trip, the return exchange could yield the same denominations that that original transaction started with.
Figure 4. Currency Exchange from UK Pounds to Normalized Euros and back to some Arbitrary Denomination in Pounds
However, in Fig. 4, NFC normalization has been applied to the target currency after the conversion. The mapping function has no control; it will be overridden by the requirements of the normalization form. Therefore, depending on the details of the normalization form, it may be impossible to find a one-to-one mapping inherent in the normalization form, which means one cannot return to the original denominations of the original currency. Note that the last step in Fig. 4 attempts to return the abstract value of 2 pounds to a particular representation. Since there are many possibilities, it is just as likely to return to a 2-pound coin, as it is to return to the original combination of 1-pound and 5-pence coins.
Since neither of the W3C specifications mentioned in the previous
note has become an official W3C Recommendation, this part of the story represents
the potential future problems if such policies were to become W3C Recommendations.
The disgruntled XML customers represent scenarios where normalization
is not necessary. For instance, consider a Web service that receives an XML
document containing weather information, and then transforms the data into
an HTML page. At no point is any normalization-sensitive operation required.
There is no need for any extra overhead in this process. Consider a Web
service that simply concatenates two strings; as a Web-based/XML-based service,
it would be required to check for normalization. But, what if a local desktop
application simply wants to supply the Web service with two string parameters?
Must the application normalize these parameters even if it is not concerned
about the normalization form of the output?
The Bankers in this story represent systems that exchange data with
other systems in fulfillment of a mutually accepted contract. Any time two
or more systems have knowledge of each other, it is likely that they will
find one format to be the optimal choice. Systems that 1) would prefer to
work with decomposed forms, 2) interact only with other systems that would
prefer the same form, and 3) have knowledge of this inherent impedance match,
should not have to be forced into a less optimal form. For instance, some
applications and operating systems prefer to work with decomposed text.
The UK XML users represent users who mainly work with non- UCS based character sets, such as Big5 (used in China),
JIS X 0208 (used in Japan), and ANSEL (used by most library systems). While
XML 1.0 does not require processors to support encodings beyond UTF-8 and
UTF-16, processors are allowed to support other encodings, even ones not based
on Unicode. As long as the character set can be mapped to a repertoire of
the UCS, this is a reasonable and reversible operation; however, once Unicode
normalization is applied to an arbitrary set of code points, preserving the
order and items of the original character set becomes much more difficult.
After publication of the Last Call drafts, the I18N and XML Core working groups received comments with concerns about the mandatory early normalization requirement. Both working groups have now considered not absolutely requiring early normalization, but instead just strongly encouraging it. [CharMod]included a section on “Responsibility for Normalization” (see http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplication). This section includes many absolute requirements (identified by the word, “MUST”) on implementations, content producers, and other specifications. However, the I18N working group has recently decided to soften the various MUST requirements towards SHOULDs. At the time of this writing it is not clear whether text components will have the option to normalize suspect text.[5]
On 15 October 2002, in anticipation of this change, the XML Core working group released the Candidate Recommendation for XML 1.1 [XML 1.1(CR)] . Some of the requirements made in the “Normalization Checking” section (see http://www.w3.org/TR/xml11/#sec2.13) are:
[“All XML parsed entities (including document entities) should be fully normalized as per the definition of [Charmod]...”] The wording of this requirement (use of the word “should”) does not imply this is absolutely mandatory.
[“However, a document is still well-formed even if it is not fully normalized.”]
[“XML processors should provide a user option to verify that the document being processed is in fully normalized form, and report to the application whether it is or not.”]
[“The option to not verify should be chosen only when the input text is certified, as defined by [Charmod].”]
[“If...a processor encounters characters for which it cannot determine the normalization properties...then the processor may, at user option, ignore any possible denormalizations caused by these characters.”]
[XML processors must not transform the input to be in fully normalized form.]
Finally, the section concludes with the statement:
“The purpose of this section is to strongly encourage XML processors to ensure that the creators of XML documents have properly normalized them, so that XML applications can make tests such as identity comparisons of strings without having to worry about the different possible "spellings" of strings which Unicode allows.”
This policy, which one could consider “lenient, yet judgmental”, will only add to the confusion. Forthcoming specifications such as XSLT 2, XPath 2, and XQuery will attempt to meet the [CharMod] recommendations, but the ambiguous environment will not improve the chances of finding a clearly compatible stance. For instance, an XPath processor will be highly encouraged to ensure all incoming text is normalized before performing any normalization-sensitive operation. Since XPath can no longer rely on well-formedness as a guarantee of NFC, it will be faced with several options:
Perform a possibly redundant NFC check
If check fails, raise a runtime error, even if the application needed only normalization-insensitive operations performed.
If check fails, restrict normalization-sensitive operations.
Automatically normalize non-NFC text (if allowed by [CharMod] [5]). Some implementations might implement this with or without explicit application request. Normalization would be an expensive operation for an XPath processor.
Document disclaimers about the potential output of all normalization-sensitive string functions. Leave it to the client to deal with calling the normalize-unicode() function when necessary (essentially ignoring the early normalization principles).
Although mandatory early normalization might be harmful, adopting a policy of encouraging the “good processors” to enforce early normalization is an even less desirable scenario. A lenient yet judgmental policy will cause each XML component implementation to question what the other is likely to do.
“Making global standards is hard. The larger the number of people who are involved, the worse it is. In actuality, people can work together with only a few global understandings, and many local and regional ones. As with international and federal laws, and the Web, the minimalist design principle applies: Try to constrain as little as possible to meet the general goal. International commerce works using global concepts of trading and debt, but it does not require everyone to use the same currency, or to have the same penalties for theft, and so on.” (Tim Berners-Lee, Weaving the Web [Berners-Lee])
XML’s success as a rapidly adopted interchange format is due to its careful balance between two key principles:
The first principle is to limit the expressive power and complexity to a level that can be processed by a variety of applications (restricted enough to be accepted by all). Examples of the application of this principle include:
UCS is the only character set an XML processor must understand. UTF-8 and UTF-16 are the only encodings that must be processed.
Schema constraints must not allow non-deterministic content models.
Semi-structured data must fit within a tree data model representation.
Attributes must be scalar and unordered.
The second principle is to restrict generic use as little as possible to allow for a variety of domain-specific uses (expressive enough to be useful by all). Examples of the application of this principle include:
Encodings other than UTF-8 and UTF-16 may be optionally allowed by a processor.
Generic XML starts with a completely open schema, which can be restricted for domain-specific interchange.
A graph data model can be used with the annotation of identity constraints.
Elements can contain recursive element content.
The current position of [CharMod] leans more towards the first principle by reducing the burden of handling normalization-sensitive operations on the wide variety of text processing components on the Web. However, this position applies severe limitations to the general case, thereby disallowing reasonable applications of XML in cases where more specific system knowledge is available. Standards that have allowed the Web to flourish have been designed so domain-specific applications can have the freedom to restrict the general case. This is not possible if the standard goes too far in restricting the potential for expressiveness.
The Internet maxim, [“Be liberal in what you accept, but conservative in what you send.”] can be applied to any system design where a preferred format is introduced to a significant base of other existing formats. The words “accept” and “send” should not be taken too literally. While there are plenty of applications where the literal interpretation is perfectly valid, a more general interpretation advocates for an open system approach to creating change. As new standards and new technologies are developed, success is more likely to result from inclusion of legacy components, rather than from exclusion through mandated change. When applied appropriately, this principle allows for evolving development, which is driven by competition between the proposed new direction and the legacy approach.
Consider the early days of the Web: When the HTTP protocol was developed, NNTP and FTP were the legacy data transfer protocols. While HTTP and HTML offered improved performance and hypertext behavior, URIs were designed to allow for the inclusion of these orthogonal protocols. At the time, one might have thought it best to require browsers to restrict the design to only HTTP addresses; after all, this could force existing archives to make the conversion. However, by liberally accepting multiple protocols, but continuing to evolve the preferred one, the Web was able to provide access to far more resources while letting the advantages of HTTP speak for themselves.
Unfortunately, this maxim is occasionally taken to mean, “be liberal in your interpretation of sloppy and inconsistent formats, but be conservative in the formats you generate.” This interpretation is what led browsers to accept malformed HTML. The liberal interpretation of coding rules caused a proliferation of invalid HTML, which worked on the most liberal browsers, but not on all browsers. This also led to the inability to extract data from most HTML pages. There is no question that content producers must strictly adhere to prescribed formats.
So, how does the universally mandated early normalization relate to this? Unlike the “bad” interpretation of the maxim, the allowance of multiple normalization forms will still require strict adherence to the rules for each form. No one should suggest that the definition of valid NFC be made open for multiple interpretations, or that parsers should give the sender the benefit of the doubt if the code points look close. However, like the inclusion of FTP and NNTP, it is important that data interchange on the Web not be limited to one normalization format. Allowing components to specify the normalization that fits their needs, does not prevent local domains with a common preference for NFC from federating across their “early normalization boundaries”.
It is important not to look at the two extremes of choice and mandate, and then choose a middle ground, as the [XML 1.1(CR)] and [CharMod]specs appear to be doing. There is a difference between allowing exceptions to a first-class standard and providing a framework for all options function on equal ground. By encouraging one standard and allowing exceptions when necessary, and by providing options to enforce one format over the others, interoperability is actually lowered. Web components should not wonder whether their data will get stopped along the way, because one particular processor chose to enforce an optionally preferred format. Instead, data should be allowed to be paired up with a compatible processor.
It is important to understand that there is more than one alternative to mandatory early normalization. Mandatory late normalization is not the only other choice. Below is a brief description of four alternatives:
For any network of systems where it is guaranteed that only one normalization form can ever exist, an early normalization in that format would likely be the best policy. Successful early normalization has to control 100% of the data transaction in a given ecosystem. The larger and more complex the system is, the more it becomes an unreachable goal. The concept of an NFC or NFD fiefdom would be one step towards making this goal more achievable.[CharMod]refers to “certified text” as text that has already been inspected or text that was received from a source that is identified to only produce normalized text. This certification concept would be important to an arrangement where components perform normalization-sensitive operations without ever checking each instance. In addition, a “normalization firewall” could govern the import/export of data with other compatible normalization formats, as well as the on-the-fly translation of the (probably much smaller in number) incompatible formats that request to enter the system.
In this scenario, a component or small system of components requires a particular normalization form. Each accepted document and each parameter is checked for normalization. A no-op or run-time error occurs if the check fails. This scenario might be used by security-related components. Components may also provide interfaces for each possible normalization form, such as a NFD-substring() function along side an NFC-substring() function. Web services could leverage discovery services, service description languages, or policy declarations to allow for automatic negotiation between incoming data and applicable APIs. Identification of data format could come from the data itself, perhaps as an attribute value in the XML declaration, similar to the encoding attribute. Alternatively, a component could incorporate normalization form checking into other validation processes, such as schema validation. Schemas are designed to limit the expressiveness of the general case in order allow for robust, domain-specific applications; this design might apply to normalization form as well. The choice between trusting the declaration of an incoming document and taking the performance hit to verify the form at the receiver will depend on the security and criticality aspects of the processor.
Some functions or components might simply return an output based on an assumption that the input was properly normalized in the specified form. Any calling component or consuming client will be responsible for performing any necessary normalization for each action, otherwise the output may be inaccurate. This is another area to consider for policy declarations that might allow for dynamic negotiation between the requirements of the sender and the resources of the receiver. Also, some data may not want to be normalized under any circumstances. A declaration in the instance data could let the receiver decide whether to process, forward, or reject the data.
This scenario involves components that accept any text and perform the normalization internally. This reflects the opposite end of the spectrum from that described in the [CharMod] early normalization policy. While this puts a large burden on the receiver, it may be useful for operations such as string concatenation, since normalization forms are not closed under concatenation. Therefore, an extra normalization step may need to be performed and this might be best handled internally.
Unicode is an extremely complex encoding with multiple representations for the same character. Normalization is clearly necessary to resolve issues of equality across multiple representations. The problem is how and when to apply normalization.
Multiple normalization formats exist to allow systems and regions to choose the format that maintains maximum fidelity of their data, with the least performance impact, while allowing comparison checks of a very complicated character model. Early normalization is simply a policy that requires data to be normalized prior to crossing a system boundary. The choice that must be made is, where to draw the boundary? Allowing regional and local domains, networks, systems, and individual components, to govern the appropriate placement of this boundary will provide the freedom necessary for the Web to evolve in a creative and efficient manner.
Standardizing constraints at a global level has a significant cost and should not be implemented any more than what is minimally necessary for universal exchange of essential data. Forced, or even highly encouraged, early NFC normalization does not magically solve an interoperability crisis, yet it does reduce the expressive power of XML and restrict local domains from designing a solution that is the most appropriate for their scenarios.
The author would like to thank the following people for reviewing, critiquing, and contributing to the development of this paper: Colin Chapman (FarmSaver.com), Eric Gropp (MWH Inc.), Howard Hao (Microsoft), François Liger(Microsoft), Jonathan Marsh (Microsoft), John McConnell (Microsoft), Jim Melton (Oracle), Michael Rys (Microsoft), and Michel Suignard (Microsoft).
[CharMod] Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Asmus Freytag, Tex Texin Character Model for the World Wide Web, W3C Working Draft, 30 April 2002. (See http://www.w3.org/TR/charmod/ for latest version..)
[ISO/IEC 10646] ISO (International Organization for Standardization). ISO/IEC 10646-1993 (E). Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 1993 (plus amendments AM 1 through AM 7).
[ISO/IEC 10646(2000)] ISO (International Organization for Standardization). ISO/IEC 10646-1:2000. Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 2000.
[StringPrep] Paul Hoffman and Marc Blanchet, "Preparation of Internationalized Strings ("stringprep")", draft-hoffman-stringprep, work in progress. (See http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-07.txt.)
[UAX #15] Mark Davis, Martin Dürst, Unicode Normalization Forms, Unicode Standard Annex #15. (See http://www.unicode.org/unicode/reports/tr15/ for the latest version.).
[UTR #17] Ken Whistler, Mark Davis, Character Encoding Model, Unicode Technical Report #17. (See http://www.unicode.org/unicode/reports/tr17/ for the latest version.).
[XML 1.0] Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, Eve Maler, Extensible Markup Language (XML) 1.0 (Second Edition), 6 October 2000. (See http://www.w3.org/TR/2000/REC-xml-20001006.)
[XML 1.1(LC)] John Cowan, XML 1.1, , W3C Working Draft, 25 April 2002 (see http://www.w3.org/TR/2002/WD-xml11-20020425/)
[1] While the euro notes are identical in all countries, each country issues its own coin with a common design on one side and a national emblem on the other. Therefore, while a German 1-euro coin might look different from a Spanish 1-euro coin, they are compatible coins, representing the same value. The same phenomenon exists with Unicode compatibility characters. These differences are normalized away in the NFKC and NFKD normalization forms. NFKC is currently being considered within the IETF as the most appropriate normalization form for comparing internationalized strings, particularly for internationalized domain names. See Preparation of Internationalized Strings (“Stringprep”)[StringPrep] (See http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-07.txt)
[2] UCS refers to the character set specified by the ISO/IEC 10646-1:1993 [ISO/IEC 10646] and ISO/IEC 10646-1:2000 [ISO/IEC 10646(2000)]standard and all of its amendments.
[3] For a detailed explanation of the ambiguities in converting from Shift-JIS to Unicode, see http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen, which is an appendix in the W3C Note, XML Japanese Profile [XML Japanese Profile]
[4] Character Encoding Schemes don’t play a key role in this discussion, but one could draw a parallel between the mapping from a denomination to its physical manifestation as a coin or note, and the encoding of a UCS code point into a serialized byte sequence.
[5] At the time of this paper was authored, the W3C Internationalization Working Group was considering whether to allow a component to perform normalization of content it did not create. See Jim Melton's comment on this issue: http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0038.htm.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |