Bitten by text html for XHTML documents

The W3C « XHTML media types » note mentions that:

XHTML documents served as ‘text/html’ will not be processed as XML [XML10], e.g. well-formedness errors may not be detected by user agents. Also be aware that HTML rules will be applied for DOM and style sheets (see C.11 and C13 of [XHTML1] respectively).

I have been bitten by this rule while developing the « Hello World » application that will illustrate the first chapter of our upcoming Web 2.0 book.

In this sample application, I am using Javascript to fill information in XHTML elements that act as place holders and noticed that updating elements could sometimes lead to erasing their following siblings:

[<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Repro case</title>
<script type="text/javascript">

function init()  {
	// here, we still have a p[@id='bar'] element
	alert("bar: " + document.getElementById("bar"));
	document.getElementById("foo").innerHTML="foo";
	// but now, the p[@id='bar'] element has disappeared...
	alert("bar: " + document.getElementById("bar"));
}

</script>
</head>
<body onload="init()">
    <div id="foo"/>
    <p id="bar"/>
</body>
</html>

One of the things that I found most surprising is that the three browsers I was testing (Firefox, Opera and Mozilla) showed the same « bug ».

It took me a while to understand that the behaviour is dictated by the media type associated to the document: when the media type is « text/html », the document is interpreted as HTML despite its XML declaration and the trailing slash in the div start tag is ignored. The document body seen by the browser is thus equivalent to:

<body onload="init()">
    <div id="foo">
      <p id="bar"></p>
    </div>
</body>

The p element which is a following sibling of the div element in XML becomes a child of the div element in HTML mode!

In Firefox or Opera, the clean way to fix that would be to send the proper media type (application/xhtml+xml) but unfortunately Internet Explorer doesn’t support it.

A workaround is to avoid using empty tags in XHTML and a comment can be included if you want to make sure that no badly behaved editor will minimise your document:

<body onload="init()">
    <div id="foo"><!-- --></div>
    <p id="bar"><!-- --></p>
</body>
            

Note that this isn’t necessary for the p element but that it doesn’t do any harm and looks more consistent.

Traduction automatique et survie des dinosaures

En 1998, j’étais responsable du support européen de deuxième niveau chez Sybase, pas peu fier d’être partenaire officiel de la coupe du monde. Compte tenu de la visibilité de l’évènement, nous étions tous sur le pied de guerre et d’astreinte 24h/24 et 7j/7.

Pour détendre l’atmosphère, j’avais eu l’idée bizarre de faire traduire la page d’accueil du site Sybase.com par le moteur de recherche qui dominait le marché, AltaVista qui n’avait pas encore son nom de domaine!

Je n’ai pas réussi à retrouver le texte d’origine sur web.archive.org mais je me souviens de notre hilarité devant une traduction dans laquelle des millions de ventilateurs (fans) se pressaient pour aller voir les allumettes (matches) de la tasse du monde (world cup).

Huit ans après, en cette nouvelle période de tasse du monde et alors que les ventilateurs se déchaînent à nouveau pour voir les allumettes, j’ai eu l’idée de m’assurer des progrès accomplis par les moteurs de traduction automatique en leur demandant de traduire la phrase « Millions of fans follow each match of the World cup ».

Chez Altavista / Babel Fish, les ventilateurs se passionnent toujours pour les allumettes de la tasse du monde : « Les millions de ventilateurs suivent chaque allumette de la tasse du monde ».

Google nous épargne les allumettes mais ne fait pas beaucoup mieux pour le reste : « Les millions de ventilateurs suivent chaque match de la tasse du monde ».

Ces essais ont de quoi entamer la belle assurance avec laquelle j’affirme volontiers que les technologies d’analyse du langage naturel font de gros progrès!

Souvenez-vous, en 98 on parlait à peine de Linux, nos ordinateurs tournaient sous Windows 95, on explorait le Web 0.9 avec Nestape 4 ou IE 4 et on commençait à trembler à cause du bug de l’an 2000…

Les seuls dinosaures du vingtième siècle à avoir résisté à tout ces bouleversements seraient-ils les logiciels de traduction automatique?

RELAX NG and W3C XML Schema compared (continued)

A lot of comparisons have already been published on this topic, but there are still plenty of misunderstanding when comparing W3C XML Schema so called Object Oriented features with RELAX NG patterns.

Many people complain that RELAX NG does not support complex type derivation nor substitution groups.

There are two ways to look at these features:

  1. If you focus on validation, these are ways to define sets of valid instance fragments.
  2. If you focus on modeling, these are ways to define design patterns and declare to potential applications what kind of relations exist between definitions.

RELAX NG (and DSDL in general) focuses on validation and its built in features provide equivalences to W3C XML Schema features in term of validation only.

Let’s see what this means on a simple example.

Derivation by extension

XW3C XML Schema:

   <xs:complexType name="BaseType">
        <xs:sequence>
            <xs:element name="FirstName" type="xs:token"/>
            <xs:element name="LastName" type="xs:token"/>
            <xs:element name="Mail" type="xs:token" minOccurs="0"/>
        </xs:sequence>
    </xs:complexType>

    <xs:complexType name="ExtendedType">
        <xs:complexContent>
            <xs:extension base="BaseType">
                <xs:sequence>
                    <xs:element name="Password" type="xs:token"/>
                </xs:sequence>
            </xs:extension>
        </xs:complexContent>
   </xs:complexType>            

The equivalent schema in RELAX NG is (compact syntax):

BaseType =
  element FirstName { xsd:token },
  element LastName { xsd:token },
  element Mail { xsd:token }?

ExtendedType =
  BaseType,
  element Password { xsd:token }
            

Or (XML syntax):

  <define name="BaseType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
    <optional>
      <element name="Mail">
        <data type="token"/>
      </element>
    </optional>
  </define>

  <define name="ExtendedType">
    <ref name="BaseType"/>
    <element name="Password">
      <data type="token"/>
    </element>
  </define> 

A derivation by extension translates in RELAX NG by creating a new pattern that adds content after a reference to the base pattern.

Derivation by restriction

XW3C XML Schema:

    <xs:complexType name="RestrictedType">
        <xs:complexContent>
            <xs:restriction base="BaseType">
                <xs:sequence>
                    <xs:element name="FirstName" type="xs:token"/>
                    <xs:element name="LastName" type="xs:token"/>
                </xs:sequence>
            </xs:restriction>
        </xs:complexContent>
    </xs:complexType>

The equivalent schema in RELAX NG is (compact syntax):

RestrictedType =
  element FirstName { xsd:token },
  element LastName { xsd:token }
            

Or (XML syntax):

  <define name="RestrictedType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
  </define> 

A derivation by restriction translates in RELAX NG by creating a new pattern that contains a definition that is a restriction of the base pattern.

Substitution groups

XW3C XML Schema:

    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="Head"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:element name="Head" type="BaseType"/>

    <xs:element name="Restricted" type="RestrictedType" substitutionGroup="Head"/>

    <xs:element name="Extended" type="ExtendedType" substitutionGroup="Head"/>

The equivalent schema in RELAX NG is (compact syntax):

Head = element Head { BaseType }
Head |= element Restricted { RestrictedType }
Head |= element Extended { ExtendedType }
start = element Root { Head  }
            

Or (XML syntax):

   <define name="Head">
    <element name="Head">
      <ref name="BaseType"/>
    </element>
  </define>
  <define name="Head" combine="choice">
    <element name="Restricted">
      <ref name="RestrictedType"/>
    </element>
  </define>
  <define name="Head" combine="choice">
    <element name="Extended">
      <ref name="ExtendedType"/>
    </element>
  </define>
  <start>
    <element name="Root">
      <ref name="Head"/>
    </element>
  </start> 

A substitution group translates in RELAX NG by combining by choice the definition of the head of the substitution group with the definitions of the group members.

What did we miss

These schemas can be considered equivalent because they validate the same set of instance documents (with the difference that the RELAX NG schemas do not allow xsi attributes).

The main difference is that the relation between the base and derived types and between the members of the substitution group is made explicit in W3C XML Schema and is implicit in RELAX NG.

For the derivation by extension and substitution groups, the design patterns used in RELAX NG (content added after a reference for an extension and combination by choice of an element definition) could be considered characteristic enough so that tools can automatically detect them.

For the derivation by restriction, there isn’t much in the RELAX NG schema that could inform a tool that RestrictedType is a restriction of BaseType.

To make these relations or design patterns explicit, it is very easy to use annotations.

A complete schema with annotations for all three design patterns could be (compact syntax):

namespace oo = "http://ns.xmlschemata.org/object-orientation/"

BaseType =
  element FirstName { xsd:token },
  element LastName { xsd:token },
  element Mail { xsd:token }?

[ oo:extends = "BaseType" ]
ExtendedType =
  BaseType,
  element Password { xsd:token }

[ oo:restricts = "BaseType" ]
RestrictedType =
  element FirstName { xsd:token },
  element LastName { xsd:token }

Head = element Head { BaseType }

[ oo:substitutionGroup = "Head" ]
Head |= element Restricted { RestrictedType }

[ oo:substitutionGroup = "Head" ]
Head |= element Extended { ExtendedType }

start = element Root { Head }

            

Or (XML syntax):

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns:oo="http://ns.xmlschemata.org/object-orientation/"
  xmlns="http://relaxng.org/ns/structure/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <define name="BaseType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
    <optional>
      <element name="Mail">
        <data type="token"/>
      </element>
    </optional>
  </define>
  <define name="ExtendedType" oo:extends="BaseType">
    <ref name="BaseType"/>
    <element name="Password">
      <data type="token"/>
    </element>
  </define>
  <define name="RestrictedType" oo:restricts="BaseType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
  </define>
  <define name="Head">
    <element name="Head">
      <ref name="BaseType"/>
    </element>
  </define>
  <define name="Head" combine="choice" oo:substitutionGroup="Head">
    <element name="Restricted">
      <ref name="RestrictedType"/>
    </element>
  </define>
  <define name="Head" combine="choice" oo:substitutionGroup="Head">
    <element name="Extended">
      <ref name="ExtendedType"/>
    </element>
  </define>
  <start>
    <element name="Root">
      <ref name="Head"/>
    </element>
  </start>
</grammar>

            

These annotations would be (as any annotation) ignored by RELAX NG processors but can be used by tools that need to understand the relation between type and element definitions (such as binding tools). These tools could also enforce the rules defined by W3C XML Schema and check that restrictions are actual restrictions (a number of papers have been published explaining how this can be implemented).

It should also be noted that annotations can be used to identity other design patterns than those implemented by W3C XML Schema.

References

This post is a consolidation of mails sent on the XML-DEV mailing list: [1] [2] [thread]