XHTML 2.0 and HTML 5: The figures

This post has been updated to take into account a mail from Björn Höhrmann with a heads-up about missing elements in the XHTML 2.0 list of elements.

The future of (X)HTML appears to be searching its way between two conflicting visions:

I have posted my views on the subject on XML-DEV and have been surprised by the answer from Björn Höhrmann. The server hosting XML-DEV and its archives is currently down but you can see this answer in Google’s cache.

The point I have found most surprising are his statistics: « XHTML 2 increases the element count by 50% compared to XHTML 1.0 Strict, and by 10% compared to HTML 2.0, HTML 3.2, HTML 4.01, and XHTML 1.1 combined, including the Frameset and Transitional variants. »

Other chapters from our upcoming Web 2.0 book kept me too busy to double check these figures but we have decided to mention this debate in our Chapter 5 and I really needed to analyse these statistics in more detail.

My sources for this exercise are:

The data concerning XHTML 2.0 is the consolidation between the list of XHTML 2.0 elements included in the Working Draft, the RELAX NG schema and the W3C XML Schema for XForms. This is needed because the list of elements is a simplified list where XForms and Ruby sub-elements are not included (see my mail to the HTML Working Group for more details). Many thanks to Björn Höhrmann for pointing that out.

By scraping these pages, I have extracted a consolidated list of elements that can be represented by the following table where in each cell you find the module into which the element belongs for the corresponding (X)HTML version or the mention « deprecated » if the element is deprecated:

Element HTML 4.01 XHTML 1.1 XHTML 2.0 HTML 5
a Core Hypertext Hypertext Phrase
abbr Core Text Text Phrase
access Access
acronym Core Text
action XForms
address Core Text Structural Sections
alert XForms
applet Deprecated Deprecated
area Core Client-side Image Map
article Sections
aside Sections
b Core Presentation
base Core Base Document metadata
basefont Deprecated Deprecated
bdo Core Bi-directional Text Phrase
big Core Presentation
bind XForms
blockcode Structural
blockquote Core Text Structural Sections
body Core Structure Document Sections
br Core Text Phrase
button Core Forms
caption Core Tables Tables
case XForms
center Deprecated Deprecated
choices XForms
cite Core Text Text Phrase
code Core Text Text Phrase
col Core Tables Tables
colgroup Core Tables Tables
command Interactive
copy XForms
datagrid Interactive
dd Core List List Lists
del Core Edit Edits
delete XForms
details Interactive
dfn Core Text Text Phrase
di List
dir Deprecated Deprecated
dispatch XForms
div Core Text Structural
dl Core List List Lists
dt Core List List Lists
em Core Text Text Phrase
ev:listener XML Events
event-source Server-sent DOM events
extension XForms
fieldset Core Forms
filename XForms
font Deprecated Deprecated
footer Sections
form Core Forms
frame Frames Frames
frameset Frames Frames
group XForms
h Structural
h1 Core Text Structural Sections
h2 Core Text Structural Sections
h3 Core Text Structural Sections
h4 Core Text Structural Sections
h5 Core Text Structural Sections
h6 Core Text Structural Sections
handler Handler
head Core Structure Document Document metadata
header Sections
help XForms
hint XForms
hr Core Presentation Paragraphs
html Core Structure Document HTML documents and document fragments
i Core Presentation Phrase
iframe Core Iframe
img Core Image Image content[TBW]
input Core Forms XForms
ins Core Edit Edits
insert XForms
instance XForms
isindex Deprecated Deprecated
item XForms
itemset XForms
kbd Core Text Text Phrase
l Text
label Core Forms List
legend Core Forms
li Core List List Lists
link Core Link Metainformation Document metadata
load XForms
m Phrase
map Core Client-side Image Map
mediatype XForms
menu Deprecated Deprecated Interactive
message XForms
meta Core Metainformation Metainformation Document metadata
meter Phrase
model XForms
nav Sections
nl List
noframes Frames Frames
noscript Core Scripting Scripting
object Core Object Object
ol Core List List Lists
optgroup Core Forms
option Core Forms
output XForms
p Core Text Structural Paragraphs
param Core Object Object
pre Core Text Structural Preformatted text
progress Phrase
q Core Text Text Phrase
range XForms
rb Ruby
rbc Ruby
rebuild XForms
recalculate XForms
refresh XForms
repeat XForms
reset XForms
revalidate XForms
rp Ruby
rt Ruby
rtc Ruby
ruby Ruby
s Deprecated Deprecated
samp Core Text Text Phrase
script Core Scripting Scripting
secret XForms
section Structural Sections
select Core Forms XForms
select1 XForms
send XForms
separator Structural
setfocus XForms
setindex XForms
setvalue XForms
small Core Presentation Phrase
span Core Text Text Phrase
standby Object
strike Deprecated Deprecated
strong Core Text Text Phrase
style Core Style Sheet Style Sheet Document metadata
sub Core Presentation Text Phrase
submission XForms
submit XForms
summary Tables
sup Core Presentation Text Phrase
switch XForms
t Phrase
table Core Tables Tables
tbody Core Tables Tables
td Core Tables Tables
textarea Core Forms XForms
tfoot Core Tables Tables
th Core Tables Tables
thead Core Tables Tables
title Core Structure Document Document metadata
toggle XForms
tr Core Tables Tables
trigger XForms
tt Core Presentation
u Deprecated Deprecated
ul Core List List Lists
upload XForms
value XForms
var Core Text Text Phrase

The total numbers of elements are :

HTML 4.01 XHTML 1.1 XHTML 2.0 HTML 5
Number of elements 91 91 115 63

Now, it should be noted that we are not comparing apples to apples: HTML 4.01 and XHTML 1.x include a number of deprecated elements that shouldn’t be used. They also include frames elements that have been taken out from XHTML 2.0 to be defined in the XFrames specification and are not part of HTML 5 either. It seems fair to remove all these elements from our numbers and that gives:

HTML 4.01 XHTML 1.1 XHTML 2.0 HTML 5
Number of non deprecated elements 81 81 115 63
Number of non deprecated non frames elements 78 78 115 63

These figures confirm the increase of almost 50% between HTML 4.01 or XHTML 1.1 and XHTML 2.0 mentioned by Björn Höhrmann and it is worth searching where the increase comes from. If you look at the different modules in this table, you’ll see that whereas HTML 4.01 and XHTML 1.1 include 10 elements from their Forms module, XHTML 2.O includes 46 XForms elements. The increase in the number of elements comes entirely from the XHTML 2.0 Xforms support and there is an actual decrease in the number of elements in the other modules.

Furthermore to compare with HTML 5.0, you also need to remove table elements which are not yet defined in HTML 5.0 and the figures are quite different:

HTML 4.01 XHTML 1.1 XHTML 2.0 HTML 5
Number of non deprecated elements 78 78 115 63
Number of Forms or XForms elements 10 10 46 0
Number of non deprecated non frames non forms elements 68 68 69 63
Number of tables elements 10 10 11 0
Number of non deprecated non frames non forms non tables elements 58 58 58 63

In other words, the debate of whether XHTML 2.0 is a simplification can be split into two different points:

  • The number of elements for the classical non forms related features is the same between HTML 4.01 and XHTML 1.1 and XHTML 2.0.
  • The replacement of the Forms module by XForms represente a complete paradigm change that undeniably leads to more complexity and an increase in the number of elements.

The last line shows that there is an actual increase in the number of elements between HTML 4.01 or XHTML 1.1 and HTML 5. If you look in the overall table, you’ll notice that this increase is due to the addition to quite a number of new elements that is compensated by removing elements that have been considered as either almost duplicated (for instance acronym has been removed and people advised to use abbr for both acronyms and abbreviations) or not very useful.

Of course, number of elements are 100% representative of the complexity of a vocabulary, but they give a good indication and the figures given by Björn did deserve some further analysis.

PS: I have sent an answer to XML-DEV that may find its way when their server will be up again.

PPS: I recommend reading Björn Höhrmann mails to the www-html@w3.org mailing list as a complement to this blog entry:

Bitten by text html for XHTML documents

The W3C « XHTML media types » note mentions that:

XHTML documents served as ‘text/html’ will not be processed as XML [XML10], e.g. well-formedness errors may not be detected by user agents. Also be aware that HTML rules will be applied for DOM and style sheets (see C.11 and C13 of [XHTML1] respectively).

I have been bitten by this rule while developing the « Hello World » application that will illustrate the first chapter of our upcoming Web 2.0 book.

In this sample application, I am using Javascript to fill information in XHTML elements that act as place holders and noticed that updating elements could sometimes lead to erasing their following siblings:

[<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Repro case</title>
<script type="text/javascript">

function init()  {
	// here, we still have a p[@id='bar'] element
	alert("bar: " + document.getElementById("bar"));
	document.getElementById("foo").innerHTML="foo";
	// but now, the p[@id='bar'] element has disappeared...
	alert("bar: " + document.getElementById("bar"));
}

</script>
</head>
<body onload="init()">
    <div id="foo"/>
    <p id="bar"/>
</body>
</html>

One of the things that I found most surprising is that the three browsers I was testing (Firefox, Opera and Mozilla) showed the same « bug ».

It took me a while to understand that the behaviour is dictated by the media type associated to the document: when the media type is « text/html », the document is interpreted as HTML despite its XML declaration and the trailing slash in the div start tag is ignored. The document body seen by the browser is thus equivalent to:

<body onload="init()">
    <div id="foo">
      <p id="bar"></p>
    </div>
</body>

The p element which is a following sibling of the div element in XML becomes a child of the div element in HTML mode!

In Firefox or Opera, the clean way to fix that would be to send the proper media type (application/xhtml+xml) but unfortunately Internet Explorer doesn’t support it.

A workaround is to avoid using empty tags in XHTML and a comment can be included if you want to make sure that no badly behaved editor will minimise your document:

<body onload="init()">
    <div id="foo"><!-- --></div>
    <p id="bar"><!-- --></p>
</body>
            

Note that this isn’t necessary for the p element but that it doesn’t do any harm and looks more consistent.

RELAX NG and W3C XML Schema compared (continued)

A lot of comparisons have already been published on this topic, but there are still plenty of misunderstanding when comparing W3C XML Schema so called Object Oriented features with RELAX NG patterns.

Many people complain that RELAX NG does not support complex type derivation nor substitution groups.

There are two ways to look at these features:

  1. If you focus on validation, these are ways to define sets of valid instance fragments.
  2. If you focus on modeling, these are ways to define design patterns and declare to potential applications what kind of relations exist between definitions.

RELAX NG (and DSDL in general) focuses on validation and its built in features provide equivalences to W3C XML Schema features in term of validation only.

Let’s see what this means on a simple example.

Derivation by extension

XW3C XML Schema:

   <xs:complexType name="BaseType">
        <xs:sequence>
            <xs:element name="FirstName" type="xs:token"/>
            <xs:element name="LastName" type="xs:token"/>
            <xs:element name="Mail" type="xs:token" minOccurs="0"/>
        </xs:sequence>
    </xs:complexType>

    <xs:complexType name="ExtendedType">
        <xs:complexContent>
            <xs:extension base="BaseType">
                <xs:sequence>
                    <xs:element name="Password" type="xs:token"/>
                </xs:sequence>
            </xs:extension>
        </xs:complexContent>
   </xs:complexType>            

The equivalent schema in RELAX NG is (compact syntax):

BaseType =
  element FirstName { xsd:token },
  element LastName { xsd:token },
  element Mail { xsd:token }?

ExtendedType =
  BaseType,
  element Password { xsd:token }
            

Or (XML syntax):

  <define name="BaseType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
    <optional>
      <element name="Mail">
        <data type="token"/>
      </element>
    </optional>
  </define>

  <define name="ExtendedType">
    <ref name="BaseType"/>
    <element name="Password">
      <data type="token"/>
    </element>
  </define> 

A derivation by extension translates in RELAX NG by creating a new pattern that adds content after a reference to the base pattern.

Derivation by restriction

XW3C XML Schema:

    <xs:complexType name="RestrictedType">
        <xs:complexContent>
            <xs:restriction base="BaseType">
                <xs:sequence>
                    <xs:element name="FirstName" type="xs:token"/>
                    <xs:element name="LastName" type="xs:token"/>
                </xs:sequence>
            </xs:restriction>
        </xs:complexContent>
    </xs:complexType>

The equivalent schema in RELAX NG is (compact syntax):

RestrictedType =
  element FirstName { xsd:token },
  element LastName { xsd:token }
            

Or (XML syntax):

  <define name="RestrictedType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
  </define> 

A derivation by restriction translates in RELAX NG by creating a new pattern that contains a definition that is a restriction of the base pattern.

Substitution groups

XW3C XML Schema:

    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="Head"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:element name="Head" type="BaseType"/>

    <xs:element name="Restricted" type="RestrictedType" substitutionGroup="Head"/>

    <xs:element name="Extended" type="ExtendedType" substitutionGroup="Head"/>

The equivalent schema in RELAX NG is (compact syntax):

Head = element Head { BaseType }
Head |= element Restricted { RestrictedType }
Head |= element Extended { ExtendedType }
start = element Root { Head  }
            

Or (XML syntax):

   <define name="Head">
    <element name="Head">
      <ref name="BaseType"/>
    </element>
  </define>
  <define name="Head" combine="choice">
    <element name="Restricted">
      <ref name="RestrictedType"/>
    </element>
  </define>
  <define name="Head" combine="choice">
    <element name="Extended">
      <ref name="ExtendedType"/>
    </element>
  </define>
  <start>
    <element name="Root">
      <ref name="Head"/>
    </element>
  </start> 

A substitution group translates in RELAX NG by combining by choice the definition of the head of the substitution group with the definitions of the group members.

What did we miss

These schemas can be considered equivalent because they validate the same set of instance documents (with the difference that the RELAX NG schemas do not allow xsi attributes).

The main difference is that the relation between the base and derived types and between the members of the substitution group is made explicit in W3C XML Schema and is implicit in RELAX NG.

For the derivation by extension and substitution groups, the design patterns used in RELAX NG (content added after a reference for an extension and combination by choice of an element definition) could be considered characteristic enough so that tools can automatically detect them.

For the derivation by restriction, there isn’t much in the RELAX NG schema that could inform a tool that RestrictedType is a restriction of BaseType.

To make these relations or design patterns explicit, it is very easy to use annotations.

A complete schema with annotations for all three design patterns could be (compact syntax):

namespace oo = "http://ns.xmlschemata.org/object-orientation/"

BaseType =
  element FirstName { xsd:token },
  element LastName { xsd:token },
  element Mail { xsd:token }?

[ oo:extends = "BaseType" ]
ExtendedType =
  BaseType,
  element Password { xsd:token }

[ oo:restricts = "BaseType" ]
RestrictedType =
  element FirstName { xsd:token },
  element LastName { xsd:token }

Head = element Head { BaseType }

[ oo:substitutionGroup = "Head" ]
Head |= element Restricted { RestrictedType }

[ oo:substitutionGroup = "Head" ]
Head |= element Extended { ExtendedType }

start = element Root { Head }

            

Or (XML syntax):

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns:oo="http://ns.xmlschemata.org/object-orientation/"
  xmlns="http://relaxng.org/ns/structure/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <define name="BaseType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
    <optional>
      <element name="Mail">
        <data type="token"/>
      </element>
    </optional>
  </define>
  <define name="ExtendedType" oo:extends="BaseType">
    <ref name="BaseType"/>
    <element name="Password">
      <data type="token"/>
    </element>
  </define>
  <define name="RestrictedType" oo:restricts="BaseType">
    <element name="FirstName">
      <data type="token"/>
    </element>
    <element name="LastName">
      <data type="token"/>
    </element>
  </define>
  <define name="Head">
    <element name="Head">
      <ref name="BaseType"/>
    </element>
  </define>
  <define name="Head" combine="choice" oo:substitutionGroup="Head">
    <element name="Restricted">
      <ref name="RestrictedType"/>
    </element>
  </define>
  <define name="Head" combine="choice" oo:substitutionGroup="Head">
    <element name="Extended">
      <ref name="ExtendedType"/>
    </element>
  </define>
  <start>
    <element name="Root">
      <ref name="Head"/>
    </element>
  </start>
</grammar>

            

These annotations would be (as any annotation) ignored by RELAX NG processors but can be used by tools that need to understand the relation between type and element definitions (such as binding tools). These tools could also enforce the rules defined by W3C XML Schema and check that restrictions are actual restrictions (a number of papers have been published explaining how this can be implemented).

It should also be noted that annotations can be used to identity other design patterns than those implemented by W3C XML Schema.

References

This post is a consolidation of mails sent on the XML-DEV mailing list: [1] [2] [thread]

Client side XSLT brings live to static HTML pages and microformats

I am making all kind of tests for the chapter about multimedia of our upcoming Web 2.0 book and as it is often the case when I am writing, this is sparkling a number of strange ideas.

I was exploring the similarities between playlists, podcasts and SMIL animation when it occurred to me that it might be interesting to see what can be done with microformats.

Although the relEnclosure proposal still needs some polishing (for instance, it mentions that Atom requires a length on enclosures but do not define a way to express this length), the result would be something such as:

      <div class="hfeed">
         <h1>SVG en quinze points</h1>
         <div class="hentry">
            <h2 class="hentry-title">
               <a
                  href="http://xmlfr.org/documentations/articles/i040130-0001/01%20-%20C'est%20base%20sur%20XML.mp3"
                  rel="bookmark" title="...">C'est basé sur XML</a>
            </h2>
            <p class="hentry-content">By <address class="vcard author fn">Antoine Quint</address> -
                  <abbr class="updated" title="2004-01-30T00:00:00">2004-01-30T00:00:00</abbr>
            </p>
            <p>[<a
                  href="http://xmlfr.org/documentations/articles/i040130-0001/01%20-%20C'est%20base%20sur%20XML.mp3"
                  rel="enclosure">download</a>] (<span class="htype">audio/mpeg</span>, <span
                  class="hLength">231469</span> bytes).</p>
         </div>
 .
 .
 .
      </div>        

[hatom.xhtml]

I am not a microformat expert and I have been surprised to see that this document is actually much harder to write than the corresponding Atom document. It probably contains lots of errors and if you spot one of them, thanks to report it as a comment.

This is nice, but probably not what users would expect for a Web 2.0 application. For one thing, this page is static and lacking all the bells and whistles of a Web 2.0 application. For instance, we might want to use one of the techniques exposed by Mark Huckvale to play the audio in the web page itself.

For this, we would need to modify the document and entries could become:

                 <div class="hentry">
                        <h2 class="hentry-title">
                              <a
                                    href="http://xmlfr.org/documentations/articles/i040130-0001/01%20-%20C'est%20base%20sur%20XML.mp3"
                                    rel="bookmark" title="...">C'est basé sur XML</a>
                        </h2>
                        <p class="hentry-content">By
                              <address class="vcard author fn">Antoine Quint</address> - <abbr
                                    class="updated" title="2004-01-30T00:00:00"
                              >2004-01-30T00:00:00</abbr>
                        </p>
                        <p>[<a
                                    href="javascript:play(&#34;http://xmlfr.org/documentations/articles/i040130-0001/01%20-%20C'est%20base%20sur%20XML.mp3&#34;);"
                                    rel="enclosure">play</a>] (<span class="htype"
                              >audio/mpeg</span>, <span class="hLength">231469</span> bytes).</p>
                  </div>
            

[hatom-decorated.xhtml]

This is not very different, but the links with rel= »enclosure » have been replaced by a call to a Javascript function and this is enough to loose the semantic of the microformat since we obfuscate the enclosure’s URL.

We have thus a situation where the document that we want to server is different from the document that we want to display client side and that’s a typical use case for client side XSLT.

The trick is to write a simple transformation that makes the static page synamic:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="http://www.w3.org/1999/xhtml" xmlns:x="http://www.w3.org/1999/xhtml" version="1.0"
    exclude-result-prefixes="x">
    <xsl:output method="xml" encoding="UTF-8" indent="yes" cdata-section-elements="x:style x:script"/>
    <xsl:strip-space elements="*"/>
    <xsl:preserve-space elements="x:script x:style"/>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="x:head">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
            <style type="text/css"><![CDATA[

#player {
    padding: 10px;
    background-color: gray;
    position:fixed;
    top: 20px;
    right:10px
}

                    ] ]></style>
            <script type="text/javascript"><![CDATA[

function play(surl) {
  document.getElementById("player").innerHTML=
    '<embed src="'+surl+'" hidden="false" autostart="true" loop="false"/>';
}

                ] ]></script>
                </xsl:copy>
                </xsl:template>

            <xsl:template match="x:body">
                <xsl:copy>
                    <xsl:apply-templates select="@*|node()"/>
                    <div id="player">A media player<br/>will pop-up here.</div>
                </xsl:copy>
            </xsl:template>

            <xsl:template match="x:a[@rel='enclosure']/@href">
                <xsl:attribute name="href">
                    <xsl:text>javascript:play("</xsl:text>
                    <xsl:value-of select="."/>
                    <xsl:text>");</xsl:text>
                </xsl:attribute>
            </xsl:template>

            <xsl:template match="x:a[@rel='enclosure']/text()">
                <xsl:text>play</xsl:text>
            </xsl:template>

</xsl:stylesheet>

            

[decorateMf.xsl]

And add a xsl-stylesheet PI to the static (microformat) page:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="decorateMf.xsl" type="text/xsl"?>
<html xmlns="http://www.w3.org/1999/xhtml">
.
.
.
</html>
            

This is working fine for me (GNU Linux/Ubuntu, Firefox 1.5) and the mplayer plug-in nicely pops up in the player div when I click on one of the « play » links but it would require a bit of polishing to work in other browsers:

  • The page crashes Opera 9.0 (I have entered a bug report and have been contacted back by their tech support who is already working on the issue).
  • The XSLT output method needs to be changed to HTML to work in Internet Explorer (otherwise the result is displayed as a XML document). Furthermore, IE inserts the embed element as text in the player div and you might need to use a proper DOM method to insert the embed element as a DOM node.

[Try it!]

There are probably a number of other (easier?) solutions for the specific problem I have solved here. However, this is an interesting pattern to apply in situations where you want to serve a clean document that needs to be altered to display nicely in a browser.

XSLT has sometimes been described as a « semantic firewall » that removes the semantic of XML documents to keep only their presentation. I like to think at this technique as a semantic « anti-firewall » or « tunnel » that keeps the semantic of XML documents intact until the very last stage before it hits the browser’s rendering engine…

Too many SVG profiles

Our upcoming Web 2.0 book is giving me the opportunity to have a closer look to the state of SVG.

After all kind of announcements for native SVG support in browsers, I was expecting that with my new Ubuntu Dapper distribution, SVG would be really easy to display.

The first thing I have tested is to display the clock that animates the front page of XMLfr: <em> You need either a browser that supports SVG or a <a href="http://www.adobe.com/svg/viewer/install/">SVG plug-in</a> to display this image. </em> [download] in Firefox.

First test, first disappointment: the text « Réalisé en SVG » doesn’t show up in Firefox. This text is displayed on a path using a textPath element which isn’t supported by Firefox.

Beginning to wonder if all that would be as easy as I had thought, I have developed a sample document showing the relations between the tags in the RSS channel of the book site.

I wanted to show the level of animation that can be done declaratively without a single line of Javascript and I have used the « set » element.

Second test, second deception: this was just not working.

Thinking that I needed to do more exhaustive tests, I decided to install the Abode SVG plugin which, fortunately is quite easy if you switch the native SVG support in Firefox using about:config as explained on Mozillazine. A very cool feature is that you can switch between native and plug-in support trough clicking on « svg.enabled » option without having to restart the browser.

After more tests and the very helpful mouseEvents SVG sample, I came to the conclusion that no implementation, including Adobe SVG plug-in, supports the « mouseover » and « mouseout » events correctly and switched to using « mousedown » and « mouseup » instead.

The result is a SVG document which (I think) is perfectly valid but works only with the Adobe SVG plug-in:

<em> You need either a browser that supports SVG or a <a href="http://www.adobe.com/svg/viewer/install/">SVG plug-in</a> to display this image. </em>
[download]

This SVG document works fine with the Adobe plug-in but doesn’t work with any of the other implementations that I have tested. Note that it is almost working with Opera 9.0b2 but this implementation doesn’t seem to support « set » elements on groups: if I move the « set » elements to the individual shapes I can get it working with Opera.

The full test report is a below:

Test Firefox 1.5.0.4 (native mode) Adobe SVG viewer 3.01 beta 3 Opera 9.0b2 Konqueror 3.5.2 Amaya 8.5 X-smiles 1.0alpha1
SVG clock [download] No support for « textPath »: the text doesn’t show. OK OK No animation The document is reported as non well formed! No animation
Tags [download] No support for « set »: no link are displayed and nothing happens when you click on a tag. OK No support for « set »: no link are displayed and nothing happens when you click on a tag. No support for « set » not for the visibility attributes: all the links are always displayed. Furthemore, the browser crashes after a while when this document is left opened. No support for « set » not for the visibility attributes: all the links are always displayed. The « text-anchor: middle » property doesn’t work either. Crashes when there is a DOCTYPE declaration in the SVG document and throws an exception « Simple duration is not defined for this animation element » probably due to the fact that the set elements do not have durations when the DOCTYPE is removed.

Also, the media type « image/svg+xml » seems to be a problem for the Adobe plug-in in Firefox even if, curiously, this isn’t systematic.

I could probably get this sample working on most of these implementations by switching to Javascript animation and carefully testing against each of them, but is that something we really want to do again?

Wasn’t SVG supposed to be interoperable? The current situation reminds me on the contrary of the worse period of the browsers war even if I have no doubt that this time there are no political reasons behind that.

Mozilla explains that Firefox SVG is a subset of SVG 1.1, but not any of the official profiles (Tiny, Basic, Full).

Other implementations have probably similar policies and I can understand their reasons. However, I am wondering if these partial implementations do not hurt SVG more than they help.

The commonality between them is that, except Konqueror when it core dumps, they all fail silently when they encounter a feature that they do not support leaving users with the feeling that the document they are viewing is bogus.

When a user with a browser that has no support for SVG finds a SVG document, she/he is invited to load a plug-in. When her/his browser has one of these partial supports, she/he just moves away.

Web 2.0 at XML Prague

This coming week-end, I’ll have the pleasure to be at XML Prague, a small and friendly XML conference in a wonderful city.

This year, I’ll leave out my usual XML schema languages expert hat to speak on two topics:

  • An experience to define a RDF/XML Query By Example language. This presentation relates a very cool project that I am developing for one of my customers (INSEE) and that I have also presented at Extreme Markup Languages last year. It is very on topic with the focus of XML Prague this year which is « XML Native Databases and Querying XML ».
  • Web 2.0: myth and reality, a presentation derived from the blog entry with the same title. Even though people could probably argue that Web 2.0 is about making a web that can be queried, this talk will probably be felt more out of topic. I hope it will still be well received and look forward to delivering it in Prague.

XML Prague 2005 had also been an opportunity to see Prague that I hadn’t seen since… 1981… (I can tell you that so many things had changed that I could hardly recognize the city) and also to meet many members from an active and creative Eastern European XML community with whom I had often exchanged emails but had had few opportunities to meet face to face.

I have no doubt XML Prague 2006 will be as fun as its preceding issue.

Normalizing Excel’s SpreadsheetML using XSLT – Part 2

As reported by one of the comments, there was a bug in the XSLT transformation which « normalizes » Excel’s SpreadsheetML documents that I had posted in a previous post.

I have fixed this bug and the new version is:

<?xml version="1.0"?>
<!--

Adapted from http://ewbi.blogs.com/develops/2004/12/normalize_excel.html

This product may incorporate intellectual property owned by Microsoft Corporation. The terms
and conditions upon which Microsoft is licensing such intellectual property may be found at
http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.
-->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:schemas-microsoft-com:office:spreadsheet"
    xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <xsl:output method="xml" indent="no" encoding="UTF-8"/>
    <xsl:template match="/">
        <xsl:apply-templates select="node()" mode="normalize"/>
    </xsl:template>
    <xsl:template match="@*|node()" mode="normalize">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="normalize"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="ss:Cell/@ss:Index" mode="normalize"/>
    <xsl:template match="ss:Cell" name="copy" mode="normalize">
        <xsl:copy>
            <xsl:apply-templates select="@*" mode="normalize"/>
            <xsl:variable name="prevCells" select="preceding-sibling::ss:Cell"/>
            <xsl:variable name="nbPrecedingIndexes"
                select="count(preceding-sibling::ss:Cell[@ss:Index])"/>
            <xsl:attribute name="ss:Index">
                <xsl:choose>
                    <xsl:when test="@ss:Index">
                        <xsl:value-of select="@ss:Index"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells) = 0">
                        <xsl:value-of select="1"/>
                    </xsl:when>
                    <xsl:when test="$nbPrecedingIndexes > 0">
                        <xsl:variable name="precedingCellsSinceLastIndex"
                            select="preceding-sibling::ss:Cell[count(preceding-sibling::ss:Cell[@ss:Index]|self::ss:Cell[@ss:Index]) = $nbPrecedingIndexes]"/>
                        <xsl:value-of
                            select="preceding-sibling::ss:Cell[@ss:Index][1]/@ss:Index +
                            count($precedingCellsSinceLastIndex)
                            + sum($precedingCellsSinceLastIndex/@ss:MergeAcross)
                            - count ($precedingCellsSinceLastIndex[@ss:MergeAcross])"
                        />
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:value-of
                            select="count($prevCells) + 1 +
                            sum($prevCells/@ss:MergeAcross) -count($prevCells/@ss:MergeAcross)"
                        />
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:attribute>
            <xsl:apply-templates select="node()" mode="normalize"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
            

I have also written the following set of tests (using XSLTUnit):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl" xmlns:xsltu="http://xsltunit.org/0/"
  exclude-result-prefixes="exsl">
  <xsl:import href="excelNormalize.xsl"/>
  <xsl:import href="xsltunit.xsl"/>
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:template match="/">
    <xsltu:tests>
      <xsltu:test id="noIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell>A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">noIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="1">A</ss:Cell>
              <ss:Cell ss:Index="2">B</ss:Cell>
              <ss:Cell ss:Index="3">C</ss:Cell>
              <ss:Cell ss:Index="4">D</ss:Cell>
              <ss:Cell ss:Index="5">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="withIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="1">A</ss:Cell>
            <ss:Cell ss:Index="2">B</ss:Cell>
            <ss:Cell ss:Index="3">C</ss:Cell>
            <ss:Cell ss:Index="4">D</ss:Cell>
            <ss:Cell ss:Index="5">E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">withIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2" select="$input"/>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="firstIndex">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="5">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">firstIndex</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="5">A</ss:Cell>
              <ss:Cell ss:Index="6">B</ss:Cell>
              <ss:Cell ss:Index="7">C</ss:Cell>
              <ss:Cell ss:Index="8">D</ss:Cell>
              <ss:Cell ss:Index="9">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="altIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="2">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell ss:Index="5">C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">altIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="2">A</ss:Cell>
              <ss:Cell ss:Index="3">B</ss:Cell>
              <ss:Cell ss:Index="5">C</ss:Cell>
              <ss:Cell ss:Index="6">D</ss:Cell>
              <ss:Cell ss:Index="7">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="noIndexesMergeAcross">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell>A</ss:Cell>
            <ss:Cell ss:MergeAcross="2">B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell ss:MergeAcross="3">D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">noIndexesMergeAcross</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="1">A</ss:Cell>
              <ss:Cell ss:MergeAcross="2" ss:Index="2">B</ss:Cell>
              <ss:Cell ss:Index="4">C</ss:Cell>
              <ss:Cell ss:MergeAcross="3" ss:Index="5">D</ss:Cell>
              <ss:Cell ss:Index="8">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="withIndexesMergeAcross">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="5" ss:MergeAcross="2">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell ss:Index="10">C</ss:Cell>
            <ss:Cell ss:MergeAcross="3">D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">withIndexesMergeAcross</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="5" ss:MergeAcross="2">A</ss:Cell>
              <ss:Cell ss:Index="7">B</ss:Cell>
              <ss:Cell ss:Index="10">C</ss:Cell>
              <ss:Cell ss:MergeAcross="3" ss:Index="11">D</ss:Cell>
              <ss:Cell ss:Index="14">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
    </xsltu:tests>
  </xsl:template>
</xsl:stylesheet>
            

These tests should help to understand what this transformation is doing.

Thanks continue to report bugs and feature requests as comments.

The influence of microformats on style-free stylesheets

It’s been a while, almost six years, since I have written my Style-free XSLT Style Sheets piece for XML.com but this simple technique remains one of my favorite.

It has not only been my first article published on XML.com but also the subject of my first talk in an IDEAlliance XML conference and it’s fair to say that it as been instrumental to launch my career of « international XML guru ».

Despite all that, this technique remains my favorite because for its efficiency. I am using it over and over. To generate (X)HTML but also many other XML vocabularies. I have been using it to generate vocabularies as different as OpenOffice documents and W3C XML Schemas. The more complex is the vocabulary to generate, the more reasons you have to keep it outside your XSLT transformations and the more efficient style-free stylesheets are.

Style-free stylesheets have become a reflex for me and that’s without even thinking about them that I have written a style-free stylesheet to power the web site of our upcoming Web 2.0 book.

In my antique XML.com paper, I had been using specific, non XHTML elements:

        <td width="75%" bgcolor="Aqua">
            <insert-body/>
        </td>

That’s working fine, but your layout documents are no longer valid XHTML and they don’t display like target documents in a browser.

Why not follow the microformats approach and use regular XHTML elements with specif class attribues instead:

        <div id="planet">
            <h1>Planet Web 2.0 the book</h1>
            <p>Aggregated content relevant to this book.</p>
            <div class="fromRss"/>
             .../...
        </div>           

In this case, the XSLT transformation replaces the content of any element with a class attribute containing the token « fromRSS » by the formated output of the RSS feed. This has the additional benefit that I can leave mock-up content to make the layout look like a final document:

<div id="planet">
            <h1>Planet Web 2.0 the book</h1>
            <p>Aggregated content relevant to this book.</p>
            <div class="fromRss">
                <ul>
                    <li>
                        <div>
                            <h2>
                                <a
                                    href="http://www.orbeon.com/blog/2006/06/02/about-json-and-poor-marketing-strategies/"
                                    title="XForms Everywhere » About JSON and poor marketing strategies"
                                    >XForms Everywhere » About JSON and poor marketing
                                strategies</a>
                            </h2>
                        </div>
                    </li>
                </ul>
            </div>
            <p>
                <a href="http://del.icio.us/rss/tag/web2.0thebook" title="RSS feed (on del.icio.us)">
                    <img src="feed-icon-24x24.png" alt="RSS feed"/>
                </a> (on <a href="http://del.icio.us/" title="del.icio.us">del.icio.us</a>)</p>
        </div>

What I like with simple ideas is that they always leave room for reuse and improvements (complex ideas on the other hand seem to only leave room for more complexity).

Web 2.0 the book

One of the reasons I have been too busy to blog these days is the project to write a comprehensive book about Web 2.0 technologies.

If Web 2.0 is about using the web as a platform, this platform is far from being homogeneous. On the contrary, it is made of a number of very different pieces of technology, from CSS to web server configuration through XML, Javascript, server side programming, HTML, …

I believe that integrating these technologies is one of the main challenges of Web 2.0 developers and I am always surprised if not frightened to see that people tend to get more and more specialized. Too many CSS gurus do not know the first thing about XML, too many XML gurus don’t know how to spell HTTP, too many Java programmers don’t want to know Javascript. And, no, knowing everything about Ajax isn’t enough to write a Web 2.0 application.

To the defense of these hyper-specialists, I have also found that most of the available resources, both online and in print, are even more heavily specialized than their authors and that even if you could read a book on each of these technologies you’d find it difficult to get the big picture and understand how they can be used together.

The goal of this book is fill the gap and be a useful resource for all the Web 2.0 developers who do not want to stay in their highly specialized domain as well as for project managers who need to grasp the Web 2.0 big picture.

This is an ambitious project on which I have started to work in December 2005.

The first phase has been to define the book outline with the helpful contribution of many friends.

The second one has been to find an editor. O’Reilly who is the editor of my two previous books happens to be also one of the co-inventors of the term « Web 2.0 » and that makes them very nervous about Web 2.0 book projects.

Jim Minatel from Wiley has immediately been convinced by the outline and the book will be published in the Wrox Professional Series.

I had initially planned to write the book all by myself but it would have taken me at least one year to complete this work and Jim wasn’t appealed by the idea of waiting until 2007 to get this book in print.

The third step has been to find the team to write the book and the lucky authors are:

Micah Dubinko is tech editing the book and Sara Shlaer is our Development Editor.

We had then to split the work between authors. The exercise has been easier than expected. Being in a position to arbiter the choice, I have found it fair to pick the chapters left by other authors and this leaves me with chapters that will require a lot of researches for me. This is fine since I like learning new things when I write but this also means more hard work.

This is my first co-authored book and I think that one of the challenges of these books is to keep the whole content coherent. This is especially true for a book which goal is to give « the big picture » and to explain how different technologies play together.

To facilitate the communication between authors, I have set up a series of internal resources (wiki, mailing list, subversion repository). It’s still too early to say if that will really help but the first results are encouraging.

More recently, I have also set up a public site (http://web2.0thebook.org/) that presents the book and aggregates relevant content. I hope that all these resources will help us to feel and act as a team rather than a set of individual authors.

The « real » work has finally started and we have now the first versions of our first chapters progressing within the Wiley review system.

It’s interesting to see the differences between processes and rules from different editors. To me, a book was a book and I hadn’t anticipated so many differences not only in the tools being used but also in style guidelines.

The first chapter I have written is about Web Services and that’s been a good opportunity to revisit the analysis I had done in 2004 for the ZDNet Web Services Convention [papers (in French)].

From a Web 2.0 developer perspective, I think that the main point is to publish Web Services that are perfectly integrated in the Web architecture and that means being as RESTfull as possible.

I have been happy to see that WSDL 2.0 appears to be making some progress in its support of REST Services even though it’s still not perfect yet. I have posted a mail with some of my findings to the Web Services Description Working Group comment list and they have split these comments as three issues on their official issue list ([CR052] [CR053] [CR054]).

I wish they can take these issues into account, even if that means updating my chapter!

Some resources I have found most helpful while I was writing this chapter are:

It’s been fun so far and I look forward to seeing this book « for real ».

Validating microformats

This blog entry is following up Norm Walsh’s essay on the same subject.

The first thing I’d want to react on isn’t the fact that RELAX NG isn’t suitable for this task, but the reason why this is the case.

Norm says that « there’s just no way to express a pattern that matches an attribute that contains some token » and this assertion isn’t true.

Let’s take the same hReview sample and see what happens when we try to define a RELAX NG schema:

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Review</title>
    </head>
    <body>
        <div class="hreview">
            <span><span class="rating">5</span> out of 5 stars</span>
            <h4 class="summary">Crepes on Cole is awesome</h4>
            <span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
                <abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>
            <div class="description item vcard"><p>
                <span class="fn org">Crepes on Cole</span> is one of the best little
                creperies in <span class="adr"><span class="locality">San Francisco</span></span>.
                Excellent food and service. Plenty of tables in a variety of sizes
                for parties large and small.  Window seating makes for excellent
                people watching to/from the N-Judah which stops right outside.
                I've had many fun social gatherings here, as well as gotten
                plenty of work done thanks to neighborhood WiFi.
            </p></div>
            <p>Visit date: <span>April 2005</span></p>
            <p>Food eaten: <span>Florentine crepe</span></p>
        </div>
    </body>
</html>

To define an element which « class » attribute is « type », we would write:

element * {
    attribute class { "type" }
    .../...
}

To define an element which « class » attribute contains the token « type », we will use the same principle and use a W3C XML Schema pattern facet:

element * {
    attribute class {
        xsd:token { pattern = "(.+\s)?type(\s.+)?" }
    }
}

The regular expression expresses the fact that we want class attributes with an optional sequence of any character followed by a whitespace character, the token « type » and an optional whitespace followed by any characters.

It correctly catches values such as « type », « foo type », « foo type bar », « type bar » and rejects values such as « anytype ».

The next tricky thing to express to validate microformats is that you want to allow an element at any level of depth.

For instance, if you’re expecting a « type » tag, you’ll accept:

<span class=type>foo</span>

But also:

<div>
   <p>Type: <span class="type">foo</span></p>
</div>

To do so with RELAX NG, you’ll recursively say that you want either a tag « type » or any other element including a tag « type ».

The « any other element » will have include an optional « class » attribute which value doesn’t contain the token « type » but even that isn’t an issue with RELAX NG and the definition could be around these lines:

hreview.type =
    element * {
        anyOtherAttribute,
        mixed {
            (attribute class {
                 xsd:token { pattern = "(.+\s)?type(\s.+)?" }
             },
             anyElement)
            | (attribute class {
                   xsd:token - xsd:token { pattern = "(.+\s)?type(\s.+)?" }
               }?,
               hreview.type)
        }
}

This looks complex and quite ugly but we wouldn’t have to write such schemas by hand. I like Norm’s idea to write a simple RELAX NG schema where classes are replaced by element names and this definition has been generated by a XSLT transformation out of his own definition which is:

hreview.type = element type { text }

So far, so good. Let’s see where the real blockers are.

The first thing which is quite ugly to validate is the flexibility that allows siblings to be nested.

In the hReview schema, « reviewer » and « dtreviewed » are defined as siblings:

hreview.hreview =
  element hreview {
    text
    & hreview.version?
    & hreview.summary?
    & hreview.type?
    & hreview.item
    & hreview.reviewer?
    & hreview.dtreviewed?
    & hreview.rating?
    & hreview.description?
}

In a XML document, we would expect to see them at the same level as direct children od the « hreview » element.

In microformats world, this can be the case, but one can also be a descendant to the other which is the case in our example:

<span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
<abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>

To express that, we would have to say that the content oh « hreview » is one of the many combinations between each sub elements being either siblings or descendants one of each other.

I haven’t tried to see if that would be feasible (we’ll see that there is another blocker that makes the question academic) but that would be a real mess to generate.

The second and probably most important blocker is the restrictions related to interleave: as stated in my RELAX NG book, « Elements combined through interleave must not overlap between name classes. »

This restriction is hitting us hard here since our name classes do overlap and we are combining the different sub patterns through interleave (see the definition of hreview.hreview above if you’re not convinced).

There are very few workarounds for this restriction:

  • Replacing interleave by an ordered group isn’t an option: microformats are about flexibility and imposing an order between the sub components is most probably out of question.
  • Replacing interleave by a « zeroOrMore/choice » combination means that we would loose any control over the number of occurrences of each sub components (we could get ten ratings and no items) and this control is one of the few things that this validation catches!

To me, this restriction is the real blocker and means that it isn’t practical to use RELAX NG to validate microformat instances directly.

Of course, we can transform these instances as plain XML as shown by Norm Walsh, but I don’t like this solution very much for a reason he hasn’t mentioned: when we would raise errors with such a validation, these errors would refer to the context within the transformed document which would be tough to understand by users and making the link between this context and the original document could be complex.

As an alternative, let’s see what we could do with Schematron.

To set a rule context to a specifi tag, we can write:

<rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">

We are no longer working on datatypes and need to apply the normalization by hand (thus the use of « normalize-space() »). On the other hand, we can freely use functions and by adding a leading and trailing space, we can make sure that the « hreview » token is matched if and only if he result of this manipulation contains the token preceded and followed by a space.

Within this context, we can check the number of occurrences of each sub pattern using more or less the same principle:

      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
     </rule>

Note that the use of the descendant axis (« // ») means that we are treating correctly cases where siblings are embedded.

Norm Walsh mentions that this can be tedious to write and that you need to define tests for what is allowed and also for what is forbidden.

That’s perfectly right but here again, you don’t have to write this schema by hand and I have written a XSLT transformation that transforms his RELAX NG schema into the following Schematron schema:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
   <pattern name="hreview.hreview">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
      </rule>
   </pattern>
   <pattern name="hreview.version">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">version not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.summary">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">summary not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.type">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">type not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.item">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">item not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.fn">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' fn ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">fn not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.url">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' url ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">url not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.photo">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' photo ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">photo not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.reviewer">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">reviewer not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.dtreviewed">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">dtreviewed not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.rating">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">rating not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.description">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">description not allowed here.</assert>
      </rule>
   </pattern>
</schema>

A couple of notes on this schema:

  • A class attribute can contain several tokens and a single element can match several rules. Since Schematron checks only the first matching rule in each pattern, each definition is in its own pattern.
  • In this example, I have added a test that each tag is found within the context where it is expected. This test reports an error on the sample at the first occurrence of « fn » because this occurrence belongs to another microformat (vCard) which is combine with hReview in this example. This test should be switchable off and that could be done using Schematron phases.

A part from that, I think that this could become a very practical solution. The idea would thus be:

  • Define a schema for a microformat using RELAX NG to describe its logical structure. This would probably lead to defining a language subset and conventions to convey information such as « which attribute is used » and would become a kind of « microschema ».
  • Transform this microschema into a Schematron schema.
  • Use this schema to validate instance documents.

What I find interesting is that the same RELAX NG microschema could be used as shown by Norm Walsh to feed a transformation that could be applied to instance documents before validation or transformed into a schema that would validate the instance documents and I am pretty sure that these schemas could have many other uses.