Validating microformats – Eric van der Vlist

This blog entry is following up Norm Walsh’s essay on the same subject.

The first thing I’d want to react on isn’t the fact that RELAX NG isn’t suitable for this task, but the reason why this is the case.

Norm says that « there’s just no way to express a pattern that matches an attribute that contains some token » and this assertion isn’t true.

Let’s take the same hReview sample and see what happens when we try to define a RELAX NG schema:

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Review</title>
    </head>
    <body>
        <div class="hreview">
            <span><span class="rating">5</span> out of 5 stars</span>
            <h4 class="summary">Crepes on Cole is awesome</h4>
            <span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
                <abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>
            <div class="description item vcard"><p>
                <span class="fn org">Crepes on Cole</span> is one of the best little
                creperies in <span class="adr"><span class="locality">San Francisco</span></span>.
                Excellent food and service. Plenty of tables in a variety of sizes
                for parties large and small.  Window seating makes for excellent
                people watching to/from the N-Judah which stops right outside.
                I've had many fun social gatherings here, as well as gotten
                plenty of work done thanks to neighborhood WiFi.
            </p></div>
            <p>Visit date: <span>April 2005</span></p>
            <p>Food eaten: <span>Florentine crepe</span></p>
        </div>
    </body>
</html>

To define an element which « class » attribute is « type », we would write:

element * {
    attribute class { "type" }
    .../...
}

To define an element which « class » attribute contains the token « type », we will use the same principle and use a W3C XML Schema pattern facet:

element * {
    attribute class {
        xsd:token { pattern = "(.+\s)?type(\s.+)?" }
    }
}

The regular expression expresses the fact that we want class attributes with an optional sequence of any character followed by a whitespace character, the token « type » and an optional whitespace followed by any characters.

It correctly catches values such as « type », « foo type », « foo type bar », « type bar » and rejects values such as « anytype ».

The next tricky thing to express to validate microformats is that you want to allow an element at any level of depth.

For instance, if you’re expecting a « type » tag, you’ll accept:

<span class=type>foo</span>

But also:

<div>
   <p>Type: <span class="type">foo</span></p>
</div>

To do so with RELAX NG, you’ll recursively say that you want either a tag « type » or any other element including a tag « type ».

The « any other element » will have include an optional « class » attribute which value doesn’t contain the token « type » but even that isn’t an issue with RELAX NG and the definition could be around these lines:

hreview.type =
    element * {
        anyOtherAttribute,
        mixed {
            (attribute class {
                 xsd:token { pattern = "(.+\s)?type(\s.+)?" }
             },
             anyElement)
            | (attribute class {
                   xsd:token - xsd:token { pattern = "(.+\s)?type(\s.+)?" }
               }?,
               hreview.type)
        }
}

This looks complex and quite ugly but we wouldn’t have to write such schemas by hand. I like Norm’s idea to write a simple RELAX NG schema where classes are replaced by element names and this definition has been generated by a XSLT transformation out of his own definition which is:

hreview.type = element type { text }

So far, so good. Let’s see where the real blockers are.

The first thing which is quite ugly to validate is the flexibility that allows siblings to be nested.

In the hReview schema, « reviewer » and « dtreviewed » are defined as siblings:

hreview.hreview =
  element hreview {
    text
    & hreview.version?
    & hreview.summary?
    & hreview.type?
    & hreview.item
    & hreview.reviewer?
    & hreview.dtreviewed?
    & hreview.rating?
    & hreview.description?
}

In a XML document, we would expect to see them at the same level as direct children od the « hreview » element.

In microformats world, this can be the case, but one can also be a descendant to the other which is the case in our example:

<span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
<abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>

To express that, we would have to say that the content oh « hreview » is one of the many combinations between each sub elements being either siblings or descendants one of each other.

I haven’t tried to see if that would be feasible (we’ll see that there is another blocker that makes the question academic) but that would be a real mess to generate.

The second and probably most important blocker is the restrictions related to interleave: as stated in my RELAX NG book, « Elements combined through interleave must not overlap between name classes. »

This restriction is hitting us hard here since our name classes do overlap and we are combining the different sub patterns through interleave (see the definition of hreview.hreview above if you’re not convinced).

There are very few workarounds for this restriction:

Replacing interleave by an ordered group isn’t an option: microformats are about flexibility and imposing an order between the sub components is most probably out of question.
Replacing interleave by a « zeroOrMore/choice » combination means that we would loose any control over the number of occurrences of each sub components (we could get ten ratings and no items) and this control is one of the few things that this validation catches!

To me, this restriction is the real blocker and means that it isn’t practical to use RELAX NG to validate microformat instances directly.

Of course, we can transform these instances as plain XML as shown by Norm Walsh, but I don’t like this solution very much for a reason he hasn’t mentioned: when we would raise errors with such a validation, these errors would refer to the context within the transformed document which would be tough to understand by users and making the link between this context and the original document could be complex.

As an alternative, let’s see what we could do with Schematron.

To set a rule context to a specifi tag, we can write:

<rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">

We are no longer working on datatypes and need to apply the normalization by hand (thus the use of « normalize-space() »). On the other hand, we can freely use functions and by adding a leading and trailing space, we can make sure that the « hreview » token is matched if and only if he result of this manipulation contains the token preceded and followed by a space.

Within this context, we can check the number of occurrences of each sub pattern using more or less the same principle:

      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
     </rule>

Note that the use of the descendant axis (« // ») means that we are treating correctly cases where siblings are embedded.

Norm Walsh mentions that this can be tedious to write and that you need to define tests for what is allowed and also for what is forbidden.

That’s perfectly right but here again, you don’t have to write this schema by hand and I have written a XSLT transformation that transforms his RELAX NG schema into the following Schematron schema:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
   <pattern name="hreview.hreview">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
      </rule>
   </pattern>
   <pattern name="hreview.version">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">version not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.summary">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">summary not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.type">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">type not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.item">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">item not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.fn">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' fn ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">fn not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.url">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' url ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">url not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.photo">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' photo ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">photo not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.reviewer">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">reviewer not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.dtreviewed">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">dtreviewed not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.rating">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">rating not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.description">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">description not allowed here.</assert>
      </rule>
   </pattern>
</schema>

A couple of notes on this schema:

A class attribute can contain several tokens and a single element can match several rules. Since Schematron checks only the first matching rule in each pattern, each definition is in its own pattern.
In this example, I have added a test that each tag is found within the context where it is expected. This test reports an error on the sample at the first occurrence of « fn » because this occurrence belongs to another microformat (vCard) which is combine with hReview in this example. This test should be switchable off and that could be done using Schematron phases.

A part from that, I think that this could become a very practical solution. The idea would thus be:

Define a schema for a microformat using RELAX NG to describe its logical structure. This would probably lead to defining a language subset and conventions to convey information such as « which attribute is used » and would become a kind of « microschema ».
Transform this microschema into a Schematron schema.
Use this schema to validate instance documents.

What I find interesting is that the same RELAX NG microschema could be used as shown by Norm Walsh to feed a transformation that could be applied to instance documents before validation or transformed into a schema that would validate the instance documents and I am pretty sure that these schemas could have many other uses.

Laisser un commentaire Annuler la réponse