Normalizing Excel’s SpreadsheetML using XSLT – Part 2

As reported by one of the comments, there was a bug in the XSLT transformation which « normalizes » Excel’s SpreadsheetML documents that I had posted in a previous post.

I have fixed this bug and the new version is:

<?xml version="1.0"?>
<!--

Adapted from http://ewbi.blogs.com/develops/2004/12/normalize_excel.html

This product may incorporate intellectual property owned by Microsoft Corporation. The terms
and conditions upon which Microsoft is licensing such intellectual property may be found at
http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.
-->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:schemas-microsoft-com:office:spreadsheet"
    xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <xsl:output method="xml" indent="no" encoding="UTF-8"/>
    <xsl:template match="/">
        <xsl:apply-templates select="node()" mode="normalize"/>
    </xsl:template>
    <xsl:template match="@*|node()" mode="normalize">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="normalize"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="ss:Cell/@ss:Index" mode="normalize"/>
    <xsl:template match="ss:Cell" name="copy" mode="normalize">
        <xsl:copy>
            <xsl:apply-templates select="@*" mode="normalize"/>
            <xsl:variable name="prevCells" select="preceding-sibling::ss:Cell"/>
            <xsl:variable name="nbPrecedingIndexes"
                select="count(preceding-sibling::ss:Cell[@ss:Index])"/>
            <xsl:attribute name="ss:Index">
                <xsl:choose>
                    <xsl:when test="@ss:Index">
                        <xsl:value-of select="@ss:Index"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells) = 0">
                        <xsl:value-of select="1"/>
                    </xsl:when>
                    <xsl:when test="$nbPrecedingIndexes > 0">
                        <xsl:variable name="precedingCellsSinceLastIndex"
                            select="preceding-sibling::ss:Cell[count(preceding-sibling::ss:Cell[@ss:Index]|self::ss:Cell[@ss:Index]) = $nbPrecedingIndexes]"/>
                        <xsl:value-of
                            select="preceding-sibling::ss:Cell[@ss:Index][1]/@ss:Index +
                            count($precedingCellsSinceLastIndex)
                            + sum($precedingCellsSinceLastIndex/@ss:MergeAcross)
                            - count ($precedingCellsSinceLastIndex[@ss:MergeAcross])"
                        />
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:value-of
                            select="count($prevCells) + 1 +
                            sum($prevCells/@ss:MergeAcross) -count($prevCells/@ss:MergeAcross)"
                        />
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:attribute>
            <xsl:apply-templates select="node()" mode="normalize"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
            

I have also written the following set of tests (using XSLTUnit):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl" xmlns:xsltu="http://xsltunit.org/0/"
  exclude-result-prefixes="exsl">
  <xsl:import href="excelNormalize.xsl"/>
  <xsl:import href="xsltunit.xsl"/>
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:template match="/">
    <xsltu:tests>
      <xsltu:test id="noIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell>A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">noIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="1">A</ss:Cell>
              <ss:Cell ss:Index="2">B</ss:Cell>
              <ss:Cell ss:Index="3">C</ss:Cell>
              <ss:Cell ss:Index="4">D</ss:Cell>
              <ss:Cell ss:Index="5">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="withIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="1">A</ss:Cell>
            <ss:Cell ss:Index="2">B</ss:Cell>
            <ss:Cell ss:Index="3">C</ss:Cell>
            <ss:Cell ss:Index="4">D</ss:Cell>
            <ss:Cell ss:Index="5">E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">withIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2" select="$input"/>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="firstIndex">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="5">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">firstIndex</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="5">A</ss:Cell>
              <ss:Cell ss:Index="6">B</ss:Cell>
              <ss:Cell ss:Index="7">C</ss:Cell>
              <ss:Cell ss:Index="8">D</ss:Cell>
              <ss:Cell ss:Index="9">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="altIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="2">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell ss:Index="5">C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">altIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="2">A</ss:Cell>
              <ss:Cell ss:Index="3">B</ss:Cell>
              <ss:Cell ss:Index="5">C</ss:Cell>
              <ss:Cell ss:Index="6">D</ss:Cell>
              <ss:Cell ss:Index="7">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="noIndexesMergeAcross">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell>A</ss:Cell>
            <ss:Cell ss:MergeAcross="2">B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell ss:MergeAcross="3">D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">noIndexesMergeAcross</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="1">A</ss:Cell>
              <ss:Cell ss:MergeAcross="2" ss:Index="2">B</ss:Cell>
              <ss:Cell ss:Index="4">C</ss:Cell>
              <ss:Cell ss:MergeAcross="3" ss:Index="5">D</ss:Cell>
              <ss:Cell ss:Index="8">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="withIndexesMergeAcross">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="5" ss:MergeAcross="2">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell ss:Index="10">C</ss:Cell>
            <ss:Cell ss:MergeAcross="3">D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">withIndexesMergeAcross</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="5" ss:MergeAcross="2">A</ss:Cell>
              <ss:Cell ss:Index="7">B</ss:Cell>
              <ss:Cell ss:Index="10">C</ss:Cell>
              <ss:Cell ss:MergeAcross="3" ss:Index="11">D</ss:Cell>
              <ss:Cell ss:Index="14">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
    </xsltu:tests>
  </xsl:template>
</xsl:stylesheet>
            

These tests should help to understand what this transformation is doing.

Thanks continue to report bugs and feature requests as comments.

The influence of microformats on style-free stylesheets

It’s been a while, almost six years, since I have written my Style-free XSLT Style Sheets piece for XML.com but this simple technique remains one of my favorite.

It has not only been my first article published on XML.com but also the subject of my first talk in an IDEAlliance XML conference and it’s fair to say that it as been instrumental to launch my career of « international XML guru ».

Despite all that, this technique remains my favorite because for its efficiency. I am using it over and over. To generate (X)HTML but also many other XML vocabularies. I have been using it to generate vocabularies as different as OpenOffice documents and W3C XML Schemas. The more complex is the vocabulary to generate, the more reasons you have to keep it outside your XSLT transformations and the more efficient style-free stylesheets are.

Style-free stylesheets have become a reflex for me and that’s without even thinking about them that I have written a style-free stylesheet to power the web site of our upcoming Web 2.0 book.

In my antique XML.com paper, I had been using specific, non XHTML elements:

        <td width="75%" bgcolor="Aqua">
            <insert-body/>
        </td>

That’s working fine, but your layout documents are no longer valid XHTML and they don’t display like target documents in a browser.

Why not follow the microformats approach and use regular XHTML elements with specif class attribues instead:

        <div id="planet">
            <h1>Planet Web 2.0 the book</h1>
            <p>Aggregated content relevant to this book.</p>
            <div class="fromRss"/>
             .../...
        </div>           

In this case, the XSLT transformation replaces the content of any element with a class attribute containing the token « fromRSS » by the formated output of the RSS feed. This has the additional benefit that I can leave mock-up content to make the layout look like a final document:

<div id="planet">
            <h1>Planet Web 2.0 the book</h1>
            <p>Aggregated content relevant to this book.</p>
            <div class="fromRss">
                <ul>
                    <li>
                        <div>
                            <h2>
                                <a
                                    href="http://www.orbeon.com/blog/2006/06/02/about-json-and-poor-marketing-strategies/"
                                    title="XForms Everywhere » About JSON and poor marketing strategies"
                                    >XForms Everywhere » About JSON and poor marketing
                                strategies</a>
                            </h2>
                        </div>
                    </li>
                </ul>
            </div>
            <p>
                <a href="http://del.icio.us/rss/tag/web2.0thebook" title="RSS feed (on del.icio.us)">
                    <img src="feed-icon-24x24.png" alt="RSS feed"/>
                </a> (on <a href="http://del.icio.us/" title="del.icio.us">del.icio.us</a>)</p>
        </div>

What I like with simple ideas is that they always leave room for reuse and improvements (complex ideas on the other hand seem to only leave room for more complexity).

Web 2.0 the book

One of the reasons I have been too busy to blog these days is the project to write a comprehensive book about Web 2.0 technologies.

If Web 2.0 is about using the web as a platform, this platform is far from being homogeneous. On the contrary, it is made of a number of very different pieces of technology, from CSS to web server configuration through XML, Javascript, server side programming, HTML, …

I believe that integrating these technologies is one of the main challenges of Web 2.0 developers and I am always surprised if not frightened to see that people tend to get more and more specialized. Too many CSS gurus do not know the first thing about XML, too many XML gurus don’t know how to spell HTTP, too many Java programmers don’t want to know Javascript. And, no, knowing everything about Ajax isn’t enough to write a Web 2.0 application.

To the defense of these hyper-specialists, I have also found that most of the available resources, both online and in print, are even more heavily specialized than their authors and that even if you could read a book on each of these technologies you’d find it difficult to get the big picture and understand how they can be used together.

The goal of this book is fill the gap and be a useful resource for all the Web 2.0 developers who do not want to stay in their highly specialized domain as well as for project managers who need to grasp the Web 2.0 big picture.

This is an ambitious project on which I have started to work in December 2005.

The first phase has been to define the book outline with the helpful contribution of many friends.

The second one has been to find an editor. O’Reilly who is the editor of my two previous books happens to be also one of the co-inventors of the term « Web 2.0 » and that makes them very nervous about Web 2.0 book projects.

Jim Minatel from Wiley has immediately been convinced by the outline and the book will be published in the Wrox Professional Series.

I had initially planned to write the book all by myself but it would have taken me at least one year to complete this work and Jim wasn’t appealed by the idea of waiting until 2007 to get this book in print.

The third step has been to find the team to write the book and the lucky authors are:

Micah Dubinko is tech editing the book and Sara Shlaer is our Development Editor.

We had then to split the work between authors. The exercise has been easier than expected. Being in a position to arbiter the choice, I have found it fair to pick the chapters left by other authors and this leaves me with chapters that will require a lot of researches for me. This is fine since I like learning new things when I write but this also means more hard work.

This is my first co-authored book and I think that one of the challenges of these books is to keep the whole content coherent. This is especially true for a book which goal is to give « the big picture » and to explain how different technologies play together.

To facilitate the communication between authors, I have set up a series of internal resources (wiki, mailing list, subversion repository). It’s still too early to say if that will really help but the first results are encouraging.

More recently, I have also set up a public site (http://web2.0thebook.org/) that presents the book and aggregates relevant content. I hope that all these resources will help us to feel and act as a team rather than a set of individual authors.

The « real » work has finally started and we have now the first versions of our first chapters progressing within the Wiley review system.

It’s interesting to see the differences between processes and rules from different editors. To me, a book was a book and I hadn’t anticipated so many differences not only in the tools being used but also in style guidelines.

The first chapter I have written is about Web Services and that’s been a good opportunity to revisit the analysis I had done in 2004 for the ZDNet Web Services Convention [papers (in French)].

From a Web 2.0 developer perspective, I think that the main point is to publish Web Services that are perfectly integrated in the Web architecture and that means being as RESTfull as possible.

I have been happy to see that WSDL 2.0 appears to be making some progress in its support of REST Services even though it’s still not perfect yet. I have posted a mail with some of my findings to the Web Services Description Working Group comment list and they have split these comments as three issues on their official issue list ([CR052] [CR053] [CR054]).

I wish they can take these issues into account, even if that means updating my chapter!

Some resources I have found most helpful while I was writing this chapter are:

It’s been fun so far and I look forward to seeing this book « for real ».

Validating microformats

This blog entry is following up Norm Walsh’s essay on the same subject.

The first thing I’d want to react on isn’t the fact that RELAX NG isn’t suitable for this task, but the reason why this is the case.

Norm says that « there’s just no way to express a pattern that matches an attribute that contains some token » and this assertion isn’t true.

Let’s take the same hReview sample and see what happens when we try to define a RELAX NG schema:

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Review</title>
    </head>
    <body>
        <div class="hreview">
            <span><span class="rating">5</span> out of 5 stars</span>
            <h4 class="summary">Crepes on Cole is awesome</h4>
            <span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
                <abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>
            <div class="description item vcard"><p>
                <span class="fn org">Crepes on Cole</span> is one of the best little
                creperies in <span class="adr"><span class="locality">San Francisco</span></span>.
                Excellent food and service. Plenty of tables in a variety of sizes
                for parties large and small.  Window seating makes for excellent
                people watching to/from the N-Judah which stops right outside.
                I've had many fun social gatherings here, as well as gotten
                plenty of work done thanks to neighborhood WiFi.
            </p></div>
            <p>Visit date: <span>April 2005</span></p>
            <p>Food eaten: <span>Florentine crepe</span></p>
        </div>
    </body>
</html>

To define an element which « class » attribute is « type », we would write:

element * {
    attribute class { "type" }
    .../...
}

To define an element which « class » attribute contains the token « type », we will use the same principle and use a W3C XML Schema pattern facet:

element * {
    attribute class {
        xsd:token { pattern = "(.+\s)?type(\s.+)?" }
    }
}

The regular expression expresses the fact that we want class attributes with an optional sequence of any character followed by a whitespace character, the token « type » and an optional whitespace followed by any characters.

It correctly catches values such as « type », « foo type », « foo type bar », « type bar » and rejects values such as « anytype ».

The next tricky thing to express to validate microformats is that you want to allow an element at any level of depth.

For instance, if you’re expecting a « type » tag, you’ll accept:

<span class=type>foo</span>

But also:

<div>
   <p>Type: <span class="type">foo</span></p>
</div>

To do so with RELAX NG, you’ll recursively say that you want either a tag « type » or any other element including a tag « type ».

The « any other element » will have include an optional « class » attribute which value doesn’t contain the token « type » but even that isn’t an issue with RELAX NG and the definition could be around these lines:

hreview.type =
    element * {
        anyOtherAttribute,
        mixed {
            (attribute class {
                 xsd:token { pattern = "(.+\s)?type(\s.+)?" }
             },
             anyElement)
            | (attribute class {
                   xsd:token - xsd:token { pattern = "(.+\s)?type(\s.+)?" }
               }?,
               hreview.type)
        }
}

This looks complex and quite ugly but we wouldn’t have to write such schemas by hand. I like Norm’s idea to write a simple RELAX NG schema where classes are replaced by element names and this definition has been generated by a XSLT transformation out of his own definition which is:

hreview.type = element type { text }

So far, so good. Let’s see where the real blockers are.

The first thing which is quite ugly to validate is the flexibility that allows siblings to be nested.

In the hReview schema, « reviewer » and « dtreviewed » are defined as siblings:

hreview.hreview =
  element hreview {
    text
    & hreview.version?
    & hreview.summary?
    & hreview.type?
    & hreview.item
    & hreview.reviewer?
    & hreview.dtreviewed?
    & hreview.rating?
    & hreview.description?
}

In a XML document, we would expect to see them at the same level as direct children od the « hreview » element.

In microformats world, this can be the case, but one can also be a descendant to the other which is the case in our example:

<span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
<abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>

To express that, we would have to say that the content oh « hreview » is one of the many combinations between each sub elements being either siblings or descendants one of each other.

I haven’t tried to see if that would be feasible (we’ll see that there is another blocker that makes the question academic) but that would be a real mess to generate.

The second and probably most important blocker is the restrictions related to interleave: as stated in my RELAX NG book, « Elements combined through interleave must not overlap between name classes. »

This restriction is hitting us hard here since our name classes do overlap and we are combining the different sub patterns through interleave (see the definition of hreview.hreview above if you’re not convinced).

There are very few workarounds for this restriction:

  • Replacing interleave by an ordered group isn’t an option: microformats are about flexibility and imposing an order between the sub components is most probably out of question.
  • Replacing interleave by a « zeroOrMore/choice » combination means that we would loose any control over the number of occurrences of each sub components (we could get ten ratings and no items) and this control is one of the few things that this validation catches!

To me, this restriction is the real blocker and means that it isn’t practical to use RELAX NG to validate microformat instances directly.

Of course, we can transform these instances as plain XML as shown by Norm Walsh, but I don’t like this solution very much for a reason he hasn’t mentioned: when we would raise errors with such a validation, these errors would refer to the context within the transformed document which would be tough to understand by users and making the link between this context and the original document could be complex.

As an alternative, let’s see what we could do with Schematron.

To set a rule context to a specifi tag, we can write:

<rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">

We are no longer working on datatypes and need to apply the normalization by hand (thus the use of « normalize-space() »). On the other hand, we can freely use functions and by adding a leading and trailing space, we can make sure that the « hreview » token is matched if and only if he result of this manipulation contains the token preceded and followed by a space.

Within this context, we can check the number of occurrences of each sub pattern using more or less the same principle:

      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
     </rule>

Note that the use of the descendant axis (« // ») means that we are treating correctly cases where siblings are embedded.

Norm Walsh mentions that this can be tedious to write and that you need to define tests for what is allowed and also for what is forbidden.

That’s perfectly right but here again, you don’t have to write this schema by hand and I have written a XSLT transformation that transforms his RELAX NG schema into the following Schematron schema:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
   <pattern name="hreview.hreview">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
      </rule>
   </pattern>
   <pattern name="hreview.version">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">version not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.summary">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">summary not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.type">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">type not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.item">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">item not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.fn">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' fn ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">fn not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.url">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' url ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">url not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.photo">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' photo ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">photo not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.reviewer">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">reviewer not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.dtreviewed">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">dtreviewed not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.rating">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">rating not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.description">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">description not allowed here.</assert>
      </rule>
   </pattern>
</schema>

A couple of notes on this schema:

  • A class attribute can contain several tokens and a single element can match several rules. Since Schematron checks only the first matching rule in each pattern, each definition is in its own pattern.
  • In this example, I have added a test that each tag is found within the context where it is expected. This test reports an error on the sample at the first occurrence of « fn » because this occurrence belongs to another microformat (vCard) which is combine with hReview in this example. This test should be switchable off and that could be done using Schematron phases.

A part from that, I think that this could become a very practical solution. The idea would thus be:

  • Define a schema for a microformat using RELAX NG to describe its logical structure. This would probably lead to defining a language subset and conventions to convey information such as « which attribute is used » and would become a kind of « microschema ».
  • Transform this microschema into a Schematron schema.
  • Use this schema to validate instance documents.

What I find interesting is that the same RELAX NG microschema could be used as shown by Norm Walsh to feed a transformation that could be applied to instance documents before validation or transformed into a schema that would validate the instance documents and I am pretty sure that these schemas could have many other uses.

TreeBind is about making Java as agile as it can be

TreeBind seems to be getting more visible :

Last week I have attended several SD West sessions that gave me interesting ideas for TreeBind:

  • Under the rather misleading title « Enterprise Java for Elvis« , Cay Horstmann has presented some good stuff coming with EJB 3.0. I have been impressed by the POJOs (Plain Old Java Objects) can now be used and that most of the persistence configuration can be done through Java annotations. That’s something we could useful within TreeBind: annotations are available through reflexion and could be use to convey serialization information such as the relative order of sub-elements and whether a property should be written as element or attribute.
  • Allen Holub has given his very enlightening presentation: « Everything You Know is Wrong: Extends and Get/Set Methods are Evil » during which he explains why classes should expose behaviors rather than properties. When you think about it, that seems obvious enough but that still helps when someone such as Allen Holub explains it! The exceptions are for serialization and deserialization where classes need to expose their internals (Allen Holub says that the languages should take care of that but that’s not yet the case with Java). Even in that case, he favours specific « importer » and « exporter » classes over the getters and setters and that’s an option that could be used by TreeBind too (the current version relies on getters and setters).
  • Rick Wayne had proposed « Railin’ on AJAX » and I was looking forward seeing what was behind the Ruby on Rails buzz. How’s that related to TreeBind? One of the lessons learned from Ruby on Rails is this « DRY » (Don’t Repeat Yourself) principle and the way Rails generates the classes from the database. That would be easy enough for TreeBind to generate the Java classes corresponding to a XML document. This could be done by TreeBind itself, through a TreeBind Sink which would write Java source files. Annotations could be added to the document to describe cardinalities and datatypes and the XML document would be used as schemas are used by SAX-B. Using XML instances as schemas? Doesn’t that ring a bell? That’s exactly what Examplotron is about!

What’s the common thread between all that?

The initial motivation is still there: to make binding as transparent and lightweight as possible and Java as agile as it can be!

W3C Internationalization « Tag » Set

2006-02-22: The Internationalization Tag Set Working Group has published an updated Working Draft of the Internationalization Tag Set (ITS). Organized by data categories, this set of elements and attributes supports the internationalization and localization of schemas and documents. Implementations are provided for DTDs, XML Schema and Relax NG, and for existing vocabularies like XHTML, DocBook and OpenDocument. Visit the Internationalization home page.

(Copied from the W3C News Archive)

I had missed the previous version of this document and I have been very impressed and pleased while (quickly) reading it.

Among the good things, I’d mention:

  • Flexibility: ITS can be used within the documents to localize, within the schemas that describe these documents or standalone.
  • Schema agnosticism: ITS can be used with DTDs, W3C XML Schema and RELAX NG (I don’t see why the list has been limited to these three ones, but, at least, RELAX NG is explicitly mentioned).
  • No QNames: more precisely, ITS has been wide enough to avoid using namespace declarations for its QNames.

Among the things that could be improved, I have found (and reported):

  • The word « tag » in name itself: « Internationalization Tag Set »: we spend our time to explain that XML is about trees and that tags are only syntactic sugar to mark the beginning and the end of elements and I wouldn’t have expected to see this word in the name of a W3C specification! [bug 2922]
  • The fact that the same element names are used in schemas and instance documents: schemas with XML syntaxes are also instances and ITS could be used to localize the schemas themselves instead of localizing the instances described by these schemas. Unfortunately, doing so would lead to a confusion since the ITS element names would be the same for both usages [bug 2923]
  • The list of schema languages could be left open [bug 2924]

Publishing GPG or PGP public keys considered harmful?

In a previous post, I have expressed the common thinking that digitally signed emails would be a strong spam stopper.

I am still thinking that a more general usage of electronic signatures would be really effective to fight against spammers, but it recently occurred to me that, at least before we reach that stage, publishing one’s public key can be considered… harmful!

A system such as GPG/PGP relies on the fact that public keys, used to check signatures are not only public but easy to find and you typically publish them both on your web site and on public key servers.

At the same time, these public keys can be used to cipher messages that you want to send to their owners.

This ciphering is typically « end to end »: the message is ciphered by the sender’s mail user agent and deciphered by the recipient’s mail agent with the recipient’s private key and nobody, either human or software, can read the content of the message in between.

While this is really great for preserving your privacy, this also means neither anti-spam nor anti-virus softwares can read the content of digitally signed emails without knowing the recipient’s private key and that pretty much eliminates any server side shielding.

Keeping your public key private would eliminate most of the benefit of signing your mails, but if you make your public key public, you’d better be very careful when reading ciphered emails, especially when they are not signed!

Edd Dumbill on XTech 2006

Last year Edd Dumbill, XTech Conference Chair, had been kind enough to answer my questions about the 2005 issue of the conference previously known as « XML Europe ». We’re renewing the experience, taking the opportunity to look back at last year issue and to figure out how XTech 2006 should look like.

vdV: You mention in your blog the success of XTech 2005 and that’s an appreciation which is shared by many attendees (including myself). Can you elaborate for those who have missed XTech 2005 what makes you say that it has been a success?

Edd: What I was particularly pleased with was the way we adapted the conference topic areas to reflect the changing technology landscape.

With Firefox and Opera, web browser technology matters a lot more now, but there was no forum to discuss it. We provided one, and some good dialog was opened up between developers, users and standards bodies.

But, to sum up how I know the conference was successful: because everybody who went told me that they had a good and profitable time!

vdV: You said during our previous interview that two new tracks which « aren’t strictly about XML topics at all » have been introduced last year (Browser Technology and Open Data) to reflect the fact that « XML broadens out beyond traditional core topics ». Have these tracks met their goal to attract a new audience?

Edd: Yes, I’m most excited about them. As I said before, the browser track really worked at getting people talking. The Open Data track was also very exciting: we heard a lot from people out there in the real world providing public data services.

The thing is that people in these « new » audiences work closely with the existing XML technologists anyway. It didn’t make sense to talk about XML and leave SVG, XHTML and XUL out in the cold: these are just as much document technologies as DocBook!

One thing that highlighted this for me was that I heard from a long-time SGML and then XML conference attendee that XTech’s subject matter was the most interesting they’d seen in years.

vdV: Did the two « older » tracks (Core Technologies and Applications) resist to these two new tracks and would you quality them as successful too?

Edd: Yes, I would! XTech is still a very important home for leaders in the core of XML technology. Yet also I think there’s always a need to change to adapt to the priorities of the conference attendees. One thing I want to do this year is to freshen the Applications track to reflect the rapidly changing landscape in which web applications are now being constructed. As well as covering the use of XML vocabularies and its technologies, I think the frameworks such as Rails, Cocoon, Orbeon and Django are important topics.

vdV: What would you like to do better in 2006?

Edd: As I’ve mentioned above, I think the Applications track can and will be better. I’d like also for there to be increased access to the conference for people such as designers and information architects. The technology discussed at XTech often directly affects these people, but there’s not always much dialogue between the technologists and the users. I’d love to foster more understanding and collaboration in that way.

vdV: You mention in your blog and in the CFP that there will be panel discussions for each track. How do you see these panel discussions?

Edd: Based on feedback from 2005’s conference, I would like the chance for people to discuss the important issues of the day in their field. For instance, how should XML implementors choose between XQuery and XSLT2, or how can organisations safely manage exposing their data as a web service? There’s no simple answer to these questions, and discussions will foster greater understanding, and maybe bring some previously unknown insights to those responsible for steering the technology.

vdV: The description of the tracks for XTech 2006 looks very similar to its predecessor. Does that mean that this will be a replay of XTech 2005?

Edd: Yes, but even more so! In fact, XTech 2005 was really a « web 2.0 » conference even before people put a name to what was happening. In 2006 I want to build on last year’s success and provide continuity.

vdV: l’année dernière: In last year’s description, the semantic web had its own bullet point in the « Open Data » track and this year, it’s sharing a bullet point with tagging and annotation. Does that mean that tagging and annotation can be seen as alternative to the semantic web? Doesn’t the semantic webtique deserve its own track?

Edd: The Semantic Web as a more formal sphere already has many conferences of its own. While XTech definitely wants to cover semantic web, it doesn’t want to get carried away with the complicated academic corners of the topic, but more see where semantic web technologies can be directly used today.

Also, I see the potential for semantic web technologies to pervade all areas that XTech covers. RDF for instance, is a « core technology ». RSS and FOAF are « applications » of RDF. RDF is used in browsers such as Mozilla. And RDF is used to describe metadata in the Creative Commons, relevant to « open data ». So why shut it off on its own? I’d far rather see ideas from semantic web spread throughout the conference.

vdV: In your blog, you’ve defended the choice of the tagline « Building Web 2.0 » quoting Paul Graham and saying that the Web 2.0 is a handy label for « The Web as it was meant to be used ». Why have you not chosen « Building the web as it was meant to be » as a tagline, then?

Edd: Because we decided on the tagline earlier! I’ll save « the web as it was meant to be » for next year :)

vdV: What struck me with this definition is that XML, Web Services and the Semantic Web are also attempts to build the Web as it was meant to be. What’s different with the Web 2.0?

Isn’t « building the web as it was meant to be » an impossible quest and why should the Web 2.0 be more successful than the previous attempts?

Edd:deux questions à la fois. I’ll answer both these together. I think the « Web 2.0 » name includes and builds on XML, Web Services and Semantic Web. But it also brings in the attitude of data sharing, community and the read/write web. Together, those things connote the web as it was intended by Berners-Lee: a two-way medium for both computers and humans.

Rather than an « attempt », I think « Web 2.0 » is a description of the latest evolution of web technologies. But I think it’s an important one, as we’re seeing a change in the notions of what makes a useful web service, and a validation of the core ideas of the web (such as REST) which the rush to make profit in « Web 1.0 » ignored.

vdV: In your blog, you said that you’re « particularly interested in getting more in about databases, frameworks like Ruby on Rails, tagging and search ». By databases, do you mean XML databases? Can you explain why you find these points particularly interesting?

Edd: I mean all databases. Databases are now core to most web applications and many web sites. They’re growing features to directly support web and XML applications, whether they’re true « XML databases » or not. A little bit of extra knowledge about the database side of things can make a great difference when creating your application.

XTech is a forum for web and XML developers, the vast majority of whom will use a database as part of their systems. Therefore, we should have the database developers and vendors there to talk as well.

vdV: One of the good things last year was the wireless coverage. Will there be one this year too?

Edd: Absolutely.

vdV: What is your worse souvenir of XTech 2005?

Edd: I don’t remember bad things :)

vdV: What is your best souvenir of XTech 2005?

Edd: For me, getting so many of the Mozilla developers out there (I think there were around 25+ Mozilla folk in all). Their participation really got the browser track off to a great start.

References:

TreeBind, Data binding and Design Patterns

I have released a new version of my Java data binding framework, TreeBind and I feel I need to explain why I am so excited by this API and by other lightweight binding APIs…

To make it short, to me these APIs are the latest episode of a complete paradigm shift in the relation between code and data.

This relationship has always been ambiguous because we are searching a balance between conflicting considerations:

  • We’d like to keep data separated because history has told us that legacy data is more important than programs and that data needs to survive during several program generations.
  • On the other hand, object orientation is about mixing programs and data.

The Strategy Pattern is about favouring composition over inheritance: basically, you create classes for behaviours and these behaviours become object properties.

This design pattern becomes still more powerful when you use a data binding API such as TreeBind since you gain the ability to directly express the behaviours as XML or RDF.

I have used this ability recently in at least two occasions.

The first one is in RDF, to implement the RDF/XML Query By Example language that I have presented at Extreme Markup Languages this summer.

RDF resources in a query such as:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://xml.insee.intra/schema/annuaire/"
  xmlns:q="http://xml.insee.intra/schema/qbe/">
    <q:select>
        <q:where>
            <inseePerson>
                <mail>
                    <q:conditions>
                        <q:ends-with>@insee.fr</q:ends-with>
                    </q:conditions>
                </mail>
            </inseePerson>
        </q:where>
    </q:select>
</rdf:RDF>

Are binded into Java classes (in this case, a class « Select », a generic class for other resources for « InseePerson » and a class « Conditions ») and these classes can be considered as behaviours.

The second project in which I have been using this ability is for a list manager which I am writing to run my mailing lists.

This list manager is designed as a set of behaviours to apply on incoming messages.

Instead of providing a set of rigid parameters to define the list configuration, I have decided to expose the behaviours themselves though TreeBind.

The result is incredibly flexible:

<?xml version="1.0" encoding="UTF-8"?>
<listManager>
    <server>localhost</server>
    <storeType>imap</storeType>
    <user>listmanager</user>
    <password>azerty</password>
    <port>143</port>
    <folderManager>
        <folder>user.list</folder>
        <messageHandler>
             <ifIsRecipient>list@example.com</ifIsRecipient>
              <messageHandler>
                <ifSpamLevelIs>spam</ifSpamLevelIs>
                <moveTo>moderated.spam</moveTo>
            </messageHandler>
            <messageHandler>
                <ifSpamLevelIs>unsure</ifSpamLevelIs>
                <moveTo>moderated.unsure</moveTo>
            </messageHandler>
            <sendToList>
                <subjectPrefix>[the XML Guild]</subjectPrefix>
                <footer>
--
Yet another mailing list manager!
></footer>
                <recipient>vdv@dyomedea.com</recipient>
                <envelopeFrom>list-bounce@example.com</envelopeFrom>
                <header name="Precedence">List</header>
                <header name="List-Id">&lt;list.example.com></header>
                <header name="List-Post">&lt;mailto:list@example.com></header>
                <server>localhost</server>
                <user>listmanager</user>
                <archive>archive</archive>
            </sendToList>
            <moveTo>done</moveTo>
        </messageHandler>
        <messageHandler>
             <moveTo>unparsed</moveTo>
        </messageHandler>
      </folderManager>
</listManager>

The whole behaviour of the list manager is exposed in this XML document and the Java classes corresponding to each element are no more than the code that implements this behaviour.

Unless you prefer to see it the other way round and consider that the XML document is the extraction of the data from their classes…

Non content based antispam sucks

My provider has recently changed the IP address of one of my server and my logs are flooded with messages such as:

Dec  7 08:21:57 gwnormandy postfix/smtp[22362]: connect to mx00.schlund.de[212.227.15.134]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)
Dec  7 08:21:57 gwnormandy postfix/smtp[22339]: connect to mx01.schlund.de[212.227.15.150]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)
Dec  7 08:21:57 gwnormandy postfix/smtp[22334]: connect to mx01.kundenserver.de[212.227.15.150]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)
Dec  7 08:21:57 gwnormandy postfix/smtp[22414]: connect to mx00.1and1.com[217.160.230.12]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)

Of course, I am trying to get this solved by sorbs.net (in that case, that should be possible since this is a fixed IP) but that incident reminds me why I think that we shouldn’t use « technical » or « non content based » antispam even if it happens to be efficient.

The basic idea of most if not all antispam software is to distinguish between what looks like a spam and what looks like a normal message.

To implement this, we’ve got three main types of implementations that can be combined:

  • Content based algorithms look at the content of the messages and use statistical methods to distinguish between « spam » and « ham » (non spam).
  • List based algorithms work with white and black lists to allow or deny mails, usually based on the address of mails sender.
  • Technical based algorithms look at the mail headers to reject most common practises used by spammers.

The problem with these technical algorithms is that the common practises used by spammers are not always practises that are not standard compliant and not even practises that should be considered as bad practises!

Let’s take the case of the sorbs.net database that identify dynamic IP addresses.

I would argue that sending a mail from a dynamic IP address is a good practise and that asking people to use their ISP mail servers when they don’t want to is a bad practise.

I personally consider that my mail is too important and sensitive for me to be outsourced to my ISP!

That’s the case when I am at home and I prefer to set up my own smtp servers that will take care of delivering my mails than using the smtp servers from my ISP.

When I am using my servers, I know from my logs if and when the smtp server of my recipients receive and queue the mails I am sending.

Also, I want to be able to manage mailing lists without having to ask to anyone.

And that’s still more the case when I am travelling and using an occasional ISP that I barely know and don’t know if I can trust.

We are using lots of these ISP when we are connected to WIFI spots and here again, I much prefer to send my mails from the smtp server that runs on my portable than from an unknown ISP.

Furthermore, that means that I don’t have to change the configuration of my mailer.

Content based antispam have also their flaws (they need training and are very inefficient with mails containing only pictures) but they don’t have false positives like technical based antispams that reject my mails if I send them from dynamic IP addresses.

That’s the reason why I have desinstalled Spam Assassin and replaced if with SpamBayes on my own systems.

Now, the thing that really puzzles me with antispam is that we have the technical solution that could eradicate spam from the web and that we just seem to ignore it.

If everyone was signing his mails with a PGP key, I could reject (or moderate) all the emails which are not signed.

Spammers would have to choose between signing their mails and being identified (meaning they could be sued) or not signing them and getting their mails trashed.

Now, the problem is that because so few people are signing their mails, I can’t afford to ignore unsigned mails and because PGP signatures are not handled correctly by many mailers and mailing list servers, most people (including me) don’t sign their mails.

The question is why doesn’t that change? Is this just a question of usages? Or is the community as a whole just not motivated to shut the spam down?