Normalizing Excel’s SpreadsheetML using XSLT – Part 2

As reported by one of the comments, there was a bug in the XSLT transformation which « normalizes » Excel’s SpreadsheetML documents that I had posted in a previous post.

I have fixed this bug and the new version is:

<?xml version="1.0"?>
<!--

Adapted from http://ewbi.blogs.com/develops/2004/12/normalize_excel.html

This product may incorporate intellectual property owned by Microsoft Corporation. The terms
and conditions upon which Microsoft is licensing such intellectual property may be found at
http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.
-->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:schemas-microsoft-com:office:spreadsheet"
    xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <xsl:output method="xml" indent="no" encoding="UTF-8"/>
    <xsl:template match="/">
        <xsl:apply-templates select="node()" mode="normalize"/>
    </xsl:template>
    <xsl:template match="@*|node()" mode="normalize">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="normalize"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="ss:Cell/@ss:Index" mode="normalize"/>
    <xsl:template match="ss:Cell" name="copy" mode="normalize">
        <xsl:copy>
            <xsl:apply-templates select="@*" mode="normalize"/>
            <xsl:variable name="prevCells" select="preceding-sibling::ss:Cell"/>
            <xsl:variable name="nbPrecedingIndexes"
                select="count(preceding-sibling::ss:Cell[@ss:Index])"/>
            <xsl:attribute name="ss:Index">
                <xsl:choose>
                    <xsl:when test="@ss:Index">
                        <xsl:value-of select="@ss:Index"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells) = 0">
                        <xsl:value-of select="1"/>
                    </xsl:when>
                    <xsl:when test="$nbPrecedingIndexes > 0">
                        <xsl:variable name="precedingCellsSinceLastIndex"
                            select="preceding-sibling::ss:Cell[count(preceding-sibling::ss:Cell[@ss:Index]|self::ss:Cell[@ss:Index]) = $nbPrecedingIndexes]"/>
                        <xsl:value-of
                            select="preceding-sibling::ss:Cell[@ss:Index][1]/@ss:Index +
                            count($precedingCellsSinceLastIndex)
                            + sum($precedingCellsSinceLastIndex/@ss:MergeAcross)
                            - count ($precedingCellsSinceLastIndex[@ss:MergeAcross])"
                        />
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:value-of
                            select="count($prevCells) + 1 +
                            sum($prevCells/@ss:MergeAcross) -count($prevCells/@ss:MergeAcross)"
                        />
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:attribute>
            <xsl:apply-templates select="node()" mode="normalize"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
            

I have also written the following set of tests (using XSLTUnit):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl" xmlns:xsltu="http://xsltunit.org/0/"
  exclude-result-prefixes="exsl">
  <xsl:import href="excelNormalize.xsl"/>
  <xsl:import href="xsltunit.xsl"/>
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:template match="/">
    <xsltu:tests>
      <xsltu:test id="noIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell>A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">noIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="1">A</ss:Cell>
              <ss:Cell ss:Index="2">B</ss:Cell>
              <ss:Cell ss:Index="3">C</ss:Cell>
              <ss:Cell ss:Index="4">D</ss:Cell>
              <ss:Cell ss:Index="5">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="withIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="1">A</ss:Cell>
            <ss:Cell ss:Index="2">B</ss:Cell>
            <ss:Cell ss:Index="3">C</ss:Cell>
            <ss:Cell ss:Index="4">D</ss:Cell>
            <ss:Cell ss:Index="5">E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">withIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2" select="$input"/>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="firstIndex">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="5">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">firstIndex</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="5">A</ss:Cell>
              <ss:Cell ss:Index="6">B</ss:Cell>
              <ss:Cell ss:Index="7">C</ss:Cell>
              <ss:Cell ss:Index="8">D</ss:Cell>
              <ss:Cell ss:Index="9">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="altIndexes">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="2">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell ss:Index="5">C</ss:Cell>
            <ss:Cell>D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">altIndexes</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="2">A</ss:Cell>
              <ss:Cell ss:Index="3">B</ss:Cell>
              <ss:Cell ss:Index="5">C</ss:Cell>
              <ss:Cell ss:Index="6">D</ss:Cell>
              <ss:Cell ss:Index="7">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="noIndexesMergeAcross">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell>A</ss:Cell>
            <ss:Cell ss:MergeAcross="2">B</ss:Cell>
            <ss:Cell>C</ss:Cell>
            <ss:Cell ss:MergeAcross="3">D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">noIndexesMergeAcross</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="1">A</ss:Cell>
              <ss:Cell ss:MergeAcross="2" ss:Index="2">B</ss:Cell>
              <ss:Cell ss:Index="4">C</ss:Cell>
              <ss:Cell ss:MergeAcross="3" ss:Index="5">D</ss:Cell>
              <ss:Cell ss:Index="8">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
      <xsltu:test id="withIndexesMergeAcross">
        <xsl:variable name="input">
          <ss:Row>
            <ss:Cell ss:Index="5" ss:MergeAcross="2">A</ss:Cell>
            <ss:Cell>B</ss:Cell>
            <ss:Cell ss:Index="10">C</ss:Cell>
            <ss:Cell ss:MergeAcross="3">D</ss:Cell>
            <ss:Cell>E</ss:Cell>
          </ss:Row>
        </xsl:variable>
        <xsl:call-template name="xsltu:assertEqual">
          <xsl:with-param name="id">withIndexesMergeAcross</xsl:with-param>
          <xsl:with-param name="nodes1">
            <xsl:apply-templates select="exsl:node-set($input)/ss:Row" mode="normalize"/>
          </xsl:with-param>
          <xsl:with-param name="nodes2">
            <ss:Row>
              <ss:Cell ss:Index="5" ss:MergeAcross="2">A</ss:Cell>
              <ss:Cell ss:Index="7">B</ss:Cell>
              <ss:Cell ss:Index="10">C</ss:Cell>
              <ss:Cell ss:MergeAcross="3" ss:Index="11">D</ss:Cell>
              <ss:Cell ss:Index="14">E</ss:Cell>
            </ss:Row>
          </xsl:with-param>
        </xsl:call-template>
      </xsltu:test>
    </xsltu:tests>
  </xsl:template>
</xsl:stylesheet>
            

These tests should help to understand what this transformation is doing.

Thanks continue to report bugs and feature requests as comments.

The influence of microformats on style-free stylesheets

It’s been a while, almost six years, since I have written my Style-free XSLT Style Sheets piece for XML.com but this simple technique remains one of my favorite.

It has not only been my first article published on XML.com but also the subject of my first talk in an IDEAlliance XML conference and it’s fair to say that it as been instrumental to launch my career of « international XML guru ».

Despite all that, this technique remains my favorite because for its efficiency. I am using it over and over. To generate (X)HTML but also many other XML vocabularies. I have been using it to generate vocabularies as different as OpenOffice documents and W3C XML Schemas. The more complex is the vocabulary to generate, the more reasons you have to keep it outside your XSLT transformations and the more efficient style-free stylesheets are.

Style-free stylesheets have become a reflex for me and that’s without even thinking about them that I have written a style-free stylesheet to power the web site of our upcoming Web 2.0 book.

In my antique XML.com paper, I had been using specific, non XHTML elements:

        <td width="75%" bgcolor="Aqua">
            <insert-body/>
        </td>

That’s working fine, but your layout documents are no longer valid XHTML and they don’t display like target documents in a browser.

Why not follow the microformats approach and use regular XHTML elements with specif class attribues instead:

        <div id="planet">
            <h1>Planet Web 2.0 the book</h1>
            <p>Aggregated content relevant to this book.</p>
            <div class="fromRss"/>
             .../...
        </div>           

In this case, the XSLT transformation replaces the content of any element with a class attribute containing the token « fromRSS » by the formated output of the RSS feed. This has the additional benefit that I can leave mock-up content to make the layout look like a final document:

<div id="planet">
            <h1>Planet Web 2.0 the book</h1>
            <p>Aggregated content relevant to this book.</p>
            <div class="fromRss">
                <ul>
                    <li>
                        <div>
                            <h2>
                                <a
                                    href="http://www.orbeon.com/blog/2006/06/02/about-json-and-poor-marketing-strategies/"
                                    title="XForms Everywhere » About JSON and poor marketing strategies"
                                    >XForms Everywhere » About JSON and poor marketing
                                strategies</a>
                            </h2>
                        </div>
                    </li>
                </ul>
            </div>
            <p>
                <a href="http://del.icio.us/rss/tag/web2.0thebook" title="RSS feed (on del.icio.us)">
                    <img src="feed-icon-24x24.png" alt="RSS feed"/>
                </a> (on <a href="http://del.icio.us/" title="del.icio.us">del.icio.us</a>)</p>
        </div>

What I like with simple ideas is that they always leave room for reuse and improvements (complex ideas on the other hand seem to only leave room for more complexity).

Web 2.0 the book

One of the reasons I have been too busy to blog these days is the project to write a comprehensive book about Web 2.0 technologies.

If Web 2.0 is about using the web as a platform, this platform is far from being homogeneous. On the contrary, it is made of a number of very different pieces of technology, from CSS to web server configuration through XML, Javascript, server side programming, HTML, …

I believe that integrating these technologies is one of the main challenges of Web 2.0 developers and I am always surprised if not frightened to see that people tend to get more and more specialized. Too many CSS gurus do not know the first thing about XML, too many XML gurus don’t know how to spell HTTP, too many Java programmers don’t want to know Javascript. And, no, knowing everything about Ajax isn’t enough to write a Web 2.0 application.

To the defense of these hyper-specialists, I have also found that most of the available resources, both online and in print, are even more heavily specialized than their authors and that even if you could read a book on each of these technologies you’d find it difficult to get the big picture and understand how they can be used together.

The goal of this book is fill the gap and be a useful resource for all the Web 2.0 developers who do not want to stay in their highly specialized domain as well as for project managers who need to grasp the Web 2.0 big picture.

This is an ambitious project on which I have started to work in December 2005.

The first phase has been to define the book outline with the helpful contribution of many friends.

The second one has been to find an editor. O’Reilly who is the editor of my two previous books happens to be also one of the co-inventors of the term « Web 2.0 » and that makes them very nervous about Web 2.0 book projects.

Jim Minatel from Wiley has immediately been convinced by the outline and the book will be published in the Wrox Professional Series.

I had initially planned to write the book all by myself but it would have taken me at least one year to complete this work and Jim wasn’t appealed by the idea of waiting until 2007 to get this book in print.

The third step has been to find the team to write the book and the lucky authors are:

Micah Dubinko is tech editing the book and Sara Shlaer is our Development Editor.

We had then to split the work between authors. The exercise has been easier than expected. Being in a position to arbiter the choice, I have found it fair to pick the chapters left by other authors and this leaves me with chapters that will require a lot of researches for me. This is fine since I like learning new things when I write but this also means more hard work.

This is my first co-authored book and I think that one of the challenges of these books is to keep the whole content coherent. This is especially true for a book which goal is to give « the big picture » and to explain how different technologies play together.

To facilitate the communication between authors, I have set up a series of internal resources (wiki, mailing list, subversion repository). It’s still too early to say if that will really help but the first results are encouraging.

More recently, I have also set up a public site (http://web2.0thebook.org/) that presents the book and aggregates relevant content. I hope that all these resources will help us to feel and act as a team rather than a set of individual authors.

The « real » work has finally started and we have now the first versions of our first chapters progressing within the Wiley review system.

It’s interesting to see the differences between processes and rules from different editors. To me, a book was a book and I hadn’t anticipated so many differences not only in the tools being used but also in style guidelines.

The first chapter I have written is about Web Services and that’s been a good opportunity to revisit the analysis I had done in 2004 for the ZDNet Web Services Convention [papers (in French)].

From a Web 2.0 developer perspective, I think that the main point is to publish Web Services that are perfectly integrated in the Web architecture and that means being as RESTfull as possible.

I have been happy to see that WSDL 2.0 appears to be making some progress in its support of REST Services even though it’s still not perfect yet. I have posted a mail with some of my findings to the Web Services Description Working Group comment list and they have split these comments as three issues on their official issue list ([CR052] [CR053] [CR054]).

I wish they can take these issues into account, even if that means updating my chapter!

Some resources I have found most helpful while I was writing this chapter are:

It’s been fun so far and I look forward to seeing this book « for real ».

Validating microformats

This blog entry is following up Norm Walsh’s essay on the same subject.

The first thing I’d want to react on isn’t the fact that RELAX NG isn’t suitable for this task, but the reason why this is the case.

Norm says that « there’s just no way to express a pattern that matches an attribute that contains some token » and this assertion isn’t true.

Let’s take the same hReview sample and see what happens when we try to define a RELAX NG schema:

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Review</title>
    </head>
    <body>
        <div class="hreview">
            <span><span class="rating">5</span> out of 5 stars</span>
            <h4 class="summary">Crepes on Cole is awesome</h4>
            <span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
                <abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>
            <div class="description item vcard"><p>
                <span class="fn org">Crepes on Cole</span> is one of the best little
                creperies in <span class="adr"><span class="locality">San Francisco</span></span>.
                Excellent food and service. Plenty of tables in a variety of sizes
                for parties large and small.  Window seating makes for excellent
                people watching to/from the N-Judah which stops right outside.
                I've had many fun social gatherings here, as well as gotten
                plenty of work done thanks to neighborhood WiFi.
            </p></div>
            <p>Visit date: <span>April 2005</span></p>
            <p>Food eaten: <span>Florentine crepe</span></p>
        </div>
    </body>
</html>

To define an element which « class » attribute is « type », we would write:

element * {
    attribute class { "type" }
    .../...
}

To define an element which « class » attribute contains the token « type », we will use the same principle and use a W3C XML Schema pattern facet:

element * {
    attribute class {
        xsd:token { pattern = "(.+\s)?type(\s.+)?" }
    }
}

The regular expression expresses the fact that we want class attributes with an optional sequence of any character followed by a whitespace character, the token « type » and an optional whitespace followed by any characters.

It correctly catches values such as « type », « foo type », « foo type bar », « type bar » and rejects values such as « anytype ».

The next tricky thing to express to validate microformats is that you want to allow an element at any level of depth.

For instance, if you’re expecting a « type » tag, you’ll accept:

<span class=type>foo</span>

But also:

<div>
   <p>Type: <span class="type">foo</span></p>
</div>

To do so with RELAX NG, you’ll recursively say that you want either a tag « type » or any other element including a tag « type ».

The « any other element » will have include an optional « class » attribute which value doesn’t contain the token « type » but even that isn’t an issue with RELAX NG and the definition could be around these lines:

hreview.type =
    element * {
        anyOtherAttribute,
        mixed {
            (attribute class {
                 xsd:token { pattern = "(.+\s)?type(\s.+)?" }
             },
             anyElement)
            | (attribute class {
                   xsd:token - xsd:token { pattern = "(.+\s)?type(\s.+)?" }
               }?,
               hreview.type)
        }
}

This looks complex and quite ugly but we wouldn’t have to write such schemas by hand. I like Norm’s idea to write a simple RELAX NG schema where classes are replaced by element names and this definition has been generated by a XSLT transformation out of his own definition which is:

hreview.type = element type { text }

So far, so good. Let’s see where the real blockers are.

The first thing which is quite ugly to validate is the flexibility that allows siblings to be nested.

In the hReview schema, « reviewer » and « dtreviewed » are defined as siblings:

hreview.hreview =
  element hreview {
    text
    & hreview.version?
    & hreview.summary?
    & hreview.type?
    & hreview.item
    & hreview.reviewer?
    & hreview.dtreviewed?
    & hreview.rating?
    & hreview.description?
}

In a XML document, we would expect to see them at the same level as direct children od the « hreview » element.

In microformats world, this can be the case, but one can also be a descendant to the other which is the case in our example:

<span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> -
<abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>

To express that, we would have to say that the content oh « hreview » is one of the many combinations between each sub elements being either siblings or descendants one of each other.

I haven’t tried to see if that would be feasible (we’ll see that there is another blocker that makes the question academic) but that would be a real mess to generate.

The second and probably most important blocker is the restrictions related to interleave: as stated in my RELAX NG book, « Elements combined through interleave must not overlap between name classes. »

This restriction is hitting us hard here since our name classes do overlap and we are combining the different sub patterns through interleave (see the definition of hreview.hreview above if you’re not convinced).

There are very few workarounds for this restriction:

  • Replacing interleave by an ordered group isn’t an option: microformats are about flexibility and imposing an order between the sub components is most probably out of question.
  • Replacing interleave by a « zeroOrMore/choice » combination means that we would loose any control over the number of occurrences of each sub components (we could get ten ratings and no items) and this control is one of the few things that this validation catches!

To me, this restriction is the real blocker and means that it isn’t practical to use RELAX NG to validate microformat instances directly.

Of course, we can transform these instances as plain XML as shown by Norm Walsh, but I don’t like this solution very much for a reason he hasn’t mentioned: when we would raise errors with such a validation, these errors would refer to the context within the transformed document which would be tough to understand by users and making the link between this context and the original document could be complex.

As an alternative, let’s see what we could do with Schematron.

To set a rule context to a specifi tag, we can write:

<rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">

We are no longer working on datatypes and need to apply the normalization by hand (thus the use of « normalize-space() »). On the other hand, we can freely use functions and by adding a leading and trailing space, we can make sure that the « hreview » token is matched if and only if he result of this manipulation contains the token preceded and followed by a space.

Within this context, we can check the number of occurrences of each sub pattern using more or less the same principle:

      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
     </rule>

Note that the use of the descendant axis (« // ») means that we are treating correctly cases where siblings are embedded.

Norm Walsh mentions that this can be tedious to write and that you need to define tests for what is allowed and also for what is forbidden.

That’s perfectly right but here again, you don’t have to write this schema by hand and I have written a XSLT transformation that transforms his RELAX NG schema into the following Schematron schema:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
   <pattern name="hreview.hreview">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]) &gt; 1">A  "version" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]) &gt; 1">A  "summary" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]) &gt; 1">A  "type" tag is duplicated.</report>
         <assert test=".//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">A mandatory "item" tag is missing.</assert>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]) &gt; 1">A  "item" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]) &gt; 1">A  "reviewer" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]) &gt; 1">A  "dtreviewed" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]) &gt; 1">A  "rating" tag is duplicated.</report>
         <report test="count(.//*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]) &gt; 1">A  "description" tag is duplicated.</report>
      </rule>
   </pattern>
   <pattern name="hreview.version">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' version ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">version not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.summary">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' summary ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">summary not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.type">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' type ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">type not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.item">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">item not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.fn">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' fn ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">fn not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.url">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' url ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">url not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.photo">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' photo ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' item ')]">photo not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.reviewer">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' reviewer ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">reviewer not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.dtreviewed">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' dtreviewed ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">dtreviewed not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.rating">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' rating ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">rating not allowed here.</assert>
      </rule>
   </pattern>
   <pattern name="hreview.description">
      <rule context="*[contains(concat(' ', normalize-space(@class), ' '), ' description ')]">
         <assert test="ancestor::*[contains(concat(' ', normalize-space(@class), ' '), ' hreview ')]">description not allowed here.</assert>
      </rule>
   </pattern>
</schema>

A couple of notes on this schema:

  • A class attribute can contain several tokens and a single element can match several rules. Since Schematron checks only the first matching rule in each pattern, each definition is in its own pattern.
  • In this example, I have added a test that each tag is found within the context where it is expected. This test reports an error on the sample at the first occurrence of « fn » because this occurrence belongs to another microformat (vCard) which is combine with hReview in this example. This test should be switchable off and that could be done using Schematron phases.

A part from that, I think that this could become a very practical solution. The idea would thus be:

  • Define a schema for a microformat using RELAX NG to describe its logical structure. This would probably lead to defining a language subset and conventions to convey information such as « which attribute is used » and would become a kind of « microschema ».
  • Transform this microschema into a Schematron schema.
  • Use this schema to validate instance documents.

What I find interesting is that the same RELAX NG microschema could be used as shown by Norm Walsh to feed a transformation that could be applied to instance documents before validation or transformed into a schema that would validate the instance documents and I am pretty sure that these schemas could have many other uses.

Première visite 2006

Il faisait beau ce weekend et les planches d’envol des ruches avaient retrouvé une activité que nous n’avions plus vu depuis le mois d’octobre…

Les abeilles apprécient les fleurs de rhododendron!

Mais c’est sur les saules marsault que nous en avons trouvé le plus. Le bourdonnement qui se dégage des saules quand nous en approchons est impressionnant.

Elles gobent quelques abeilles, mais nous sommes bien contents de voir que les hirondelles sont de retour!

Cette première visite nous a hélas permis de voir les dégâts causé par cet hiver plus rigoureux mais surtout beaucoup plus long que d’habitude (nous avons eu de la neige dès fin novembre ce qui est exceptionnel en Normandie).

Sur les sept colonies que nous avions, seules deux ont encore quelques réserves de miel. Parmi les autres, une est morte de froid et de faim, deux autres sont très faibles avec un seul cadre présentant un peu de couvain et les deux dernières sont un peu plus fortes bien que n’ayant plus du tout de réserves.

Nous essayons d’éviter au maximum de nourrir nos ruches, mais la météo prévoyant un refroidissement nous avons préféré donner un peu de sirop aux ruches qui n’ont plus de réserves pour qu’elles aient de meilleures chances de passer ce cap qui peut être difficile : les reines ont recommencé à pondre, les populations ont recommencé à croître et tout cela demande de la nourriture qu’elles ne pourront pas aller chercher s’il fait trop froid.

Il ne faisait pas très chaud et nous n’avons pas pris le temps de prendre beaucoup de photos, mais nous n’avons pas résisté devant cette abeille en train de sortir de sa cellule (au centre, cliquez sur la photo pour l’agrandir).

Le printemps 2006 se fait attendre

Les boutons de prunier gonflent mais n’éclatent pas encore.

Les chatons de saule marsault commencent juste à éclore.

L’hiver a été long et rigoureux et le printemps a plus de deux semaines de retard sur l’année dernière!

Hier, il faisait doux et malgré le vent et l’humidité les abeilles étaient actives sur les planches d’envol où elles revenaient chargées de pollen blanc.

Si les floraisons de prunier et saules marsaults qui sont les premières grandes floraisons dans notre région ne sont pas encore vraiment commencées, que butinent elles?

Nous en avons trouvé en grand nombre récoltant du pollen blanc sur les petites fleurs (une dizaine de millimètres) de la véronique petit chêne.

La taille des fleurs de la véronique petit chêne les obligent à changer constamment de fleur ne restant que quelques secondes sur chacune d’entre elles.

J’en ai également trouvée une sur une rose de noël (héllébore noire).

La récolte est sans doute plus anecdotique (nous n’avons que quelques pieds d’héllébore) mais beaucoup plus confortable dans cette grande fleur où l’abeille peut se permettre de prendre son temps et prendre un véritable bain de pollen.

L’héllébore étant très toxique, il est sans doute préférable que nous n’en offrions pas trop à nos pensionnaires!

TreeBind is about making Java as agile as it can be

TreeBind seems to be getting more visible :

Last week I have attended several SD West sessions that gave me interesting ideas for TreeBind:

  • Under the rather misleading title « Enterprise Java for Elvis« , Cay Horstmann has presented some good stuff coming with EJB 3.0. I have been impressed by the POJOs (Plain Old Java Objects) can now be used and that most of the persistence configuration can be done through Java annotations. That’s something we could useful within TreeBind: annotations are available through reflexion and could be use to convey serialization information such as the relative order of sub-elements and whether a property should be written as element or attribute.
  • Allen Holub has given his very enlightening presentation: « Everything You Know is Wrong: Extends and Get/Set Methods are Evil » during which he explains why classes should expose behaviors rather than properties. When you think about it, that seems obvious enough but that still helps when someone such as Allen Holub explains it! The exceptions are for serialization and deserialization where classes need to expose their internals (Allen Holub says that the languages should take care of that but that’s not yet the case with Java). Even in that case, he favours specific « importer » and « exporter » classes over the getters and setters and that’s an option that could be used by TreeBind too (the current version relies on getters and setters).
  • Rick Wayne had proposed « Railin’ on AJAX » and I was looking forward seeing what was behind the Ruby on Rails buzz. How’s that related to TreeBind? One of the lessons learned from Ruby on Rails is this « DRY » (Don’t Repeat Yourself) principle and the way Rails generates the classes from the database. That would be easy enough for TreeBind to generate the Java classes corresponding to a XML document. This could be done by TreeBind itself, through a TreeBind Sink which would write Java source files. Annotations could be added to the document to describe cardinalities and datatypes and the XML document would be used as schemas are used by SAX-B. Using XML instances as schemas? Doesn’t that ring a bell? That’s exactly what Examplotron is about!

What’s the common thread between all that?

The initial motivation is still there: to make binding as transparent and lightweight as possible and Java as agile as it can be!

Web 2.0 et entreprises 1.0

Dare Obasanjo et Uche Ogbuji ont publié trois billets web ([dare], [uche1], [uche2]) qui illustrent bien le décalage entre l’informatique d’entreprise et l’informatique du Web.

Ce phénomène n’est pas nouveau et dans les années 90 on retrouvait le même décalage entre l’informatique « sérieuse » prônée par la plupart des DSI et les développements client/serveur que nous préconisions (je travaillais alors chez Sybase) et qui étaient souvent pris en charge par d’autres équipes (parfois les utilisateurs eux-mêmes).

Les DSI ont fini par s’y mettre mais les progrès récents du Web dit 2.0, sont tels qu’il y a peu d’intérêt (en dehors que quelques niches applicatives peu communes) à développer aujourd’hui autre chose que des applications Web.

Les implications sont plus profondes qu’il n’y parait.

Au plan technique, et c’est l’objet des billets que je cite, quelle justification peut-il y avoir à utiliser d’autres technologies que celles qui font le succès de monstres tels que Google, Yahoo ou Amazon?

Comment justifier la complexité et le coût des architectures qui caractérisent l’informatique d’entreprise pour développer des applications Web dont les contraintes techniques seront dans la grande majorité des cas nettement plus faibles que celles de ces monstres?

Les entreprises devraient au contraire plébisciter les architectures à base de logiciels Open Source et de langages de script utilisés par les grands sites Web!

Mais c’est peut-être au niveau des utilisations que les gains les plus importants peuvent être réalisés.

Le volet dit « social » du Web 2.0 parvient à rendre le web collaboratif et à transformer ses utilisateurs en acteurs.

N’est-ce pas un enjeu majeur dans les entreprises?

Beaucoup d’entreprises butent sur le manque d’adhésion des utilisateurs en cherchant à mettre en place de coûteux systèmes de gestion des connaissances.

Le Web 2.0 réussi au contraire à faire participer ses utilisateurs, que ce soit pour écrire des documents (Wikipédia), classifier des ressources (del.icio.us et dmoz), partager des photos (Flickr), informer (digg et wikinews), se faire connaître (blogs), constituer des réseaux sociaux (linkedIn, Viaduc, 6nergies, …), fournir du support technique (newsgroups, forums et listes de discussions), développer des logiciels de manière distribuée (SourceForge, Savannah, …), échanger des services intellectuels (Amazon Mechanical Turk, Google Answers, Yahoo! Answers), …

L’utilisation des applications Web 2.0 en entreprise démarre tout juste, essentiellement grâce aux Wikis qui commencent à gagner leurs lettres de noblesse.

Les entreprises ont pourtant tout à gagner à appliquer en interne les recettes qui marchent si bien sur le Web!

Les possibilités sont illimitées et l’entreprise 2.0 utilisera sans doute un Wikipédia interne pour éditer sa documentation, un clone de del.icio.us pour classifier ses ressources internes et externes, un simili LinkedIn pour gérer les relations entre ses employés, un dérivé d’Amazon Mechanical Turk pour canaliser les questions internes ou externes qui lui sont posées, …

C’est un sujet qui me tient à coeur. Contactez moi si vous souhaitez en discuter pour voir comment tout cela pourrait s’appliquer à votre entreprise.

W3C Internationalization « Tag » Set

2006-02-22: The Internationalization Tag Set Working Group has published an updated Working Draft of the Internationalization Tag Set (ITS). Organized by data categories, this set of elements and attributes supports the internationalization and localization of schemas and documents. Implementations are provided for DTDs, XML Schema and Relax NG, and for existing vocabularies like XHTML, DocBook and OpenDocument. Visit the Internationalization home page.

(Copied from the W3C News Archive)

I had missed the previous version of this document and I have been very impressed and pleased while (quickly) reading it.

Among the good things, I’d mention:

  • Flexibility: ITS can be used within the documents to localize, within the schemas that describe these documents or standalone.
  • Schema agnosticism: ITS can be used with DTDs, W3C XML Schema and RELAX NG (I don’t see why the list has been limited to these three ones, but, at least, RELAX NG is explicitly mentioned).
  • No QNames: more precisely, ITS has been wide enough to avoid using namespace declarations for its QNames.

Among the things that could be improved, I have found (and reported):

  • The word « tag » in name itself: « Internationalization Tag Set »: we spend our time to explain that XML is about trees and that tags are only syntactic sugar to mark the beginning and the end of elements and I wouldn’t have expected to see this word in the name of a W3C specification! [bug 2922]
  • The fact that the same element names are used in schemas and instance documents: schemas with XML syntaxes are also instances and ITS could be used to localize the schemas themselves instead of localizing the instances described by these schemas. Unfortunately, doing so would lead to a confusion since the ITS element names would be the same for both usages [bug 2923]
  • The list of schema languages could be left open [bug 2924]

Publishing GPG or PGP public keys considered harmful?

In a previous post, I have expressed the common thinking that digitally signed emails would be a strong spam stopper.

I am still thinking that a more general usage of electronic signatures would be really effective to fight against spammers, but it recently occurred to me that, at least before we reach that stage, publishing one’s public key can be considered… harmful!

A system such as GPG/PGP relies on the fact that public keys, used to check signatures are not only public but easy to find and you typically publish them both on your web site and on public key servers.

At the same time, these public keys can be used to cipher messages that you want to send to their owners.

This ciphering is typically « end to end »: the message is ciphered by the sender’s mail user agent and deciphered by the recipient’s mail agent with the recipient’s private key and nobody, either human or software, can read the content of the message in between.

While this is really great for preserving your privacy, this also means neither anti-spam nor anti-virus softwares can read the content of digitally signed emails without knowing the recipient’s private key and that pretty much eliminates any server side shielding.

Keeping your public key private would eliminate most of the benefit of signing your mails, but if you make your public key public, you’d better be very careful when reading ciphered emails, especially when they are not signed!