English – Page 48 – Eric van der Vlist

Quark’s desperate attempt to keep XML under control

Yesterday, I had the opportunity to read more carefully the press release made back in January 2005 to announce QuarkXPress Markup Language (QXML).

My first guess, before clicking on the link, was that QXML would be an XML vocabulary.

Wrong guess!

QXML appears to be an XML schema of the World Wide Web Consortium (W3C) Document Object Model (DOM).

Although the press release doesn’t give a definition of this term (new to me), its benefits are detailed:

With QXML, the new DOM schema for QuarkXPress, developers can dynamically access and update the content, structure, and style of a QuarkXPress project using a DOM interface. XTensions modules can be more versatile because they can use a project’s complete content, including all formatting, style sheets, hyphenation, and justification specifications.

A PDF white paper further explains:

One of the goals of an open-standard DOM is to produce common methods for accessing, modifying, creating, and deleting content within a DOM schema. If you are familiar with one particular DOM, understanding and working with another DOM is easy because you are already familiar with the common methods and metaphors applicable to all DOMs. This commonality gives the DOM a distinct advantage over traditional C/C++ application programming interfaces (APIs).

I did quite a lot of searches both on the Quark’s web site and on the Internet and did not see any reference to the possibility of directly using the XML document or any description of the XML vocabulary itself.

To me, it looks like Quark is just missing the point of XML.

XML is about letting anyone read documents in a text editors and write them through print statements…

Whether you like it or not, you just can’t publish documents in XML and still constrain developers to use your own API, only available to your official partners on your company web site and working only on the platforms and languages that you support!

And since you can’t avoid that people use their XML format directly, you should rather document it…

Good old entities

There is a tendency, among XML gurus, to deprecate everything from the XML recommendation that is not element or attribute and XML constructions such as comments or processing instructions have been deprecated de facto by specifications such as W3C XML Schema that have reinvented their own element based replacements.

Many people also think that DTDs are an archaism that should be removed from the spec.

Lacking of SGML culture, I am not a big fan of DTDs, but there are cases where they can be very useful.

I came upon one of these cases this afternoon while implementing multi-level grouping in XSLT 1.0 following the so-called Muenchian method.

At the fourth level, I ended up with XPath expressions that looked like these ones:

        <xsl:when test="key('path4', concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤',
          ../path/step[3], '¤', ../path/step[4]))[../path/step[5]] ">
          <xs:complexType>
            <xsl:if test="key('path4', concat(@name, '¤', ../path/step[1], '¤', ../path/step[2],
              '¤', ../path/step[3], '¤', ../path/step[4]))[../path/step[5][starts-with(., '@')]]">
              <xsl:attribute name="mixed">true</xsl:attribute>
            </xsl:if>
            <xs:sequence>
              <xsl:apply-templates select="key('path4', concat(@name, '¤', ../path/step[1], '¤',
                ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4]))[ count( . |
                key('path5', concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤',
                ../path/step[3], '¤', ../path/step[4], '¤', ../path/step[4]))[1]  )
                = 1 ]" mode="path5"/>
            </xs:sequence>
            <xsl:apply-templates select="key('path4', concat(@name, '¤', ../path/step[1], '¤',
              ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4]))[ count( . | key('path5',
              concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3], '¤',
              ../path/step[4], '¤', ../path/step[4]))[1]  ) = 1                             ]"
              mode="path5Attributes"/>
          </xs:complexType>
        </xsl:when>

Isn’t it cute?

If you have to write such repetitive expressions, XML entities are your friends. Just write:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY path1 "concat(@name, '¤', ../path/step[1])">
<!ENTITY path2 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2])">
<!ENTITY path3 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3])">
<!ENTITY path4 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4])">
<!ENTITY path5 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4], '¤', ../path/step[4])">
<!ENTITY kCase "key('case', @name)">
<!ENTITY kPath1 "key('path1', &path1;)">
<!ENTITY kPath2 "key('path2', &path2;)">
<!ENTITY kPath3 "key('path3', &path3;)">
<!ENTITY kPath4 "key('path4', &path4;)">
<!ENTITY kPath5 "key('path5', &path5;)">

]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    version="1.0">
    <xsl:import href="excel2model.xsl"/>
    <xsl:output media-type="xml" indent="yes"/>
    <xsl:key name="case" match="case" use="@name"/>
    <xsl:key name="path1" match="case" use="&path1;"/>
    <xsl:key name="path2" match="case[../path/step[2]]" use="&path2;"/>
    <xsl:key name="path3" match="case[../path/step[3]]" use="&path3;"/>
    <xsl:key name="path4" match="case[../path/step[4]]" use="&path4;"/>
    <xsl:key name="path5" match="case[../path/step[5]]" use="&path5;"/>
            ...

And you’ll be able to simplify the previous snippet to:

                <xsl:when test="&kPath4;[../path/step[5]] ">
                    <xs:complexType>
                        <xsl:if test="&kPath4;[../path/step[5][starts-with(., '@')]]">
                            <xsl:attribute name="mixed">true</xsl:attribute>
                        </xsl:if>
                        <xs:sequence>
                            <xsl:apply-templates select="&kPath4;[ count( . | &kPath5;[1] = 1 ]"
                                  mode="path5"/>
                        </xs:sequence>
                        <xsl:apply-templates select="&kPath4;[ count( . | &kPath5;[1]  ) = 1]"
                                  mode="path5Attributes"/>
                    </xs:complexType>
            </xsl:when>

Doesn’t that look better?

Normalizing Excel’s SpreadsheetML using XSLT

Spreadsheet tables are full of holes and spreadsheet processors such as OpenOffice or Excel have implemented hacks to avoid having to store empty cells.

In the case of Excel, that’s done using ss:Index and ss:MergeAcross attributes.

While these attributes are easy enough to understand, they add a great deal of complexity to XSLT transformations that need to access to a specific cell since you can’t any longer index directly your target.

The traditional way to work around this kind of issue is to pre-process your spreadsheet document to get an intermediary result that lets you index your target cells.

Having already encountered this issue with OpenOffice, I needed something to do the same with Excel when Google led me to a blog entry proposing a transformation something similar.

The transformation needed some adaptation to be usable as I wanted to use it, ie as a transformation that does not modify your SpreadsheetML document except for inserting an ss:Index attribute to every cell.

Here is the result of this adaptation:

This version is buggy. An updated one is available here

<?xml version="1.0"?>
<!--

Adapted from http://ewbi.blogs.com/develops/2004/12/normalize_excel.html

This product may incorporate intellectual property owned by Microsoft Corporation. The terms
and conditions upon which Microsoft is licensing such intellectual property may be found at
http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.
-->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:schemas-microsoft-com:office:spreadsheet"
    xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <xsl:output method="xml" indent="no" encoding="UTF-8"/>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="ss:Cell/@ss:Index"/>
    <xsl:template match="ss:Cell">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:variable name="prevCells" select="preceding-sibling::ss:Cell"/>
            <xsl:attribute name="ss:Index">
                <xsl:choose>
                    <xsl:when test="@ss:Index">
                        <xsl:value-of select="@ss:Index"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells) = 0">
                        <xsl:value-of select="1"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells[@ss:Index]) > 0">
                        <xsl:value-of select="($prevCells[@ss:Index][1]/@ss:Index) +
                            ((count($prevCells) + 1) -
                            (count($prevCells[@ss:Index][1]/preceding-sibling::ss:Cell)
                            + 1)) + sum($prevCells/@ss:MergeAcross)"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:value-of select="count($prevCells) + 1 +
                            sum($prevCells/@ss:MergeAcross)"/>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:attribute>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

This version is buggy. An updated one is available here

TreeBind goes RDF

TreeBind can be seen as yet another open source XML <-> Java object data binding framework.

The two major design decisions that differentiate TreeBind from other similar frameworks are that:

TreeBind has been designed to work with existing classes through Java introspection and doesn’t rely on XML schemas.
Its architecture is not specific to XML and TreeBind can be used to bind any source to any sink, assuming they can be browsed and built following the paradigm of trees (this is the reason why we have chosen this name).

Another difference with other frameworks is that its TreeBind has been sponsored by one of my customers (INSEE) and that I am its author…

The reason why we’ve started this project is that we’ve not found any framework that was meeting the two requirements I have mentioned and I am now bringing TreeBind a step forward by designing a RDF binding.

I have sent an email with the design decisions I am considering for this RDF binding to the TreeBind mailing list and I include a copy bellow for your convenience:

Hi,

I am currently using TreeBind on a RDF/XML vocabulary.

Of course, a RDF/XML document is a well formed XML document and I could use the current XML bindings to read and write RDF/XML document.

However, these bindings focus on the actual XML syntax used to serialize the document. They don’t see the RDF graph behind that syntax and are sensitive to the « style » used in the XML document.

For instance, these two documents produce very similar triples:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns="http://ns.treebind.org/example/";>
    <book>
        <title>RELAX NG</title>
        <written-by>
            <author>
                <fname>Eric</fname>
                <lname>van der Vlist</lname>
            </author>
        </written-by>
    </book>
</rdf:RDF>

and

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns="http://ns.treebind.org/example/";>
    <book>
        <title>RELAX NG</title>
        <written-by rdf:resource="#vdv"/>
    </book>
    <author rdf:ID="vdv">
        <fname>Eric</fname>
        <lname>van der Vlist</lname>
    </author>
</rdf:RDF>

but the XML bindings will generate a quite different set of objects.

The solution to this problem is to create RDF bindings that will sit on top of a RDF parser to pour the content of the RDF model into a set of objects.

The overall architecture of TreeBind has been designed with this kind of extension in mind and that should be easy enough.

That being said, design decisions need to be made to define these RDF bindings and I’d like to discuss them in this forum.

RDF/XML isn’t so much an XML vocabulary in the common meaning of this term but rather a set of binding rules to bind an XML tree into a graph.

These binding rules introduce some conventions that are sometimes different from what we use to do in « raw » XML documents.

In raw XML, we would probably have written the previous example as:

<?xml version="1.0" encoding="UTF-8"?>
<book xmlns="http://ns.treebind.org/example/";>
    <title>RELAX NG</title>
    <author>
        <fname>Eric</fname>
        <lname>van der Vlist</lname>
    </author>
</book>

The XML bindings would pour that content into a set of objects using the following algorithm:

find a class that matches the XML expanded name {http://ns.treebind.org/example/}book and create an object from that class.
try to find a method such as addTitle or setTitle with a string parameter on this book object and call that method with the string « RELAX NG ».
find a class that matches the XML expanded name {http://ns.treebind.org/example/}author and create an object from that class.
try to find a method such as addFname or setFname with a string parameter on this author object and call that method with the string « Eric ».
try to find a method such as addLname or setLname with a string parameter on this author object and call that method with the string « van der Vlist ».
try to find a method such as addAuthor or setAuthor with a string parameter on the book object and call that method with the author object.

We see that there is a difference between the way simple type and complex type elements are treated.

For a simple type element (such as « title », « fname » and « lname »), the name of the element is used to determine the method to call and the parameter type is always string.

For a complex type element (such as author), the name of the element is used both to determine the method to call and the class of the object that needs to be created. The parameter type is this class.

This is because when we write in XML there is an implicit expectation that « author » is used both as a complex object and as a verb.

Unless instructed otherwise, RDF doesn’t allow these implicit shortcuts and an XML element is either a predicate or an object. That’s why we have added an « written-by » element in our RDF example:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns="http://ns.treebind.org/example/";>
    <book>
        <title>RELAX NG</title>
        <written-by>
            <author>
                <fname>Eric</fname>
                <lname>van der Vlist</lname>
            </author>
        </written-by>
    </book>
</rdf:RDF>

The first design decision we have to make is to decide how we will treat that « written-by » element.

To have everything in hand to take a decision, let’s also see what are the triples for that example:

rapper: Parsing file book1.rdf
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ns.treebind.org/example/book> .
_:genid1 <http://ns.treebind.org/example/title> "RELAX NG" .
_:genid2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ns.treebind.org/example/author> .
_:genid2 <http://ns.treebind.org/example/fname> "Eric" .
_:genid2 <http://ns.treebind.org/example/lname> "van der Vlist" .
_:genid1 <http://ns.treebind.org/example/written-by> _:genid2 .
rapper: Parsing returned 6 statements

In these triples, two of them are defining element types:

_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ns.treebind.org/example/book> .

and

_:genid2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ns.treebind.org/example/author> .

I propose to use these statements to determine which classes must be used to create the objects. So far, that’s pretty similar to what we’re doing in XML.

Then, we have triples that assign literals to our objects:

_:genid1 <http://ns.treebind.org/example/title> "RELAX NG" .
_:genid2 <http://ns.treebind.org/example/fname> "Eric" .
_:genid2 <http://ns.treebind.org/example/lname> "van der Vlist" .

We can use the predicates of these triples (<http://ns.treebind.org/example/title>, <http://ns.treebind.org/example/fname>, <http://ns.treebind.org/example/lname>) to determine the names of the setter methods to use to add the corresponding information to the object. Again, that’s exactly similar to what we’re doing in XML.

Finally, we have a statement that links two objects together:

_:genid1 <http://ns.treebind.org/example/written-by> _:genid2 .

I think that it is quite natural to use the predicate (<http://ns.treebind.org/example/written-by>) to determine the setter method that needs to be called on the book object to set the author object.

This is different from what we would have been doing in XML: in XML, since there is a written-by element, we would have created a « written-by » object, added the author object to the written-by object and added the written-by object to the book object.

Does that difference make sense?

I think it does, but the downside is that the same simple document like this one:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns="http://ns.treebind.org/example/";>
    <book>
        <title>RELAX NG</title>
        <written-by>
            <author>
                <fname>Eric</fname>
                <lname>van der Vlist</lname>
            </author>
        </written-by>
    </book>
</rdf:RDF>

will give a quite different set of objects depending which binding (XML or RDF) will be used.

That seems to be the price to pay to try to get as close as possible to the RDF model.

What do you think?

Earlier, I have mentioned that RDF can be told to accept documents with « shortcuts ». What I had in mind is:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns="http://ns.treebind.org/example/";>
    <book>
        <title>RELAX NG</title>
        <author rdf:parseType="Resource">
            <fname>Eric</fname>
            <lname>van der Vlist</lname>
        </author>
    </book>
</rdf:RDF>

Here, we have used an attribute rdf:parseType= »Resource » to specify that the author element is a resource.

The triples generated from this document are:

rapper: Parsing file book3.rdf
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ns.treebind.org/example/book> .
_:genid1 <http://ns.treebind.org/example/title> "RELAX NG" .
_:genid2 <http://ns.treebind.org/example/fname> "Eric" .
_:genid2 <http://ns.treebind.org/example/lname> "van der Vlist" .
_:genid1 <http://ns.treebind.org/example/author> _:genid2 .
rapper: Parsing returned 5 statements

The model is pretty similar except that there is a triple missing (we have now 5 triples instead of 6).

The triple that is missing is the one that gave the type of the author element:

_:genid2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ns.treebind.org/example/author> .

The other difference is that <http://ns.treebind.org/example/author> is now a predicate.

In this situation when we don’t have a type for a predicate linking to a resource, I propose that we follow the rule we use in XML and use the predicate to determine both the setter method and the class of the object to create to pour the resource.

What do you think? Does that make sense?

Thanks,

Eric

Thanks for your comments, either on this blog or (preferred) on the TreeBind mailing list.

SPARQL Versus Versa

A new working draft of SPARQL has been released.

While there is no doubt that the language is getting better and more polished with each new release of this specification, I am surprised to see that the limitations I had found in rdfDB back in early 2001 when I have tried to use it for XMLfr are still there.

This is an old story that I have presented in Austin at KT 2001 and published as an XML.com article: it can be very interesting to compute the distance between resources and to do so, you need the equivalent of a SQL « group by » clause and the related aggregate functions.

In the case of XMLfr, I rely on this feature to compute the distance between two topics by counting the number of articles in which they appear together. To do so, I use the SQL group by clause with the « count » aggregate function.

The fact that these features were missing in rdfDB has been the reason why I have had to drop rdfDB and RDF altogether and store my triples in a relational database that I query with SQL.

As far as I know, there is only one RDF query language that support these features: 4Suite’s Versa query language.

Versa is so different from SPARQL that these two languages are as difficult to compare as, let’s say the W3C XML Schema’s XML syntax and the RELAX NG’s compact syntax.

Instead of trying to bend the well known SQL syntax to make it work on triples, Versa has defined a totally new language for the purpose of traversing triples data stores.

The result is surprising. You won’t find anything that will remind you of SQL and, to take an example from « Versa by example« , to get a list of people’s first names sorted by their age, you’d write: « sortq(all(), « .-o:age->* », vsort:number) – o:fname -> * »

If you insist and don’t let the first surprise stop you, the second surprise is that this language is working incredibly well. During the (unfortunately too few) opportunities I have had to work with Versa, I have never been blocked by a limit of the language like I had been with rdfDB or would be with SPARQL.

The bad news is that there is only one implementation of Versa (4Suite). This means that you won’t be able to use Versa over Redland or Jena and I wish people implementing RDF databases could consider more closely implementing Versa over their databases!

I also wish the W3C could have taken Versa as the main input for their RDF query language, but this wish doesn’t seem too likely to happen :-( …

RELAX NG text versus data

Uche Ogbuji asks me on the IRC:

Eric, I looked in your book for a general discussion of <data type="string"/> versus <text/>, but I didn’t find one.

It’s a question that I’ve come across a few times. I might have missed something or perhaps I and others over-think the distinction :-)

From my reading of the specs, it seems to me there is no lexical difference but that there are differences with respect to pattern restrictions (the infamous data typing restrictions in RNG), so it seems to me one should always use <text/> unless they’re sure they know what they’re doing.

Also, any thoughts about the « simple string type considered harmful » note in WXS? It seems to raise a respectable point that not allowing child elements in elements to contain prose can lead to unexpected restrictions.

That’s a common question, or if not, it’s a question that anyone writing RELAX NG or X3C XML Schema schemas should ask himself and I think that it might be useful to publish my thoughts on the subject.

The make it short, I fully agree with Uche’s analysis and here are my reasons:

The names of the elements of the RELAX NG XML syntax have been chosen very carefully and basically, <data/> has been designed for data oriented applications (or part of applications) and <text/> has been designed for text (ie document) oriented applications.

The first and best thumb rule is thus probably: if you think of the content of an element or attribute as « a piece of text » you should use <text/> and if you think of it as « data » you should use <data/>.

As mentioned by Uche, there is no lexical differences and element foo {text} and element foo {string} or element foo {token} (ie the notation of an element foo containing either text or string or token data in the RNG compact syntax) validate exactly the same set of elements. The difference is in the restrictions attached to these two patterns. The main restrictions are that:

With the « text oriented » pattern (element foo {text}) , it is possible and even easy to extend the model to add sub elements and make it mixed content but you can’t add facets that will restrict your text nodes.
With the « data oriented » pattern (element foo {string}) you can add restrictions to your datatype if the datatype library that you are using does support it which is the case of the W3C XML Schema datatype library, such as in element foo {xsd:token {maxLength="10"}} but you can’t extend the content model into mixed content.

Now, why do I consider string datatypes harmful?

This datatype is harmful because it doesn’t behave like any other datatype in that spaces are normalized for any datatype except string and xsd:normalizedString before text nodes reach the datatype library and left unchanged for string datatypes.

This is very different if you are applying restrictions to your datatypes.

Consider for instance: element foo {xsd:boolean "true"} and element foo {xsd:string "true"}.

The first one will accept: <foo> true </foo> while the second one will reject it and there is nothing you can do about it because the white spaces are handled before they reach your datatype library.

That’s why I strongly advise, in 99% of the cases, to use token (or xsd:token) instead of string (or xsd:string).

And don’t think that this advise applies only to cases like the preceding example where we have actual « tokens » (such as « true »). The name of the token datatype is very misleading and its lexical space is the same than the lexical space of string. In other words, » this is a string » is a valid token (or xsd:token) type. The difference is that if it’s a token datatype , the spaces will be normalized and what will reach the datatype library is « this is a string » .

How are these two questions linked?

The fact that <data/> is meant to be used for data oriented applications while <text/> is meant to be used for document oriented applications reduces the need to ever use the string datatype: if your spaces are meaningful, such as within an HTML <pre/> element, chances are that you’re dealing with a document oriented applications and that you should be using a <text/> pattern. On the other hand, if you are designing a data oriented applications, chances are that spaces are not significant, should better be normalized to be coherent with the other datatypes that you are probably using.

The opportunities to use string or xsd:string datatypes are thus very infrequent.

How do that translate into W3C XML Schema?

A good way to answer the question is to see what James Clark’s converter, trang, thinks about it.

If you convert a pattern such as « element foo {attribute bar {text}, xsd:token} » into W3C XML Schema, you get:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="foo">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="xs:token">
          <xs:attribute name="bar" use="required"/>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
</xs:schema>

Sounds logical: a RELAX NG <data/> pattern is clearly a close match for W3C XML Schema simple types.

Now, what does happen if we convert « element foo {attribute bar {text}, text}« ? We get:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="foo">
    <xs:complexType mixed="true">
      <xs:attribute name="bar" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

And that clearly shows that the semantic of a <text/> pattern is to be a text node that has the potential of becoming mixed content model, or in other words a mixed content without sub elements!

It’s also showing a smart (and probably not used enough) option; for those of us who are using W3C XML Schema, to represent text nodes as mixed content models rather than simple string or token types in document oriented applications.

Is XML 2.0 doomed?

XML 2.0 seems to be becoming the new buzzword and hot topic among many forums.

While I think that XML 1.0 would deserve a certain amount of factoring, I don’t think that XML 2.0 is likely to ever happen nor even that it is something we should wish.

The reasons of the success of XML 1.0 are not that difficult to analyse:

The cost/benefit of developing XML 1.0 applications compared to previous technologies has been generally analysed as highly positive.
XML 1.0 is very near to being the greatest common denominator between the needs of a very wide range of applications, including document and data oriented applications.
XML 1.0 has been proposed by a normalisation body that had the credentials to push such a specification.

I don’t think that this is likely to happen again for XML 2.0:

The unfortunate XML 1.1 recommendation has shown that the cost of the tiniest modification to XML 1.0 is so high that it is difficult to think of a benefit that could compensate this cost. While XML 1.0 is certainly imperfect, the cost of its imperfections isn’t just high enough.
A fairly good consensus on the features supported by XML 1.0 has been possible in a small Working Group working reasonably isolated from the pressure of the lobbies that finance the W3C. All the recent specifications developed under more pressure and hype such as W3C XML Schema, SOAP, WDSL, XPath 2.0, XSLT 2.0, XQuery 1.0 and others show that this not likely to happen any longer and that on the contrary, a XML 2.0 specification would most probably loose the balance that has made XML 1.0 successful.
During the past six years, the W3C has lost a lot of credibility to the point that its most influent participants are now ignoring its most basic recommendations such as XHTML, CSS, SVG, XForms, XLink and many others. This loss of credibility would greatly compromise the success of a XML 2.0 recommendation published by the W3C.

What is likely to happen with XML 2.0 is either a recommendation that is easier ignored by the community at large or much less generic, lightweight and flexible than XML 1.0.

I think I would prefer the first option!

Edd Dumbill on XTech 2005

XTech 2005 presents itself as « the premier European conference for developers and managers working with XML and Web technologies, bringing together the worlds of web development, open source, semantic web and open standards. » Edd Dumbill, XTech 2005 Conference Chair answered our questions about this conference previously known as XML Europe. This interview has been published in French on XMLfr.

vdV: XTech was formally known as XML Europe, what are the motivations for changing its name?

Edd: As the use of XML broadens out beyond traditional core topics, we want to reflect that in the conference. As well as XML, XTech 2005 will cover web development, the semantic web and more. XML’s always been about more than just the core, but we felt that having « XML » in the name made some people feel the conference wasn’t relevant to them. The two new tracks, Browser Technology and Open Data, aren’t strictly about XML topics at all.

vdV: In the new name (XTech), there is no mention of Europe, does that mean that the conference is no longer or less European?

Edd: Not at all! Why should « Europe » be a special case anyway? Even as XML Europe, we’ve always had a fair number of North American speakers and participants. I don’t see anything changing in this regard.

vdV: After a period where every event, product or company tried to embed « XML » in their name, the same events are now removing any reference to XML. How do you analyse this trend?

Edd: It’s a testament to the success of XML. As XML was getting better known, everybody knew it was a good thing and so used it as a sign in their names. Now XML is a basic requirement for many applications, it’s no longer remarkable in that sense.

vdV: How would you compare the 12 different track keys of XML Europe 2004 (ranging from Content Management to Legal through Government and Electronic Busines) and the 4 tracks of XTech 2005 (Core technologies, Applications, Browser technologies and Open data).

Edd: The switch to four clearly defined tracks is intended to help both attendees and speakers. The twelve tracks from before weren’t always easy to schedule in an easy-to-understand way, leading to a « patchwork » programme. Some of the previous tracks only had a handful of sessions in them anyway.

In addition to making the conference easier to understand, we get an opportunity to set the agenda as well as reflect the current practice. Take the new « Open Data » track as an example. There are various areas in which data is being opened up on the internet: political and government (theyrule.net, electoral-vote.com, theyworkforyou.com), cultural ( BBC Creative Archive), scientific and academic (Open Access). Many of the issues in these areas are the same, but there’s never been a forum bringing the various communities together.

vdV: Isn’t there a danger that the new focus on Web technologies becomes a specialisation and reduces that scope?

Edd: I don’t think that’s a danger. In fact, web technology is as much a part of the basic requirement for companies today as XML is, and it’s always been a running theme through the XML Europe conferences.

What we’re doing with the Browser Technology track is reflected the growing importance of decent web and XML-based user interfaces. Practically everybody needs to built web UIs these days, and practically everybody agrees the current situation isn’t much good. We’re bringing together, for the first time, everybody with a major technology offering here: W3C standards implementors, Mozilla, Microsoft. I hope again that new ideas will form, and attendees will get a good sense of the future

landscape.

vdV: Does the new orientation means that some of the people who have enjoyed last XML Europe 2004 might not enjoy XTech 2005?

Edd: No, I don’t think so. In fact, I think they’ll enjoy it more because it will be more relevant to their work. Part of the reasoning in expanding the conference’s remit is the realisation that core XML people are always working with web people, and that any effort to archive or provide public data will heavily involve traditional XML topics. So we’re simply bringing together communities that always work closely anyway, to try and get a more « joined up » conference.

vdV: In these big international conferences, social activities are often as important as the sessions. What are your plans to encourage these activities?

Edd: The first and most important thing is the city, of course! Amsterdam is a great place to go out with other people.

We’ll be having birds-of-a-feather lunch tables, for ad-hoc meetings at lunch time. Additionally, there’ll be dinner sign-up sheets and restaurant suggestions. I’m personally not very keen on having formal evening conference sessions when we’re in such a great city, but I do want a way for people to meet others with common interests.

I’m also thinking about having a conference Wiki, where attendees can self-organise before arriving in Amsterdam.

vdV: Wireless access can play a role in these social activities (people can share their impression in real time using IRC channels, blogs and wikis). Will the conference be covered with wireless?

Edd: I really hope so. The RAI center are in the process of rolling out wireless throughout their facility, but unfortunately haven’t been able to say for sure.

Wireless internet is unfortunately very expensive, and we would need a sponsor to get free wireless throughout the conference. If anybody’s reading this and interested, please get in touch.

vdV: What topics would you absolutely like to see covered?

Edd: I think what I wrote in the track descriptions page at http://www.xtech-conference.org/2005/tracks.asp is a good starting point for this.

vdV: What topics would you prefer to leave away?

Edd: I don’t want to turn any topics away before proposals have been made. All proposed abstracts are blind reviewed by the reviewing team, so there’s a fair chance for everybody.

vdV: What is your best souvenir from the past editions of XML Europe?

Edd: I always love the opening sessions. It’s very gratifying to see all the attendees and to get a great sense of expectation about what will be achieved over the next three days.

vdV: What is your worse souvenir from the past editions of XML Europe?

Edd: The bad snail I ate in Barcelona — the ride over the bumpy road to the airport after the conference was agony!

Half a day to learn about XML schema languages…

One of the things which are really hard when you specialize in XML schema languages is to play the classical « elevator story » and explain in the time it takes to go from the ground floor to your 7th floor what you are doing to own your life to your neighbor.

That’s already tough to play the elevator story explaining what is XML, just try to imagine how difficult it is with XML schema languages!

In fact, I have found that the shorter amount of time it takes me to really initiate people (that is, not my neighbors but people already knowing XML quite well) to XML Schema languages is half a day and I have made a tutorial to perform that initiation!

The next times I’ll be giving this tutorial will be on April 18th in Amsterdam for XML Europe and then on July 27th in Portland for the Open Source Conference.

I have already been giving that tutorial quite a few times at various conferences including previous issues of XML Europe, XML 2003, OSCON and even SD West 2004 no later than last month but I am still enjoying giving that talk as much as I did the first time.

As far as I know, its agenda is unique and no other tutorials or training cover all three big schema languages (Schematron, RELAX NG and W3C XML Schema).

Most of the attendees seem to enjoy this tutorial as well and to be astonished by the scope that is hiding beyond the term of « XML schema languages ». Of course, many of the concepts seem familiar to those of them who are DTD experts, but the approach taken by these « modern » schema languages is quite different and people tend to underestimate the amount of work that has been done in these areas.

I have been asked why people should want to learn about three different schema languages if they are going to use only one or two of them.

Of course, people can always learn no more than what they will immediately need, but I think that it’s a good thing to know your environment before stepping into a new discipline.

When you start learning computer science, you can start learning only a single programming language, let’s say Visual Basic, if you think that it’s the one that you’ll be using later on. I would argue that you’ll become a better programmer (even in Visual Basic) if you have an insight of other programming languages (let’s say C and Java or Python). Furthermore, you may also find out that Visual Basic isn’t always the best choice for what you’ll have to do later on.

That’s the same with schema languages. Even if right now you think that you’ll be using only one of them (let’s say W3C XML Schema), you’ll get a better understanding of W3C XML Schema if you learn about other approaches: that will help you to understand its most basic principles and also its limitations. And eventually, you may find out that W3C XML Schema isn’t always the best choice for what you’ll have to do…

See you in Amsterdam!

Tim Berners-Lee has taken up the hatchet

Tim Berners-Lee takes up the hatchet and publishes use cases for relative URIs as namespace names:

I can’t remember what prompted me to write up these sue cases [sic] for relative URIs in namespaces, and I apologize if I have done it before. The XML 1.0 and XML 1.1 namespaces documents « deprocate » this practice, following a vote at a XML plenary. It seems that RDF does need this, but no one else seems to just now.

Two use cases follow, the first one contradicted by Elliotte Rusty Harold, the second one related to RDF:

2. In in RDF, local identifiers are of the for rdf:id= »foo » or about= »#foo », which are equivalent. These are sed for naming arbitrary things within a description.

The URl « #foo » is defiuned to be relative to the current document. by the URI spec.

RDF can also use these ientifiers as class names or properties, in which case they are usedas element names in a namespace of the document itself. It is clearly useful to be able to say xmlns:= » »in this case.

We have had plenty of trouble with information (for example in the cwm test suite) being serialized as XML, and the local identifiers having necessarily to be given absolute URIs. This has mean that the test files have ended up bein branded with the local filepath whether they were processed(xmlns= »file:/disk4/joe/devel/test/set5/bar ») which works, but its a pain. It makes files arbitrarily different for testing, can have privacy implications, and so on.

I am not sure what is meant by xmlns:= » ».

xmlns= » » has been allowed since namespaces in XML 1.0 and xmlns:foo= » » has been introduced by namespaces in XML 1.1 (now a Proposed Recommendation). Both are namespaces « undeclaration » and that’s probably not what’s meant here.

If it’s about using relative URIs in RDF, I wonder if all this is really an issue with XML namespaces and not rather an issue with the XML syntax of RDF. Maybe RDF just needs some kind of rdf:base attribute?

Also, I have more than mixed feelings about using rdf:about= »#foo ».

The only circumstance where that would seem legitimate to me would be to make assertions about the XML fragment identified as « #foo », but is that interesting?

Doesn’t it make RDF no better than XLink+XPointer which relies on URIs used as addresses rather than names?

Of course, that’s painful to give each resource an absolute URI, but at least, these URIs don’t change when you move the documents containing the assertions.

Or have I misunderstood the whole point?