juillet 2005 – Eric van der Vlist

An unconventional XML naming convention

I am not a big fan of naming conventions but I don’t like to be obliged to follow naming conventions that do not seem to make sense!

One of the issues added by W3C XML Schema is that, in addition to define names for elements and attributes, you often have to also define names for simple and complex types.

Even though the W3C XML Schema recommendation says that elements, attributes, types, element groups and attribute groups have separate name spaces, many people want to have a mean to differentiate these spaces just looking at names and end up with using all kind of verbose suffixes.

The other issue is of course to define which character set and capitalization methods should be used.

It happens that the conventions most of my customers have to follow are the UN/CEFACT XML Naming and Design Rules Version 1.1 (PDF).

Following ebXML and UBL, they state that:

Following the ebXML Architecture Specification and commonly used best practice, Lower Camel Case (LCC) is used for naming attributes and Upper Camel Case (UCC) is used for naming elements and types. Lower Camel Case capitalizes the first character of each word except the first word and compounds the name. Upper Camel Case capitalizes the first character of each word and compounds the name.

I think that these rules do not make sense for a couple of reasons:

There are many circumstances where elements and attributes are interchangeable and many vocabularies try to minimize the differences of treatments between elements and attributes. On the contrary, elements and attributes on one hand and types on the other hand are very different kind of beasts: elements and attributes are physical notions that are visible in instance documents while type are abstract notions that belong to schemas.
This convention is not coherent with the UML naming conventions defined in the ebXML Technical Architecture Specification which says that Class, Interface, Association, Package, State, Use Case, Actor names SHALL use UCC convention (examples: ClassificationNode, Versionable, Active,InsertOrder, Buyer). Attribute, Operation, Role, Stereotype, Instance, Event, Action names SHALL use LCC convention (examples: name, notifySender, resident, orderArrived). XML elements and attributes are similar to UML object instances while types are similar to UML classes and they should follow similar naming conventions.

My preferred naming conventions for XML schemas (and those that I am going to follow in the future for projects that are not tied to other conventions) is to use LCC for element and attribute names and UCC for type and group names (or RELAX NG named patterns).

Sticking to this rule will give consistency with the Object Oriented world and allow me to get rid of suffixes to distinguish between what can be seen in the instance documents (elements and attributes) and what belongs to schemas (types, groups or RELAX NG named patterns).

Quark’s desperate attempt to keep XML under control

Yesterday, I had the opportunity to read more carefully the press release made back in January 2005 to announce QuarkXPress Markup Language (QXML).

My first guess, before clicking on the link, was that QXML would be an XML vocabulary.

Wrong guess!

QXML appears to be an XML schema of the World Wide Web Consortium (W3C) Document Object Model (DOM).

Although the press release doesn’t give a definition of this term (new to me), its benefits are detailed:

With QXML, the new DOM schema for QuarkXPress, developers can dynamically access and update the content, structure, and style of a QuarkXPress project using a DOM interface. XTensions modules can be more versatile because they can use a project’s complete content, including all formatting, style sheets, hyphenation, and justification specifications.

A PDF white paper further explains:

One of the goals of an open-standard DOM is to produce common methods for accessing, modifying, creating, and deleting content within a DOM schema. If you are familiar with one particular DOM, understanding and working with another DOM is easy because you are already familiar with the common methods and metaphors applicable to all DOMs. This commonality gives the DOM a distinct advantage over traditional C/C++ application programming interfaces (APIs).

I did quite a lot of searches both on the Quark’s web site and on the Internet and did not see any reference to the possibility of directly using the XML document or any description of the XML vocabulary itself.

To me, it looks like Quark is just missing the point of XML.

XML is about letting anyone read documents in a text editors and write them through print statements…

Whether you like it or not, you just can’t publish documents in XML and still constrain developers to use your own API, only available to your official partners on your company web site and working only on the platforms and languages that you support!

And since you can’t avoid that people use their XML format directly, you should rather document it…

Good old entities

There is a tendency, among XML gurus, to deprecate everything from the XML recommendation that is not element or attribute and XML constructions such as comments or processing instructions have been deprecated de facto by specifications such as W3C XML Schema that have reinvented their own element based replacements.

Many people also think that DTDs are an archaism that should be removed from the spec.

Lacking of SGML culture, I am not a big fan of DTDs, but there are cases where they can be very useful.

I came upon one of these cases this afternoon while implementing multi-level grouping in XSLT 1.0 following the so-called Muenchian method.

At the fourth level, I ended up with XPath expressions that looked like these ones:

        <xsl:when test="key('path4', concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤',
          ../path/step[3], '¤', ../path/step[4]))[../path/step[5]] ">
          <xs:complexType>
            <xsl:if test="key('path4', concat(@name, '¤', ../path/step[1], '¤', ../path/step[2],
              '¤', ../path/step[3], '¤', ../path/step[4]))[../path/step[5][starts-with(., '@')]]">
              <xsl:attribute name="mixed">true</xsl:attribute>
            </xsl:if>
            <xs:sequence>
              <xsl:apply-templates select="key('path4', concat(@name, '¤', ../path/step[1], '¤',
                ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4]))[ count( . |
                key('path5', concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤',
                ../path/step[3], '¤', ../path/step[4], '¤', ../path/step[4]))[1]  )
                = 1 ]" mode="path5"/>
            </xs:sequence>
            <xsl:apply-templates select="key('path4', concat(@name, '¤', ../path/step[1], '¤',
              ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4]))[ count( . | key('path5',
              concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3], '¤',
              ../path/step[4], '¤', ../path/step[4]))[1]  ) = 1                             ]"
              mode="path5Attributes"/>
          </xs:complexType>
        </xsl:when>

Isn’t it cute?

If you have to write such repetitive expressions, XML entities are your friends. Just write:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY path1 "concat(@name, '¤', ../path/step[1])">
<!ENTITY path2 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2])">
<!ENTITY path3 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3])">
<!ENTITY path4 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4])">
<!ENTITY path5 "concat(@name, '¤', ../path/step[1], '¤', ../path/step[2], '¤', ../path/step[3], '¤', ../path/step[4], '¤', ../path/step[4])">
<!ENTITY kCase "key('case', @name)">
<!ENTITY kPath1 "key('path1', &path1;)">
<!ENTITY kPath2 "key('path2', &path2;)">
<!ENTITY kPath3 "key('path3', &path3;)">
<!ENTITY kPath4 "key('path4', &path4;)">
<!ENTITY kPath5 "key('path5', &path5;)">

]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    version="1.0">
    <xsl:import href="excel2model.xsl"/>
    <xsl:output media-type="xml" indent="yes"/>
    <xsl:key name="case" match="case" use="@name"/>
    <xsl:key name="path1" match="case" use="&path1;"/>
    <xsl:key name="path2" match="case[../path/step[2]]" use="&path2;"/>
    <xsl:key name="path3" match="case[../path/step[3]]" use="&path3;"/>
    <xsl:key name="path4" match="case[../path/step[4]]" use="&path4;"/>
    <xsl:key name="path5" match="case[../path/step[5]]" use="&path5;"/>
            ...

And you’ll be able to simplify the previous snippet to:

                <xsl:when test="&kPath4;[../path/step[5]] ">
                    <xs:complexType>
                        <xsl:if test="&kPath4;[../path/step[5][starts-with(., '@')]]">
                            <xsl:attribute name="mixed">true</xsl:attribute>
                        </xsl:if>
                        <xs:sequence>
                            <xsl:apply-templates select="&kPath4;[ count( . | &kPath5;[1] = 1 ]"
                                  mode="path5"/>
                        </xs:sequence>
                        <xsl:apply-templates select="&kPath4;[ count( . | &kPath5;[1]  ) = 1]"
                                  mode="path5Attributes"/>
                    </xs:complexType>
            </xsl:when>

Doesn’t that look better?

Normalizing Excel’s SpreadsheetML using XSLT

Spreadsheet tables are full of holes and spreadsheet processors such as OpenOffice or Excel have implemented hacks to avoid having to store empty cells.

In the case of Excel, that’s done using ss:Index and ss:MergeAcross attributes.

While these attributes are easy enough to understand, they add a great deal of complexity to XSLT transformations that need to access to a specific cell since you can’t any longer index directly your target.

The traditional way to work around this kind of issue is to pre-process your spreadsheet document to get an intermediary result that lets you index your target cells.

Having already encountered this issue with OpenOffice, I needed something to do the same with Excel when Google led me to a blog entry proposing a transformation something similar.

The transformation needed some adaptation to be usable as I wanted to use it, ie as a transformation that does not modify your SpreadsheetML document except for inserting an ss:Index attribute to every cell.

Here is the result of this adaptation:

This version is buggy. An updated one is available here

<?xml version="1.0"?>
<!--

Adapted from http://ewbi.blogs.com/develops/2004/12/normalize_excel.html

This product may incorporate intellectual property owned by Microsoft Corporation. The terms
and conditions upon which Microsoft is licensing such intellectual property may be found at
http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.
-->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:schemas-microsoft-com:office:spreadsheet"
    xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <xsl:output method="xml" indent="no" encoding="UTF-8"/>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="ss:Cell/@ss:Index"/>
    <xsl:template match="ss:Cell">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:variable name="prevCells" select="preceding-sibling::ss:Cell"/>
            <xsl:attribute name="ss:Index">
                <xsl:choose>
                    <xsl:when test="@ss:Index">
                        <xsl:value-of select="@ss:Index"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells) = 0">
                        <xsl:value-of select="1"/>
                    </xsl:when>
                    <xsl:when test="count($prevCells[@ss:Index]) > 0">
                        <xsl:value-of select="($prevCells[@ss:Index][1]/@ss:Index) +
                            ((count($prevCells) + 1) -
                            (count($prevCells[@ss:Index][1]/preceding-sibling::ss:Cell)
                            + 1)) + sum($prevCells/@ss:MergeAcross)"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:value-of select="count($prevCells) + 1 +
                            sum($prevCells/@ss:MergeAcross)"/>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:attribute>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

This version is buggy. An updated one is available here