RELAX NG text versus data – Eric van der Vlist

Uche Ogbuji asks me on the IRC:

Eric, I looked in your book for a general discussion of <data type="string"/> versus <text/>, but I didn’t find one.

It’s a question that I’ve come across a few times. I might have missed something or perhaps I and others over-think the distinction :-)

From my reading of the specs, it seems to me there is no lexical difference but that there are differences with respect to pattern restrictions (the infamous data typing restrictions in RNG), so it seems to me one should always use <text/> unless they’re sure they know what they’re doing.

Also, any thoughts about the « simple string type considered harmful » note in WXS? It seems to raise a respectable point that not allowing child elements in elements to contain prose can lead to unexpected restrictions.

That’s a common question, or if not, it’s a question that anyone writing RELAX NG or X3C XML Schema schemas should ask himself and I think that it might be useful to publish my thoughts on the subject.

The make it short, I fully agree with Uche’s analysis and here are my reasons:

The names of the elements of the RELAX NG XML syntax have been chosen very carefully and basically, <data/> has been designed for data oriented applications (or part of applications) and <text/> has been designed for text (ie document) oriented applications.

The first and best thumb rule is thus probably: if you think of the content of an element or attribute as « a piece of text » you should use <text/> and if you think of it as « data » you should use <data/>.

As mentioned by Uche, there is no lexical differences and element foo {text} and element foo {string} or element foo {token} (ie the notation of an element foo containing either text or string or token data in the RNG compact syntax) validate exactly the same set of elements. The difference is in the restrictions attached to these two patterns. The main restrictions are that:

With the « text oriented » pattern (element foo {text}) , it is possible and even easy to extend the model to add sub elements and make it mixed content but you can’t add facets that will restrict your text nodes.
With the « data oriented » pattern (element foo {string}) you can add restrictions to your datatype if the datatype library that you are using does support it which is the case of the W3C XML Schema datatype library, such as in element foo {xsd:token {maxLength="10"}} but you can’t extend the content model into mixed content.

Now, why do I consider string datatypes harmful?

This datatype is harmful because it doesn’t behave like any other datatype in that spaces are normalized for any datatype except string and xsd:normalizedString before text nodes reach the datatype library and left unchanged for string datatypes.

This is very different if you are applying restrictions to your datatypes.

Consider for instance: element foo {xsd:boolean "true"} and element foo {xsd:string "true"}.

The first one will accept: <foo> true </foo> while the second one will reject it and there is nothing you can do about it because the white spaces are handled before they reach your datatype library.

That’s why I strongly advise, in 99% of the cases, to use token (or xsd:token) instead of string (or xsd:string).

And don’t think that this advise applies only to cases like the preceding example where we have actual « tokens » (such as « true »). The name of the token datatype is very misleading and its lexical space is the same than the lexical space of string. In other words, » this is a string » is a valid token (or xsd:token) type. The difference is that if it’s a token datatype , the spaces will be normalized and what will reach the datatype library is « this is a string » .

How are these two questions linked?

The fact that <data/> is meant to be used for data oriented applications while <text/> is meant to be used for document oriented applications reduces the need to ever use the string datatype: if your spaces are meaningful, such as within an HTML <pre/> element, chances are that you’re dealing with a document oriented applications and that you should be using a <text/> pattern. On the other hand, if you are designing a data oriented applications, chances are that spaces are not significant, should better be normalized to be coherent with the other datatypes that you are probably using.

The opportunities to use string or xsd:string datatypes are thus very infrequent.

How do that translate into W3C XML Schema?

A good way to answer the question is to see what James Clark’s converter, trang, thinks about it.

If you convert a pattern such as « element foo {attribute bar {text}, xsd:token} » into W3C XML Schema, you get:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="foo">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="xs:token">
          <xs:attribute name="bar" use="required"/>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
</xs:schema>

Sounds logical: a RELAX NG <data/> pattern is clearly a close match for W3C XML Schema simple types.

Now, what does happen if we convert « element foo {attribute bar {text}, text}« ? We get:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="foo">
    <xs:complexType mixed="true">
      <xs:attribute name="bar" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

And that clearly shows that the semantic of a <text/> pattern is to be a text node that has the potential of becoming mixed content model, or in other words a mixed content without sub elements!

It’s also showing a smart (and probably not used enough) option; for those of us who are using W3C XML Schema, to represent text nodes as mixed content models rather than simple string or token types in document oriented applications.

Laisser un commentaire Annuler la réponse