English – Page 47 – Eric van der Vlist

Web 2.0: myth and reality

The Web 2.0 is both a new buzzword and a real progress. In this article, I’ll to separate the myth from the reality.

Note

This article is a translation of the article published in French on XMLfr and presented at sparklingPoint.

This version does integrate, in a very “WEB 2.0 fashion” a lot of comments from XMLfr editors and sparklingPoint participants and I’d like to thank them for their contribution.

Definition

The first difficulty when we want to make an opinion about Web 2.0 is to distinguish its perimeter.

When you need to say if an application is XML or not, that’s quite easy: the application is an XML application if and only if it conforms to the XML 1.0 (or 1.1) recommendation.

That’s not so easy for Web 2.0 since Web 2.0 is not a standard but a set of practices.

In that sense, Web 2.0 can be compared to REST (Representational State Transfer) which is also a set of practices.

Fair enough will you say, but it’s easy to say if an application is RESTfull. Why would that be different with Web 2.0?

REST is a concept that is clearly described in a single document: Roy Fielding’s thesis which gives a precise definition of what REST is.

On the contrary, Web 2.0 is a blurred concept which aggregates a number of tendencies and everyone seems to have his own definition of Web 2.0 as you can see by the number of articles describing what the Web 2.0 is.

If we really need to define Web 2.0, I’ll take two definitions.

The first one is the one given by the French version of Wikipedia :

Web 2.0 is a term often used to describe what is perceived as an important transition of the World Wide Web, from a collection of web sites to a computing platform providing web application to users. The proponents of this vision believe that the services of Web 2.0 will come to replace traditional office applications.

This article also gives an history of the term:

The term was coined by Dale Dougherty of O’Reilly Media during a brainstorming session with MediaLive International to develop ideas for a conference that they could jointly host. Dougherty suggested that the Web was in a renaissance, with changing rules and evolving business models.

And it goes on by giving a series of examples that illustrate the difference between good old “Web 1.0” and Web 2.0:

DoubleClick was Web 1.0; Google AdSense is Web 2.0. Ofoto is Web 1.0; Flickr is Web 2.0.

Google who has launched AdSense in 2003 was doing Web 2.0 without knowing it one year before the term has been invented in 2004!

Technical layer

Let’s focus on the technical side of Web 2.0 first.

One of the characteristics of Web 2.0 is to be available to today’s users using reasonably recent versions of any browser. That’s one of the reasons why Mike Shaver said in its opening keynote at XTech 2005 that “Web 2.0 isn’t a big bang but a series of small bangs”.

Restricted by the set of installed browsers, Web 2.0 has no other choice than to rely on technologies that can be qualified of “matured”:

HTML (or XHTML pretending to be HTML since Internet Explorer doesn’t accept XHTML documents declared as such) –the last version of HTML has been published in 1999.
A subset of CSS 2.0 supported by Internet Explorer –CSS 2.0 has been published in 1998.
Javascript –a technology introduced by Netscape in its browser in 1995.
XML –published in 1998.
Atom or RSS syndication –RSS has been created by Netscape in 1999.
HTTP protocol –the latest HTTP version has been published in 1999.
URIs –published in 1998.
REST –a thesis published in 2000.
Web Services –XML-RPC APIs for Javascript were already available in 2000.

The usage of XML over HTTP in asynchronous mode has been given the name “Ajax”.

Web 2.0 appears to be the full appropriation by web developers of mature technologies to achieve a better user experience.

If it’s a revolution, this is a revolution in the way to use these technologies together, not a revolution in the technologies themselves.

Office applications

Can these old technologies really replace office applications? Is Web 2.0 about rewriting MS Office in Javascript and could that run in a browser?

Probably not if the rule was to keep the same paradigm with the same level of features.

We often quote the famous “80/20” rule after which 80% of the features would require only 20% of the development efforts and sensible applications should focus on these 80% of features.

Office applications have crossed the 80/20 border line years ago and have invented a new kind of 80/20 rule: 80% of the users use probably less than 20% of the features.

I think that a Web 2.0 application focussing on the genuine 80/20 rule for a restricted application or group of users would be a tough competition to traditional office applications.

This seems to be the case for applications such as Google Maps (that could compete with GIS applications on the low end market) or some of the new wysiwyg text editing applications that flourish on the web.

A motivation that may push users to adopt these web applications is the attractiveness of systems that help us manage our data.

This is the case of Gmail, Flickr, del.icio.us or LinkedIn to name few: while these applications relieve us from the burden of the technical management of our data they also give us a remote access from any device connected to the internet.

What is seen today as a significant advantage for managing our mails, pictures, bookmarks or contacts could be seen in the future as a significant advantage for managing our office documents.

Social layer

If the French version of Wikipedia has the benefit of being concise, its is slightly out of date and doesn’t describe the second layer of Web 2.0, further developed during the second Web 2.0 conference in October 2005.

The English version of Wikipedia adds the following examples to the list of Web 1.0/Web 2.0 sites:

Britannica Online (1.0)/ Wikipedia (2.0), personal sites (1.0)/ blogging (2.0), content management systems (1.0)/ wikis (2.0), directories (taxonomy) (1.0) / tagging (« folksonomy« ) (2.0)

These examples are interesting because technically speaking, Wikipedia, blogs, wikis or folksonomies are mostly Web 1.0.

They illustrate what Paul Graham is calling Web 2.0 “democracy”.

Web 2.0 democracy is the fact that to “lead the web to its full potential” (as the W3C tagline says) the technical layer of the internet must be complemented by a human network formed by its users to produce, maintain and improve its content.

There is nothing new here either and I remember Edd Dumbill launching WriteTheWeb in 2000, “a community news site dedicated to encouraging the development of the read/write web” because the “tide is turning” and the web is no longer a one way web.

This social effect was also the guide line of Tim O’Reilly in his keynote session at OSCON 2004, one year before becoming the social layer of Web 2.0.

Another definition

With a technical and a social layer, isn’t Web 2.0 becoming a shapeless bag in which we’re grouping anything that’s looking new on the web?

We can see in the technical layer a consequence of the social layer, the technical layer being needed to provide the interactivity required by the social layer.

This analysis would exclude from Web 2.0 applications such as Google Maps which have no social aspect but are often quoted as typical examples of Web 2.0.

Paul Graham tries to find common trends between these layers in the second definition that I’ll propose in this article:

Web 2.0 means using the web the way it’s meant to be used. The « trends » we’re seeing now are simply the inherent nature of the web emerging from under the broken models that got imposed on it during the Bubble.

This second definition reminds me other taglines and buzzword heard during these past years:

The W3C tagline is “Leading the Web to Its Full Potential”. Ironically, Web 2.0 is happening, technically based on many technologies specified by the W3C, without the W3C… It is very tempting to interpret the recent announcement of a “Rich Web Clients Activity” as an attempt to catch a running train.
Web Services are an attempt to make the web available to applications which was meant to be from the early ages of Web 1.0.
The Semantic Web -which seems to have completely missed the Web 2.0 train- is the second generation of the web seen by the inventor of Web 1.0.
REST is the description of web applications using the web as it is meant to be used.
XML is “SGML on the web” which was possible with HTTP as it was meant to be used.
…

Here again, Web 2.0 appears to be the continuation of the “little big bangs” of the web.

Technical issues

In maths, continuous isn’t the same as differentiable and in technology too, continuous evolutions can change direction.

Technical evolutions are often a consequence of changes in priorities that lead to these changes of direction.

The priorities of client/server applications that we developed in the 90’s were:

the speed of the user interfaces,

their quality,
their transactional behaviour,
security.

They’ve been swept out by web applications which priorities are:

a universal addressing system,
universal access,
globally fault tolerant: when a computer stops, some services might stop working but the web as a whole isn’t affected,
scalability (web applications support more users than client/server ones dreamed to support),
a user interface relatively coherent that enables sharing services through URIs,
open standards,

Web 2.0 is taking back some of the priorities of client/server applications and one needs to be careful that these priorities are met without compromising what is the strength of the web.

Technically speaking, we are lucky enough to have best practices formalized in REST and Web 2.0 developers should be careful to design RESTfull exchanges between browsers and servers to take full advantage of the web.

Ergonomic issues

Web 2.0 run in a web browsers and they should make sure that users can keep their Web 1.0 habits, especially with respect to URIs (including the ability to create bookmarks, send URIs by mail and use their back and forward buttons).

Let’s take a simple example to illustrate the point.

Have you noticed that Google, presented as a leading edge Web 2.0 company is stubbornly Web 1.0 on its core business: the search engine itself?

It is easy to imagine what a naïve Web 2.0 search engine might look like.

That might start with a search page similar to the current Google suggest. When you start writing your query terms, the service suggests possible completions of you terms.

When you would send the query, the page wouldn’t move. Some animation could keep you waiting even if that’s usually not necessary with a high speed connection on Google. the query would be sent and the results brought back asynchronously Then, the list of matches would be displayed in the same page.

The user experience would be fast and smooth, but there are enough drawbacks with this scenario that Google doesn’t seem to find it worth trying:

The URI in the address bar would stay the same: users would have no way to bookmark a search result or to copy and past it to send to a friend.

Back and forward buttons would not work as expected.
These result pages would be accessible to crawlers.

The web developer who would implement this Web 2.0 application should take care to provide good workarounds for each of these drawbacks. This is certainly possible, but that requires some effort.

Falling into these traps would be really counter-productive to Web 2.0 since we have seen that these are ergonomic issues that justify this evolution to make the web easier to use.

Development

The last point on which one must be careful when developing Web 2.0 applications are development tools.

The flow of press releases made by software vendors to announce development tools for Ajax based applications may put an end to this problem, but Web 2.0 often means developing complex scripts that are subject to interoperability issues between browsers.

Does that mean that Web 2.0 should ignore declarative definitions of user interface (such as in XForms, XUL or XAML) or even in the 4GL’s that had been invented for client/server applications in the early 90’s?

A way to avoid this regression is to use a framework that hides most of the Javascript development.

Catching up with the popular “Ruby on Rails”, web publications frameworks are beginning to propose Web 2.0 extensions.

This is the case of Cocoon which new version 2.1.8 includes a support of Ajax but also of Orbeon PresentationServer which includes in its version 3.0 a fully transparent support of Ajax through its Xforms engine.

This features enables to write user interfaces in standard XForms (without a single line of Javascript) and to deploy these applications on todays browsers, the system using Ajax interactions between browsers and servers to implement XForms.

Published in 2003, XForms is only two years old, way too young to be part of the Web 2.0 technical stack… Orbeon PresentationServer is a nifty way to use XForms before it can join the other Web 2.0 technologies!

Business model

What about the business model?

The definition of Paul Graham for whom Web 2.0 is a web rid of the bad practises of the internet bubble is interesting when you know that some analysts believe that a Web 2.0 bubble is on its way.

This is the case of Rob Hof (Business Week) who deploys a two step argumentation:

1) “It costs a whole lot less to fund companies to revenue these days”, which Joe Kraus (JotSpot) explains by the facts that:

“Hardware is 100X cheaper”,
“Infrastructure software is free”,
“Access to Global Labor Markets”,
Internet marketing is cheap and efficient for niche markets.

2) Even though venture capital investment seems to stay level, cheaper costs mean that much more companies are being funded with the same level of investment. Furthermore, cheaper costs also means that more companies can be funded by non VC funds.

Rob Hof also remarks that many Web 2.0 startups are created with no other business model than being sold in the short term.

Even if it is composed to smaller bubbles, a Web 2.0 bubble might be on the way…

Here again, the golden rule is to take profit of the Web 1.0 experience.

Data Lock-In Era

If we need a solid business model for Web 2.0, what can it be?

One of the answers to this question was in the Tim O’Reilly keynote at OSCON 2004 that I have already mentioned.

Giving its views on the history of computer technologies since their beginning, Tim O’Reilly showed how this history can be split into three eras:

During the “Hardware Lock-In” era, computer constructors ruled the market.
Then came the “Software Lock-In” era dominated by software vendors.
We are now entering the “Data Lock-In” era.

In this new era, illustrated by the success of sites such as Google, Amazon, or eBay, the dominating actors are companies that can gather more data than their competitors and their main asset is the content given or lent by their users for free.

When you outsource your mails to Google, you publish a review or even buy something on Amazon, upload your pictures to Flickr or add a bookmark in del.icio.us, you tie yourself to this site and you trade a service against their usage of your data.

A number of people are talking against what François Joseph de Kermadec is calling the “fake freedom” given by Web 2.0.

Against this fake freedom, users should be careful:

to trade data against real services,
to look into the terms of use of each site to know which rights they grant in exchange if these services,
to demand technical means, based on open standards, to get their data back.

So what?

What are the conclusions of this long article?

Web 2.0 is a term to qualify a new web that is emerging right now.

This web will use the technologies that we already know in creative ways to develop a collaborative “two way web”.

Like any other evolution, Web 2.0 comes with a series of risks: technical, ergonomic, financial and threats against our privacy.

Beyond the marketing buzzword, Web 2.0 is a fabulous bubble of new ideas, practices and usages.

The fact that its shape is still so blurred shows that everything is still open and that personal initiatives are still important.

The Web 2.0 message is a message of hope!

References

Web 2.0 definitions on Wikipedia [French] [English]
Web 2.0 seen by Paul Graham
Roy Fielding’s thesis
Rob Hof analysis
A fake freedom by François Joseph de Kermadec

The complex simple problem of media types

Media types (previously called mime types) always stuck me as something which is simple in theory but awfully complex in practise.

When they seem to be working, you can enjoy your luck and be sure that this is only temporary before the next release of software X or Y: after the recent upgrade from Ubuntu Hoary to Breezy, my workstation insists that my « audio/x-mpergurl » play lists are « text/plain » and when I associate XMMS to these files it uses XMMS to open all my « text/plain » documents!

I had recently the opportunity to look a little bit deeper into these issues for a project of mine which needs to determine the media types of files in Java.

Freedesktop.org comes to the rescue

The problem is more complex than it appears, and that’s comforting to know that some people seem to be doing exactly what needs to be done to fix it.

The freedesktop.org has been working for a while on a shared database or media types and have published its specification.

Gnome and KDE are participating and I hope that this means the end of the media types nightmare on my desktop…

I really like the principles that they have adopted, especially the simple XML format that they have adopted to describe the media types (that they are still calling mime types).

One thing which is surprising when you first open the XML document describing this shared mime types database is that it includes an internal DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mime-info [
  <!ELEMENT mime-info (mime-type)+>
  <!ATTLIST mime-info xmlns CDATA #FIXED "http://www.freedesktop.org/standards/shared-mime-info">

  <!ELEMENT mime-type (comment|glob|magic|root-XML|alias|sub-class-of)*>
  <!ATTLIST mime-type type CDATA #REQUIRED>

  <!ELEMENT comment (#PCDATA)>
  <!ATTLIST comment xml:lang CDATA #IMPLIED>

  <!ELEMENT glob EMPTY>
  <!ATTLIST glob pattern CDATA #REQUIRED>

  <!ELEMENT magic (match)+>
  <!ATTLIST magic priority CDATA #IMPLIED>

  <!ELEMENT match (match)*>
  <!ATTLIST match offset CDATA #REQUIRED>
  <!ATTLIST match type (string|big16|big32|little16|little32|host16|host32|byte) #REQUIRED>
  <!ATTLIST match value CDATA #REQUIRED>
  <!ATTLIST match mask CDATA #IMPLIED>

  <!ELEMENT root-XML EMPTY>
  <!ATTLIST root-XML
  	namespaceURI CDATA #REQUIRED
	localName CDATA #REQUIRED>

  <!ELEMENT alias EMPTY>
  <!ATTLIST alias
  	type CDATA #REQUIRED>

  <!ELEMENT sub-class-of EMPTY>
  <!ATTLIST sub-class-of
  	type CDATA #REQUIRED>
]>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">

That’s not very usual and often considered a bad practise since you don’t share the DTD between documents. When you think more about it, in this specific context where there should be only one of these documents per machine, it makes perfect sense.

Using an internal DTD solves all the packaging issues: there is only one self contained document to ship and this document has no external dependencies. Furthermore, the DTD is pretty straightforward and including it in the document itself makes it more self described.

This vocabulary is meant to be extensible through namespaces:

Applications may also define their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML files like comment elements.

I think that they could have allowed external attributes as well as they are usually quite harmless.

The mechanism defined to differentiate different types of XML documents appears to be somewhat weak since it’s relying only on the namespace and local name of the root element.

Without mentioning the problem of compound documents, this mechanism is completely missing the fact that a document such as:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          xmlns="http://www.w3.org/1999/xhtml" xsl:version="1.O">
    .../...
</html>

isn’t an XHTML document but an XSLT transformation!

The other thing that I regret, probably because I am not familiar with these issues, is that the description of the « magic » rules is very concise.

One the questions I would have liked to see answered is which encodings should we try when doing string matching. When I see the following rule in the description of the « application/xml » type:

    <magic priority="50">
      <match value="&lt;?xml" type="string" offset="0"/>
    </magic>

I have the feeling that different encodings should be tried : ASCII would also work for UTF-8, ISO-8859-1 and alike, but it would fail for UTF-16 or EBCDIC…

On the other hand, there are probably many text formats that do not support UTF-16 or EBCDIC and for which that would be a mistake to try these encodings…

More implementations needed!

Having found this gem, I was pretty confident I would find a Java implementation…

There is one buried into Sun’s Java Desktop System (which isn’t Open Source) and one in Nutch but that seems to be pretty much everything that we have available!

The MimeType class in Nutch appears to be quite minimal. It probably does what most applications want to do, but that’s not enough for what I’d like to achieve.

The mime types database has some advanced features such as type hierarchy: a mime type can be a subclass of other mime types, for instance all the text types are subclasses of the text/plain type.

These hierarchies can be implicit or explicit and they support multiple inheritance. The freedesktop.org specification gives the following example:

Some types may or may not be instances of other types. For example, a spreadsheet file may be compressed or not. It is a valid spreadsheet file either way, but only inherits from application/x-gzip in one case. This information cannot be represented statically; instead an application interested in this information should run all of the magic rules, and use the list of types returned as the subclasses.

These hierarchies should be used by user interfaces: instead of proposing only to use the tools registered for a specific type, a user interface should also propose to use the tools registered for its parent classes. If they did, they would propose me to use an XML or a text editor to open an SVG document or to use a zip extractor to open an OpenOffice document which can be very handy.

That’s the kind of features I’d be expecting to see in a mime type API and I am wondering if I will not have to write my own Java implementation to see that happen!

TreeBind: one infoset to bind them all

This is the first entry of a series dedicated to the TreeBind generic binding API.
I have recently made good progress in the extensive refactoring of TreeBind required by my proposal to support RDF and it’s time to start explaining these changes.

The first of them is the infoset on which TreeBind is now relying.

TreeBind’s target is to propose and implement a binding mechanism that can support XML, Java objects but also RDF and LDAP (my new implementation includes support for these two models as data sources) and, potentially, other sources such as relational databases or even PSVIs…

In order to cover all these data sources, TreeBind required an infoset (or data model) which is a superset of the data models of these sources.

The new TreeBind infoset is simple enough to cope with all these data models. It consists in:

Names. These different sources have different ways of defining names. Names can include both a domain name and a local name (that’s the case with XML and namespaces, but also with Java Class names and packages), they can include only a local name (that’s the case with LDAP but also for Java method names) or can be more complex like XML attribute names in which the namespace of the parent element has a role to play.
Complex properties. These are non leaf properties. Complex properties have a nature which is a name and a set of embedded properties that are either complex or leaf properties. When a sub property is attached to a property, the attachment carries a role which is a name.
Leaf properties. Leaf properties have a nature (which is a name) and a value.

That’s all…

This is enough to differentiate, for instance, an XML element from an XML attribute because their names belong to different name classes.

This would, potentially, allow to cope with mixed content by adding a new class of names to support text nodes. This is not implemented for the moment, just because I don’t have any business case to justify the additional workload.

If needed, the same could be done to support other XML constructions such as PIs and comments.

A concept which is clearly missing and should probably be added in a future version is the concept of identity.

Names are used to identify the nature of the objects and the roles they play in the different associations.

When we use TreeBind to bind not only trees but also graphs (which is the case of RDF, LDAP and even XML if we want to support some type of id/idref), we need to be able to identify objects in order to avoid creating binding loops.

This could be done by attaching an ID which could also be a name to each property.

So what?

The new version of TreeBind is implementing a SAX like paradigm built on top of this simple infoset like SAX is (more or less) built on top of the XML infoset.

Binding a source to a sink is done by implementing or just using:

A source that will read the data source and stream properties.
A sink that receives streamed properties and create the target data.
One or more filters to deal with the impedance mismatches between the source and the sink.

The strength of this architecture is that if the built-in pipe that does the binding is not flexible enough for your application, you can just add a filter that will cope with your specific requirements.

We’ll explore all that in more details in the next entries…

Using Orbeon PresentationServer with a recent version of eXist

Why would you want to do that?

Orbeon PresentationServer is currently shipping with eXist 1.0 beta2.

This is true of both OPS version 2.8 (the current stable release) and OPS 3.0 beta 3 (the latest beta of the next generation).

While eXist 1.0 beta2 is described as the stable version of the Open source XML database, their web site displays the following Health Warning:

The 1.0 beta2 release is truly ancient now. There were lots of bug fixes and feature enhancements during the past months, so using beta2 cannot be recommended any more. Please download a newer development snapshot. Recent development snapshots can be regarded as stable. A new official « stable » release is in preparation, but as usual, we lack the time to complete the documentation. Any help will be welcome!

Among the many enhancements included in more recent versions, Transactions and Crash Recovery is very worth mentioning:

After several months of development, eXist does now support full crash recovery. Crash recovery means that the database can automatically recover from an unclean termination, e.g. caused by a killed jvm, power loss, system reboot or hanging processes.

This might be a reason of the corruptions noticed in my experience with OPS and eXist and that has been my motivation to migrate http://apiculteurs.info to the latest eXist snapshot

While this is not rocket science, the following notes may help you if you want to attempt the same migration.

Environment

My environment is Ubuntu Hoary, Java Sun j2sdk 1.4 and /or 1.5, Jetty and OPS 2.8 but the same procedure should be valid for other environments.

Migration

Database backup

The physical database format has changed between these versions and, if you have to keep a database during this migration, you need to backup the database using the eXist client before starting the actual migration.

I’ll cover how to use the eXist client with an eXist database embedded in OPS in a future blog entry, in the mean time, you can refer to this thread of the ops-users mailing list.

After you’ve done this backup, remove the content of the old database:

rm orbeon/WEB-INF/exist-data/*

Removing the old libraries

You should then stop your servlet and move to the orbeon « orbeon/WEB-INF/lib » directory where you’ll find four eXist libraries:

orbeon/WEB-INF/lib/exist-1_0b2_build_1107.jar
orbeon/WEB-INF/lib/exist-optional-1_0b2_build_1107.jar
orbeon/WEB-INF/lib/xmldb-exist_1_0b2_build_1107.jar
orbeon/WEB-INF/lib/xmlrpc-1_2_patched_exist_1_0b2_build_1107.jar

Remove these four libraries from « orbeon/WEB-INF/lib » and keep them somewhere else in case you want to move back to eXist 1.0 beta2 later on.

Installing the eXist snapshot

Install the eXist snapshot through:

java -jar eXist-snapshot-20050805.jar

Choose whatever directory you want to install this new version but keep it out of the scope of your OPS install: we are doing this installation only to get the new libraries!

Installing the new libraries

You need to copy five eXist libraries into « orbeon/WEB-INF/lib ». If you’ve install eXist in « /opt/eXist », move to « orbeon/WEB-INF/lib » and type:

cp /opt/eXist/exist.jar eXist-snapshot-20050805.jar
cp /opt/eXist/exist-optional.jar exist-optional-snapshot-20050805.jar
cp /opt/eXist/exist-modules.jar exist-modules-snapshot-20050805.jar
cp /opt/eXist/lib/core/xmldb.jar xmldb-eXist-snapshot-20050805.jar
cp /opt/eXist/lib/core/xmlrpc-1.2-patched.jar xmlrpc-1.2-patched-eXist-snapshot-20050805.jar

Move to java 5.0

eXist now relies on some Java 5.0 classes and if you try to use it with j2sdk 1.4, you’ll run into errors such as:

22:37:11.168 WARN!! [SocketListener0-9] org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:574) >11> Error for /orbeon/apiculteurs/administration/statistiques/montre
java.lang.NoClassDefFoundError: javax/xml/datatype/DatatypeConfigurationException
	at org.exist.xquery.value.AbstractDateTimeValue.<clinit>(AbstractDateTimeValue.java:157)
	at org.exist.xquery.functions.FunCurrentDateTime.eval(FunCurrentDateTime.java:51)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:144)
	at org.exist.xquery.EnclosedExpr.eval(EnclosedExpr.java:58)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:144)
	at org.exist.xquery.ElementConstructor.eval(ElementConstructor.java:173)
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:43)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:159)

To fix that, the simplest solution (assuming your application supports it) is to move your servlet to j2sdk 1.5.

Restart, restore and enjoy

You’re almost done!

Restart your servlet, restore your database using the eXist client and enjoy your brand new eXist installation.

After a servlet reload, in the servlet log, you’ll notice new messages:

            2005-10-07 08:51:08,615 INFO  org.exist.storage.XQueryPool null - QueryPool: maxStackSize = 5; timeout = 120000; timeoutCheckInterval = 30000
Scanning journal  [==                                                ] (4 %)
Scanning journal  [====                                              ] (8 %)
Scanning journal  [======                                            ] (12 %)
Scanning journal  [========                                          ] (16 %)
Scanning journal  [==========                                        ] (20 %)
Scanning journal  [=================                                 ] (34 %)
Scanning journal  [====================                              ] (40 %)
Scanning journal  [==============================                    ] (60 %)
Scanning journal  [========================================          ] (80 %)
            2005-10-07 08:51:19,713 INFO  org.orbeon.oxf.pipeline.InitUtils null - /apicu

These messages confirm that your eXist installation is now using a journal.

Lawyers shouldn’t edit XML documents

One of my customers has found out that the DTD published by Sun to validate property files in JSE 1.5.0 is not well formed!

The javadoc explains :

Note that the system URI (http://java.sun.com/dtd/properties.dtd) is not accessed when exporting or importing properties; it merely serves as a string to uniquely identify the DTD, which is:

<?xml version="1.0" encoding="UTF-8"?>

<!-- DTD for properties -->

<!ELEMENT properties ( comment?, entry* ) >

<!ATTLIST properties version CDATA #FIXED "1.0">

<!ELEMENT comment (#PCDATA) >

<!ELEMENT entry (#PCDATA) >

<!ATTLIST entry key CDATA #REQUIRED>

Reducing the system URI to a mere identifier is a simplification that can lead to problems when you parse your document: XML parsers are free to load DTDs even if you specify standalone= »yes » in your XML declaration and even if you run them in non-validating mode.

In that case, including a system URI pointing to a non well formed DTD means that depending on you parser and on the options you’ll send it at parse time, you may get (or not) a well formness error.

Interestingly, the DTD listed above and borrowed from the javadoc is well formed.

The DTD published at http://java.sun.com/dtd/properties.dtd appears to have been modified to:

<!--
   Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
  -->

<?xml version="1.0" encoding="UTF-8"?>

<!-- DTD for properties -->

<!ELEMENT properties ( comment?, entry* ) >

<!ATTLIST properties version CDATA #FIXED "1.0">

<!ELEMENT comment (#PCDATA) >

<!ELEMENT entry (#PCDATA) >

<!ATTLIST entry key CDATA #REQUIRED>

See what has happened? Someone has probably insisted that they should add a copyright statement at the beginning of each of their documents, forgetting that XML forbids comments before the XML declaration…

We shouldn’t let lawyers edit XML documents!

Dyomedea.com is valid, at last!

There is a French dictum that says that cobblers are the worst shod (curiously, the English equivalent, « shoemaker’s children are the worst shod » bring children into the picture).

After having spent years teaching to my customers that they should follow the W3C recommendations, I have just finished to apply that to my own corporate site, http://dyomedea.com/english/!

For those of you who would like to see the difference, the old one now belongs to web.archive.org…

The new site is looking very different, but the structure has been kept similar and the old URIs haven’t changed.

Of course, the new site is now valid XHTML 1.1 and CSS 2.0, free from layout tables and of course, it is powered by XML.

In addition to classics (GNU/Linux, Apache, …); the site is powered by the new beta version of Orbeon PresentationServer.

This version has a lot of fancy stuff such as its Ajax based XForms support (that I am not using here) and a support out of the box for XHTML (which wasn’t the case in previous versions).

I am using it because I like this product (that’s a good reason, isn’t it?) and also to create dynamic pages:

I send XHTML (as application/xhtml+xml) to browsers that announce they support it (and also to the W3C XHTML validator that doesn’t send accept headers; if you think that this wrong, vote for this bug!) and HTML to the others (Konqueror appears to be in that list!).
Of course, I aggregate RSS 1.0 feeds (from XMLfr and from this blog) to display my latest articles and the XMLfr agenda.
More interesting, I have developed a couple of new OPS generators to fetch in my mailbox the latest mails I have sent to public lists.
These generators are using my TreeBind JAVA/XML API to read their config inputs.
And, of course, an XML/XSLT platform helps a lot to manage the i18n issues (the site is in English and French) and to add goodies such as a site map.

That’s been fun, I should have done it before!

Next on my list should be to do the same with XMLfr…

When old good practices become bad

There are some people with whom you just can’t disagree in their domains of expertise.

These people are always precise and accurate and when you read what one of them writes, you have a feeling that each of his words have been carefully pondered and is just the most accurate that could have been chosen.

In XML land, names that come to mind in that category are (to name few) James Clark, Rick Jelliffe, Murata Makoto, Jeni Tennison, David Carlisle, Uche Ogbuji and, of course, Michael Kay.

It is very exceptional that one can disagree with Michael Kay, his books appear to be 100% bullet proof and it can seem unbelievable that Joris Gillis could dare to write on the xsl-list:

You nearly gave me a heart attack when I encountered the following code in your – in all other aspects excellent – XSLT 2.0 book (3rd edition):…/…

You’ll have guessed that the reason why this happened is that the complain was not related to XSLT skills and the code that followed is:

<xsl:variable name="table-heading">
        <tr>
                <td><b>Date</b></td>
                <td><b>Home Team</b></td>
                <td><b>Away Team</b></td>
                <td><b>Result</b></td>
        </tr>
</xsl:variable>

Michael Kay apologized:

I think it’s true to say that practices like this were commonplace five years ago when many of these examples were written – they are still commonplace today, but no longer regarded as good practice.

And the thread ended up as a discussion about common sense and good practices:

« Common sense » is after all by definition what the majority of people think at the time – it was common sense back then to use tables, it’s common sense now to avoid them…

This thinking itself is also common sense but still good food for thought: good practices of yesterday become poor practices and it’s always worth reconsidering our practices.

When I saw Derrick Story’s announcement of O’Reilly Network Homepage beta, I was quite sure that the publisher of Eric Meyer would have taken the opportunity to follow today’s good practices…

Guess what? The W3C HTML validator reports 40 errors on that page and I can’t disagree with that comment posted on their site:

Well. […] 2 different sites to allow for cookies, redirects that went nowhere and all I really wanted to say was « IT’S A TABLE-BASED LAYOUT! ». Good grief.

The transform source effect

Why is exposing a document model so important? Why would that be better than providing import/export capabilities or API accesses to the document mode?

The « view source effect » is often considered as one of the reasons why XML is so important: people can just learn from opening existing documents and copy/paste stuff they like into their own documents.

Following this analysis, the view source effect would be one of the main reasons of the success of the web: you can just learn by looking at the source of the pages you consider as good examples.

The view source effect is important indeed, but to take it to its full potential copy/paste need to be automated and the view source effect to become the « transform source effect ».

The ability to transform sources means that you don’t need to fully understand what’s going on to take advantage of markup languages formats: you can just do text substitution on a template.

The web is full of examples of the power of the transform source effect: the various templating languages such as PHP, ASP, JSP and many more are nothing more than implementations of the transform source effect.

The « style free stylesheets » which power XMLfr and that I have described in an XML.com article are another example of the transform source effect.

How does that relate to desktop publishing formats? Let’s take a simple example to illustrate that point.

Let’s say I am programmer and I need to deliver an application that takes models of letters and print them after having changed the names and addresses.

Let’s also imagine that I do not know anything of the specifics of the different word processors and that I want my application to be portable across Microsoft Word, Open Office, WordPerfect, AbiWord and Scribus.

Finally, let’s say, for the fun, that I do not know anything of XML but that I am a XXX programmer (substitute XXX by whatever programming language you like).

Because all these word processors can read and write their documents as XML, I’ll just write a simple program that will substitute predefined strings values included in the documents (let’s call them $name, $address, …) with the contents of variables that I could retrieve for instance from a database.

I am sure that you know how to do that with your favorite programming language! In Perl for instance, that would be something like:

#!/usr/bin/perl

$name = 'Mr. Eric van der Vlist';
$address = '22, rue Edgar Faure';
$postcode = 'F75015';
$city = 'Paris';
$country = 'France';

while (<>) {

	s/\$(name|address|postcode|city|country)/${$1}/g;
	print;


            }

There is no magic here: I am just replacing occurrences of the string « $name » in the text by the variable $name that contains « Eric van der Vlist », occurrences of « $address » by « 22, rue Edgar Faure » and so on in a plain text document.

I am leveraging on the « transform source effect » to write a simple application that is compatible with any application that enables this effect by exposing its model as plain text.

This application will work with Microsoft Word (using WordML and probably even RTF), OpenOffice, WordPerfect, AbiWord, Scribus and may more.

It will also work with HTML, XHTML, XSL-FO, SVG, DocBook, TEI, plain text, TEX, …

It will work with Quark, but only if we use QXML as a XML format and not as an API.

And it won’t work with InDesign unless there is a way to import/export full InDesign documents in XML…

InDesign and XML: no better than its competition

Someone who had read my bog entry about Quark and XML has asked me if I knew whether Adobe had followed the same principles for the support of XML in InDesign.

I am not a specialist of this product range and I had no answer to this question.

Anyway, some quick research on Adobe’s web site and on Google makes me think that even though I find it disappointed that Quark (like so many others) is making this confusion between markup languages and APIs, InDesign hasn’t even reached this point yet.

Adobe’s web site describes InDesign’s flexible XML support as:

Enhanced XML import: Import XML files with flexible control using the Structure view and Tags palette. Automatically flow XML into tagged templates or import and interactively place it. Enhanced options give you greater import control and make it easier to achieve the results you want.

Linked XML files: Create a link to an XML file on import, so you can easily update your placed XML content whenever the source XML content is updated.

XML automation: Automatically format XML on import by mapping XML tags to paragraph and character styles in your document, or more easily reuse existing content by mapping text styles to XML tags and then exporting the XML content.

XML table tagging: Easily apply XML tags to InDesign tables. Then import XML content into the tables or export it from them.

These are useful features that are detailed in a XML.com article, but they do not expose their complete document model in XML, either directly nor even through a XML Schema of the DOM.

That might be the reason why someone that defines himself as a die hard InDesign fan has commented this article to say that Adobe InDesign is behind QuarkXPress in terms of XML features.

This comment has been written in August, 2004 but, has far as I can see on the web, this is still the case today.

An unconventional XML naming convention

I am not a big fan of naming conventions but I don’t like to be obliged to follow naming conventions that do not seem to make sense!

One of the issues added by W3C XML Schema is that, in addition to define names for elements and attributes, you often have to also define names for simple and complex types.

Even though the W3C XML Schema recommendation says that elements, attributes, types, element groups and attribute groups have separate name spaces, many people want to have a mean to differentiate these spaces just looking at names and end up with using all kind of verbose suffixes.

The other issue is of course to define which character set and capitalization methods should be used.

It happens that the conventions most of my customers have to follow are the UN/CEFACT XML Naming and Design Rules Version 1.1 (PDF).

Following ebXML and UBL, they state that:

Following the ebXML Architecture Specification and commonly used best practice, Lower Camel Case (LCC) is used for naming attributes and Upper Camel Case (UCC) is used for naming elements and types. Lower Camel Case capitalizes the first character of each word except the first word and compounds the name. Upper Camel Case capitalizes the first character of each word and compounds the name.

I think that these rules do not make sense for a couple of reasons:

There are many circumstances where elements and attributes are interchangeable and many vocabularies try to minimize the differences of treatments between elements and attributes. On the contrary, elements and attributes on one hand and types on the other hand are very different kind of beasts: elements and attributes are physical notions that are visible in instance documents while type are abstract notions that belong to schemas.
This convention is not coherent with the UML naming conventions defined in the ebXML Technical Architecture Specification which says that Class, Interface, Association, Package, State, Use Case, Actor names SHALL use UCC convention (examples: ClassificationNode, Versionable, Active,InsertOrder, Buyer). Attribute, Operation, Role, Stereotype, Instance, Event, Action names SHALL use LCC convention (examples: name, notifySender, resident, orderArrived). XML elements and attributes are similar to UML object instances while types are similar to UML classes and they should follow similar naming conventions.

My preferred naming conventions for XML schemas (and those that I am going to follow in the future for projects that are not tied to other conventions) is to use LCC for element and attribute names and UCC for type and group names (or RELAX NG named patterns).

Sticking to this rule will give consistency with the Object Oriented world and allow me to get rid of suffixes to distinguish between what can be seen in the instance documents (elements and attributes) and what belongs to schemas (types, groups or RELAX NG named patterns).