Owark WordPress plugin v0.1

I am proud to announce that I have installed the very first version of my owark WordPress plugin on this blog.

Note: the plugin is still in an early stage and I wouldn’t recommend to install it on your blog!

Standing for Open Web Archive, owark is a project that I’ll be presenting at OSCON 2011.

This first version is only a small piece in the bigger vision I’ll be presenting at OSCON, but I find it already pretty cool…

The plugin relies on the Broken Link Checker to harvest the links in the blog content and check for broken links and on GNU Wget to perform the archiving itself.

I had been archiving links for a while with a bash script and I have already some stuff in my archive database so that this plugin doesn’t start from scratch and can take advantage of this history.

These are just a couple of simple examples, but I am happy with the progress so far…

XML Prague 2011 : XML à l’attaque du web

Salle de conférence pendant la pause café

Après une période un peu folle entre 2000 et 2008 pendant laquelle j’ai participé à un nombre impressionnant de conférences, je m’étais mis un peu en retrait et n’avais plus participé à aucune conférence depuis XTech 2008.

XML Prague 2011 était donc pour moi l’occasion de rencontrer à nouveau la communauté des experts XML internationaux et j’étais curieux de voir comment elle avait évolué pendant ces trois dernières années.

MURATA Makoto (EPUB3: Global Language and Comic)A côté des aspects plus techniques, je n’oublierai pas l’image de Murata Makoto exprimant sobrement sa peine pour les victimes du tremblement de terre au Japon.

La tagline de XML Prague 2011 était « XML devait être l’espéranto du Web. Pourquoi n’est-ce pas le cas? » (« XML as new lingua franca for the Web. Why did it never happen? »).

Michael Sperberg-McQueen (Closing keynote)Le contenu de la conférence est resté proche de cette ligne mais il a été résumé de manière plus exacte par Michael Sperberg-McQueen lors de sa clôture : « Mettons du XML dans le navigateur, qu’ils le veuillent ou non! »

Norman Walsh (HTML+XML: The W3C HTML/XML Task Force)Le ton a été donné par Norman Walsh dès la toute première présentation: la convergence entre HTML et XML n’aura pas lieu.

XML a tenté d’être un format neutre convenant aussi bien aux documents qu’aux données sur le web. On peut dire aujourd’hui que cet objectif n’a pas été atteint et que les formats les plus populaires sur le web sont HTML pour les documents et JSON pour les données.

Cela ne semble pas préoccuper plus que mesure le public de XML Prague composé d’aficionados des langages à balises : si la « masse des développeurs web » n’est pas intéressée par XML c’est son problème. Les bénéfices liés à XML sont bien connus et cela signifie simplement que la communauté XML devra développer les outils nécessaires pour utiliser XML dasn le navigateur aussi bien que sur le serveur.

Sur ce thème, beaucoup de présentations couvraient le support de XML dans le navigateur ainsi que les passerelles entre JSON et XML :

  • Validation XML Schema côté client par Henry S. Thompson and Aleksejs Goremikins
  • JSON pour XForms par Alain Couthures
  • XSLT dans le navigateur par Michael Kay
  • Traitement de XML efficace dans les navigateurs par Alex Milowski
  • XQuery dans le navigateur par Peter Fischer

Les outils côté serveurs ont fait l’objet de moins de sessions, peut être parce que le sujet est plus ancien :

  • Une façade JSON pour le serveur MarkLogic par Jason Hunter
  • CXAN: étude de cas pour Servlex, un framework XML pour le web par Florent Georges
  • Akara – « Spicy Bean Fritters » et services de données XML par Uche Ogbuji

Bien entendu, les standards étaient aussi au programme :

  • HTML+XML: la task force W3C HTML/XML (déjà mentionnée) par Norman Walsh
  • Standards update: XSLT 3.0 par Michael Kay
  • Standards update: XML, XQuery, XML Processing Profiles, DSDL par Liam Quin, Henry S. Thompson, Jirka Kosek

Ainsi que les applications de XML :

  • Configuration d’équipements réseau avec NETCONF et YANG par Ladislav Lhotka
  • Développements XML – XML Projects par George Bina
  • EPUB3: le langage et les bandes dessinées par Murata Makoto
  • EPUB: Chapitres et versets par Tony Graham
  • DITA NG – une implémentation Relax NG de DITA par George Bina

Sans oublier quelques présentations techniques sur les implémentations elles mêmes :

  • Traduction de SPARQL et SQL en XQuery par Martin Kaufmann
  • Réécritures déclaratives de XQuery pour le profit et le plaisir par John Snelson

Et la séance de clôture par le roi de cet excercice, Michael Sperberg-McQueen.

Ma présentation, « injection XQuery », était assez atypique dans cet ensemble et il a fallu tout le talent de Michael Sperberg-McQueen pour lui trouver un point commun en faisant remarquer que pour avoir une chance de mettre XML sur le web il faudrait se préoccuper un peu plus de sécurité.

J’avais été impressionné lors des conférences XTech par l’évolution des techniques de présentation, la plupart des intervenants rejetant les traditionnelles présentation powerpoint et leurs « transparents » surchargés pour des alternatives plus légères et beaucoup plus imagées.

John Snelson (Declarative XQuery Rewrites for Profit or Pleasure)Je pensais ce mouvement inéluctable et ai été bien surpris de voir qu’il n’avait guère atteint les intervenants de XML Prague 2011 qui (à l’exception très notable de John Snelson) continuaient à utiliser powerpoint de manière très traditionnelle.

J’avais conçu ma présentation en suivant ce que je croyais être la technique de présentation devenue classique. Utilisant Slidy, j’avais pas moins de 35 pages très concises à présenter en 25 minutes. Chaque page avait une photo différente en arrière plan et ne comprenait que quelques mots.

Les commentaires ont été plutôt positifs bien que certaines photos d’injections aient choqué quelques participants.

Ma présentation étant du HTML standard, j’avais jugé plus sur d’utiliser l’ordinateur mis à disposition par les organisateurs. C’était sans compter sur les 74 Moctets d’images à charger pour les fonds de pages qui ont mis à mal cet ordinateur un peu poussif et les pages étaient un peu lentes à l’affichage (note personnelle : la prochaine fois, utilise ton ordinateur)!

The twitter wall (and Norman Walsh)Le « mur twitter » projeté au moyen d’un second vidéo projecteur a eu également beaucoup de succès.

Ce mur a été bien pratique pour communiquer pendant les sessions et il remplace avantageusement les canaux IRC que nous utilisions auparavant.

Twitter ne permet malheureusement pas de rechercher dans ses archives et, alors que j’écris ces mots, je ne peux déjà plus accéder aux tweets du premier jour de la conférence!

Avec un peu de recul, si j’essaye d’analyser ce qui s’est dit à XML Prague 2011, j’ai des sentiments mitigés à propos de ce fossé qui se creuse entre communautés Web et XML.

Le rêve que XML puisse être accepté par l’ensemble de la communauté des développeurs web était une vision très forte et nous ne devons pas oublier que XML a été conçu pour mettre « SGML sur le web« .

Ceci dit, il faut bien reconnaître que les développeurs web ont toujours été réticents devant la complexité additionnelle (réelle ou perçue) de XHTML. Ce fossé a toujours existé et après que XML ait manqué le virage du Web 2.0 il était trop tard pour espérer le combler.

XML sur le web restera donc une niche et continuera à être utilisé par une minorité, mais la créativité et le dynamisme de la communauté qui s’est manifesté à Prague est impressionnant et encourageant : il y a encore place pour beaucoup d’innovations et XML est, plus que jamais, une technologie de choix pour développer des applications web.

Photos

XML Prague 2011: XML against the web

Coffee break

After a frenzy period between 2000 and 2008 where I have spoken at an impressive number of conferences, I temporally retired and hadn’t been at a conference since XTech 2008.

For me, XML Prague 2011 was the first opportunity to meet again face to face with the community of XML core techies and I was curious to find out what the evolution had been during the past three years.

MURATA Makoto (EPUB3: Global Language and Comic)Aside from all the technical food for thought, an image of the conference that I won’t forget is Murata Makoto expressing his grief for the victims of the earthquake in Japan with simple and sober terms.

The tag line of XML Prague 2011 was « XML as new lingua franca for the Web. Why did it never happen? ».

Michael Sperberg-McQueen (Closing keynote)The actual content of the conference has been close to this tag line but was better summarized by Michael Sperberg-McQueen during his closing keynotes: « Let’s put XML in the browser, whether they want it there or not! »

Norman Walsh (HTML+XML: The W3C HTML/XML Task Force)The tone was given by Norman Walsh during the very first session: the convergence between HTML and XML will not happen.

XML has been trying hard to be an application neutral format for the web that could be used both for documents and data. It is fair to say that it has failed to reach this goal and that the preferred formats on the web are HTML for documents and JSON for data.

That doesn’t seem to bother that much the XML Prague attendees who are markup language addicts anyway: if the « mass of web developers » do not care about XML that’s their problem. The benefits of using XML is well known and that just means that we have to develop the XML tools we need on the server as well as on the browser.

Following this line, many sessions were about developing XML support on the browser and bridging the gaps between XML and HTML/JSON:

  • Client-side XML Schema validation by Henry S. Thompson and Aleksejs Goremikins
  • JSON for XForms by Alain Couthures
  • XSLT on the browser by Michael Kay
  • Efficient XML Processing in Browsers by Alex Milowski
  • XQuery in the Browser reloaded by Peter Fischer

By contrast, server side tools have been less represented, maybe because the domain had been better covered in the past:

  • A JSON Facade on MarkLogic Server by Jason Hunter
  • CXAN: a case-study for Servlex, an XML web framework by Florent Georges
  • Akara – Spicy Bean Fritters and XML Data Services by Uche Ogbuji

Of course, standard updates were also on the program:

  • HTML+XML: The W3C HTML/XML Task Force (already mentioned) by Norman Walsh
  • Standards update: XSLT 3.0 by Michael Kay
  • Standards update: XML, XQuery, XML Processing Profiles, DSDL by Liam Quin, Henry S. Thompson, Jirka Kosek

We also had talks about XML applications:

  • Configuring Network Devices with NETCONF and YANG by Ladislav Lhotka
  • Advanced XML development – XML Projects by George Bina
  • EPUB3: Global Language and Comic by Murata Makoto
  • EPUB: Chapter and Verse by Tony Graham
  • DITA NG – A Relax NG implementation of DITA by George Bina

Without forgetting a couple of implementation considerations:

  • Translating SPARQL and SQL to XQuery by Martin Kaufmann
  • Declarative XQuery Rewrites for Profit or Pleasure by John Snelson

And the traditional and always impressive closing keynote by Michael Sperberg-McQueen.

My own presentation, « XQuery injection », was quite atypical and it took all the talent of Michael Sperberg-McQueen to kindly relate it to « XML on the web » by noticing that security would have to be taken more seriously to make it happen.

One of the things that had impressed me during XTech conferences was the shift in presentation styles, most speakers moving away from heavy bullet points stuffed traditional powerpoint presentations to lighter and better illustrated shows.

John Snelson (Declarative XQuery Rewrites for Profit or Pleasure)I had expected the move to continue and have been surprised to see that the movement doesn’t seem to have caught XML Prague presenters whom continued to do with traditional bullet points with only a couple of exceptions (John Snelson being a notable exception).

I had worked my presentation to use what I thought would be a common style. Using Slidy, I had created no less than 35 short pages to present in 25 minutes. Each page had a different high resolution picture as a background and contained only a few words.

The comments have been generally good even though some pictures chosen to represent injections seem to have hurt the feelings of some attendees.

Since my presentation is just standard HTML, I had been brave enough to use the shared computer. Unfortunately, the presentation loads 74 Megs of background pictures and that was a little bit high for the shared computer that took several seconds to change pages (note to self: next time, use your own laptop)!

The twitter wall (and Norman Walsh)Another interesting feature of this conference was the « twitter wall » that was projected in the room using a second video projector.

This wall has proven to be very handy to communicate during the sessions and it can be seen like a more modern incarnation of the IRC channels used in earlier conferences.

Unfortunately, twitter doesn’t allow to search in archives and while I am writing these words, I can no longer go back in the past and read the tweets of the first day of the conference.

Looking backward at the conference, I have mixed feelings about this gap that now seems to be widely accepted on both sides between the XML and the web developers communities.

The dream that XML could be accepted by the web community at large was a nice vision and we should not forget that XML has been designed to be « SGML on the web« .

Web developers have always been reluctant to accept the perceived additional complexity of XHTML and the gap has been there from the beginning and after XML missed the train of Web 2.0 it was too late to close it.

XML on the web will stay a niche and will be used by a minority but the creativity and dynamism of the community shown at Prague is inspiring and encouraging: there is still room for a lot of innovation and XML is more than ever the technology of choice to power web applications.

All the pictures

Coup de neuf pour mon blog

Mon blog était hébergé sur un vieux serveur qui commence à donner des signes de faiblesses et son déménagement sur un serveur plus récent a été l’occasion de lui donner un petit coup de neuf…

Mises à jour de la version de WordPress et de ses plugins bien entendu, mais aussi changement de thème : le thème dkert3 que j’utilisais jusqu’à présent s’est subitement mis à générer une erreur PHP après l’upgrade et en activant Twenty ten (le nouveau thème par défaut de WordPress) pour examiner ce qui se passait j’ai trouvé qu’il convenait bien à ce blog.

Le plus gros changement est sans doute l’intégration de mon ancien album photo au blog.

Pour le moment, cette intégration est faite « à périmètre constant » et reproduit l’ancien album mais j’essayerai de trouver le temps de l’enrichir et de l’améliorer!

C’est album photo était auparavant géré par Gallery, un logiciel de gestion d’albums photos que j’utilisais depuis sa première version.

Pourquoi l’abandonner?

Le positionnement de Gallery devient un peu délicat : c’est un logiciel de gestion d’albums photos destiné à publier des photos sur le web et à les partager des photos avec ses amis ou sa famille sans être un logiciel de gestion de photos ou un DAM.

J’apprécie beaucoup le logiciel Gallery, mais il me semble qu’il est pris en tenaille entre les extensions permettant de gérer des albums photos directement dans les blogs et les DAMs qui permettent également de publier des photos sur internet.

Ma décision d’intégrer l’album photos a finalement été motivée par deux raisons :

  • le souhait de pouvoir intégrer plus facilement les photos de l’album au blog,
  • la charge de travail lié au maintient de deux logiciels différents pour le blog et pour l’album photo.

Avant de prendre cette décision, j’ai migré mon ancien album photo sous la dernière version de Gallery (Gallery 3) et ai testé les plugins heiv Gallery3 et  Gallery3 Picker qui permettent d’intégrer respectivement des albums et des photos de Gallery3 dans WordPress. Le résultat était satisfaisant mais il m’a semblé moins convainquant que le plugin de gestion de photos NextGen Gallery que j’utilise déjà sur le site du Retour à la Terre.

Outre ce plugin, j’ai également installé :

  • Broken Link Checker qui vérifie les liens inséré dans le blog. J’ai du travail à faire à ce niveau : le plugin a détecté 72 liens « cassés’!
  • OpenID qui me permet, ainsi qu’à mes visiteurs de m’identifier en utilisant un OpenID.
  • Une version modifiée de Comment Form Notes qui affiche un message pour inciter mes visiteurs à utiliser OpenID pour poster leurs commentaires (ils évitent ainsi la phase de « modération »).
  • pageMash, bien utile pour gérer l’arborescence des pages du blog.
  • Raven’s Antispam dont j’espère qu’il facilitera la tâche de gestion des commentaires sur le site. Dites moi s’il bloque intempestivement vos commentaires!
  • Redirection qui permet de garder des URLs cools (qui ne changent pas) et d’analyser les erreurs 404.
  • Relevanssi qui améliore les fonctions de recherche.
  • Shutter Reloaded qui gère l’affichage des photos de manière spectaculaire.
  • XRDS-Simple nécessaire au plugin OpenID.

Un grand merci aux développeurs de WordPress et de ces plugins!

L’écosystème WordPress continue de m’impressionner par sa richesse. C’est une des raisons pour lesquelles j’ai adopté WordPress mais cette richesse comporte aussi des risques : lorsque l’on dépend de 10 plugins et d’un thème pour gérer un site, les risques que quelque chose ne se passe pas bien quand on met à jour la version du moteur WordPress ne sont pas négligeables et on a tendance à croiser les doigts avant chaque mise à jour!

Debian/Ubuntu PHP packages and virtual hosts: introducing adminstance

As a short term way to deal with my Debian/Ubuntu PHP packages and virtual hosts issue, I have written a pretty crude Python script that I have called « adminstance« .

This script can currently install, update and remove an instance of a web package such as websvn:

vdv@studio:~/Documents/Dyomedea/code/adminstance$ ./adminstance


Usages:  

adminstance -h|--help
  print this message

adminstance -l|--list 
  lists the installed instances for this directory

adminstance -i|--install [-f|--force]  
  installs an instance for a root directory
  
adminstance -u|--update [-f|--force]  
  updates an instance for a root directory
  
adminstance -r|--remove [-f|--force] [-p|--purge]  
  removes an instance for a root directory

Options:

  -i, --install : action = installation 
  -f, --force   : when action = install, update or remove, install
                  without prompting the user for a confirmation
  -h, --help    : prints this message
  -l, --list    : action = list 
  -p, --purge   : when action = remove, remove also files and directories
                  under /var and /etc (by default, these are preserved)
  -r, --remove  : action = remove
  -u, --update  : action = update
   
  

To install an instance of websvn named « foo », type:

vdv@studio:~/Documents/Dyomedea/code/adminstance$ sudo ./adminstance -i /usr/share/websvn/ foo
[sudo] password for vdv: 
install an instance of /usr/share/websvn/ named foo? (y|N) y
Copying /var/cache/websvn to /var/cache/adminstance/websvn/foo

Copying /usr/share/websvn to /usr/share/adminstance/websvn/foo

Copying /etc/websvn to /etc/adminstance/websvn/foo

Creating a symlink from /etc/adminstance/websvn/foo/config.php to /usr/share/adminstance/websvn/foo/include/config.php
Creating a symlink from /var/cache/adminstance/websvn/foo/tmp to /usr/share/adminstance/websvn/foo/temp
Creating a symlink from /var/cache/adminstance/websvn/foo to /usr/share/adminstance/websvn/foo/cache
Creating a symlink from /etc/adminstance/websvn/foo/wsvn.php to /usr/share/adminstance/websvn/foo/wsvn.php

To update it if you get a new version of websvn:

vdv@studio:~/Documents/Dyomedea/code/adminstance$ sudo ./adminstance -u /usr/share/websvn/ foo
update an instance of /usr/share/websvn/ named foo? (y|N) y
Synchronizing /usr/share/websvn to /usr/share/adminstance/websvn/foo
rsync -a --delete /usr/share/websvn/ /usr/share/adminstance/websvn/foo/

Creating a symlink from /etc/adminstance/websvn/foo/config.php to /usr/share/adminstance/websvn/foo/include/config.php
Creating a symlink from /var/cache/adminstance/websvn/foo/tmp to /usr/share/adminstance/websvn/foo/temp
Creating a symlink from /var/cache/adminstance/websvn/foo to /usr/share/adminstance/websvn/foo/cache
Creating a symlink from /etc/adminstance/websvn/foo/wsvn.php to /usr/share/adminstance/websvn/foo/wsvn.php

To list the instances of websvn:

vdv@studio:~/Documents/Dyomedea/code/adminstance$ sudo ./adminstance -l /usr/share/websvn/ 
List of instances for the package websvn:
	bar
	foo

To remove the instance foo:

dv@studio:~/Documents/Dyomedea/code/adminstance$ sudo ./adminstance -r /usr/share/websvn/ foo
remove an instance of /usr/share/websvn/ named foo? (y|N) y
Deleting /usr/share/adminstance/websvn/foo
rm -r /usr/share/adminstance/websvn/foo

To remove it including its directory under /etc and /var:

vdv@studio:~/Documents/Dyomedea/code/adminstance$ sudo ./adminstance -rp /usr/share/websvn/ foo
remove an instance of /usr/share/websvn/ named foo? (y|N) y
Deleting /var/cache/adminstance/websvn/foo
rm -r /var/cache/adminstance/websvn/foo
Deleting /usr/share/adminstance/websvn/foo
rm -r /usr/share/adminstance/websvn/foo
Deleting /etc/adminstance/websvn/foo
rm -r /etc/adminstance/websvn/foo

It’s pretty basic and has a few limitations but that should be enough for me for the moment.

In the longer term, it should be possible to pack it as a .deb that uses dpkg triggers to automate the update of all its instances when a package is updated through apt…

Debian/Ubuntu PHP packages and virtual hosts

I am a big fan of the Debian packaging system and use it on my Ubuntu systems as much as I can as it greatly simplifies both the installation of new software and more important their maintenance and security updates.

There is unfortunately one downside that bites me so often that I am really surprised that nobody seems to care…

When you run a web server, it is often the case that you want to install popular web applications such as WordPress, Gallery, websvn or whatever and Debian/Ubuntu packages are perfectly fine until you want to run these applications on multiple virtual hosts.

To enforce the strict separation between /usr, /var and /etc that is part of the Debian policy, these packages usually put their PHP source files under /usr/share and replace the configuration files by symbolic links to files located under /etc. Symbolic links to files located under /var are also added in some cases.

I understand the reasons for this policy but when you want to run several instances of these applications, links from the source to a single set of configuration files just seem plain wrong! Ideally you’d want things to work the other way round and get instances that have their own configuration and variable space under /etc and /var and link to a common set of source files located under /usr.

Taking a package such as WordPress and converting it into a « virtual host friendly » form isn’t that difficult but as soon as you start modifying a package after it’s been installed you need to redo these modifications after each new package update and loose a lot of the benefit of using a package.

Have I missed something obvious and is there an easy solution for this issue?

See also Debian/Ubuntu PHP packages and virtual hosts: introducing adminstance.

RSS en campagne

Les agriculteurs AB sont rares en Haute-Normandie dont les bocages ont été transformés en grandes culture partout où c’était possible. Pour briser l’isolement, ils se sont rassemblés au sein de groupements et cherchent à garder le contact par tous les moyens.

En tant qu’arboriculteur AB, nous ne faisons pas exception à la règle et rendons volontiers visite à Paola et Benoît Lelièvre de la ferme de Pincheloup qui sont nos plus proches voisins AB, mais depuis le mois d’août et l’ouverture du magasin, nous avons du quelque peu espacer ces contacts.

Catherine a reçu hier un mail de Paola lui donnant quelques nouvelles et lui disant que nos aventures étaient suivies avec attention par la ferme de Pincheloup grâce… au flux RSS de notre site!

Lorsque j’ai participé à la rédaction de la spécification RSS 1.0 en 2000, j’étais loin de me douter que ce vocabulaire dont j’avais tant de mal à expliquer l’intérêt à mon entourage serait un jour utilisé dans les campagnes et me servirait à mener à bien un projet de nature si différente!

Le Retour à la Terre ouvre le 20 août

Logo du Retour à la TerreIl aura fallu la détermination sans faille de Catherine et la motivation exceptionnelle de son équipe pour maintenir le cap et respecter cette date d’ouverture fixée depuis plusieurs mois en dépit de tous les obstacles que l’on rencontre sur ce type de projet…

Le pari est en passe d’être gagné et son magasin « Le Retour à la Terre » ouvrira ses portes le 20 août.

Cette ouverture imminente nous a conduit à mettre en ligne dès ce week-end le site du Retour à la Terre bien qu’il soit encore en construction.

Pour ce site qui présentera non seulement le magasin mais également notre démarche et nos vergers, j’ai souhaité privilégier la facilité de mise à jour et ai choisi de le « motoriser » avec WordPress que Catherine connaît bien puisque c’est également le moteur de son blog.

Le style a été réalisé par Laurent Henriot et appliqué au site sous forme d’un « thème » WordPress spécifique.

Insaisissables identités numériques

J’ai passé un après midi passionnant à l’atelier « Apparaître, paraître, disparaître » organisé par la Fing.

Les exposés de Dominique Cardon, Philippe Rigaut, Arnaud Belleil et Pascal Levy-Garboua étaient intéressants et, combinés à la discussion qui a suivi, cela me confirme qu’en matière d’identité numérique je dois faire figure de dinosaure puisque je m’obstine à entretenir une seule et même identité numérique et à la rattacher à mon identité physique par des artifices tels que mon éternel tee-shirt XMLfr!

Plutôt que d’essayer de faire un compte rendu qui ferait double emploi avec celui que devrait publier la Fing, je préfère retranscrire ici quelques réflexions qui me sont venues alors que je rentrais chez moi.

Ces discussions ont mis en lumière un besoin de mieux définir la notion d’identité numérique.

Alors que j’ai tendance à considérer mon identité numérique comme la somme de tout ce que l’on trouve à propos de moi sur le web, il m’a semblé que beaucoup de personnes considèrent une identité numérique comme un « compte » sur un site Web. Ainsi, quand il a été question de savoir combien d’identités numériques chacun d’entre nous possédait, il a semblé naturel de compter le nombre d’applications sur lesquelles nous avons créé des comptes.

Michel Desbois a fait remarquer que les choses n’étaient pas si simples et que l’on pouvait avoir plusieurs comptes et donc plusieurs identités sur une seule application.

Certes, mais on peut également, ce qui est mon cas, être dans la situation inverse et avoir une seule et même identité numérique sur plusieurs applications et même sur le Web en dehors de ces applications. Partager une même identité numérique sur plusieurs applications est assez simple : il suffit d’utiliser le même nom ou pseudonyme et de créer des liens entre ces applications.

Quelque soit le rayonnement des grandes applications Web 2.0, il me semble également très important de prendre en compte ce qui est publié sur le Web en dehors de ces applications.

Je suis consultant indépendant et mon identité numérique est ce qui me sert d’action marketing. C’est quelque chose que je constitue patiemment depuis 1999 et j’ai tendance à considérer qu’elle est trop importante pour en confier plus que des bribes à ces sites.

Certains de ces sites ont des conditions d’utilisation qui leur permettraient s’ils le souhaitaient de l’exploiter sans mon accord et d’autres peuvent y ajouter des publicités qui pourraient être contre productives.

De plus et cela semble encore plus important, je ne suis pas certain de pouvoir leur faire confiance pour assurer la pérennité de mon identité électronique.

Depuis 1999, bien des services que l’on croyait immuables ont disparu, ont changé de formule ou ont été achetés par d’autres sociétés.

J’ai écris récemment deux articles l’occasion du huitième anniversaire de XMLfr et du dixième anniversaire de XML et ai eu l’occasion de faire quelques recherches de ressources Web ayant huit ou dix ans. Certains jouent remarquablement le jeu et c’est le cas par exemple de Yahoo! qui assure toujours la redirection des URIs des messages egroups.com vers le même message sur yahoogroups.com. J’espère que ce sera également le cas si Yahoo! est acheté par Microsoft mais la plupart des liens ont disparus, y compris des liens de billets postés sur les blogues O’Reilly.

Si j’avais utilisé plus massivement les services offerts en 1999 ou 2000, beaucoup des documents que j’ai produit à cet époque et qui constituent aujourd’hui mon identité numérique auraient disparu.

Aussi paradoxal ou prétentieux que cela paraisse, je pense donc prendre moins de risques en gérant mon identité numérique en dehors des grands sites Web 2.0!

Ceci dit, aussi simple soit elle, ma définition de l’identité numérique comme étant l’ensemble des informations disponibles sur une personne ou un pseudonyme ne va pas sans poser problème…

J’ai la chance d’avoir un nom suffisamment peu répandu pour que l’on puisse facilement repérer sur le Web ce qui fait partie de mon identité numérique. Les choses ne seraient pas si simples si j’avais un nom beaucoup plus courant, ou pire, si j’avais le même nom de quelqu’un de beaucoup plus connu. Dans ce cas, comment définir les contours de mon identité numérique?

HTML 5 turns documents into applications

 Voir aussi la version française de cet article sur XMLfr.

HTML 5 is not just HTML 4 + 1

This announcement has been already widely commented and I won’t come back on the detail of the differences between HTML 4.1 and HTML 5 which are detailed in one of the documents published with the Working Draft. What I find unfortunate is that this document and much of the comments about HTML 5 focus on the detail of the syntactical differences between these versions rather than commenting more major differences.

These differences are clearly visible as soon as you read the introduction:

The World Wide Web’s markup language has always been HTML. HTML was primarily designed as a language for semantically describing scientific documents, although its general design and adaptations over the years has enabled it to be used to describe a number of other types of documents.

The main area that has not been adequately addressed by HTML is a vague subject referred to as Web Applications. This specification attempts to rectify this, while at the same time updating the HTML specifications to address issues raised in the past few years.

This introduction does a good job in setting the context and expectations: the goal of HTML 5 is to move from documents to applications and this is confirmed in many other places, such as for instance the section titled “Relationship to XUL, Flash, Silverlight, and other proprietary UI languages”:

This specification is independent of the various proprietary UI languages that various vendors provide. As an open, vender-neutral language, HTML provides for a solution to the same problems without the risk of vendor lock-in.

To understand this bold move, we need to set this back into context.

Nobody denies that HTML has been created to represent documents, but its success comes from its neutrality: even if it is fair to say that Web 2.0 is the web has it was meant to be, the makers of HTML couldn’t imagine everything that can be done in modern web applications. If these applications are possible in HTML, this is because HTML has been designed to be neutral enough to describe yesterday’s, today’s and probably tomorrow’s applications.

If on the contrary, HTML 4.01 had attempted to describe, in 1999, what was a web application, it is pretty obvious that this description would have had to be in the best case worked around and that it might even have slowed down the development of Web 2.0.

This is the reason why I would make to HTML 5 the same kind of criticism I made to W3C XML Schema: over specifying how to use a document is a risk to block creativity and increase the coupling between applications.

While many people agree that web applications should be designed as documents, HTML 5 appears to propose to move from documents to applications. This seems to me to be a major step… backward!

Flashback on HTML’s history

Another point that needs to be highlighted are the relations between HTML 5 and XML in general and XHTML in particular.

HTML 5 presents itself as the sibling of both HTML 4.01 and XHTML 1.1 and as a competitor of XHTML 2.0.

To understand why the W3C is developing two competing standards, we need a brief reminder of the history of HTML.

HTML has been originally designed as a SGML vocabulary and uses some of its features to reduce the verbosity of its documents. This is the case for instance of tags such as <img> or <link> that do not need to be closed in HTML.

XML has been designed to be a simplification of SGML and this simplification does not allow to use the features used by HTML to reduce its verbosity.

When XML has been published, W3C found themselves with a SGML application in one hand (HTML) and a simplification of SGML in the other hand (XML) and these two recommendations were incompatible.

To make these recommendations compatible, they decided to create XHTML 1.0 which is a revamping of HTML to be compatible with the XML recommendation while keeping th exact same features. This lead to XHTML 1.0 and then XHTML 1.1 which is roughly the same thing cut into modules that can be used independently.

One of the weaknesses of HTML being its forms, W3C did also work on XForms, a new generation of web forms and started to move forward working on a new version of XHTML with new features, XHTML 2.0 still work in progress.

The approach looked so obvious that W3C has probably neglected to check that the community was still following its works. With the euphoria that followed the publication of XML 1.0 many people were convinced that the browsers war was over, the interest for HTML which was partly fueled by this war started to decline and the W3C works in this domain didn’t seem to raise that much interest compared to let’s say XML Schema languages or Web Services.

It is also fair to say that the practical interest to move from HTML to XHTML wasn’t (and still isn’t) obvious for web site developers since the features are the same. Migrating a site from HTML to XHTML involves an additional work which is only compensated by the joy of displaying a “W3C XHTML 1.x compliant” logo!

This is also the moment when Microsoft stopped any development on Internet Explorer and Netscape transferred their development to Mozilla.

The old actors from the browsers war, well represented at the W3C which was one of their battle fields led the way to new actors, Mozilla, Opera and Apple/Safari younger and less keen to accept the heaviness of W3C procedures.

At the same time, the first Web 2.0 applications sparkled a new wave of creativity among web developers and all this happened outside the W3C. This is not necessarily a bad thing since the mission of standard bodies such as W3C is to standardize rather than innovate, but the W3C doesn’t appear to have correctly estimated the importance of these changes and seems to have lost the contact with their users.

And when these users, led by Opera, Mozilla and Safari decided that it was time to move HTML forward, rather than jump into the XHTML 2.0 wagon, they decided to create their own Working Group, WHATWG, outside the W3C. This is where the first versions of HTML 5 have been drafted together with Web Forms 2.0, a sister documentation designed to be an enhancement of HTML forms simpler than Xforms.

Microsoft was still silent on this subject and the W3C saw themselves as editor of a promising new specification, XHTML 2.0 which didn’t seem to attract much attention while, outside, a new specification claiming to be the true successor of HTML was being developed by the most promising outsiders in the browser market.

At XTech 2007, I had a chance to measure the depth of the channel that separates the two communities by attending to a debate between both working groups.

Tim Berners-Lee must have found that this channel was too deep when he took the decision to invite the WHATWG to continue their work within the W3C in a Working Group created for this purpose and distinct from the XHTML 2.0 Working Group that continues their work as if nothing has changed.

HTML 5 or XHTML 2.0?

So, the W3C has now two distinct and competing Working Groups.

Missons are very close

The XHTML 2.0 Working Group develops an extensible vocabulary based on XML:

The mission of the XHTML2 Working Group is to fulfill the promise of XML for applying XHTML to a wide variety of platforms with proper attention paid to internationalization, accessibility, device-independence, usability and document structuring. The group will provide an essential piece for supporting rich Web content that combines XHTML with other W3C work on areas such as math, scalable vector graphics, synchronized multimedia, and forms, in cooperation with other Working Groups.

The HTML Working Group focuses on the continuity with previous HTML versions:

The mission of the HTML Working Group, part of the HTML Activity, is to continue the evolution of HTML (including classic HTML and XML syntaxes).

The conciseness of this sentence doesn’t imply that the HTML Working Group isn’t worried about extensibility and cross platform support since the list of deliverables says “there is a single specification deliverable for the HTML Working Group, the HTML specification, a platform-neutral and device-independent design”and later on “the HTML WG is encouraged to provide a mechanism to permit independently developed vocabularies such as Internationalization Tag Set (ITS), Ruby, and RDFa to be mixed into HTML documents”.

The policy is thus clearly, taking the risk to see a standards war develop, to develop two specifications and let user choose.

XHTML 5 is a weak alibi

We find this policy within the HTML 5 specification that proposes to choose between two syntaxes:

This specification defines an abstract language for describing documents and applications, and some APIs for interacting with in-memory representations of resources that use this language.

The in-memory representation is known as « DOM5 HTML », or « the DOM » for short.

There are various concrete syntaxes that can be used to transmit resources that use this abstract language, two of which are defined in this specification.

The first such concrete syntax is « HTML5 ». This is the format recommended for most authors. It is compatible with all legacy Web browsers. If a document is transmitted with the MIME type text/html, then it will be processed as an « HTML5 » document by Web browsers.

The second concrete syntax uses XML, and is known as « XHTML5 ». When a document is transmitted with an XML MIME type, such as application/xhtml+xml, then it is processed by an XML processor by Web browsers, and treated as an « XHTML5 » document. Authors are reminded that the processing for XML and HTML differs; in particular, even minor syntax errors will prevent an XML document from being rendered fully, whereas they would be ignored in the « HTML5 » syntax.

This section, which by chance is non-normative, appears to exclude that a browser might accept any other HTML document than HTML5 or any XHTML other than XHTML5!

Furthermore, with such a notice, I wonder who would want to choose XHTML 5 over HTML5…

This notice relies on a frequent misunderstanding of the XML recommendation. It is often said that XML parsing must stop after the first error, but the recommendation is much more flexible than that and distinguishes two types of errors:

  • An error is “a violation of the rules of this specification; results are undefined. Unless otherwise specified, failure to observe a prescription of this specification indicated by one of the keywords MUST, REQUIRED, MUST NOT, SHALL and SHALL NOT is an error. Conforming software MAY detect and report an error and MAY recover from it.”
  • A fatal errors is “an error which a conforming XML processor MUST detect and report to the application. After encountering a fatal error, the processor MAY continue processing the data to search for further errors and MAY report such errors to the application. In order to support correction of errors, the processor MAY make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor MUST NOT continue normal processing (i.e., it MUST NOT continue to pass character data and information about the document’s logical structure to the application in the normal way).”

We see that on the contrary, the XML recommendation specifies that a XML processor can correct simple errors.

One may argue that what XML considers as a fatal error can be considered by users as simple errors, this would be the case for instance of a <img> tag that wouldn’t be closed. But even for fatal errors, the recommendation doesn’t stipulate that the browser should not display the document. It does require that the parser report the error to the browser but doesn’t say anything about how the browser should react. Similarly, the recommendation imposes that normal processing should stop because the parser would be unable to reliability report the structure of the document but doesn’t say that the browser shouldn’t switch to a recovery mode where it could try to correct this error.

In fact, if browsers are so strict when they display XML documents, this isn’t to be conform to the XML recommendation but because there was a consensus that they should be strict at the time when they implemented their XML support.

At that time, everyone had in mind the consequence of the browsers war that was one of the reasons why browsers accepted pretty much anything that pretended to be HTML. While this can be considered a good thing in some cases, this also means implementing a lot of undocumented algorithms and this leads to major interoperability issues.

The decision to be strict when displaying XML documents came as a new era good resolution and nobody seemed to dissent at that time.

If this position needs to be revisited, it would be ridiculous to throw away XML since we have seen that it isn’t imposed by the recommendation.

The whole way in which the two HTML5 syntaxes are presented is a clear indication thet the XML syntax which was not mentioned in the first HTML5 drafts has been added as a compromise so that the W3C doesn’t look like if they rejected XML, but that the idea is to maintain and promote a non XML syntax.

HTML 5 gets rid of its SGML roots

Not only does HTML 5 rejects XML, but it also abandons any kind of compatibility with SGML and says clearly “while the HTML form of HTML5 bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules”.

This sentence is symptomatic of the overall attitude of the specification that seems to pretend to build on the experience of the web and ignore the experience of markup languages, taking the risk once again, to freeze the web to its current status.

The attitude of the XHTML Working Group is better balanced. Of course, XHTML 2.0 is about building on the most recent web development, but it doesn’t do so without keeping the experience acquired while developing XML and SGML vocabularies.

Technical approaches radically different

Without entering into a detailed comparison, two points are worth mentioning.

XHTML 2.0 is more extensible

Both specifications acknowledge the need to take into account the requirements that have appeared since HTML has been created when these are not correctly supported, but the method to do so is totally different.

HTML 5 has adopted a method that looks simple: if a new need is considered important enough, a new element is added. Since many pages contain articles, a new <article> element is added. And since most pages have navigation bars, a new <nav> element is added…

We have seen with the big vocabularies in document applications what are the limits of this approach: this leads to an explosion of the number of elements and the simplicity turns into complexity. It becomes difficult to choose between elements and pick the right one and since these elements are specialized, they never meet exactly your needs

Using this approach with HTML is more or less a way to transform it into a kind of DocBook clone for the web in the long term.

XHTML 2.0 has taken an opposite approach. The idea is, on the contrary, to start with a clean up and remove any element from XHTML that isn’t absolutely necessary.

It relies then on current practices: how do we do to represent an article or a navigation bar? The most common approach is to use a standard element, often a <div> and hijack the class attribute to apply a CSS style or a JavaScript animation.

The downside is that the values of the class attribute aren’t standardised and that the class attribute is used to convey information about the meaning of an element rather than define the way it should be displayed. This kind of hijack is pretty common since this is also the foundation of microformats.

To avoid this hijack while keeping the flexibility if this approach, XHTML 2.0 proposes to add a role attribute that defines the role of XHTML elements. This attribute can take a set of predefined values together with ad hoc values differentiated by their namespaces.

This method is a way to introduce the same kind of features that will be added to HTML 5 without adding new elements. This is more flexible since anyone can create new values in new namespaces. This also gives microformats a way to build upon something more solid than the class attribute that can be used again to define how elements should be presented.

Documents versus applications

Another important point that differentiate these two specification is their balance between data and applications or treatments.

XHTML 2.0 is built upon the XML stack:

  • The lower level is syntactical and consists of the XML and namespaces recommendations.
  • On top of this layer, the XML infoset defines a data model independent of any kind of treatment.
  • APIs, specific languages (XPath, XQuery, …) and schema languages are built on to of this data model.

It took some few years to build this architecture and things haven’t always been that clear and simple, but its big benefit is to separate data and treatments and be just the right one for a weak coupling between applications.

We’ve seen that HTML 5 has cut all its links to XML and SGML and that means that it doesn’t rely on this architecture. On the contrary, this specification mixes everything, syntax, data model and API (DOM) in a single specification.

This is because, as we’ve already seen, HTML 5 is a vocabulary to develop web applications rather than a vocabulary to write documents.

The difference seems important to me in not only in term of architecture but also in term of sustainability. Everyone agrees that XML is one of the best formats for long term preservation of documents. What is true of documents is probably not true of applications and I don’t think a HTML 5 application with a good proportion of JavaScript will be as sustainable as a XHTML 2.0 document.

The architecture on which XHTML 2.0 is built doesn’t prevent people from developing applications, but it dissociates more clearly these applications from the content.

Furthermore, XHTML 2.0 is also trying to develop and promote declarative alternatives such as XForms to define web applications that should be a better fit than JavaScript for documents.

Will the best specification win?

For all these reasons, HTML 5 looks to me as a big step backward and XHTML 2.0 seems to be a much better alternative.

Does that mean that XHTML 2.0 will be the winner or on the contrary, does the fact that HTML 5 is written by those who develop web browsers mean that XHTML 2.0 is doomed?

XHTML 2.0 has a strong handicap, but the battle isn’t lost yet. The HTML Working Group doesn’t expect that HTML 5 becomes a recommendation before Q3 20010 and before that date everything can happen.

It is up to us, the users, to vote with our feet and pen and start by boycotting the HTML 5 features that are already implemented in some browsers.

And short term, certifying that a page us XHTML 1.x valid is a good way to certify that it doesn’t contain HTML 5 features!