Edd Dumbill on XTech 2006

Last year Edd Dumbill, XTech Conference Chair, had been kind enough to answer my questions about the 2005 issue of the conference previously known as « XML Europe ». We’re renewing the experience, taking the opportunity to look back at last year issue and to figure out how XTech 2006 should look like.

vdV: You mention in your blog the success of XTech 2005 and that’s an appreciation which is shared by many attendees (including myself). Can you elaborate for those who have missed XTech 2005 what makes you say that it has been a success?

Edd: What I was particularly pleased with was the way we adapted the conference topic areas to reflect the changing technology landscape.

With Firefox and Opera, web browser technology matters a lot more now, but there was no forum to discuss it. We provided one, and some good dialog was opened up between developers, users and standards bodies.

But, to sum up how I know the conference was successful: because everybody who went told me that they had a good and profitable time!

vdV: You said during our previous interview that two new tracks which « aren’t strictly about XML topics at all » have been introduced last year (Browser Technology and Open Data) to reflect the fact that « XML broadens out beyond traditional core topics ». Have these tracks met their goal to attract a new audience?

Edd: Yes, I’m most excited about them. As I said before, the browser track really worked at getting people talking. The Open Data track was also very exciting: we heard a lot from people out there in the real world providing public data services.

The thing is that people in these « new » audiences work closely with the existing XML technologists anyway. It didn’t make sense to talk about XML and leave SVG, XHTML and XUL out in the cold: these are just as much document technologies as DocBook!

One thing that highlighted this for me was that I heard from a long-time SGML and then XML conference attendee that XTech’s subject matter was the most interesting they’d seen in years.

vdV: Did the two « older » tracks (Core Technologies and Applications) resist to these two new tracks and would you quality them as successful too?

Edd: Yes, I would! XTech is still a very important home for leaders in the core of XML technology. Yet also I think there’s always a need to change to adapt to the priorities of the conference attendees. One thing I want to do this year is to freshen the Applications track to reflect the rapidly changing landscape in which web applications are now being constructed. As well as covering the use of XML vocabularies and its technologies, I think the frameworks such as Rails, Cocoon, Orbeon and Django are important topics.

vdV: What would you like to do better in 2006?

Edd: As I’ve mentioned above, I think the Applications track can and will be better. I’d like also for there to be increased access to the conference for people such as designers and information architects. The technology discussed at XTech often directly affects these people, but there’s not always much dialogue between the technologists and the users. I’d love to foster more understanding and collaboration in that way.

vdV: You mention in your blog and in the CFP that there will be panel discussions for each track. How do you see these panel discussions?

Edd: Based on feedback from 2005’s conference, I would like the chance for people to discuss the important issues of the day in their field. For instance, how should XML implementors choose between XQuery and XSLT2, or how can organisations safely manage exposing their data as a web service? There’s no simple answer to these questions, and discussions will foster greater understanding, and maybe bring some previously unknown insights to those responsible for steering the technology.

vdV: The description of the tracks for XTech 2006 looks very similar to its predecessor. Does that mean that this will be a replay of XTech 2005?

Edd: Yes, but even more so! In fact, XTech 2005 was really a « web 2.0 » conference even before people put a name to what was happening. In 2006 I want to build on last year’s success and provide continuity.

vdV: l’année dernière: In last year’s description, the semantic web had its own bullet point in the « Open Data » track and this year, it’s sharing a bullet point with tagging and annotation. Does that mean that tagging and annotation can be seen as alternative to the semantic web? Doesn’t the semantic webtique deserve its own track?

Edd: The Semantic Web as a more formal sphere already has many conferences of its own. While XTech definitely wants to cover semantic web, it doesn’t want to get carried away with the complicated academic corners of the topic, but more see where semantic web technologies can be directly used today.

Also, I see the potential for semantic web technologies to pervade all areas that XTech covers. RDF for instance, is a « core technology ». RSS and FOAF are « applications » of RDF. RDF is used in browsers such as Mozilla. And RDF is used to describe metadata in the Creative Commons, relevant to « open data ». So why shut it off on its own? I’d far rather see ideas from semantic web spread throughout the conference.

vdV: In your blog, you’ve defended the choice of the tagline « Building Web 2.0 » quoting Paul Graham and saying that the Web 2.0 is a handy label for « The Web as it was meant to be used ». Why have you not chosen « Building the web as it was meant to be » as a tagline, then?

Edd: Because we decided on the tagline earlier! I’ll save « the web as it was meant to be » for next year :)

vdV: What struck me with this definition is that XML, Web Services and the Semantic Web are also attempts to build the Web as it was meant to be. What’s different with the Web 2.0?

Isn’t « building the web as it was meant to be » an impossible quest and why should the Web 2.0 be more successful than the previous attempts?

Edd:deux questions à la fois. I’ll answer both these together. I think the « Web 2.0 » name includes and builds on XML, Web Services and Semantic Web. But it also brings in the attitude of data sharing, community and the read/write web. Together, those things connote the web as it was intended by Berners-Lee: a two-way medium for both computers and humans.

Rather than an « attempt », I think « Web 2.0 » is a description of the latest evolution of web technologies. But I think it’s an important one, as we’re seeing a change in the notions of what makes a useful web service, and a validation of the core ideas of the web (such as REST) which the rush to make profit in « Web 1.0 » ignored.

vdV: In your blog, you said that you’re « particularly interested in getting more in about databases, frameworks like Ruby on Rails, tagging and search ». By databases, do you mean XML databases? Can you explain why you find these points particularly interesting?

Edd: I mean all databases. Databases are now core to most web applications and many web sites. They’re growing features to directly support web and XML applications, whether they’re true « XML databases » or not. A little bit of extra knowledge about the database side of things can make a great difference when creating your application.

XTech is a forum for web and XML developers, the vast majority of whom will use a database as part of their systems. Therefore, we should have the database developers and vendors there to talk as well.

vdV: One of the good things last year was the wireless coverage. Will there be one this year too?

Edd: Absolutely.

vdV: What is your worse souvenir of XTech 2005?

Edd: I don’t remember bad things :)

vdV: What is your best souvenir of XTech 2005?

Edd: For me, getting so many of the Mozilla developers out there (I think there were around 25+ Mozilla folk in all). Their participation really got the browser track off to a great start.

References:

TreeBind, Data binding and Design Patterns

I have released a new version of my Java data binding framework, TreeBind and I feel I need to explain why I am so excited by this API and by other lightweight binding APIs…

To make it short, to me these APIs are the latest episode of a complete paradigm shift in the relation between code and data.

This relationship has always been ambiguous because we are searching a balance between conflicting considerations:

  • We’d like to keep data separated because history has told us that legacy data is more important than programs and that data needs to survive during several program generations.
  • On the other hand, object orientation is about mixing programs and data.

The Strategy Pattern is about favouring composition over inheritance: basically, you create classes for behaviours and these behaviours become object properties.

This design pattern becomes still more powerful when you use a data binding API such as TreeBind since you gain the ability to directly express the behaviours as XML or RDF.

I have used this ability recently in at least two occasions.

The first one is in RDF, to implement the RDF/XML Query By Example language that I have presented at Extreme Markup Languages this summer.

RDF resources in a query such as:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://xml.insee.intra/schema/annuaire/"
  xmlns:q="http://xml.insee.intra/schema/qbe/">
    <q:select>
        <q:where>
            <inseePerson>
                <mail>
                    <q:conditions>
                        <q:ends-with>@insee.fr</q:ends-with>
                    </q:conditions>
                </mail>
            </inseePerson>
        </q:where>
    </q:select>
</rdf:RDF>

Are binded into Java classes (in this case, a class « Select », a generic class for other resources for « InseePerson » and a class « Conditions ») and these classes can be considered as behaviours.

The second project in which I have been using this ability is for a list manager which I am writing to run my mailing lists.

This list manager is designed as a set of behaviours to apply on incoming messages.

Instead of providing a set of rigid parameters to define the list configuration, I have decided to expose the behaviours themselves though TreeBind.

The result is incredibly flexible:

<?xml version="1.0" encoding="UTF-8"?>
<listManager>
    <server>localhost</server>
    <storeType>imap</storeType>
    <user>listmanager</user>
    <password>azerty</password>
    <port>143</port>
    <folderManager>
        <folder>user.list</folder>
        <messageHandler>
             <ifIsRecipient>list@example.com</ifIsRecipient>
              <messageHandler>
                <ifSpamLevelIs>spam</ifSpamLevelIs>
                <moveTo>moderated.spam</moveTo>
            </messageHandler>
            <messageHandler>
                <ifSpamLevelIs>unsure</ifSpamLevelIs>
                <moveTo>moderated.unsure</moveTo>
            </messageHandler>
            <sendToList>
                <subjectPrefix>[the XML Guild]</subjectPrefix>
                <footer>
--
Yet another mailing list manager!
></footer>
                <recipient>vdv@dyomedea.com</recipient>
                <envelopeFrom>list-bounce@example.com</envelopeFrom>
                <header name="Precedence">List</header>
                <header name="List-Id">&lt;list.example.com></header>
                <header name="List-Post">&lt;mailto:list@example.com></header>
                <server>localhost</server>
                <user>listmanager</user>
                <archive>archive</archive>
            </sendToList>
            <moveTo>done</moveTo>
        </messageHandler>
        <messageHandler>
             <moveTo>unparsed</moveTo>
        </messageHandler>
      </folderManager>
</listManager>

The whole behaviour of the list manager is exposed in this XML document and the Java classes corresponding to each element are no more than the code that implements this behaviour.

Unless you prefer to see it the other way round and consider that the XML document is the extraction of the data from their classes…

Non content based antispam sucks

My provider has recently changed the IP address of one of my server and my logs are flooded with messages such as:

Dec  7 08:21:57 gwnormandy postfix/smtp[22362]: connect to mx00.schlund.de[212.227.15.134]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)
Dec  7 08:21:57 gwnormandy postfix/smtp[22339]: connect to mx01.schlund.de[212.227.15.150]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)
Dec  7 08:21:57 gwnormandy postfix/smtp[22334]: connect to mx01.kundenserver.de[212.227.15.150]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)
Dec  7 08:21:57 gwnormandy postfix/smtp[22414]: connect to mx00.1and1.com[217.160.230.12]: server refused to talk to me: 421 Mails from this IP temporarily refused: Dynamic IP Addresses See: http://www.sorbs.net/lookup.shtml?213.41.184.90   (port 25)

Of course, I am trying to get this solved by sorbs.net (in that case, that should be possible since this is a fixed IP) but that incident reminds me why I think that we shouldn’t use « technical » or « non content based » antispam even if it happens to be efficient.

The basic idea of most if not all antispam software is to distinguish between what looks like a spam and what looks like a normal message.

To implement this, we’ve got three main types of implementations that can be combined:

  • Content based algorithms look at the content of the messages and use statistical methods to distinguish between « spam » and « ham » (non spam).
  • List based algorithms work with white and black lists to allow or deny mails, usually based on the address of mails sender.
  • Technical based algorithms look at the mail headers to reject most common practises used by spammers.

The problem with these technical algorithms is that the common practises used by spammers are not always practises that are not standard compliant and not even practises that should be considered as bad practises!

Let’s take the case of the sorbs.net database that identify dynamic IP addresses.

I would argue that sending a mail from a dynamic IP address is a good practise and that asking people to use their ISP mail servers when they don’t want to is a bad practise.

I personally consider that my mail is too important and sensitive for me to be outsourced to my ISP!

That’s the case when I am at home and I prefer to set up my own smtp servers that will take care of delivering my mails than using the smtp servers from my ISP.

When I am using my servers, I know from my logs if and when the smtp server of my recipients receive and queue the mails I am sending.

Also, I want to be able to manage mailing lists without having to ask to anyone.

And that’s still more the case when I am travelling and using an occasional ISP that I barely know and don’t know if I can trust.

We are using lots of these ISP when we are connected to WIFI spots and here again, I much prefer to send my mails from the smtp server that runs on my portable than from an unknown ISP.

Furthermore, that means that I don’t have to change the configuration of my mailer.

Content based antispam have also their flaws (they need training and are very inefficient with mails containing only pictures) but they don’t have false positives like technical based antispams that reject my mails if I send them from dynamic IP addresses.

That’s the reason why I have desinstalled Spam Assassin and replaced if with SpamBayes on my own systems.

Now, the thing that really puzzles me with antispam is that we have the technical solution that could eradicate spam from the web and that we just seem to ignore it.

If everyone was signing his mails with a PGP key, I could reject (or moderate) all the emails which are not signed.

Spammers would have to choose between signing their mails and being identified (meaning they could be sued) or not signing them and getting their mails trashed.

Now, the problem is that because so few people are signing their mails, I can’t afford to ignore unsigned mails and because PGP signatures are not handled correctly by many mailers and mailing list servers, most people (including me) don’t sign their mails.

The question is why doesn’t that change? Is this just a question of usages? Or is the community as a whole just not motivated to shut the spam down?

Web 2.0: myth and reality

The Web 2.0 is both a new buzzword and a real progress. In this article, I’ll to separate the myth from the reality. 

Note 

This article is a translation of the article published in French on XMLfr and presented at  sparklingPoint.

This version does integrate, in a very “WEB 2.0 fashion” a lot of comments from XMLfr editors and sparklingPoint participants and I’d like to thank them for their contribution. 

Definition 

The first difficulty when we want to make an opinion about Web 2.0 is to distinguish its perimeter. 

When you need to say if an application is XML or not, that’s quite easy: the application is an XML application if and only if it conforms to the XML 1.0 (or 1.1) recommendation. 

That’s not so easy for Web 2.0 since Web 2.0 is not a standard but a set of practices. 

In that sense, Web 2.0 can be compared to REST (Representational State Transfer) which is also a set of practices.

Fair enough will you say, but it’s easy to say if an application is RESTfull. Why would that be different with Web 2.0?

REST is a concept that is clearly described in a single document: Roy Fielding’s thesis which gives a precise definition of what REST is.

On the contrary, Web 2.0 is a blurred concept which aggregates a number of tendencies and everyone seems to have his own definition of Web 2.0 as you can see by the number of articles describing what the Web 2.0 is. 

If we really need to define Web 2.0, I’ll take two definitions. 

The first one is the one given by the French version of Wikipedia :

Web 2.0 is a term often used to describe what is perceived as an important transition of the World Wide Web, from a collection of web sites to a computing platform providing web application to users. The proponents of this vision believe that the services of Web 2.0 will come to replace  traditional office applications.

This article also gives an history of the term:

The term was coined by Dale Dougherty of O’Reilly Media during a brainstorming session with MediaLive International to develop ideas for a conference that they could jointly host. Dougherty suggested that the Web was in a renaissance, with changing rules and evolving business models.

And it goes on by giving a series of examples that illustrate the difference between good old “Web 1.0” and Web 2.0:

DoubleClick was Web 1.0; Google AdSense is Web 2.0. Ofoto is Web 1.0; Flickr is Web 2.0.

Google who has launched AdSense in 2003 was doing Web 2.0 without knowing it one year before the term has been invented in 2004!

Technical layer

Let’s focus on the technical side of Web 2.0 first.

One of the characteristics of  Web 2.0 is to be available to today’s users using reasonably recent versions of any browser. That’s one of the reasons why Mike Shaver said in its opening keynote at XTech 2005 that “Web 2.0 isn’t a big bang but a series of small bangs”.

Restricted by the set of installed browsers, Web 2.0 has no other choice than to rely on technologies that can be qualified of “matured”:

  • HTML (or XHTML pretending to be HTML since Internet Explorer doesn’t accept XHTML documents declared as such) –the last version of HTML has been published in 1999.
  • A subset of CSS  2.0 supported by Internet Explorer –CSS 2.0 has been published in 1998.
  • Javascript –a technology introduced by Netscape in its browser in 1995.
  • XML –published in 1998.
  • Atom or RSS syndication –RSS has been created by Netscape in 1999.
  • HTTP protocol –the latest HTTP version has been published in 1999.
  • URIs  –published in 1998.
  • REST –a thesis published in 2000.
  • Web Services –XML-RPC APIs for Javascript were already available in 2000.

The usage of XML over HTTP in asynchronous mode has been given the name “Ajax”.

Web 2.0 appears to be the full appropriation by web developers of mature technologies to achieve a better user experience.

If it’s a revolution, this is a revolution in the way to use these technologies together, not a revolution in the technologies themselves.

Office applications

Can these old technologies really replace office applications? Is Web 2.0 about rewriting MS Office in Javascript and could that run in a browser?

Probably not if the rule was to keep the same paradigm with the same level of features.

We often quote the famous “80/20” rule after which 80% of the features would require only 20% of the development efforts and sensible applications should focus on these 80% of features.

Office applications have crossed the 80/20 border line years ago and have invented a new kind of 80/20 rule: 80% of the users use probably less than 20% of the features.

I think that a Web 2.0 application focussing on the genuine 80/20 rule for a restricted application or group of users would be a tough competition to traditional office applications.

This seems to be the case for applications such as Google Maps (that could compete with GIS applications on the low end market) or some of the new wysiwyg text editing applications that flourish on the web.

A motivation that may push users to adopt these web applications is the attractiveness of systems that help us manage our data.

This is the case of Gmail, Flickr, del.icio.us or  LinkedIn to name few: while these applications relieve us from the burden of the technical management of our data they also give us a remote access from any device connected to the internet.

What is seen today as a significant advantage for managing our mails, pictures, bookmarks or contacts could be seen in the future as a significant advantage for managing our office documents.

Social layer

If the French version of Wikipedia has the benefit of being concise, its is slightly out of date and doesn’t describe the second layer of Web 2.0, further developed during the second Web 2.0 conference in October 2005.

The English version of Wikipedia adds the following examples to the list of Web 1.0/Web 2.0 sites:

Britannica Online (1.0)/ Wikipedia (2.0), personal sites (1.0)/ blogging (2.0), content management systems (1.0)/ wikis (2.0), directories (taxonomy) (1.0) / tagging (« folksonomy« ) (2.0)

These examples are interesting because technically speaking, Wikipedia, blogs, wikis or folksonomies are mostly Web 1.0.

They illustrate what Paul Graham is calling Web 2.0 “democracy”.

Web 2.0 democracy is the fact that to “lead the web to its full potential” (as the W3C tagline says) the technical layer of the internet must be complemented by a human network formed by its users to produce, maintain and improve its content.

There is nothing new here either and I remember Edd Dumbill launching WriteTheWeb in 2000, “a community news site dedicated to encouraging the development of the read/write web” because the “tide is turning” and the web is no longer a one way web.

This social effect was also the guide line of Tim O’Reilly in his keynote session at  OSCON 2004, one year before becoming the social layer of Web 2.0.

Another definition

With a technical and a social layer, isn’t Web 2.0 becoming a shapeless bag in which we’re grouping anything that’s looking new on the web?

We can see in the technical layer a consequence of the social layer, the technical layer being needed to provide the interactivity required by the social layer.

This analysis would exclude from Web 2.0 applications such as Google Maps which have no social aspect but are often quoted as typical examples of Web 2.0.

Paul Graham  tries to find common trends between these layers in the second definition that I’ll propose in this article:

Web 2.0 means using the web the way it’s meant to be used. The « trends » we’re seeing now are simply the inherent nature of the web emerging from under the broken models that got imposed on it during the Bubble. 

This second definition reminds me other taglines and buzzword heard during these past years: 

  • The W3C tagline is “Leading the Web to Its Full Potential”. Ironically, Web 2.0 is happening, technically based on many technologies specified by the W3C, without the W3C… It is very tempting to interpret the recent announcement of a “Rich Web Clients Activity” as an attempt to catch a running train.
  • Web Services are an attempt to make the web available to applications which was meant to be from the early ages of Web 1.0. 
  • The Semantic Web -which seems to have completely missed the Web 2.0 train- is the second generation of the web seen by the inventor of Web 1.0. 
  • REST is the description of web applications using the web as it is meant to be used.
  • XML is “SGML on the web” which was possible with HTTP as it was meant to be used. 
  • … 

Here again, Web 2.0 appears to be the continuation of the “little big bangs” of the web.

Technical issues

In maths, continuous isn’t the same as differentiable and in technology too, continuous evolutions can change direction.

Technical evolutions are often a consequence of changes in priorities that lead to these changes of direction.

The priorities of client/server applications that we developed in the 90’s were:

  • the speed of the user interfaces,
  • their quality,
  • their transactional behaviour,
  • security.

They’ve been swept out by web applications which priorities are:

  • a universal addressing system,
  • universal access,
  • globally fault tolerant: when a computer stops, some services might stop working but the web as a whole isn’t affected,
  • scalability (web applications support more users than client/server ones dreamed to support),
  • a user interface relatively coherent that enables sharing services through URIs,
  • open standards,

Web 2.0 is taking back some of the priorities of client/server applications and one needs to be careful that these priorities are met without compromising what is the strength of the web.

Technically speaking, we are lucky enough to have best practices formalized in REST and Web 2.0 developers should be careful to design RESTfull exchanges between browsers and servers to take full advantage of the web.

Ergonomic issues

Web 2.0 run in a web browsers and they should make sure that users can keep their Web 1.0 habits, especially with respect to URIs (including the ability to create bookmarks, send URIs by mail and use their back and forward buttons).

Let’s take a simple example to illustrate the point.

Have you noticed that Google, presented as a leading edge Web 2.0 company is stubbornly Web 1.0 on its core business: the search engine itself?

It is easy to imagine what a naïve Web 2.0 search engine might look like.

That might start with a search page similar to the current Google suggest. When you start writing your query terms, the service suggests possible completions of you terms.

When you would send the query, the page wouldn’t move. Some animation could keep you waiting even if that’s usually not necessary with a high speed connection on Google. the query would be sent and the results brought back asynchronously  Then, the list of matches would be displayed in the same page.

The user experience would be fast and smooth, but there are enough drawbacks with this scenario that Google doesn’t seem to find it worth trying:

  • The URI in the address bar would stay the same: users would have no way to bookmark a search result or to copy and past it to send to a friend.
  • Back and forward buttons would not work as expected.
  • These result pages would be accessible to crawlers.

The web developer who would implement this Web 2.0 application should take care to provide good workarounds for each of these drawbacks. This is certainly possible, but that requires some effort.

Falling into these traps would be really counter-productive to Web 2.0 since we have seen that these are ergonomic issues that justify this evolution to make the web easier to use.

Development

The last point on which one must be careful when developing Web 2.0 applications are  development tools.

The flow of press releases made by software vendors to announce development tools for Ajax based applications may put an end to this problem, but Web 2.0 often means developing complex scripts that are subject to interoperability issues between browsers.

Does that mean that Web 2.0 should ignore declarative definitions of user interface (such as in XForms, XUL or XAML) or even in the 4GL’s that had been invented for client/server applications in the early 90’s?

A way to avoid this regression is to use a framework that hides most of the Javascript development.

Catching up with the popular “Ruby on Rails”, web publications frameworks are beginning to propose Web 2.0 extensions.

This is the case of Cocoon which new version 2.1.8 includes a support of Ajax but also of Orbeon PresentationServer which includes in its version 3.0 a fully transparent support of Ajax through its Xforms engine.

This features enables to write user interfaces in standard XForms (without a single line of Javascript) and to deploy these applications on todays browsers, the system using Ajax interactions between browsers and servers to implement XForms.

Published in 2003, XForms is only two years old, way too young to be part of the Web 2.0 technical stack… Orbeon PresentationServer is a nifty way to use XForms before it can join the other Web 2.0 technologies!

Business model

What about the business model?

The definition of Paul Graham for whom Web 2.0 is a web rid of the bad practises of the internet bubble is interesting when you know that some analysts believe that a Web 2.0 bubble is on its way.

This is the case of Rob Hof (Business Week) who deploys a two step argumentation:

1) “It costs a whole lot less to fund companies to revenue these days”, which Joe Kraus (JotSpot) explains by the facts that:

  • “Hardware is 100X cheaper”,  
  • “Infrastructure software is free”, 
  • “Access to Global Labor Markets”, 
  • Internet marketing is cheap and efficient for niche markets. 

2) Even though venture capital investment seems to stay level, cheaper costs mean that much more companies are being funded with the same level of investment. Furthermore, cheaper costs also means that more companies can be funded by non VC funds.

Rob Hof also remarks that many Web 2.0 startups are created with no other business model than being sold in the short term.

Even if it is composed to smaller bubbles, a Web 2.0 bubble might be on the way…

Here again, the golden rule is to take profit of the Web 1.0 experience.

Data Lock-In Era

If we need a solid business model for Web 2.0, what can it be?

One of the answers to this question was in the Tim O’Reilly keynote at OSCON 2004 that I have already  mentioned.

Giving its views on the history of computer technologies since their beginning, Tim O’Reilly showed how this history can be split into three eras:

  • During the “Hardware Lock-In” era, computer constructors ruled the market.
  • Then came the “Software Lock-In” era dominated by software vendors.
  • We are now entering the “Data Lock-In” era.

In this new era, illustrated by the success of sites such as Google, Amazon, or eBay, the dominating actors are companies that can gather more data than their competitors and  their main asset is the content given or lent by their users for free.

When you outsource your mails to Google, you publish a review or even buy something on Amazon, upload your pictures to Flickr or add a bookmark in del.icio.us, you tie yourself to this site and you trade a service against their usage of your data.

A number of people are talking against what François Joseph de Kermadec is calling the “fake freedom” given by  Web 2.0.

Against this fake freedom, users should be careful: 

  • to trade data against real services, 
  • to look into the terms of use of each site to know which rights they grant in exchange if these services, 
  • to demand technical means, based on open standards, to get their data back. 

So what?

What are the conclusions of this long article?

Web 2.0 is a term to qualify a new web that is emerging right now.

This web will use the technologies that we already know in creative ways to develop a collaborative “two way web”.

Like any other evolution, Web 2.0 comes with a series of risks: technical, ergonomic, financial and threats against our privacy.

Beyond the marketing buzzword, Web 2.0 is a fabulous bubble of new ideas, practices and usages.

The fact that its shape is still so blurred shows that everything is still open and that personal initiatives are still important.

The Web 2.0 message is a message of hope!

References 

The complex simple problem of media types

Media types (previously called mime types) always stuck me as something which is simple in theory but awfully complex in practise.

When they seem to be working, you can enjoy your luck and be sure that this is only temporary before the next release of software X or Y: after the recent upgrade from Ubuntu Hoary to Breezy, my workstation insists that my « audio/x-mpergurl » play lists are « text/plain » and when I associate XMMS to these files it uses XMMS to open all my « text/plain » documents!

I had recently the opportunity to look a little bit deeper into these issues for a project of mine which needs to determine the media types of files in Java.

Freedesktop.org comes to the rescue

The problem is more complex than it appears, and that’s comforting to know that some people seem to be doing exactly what needs to be done to fix it.

The freedesktop.org has been working for a while on a shared database or media types and have published its specification.

Gnome and KDE are participating and I hope that this means the end of the media types nightmare on my desktop…

I really like the principles that they have adopted, especially the simple XML format that they have adopted to describe the media types (that they are still calling mime types).

One thing which is surprising when you first open the XML document describing this shared mime types database is that it includes an internal DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mime-info [
  <!ELEMENT mime-info (mime-type)+>
  <!ATTLIST mime-info xmlns CDATA #FIXED "http://www.freedesktop.org/standards/shared-mime-info">

  <!ELEMENT mime-type (comment|glob|magic|root-XML|alias|sub-class-of)*>
  <!ATTLIST mime-type type CDATA #REQUIRED>

  <!ELEMENT comment (#PCDATA)>
  <!ATTLIST comment xml:lang CDATA #IMPLIED>

  <!ELEMENT glob EMPTY>
  <!ATTLIST glob pattern CDATA #REQUIRED>

  <!ELEMENT magic (match)+>
  <!ATTLIST magic priority CDATA #IMPLIED>

  <!ELEMENT match (match)*>
  <!ATTLIST match offset CDATA #REQUIRED>
  <!ATTLIST match type (string|big16|big32|little16|little32|host16|host32|byte) #REQUIRED>
  <!ATTLIST match value CDATA #REQUIRED>
  <!ATTLIST match mask CDATA #IMPLIED>

  <!ELEMENT root-XML EMPTY>
  <!ATTLIST root-XML
  	namespaceURI CDATA #REQUIRED
	localName CDATA #REQUIRED>

  <!ELEMENT alias EMPTY>
  <!ATTLIST alias
  	type CDATA #REQUIRED>

  <!ELEMENT sub-class-of EMPTY>
  <!ATTLIST sub-class-of
  	type CDATA #REQUIRED>
]>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">

That’s not very usual and often considered a bad practise since you don’t share the DTD between documents. When you think more about it, in this specific context where there should be only one of these documents per machine, it makes perfect sense.

Using an internal DTD solves all the packaging issues: there is only one self contained document to ship and this document has no external dependencies. Furthermore, the DTD is pretty straightforward and including it in the document itself makes it more self described.

This vocabulary is meant to be extensible through namespaces:

Applications may also define their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML files like comment elements.

I think that they could have allowed external attributes as well as they are usually quite harmless.

The mechanism defined to differentiate different types of XML documents appears to be somewhat weak since it’s relying only on the namespace and local name of the root element.

Without mentioning the problem of compound documents, this mechanism is completely missing the fact that a document such as:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          xmlns="http://www.w3.org/1999/xhtml" xsl:version="1.O">
    .../...
</html>

isn’t an XHTML document but an XSLT transformation!

The other thing that I regret, probably because I am not familiar with these issues, is that the description of the « magic » rules is very concise.

One the questions I would have liked to see answered is which encodings should we try when doing string matching. When I see the following rule in the description of the « application/xml » type:

    <magic priority="50">
      <match value="&lt;?xml" type="string" offset="0"/>
    </magic>

I have the feeling that different encodings should be tried : ASCII would also work for UTF-8, ISO-8859-1 and alike, but it would fail for UTF-16 or EBCDIC…

On the other hand, there are probably many text formats that do not support UTF-16 or EBCDIC and for which that would be a mistake to try these encodings…

More implementations needed!

Having found this gem, I was pretty confident I would find a Java implementation…

There is one buried into Sun’s Java Desktop System (which isn’t Open Source) and one in Nutch but that seems to be pretty much everything that we have available!

The MimeType class in Nutch appears to be quite minimal. It probably does what most applications want to do, but that’s not enough for what I’d like to achieve.

The mime types database has some advanced features such as type hierarchy: a mime type can be a subclass of other mime types, for instance all the text types are subclasses of the text/plain type.

These hierarchies can be implicit or explicit and they support multiple inheritance. The freedesktop.org specification gives the following example:

Some types may or may not be instances of other types. For example, a spreadsheet file may be compressed or not. It is a valid spreadsheet file either way, but only inherits from application/x-gzip in one case. This information cannot be represented statically; instead an application interested in this information should run all of the magic rules, and use the list of types returned as the subclasses.

These hierarchies should be used by user interfaces: instead of proposing only to use the tools registered for a specific type, a user interface should also propose to use the tools registered for its parent classes. If they did, they would propose me to use an XML or a text editor to open an SVG document or to use a zip extractor to open an OpenOffice document which can be very handy.

That’s the kind of features I’d be expecting to see in a mime type API and I am wondering if I will not have to write my own Java implementation to see that happen!

Modifier les dependances d’un paquet Debian

Quand on installe des paquets provenant de sources non officielles, il arrive fréquemment que les dépendances déclarées dans les paquets ne correspondent pas au système sur lequel on installe ces paquets.

C’est le cas par exemple lorsque l’on installe la version courante d’Opera sur Ubuntu 5.10 « BreezyBadger » ou celle de Skype sur Ubunto 5.04 « HoaryHedgehog » :

vdv@vaio:~ $  sudo dpkg -i /opt/downloads/skype_1.2.0.17-1_i386.deb
Password:
(Lecture de la base de données... 169348 fichiers et répertoires déjà installés.)
Préparation du remplacement de skype 1.2.0.17-1 (en utilisant .../skype_1.2.0.17-1_i386.deb) ...
Dépaquetage de la mise à jour de skype ...
dpkg : des problèmes de dépendances empêchent la configuration de skype :
 skype dépend de libqt3c102-mt (>= 3:3.3.3.2) ; cependant :
  La version de libqt3c102-mt sur le système est 3:3.3.3-7ubuntu3.
dpkg : erreur de traitement de skype (--install) :
 problèmes de dépendances - laissé non configuré
Des erreurs ont été rencontrées pendant l'exécution :
 skype

Face à ce type de solution, on peut forcer l’installation avec l’option « –force » :

vdv@vaio:~ $  sudo dpkg --force depends -i /opt/downloads/skype_1.2.0.17-1_i386.deb
Password:
(Lecture de la base de données... 169348 fichiers et répertoires déjà installés.)
Préparation du remplacement de skype 1.2.0.17-1 (en utilisant .../skype_1.2.0.17-1_i386.deb) ...
Dépaquetage de la mise à jour de skype ...
dpkg : skype : problèmes de dépendances, mais configuration comme demandé :
 skype dépend de libqt3c102-mt (>= 3:3.3.3.2) ; cependant :
  La version de libqt3c102-mt sur le système est 3:3.3.3-7ubuntu3.
Paramétrage de skype (1.2.0.17-1) ...

        

Cela permet de tester l’application et de vérifier qu’elle fonctionne sur votre système, mais le paquet est considéré être « cassé » et le système ne se prive pas de vous le rappeler :

vdv@vaio:~ $ sudo apt-get dist-upgrade
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances... Fait
Vous pouvez lancer « apt-get -f install » pour corriger ces problèmes.
Les paquets suivants contiennent des dépendances non satisfaites :
  skype: Dépend: libqt3c102-mt (>= 3:3.3.3.2) mais 3:3.3.3-7ubuntu3 est installé
E: Dépendances manquantes. Essayez d'utiliser l'option -f.
            

Si vous essayez « apt-get -f install », celui-ci propose de désinstaller le paquet récalcitrant et on est revenu au point de départ:

vdv@vaio:~ $ sudo apt-get -f install
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances... Fait
Correction des dépendances... Fait
Les paquets suivants seront ENLEVÉS :
  skype
0 mis à jour, 0 nouvellement installés, 1 à enlever et 3 non mis à jour.
Il est nécessaire de prendre 0o dans les archives.
Après dépaquetage, 9160ko d'espace disque seront libérés.
Souhaitez-vous continuer [O/n] ? n
Annulation.
            

Une solution à ce problème est de corriger les dépendances dans le paquet lui-même, ce qui est beaucoup plus facile qu’on ne pourrait le redouter…

Il faut pour cela extraire les fichiers du paquet :

vdv@vaio:~ $ cd /tmp
vdv@vaio:/tmp $ dpkg-deb -x /opt/downloads/skype_1.2.0.17-1_i386.deb skype_1.2.0.17-1_i386
            

Cette commande n’extrait pas le fichiers de contrôle qu’il faut extraire dans un deuxième temps :

vdv@vaio:/tmp $ mkdir skype_1.2.0.17-1_i386/DEBIAN
vdv@vaio:/tmp $ dpkg-deb -e /opt/downloads/skype_1.2.0.17-1_i386.deb skype_1.2.0.17-1_i386/DEBIAN
            

On peut maintenant éditer le fichier de contrôle :

vdv@vaio:/tmp $ gvim skype_1.2.0.17-1_i386/DEBIAN/control
            

Pour modifier le numéro de version dans la ligne :

Depends: libc6 (>= 2.3.2.ds1-4), libgcc1 (>= 1:3.4.1-3), libqt3c102-mt (>= 3:3.3.3.2), libstdc++5 (>= 1:3.3.4-1), libx11-6 | xlibs (>> 4.1.0), libxext6 | xlibs (>> 4.1.0)
            

Par celui de notre installation :

Depends: libc6 (>= 2.3.2.ds1-4), libgcc1 (>= 1:3.4.1-3), libqt3c102-mt (>= 3:3.3.3-7ubuntu3), libstdc++5 (>= 1:3.3.4-1), libx11-6 | xlibs (>> 4.1.0), libxext6 | xlibs (>> 4.1.0)

Il suffit maintenant de reconstituer le paquet :

vdv@vaio:/tmp $ dpkg-deb -b skype_1.2.0.17-1_i386
dpkg-deb : construction du paquet « skype » dans « skype_1.2.0.17-1_i386.deb ».
            

Et de le réinstaller :

vdv@vaio:/tmp $ sudo dpkg -i skype_1.2.0.17-1_i386.deb
(Lecture de la base de données... 169348 fichiers et répertoires déjà installés.)
Préparation du remplacement de skype 1.2.0.17-1 (en utilisant skype_1.2.0.17-1_i386.deb) ...
Dépaquetage de la mise à jour de skype ...
Paramétrage de skype (1.2.0.17-1) ...

Bien entendu, cela ne fonctionne que si le problème de dépendances était fictif, d’où l’utilité de tester cela avec la commande « dpkg –force depends » avant d’entreprendre l’opération, mais lorsque c’est le cas, cette manipulation assez simple règle le problème de manière définitive… jusqu’à la prochaine version!

TreeBind: one infoset to bind them all

This is the first entry of a series dedicated to the TreeBind generic binding API.
I have recently made good progress in the extensive refactoring of TreeBind required by my proposal to support RDF and it’s time to start explaining these changes.

The first of them is the infoset on which TreeBind is now relying.

TreeBind’s target is to propose and implement a binding mechanism that can support XML, Java objects but also RDF and LDAP (my new implementation includes support for these two models as data sources) and, potentially, other sources such as relational databases or even PSVIs…

In order to cover all these data sources, TreeBind required an infoset (or data model) which is a superset of the data models of these sources.

The new TreeBind infoset is simple enough to cope with all these data models. It consists in:

  • Names. These different sources have different ways of defining names. Names can include both a domain name and a local name (that’s the case with XML and namespaces, but also with Java Class names and packages), they can include only a local name (that’s the case with LDAP but also for Java method names) or can be more complex like XML attribute names in which the namespace of the parent element has a role to play.
  • Complex properties. These are non leaf properties. Complex properties have a nature which is a name and a set of embedded properties that are either complex or leaf properties. When a sub property is attached to a property, the attachment carries a role which is a name.
  • Leaf properties. Leaf properties have a nature (which is a name) and a value.

That’s all…

This is enough to differentiate, for instance, an XML element from an XML attribute because their names belong to different name classes.

This would, potentially, allow to cope with mixed content by adding a new class of names to support text nodes. This is not implemented for the moment, just because I don’t have any business case to justify the additional workload.

If needed, the same could be done to support other XML constructions such as PIs and comments.

A concept which is clearly missing and should probably be added in a future version is the concept of identity.

Names are used to identify the nature of the objects and the roles they play in the different associations.

When we use TreeBind to bind not only trees but also graphs (which is the case of RDF, LDAP and even XML if we want to support some type of id/idref), we need to be able to identify objects in order to avoid creating binding loops.

This could be done by attaching an ID which could also be a name to each property.

So what?

The new version of TreeBind is implementing a SAX like paradigm built on top of this simple infoset like SAX is (more or less) built on top of the XML infoset.

Binding a source to a sink is done by implementing or just using:

  • A source that will read the data source and stream properties.
  • A sink that receives streamed properties and create the target data.
  • One or more filters to deal with the impedance mismatches between the source and the sink.

The strength of this architecture is that if the built-in pipe that does the binding is not flexible enough for your application, you can just add a filter that will cope with your specific requirements.

We’ll explore all that in more details in the next entries…

Using Orbeon PresentationServer with a recent version of eXist

Why would you want to do that?

Orbeon PresentationServer is currently shipping with eXist 1.0 beta2.

This is true of both OPS version 2.8 (the current stable release) and OPS 3.0 beta 3 (the latest beta of the next generation).

While eXist 1.0 beta2 is described as the stable version of the Open source XML database, their web site displays the following Health Warning:

The 1.0 beta2 release is truly ancient now. There were lots of bug fixes and feature enhancements during the past months, so using beta2 cannot be recommended any more. Please download a newer development snapshot. Recent development snapshots can be regarded as stable. A new official « stable » release is in preparation, but as usual, we lack the time to complete the documentation. Any help will be welcome!

Among the many enhancements included in more recent versions, Transactions and Crash Recovery is very worth mentioning:

After several months of development, eXist does now support full crash recovery. Crash recovery means that the database can automatically recover from an unclean termination, e.g. caused by a killed jvm, power loss, system reboot or hanging processes.

This might be a reason of the corruptions noticed in my experience with OPS and eXist and that has been my motivation to migrate http://apiculteurs.info to the latest eXist snapshot

While this is not rocket science, the following notes may help you if you want to attempt the same migration.

Environment

My environment is Ubuntu Hoary, Java Sun j2sdk 1.4 and /or 1.5, Jetty and OPS 2.8 but the same procedure should be valid for other environments.

Migration

Database backup

The physical database format has changed between these versions and, if you have to keep a database during this migration, you need to backup the database using the eXist client before starting the actual migration.

I’ll cover how to use the eXist client with an eXist database embedded in OPS in a future blog entry, in the mean time, you can refer to this thread of the ops-users mailing list.

After you’ve done this backup, remove the content of the old database:

rm orbeon/WEB-INF/exist-data/*

Removing the old libraries

You should then stop your servlet and move to the orbeon « orbeon/WEB-INF/lib » directory where you’ll find four eXist libraries:

orbeon/WEB-INF/lib/exist-1_0b2_build_1107.jar
orbeon/WEB-INF/lib/exist-optional-1_0b2_build_1107.jar
orbeon/WEB-INF/lib/xmldb-exist_1_0b2_build_1107.jar
orbeon/WEB-INF/lib/xmlrpc-1_2_patched_exist_1_0b2_build_1107.jar
            

Remove these four libraries from « orbeon/WEB-INF/lib » and keep them somewhere else in case you want to move back to eXist 1.0 beta2 later on.

Installing the eXist snapshot

Install the eXist snapshot through:

java -jar eXist-snapshot-20050805.jar

Choose whatever directory you want to install this new version but keep it out of the scope of your OPS install: we are doing this installation only to get the new libraries!

Installing the new libraries

You need to copy five eXist libraries into « orbeon/WEB-INF/lib ». If you’ve install eXist in « /opt/eXist », move to « orbeon/WEB-INF/lib » and type:

cp /opt/eXist/exist.jar eXist-snapshot-20050805.jar
cp /opt/eXist/exist-optional.jar exist-optional-snapshot-20050805.jar
cp /opt/eXist/exist-modules.jar exist-modules-snapshot-20050805.jar
cp /opt/eXist/lib/core/xmldb.jar xmldb-eXist-snapshot-20050805.jar
cp /opt/eXist/lib/core/xmlrpc-1.2-patched.jar xmlrpc-1.2-patched-eXist-snapshot-20050805.jar
            

Move to java 5.0

eXist now relies on some Java 5.0 classes and if you try to use it with j2sdk 1.4, you’ll run into errors such as:

22:37:11.168 WARN!! [SocketListener0-9] org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:574) >11> Error for /orbeon/apiculteurs/administration/statistiques/montre
java.lang.NoClassDefFoundError: javax/xml/datatype/DatatypeConfigurationException
	at org.exist.xquery.value.AbstractDateTimeValue.<clinit>(AbstractDateTimeValue.java:157)
	at org.exist.xquery.functions.FunCurrentDateTime.eval(FunCurrentDateTime.java:51)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:144)
	at org.exist.xquery.EnclosedExpr.eval(EnclosedExpr.java:58)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:144)
	at org.exist.xquery.ElementConstructor.eval(ElementConstructor.java:173)
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:43)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:159)
            

To fix that, the simplest solution (assuming your application supports it) is to move your servlet to j2sdk 1.5.

Restart, restore and enjoy

You’re almost done!

Restart your servlet, restore your database using the eXist client and enjoy your brand new eXist installation.

After a servlet reload, in the servlet log, you’ll notice new messages:

            2005-10-07 08:51:08,615 INFO  org.exist.storage.XQueryPool null - QueryPool: maxStackSize = 5; timeout = 120000; timeoutCheckInterval = 30000
Scanning journal  [==                                                ] (4 %)
Scanning journal  [====                                              ] (8 %)
Scanning journal  [======                                            ] (12 %)
Scanning journal  [========                                          ] (16 %)
Scanning journal  [==========                                        ] (20 %)
Scanning journal  [=================                                 ] (34 %)
Scanning journal  [====================                              ] (40 %)
Scanning journal  [==============================                    ] (60 %)
Scanning journal  [========================================          ] (80 %)
            2005-10-07 08:51:19,713 INFO  org.orbeon.oxf.pipeline.InitUtils null - /apicu

These messages confirm that your eXist installation is now using a journal.

Lawyers shouldn’t edit XML documents

One of my customers has found out that the DTD published by Sun to validate property files in JSE 1.5.0 is not well formed!

The javadoc explains :

Note that the system URI (http://java.sun.com/dtd/properties.dtd) is not accessed when exporting or importing properties; it merely serves as a string to uniquely identify the DTD, which is:

<?xml version="1.0" encoding="UTF-8"?>

<!-- DTD for properties -->

<!ELEMENT properties ( comment?, entry* ) >

<!ATTLIST properties version CDATA #FIXED "1.0">

<!ELEMENT comment (#PCDATA) >

<!ELEMENT entry (#PCDATA) >

<!ATTLIST entry key CDATA #REQUIRED>

Reducing the system URI to a mere identifier is a simplification that can lead to problems when you parse your document: XML parsers are free to load DTDs even if you specify standalone= »yes » in your XML declaration and even if you run them in non-validating mode.

In that case, including a system URI pointing to a non well formed DTD means that depending on you parser and on the options you’ll send it at parse time, you may get (or not) a well formness error.

Interestingly, the DTD listed above and borrowed from the javadoc is well formed.

The DTD published at http://java.sun.com/dtd/properties.dtd appears to have been modified to:

<!--
   Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
  -->

<?xml version="1.0" encoding="UTF-8"?>

<!-- DTD for properties -->

<!ELEMENT properties ( comment?, entry* ) >

<!ATTLIST properties version CDATA #FIXED "1.0">

<!ELEMENT comment (#PCDATA) >

<!ELEMENT entry (#PCDATA) >

<!ATTLIST entry key CDATA #REQUIRED>

See what has happened? Someone has probably insisted that they should add a copyright statement at the beginning of each of their documents, forgetting that XML forbids comments before the XML declaration…

We shouldn’t let lawyers edit XML documents!

Les chiffres du chomage

Deux petits chiffres à propos du chômage en France glanés à l’écoute de la très intéressante émission « la nouvelle fabrique » avec Richard Dethyre ce matin sur France Culture.

Suivant cet interlocuteur :

  • seuls quatre chômeurs sur dix seraient comptabilisés dans les statistiques officielles du chômage,
  • le taux d’activité en France serait de 63%.