Dataset versioning

The first feature complete version of my MarkLogic persistence layer for Orbeon Form Runner is available but as far as I know nobody is using it yet and it’s still time to think about what will happen for new versions.

Versioning dataset « schemes » is always hard.

I am using the term « scheme » to refer to both structures and conventions used to store and retrieve information.

I would have used  the term « schema » if it hadn’t different meaning depending on the technology  (SQL database schemas, XML schemas, RDF Schema, …) which are all part of what I am calling a « scheme » in this post.

By « dataset » I mean a set of data isolated by any mean one from each other for the application. Datasets is a logical concept which overlaps the concept of databases as defined by the different database implementations.

It’s a generic issue and if you take the example of WordPress which is powering this blog, most of its plugins include a mechanism to check that their database schemes is up to date with their versions and perform automatic upgrades when needed (see for instance the section called « Store the Plugin Version for Upgrades » and the following one in this article by Dave Donaldson).

This is most needed for applications such as WordPress where there the end user is often the administrator of his blog and enterprise applications usually have system administrators to deal with upgrades but still there might be something to borrow in this approach.

Like in a WordPress plugin, in most database application like my persistence layer, you have on one side the « program » (implemented in XQuery, pipelines or whatever for XML databases) and on the other side one or several « datasets ».

This persistence layer supports storing multiple « datasets » in a single database.

A use case for that could be for a provider to define a common dataset as a persistence layer for form definitions shared by different customers, a dataset per customer, a dataset for demonstration purposes, … All these datasets could be either in separate MarkLogic databases or in a single database.

There is currently a single version of the persistence layer and there is no possible mismatch by the XQuery modules which implement the REST API and the dataset schemes.

The situation might be different in a few months.

There might be for instance a version 2 implementing the optional features which haven’t been implemented yet. These features will likely rely on new documents properties added to the current ones and thus rely on a version 2 of the dataset scheme.

The current scheme relies on URIs to store application and form names and document types in a directory fashion. This seems the natural thing to do because that mirrors the URIs structures in the persistence layer specification. Other options could be to use MarkLogic collections or document properties to store this information and we might have decided to go this way in a version 3.

Our provider could then be in a situation where the dataset with the form definitions would still be in v1 (if it doesn’t use features added in v2, why should he migrate), the datasets with customer data would be in v2 and the dataset used for demo purposes would be testing v3.

Traditional approaches rely on upgrade scripts executed by system administrators which can be run through Roxy  which is the preferred deployment tool for MarkLogic applications such as this persistence layer.

In that case, system administrators need to carefully keep modules and dataset schemes versions synchronized and supporting multiple versions of modules for a single database can be tough even if the modules URIs can be versioned.

This is where the WordPress plugin approach may be useful.

What about adding some kind of metadata to each dataset to determine its version?

The URL rewriter would retrieve this information and choose the REST library options corresponding to a specific version to execute the right version of the modules.

An administrative REST API could easily be added to list datasets, display their metadata and perform datasets upgrades (and downgrades if available) and this REST API could be used by Roxy.

The idea of adding dataset metadata seems really powerful but what should such metadata include?

The minimum would be to identify applications using the datasets and their versions but what about adding more information, such as some kind of user readable documentation, URIs to additional documentation and resources, …

Proposing a vocabulary to define such information is an interesting exercise that I’ll be happy to do if needed but I can’t believe it has never been done…

If you are aware of something similar, please tell us!

Many thanks to Peter Kester (MarkLogic) for sharing his thoughts on the subject.

First steps with MarkLogic

[Edited to take into account Dave Cassel’s comments]

To get started with MarkLogic I have chosen to develop a persistence layer for Orbeon Form Runner.

This is the kind of projects I like, small enough to be done in a few days yet technical enough to see advanced topics and potentially useful to other people.

The project is available on my community site and I’d like to share in this post my feelings during this first contact with MarkLogic.

The first contact with a new product is the installation and I have been really surprised by the simplicity of MarkLogic installation  process. My laptop is running Ubuntu which is not a supported platform but the install went very smoothly after converting the RPM package as documented everywhere on the web and it didn’t took me more than a few minutes to get MarkLogic up and running.

The second contact with the admin interface has been less obvious: MarkLogic comes with a series of different generations of web UIs (admin, configuration manager, monitoring, information studio, application builder and query console) and it’s not always obvious to find your way between these tools.

I must also say that I am an old school administrator who prefers configuration files rather than point and click administration windows!

Fortunately this is well documented and I have rapidly been able to create a new database and servers for my project. The interface with my favorite XML tool, oXygen XML editor has also been very easy to setup.

The feeling that hasn’t left me all over this project is a feeling of stability and robustness: I have never needed to restart the server, all the modifications of configuration have been made while the server was up and running, I have never seen any crash nor any non understandable error message.

In other words, MarkLogic is the kind of software which makes you feel secure and comfortable!

A Form Runner persistence layer is a REST API and such APIs are reasonably easy to implement in MarkLogic thanks to their REST library. I think I have found a bug (I am pretty good for that, all the products I have worked with will tell you that) but that was in a minor function and nothing really blocking.

Something to note if you want to try it by yourself is that paths to documents in a database does not always start with a « / » and « foo/bar » is a different directory than « /foo/bar ». To search all the documents under « foo/bar/ » you’ll write something such as:

cts:search(/, cts:directory-query('foo/bar/', "infinity"))

If you forget the trailing slash in (foo/bar) MarkLogic will raise an error with a self-explanatory message but if you add a leading slash (/foo/bar/) like you’d do for any decent file system you will search in a different directory and your search may  silently result in an empty sequence!

In fact, as pointed out by Dave Cassel, Marklogic considers that « foo/ » is a root directory like « / » and « /foo/ » is a subdirectory of the root directory « / ». A database can thus have as many root directory as you want but you need to be careful and if you insert a document as « foo/bar/bat.xml » you won’t be able to find it as « /foo/bar/bat.xml »!

And as you’ve noticed with this simple snipet you’ll have to use many proprietary functions to develop XQuery applications in MarkLogic. This is not really a problem specific to MarkLogic but XQuery has been defined to be generic and we use it for things which are well beyond its original scope.

The good news is that MarkLogic comes with a very extensive library and that you won’t be blocked in your developments. The bad news is of course that what you’ll develop in MarkLogic won’t be easily portable to other XML databases.

The last thing I want to report is the quality of the online documentation, on MarkLogic Community but also on the web at large and on stackoverflow in particular : during my development I have always been able to find answers for the many questions I had in a very reasonable amount of time.

To summarize, I haven’t had the opportunity to test the support of big data yet but this first contact leaves me with a very positive feeling of a product which is mature, stable, rich of features and well documented and supported by its community.