The first feature complete version of my MarkLogic persistence layer for Orbeon Form Runner is available but as far as I know nobody is using it yet and it’s still time to think about what will happen for new versions.
Versioning dataset “schemes” is always hard.
I am using the term “scheme
” to refer to both structures and conventions used to store and retrieve information.
I would have used the term “schema” if it hadn’t different meaning depending on the technology (SQL database schemas, XML schemas, RDF Schema, …) which are all part of what I am calling a “scheme” in this post.
By “dataset” I mean a set of data isolated by any mean one from each other for the application. Datasets is a logical concept which overlaps the concept of databases as defined by the different database implementations.
It’s a generic issue and if you take the example of WordPress which is powering this blog, most of its plugins include a mechanism to check that their database schemes is up to date with their versions and perform automatic upgrades when needed (see for instance the section called “Store the Plugin Version for Upgrades” and the following one in this article by Dave Donaldson).
This is most needed for applications such as WordPress where there the end user is often the administrator of his blog and enterprise applications usually have system administrators to deal with upgrades but still there might be something to borrow in this approach.
Like in a WordPress plugin, in most database application like my persistence layer, you have on one side the “program” (implemented in XQuery, pipelines or whatever for XML databases) and on the other side one or several “datasets”.
This persistence layer supports storing multiple “datasets” in a single database.
A use case for that could be for a provider to define a common dataset as a persistence layer for form definitions shared by different customers, a dataset per customer, a dataset for demonstration purposes, … All these datasets could be either in separate MarkLogic databases or in a single database.
There is currently a single version of the persistence layer and there is no possible mismatch by the XQuery modules which implement the REST API and the dataset schemes.
The situation might be different in a few months.
There might be for instance a version 2 implementing the optional features which haven’t been implemented yet. These features will likely rely on new documents properties added to the current ones and thus rely on a version 2 of the dataset scheme.
The current scheme relies on URIs to store application and form names and document types in a directory fashion. This seems the natural thing to do because that mirrors the URIs structures in the persistence layer specification. Other options could be to use MarkLogic collections or document properties to store this information and we might have decided to go this way in a version 3.
Our provider could then be in a situation where the dataset with the form definitions would still be in v1 (if it doesn’t use features added in v2, why should he migrate), the datasets with customer data would be in v2 and the dataset used for demo purposes would be testing v3.
Traditional approaches rely on upgrade scripts executed by system administrators which can be run through Roxy which is the preferred deployment tool for MarkLogic applications such as this persistence layer.
In that case, system administrators need to carefully keep modules and dataset schemes versions synchronized and supporting multiple versions of modules for a single database can be tough even if the modules URIs can be versioned.
This is where the WordPress plugin approach may be useful.
What about adding some kind of metadata to each dataset to determine its version?
The URL rewriter would retrieve this information and choose the REST library options corresponding to a specific version to execute the right version of the modules.
An administrative REST API could easily be added to list datasets, display their metadata and perform datasets upgrades (and downgrades if available) and this REST API could be used by Roxy.
The idea of adding dataset metadata seems really powerful but what should such metadata include?
The minimum would be to identify applications using the datasets and their versions but what about adding more information, such as some kind of user readable documentation, URIs to additional documentation and resources, …
Proposing a vocabulary to define such information is an interesting exercise that I’ll be happy to do if needed but I can’t believe it has never been done…
If you are aware of something similar, please tell us!
Many thanks to Peter Kester (MarkLogic) for sharing his thoughts on the subject.