The complex simple problem of media types

Media types (previously called mime types) always stuck me as something which is simple in theory but awfully complex in practise.

When they seem to be working, you can enjoy your luck and be sure that this is only temporary before the next release of software X or Y: after the recent upgrade from Ubuntu Hoary to Breezy, my workstation insists that my « audio/x-mpergurl » play lists are « text/plain » and when I associate XMMS to these files it uses XMMS to open all my « text/plain » documents!

I had recently the opportunity to look a little bit deeper into these issues for a project of mine which needs to determine the media types of files in Java.

Freedesktop.org comes to the rescue

The problem is more complex than it appears, and that’s comforting to know that some people seem to be doing exactly what needs to be done to fix it.

The freedesktop.org has been working for a while on a shared database or media types and have published its specification.

Gnome and KDE are participating and I hope that this means the end of the media types nightmare on my desktop…

I really like the principles that they have adopted, especially the simple XML format that they have adopted to describe the media types (that they are still calling mime types).

One thing which is surprising when you first open the XML document describing this shared mime types database is that it includes an internal DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mime-info [
  <!ELEMENT mime-info (mime-type)+>
  <!ATTLIST mime-info xmlns CDATA #FIXED "http://www.freedesktop.org/standards/shared-mime-info">

  <!ELEMENT mime-type (comment|glob|magic|root-XML|alias|sub-class-of)*>
  <!ATTLIST mime-type type CDATA #REQUIRED>

  <!ELEMENT comment (#PCDATA)>
  <!ATTLIST comment xml:lang CDATA #IMPLIED>

  <!ELEMENT glob EMPTY>
  <!ATTLIST glob pattern CDATA #REQUIRED>

  <!ELEMENT magic (match)+>
  <!ATTLIST magic priority CDATA #IMPLIED>

  <!ELEMENT match (match)*>
  <!ATTLIST match offset CDATA #REQUIRED>
  <!ATTLIST match type (string|big16|big32|little16|little32|host16|host32|byte) #REQUIRED>
  <!ATTLIST match value CDATA #REQUIRED>
  <!ATTLIST match mask CDATA #IMPLIED>

  <!ELEMENT root-XML EMPTY>
  <!ATTLIST root-XML
  	namespaceURI CDATA #REQUIRED
	localName CDATA #REQUIRED>

  <!ELEMENT alias EMPTY>
  <!ATTLIST alias
  	type CDATA #REQUIRED>

  <!ELEMENT sub-class-of EMPTY>
  <!ATTLIST sub-class-of
  	type CDATA #REQUIRED>
]>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">

That’s not very usual and often considered a bad practise since you don’t share the DTD between documents. When you think more about it, in this specific context where there should be only one of these documents per machine, it makes perfect sense.

Using an internal DTD solves all the packaging issues: there is only one self contained document to ship and this document has no external dependencies. Furthermore, the DTD is pretty straightforward and including it in the document itself makes it more self described.

This vocabulary is meant to be extensible through namespaces:

Applications may also define their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML files like comment elements.

I think that they could have allowed external attributes as well as they are usually quite harmless.

The mechanism defined to differentiate different types of XML documents appears to be somewhat weak since it’s relying only on the namespace and local name of the root element.

Without mentioning the problem of compound documents, this mechanism is completely missing the fact that a document such as:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          xmlns="http://www.w3.org/1999/xhtml" xsl:version="1.O">
    .../...
</html>

isn’t an XHTML document but an XSLT transformation!

The other thing that I regret, probably because I am not familiar with these issues, is that the description of the « magic » rules is very concise.

One the questions I would have liked to see answered is which encodings should we try when doing string matching. When I see the following rule in the description of the « application/xml » type:

    <magic priority="50">
      <match value="&lt;?xml" type="string" offset="0"/>
    </magic>

I have the feeling that different encodings should be tried : ASCII would also work for UTF-8, ISO-8859-1 and alike, but it would fail for UTF-16 or EBCDIC…

On the other hand, there are probably many text formats that do not support UTF-16 or EBCDIC and for which that would be a mistake to try these encodings…

More implementations needed!

Having found this gem, I was pretty confident I would find a Java implementation…

There is one buried into Sun’s Java Desktop System (which isn’t Open Source) and one in Nutch but that seems to be pretty much everything that we have available!

The MimeType class in Nutch appears to be quite minimal. It probably does what most applications want to do, but that’s not enough for what I’d like to achieve.

The mime types database has some advanced features such as type hierarchy: a mime type can be a subclass of other mime types, for instance all the text types are subclasses of the text/plain type.

These hierarchies can be implicit or explicit and they support multiple inheritance. The freedesktop.org specification gives the following example:

Some types may or may not be instances of other types. For example, a spreadsheet file may be compressed or not. It is a valid spreadsheet file either way, but only inherits from application/x-gzip in one case. This information cannot be represented statically; instead an application interested in this information should run all of the magic rules, and use the list of types returned as the subclasses.

These hierarchies should be used by user interfaces: instead of proposing only to use the tools registered for a specific type, a user interface should also propose to use the tools registered for its parent classes. If they did, they would propose me to use an XML or a text editor to open an SVG document or to use a zip extractor to open an OpenOffice document which can be very handy.

That’s the kind of features I’d be expecting to see in a mime type API and I am wondering if I will not have to write my own Java implementation to see that happen!

Modifier les dependances d’un paquet Debian

Quand on installe des paquets provenant de sources non officielles, il arrive fréquemment que les dépendances déclarées dans les paquets ne correspondent pas au système sur lequel on installe ces paquets.

C’est le cas par exemple lorsque l’on installe la version courante d’Opera sur Ubuntu 5.10 « BreezyBadger » ou celle de Skype sur Ubunto 5.04 « HoaryHedgehog » :

vdv@vaio:~ $  sudo dpkg -i /opt/downloads/skype_1.2.0.17-1_i386.deb
Password:
(Lecture de la base de données... 169348 fichiers et répertoires déjà installés.)
Préparation du remplacement de skype 1.2.0.17-1 (en utilisant .../skype_1.2.0.17-1_i386.deb) ...
Dépaquetage de la mise à jour de skype ...
dpkg : des problèmes de dépendances empêchent la configuration de skype :
 skype dépend de libqt3c102-mt (>= 3:3.3.3.2) ; cependant :
  La version de libqt3c102-mt sur le système est 3:3.3.3-7ubuntu3.
dpkg : erreur de traitement de skype (--install) :
 problèmes de dépendances - laissé non configuré
Des erreurs ont été rencontrées pendant l'exécution :
 skype

Face à ce type de solution, on peut forcer l’installation avec l’option « –force » :

vdv@vaio:~ $  sudo dpkg --force depends -i /opt/downloads/skype_1.2.0.17-1_i386.deb
Password:
(Lecture de la base de données... 169348 fichiers et répertoires déjà installés.)
Préparation du remplacement de skype 1.2.0.17-1 (en utilisant .../skype_1.2.0.17-1_i386.deb) ...
Dépaquetage de la mise à jour de skype ...
dpkg : skype : problèmes de dépendances, mais configuration comme demandé :
 skype dépend de libqt3c102-mt (>= 3:3.3.3.2) ; cependant :
  La version de libqt3c102-mt sur le système est 3:3.3.3-7ubuntu3.
Paramétrage de skype (1.2.0.17-1) ...

        

Cela permet de tester l’application et de vérifier qu’elle fonctionne sur votre système, mais le paquet est considéré être « cassé » et le système ne se prive pas de vous le rappeler :

vdv@vaio:~ $ sudo apt-get dist-upgrade
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances... Fait
Vous pouvez lancer « apt-get -f install » pour corriger ces problèmes.
Les paquets suivants contiennent des dépendances non satisfaites :
  skype: Dépend: libqt3c102-mt (>= 3:3.3.3.2) mais 3:3.3.3-7ubuntu3 est installé
E: Dépendances manquantes. Essayez d'utiliser l'option -f.
            

Si vous essayez « apt-get -f install », celui-ci propose de désinstaller le paquet récalcitrant et on est revenu au point de départ:

vdv@vaio:~ $ sudo apt-get -f install
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances... Fait
Correction des dépendances... Fait
Les paquets suivants seront ENLEVÉS :
  skype
0 mis à jour, 0 nouvellement installés, 1 à enlever et 3 non mis à jour.
Il est nécessaire de prendre 0o dans les archives.
Après dépaquetage, 9160ko d'espace disque seront libérés.
Souhaitez-vous continuer [O/n] ? n
Annulation.
            

Une solution à ce problème est de corriger les dépendances dans le paquet lui-même, ce qui est beaucoup plus facile qu’on ne pourrait le redouter…

Il faut pour cela extraire les fichiers du paquet :

vdv@vaio:~ $ cd /tmp
vdv@vaio:/tmp $ dpkg-deb -x /opt/downloads/skype_1.2.0.17-1_i386.deb skype_1.2.0.17-1_i386
            

Cette commande n’extrait pas le fichiers de contrôle qu’il faut extraire dans un deuxième temps :

vdv@vaio:/tmp $ mkdir skype_1.2.0.17-1_i386/DEBIAN
vdv@vaio:/tmp $ dpkg-deb -e /opt/downloads/skype_1.2.0.17-1_i386.deb skype_1.2.0.17-1_i386/DEBIAN
            

On peut maintenant éditer le fichier de contrôle :

vdv@vaio:/tmp $ gvim skype_1.2.0.17-1_i386/DEBIAN/control
            

Pour modifier le numéro de version dans la ligne :

Depends: libc6 (>= 2.3.2.ds1-4), libgcc1 (>= 1:3.4.1-3), libqt3c102-mt (>= 3:3.3.3.2), libstdc++5 (>= 1:3.3.4-1), libx11-6 | xlibs (>> 4.1.0), libxext6 | xlibs (>> 4.1.0)
            

Par celui de notre installation :

Depends: libc6 (>= 2.3.2.ds1-4), libgcc1 (>= 1:3.4.1-3), libqt3c102-mt (>= 3:3.3.3-7ubuntu3), libstdc++5 (>= 1:3.3.4-1), libx11-6 | xlibs (>> 4.1.0), libxext6 | xlibs (>> 4.1.0)

Il suffit maintenant de reconstituer le paquet :

vdv@vaio:/tmp $ dpkg-deb -b skype_1.2.0.17-1_i386
dpkg-deb : construction du paquet « skype » dans « skype_1.2.0.17-1_i386.deb ».
            

Et de le réinstaller :

vdv@vaio:/tmp $ sudo dpkg -i skype_1.2.0.17-1_i386.deb
(Lecture de la base de données... 169348 fichiers et répertoires déjà installés.)
Préparation du remplacement de skype 1.2.0.17-1 (en utilisant skype_1.2.0.17-1_i386.deb) ...
Dépaquetage de la mise à jour de skype ...
Paramétrage de skype (1.2.0.17-1) ...

Bien entendu, cela ne fonctionne que si le problème de dépendances était fictif, d’où l’utilité de tester cela avec la commande « dpkg –force depends » avant d’entreprendre l’opération, mais lorsque c’est le cas, cette manipulation assez simple règle le problème de manière définitive… jusqu’à la prochaine version!

Using Orbeon PresentationServer with a recent version of eXist

Why would you want to do that?

Orbeon PresentationServer is currently shipping with eXist 1.0 beta2.

This is true of both OPS version 2.8 (the current stable release) and OPS 3.0 beta 3 (the latest beta of the next generation).

While eXist 1.0 beta2 is described as the stable version of the Open source XML database, their web site displays the following Health Warning:

The 1.0 beta2 release is truly ancient now. There were lots of bug fixes and feature enhancements during the past months, so using beta2 cannot be recommended any more. Please download a newer development snapshot. Recent development snapshots can be regarded as stable. A new official « stable » release is in preparation, but as usual, we lack the time to complete the documentation. Any help will be welcome!

Among the many enhancements included in more recent versions, Transactions and Crash Recovery is very worth mentioning:

After several months of development, eXist does now support full crash recovery. Crash recovery means that the database can automatically recover from an unclean termination, e.g. caused by a killed jvm, power loss, system reboot or hanging processes.

This might be a reason of the corruptions noticed in my experience with OPS and eXist and that has been my motivation to migrate http://apiculteurs.info to the latest eXist snapshot

While this is not rocket science, the following notes may help you if you want to attempt the same migration.

Environment

My environment is Ubuntu Hoary, Java Sun j2sdk 1.4 and /or 1.5, Jetty and OPS 2.8 but the same procedure should be valid for other environments.

Migration

Database backup

The physical database format has changed between these versions and, if you have to keep a database during this migration, you need to backup the database using the eXist client before starting the actual migration.

I’ll cover how to use the eXist client with an eXist database embedded in OPS in a future blog entry, in the mean time, you can refer to this thread of the ops-users mailing list.

After you’ve done this backup, remove the content of the old database:

rm orbeon/WEB-INF/exist-data/*

Removing the old libraries

You should then stop your servlet and move to the orbeon « orbeon/WEB-INF/lib » directory where you’ll find four eXist libraries:

orbeon/WEB-INF/lib/exist-1_0b2_build_1107.jar
orbeon/WEB-INF/lib/exist-optional-1_0b2_build_1107.jar
orbeon/WEB-INF/lib/xmldb-exist_1_0b2_build_1107.jar
orbeon/WEB-INF/lib/xmlrpc-1_2_patched_exist_1_0b2_build_1107.jar
            

Remove these four libraries from « orbeon/WEB-INF/lib » and keep them somewhere else in case you want to move back to eXist 1.0 beta2 later on.

Installing the eXist snapshot

Install the eXist snapshot through:

java -jar eXist-snapshot-20050805.jar

Choose whatever directory you want to install this new version but keep it out of the scope of your OPS install: we are doing this installation only to get the new libraries!

Installing the new libraries

You need to copy five eXist libraries into « orbeon/WEB-INF/lib ». If you’ve install eXist in « /opt/eXist », move to « orbeon/WEB-INF/lib » and type:

cp /opt/eXist/exist.jar eXist-snapshot-20050805.jar
cp /opt/eXist/exist-optional.jar exist-optional-snapshot-20050805.jar
cp /opt/eXist/exist-modules.jar exist-modules-snapshot-20050805.jar
cp /opt/eXist/lib/core/xmldb.jar xmldb-eXist-snapshot-20050805.jar
cp /opt/eXist/lib/core/xmlrpc-1.2-patched.jar xmlrpc-1.2-patched-eXist-snapshot-20050805.jar
            

Move to java 5.0

eXist now relies on some Java 5.0 classes and if you try to use it with j2sdk 1.4, you’ll run into errors such as:

22:37:11.168 WARN!! [SocketListener0-9] org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:574) >11> Error for /orbeon/apiculteurs/administration/statistiques/montre
java.lang.NoClassDefFoundError: javax/xml/datatype/DatatypeConfigurationException
	at org.exist.xquery.value.AbstractDateTimeValue.<clinit>(AbstractDateTimeValue.java:157)
	at org.exist.xquery.functions.FunCurrentDateTime.eval(FunCurrentDateTime.java:51)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:144)
	at org.exist.xquery.EnclosedExpr.eval(EnclosedExpr.java:58)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:144)
	at org.exist.xquery.ElementConstructor.eval(ElementConstructor.java:173)
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:43)
	at org.exist.xquery.PathExpr.eval(PathExpr.java:159)
            

To fix that, the simplest solution (assuming your application supports it) is to move your servlet to j2sdk 1.5.

Restart, restore and enjoy

You’re almost done!

Restart your servlet, restore your database using the eXist client and enjoy your brand new eXist installation.

After a servlet reload, in the servlet log, you’ll notice new messages:

            2005-10-07 08:51:08,615 INFO  org.exist.storage.XQueryPool null - QueryPool: maxStackSize = 5; timeout = 120000; timeoutCheckInterval = 30000
Scanning journal  [==                                                ] (4 %)
Scanning journal  [====                                              ] (8 %)
Scanning journal  [======                                            ] (12 %)
Scanning journal  [========                                          ] (16 %)
Scanning journal  [==========                                        ] (20 %)
Scanning journal  [=================                                 ] (34 %)
Scanning journal  [====================                              ] (40 %)
Scanning journal  [==============================                    ] (60 %)
Scanning journal  [========================================          ] (80 %)
            2005-10-07 08:51:19,713 INFO  org.orbeon.oxf.pipeline.InitUtils null - /apicu

These messages confirm that your eXist installation is now using a journal.

SVG a sparklingPoint, multimedia sur XMLfr

Comme promis, j’ai fini par transcrire la présentation d’Antoine Quint à sparklingPoint.

Ce faisant, j’en ai profité pour aller plus loin et publier également la version sonore de son intervention et comme cela, sans être très complexe, a justifié l’utilisation de pas mal d’outils différents (sous Linux/Debian), je vais en décrire ici les grandes lignes.

Le première étape a été de récupérer l’enregistrement réalisé par la fonction « mémo vocal » de mon téléphone (Sony Ericsson P800).

Curieusement, les mémos vocaux n’apparaissent pas lorsque l’on visualise le contenu du téléphone grace au logiciel fourni avec le téléphone. Par contre, ils sont présents dans les sauvegardes du téléphone qui se trouvent être des archives au format ZIP :

vdv@delleric:/tmp$ unzip Mon\ P800\ 2004-01-26\ 08.59.05.ecs
...
vdv@delleric:/tmp$ find backup/ -name "*.wav"
backup/Internal/documents/Media files/audio/unfiled/Arrow sound.wav
backup/Internal/documents/Media files/audio/unfiled/Carhorn sound.wav
backup/Internal/documents/Media files/audio/unfiled/Oldfashioned.wav
backup/Internal/documents/Voice/VoiceNote/voicenote14720051577282279.wav
backup/Internal/documents/Voice/VoiceNote/voicenote14724921931175759.wav
backup/Internal/system/data/DefaultSounds/alarm.wav
vdv@delleric:/tmp$

Celui qui nous intéresse est « voicenote14724921931175759.wav ». Ce fichier se révèle être un « .wav » utilisant un codec GSM peu utilisé par les outils Linux classiques :

vdv@delleric:/tmp$ file backup/Internal/documents/Voice/VoiceNote/voicenote14724
921931175759.wav
backup/Internal/documents/Voice/VoiceNote/voicenote14724921931175759.wav: RIFF (
little-endian) data, WAVE audio, GSM 6.10, mono 8000 Hz

Il ne s’ouvre pas directement avec des outils d’édition sonore tels qu’audacity mais la commande « play » permet de l’écouter ce qui me permit de suspecter que « sox » sur lequel s’appuie « play » devait connaître ce codec et, de fait, ces documents peuvent être convertis en formats plus usuels en utilisant sox. Je l’ai converti en .ogg en multipliant son volume par deux et en le rééchantillonnant en 44100 Hz (ce qui est une fréquence plus habituelle et facilite les conversions ultérieures) avec la commande :

vdv@delleric:/tmp$ sox -v 2 voicenote14724921931175759.wav -r 44100 voicenote.ogg
maskChannels: 1  Rate: 44100

Après cette étape, l’enregistrement est prêt à être transcrit, sectionné et publié en MP3.

Il existe de nombreux outils permettant l’édition et la transcription des documents sonores (audacity, glame, transcriber, …).

Bien que transcriber, qui m’avait été aimablement signalé par Erwan le Gall, soit spécifiquement adapté à la transcription de documents sonores, j’ai hésité à apprendre à me servir d’un nouvel outil et ai préféré pour cette fois-ci me servir d’audacity que j’utilise couramment pour sectionner mes enregistrements de disques vinyles. J’ai eu quelque déboires avec la fonction export en MP3 et ai préféré exporter en .wav et convertir manuellement en MP3 en utilisant BladEnc. Pour chaque plage, j’ai donc utilisé la commande :

vdv@delleric:/tmp$ bladeenc -br 32 voicenote1.wav voicenote1.mp3

BladeEnc 0.94.2    (c) Tord Jansson            Homepage: http://bladeenc.mp3.no
===============================================================================
BladeEnc is free software, distributed under the Lesser General Public License.
See the file COPYING, BladeEnc's homepage or www.fsf.org for more details.

Files to encode: 1

Encoding:  voicenote1.wav
Input:     44.1 kHz, 16 bit, mono.
Output:    32 kBit, mono.

Completed. Encoding time: 00:00:02 (28.78X)

All operations completed. Total encoding time: 00:00:02

Dernière opération, pour ajouter les tags MP3, j’ai utilisé cantus, un outil qui est une petite merveille pour ce genre d’opérations.

Il ne restait ensuite plus qu’à publier.

Pourquoi des MP3?

J’ai hésité à ce sujet. J’aurais préféré publier au format Ogg Vorbis (.ogg) que j’utilise dès que je le peux (Ogg est à MP3 ce que PNG est à GIF) mais ai pensé que les utilisateurs potentiels de ces documents sonores auraient du mal à les écouter (Ogg n’est pas supporté par Windows Media Player).

Qu’en pensez-vous?

Viruses

Last week, when « W32.Beagle.A@mm » started to spread out, I decided to switch my mailing lists to « moderated ».

Not that my subscribers can be affected by viruses (I configure my mailing lists to scrap binary attachments to reduce the risk), but a flurry of mails saying « Hi » was something I wanted to avoid, even if they are safe!

This week, when « W32.Novarg.A@mm » came out, I decided I needed something stronger and more automatic…

After some googling, I have installed clamav and amavis. That took me some time and has been painful (the mail system is probably what’s the most complex on my servers, with many different programs involved: postfix, procmail, cyrus, procmail and now clamav and amavis), but I am pretty happy with this small achievement.

Clamav is what’s doing the real work. It’s an open source anti-virus scanner. It comes with an update daemon and several virus signatures seem to be added daily as far as I can tell on my limited experience.

Also open source, Amavis is what does the interface between the MTA (Postfix in my case) and the virus scanner (Clamav in my case). I have installed a flavor of Amavis named « Amavisd-new ». Amavis is highly configurable. You can tell him when a virus is using fake sender addresses and in that case, it won’t send a report to the sender. I wish more systems and admins could be using that feature to avoid flooding the net with rubbish virus notifications!

With this setup, I have switched my mailing lists to their normal mode again and I am now watching viruses being caught: the rate has reached 30 viruses per hour. 30 mails that won’t leave my SMTP server and never spread their virus…

That may be a small achievement, but I feel a good « net citizen » :) … If more SMTP servers (including those from ISPs) were equipped with such tools, the viruses would spread much, much, much slower.