The complex simple problem of media types

Media types (previously called mime types) always stuck me as something which is simple in theory but awfully complex in practise.

When they seem to be working, you can enjoy your luck and be sure that this is only temporary before the next release of software X or Y: after the recent upgrade from Ubuntu Hoary to Breezy, my workstation insists that my « audio/x-mpergurl » play lists are « text/plain » and when I associate XMMS to these files it uses XMMS to open all my « text/plain » documents!

I had recently the opportunity to look a little bit deeper into these issues for a project of mine which needs to determine the media types of files in Java.

Freedesktop.org comes to the rescue

The problem is more complex than it appears, and that’s comforting to know that some people seem to be doing exactly what needs to be done to fix it.

The freedesktop.org has been working for a while on a shared database or media types and have published its specification.

Gnome and KDE are participating and I hope that this means the end of the media types nightmare on my desktop…

I really like the principles that they have adopted, especially the simple XML format that they have adopted to describe the media types (that they are still calling mime types).

One thing which is surprising when you first open the XML document describing this shared mime types database is that it includes an internal DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mime-info [
  <!ELEMENT mime-info (mime-type)+>
  <!ATTLIST mime-info xmlns CDATA #FIXED "http://www.freedesktop.org/standards/shared-mime-info">

  <!ELEMENT mime-type (comment|glob|magic|root-XML|alias|sub-class-of)*>
  <!ATTLIST mime-type type CDATA #REQUIRED>

  <!ELEMENT comment (#PCDATA)>
  <!ATTLIST comment xml:lang CDATA #IMPLIED>

  <!ELEMENT glob EMPTY>
  <!ATTLIST glob pattern CDATA #REQUIRED>

  <!ELEMENT magic (match)+>
  <!ATTLIST magic priority CDATA #IMPLIED>

  <!ELEMENT match (match)*>
  <!ATTLIST match offset CDATA #REQUIRED>
  <!ATTLIST match type (string|big16|big32|little16|little32|host16|host32|byte) #REQUIRED>
  <!ATTLIST match value CDATA #REQUIRED>
  <!ATTLIST match mask CDATA #IMPLIED>

  <!ELEMENT root-XML EMPTY>
  <!ATTLIST root-XML
  	namespaceURI CDATA #REQUIRED
	localName CDATA #REQUIRED>

  <!ELEMENT alias EMPTY>
  <!ATTLIST alias
  	type CDATA #REQUIRED>

  <!ELEMENT sub-class-of EMPTY>
  <!ATTLIST sub-class-of
  	type CDATA #REQUIRED>
]>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">

That’s not very usual and often considered a bad practise since you don’t share the DTD between documents. When you think more about it, in this specific context where there should be only one of these documents per machine, it makes perfect sense.

Using an internal DTD solves all the packaging issues: there is only one self contained document to ship and this document has no external dependencies. Furthermore, the DTD is pretty straightforward and including it in the document itself makes it more self described.

This vocabulary is meant to be extensible through namespaces:

Applications may also define their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML files like comment elements.

I think that they could have allowed external attributes as well as they are usually quite harmless.

The mechanism defined to differentiate different types of XML documents appears to be somewhat weak since it’s relying only on the namespace and local name of the root element.

Without mentioning the problem of compound documents, this mechanism is completely missing the fact that a document such as:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          xmlns="http://www.w3.org/1999/xhtml" xsl:version="1.O">
    .../...
</html>

isn’t an XHTML document but an XSLT transformation!

The other thing that I regret, probably because I am not familiar with these issues, is that the description of the « magic » rules is very concise.

One the questions I would have liked to see answered is which encodings should we try when doing string matching. When I see the following rule in the description of the « application/xml » type:

    <magic priority="50">
      <match value="&lt;?xml" type="string" offset="0"/>
    </magic>

I have the feeling that different encodings should be tried : ASCII would also work for UTF-8, ISO-8859-1 and alike, but it would fail for UTF-16 or EBCDIC…

On the other hand, there are probably many text formats that do not support UTF-16 or EBCDIC and for which that would be a mistake to try these encodings…

More implementations needed!

Having found this gem, I was pretty confident I would find a Java implementation…

There is one buried into Sun’s Java Desktop System (which isn’t Open Source) and one in Nutch but that seems to be pretty much everything that we have available!

The MimeType class in Nutch appears to be quite minimal. It probably does what most applications want to do, but that’s not enough for what I’d like to achieve.

The mime types database has some advanced features such as type hierarchy: a mime type can be a subclass of other mime types, for instance all the text types are subclasses of the text/plain type.

These hierarchies can be implicit or explicit and they support multiple inheritance. The freedesktop.org specification gives the following example:

Some types may or may not be instances of other types. For example, a spreadsheet file may be compressed or not. It is a valid spreadsheet file either way, but only inherits from application/x-gzip in one case. This information cannot be represented statically; instead an application interested in this information should run all of the magic rules, and use the list of types returned as the subclasses.

These hierarchies should be used by user interfaces: instead of proposing only to use the tools registered for a specific type, a user interface should also propose to use the tools registered for its parent classes. If they did, they would propose me to use an XML or a text editor to open an SVG document or to use a zip extractor to open an OpenOffice document which can be very handy.

That’s the kind of features I’d be expecting to see in a mime type API and I am wondering if I will not have to write my own Java implementation to see that happen!

Vignettes de pages Web

vignette du site XMLfr Pour égayer la page articles du site http://dyomedea.com, j’ai mis des vignettes composées de captures d’écran.

Pour constituer ces vignettes, j’ai voulu éviter la méthode bestiale « capture d’écran et redimensionnement manuel avec Gimp ».

Le procédé n’étant pas bien original, j’ai recherché des outils faisant cela et n’ai trouvé en Open Source que webthumb, un script Perl qui enchaîne des commandes pour lancer Mozilla sur un serveur Xvfb et en effectuer une capture d’écran.

Pour une raison que je n’ai pas cherché à approfondir, webthumb ne semble pas tourner directement sur mon poste de travail (Ubuntu Hoary). Par contre le lancement manuel des commandes pour obtenir le résultat est assez facile.

Dans un premier terminal, il suffit de lancer Xvfb et les commandes dont on veut capturer le résultat, par exemple :

vdv@grosbill:~ $ Xvfb :2 -screen 0 1024x768x24 -ac -fbdir /tmp/xvfb/ &
[1] 14006
vdv@grosbill:~ $ Could not init font path element /usr/X11R6/lib/X11/fonts/TTF/, removing from list!
Could not init font path element /usr/X11R6/lib/X11/fonts/CID/, removing from list!

vdv@grosbill:~ $ export DISPLAY=:2
vdv@grosbill:~ $ firefox http://dyomedea.com
Could not init font path element /usr/X11R6/lib/X11/fonts/TTF/, removing from list!
Could not init font path element /usr/X11R6/lib/X11/fonts/CID/, removing from list!
        

Dans un deuxième terminal, on peut alors vérifier l’affichage avec xwud et le sauvegarder avec xwdtopnm. Pour obtenir ces captures, j’ai utilisé les séquences :

vdv@grosbill:~ $ xwud -in /tmp/xvfb/Xvfb_screen0
vdv@grosbill:~ $ xwdtopnm /tmp/xvfb/Xvfb_screen0| pnmscale -xysize 120 120 | pnmtojpeg -quality 95 > thumb.jpg
xwdtopnm: writing PPM file
vdv@grosbill:~ $ gimp thumb.jpg
*** attempt to put segment in horiz list twice
*** attempt to put segment in horiz list twice
            

Simple, non?

Notes :

  • Les messages d’avertissements affichés ci-dessus ne semblent pas être importants.
  • Les packages Debian/Ubuntu nécessaires pour lancer ces commandes sont xvfb (virtual framebuffer X server), netpbm (Graphics conversion tools) et bien entendu Firefox.

Dyomedea.com est enfin valide!

Après avoir passé des années à expliquer à mes clients qu’il fallait suivre les recommandations du W3C, je viens tout juste d’appliquer ces principes à mon propre site institutionnel : http://dyomedea.com/!

Pour ceux d’entre vous qui voudraient voir la différence, l’ancienne version appartient maintenant aux archives du Web

Le nouveau site parait être très différent, mais sa structure est similaire et les URIs n’ont pas changé.

Ce nouveau site est bien entendu conforme à XHTML 1.1 et CSS 2.0, dénué de tables de présentation et, comme il se doit, basé sur XML.

En plus des grands classiques (GNU/Linux, Apache, …); les site est propulsé par une nouvelle version bêta de Orbeon PresentationServer.

Cette nouvelle version apporte plein de fonctionnalités sexies telle qu’un support de XForms basé sur les technologies Ajax (que je n’utilise pas ici) et un support de XHTML en standard (ce qui n’était pas le cas des versions précédentes qu’il fallait bricoler pour générer du XHTML valide).

J’utilise ce produit (un peu trop puissant pour les besoins de ce site) parce que je l’aime bien (c’est une bonne raison!) et également pour générer des pages dynamiques ce qui a quelques avantages y compris pour un site relativement statique comme celui-là :

  • J’envoie du XHTML (avec un type média « application/xhtml+xml ») uniquement aux navigateurs qui annoncent dans leur réponse qu’ils le supporte (sauf le validateur XHTML du W3C qui ne dit pas ce qu’il supporte; si vous pensez qu’il a tort, vous pouvez voter pour ce bug!) et du HTML aux autres (curieusement, Konqueror qui ne le devrait pas semble faire partie de cette liste).
  • Bien entendu, j’en profite pour faire de l’agrégation de flux RSS 1.0 (de XMLfr et de ce carnet Web) pour afficher mes derniers articles et l’agenda XMLfr.
  • Plus intéressant, j’ai développé deux nouveaux générateurs OPS qui vont récupérer dans ma boîte à lettres les derniers messages que j’ai envoyé sur des listes publiques.
  • Ces générateurs utilisent mon API de binding XML/Java pour lire leurs configurations.
  • Et bien entendu, une plateforme XML/XSLT facilite grandement la gestion de l’internationalisation (le site est en français et en anglais) et permet d’ajouter des gadgets tels qu’un plan du site.

Tout cela a été bien intéressant à réaliser, j’aurais dû le faire avant!

Il ne me reste plus qu’à faire la même chose avec XMLfr…

Dyomedea.com is valid, at last!

There is a French dictum that says that cobblers are the worst shod (curiously, the English equivalent, « shoemaker’s children are the worst shod » bring children into the picture).

After having spent years teaching to my customers that they should follow the W3C recommendations, I have just finished to apply that to my own corporate site, http://dyomedea.com/english/!

For those of you who would like to see the difference, the old one now belongs to web.archive.org

The new site is looking very different, but the structure has been kept similar and the old URIs haven’t changed.

Of course, the new site is now valid XHTML 1.1 and CSS 2.0, free from layout tables and of course, it is powered by XML.

In addition to classics (GNU/Linux, Apache, …); the site is powered by the new beta version of Orbeon PresentationServer.

This version has a lot of fancy stuff such as its Ajax based XForms support (that I am not using here) and a support out of the box for XHTML (which wasn’t the case in previous versions).

I am using it because I like this product (that’s a good reason, isn’t it?) and also to create dynamic pages:

  • I send XHTML (as application/xhtml+xml) to browsers that announce they support it (and also to the W3C XHTML validator that doesn’t send accept headers; if you think that this wrong, vote for this bug!) and HTML to the others (Konqueror appears to be in that list!).
  • Of course, I aggregate RSS 1.0 feeds (from XMLfr and from this blog) to display my latest articles and the XMLfr agenda.
  • More interesting, I have developed a couple of new OPS generators to fetch in my mailbox the latest mails I have sent to public lists.
  • These generators are using my TreeBind JAVA/XML API to read their config inputs.
  • And, of course, an XML/XSLT platform helps a lot to manage the i18n issues (the site is in English and French) and to add goodies such as a site map.

That’s been fun, I should have done it before!

Next on my list should be to do the same with XMLfr…

When old good practices become bad

There are some people with whom you just can’t disagree in their domains of expertise.

These people are always precise and accurate and when you read what one of them writes, you have a feeling that each of his words have been carefully pondered and is just the most accurate that could have been chosen.

In XML land, names that come to mind in that category are (to name few) James Clark, Rick Jelliffe, Murata Makoto, Jeni Tennison, David Carlisle, Uche Ogbuji and, of course, Michael Kay.

It is very exceptional that one can disagree with Michael Kay, his books appear to be 100% bullet proof and it can seem unbelievable that Joris Gillis could dare to write on the xsl-list:

You nearly gave me a heart attack when I encountered the following code in your – in all other aspects excellent – XSLT 2.0 book (3rd edition):…/…

You’ll have guessed that the reason why this happened is that the complain was not related to XSLT skills and the code that followed is:

<xsl:variable name="table-heading">
        <tr>
                <td><b>Date</b></td>
                <td><b>Home Team</b></td>
                <td><b>Away Team</b></td>
                <td><b>Result</b></td>
        </tr>
</xsl:variable>

Michael Kay apologized:

I think it’s true to say that practices like this were commonplace five years ago when many of these examples were written – they are still commonplace today, but no longer regarded as good practice.

And the thread ended up as a discussion about common sense and good practices:

« Common sense » is after all by definition what the majority of people think at the time – it was common sense back then to use tables, it’s common sense now to avoid them…

This thinking itself is also common sense but still good food for thought: good practices of yesterday become poor practices and it’s always worth reconsidering our practices.

When I saw Derrick Story’s announcement of O’Reilly Network Homepage beta, I was quite sure that the publisher of Eric Meyer would have taken the opportunity to follow today’s good practices…

Guess what? The W3C HTML validator reports 40 errors on that page and I can’t disagree with that comment posted on their site:

Well. […] 2 different sites to allow for cookies, redirects that went nowhere and all I really wanted to say was « IT’S A TABLE-BASED LAYOUT! ». Good grief.

Apiculteurs.info

J’ai mis en ligne apiculteurs.info, un site conçu par des amateurs de miel (ma femme Catherine et moi-même) pour des amateurs de miel :

Ce site a pour objectif de répertorier le plus grand nombre possible d’apiculteurs qui commercialisent leur propre production, y compris les apiculteurs traditionnels qui ne sont pas connectés à internet.

Le résultat n’est pas encore très impressionnant et sa liste d’apiculteurs est encore courte, notamment parce que suivant les directives de la CNIL nous demandons par écrit l’accord des apiculteurs avant de publier leurs coordonnées et que nous sommes encore en attente d’un bon nombre de réponses.

Si je mentionne ce site ici, c’est que j’ai souhaité, dans sa conception et sa réalisation utiliser les principes, technologies et bonnes pratiques que j’enseigne et conseille à mes clients.

Le site est ainsi entièrement « powered by XML ».

Il repose sur le framework de publication XML Open Source Orbeon PresentationServer et la base de données XML eXist. Les informations sont stockées dans la base eXist un utilisant un vocabulaire XML/RDF que j’envisage de publier sous le nom de « foab » (Friend Of A Bee) et les pages sont constituées dynamiquement à l’aide de PresentationServer.

L’utilisation de cette architecture nous permet également de publier (en respectant leur licence) des articles de l’encyclopédie libre Wikipédia relatifs aux apiculteurs et à l’apiculture. Publiées sous l’adresse http://apiculteurs.info/wikipedia, ces pages sont téléchargées au format XHTML à partir de l’encyclopédie et stockées localement dans la base eXist.

La souplesse de PresentationServer permet de respecter les principes de l’architecture REST et d’attribuer automatiquement à chaque apiculteur une adresse stable (telle que http://apiculteurs.info/apiculteur/exemple/). Les services se rapportant à cet apiculteur (tels que l’édition de ses informations, la suppression de l’enregistrement, son export XML, …) s’effectuent également au moyen d’adresses stables et spécifiques à chaque apiculteur.

Les formulaires de saisie, tels que celui qui permet de suggérer un nouvel apiculteur mais également tout ceux qui permettent d’administrer la base de données sont définis à l’aide du standard W3C XForms et s’appuient sur l’implémentation XForms côté serveur de PresentationServer ce qui permet d’utiliser XForms dès aujourd’hui sans attendre que XForms soit implémenté dans les navigateurs Web…

Le site dispose bien entendu d’un canal RSS 1.0.

Les pages ne sont pas encore valides au sens de la recommandation XHTML 1.1 (les formes de saisies générées par l’implémentation XForms de PresentationServer sont conformes à HTML et non à XHTML) mais c’est point que je compte corriger prochainement. Elles utilisent néanmoins des méthodes de présentations sans tables basées sur CSS.

Les lettres envoyées aux apiculteurs pour leur demander l’autorisation de publier leur coordonnées sont envoyées à partir des données extraites de la base XML formatées par transformation XSLT sous forme de document OpenOffice. Le tout est orchestré par PresentationServer et disponible à partir des pages d’administration du site.

Ce petit site qui se veut une vitrine de l’apiculture et des apiculteurs constitue donc également une véritable vitrine de quelques unes des possibilités apportées par XML en matière de publication Web!

Is XML 2.0 doomed?

XML 2.0 seems to be becoming the new buzzword and hot topic among many forums.

While I think that XML 1.0 would deserve a certain amount of factoring, I don’t think that XML 2.0 is likely to ever happen nor even that it is something we should wish.

The reasons of the success of XML 1.0 are not that difficult to analyse:

  1. The cost/benefit of developing XML 1.0 applications compared to previous technologies has been generally analysed as highly positive.
  2. XML 1.0 is very near to being the greatest common denominator between the needs of a very wide range of applications, including document and data oriented applications.
  3. XML 1.0 has been proposed by a normalisation body that had the credentials to push such a specification.

I don’t think that this is likely to happen again for XML 2.0:

  1. The unfortunate XML 1.1 recommendation has shown that the cost of the tiniest modification to XML 1.0 is so high that it is difficult to think of a benefit that could compensate this cost. While XML 1.0 is certainly imperfect, the cost of its imperfections isn’t just high enough.
  2. A fairly good consensus on the features supported by XML 1.0 has been possible in a small Working Group working reasonably isolated from the pressure of the lobbies that finance the W3C. All the recent specifications developed under more pressure and hype such as W3C XML Schema, SOAP, WDSL, XPath 2.0, XSLT 2.0, XQuery 1.0 and others show that this not likely to happen any longer and that on the contrary, a XML 2.0 specification would most probably loose the balance that has made XML 1.0 successful.
  3. During the past six years, the W3C has lost a lot of credibility to the point that its most influent participants are now ignoring its most basic recommendations such as XHTML, CSS, SVG, XForms, XLink and many others. This loss of credibility would greatly compromise the success of a XML 2.0 recommendation published by the W3C.

What is likely to happen with XML 2.0 is either a recommendation that is easier ignored by the community at large or much less generic, lightweight and flexible than XML 1.0.

I think I would prefer the first option!

Edd Dumbill on XTech 2005

XTech 2005 presents itself as « the premier European conference for developers and managers working with XML and Web technologies, bringing together the worlds of web development, open source, semantic web and open standards. » Edd Dumbill, XTech 2005 Conference Chair answered our questions about this conference previously known as XML Europe. This interview has been published in French on XMLfr.

vdV: XTech was formally known as XML Europe, what are the motivations for changing its name?

Edd: As the use of XML broadens out beyond traditional core topics, we want to reflect that in the conference. As well as XML, XTech 2005 will cover web development, the semantic web and more. XML’s always been about more than just the core, but we felt that having « XML » in the name made some people feel the conference wasn’t relevant to them. The two new tracks, Browser Technology and Open Data, aren’t strictly about XML topics at all.

vdV: In the new name (XTech), there is no mention of Europe, does that mean that the conference is no longer or less European?

Edd: Not at all! Why should « Europe » be a special case anyway? Even as XML Europe, we’ve always had a fair number of North American speakers and participants. I don’t see anything changing in this regard.

vdV: After a period where every event, product or company tried to embed « XML » in their name, the same events are now removing any reference to XML. How do you analyse this trend?

Edd: It’s a testament to the success of XML. As XML was getting better known, everybody knew it was a good thing and so used it as a sign in their names. Now XML is a basic requirement for many applications, it’s no longer remarkable in that sense.

vdV: How would you compare the 12 different track keys of XML Europe 2004 (ranging from Content Management to Legal through Government and Electronic Busines) and the 4 tracks of XTech 2005 (Core technologies, Applications, Browser technologies and Open data).

Edd: The switch to four clearly defined tracks is intended to help both attendees and speakers. The twelve tracks from before weren’t always easy to schedule in an easy-to-understand way, leading to a « patchwork » programme. Some of the previous tracks only had a handful of sessions in them anyway.

In addition to making the conference easier to understand, we get an opportunity to set the agenda as well as reflect the current practice. Take the new « Open Data » track as an example. There are various areas in which data is being opened up on the internet: political and government (theyrule.net, electoral-vote.com, theyworkforyou.com), cultural ( BBC Creative Archive), scientific and academic (Open Access). Many of the issues in these areas are the same, but there’s never been a forum bringing the various communities together.

vdV: Isn’t there a danger that the new focus on Web technologies becomes a specialisation and reduces that scope?

Edd: I don’t think that’s a danger. In fact, web technology is as much a part of the basic requirement for companies today as XML is, and it’s always been a running theme through the XML Europe conferences.

What we’re doing with the Browser Technology track is reflected the growing importance of decent web and XML-based user interfaces. Practically everybody needs to built web UIs these days, and practically everybody agrees the current situation isn’t much good. We’re bringing together, for the first time, everybody with a major technology offering here: W3C standards implementors, Mozilla, Microsoft. I hope again that new ideas will form, and attendees will get a good sense of the future

landscape.

vdV: Does the new orientation means that some of the people who have enjoyed last XML Europe 2004 might not enjoy XTech 2005?

Edd: No, I don’t think so. In fact, I think they’ll enjoy it more because it will be more relevant to their work. Part of the reasoning in expanding the conference’s remit is the realisation that core XML people are always working with web people, and that any effort to archive or provide public data will heavily involve traditional XML topics. So we’re simply bringing together communities that always work closely anyway, to try and get a more « joined up » conference.

vdV: In these big international conferences, social activities are often as important as the sessions. What are your plans to encourage these activities?

Edd: The first and most important thing is the city, of course! Amsterdam is a great place to go out with other people.

We’ll be having birds-of-a-feather lunch tables, for ad-hoc meetings at lunch time. Additionally, there’ll be dinner sign-up sheets and restaurant suggestions. I’m personally not very keen on having formal evening conference sessions when we’re in such a great city, but I do want a way for people to meet others with common interests.

I’m also thinking about having a conference Wiki, where attendees can self-organise before arriving in Amsterdam.

vdV: Wireless access can play a role in these social activities (people can share their impression in real time using IRC channels, blogs and wikis). Will the conference be covered with wireless?

Edd: I really hope so. The RAI center are in the process of rolling out wireless throughout their facility, but unfortunately haven’t been able to say for sure.

Wireless internet is unfortunately very expensive, and we would need a sponsor to get free wireless throughout the conference. If anybody’s reading this and interested, please get in touch.

vdV: What topics would you absolutely like to see covered?

Edd: I think what I wrote in the track descriptions page at http://www.xtech-conference.org/2005/tracks.asp is a good starting point for this.

vdV: What topics would you prefer to leave away?

Edd: I don’t want to turn any topics away before proposals have been made. All proposed abstracts are blind reviewed by the reviewing team, so there’s a fair chance for everybody.

vdV: What is your best souvenir from the past editions of XML Europe?

Edd: I always love the opening sessions. It’s very gratifying to see all the attendees and to get a great sense of expectation about what will be achieved over the next three days.

vdV: What is your worse souvenir from the past editions of XML Europe?

Edd: The bad snail I ate in Barcelona — the ride over the bumpy road to the airport after the conference was agony!

SVG a sparklingPoint, multimedia sur XMLfr

Comme promis, j’ai fini par transcrire la présentation d’Antoine Quint à sparklingPoint.

Ce faisant, j’en ai profité pour aller plus loin et publier également la version sonore de son intervention et comme cela, sans être très complexe, a justifié l’utilisation de pas mal d’outils différents (sous Linux/Debian), je vais en décrire ici les grandes lignes.

Le première étape a été de récupérer l’enregistrement réalisé par la fonction « mémo vocal » de mon téléphone (Sony Ericsson P800).

Curieusement, les mémos vocaux n’apparaissent pas lorsque l’on visualise le contenu du téléphone grace au logiciel fourni avec le téléphone. Par contre, ils sont présents dans les sauvegardes du téléphone qui se trouvent être des archives au format ZIP :

vdv@delleric:/tmp$ unzip Mon\ P800\ 2004-01-26\ 08.59.05.ecs
...
vdv@delleric:/tmp$ find backup/ -name "*.wav"
backup/Internal/documents/Media files/audio/unfiled/Arrow sound.wav
backup/Internal/documents/Media files/audio/unfiled/Carhorn sound.wav
backup/Internal/documents/Media files/audio/unfiled/Oldfashioned.wav
backup/Internal/documents/Voice/VoiceNote/voicenote14720051577282279.wav
backup/Internal/documents/Voice/VoiceNote/voicenote14724921931175759.wav
backup/Internal/system/data/DefaultSounds/alarm.wav
vdv@delleric:/tmp$

Celui qui nous intéresse est « voicenote14724921931175759.wav ». Ce fichier se révèle être un « .wav » utilisant un codec GSM peu utilisé par les outils Linux classiques :

vdv@delleric:/tmp$ file backup/Internal/documents/Voice/VoiceNote/voicenote14724
921931175759.wav
backup/Internal/documents/Voice/VoiceNote/voicenote14724921931175759.wav: RIFF (
little-endian) data, WAVE audio, GSM 6.10, mono 8000 Hz

Il ne s’ouvre pas directement avec des outils d’édition sonore tels qu’audacity mais la commande « play » permet de l’écouter ce qui me permit de suspecter que « sox » sur lequel s’appuie « play » devait connaître ce codec et, de fait, ces documents peuvent être convertis en formats plus usuels en utilisant sox. Je l’ai converti en .ogg en multipliant son volume par deux et en le rééchantillonnant en 44100 Hz (ce qui est une fréquence plus habituelle et facilite les conversions ultérieures) avec la commande :

vdv@delleric:/tmp$ sox -v 2 voicenote14724921931175759.wav -r 44100 voicenote.ogg
maskChannels: 1  Rate: 44100

Après cette étape, l’enregistrement est prêt à être transcrit, sectionné et publié en MP3.

Il existe de nombreux outils permettant l’édition et la transcription des documents sonores (audacity, glame, transcriber, …).

Bien que transcriber, qui m’avait été aimablement signalé par Erwan le Gall, soit spécifiquement adapté à la transcription de documents sonores, j’ai hésité à apprendre à me servir d’un nouvel outil et ai préféré pour cette fois-ci me servir d’audacity que j’utilise couramment pour sectionner mes enregistrements de disques vinyles. J’ai eu quelque déboires avec la fonction export en MP3 et ai préféré exporter en .wav et convertir manuellement en MP3 en utilisant BladEnc. Pour chaque plage, j’ai donc utilisé la commande :

vdv@delleric:/tmp$ bladeenc -br 32 voicenote1.wav voicenote1.mp3

BladeEnc 0.94.2    (c) Tord Jansson            Homepage: http://bladeenc.mp3.no
===============================================================================
BladeEnc is free software, distributed under the Lesser General Public License.
See the file COPYING, BladeEnc's homepage or www.fsf.org for more details.

Files to encode: 1

Encoding:  voicenote1.wav
Input:     44.1 kHz, 16 bit, mono.
Output:    32 kBit, mono.

Completed. Encoding time: 00:00:02 (28.78X)

All operations completed. Total encoding time: 00:00:02

Dernière opération, pour ajouter les tags MP3, j’ai utilisé cantus, un outil qui est une petite merveille pour ce genre d’opérations.

Il ne restait ensuite plus qu’à publier.

Pourquoi des MP3?

J’ai hésité à ce sujet. J’aurais préféré publier au format Ogg Vorbis (.ogg) que j’utilise dès que je le peux (Ogg est à MP3 ce que PNG est à GIF) mais ai pensé que les utilisateurs potentiels de ces documents sonores auraient du mal à les écouter (Ogg n’est pas supporté par Windows Media Player).

Qu’en pensez-vous?

Viruses

Last week, when « W32.Beagle.A@mm » started to spread out, I decided to switch my mailing lists to « moderated ».

Not that my subscribers can be affected by viruses (I configure my mailing lists to scrap binary attachments to reduce the risk), but a flurry of mails saying « Hi » was something I wanted to avoid, even if they are safe!

This week, when « W32.Novarg.A@mm » came out, I decided I needed something stronger and more automatic…

After some googling, I have installed clamav and amavis. That took me some time and has been painful (the mail system is probably what’s the most complex on my servers, with many different programs involved: postfix, procmail, cyrus, procmail and now clamav and amavis), but I am pretty happy with this small achievement.

Clamav is what’s doing the real work. It’s an open source anti-virus scanner. It comes with an update daemon and several virus signatures seem to be added daily as far as I can tell on my limited experience.

Also open source, Amavis is what does the interface between the MTA (Postfix in my case) and the virus scanner (Clamav in my case). I have installed a flavor of Amavis named « Amavisd-new ». Amavis is highly configurable. You can tell him when a virus is using fake sender addresses and in that case, it won’t send a report to the sender. I wish more systems and admins could be using that feature to avoid flooding the net with rubbish virus notifications!

With this setup, I have switched my mailing lists to their normal mode again and I am now watching viruses being caught: the rate has reached 30 viruses per hour. 30 mails that won’t leave my SMTP server and never spread their virus…

That may be a small achievement, but I feel a good « net citizen » :) … If more SMTP servers (including those from ISPs) were equipped with such tools, the viruses would spread much, much, much slower.