Media types (previously called mime types) always stuck me as something which is simple in theory but awfully complex in practise.
When they seem to be working, you can enjoy your luck and be sure that this is only temporary before the next release of software X or Y: after the recent upgrade from Ubuntu Hoary to Breezy, my workstation insists that my « audio/x-mpergurl » play lists are « text/plain » and when I associate XMMS to these files it uses XMMS to open all my « text/plain » documents!
I had recently the opportunity to look a little bit deeper into these issues for a project of mine which needs to determine the media types of files in Java.
Freedesktop.org comes to the rescue
The problem is more complex than it appears, and that’s comforting to know that some people seem to be doing exactly what needs to be done to fix it.
The freedesktop.org has been working for a while on a shared database or media types and have published its specification.
Gnome and KDE are participating and I hope that this means the end of the media types nightmare on my desktop…
I really like the principles that they have adopted, especially the simple XML format that they have adopted to describe the media types (that they are still calling mime types).
One thing which is surprising when you first open the XML document describing this shared mime types database is that it includes an internal DTD:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE mime-info [ <!ELEMENT mime-info (mime-type)+> <!ATTLIST mime-info xmlns CDATA #FIXED "http://www.freedesktop.org/standards/shared-mime-info"> <!ELEMENT mime-type (comment|glob|magic|root-XML|alias|sub-class-of)*> <!ATTLIST mime-type type CDATA #REQUIRED> <!ELEMENT comment (#PCDATA)> <!ATTLIST comment xml:lang CDATA #IMPLIED> <!ELEMENT glob EMPTY> <!ATTLIST glob pattern CDATA #REQUIRED> <!ELEMENT magic (match)+> <!ATTLIST magic priority CDATA #IMPLIED> <!ELEMENT match (match)*> <!ATTLIST match offset CDATA #REQUIRED> <!ATTLIST match type (string|big16|big32|little16|little32|host16|host32|byte) #REQUIRED> <!ATTLIST match value CDATA #REQUIRED> <!ATTLIST match mask CDATA #IMPLIED> <!ELEMENT root-XML EMPTY> <!ATTLIST root-XML namespaceURI CDATA #REQUIRED localName CDATA #REQUIRED> <!ELEMENT alias EMPTY> <!ATTLIST alias type CDATA #REQUIRED> <!ELEMENT sub-class-of EMPTY> <!ATTLIST sub-class-of type CDATA #REQUIRED> ]> <mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">
That’s not very usual and often considered a bad practise since you don’t share the DTD between documents. When you think more about it, in this specific context where there should be only one of these documents per machine, it makes perfect sense.
Using an internal DTD solves all the packaging issues: there is only one self contained document to ship and this document has no external dependencies. Furthermore, the DTD is pretty straightforward and including it in the document itself makes it more self described.
This vocabulary is meant to be extensible through namespaces:
Applications may also define their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML files like comment elements.
I think that they could have allowed external attributes as well as they are usually quite harmless.
The mechanism defined to differentiate different types of XML documents appears to be somewhat weak since it’s relying only on the namespace and local name of the root element.
Without mentioning the problem of compound documents, this mechanism is completely missing the fact that a document such as:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml" xsl:version="1.O">
.../...
</html>
isn’t an XHTML document but an XSLT transformation!
The other thing that I regret, probably because I am not familiar with these issues, is that the description of the « magic » rules is very concise.
One the questions I would have liked to see answered is which encodings should we try when doing string matching. When I see the following rule in the description of the « application/xml » type:
<magic priority="50">
<match value="<?xml" type="string" offset="0"/>
</magic>
I have the feeling that different encodings should be tried : ASCII would also work for UTF-8, ISO-8859-1 and alike, but it would fail for UTF-16 or EBCDIC…
On the other hand, there are probably many text formats that do not support UTF-16 or EBCDIC and for which that would be a mistake to try these encodings…
More implementations needed!
Having found this gem, I was pretty confident I would find a Java implementation…
There is one buried into Sun’s Java Desktop System (which isn’t Open Source) and one in Nutch but that seems to be pretty much everything that we have available!
The MimeType class in Nutch appears to be quite minimal. It probably does what most applications want to do, but that’s not enough for what I’d like to achieve.
The mime types database has some advanced features such as type hierarchy: a mime type can be a subclass of other mime types, for instance all the text types are subclasses of the text/plain type.
These hierarchies can be implicit or explicit and they support multiple inheritance. The freedesktop.org specification gives the following example:
Some types may or may not be instances of other types. For example, a spreadsheet file may be compressed or not. It is a valid spreadsheet file either way, but only inherits from application/x-gzip in one case. This information cannot be represented statically; instead an application interested in this information should run all of the magic rules, and use the list of types returned as the subclasses.
These hierarchies should be used by user interfaces: instead of proposing only to use the tools registered for a specific type, a user interface should also propose to use the tools registered for its parent classes. If they did, they would propose me to use an XML or a text editor to open an SVG document or to use a zip extractor to open an OpenOffice document which can be very handy.
That’s the kind of features I’d be expecting to see in a mime type API and I am wondering if I will not have to write my own Java implementation to see that happen!