This is an Open Web Archive archive of http://dev.saxonica.com/blog/mike/2012/01/.
This snapshot has been taken on 2012-02-25 16:51:00 for the website Eric van der Vlist which contains a link to this page and has saved a copy to be displayed in the page ever disappears.
Saxon diaries: January 2012 Archives

January 2012 Archives

Maps

| No Comments | No TrackBacks
I optimistically assumed that the W3C XQuery/XSLT specification for maps would be published long before Saxon 9.4 was released. I was wrong - the XQuery WG didn't rubber-stamp the XSLT WG's spec as expected, but insisted on reviewing it from first principles, starting a long drawn out study of requirements and use-cases which is still ongoing.

So Saxon 9.4 has maps, but no documentation for them. To remedy this, here is the spec. It happens to be a snapshot of a draft of the XSLT 3.0 specification, but of course everything can change before a draft is published. And as always when Saxon implements features in a draft specification, if the spec changes then Saxon will change with it, regardless of compatibility. Use at your own risk.

20.1 Maps

A map is an additional kind of item.

[Definition: A map comprises a collation and a set of entries. Each entry comprises a key which is an arbitrary atomic value, and an arbitrary sequence called the associated value. Within a map, no two entries have the same key, when compared using the eq operator under the map's collation. It is not necessary that all the keys should be mutually comparable (for example, they can include a mixture of integers and strings). Key values will never be of type xs:untypedAtomic, and they will never be the xs:float or xs:double value NaN.]

The function call map:get($map, $key) can be used to retrieve the value associated with a given key.

A map can also be viewed as a function from keys to associated values. To achieve this, a map is also a function item. The function corresponding to the map has the signaturefunction($key as xs:anyAtomicValue) as item()*. Calling the function has the same effect as calling the get function: the expression $map($key) returns the same result asget($map, $key). For example, if $books-by-isbn is a map whose keys are ISBNs and whose assocated values are book elements, then the expression $books-by-isbn("0470192747") returns the book element with the given ISBN. The fact that a map is a function item allows it to be passed as an argument to higher-order functions that expect a function item as one of their arguments.

Like all other values, maps are immutable. For example, the map:remove function creates a new map by removing an entry from an existing map, but the existing map is not changed by the operation.

Like sequences, maps have no identity. It is meaningful to compare the contents of two maps, but there is no way of asking whether they are "the same map": two maps with the same content are indistinguishable.

20.1.1 The Type of a Map

The syntax of ItemTypeXP30 as defined in XPath is extended as follows:

MapType
[17]   ItemType   ::=   ... | MapType 
[18]   MapType   ::=   'map' '(' ( '*' | (AtomicOrUnionTypeXP30 ',' SequenceTypeXP30) ')'

The ItemType map(K, V) matches an item M if (a) M is a map, and (b) every entry in M has a key that matches K and an associated value that matches V. For example,map(xs:integer, element(employee)) matches a map if all the keys in the map are integers, and all the associated values are employee elements. Note that a map (like a sequence) carries no intrinsic type information separate from the types of its entries, and the type of existing entries in a map does not constrain the type of new entries that can be added to the map.

The ItemType map(*) is equivalent to map(xs:anyAtomicType, item()*), and matches any map regardless of its contents.

Because a map is a function, the type map(K, V) is derived from function(K) as V, and instances of map(K, V) can be used wherever the required type is function(K) as V.

20.1.2 Functions that operate on Maps

The functions defined in this section use a conventional namespace prefix map, which is assumed to be bound to the namespace URI http://www.w3.org/2005/xpath-functions/map.

There is no operation to atomize a map or convert it to a string.

The number of entries in the map may be obtained as count(map:keys($map)).

20.1.2.1 map:new
Summary

Creates a new map: either an empty map, or a map that combines entries from a number of existing maps.

Signatures

new() as map(*)

 

new($maps as map(*)*) as map(*)

 

new($maps as map(*)*,
$collation as xs:string) as map(*)

Rules

The function map:new constructs and returns a new map.

The zero-argument form of the function returns an empty map whose collation is the default collation in the static context. It is equivalent to calling the one-argument form of the function with an empty sequence as the value of the first argument.

The one-argument form of the function returns a map that is formed by combining the contents of the maps supplied in the $input argument. It is equivalent to calling the two-argument form of the function with the default collation from the static context as the second argument.

The two-argument form of the function returns a map that is formed by combining the contents of the maps supplied in the $input argument. The collation of the new map is the value of the $collation argument. The supplied maps are combined as follows:

  1. There is one entry in the new map for each distinct key value present in the union of the input maps, where keys are considered distinct according to the rules of thefn:distinct-values function with $collation as the collation.

  2. The associated value for each such key is taken from the last map in the input sequence $input that contains an entry with this key. If this map contains more than one entry with this key (which can happen if its collation is different from that of the new map) then it is implementation-dependent which of them is selected.

There is no requirement that the supplied input maps should have the same or compatible types. The type of a map (for example map(xs:integer, xs:string)) is descriptive of the entries it currently contains, but is not a constraint on how the map may be combined with other maps.

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:new() returns map{}(Returns an empty map, whose collation is the default collation from the static context).

The expression map:new(()) returns map{}(Returns an empty map, whose collation is the default collation from the static context).

The expression map:new((map:entry(0, "no"), map:entry(1, "yes"))) returns map{0:="no", 1:="yes"}(Returns a map with two entries; the collation of the map is the default collation from the static context).

The expression map:new((map:entry(0, "no"), map:entry(1, "yes"))) returns map{0:="no", 1:="yes"}(Returns a map with two entries; the collation of the map is the default collation from the static context).

The expression map:new(($week, map{7:="Unbekannt"})) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag", 7:="Unbekannt"}(The value of the existing map is unchanged; a new map is created containing all the entries from $week, supplemented with a new entry.).

The expression map:new(($week, map{6:="Sonnabend"})) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Sonnabend"}(The value of the existing map is unchanged; a new map is created containing all the entries from $week, with one entry replaced by a new entry. Both input maps contain an entry with the key value 6; the one used in the result is the one that comes last in the input sequence.).

The expression map:new((map{"A":=1}, map{"a":=2}), "http://collation.example.com/caseblind") returns map{"a":=2}(Assuming that the keys of the two entries are equal under the rules of the chosen collation, only one of the entries can appear in the result; the one that is chosen is the one from the last map in the input sequence. If both entries were in the same map, it would be implementation-dependent which was chosen.).

20.1.2.2 map:collation
Summary

Returns the URI of the supplied map's collation

Signature

collation($input as map(*)) as xs:string

Rules

The function map:collation returns the collation URI of the map supplied as $input.

Examples

The expression map:collation(map:new((), "http://collation.example.com/caseblind")) returns "http://collation.example.com/caseblind".

20.1.2.3 map:keys
Summary

Returns a sequence containing all the key values present in a map

Signature

keys($input as map(*)) as xs:anyAtomicType*

Rules

The function map:keys takes any map as its $input argument and returns the keys that are present in the map as a sequence of atomic values, in implementation-dependent order.

Examples

The expression map:keys(map{1:="yes", 2:="no"}) returns some permutation of (1,2)(The result is in implementation-dependent order.).

20.1.2.4 map:contains 
Summary

Tests whether a supplied map contains an entry for a given key

Signature

contains($map as map(*),
$key as xs:anyAtomicType) as xs:boolean

Rules

The function map:contains returns true if the map supplied as $map contains an entry with a key equal to the supplied value of $key; otherwise it returns false. The equality comparison uses the map's collation; no error occurs if the map contains keys that are not comparable with the supplied $key.

If the supplied key is xs:untypedAtomic, it is converted to xs:string. If the supplied key is the xs:float or xs:double value NaN, the function returns false.

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:contains($week, 2) returns true().

The expression map:contains($week, 9) returns false().

The expression map:contains(map{}, "xyz") returns false().

The expression map:contains(map{"xyz":=23}, "xyz") returns true().

The expression map:contains(map{"abc":=23, "xyz":=()}, "xyz") returns true().

20.1.2.5 map:get
Summary

Returns the value associated with a supplied key in a given map.

Signature

get($map as map(*),
$key as xs:anyAtomicType) as item()*

Rules

The function map:get attempts to find an entry within the map supplied as $input that has a key equal to the supplied value of $key. If there is such an entry, it returns the associated value; otherwise it returns an empty sequence. The equality comparison uses the map's collation; no error occurs if the map contains keys that are not comparable with the supplied $key.

If the supplied key is xs:untypedAtomic, it is converted to xs:string. If the supplied key is the xs:float or xs:double value NaN, the function returns an empty sequence.

Notes

A return value of () from map:get could indicate that the key is present in the map with an associated value of (), or it could indicate that the key is not present in the map. The two cases can be distinguished by calling map:contains.

Invoking the map as a function item has the same effect as calling get: that is, when $map is a map, the expression $map($K) is equivalent to get($map, $K). Similarly, the expression get(get(get($map, 'employee'), 'name'), 'first') can be written as $map('employee')('name')('first').

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:get($week, 4) returns "Donnerstag".

The expression map:get($week, 9) returns ()(When the key is not present, the function returns an empty sequence.).

The expression map:get(map:entry(7,()), 7) returns ()(An empty sequence as the result can also signify that the key is present and the associated value is an empty sequence.).

20.1.2.6 map:entry
Summary

Creates a map that contains a single entry (a key-value pair).

Signature

entry($key as xs:anyAtomicType,
$value as item()*) as map(*)

Rules

The function map:entry returns a new map which normally contains a single entry. The collation of the new map is the default collation from the static context. The key of the entry in the new map is $key, and its associated value is $value.

If the supplied key is the xs:float or xs:double value NaN, the supplied $map is empty (that is, it contains no entries).

If the supplied key is xs:untypedAtomic, it is converted to xs:string.

Notes

The function map:entry is intended primarily for use in conjunction with the function map:new. For example, a map containing seven entries may be constructed like this:

map:new((
   map:entry("Su", "Sunday"),
   map:entry("Mo", "Monday"),
   map:entry("Tu", "Tuesday"),
   map:entry("We", "Wednesday"),
   map:entry("Th", "Thursday"),
   map:entry("Fr", "Friday"),
   map:entry("Sa", "Saturday")
   ))

Unlike the map{...} expression, this technique can be used to construct a map with a variable number of entries, for example:

map:new(for $b in //book return map:entry($b/isbn, $b))
Examples

The expression map:entry("M", "Monday") returns map{"M":="Monday"}.

20.1.2.7 map:remove
Summary

Constructs a new map by removing an entry from an existing map

Signature

remove($map as map(*),
$key as xs:anyAtomicType) as map(*)

Rules

The function map:remove returns a new map. The collation of the new map is the same as the collation of the map supplied as $map. The entries in the new map correspond to the entries of $map, excluding any entry whose key is equal to $key.

No failure occurs if the input map contains no entry with the supplied key; the input map is returned unchanged

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:remove($week, 4) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 5:="Freitag", 6:="Samstag"}.

The expression map:remove($week, 23) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}.

20.1.2.8 fn:deep-equal
Summary

This function is extended to handle maps.

Signatures

deep-equal($parameter1 as item()*,
$parameter2 as item()*) as xs:boolean

 

deep-equal($parameter1 as item()*,
$parameter2 as item()*,
$collation as xs:string) as xs:boolean

Rules

The $collation argument identifies a collation which is used at all levels of recursion when strings are compared (but not when names are compared), according to the rules in [FO:choosing-a-collation]FO.

If the two sequences are both empty, the function returns true.

If the two sequences are of different lengths, the function returns false.

If the two sequences are of the same length, the function returns true if and only if every item in the sequence $parameter1 is deep-equal to the item at the same position in the sequence $parameter2. The rules for deciding whether two items are deep-equal follow.

Call the two items $i1 and $i2 respectively.

If $i1 and $i2 are both atomic values, they are deep-equal if and only if ($i1 eq $i2) is true, or if both values are NaN. If the eq operator is not defined for $i1 and $i2, the function returns false.

If one of the pair $i1 or $i2 is an atomic value and the other is not, or if one is a node and the other is not, the function returns false.

If $i1 and $i2 are both maps, the result is true if and only if all the following conditions apply:

  1. Both maps have the same number of entries.

  2. Both maps have the same collation.

  3. For every entry in the first map, there is an entry in the second map that:

    1. has the same key (compared using the eq operator under the maps' collation), and

    2. has the same associated value (compared using the fn:deep-equal2 function, under the collation supplied in the original call to fn:deep-equal2).

If $i1 and $i2 are both nodes, they are compared as described below:

  1. If the two nodes are of different kinds, the result is false.

  2. If the two nodes are both document nodes then they are deep-equal if and only if the sequence $i1/(*|text()) is deep-equal to the sequence $i2/(*|text()).

  3. If the two nodes are both element nodes then they are deep-equal if and only if all of the following conditions are satisfied:

    1. The two nodes have the same name, that is (node-name($i1) eq node-name($i2)).

    2. The two nodes are both annotated as having simple content or both nodes are annotated as having complex content.

    3. The two nodes have the same number of attributes, and for every attribute $a1 in $i1/@* there exists an attribute $a2 in $i2/@* such that $a1 and $a2 are deep-equal.

    4. One of the following conditions holds:

      • Both element nodes have a type annotation that is simple content, and the typed value of $i1 is deep-equal to the typed value of $i2.

      • Both element nodes have a type annotation that is complex content with elementOnly content, and each child element of $i1 is deep-equal to the corresponding child element of $i2.

      • Both element nodes have a type annotation that is complex content with mixed content, and the sequence $i1/(*|text()) is deep-equal to the sequence$i2/(*|text()).

      • Both element nodes have a type annotation that is complex content with empty content.

  4. If the two nodes are both attribute nodes then they are deep-equal if and only if both the following conditions are satisfied:

    1. The two nodes have the same name, that is (node-name($i1) eq node-name($i2)).

    2. The typed value of $i1 is deep-equal to the typed value of $i2.

  5. If the two nodes are both processing instruction nodes, then they are deep-equal if and only if both the following conditions are satisfied:

    1. The two nodes have the same name, that is (node-name($i1) eq node-name($i2)).

    2. The string value of $i1 is equal to the string value of $i2.

  6. If the two nodes are both namespace nodes, then they are deep-equal if and only if both the following conditions are satisfied:

    1. The two nodes either have the same name or are both nameless, that is fn:deep-equal(node-name($i1), node-name($i2)).

    2. The string value of $i1 is equal to the string value of $i2 when compared using the Unicode codepoint collation.

  7. If the two nodes are both text nodes or comment nodes, then they are deep-equal if and only if their string-values are equal.

Error Conditions

An error is raised [tba] if either input sequence contains a function item that is not a map.

Notes

Two nodes are not required to have the same type annotation, and they are not required to have the same in-scope namespaces. They may also differ in their parent, their base URI, and the values returned by the is-id and is-idrefs accessors (see Section 5.5 is-id Accessor DM30 and Section 5.6 is-idrefs Accessor DM30). The order of children is significant, but the order of attributes is insignificant.

The contents of comments and processing instructions are significant only if these nodes appear directly as items in the two sequences being compared. The content of a comment or processing instruction that appears as a descendant of an item in one of the sequences being compared does not affect the result. However, the presence of a comment or processing instruction, if it causes a text node to be split into two text nodes, may affect the result.

The result of fn:deep-equal(1, current-dateTime()) is false; it does not raise an error.

Examples

The expression fn:deep-equal(map{}, map{}) returns true().

The expression fn:deep-equal(map{"a":=1, "b":=2}, map{"b":=2, "a":=1.0}) returns true().

The expression fn:deep-equal(map{"a":=xs:double('NaN')}, map{"a":=xs:float('NaN')}) returns true().

let $at :=

<attendees>
  <name last='Parker' first='Peter'/>
  <name last='Barker' first='Bob'/>
  <name first='Peter' last='Parker'/>
</attendees>

The expression fn:deep-equal($at, $at/*) returns false().

The expression fn:deep-equal($at/name[1], $at/name[2]) returns false().

The expression fn:deep-equal($at/name[1], $at/name[3]) returns true().

The expression fn:deep-equal($at/name[1], 'Peter Parker') returns false().

20.1.3 Map Expressions

A new kind of expression is added to the syntax of XPath.

The syntax of PrimaryExprXP30 is extended to permit MapExpr as an additional alternative.

MapExpr := "map" "{" (KeyExpr ":=" ValueExpr ("," KeyExpr ":=" ValueExpr )*)? "}"

KeyExpr := ExprSingle

ValueExpr := ExprSingle

Note:

Two variations on this syntax are under consideration: removing the leading keyword "map", and using the token ":" in place of ":=". This would bring the syntax closer to Javascript and JSON notation. However, special lexical rules would be needed to disambiguate this use of ":" from other uses. Feedback is invited.

The value of the expression is a map whose entries correspond to the key-value pairs obtained by evaluating the successive KeyExpr and ValueExpr expressions.

Each KeyExpr expression is evaluated and atomized; a dynamic error occurs if the result is not a single atomic value. If the key value is of type xs:untypedAtomic it is converted toxs:string. The associated value is the result of evaluating the corresponding ValueExpr. The collation of the new map is the default collation from the static context. If the key value is NaN then the key/value pair is not added to the map. If two or more keys are equal under the collation of the map then the last occurrence is added to the map and the others are ignored.

For example, the following expression constructs a map with seven entries:

map {
  "Su" := "Sunday",
  "Mo" := "Monday",
  "Tu" := "Tuesday",
  "We" := "Wednesday",
  "Th" := "Thursday",
  "Fr" := "Friday",
  "Sa" := "Saturday
}

Note:

Unlike the map:new function, the number of entries in a map that is constructed using a map expression is known statically, except where duplicate keys or NaN values cause some entries to be ignored.

20.1.4 Examples using maps

This section gives some examples of where maps can be useful.

Example: Using maps with xsl:iterate

This example uses maps in conjunction with the xsl:iterate instruction to find the highest-earning employee in each department, in a single streaming pass of an input document containing employee records.

<xsl:stream href="employees.xml">
  <xsl:iterate select="*/employee">
    <xsl:param name="highest-earners" 
               as="map(xs:string, element(employee))" 
               select="map:new()"/>
    <xsl:variable name="this" select="copy-of(.)" as="element(employee)"/> 
    <xsl:next-iteration>
      <xsl:with-param name="highest-earners"
          select="let $existing := $highest-earners($this/department)
                  return if ($existing/salary gt $this/salary)
                         then $highest-earners
                         else map:new($highest-earners, map:entry($this/department, $this))"/>
    </xsl:next-iteration>
    <xsl:on-completion>
      <xsl:for-each select="map:keys($highest-earners)">
        <department name="{.}">
          <xsl:copy-of select="$highest-earners(.)"/>
        </department>
      </xsl:for-each>
    </xsl:on-completion>
  </xsl:iterate>
</xsl:stream>

 

Example: Using Maps to implement Complex Numbers

A complex number might be represented as a map with two entries, the keys being the xs:boolean value true for the real part, and the xs:boolean value false for the imaginary part. A library for manipulation of complex numbers might include functions such as the following:

<xsl:function name="i:complex" as="map(xs:boolean, xs:double)">
  <xsl:param name="real" as="xs:double"/>
  <xsl:param name="imaginary" as="xs:double"/>
  <xsl:sequence select="map{ true() := $real, false() := $imaginary }"/>
</xsl:function>

<xsl:function name="i:real" as="xs:double">
  <xsl:param name="complex" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="$complex(true())"/>
</xsl:function>

<xsl:function name="i:imaginary" as="xs:double">
  <xsl:param name="complex" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="$complex(false())"/>
</xsl:function>

<xsl:function name="i:add" as="map(xs:boolean, xs:double)">
  <xsl:param name="arg1" as="map(xs:boolean, xs:double)"/>
  <xsl:param name="arg2" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="i:complex(i:real($arg1)+i:real($arg2), i:imaginary($arg1)+i:imaginary($arg2)"/>
</xsl:function>

<xsl:function name="i:multiply" as="map(xs:boolean, xs:double)">
  <xsl:param name="arg1" as="map(xs:boolean, xs:double)"/>
  <xsl:param name="arg2" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="i:complex(
      i:real($arg1)*i:real($arg2) - i:imaginary($arg1)*i:imaginary($arg2),
      i:real($arg1)*i:imaginary($arg2) + i:imaginary($arg1)*i:real($arg2))"/>
</xsl:function>

Note:

This example demonstrates how useful it would be to allow user-defined type aliases, so that callers of this function library could write code that treats the value simply as acomplex-number, not as a map. A proposal to introduce such type aliases is under consideration.

 

Example: Using a Map as an Index

Given a set of book elements, it is possible to construct an index in the form of a map allowing the books to be retrieved by ISBN number.

Assume the book elements have the form:

<book>
  <isbn>0470192747</isbn>
  <author>Michael H. Kay</author>
  <publisher>Wiley</publisher>
  <title>XSLT 2.0 and XPath 2.0 Programmer's Reference</title>
</book>

An index may be constructed as follows:

<xsl:variable name="isbn-index" as="map(xs:string, element(book))"
    select="map:new(for $b in //book return map{$b/isbn := $b})"/>

This index may then be used to retrieve the book for a given ISBN using either of the expressions map:get($isbn-index, "0470192747") or $isbn-index("0470192747").

In this simple form, this replicates the functionality available using xsl:key and the key function. However, it also provides capabilities not directly available using the key function: for example, the index can include book elements in multiple source documents. It also allows processing of all the books using a construct such as <xsl:for-each select="map:keys($isbn-index)">

 

Example: A Map containing named Functions

As in Javascript, a map whose keys are strings and whose associated values are function items can be used in a similar way to a class in object-oriented programming languages.

Suppose an application needs to handle customer order information that may arrive in three different formats, with different hierarchic arrangement:

  1. Flat structure:

    <customer id="c123">...</customer>
    <product id="p789">...</product>
    <order customer="c123" product="p789">...</order>
    
  2. Orders within customer elements:

    <customer id="c123">
       <order product="p789">...</order>
    </customer>
    <product id="p789">...</product>
    
  3. Orders within product elements:

    <customer id="c123">...</customer>
    <product id="p789">
      <order customer id="c123">...</order>
    </product>
    

An application can isolate itself from these differences by defining a set of functions to navigate the relationships between customers, orders, and products: orders-for-customerorders-for-productcustomer-for-orderproduct-for-order. These functions can be implemented in different ways for the three different input formats. For example, with the first format the implementation might be:

<xsl:variable name="flat-input-functions" as="map(xs:string, function(*))*"
  select="map {
            'orders-for-customer' := 
                 function($c as element(customer)) as element(order)* 
                    {$c/../order[@customer=$c/@id]},
            'orders-for-product' := 
                 function($p as element(product)) as element(order)* 
                    {$p/../order[@product=$p/@id]},
            'customer-for-order' := 
                 function($o as element(order)) as element(customer) 
                    {$o/../customer[@id=$o/@customer]},
            'product-for-order' := 
                 function($o as element(order)) as element(product) 
                    {$o/../product[@id=$o/@product]} }                    
         "/>

Having established which input format is in use, the application can bind the appropriate implementation of these functions to a variable such as $input-navigator, and can then process the input using XPath expressions such as the following, which selects all products for which there is no order: //product[empty($input-navigator("orders-for-product")(.))]


A new regex engine

| No Comments | No TrackBacks
Welcome to the new blog site. I hope it meets expectations.

I've been busy integrating a new regex engine into Saxon over the last week or so.

Why?

The current arrangement is that Saxon (using code written originally by James Clark) parses the regular expression supplied to functions like matches() and replace(), or to the XSD pattern facet, and translates it into a Java JDK regular expression.

The disadvantages of this are (a) the process is inefficient, as the regular expression is effectively parsed twice (and the JDK regular expression that Saxon generates may be very long and complicated in some cases), (b) Saxon has little control over the details of evaluation, needed to achieve precise conformance to the XSD and XPath specs, (c) there are bugs in the JDK regex engine which Oracle appear to be in no hurry to fix.

In Saxon 9.4 we've moved forward to Unicode 6.0 (the JDK is still on 3 point-something) and this compounds the problem, because the expansions of character classes such as \p{Lu} become increasingly long-winded.

For Saxon-CE there is an added impetus. GWT doesn't support the Java regular expression API; instead it provides an interface to JavaScript regular expressions. We could have written yet another variant of the regex translator, including digging out the old version that targeted a regex engine with no support for non-BMP Unicode characters, but this didn't feel like an attractive option; it wasn't even easy to see how constructs such as XSD's subtraction operator ([A-Z-[aeiou]]) should be supported. It seemed a good time for Saxon to have its own regex engine.

I looked at two candidates: JRegex and Apache Jakarta. On paper, JRegex looked a better option, but I found the code extremely obscure and badly documented. I did a preliminary integration of JRegex, but it crashed on some of the tests, and understanding the engine well enough to debug and fix these problems didn't look like being much fun. So I decided to go with Jakarta.

Jakarta is not a particularly sophisticated regex engine, but it's easy to understand how it works and therefore easy to change it. It's also surprisingly small - not much bigger than the current JDK regex translator. I started by changing the front-end to handle the (slightly different) XSD and XPath dialects of regular expressions, which was straightforward enough, except perhaps for the rather subtle rules on how hyphens appearing within a character class (ie. between square brackets) are parsed, and how many digits get swallowed by a back-reference. 

The next change was to ensure that all Unicode codepoints (including surrogate pairs) are handled as single characters rather than pairs of 16-bit values. This didn't prove difficult, partly because the engine was already using an abstraction for character strings (although this was used only for the input string and not for the regex itself.) Then all character classes such as \p{Lu}, \i, and \c had to be implemented in terms of the Unicode 6.0 definitions - which wasn't hard, because the existing regex translator already understood these constructs.

At this stage nearly all the tests ran successfully (and fortunately, there's a large suite of regression tests for both XSD and XQuery).

Jakarta enforced a rule that in the construct (E)*, E must not be "nullable". For example, you aren't allowed to write (a?)*. That seems a reasonable restriction, but unfortunately there's no such limitation in the XSD or XPath specs, so I had to work out what to do with this construct. Simply removing the restriction meant that the matching engine would fail to terminate on these constructs (after all, every string contains an infinite number of consecutive empty sequences...) The pro-tem solution I've come up with, though I'm not 100% enamoured of it, is for the (*) operator to keep track of where it has been: if the same operation is applied at the same position in the input string more than once, the second attempt is deemed to be a no-match. (If anyone knows a better approach, let me know.)

The final problem shown up by the tests was that matching long input strings had a tendency to cause a stack overflow. This is because the engine is using recursion to work its way through the string, this being the easy way to implement back-tracking. It's well known that backtracking isn't a brilliant approach to regex evaluation, especially for pathological cases, but unfortunately the kind of regexes in popular use (with substring capture and backreferences) aren't true "regular" expressions in the sense that the theorists use the term, and the better algorithms can only handle truly regular expressions. At this stage I felt some pragmatic improvements were more appropriate than a rewrite using a new algorithm, and it seems that all the test cases that were failing in this way were amenable to a simple optimization: if you can detect that (E*) must be followed by something that doesn't match E, then the evaluation of E* will never need to backtrack, and therefore it can be evaluated using iteration rather than recursion.

It's now all running, both in "big" Saxon (-HE, -PE, -EE) and also in the client-side Saxon-CE engine. We haven't measured performance yet, but the results are unlikely to be surprising: initial compilation of regular expressions should be considerably faster; evaluation of simple regular expressions should be much the same, and evaluation of pathological regular expressions involving backtracking will probably be bad (as it already is in the JDK engine).


Enhanced by Zemanta

About this Archive

This page is an archive of entries from January 2012 listed from newest to oldest.

July 2011 is the previous archive.

February 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.