Docs/ Developers’ Guide to Path Extraction APIs

This document provides a guide to implementing a path extraction API as an alternative to the traditional streaming and DOM APIs. A reference implementation exists in ion-c (see 1, 2, and 3) and it’s also available as an extension to ion-java.

Motivation

The traditional streaming and DOM APIs force the user to choose between speed and convenience, respectively. Path extraction APIs aim to combine the two by allowing the user to register paths into the data using just a few lines of code and receive callbacks during stream processing when any of those paths is matched. This allows the Ion reader to plan the most efficient traversal over the data without requiring further manual interaction from the user. For example, there is no reason to step in to containers which could not possibly match one of the search paths. When encoded in binary Ion, the resulting skip is a seek forward in the input stream, which is inexpensive relative to the cost of parsing (and in the case of a DOM, materializing) the skipped value.

Terms

Search path - a path which is provided to the extractor for matching.

Partial match - a value whose path matches the first N elements in a search path (where N is less than the search path’s total length).

Terminal match - a value whose path exactly matches a search path.

Active path - a search path that has at least one partial match at depth N, where N is the extractor’s current depth minus one.

Phases

Interacting with a path extraction API typically occurs in three phases:

  1. configuration,
  2. path registration, and
  3. notification.

Configuration

The configuration phase involves readying an extractor instance for path registration by setting its options. It should be noted that an Ion reader should not be provided to the extractor during configuration; it should be provided as an argument to the function that begins the notification phase, as described below. This allows the extractor itself to be stateless, immutable, long-lived, and reusable.

Common Options

Public API Example

# Do not bind a reader to the extractor at construction.
extractor = Extractor(max_path_depth=10, max_num_paths=100, match_relative_paths=false, case_insensitive=true)

Registration

The registration phase begins once the user has a path extractor instance configured with the desired options. The user may now register search paths. Path registration APIs may be entirely programmatic, and/or may parse path information from a string. Both techniques require the following information to register a path:

At the discretion of the implementor, paths may either be standalone (i.e. objects), or may be contained internally to the extractor. The latter strategy may be most useful when memory and performance concerns are important. In these cases, implementors may consider designing the layout of the paths in memory for efficient traversal.

The callback function signature should include parameters for the matching value (likely represented by an Ion reader positioned at the value, but this may be implementation-defined) and the user context; and a return value which indicates to the extractor how it should proceed. This return value can be described as a ‘step-out-N’ instruction. The most common value is zero, which tells the extractor to continue with the next value at the same depth. A return value greater than zero may be useful to users who only care about the first match at a particular depth.

When the callback function is invoked with an Ion reader positioned at the matching value, and that reader is the same one used internally by the extractor, one of two steps must be taken:

An alternative is to avoid exposing the Ion reader to the user, and instead provide a DOM-like representation of the matching value to the user through the callback. This may be more convenient for the user (especially those already comfortable with the DOM), but is potentially less flexible and less performant.

Programmatic Registration

Programmatic registration involves allowing the user to manually construct a path by chaining path elements. This should be implemented in a language-idiomatic way.

String Registration

Users may find it simpler to create a path in one line by using a string representation. It is possible to accept a wide variety of different syntaxes (e.g. XPath, JSONPath, etc.), but the following stripped-down Ion-based syntax should be supported by all implementations.

This syntax calls for a string of text Ion data which must contain exactly one top-level ordered sequence (either list or s-expression) containing a number of elements less than or equal to the extractor’s maximum path depth. The elements (if any) must be either text types (string or symbol), representing fields or wildcards, or integers, representing indices. A non-wildcard field element with the same text as a wildcard must be annotated with the special annotation $ion_extractor_field as its first annotation.

For example, () represents a path of length zero, which will match all top-level values when the extractor’s initial depth is zero, while (abc * 2 $ion_extractor_field::*) represents a path of length 4 consisting of a field named abc, a wildcard, an index of 2, and a field named *.

Public API Example

def pathCallback(reader, userContext):
    ... # Retrieve the matching value from the reader and do something.
    return 0

path = Path("(foo * bar 0)")
pathContext = ... # Create some user context to pass to the callback when matched, if desired.
extractor.register(path, pathContext, pathCallback)

Notification

The notification phase begins when the user finishes registering paths and invokes a function which causes the path extractor to begin processing the given Ion stream. As previously discussed, this function should accept an Ion reader (positioned at depth zero, unless the extractor is configured to match relative paths) over the stream to be processed.

Each time the reader encounters a value, the extractor must evaluate whether the value is a partial and/or terminal match to at least one of the active paths at the current depth.

If the value is a

When the end of a container is reached, and the container’s depth is greater than the reader’s initial depth, step out of the container and resume processing with the active paths at the stepped-out depth.

Determining a match

Search path elements which are

at the element’s depth.

Performance considerations

Observing a few characteristics of the matching algorithm may reveal opportunities to design for performance.

Public API Example

reader1 = ... # Create an Ion reader in the typical way.
extractor.match(reader1)
reader2 = ...
extractor.match(reader2)