This is a draft specification of Ion 1.1, a new minor version of the Ion serialization format.

Status

This document is a working draft and is subject to change.

Audience

This documents presents the formal specification for the Ion 1.1 data format. This document is not intended to be used as a user guide or as a cook book, but as a reference to the syntax and semantics of the Ion data format and its logical data model.

What's New in Ion 1.1

We will go through a high-level overview of what is new and different in Ion 1.1 from Ion 1.0 from an implementer's perspective.

Motivation

Ion 1.1 has been designed to address some of the trade-offs in Ion 1.0 to make it suitable for a wider range of applications. Ion 1.1 now makes length prefixing of containers optional, and makes the interning of symbolic tokens optional as well. This allows for applications that write data more than they read data or are constrained by the writer in some way to have more flexibility. Data density is another motivation. Certain encodings (e.g., timestamps, integers) have been made more compact and efficient, but more significantly, macros now enable applications to have very flexible interning of their data's structure. In aggregate, data transcoded from Ion 1.0 to Ion 1.1 should be more compact.

Backwards compatibility

Ion 1.0 and Ion 1.1 share the same data model. Any data that can be represented in Ion 1.0 can also be represented with full fidelity in Ion 1.1 and vice-versa. This means that it is always possible to convert data from one version to the other without risk of data loss.

Ion 1.1 readers must be able to understand both Ion 1.0 and Ion 1.1 data.

The text encoding grammar of Ion 1.1 is a superset of Ion 1.0's text encoding grammar. Any Ion 1.0 text data can also be parsed by an Ion 1.1 text parser. However, because Ion 1.1 has a different system symbol table, symbol IDs in an Ion 1.0 stream do not always refer to the same text as the same symbol ID in an Ion 1.1 stream. (For example: in an Ion 1.0 stream, $4 is always the text "name". However, it may or may not be "name" in an Ion 1.1 stream.)

Ion 1.1's binary encoding is substantially different from Ion 1.0's binary encoding. Many changes have been made to make values more compact, faster to read and/or faster to write. Ion 1.0's type descriptors have been supplanted by Ion 1.1's more general opcodes, which have been organized to prioritize the most commonly used encodings and make leveraging macros as inexpensive as possible.

In both text and binary Ion 1.1, the Ion Version Marker syntax is compatible with Ion 1.0's version marker syntax.

This means that an Ion 1.0-only reader can correctly identify when a stream uses Ion 1.1 (allowing it to report an error), and an Ion 1.1 reader can correctly "downshift" to expecting Ion 1.0 data when it encounters a stream using Ion 1.0.

Two streams using different Ion versions can be safely concatenated together provided that they are both text or both binary. A concatenated stream containing both Ion 1.0 and Ion 1.1 can only be fully read by a reader that supports Ion 1.1.

Upgrading an existing application to Ion 1.1 often requires little-to-no code changes, as APIs typically operate at the data model level ("write an integer") rather than at the encoding level ("write 0x64 followed by four Little-Endian bytes"). However, taking full advantage of macros after upgrading typically requires additional development time.

Text syntax changes

Ion 1.1 text must use the $ion_1_1 version marker at the top-level of the data stream or document.

The only syntax change for the text format is the introduction of encoding expression (E-expression) syntax, which allows for the invocation of macros in the data stream. This syntax is grammatically similar to S-expressions, except that these expressions are opened with (: and closed with ). For example, (:a 1 2) would expand the macro named a with the arguments 1 and 2. See the Macros, templates, and encoding expressions section for details.

This syntax is allowed anywhere an Ion value is allowed:

E-expression examples

// At the top level
(:foo 1 2)

// Nested in a list
[1, 2, (:bar 3 4)]

// Nested in an S-expression
(cons a (:baz b))

// Nested in a struct
{c: (:bop d)}

E-expressions may also appear in the field name position of a struct.

E-Expression in field position of struct

{
    a:1,
    b:2,
    (:foo 1 2),
    c: 3,
}

Binary encoding changes

Ion 1.1 binary encoding reorganizes the type descriptors to support compact E-expressions, make certain encodings more compact, and certain lower priority encodings marginally less compact. The IVM for this encoding is the octet sequence 0xE0 0x01 0x01 0xEA.

Inlined symbolic tokens

In binary Ion 1.0, symbol values, field names, and annotations are required to be encoded using a symbol ID in the local symbol table. For some use cases (e.g. write-once, read-maybe logs) this creates a burden on the writer and may not actually be efficient for an application. Ion 1.1 introduces optional binary syntax for encoding inline UTF-8 sequences for these tokens which can allow an encoder to have flexibility in whether and when to add a given text value to the symbol table.

Ion text requires no change for this feature as it already had inline symbolic tokens without using the local symbol table. Ion text also has compatible syntax for representing the local symbol table and encoding of symbolic tokens with their position in the table (i.e., the $id syntax).

Delimited containers

In Ion 1.0, all data is length prefixed. While this is good for optimizing the reading of data, it requires an Ion encoder to buffer any data in memory to calculate the data's length. Ion 1.1 introduces optional binary syntax to allow containers to be encoded with an end marker instead of a length prefix.

Low-level binary encoding changes

Ion 1.0's VarUInt and VarInt encoding primitives used big-endian byte order and used the high bit of each byte to indicate whether it was the final byte in the encoding. VarInt used an additional bit in the first byte to represent the integer's sign. Ion 1.1 replaces these primitives with more optimized versions called FlexUInt and FlexInt.

FlexUInt and FlexInt use little-endian byte order, avoiding the need for reordering on common architectures like x86, aarch64, and RISC-V.

Rather than using a bit in each byte to indicate the width of the encoding, FlexUInt and FlexInt front-load the continuation bits. In most cases, this means that these bits all fit in the first byte of the representation, allowing a reader to determine the complete size of the encoding without having to inspect each byte individually.

Finally, FlexInt does not use a separate bit to indicate its value's sign. Instead, it uses two's complement representation, allowing it to share much of the same structure and parsing logic as its unsigned counterpart. Benchmarks have shown that in aggregate, these encoding changes are between 1.25 and 3x faster than Ion 1.0's VarUInt and VarInt encodings depending on the host architecture.

Ion 1.1 supplants Ion 1.0's Int encoding primitive with a new encoding called FixedInt, which uses two's complement notation instead of sign-and-magnitude. A corresponding FixedUInt primitive has also been introduced; its encoding is the same as Ion 1.0's UInt primitive.

A new primitive encoding type, FlexSym, has been introduced to flexibly encode symbol IDs and symbolic tokens with inline text.

Type encoding changes

All Ion types use the new low-level encoding primitives described in the previous section. Ion 1.0's type descriptors have been supplanted by Ion 1.1's more general opcodes, which have been organized to prioritize the most commonly used encodings and make leveraging macros as inexpensive as possible.

Typed null values are now encoded in two bytes using the 0xEB opcode.

Lists and S-expressions have two encodings: a length-prefixed encoding and a new delimited form that ends with the 0xF0 opcode.

Struct values have the option of encoding their field names as a FlexSym, enabling them to write field name text inline instead of adding all names to the symbol table. There is now also a delimited form.

Similarly, symbol values now also have the option of encoding their symbol text inline.

Annotation sequences are a prefix to the value they decorate, and no longer have an outer length container. They are now encoded with one of three opcodes:

  1. 0xE7, which is followed by a single annotation and then the decorated value.
  2. 0xE8, which is followed by two annotations and then the decorated value.
  3. 0xE9, which is followed by a FlexUInt indicating the number of bytes used to encode the annotations sequence, the sequence itself, and then the decorated value.

The latter encoding is similar to how Ion 1.0 annotations are encoded with the exception that there is no outer length in addition to the annotations sequence length.

Integers now use a FixedInt sub-field instead of the Ion 1.0 encoding which used sign-and-magnitude (with two opcodes).

Decimals are structurally identical to their Ion 1.0 counterpart with the exception of the negative zero coefficient. The Ion 1.1 FlexInt encoding is two's complement, so negative zero cannot be encoded directly with it. Instead, an opcode is allocated specifically for encoding decimals with a negative zero coefficient.

Timestamps no longer encode their sub-field components as octet-aligned fields.

The Ion 1.1 format uses a packed bit encoding and has a biased form (encoding the year field as an offset from 1970) to make common encodings of timestamp easily fit in a 64-bit word for microsecond and nanosecond precision (with UTC offset or unknown UTC offset). Benchmarks have shown this new encoding to be 59% faster to encode and 21% faster to decode. A non-biased, arbitrary length timestamp with packed bit encoding is defined for uncommon cases.

Encoding expressions in binary

In binary, E-expressions are encoded with an opcode that includes the macro identifier or an opcode that specifies a FlexUInt for the macro identifier. The identifier is followed by the encoding of the arguments to the E-expression. The macro's definition statically determines how the arguments are to be laid out. An argument may be a full Ion value with a leading opcode (sometimes called a "tagged" value), or it could be a lower-level encoding (e.g., a fixed width integer or FlexInt/FlexUInt).

Macros, templates, and encoding expressions

Ion 1.1 introduces a new primitive called an encoding expression (E-expression). These expressions are (in text syntax) similar to S-expressions, but they are not part of the data model and are evaluated into one or more Ion values (called a stream) which enable compact representation of Ion data. E-expressions represent the invocation of either system defined or user defined macros with arguments that are either themselves E-expressions, value literals, or container constructors (list, sexp, struct syntax containing E-expressions) corresponding to the formal parameters of the macro's definition. The resulting stream is then expanded into the resulting Ion data model.

At the top level, the stream becomes individual top-level values. Consider for illustrative purposes an E-expression (:values 1 2 3) that evaluates to the stream 1, 2, 3 and (:none) that evaluates to the empty stream. In the following examples, values and none are the names of the macros being invoked and each line is equivalent.

Top-level e-expressions

// Encoding
a (:values 1 2 3) b (:none) c

// Evaluates to
a 1 2 3 b c

Within a list or S-expression, the stream becomes additional child elements in the collection.

E-expressions in lists

// Encoding
[a, (:values 1 2 3), b, (:none), c]

// Evaluates to
[a, 1, 2, 3, b, c]

E-expressions in S-expressions

(a (:values 1 2 3) b (:none) c)
(a 1 2 3 b c)

Within a struct at the field name position, the resulting stream must contain structs and each of the fields in those structs become fields in the enclosing struct (the value portion is not specified); at the value position, the resulting stream of values becomes fields with whatever field name corresponded before the E-expression (empty stream elides the field all together). In the following examples, let us define (:make_struct c 5) that evaluates to a single struct {c: 5}.

E-expressions in structs

// Encoding
{
  a: (:values 1 2 3),
  b: 4,
  (:make_struct c 5),
  d: 6,
  e: (:none)
}

// Evaluates to
{
  a: 1,
  a: 2,
  a: 3,
  b: 4,
  c: 5,
  d: 6
}

Modules

Ion 1.0 uses symbol tables to group together related text values. In order to also accommodate macros, Ion 1.1 introduces modules, a named organizational unit that contains:

  • An exported symbol table, a list of text values used to compactly encode symbol tokens like field names, annotations, and symbol values.
  • An exported macro table, a list of macro definitions used to compactly encode complete values or partially populated containers.
  • An unexported nested modules map, a set of unique module names and their associated module definitions.

While Ion 1.0 does not have modules, it is reasonable to think of Ion 1.0's local symbol table as a module that only has symbols, and whose macro table and nested modules map are permanently empty.

Modules can be imported from the catalog (they subsume shared symbol tables) or defined locally.

Directives

Directives modify the encoding context. Syntactically, a directive is a top-level s-expression annotated with $ion. Its first child value is an operation name. The operation determines what changes will be made to the encoding context and which clauses may legally follow.

$ion::
(operation_name
    (clause_1 /*...*/)
    (clause_2 /*...*/)
    /*...*/
    (clause_N /*...*/))

In Ion v1.1, there are three supported directive operations:

  1. module
  2. import
  3. encoding

Macro definitions

Macros can be defined by a user either directly in a local module within an encoding directive or in a module defined externally (i.e., shared module). A macro has a name which must be unique in a module or it may have no name.

Ion 1.1 defines a list of system macros that are built-in in the module named $ion. Unlike the system symbol table, which is always installed and accessible in the local symbol table, the system macros are both always accessible to E-expressions and not installed in the local macro table by default (unlike the local symbol table).

In Ion binary, macros are always addressed in E-expressions by the offset in the local macro table. System macros may be addressed by the system macro identifier using a specific encoding op-code. In Ion text, macros may be addressed by the offset in the local macro table (mirroring binary), its name if its name is unambiguous within the local encoding context, or by qualifying the macro name/offset with the module name in the encoding context. An E-expression can only refer to macros installed in the local macro table or a macro from the system module. In text, an E-expression referring to a system macro that is not installed in the local macro table, must use a qualified name with the $ion module name.

For illustrative purposes let's consider the module named foo that has a macro named bar at offset 5 installed at the begining of the local macro table.

E-expressions name resolution

// allowed if there are no other macros named 'bar' 
(:bar)

// fully qualified by module--always allowed
(:foo:bar)

// by local macro table offset
(:5)

// In text, system macros are always addressable by name.
// In binary, system macros may be invoked using a separate
// opcode.
(:$ion:none)

Macro definition language

User defined macros are defined by their parameters and template which defines how they are invoked and what stream of data they evaluate to. This template is defined using a domain specific Ion macro definition language with S-expressions. A template defines a list of zero or more parameters that it can accept. These parameters each have their own cardinality of expression arguments which can be specified as exactly one, zero or one, zero or more, and one or more. Furthermore the template defines what type of argument can be accepted by each of these parameters:

  • "Tagged" values, whose encodings always begin with an opcode.
  • "Tagless" values, whose encodings do not begin with an opcode and are therefore both more compact and less flexible (For example: flex_int, int32, float16).
  • Specific macro shaped arguments to allow for structural composition of macros and efficient encoding in binary.

The macro definition includes a template body that defines how the macro is expanded. In the language, system macros, macros defined in previously defined modules in the encoding context, and macros defined previously in the current module are available to be invoked with (name ...) syntax where name is the macro to be invoked. Certain names in the expression syntax are reserved for special forms (for example, literal and if_none). When a macro name is shadowed by a special form, or is ambiguous with respect to all macros visible, it can always be qualified with (':module:name' ...) syntax where module is the name of the module and name is the offset or name of the macro. Referring to a previously defined macro name within a module may be qualified with (':name' ...) syntax.

Shared Modules

Ion 1.1 extends the concept of a shared symbol table to be a shared module. An Ion 1.0 shared symbol table is a shared module with no macro definitions. A new schema for the convention of serializing shared modules in Ion are introduced in Ion 1.1. An Ion 1.1 implementation should support containing Ion 1.0 shared symbol tables and Ion 1.1 shared modules in its catalog.

System Symbol Table Changes

The system symbol table in Ion 1.1 replaces the Ion 1.0 symbol table with new symbols. However, the system symbols are not required to be in the symbol table—they are always available to use.

Macros

Like other self-describing formats, Ion 1.0 makes it possible to write a stream with truly arbitrary content--no formal schema required. However, in practice all applications have a de facto schema, with each stream sharing large amounts of predictable structure and recurring values. This means that Ion readers and writers often spend substantial resources processing undifferentiated data.

Consider this example excerpt from a webserver's log file:

{
  method: GET,
  statusCode: 200,
  status: "OK",
  protocol: https,
  clientIp: ip_addr::"192.168.1.100",
  resource: "index.html"
}
{
  method: GET,
  statusCode: 200,
  status: "OK",
  protocol: https,
  clientIp:
  ip_addr::"192.168.1.100",
  resource: "images/funny.jpg"
}
{
  method: GET,
  statusCode: 200,
  status: "OK",
  protocol: https,
  clientIp: ip_addr::"192.168.1.101",
  resource: "index.html"
}

Macros allow users to define fill-in-the-blank templates for their data. This enables applications to focus on encoding and decoding the parts of the data that are distinctive, eliding the work needed to encode the boilerplate.

Using this macro definition:

(macro getOk (clientIp resource)
  {
    method: GET,
    statusCode: 200,
    status: "OK",
    protocol: https,
    clientIp: (.annotate "ip_addr" (%clientIp)),
    resource: (%resource)
  })

The same webserver log file could be written like this:

(:getOk "192.168.1.100" "index.html")
(:getOk "192.168.1.100" "images/funny.jpg")
(:getOk "192.168.1.101" "index.html")

Macros are an encoding-level concern, and their use in the data stream is invisible to consuming applications. For writers, macros are always optional--a writer can always elect to write their data using value literals instead.

For a guided walkthrough of what macros can do, see Macros by example.

Defining macros

A macro is defined using a macro clause within a module's macro_table clause.

Syntax

(macro name signature template)
ArgumentDescription
nameA unique name assigned to the macro or--to construct an anonymous macro--null.
signatureAn s-expression enumerating the parameters this macro accepts.
templateA template definition language (TDL) expression that can be evaluated to produce zero or more Ion values.

Example macro clause

//      ┌─── name
//      │     ┌─── signature
//     ┌┴┐ ┌──┴──┐
(macro foo (x y z)
  {           // ─┐
    x: (%x),  //  │
    y: (%y),  //  ├─ template
    z: (%z),  //  │
  }           // ─┘
)

Macro names

Syntactically, macro names are identifiers. Each macro name in a macro table must be unique.

In some circumstances, it may not make sense to name a macro. (For example, when the macro is generated automatically.) In such cases, authors may set the macro name to null or null.symbol to indicate that the macro does not have a name. Anonymous macros can only be referenced by their address in the macro table.

Macro Parameters

A parameter is a named stream of Ion values. The stream's contents are determined by the macro's invocation. A macro's parameters are declared in the macro signature.

Each parameter declaration is comprised of three elements:

  1. A name
  2. An optional encoding
  3. An optional cardinality

Example parameter declaration

//     ┌─── encoding
//     │      ┌─── name
//     │      │┌─── cardinality
// ┌───┴───┐  ││
   flex_uint::x*

Parameter names

A parameter's name is an identifier. The name is required; any non-identifier (including null, quoted symbols, $0, or a non-symbol) found in parameter-name position will cause the reader to raise an error.

All of a macro's parameters must have unique names.

Parameter encodings

In binary Ion, the default encoding for all parameters is tagged. Each argument passed into the macro from the callsite is prefixed by an opcode (or "tag") that indicates the argument's type and length.

Parameters may choose to specify an alternative encoding to make the corresponding arguments' binary representation more compact and/or fixed width. These "tagless" encodings do not begin with an opcode, an arrangement which saves space but also limits the domain of values they can each represent. Arguments passed to tagless parameters cannot be null, cannot be annotated, and may have additional range restrictions.

To specify an encoding, the parameter name is annotated with one of the following tokens:

Tagless encodingsDescription
flex_intVariable-width, signed int
flex_uintVariable-width, unsigned int
int8 int16 int32 int64Fixed-width, signed int
uint8 uint16 uint32 uint64Fixed-width, unsigned int
float16 float32 float64Fixed-width float
flex_symbolFlexSym-encoded SID or text

When writing text Ion, the declared encoding does not affect how values are serialized. However, it does constrain the domain of values that that parameter will accept. When transcribing from text to binary, it must be possible to serialize all values passed as an argument using the parameter's declared encoding. This means that parameters with a primitive encoding cannot be annotated or a null of any type. If an int or a float is being passed to a parameter with a fixed-width encoding, that value must fit within the range of values that can be represented by that width. For example, the value 256 cannot be passed to a parameter with an encoding of uint8 because a uint8 can only represent values in the range [0, 255].

Parameter cardinalities

A parameter name may optionally be followed by a cardinality modifier. This is a sigil that indicates how many values the parameter expects the corresponding argument expression to produce when it is evaluated.

ModifierCardinality
?zero-or-one value
*zero-or-more values
!exactly-one value
+one-or-more values

If no modifier is specified, the parameter's cardinality will default to exactly-one. An exactly-one parameter will always expand to a stream containing a single value.

Parameters with a cardinality other than exactly-one are called variadic parameters.

If an argument expression expands to a number of values that the cardinality forbids, the reader must raise an error.

Optional parameters

Parameters with a cardinality that can accept an empty expression group as an argument (? and *) are called optional parameters. In text Ion, their corresponding arguments can be elided from e-expressions and TDL macro invocations when they appear in tail position. When an argument is elided, it is treated as though an explicit empty group (::) had been passed in its place.

In contrast, parameters with a cardinality that cannot accept an empty group (! and +) are called required parameters. Required parameters can never be elided.

(:set_macros
    (foo (x y? z*) // `x` is required, `y` and `z` are optional
        [x, y, z]
    )
)

// `z` is a populated expression group
(:foo 1 2 (:: 3 4 5)) => [1, 2, 3, 4, 5]

// `z` is an empty expression group
(:foo 1 2 (::))       => [1, 2]

// `z` has been elided
(:foo 1 2)            => [1, 2]

// `y` and `z` have been elided
(:foo 1)              => [1]

// `x` cannot be elided
(:foo)                => ERROR: missing required argument `x`

Optional parameters that are not in tail position cannot be elided, as this would cause them to appear in a position corresponding to a different argument.

(:set_macros
    (foo (x? y) // `x` is optional, `y` is required
        [x, y]
    )
)

(:foo (::) 1) => [(::), 1] => [1]
(:foo 1)                   => ERROR: missing required argument `y`

Macro signatures

A macro's signature is the ordered sequence of parameters which an invocation of that macro must define. Syntactically, the signature is an s-expression of parameter declarations.

Example macro signature

(w flex_uint::x* float16::y? z+)
NameEncodingCardinality
wtaggedexactly-one
xflex_uintzero-or-more
yfloat16zero-or-one
ztaggedone-or-more

Template definition language (TDL)

The macro's template is a single Ion value that defines how a reader should expand invocations of the macro. Ion 1.1 introduces a template definition language (TDL) to express this process in terms of the macro's parameters. TDL is a small language with only a few constructs.

A TDL expression can be any of the following:

  1. A literal Ion scalar
  2. A macro invocation
  3. A variable expansion
  4. A quasi-literal Ion container
  5. A special form

In terms of its encoding, TDL is "just Ion." As you shall see in the following sections, the constructs it introduces are written as s-expressions with a distinguishing leading value or values.

A grammar for TDL can be found at the end of this chapter.

Ion scalars

Ion scalars are interpreted literally. These include values of any type except list, sexp, and struct. null values of any type—even null.list, null.sexp, and null.struct—are also interpreted literally.

Examples

These macros are constants; they take no parameters. When they are invoked, they expand to a stream of a single value: the Ion scalar acting as the template expression.

$ion::
(module _
  (macro_table
    (macro greeting () "hello")
    (macro birthday () 1996-10-11)
    // Annotations are also literal
    (macro price () USD::29.95)
  )
)

(:greeting) => "hello"
(:birthday) => 1996-10-11
(:price)    => USD::29.95

Macro invocations

Macro invocations call an existing macro. The invoked macro could be a system macro, a macro imported from a shared module, or a macro previously defined in the current scope.

Syntactically, a macro invocation is an s-expression whose first value is the operator . and whose second value is a macro reference.

Grammar
macro-invocation   ::= '(.' macro-ref macro-arg* ')'

macro-ref          ::= (module-name '::')? (macro-name | macro-address)

macro-arg          ::= expression | expression-group

macro-name         ::= ion-identifier

macro-address      ::= unsigned-ion-integer

expression-group   ::= '(..' expression* ')'
Invocation syntax illustration
// Invoking a macro defined in the same module by name.
(.macro_name              arg1 arg2 /*...*/ argN)

// Invoking a macro defined in another module by name.
(.module_name::macro_name arg1 arg2 /*...*/ argN)

// Invoking a macro defined in the same module by its address.
(.0              arg1 arg2 /*...*/ argN)

// Invoking a macro defined in a different module by its address.
(.module_name::0 arg1 arg2 /*...*/ argN)
Examples
$ion::
(module _
  (macro_table
    // Calls the system macro `values`, allowing it to produce a stream of three values.
    (macro nephews () (.values Huey Dewey Louie))

    // Calls a macro previously defined in this module, splicing its result
    // stream into a list.
    (macro list_of_nephews () [(.nephews)])
  )
)

(:nephews)         => Huey Dewey Louie
(:list_of_nephews) => [Huey, Dewey, Louie]

important

There are no forward references in TDL. If a macro definition includes an invocation of a name or address that is not already valid, the reader must raise an error.

$ion::
(module _
  (macro_table
    (macro list_of_nephews () [(.nephews)])
    //                          ^^^^^^^^
    // ERROR: Calls a macro that has not yet been defined in this module.
    (macro nephews () (.values Huey Dewey Louie))
  )
)

Variable expansion

Templates can insert the contents of a macro parameter into their output by using a variable expansion, an s-expression whose first value is the operator % and whose second and final value is the variable name of the parameter to expand.

If the variable name does not match one of the declared macro parameters, the implementation must raise an error.

Grammar
variable-expansion ::= '(%' variable-name ')'

variable-name      ::= ion-identifier
Examples
$ion::
(module _
  (macro_table
    // Produces a stream that repeats the content of parameter `x` twice.
    (macro twice (x*) (.values (%x) (%x)))
  )
)

(:twice foo)     => foo foo
(:twice "hello") => "hello" "hello"
(:twice 1 2 3)   => 1 2 3 1 2 3

Quasi-literal Ion containers

When an Ion container appears in a template definition, it is interpreted almost literally.

Each nested value in the container is inspected.

  • If the value is an Ion scalar, it is added to the output as-is.
  • If the value is a variable expansion, the stream bound to that variable name is added to the output. The variable expansion literal (for example: (%name)) is discarded.
  • If the value is a macro invocation, the invocation is evaluated and the resulting stream is added to the output. The macro invocation literal (for example: (.name 1 2 3)) is discarded.
  • If the value is a container, the reader will recurse into the container and repeat this process.
Expansion within a sequence

When the container is a list or s-expression, the values in the nested expression's expansion are spliced into the sequence at the site of the expression. If the expansion was empty, no values are spliced into the container.

$ion::
(module _
  (macro_table
    (macro bookend_list (x y*) [(%x), (%y), (%x)])
    (macro bookend_sexp (x y*) ((%x) (%y) (%x)))
  )
)

(:bookend_list ! a b c) => ['!', a, b, c, '!']
(:bookend_sexp ! a b c) => (! a b c !)

(:bookend_sexp !) => (! !)
Expansion within a struct

When the container is a struct, the expansion of each field value is paired with the corresponding field name. If the expansion produces a single value, a single field with that name will be spliced into the parent struct. If the expansion produces multiple values, a field with that name will be created for each value and spliced into the parent struct. If the expansion was empty, no fields are spliced into the parent struct.

Examples
$ion::
(module _
  (macro_table
    (macro resident (id names*)
        {
            town: "Riverside",
            id: (.make_string "123-" (%id)),
            name: (%names)
        }
     )
  )
)

(:resident "abc" "Alice") =>
{
  town: "Riverside",
  id: "123-abc",
  name: "Alice"
}

(:resident "def" "John" "Jacob" "Jingleheimer" "Schmidt") =>
{
  town: "Riverside",
  id: "123-def",
  name: "John",
  name: "Jacob",
  name: "Jingleheimer",
  name: "Schmidt",
}

(:resident "ghi") =>
{
  town: "Riverside",
  id: "123-ghi",
}

Special forms

special-form       ::= '(.' ('$ion::')?  special-form-name expression* ')'

special-form-name  ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi'

Special forms are similar to macro invocations, but they have their own expansion rules. See Special forms for the list of special forms and a description of each.

Note that unlike macro expansions, special forms cannot accept argument groups.

Macros by example

Before getting into the technical details of Ion’s macro and module system, it will help to be more familiar with the use of macros. We’ll step through increasingly sophisticated use cases, some admittedly synthetic for illustrative purposes, with the intent of teaching the core concepts and moving parts without getting into the weeds of more formal specification.

Ion macros are defined using a domain-specific language that is in turn expressed via the Ion data model. That is, macro definitions are Ion data, and use Ion features like S-expressions and symbols to represent code in a Lisp-like fashion. In this document, the fundamental construct we explore is the macro definition, denoted using an S-expression of the form (macro name …) where macro is a keyword and name must be a symbol denoting the macro's name.

NOTE: S-expressions of that shape only declare macros when they occur in the context of an encoding module. We will completely ignore modules for now, and the examples below omit this context to keep things simple.

Constants

The most basic macro is a constant:

(macro pi            // name
  ()                 // signature
  3.141592653589793) // template

This declaration defines a macro named pi. The () is the macro’s signature, in this case a trivial one that declares no parameters. The 3.141592653589793 is a similarly trivial template, an expression in Ion 1.1's domain-specific language for defining macro functions. This macro accepts no arguments and always returns a constant value.

To use pi in an Ion document, we write an encoding expression or E-expression:

$ion_1_1
(:pi)

The syntax (:pi) looks a lot like an S-expression. It’s not, though, since colons cannot appear unquoted in that context. Ion 1.1 makes use of syntax that is not valid in Ion 1.0—specifically, the (: digraph—to denote E-expressions. Those characters must be followed by a reference to a macro, and we say that the E-expression is an invocation of the macro. Here, (:pi) is an invocation of the macro named pi.

note

We also call these “smile expressions” when we’re feeling particularly casual. (:

That document is equivalent to the following, in the sense that they denote the same data:

$ion_1_1
3.141592653589793

The process by which the Ion implementation turns the former document into the latter is called macro expansion or just expansion. This happens transparently to Ion-consuming applications: the stream of values in both cases are the same. The documents have the same content, encoded in two different ways. It’s reasonable to think of (:pi) as a custom encoding for 3.141592653589793, and the notation’s similarity to S-expressions leads us to the term “encoding expression” (or "e-expression").

note

Any Ion 1.1 document with macros can be fully expanded into an equivalent Ion 1.0 document.

We can streamline future examples with a couple of conventions. First, assume that any E-expression is occurring within an Ion 1.1 document; second, we use the relation notation, , to mean “expands to”. So we can say:

(:pi) ⇒ 3.141592653589793

Parameters and variable expansion

Most macros are not constant--they accept inputs that determine their results.

(macro passthrough
  (x)   // signature
  (%x)  // template
)

This macro has a signature that declares a parameter called x, and it therefore requires one argument to be passed in when it is invoked. This creates a variable (i.e. named data) called x that can be referred to within the context of the template.

note

We are careful to distinguish between the views from “inside” and “outside” the macro: parameters are the names used by a macro’s implementation to refer to its expansion-time inputs, while arguments are the data provided to a macro at the point of invocation. In other words, we have “formal” parameters and “actual” arguments.

The body of this macro is our first non-trivial template, an expression in Ion’s new domain-specific language for defining macro functions. This template definition language (TDL) treats Ion scalar values as literals, giving the decimal in pi’s template its intended meaning.

In this example, the template expression (%x) is a variable expansion in the form (%variable_name). During macro evaluation, variable expansions are replaced by the contents of the referenced variable. Because this macro's template is an expansion of its only parameter, x, invoking the macro will produce the same value it was given as an argument.

(:passthrough 1)         => 1
(:passthrough "foo")     => "foo"
(:passthrough [a, b, c]) => [a, b, c]

Simple Templates

Here's a more realistic macro:

(macro price
  (a c)                             // signature
  { amount: (%a), currency: (%c) }) // template

This macro has a signature that declares two parameters named a and c. It therefore accepts two arguments when invoked.

(:price 99 USD) ⇒ { amount: 99, currency: USD }

Template expressions that are structs are interpreted almost literally; the field names are literal--is why the amount and currency field names show up as-is in the expansion--but the field “values” are arbitrary expressions. We call these almost-literal forms quasi-literals.

The template definition language also treats lists quasi-literally, and every element inside the list is an expression. Here’s a silly macro to illustrate:

(macro two_item_list (a b) [(%a), (%b)])
(:two_item_list foo bar) ⇒ [foo, bar]

E-expressions can accept other e-expressions as arguments. For example:

(:two_item_list (:price 99 USD) foo)
//              └──────┬──────┘
//                     └─── passing another e-expression as an argument

Expansion happens from the "inside out". The outer e-expression receives the results from the expansion of the inner e-expression.

(:two_item_list (:price 99 USD) foo)

  // First, the inner invocation of `price` is expanded...
  => (:two_item_list {amount: 99, currency: USD} foo)

  // ...and then the outer invocation of `two_item_list` is expanded.
  => [{amount: 99, currency: USD}, foo]

Invoking Macros from Templates

Templates are able to invoke other macros. In TDL, an s-expression starting with a . and an identifier is an operator invocation, where operators are either macros or special forms, which we'll explore later.

(macro website_url
  (path)
  (.make_string "https://www.amazon.com/" (%path)))

This macro's template is an s-expression beginning with .make_string, so it an invocation of a macro called make_string. make_string is a system macro (a built-in function) which concatenates its arguments to produce a single string.

(:website_url "gp/cart") ⇒ "https://www.amazon.com/gp/cart"

In TDL, it is legal for a macro invocation to appear anywhere that a value could appear. In this example, an invocation of make_string is being passed as an argument to an invocation of website_url.

(macro detail_page_url
  (asin)
  (.website_url (.make_string "dp/" (%asin))))
(:detail_page_url "B08KTZ8249") ⇒ "https://www.amazon.com/dp/B08KTZ8249"

note

This may not look like much of an improvement, but the full string

"https://www.amazon.com/dp/B08KTZ8249"

takes 38 bytes to encode while the macro invocation

(:detail_page_url "B08KTZ8249")

takes as few as 12 bytes in binary Ion. While text Ion spells out the macro name to be human-friendly, the binary Ion encoding uses the macro's integer address instead. Here's an illustration:

(:1 "B08KTZ8249")

This makes the e-expression both more compact and faster to decode. Readers can also avoid the cost of repeatedly validating the UTF-8 bytes of substrings that are 'baked into' the macro definition.

E-expressions Versus S-expressions

We've now seen two ways to invoke macros, and their difference deserves thorough exploration.

An E-expression is an encoding artifact of a serialized Ion document. It has no intrinsic meaning other than the fact that it represents a macro invocation. The meaning of the document can only be determined by expanding the macro, passing the E-expression's arguments to the function defined by the macro. This all happens as the Ion document is parsed, transparent to the reader of the document. In casual terms, E-expressions are expanded away before the application sees the data.

Within the template definition language, you can define new macros in terms of other macros, and those invocations are written as S-expressions. Unlike E-expressions, TDL macro invocations are normal Ion data structures, consumed by the Ion system and interpreted as TDL. Further, TDL macro invocations only have meaning in the context of a macro definition, inside an encoding module, while E-expressions can occur anywhere in an Ion document.

warning

It's entirely possible to write a macro that can generate all or part of a macro definition. We don't recommend that you spend time considering such things at this point.

These two invocation forms are syntactically aligned in their calling convention, but are distinct in context and "immediacy". E-expressions occur anywhere and are invoked immediately, as they are parsed. S-expression invocations occur only within macro definitions, and are only invoked if and when that code path is ever executed by invocation of the surrounding macro.

Rest Parameters

Sometimes we want a macro to accept an arbitrary number of arguments, in particular all the rest of them. The make_string macro is one of those, concatenating all of its arguments into a single string:

(:make_string)                 ⇒ ""
(:make_string "a")             ⇒ "a"
(:make_string "a" "b")         ⇒ "ab"
(:make_string "a" "b" "c")     ⇒ "abc"
(:make_string "a" "b" "c" "d") ⇒ "abcd"

To make this work, the declaration of make_string is effectively:

(macro make_string (parts*) /*...*/)

The * is a cardinality modifier. A parameter's cardinality dictates both the number of argument expressions it can accept and the number of values its expansion can produce.

In the examples so far, all parameters have had a cardinality of exactly-one, which is the default. The parts parameter has a cardinality of zero-or-more, meaning:

  1. It can accept zero-or-more argument expressions.
  2. When expanded, it will produce zero-or-more values.

When the final parameter in the macro signature is zero-or-more, "all of the rest" of the argument expressions will be passed to that parameter.

(:make_string)
//           └── 0 argument expressions passed to `parts`
(:make_string "a")
//            └┬┘
//             └── 1 argument expression passed to `parts`
(:make_string "a" "b" "c" "d")
//            └──────┬──────┘
//                   └── 4 argument expressions passed to `parts`

At this point our distinction between parameters and arguments becomes more apparent, since they are no longer one-to-one: this macro with one parameter can be invoked with one argument, or twenty, or none.

tip

To declare a final parameter that requires at least one rest-argument, use the + modifier.

Arguments and results are streams

The inputs to and results from a macro are modeled as streams of values. When a macro is invoked, each argument expression produces a stream of values, and within the macro definition, each parameter name refers to the corresponding stream, not to a specific value. The declared cardinality of a parameter constrains the number of elements produced by its stream, and is verified by the macro expansion system.

More generally, the results of all template expressions are streams. While most expressions produce a single value, various macros and special forms can produce zero or more values.

We have everything we need to illustrate this, via another system macro, values:

(macro values (vals*) (%vals))
(:values 1)           ⇒ 1
(:values 1 true null) ⇒ 1 true null
(:values)             ⇒ _nothing_

The values macro accepts any number of arguments and returns their values; it is effectively a multi-value identity function. We can use this to explore how streams combine in E-expressions.

Splicing in encoded data

At the top level, an e-expression's resulting values become top-level values.

(:values 1 2 3) => 1 2 3

When an E-expression appears within a list or S-expression, the resulting values are spliced into the surrounding container:

[first, (:values), last]          ⇒ [first, last]
[first, (:values "middle"), last] ⇒ [first, "middle", last]
(first (:values left right) last) ⇒ (first left right last)

This also applies wherever a tagged type can appear inside an E-expression:

(first (:values (:values left right) (:values)) last) ⇒ (first left right last)

Note that each argument-expression always maps to one parameter, even when that expression returns too-few or too-many values.

(macro reverse (a b)
  [(%b), (%a)])
(:reverse (:values 5 USD))   ⇒ // Error: 'reverse' expects 2 arguments, given 1
(:reverse 5 (:values) USD)   ⇒ // Error: 'reverse' expects 2 arguments, given 3
(:reverse (:values 5 6) USD) ⇒ // Error: argument 'a' expects 1 value, given 2

In this example, the parameters expect exactly one argument, producing exactly one value. When the cardinality allows multiple values, then the argument result-streams are concatenated. We saw this (rather subtly) above in the nested use of values, but can also illustrate using the rest-parameter to make_string, which we'll expand here in steps:

(:make_string (:values) a (:values b (:values c) d) e)
//              ^^^^^^ next
  ⇒ (:make_string a (:values b (:values c) d) e)
//                               ^^^^^^ next
  ⇒ (:make_string a (:values b c d) e)
//                    ^^^^^^ next
  ⇒ (:make_string a b c d e)
  ⇒ "abcde"

Splicing within sequences is straightforward, but structs are trickier due to their key/value nature. When used in field-value position, each result from a macro is bound to the field-name independently, leading to the field being repeated or even absent:

{ name: (:values) }          ⇒ { }
{ name: (:values v) }        ⇒ { name: v }
{ name: (:values v ann::w) } ⇒ { name: v, name: ann::w }

An E-expression can even be used in place of a key-value pair, in which case it must return structs, which are merged into the surrounding container:

{ a:1, (:values), z:3 }             ⇒ { a:1, z:3 }
{ a:1, (:values {}), z:3 }          ⇒ { a:1, z:3 }
{ a:1, (:values {b:2}), z:3 }       ⇒ { a:1, b:2, z:3 }
{ a:1, (:values {b:2} {z:3}), z:3 } ⇒ { a:1, b:2, z:3, z:3 }

{ a:1, (:values key "value") } ⇒ // Error: struct expected for splicing into struct

Splicing in template expressions

The preceding examples demonstrate splicing of E-expressions into encoded data, but similar stream-splicing occurs within the template language, making it trivial to convert a stream to a list:

(macro list_of (vals*) [ (%vals) ])
(macro clumsy_bag (elts*) { '': (%elts) })
(:list_of)   ⇒ []
(:clumsy_bag) ⇒ {}

(:list_of 1 2 3)    ⇒ [1, 2, 3]
(:clumsy_bag true 2) ⇒ {'':true, '':2}

Mapping templates over streams: for

Another way to produce a stream is via a mapping form. The for special form evaluates a template once for each value provided by a stream or streams. Each time, a local variable is created and bound to the next value on the stream.

(macro prices (currency amounts*)
  (.for
    // Binding pairs
    [(amt (%amounts))]
    //└┬┘ └────┬───┘
    // │       └─── stream to map over
    // └─────────── variable name

    // Template
    (.price (%amt) (%currency))
  )
)

The first subform of for is a list of binding pairs, S-expressions containing a variable names and a series of TDL expressions. Here, that TDL expression series is a single parameter expansion, so each individual value from the amounts stream is bound to the name amt before the price invocation is expanded.

(:prices GBP 10 9.99 12.)
  ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP} {amount:12., currency:GBP}

More than one stream can be iterated in parallel, and iteration terminates when any stream becomes empty.

(macro zip (front* back*)
  (.for [(f (%front)),
        (b (%back))]
    [(%f), (%b)]))

(:zip (:values 1 2 3) (:values a b))
  ⇒ [1, a] [2, b]

Empty streams: none

The empty stream is an important edge case that requires careful handling and communication. The built-in macro none accepts no values and produces an empty stream:

(macro list_of (items*) [(%items)])

(:list_of (:none)) ⇒ []
(:list_of 1 (:none) 2) ⇒ [1, 2]
[(:none)]   ⇒ []
{a:(:none)} ⇒ {}

When used as a macro argument, a none invocation (like any other expression) counts as one argument:

(:pi (:none)) ⇒ // Error: 'pi' expects 0 arguments, given 1

The special form (::) is an empty argument expression group, similar to (:none) but used specifically to express the absence of an argument:

(:int_list (::)) ⇒ []
(:int_list 1 (::) 2) ⇒ [1, 2]

TIP: While none and values both produce the empty stream, the former is preferred for clarity of intent and terminology.

Cardinality

As described earlier, parameters are all streams of values, but the number of values can be controlled by the parameter's cardinality. So far we have seen the default exactly-one and the * (zero-or-more) cardinality modifiers, and in total there are four:

ModifierCardinality
!exactly-one value
?zero-or-one value
+one-or-more values
*zero-or-more values

Exactly-One

Many parameters expect exactly one value and thus have exactly-one cardinality. This is the default cardinality, but the ! modifier can be used for clarity.

This cardinality means that the parameter requires a stream producing a single value, so one might refer to them as singleton streams or just singletons colloquially.

Zero-or-One

A parameter with the modifier ? has zero-or-one cardinality, which is much like exactly-one cardinality, except the parameter accepts an empty-stream argument as a way to denote an absent parameter.

(macro temperature (degrees scale?)
  {
    degrees: (%degrees),
    scale: (%scale)
  })

Since the scale accepts the empty stream, we can pass it an empty argument group:

(:temperature 96 F)    ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283}

Note that the result’s scale field has disappeared because no value was provided. It would be more useful to fill in a default value, which we can achieve with the default system macro:

(macro temperature (degrees scale?)
  {
    degrees: (%degrees),
    scale: (.default (%scale) K)
  })
(:temperature 96 F)    ⇒ {degrees:96,  scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}

To refine things a bit further, trailing arguments that accept the empty stream can be omitted entirely:

(:temperature 283) ⇒ {degrees:283, scale:K}

tip

The default macro is implemented with the help of a special form that can detect the empty stream: if_none.

Zero-or-More

A parameter with the modifier * has zero-or-more cardinality.

(macro prices (amount* currency)
  (.for [(amt (%amount))]
    (.price (%amt) (%currency))))

When * is on a non-final parameter, we cannot take “all the rest” of the arguments and must use a different calling convention to draw the boundaries of the stream. Instead, we need a single expression that produces the desired values:

(:prices (::) JPY)          ⇒ // empty stream
(:prices 54 CAD)           ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP)  ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}

Here we use a non-empty argument group (:: /*...*/) to delimit the multiple elements of the amount stream.

One-or-More

A parameter with the modifier + has one-or-more cardinality, which works like * except:

  1. + parameters cannot accept the empty stream
  2. When expanded, + parameters must produce at least one value. To continue using our prices example:
(macro prices (amount+ currency)
  (.for [(amt (%amount))]
    (.price (%amt) (%currency))))
(:prices (::) JPY)          ⇒ // Error: `+` parameter received the empty stream
(:prices 54 CAD)           ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP)  ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}

On the final parameter, + collects the remaining (one or more) arguments:

(macro thanks (names+)
  (.make_string "Thank you to my Patreon supporters:\n"
    (.for [(name (%names))]
      (.make_string "  * " (%name) "\n"))))
(:thanks) ⇒ // Error: at least one value expected for + parameter

(:thanks Larry Curly Moe) =>
'''\
Thank you to my Patreon supporters:
  * Larry
  * Curly
  * Moe
'''

Argument Groups

The non-rest versions of multi-value parameters require some kind of delimiting syntax to contain the applicable sub-expressions. For the tagged-type parameters we've seen so far, you could use :values or some other macro to produce the stream, but that doesn't work for tagless types. The preferred syntax, supporting all argument types, is a special delimiting form called an argument group. Here is a macro to illustrate:

(macro prices
  (amount* currency)
  (.for [(amt (%amount))]
    (.price (%amt) (%currency))))

The parameter amount accepts any number of argument expressions. It's easy to provide exactly one:

(:prices 12.99 GBP) ⇒ {amount:12.99, currency:GBP}

To provide a non-singleton stream of values, use an argument group. Inside an E-expression, a group starts with (::

(:prices (::) GBP)       ⇒ _void_
(:prices (:: 1) GBP)     ⇒ {amount:1, currency:GBP}
(:prices (:: 1 2 3) GBP) ⇒ {amount:1, currency:GBP}
                           {amount:2, currency:GBP}
                           {amount:3, currency:GBP}

Within the group, the invocation can have any number of expressions that align with the parameter's encoding. The macro parameter produces the results of those expressions, concatenated into a single stream, and the expander verifies that each value on that stream is acceptable by the parameter’s declared encoding.

(:prices (:: 1 (:values 2 3) 4) GBP) ⇒ {amount:1, currency:GBP}
                                       {amount:2, currency:GBP}
                                       {amount:3, currency:GBP}
                                       {amount:4, currency:GBP}

Argument groups may only appear inside macro invocations where the corresponding parameter has ?, *, or + cardinality. There is no binary opcode for these constructs; the encoding uses a tagless format to keep things as dense as possible. As usual, the text format mirrors this constraint.

warning

The allowed combinations of cardinality and argument groups is pending finalization of the binary encoding.

Optional Arguments

When a trailing parameter accepts the empty stream, an invocation can omit its corresponding argument expression, as long as no following parameter is being given an expression. We’ve seen this as applied to final * parameters, but it also applies to ? parameters:

(macro optionals (a* b? c! d* e? f*)
  (.make_list a b c d e f))

Since d, e, and f all accept the empty stream, they can be omitted by invokers. But c is required so a and b must always be present, at least as an empty group:

(:optionals (::) (::) "value for c") ⇒ ["value for c"]

Now c receives the string "value for c" while the other parameters are all empty. If we want to provide e, then we must also provide a group for d:

(:optionals (::) (::) "value for c" (::) "value for e")
  ⇒ ["value for c", "value for e"]

Tagless and fixed-width types

In Ion 1.0, the binary encoding of every value starts off with a “type tag”, an opcode that indicates the data-type of the next value and thus the interpretation of the following octets of data. In general, these tags also indicate whether the value has annotations, and whether it’s null.

These tags are necessary because the Ion data model allows values of any type to be used anywhere. Ion documents are not schema-constrained: nothing forces any part of the data to have a specific type or shape. We call Ion “self-describing” precisely because each value self-describes its type via a type tag.

If schema constraints are enforced through some mechanism outside the serializer/deserializer, the type tags are unnecessary and may add up to a non-trivial amount of wasted space. Furthermore, the overhead for each value also includes length information: encoding an octet of data takes two octets on the stream.

Ion 1.1 tries to mitigate this overhead in the binary format by allowing macro parameters to use more-constrained tagless types. These are subtypes of the concrete types, constrained such that type tags are not necessary in the binary form. In general this can shave 4-6 bits off each value, which can add up in aggregate. In the extreme, that octet of data can be encoded with no overhead at all.

The following tagless types are available:

Tagless typeDescription
flex_symbolTagless symbol (SID or text)
flex_stringTagless string
flex_intTagless, variable-width signed int
flex_uintTagless, variable-width unsigned int
int8 int16 int32 int64Fixed-width signed int
uint8 uint16 uint32 uint64Fixed-width unsigned int
float16 float32 float64Fixed-width float

To define a tagless parameter, just declare one of the primitive types:

(macro point (flex_int::x flex_int::y)
  {x: (%x), y: (%y)})
(:point 3 17) ⇒ {x:3, y:17}

The tagless encoding has no real benefit here in text, as primitive types aim to improve the binary encoding.

This density comes at the cost of flexibility. Primitive types cannot be annotated or null, and arguments cannot be expressed using macros, like we’ve done before:

(:point null.int 17)   ⇒ // Error: primitive flex_int does not accept nulls
(:point a::3 17)       ⇒ // Error: primitive flex_int does not accept annotations
(:point (:values 1) 2) ⇒ // Error: cannot use macro for a primitive argument

While Ion text syntax doesn’t use tags—the types are built into the syntax—these errors ensure that a text E-expression may only express things that can also be expressed using an equivalent binary E-expression.

For the same reasons, supplying a (non-rest) tagless parameter with no value, or with more than one value, can only be expressed by using an argument group.

A subset of the primitive types are fixed-width: they are binary-encoded with no per-value overhead.

(macro byte_array
  (uint8::bytes*)
  [(%bytes)])

Invocations of this macro are encoded as a sequence of untagged octets, because the macro definition constrains the argument shape such that nothing else is acceptable. A text invocation is written using normal ints:

(:byte_array 0 1 2 3 4 5 6 7 8) ⇒ [0, 1, 2, 3, 4, 5, 6, 7, 8]
(:byte_array 9 -10 11)          ⇒ // Error: -10 is not a valid uint8
(:byte_array 256)               ⇒ // Error: 256 is not a valid uint8

As above, Ion text doesn’t have syntax specifically denoting “8-bit unsigned integers”, so to keep text and binary capabilities aligned, the parser rejects invocations where an argument value exceeds the range of the binary-only type.

Primitive types have inherent tradeoffs and require careful consideration, but in the right circumstances the density wins can be significant.

Macro Shapes

We can now introduce the final kind of input constraint, macro-shaped parameters. To understand the motivation, consider modeling a scatter-plot as a list of points:

[{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}, …]

Lists like these exhibit a lot of repetition. Since we already have a point macro, we can eliminate a fair amount:

[(:point 3 17), (:point 395 23), (:point 15 48), (:point 2023 5), …]

This eliminates all the xs and ys, but leaves repeated macro invocations.

What we’d like is to eliminate the point calls and just write a stream of pairs, something like:

(:scatterplot (3 17) (395 23) (15 48) (2023 5) …)

We can achieve exactly that with a macro-shaped parameter, in which we use the point macro as an encoding:

(macro scatterplot (point::points*)
//                  ^^^^^
  [(%points)])

point is not one of the built-in encodings, so this is a reference to the macro of that name defined earlier.

(:scatterplot (3 17) (395 23) (15 48) (2023 5))
  ⇒
  [{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}]

Each argument S-expression like (3 17) is implicitly an E-expression invoking the point macro. The argument mirrors the shape of the inner macro, without repeating its name. Further, expansion of the implied points happens automatically, so the overall behavior is just like the preceding variant and the points parameter produces a stream of structs.

The binary encoding of macro-shaped parameters are similarly tagless, eliding any opcodes mentioning point and just writing its arguments with minimal delimiting.

Macro types can be combined with cardinality modifiers, with invocations using groups as needed:

(macro scatterplot
  (point::points+ flex_string::x_label flex_string::y_label)
  { points: [(%points)], x_label: (%x_label), y_label: (%y_label) })
(:scatterplot (:: (3 17) (395 23) (15 48) (2023 5)) "hour" "widgets")
  ⇒
  {
    points: [{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}],
    x_label: "hour",
    y_label: "widgets"
  }

As with other tagless parameters, you cannot replace a group with a macro invocation, and you can't use a macro invocation as an element of an argument group:

(:scatterplot (:make_points 3 17 395 23 15 48 2023 5) "hour" "widgets")
  ⇒ // Error: Argument group expected, found :make_points

(:scatterplot (:: (3 17) (:make_points 395 23 15 48) (2023 5)) "hour" "widgets")
  ⇒ // Error: sexp expected with args for 'point', found :make_points

(:scatterplot (:: (3 17) (:point 395 23) (15 48) (2023 5)) "hour" "widgets")
  ⇒ // Error: sexp expected with args for 'point', found :point

This limitation mirrors the binary encoding, where both the argument group and the individual macro invocations are tagless and there's no way to express a macro invocation.

tip

The primary goal of macro-shaped arguments, and tagless types in general, is to increase density by tightly constraining the inputs.

Special Forms

When a TDL expression is syntactically an S-expression and its first element is the symbol ., its next element must be a symbol that matches either a set of keywords denoting the special forms, or the name of a previously-defined macro. The interpretation of the S-expression’s remaining elements depends on how the symbol resolves. In the case of macro invocations, the elements following the operator are arbitrary TDL expressions, but for special forms that is not always the case.

Special forms are "special" precisely because they cannot be expressed as macros and must therefore receive bespoke syntactic treatment. Since the elements of macro-invocation expressions are themselves expressions, when you want something to not be evaluated that way, it must be a special form.

Finally, these special forms are part of the template language itself, and are not addressable outside of TDL; the E-expression (:if_none foo bar baz) must necessarily refer to some user-defined macro named if_none, not to the special form of the same name.

todo

Many of these could be system macros instead of special forms. Being unrepresentable in TDL is not a reason for something to be a special form. Candidates to be moved to system macros are if_* and fail. Additionally, the system macro parse_ion may need to be classified as a special form since it only accepts literals.

if_none

(macro if_none (stream* true_branch* false_branch*) /* Not representable in TDL */)

The if_none form is if/then/else syntax testing stream emptiness. It has three sub-expressions, the first being a stream to check. If and only if that stream is empty (it produces no values), the second sub-expression is expanded. Otherwise, the third sub-expression is expanded. The expanded second or third sub-expression becomes the result that is produced by if_none.

note

Exactly one branch is expanded, because otherwise the empty stream might be used in a context that requires a value, resulting in an errant expansion error.

(macro temperature (degrees scale) 
       {
         degrees: (%degrees),
         scale: (.if_none (%scale) K (%scale)),
       })
(:temperature 96 F)     ⇒ {degrees:96,  scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}

To refine things a bit further, trailing optional arguments can be omitted entirely:

(:temperature 283) ⇒ {degrees:283, scale:K}

tip

If you're using if_none to specify an expression to default to, you can use the default system macro to be more concise.

(macro temperature (degrees scale)
    {
      degrees: (%degrees),
      scale: (.default (%scale) K),
    }
)

if_some

(macro if_some (stream* true_branch* false_branch*) /* Not representable in TDL */)

If stream evaluates to one or more values, it produces true_branch. Otherwise, it produces false_branch. Exactly one of true_branch and false_branch is evaluated. The stream expression must be expanded enough to determine whether it produces any values, but implementations are not required to fully expand the expression.

Example:

(macro foo (x)
       {
         foo: (.if_some (%x) [(%x)] null)
       })
(:foo (::))     => { foo: null }
(:foo 2)        => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }

The false_branch parameter may be elided, allowing if_some to serve as a map-if-not-none function.

Example:

(macro foo (x)
       {
         foo: (.if_some (%x) [(%x)])
       })
(:foo (::))     => { }
(:foo 2)        => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }

if_single

(macro if_single (expressions* true_branch* false_branch*) /* Not representable in TDL */)

If expressions evaluates to exactly one value, if_single produces the expansion of true_branch. Otherwise, it produces the expansion of false_branch. Exactly one of true_branch and false_branch is evaluated. The stream expression must be expanded enough to determine whether it produces exactly one value, but implementations are not required to fully expand the expression.

if_multi

(macro if_multi (expressions* true_branch* false_branch*) /* Not representable in TDL */)

If expressions evaluates to more than one value, it produces true_branch. Otherwise, it produces false_branch. Exactly one of true_branch and false_branch is evaluated. The stream expression must be expanded enough to determine whether it produces more than one value, but implementations are not required to fully expand the expression.

for

(for name_and_expressions template)

name_and_expressions is a list or s-expression containing one or more s-expressions of the form (name expr0 expr1 ... exprN). The first value is a symbol to act as a variable name. The remaining expressions in the s-expression will be expanded and concatenated into a single stream; for each value in the stream, the for expansion will produce a copy of the template argument expression with any appearance of the variable replaced by the value.

For example:

(.for
  [(word                     // Variable name
   foo bar baz)]             // Values over which to iterate
  (.values (%word) (%word))) // Template expression; `(%word)` will be replaced
=>
foo foo bar bar baz baz

Multiple s-expressions can be specified. The streams will be iterated over in lockstep.

(.for
  ((x 1 2 3)   // for x in...
   (y 4 5 6))  // for y in...
  ((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 4)
(2 5)
(3 6)

Iteration will end when the shortest stream is exhausted.

(.for
  [(x 1 2),    // for x in...
   (y 3 4 5)]  // for y in...
  ((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 3)
(2 4)
// no more output, `x` is exhausted

Names defined inside a for shadow names in the parent scope.

(macro triple (x)
  //           └─── Parameter `x` is declared here...
  (.for
  //    ...but the `for` expression introduces a
  //  ┌─── new variable of the same name here.
    ((x a b c))
    (%x)
  //  └─── This refers to the `for` expression's `x`, not the parameter.
  )
)
(:triple 1) // Argument `1` is ignored
=>
a b c

The for special form can only be invoked in the body of template macro. It is not valid to use as an E-Expression.

System Macros

Many of the system macros MAY be defined as template macros, and when possible, the specification includes a template. Templates are given here as normative example, but system macros are not required to be implemented as template macros.

The macros that can be defined as templates are included as system macros because of their broad applicability, and so that Ion implementations can provide optimizations for these macros that run directly in the implementations runtime environment rather than in the macro evaluator. For example, a macro such as add_symbols does not produce user values, so an Ion Reader could bypass evaluating the template and directly update the encoding context with the new symbols.

Stream Constructors

none

(macro none () (.values))

none accepts no values and produces nothing (an empty stream).

For normative examples, see none in the Ion conformance test suite.

values

(macro values (v*) v)

This is, essentially, the identity function. It produces a stream from any number of arguments, concatenating the streams produced by the nested expressions. Used to aggregate multiple values or sub-streams to pass to a single argument, or to produce multiple results.

For normative examples, see values in the Ion conformance test suite.

default

(macro default (expr* default_expr*)
    // If `expr` is empty...
    (.if_none (%expr)
        // then expand `default_expr` instead.
        (%default_expr)
        // If it wasn't empty, then expand `expr`.
        (%expr)
    )
)

default tests expr to determine whether it expands to the empty stream. If it does not, default will produce the expansion of expr. If it does, default will produce the expansion of default_expr instead.

For normative examples, see values in the Ion conformance test suite.

flatten

(macro flatten (sequence*) /* Not representable in TDL */)

The flatten system macro constructs a stream from the content of one or more sequences.

Produces a stream with the contents of all the sequence values. Any annotations on the sequence values are discarded. Any non-sequence arguments will raise an error. Any null arguments will be ignored.

Examples:

(:flatten [a, b, c] (d e f))       => a b c d e f
(:flatten [[], null.list] foo::()) => [] null.list

The flatten macro can also be used to splice the content of one list or s-expression into another list or s-expression.

[1, 2, (:flatten [a, b]), 3, 4] => [1, 2, a, b, 3, 4]

For normative examples, see flatten in the Ion conformance test suite.

parse_ion

Ion documents may be embedded in other Ion documents using the parse_ion macro.

(macro parse_ion (uint8::data*) /* Not representable in TDL */)

The parse_ion macro constructs a stream of values by parsing a blob literal or string literal as a single, self-contained Ion document. All values produced by the expansion of parse_ion are application values. (i.e. it is as if they are all annotated with $ion_literal.)

The IVM at the beginning of an Ion data stream is sufficient to identify whether it is text or binary, so text Ion can be embedded as a blob containing the UTF-8 encoded text.

Embedded text example:

(:parse_ion
    '''
    $ion_1_1
    $ion::(module _ (symbol_table ["foo" "bar"]]))
    $1 $2
    '''
)
=> foo bar

Embedded binary example:

(:parse_ion {{ 4AEB6qNmb2+jYmFy }} )
=> foo bar

important

Unlike most macros, this macro specifically requires literals. Macros are not allowed to contain recursive calls, and composing an embedded document from multiple expressions would make it possible to implement recursion in the macro system.

The data argument is evaluated in a clean environment that cannot read anything from the parent document. Allowing context to leak from the outer scope into the document being parsed would also enable recursion.

For normative examples, see parse_ion in the Ion conformance test suite.

Value Constructors

annotate

(macro annotate (ann* value) /* Not representable in TDL */)

Produces the value prefixed with the annotations anns1. Each ann must be a non-null, unannotated string or symbol.

(:annotate (: "a2") a1::true) => a2::a1::true

For normative examples, see annotate in the Ion conformance test suite.

make_string

(macro make_string (content*) /* Not representable in TDL */)

Produces a non-null, unannotated string containing the concatenated content produced by the arguments. Nulls (of any type) are forbidden. Any annotations on the arguments are discarded.

For normative examples, see make_string in the Ion conformance test suite.

make_symbol

(macro make_symbol (content*) /* Not representable in TDL */)

Like make_string but produces a symbol.

For normative examples, see make_symbol in the Ion conformance test suite.

make_blob

(macro make_blob (lobs*) /* Not representable in TDL */)

Like make_string but accepts lobs and produces a blob.

For normative examples, see make_blob in the Ion conformance test suite.

make_list

(macro make_list (sequences*) [ (.flatten sequences) ])

Produces a non-null, unannotated list by concatenating the content of any number of non-null list or sexp inputs.

(:make_list)                  => []
(:make_list (1 2))            => [1, 2]
(:make_list (1 2) [3, 4])     => [1, 2, 3, 4]
(:make_list ((1 2)) [[3, 4]]) => [(1 2), [3, 4]]

For normative examples, see make_list in the Ion conformance test suite.

make_sexp

(macro make_sexp (sequences*) ( (.flatten sequences) ))

Like make_list but produces a sexp.

(:make_sexp)                  => ()
(:make_sexp (1 2))            => (1 2)
(:make_sexp (1 2) [3, 4])     => (1 2 3 4)
(:make_sexp ((1 2)) [[3, 4]]) => ((1 2) [3, 4])

For normative examples, see make_sexp in the Ion conformance test suite.

make_struct

(macro make_struct (structs*) /* Not representable in TDL */)

Produces a non-null, unannotated struct by combining the fields of any number of non-null structs.

(:make_struct)    => {}
(:make_struct
  {k1: 1, k2: 2}
  {k3: 3}
  {k4: 4})        => {k1:1, k2:2, k3:3, k4:4}

For normative examples, see make_struct in the Ion conformance test suite.

make_field

(macro make_field (field_name value) /* Not representable in TDL */)

Produces a non-null, unannotated, single-field struct using the given field name and value.

The field_name parameter may be (or evaluate to) any non-null text value, and the value parameter may be (or evaluate to) any single value.

This can be used to dynamically construct field names based on macro parameters.

Example:

(macro foo_struct (extra_name extra_value)
       (make_struct 
         {
           foo_a: 1,
           foo_b: 2,
         }
         (make_field (make_string "foo_" (%extra_name)) (%extra_value))
       ))

Then:

(:foo_struct c 3) => { foo_a: 1, foo_b: 2, foo_c: 3 }

For normative examples, see make_struct in the Ion conformance test suite.

make_decimal

(macro make_decimal (coefficient exponent) /* Not representable in TDL */)

This is no more compact than the regular binary encoding for decimals. However, it can be used in conjunction with other macros, for example, to represent fixed-point numbers.

Both coefficient and exponent must be (or evaluate to) a single integer value.

(macro usd (cents) (.annotate USD (.make_decimal cents -2))

(:usd 199) =>  USD::1.99

note

It is not possible to use make_decimal to construct any negative zero value because Ion integers do not have signed zero.

For normative examples, see make_decimal in the Ion conformance test suite.

make_timestamp

(macro make_timestamp (year month? day? hour? minute? second? offset_minutes?) /* Not representable in TDL */)

Produces a non-null, unannotated timestamp at various levels of precision. When offset is absent, the result has unknown local offset; offset 0 denotes UTC.

The make_timestamp macro has rules that cannot be expressed in the macro signature because it must construct a valid Ion timestamp value.

The arguments to this macro may not be any null value. The evaluated argument for the year parameter must be an integer from 1 to 9999 inclusive. The evaluated argument for the month parameter, if present, must be an integer from 1 to 12 inclusive. The evaluated argument for the day parameter, if present, must be an integer that is a valid, 1-indexed day for the given month. The evaluated argument for the hour parameter, if present, must be an integer from 0 to 23 inclusive. The evaluated argument for the day parameter, if present, must be an integer from 0 to 59 inclusive. The evaluated argument for the second parameter, if present, must be a decimal or integer value that is greater than or equal to zero and less than 60. The evaluated arguments for all other parameters, if present, must be integer values.

The offset_minutes and hour parameters may only be present if minute is present. Aside from offset_minutes, if any evaluated argument is present, the evaluated arguments for all parameters to the left must also be present. The precision of the constructed timestamp is determined by which parameters have non-empty arguments.

note

TODO ion-docs#256 Reconsider offset semantics, perhaps default should be UTC.

Example:

(macro ts_today 
       (uint8::hour uint8::minute uint32::seconds_millis)
       (.make_timestamp
         2022
         4
         28
         hour
         minute
         (.make_decimal (%seconds_millis) -3) 0))

For normative examples, see make_timestamp in the Ion conformance test suite.

Encoding Utility Macros

repeat

The repeat system macro can be used for efficient run-length encoding.

(macro repeat (n! value*) /* Not representable in TDL */)

Produces a stream that repeats the specified value expression(s) n times.

The evaluated argument for n must be a non-null integer value that is equal to or greater than zero.

(:repeat 5 0)          => 0 0 0 0 0
(:repeat 2 true false) => true false true false

For normative examples, see repeat in the Ion conformance test suite.

delta

(macro delta (deltas*) /* Not representable in TDL */)

The delta system macro can be used for directed delta encoding. It produces a stream that is equal in length to the deltas argument, defined by the recurrence relation:

output₀ = delta₀
outputₙ₊₁ = outputₙ + deltaₙ₊₁

Example:

(:delta 1000 1 2 3 -4) => 1000 1001 1003 1006 1002

For normative examples, see delta in the Ion conformance test suite.

sum

(macro sum (a b) /* Not representable in TDL */)

Produces the sum of two non-null integer arguments.

Examples:

(:sum 1 2) => 3

For normative examples, see sum in the Ion conformance test suite.

meta

(macro meta (anything*) (.none))

The meta macro accepts any values and emits nothing. It allows writers to encode data that will be not be surfaced to most readers. Readers can be configured to intercept calls to meta, allowing them to read the otherwise invisible data.

When transcribing from one format to another, writers should preserve invocations of meta when possible.

Example:

(:values
    (:meta {author: "Mike Smith", email: "mikesmith@example.com"})
    {foo:2,foo:1}
)
=>
{foo:2,foo:1}

For normative examples, see meta in the Ion conformance test suite.

Updating the Encoding Context

set_symbols

Redefines the default module's symbol table, preserving any macros in its macro table.

(macro set_symbols (symbols*)
       $ion::
       (module _
         (symbol_table [(%symbols)])
         (macro_table _)
       ))

Example:

(:set_symbols foo bar)
=>
$ion::
(module _
  (symbol_table [foo, bar])
  (macro_table _)
)

For normative examples, see set_symbols in the Ion conformance test suite.

add_symbols

Appends symbols to the default module's symbol table, preserving any macros in its macro table.

(macro add_symbols (symbols*)
       $ion::
       (module _
         (symbol_table _ [(%symbols)])
         (macro_table _)
       ))

Example:

(:add_symbols foo bar)
=>
$ion::
(module _
  (symbol_table _ [foo, bar])
  (macro_table _)
)

For normative examples, see add_symbols in the Ion conformance test suite.

set_macros

Sets the default module's macro table, preserving any symbols in its symbol table.

(macro set_macros (macros*)
       $ion::
       (module _
         (symbol_table _)
         (macro_table (%macros))
       ))

Example:

(:set_macros (macro pi () 3.14159))
=>
$ion::
(module _
  (symbol_table _)
  (macro_table (macro pi () 3.14159))
)

For normative examples, see set_macros in the Ion conformance test suite.

add_macros

Appends macros to the default module's macro table, preserving any symbols in its symbol table.

(macro add_macros (macros*)
       $ion::
       (module _
         (symbol_table _)
         (macro_table _ (%macros))
       ))

Example:

(:add_macros (macro pi () 3.14159))
=>
$ion::
(module _
  (symbol_table _)
  (macro_table _ (macro pi () 3.14159))
)

For normative examples, see add_macros in the Ion conformance test suite.

use

Appends the content of the given module to the default module.

(macro use (catalog_key version?)
       $ion::
       (module _
         (import the_module catalog_key (.default (%version) 1))
         (symbol_table _ the_module)
         (macro_table _ the_module)
       ))

Example:

(:use "org.example.FooModule" 2)
=>
$ion::
(module _
  (import the_module "org.example.FooModule" 2)
  (symbol_table _ the_module)
  (macro_table _ the_module)
)

For normative examples, see use in the Ion conformance test suite.


1

The annotations sequence comes first in the macro signature because it parallels how annotations are read from the data stream.^

Ion 1.1 modules

In Ion 1.0, each stream has a symbol table. The symbol table stores text values that can be referred to by their integer index in the table, providing a much more compact representation than repeating the full UTF-8 text bytes each time the value is used. Symbol tables do not store any other information used by the reader or writer.

Ion 1.1 introduces the concept of a macro table. It is analogous to the symbol table, but instead of holding text values it holds macro definitions.

Ion 1.1 also introduces the concept of a module, an organizational unit that holds a (symbol table, macro table) pair.

tip

You can think of an Ion 1.0 symbol table as a module with an empty macro table.

In Ion 1.1, each stream has an encoding module sequence— a collection of modules whose symbols and macros are being used to encode the current segment.

Module interface

The interface to a module consists of:

  • its spec version, denoting the Ion version used to define the module
  • its exported symbols, an array of strings denoting symbol content
  • its exported macros, an array of <name, macro> pairs, where all names (where specified) are unique identifiers

The spec version is external to the module body and the precise way it is determined depends on the type of module being defined. This is explained in further detail in Module Versioning.

The exported symbol array is denoted by the symbol_table clause of a module definition, and by the symbols field of a shared symbol table.

The exported macro array is denoted by the module’s macro_table clause, with addresses allocated to macros or macro bindings in the order they are declared.

The exported symbols and exported macros are defined in the module body.

Types of modules

There are multiple types of modules. All modules share the same interface, but vary in their implementation in order to support a variety of different use cases.

Module TypePurpose
Local ModulesOrganizing symbols and macros within a scope
Shared ModulesDefining reusable symbols and macros outside of the data stream
System ModulesDefining system symbols and macros
Encoding ModulesEncoding the current stream segment

Module versioning

Every module definition has a spec version that determines the syntax and semantics of the module body. A module’s spec version is expressed in terms of a specific Ion version; the meaning of the module is as defined by that version of the Ion specification.

The spec version for a local module is inherited from its parent scope, which may be the stream itself. The spec version for a shared module is denoted via a required annotation. The spec version of a system module is the Ion version in which it was specified.

To ensure that all consumers of a module can properly understand it, a module can only import shared modules defined with the same or earlier spec version.

Examples

The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N. This allows the module to be serialized using any version of Ion, and its meaning will not change.

$ion_shared_module::
$ion_1_1::
("com.example.symtab" 3
    (symbol_table ...)
    (macro_table ...))

The spec version of a local module is always the same as the spec version of its enclosing scope. If the local module is defined at the top level of the stream, its spec version is the Ion version of the current segment.

$ion_1_1
$ion::
(module foo
  // Module semantics specified by Ion 1.1
  ...
)

// ...

$ion_1_3
$ion::
(module foo
  // Module semantics specified by Ion 1.3
  ...
)
//...                  // Assuming no IVM
$ion::
(module bar
  // Module semantics specified by Ion 1.3
  ...
)

Identifiers

Many of the grammatical elements used to define modules and macros are identifiers--symbols that do not require quotation marks.

More explicitly, an identifier is a sequence of one or more ASCII letters, digits, or the characters $ (dollar sign) or _ (underscore), not starting with a digit. It also cannot be of the form $\d+, which is the syntax for symbol IDs (for example: $3, $10, $458, etc.), nor can it be a keyword (true, false, null, or nan).

Defining modules

A module is defined by four kinds of subclauses which, if present, always appear in the same order.

  1. import - a reference to a shared module definition; repeatable
  2. module - a nested module definition; repeatable
  3. symbol_table - an exported list of text values
  4. macro_table - an exported list of macro definitions

The lexical name given to a module definition must be an identifier. However, it must not begin with a $--this is reserved for system-defined bindings like $ion.

Internal environment

The body of a module tracks an internal environment by which macro references are resolved. This environment is constructed incrementally by each clause in the definition and consists of:

  • the module bindings, a map from identifier to module definition
  • the exported symbols, an array containing symbol texts
  • the exported macros, an array containing name/macro pairs

Before any clauses of the module definition are examined, each of these is empty.

Each clause affects the environment as follows:

  • An import declaration retrieves a shared module from the implementation’s catalog and binds a name to it, making its macros available for use. An error must be signaled if the name already appears in the module bindings.
  • A module declaration defines a new module and binds a name to it. An error must be signaled if the name already appears in the module bindings.
  • A symbol_table declaration defines the exported symbols.
  • A macro_table declaration defines the exported macros.

Resolving Macro References

Within a module definition, macros can be referenced in several contexts using the following macro-ref syntax:

qualified-ref      ::= module-name '::' macro-ref

macro-ref          ::= macro-name | macro-addr

macro-name         ::= unannotated-identifier-symbol

macro-addr         ::= unannotated-uint 

Macro references are resolved to a specific macro as follows:

  • An unqualified macro-name is looked up in the following locations:

    1. in the macros already exported in this module's macro_table
    2. in the default_module
    3. in the system module

    If it maps to a macro, that’s the resolution of the reference. Otherwise, an error is signaled due to an unbound reference.

  • An anonymous local reference (macro-addr) is resolved by index in the exported macro array. If the address exceeds the array boundary, an error is signaled due to an invalid reference.

  • A qualified reference (qualified-ref) resolves solely against the referenced module. First, the module name must be resolved to a module definition.

    • If the module name is in the module bindings, it resolves to the corresponding module definition.
    • If the module name is not in the module bindings, resolution is attempted recursively upwards through the parent scopes.
    • If the search reaches the top level without resolving to a module, an error is signaled due to an unbound reference.

    Next, the name or address is resolved within that module definition’s exported macro table.

import

import             ::= '(import ' module-name catalog-key ')'

module-name        ::= unannotated-identifier-symbol

catalog-key        ::= catalog-name catalog-version?

catalog-name       ::= string

catalog-version    ::= int // positive, unannotated

An import binds a lexically scoped module name to a shared module that is identified by a catalog key—a (name, version) pair. The version of the catalog key is optional—when omitted, the version is implicitly 1.

In Ion 1.0, imports may be substituted with a different version if an exact match is not found. In Ion 1.1, however, all imports require an exact match to be found in the reader's catalog; if an exact match is not found, the implementation must signal an error.

module

The module clause defines a new local module that is contained in the current module.

inner-module ::= '(module' module-name import* symbol-table? macro-table? ')'

Inner modules automatically have access to modules previously declared in the containing module using module or import. The new module (and its exported symbols and macros) is available to any following module, symbol_table, and macro_table clauses in the enclosing container.

See local modules for full explanation.

symbol_table

A module can define a list of exported symbols by copying symbols from other modules and/or declaring new symbols.

symbol-table       ::= '(symbol_table' symbol-table-entry* ')'

symbol-table-entry ::= module-name | symbol-list

symbol-list        ::= '[' ( symbol-text ',' )* ']'

symbol-text        ::= symbol | string

The symbol_table clause assembles a list of text values for the module to export. It takes any number of arguments, each of which may be the name of visible module or a list of symbol-texts. The symbol table is a list of symbol-texts by concatenating the symbol tables of named modules and lists of symbol/string values.

Where a module name occurs, its symbol table is appended. (The module name must refer to another module that is visible to the current module.) Unlike Ion 1.0, no symbol-maxid is needed because Ion 1.1 always required exact matches for imported modules.

tip

When redefining a top-level module binding, the binding being redefined can be added to the symbol table in order to retain its symbols. For example:

// Define module `foo`
$ion::
(module foo
    (symbol_table ["b", "c"]))

// Redefine `foo` in terms of its former definition
$ion::
(module foo
    (symbol_table
        ["a"]
        foo // The old definition of `foo` with symbols ["b", "c"]
        ["d"]))

// Now `foo`'s symbol table is ["a", "b", "c", "d"]

Where a list occurs, it must contain only non-null, unannotated strings and symbols. The text of these strings and/or symbols are appended to the symbol table. Upon encountering any non-text value, null value, or annotated value in the list, the implementation shall signal an error.
To add a symbol with unknown text to the symbol table, one may use $0.

All modules have a symbol table, so when a module has no symbol_table clause, the module has an empty symbol table.

Symbol zero $0

Symbol zero (i.e. $0) is a special symbol that is not assigned text by any symbol table, even the system symbol table. Symbol zero always has unknown text, and can be useful in synthesizing symbol identifiers where the text image of the symbol is not known in a particular operating context.

All symbol tables (even an empty symbol table) can be thought of as implicitly containing $0. However, $0 precedes all symbol tables rather than belonging to any symbol table. When adding the exported symbols from one module to the symbol table of another, the preceding $0 is not copied into the destination symbol table (because it is not part of the source symbol table).

It is important to note that $0 is only semantically equivalent to itself and to locally-declared SIDs with unknown text. It is not semantically equivalent to SIDs with unknown text from shared symbol tables, so replacing such SIDs with $0 is a destructive operation to the semantics of the data.

Processing

When the symbol_table clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.

For each arg:

  • If the arg is a list of text values, the nested text values are appended to the end of the symbol table being constructed.
    • When $0 appears in the list of text values, this creates a symbol with unknown text.
    • The presence of any other Ion value in the list raises an error.
  • If the arg is the name of a module, the symbols in that module's symbol table are appended to the end of the symbol table being constructed.
  • If the arg is anything else, the reader must raise an error.

Example

(symbol_table         // Constructs an empty symbol table (list)
  ["a", b, 'c']       // The text values in this list are appended to the table
  foo                 // Module `foo`'s symbol table values are appended to the table
  ['''g''', "h", i])  // The text values in this list are appended to the table

If module foo's symbol table were [d, e, f], then the symbol table defined by the above clause would be:

["a", "b", "c", "d", "e", "f", "g", "h", "i"]

This is an Ion 1.0 symbol table that imports two shared symbol tables and then declares some symbols of its own.

$ion_1_0
$ion_symbol_table::{
  imports: [{ name: "com.example.shared1", version: 1, max_id: 10 },
            { name: "com.example.shared2", version: 2, max_id: 20 }],
  symbols: ["s1", "s2"]
}

Here’s the Ion 1.1 equivalent in terms of symbol allocation order:

$ion_1_1
$ion::(import m1 "com.example.shared1" 1)
$ion::(import m2 "com.example.shared2" 2)
$ion::
(module _
  (symbol_table m1 m2 ["s1", "s2"])
)

macro_table

Macros are declared after symbols. The macro_table clause assembles a list of macro definitions for the module to export. It takes any number of arguments. All modules have a macro table, so when a module has no macro_table clause, the module has an empty macro table.

Most commonly, a macro table entry is a definition of a new macro expansion function, following this general shape:

When no name is given, this defines an anonymous macro that can be referenced by its numeric address (that is, its index in the enclosing macro table). Inside the defining module, that uses a local reference like 12.

The signature defines the syntactic shape of expressions invoking the macro; see Macro Signatures for details. The template defines the expansion of the macro, in terms of the signature’s parameters; see Template Expressions for details.

Imported macros must be explicitly exported if so desired. Module names and export clauses can be intermingled with macro definitions inside the macro_table; together, they determine the bindings that make up the module’s exported macro array.

The module-name export form is shorthand for referencing all exported macros from that module, in their original order with their original names.

An export clause contains a single macro reference followed by an optional alias for the exported macro. The referenced macro is appended to the macro table.

tip

No name can be repeated among the exported macros, including macro definitions. Name conflicts must be resolved by exports with aliases.

Processing

When the macro_table clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.

For each arg:

  • If the arg is a macro clause, the clause is processed and the resulting macro definition is appended to the end of the macro table being constructed.
  • If the arg is an export clause, the clause is processed and the referenced macro definition is appended to the end of the macro table being constructed.
  • If the arg is the name of a module, the macro definitions in that module's macro table are appended to the end of the macro table being constructed.
  • If the arg is anything else, the reader must raise an error.

A macro name is a symbol that can be used to reference a macro, both inside and outside the module. Macro names are optional, and improve legibility when using, writing, and debugging macros. When a name is used, it must be an identifier per Ion’s syntax for symbols. Macro definitions being added to the macro table must have a unique name. If a macro is added whose name conflicts with one already present in the table, the implementation must raise an error.

macro

A macro clause defines a new macro. When the macro declaration uses a name, an error must be signaled if it already appears in the exported macro array.

export

An export clause declares a name for an existing macro and appends the macro to the macro table.

  • If the reference to the existing macro is followed by a name, the existing macro is appended to the exported macro array with the latter name instead of the original name, if any. In this way, an anonymous macro can be given a name. An error must be signaled if that name already appears in the exported macro array.
  • If the reference to the existing macro is followed by null, the macro is appended to the exported macro array without a name, regardless of whether the macro has a name.
  • If the reference to the existing macro is anonymous, the macro is appended to the exported macro array without a name.
  • When the reference to the existing macro uses a name, the name and macro are appended to the exported macro
    array. An error must be signaled if that name already appears in the exported macro array.

Module names in macro_table

A module name appends all exported macros from the module to the exported macro array. If any exported macro uses a name that already appears in the exported macro array, an error must be signaled.

Directives

Directives are system values that modify the encoding context.

Syntactically, a directive is a top-level s-expression annotated with $ion. Its first child value is an operation name. The operation determines what changes will be made to the encoding context and which clauses may legally follow.

$ion::
(operation_name
    (clause_1 /*...*/)
    (clause_2 /*...*/)
    /*...more clauses...*/
    (clause_N /*...*/))

In Ion 1.1, there are three supported directive operations:

  1. module
  2. import
  3. encoding

Top-level bindings

The module and import directives each create a stream-level binding to a module definition. Once created, module bindings at this level endure until the file ends or another Ion version marker is encountered.

Module bindings at the stream-level can be redefined.

tip

The add_macros and add_symbols system macros work by redefining the default module (_) in terms of itself.

This behavior differs from module bindings created inside another module; attempting to redefine these will raise an error.

module directives

The module directive binds a name to a local module definition at the top level of the stream.

$ion::
(module foo
    /*...imports, if any...*/
    /*...submodules, if any...*/
    (macro_table /*...*/)
    (symbol_table /*...*/)
)

import directives

The import directive looks up the module corresponding to the given (name, version) pair in the catalog. Upon success, it creates a new binding to that module at the top level of the stream.

$ion::
(import
    bar               // Binding
    "com.example.bar" // Module name
    2)                // Module version

The version can be omitted. When it is not specified, it defaults to 1.

If the catalog does contain an exact match, this operation raises an error.

encoding directives

An encoding directive accepts a sequence of module bindings to use as the following stream segment's encoding module sequence.

$ion::
(encoding
    mod_a
    mod_b
    mod_c)

The new encoding module sequence takes effect immediately after the directive and remains the same until the next encoding directive or Ion version marker.

Note that the default module is always implicitly at the head of the encoding module sequence.

Local modules

Local modules are lexically scoped. They can be referenced immediately following their definition, up until the end of their enclosing scope. They can be defined either:

  1. At the top level of a stream, in which case the enclosing scope is the stream itself.
  2. Inside another module, in which case the enclosing scope is the parent module. The parent module can be a shared or local module.

Local modules always have a symbolic name given at the point of definition, also known as a binding. It is legal for a module binding to "shadow" a module binding in its parent scope by using the same name.

$ion::
(module foo // <-- Top-level module `foo`
  (macro_table
    (macro quux () Quux)))

$ion::
(module bar
  (module foo // <-- Shadows the top-level module `foo`
    (macro_table
      (macro quuz () Quuz)))
  (macro_table foo::quuz) // <-- Refers to the innermost `foo`
)

However, it is not legal for a local module to use the same name as a module previously defined in the same scope.

$ion::
(module bar
  (module foo // <-- First definition of `foo` inside `bar`
    (macro_table
      (macro quux () Quux)))
  (module foo // <-- ERROR: module `foo` already defined in this scope
    (macro_table
      (macro quuz () Quuz)))
  /*...*/
)

The only exception to this rule is at the top level. Stream-level bindings are mutable, while bindings inside a module are immutable.

$ion::
(module foo // <-- Top-level module `foo`
  (macro_table
    (macro quux () Quux)))

$ion::
(module foo // <-- Redefines the top-level binding `foo`
  (macro_table
    (macro quuz () Quuz)))

Local modules inherit their spec version from the enclosing scope. Local modules automatically have access to modules previously declared in their enclosing scope using module or import.

Examples

Local modules can be used to define helper macros without having to export them.

$ion_shared_module::$ion_1_1::(
  "org.example.Foo" 1
  (module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
  (macro_table
    (macro y_axis_point (y) (.util::point2d 0 (%y)))
    (macro poylgon (util::point2d::points+) [(%points)]))
)

In this example, the macro point2d is declared in a local module. The macro definitions being exported in the shared module's macro table are able to reference the helper macros by name.


Local modules can also be used for grouping macros into namespaces (only visible within the parent scope).

$ion_shared_module::$ion_1_1::(
  "org.example.Foo" 1
  (module cartesian (macro_table (macro point2d (x y) { x:(%x), y:(%y) })
                                 (macro polygon (point2d::points+) [(%points)]) ))

  (module polar (macro_table (macro point2d (r phi) { r:(%r), phi:(%phi) })
                             (macro polygon (point2d::points+) [(%points)]) ))
  (macro_table
    (export cartesian::polygon cartesian_poylgon)
    (export polar::polygon polar_poylgon))
)

In this example, there are two macros named point2d and two named polygon. There is no name conflict between them because they are declared in separate namespaces. Both polygon macros are added to the shared module's macro table, with each one given an alias in order to resolve the name conflict. Neither one of the point2d macros needs to be added to the shared module's macro table because they can be referenced in the definitions of both polygon macros without needing to be added to the shared module's macro table.


When grouping macros in local modules, there are more than just organizational benefits. By first defining helper macros in an inner module, a module can export macros in a different order than they are declared:

$ion_shared_module::$ion_1_1::(
  "org.example.Foo" 1
  // point2d must be declared before polygon...
  (module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
  (macro_table
    // ...because it is used in the definition of polygon
    (macro poylgon (util::point2d::points+) [(%points)])
    // But it can be added to the macro table after polygon
    util)
)

Local modules can also be used for organization of symbols.

$ion::
(encoding
  (module dairy      (symbol_table [cheese,  yogurt, milk]))
  (module grains     (symbol_table [cereal,  bread,  rice]))
  (module vegetables (symbol_table [carrots, celery, peas]))
  (module meat       (symbol_table [chicken, mutton, beef]))

  (symbol_table dairy
                grains
                vegetables
                meat)
)

Encoding modules

The encoding of each segment of a stream is shaped by the currently configured encoding modules, an ordered sequence of modules that determine which symbols and macros are available for use in the stream. A writer can modify this sequence by emitting an encoding directive.

By logically concatenating the encoding modules' symbol and macro tables respectively, they can be viewed as unified local symbol and macro tables.

For example, consider these module definitions and the subsequent encoding directive:

$ion::
(module mod_a
    (symbol_table ["a", "b", "c"])
    (macro_table
        (macro foo () Foo)
        (macro bar () Bar)))
$ion::
(module mod_b
    (symbol_table ["c", "d", "e"])
    (macro_table
        (macro baz () Baz)
        (macro quux () Quux)))
$ion::
(module mod_c
    (symbol_table ["f", "g", "h"])
    (macro_table
        (macro quuz () Quuz)
        (macro foo () Foo2)))

$ion::
(encoding
    mod_a
    mod_b
    mod_c)

It produces the encoding module sequence _ mod_a mod_b mod_c. (The default module, _, is always implicitly at the head of the encoding sequence.)

The segment's local symbol table, formed by logically concatenating the symbol tables of mod_a, mod_b, and mod_c in that order, is:

AddressSymbol text
0<unknown text>
1a
2b
3c
4c
5d
6e
7f
8g
9h

Notice that no de-duplication takes place; c appears in both addresses 4 and 5.

The segment's macro table, formed by logically concatenating the macro tables of mod_a, mod_b, and mod_c in that order, is:

AddressMacro
0mod_a::foo
1mod_a::bar
2mod_b::baz
3mod_b::quux
4mod_c::quuz
5mod_c::foo

Notice that mod_a::foo and mod_c::foo can coexist in this unified view without issue. Invocations of these macros require that they be qualified by their enclosing module's name.

Because lower addresses take fewer bytes to encode than higher addresses, writers should place the modules they anticipate referencing the most frequently at the beginning of the encoding module sequence.

Modules in the current segment's encoding module sequence are said to be active, while modules that are defined or imported but which are not in the encoding module sequence are available. E-expressions can only invoke macros in an active module.

For example:

$ion::
(module mod_a
    (macro_table
        (macro foo () Foo)))

// `mod_a` is now available

$ion::
(module mod_b
    (macro_table
        (macro bar () Bar)))

// `mod_b` is now available

$ion::
(encoding mod_a)

// `mod_a` is now active

(:mod_a::foo) // Foo
(:mod_b::bar) // ERROR: `mod_b` is not in the encoding module sequence

The default module

The default module, _, is an empty top-level module that is implicitly defined at the beginning of every stream.

When resolving an unqualified macro name, readers first look for the corresponding macro definition in _. If it is not found in _, they will then look in $ion. If it is still not found, the reader will raise an error.

This makes it possible to leverage macros in a lightweight way; writers do not have to first name/define a custom module to house their macros, and the macros themselves can be invoked in text without having to write out the module name.

Macros and symbols can be added to the default module by redefining _. Like all modules, _ can be redefined in terms of itself, making appends and prepends straightforward.

$ion_1_1

// `_` exists, but is empty

$ion::
(module _
    (macro_table
        (macro foo () Foo)))

// `_` now contains macro `foo`

$ion::
(module _
    (macro_table
        _ // Add all macros in `_` to its redefinition
        (macro bar () Bar)))

// `_` now contains macros `foo` and `bar`

(:foo) // Equivalent to `(:_::foo)`
(:bar) // Equivalent to `(:_::bar)`

System macros like add_symbols and add_macros apply their changes to _, so we can rewrite the above more succinctly as:

$ion_1_1

// `_` exists, but is empty

(:add_macros
    (macro foo () Foo)
    (macro bar () Bar))

// `_` now contains macros `foo` and `bar`

(:foo) // Equivalent to `(:_::foo)`
(:bar) // Equivalent to `(:_::bar)`

_ can also be redefined by an import directive.

Default encoding module sequence

At the beginning of a stream, the encoding module sequence contains two modules:

  1. the default module, _
  2. the system module, $ion

Recall that a segment's symbol and macro tables are logical concatenations of those found in the segment's encoding modules. Because _ is empty at the beginning of the stream, the stream's initial symbol and macro tables are identical to those of the system module, $ion.

This is beneficial because it allows all system macros to be invoked from the stream's macro table in a single byte rather than the two-byte sequence needed to invoke them from the system macro table. In this way, a writer can define its macros and symbols in a maximally compact fashion at the head of the stream.

Modifying active modules

If a module binding in the encoding module sequence is redefined, the new module definition replaces the old one in the sequence.

For example after these directives are evaluated:

$ion::
(module mod_a
    (macro_table
        (macro foo () Foo))
        (macro bar () Bar)))

$ion::
(module mod_b)

$ion::
(module mod_c
    (macro_table
        (macro quux () Quux)
        (macro quuz () Quuz)))

$ion::(encoding mod_a mod_b mod_c)

the encoding sequence is _ mod_a mod_b mod_c, and mod_b is empty.

(:0) // => Foo
(:1) // => Bar
(:2) // => Quux
(:3) // => Quuz

If we then add macros to mod_b, those macros will immediately become available.

$ion::
(module mod_b
    (macro_table
        (macro baz () Baz)))

(:0) // => Foo
(:1) // => Bar
(:2) // => Baz
(:3) // => Quux
(:4) // => Quuz

important

Notice that modifying a module (in this case mod_b) can cause the addresses of all subsequent macros to be modified.

Clearing the symbol and macro tables

(module _) // Redefine `_` to be an empty module
// If other modules are in use, remove them from the encoding module sequence
$ion::(encoding)

You can also consider writing an Ion verson marker, which is more compact. The behavior is slightly different, however: an IVM will also add $ion to the encoding module sequence. See the Default encoding module sequence section for details.

The system module

The symbols and macros of the system module $ion are available everywhere within an Ion document, with the version of that module being determined by the spec-version of each segment. The specific system symbols are largely uninteresting to users; while the binary encoding heavily leverages the system symbol table, the text encoding that users typically interact with does not. The system macros are more visible, especially to authors of macros.

This chapter catalogs the system-provided symbols and macros. The examples below use unqualified names, which works assuming no other macros with the same name are in scope. The unambiguous form $ion::macro-name is always available to use.

Relation to local symbol and macro tables

In Ion 1.0, the system symbol table is always the first import of the local symbol table. However, in Ion 1.1, the system symbol and macro tables have a system address space that is distinct from the local address space, but can optionally be included in the user address space.

When starting an Ion 1.1 segment (i.e. immediately after encountering an $ion_1_1 version marker), the system module is in the sequence of active encoding modules immediately following the default module. As a result, both the system macros and system symbols are initially included in the local macro and symbol tables1. The system module is not a permanent fixture in the active encoding modules, so (in contrast to Ion 1.0) the system symbols and macros can be removed from the local symbol and macro tables.

System Symbols

The Ion 1.1 System Symbol table replaces rather than extends the Ion 1.0 System Symbol table. The system symbols are as follows:

IDHexText
00x00<reserved>
10x01$ion
20x02$ion_1_0
30x03$ion_symbol_table
40x04name
50x05version
60x06imports
70x07symbols
80x08max_id
90x09$ion_shared_symbol_table
100x0Aencoding
110x0B$ion_literal
120x0C$ion_shared_module
130x0Dmacro
140x0Emacro_table
150x0Fsymbol_table
160x10module
170x11export
180x12import
190x13flex_symbol
200x14flex_int
210x15flex_uint
220x16uint8
230x17uint16
240x18uint32
250x19uint64
260x1Aint8
270x1Bint16
280x1Cint32
290x1Dint64
300x1Efloat16
310x1Ffloat32
320x20float64
330x21zero-length text (i.e. '')
340x22for
350x23literal
360x24if_none
370x25if_some
380x26if_single
390x27if_multi
400x28none
410x29values
420x2Adefault
430x2Bmeta
440x2Crepeat
450x2Dflatten
460x2Edelta
470x2Fsum
480x30annotate
490x31make_string
500x32make_symbol
510x33make_decimal
520x34make_timestamp
530x35make_blob
540x36make_list
550x37make_sexp
560x38make_field
570x39make_struct
580x3Aparse_ion
590x3Bset_symbols
600x3Cadd_symbols
610x3Dset_macros
620x3Eadd_macros
630x3Fuse

In Ion 1.1 Text, system symbols can never be referenced by symbol ID; $1 always refers to the first symbol in the user symbol table. This allows the Ion 1.1 system symbol table to be relatively large without taking away SID space from the user symbol table.

System Macros

IDHexText
00x00none
10x01values
20x02default
30x03meta
40x04repeat
50x05flatten
60x06delta
70x07sum
80x08annotate
90x09make_string
100x0Amake_symbol
110x0Bmake_decimal
120x0Cmake_timestamp
130x0Dmake_blob
140x0Emake_list
150x0Fmake_sexp
160x10make_field
170x11make_struct
180x12parse_ion
190x13set_symbols
200x14add_symbols
210x15set_macros
220x16add_macros
230x17use

1

System symbols require the same number of bytes whether they are encoded using the system symbol or the user symbol encoding. The reasons the system symbols are initially loaded into the user symbol table are twofold—to be consistent with loading the system macros into user space, and so that implementors can start testing user symbols even before they have implemented support for reading encoding directives.^

Shared modules

Shared modules exist independently of the documents that use them. They are identified by a catalog key consisting of a string name and an integer version.

The self-declared catalog-names of shared modules are generally long, since they must be more-or-less globally unique. When imported by another module, they are given local symbolic names—a binding—by import declarations.

They have a spec version that is explicit via annotation, and a content version derived from the catalog version. The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N. This allows the module to be serialized using any version of Ion, and its meaning will not change.

$ion_shared_module::
$ion_1_1::("com.example.symtab" 3 
           (symbol_table ...) 
           (macro_table ...) )

Example

An Ion 1.1 shared module.

$ion_shared_module::
$ion_1_1::("org.example.geometry" 2
           (symbol_table ["x", "y", "square", "circle"])
           (macro_table (macro point2d (x y) { x:(%x), y:(%y) })
                        (macro polygon (point2d::points+) [(%points)]) )
)

The system module provides a convenient macro (use) to append a shared module to the encoding module.

$ion_1_1
(:use "org.example.geometry" 2)
(:polygon (:: (1 4) (1 8) (3 6)))

Compatibility with Ion 1.0

Ion 1.0 shared symbol tables are treated as Ion 1.1 shared modules that have an empty macro table.

Ion 1.1 Binary Encoding

A binary Ion stream consists of an Ion version marker followed by a series of value literals and/or encoding expressions.

Both value literals and e-expressions begin with an opcode that indicates what the next expression represents and how the bytes that follow should be interpreted.

Primitives

This section describes Ion 1.1's binary encoding primitives--reusable building blocks that can be combined to represent more complex constructs.

NameTypeWidth
FixedUIntintDetermined by context
FixedIntintDetermined by context
FlexUIntintVariable, self-delimiting
FlexIntintVariable, self-delimiting
FlexSymsymbolVariable, self-delimiting

FlexUInt

A variable-length unsigned integer.

The bytes of a FlexUInt are written in little-endian byte order. This means that the first bytes will contain the FlexUInt's least significant bits.

The least significant bits in the FlexUInt indicate the number of bytes that were used to encode the integer. If a FlexUInt is N bytes long, its N-1 least significant bits will be 0; a terminal 1 bit will be in the next most significant position.

All bits that are more significant than the terminal 1 represent the magnitude of the FlexUInt.

FlexUInt encoding of 14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14

FlexUInt encoding of 729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the unsigned  of the unsigned
integer          integer

FlexUInt encoding of 21,043

            ┌───── There are 2 zeros in the least significant bits, so this
            │      integer is three bytes wide.
          ┌─┴─┐
1 0 0 1 1 1 0 0  1 0 0 1 0 0 0 1  0 0 0 0 0 0 1 0
└───┬───┘        └──────┬──────┘  └──────┬──────┘
lowest 6 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

FlexInt

A variable-length signed integer.

From an encoding perspective, FlexInts are structurally similar to a FlexUInt. Both encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a FlexUInt's bits are unsigned, a FlexInt's bits are encoded using two's complement notation.

TIP: An implementation could choose to read a FlexInt by instead reading a FlexUInt and then reinterpreting its bits as two's complement.

FlexInt encoding of 14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
 2's comp. 14

FlexInt encoding of -14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
 2's comp. -14

FlexInt encoding of 729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

FlexInt encoding of -729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
1 0 0 1 1 1 1 0  1 1 1 1 0 1 0 0
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

FixedUInt

A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.

FixedUInt encoding of 3,954,261


0 1 0 1 0 1 0 1  0 1 0 1 0 1 1 0  0 0 1 1 1 1 0 0
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

FixedInt

A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two's complement.

FixedInt encoding of -3,954,261


1 0 1 0 1 0 1 1  1 0 1 0 1 0 0 1  1 1 0 0 0 0 1 1
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the 2's       the 2's comp.   of the 2's comp.
comp. integer    integer          integer

FlexSym

A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.

A FlexSym begins with a FlexInt; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:

  • greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.
  • less than zero, its absolute value represents a number of UTF-8 bytes that follow the FlexInt. These bytes represent the symbol’s text.
  • exactly zero, another byte follows that is a FlexSymOpCode.

FlexSym encoding of symbol ID $10

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │
0 0 0 1 0 1 0 1
└─────┬─────┘
  2's comp.
  positive 10

FlexSym encoding of symbol text 'hello'

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │      h         e        l        l        o
1 1 1 1 0 1 1 1  01101000  01100101 01101100 01101100 01101111
└─────┬─────┘    └─────────────────────┬─────────────────────┘
  2's comp.              5-byte UTF-8 encoded "hello"
  negative 5

FlexSymOpCode

FlexSymOpCodes are a combination of system symbols and a subset of the general opcodes. The FlexSym parser is not responsible for evaluating a FlexSymOpCode, only returning it—the caller will decide whether the opcode is legal in the current context.

Example usages of the FlexSymOpCode include:

  • Representing SID $0
  • Representing system symbols
    • Note that the empty symbol (i.e. the symbol '') is now a system symbol and can be referenced this way.
  • When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct.
  • In a delimited struct, terminating the sequence of (field name, value) pairs with 0xF0.
OpCode ByteMeaningAdditional Notes
0x00 - 0x5FE-ExpressionMay be used when the FlexSym occurs in the field name position of any struct
0x60Symbol with unknown text (also known as $0)
0x61 - 0xDFSystem SID (with 0x60 bias)While the range of 0x61 - 0xDF is reserved for system symbols, not all of these bytes correspond to a system symbol. See system symbols for the list of system symbols.
0xEESystem symbol
0xEFE-Expression invoking a system macroMay be used when the FlexSym occurs in the field name position of any struct
0xF0Delimited container end markerMay only be when the FlexSym occurs in the field name position of a delimited struct
0xF5Length-prefixed macro invocationMay be used when the FlexSym occurs in the field name position of any struct

FlexSym encoding of '' (empty text) using an opcode

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │
0 0 0 0 0 0 0 1   01110111
└─────┬─────┘     └───┬──┘
  2's comp.     FixedInt 0x77,
  zero          System SID 23
                (the empty symbol)

Opcodes

An opcode is a 1-byte FixedUInt that tells the reader what the next expression represents and how the bytes that follow should be interpreted.

The meanings of each opcode are organized loosely by their high and low nibbles.

High nibbleLow nibbleMeaning
0x0_ to 0x3_0-FE-expression with 6-bit address
0x4_0-FE-expression with 12-bit address
0x5_0-FE-expression with 20-bit address
0x6_0-8Integers from 0 to 8 bytes wide
9Reserved
A-DFloats
E-FBooleans
0x7_0-FDecimals
0x8_0-CShort-form timestamps
D-FReserved
0x9_0-FStrings
0xA_0-FSymbols with inline text
0xB_0-FLists
0xC_0-FS-expressions
0xD_0Empty struct
1Reserved
2-FStructs
0xE_0Ion version marker
1-3Symbols with symbol address
4-6Annotations with symbol address
7-9Annotations with FlexSym text
Anull.null
BTyped nulls
C-DNOP
ESystem symbol
FSystem macro invocation
0xF_0Delimited container end
1Delimited list start
2Delimited S-expression start
3Delimited struct start
4E-expression with FlexUInt macro address
5E-expression with FlexUInt length prefix
6Integer with FlexUInt length prefix
7Decimal with FlexUInt length prefix
8Timestamp with FlexUInt length prefix
9String with FlexUInt length prefix
ASymbol with FlexUInt length prefix and inline text
BList with FlexUInt length prefix
CS-expression with FlexUInt length prefix
DStruct with FlexUInt length prefix
EBlob with FlexUInt length prefix
FClob with FlexUInt length prefix

Values

Nulls

The opcode 0xEA indicates an untyped null (that is: null, or its alias null.null).

The opcode 0xEB indicates a typed null; a byte follows whose value represents an offset into the following table:

ByteType
0x00null.bool
0x01null.int
0x02null.float
0x03null.decimal
0x04null.timestamp
0x05null.string
0x06null.symbol
0x07null.blob
0x08null.clob
0x09null.list
0x0Anull.sexp
0x0Bnull.struct

All other byte values are reserved for future use.

Encoding of null

┌──── The opcode `0xEA` represents a null (null.null)
EA

Encoding of null.string

┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows
│  ┌──── Byte 0x05 indicates the type `string`
EB 05

Booleans

0x6E represents boolean true, while 0x6F represents boolean false.

0xEB 0x00 represents null.bool.

Encoding of boolean true
6E
Encoding of boolean false
6F
Encoding of null.bool
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: boolean
│  │
EB 00

Integers

Opcodes in the range 0x60 to 0x68 represent an integer. The opcode is followed by a FixedInt that represents the integer value. The low nibble of the opcode (0x_0 to 0x_8) indicates the size of the FixedInt. Opcode 0x60 represents integer 0; no more bytes follow.

Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6, followed by a <<flexuint, FlexUInt>> indicating how many bytes of representation data follow.

0xEB 0x01 represents null.int.

Encoding of integer 0
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││    no more bytes follow.
60
Encoding of integer 17
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││    a single byte follows.
61 11
    └── FixedInt 17
Encoding of integer -944
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││    that two bytes follow.
62 50 FC
   └─┬─┘
FixedInt -944
Encoding of integer -944
┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows
│   ┌─── FlexUInt 2; a 2-byte FixedInt follows
│   │
F6 05 50 FC
      └─┬─┘
   FixedInt -944
Encoding of null.int
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: integer
│  │
EB 01

Floats

Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:

  • 0 bits (0 bytes), representing the value 0e0 and indicated by opcode 0x6A
  • 16 bits (2 bytes in little-endian order, half-precision), indicated by opcode 0x6B
  • 32 bits (4 bytes in little-endian order, single precision), indicated by opcode 0x6C
  • 64 bits (8 bytes in little-endian order, double precision), indicated by opcode 0x6D

note

In the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.

0xEB 0x02 represents null.float.

Encoding of float 0e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││    a 0-length float; 0e0
6A
Encoding of float 3.14e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
   └─┬─┘
half-precision 3.14
Encoding of float 3.1415927e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││    single-precision value.
6C DB 0F 49 40   
   └────┬────┘
single-precision 3.1415927
Encoding of float 3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││    double-precision value.
6D 18 2D 44 54 FB 21 09 40       
   └──────────┬──────────┘
double-precision 3.141592653589793
Encoding of null.float
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: float
│  │
EB 02

Decimals

If an opcode has a high nibble of 0x7_, it represents a decimal. Low nibble values indicate the number of trailing bytes used to encode the decimal.

The body of the decimal is encoded as a FlexInt representing its exponent, followed by a FixedInt representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0. When the coefficient is present but has a value of 0, the coefficient is -0.

Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7.

0xEB 0x03 represents null.decimal.

Encoding of decimal 0d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││    decimal; 0d0
70
Encoding of decimal 7d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
   |  └─── Coefficient: 1-byte FixedInt 7
   └─── Exponent: FlexInt 0
Encoding of decimal 1.27
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
   |  └─── Coefficient: FixedInt 127
   └─── Exponent: 1-byte FlexInt -2
Variable-length encoding of decimal 1.27
┌──── Opcode F7 indicates a variable-length decimal
│
F7 05 FD 7F
   |  |  └─── Coefficient: FixedInt 127
   |  └───── Exponent: 1-byte FlexInt -2
   └─────── Decimal length: FlexUInt 2
Encoding of 0d3, which has a coefficient of zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
   └────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
Encoding of -0d3, which has a coefficient of negative zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
   |  └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
   └────── Exponent: FlexInt 3
Encoding of null.decimal
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: decimal
│  │
EB 03

Timestamps

Timestamps have two encodings:

  1. Short-form timestamps, a compact representation optimized for the most commonly used precisions and date ranges.
  2. Long-form timestamps, a less compact representation capable of representing any timestamp in the Ion data model.

0xEB x04 represents null.timestamp.

Encoding of null.timestamp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: timestamp
│  │
EB 04

note

In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.

Short-form Timestamps

If an opcode has a high nibble of 0x8_, it represents a short-form timestamp. This encoding focuses on making the most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via the variable-length long form timestamp encoding.

Timestamps may be encoded using the short form if they meet all of the following conditions:

The year is between 1970 and 2097.:: The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form. The local offset is either UTC, unknown, or falls between -14:00 to +14:00 and is divisible by 15 minutes. 7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset -14:00). The value 0b1111111 indicates an unknown offset. At the time of this writing (2024-08T), all real-world offsets fall between -12:00 and +14:00 and are multiples of 15 minutes. The fractional seconds are a common precision. The timestamp's fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).

Opcodes by precision and offset

Each opcode with a high nibble of 0x8_ indicates a different precision and offset encoding pair.

OpcodePrecisionSerialized size in bytes1Offset encoding
0x80Year1Implicitly Unknown offset
0x81Month2
0x82Day2
0x83Hour and minutes41 bit to indicate UTC or Unknown Offset
0x84Seconds5
0x85Milliseconds6
0x86Microseconds7
0x87Nanoseconds8
0x88Hour and minutes57 bits to represent a known offset.2
0x89Seconds5
0x8AMilliseconds7
0x8BMicroseconds8
0x8CNanoseconds9
0x8DReserved--
0x8EReserved--
0x8FReserved--
1

Serialized size in bytes does not include the opcode.

2

This encoding can also represent UTC and Unknown Offset, though it is less compact than opcodes 0x83-0x87 above.

The body of a short-form timestamp is encoded as a FixedUInt of the size specified by the opcode. This integer is then partitioned into bit-fields representing the timestamp's subfields. Note that endianness does not apply here because the bit-fields are defined over the body interpreted as an integer.

The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.

The Month and Day subfields are one-based; 0 is not a valid month or day.

Letter codeNumber of bitsSubfield
Y7Year
M4Month
D5Day
H5Hour
m6Minute
o7Offset
U1Unknown (0) or UTC (1) offset
s6Second
f10 (ms)
20 (μs)
30 (ns)
Fractional second
.n/aUnused

We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.

          7       0 <--- bit position
          |       |
         +=========+
byte 0   |  0xNN   | <-- hex notation for constants like opcodes
         +=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
     1   |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
         +---------+ <-- octet boundary within an encoding primitive
         ...
         +---------+
     N   |nnnn:nnnn|
         +=========+

The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)

note

While this encoding may complicate human reading, it guarantees that the timestamp's subfields (year, month, etc.) occupy the same bit contiguous indexes regardless of how many bytes there are overall. (The last subfield, fractional_seconds, always begins at the same bit index when present, but can vary in length according to the precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the appropriate bit ranges to access the subfields.

Encoding of a timestamp with year precision

         +=========+
byte 0   |  0x80   |
         +=========+
     1   |.YYY:YYYY|
         +=========+

Encoding of a timestamp with month precision

         +=========+
byte 0   |  0x81   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |....:.MMM|
         +=========+

Encoding of a timestamp with day precision

         +=========+
byte 0   |  0x82   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +=========+

Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset

         +=========+
byte 0   |  0x83   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |....:Ummm|
         +=========+

Encoding of a timestamp with seconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x84   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |....:..ss|
         +=========+

Encoding of a timestamp with milliseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x85   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |....:ffff|
         +=========+

Encoding of a timestamp with microseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x86   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |..ff:ffff|
         +=========+

Encoding of a timestamp with nanoseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x87   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +=========+

Encoding of a timestamp with hour-and-minutes precision at known offset

         +=========+
byte 0   |  0x88   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |....:..oo|
         +=========+

Encoding of a timestamp with seconds precision at known offset

         +=========+
byte 0   |  0x89   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +=========+

Encoding of a timestamp with milliseconds precision at known offset

         +=========+
byte 0   |  0x8A   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |....:..ff|
         +=========+

Encoding of a timestamp with microseconds precision at known offset

         +=========+
byte 0   |  0x8B   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |....:ffff|
         +=========+

Encoding of a timestamp with nanoseconds precision at known offset

         +=========+
byte 0   |  0x8C   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +---------+
     9   |..ff:ffff|
         +=========+

Examples of short-form timestamps

TextBinary
2023T80 35
2023-10-15T82 35 7D
2023-10-15T11:22:33Z84 35 7D CB 1A 02
2023-10-15T11:22:33-00:0084 35 7D CB 12 02
2023-10-15T11:22:33+01:1589 35 7D CB 2A 84
2023-10-15T11:22:33.444555666+01:158C 35 7D CB 2A 84 92 61 7F 1A

warning

Opcodes 0x8D, 0x8E, and 0x8F are illegal; they are reserved for future use.

Long-form Timestamps

Unlike the short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.

The long form begins with opcode 0xF8. A FlexUInt follows indicating the number of bytes that were needed to represent the timestamp. The encoding consumes the minimum number of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s precision as follows:

LengthCorresponding precision
0Illegal
1Illegal
2Year
3Month or Day (see below)
4Illegal; the hour cannot be specified without also specifying minutes
5Illegal
6Minutes
7Seconds
8 or moreFractional seconds

Unlike the short-form encoding, the long-form encoding reserves:

  • 14 bits for the year (Y), which is not biased.
  • 12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440 (that is: -24:00). An offset value of 0b111111111111 indicates an unknown offset.

Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the timestamp are encoded as bit-fields on a FixedUInt that corresponds to the length that followed the opcode.

If the timestamp's overall length is greater than or equal to 8, the FixedUInt part of the timestamp is 7 bytes and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a (scale, coefficient) pair, which is similar to a decimal. The primary difference is that the scale represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to 1.0 or less than 0.0. The scale is encoded as a FlexUInt (instead of FlexInt) to discourage the encoding of decimal numbers greater than 1.0. The coefficient is encoded as a FixedUInt (instead of FixedInt) to prevent the encoding of fractional seconds less than 0.0. Note that validation is still required; namely:

  • A scale value of 0 is illegal, as that would result in a fractional seconds greater than 1.0 (a whole second).
  • If coefficient * 10^-scale > 1.0, that (coefficient, scale) pair is illegal.

If the timestamp's length is 3, the precision is determined by inspecting the day (DDDDD) bits. Like the short-form, the Month and Day subfields are one-based (0 is not a valid month or day). If the day subfield is zero, that indicates month precision. If the day subfield is any non-zero number, that indicates day precision.

Encoding of the body of a long-form timestamp

         +=========+
byte 0   |YYYY:YYYY|
         +=========+
     1   |MMYY:YYYY|
         +---------+
     2   |HDDD:DDMM|
         +---------+
     3   |mmmm:HHHH|
         +---------+
     4   |oooo:oomm|
         +---------+
     5   |ssoo:oooo|
         +---------+
     6   |....:ssss|
         +=========+
     7   |FlexUInt | <-- scale of the fractional seconds
         +---------+
         ...
         +=========+
     N   |FixedUInt| <-- coefficient of the fractional seconds
         +---------+
         ...

Examples of long-form timestamps

TextBinary
1947TF8 05 9B 07
1947-12TF8 07 9B 07 03
1947-12-23TF8 07 9B 07 5F
1947-12-23T11:22:33-00:00F8 0F 9B 07 DF 65 FD 7F 08
1947-12-23T11:22:33+01:15F8 0F 9B 07 DF 65 AD 57 08
1947-12-23T11:22:33.127+01:15F8 13 9B 07 DF 65 AD 57 08 07 7F

Strings

If the high nibble of the opcode is 0x9_, it represents a string. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0x90 represents a string with empty text ("").

Strings longer than 15 bytes can be encoded with the F9 opcode, which takes a FlexUInt-encoded length after the opcode.

0xEB x05 represents null.string.

Encoding of the empty string, ""

┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90

Encoding of a 14-byte string

┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes

Encoding of a 24-byte string

┌──── Opcode F9 indicates a variable-length string
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes

Encoding of null.string

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: string
│  │
EB 05

Symbols

Symbols With Inline Text

If the high nibble of the opcode is 0xA_, it represents a symbol whose text follows the opcode. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0xA0 represents a symbol with empty text ('').

0xEB x06 represents null.symbol.

Encoding of a symbol with empty text ('')
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0
Encoding of a symbol with 14 bytes of inline text
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes
Encoding of a symbol with 24 bytes of inline text
┌──── Opcode FA indicates a variable-length symbol with inline text
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes
Encoding of null.symbol
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: symbol
│  │
EB 06

Symbols With a Symbol Address

Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1 through 0xE3:

  • 0xE1 represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byte FixedUInt that follows the opcode.
  • 0xE2 represents a symbol whose address in the symbol table is a 2-byte FixedUInt that follows the opcode.
  • 0xE3 represents a symbol whose address in the symbol table is a FlexUInt that follows the opcode.

Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.

OpcodeSymbol address rangeBias
0xE10 to 2550
0xE2256 to 65,791256
0xE365,792 to infinity65,792

System Symbols

System symbols (that is, symbols defined in the system module) can be encoded using the 0xEE opcode followed by a 1-byte FixedUInt representing an index in the system symbol table.

Unlike Ion 1.0, symbols are not required to use the lowest available SID for a given text, and system symbols MAY be encoded using other SIDs.

Encoding of the system symbol $ion
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│  ┌─── FixedUInt 1 indicates system symbol 1
│  │
EE 01

Binary Data

Blobs

Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob's byte-length.

0xEB x07 represents null.blob.

Example blob encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: blob
│  │
EB 07

Clobs

Opcode FF indicates a clob--binary character data of an unspecified encoding. A FlexUInt follows that represents the clob's byte-length.

0xEB x08 represents null.clob.

Example clob encoding

┌──── Opcode FF indicates a clob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data

Encoding of null.clob

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: clob
│  │
EB 08

Binary Data

Blobs

Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob's byte-length.

0xEB x07 represents null.blob.

Example blob encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: blob
│  │
EB 07

Clobs

Opcode FF indicates a clob--binary character data of an unspecified encoding. A FlexUInt follows that represents the clob's byte-length.

0xEB x08 represents null.clob.

Example clob encoding

┌──── Opcode FF indicates a clob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data

Encoding of null.clob

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: clob
│  │
EB 08

Lists

Length-prefixed encoding

An opcode with a high nibble of 0xB_ indicates a length-prefixed list. The lower nibble of the opcode indicates how many bytes were used to encode the child values that the list contains.

If the list's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB opcode to write a variable-length list. The 0xFB opcode is followed by a FlexUInt that indicates the list's byte length.

0xEB 0x09 represents null.list.

Length-prefixed encoding of an empty list ([])
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this
││    list took zero bytes to encode.
B0
Length-prefixed encoding of [1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 6 indicates that the child values of this
││    list took six bytes to encode.
B6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Length-prefixed encoding of ["variable length list"]
┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     l  i  s  t
FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element
Encoding of null.list
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: list
│  │
EB 09

Delimited Encoding

Opcode 0xF1 begins a delimited list, while opcode 0xF0 closes the most recently opened delimited container that has not yet been closed.

Delimited encoding of an empty list ([])
┌──── Opcode 0xF1 indicates a delimited list
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F1 F0
Delimited encoding of [1, 2, 3]
┌──── Opcode 0xF1 indicates a delimited list
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F1 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Delimited encoding of [1, [2], 3]
┌──── Opcode 0xF1 indicates a delimited list
│        ┌─── Opcode 0xF1 begins a nested delimited list
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested list.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and 
│        │        │        │    still open) delimited container: the outer list.
│        │        │        │
F1 61 01 F1 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

S-Expressions

S-expressions use the same encodings as lists, but with different opcodes.

OpcodeEncoding
0xC0-0xCFLength-prefixed S-expression; low nibble of the opcode represents the byte-length.
0xFCVariable-length prefixed S-expression; a FlexUInt following the opcode represents the byte-length.
0xF2Starts a delimited S-expression; 0xF0 closes the most recently opened delimited container.

0xEB 0x0A represents null.sexp.

Length-prefixed encoding

Length-prefixed encoding of an empty S-expression (())
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression
││    took zero bytes to encode.
C0
Length-prefixed encoding of (1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression
││    took six bytes to encode.
C6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Length-prefixed encoding of ("variable length sexp")
┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  e  x  p
FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element
Encoding of null.sexp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: sexp
│  │
EB 0A

Delimited encoding

Delimited encoding of an empty S-expression (())
┌──── Opcode 0xF2 indicates a delimited S-expression
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F2 F0
Delimited encoding of (1 2 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F2 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Delimited encoding of (1 (2) 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│        ┌─── Opcode 0xF2 begins a nested delimited S-expression
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested S-expression.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and
│        │        │        │     still open)delimited container: the outer S-expression.
│        │        │        │
F2 61 01 F2 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

Structs

Length-prefixed encoding

If the high nibble of the opcode is 0xD_, it represents a struct. The lower nibble of the opcode indicates how many bytes were used to encode all of its nested (field name, value) pairs. Opcode 0xD0 represents an empty struct.

warning

Opcode 0xD1 is illegal. Non-empty structs must have at least two bytes: a field name and a value.

If the struct's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD opcode to write a variable-length struct. The 0xFD opcode is followed by a FlexUInt that indicates the byte length.

Each field in the struct is encoded as a FlexUInt representing the address of the field name's text in the symbol table, followed by an opcode-prefixed value.

0xEB 0x0B represents null.struct.

Length-prefixed encoding of an empty struct ({})
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0
Length-prefixed encoding of {$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│  ┌─── Field name: FlexUInt 10 ($10)
│  │        ┌─── Field name: FlexUInt 11 ($11)
│  │        │
D6 15 61 01 17 61 02
      └─┬─┘    └─┬─┘
        1        2
Length-prefixed encoding of {$10: "variable length struct"}
 ┌───────────── Opcode `FD` indicates a struct with a FlexUInt length prefix
 │  ┌────────── Length: FlexUInt 25
 │  │  ┌─────── Field name: FlexUInt 10 ($10)
 │  │  │  ┌──── Opcode `F9` indicates a variable length string
 │  │  │  │  ┌─ FlexUInt: 22 the string is 22 bytes long
 │  │  │  │  │  v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  t  r  u  c  t
FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
               └─────────────────────────────┬─────────────────────────────────┘
                                        UTF-8 bytes
Encoding of null.struct
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: struct
│  │
EB 0B

Optional FlexSym field name encoding

By default, all struct field names are encoded as FlexUInt symbol addresses. However, a writer has the option of encoding the field names as FlexSyms instead, granting additional flexibility at the expense of some compactness.

Writing a field names as a FlexSyms allows the writer to:

  • encode the UTF-8 bytes of the field name inline (for example, to avoid modifying the symbol table).
  • call a macro whose output (another struct) will be merged into the current struct.
  • encode the field name as a symbol address if it's already in the symbol table. (just like a FlexUInt would, but slightly less compactly.)

To switch to FlexSym field names, the writer emits a FlexUInt zero (byte 0x01) in field name position to inform the reader that subsequent field names will be encoded as FlexSyms.

This switch is one way. Once the writer switches to using FlexSym, the encoding cannot be switched back to FlexUInt for the remainder of the struct.

Switching to FlexSym while encoding {$10: 1, foo: 2, $11: 3}

In this example, the writer switches to FlexSym field names before encoding foo so it can write the UTF-8 bytes inline.

┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│  ┌─── Field name: FlexUInt 10 ($10)
│  │        ┌─── FlexUInt 0: Switch to FlexSym field name encoding
│  │        │
│  │        │  ┌─── FlexSym: 3 UTF-8 bytes follow
│  │        │  │           ┌─── Field name: FlexSym 11 ($11)
│  │        │  │   f  o  o │
D6 15 61 01 01 FB 66 6F 6F 17 61 02
      └─┬─┘                   └─┬─┘
        1                       2

note

Because FlexUInt zero indicates a mode switch, encoding symbol ID $0 requires switching to FlexSym.

Length-prefixed encoding of {$0: 1}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 5
││ ┌── FlexUInt 0 in the field name position indicates that the struct
││ │   is switching to FlexSym mode
││ │  ┌── FlexSym "escape"
││ │  │  ┌── Symbol address: 1-byte FixedUInt follows 
││ │  │  │  ┌─ FixedUInt 0
││ │  │  │  │
D5 01 01 E1 00 61 01
      └───┬──┘ └─┬─┘
         $0      1

Delimited encoding

Opcode 0xF3 indicates the beginning of a delimited struct. Unlike length-prefixed structs, delimited structs always encode their field names as FlexSyms.

Unlike lists and S-expressions, structs cannot use opcode 0xF0 by itself to indicate the end of the delimited container. This is because 0xF0 is a valid FlexSym (a symbol with 16 bytes of inline text). To close the delimited struct, the writer emits a 0x01 byte (a FlexSym escape) followed by the opcode 0xF0.

note

It is much more compact to write 0xD0-- the empty length-prefixed struct.

Delimited encoding of the empty struct ({})

┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│  ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │  ┌─── Opcode 0xF0 indicates the end of the most
│  │  │    recently opened delimited container
F3 01 F0

Delimited encoding of {"foo": 1, $11: 2}

┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│
│  ┌─ FlexSym -3     ┌─ FlexSym: 11 ($11)
│  │                 │        ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │                 │        │  ┌─── Opcode 0xF0 indicates the end of the most
│  │   f  o  o       │        │  │    recently opened delimited container
F3 FB 66 6F 6F 61 01 17 61 02 01 F0
      └──┬───┘ └─┬─┘    └─┬─┘
      3 UTF-8    1        2
       bytes

Encoding Expressions

note

This chapter focuses on the binary encoding of e-expressions. Macros by example explains what they are and how they are used.

E-expression with the address in the opcode

If the value of the opcode is less than 64 (0x40), it represents an E-expression invoking the macro at the corresponding address—-an offset within the local macro table.

Invocation of macro address 7

┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
07
└── FixedUInt 7

Invocation of macro address 31

┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
1F
└── FixedUInt 31

Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section Macro calling conventions. (TODO: Link)

E-expressions with biased FixedUInt addresses

While E-expressions invoking macro addresses in the range [0, 63] can be encoded in a single byte using E-expressions with the address in the opcode, many applications will benefit from defining more than 64 macros. The 0x4_ and 0x5_ opcodes can be used to represent macro addresses up to 1,052,734. In both encodings, the address is biased by the total number of addresses with lower opcodes.

If the high nibble of the opcode is 0x4_, then a biased address follows as a 1-byte FixedUInt. For 0x4_, the bias is 256 * low_nibble + 64 (or (low_nibble << 8) + 64).

If the high nibble of the opcode is 0x5_, then a biased address follows as a 2-byte FixedUInt.

For 0x5_, the bias is 65536 * low_nibble + 4160 (or (low_nibble << 16) + 4160)

Invocation of macro address 841

┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address
│┌─── Low nibble 3 indicates bias of 832
││
43 09
   │
   └─── FixedUInt 9

Biased Address : 9
Bias : 832
Address : 841

Invocation of macro address 142918

┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address
│┌─── Low nibble 2 indicates bias of 135232
││
52 06 1E
   └─┬─┘
     └─── FixedUInt 7686

Biased Address : 7686
Bias : 135232
Address : 142918

Macro address range biases for 0x4_ and 0x5_

Low Nibble0x4_ Bias0x5_ Bias
0644160
132069696
2576135232
3832200768
41088266304
51344331840
61600397376
71856462912
82112528448
92368593984
A2624659520
B2880725056
C3136790592
D3392856128
E3648921664
F3904987200

E-expression with the address as a trailing FlexUInt

The opcode 0xF4 indicates an e-expression whose address is encoded as a trailing FlexUInt with no bias. This encoding is less compact for addresses that can be encoded using opcodes 0x5F and below, but it is the only encoding that can be used for macro addresses greater than 1,052,734.

Invocation of macro address 4
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 09
   │
   └─── FlexUInt 4
Invocation of macro address 1_100_000
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 04 47 86
   └──┬───┘
      └─── FlexUInt 1,100,000

System Macro Invocations

E-expressions that invoke a system macro can be encoded using the 0xEF opcode followed by a 1-byte FixedUInt representing an index in the system macro table.

Encoding of the system macro values
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│  ┌─── FixedInt 1 indicates macro 1 from the system macro table
│  │
EF 01

In addition, system macros MAY be invoked using any of the 0x00-0x5F or 0xF4-0xF5 opcodes, provided that the macro being invoked has been given an address in user macro address space.

E-expression argument encoding

The example invocations in prior sections have demonstrated how to encode an invocation of the simplest form of macro--one with no parameters. This section explains how to encode macro invocations when they take parameters of different encodings and cardinalities.

To begin, we will examine how arguments are encoded when all of the macro's parameters use the tagged encoding and have a cardinality of exactly-one.

Tagged encoding

When a macro parameter does not specify an encoding (the parameter name is not annotated), arguments passed to that parameter use the 'tagged' encoding. The argument begins with a leading opcode that dictates how to interpret the bytes that follow.

This is the same encoding used for values in other Ion 1.1 contexts like lists, s-expressions, or at the top level.

Encoding a single exactly-one argument

A parameter with a cardinality of exactly-one expects its corresponding argument to be encoded as a single expression of the parameter's declared encoding. (The following section will explore the available encodings in greater depth; for now, our examples will be limited to parameters using the tagged encoding.)

When the macro has a single exactly-one parameter, the corresponding encoded argument follows the opcode and (if separate) the encoded address.

Example encoding of an e-expression with a tagged, exactly-one argument

Macro definition
(:set_macros
  (foo (x) /*...*/)
)
Text e-expression
(:foo 1)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression invoking
│     the macro at address 0.
│    ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│  ┌─┴─┐
00 61 01
Binary e-expression using a trailing FlexUInt address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│  ┌──── FlexUInt 0: Macro address 0
│  │    ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│  │  ┌─┴─┐
F4 01 61 01

Encoding multiple exactly-one arguments

If the macro has more than one parameter, a reader would iterate over the parameters declared in the macro signature from left to right. For each parameter, the reader would use the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing of the e-expression's arguments is complete.

Example encoding of an e-expression with multiple tagged, exactly-one arguments

Macro definition
(:set_macros
  (foo (a b c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│    ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│    │     ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│    │     │     ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 61 01 61 02 61 03
Binary e-expression using a trailing FlexUInt address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│  ┌──── FlexUInt 0: Macro address 0
│  │    ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│  │    │     ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│  │    │     │     ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│  │    │     │     │
│  │  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
F4 01 61 01 61 02 61 03

Tagless Encodings

In contrast to the tagged encoding, tagless encodings do not begin with an opcode. This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings do not have an opcode, they cannot represent E-expressions, annotation sequences, or null values of any kind.

Tagless encodings are comprised of the primitive encodings and macro shapes.

Primitive encodings

Primitive encodings are self-delineating, either by having a statically known size in bytes or by including length information in their serialized form.

Ion typePrimitive encodingSize in bytesEncoding
intuint81FixedUInt
uint162
uint324
uint648
flex_uintvariableFlexUInt
int81FixedInt
int162
int324
int648
flex_intvariableFlexInt
floatfloat162Little-endian IEEE-754 half-precision float
float324Little-endian IEEE-754 single-precision float
float648Little-endian IEEE-754 double-precision float
symbolflex_symvariableFlexSym

Example encoding of an e-expression with primitive, exactly-one arguments

As first demonstrated in Encoding multiple exactly-one arguments, the bytes of the serialized arguments begin immediately after the e-expression's opcode and (if separate) the macro address. The reader iterates over the parameters in the macro signature in the order they are declared. For each parameter, the reader uses the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing is complete.

Macro definition
(:set_macros
  (foo (flex_uint::a int8::b uint16::c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌─── Argument 'a': FlexUInt 1
│  │  ┌─── Argument 'b': 1-byte FixedInt 2
│  │  │    ┌─── Argument 'c': 2-byte FixedUInt 3
│  │  │  ┌─┴─┐
00 03 02 03 00
Binary e-expression using a trailing FlexUInt address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│  ┌──── FlexUInt 0: Macro address 0
│  │  ┌─── Argument 'a': FlexUInt 1
│  │  │  ┌─── Argument 'b': 1-byte FixedInt 2
│  │  │  │    ┌─── Argument 'c': 2-byte FixedUInt 3
│  │  │  │  ┌─┴─┐
F4 01 03 02 03 00

Macro shapes

The term macro shape describes a macro that is being used as the encoding of an E-expression argument. A parameter using a macro shape as its encoding is sometimes called a macro-shaped parameter. For example, consider the following two macro definitions.

The point2D macro takes two flex_int-encoded values as arguments.

(macro point2D (flex_int::x flex_int::y)
  {
    x: (%x),
    y: (%y),
  }
)

The line macro takes a pair of point2D invocations as arguments.

(macro line (point2D::start point2D::end)
  {
    start: (%start),
    end: (%end),
  }
)

Normally an e-expression would begin with an opcode and an address communicating what comes next. However, when we're reading the argument for a macro-shaped parameter, the macro being invoked is inferred from the parent macro signature instead. As such, there is no need to include an opcode or address.

┌──── Opcode 0x01 is less than 0x40; this is an e-expression
│     invoking the macro at address 1: `line`
│    ┌─── Argument $start: an implicit invocation of macro `point2D`
│    │     ┌─── Argument $end: an implicit invocation of macro `point2D`
│  ┌─┴─┐ ┌─┴─┐
00 03 05 07 09
   │  │  │  └────   $end/$y: FlexInt 4
   │  │  └───────   $end/$x: FlexInt 3
   │  └────────── $start/$y: FlexInt 2
   └───────────── $start/$x: FlexInt 1

Any macro can be used as a macro shape except for constants--macros which take zero parameters. Constants cannot be used as a macro shape because their serialized representation would be empty, making it impossible to encode them in expression groups. However, this limitation does not sacrifice any expressiveness; the desired constant can always be invoked directly in the body of the macro.

(:add_macros
  // Defines a constant 'hostname'
  (hostname () "abc123.us_west.example.com")

  (http_ok (hostname::server page)
  //           └── ERROR: cannot use a constant as a macro shape
     {
        server: (%server),
        page: (%page),
        message: OK,
        status: 200,
     }
  )

  (http_ok (page)
    {
      server: (.hostname),
      //           └── OK: invokes constant as needed
      page: (%page),
      message: OK,
      status: 200,
    }
  )
)

Encoding variadic arguments

The preceding sections have described how to (de)serialize the various parameter encodings, but these parameters have always had the same cardinality: exactly-one.

This section explains how to encode e-expressions invoking a macro whose signature contains variadic parameters--parameters with a cardinality of zero-or-one, zero-or-more, or one-or-more.

Argument Encoding Bitmap (AEB)

If a macro signature has one or more variadic parameters, then e-expressions invoking that macro will include an additional construct: the Argument Encoding Bitmap (AEB). This little-endian byte sequence precedes the first serialized argument and indicates how each argument corresponding to a variadic parameter has been encoded.

Each variadic parameter in the signature is assigned two bits in the AEB. This means that the reader can statically determine how many AEB bytes to expect in the e-expression by examining the signature.

Number of variadic parametersAEB byte length
00
1 to 41
5 to 82
9 to 123
Nceiling(N/4)

Bits in the AEB are assigned from least significant to most significant and correspond to the variadic parameters in the signature from left to right. This allows the reader to right-shift away the bits of each variadic parameter when its corresponding argument has been read.

Example SignatureAEB Layout
()<No variadics, no AEB>
(a b c)<No variadics, no AEB>
(a b c?)------cc
(a b* c?)----ccbb
(a+ b* c?)--ccbbaa
(a+ b c?)----ccaa
(a+ b* c? d*)ddccbbaa
(a+ b* c? d* e)ddccbbaa
(a+ b* c? d* e f?)ddccbbaa ------ff
(a+ b* c? d* e+ f?)ddccbbaa ----ffee

Each pair of bits in the AEB indicates what kind of expression to expect in the corresponding argument position.

Bit sequenceMeaning?*+
00An empty stream. No bytes are present in the corresponding argument position.
01A single expression of the declared encoding is present in the corresponding argument position.
10A expression group of the declared encoding is present in the corresponding argument position.
11Reserved. A bitmap entry with this bit sequence is illegal in Ion 1.1.

As noted in the table above:

  • An empty stream (00) cannot be used to encode an argument for a parameter with a cardinality of one-or-more.
  • An expression group (10) cannot be used to encode an argument for a parameter with a cardinality of zero-or-one.

Expression groups

This section describes the encoding of an expression group. For an explanation of what an expression group is and how to use it, see Expression groups.

An expression group begins with a FlexUInt. If the FlexUInt's value is:

  • greater than zero, then it represents the number of bytes used to encode the rest of the expression group. The reader should continue reading expressions of the declared encoding until that number of bytes has been consumed.
  • zero, then it indicates that this is a delimited expression group and the processing varies according to whether the declared encoding is tagged or tagless. If the encoding is:
    • tagged, then each expression in the group begins with an opcode. The reader must consume tagged expressions until it encounters a terminating END opcode (0xF0).
    • tagless, then the expression group is a delimited sequence of 'chunks' that each have a FlexUInt length prefix and a body comprised of one or more expressions of the declared encoding. The reader will continue reading chunks until it encounters a length prefix of FlexUInt 0 (0x01), indicating the end of the chunk sequence. Each chunk in the sequence must be self-contained; an expression of the declared encoding may not be split across multiple chunks. See Example encoding of tagless zero-or-more with delimited expression group for an illustration.

tip

While it is legal to write an empty expression group for zero-or-more parameters, it is always more efficient to set the parameter's AEB bits to 00 instead.

Example encoding of tagged zero-or-one with empty group

(:add_macros
  (foo (a?) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa
│  │     a=00, empty expression group
00 00

Example encoding of tagged zero-or-one with single expression

(:add_macros
  (foo (a?) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=01, single expression
│  │    ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│  │  ┌─┴─┐
00 01 61 01

Example encoding of tagged zero-or-more with empty group

(:add_macros
  (foo (a*) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=00, empty expression group
│  │
00 00

Example encoding of tagged zero-or-more with single expression

(:add_macros
  (foo (a*) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=01, single expression
│  │    ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│  │  ┌─┴─┐
00 01 61 01

Example encoding of tagged zero-or-more with expression group

(:add_macros
  (foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 6: 6-byte expression group
│  │  │    ┌──── Opcode 0x61 indicates a 1-byte int (1)
│  │  │    │     ┌──── Opcode 0x61 indicates a 1-byte int (2)
│  │  │    │     │     ┌─── Opcode 0x61 indicates a 1-byte int (3)
│  │  │  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 02 0D 61 01 61 02 61 03
         └───────┬───────┘
      6-byte expression group body

Example encoding of tagged zero-or-more with delimited expression group

(:add_macros
  (foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 0: delimited expression group
│  │  │    ┌──── Opcode 0x61 indicates a 1-byte int (1)
│  │  │    │     ┌──── Opcode 0x61 indicates a 1-byte int (2)
│  │  │    │     │     ┌─── Opcode 0x61 indicates a 1-byte int (3)
│  │  │    │     │     │   ┌─── Opcode 0xF0 is delimited end
│  │  │  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │
00 02 01 61 01 61 02 61 03 F0
         └───────┬───────┘
        expression group body

Example encoding of tagged one-or-more with single expression

(:add_macros
  (foo (a+) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=01, single expression
│  │  ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int
│  │  │   1
00 01 61 01

Example encoding of tagged one-or-more with expression group

(:add_macros
  (foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 6: 6-byte expression group
│  │  │  ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     │     ┌─── Opcode 0x61 indicates a 1-byte int
│  │  │  │   1 │  2  │   3
00 02 0D 61 01 61 02 61 03
         └───────┬───────┘
      6-byte expression group body

Example encoding of tagged one-or-more with delimited expression group

(:add_macros
  (foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 0: delimited expression group
│  │  │  ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     │     ┌─── Opcode 0x61 indicates a 1-byte int
│  │  │  │     │     │      ┌─── Opcode 0xF0 is delimited end
│  │  │  │   1 │  2  │   3  │
00 02 01 61 01 61 02 61 03 F0
         └───────┬───────┘
        expression group body

Example encoding of tagless zero-or-more with expression group

(:add_macros
  (foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 3: 3-byte expression group
│  │  │  ┌──── uint8 1
│  │  │  │  ┌──── uint8 2
│  │  │  │  │  ┌─── uint8 3
│  │  │  │  │  │
00 02 07 01 02 03
         └──┬───┘
   expression group body

Example encoding of tagless zero-or-more with delimited expression group

(:add_macros
  (foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 0: Delimited expression group
│  │  │  ┌──── FlexUInt 3: 3-byte chunk of uint8 expressions
│  │  │  │            ┌──── FlexUInt 2: 2-byte chunk of uint8 expressions
│  │  │  │            │       ┌──── FlexUInt 0: End of group
│  │  │  │            │       │
00 02 01 07 01 02 03 05 04 05 01
            └──┬───┘    └─┬─┘
            chunk 1    chunk 2

Annotations

Annotations can be encoded either as symbol addresses or as FlexSyms. In both encodings, the annotations sequence appears just before the value that it decorates.

It is illegal for an annotations sequence to appear before any of the following:

  • The end of the stream
  • Another annotations sequence
  • A NOP
  • An e-expression. To add annotations to the expansion of an E-expression, see the annotate macro.

Annotations With Symbol Addresses

Opcodes 0xE4 through 0xE6 indicate one or more annotations encoded as symbol addresses. If the opcode is:

  • 0xE4, a single FlexUInt-encoded symbol address follows.
  • 0xE5, two FlexUInt-encoded symbol addresses follow.
  • 0xE6, a FlexUInt follows that represents the number of bytes needed to encode the annotations sequence, which can be made up of any number of FlexUInt symbol addresses.
Encoding of $10::false
┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows
│  ┌──── Annotation with symbol address: FlexUInt 10
E4 15 6F
      └── The annotated value: `false`
Encoding of $10::$11::false
┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow
│  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
E5 15 17 6F
         └── The annotated value: `false`
Encoding of $10::$11::$12::false
┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations;
│     a FlexUInt follows representing the length of the sequence.
│   ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10)
│   │  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│   │  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
│   │  │  │  ┌──── Annotation with symbol address: FlexUInt 12 ($12)
E5 07 15 17 19 6F
               └── The annotated value: `false`

Annotations With FlexSym Text

Opcodes 0xE7 through 0xE9 indicate one or more annotations encoded as FlexSyms.

If the opcode is:

  • 0xE7, a single FlexSym-encoded symbol follows.
  • 0xE8, two FlexSym-encoded symbols follow.
  • 0xE9, a FlexUInt follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded as FlexSyms.

While this encoding is more flexible than annotations with symbol addresses it can be slightly less compact when all the annotations are encoded as symbol addresses.

Encoding of $10::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation with symbol address: FlexSym 10 ($10)
E7 15 6F
      └── The annotated value: `false`
Encoding of foo::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │   f  o  o
E7 FD 66 6F 6F 6F
      └──┬───┘ └── The annotated value: `false`
      3 UTF-8
       bytes

Note that FlexSym annotation sequences can switch between symbol address and inline text on a per-annotation basis.

Encoding of $10::foo::false
┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow
│  ┌──── Annotation: FlexSym 10 ($10)
│  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │   f  o  o
E8 15 FD 66 6F 6F 6F
         └──┬───┘ └── The annotated value: `false`
         3 UTF-8
          bytes
Encoding of $10::foo::$11::false
┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations
│  ┌──── Length: FlexUInt 6
│  │  ┌──── Annotation: FlexSym 10 ($10)
│  │  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │  │           ┌──── Annotation: FlexSym 11 ($11)
│  │  │  │   f  o  o │
E9 0D 15 FD 66 6F 6F 17 6F
            └──┬───┘    └── The annotated value: `false`
            3 UTF-8
             bytes

NOPs

A NOP (short for "no-operation") is the binary equivalent of whitespace. NOP bytes have no meaning, but can be used as padding to achieve a desired alignment.

An opcode of 0xEC indicates a single-byte NOP pad. An opcode of 0xED indicates that a FlexUInt follows that represents the number of additional bytes to skip.

It is legal for a NOP to appear anywhere that a value can be encoded. It is not legal for a NOP to appear in annotation sequences or struct field names. If a NOP appears in place of a struct field value, then the associated field name is ignored; the NOP is immediately followed by the next field name, if any.

Encoding of a 1-byte NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC
Encoding of a 3-byte NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│  ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│  │
ED 05 93 C6
      └─┬─┘
NOP bytes, values ignored

Security considerations

The Ion 1.1 data format is orthogonal to many classes of attacks, such as privilege escalation and phishing attacks. Ion 1.1 is primarily susceptible to denial-of-service (DoS) attacks that attempt to cause an error condition in the receiving system or consume excessive system resources. As with many such attacks, the strongest defense is to not accept any untrusted input, but that defense is not always compatible with the business requirements of the receiving application.

This document addresses various types of attacks, assuming that it is not possible to avoid accepting untrusted input.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Data expansion denial-of-service

An attacker could craft an input that is relatively small, but upon expansion, produces something thousands or millions of times larger.

For many use cases, the expansion of a template macro will grow linearly with the size of its input. However, it is possible to create macros with expansions that grow at greater rates. Using for we can nest an arbitrary number of loops to create a macro expansion with a polynomial growth rate. Using the repeat macro, we can create classes of inputs with expansions that grow exponentially in relation to the input.

For example, this input is less than 250 characters when encoded as Ion text (and omitting all optional whitespace). In Ion binary, it requires only 74 bytes. For each additional level of nesting, only 20 additional characters (text) or 6 additional bytes (binary) are required, but it increases the number of expanded values by 2147483647 times.

$ion_1_1
(:repeat 2147483647
  (:repeat 2147483647
    (:repeat 2147483647
      (:repeat 2147483647
        (:repeat 2147483647
          (:repeat 2147483647
            (:repeat 2147483647
              (:repeat 2147483647
                (:repeat 2147483647
                  (:repeat 2147483647
                    (:repeat 2147483647 "abc")))))))))))

The expansion of these e-expressions results in a stream of ~450 googol string values. Any attempt to hold all of this in memory or write it to disk will exhaust all available resources and eventually fail. Even an attempt to count the length of the stream, while it may theoretically succeed if using an appropriate BigInteger type, will require a considerable amount of CPU operations (over a googol), and even the fastest processors will require many millennia to completely count the number of values in the stream.

Even without using repeat or for, a Billion laughs attack could exist for any data format with macro expansion, and it is certainly possible with Ion 1.1.

$ion_1_1
(:add_macros (macro lol0 () "lol")
             (macro lol1 () (.values (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0)))
             (macro lol2 () (.values (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1)))
             (macro lol3 () (.values (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2)))
             (macro lol4 () (.values (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3)))
             (macro lol5 () (.values (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4)))
             (macro lol6 () (.values (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5)))
             (macro lol7 () (.values (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6)))
             (macro lol8 () (.values (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7)))
             (macro lol9 () (.values (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8)))
             (macro lolz () (.lol9)) )
(:lolz)

Implementations of Ion 1.1 MUST have some mechanism by which to mitigate data expansion attacks.

The macro evaluator of Ion 1.1 implementations SHOULD have a (possibly configurable) limit on the number of values produced by the expansion of any macro or e-expression. If the macro evaluator reaches that limit, evaluation should halt and the reader should signal an error. This is similar to the Token Bucket Algorithm, but instead of refilling the bucket, the bucket starts at the maximum capacity whenever the reader begins evaluating an e-expression that is not nested in any other e-expression at any other depth. In order to prevent a malicious input that produces no values (for example, (macro sneaky_lolz () (.meta (.lolz)))), tokens SHOULD be consumed at every level of expansion, including special forms and TDL macro invocations. Expansions that are skipped are not required to consume tokens (since they are not expanded), but an empty expansion MUST consume at least one token.

$ion_1_1
// Fill bucket here
(:make_list
  [
    // Do not fill bucket here
    (:repeat 100 "foo")
  ]
  [
    "bar",
    "baz",
  ]
)
{
  // Fill bucket here
  foo: (:make_string "foo" "bar")
  // Fill bucket here.
  // Consume one token for each value produced by repeat and for each value produced by make_string
  bar: (:make_string (:repeat 16 "na") " batman!")
}

Remote code execution

The template definition language (TDL) is a domain specific programming language used to declare template macros in Ion 1.1. It is intentionally limited in its capabilities—it cannot recurse and does not support forward references. In general, it supports combining Ion values to produce other Ion values, but it does not support arbitrary computation on those values.

Remote code execution (RCE) attacks allow an attacker to remotely execute malicious code on a computer. By invoking e-expressions in the body of an Ion document, an attacker can cause the recipient to execute arbitrary TDL (code) when reading the document.

This is unlikely to be a concern in practice because TDL is not arbitrary code. TDL is intentionally not Turing complete, to make it impossible to perform arbitrary computation. It also has a very limited domain—it can only transform/produce Ion data model values. While it could be possible to attempt a denial-of-service attack using TDL, TDL expansion is guaranteed to terminate in a finite number of steps, and implementations can additionally limit the expansion size (as described above).

Embedded Documents

Ion 1.1 supports embedded documents using the parse_ion macro. Generally speaking, systems that accept embedded documents should properly isolate and validate embedded documents to prevent attacks.

Ion 1.1 specifies that parse_ion must only accept a literal string or literal blob, and that the resulting values are always user values (rather than system values). This ensures that the embedded document cannot be affected by any input from the containing document, nor can it have any effect on the encoding context of the containing document. The parse_ion macro uses an Ion reader, so it will be validated just as any other Ion document.

Data injection via shared modules

Applications are not required to use shared modules. If an application does use shared modules, it should take steps to ensure that shared modules come from a trusted source and use appropriate measures to prevent man-in-the-middle and other attacks that can compromise data while it is in transit.

In many cases, even if an application needs to accept Ion payloads from untrusted sources, it is possible to design a solution in which the shared modules are supplied by a trusted source. For example, in a service-oriented-architecture, the server can host shared modules so that the server does not have to trust the client. (However, this assumes that the client trusts the server.)

If shared modules must come from an untrusted source, then applications should take steps to ensure that the shared modules originate from the same source as the data that uses them, and they can be treated as if they are one composite piece of data from that source.

Arbitrary-sized values

The Ion specification places no limits on the size of Ion values, so an attacker could send a sufficiently large value, it could consume enough system resources to disrupt the application reading the value.

Even though the Ion specification does not have limits on the size of values, all real computer systems have finite resources, so all implementations will have limits in practice. Ion implementations MAY set limits on the maximum size of any Ion value for any available metric, including (but not limited to) number of bytes, number of codepoints, number of child values, digits of precision, or number of annotations. An implementation MAY allow limits to be configurable by an application that uses the Ion implementation. Any limits imposed SHOULD be described in the public documentation of an Ion implementation, unless the limits are unknown and/or are dependent on the underlying runtime environment.

Symbol table and macro table inflation

An attacker could try to create an input that results in excessively large symbol and macro tables in the Ion reader that could exhaust the memory of the receiving system and lead to a denial of service.

Although Ion 1.1 does not specify a maximum size for symbol tables or macro tables, Ion implementations MAY impose upper bounds on the size of symbol tables, macro tables, module bindings, and any other direct or indirect component of the encoding context. An implementation MAY allow limits to be configurable by an application that uses the Ion implementation. Any limits imposed SHOULD be described in the public documentation of an Ion implementation, unless the limits are unknown and/or are dependent on the underlying runtime environment.

Grammar

This chapter presents Ion 1.1's domain grammar, by which we mean the grammar of the domain of values that drive Ion's encoding features.

We use a BNF-like notation for describing various syntactic parts of a document, including Ion data structures. In such cases, the BNF should be interpreted loosely to accommodate Ion-isms like commas and unconstrained ordering of struct fields.

Documents

document           ::= ivm? segment*

ivm                ::= '$ion_1_0' | '$ion_1_1'

segment            ::= value* directive?

directive          ::= ivm 
                     | encoding-directive 
                     | symtab-directive 

symtab-directive   ::=  local-symbol-table     ; As per the Ion 1.0 specification¹

encoding-directive ::= '$ion::(encoding ' module-name* ')'

    ¹Symbols – Local Symbol Tables.

Modules

module-body             ::= import* inner-module* symbol-table? macro-table?

shared-module           ::= '$ion_shared_module::' ivm '::(' catalog-key module-body ')'

import                  ::= '(import ' module-name catalog-key ')'

catalog-key             ::= catalog-name catalog-version?

catalog-name            ::= string

catalog-version         ::= unannotated-uint                   ; must be positive

inner-module            ::= '(module' module-name module-body ')'

module-name             ::= unannotated-identifier-symbol

symbol-table            ::= '(symbol_table' symbol-table-entry* ')'

symbol-table-entry      ::= module-name | symbol-list

symbol-list             ::= '[' symbol-text* ']'

symbol-text             ::= symbol | string

macro-table             ::= '(macro_table' macro-table-entry* ')'

macro-table-entry       ::= macro-definition
                          | macro-export
                          | module-name

macro-export            ::= '(export' qualified-macro-ref macro-name-declaration? ')'

Macro references

qualified-macro-ref     ::= module-name '::' macro-ref

macro-ref               ::= macro-name | macro-addr

qualified-macro-name    ::= module-name '::' macro-name

macro-name              ::= unannotated-identifier-symbol

macro-addr              ::= unannotated-uint 

Macro definitions

macro-definition        ::= '(macro' macro-name-declaration signature tdl-expression ')'

macro-name-declaration  ::= macro-name | 'null'

signature               ::= '(' parameter* ')'

parameter               ::= parameter-encoding? parameter-name parameter-cardinality?

parameter-encoding      ::= (primitive-encoding-type | macro-name | qualified-macro-name)'::'

primitive-encoding-type ::= 'uint8' | 'uint16' | 'uint32' | 'uint64'
                          |  'int8' |  'int16' |  'int32' |  'int64'
                          | 'float16' | 'float32' | 'float64'
                          | 'flex_int' | 'flex_uint' 
                          | 'flex_sym' | 'flex_string'

parameter-name          ::= unannotated-identifier-symbol

parameter-cardinality   ::= '!' | '*' | '?' | '+'

tdl-expression          ::= operation | variable-expansion | ion-scalar | ion-container

operation               ::= macro-invocation | special-form

variable-expansion      ::= '(%' variable-name ')'

variable-name           ::= unannotated-identifier-symbol

macro-invocation        ::= '(.' macro-ref macro-arg* ')'

special-form            ::= '(.' '$ion::'?  special-form-name tdl-expression* ')'

special-form-name       ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi'

macro-arg               ::= tdl-expression | expression-group

expression-group        ::= '(..' tdl-expression* ')'

Glossary

active encoding module
An encoding module whose symbol table and macro table are available in the current segment of an Ion document. The sequence of active encoding modules is set by an encoding directive.

argument
The sub-expression(s) within a macro invocation, corresponding to exactly one of the macro's parameters.

cardinality
Describes both the number of argument expressions that a parameter will accept when the macro is invoked, and the number of values that the parameter may expand to during evaluation. A parameter's cardinality can be zero-or-one, exactly-one, zero-or-more, or one-or-more, specified in a signature by one of the modifiers ?, !, *, or + respectively. If no modifier is specified, cardinality defaults to exactly-one.

declaration
The association of a name with an entity (for example, a module or macro). See also definition. Not all declarations are definitions: some introduce new names for existing entities.

definition
The specification of a new entity.

directive
A keyword or unit of data in an Ion document that affects the encoding environment, and thus the way the document's data is encoded and decoded. In Ion 1.0 there are two directives: Ion version markers, and the symbol table directives. Ion 1.1 adds encoding directives.

document
A stream of octets conforming to either the Ion text or binary specification. Can consist of multiple segments, perhaps using varying versions of the Ion specification. A document does not necessarily exist as a file, and is not necessarily finite.

E-expression
See encoding expression.

encoding directive
In an Ion 1.1 segment, a top-level S-expression annotated with $ion. Defines a new encoding module sequence for the segment immediately following it. The symbol table directive is effectively a less capable alternative syntax.

encoding environment
The context-specific data maintained by an Ion implementation while encoding or decoding data. In Ion 1.0 this consists of the current symbol table; in Ion 1.1 this is expanded to also include the Ion spec version, the current macro table, and a collection of available modules.

encoding expression
The invocation of a macro in encoded data, aka e-expression. Starts with a macro reference denoting the function to invoke. The Ion text format uses "smile syntax" (:macro ...) to denote e-expressions. Ion binary devotes a large number of opcodes to e-expressions, so they can be compact.

encoding module
A module whose symbol table and macro table can be used directly in the user data stream.

expression
A serialized syntax element that may produce values. Encoding expressions and values are both considered expressions, whereas NOP, comments, and IVMs, for example, are not.

expression group
A grouping of zero or more expressions that together form one argument. The concrete syntax for passing a stream of expressions to a macro parameter. In a text e-expression, a group starts with the trigraph (:: and ends with ), similar to an S-expression. In template definition language, a group is written as an S-expression starting with .. (two dots).

inner module
A module that is defined inside another module and only visible inside the definition of that module.

Ion version marker
A keyword directive that denotes the start of a new segment encoded with a specific Ion version. Also known as "IVM".

macro
A transformation function that accepts some number of streams of values, and produces a stream of values.

macro definition
Specifies a macro in terms of a signature and a template.

macro reference
Identifies a macro for invocation or exporting. Must always be unambiguous. Lexically scoped. Cannot be a "forward reference" to a macro that is declared later in the document; these are not legal.

module
The data entity that defines and exports both symbols and macros.

opcode
A 1-byte, unsigned integer that tells the reader what the next expression represents and how the bytes that follow should be interpreted.

optional parameter
A parameter that can have its corresponding subform(s) omitted when the macro is invoked. A parameter is optional if both it and the parameters that follow it in the macro signature can accept an empty stream.

parameter
A named input to a macro, as defined by its signature. At expansion time a parameter produces a stream of values.

qualified macro reference
A macro reference that consists of a module name and either a macro name exported by that module, or a numeric address within the range of the module's exported macro table. In TDL, these look like module-name::name-or-address.

required parameter
A macro parameter that is not optional and therefore requires an argument at each invocation.

rest parameter
A macro parameter—always the final parameter—declared with * or + cardinality, that accepts all remaining individual arguments to the macro as if they were in an implicit argument group. Applies to Ion text and TDL. Similar to "varargs" parameters in Java and other languages.

segment
A contiguous partition of a document that uses the same encoding module sequence. Segment boundaries are caused by directives: an IVM starts a new segment (ending the prior segment, if any), while encoding directives end segments (with a new one starting immediately afterward). import and module directives can also end a segment if they are redefining a module binding that was in the encoding module sequence.

shared module
A module that exists independent of the data stream of an Ion document. It is identified by a name and version so that it can be imported by other modules.

signature
The part of a macro definition that specifies its "calling convention", in terms of the shape, type, and cardinality of arguments it accepts.

symbol table directive
A top-level struct annotated with $ion_symbol_table. Defines a new encoding environment without any macros. Valid in Ion 1.0 and 1.1.

system e-expression
An e-expression that invokes a macro from the system-module rather than from the active encoding module.

system macro
A macro provided by the Ion implementation via the system module $ion. System macros are available at all points within Ion 1.1 segments.

system module
A standard module named $ion that is provided by the Ion implementation, implicitly installed so that the system symbols and system macros are available at all points within a document. Subsumes the functionality of the Ion 1.0 system symbol table.

system symbol
A symbol provided by the Ion implementation via the system module $ion. System symbols are available at all points within an Ion document, though the selection of symbols varies by segment according to its Ion version.

TDL
See template definition language.

template
The part of a macro definition that expresses its transformation of inputs to results.

template definition language
An Ion-based, domain-specific language that declaratively specifies the output produced by a macro. Template definition language uses only the Ion data model.

unqualified macro reference
A macro reference that consists of either a macro name or numeric address, without a qualifying module name. These are resolved using lexical scope and must always be unambiguous.

variable expansion
In TDL, a special form that causes all argument expression(s) for the given parameter to be expanded and the result of the expansion to be substituted into the template.

TODO

This page is a placeholder and will be updated when the target page is available.

If you believe the target page is available, please open an issue.