This is a draft specification of Ion 1.1, a new minor version of the Ion serialization format.

Status

This document is a working draft and is subject to change.

Audience

This documents presents the formal specification for the Ion 1.1 data format. This document is not intended to be used as a user guide or as a cook book, but as a reference to the syntax and semantics of the Ion data format and its logical data model.

What's New in Ion 1.1

We will go through a high-level overview of what is new and different in Ion 1.1 from Ion 1.0 from an implementer's perspective.

Motivation

Ion 1.1 has been designed to address some of the trade-offs in Ion 1.0 to make it suitable for a wider range of applications. Ion 1.1 now makes length prefixing of containers optional, and makes the interning of symbolic tokens optional as well. This allows for applications that write data more than they read data or are constrained by the writer in some way to have more flexibility. Data density is another motivation. Certain encodings (e.g., timestamps, integers) have been made more compact and efficient, but more significantly, macros now enable applications to have very flexible interning of their data's structure. In aggregate, data transcoded from Ion 1.0 to Ion 1.1 should be more compact.

Backwards compatibility

Ion 1.1 is backwards compatible with Ion 1.0. While their encodings are distinct, they share the same data model--any data that can be produced and read by an application in Ion 1.1 has an equivalent representation in Ion 1.0.

Ion 1.1 is not required to preserve Ion 1.0 binary encodings in Ion 1.1 encoding contexts (i.e., the type codes and lower-level encodings are not preserved in the new version). The Ion Version Marker (IVM) is used to denote the different versions of the syntax. Ion 1.1 does retain text compatibility with Ion 1.0 in that the changes are a strict superset of the grammar, however due to the updated system symbol table, symbol IDs referred to using the $n syntax for symbols beyond the 1.0 system symbol table are not compatible.

Text syntax changes

Ion 1.1 text must use the $ion_1_1 version marker at the top-level of the data stream or document.

The only syntax change for the text format is the introduction of encoding expression (E-expression) syntax, which allows for the invocation of macros in the data stream. This syntax is grammatically similar to S-expressions, except that these expressions are opened with (: and closed with ). For example, (:a 1 2) would expand the macro named a with the arguments 1 and 2. See the <<sec:whatsnew-eexp, Macros, Templates, and Encoding-Expressions>> section for details.

This syntax is allowed anywhere an Ion value is allowed:

E-expression examples

// At the top level
(:foo 1 2)

// Nested in a list
[1, 2, (:bar 3 4)]

// Nested in an S-expression
(cons a (:baz b))

// Nested in a struct
{c: (:bop d)}

E-expressions are also grammatically allowed in the field name position of a struct and when used there, indicate that the expression should expand to a struct value that is merged into the enclosing struct:

E-Expression in field position of struct

{
    a:1,
    b:2,
    (:foo 1 2),
    c: 3,
}

In the above example, the E-expression (:foo 1 2) must evaluate into a struct that will be merged between the b field and the c field. If it does not evaluate to a struct, then the above is an error.

Binary encoding changes

Ion 1.1 binary encoding reorganizes the type descriptors to support compact E-expressions, make certain encodings more compact, and certain lower priority encodings marginally less compact. The IVM for this encoding is the octet sequence 0xE0 0x01 0x01 0xEA.

Inlined symbolic tokens

In binary Ion 1.0, symbol values, field names, and annotations are required to be encoded using a symbol ID in the local symbol table. For some use cases (e.g., as write-once, read-maybe logs) this creates a burden on the writer and may not actually be efficient for an application. Ion 1.1 introduces optional binary syntax for encoding inline UTF-8 sequences for these tokens which can allow an encoder to have flexibility in whether and when to add a given symbolic token to the symbol table.

Ion text requires no change for this feature as it already had inline symbolic tokens without using the local symbol table. Ion text also has compatible syntax for representing the local symbol table and encoding of symbolic tokens with their position in the table (i.e., the $id syntax).

Delimited containers

In Ion 1.0, all data is length prefixed. While this is good for optimizing the reading of data, it requires an Ion encoder to buffer any data in memory to calculate the data's length. Ion 1.1 introduces optional binary syntax to allow containers to be encoded with an end marker instead of a length prefix.

Low-level binary encoding changes

Ion 1.0's VarUInt and VarInt encoding primitives used big-endian byte order and used the high bit of each byte to indicate whether it was the final byte in the encoding. VarInt used an additional bit in the first byte to represent the integer's sign. Ion 1.1 replaces these primitives with more optimized versions called FlexUInt and FlexInt.

FlexUInt and FlexInt use little-endian byte order, avoiding the need for reordering on common architectures like x86, aarch64, and RISC-V.

Rather than using a bit in each byte to indicate the width of the encoding, FlexUInt and FlexInt front-load the continuation bits. In most cases, this means that these bits all fit in the first byte of the representation, allowing a reader to determine the complete size of the encoding without having to inspect each byte individually.

Finally, FlexInt does not use a separate bit to indicate its value's sign. Instead, it uses two's complement representation, allowing it to share much of the same structure and parsing logic as its unsigned counterpart. Benchmarks have shown that in aggregate, these encoding changes are between 1.25 and 3x faster than Ion 1.0's VarUInt and VarInt encodings depending on the host architecture.

Ion 1.1 supplants Ion 1.0's Int encoding primitive with a new encoding called FixedInt, which uses two's complement notation instead of sign-and-magnitude. A corresponding FixedUInt primitive has also been introduced; its encoding is the same as Ion 1.0's UInt primitive.

A new primitive encoding type, FlexSym, has been introduced to flexibly encode symbol IDs and symbolic tokens with inline text.

Type encoding changes

All Ion types use the new low-level encodings as specified in the previous section. Many of the opcodes used in Ion 1.0 have been re-organized primarily to make E-expressions compact.

Typed null values are now [encoded in two bytes using the 0xEB opcode].

Lists and S-expressions have two encodings: a length-prefixed encoding and a new delimited form that ends with the 0xF0 opcode.

Struct values have the option of encoding their field names as a FlexSym, enabling them to write field name text inline instead of adding all names to the symbol table. There is now also a delimited form.

Similarly, symbol values now also have the option of encoding their symbol text inline.

Annotation sequences are a prefix to the value they decorate, and no longer have an outer length container. They are now encoded with one of three opcodes:

  1. 0xE7, which is followed by a single annotation and then the decorated value.
  2. 0xE8, which is followed by two annotations and then the decorated value.
  3. 0xE9, which is followed by a FlexUInt indicating the number of bytes used to encode the annotations sequence, the sequence itself, and then the decorated value.

The latter encoding is similar to how Ion 1.0 annotations are encoded with the exception that there is no outer length in addition to the annotations sequence length.

Integers now use a FixedInt sub-field instead of the Ion 1.0 encoding which used sign-and-magnitude (with two opcodes).

Decimals are structurally identical to their Ion 1.0 counterpart with the exception of the negative zero coefficient. The Ion 1.1 FlexInt encoding is two's complement, so negative zero cannot be encoded directly with it. Instead, an opcode is allocated specifically for encoding decimals with a negative zero coefficient.

Timestamps no longer encode their sub-field components as octet-aligned fields.

The Ion 1.1 format uses a packed bit encoding and has a biased form (encoding the year field as an offset from 1970) to make common encodings of timestamp easily fit in a 64-bit word for microsecond and nanosecond precision (with UTC offset or unknown UTC offset). Benchmarks have shown this new encoding to be 59% faster to encode and 21% faster to decode. A non-biased, arbitrary length timestamp with packed bit encoding is defined for uncommon cases.

Encoding expressions in binary

In binary, E-expressions are encoded with an opcode that includes the macro identifier or an opcode that specifies a FlexUInt for the macro identifier. The identifier is followed by the encoding of the arguments to the E-expression. The macro's definition statically determines how the arguments are to be laid out. An argument may be a full Ion value with a leading opcode (sometimes called a "tagged" value), or it could be a lower-level encoding (e.g., a fixed width integer or FlexInt/FlexUInt).

Macros, templates, and encoding expressions

Ion 1.1 introduces a new primitive called an encoding expression (E-expression). These expressions are (in text syntax) similar to S-expressions, but they are not part of the data model and are evaluated into one or more Ion values (called a stream) which enable compact representation of Ion data. E-expressions represent the invocation of either system defined or user defined macros with arguments that are either themselves E-expressions, value literals, or container constructors (list, sexp, struct syntax containing E-expressions) corresponding to the formal parameters of the macro's definition. The resulting stream is then expanded into the resulting Ion data model.

At the top level, the stream becomes individual top-level values. Consider for illustrative purposes an E-expression (:values 1 2 3) that evaluates to the stream 1, 2, 3 and (:none) that evaluates to the empty stream. In the following examples, values and none are the names of the macros being invoked and each line is equivalent.

Top-level e-expressions

// Encoding
a (:values 1 2 3) b (:none) c

// Evaluates to
a 1 2 3 b c

Within a list or S-expression, the stream becomes additional child elements in the collection.

E-expressions in lists

// Encoding
[a, (:values 1 2 3), b, (:none), c]

// Evaluates to
[a, 1, 2, 3, b, c]

E-expressions in S-expressions

(a (:values 1 2 3) b (:none) c)
(a 1 2 3 b c)

Within a struct at the field name position, the resulting stream must contain structs and each of the fields in those structs become fields in the enclosing struct (the value portion is not specified); at the value position, the resulting stream of values becomes fields with whatever field name corresponded before the E-expression (empty stream elides the field all together). In the following examples, let us define (:make_struct c 5) that evaluates to a single struct {c: 5}.

E-expressions in structs

// Encoding
{
  a: (:values 1 2 3),
  b: 4,
  (:make_struct c 5),
  d: 6,
  e: (:none)
}

// Evaluates to
{
  a: 1,
  a: 2,
  a: 3,
  b: 4,
  c: 5,
  d: 6
}

Encoding context and modules

In Ion 1.0, there is a single encoding context which is the local symbol table. In Ion 1.1, the encoding context becomes the following:

  • The local symbol table which is a list of strings. This is used to encode/decode symbolic tokens.

  • The local macro table which is a list of macros. This is used to reference macros that can be invoked by E-expressions.

  • A mapping of a string name to module which is an organizational unit of symbol definitions and macro definitions. Within the encoding context, this name is unique and used to address a module's contents either as the list of symbols to install into the local symbol table, the list of macros to install into the local macro table, or to qualify the name of a macro in a text E-expression or the definition of a macro.

The module is a new concept in Ion 1.1. It contains:

  • A list of strings representing the symbol table of the module.

  • A list of macro definitions.

Modules can be imported from the catalog (they subsume shared symbol tables), but can also be defined locally. Modules are referenced as a group to allocate entries in the local symbol table and local macro table (e.g., the local symbol table is initially, implicitly allocated with the symbols in the $ion module).

Ion 1.1 introduces a new system value (an encoding directive) for the encoding context (see the TBD section for details.)

Ion encoding directive example

$ion_encoding::{
  modules:         [ /* module declarations - including imports */ ],
  install_symbols: [ /* names of declared modules */ ],
  install_macros:  [ /* names of declared modules */ ]
}
This is still being actively worked and is provisional.

Macro definitions

Macros can be defined by a user either directly in a local module within an encoding directive or in a module defined externally (i.e., shared module). A macro has a name which must be unique in a module or it may have no name.

Ion 1.1 defines a list of system macros that are built-in in the module named $ion. Unlike the system symbol table, which is always installed and accessible in the local symbol table, the system macros are both always accessible to E-expressions and not installed in the local macro table by default (unlike the local symbol table).

In Ion binary, macros are always addressed in E-expressions by the offset in the local macro table. System macros may be addressed by the system macro identifier using a specific encoding op-code. In Ion text, macros may be addressed by the offset in the local macro table (mirroring binary), its name if its name is unambiguous within the local encoding context, or by qualifying the macro name/offset with the module name in the encoding context. An E-expression can only refer to macros installed in the local macro table or a macro from the system module. In text, an E-expression referring to a system macro that is not installed in the local macro table, must use a qualified name with the $ion module name.

For illustrative purposes let's consider the module named foo that has a macro named bar at offset 5 installed at the begining of the local macro table.

E-expressions name resolution

// allowed if there are no other macros named 'bar' 
(:bar)

// fully qualified by module--always allowed
(:foo:bar)

// by local macro table offset
(:5)

// In text, system macros are always addressable by name.
// In binary, system macros may be invoked using a separate
// opcode.
(:$ion:none)

Macro definition language

User defined macros are defined by their parameters and template which defines how they are invoked and what stream of data they evaluate to. This template is defined using a domain specific Ion macro definition language with S-expressions. A template defines a list of zero or more parameters that it can accept. These parameters each have their own cardinality of expression arguments which can be specified as exactly one, zero or one, zero or more, and one or more. Furthermore the template defines what type of argument can be accepted by each of these parameters:

  • "Tagged" values, whose encodings always begin with an opcode.
  • "Tagless" values, whose encodings do not begin with an opcode and are therefore both more compact and less flexible (For example: flex_int, int32, float16).
  • Specific macro shaped arguments to allow for structural composition of macros and efficient encoding in binary.

The macro definition includes a template body that defines how the macro is expanded. In the language, system macros, macros defined in previously defined modules in the encoding context, and macros defined previously in the current module are available to be invoked with (name ...) syntax where name is the macro to be invoked. Certain names in the expression syntax are reserved for special forms (for example, literal and if_none). When a macro name is shadowed by a special form, or is ambiguous with respect to all macros visible, it can always be qualified with (':module:name' ...) syntax where module is the name of the module and name is the offset or name of the macro. Referring to a previously defined macro name within a module may be qualified with (':name' ...) syntax.

Shared Modules

Ion 1.1 extends the concept of a shared symbol table to be a shared module. An Ion 1.0 shared symbol table is a shared module with no macro definitions. A new schema for the convention of serializing shared modules in Ion are introduced in Ion 1.1. An Ion 1.1 implementation should support containing Ion 1.0 shared symbol tables and Ion 1.1 shared modules in its catalog.

System Symbol Table Changes

The system symbol table in Ion 1.1 replaces the Ion 1.0 symbol table with new symbols. However, the system symbols are not required to be in the symbol table—they are always available to use.

Macros

Like other self-describing formats, Ion 1.0 makes it possible to write a stream with truly arbitrary content--no formal schema required. However, in practice all applications have a de facto schema, with each stream sharing large amounts of predictable structure and recurring values. This means that Ion readers and writers often spend substantial resources processing undifferentiated data.

Consider this example excerpt from a webserver's log file:

{
  method: GET,
  statusCode: 200,
  status: "OK",
  protocol: https,
  clientIp: ip_addr::"192.168.1.100",
  resource: "index.html"
}
{
  method: GET,
  statusCode: 200,
  status: "OK",
  protocol: https,
  clientIp:
  ip_addr::"192.168.1.100",
  resource: "images/funny.jpg"
}
{
  method: GET,
  statusCode: 200,
  status: "OK",
  protocol: https,
  clientIp: ip_addr::"192.168.1.101",
  resource: "index.html"
}

Macros allow users to define fill-in-the-blank templates for their data. This enables applications to focus on encoding and decoding the parts of the data that are distinctive, eliding the work needed to encode the boilerplate.

Using this macro definition:

(macro getOk (clientIp resource)
  {
    method: GET,
    statusCode: 200,
    status: "OK",
    protocol: https,
    clientIp: (.annotate "ip_addr" (%clientIp)),
    resource: (%resource)
  })

The same webserver log file could be written like this:

(:getOk "192.168.1.100" "index.html")
(:getOk "192.168.1.100" "images/funny.jpg")
(:getOk "192.168.1.101" "index.html")

Macros are an encoding-level concern, and their use in the data stream is invisible to consuming applications. For writers, macros are always optional--a writer can always elect to write their data using value literals instead.

For a guided walkthrough of what macros can do, see Macros by example.

Defining macros

A macro is defined using a macro clause within a module's macro_table clause.

Syntax

(macro name signature template)
ArgumentDescription
nameA unique name assigned to the macro or--to construct an anonymous macro--null.
signatureAn s-expression enumerating the parameters this macro accepts.
templateA template definition language (TDL) expression that can be evaluated to produce zero or more Ion values.

Example macro clause

//      ┌─── name
//      │     ┌─── signature
//     ┌┴┐ ┌──┴──┐
(macro foo (x y z)
  {           // ─┐
    x: (%x),  //  │
    y: (%y),  //  ├─ template
    z: (%z),  //  │
  }           // ─┘
)

Macro names

Syntactically, macro names are identifiers. Each macro name in a macro table must be unique.

In some circumstances, it may not make sense to name a macro. (For example, when the macro is generated automatically.) In such cases, authors may set the macro name to null or null.symbol to indicate that the macro does not have a name. Anonymous macros can only be referenced by their address in the macro table.

Macro Parameters

A parameter is a named stream of Ion values. The stream's contents are determined by the macro's invocation. A macro's parameters are declared in the macro signature.

Each parameter declaration is comprised of three elements:

  1. A name
  2. An optional encoding
  3. An optional cardinality

Example parameter declaration

//     ┌─── encoding
//     │      ┌─── name
//     │      │┌─── cardinality
// ┌───┴───┐  ││
   flex_uint::x*

Parameter names

A parameter's name is an identifier. The name is required; any non-identifier (including null, quoted symbols, $0, or a non-symbol) found in parameter-name position will cause the reader to raise an error.

All of a macro's parameters must have unique names.

Parameter encodings

In binary Ion, the default encoding for all parameters is tagged. Each argument passed into the macro from the callsite is prefixed by an opcode (or "tag") that indicates the argument's type and length.

Parameters may choose to specify an alternative encoding to make the corresponding arguments' binary representation more compact and/or fixed width. These "tagless" encodings do not begin with an opcode, an arrangement which saves space but also limits the domain of values they can each represent. Arguments passed to tagless parameters cannot be null, cannot be annotated, and may have additional range restrictions.

To specify an encoding, the parameter name is annotated with one of the following tokens:

Tagless encodingsDescription
flex_intVariable-width, signed int
flex_uintVariable-width, unsigned int
int8 int16 int32 int64Fixed-width, signed int
uint8 uint16 uint32 uint64Fixed-width, unsigned int
float16 float32 float64Fixed-width float
flex_symbolFlexSym-encoded SID or text

When writing text Ion, the declared encoding does not affect how values are serialized. However, it does constrain the domain of values that that parameter will accept. When transcribing from text to binary, it must be possible to serialize all values passed as an argument using the parameter's declared encoding. This means that parameters with a primitive encoding cannot be annotated or a null of any type. If an int or a float is being passed to a parameter with a fixed-width encoding, that value must fit within the range of values that can be represented by that width. For example, the value 256 cannot be passed to a parameter with an encoding of uint8 because a uint8 can only represent values in the range [0, 255].

Parameter cardinalities

A parameter name may optionally be followed by a cardinality modifier. This is a sigil that indicates how many values the parameter expects the corresponding argument expression to produce when it is evaluated.

ModifierCardinality
?zero-or-one value
*zero-or-more values
!exactly-one value
+one-or-more values

If no modifier is specified, the parameter's cardinality will default to exactly-one. An exactly-one parameter will always expand to a stream containing a single value.

Parameters with a cardinality other than exactly-one are called variadic parameters.

If an argument expression expands to a number of values that the cardinality forbids, the reader must raise an error.

Optional parameters

Parameters with a cardinality that can accept an empty expression group as an argument (? and *) are called optional parameters. In text Ion, their corresponding arguments can be elided from e-expressions and TDL macro invocations when they appear in tail position. When an argument is elided, it is treated as though an explicit empty group (::) had been passed in its place.

In contrast, parameters with a cardinality that cannot accept an empty group (! and +) are called required parameters. Required parameters can never be elided.

(:set_macros
    (foo (x y? z*) // `x` is required, `y` and `z` are optional
        [x, y, z]
    )
)

// `z` is a populated expression group
(:foo 1 2 (:: 3 4 5)) => [1, 2, 3, 4, 5]

// `z` is an empty expression group
(:foo 1 2 (::))       => [1, 2]

// `z` has been elided
(:foo 1 2)            => [1, 2]

// `y` and `z` have been elided
(:foo 1)              => [1]

// `x` cannot be elided
(:foo)                => ERROR: missing required argument `x`

Optional parameters that are not in tail position cannot be elided, as this would cause them to appear in a position corresponding to a different argument.

(:set_macros
    (foo (x? y) // `x` is optional, `y` is required
        [x, y]
    )
)

(:foo (::) 1) => [(::), 1] => [1]
(:foo 1)                   => ERROR: missing required argument `y`

Macro signatures

A macro's signature is the ordered sequence of parameters which an invocation of that macro must define. Syntactically, the signature is an s-expression of parameter declarations.

Example macro signature

(w flex_uint::x* float16::y? z+)
NameEncodingCardinality
wtaggedexactly-one
xflex_uintzero-or-more
yfloat16zero-or-one
ztaggedone-or-more

Template definition language (TDL)

The macro's template is a single Ion value that defines how a reader should expand invocations of the macro. Ion 1.1 introduces a template definition language (TDL) to express this process in terms of the macro's parameters. TDL is a small language with only a few constructs.

A TDL expression can be any of the following:

  1. A literal Ion scalar
  2. A macro invocation
  3. A variable expansion
  4. A quasi-literal Ion container
  5. A special form

In terms of its encoding, TDL is "just Ion." As you shall see in the following sections, the constructs it introduces are written as s-expressions with a distinguishing leading value or values.

A grammar for TDL can be found at the end of this chapter.

Ion scalars

Ion scalars are interpreted literally. These include values of any type except list, sexp, and struct. null values of any type—even null.list, null.sexp, and null.struct—are also interpreted literally.

Examples

These macros are constants; they take no parameters. When they are invoked, they expand to a stream of a single value: the Ion scalar acting as the template expression.

$ion_encoding::(
  (macro_table
    (macro greeting () "hello")
    (macro birthday () 1996-10-11)
    // Annotations are also literal
    (macro price () USD::29.95)
  )
)

(:greeting) => "hello"
(:birthday) => 1996-10-11
(:price)    => USD::29.95

Macro invocations

Macro invocations call an existing macro. The invoked macro could be a system macro, a macro imported from a shared module, or a macro previously defined in the current scope.

Syntactically, a macro invocation is an s-expression whose first value is the operator . and whose second value is a macro reference.

Grammar
macro-invocation   ::= '(.' macro-ref macro-arg* ')'

macro-ref          ::= (module-name '::')? (macro-name | macro-address)

macro-arg          ::= expression | expression-group

macro-name         ::= ion-identifier

macro-address      ::= unsigned-ion-integer

expression-group   ::= '(..' expression* ')'
Invocation syntax illustration
// Invoking a macro defined in the same module by name.
(.macro_name              arg1 arg2 /*...*/ argN)

// Invoking a macro defined in another module by name.
(.module_name::macro_name arg1 arg2 /*...*/ argN)

// Invoking a macro defined in the same module by its address.
(.0              arg1 arg2 /*...*/ argN)

// Invoking a macro defined in a different module by its address.
(.module_name::0 arg1 arg2 /*...*/ argN)
Examples
$ion_encoding::(
  (macro_table
    // Calls the system macro `values`, allowing it to produce a stream of three values.
    (macro nephews () (.values Huey Dewey Louie))

    // Calls a macro previously defined in this module, splicing its result
    // stream into a list.
    (macro list_of_nephews () [(.nephews)])
  )
)

(:nephews)         => Huey Dewey Louie
(:list_of_nephews) => [Huey, Dewey, Louie]

important

There are no forward references in TDL. If a macro definition includes an invocation of a name or address that is not already valid, the reader must raise an error.

$ion_encoding::(
  (macro_table
    (macro list_of_nephews () [(.nephews)])
    //                          ^^^^^^^^
    // ERROR: Calls a macro that has not yet been defined in this module.
    (macro nephews () (.values Huey Dewey Louie))
  )
)

Variable expansion

Templates can insert the contents of a macro parameter into their output by using a variable expansion, an s-expression whose first value is the operator % and whose second and final value is the variable name of the parameter to expand.

If the variable name does not match one of the declared macro parameters, the implementation must raise an error.

Grammar
variable-expansion ::= '(%' variable-name ')'

variable-name      ::= ion-identifier
Examples
$ion_encoding::(
  (macro_table
    // Produces a stream that repeats the content of parameter `x` twice.
    (macro twice (x*) (.values (%x) (%x)))
  )
)

(:twice foo)     => foo foo
(:twice "hello") => "hello" "hello"
(:twice 1 2 3)   => 1 2 3 1 2 3

Quasi-literal Ion containers

When an Ion container appears in a template definition, it is interpreted almost literally.

Each nested value in the container is inspected.

  • If the value is an Ion scalar, it is added to the output as-is.
  • If the value is a variable expansion, the stream bound to that variable name is added to the output. The variable expansion literal (for example: (%name)) is discarded.
  • If the value is a macro invocation, the invocation is evaluated and the resulting stream is added to the output. The macro invocation literal (for example: (.name 1 2 3)) is discarded.
  • If the value is a container, the reader will recurse into the container and repeat this process.
Expansion within a sequence

When the container is a list or s-expression, the values in the nested expression's expansion are spliced into the sequence at the site of the expression. If the expansion was empty, no values are spliced into the container.

$ion_encoding::(
  (macro_table
    (macro bookend_list (x y*) [(%x), (%y), (%x)])
    (macro bookend_sexp (x y*) ((%x) (%y) (%x)))
  )
)

(:bookend_list ! a b c) => ['!', a, b, c, '!']
(:bookend_sexp ! a b c) => (! a b c !)

(:bookend_sexp !) => (! !)
Expansion within a struct

When the container is a struct, the expansion of each field value is paired with the corresponding field name. If the expansion produces a single value, a single field with that name will be spliced into the parent struct. If the expansion produces multiple values, a field with that name will be created for each value and spliced into the parent struct. If the expansion was empty, no fields are spliced into the parent struct.

Examples
$ion_encoding::(
  (macro_table
    (macro resident (id names*)
        {
            town: "Riverside",
            id: (.make_string "123-" (%id)),
            name: (%names)
        }
     )
  )
)

(:resident "abc" "Alice") =>
{
  town: "Riverside",
  id: "123-abc",
  name: "Alice"
}

(:resident "def" "John" "Jacob" "Jingleheimer" "Schmidt") =>
{
  town: "Riverside",
  id: "123-def",
  name: "John",
  name: "Jacob",
  name: "Jingleheimer",
  name: "Schmidt",
}

(:resident "ghi") =>
{
  town: "Riverside",
  id: "123-ghi",
}

Special forms

special-form       ::= '(.' ('$ion::')?  special-form-name expression* ')'

special-form-name  ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi'

Special forms are similar to macro invocations, but they have their own expansion rules. See Special forms for the list of special forms and a description of each.

Note that unlike macro expansions, special forms cannot accept argument groups.

Macros by example

Before getting into the technical details of Ion’s macro and module system, it will help to be more familiar with the use of macros. We’ll step through increasingly sophisticated use cases, some admittedly synthetic for illustrative purposes, with the intent of teaching the core concepts and moving parts without getting into the weeds of more formal specification.

Ion macros are defined using a domain-specific language that is in turn expressed via the Ion data model. That is, macro definitions are Ion data, and use Ion features like S-expressions and symbols to represent code in a Lisp-like fashion. In this document, the fundamental construct we explore is the macro definition, denoted using an S-expression of the form (macro name …) where macro is a keyword and name must be a symbol denoting the macro's name.

NOTE: S-expressions of that shape only declare macros when they occur in the context of an encoding module. We will completely ignore modules for now, and the examples below omit this context to keep things simple.

Constants

The most basic macro is a constant:

(macro pi            // name
  ()                 // signature
  3.141592653589793) // template

This declaration defines a macro named pi. The () is the macro’s signature, in this case a trivial one that declares no parameters. The 3.141592653589793 is a similarly trivial template, an expression in Ion 1.1's domain-specific language for defining macro functions. This macro accepts no arguments and always returns a constant value.

To use pi in an Ion document, we write an encoding expression or E-expression:

$ion_1_1
(:pi)

The syntax (:pi) looks a lot like an S-expression. It’s not, though, since colons cannot appear unquoted in that context. Ion 1.1 makes use of syntax that is not valid in Ion 1.0—specifically, the (: digraph—to denote E-expressions. Those characters must be followed by a reference to a macro, and we say that the E-expression is an invocation of the macro. Here, (:pi) is an invocation of the macro named pi.

note

We also call these “smile expressions” when we’re feeling particularly casual. (:

That document is equivalent to the following, in the sense that they denote the same data:

$ion_1_1
3.141592653589793

The process by which the Ion implementation turns the former document into the latter is called macro expansion or just expansion. This happens transparently to Ion-consuming applications: the stream of values in both cases are the same. The documents have the same content, encoded in two different ways. It’s reasonable to think of (:pi) as a custom encoding for 3.141592653589793, and the notation’s similarity to S-expressions leads us to the term “encoding expression” (or "e-expression").

note

Any Ion 1.1 document with macros can be fully expanded into an equivalent Ion 1.0 document.

We can streamline future examples with a couple of conventions. First, assume that any E-expression is occurring within an Ion 1.1 document; second, we use the relation notation, , to mean “expands to”. So we can say:

(:pi) ⇒ 3.141592653589793

Parameters and variable expansion

Most macros are not constant--they accept inputs that determine their results.

(macro passthrough
  (x)   // signature
  (%x)  // template
)

This macro has a signature that declares a parameter called x, and it therefore requires one argument to be passed in when it is invoked. This creates a variable (i.e. named data) called x that can be referred to within the context of the template.

note

We are careful to distinguish between the views from “inside” and “outside” the macro: parameters are the names used by a macro’s implementation to refer to its expansion-time inputs, while arguments are the data provided to a macro at the point of invocation. In other words, we have “formal” parameters and “actual” arguments.

The body of this macro is our first non-trivial template, an expression in Ion’s new domain-specific language for defining macro functions. This template definition language (TDL) treats Ion scalar values as literals, giving the decimal in pi’s template its intended meaning.

In this example, the template expression (%x) is a variable expansion in the form (%variable_name). During macro evaluation, variable expansions are replaced by the contents of the referenced variable. Because this macro's template is an expansion of its only parameter, x, invoking the macro will produce the same value it was given as an argument.

(:passthrough 1)         => 1
(:passthrough "foo")     => "foo"
(:passthrough [a, b, c]) => [a, b, c]

Simple Templates

Here's a more realistic macro:

(macro price
  (a c)                             // signature
  { amount: (%a), currency: (%c) }) // template

This macro has a signature that declares two parameters named a and c. It therefore accepts two arguments when invoked.

(:price 99 USD) ⇒ { amount: 99, currency: USD }

Template expressions that are structs are interpreted almost literally; the field names are literal--is why the amount and currency field names show up as-is in the expansion--but the field “values” are arbitrary expressions. We call these almost-literal forms quasi-literals.

The template definition language also treats lists quasi-literally, and every element inside the list is anexpression. Here’s a silly macro to illustrate:

(macro two_item_list (a b) [(%a), (%b)])
(:two_item_list foo bar) ⇒ [foo, bar]

E-expressions can accept other e-expressions as arguments. For example:

(:two_item_list (:price 99 USD) foo)
//              └──────┬──────┘
//                     └─── passing another e-expression as an argument

Expansion happens from the "inside out". The outer e-expression receives the results from the expansion of the inner e-expression.

(:two_item_list (:price 99 USD) foo)

  // First, the inner invocation of `price` is expanded...
  => (:two_item_list {amount: 99, currency: USD} foo)

  // ...and then the outer invocation of `two_item_list` is expanded.
  => [{amount: 99, currency: USD}, foo]

Invoking Macros from Templates

Templates are able to invoke other macros. In TDL, an s-expression starting with a . and an identifier is an operator invocation, where operators are either macros or special forms, which we'll explore later.

(macro website_url
  (path)
  (.make_string "https://www.amazon.com/" (%path)))

This macro's template is an s-expression beginning with .make_string, so it an invocation of a macro called make_string. make_string is a system macro (a built-in function) which concatenates its arguments to produce a single string.

(:website_url "gp/cart") ⇒ "https://www.amazon.com/gp/cart"

In TDL, it is legal for a macro invocation to appear anywhere that a value could appear. In this example, an invocation of make_string is being passed as an argument to an invocation of website_url.

(macro detail_page_url
  (asin)
  (.website_url (.make_string "dp/" (%asin))))
(:detail_page_url "B08KTZ8249") ⇒ "https://www.amazon.com/dp/B08KTZ8249"

note

This may not look like much of an improvement, but the full string

"https://www.amazon.com/dp/B08KTZ8249"

takes 38 bytes to encode while the macro invocation

(:detail_page_url "B08KTZ8249")

takes as few as 12 bytes in binary Ion. While text Ion spells out the macro name to be human-friendly, the binary Ion encoding uses the macro's integer address instead. Here's an illustration:

(:1 "B08KTZ8249")

This makes the e-expression both more compact and faster to decode. Readers can also avoid the cost of repeatedly validating the UTF-8 bytes of substrings that are 'baked into' the macro definition.

E-expressions Versus S-expressions

We've now seen two ways to invoke macros, and their difference deserves thorough exploration.

An E-expression is an encoding artifact of a serialized Ion document. It has no intrinsic meaning other than the fact that it represents a macro invocation. The meaning of the document can only be determined by expanding the macro, passing the E-expression's arguments to the function defined by the macro. This all happens as the Ion document is parsed, transparent to the reader of the document. In casual terms, E-expressions are expanded away before the application sees the data.

Within the template definition language, you can define new macros in terms of other macros, and those invocations are written as S-expressions. Unlike E-expressions, TDL macro invocations are normal Ion data structures, consumed by the Ion system and interpreted as TDL. Further, TDL macro invocations only have meaning in the context of a macro definition, inside an encoding module, while E-expressions can occur anywhere in an Ion document.

warning

It's entirely possible to write a macro that can generate all or part of a macro definition. We don't recommend that you spend time considering such things at this point.

These two invocation forms are syntactically aligned in their calling convention, but are distinct in context and "immediacy". E-expressions occur anywhere and are invoked immediately, as they are parsed. S-expression invocations occur only within macro definitions, and are only invoked if and when that code path is ever executed by invocation of the surrounding macro.

Rest Parameters

Sometimes we want a macro to accept an arbitrary number of arguments, in particular all the rest of them. The make_string macro is one of those, concatenating all of its arguments into a single string:

(:make_string)                 ⇒ ""
(:make_string "a")             ⇒ "a"
(:make_string "a" "b")         ⇒ "ab"
(:make_string "a" "b" "c")     ⇒ "abc"
(:make_string "a" "b" "c" "d") ⇒ "abcd"

To make this work, the declaration of make_string is effectively:

(macro make_string (parts*) /*...*/)

The * is a cardinality modifier. A parameter's cardinality dictates both the number of argument expressions it can accept and the number of values its expansion can produce.

In the examples so far, all parameters have had a cardinality of exactly-one, which is the default. The parts parameter has a cardinality of zero-or-more, meaning:

  1. It can accept zero-or-more argument expressions.
  2. When expanded, it will produce zero-or-more values.

When the final parameter in the macro signature is zero-or-more, "all of the rest" of the argument expressions will be passed to that parameter.

(:make_string)
//           └── 0 argument expressions passed to `parts`
(:make_string "a")
//            └┬┘
//             └── 1 argument expression passed to `parts`
(:make_string "a" "b" "c" "d")
//            └──────┬──────┘
//                   └── 4 argument expressions passed to `parts`

At this point our distinction between parameters and arguments becomes more apparent, since they are no longer one-to-one: this macro with one parameter can be invoked with one argument, or twenty, or none.

tip

To declare a final parameter that requires at least one rest-argument, use the + modifier.

Arguments and results are streams

The inputs to and results from a macro are modeled as streams of values. When a macro is invoked, each argument expression produces a stream of values, and within the macro definition, each parameter name refers to the corresponding stream, not to a specific value. The declared cardinality of a parameter constrains the number of elements produced by its stream, and is verified by the macro expansion system.

More generally, the results of all template expressions are streams. While most expressions produce a single value, various macros and special forms can produce zero or more values.

We have everything we need to illustrate this, via another system macro, values:

(macro values (vals*) (%vals))
(:values 1)           ⇒ 1
(:values 1 true null) ⇒ 1 true null
(:values)             ⇒ _nothing_

The values macro accepts any number of arguments and returns their values; it is effectively a multi-value identity function. We can use this to explore how streams combine in E-expressions.

Splicing in encoded data

At the top level, an e-expression's resulting values become top-level values.

(:values 1 2 3) => 1 2 3

When an E-expression appears within a list or S-expression, the resulting values are spliced into the surrounding container:

[first, (:values), last]          ⇒ [first, last]
[first, (:values "middle"), last] ⇒ [first, "middle", last]
(first (:values left right) last) ⇒ (first left right last)

This also applies wherever a tagged type can appear inside an E-expression:

(first (:values (:values left right) (:values)) last) ⇒ (first left right last)

Note that each argument-expression always maps to one parameter, even when that expression returns too-few or too-many values.

(macro reverse (a b)
  [(%b), (%a)])
(:reverse (:values 5 USD))   ⇒ // Error: 'reverse' expects 2 arguments, given 1
(:reverse 5 (:values) USD)   ⇒ // Error: 'reverse' expects 2 arguments, given 3
(:reverse (:values 5 6) USD) ⇒ // Error: argument 'a' expects 1 value, given 2

In this example, the parameters expect exactly one argument, producing exactly one value. When the cardinality allows multiple values, then the argument result-streams are concatenated. We saw this (rather subtly) above in the nested use of values, but can also illustrate using the rest-parameter to make_string, which we'll expand here in steps:

(:make_string (:values) a (:values b (:values c) d) e)
//              ^^^^^^ next
  ⇒ (:make_string a (:values b (:values c) d) e)
//                               ^^^^^^ next
  ⇒ (:make_string a (:values b c d) e)
//                    ^^^^^^ next
  ⇒ (:make_string a b c d e)
  ⇒ "abcde"

Splicing within sequences is straightforward, but structs are trickier due to their key/value nature. When used in field-value position, each result from a macro is bound to the field-name independently, leading to the field being repeated or even absent:

{ name: (:values) }          ⇒ { }
{ name: (:values v) }        ⇒ { name: v }
{ name: (:values v ann::w) } ⇒ { name: v, name: ann::w }

An E-expression can even be used in place of a key-value pair, in which case it must return structs, which are merged into the surrounding container:

{ a:1, (:values), z:3 }             ⇒ { a:1, z:3 }
{ a:1, (:values {}), z:3 }          ⇒ { a:1, z:3 }
{ a:1, (:values {b:2}), z:3 }       ⇒ { a:1, b:2, z:3 }
{ a:1, (:values {b:2} {z:3}), z:3 } ⇒ { a:1, b:2, z:3, z:3 }

{ a:1, (:values key "value") } ⇒ // Error: struct expected for splicing into struct

Splicing in template expressions

The preceding examples demonstrate splicing of E-expressions into encoded data, but similar stream-splicing occurs within the template language, making it trivial to convert a stream to a list:

(macro list_of (vals*) [ (%vals) ])
(macro clumsy_bag (elts*) { '': (%elts) })
(:list_of)   ⇒ []
(:clumsy_bag) ⇒ {}

(:list_of 1 2 3)    ⇒ [1, 2, 3]
(:clumsy_bag true 2) ⇒ {'':true, '':2}

Mapping templates over streams: for

Another way to produce a stream is via a mapping form. The for special form evaluates a template once for each value provided by a stream or streams. Each time, a local variable is created and bound to the next value on the stream.

(macro prices (currency amounts*)
  (.for
    // Binding pairs
    [(amt (%amounts))]
    //└┬┘ └────┬───┘
    // │       └─── stream to map over
    // └─────────── variable name

    // Template
    (.price (%amt) (%currency))
  )
)

The first subform of for is a list of binding pairs, S-expressions containing a variable names and a series of TDL expressions. Here, that TDL expression series is a single parameter expansion, so each individual value from the amounts stream is bound to the name amt before the price invocation is expanded.

(:prices GBP 10 9.99 12.)
  ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP} {amount:12., currency:GBP}

More than one stream can be iterated in parallel, and iteration terminates when any stream becomes empty.

(macro zip (front* back*)
  (.for [(f (%front)),
        (b (%back))]
    [(%f), (%b)]))

(:zip (:values 1 2 3) (:values a b))
  ⇒ [1, a] [2, b]

Empty streams: none

The empty stream is an important edge case that requires careful handling and communication. The built-in macro none accepts no values and produces an empty stream:

(macro list_of (items*) [(%items)])

(:list_of (:none)) ⇒ []
(:list_of 1 (:none) 2) ⇒ [1, 2]
[(:none)]   ⇒ []
{a:(:none)} ⇒ {}

When used as a macro argument, a none invocation (like any other expression) counts as one argument:

(:pi (:none)) ⇒ // Error: 'pi' expects 0 arguments, given 1

The special form (::) is an empty argument expression group, similar to (:none) but used specifically to express the absence of an argument:

(:int_list (::)) ⇒ []
(:int_list 1 (::) 2) ⇒ [1, 2]

TIP: While none and values both produce the empty stream, the former is preferred for clarity of intent and terminology.

Cardinality

As described earlier, parameters are all streams of values, but the number of values can be controlled by the parameter's cardinality. So far we have seen the default exactly-one and the * (zero-or-more) cardinality modifiers, and in total there are four:

ModifierCardinality
!exactly-one value
?zero-or-one value
+one-or-more values
*zero-or-more values

Exactly-One

Many parameters expect exactly one value and thus have exactly-one cardinality. This is the default cardinality, but the ! modifier can be used for clarity.

This cardinality means that the parameter requires a stream producing a single value, so one might refer to them as singleton streams or just singletons colloquially.

Zero-or-One

A parameter with the modifier ? has zero-or-one cardinality, which is much like exactly-one cardinality, except the parameter accepts an empty-stream argument as a way to denote an absent parameter.

(macro temperature (degrees scale?)
  {
    degrees: (%degrees),
    scale: (%scale)
  })

Since the scale accepts the empty stream, we can pass it an empty argument group:

(:temperature 96 F)    ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283}

Note that the result’s scale field has disappeared because no value was provided. It would be more useful to fill in a default value, which we can achieve with the default system macro:

(macro temperature (degrees scale?)
  {
    degrees: (%degrees),
    scale: (.default (%scale) K)
  })
(:temperature 96 F)    ⇒ {degrees:96,  scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}

To refine things a bit further, trailing arguments that accept the empty stream can be omitted entirely:

(:temperature 283) ⇒ {degrees:283, scale:K}

tip

The default macro is implemented with the help of a special form that can detect the empty stream: if_none.

Zero-or-More

A parameter with the modifier * has zero-or-more cardinality.

(macro prices (amount* currency)
  (.for [(amt (%amount))]
    (.price (%amt) (%currency))))

When * is on a non-final parameter, we cannot take “all the rest” of the arguments and must use a different calling convention to draw the boundaries of the stream. Instead, we need a single expression that produces the desired values:

(:prices (::) JPY)          ⇒ // empty stream
(:prices 54 CAD)           ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP)  ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}

Here we use a non-empty argument group (:: /*...*/) to delimit the multiple elements of the amount stream.

One-or-More

A parameter with the modifier + has one-or-more cardinality, which works like * except:

  1. + parameters cannot accept the empty stream
  2. When expanded, + parameters must produce at least one value. To continue using our prices example:
(macro prices (amount+ currency)
  (.for [(amt (%amount))]
    (.price (%amt) (%currency))))
(:prices (::) JPY)          ⇒ // Error: `+` parameter received the empty stream
(:prices 54 CAD)           ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP)  ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}

On the final parameter, + collects the remaining (one or more) arguments:

(macro thanks (names+)
  (.make_string "Thank you to my Patreon supporters:\n"
    (.for [(name (%names))]
      (.make_string "  * " (%name) "\n"))))
(:thanks) ⇒ // Error: at least one value expected for + parameter

(:thanks Larry Curly Moe) =>
'''\
Thank you to my Patreon supporters:
  * Larry
  * Curly
  * Moe
'''

Argument Groups

The non-rest versions of multi-value parameters require some kind of delimiting syntax to contain the applicable sub-expressions. For the tagged-type parameters we've seen so far, you could use :values or some other macro to produce the stream, but that doesn't work for tagless types. The preferred syntax, supporting all argument types, is a special delimiting form called an argument group. Here is a macro to illustrate:

(macro prices
  (amount* currency)
  (.for [(amt (%amount))]
    (.price (%amt) (%currency))))

The parameter amount accepts any number of argument expressions. It's easy to provide exactly one:

(:prices 12.99 GBP) ⇒ {amount:12.99, currency:GBP}

To provide a non-singleton stream of values, use an argument group. Inside an E-expression, a group starts with (::

(:prices (::) GBP)       ⇒ _void_
(:prices (:: 1) GBP)     ⇒ {amount:1, currency:GBP}
(:prices (:: 1 2 3) GBP) ⇒ {amount:1, currency:GBP}
                           {amount:2, currency:GBP}
                           {amount:3, currency:GBP}

Within the group, the invocation can have any number of expressions that align with the parameter's encoding. The macro parameter produces the results of those expressions, concatenated into a single stream, and the expander verifies that each value on that stream is acceptable by the parameter’s declared encoding.

(:prices (:: 1 (:values 2 3) 4) GBP) ⇒ {amount:1, currency:GBP}
                                       {amount:2, currency:GBP}
                                       {amount:3, currency:GBP}
                                       {amount:4, currency:GBP}

Argument groups may only appear inside macro invocations where the corresponding parameter has ?, *, or + cardinality. There is no binary opcode for these constructs; the encoding uses a tagless format to keep things as dense as possible. As usual, the text format mirrors this constraint.

warning

The allowed combinations of cardinality and argument groups is pending finalization of the binary encoding.

Optional Arguments

When a trailing parameter accepts the empty stream, an invocation can omit its corresponding argument expression, as long as no following parameter is being given an expression. We’ve seen this as applied to final * parameters, but it also applies to ? parameters:

(macro optionals (a* b? c! d* e? f*)
  (.make_list a b c d e f))

Since d, e, and f all accept the empty stream, they can be omitted by invokers. But c is required so a and b must always be present, at least as an empty group:

(:optionals (::) (::) "value for c") ⇒ ["value for c"]

Now c receives the string "value for c" while the other parameters are all empty. If we want to provide e, then we must also provide a group for d:

(:optionals (::) (::) "value for c" (::) "value for e")
  ⇒ ["value for c", "value for e"]

Tagless and fixed-width types

In Ion 1.0, the binary encoding of every value starts off with a “type tag”, an opcode that indicates the data-type of the next value and thus the interpretation of the following octets of data. In general, these tags also indicate whether the value has annotations, and whether it’s null.

These tags are necessary because the Ion data model allows values of any type to be used anywhere. Ion documents are not schema-constrained: nothing forces any part of the data to have a specific type or shape. We call Ion “self-describing” precisely because each value self-describes its type via a type tag.

If schema constraints are enforced through some mechanism outside the serializer/deserializer, the type tags are unnecessary and may add up to a non-trivial amount of wasted space. Furthermore, the overhead for each value also includes length information: encoding an octet of data takes two octets on the stream.

Ion 1.1 tries to mitigate this overhead in the binary format by allowing macro parameters to use more-constrained tagless types. These are subtypes of the concrete types, constrained such that type tags are not necessary in the binary form. In general this can shave 4-6 bits off each value, which can add up in aggregate. In the extreme, that octet of data can be encoded with no overhead at all.

The following tagless types are available:

Tagless typeDescription
flex_symbolTagless symbol (SID or text)
flex_stringTagless string
flex_intTagless, variable-width signed int
flex_uintTagless, variable-width unsigned int
int8 int16 int32 int64Fixed-width signed int
uint8 uint16 uint32 uint64Fixed-width unsigned int
float16 float32 float64Fixed-width float

To define a tagless parameter, just declare one of the primitive types:

(macro point (flex_int::x flex_int::y)
  {x: (%x), y: (%y)})
(:point 3 17) ⇒ {x:3, y:17}

The tagless encoding has no real benefit here in text, as primitive types aim to improve the binary encoding.

This density comes at the cost of flexibility. Primitive types cannot be annotated or null, and arguments cannot be expressed using macros, like we’ve done before:

(:point null.int 17)   ⇒ // Error: primitive flex_int does not accept nulls
(:point a::3 17)       ⇒ // Error: primitive flex_int does not accept annotations
(:point (:values 1) 2) ⇒ // Error: cannot use macro for a primitive argument

While Ion text syntax doesn’t use tags—the types are built into the syntax—these errors ensure that a text E-expression may only express things that can also be expressed using an equivalent binary E-expression.

For the same reasons, supplying a (non-rest) tagless parameter with no value, or with more than one value, can only be expressed by using an argument group.

A subset of the primitive types are fixed-width: they are binary-encoded with no per-value overhead.

(macro byte_array
  (uint8::bytes*)
  [(%bytes)])

Invocations of this macro are encoded as a sequence of untagged octets, because the macro definition constrains the argument shape such that nothing else is acceptable. A text invocation is written using normal ints:

(:byte_array 0 1 2 3 4 5 6 7 8) ⇒ [0, 1, 2, 3, 4, 5, 6, 7, 8]
(:byte_array 9 -10 11)          ⇒ // Error: -10 is not a valid uint8
(:byte_array 256)               ⇒ // Error: 256 is not a valid uint8

As above, Ion text doesn’t have syntax specifically denoting “8-bit unsigned integers”, so to keep text and binary capabilities aligned, the parser rejects invocations where an argument value exceeds the range of the binary-only type.

Primitive types have inherent tradeoffs and require careful consideration, but in the right circumstances the density wins can be significant.

Macro Shapes

We can now introduce the final kind of input constraint, macro-shaped parameters. To understand the motivation, consider modeling a scatter-plot as a list of points:

[{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}, …]

Lists like these exhibit a lot of repetition. Since we already have a point macro, we can eliminate a fair amount:

[(:point 3 17), (:point 395 23), (:point 15 48), (:point 2023 5), …]

This eliminates all the xs and ys, but leaves repeated macro invocations.

What we’d like is to eliminate the point calls and just write a stream of pairs, something like:

(:scatterplot (3 17) (395 23) (15 48) (2023 5) …)

We can achieve exactly that with a macro-shaped parameter, in which we use the point macro as an encoding:

(macro scatterplot (point::points*)
//                  ^^^^^
  [(%points)])

point is not one of the built-in encodings, so this is a reference to the macro of that name defined earlier.

(:scatterplot (3 17) (395 23) (15 48) (2023 5))
  ⇒
  [{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}]

Each argument S-expression like (3 17) is implicitly an E-expression invoking the point macro. The argument mirrors the shape of the inner macro, without repeating its name. Further, expansion of the implied points happens automatically, so the overall behavior is just like the preceding variant and the points parameter produces a stream of structs.

The binary encoding of macro-shaped parameters are similarly tagless, eliding any opcodes mentioning point and just writing its arguments with minimal delimiting.

Macro types can be combined with cardinality modifiers, with invocations using groups as needed:

(macro scatterplot
  (point::points+ flex_string::x_label flex_string::y_label)
  { points: [(%points)], x_label: (%x_label), y_label: (%y_label) })
(:scatterplot (:: (3 17) (395 23) (15 48) (2023 5)) "hour" "widgets")
  ⇒
  {
    points: [{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}],
    x_label: "hour",
    y_label: "widgets"
  }

As with other tagless parameters, you cannot replace a group with a macro invocation, and you can't use a macro invocation as an element of an argument group:

(:scatterplot (:make_points 3 17 395 23 15 48 2023 5) "hour" "widgets")
  ⇒ // Error: Argument group expected, found :make_points

(:scatterplot (:: (3 17) (:make_points 395 23 15 48) (2023 5)) "hour" "widgets")
  ⇒ // Error: sexp expected with args for 'point', found :make_points

(:scatterplot (:: (3 17) (:point 395 23) (15 48) (2023 5)) "hour" "widgets")
  ⇒ // Error: sexp expected with args for 'point', found :point

This limitation mirrors the binary encoding, where both the argument group and the individual macro invocations are tagless and there's no way to express a macro invocation.

tip

The primary goal of macro-shaped arguments, and tagless types in general, is to increase density by tightly constraining the inputs.

Special Forms

When a TDL expression is syntactically an S-expression and its first element is the symbol ., its next element must be a symbol that matches either a set of keywords denoting the special forms, or the name of a previously-defined macro. The interpretation of the S-expression’s remaining elements depends on how the symbol resolves. In the case of macro invocations, the elements following the operator are arbitrary TDL expressions, but for special forms that is not always the case.

Special forms are "special" precisely because they cannot be expressed as macros and must therefore receive bespoke syntactic treatment. Since the elements of macro-invocation expressions are themselves expressions, when you want something to not be evaluated that way, it must be a special form.

Finally, these special forms are part of the template language itself, and are not addressable outside of TDL; the E-expression (:if_none foo bar baz) must necessarily refer to some user-defined macro named if_none, not to the special form of the same name.

todo

Many of these could be system macros instead of special forms. Being unrepresentable in TDL is not a reason for something to be a special form. Candidates to be moved to system macros are if_* and fail. Additionally, the system macro parse_ion may need to be classified as a special form since it only accepts literals.

if_none

(macro if_none (stream* true_branch* false_branch*) /* Not representable in TDL */)

The if_none form is if/then/else syntax testing stream emptiness. It has three sub-expressions, the first being a stream to check. If and only if that stream is empty (it produces no values), the second sub-expression is expanded. Otherwise, the third sub-expression is expanded. The expanded second or third sub-expression becomes the result that is produced by if_none.

note

Exactly one branch is expanded, because otherwise the empty stream might be used in a context that requires a value, resulting in an errant expansion error.

(macro temperature (degrees scale) 
       {
         degrees: (%degrees),
         scale: (.if_none (%scale) K (%scale)),
       })
(:temperature 96 F)     ⇒ {degrees:96,  scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}

To refine things a bit further, trailing optional arguments can be omitted entirely:

(:temperature 283) ⇒ {degrees:283, scale:K}

tip

If you're using if_none to specify an expression to default to, you can use the default system macro to be more concise.

(macro temperature (degrees scale)
    {
      degrees: (%degrees),
      scale: (.default (%scale) K),
    }
)

if_some

(macro if_some (stream* true_branch* false_branch*) /* Not representable in TDL */)

If stream evaluates to one or more values, it produces true_branch. Otherwise, it produces false_branch. Exactly one of true_branch and false_branch is evaluated. The stream expression must be expanded enough to determine whether it produces any values, but implementations are not required to fully expand the expression.

Example:

(macro foo (x)
       {
         foo: (.if_some (%x) [(%x)] null)
       })
(:foo (::))     => { foo: null }
(:foo 2)        => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }

The false_branch parameter may be elided, allowing if_some to serve as a map-if-not-none function.

Example:

(macro foo (x)
       {
         foo: (.if_some (%x) [(%x)])
       })
(:foo (::))     => { }
(:foo 2)        => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }

if_single

(macro if_single (expressions* true_branch* false_branch*) /* Not representable in TDL */)

If expressions evaluates to exactly one value, if_single produces the expansion of true_branch. Otherwise, it produces the expansion of false_branch. Exactly one of true_branch and false_branch is evaluated. The stream expression must be expanded enough to determine whether it produces exactly one value, but implementations are not required to fully expand the expression.

if_multi

(macro if_multi (expressions* true_branch* false_branch*) /* Not representable in TDL */)

If expressions evaluates to more than one value, it produces true_branch. Otherwise, it produces false_branch. Exactly one of true_branch and false_branch is evaluated. The stream expression must be expanded enough to determine whether it produces more than one value, but implementations are not required to fully expand the expression.

for

(for name_and_expressions template)

name_and_expressions is a list or s-expression containing one or more s-expressions of the form (name expr0 expr1 ... exprN). The first value is a symbol to act as a variable name. The remaining expressions in the s-expression will be expanded and concatenated into a single stream; for each value in the stream, the for expansion will produce a copy of the template argument expression with any appearance of the variable replaced by the value.

For example:

(.for
  [(word                     // Variable name
   foo bar baz)]             // Values over which to iterate
  (.values (%word) (%word))) // Template expression; `(%word)` will be replaced
=>
foo foo bar bar baz baz

Multiple s-expressions can be specified. The streams will be iterated over in lockstep.

(.for
  ((x 1 2 3)   // for x in...
   (y 4 5 6))  // for y in...
  ((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 4)
(2 5)
(3 6)

Iteration will end when the shortest stream is exhausted.

(.for
  [(x 1 2),    // for x in...
   (y 3 4 5)]  // for y in...
  ((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 3)
(2 4)
// no more output, `x` is exhausted

Names defined inside a for shadow names in the parent scope.

(macro triple (x)
  //           └─── Parameter `x` is declared here...
  (.for
  //    ...but the `for` expression introduces a
  //  ┌─── new variable of the same name here.
    ((x a b c))
    (%x)
  //  └─── This refers to the `for` expression's `x`, not the parameter.
  )
)
(:triple 1) // Argument `1` is ignored
=>
a b c

The for special form can only be invoked in the body of template macro. It is not valid to use as an E-Expression.

System Macros

Many of the system macros MAY be defined as template macros, and when possible, the specification includes a template. Templates are given here as normative example, but system macros are not required to be implemented as template macros.

The macros that can be defined as templates are included as system macros because of their broad applicability, and so that Ion implementations can provide optimizations for these macros that run directly in the implementations runtime environment rather than in the macro evaluator. For example, a macro such as add_symbols does not produce user values, so an Ion Reader could bypass evaluating the template and directly update the encoding context with the new symbols.

Stream Constructors

none

(macro none () (.values))

none accepts no values and produces nothing (an empty stream).

values

(macro values (v*) v)

This is, essentially, the identity function. It produces a stream from any number of arguments, concatenating the streams produced by the nested expressions. Used to aggregate multiple values or sub-streams to pass to a single argument, or to produce multiple results.

default

(macro default (expr* default_expr*)
    // If `expr` is empty...
    (.if_none (%expr)
        // then expand `default_expr` instead.
        (%default_expr)
        // If it wasn't empty, then expand `expr`.
        (%expr)
    )
)

default tests expr to determine whether it expands to the empty stream. If it does not, default will produce the expansion of expr. If it does, default will produce the expansion of default_expr instead.

flatten

(macro flatten (sequence*) /* Not representable in TDL */)

The flatten system macro constructs a stream from the content of one or more sequences.

Produces a stream with the contents of all the sequence values. Any annotations on the sequence values are discarded. Any non-sequence arguments will raise an error. Any null arguments will be ignored.

Examples:

(:flatten [a, b, c] (d e f))       => a b c d e f
(:flatten [[], null.list] foo::()) => [] null.list

The flatten macro can also be used to splice the content of one list or s-expression into another list or s-expression.

[1, 2, (:flatten [a, b]), 3, 4] => [1, 2, a, b, 3, 4]

parse_ion

Ion documents may be embedded in other Ion documents using the parse_ion macro.

(macro parse_ion (uint8::data*) /* Not representable in TDL */)

The parse_ion macro constructs a stream of values by parsing a blob literal or string literal as a single, self-contained Ion document. All values produced by the expansion of parse_ion are application values. (I.e. it is as if they are all annotated with $ion_literal.)

The IVM at the beginning of an Ion data stream is sufficient to identify whether it is text or binary, so text Ion can be embedded as a blob containing the UTF-8 encoded text.

Embedded text example:

(:parse_ion
    '''
    $ion_1_1
    $ion_encoding::((symbol_table ["foo" "bar"]]))
    $1 $2
    '''
)
=> foo bar

Embedded binary example:

(:parse_ion {{ 4AEB6qNmb2+jYmFy }} )
=> foo bar

important

Unlike most macros, this macro specifically requires literals. Macros are not allowed to contain recursive calls, and composing an embedded document from multiple expressions would make it possible to implement recursion in the macro system.

The data argument is evaluated in a clean environment that cannot read anything from the parent document. Allowing context to leak from the outer scope into the document being parsed would also enable recursion.

Value Constructors

annotate

(macro annotate (ann* value) /* Not representable in TDL */)

Produces the value prefixed with the annotations anns1. Each ann must be a non-null, unannotated string or symbol.

(:annotate (: "a2") a1::true) => a2::a1::true

make_string

(macro make_string (content*) /* Not representable in TDL */)

Produces a non-null, unannotated string containing the concatenated content produced by the arguments. Nulls (of any type) are forbidden. Any annotations on the arguments are discarded.

make_symbol

(macro make_symbol (content*) /* Not representable in TDL */)

Like make_string but produces a symbol.

make_blob

(macro make_blob (lobs*) /* Not representable in TDL */)

Like make_string but accepts lobs and produces a blob.

make_list

(macro make_list (sequences*) [ (.flatten sequences) ])

Produces a non-null, unannotated list by concatenating the content of any number of non-null list or sexp inputs.

(:make_list)                  => []
(:make_list (1 2))            => [1, 2]
(:make_list (1 2) [3, 4])     => [1, 2, 3, 4]
(:make_list ((1 2)) [[3, 4]]) => [(1 2), [3, 4]]

make_sexp

(macro make_sexp (sequences*) ( (.flatten sequences) ))

Like make_list but produces a sexp.

(:make_sexp)                  => ()
(:make_sexp (1 2))            => (1 2)
(:make_sexp (1 2) [3, 4])     => (1 2 3 4)
(:make_sexp ((1 2)) [[3, 4]]) => ((1 2) [3, 4])

make_struct

(macro make_struct (structs*) /* Not representable in TDL */)

Produces a non-null, unannotated struct by combining the fields of any number of non-null structs.

(:make_struct)    => {}
(:make_struct
  {k1: 1, k2: 2}
  {k3: 3}
  {k4: 4})        => {k1:1, k2:2, k3:3, k4:4}

make_field

(macro make_field (flex_sym::field_name value) /* Not representable in TDL */)

Produces a non-null, unannotated, single-field struct using the given field name and value.

This can be used to dynamically construct field names based on macro parameters.

Example:

(macro foo_struct (extra_name extra_value)
       (make_struct 
         {
           foo_a: 1,
           foo_b: 2,
         }
         (make_field (make_string "foo_" (%extra_name)) (%extra_value))
       ))

Then:

(:foo_struct c 3) => { foo_a: 1, foo_b: 2, foo_c: 3 }

make_decimal

(macro make_decimal (flex_int::coefficient flex_int::exponent) /* Not representable in TDL */)

This is no more compact than the regular binary encoding for decimals. However, it can be used in conjunction with other macros, for example, to represent fixed-point numbers.

(macro usd (cents) (.annotate USD (.make_decimal cents -2))

(:usd 199) =>  USD::1.99

make_timestamp

(macro make_timestamp (uint16::year
                       uint8::month?
                       uint8::day?
                       uint8::hour?
                       uint8::minute?
                       /*decimal*/ second?
                       int16::offset_minutes?) /* Not representable in TDL */)

Produces a non-null, unannotated timestamp at various levels of precision. When offset is absent, the result has unknown local offset; offset 0 denotes UTC. The arguments to this macro may not be any null value.

note

TODO ion-docs#256 Reconsider offset semantics, perhaps default should be UTC.

Example:

(macro ts_today 
       (uint8::hour uint8::minute uint32::seconds_millis)
       (.make_timestamp
         2022
         4
         28
         hour
         minute
         (.make_decimal (%seconds_millis) -3) 0))

Encoding Utility Macros

repeat

The repeat system macro can be used for efficient run-length encoding.

(macro repeat (n! value+) /* Not representable in TDL */)

Produces a stream that repeats the specified value expression(s) n times.

(:repeat 5 0)          => 0 0 0 0 0
(:repeat 2 true false) => true false true false

delta

note

🚧 Name still TBD 🚧

The delta system macro can be used for directed delta encoding.

(macro delta (flex_int::initial! flex_int::deltas+) /* Not representable in TDL */)

Example:

(:delta 10 1 2 3 -4) => 11 13 16 12

sum

(macro sum (i*) /* Not representable in TDL */)

Produces the sum of all the integer arguments.

Examples:

(:sum 1 2 3) => 6
(:sum (:))   => 0

meta

(macro meta (anything*) (.none))

The meta macro accepts any values and emits nothing. It allows writers to encode data that will be not be surfaced to most readers. Readers can be configured to intercept calls to meta, allowing them to read the otherwise invisible data.

When transcribing from one format to another, writers should preserve invocations of meta when possible.

Example:

(:values
    (:meta {author: "Mike Smith", email: "mikesmith@example.com"})
    {foo:2,foo:1}
)
=>
{foo:2,foo:1}

Updating the Encoding Context

set_symbols

Sets the local symbol table, preserving any macros in the macro table.

(macro set_symbols (symbols*)
       $ion_encoding::(
         (symbol_table [(%symbols)])
         (macro_table $ion_encoding)
       ))

Example:

(:set_symbols foo bar)
=>
$ion_encoding::(
  (symbol_table [foo, bar])
  (macro_table $ion_encoding)
)

add_symbols

Appends symbols to the local symbol table, preserving any macros in the macro table.

(macro add_symbols (symbols*)
       $ion_encoding::(
         (symbol_table $ion_encoding [(%symbols)])
         (macro_table $ion_encoding)
       ))

Example:

(:add_symbols foo bar)
=>
$ion_encoding::(
  (symbol_table $ion_encoding [foo, bar])
  (macro_table $ion_encoding)
)

set_macros

Sets the local macro table, preserving any symbols in the symbol table.

(macro set_macros (macros*)
       $ion_encoding::(
         (symbol_table $ion_encoding)
         (macro_table (%macros))
       ))

Example:

(:set_macros (macro pi () 3.14159))
=>
$ion_encoding::(
  (symbol_table $ion_encoding)
  (macro_table (macro pi () 3.14159))
)

add_macros

Appends macros to the local macro table, preserving any symbols in the symbol table.

(macro add_macros (macros*)
       $ion_encoding::(
         (symbol_table $ion_encoding)
         (macro_table $ion_encoding (%macros))
       ))

Example:

(:add_macros (macro pi () 3.14159))
=>
$ion_encoding::(
  (symbol_table $ion_encoding)
  (macro_table $ion_encoding (macro pi () 3.14159))
)

use

Appends the content of the given module to the encoding context.

(macro use (catalog_key version?)
       $ion_encoding::(
         (import the_module catalog_key (.default (%version) 1))
         (symbol_table $ion_encoding the_module)
         (macro_table $ion_encoding the_module)
       ))

Example:

(:use "org.example.FooModule" 2)
=>
$ion_encoding::(
  (import the_module "org.example.FooModule" 2)
  (symbol_table $ion_encoding the_module)
  (macro_table $ion_encoding the_module)
)

1

The annotations sequence comes first in the macro signature because it parallels how annotations are read from the data stream.^

Ion 1.1 modules

In Ion 1.0, each stream has a symbol table. The symbol table stores text values that can be referred to by their integer index in the table, providing a much more compact representation than repeating the full UTF-8 text bytes each time the value is used. Symbol tables do not store any other information used by the reader or writer.

Ion 1.1 introduces the concept of a macro table. It is analogous to the symbol table, but instead of holding text values it holds macro definitions.

Ion 1.1 also introduces the concept of a module, an organizational unit that holds a (symbol table, macro table) pair.

tip

You can think of an Ion 1.0 symbol table as a module with an empty macro table.

In Ion 1.1, each stream has an encoding module—the active (symbol table, macro table) pair that is being used to encode the stream.

Module interface

The interface to a module consists of:

  • its spec version, denoting the Ion version used to define the module
  • its exported symbols, an array of strings denoting symbol content
  • its exported macros, an array of <name, macro> pairs, where all names are unique identifiers (or null).

The spec version is external to the module body and the precise way it is determined depends on the type of module being defined. This is explained in further detail in Module Versioning.

The exported symbol array is denoted by the symbol_table clause of a module definition, and by the symbols field of a shared symbol table.

The exported macro array is denoted by the module’s macro_table clause, with addresses allocated to macros or macro bindings in the order they are declared.

The exported symbols and exported macros are defined in the module body.

Types of modules

There are multiple types of modules. All modules share the same interface, but vary in their implementation in order to support a variety of different use cases.

Module TypePurpose
Encoding ModuleDefining the local encoding context
System ModuleDefining system symbols and macros
Inner ModuleOrganizing symbols and macros and limiting the scope of macros
Shared ModuleDefining symbols and macros outside of the data stream

Module versioning

Every module definition has a spec version that determines the syntax and semantics of the module body. A module’s spec version is expressed in terms of a specific Ion version; the meaning of the module is as defined by that version of the Ion specification.

The spec version for an encoding module is implicitly derived from the Ion version of its containing segment. The spec version for a shared module is denoted via a required annotation. The spec version of an inner module is always the same as its containing module. The spec version of a system module is the Ion version in which it was specified.

To ensure that all consumers of a module can properly understand it, a module can only import shared modules defined with the same or earlier spec version.

Examples

The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N. This allows the module to be serialized using any version of Ion, and its meaning will not change.

$ion_shared_module::
$ion_1_1::("com.example.symtab" 3
           (symbol_table ...)
           (macro_table ...))

The spec version of an encoding module is always the same as the Ion version of its enclosing segment.

$ion_1_1
$ion_encoding::(
  // Module semantics specified by Ion 1.1
  ...
)

// ...

$ion_1_3
$ion_encoding::(
  // Module semantics specified by Ion 1.3
  ...
)
//...                  // Assuming no IVM
$ion_encoding::(
  // Module semantics specified by Ion 1.3
  ...
)

Identifiers

Many of the grammatical elements used to define modules and macros are identifiers--symbols that do not require quotation marks.

More explicitly, an identifier is a sequence of one or more ASCII letters, digits, or the characters $ (dollar sign) or _ (underscore), not starting with a digit. It also cannot be of the form $\d+, which is the syntax for symbol IDs (for example: $3, $10, $458, etc.), nor can it be a keyword (true, false, null, or nan).

Defining modules

A module is defined by four kinds of subclauses which, if present, always appear in the same order.

  1. import - a reference to a shared module definition; repeatable
  2. module - a nested module definition; repeatable
  3. symbol_table - an exported list of text values
  4. macro_table - an exported list of macro definitions

Internal environment

The body of a module tracks an internal environment by which macro references are resolved. This environment is constructed incrementally by each clause in the definition and consists of:

  • the visible modules, a map from identifier to module
  • the exported symbols, an array containing symbol texts
  • the exported macros, an array containing name/macro pairs

Before any clauses of the module definition are examined, the initial environment is as follows:

  • The visible modules map binds $ion to the system module for the appropriate spec version. Inside an encoding directive, the visible modules map also binds $ion_encoding to the active encoding module (the encoding module that was active when the encoding directive was encountered). For an inner module, it also includes the modules previously made available by the enclosing module (via import or module).
  • The macro table and symbol table are empty.

Each clause affects the environment as follows:

  • An import declaration retrieves a shared module from the implementation’s catalog, assigns it a name in the visible modules, and makes its macros available for use. An error must be signaled if the name already appears in the visible modules.
  • A module declaration defines a new module and assigns it a name in the visible modules. An error must be signaled if the name already appears in the visible modules.
  • A symbol_table declaration defines the exported symbols.
  • A macro_table declaration defines the exported macros.

Resolving Macro References

Within a module definition, macros can be referenced in several contexts using the following macro-ref syntax:

qualified-ref      ::= module-name '::' macro-ref

macro-ref          ::= macro-name | macro-addr

macro-name         ::= unannotated-identifier-symbol

macro-addr         ::= unannotated-uint 

Macro references are resolved to a specific macro as follows:

  • An unqualified macro-name is looked up within the exported macros, and if not found, then the active encoding module's macro table. If it maps to a macro, that’s the resolution of the reference. Otherwise, an error is signaled due to an unbound reference.
  • An anonymous local reference (macro-addr) is resolved by index in the exported macro array. If the address exceeds the array boundary, an error is signaled due to an invalid reference.
  • A qualified reference (qualified-ref) resolves solely against the referenced module. If the module name does not exist in the visible modules, an error is signaled due to an unbound reference. Otherwise, the name or address is resolved within that module’s exported macro array.

warning

An unqualified macro name can change meaning in the middle of an encoding module if you choose to shadow the name of a macro in the active encoding module. To unambiguously refer to the active encoding module, use the qualified reference syntax: $ion_encoding::<macro-name>.

import

import             ::= '(import ' module-name catalog-key ')'

module-name        ::= unannotated-identifier-symbol

catalog-key        ::= catalog-name catalog-version?

catalog-name       ::= string

catalog-version    ::= int // positive, unannotated

An import binds a lexically scoped module name to a shared module that is identified by a catalog key—a (name, version) pair. The version of the catalog key is optional—when omitted, the version is implicitly 1.

In Ion 1.0, imports may be substituted with a different version if an exact match is not found. In Ion 1.1, however, all imports require an exact match to be found in the reader's catalog; if an exact match is not found, the implementation must signal an error.

module

The module clause defines a new module that is contained in the current module.

inner-module ::= '(module' module-name import* symbol-table? macro-table? ')'

Inner modules automatically have access to modules previously declared in the containing module using module or import. The new module (and its exported symbols and macros) is available to any following module, symbol_table, and macro_table clauses in the enclosing container.

See inner modules for full explanation.

symbol_table

A module can define a list of exported symbols by copying symbols from other modules and/or declaring new symbols.

symbol-table       ::= '(symbol_table' symbol-table-entry* ')'

symbol-table-entry ::= module-name | symbol-list

symbol-list        ::= '[' ( symbol-text ',' )* ']'

symbol-text        ::= symbol | string

The symbol_table clause assembles a list of text values for the module to export. It takes any number of arguments, each of which may be the name of visible module or a list of symbol-texts. The symbol table is a list of symbol-texts by concatenating the symbol tables of named modules and lists of symbol/string values.

Where a module name occurs, its symbol table is appended. (The module name must refer to another module that is visible to the current module.) Unlike Ion 1.0, no symbol-maxid is needed because Ion 1.1 always required exact matches for imported modules.

tip

In an encoding directive, the active encoding module $ion_encoding can be added to the symbol table in order to retain the symbols from the active encoding module. $ion_encoding can occur anywhere in the symbol_table clause, so in Ion 1.1 it is possible to append and prepend to the symbol table.

Where a list occurs, it must contain only non-null, unannotated strings and symbols. The text of these strings and/or symbols are appended to the symbol table. Upon encountering any non-text value, null value, or annotated value in the list, the implementation shall signal an error.
To add a symbol with unknown text to the symbol table, one may use $0.

All modules have a symbol table, so when a module has no symbol_table clause, the module has an empty symbol table.

Symbol zero $0

Symbol zero (i.e. $0) is a special symbol that is not assigned text by any symbol table, even the system symbol table. Symbol zero always has unknown text, and can be useful in synthesizing symbol identifiers where the text image of the symbol is not known in a particular operating context.

All symbol tables (even an empty symbol table) can be thought of as implicitly containing $0. However, $0 precedes all symbol tables rather than belonging to any symbol table. When adding the exported symbols from one module to the symbol table of another, the preceding $0 is not copied into the destination symbol table (because it is not part of the source symbol table).

It is important to note that $0 is only semantically equivalent to itself and to locally-declared SIDs with unknown text. It is not semantically equivalent to SIDs with unknown text from shared symbol tables, so replacing such SIDs with $0 is a destructive operation to the semantics of the data.

Processing

When the symbol_table clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.

For each arg:

  • If the arg is a list of text values, the nested text values are appended to the end of the symbol table being constructed.
    • When $0 appears in the list of text values, this creates a symbol with unknown text.
    • The presence of any other Ion value in the list raises an error.
  • If the arg is the name of a module, the symbols in that module's symbol table are appended to the end of the symbol table being constructed.
  • If the arg is anything else, the reader must raise an error.

Example

(symbol_table         // Constructs an empty symbol table (list)
  ["a", b, 'c']       // The text values in this list are appended to the table
  foo                 // Module `foo`'s symbol table values are appended to the table
  ['''g''', "h", i])  // The text values in this list are appended to the table

If module foo's symbol table were [d, e, f], then the symbol table defined by the above clause would be:

["a", "b", "c", "d", "e", "f", "g", "h", "i"]

This is an Ion 1.0 symbol table that imports two shared symbol tables and then declares some symbols of its own.

$ion_1_0
$ion_symbol_table::{
  imports: [{ name: "com.example.shared1", version: 1, max_id: 10 },
            { name: "com.example.shared2", version: 2, max_id: 20 }],
  symbols: ["s1", "s2"]
}

Here’s the Ion 1.1 equivalent in terms of symbol allocation order:

$ion_1_1
$ion_encoding::(
  (import m1 "com.example.shared1" 1)
  (import m2 "com.example.shared2" 2)
  (symbol_table m1 m2 ["s1", "s2"])
)

macro_table

Macros are declared after symbols. The macro_table clause assembles a list of macro definitions for the module to export. It takes any number of arguments. All modules have a macro table, so when a module has no macro_table clause, the module has an empty macro table.

Most commonly, a macro table entry is a definition of a new macro expansion function, following this general shape:

When no name is given, this defines an anonymous macro that can be referenced by its numeric address (that is, its index in the enclosing macro table). Inside the defining module, that uses a local reference like 12.

The signature defines the syntactic shape of expressions invoking the macro; see Macro Signatures for details. The template defines the expansion of the macro, in terms of the signature’s parameters; see Template Expressions for details.

Imported macros must be explicitly exported if so desired. Module names and export clauses can be intermingled with macro definitions inside the macro_table; together, they determine the bindings that make up the module’s exported macro array.

The module-name export form is shorthand for referencing all exported macros from that module, in their original order with their original names.

An export clause contains a single macro reference followed by an optional alias for the exported macro. The referenced macro is appended to the macro table.

tip

No name can be repeated among the exported macros, including macro definitions. Name conflicts must be resolved by exports with aliases.

Processing

When the macro_table clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.

For each arg:

  • If the arg is a macro clause, the clause is processed and the resulting macro definition is appended to the end of the macro table being constructed.
  • If the arg is an export clause, the clause is processed and the referenced macro definition is appended to the end of the macro table being constructed.
  • If the arg is the name of a module, the macro definitions in that module's macro table are appended to the end of the macro table being constructed.
  • If the arg is anything else, the reader must raise an error.

A macro name is a symbol that can be used to reference a macro, both inside and outside the module. Macro names are optional, and improve legibility when using, writing, and debugging macros. When a name is used, it must be an identifier per Ion’s syntax for symbols. Macro definitions being added to the macro table must have a unique name. If a macro is added whose name conflicts with one already present in the table, the implementation must raise an error.

macro

A macro clause defines a new macro. When the macro declaration uses a name, an error must be signaled if it already appears in the exported macro array.

export

An export clause declares a name for an existing macro and appends the macro to the macro table.

  • If the reference to the existing macro is followed by a name, the existing macro is appended to the exported macro array with the latter name instead of the original name, if any. In this way, an anonymous macro can be given a name. An error must be signaled if that name already appears in the exported macro array.
  • If the reference to the existing macro is followed by null, the macro is appended to the exported macro array without a name, regardless of whether the macro has a name.
  • If the reference to the existing macro is anonymous, the macro is appended to the exported macro array without a name.
  • When the reference to the existing macro uses a name, the name and macro are appended to the exported macro
    array. An error must be signaled if that name already appears in the exported macro array.

Module names in macro_table

A module name appends all exported macros from the module to the exported macro array. If any exported macro uses a name that already appears in the exported macro array, an error must be signaled.

The encoding module

The encoding module is the module that is currently being used to encode the data stream. When the stream begins, the encoding module is the system module.

The application may define a new encoding module by writing an encoding directive at the top level of the stream. An encoding directive is an s-expression annotated with $ion_encoding; its nested clauses define a new encoding module.

When the reader advances beyond an encoding directive, the module it defined becomes the new encoding module.

In the context of an encoding directive, the active encoding module is named $ion_encoding. The encoding directive may preserve symbols or macros that were defined in the previous encoding directive by referencing $ion_encoding. The $ion_encoding module may only be imported to an encoding directive, and it is done so automatically and implicitly.

Examples

An encoding directive

A simple encoding directive—it defines a module that exports three symbols and two macros.

$ion_encoding::(
    (symbol_table [
        "a",  // $1
        "b",  // $2
        "c"   // $3
    ])
    (macro_table
      (macro pi () 3.14159265)
      (macro moon_landing_ts () 1969-07-20T20:17Z)
    )
)

Adding symbols to the encoding module

The implicitly imported $ion_encoding is used to append to the current symbol and macro tables.

$ion_encoding::(
    (symbol_table [
        "a",  // $1
        "b",  // $2
        "c",  // $3
    ])
    (macro_table
      (macro pi () 3.14159265)
      (macro moon_landing_ts () 1969-07-20T20:17Z)
    )
)

// ...

$ion_encoding::(
  // The first argument of the symbol_table clause is the module name '$ion_encoding',
  // which adds the symbols from the active encoding module to the new encoding module.
  // The '$ion_encoding' argument in the macro_table clause behaves similarly.
  (symbol_table $ion_encoding 
                [
                  "d", // $4
                  "e", // $5
                  "f", // $6
                ])
  (macro_table $ion_encoding
               (macro e () 2.71828182))
)

// ...

Clearing the local symbols and local macros

$ion_encoding::()

The absence of the symbol_table and macro_table clauses is interpreted as empty symbol and macro tables.

Note that this is different from the behaviour of an IVM. When an IVM is encountered, the encoding module is set to the system module.

Shared modules

Shared modules exist independently of the documents that use them. They are identified by a catalog key consisting of a string name and an integer version.

The self-declared catalog-names of shared modules are generally long, since they must be more-or-less globally unique. When imported by another module, they are given local symbolic names by import declarations.

They have a spec version that is explicit via annotation, and a content version derived from the catalog version. The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N. This allows the module to be serialized using any version of Ion, and its meaning will not change.

$ion_shared_module::
$ion_1_1::("com.example.symtab" 3 
           (symbol_table ...) 
           (macro_table ...) )

Example

An Ion 1.1 shared module.

$ion_shared_module::
$ion_1_1::("org.example.geometry" 2
           (symbol_table ["x", "y", "square", "circle"])
           (macro_table (macro point2d (x y) { x:(%x), y:(%y) })
                        (macro polygon (point2d::points+) [(%points)]) )
)

The system module provides a convenient macro (use) to append a shared module to the encoding module.

$ion_1_1
(:use "org.example.geometry" 2)
(:polygon (:: (1 4) (1 8) (3 6)))

Compatibility with Ion 1.0

Ion 1.0 shared symbol tables are treated as Ion 1.1 shared modules that have an empty macro table.

Inner modules

Inner modules are defined within another module, and can be referenced only within the enclosing module. Their scope is lexical; they can be referenced immediately following their definition, up until the end of the containing module.

Inline modules always have a symbolic name given at the point of definition. They inherit their spec version from the containing module, and they have no content version. Inner modules automatically have access to modules previously declared in their containing module using module or import. Inner modules may not contain their own nested inner modules.

Examples

Inner modules can be used to define helper macros and use them by name in the definitions of other macros without having to export the helper macro by name.

$ion_shared_module::$ion_1_1::(
  "org.example.Foo" 1
  (module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
  (macro_table
    (export util::0)
    (macro y_axis_point (y) (.util::point2d 0 (%y)))
    (macro poylgon (util::point2d::points+) [(%points)]))
)

In this example, the macro point2d is declared in an inner module. It is added to the shared module's macro table without a name, and subsequently referenced by name in the definition of other macros.


Inner modules can also be used for grouping macros into namespaces (only visible within the outer module), and to declare helper macros that are not added to the macro table of the outer module.

$ion_shared_module::$ion_1_1::(
  "org.example.Foo" 1
  (module cartesian (macro_table (macro point2d (x y) { x:(%x), y:(%y) })
                                 (macro polygon (point2d::points+) [(%points)]) ))

  (module polar (macro_table (macro point2d (r phi) { r:(%r), phi:(%phi) })
                             (macro polygon (point2d::points+) [(%points)]) ))
  (macro_table
    (export cartesian::polygon cartesian_poylgon)
    (export polar::polygon polar_poylgon))
)

In this example, there are two macros named point2d and two named polygon. There is no name conflict between them because they are declared in separate namespaces. Both polygon macros are added to the shared module's macro table, each one given an alias in order to resolve the name conflict. Neither one of the point2d macros needs to be added to the shared module's macro table because they can be referenced in the definitions of both polygon macros without needing to be added to the shared module's macro table.


When grouping macros in inner modules, there are more than just organizational benefits. By defining helper macros in an inner module, the order in which the macros are added to the macro table of the outer module does not have to be the same as the order in which the macros are declared:

$ion_shared_module::$ion_1_1::(
  "org.example.Foo" 1
  // point2d must be declared before polygon...
  (module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
  (macro_table
    // ...because it is used in the definition of polygon
    (macro poylgon (util::point2d::points+) [(%points)])
    // But it can be added to the macro table after polygon
    util)
)

Inner modules can also be used for organization of symbols.

$ion_encoding::(
  (module dairy      (symbol_table [cheese,  yogurt, milk]))
  (module grains     (symbol_table [cereal,  bread,  rice]))
  (module vegetables (symbol_table [carrots, celery, peas]))
  (module meat       (symbol_table [chicken, mutton, beef]))
  
  (symbol_table dairy 
                grains 
                vegetables 
                meat)
)

The system module

The symbols and macros of the system module $ion are available everywhere within an Ion document, with the version of that module being determined by the spec-version of each segment. The specific system symbols are largely uninteresting to users; while the binary encoding heavily leverages the system symbol table, the text encoding that users typically interact with does not. The system macros are more visible, especially to authors of macros.

This chapter catalogs the system-provided symbols and macros. The examples below use unqualified names, which works assuming no other macros with the same name are in scope. The unambiguous form $ion::macro-name is always available to use in the template definition language.

Relation to local symbol and macro tables

In Ion 1.0, the system symbol table is always the first import of the local symbol table. However, in Ion 1.1, the system symbol and macro tables have a system address space that is distinct from the local address space. When starting an Ion 1.1 segment (i.e. immediately after encountering an $ion_1_1 version marker), the local symbol table is prepopulated with the system symbols1. The local macro table is also prepopulated with the system macros. However, the system symbols and macros are not permanent fixtures of the local symbol and macro tables respectively.

When a local macro has the same name as a system macro, it shadows the system macro. In TDL, it is still possible to invoke a shadowed system macro by using a qualified name, such as $ion::make_string. If a macro in the active local macro table has the same name as a system macro, it is impossible to invoke that system macro by name using an E-Expression. (It is still possible to invoke the system macro if the local macro table has assigned an alias for that system macro.)

System Symbols

The Ion 1.1 System Symbol table replaces rather than extends the Ion 1.0 System Symbol table. The system symbols are as follows:

IDHexText
00x00<reserved>
10x01$ion
20x02$ion_1_0
30x03$ion_symbol_table
40x04name
50x05version
60x06imports
70x07symbols
80x08max_id
90x09$ion_shared_symbol_table
100x0A$ion_encoding
110x0B$ion_literal
120x0C$ion_shared_module
130x0Dmacro
140x0Emacro_table
150x0Fsymbol_table
160x10module
170x11see ion-docs#345
180x12export
190x13see ion-docs#345
200x14import
210x15zero-length text (i.e. '')
220x16literal
230x17if_none
240x18if_some
250x19if_single
260x1Aif_multi
270x1Bfor
280x1Cdefault
290x1Dvalues
300x1Eannotate
310x1Fmake_string
320x20make_symbol
330x21make_blob
340x22make_decimal
350x23make_timestamp
360x24make_list
370x25make_sexp
380x26make_struct
390x27parse_ion
400x28repeat
410x29delta
420x2Aflatten
430x2Bsum
440x2Cset_symbols
450x2Dadd_symbols
460x2Eset_macros
470x2Fadd_macros
480x30use
490x31meta
500x32flex_symbol
510x33flex_int
520x34flex_uint
530x35uint8
540x36uint16
550x37uint32
560x38uint64
570x39int8
580x3Aint16
590x3Bint32
600x3Cint64
610x3Dfloat16
620x3Efloat32
630x3Ffloat64
640x40none
650x41make_field

In Ion 1.1 Text, system symbols can never be referenced by symbol ID; $1 always refers to the first symbol in the user symbol table. This allows the Ion 1.1 system symbol table to be relatively large without taking away SID space from the user symbol table.

System Macros

IDHexText
00x00none
10x01values
20x02annotate
30x03make_string
40x04make_symbol
50x05make_blob
60x06make_decimal
70x07make_timestamp
80x08make_list
90x09make_sexp
100x0Amake_struct
110x0Bset_symbols
120x0Cadd_symbols
130x0Dset_macros
140x0Eadd_macros
150x0Fuse
160x10parse_ion
170x11repeat
180x12delta
190x13flatten
200x14sum
210x15meta
220x16make_field
230x17default

1

System symbols require the same number of bytes whether they are encoded using the system symbol or the user symbol encoding. The reasons the system symbols are initially loaded into the user symbol table are twofold—to be consistent with loading the system macros into user space, and so that implementors can start testing user symbols even before they have implemented support for reading encoding directives.^

Ion 1.1 Binary Encoding

A binary Ion stream consists of an Ion version marker followed by a series of value literals and/or encoding expressions.

Both value literals and e-expressions begin with an opcode that indicates what the next expression represents and how the bytes that follow should be interpreted.

Primitives

This section describes Ion 1.1's binary encoding primitives--reusable building blocks that can be combined to represent more complex constructs.

NameTypeWidth
FixedUIntintDetermined by context
FixedIntintDetermined by context
FlexUIntintVariable, self-delimiting
FlexIntintVariable, self-delimiting
FlexSymsymbolVariable, self-delimiting

FlexUInt

A variable-length unsigned integer.

The bytes of a FlexUInt are written in little-endian byte order. This means that the first bytes will contain the FlexUInt's least significant bits.

The least significant bits in the FlexUInt indicate the number of bytes that were used to encode the integer. If a FlexUInt is N bytes long, its N-1 least significant bits will be 0; a terminal 1 bit will be in the next most significant position.

All bits that are more significant than the terminal 1 represent the magnitude of the FlexUInt.

FlexUInt encoding of 14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14

FlexUInt encoding of 729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the unsigned  of the unsigned
integer          integer

FlexUInt encoding of 21,043

            ┌───── There are 2 zeros in the least significant bits, so this
            │      integer is three bytes wide.
          ┌─┴─┐
1 0 0 1 1 1 0 0  1 0 0 1 0 0 0 1  0 0 0 0 0 0 1 0
└───┬───┘        └──────┬──────┘  └──────┬──────┘
lowest 6 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

FlexInt

A variable-length signed integer.

From an encoding perspective, FlexInts are structurally similar to a FlexUInt. Both encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a FlexUInt's bits are unsigned, a FlexInt's bits are encoded using two's complement notation.

TIP: An implementation could choose to read a FlexInt by instead reading a FlexUInt and then reinterpreting its bits as two's complement.

FlexInt encoding of 14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
 2's comp. 14

FlexInt encoding of -14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
 2's comp. -14

FlexInt encoding of 729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

FlexInt encoding of -729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
1 0 0 1 1 1 1 0  1 1 1 1 0 1 0 0
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

FixedUInt

A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.

FixedUInt encoding of 3,954,261


0 1 0 1 0 1 0 1  0 1 0 1 0 1 1 0  0 0 1 1 1 1 0 0
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

FixedInt

A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two's complement.

FixedInt encoding of -3,954,261


1 0 1 0 1 0 1 1  1 0 1 0 1 0 0 1  1 1 0 0 0 0 1 1
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the 2's       the 2's comp.   of the 2's comp.
comp. integer    integer          integer

FlexSym

A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.

A FlexSym begins with a FlexInt; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:

  • greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.
  • less than zero, its absolute value represents a number of UTF-8 bytes that follow the FlexInt. These bytes represent the symbol’s text.
  • exactly zero, another byte follows that is a FlexSymOpCode.

FlexSym encoding of symbol ID $10

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │
0 0 0 1 0 1 0 1
└─────┬─────┘
  2's comp.
  positive 10

FlexSym encoding of symbol text 'hello'

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │      h         e        l        l        o
1 1 1 1 0 1 1 1  01101000  01100101 01101100 01101100 01101111
└─────┬─────┘    └─────────────────────┬─────────────────────┘
  2's comp.              5-byte UTF-8 encoded "hello"
  negative 5

FlexSymOpCode

FlexSymOpCodes are a combination of system symbols and a subset of the general opcodes. The FlexSym parser is not responsible for evaluating a FlexSymOpCode, only returning it—the caller will decide whether the opcode is legal in the current context.

Example usages of the FlexSymOpCode include:

  • Representing SID $0
  • Representing system symbols
    • Note that the empty symbol (i.e. the symbol '') is now a system symbol and can be referenced this way.
  • When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct.
  • In a delimited struct, terminating the sequence of (field name, value) pairs with 0xF0.
OpCode ByteMeaningAdditional Notes
0x00 - 0x5FE-ExpressionMay be used when the FlexSym occurs in the field name position of any struct
0x60Symbol with unknown text (also known as $0)
0x61 - 0xDFSystem SID (with 0x60 bias)While the range of 0x61 - 0xDF is reserved for system symbols, not all of these bytes correspond to a system symbol. See system symbols for the list of system symbols.
0xEESystem symbol
0xEFE-Expression invoking a system macroMay be used when the FlexSym occurs in the field name position of any struct
0xF0Delimited container end markerMay only be when the FlexSym occurs in the field name position of a delimited struct
0xF5Length-prefixed macro invocationMay be used when the FlexSym occurs in the field name position of any struct

FlexSym encoding of '' (empty text) using an opcode

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │
0 0 0 0 0 0 0 1   01110111
└─────┬─────┘     └───┬──┘
  2's comp.     FixedInt 0x77,
  zero          System SID 23
                (the empty symbol)

Opcodes

An opcode is a 1-byte FixedUInt that tells the reader what the next expression represents and how the bytes that follow should be interpreted.

The meanings of each opcode are organized loosely by their high and low nibbles.

High nibbleLow nibbleMeaning
0x0_ to 0x3_0-FE-expression with 6-bit address
0x4_0-FE-expression with 12-bit address
0x5_0-FE-expression with 20-bit address
0x6_0-8Integers from 0 to 8 bytes wide
9Reserved
A-DFloats
E-FBooleans
0x7_0-FDecimals
0x8_0-CShort-form timestamps
D-FReserved
0x9_0-FStrings
0xA_0-FSymbols with inline text
0xB_0-FLists
0xC_0-FS-expressions
0xD_0Empty struct
1Reserved
2-FStructs
0xE_0Ion version marker
1-3Symbols with symbol address
4-6Annotations with symbol address
7-9Annotations with FlexSym text
Anull.null
BTyped nulls
C-DNOP
ESystem symbol
FSystem macro invocation
0xF_0Delimited container end
1Delimited list start
2Delimited S-expression start
3Delimited struct start
4E-expression with FlexUInt macro address
5E-expression with FlexUInt length prefix
6Integer with FlexUInt length prefix
7Decimal with FlexUInt length prefix
8Timestamp with FlexUInt length prefix
9String with FlexUInt length prefix
ASymbol with FlexUInt length prefix and inline text
BList with FlexUInt length prefix
CS-expression with FlexUInt length prefix
DStruct with FlexUInt length prefix
EBlob with FlexUInt length prefix
FClob with FlexUInt length prefix

Values

Nulls

The opcode 0xEA indicates an untyped null (that is: null, or its alias null.null).

The opcode 0xEB indicates a typed null; a byte follows whose value represents an offset into the following table:

ByteType
0x00null.bool
0x01null.int
0x02null.float
0x03null.decimal
0x04null.timestamp
0x05null.string
0x06null.symbol
0x07null.blob
0x08null.clob
0x09null.list
0x0Anull.sexp
0x0Bnull.struct

All other byte values are reserved for future use.

Encoding of null

┌──── The opcode `0xEA` represents a null (null.null)
EA

Encoding of null.string

┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows
│  ┌──── Byte 0x05 indicates the type `string`
EB 05

Booleans

0x6E represents boolean true, while 0x6F represents boolean false.

0xEB 0x00 represents null.bool.

Encoding of boolean true
6E
Encoding of boolean false
6F
Encoding of null.bool
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: boolean
│  │
EB 00

Integers

Opcodes in the range 0x60 to 0x68 represent an integer. The opcode is followed by a FixedInt that represents the integer value. The low nibble of the opcode (0x_0 to 0x_8) indicates the size of the FixedInt. Opcode 0x60 represents integer 0; no more bytes follow.

Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6, followed by a <<flexuint, FlexUInt>> indicating how many bytes of representation data follow.

0xEB 0x01 represents null.int.

Encoding of integer 0
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││    no more bytes follow.
60
Encoding of integer 17
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││    a single byte follows.
61 11
    └── FixedInt 17
Encoding of integer -944
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││    that two bytes follow.
62 50 FC
   └─┬─┘
FixedInt -944
Encoding of integer -944
┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows
│   ┌─── FlexUInt 2; a 2-byte FixedInt follows
│   │
F6 05 50 FC
      └─┬─┘
   FixedInt -944
Encoding of null.int
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: integer
│  │
EB 01

Floats

Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:

  • 0 bits (0 bytes), representing the value 0e0 and indicated by opcode 0x6A
  • 16 bits (2 bytes in little-endian order, half-precision), indicated by opcode 0x6B
  • 32 bits (4 bytes in little-endian order, single precision), indicated by opcode 0x6C
  • 64 bits (8 bytes in little-endian order, double precision), indicated by opcode 0x6D

note

In the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.

0xEB 0x02 represents null.float.

Encoding of float 0e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││    a 0-length float; 0e0
6A
Encoding of float 3.14e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
   └─┬─┘
half-precision 3.14
Encoding of float 3.1415927e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││    single-precision value.
6C DB 0F 49 40   
   └────┬────┘
single-precision 3.1415927
Encoding of float 3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││    double-precision value.
6D 18 2D 44 54 FB 21 09 40       
   └──────────┬──────────┘
double-precision 3.141592653589793
Encoding of null.float
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: float
│  │
EB 02

Decimals

If an opcode has a high nibble of 0x7_, it represents a decimal. Low nibble values indicate the number of trailing bytes used to encode the decimal.

The body of the decimal is encoded as a FlexInt representing its exponent, followed by a FixedInt representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0. When the coefficient is present but has a value of 0, the coefficient is -0.

Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7.

0xEB 0x03 represents null.decimal.

Encoding of decimal 0d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││    decimal; 0d0
70
Encoding of decimal 7d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
   |  └─── Coefficient: 1-byte FixedInt 7
   └─── Exponent: FlexInt 0
Encoding of decimal 1.27
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
   |  └─── Coefficient: FixedInt 127
   └─── Exponent: 1-byte FlexInt -2
Variable-length encoding of decimal 1.27
┌──── Opcode F7 indicates a variable-length decimal
│
F7 05 FD 7F
   |  |  └─── Coefficient: FixedInt 127
   |  └───── Exponent: 1-byte FlexInt -2
   └─────── Decimal length: FlexUInt 2
Encoding of 0d3, which has a coefficient of zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
   └────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
Encoding of -0d3, which has a coefficient of negative zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
   |  └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
   └────── Exponent: FlexInt 3
Encoding of null.decimal
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: decimal
│  │
EB 03

Timestamps

Timestamps have two encodings:

  1. Short-form timestamps, a compact representation optimized for the most commonly used precisions and date ranges.
  2. Long-form timestamps, a less compact representation capable of representing any timestamp in the Ion data model.

0xEB x04 represents null.timestamp.

Encoding of null.timestamp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: timestamp
│  │
EB 04

note

In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.

Short-form Timestamps

If an opcode has a high nibble of 0x8_, it represents a short-form timestamp. This encoding focuses on making the most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via the variable-length long form timestamp encoding.

Timestamps may be encoded using the short form if they meet all of the following conditions:

The year is between 1970 and 2097.:: The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form. The local offset is either UTC, unknown, or falls between -14:00 to +14:00 and is divisible by 15 minutes. 7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset -14:00). The value 0b1111111 indicates an unknown offset. At the time of this writing (2024-08T), all real-world offsets fall between -12:00 and +14:00 and are multiples of 15 minutes. The fractional seconds are a common precision. The timestamp's fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).

Opcodes by precision and offset

Each opcode with a high nibble of 0x8_ indicates a different precision and offset encoding pair.

OpcodePrecisionSerialized size in bytes1Offset encoding
0x80Year1Implicitly Unknown offset
0x81Month2
0x82Day2
0x83Hour and minutes41 bit to indicate UTC or Unknown Offset
0x84Seconds5
0x85Milliseconds6
0x86Microseconds7
0x87Nanoseconds8
0x88Hour and minutes57 bits to represent a known offset.2
0x89Seconds5
0x8AMilliseconds7
0x8BMicroseconds8
0x8CNanoseconds9
0x8DReserved--
0x8EReserved--
0x8FReserved--
1

Serialized size in bytes does not include the opcode.

2

This encoding can also represent UTC and Unknown Offset, though it is less compact than opcodes 0x83-0x87 above.

The body of a short-form timestamp is encoded as a FixedUInt of the size specified by the opcode. This integer is then partitioned into bit-fields representing the timestamp's subfields. Note that endianness does not apply here because the bit-fields are defined over the body interpreted as an integer.

The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.

The Month and Day subfields are one-based; 0 is not a valid month or day.

Letter codeNumber of bitsSubfield
Y7Year
M4Month
D5Day
H5Hour
m6Minute
o7Offset
U1Unknown (0) or UTC (1) offset
s6Second
f10 (ms)
20 (μs)
30 (ns)
Fractional second
.n/aUnused

We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.

          7       0 <--- bit position
          |       |
         +=========+
byte 0   |  0xNN   | <-- hex notation for constants like opcodes
         +=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
     1   |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
         +---------+ <-- octet boundary within an encoding primitive
         ...
         +---------+
     N   |nnnn:nnnn|
         +=========+

The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)

note

While this encoding may complicate human reading, it guarantees that the timestamp's subfields (year, month, etc.) occupy the same bit contiguous indexes regardless of how many bytes there are overall. (The last subfield, fractional_seconds, always begins at the same bit index when present, but can vary in length according to the precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the appropriate bit ranges to access the subfields.

Encoding of a timestamp with year precision

         +=========+
byte 0   |  0x80   |
         +=========+
     1   |.YYY:YYYY|
         +=========+

Encoding of a timestamp with month precision

         +=========+
byte 0   |  0x81   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |....:.MMM|
         +=========+

Encoding of a timestamp with day precision

         +=========+
byte 0   |  0x82   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +=========+

Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset

         +=========+
byte 0   |  0x83   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |....:Ummm|
         +=========+

Encoding of a timestamp with seconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x84   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |....:..ss|
         +=========+

Encoding of a timestamp with milliseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x85   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |....:ffff|
         +=========+

Encoding of a timestamp with microseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x86   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |..ff:ffff|
         +=========+

Encoding of a timestamp with nanoseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x87   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +=========+

Encoding of a timestamp with hour-and-minutes precision at known offset

         +=========+
byte 0   |  0x88   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |....:..oo|
         +=========+

Encoding of a timestamp with seconds precision at known offset

         +=========+
byte 0   |  0x89   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +=========+

Encoding of a timestamp with milliseconds precision at known offset

         +=========+
byte 0   |  0x8A   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |....:..ff|
         +=========+

Encoding of a timestamp with microseconds precision at known offset

         +=========+
byte 0   |  0x8B   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |....:ffff|
         +=========+

Encoding of a timestamp with nanoseconds precision at known offset

         +=========+
byte 0   |  0x8C   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +---------+
     9   |..ff:ffff|
         +=========+

Examples of short-form timestamps

TextBinary
2023T80 35
2023-10-15T82 35 7D
2023-10-15T11:22:33Z84 35 7D CB 1A 02
2023-10-15T11:22:33-00:0084 35 7D CB 12 02
2023-10-15T11:22:33+01:1589 35 7D CB 2A 84
2023-10-15T11:22:33.444555666+01:158C 35 7D CB 2A 84 92 61 7F 1A

warning

Opcodes 0x8D, 0x8E, and 0x8F are illegal; they are reserved for future use.

Long-form Timestamps

Unlike the short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.

The long form begins with opcode 0xF8. A FlexUInt follows indicating the number of bytes that were needed to represent the timestamp. The encoding consumes the minimum number of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s precision as follows:

LengthCorresponding precision
0Illegal
1Illegal
2Year
3Month or Day (see below)
4Illegal; the hour cannot be specified without also specifying minutes
5Illegal
6Minutes
7Seconds
8 or moreFractional seconds

Unlike the short-form encoding, the long-form encoding reserves:

  • 14 bits for the year (Y), which is not biased.
  • 12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440 (that is: -24:00). An offset value of 0b111111111111 indicates an unknown offset.

Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the timestamp are encoded as bit-fields on a FixedUInt that corresponds to the length that followed the opcode.

If the timestamp's overall length is greater than or equal to 8, the FixedUInt part of the timestamp is 7 bytes and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a (scale, coefficient) pair, which is similar to a decimal. The primary difference is that the scale represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to 1.0 or less than 0.0. The scale is encoded as a FlexUInt (instead of FlexInt) to discourage the encoding of decimal numbers greater than 1.0. The coefficient is encoded as a FixedUInt (instead of FixedInt) to prevent the encoding of fractional seconds less than 0.0. Note that validation is still required; namely:

  • A scale value of 0 is illegal, as that would result in a fractional seconds greater than 1.0 (a whole second).
  • If coefficient * 10^-scale > 1.0, that (coefficient, scale) pair is illegal.

If the timestamp's length is 3, the precision is determined by inspecting the day (DDDDD) bits. Like the short-form, the Month and Day subfields are one-based (0 is not a valid month or day). If the day subfield is zero, that indicates month precision. If the day subfield is any non-zero number, that indicates day precision.

Encoding of the body of a long-form timestamp

         +=========+
byte 0   |YYYY:YYYY|
         +=========+
     1   |MMYY:YYYY|
         +---------+
     2   |HDDD:DDMM|
         +---------+
     3   |mmmm:HHHH|
         +---------+
     4   |oooo:oomm|
         +---------+
     5   |ssoo:oooo|
         +---------+
     6   |....:ssss|
         +=========+
     7   |FlexUInt | <-- scale of the fractional seconds
         +---------+
         ...
         +=========+
     N   |FixedUInt| <-- coefficient of the fractional seconds
         +---------+
         ...

Examples of long-form timestamps

TextBinary
1947TF8 05 9B 07
1947-12TF8 07 9B 07 03
1947-12-23TF8 07 9B 07 5F
1947-12-23T11:22:33-00:00F8 0F 9B 07 DF 65 FD 7F 08
1947-12-23T11:22:33+01:15F8 0F 9B 07 DF 65 AD 57 08
1947-12-23T11:22:33.127+01:15F8 13 9B 07 DF 65 AD 57 08 07 7F

Strings

If the high nibble of the opcode is 0x9_, it represents a string. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0x90 represents a string with empty text ("").

Strings longer than 15 bytes can be encoded with the F9 opcode, which takes a FlexUInt-encoded length after the opcode.

0xEB x05 represents null.string.

Encoding of the empty string, ""

┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90

Encoding of a 14-byte string

┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes

Encoding of a 24-byte string

┌──── Opcode F9 indicates a variable-length string
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes

Encoding of null.string

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: string
│  │
EB 05

Symbols

Symbols With Inline Text

If the high nibble of the opcode is 0xA_, it represents a symbol whose text follows the opcode. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0xA0 represents a symbol with empty text ('').

0xEB x06 represents null.symbol.

Encoding of a symbol with empty text ('')
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0
Encoding of a symbol with 14 bytes of inline text
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes
Encoding of a symbol with 24 bytes of inline text
┌──── Opcode FA indicates a variable-length symbol with inline text
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes
Encoding of null.symbol
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: symbol
│  │
EB 06

Symbols With a Symbol Address

Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1 through 0xE3:

  • 0xE1 represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byte FixedUInt that follows the opcode.
  • 0xE2 represents a symbol whose address in the symbol table is a 2-byte FixedUInt that follows the opcode.
  • 0xE3 represents a symbol whose address in the symbol table is a FlexUInt that follows the opcode.

Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.

OpcodeSymbol address rangeBias
0xE10 to 2550
0xE2256 to 65,791256
0xE365,792 to infinity65,792

System Symbols

System symbols (that is, symbols defined in the system module) can be encoded using the 0xEE opcode followed by a 1-byte FixedUInt representing an index in the system symbol table.

Unlike Ion 1.0, symbols are not required to use the lowest available SID for a given text, and system symbols MAY be encoded using other SIDs.

Encoding of the system symbol $ion
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│  ┌─── FixedUInt 1 indicates system symbol 1
│  │
EE 01

Binary Data

Blobs

Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob's byte-length.

0xEB x07 represents null.blob.

Example blob encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: blob
│  │
EB 07

Clobs

Opcode FF indicates a clob--binary character data of an unspecified encoding. A FlexUInt follows that represents the clob's byte-length.

0xEB x08 represents null.clob.

Example clob encoding

┌──── Opcode FF indicates a clob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data

Encoding of null.clob

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: clob
│  │
EB 08

Binary Data

Blobs

Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob's byte-length.

0xEB x07 represents null.blob.

Example blob encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: blob
│  │
EB 07

Clobs

Opcode FF indicates a clob--binary character data of an unspecified encoding. A FlexUInt follows that represents the clob's byte-length.

0xEB x08 represents null.clob.

Example clob encoding

┌──── Opcode FF indicates a clob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data

Encoding of null.clob

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: clob
│  │
EB 08

Lists

Length-prefixed encoding

An opcode with a high nibble of 0xB_ indicates a length-prefixed list. The lower nibble of the opcode indicates how many bytes were used to encode the child values that the list contains.

If the list's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB opcode to write a variable-length list. The 0xFB opcode is followed by a FlexUInt that indicates the list's byte length.

0xEB 0x09 represents null.list.

Length-prefixed encoding of an empty list ([])
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this
││    list took zero bytes to encode.
B0
Length-prefixed encoding of [1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 6 indicates that the child values of this
││    list took six bytes to encode.
B6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Length-prefixed encoding of ["variable length list"]
┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     l  i  s  t
FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element
Encoding of null.list
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: list
│  │
EB 09

Delimited Encoding

Opcode 0xF1 begins a delimited list, while opcode 0xF0 closes the most recently opened delimited container that has not yet been closed.

Delimited encoding of an empty list ([])
┌──── Opcode 0xF1 indicates a delimited list
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F1 F0
Delimited encoding of [1, 2, 3]
┌──── Opcode 0xF1 indicates a delimited list
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F1 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Delimited encoding of [1, [2], 3]
┌──── Opcode 0xF1 indicates a delimited list
│        ┌─── Opcode 0xF1 begins a nested delimited list
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested list.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and 
│        │        │        │    still open) delimited container: the outer list.
│        │        │        │
F1 61 01 F1 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

S-Expressions

S-expressions use the same encodings as lists, but with different opcodes.

OpcodeEncoding
0xC0-0xCFLength-prefixed S-expression; low nibble of the opcode represents the byte-length.
0xFCVariable-length prefixed S-expression; a FlexUInt following the opcode represents the byte-length.
0xF2Starts a delimited S-expression; 0xF0 closes the most recently opened delimited container.

0xEB 0x0A represents null.sexp.

Length-prefixed encoding

Length-prefixed encoding of an empty S-expression (())
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression
││    took zero bytes to encode.
C0
Length-prefixed encoding of (1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression
││    took six bytes to encode.
C6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Length-prefixed encoding of ("variable length sexp")
┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  e  x  p
FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element
Encoding of null.sexp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: sexp
│  │
EB 0A

Delimited encoding

Delimited encoding of an empty S-expression (())
┌──── Opcode 0xF2 indicates a delimited S-expression
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F2 F0
Delimited encoding of (1 2 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F2 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Delimited encoding of (1 (2) 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│        ┌─── Opcode 0xF2 begins a nested delimited S-expression
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested S-expression.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and
│        │        │        │     still open)delimited container: the outer S-expression.
│        │        │        │
F2 61 01 F2 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

Structs

Length-prefixed encoding

If the high nibble of the opcode is 0xD_, it represents a struct. The lower nibble of the opcode indicates how many bytes were used to encode all of its nested (field name, value) pairs. Opcode 0xD0 represents an empty struct.

warning

Opcode 0xD1 is illegal. Non-empty structs must have at least two bytes: a field name and a value.

If the struct's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD opcode to write a variable-length struct. The 0xFD opcode is followed by a FlexUInt that indicates the byte length.

Each field in the struct is encoded as a FlexUInt representing the address of the field name's text in the symbol table, followed by an opcode-prefixed value.

0xEB 0x0B represents null.struct.

Length-prefixed encoding of an empty struct ({})
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0
Length-prefixed encoding of {$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│  ┌─── Field name: FlexUInt 10 ($10)
│  │        ┌─── Field name: FlexUInt 11 ($11)
│  │        │
D6 15 61 01 17 61 02
      └─┬─┘    └─┬─┘
        1        2
Length-prefixed encoding of {$10: "variable length struct"}
 ┌───────────── Opcode `FD` indicates a struct with a FlexUInt length prefix
 │  ┌────────── Length: FlexUInt 25
 │  │  ┌─────── Field name: FlexUInt 10 ($10)
 │  │  │  ┌──── Opcode `F9` indicates a variable length string
 │  │  │  │  ┌─ FlexUInt: 22 the string is 22 bytes long
 │  │  │  │  │  v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  t  r  u  c  t
FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
               └─────────────────────────────┬─────────────────────────────────┘
                                        UTF-8 bytes
Encoding of null.struct
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: struct
│  │
EB 0B

Optional FlexSym field name encoding

By default, all struct field names are encoded as FlexUInt symbol addresses. However, a writer has the option of encoding the field names as FlexSyms instead, granting additional flexibility at the expense of some compactness.

Writing a field names as a FlexSyms allows the writer to:

  • encode the UTF-8 bytes of the field name inline (for example, to avoid modifying the symbol table).
  • call a macro whose output (another struct) will be merged into the current struct.
  • encode the field name as a symbol address if it's already in the symbol table. (just like a FlexUInt would, but slightly less compactly.)

To switch to FlexSym field names, the writer emits a FlexUInt zero (byte 0x01) in field name position to inform the reader that subsequent field names will be encoded as FlexSyms.

This switch is one way. Once the writer switches to using FlexSym, the encoding cannot be switched back to FlexUInt for the remainder of the struct.

Switching to FlexSym while encoding {$10: 1, foo: 2, $11: 3}

In this example, the writer switches to FlexSym field names before encoding foo so it can write the UTF-8 bytes inline.

┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│  ┌─── Field name: FlexUInt 10 ($10)
│  │        ┌─── FlexUInt 0: Switch to FlexSym field name encoding
│  │        │
│  │        │  ┌─── FlexSym: 3 UTF-8 bytes follow
│  │        │  │           ┌─── Field name: FlexSym 11 ($11)
│  │        │  │   f  o  o │
D6 15 61 01 01 FB 66 6F 6F 17 61 02
      └─┬─┘                   └─┬─┘
        1                       2

note

Because FlexUInt zero indicates a mode switch, encoding symbol ID $0 requires switching to FlexSym.

Length-prefixed encoding of {$0: 1}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 5
││ ┌── FlexUInt 0 in the field name position indicates that the struct
││ │   is switching to FlexSym mode
││ │  ┌── FlexSym "escape"
││ │  │  ┌── Symbol address: 1-byte FixedUInt follows 
││ │  │  │  ┌─ FixedUInt 0
││ │  │  │  │
D5 01 01 E1 00 61 01
      └───┬──┘ └─┬─┘
         $0      1

Delimited encoding

Opcode 0xF3 indicates the beginning of a delimited struct. Unlike length-prefixed structs, delimited structs always encode their field names as FlexSyms.

Unlike lists and S-expressions, structs cannot use opcode 0xF0 by itself to indicate the end of the delimited container. This is because 0xF0 is a valid FlexSym (a symbol with 16 bytes of inline text). To close the delimited struct, the writer emits a 0x01 byte (a FlexSym escape) followed by the opcode 0xF0.

note

It is much more compact to write 0xD0-- the empty length-prefixed struct.

Delimited encoding of the empty struct ({})

┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│  ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │  ┌─── Opcode 0xF0 indicates the end of the most
│  │  │    recently opened delimited container
F3 01 F0

Delimited encoding of {"foo": 1, $11: 2}

┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│
│  ┌─ FlexSym -3     ┌─ FlexSym: 11 ($11)
│  │                 │        ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │                 │        │  ┌─── Opcode 0xF0 indicates the end of the most
│  │   f  o  o       │        │  │    recently opened delimited container
F3 FB 66 6F 6F 61 01 17 61 02 01 F0
      └──┬───┘ └─┬─┘    └─┬─┘
      3 UTF-8    1        2
       bytes

Encoding Expressions

note

This chapter focuses on the binary encoding of e-expressions. Macros by example explains what they are and how they are used.

E-expression with the address in the opcode

If the value of the opcode is less than 64 (0x40), it represents an E-expression invoking the macro at the corresponding address—-an offset within the local macro table.

Invocation of macro address 7

┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
07
└── FixedUInt 7

Invocation of macro address 31

┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
1F
└── FixedUInt 31

Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section Macro calling conventions. (TODO: Link)

E-expressions with biased FixedUInt addresses

While E-expressions invoking macro addresses in the range [0, 63] can be encoded in a single byte using E-expressions with the address in the opcode, many applications will benefit from defining more than 64 macros. The 0x4_ and 0x5_ opcodes can be used to represent macro addresses up to 1,052,734. In both encodings, the address is biased by the total number of addresses with lower opcodes.

If the high nibble of the opcode is 0x4_, then a biased address follows as a 1-byte FixedUInt. For 0x4_, the bias is 256 * low_nibble + 64 (or (low_nibble << 8) + 64).

If the high nibble of the opcode is 0x5_, then a biased address follows as a 2-byte FixedUInt.

For 0x5_, the bias is 65536 * low_nibble + 4160 (or (low_nibble << 16) + 4160)

Invocation of macro address 841

┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address
│┌─── Low nibble 3 indicates bias of 832
││
43 09
   │
   └─── FixedUInt 9

Biased Address : 9
Bias : 832
Address : 841

Invocation of macro address 142918

┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address
│┌─── Low nibble 2 indicates bias of 135232
││
52 06 1E
   └─┬─┘
     └─── FixedUInt 7686

Biased Address : 7686
Bias : 135232
Address : 142918

Macro address range biases for 0x4_ and 0x5_

Low Nibble0x4_ Bias0x5_ Bias
0644160
132069696
2576135232
3832200768
41088266304
51344331840
61600397376
71856462912
82112528448
92368593984
A2624659520
B2880725056
C3136790592
D3392856128
E3648921664
F3904987200

E-expression with the address as a trailing FlexUInt

The opcode 0xF4 indicates an e-expression whose address is encoded as a trailing FlexUInt with no bias. This encoding is less compact for addresses that can be encoded using opcodes 0x5F and below, but it is the only encoding that can be used for macro addresses greater than 1,052,734.

Invocation of macro address 4
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 09
   │
   └─── FlexUInt 4
Invocation of macro address 1_100_000
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 04 47 86
   └──┬───┘
      └─── FlexUInt 1,100,000

System Macro Invocations

E-expressions that invoke a system macro can be encoded using the 0xEF opcode followed by a 1-byte FixedUInt representing an index in the system macro table.

Encoding of the system macro values
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│  ┌─── FixedInt 1 indicates macro 1 from the system macro table
│  │
EF 01

In addition, system macros MAY be invoked using any of the 0x00-0x5F or 0xF4-0xF5 opcodes, provided that the macro being invoked has been given an address in user macro address space.

E-expression argument encoding

The example invocations in prior sections have demonstrated how to encode an invocation of the simplest form of macro--one with no parameters. This section explains how to encode macro invocations when they take parameters of different encodings and cardinalities.

To begin, we will examine how arguments are encoded when all of the macro's parameters use the tagged encoding and have a cardinality of exactly-one.

Tagged encoding

When a macro parameter does not specify an encoding (the parameter name is not annotated), arguments passed to that parameter use the 'tagged' encoding. The argument begins with a leading opcode that dictates how to interpret the bytes that follow.

This is the same encoding used for values in other Ion 1.1 contexts like lists, s-expressions, or at the top level.

Encoding a single exactly-one argument

A parameter with a cardinality of exactly-one expects its corresponding argument to be encoded as a single expression of the parameter's declared encoding. (The following section will explore the available encodings in greater depth; for now, our examples will be limited to parameters using the tagged encoding.)

When the macro has a single exactly-one parameter, the corresponding encoded argument follows the opcode and (if separate) the encoded address.

Example encoding of an e-expression with a tagged, exactly-one argument

Macro definition
(:set_macros
  (foo (x) /*...*/)
)
Text e-expression
(:foo 1)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression invoking
│     the macro at address 0.
│    ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│  ┌─┴─┐
00 61 01
Binary e-expression using a trailing FlexUInt address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│  ┌──── FlexUInt 0: Macro address 0
│  │    ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│  │  ┌─┴─┐
F4 01 61 01

Encoding multiple exactly-one arguments

If the macro has more than one parameter, a reader would iterate over the parameters declared in the macro signature from left to right. For each parameter, the reader would use the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing of the e-expression's arguments is complete.

Example encoding of an e-expression with multiple tagged, exactly-one arguments

Macro definition
(:set_macros
  (foo (a b c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│    ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│    │     ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│    │     │     ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 61 01 61 02 61 03
Binary e-expression using a trailing FlexUInt address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│  ┌──── FlexUInt 0: Macro address 0
│  │    ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│  │    │     ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│  │    │     │     ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│  │    │     │     │
│  │  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
F4 01 61 01 61 02 61 03

Tagless Encodings

In contrast to the tagged encoding, tagless encodings do not begin with an opcode. This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings do not have an opcode, they cannot represent E-expressions, annotation sequences, or null values of any kind.

Tagless encodings are comprised of the primitive encodings and macro shapes.

Primitive encodings

Primitive encodings are self-delineating, either by having a statically known size in bytes or by including length information in their serialized form.

Ion typePrimitive encodingSize in bytesEncoding
intuint81FixedUInt
uint162
uint324
uint648
flex_uintvariableFlexUInt
int81FixedInt
int162
int324
int648
flex_intvariableFlexInt
floatfloat162Little-endian IEEE-754 half-precision float
float324Little-endian IEEE-754 single-precision float
float648Little-endian IEEE-754 double-precision float
symbolflex_symvariableFlexSym

Example encoding of an e-expression with primitive, exactly-one arguments

As first demonstrated in Encoding multiple exactly-one arguments, the bytes of the serialized arguments begin immediately after the e-expression's opcode and (if separate) the macro address. The reader iterates over the parameters in the macro signature in the order they are declared. For each parameter, the reader uses the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing is complete.

Macro definition
(:set_macros
  (foo (flex_uint::a int8::b uint16::c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌─── Argument 'a': FlexUInt 1
│  │  ┌─── Argument 'b': 1-byte FixedInt 2
│  │  │    ┌─── Argument 'c': 2-byte FixedUInt 3
│  │  │  ┌─┴─┐
00 03 02 03 00
Binary e-expression using a trailing FlexUInt address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│  ┌──── FlexUInt 0: Macro address 0
│  │  ┌─── Argument 'a': FlexUInt 1
│  │  │  ┌─── Argument 'b': 1-byte FixedInt 2
│  │  │  │    ┌─── Argument 'c': 2-byte FixedUInt 3
│  │  │  │  ┌─┴─┐
F4 01 03 02 03 00

Macro shapes

The term macro shape describes a macro that is being used as the encoding of an E-expression argument. A parameter using a macro shape as its encoding is sometimes called a macro-shaped parameter. For example, consider the following two macro definitions.

The point2D macro takes two flex_int-encoded values as arguments.

(macro point2D (flex_int::x flex_int::y)
  {
    x: (%x),
    y: (%y),
  }
)

The line macro takes a pair of point2D invocations as arguments.

(macro line (point2D::start point2D::end)
  {
    start: (%start),
    end: (%end),
  }
)

Normally an e-expression would begin with an opcode and an address communicating what comes next. However, when we're reading the argument for a macro-shaped parameter, the macro being invoked is inferred from the parent macro signature instead. As such, there is no need to include an opcode or address.

┌──── Opcode 0x01 is less than 0x40; this is an e-expression
│     invoking the macro at address 1: `line`
│    ┌─── Argument $start: an implicit invocation of macro `point2D`
│    │     ┌─── Argument $end: an implicit invocation of macro `point2D`
│  ┌─┴─┐ ┌─┴─┐
00 03 05 07 09
   │  │  │  └────   $end/$y: FlexInt 4
   │  │  └───────   $end/$x: FlexInt 3
   │  └────────── $start/$y: FlexInt 2
   └───────────── $start/$x: FlexInt 1

Any macro can be used as a macro shape except for constants--macros which take zero parameters. Constants cannot be used as a macro shape because their serialized representation would be empty, making it impossible to encode them in expression groups. However, this limitation does not sacrifice any expressiveness; the desired constant can always be invoked directly in the body of the macro.

(:add_macros
  // Defines a constant 'hostname'
  (hostname () "abc123.us_west.example.com")

  (http_ok (hostname::server page)
  //           └── ERROR: cannot use a constant as a macro shape
     {
        server: (%server),
        page: (%page),
        message: OK,
        status: 200,
     }
  )

  (http_ok (page)
    {
      server: (.hostname),
      //           └── OK: invokes constant as needed
      page: (%page),
      message: OK,
      status: 200,
    }
  )
)

Encoding variadic arguments

The preceding sections have described how to (de)serialize the various parameter encodings, but these parameters have always had the same cardinality: exactly-one.

This section explains how to encode e-expressions invoking a macro whose signature contains variadic parameters--parameters with a cardinality of zero-or-one, zero-or-more, or one-or-more.

Argument Encoding Bitmap (AEB)

If a macro signature has one or more variadic parameters, then e-expressions invoking that macro will include an additional construct: the Argument Encoding Bitmap (AEB). This little-endian byte sequence precedes the first serialized argument and indicates how each argument corresponding to a variadic parameter has been encoded.

Each variadic parameter in the signature is assigned two bits in the AEB. This means that the reader can statically determine how many AEB bytes to expect in the e-expression by examining the signature.

Number of variadic parametersAEB byte length
00
1 to 41
5 to 82
9 to 123
Nceiling(N/4)

Bits in the AEB are assigned from least significant to most significant and correspond to the variadic parameters in the signature from left to right. This allows the reader to right-shift away the bits of each variadic parameter when its corresponding argument has been read.

Example SignatureAEB Layout
()<No variadics, no AEB>
(a b c)<No variadics, no AEB>
(a b c?)------cc
(a b* c?)----ccbb
(a+ b* c?)--ccbbaa
(a+ b c?)----ccaa
(a+ b* c? d*)ddccbbaa
(a+ b* c? d* e)ddccbbaa
(a+ b* c? d* e f?)ddccbbaa ------ff
(a+ b* c? d* e+ f?)ddccbbaa ----ffee

Each pair of bits in the AEB indicates what kind of expression to expect in the corresponding argument position.

Bit sequenceMeaning?*+
00An empty stream. No bytes are present in the corresponding argument position.
01A single expression of the declared encoding is present in the corresponding argument position.
10A expression group of the declared encoding is present in the corresponding argument position.
11Reserved. A bitmap entry with this bit sequence is illegal in Ion 1.1.

As noted in the table above:

  • An empty stream (00) cannot be used to encode an argument for a parameter with a cardinality of one-or-more.
  • An expression group (10) cannot be used to encode an argument for a parameter with a cardinality of zero-or-one.

Expression groups

This section describes the encoding of an expression group. For an explanation of what an expression group is and how to use it, see Expression groups.

An expression group begins with a FlexUInt. If the FlexUInt's value is:

  • greater than zero, then it represents the number of bytes used to encode the rest of the expression group. The reader should continue reading expressions of the declared encoding until that number of bytes has been consumed.
  • zero, then it indicates that this is a delimited expression group and the processing varies according to whether the declared encoding is tagged or tagless. If the encoding is:
    • tagged, then each expression in the group begins with an opcode. The reader must consume tagged expressions until it encounters a terminating END opcode (0xF0).
    • tagless, then the expression group is a delimited sequence of 'chunks' that each have a FlexUInt length prefix and a body comprised of one or more expressions of the declared encoding. The reader will continue reading chunks until it encounters a length prefix of FlexUInt 0 (0x01), indicating the end of the chunk sequence. Each chunk in the sequence must be self-contained; an expression of the declared encoding may not be split across multiple chunks. See Example encoding of tagless zero-or-more with delimited expression group for an illustration.

tip

While it is legal to write an empty expression group for zero-or-more parameters, it is always more efficient to set the parameter's AEB bits to 00 instead.

Example encoding of tagged zero-or-one with empty group

(:add_macros
  (foo (a?) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa
│  │     a=00, empty expression group
00 00

Example encoding of tagged zero-or-one with single expression

(:add_macros
  (foo (a?) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=01, single expression
│  │    ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│  │  ┌─┴─┐
00 01 61 01

Example encoding of tagged zero-or-more with empty group

(:add_macros
  (foo (a*) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=00, empty expression group
│  │
00 00

Example encoding of tagged zero-or-more with single expression

(:add_macros
  (foo (a*) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=01, single expression
│  │    ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│  │  ┌─┴─┐
00 01 61 01

Example encoding of tagged zero-or-more with expression group

(:add_macros
  (foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 6: 6-byte expression group
│  │  │    ┌──── Opcode 0x61 indicates a 1-byte int (1)
│  │  │    │     ┌──── Opcode 0x61 indicates a 1-byte int (2)
│  │  │    │     │     ┌─── Opcode 0x61 indicates a 1-byte int (3)
│  │  │  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 02 0D 61 01 61 02 61 03
         └───────┬───────┘
      6-byte expression group body

Example encoding of tagged zero-or-more with delimited expression group

(:add_macros
  (foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 0: delimited expression group
│  │  │    ┌──── Opcode 0x61 indicates a 1-byte int (1)
│  │  │    │     ┌──── Opcode 0x61 indicates a 1-byte int (2)
│  │  │    │     │     ┌─── Opcode 0x61 indicates a 1-byte int (3)
│  │  │    │     │     │   ┌─── Opcode 0xF0 is delimited end
│  │  │  ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │
00 02 01 61 01 61 02 61 03 F0
         └───────┬───────┘
        expression group body

Example encoding of tagged one-or-more with single expression

(:add_macros
  (foo (a+) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=01, single expression
│  │  ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int
│  │  │   1
00 01 61 01

Example encoding of tagged one-or-more with expression group

(:add_macros
  (foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 6: 6-byte expression group
│  │  │  ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     │     ┌─── Opcode 0x61 indicates a 1-byte int
│  │  │  │   1 │  2  │   3
00 02 0D 61 01 61 02 61 03
         └───────┬───────┘
      6-byte expression group body

Example encoding of tagged one-or-more with delimited expression group

(:add_macros
  (foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 0: delimited expression group
│  │  │  ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     ┌──── Opcode 0x61 indicates a 1-byte int
│  │  │  │     │     ┌─── Opcode 0x61 indicates a 1-byte int
│  │  │  │     │     │      ┌─── Opcode 0xF0 is delimited end
│  │  │  │   1 │  2  │   3  │
00 02 01 61 01 61 02 61 03 F0
         └───────┬───────┘
        expression group body

Example encoding of tagless zero-or-more with expression group

(:add_macros
  (foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 3: 3-byte expression group
│  │  │  ┌──── uint8 1
│  │  │  │  ┌──── uint8 2
│  │  │  │  │  ┌─── uint8 3
│  │  │  │  │  │
00 02 07 01 02 03
         └──┬───┘
   expression group body

Example encoding of tagless zero-or-more with delimited expression group

(:add_macros
  (foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│     invoking the macro at address 0.
│  ┌──── AEB: 0b------aa; a=10, expression group
│  │  ┌──── FlexUInt 0: Delimited expression group
│  │  │  ┌──── FlexUInt 3: 3-byte chunk of uint8 expressions
│  │  │  │            ┌──── FlexUInt 2: 2-byte chunk of uint8 expressions
│  │  │  │            │       ┌──── FlexUInt 0: End of group
│  │  │  │            │       │
00 02 01 07 01 02 03 05 04 05 01
            └──┬───┘    └─┬─┘
            chunk 1    chunk 2

Annotations

Annotations can be encoded either as symbol addresses or as FlexSyms. In both encodings, the annotations sequence appears just before the value that it decorates.

It is illegal for an annotations sequence to appear before any of the following:

  • The end of the stream
  • Another annotations sequence
  • A NOP
  • An e-expression. To add annotations to the expansion of an E-expression, see the annotate macro.

Annotations With Symbol Addresses

Opcodes 0xE4 through 0xE6 indicate one or more annotations encoded as symbol addresses. If the opcode is:

  • 0xE4, a single FlexUInt-encoded symbol address follows.
  • 0xE5, two FlexUInt-encoded symbol addresses follow.
  • 0xE6, a FlexUInt follows that represents the number of bytes needed to encode the annotations sequence, which can be made up of any number of FlexUInt symbol addresses.
Encoding of $10::false
┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows
│  ┌──── Annotation with symbol address: FlexUInt 10
E4 15 6F
      └── The annotated value: `false`
Encoding of $10::$11::false
┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow
│  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
E5 15 17 6F
         └── The annotated value: `false`
Encoding of $10::$11::$12::false
┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations;
│     a FlexUInt follows representing the length of the sequence.
│   ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10)
│   │  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│   │  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
│   │  │  │  ┌──── Annotation with symbol address: FlexUInt 12 ($12)
E5 07 15 17 19 6F
               └── The annotated value: `false`

Annotations With FlexSym Text

Opcodes 0xE7 through 0xE9 indicate one or more annotations encoded as FlexSyms.

If the opcode is:

  • 0xE7, a single FlexSym-encoded symbol follows.
  • 0xE8, two FlexSym-encoded symbols follow.
  • 0xE9, a FlexUInt follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded as FlexSyms.

While this encoding is more flexible than annotations with symbol addresses it can be slightly less compact when all the annotations are encoded as symbol addresses.

Encoding of $10::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation with symbol address: FlexSym 10 ($10)
E7 15 6F
      └── The annotated value: `false`
Encoding of foo::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │   f  o  o
E7 FD 66 6F 6F 6F
      └──┬───┘ └── The annotated value: `false`
      3 UTF-8
       bytes

Note that FlexSym annotation sequences can switch between symbol address and inline text on a per-annotation basis.

Encoding of $10::foo::false
┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow
│  ┌──── Annotation: FlexSym 10 ($10)
│  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │   f  o  o
E8 15 FD 66 6F 6F 6F
         └──┬───┘ └── The annotated value: `false`
         3 UTF-8
          bytes
Encoding of $10::foo::$11::false
┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations
│  ┌──── Length: FlexUInt 6
│  │  ┌──── Annotation: FlexSym 10 ($10)
│  │  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │  │           ┌──── Annotation: FlexSym 11 ($11)
│  │  │  │   f  o  o │
E9 0D 15 FD 66 6F 6F 17 6F
            └──┬───┘    └── The annotated value: `false`
            3 UTF-8
             bytes

NOPs

A NOP (short for "no-operation") is the binary equivalent of whitespace. NOP bytes have no meaning, but can be used as padding to achieve a desired alignment.

An opcode of 0xEC indicates a single-byte NOP pad. An opcode of 0xED indicates that a FlexUInt follows that represents the number of additional bytes to skip.

It is legal for a NOP to appear anywhere that a value can be encoded. It is not legal for a NOP to appear in annotation sequences or struct field names. If a NOP appears in place of a struct field value, then the associated field name is ignored; the NOP is immediately followed by the next field name, if any.

Encoding of a 1-byte NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC
Encoding of a 3-byte NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│  ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│  │
ED 05 93 C6
      └─┬─┘
NOP bytes, values ignored

Grammar

This chapter presents Ion 1.1's domain grammar, by which we mean the grammar of the domain of values that drive Ion's encoding features.

We use a BNF-like notation for describing various syntactic parts of a document, including Ion data structures. In such cases, the BNF should be interpreted loosely to accommodate Ion-isms like commas and unconstrained ordering of struct fields.

Documents

document           ::= ivm? segment*

ivm                ::= '$ion_1_0' | '$ion_1_1'

segment            ::= value* directive?

directive          ::= ivm 
                     | encoding-directive 
                     | symtab-directive 

symtab-directive   ::=  local-symbol-table     ; As per the Ion 1.0 specification¹

encoding-directive ::= '$ion_encoding::(' module-body ')'

    ¹Symbols – Local Symbol Tables.

Modules

module-body             ::= import* inner-module* symbol-table? macro-table?

shared-module           ::= '$ion_shared_module::' ivm '::(' catalog-key module-body ')'

import                  ::= '(import ' module-name catalog-key ')'

catalog-key             ::= catalog-name catalog-version?

catalog-name            ::= string

catalog-version         ::= unannotated-uint                   ; must be positive

inner-module            ::= '(module' module-name module-body ')'

module-name             ::= unannotated-identifier-symbol

symbol-table            ::= '(symbol_table' symbol-table-entry* ')'

symbol-table-entry      ::= module-name | symbol-list

symbol-list             ::= '[' symbol-text* ']'

symbol-text             ::= symbol | string

macro-table             ::= '(macro_table' macro-table-entry* ')'

macro-table-entry       ::= macro-definition
                          | macro-export
                          | module-name

macro-export            ::= '(export' qualified-macro-ref macro-name-declaration? ')'

Macro references

qualified-macro-ref     ::= module-name '::' macro-ref

macro-ref               ::= macro-name | macro-addr

qualified-macro-name    ::= module-name '::' macro-name

macro-name              ::= unannotated-identifier-symbol

macro-addr              ::= unannotated-uint 

Macro definitions

macro-definition        ::= '(macro' macro-name-declaration signature tdl-expression ')'

macro-name-declaration  ::= macro-name | 'null'

signature               ::= '(' parameter* ')'

parameter               ::= parameter-encoding? parameter-name parameter-cardinality?

parameter-encoding      ::= (primitive-encoding-type | macro-name | qualified-macro-name)'::'

primitive-encoding-type ::= 'uint8' | 'uint16' | 'uint32' | 'uint64'
                          |  'int8' |  'int16' |  'int32' |  'int64'
                          | 'float16' | 'float32' | 'float64'
                          | 'flex_int' | 'flex_uint' 
                          | 'flex_sym' | 'flex_string'

parameter-name          ::= unannotated-identifier-symbol

parameter-cardinality   ::= '!' | '*' | '?' | '+'

tdl-expression          ::= operation | variable-expansion | ion-scalar | ion-container

operation               ::= macro-invocation | special-form

variable-expansion      ::= '(%' variable-name ')'

variable-name           ::= unannotated-identifier-symbol

macro-invocation        ::= '(.' macro-ref macro-arg* ')'

special-form            ::= '(.' '$ion::'?  special-form-name tdl-expression* ')'

special-form-name       ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi'

macro-arg               ::= tdl-expression | expression-group

expression-group        ::= '(..' tdl-expression* ')'

Glossary

active encoding module
The encoding module whose symbol table and macro table are available in the current segment of an Ion document. The active encoding module is set by a directive.

argument
The sub-expression(s) within a macro invocation, corresponding to exactly one of the macro's parameters.

cardinality
Describes both the number of argument expressions that a parameter will accept when the macro is invoked, and the number of values that the parameter may expand to during evaluation. A parameter's cardinality can be zero-or-one, exactly-one, zero-or-more, or one-or-more, specified in a signature by one of the modifiers ?, !, *, or + respectively. If no modifier is specified, cardinality defaults to exactly-one.

declaration
The association of a name with an entity (for example, a module or macro). See also definition. Not all declarations are definitions: some introduce new names for existing entities.

definition
The specification of a new entity.

directive
A keyword or unit of data in an Ion document that affects the encoding environment, and thus the way the document's data is encoded and decoded. In Ion 1.0 there are two directives: Ion version markers, and the symbol table directives. Ion 1.1 adds encoding directives.

document
A stream of octets conforming to either the Ion text or binary specification. Can consist of multiple segments, perhaps using varying versions of the Ion specification. A document does not necessarily exist as a file, and is not necessarily finite.

E-expression
See encoding expression.

encoding directive
In an Ion 1.1 segment, a top-level S-Expression annotated with $ion_encoding. Defines a new encoding module for the segment immediately following it. At the end of the encoding directive, the new encoding module is promoted to be the active encoding module. The symbol table directive is effectively a less capable alternative syntax.

encoding environment
The context-specific data maintained by an Ion implementation while encoding or decoding data. In Ion 1.0 this consists of the current symbol table; in Ion 1.1 this is expanded to also include the Ion spec version, the current macro table, and a collection of available modules.

encoding expression
The invocation of a macro in encoded data, aka e-expression. Starts with a macro reference denoting the function to invoke. The Ion text format uses "smile syntax" (:macro ...) to denote e-expressions. Ion binary devotes a large number of opcodes to e-expressions, so they can be compact.

encoding module
A module whose symbol table and macro table can be used directly in the user data stream.

expression
A serialized syntax element that may produce values. Encoding expressions and values are both considered expressions, whereas NOP, comments, and IVMs, for example, are not.

expression group
A grouping of zero or more expressions that together form one argument. The concrete syntax for passing a stream of expressions to a macro parameter. In a text e-expression, a group starts with the trigraph (:: and ends with ), similar to an S-expression. In template definition language, a group is written as an S-expression starting with .. (two dots).

inner module
A module that is defined inside another module and only visible inside the definition of that module.

Ion version marker
A keyword directive that denotes the start of a new segment encoded with a specific Ion version. Also known as "IVM".

macro
A transformation function that accepts some number of streams of values, and produces a stream of values.

macro definition
Specifies a macro in terms of a signature and a template.

macro reference
Identifies a macro for invocation or exporting. Must always be unambiguous. Lexically scoped. Cannot be a "forward reference" to a macro that is declared later in the document; these are not legal.

module
The data entity that defines and exports both symbols and macros.

opcode
A 1-byte, unsigned integer that tells the reader what the next expression represents and how the bytes that follow should be interpreted.

optional parameter
A parameter that can have its corresponding subform(s) omitted when the macro is invoked. A parameter is optional if both it and the parameters that follow it in the macro signature can accept an empty stream.

parameter
A named input to a macro, as defined by its signature. At expansion time a parameter produces a stream of values.

qualified macro reference
A macro reference that consists of a module name and either a macro name exported by that module, or a numeric address within the range of the module's exported macro table. In TDL, these look like module-name::name-or-address.

required parameter
A macro parameter that is not optional and therefore requires an argument at each invocation.

rest parameter
A macro parameter—always the final parameter—declared with * or + cardinality, that accepts all remaining individual arguments to the macro as if they were in an implicit argument group. Applies to Ion text and TDL. Similar to "varargs" parameters in Java and other languages.

segment
A contiguous partition of a document that uses the same active encoding module. Segment boundaries are caused by directives: an IVM starts a new segment (ending the prior segment, if any), while $ion_symbol_table and $ion_encoding directives end segments (with a new one starting immediately afterward).

shared module
A module that exists independent of the data stream of an Ion document. It is identified by a name and version so that it can be imported by other modules.

signature
The part of a macro definition that specifies its "calling convention", in terms of the shape, type, and cardinality of arguments it accepts.

symbol table directive
A top-level struct annotated with $ion_symbol_table. Defines a new encoding environment without any macros. Valid in Ion 1.0 and 1.1.

system e-expression
An e-expression that invokes a macro from the system-module rather than from the active encoding module.

system macro
A macro provided by the Ion implementation via the system module $ion. System macros are available at all points within Ion 1.1 segments.

system module
A standard module named $ion that is provided by the Ion implementation, implicitly installed so that the system symbols and system macros are available at all points within a document. Subsumes the functionality of the Ion 1.0 system symbol table.

system symbol
A symbol provided by the Ion implementation via the system module $ion. System symbols are available at all points within an Ion document, though the selection of symbols varies by segment according to its Ion version.

TDL
See template definition language.

template
The part of a macro definition that expresses its transformation of inputs to results.

template definition language
An Ion-based, domain-specific language that declaratively specifies the output produced by a macro. Template definition language uses only the Ion data model.

unqualified macro reference
A macro reference that consists of either a macro name or numeric address, without a qualifying module name. These are resolved using lexical scope and must always be unambiguous.

variable expansion
In TDL, a special form that causes all argument expression(s) for the given parameter to be expanded and the result of the expansion to be substituted into the template.

TODO

This page is a placeholder and will be updated when the target page is available.

If you believe the target page is available, please open an issue.