This is a draft specification of Ion 1.1, a new minor version of the Ion serialization format.
Status
This document is a working draft and is subject to change.
Audience
This documents presents the formal specification for the Ion 1.1 data format. This document is not intended to be used as a user guide or as a cook book, but as a reference to the syntax and semantics of the Ion data format and its logical data model.
What's New in Ion 1.1
We will go through a high-level overview of what is new and different in Ion 1.1 from Ion 1.0 from an implementer's perspective.
Motivation
Ion 1.1 has been designed to address some of the trade-offs in Ion 1.0 to make it suitable for a wider range of applications, giving greater greater representational choice and expressive power. Some applications want to optimize writes over reads, or are constrained by the writer in some way (e.g. it's prohibitively expensive to buffer an entire value before writing). Ion 1.1 now makes both length prefixing of containers and the interning of symbol tokens independently optional, granting such writers greater flexibility. Data density is another motivation. Certain encodings (e.g., timestamps, integers) have been made more compact and efficient. More significantly, macros now enable applications to have very flexible interning of their data's structure. In aggregate, data transcoded from Ion 1.0 to Ion 1.1 should be more compact and more efficient to both read and write.
Backwards compatibility
Ion 1.0 and Ion 1.1 share the same data model. Any data that can be represented in Ion 1.0 can also be represented with full fidelity in Ion 1.1 and vice-versa. This means that it is always possible to convert data from one version to the other without risk of data loss.
Ion 1.1 readers should be able to understand both Ion 1.0 and Ion 1.1 data.
The text encoding grammar of Ion 1.1 is a superset of Ion 1.0's text encoding grammar. Any Ion 1.0 text data can also be parsed by an Ion 1.1 text parser.
note
Because Ion 1.1 has a different system symbol table,
symbol IDs in an Ion 1.0 stream do not always refer to the same text as the same symbol ID in an Ion 1.1 stream.
For example: in an Ion 1.0 stream, $4
is always the text "name"
. However, $4
may or may not be "name"
in an Ion 1.1 stream. It may instead be user symbol 4 if the user has chosen not to export the system symbols.
Ion 1.1's binary encoding is substantially different from Ion 1.0's binary encoding. Many changes have been made to make values more compact, faster to read and/or faster to write. Ion 1.0's type descriptors have been supplanted by Ion 1.1's more general opcodes, which have been organized to prioritize the most commonly used encodings and make leveraging macros as inexpensive as possible.
In both text and binary Ion 1.1, the Ion Version Marker syntax is compatible with Ion 1.0's version marker syntax.
This means that an Ion 1.0-only reader can correctly identify when a stream uses Ion 1.1 (allowing it to report an error), and an Ion 1.1 reader can correctly "downshift" to expecting Ion 1.0 data when it encounters a stream using Ion 1.0.
Two streams using different Ion versions can be safely concatenated together provided that they are both text or both binary. A concatenated stream containing both Ion 1.0 and Ion 1.1 can only be fully read by a reader that supports Ion 1.1.
Upgrading an existing application to Ion 1.1 often requires little-to-no code changes,
as APIs typically operate at the data model level ("write an integer")
rather than at the encoding level ("write 0x64
followed by four Little-Endian bytes").
However, taking full advantage of macros after upgrading typically requires additional development time.
Macros, templates, and encoding expressions
Ion 1.1 introduces a new primitive called an encoding expression (E-expression). These expressions are (in text syntax) similar to S-expressions, but they are not part of the data model and are evaluated into one or more Ion values (called a stream) which enable compact representation of Ion data. E-expressions represent the invocation of either system defined or user defined macros with arguments that are either themselves E-expressions, value literals, or container constructors (list, sexp, struct syntax containing E-expressions) corresponding to the formal parameters of the macro's definition. The resulting stream is then expanded into the resulting Ion data model.
Top-level e-expressions
At the top level, the stream becomes individual top-level values. Consider for illustrative purposes an E-expression
(:values 1 2 3)
that evaluates to the stream 1
, 2
, 3
and (:none)
that evaluates to the empty stream. In the
following examples, values
and none
are the names of the macros being invoked and each line is equivalent.
// Encoding
a (:values 1 2 3) b (:none) c
// Evaluates to
a 1 2 3 b c
E-expressions in lists or S-expressions
Within a list or S-expression, the stream becomes additional child elements in the collection.
E-expressions in lists
// Encoding
[a, (:values 1 2 3), b, (:none), c]
// Evaluates to
[a, 1, 2, 3, b, c]
E-expressions in S-expressions
// Encoding
(a (:values 1 2 3) b (:none) c)
// Evaluates to
(a 1 2 3 b c)
E-expressions in structs
Within a struct at the field name position, the resulting stream must contain structs and each of the fields in those
structs become fields in the enclosing struct (the value portion is not specified); at the value position, the resulting
stream of values becomes fields with whatever field name corresponded before the E-expression (empty stream elides the
field all together). In the following examples, let us define (:make_struct { c: 5 })
that evaluates to a single struct
{c: 5}
.
// Encoding
{
a: (:values 1 2 3),
b: 4,
(:make_struct { c: 5 }),
(:make_field d 6),
e: (:none)
}
// Evaluates to
{
a: 1,
a: 2,
a: 3,
b: 4,
c: 5,
d: 6
}
Macro definitions
Macros can be defined by a user either directly in a default module within an encoding directive or in a module defined externally (i.e., shared module). A macro has a name which must be unique in a module or it may have no name.
Ion 1.1 defines a list of system macros that are built-in in the module named $ion
. Unlike the system symbol table,
which is always installed and accessible in the local symbol table, the system macros are both always accessible to
E-expressions and not installed in the local macro table by default (unlike the local symbol table).
In Ion binary, macros are always addressed in E-expressions by integer macro address. For user macros this is the offset in the local macro table. System macros may be addressed by the system macro address using a specific encoding op-code. In Ion text, macros may be addressed by
the offset in the local macro table (mirroring binary), by name, or by qualifying the macro name/offset with the module name in the encoding context. An E-expression can
only refer to macros installed in the local macro table or a macro from the system module. In text, an E-expression
referring to a system macro that is not installed in the local macro table, must use a qualified name with the $ion
module name.
For illustrative purposes let's consider the module named foo
that has a macro named bar
at offset 5 installed at
the begining of the local macro table.
E-expressions name resolution
// allowed if there are no other macros named 'bar'
(:bar)
// fully qualified by module–always allowed
(:foo::bar)
// by local macro table offset
(:5)
// In text, system macros are always addressable by name.
// In binary, system macros may be invoked using a separate
// opcode.
(:$ion:none)
Template definition language
User defined macros are defined by their parameters and template which defines how they are invoked and what stream of data they evaluate to. This template is defined using a domain specific Ion macro definition language with S-expressions. A template defines a list of zero or more parameters that it can accept. These parameters each have their own cardinality of expression arguments which can be specified as exactly one, zero or one, zero or more, and one or more. Furthermore the template defines what type of argument can be accepted by each of these parameters:
- "Tagged" values, whose encodings always begin with an opcode.
- "Tagless" values, whose encodings do not begin with an opcode and are therefore both more compact and less flexible (For example:
flex_int
,int32
,float16
). - Specific macro shaped arguments to allow for structural composition of macros and efficient encoding in binary.
The macro definition includes a template body that defines how the macro is expanded. In the language, system macros, macros defined in previously defined modules in the encoding context, and macros defined previously in the current module are available to be invoked with (.name ...)
syntax where name
is
the macro to be invoked. Certain names in the expression syntax are reserved for special forms (for example, literal
and if_none
). When a macro name is shadowed by a special form, or is ambiguous with respect to all
macros visible, it can always be qualified with (.module::name ...)
syntax where module
is the name of the module
and name
is the offset or name of the macro. Referring to a previously defined macro name within a module may be
qualified with (.name ...)
syntax.
Modules
Ion 1.0 uses symbol tables to group together related text values. In order to also accommodate macros, Ion 1.1 introduces modules, a named organizational unit that contains:
- An exported symbol table, a list of text values used to compactly encode symbol tokens like field names, annotations, and symbol values.
- An exported macro table, a list of macro definitions used to compactly encode complete values or partially populated containers.
- An unexported nested modules map, a set of unique module names and their associated module definitions.
While Ion 1.0 does not have modules, it is reasonable to think of Ion 1.0's local symbol table as a module that only has symbols, and whose macro table and nested modules map are permanently empty.
Modules can be imported from the catalog (they subsume shared symbol tables) or defined locally.
Directives
Directives modify the encoding context.
Syntactically, a directive is a top-level s-expression annotated with $ion
.
Its first child value is an operation name.
The operation determines what changes will be made to the encoding context and which clauses may legally follow.
$ion::
(operation_name
(clause_1 /*...*/)
(clause_2 /*...*/)
/*...*/
(clause_N /*...*/))
In Ion v1.1, there are three supported directive operations:
Shared Modules
Ion 1.1 extends the concept of a shared symbol table to be a shared module. An Ion 1.0 shared symbol table is a shared module with no macro definitions. A new schema for the convention of serializing shared modules in Ion are introduced in Ion 1.1. An Ion 1.1 implementation should support containing Ion 1.0 shared symbol tables and Ion 1.1 shared modules in its catalog.
System Symbol Table Changes
The system symbol table in Ion 1.1 replaces the Ion 1.0 symbol table with new symbols. However, the system symbols are not required to be in the symbol table—they are always available to use.
Text syntax changes
Ion 1.1 text must use the $ion_1_1
version marker at the top-level of the data stream or document.
The only syntax change for the text format is the introduction of encoding expression (E-expression) syntax, which
allows for the invocation of macros in the data stream. This syntax is grammatically similar to S-expressions, except that
these expressions are opened with (:
and closed with )
. For example, (:a 1 2)
would expand the macro named a
with the
arguments 1
and 2
. This syntax is allowed anywhere an Ion value is allowed, and may also appear in the field name position of a struct. See the Macros, templates, and encoding expressions section for details.
Binary encoding changes
Ion 1.1 binary encoding reorganizes the type descriptors to support compact E-expressions, make certain encodings
more compact, and certain lower priority encodings marginally less compact (for greater detail see Type Encoding Changes). The IVM for this encoding is the octet
sequence 0xE0 0x01 0x01 0xEA
.
Inlined symbol tokens
In binary Ion 1.0, symbol values, field names, and annotations are required to be encoded using a symbol ID in the local symbol table. For some use cases (e.g. RPC or small, independent values where the symbol table overhead cannot be amortized) this creates a burden on the writer and may not actually be efficient for an application. Ion 1.1 introduces optional binary syntax for encoding inline UTF-8 sequences for these tokens which can allow an encoder to have flexibility in whether and when to add a given text value to the symbol table.
Ion text requires no change for this feature as it already had inline symbol tokens without using the local symbol
table. Ion text also has compatible syntax for representing the local symbol table and encoding of symbol tokens with
their position in the table (i.e., the $id
syntax).
See FlexSym
documentation for greater detail.
Delimited containers
In Ion 1.0, all data is length prefixed. While this is good for optimizing the reading of data, it requires an Ion encoder to buffer any data in memory to calculate the data's length. Ion 1.1 introduces optional binary syntax to allow containers to be encoded with an end marker instead of a length prefix.
See the relevant list, sexp, and struct deliited encoding sections for greater detail.
Low-level binary encoding changes
Ion 1.0's VarUInt
and VarInt
encoding primitives
used big-endian byte order and used the high bit of each byte to indicate whether it was the final byte in the encoding.
VarInt
used an additional bit in the first byte to represent the integer's sign. Ion 1.1 replaces these primitives
with more optimized versions called FlexUInt
and FlexInt
.
FlexUInt
and FlexInt
use little-endian byte order, avoiding the need for reordering on common architectures like
x86, aarch64, and RISC-V.
Rather than using a bit in each byte to indicate the width of the encoding, FlexUInt
and FlexInt
front-load
the continuation bits. In most cases, this means that these bits all fit in the first byte of the representation,
allowing a reader to determine the complete size of the encoding without having to inspect each byte individually.
Finally, FlexInt
does not use a separate bit to indicate its value's sign. Instead, it uses two's complement
representation, allowing it to share much of the same structure and parsing logic as its unsigned counterpart.
Benchmarks have shown that in aggregate, these encoding changes are between 1.25 and 3x faster than Ion 1.0's
VarUInt
and VarInt
encodings depending on the host architecture.
Ion 1.1 supplants Ion 1.0's Int
encoding primitive
with a new encoding called FixedInt
, which uses two's complement notation instead of sign-and-magnitude.
A corresponding FixedUInt
primitive has also been introduced; its encoding is nearly the same as
Ion 1.0's UInt
primitive, save that UInt
is big endian where FixedUInt
is little endian.
A new primitive encoding type, FlexSym
, has been introduced to flexibly encode
symbol IDs and symbol tokens with inline text.
tip
FlexSym
makes it possible for a writer to emit any Ion value as binary without requiring a symbol table. This is generally less efficient when working with multiple values but there are use cases where it is convenient.
Type encoding changes
All Ion types use the new low-level encoding primitives described in the previous section. Ion 1.0's type descriptors have been supplanted by Ion 1.1's more general opcodes, which have been organized to prioritize the most commonly used encodings and make leveraging macros as inexpensive as possible.
Typed null
values are now encoded in two bytes using the 0xEB
opcode.
Symbol IDs greater than two bytes no longer have dedicated type descriptors- the 65537th and on symbols defined in a stream will take an extra byte each to represent in the stream.
Lists and S-expressions have two encodings:
a length-prefixed encoding and a new delimited form that ends with the 0xF0
opcode.
Struct values have the option of encoding their field names as
a FlexSym
, enabling them to write field name text inline
instead of adding all names to the symbol table. There is now also a delimited form.
Similarly, symbol values now also have the option of encoding their symbol text inline.
Annotation sequences are a prefix to the value they decorate, and no longer have an outer length container. They are now encoded with one of the six opcodes 0xE4
through 0xE9
.
- Opcodes
0xE4
through0xE6
indicate one or more annotations encoded as symbol addresses. - Opcodes
0xE7
through0xE9
indicate one or more annotations encoded as aFlexSym
.
The 0xE6
encoding is similar to how Ion 1.0 annotations are encoded with the exception that there is no
outer length in addition to the annotations sequence length.
Integers now use a FixedInt
sub-field instead of the Ion 1.0 encoding which used sign-and-magnitude (with two opcodes).
Decimals are structurally identical to their Ion 1.0 counterpart with the exception
of the negative zero coefficient. The Ion 1.1 FlexInt
encoding is two's complement, so negative zero cannot be
encoded directly with it. Instead, an opcode is allocated specifically for encoding decimals with a negative zero
coefficient.
Timestamps no longer encode their sub-field components as octet-aligned fields.
The Ion 1.1 format uses a packed bit encoding and has a biased form (encoding the year field as an offset from 1970) to make common encodings of timestamp easily fit in a 64-bit word for microsecond and nanosecond precision (with UTC offset or unknown UTC offset). Benchmarks have shown this new encoding to be 40% smaller, 59% faster to encode and 21% faster to decode in-range timestamps. A non-biased, arbitrary length timestamp with packed bit encoding is defined for uncommon cases.
Encoding expressions in binary
See the binary E-expressions documentation to learn more about how e-expressions are encoded in binary.
Macros
Like other self-describing formats, Ion 1.0 makes it possible to write a stream with truly arbitrary content—no formal schema required. However, in practice all applications have a de facto schema, with each stream sharing large amounts of predictable structure and recurring values. This means that Ion readers and writers often spend substantial resources processing undifferentiated data.
Consider this example excerpt from a webserver's log file:
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.100",
resource: "index.html"
}
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.100",
resource: "images/funny.jpg"
}
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.101",
resource: "index.html"
}
Macros allow users to define fill-in-the-blank templates for their data. This enables applications to focus on encoding and decoding the parts of the data that are distinctive, eliding the work needed to encode the boilerplate.
Using this macro definition:
(macro getOk (clientIp resource)
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: (.annotate "ip_addr" (%clientIp)),
resource: (%resource)
})
The same webserver log file could be written like this:
(:getOk "192.168.1.100" "index.html")
(:getOk "192.168.1.100" "images/funny.jpg")
(:getOk "192.168.1.101" "index.html")
Macros are an encoding-level concern, and their use in the data stream is invisible to consuming applications. For writers, macros are always optional—a writer can always elect to write their data using value literals instead.
For a guided walkthrough of what macros can do, see Macros by example.
Macros by example
Before getting into the technical details of Ion’s macro and module system, it will help to be more familiar with the use of macros. We’ll step through increasingly sophisticated use cases, some admittedly synthetic for illustrative purposes, with the intent of teaching the core concepts and moving parts without getting into the weeds of more formal specification.
Ion macros are defined using a domain-specific language that is in turn expressed via the Ion
data model. That is, macro definitions are Ion data, and use Ion features like S-expressions and
symbols to represent code in a Lisp-like fashion. In this document, the fundamental construct we
explore is the macro definition, denoted using an S-expression of the form (macro name …)
where macro
is a keyword and name
must be a symbol denoting the macro's name.
NOTE: S-expressions of that shape only declare macros when they occur in the context of an encoding module. We will completely ignore modules for now, and the examples below omit this context to keep things simple.
Constants
The most basic macro is a constant:
(macro pi // name
() // signature
3.141592653589793) // template
This declaration defines a macro named pi
. The ()
is the macro’s signature, in this
case a trivial one that declares no parameters. The 3.141592653589793
is a similarly trivial
template, an expression in Ion 1.1's domain-specific language for defining macro functions.
This macro accepts no arguments and always returns a constant value.
To use pi
in an Ion document, we write an encoding expression or E-expression:
$ion_1_1
(:pi)
The syntax (:pi)
looks a lot like an S-expression. It’s not, though, since colons
cannot appear unquoted in that context. Ion 1.1 makes use of syntax that is not valid in Ion
1.0—specifically, the (:
digraph—to denote E-expressions. Those characters must be followed by
a reference to a macro, and we say that the E-expression is an invocation of the macro. Here,
(:pi)
is an invocation of the macro named pi
.
That document is equivalent to the following, in the sense that they denote the same data:
$ion_1_1
3.141592653589793
The process by which the Ion implementation turns the former document into the latter is called
macro expansion or just expansion. This happens transparently to
Ion-consuming applications: the stream of values in both cases are the same. The documents have
the same content, encoded in two different ways. It’s reasonable to think of (:pi)
as a custom
encoding for 3.141592653589793
, and the notation’s similarity to S-expressions leads us to the
term “encoding expression” (or "e-expression").
note
Any Ion 1.1 document with macros can be fully expanded into an equivalent Ion 1.0 document.
We can streamline future examples with a couple of conventions. First, assume that any E-expression
is occurring within an Ion 1.1 document; second, we use the relation notation, ⇒
, to mean “expands to”.
So we can say:
(:pi) ⇒ 3.141592653589793
Parameters and variable expansion
Most macros are not constant—they accept inputs that determine their results.
(macro passthrough
(x) // signature
(%x) // template
)
This macro has a signature that declares a parameter called x
, and it therefore requires one argument to be passed in when it is invoked.
This creates a variable (i.e. named data) called x
that can be referred to within the context of the template.
note
We are careful to distinguish between the views from “inside” and “outside” the macro: parameters are the names used by a macro’s implementation to refer to its expansion-time inputs, while arguments are the data provided to a macro at the point of invocation. In other words, we have “formal” parameters and “actual” arguments.
The body of this macro is our first non-trivial template, an expression in Ion’s new domain-specific language for defining macro functions.
This template definition language (TDL) treats Ion scalar values as literals, giving the decimal in pi
’s template its intended meaning.
In this example, the template expression (%x)
is a variable expansion in the form (%variable_name)
.
During macro evaluation, variable expansions are replaced by the contents of the referenced variable.
Because this macro's template is an expansion of its only parameter, x
, invoking the macro will produce the same value it was given as an argument.
(:passthrough 1) => 1
(:passthrough "foo") => "foo"
(:passthrough [a, b, c]) => [a, b, c]
Simple Templates
Here's a more realistic macro:
(macro price
(a c) // signature
{ amount: (%a), currency: (%c) }) // template
This macro has a signature that declares two parameters named a
and c
. It therefore accepts two arguments when invoked.
(:price 99 USD) ⇒ { amount: 99, currency: USD }
Template expressions that are structs are interpreted almost literally;
the field names are literal--is why the amount
and currency
field names show up as-is in the expansion--but the field “values” are arbitrary expressions.
We call these almost-literal forms quasi-literals.
The template definition language also treats lists quasi-literally, and every element inside the list is an expression. Here’s a silly macro to illustrate:
(macro two_item_list (a b) [(%a), (%b)])
(:two_item_list foo bar) ⇒ [foo, bar]
E-expressions can accept other e-expressions as arguments. For example:
(:two_item_list (:price 99 USD) foo)
// └──────┬──────┘
// └─── passing another e-expression as an argument
Expansion happens from the "inside out". The outer e-expression receives the results from the expansion of the inner e-expression.
(:two_item_list (:price 99 USD) foo)
// First, the inner invocation of `price` is expanded...
=> (:two_item_list {amount: 99, currency: USD} foo)
// ...and then the outer invocation of `two_item_list` is expanded.
=> [{amount: 99, currency: USD}, foo]
Invoking Macros from Templates
Templates are able to invoke other macros.
In TDL, an s-expression starting with a .
and an identifier is an operator invocation,
where operators are either macros or special forms, which we'll explore later.
(macro website_url
(path)
(.make_string "https://www.amazon.com/" (%path)))
This macro's template is an s-expression beginning with .make_string
, so it an invocation of a macro called make_string
.
make_string
is a system macro (a built-in function) which concatenates its arguments to produce a single string.
(:website_url "gp/cart") ⇒ "https://www.amazon.com/gp/cart"
In TDL, it is legal for a macro invocation to appear anywhere that a value could appear.
In this example, an invocation of make_string
is being passed as an argument to an invocation of website_url
.
(macro detail_page_url
(asin)
(.website_url (.make_string "dp/" (%asin))))
(:detail_page_url "B08KTZ8249") ⇒ "https://www.amazon.com/dp/B08KTZ8249"
note
This may not look like much of an improvement, but the full string
"https://www.amazon.com/dp/B08KTZ8249"
takes 38 bytes to encode while the macro invocation
(:detail_page_url "B08KTZ8249")
takes as few as 12 bytes in binary Ion. While text Ion spells out the macro name to be human-friendly, the binary Ion encoding uses the macro's integer address instead. Here's an illustration:
(:1 "B08KTZ8249")
This makes the e-expression both more compact and faster to decode. Readers can also avoid the cost of repeatedly validating the UTF-8 bytes of substrings that are 'baked into' the macro definition.
E-expressions Versus S-expressions
We've now seen two ways to invoke macros, and their difference deserves thorough exploration.
An E-expression is an encoding artifact of a serialized Ion document. It has no intrinsic meaning other than the fact that it represents a macro invocation. The meaning of the document can only be determined by expanding the macro, passing the E-expression's arguments to the function defined by the macro. This all happens as the Ion document is parsed, transparent to the reader of the document. In casual terms, E-expressions are expanded away before the application sees the data.
Within the template definition language (TDL), you can define new macros in terms of other macros, and those invocations are written as S-expressions. Unlike E-expressions, TDL macro invocations are normal Ion data structures, consumed by the Ion system and interpreted as TDL. Further, TDL macro invocations only have meaning in the context of a macro definition, inside an encoding module, while E-expressions can occur anywhere in an Ion document.
These two invocation forms are syntactically aligned in their calling convention, but are distinct in context and "immediacy". E-expressions occur anywhere and are invoked immediately, as they are parsed. S-expression invocations occur only within macro definitions, and are only invoked if and when that code path is ever executed by invocation of the surrounding macro.
Rest Parameters
Sometimes we want a macro to accept an arbitrary number of arguments, in particular all the rest
of them. The make_string
macro is one of those, concatenating all of its arguments into a
single string:
(:make_string) ⇒ ""
(:make_string "a") ⇒ "a"
(:make_string "a" "b") ⇒ "ab"
(:make_string "a" "b" "c") ⇒ "abc"
(:make_string "a" "b" "c" "d") ⇒ "abcd"
To make this work, the declaration of make_string
is effectively:
(macro make_string (parts*) /*...*/)
The *
is a cardinality modifier.
A parameter's cardinality dictates both the number of argument expressions it can accept and the number of values its expansion can produce.
In the examples so far, all parameters have had a cardinality of exactly-one
, which is the default.
The parts
parameter has a cardinality of zero-or-more
, meaning:
- It can accept
zero-or-more
argument expressions. - When expanded, it will produce
zero-or-more
values.
When the final parameter in the macro signature is zero-or-more
, "all of the rest" of the argument expressions will be passed to that parameter.
(:make_string)
// └── 0 argument expressions passed to `parts`
(:make_string "a")
// └┬┘
// └── 1 argument expression passed to `parts`
(:make_string "a" "b" "c" "d")
// └──────┬──────┘
// └── 4 argument expressions passed to `parts`
At this point our distinction between parameters and arguments becomes more apparent, since they are no longer one-to-one: this macro with one parameter can be invoked with one argument, or twenty, or none.
tip
To declare a final parameter that requires at least one rest-argument, use the +
modifier.
Arguments and results are streams
The inputs to and results from a macro are modeled as streams of values. When a macro is invoked, each argument expression produces a stream of values, and within the macro definition, each parameter name refers to the corresponding stream, not to a specific value. The declared cardinality of a parameter constrains the number of elements produced by its stream, and is verified by the macro expansion system.
More generally, the results of all template expressions are streams. While most expressions produce a single value, various macros and special forms can produce zero or more values.
We have everything we need to illustrate this, via another system macro, values
:
(macro values (vals*) (%vals))
(:values 1) ⇒ 1
(:values 1 true null) ⇒ 1 true null
(:values) ⇒ _nothing_
The values
macro accepts any number of arguments and returns their values; it is effectively a multi-value identity function.
We can use this to explore how streams combine in E-expressions.
Splicing in encoded data
At the top level, an e-expression's resulting values become top-level values.
(:values 1 2 3) => 1 2 3
When an E-expression appears within a list or S-expression, the resulting values are spliced into the surrounding container:
[first, (:values), last] ⇒ [first, last]
[first, (:values "middle"), last] ⇒ [first, "middle", last]
(first (:values left right) last) ⇒ (first left right last)
This also applies wherever a tagged type can appear inside an E-expression:
(first (:values (:values left right) (:values)) last) ⇒ (first left right last)
Note that each argument-expression always maps to one parameter, even when that expression returns too-few or too-many values.
(macro reverse (a b)
[(%b), (%a)])
(:reverse (:values 5 USD)) ⇒ // Error: 'reverse' expects 2 arguments, given 1
(:reverse 5 (:values) USD) ⇒ // Error: 'reverse' expects 2 arguments, given 3
(:reverse (:values 5 6) USD) ⇒ // Error: argument 'a' expects 1 value, given 2
In this example, the parameters expect exactly one argument, producing exactly one value. When
the cardinality allows multiple values, then the argument result-streams are concatenated. We saw
this (rather subtly) above in the nested use of values
, but can also illustrate using the
rest-parameter to make_string
, which we'll expand here in steps:
(:make_string (:values) a (:values b (:values c) d) e)
// ^^^^^^ next
⇒ (:make_string a (:values b (:values c) d) e)
// ^^^^^^ next
⇒ (:make_string a (:values b c d) e)
// ^^^^^^ next
⇒ (:make_string a b c d e)
⇒ "abcde"
Splicing within sequences is straightforward, but structs are trickier due to their key/value nature. When used in field-value position, each result from a macro is bound to the field-name independently, leading to the field being repeated or even absent:
{ name: (:values) } ⇒ { }
{ name: (:values v) } ⇒ { name: v }
{ name: (:values v ann::w) } ⇒ { name: v, name: ann::w }
An E-expression can even be used in place of a key-value pair, in which case it must return structs, which are merged into the surrounding container:
{ a:1, (:values), z:3 } ⇒ { a:1, z:3 }
{ a:1, (:values {}), z:3 } ⇒ { a:1, z:3 }
{ a:1, (:values {b:2}), z:3 } ⇒ { a:1, b:2, z:3 }
{ a:1, (:values {b:2} {z:3}), z:3 } ⇒ { a:1, b:2, z:3, z:3 }
{ a:1, (:values key "value") } ⇒ // Error: struct expected for splicing into struct
Splicing in template expressions
The preceding examples demonstrate splicing of E-expressions into encoded data, but similar stream-splicing occurs within the template language, making it trivial to convert a stream to a list:
(macro list_of (vals*) [ (%vals) ])
(macro clumsy_bag (elts*) { '': (%elts) })
(:list_of) ⇒ []
(:clumsy_bag) ⇒ {}
(:list_of 1 2 3) ⇒ [1, 2, 3]
(:clumsy_bag true 2) ⇒ {'':true, '':2}
Mapping templates over streams: for
Another way to produce a stream is via a mapping form. The for
special form evaluates a
template once for each value provided by a stream or streams. Each time, a local variable is
created and bound to the next value on the stream.
(macro price (a c) { amount: (%a), currency: (%c) })
(macro prices (currency amounts*)
(.for
// Binding pairs
[(amt (%amounts))]
//└┬┘ └────┬───┘
// │ └─── stream to map over
// └─────────── variable name
// Template
(.price (%amt) (%currency))
)
)
The first subform of for
is a list of binding pairs, S-expressions containing a variable
names and a series of TDL expressions. Here, that TDL expression series is a single parameter expansion,
so each individual value from the amounts
stream is bound to the name amt
before the price
invocation is expanded.
(:prices GBP 10 9.99 12.)
⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP} {amount:12., currency:GBP}
More than one stream can be iterated in parallel, and iteration terminates when any stream becomes empty.
(macro zip (front* back*)
(.for [(f (%front)),
(b (%back))]
[(%f), (%b)]))
(:zip (:values 1 2 3) (:values a b))
⇒ [1, a] [2, b]
Empty streams: none
The empty stream is an important edge case that requires careful handling and communication.
The built-in macro none
accepts no values and produces an empty stream:
(macro list_of (items*) [(%items)])
(:list_of (:none)) ⇒ []
(:list_of 1 (:none) 2) ⇒ [1, 2]
[(:none)] ⇒ []
{a:(:none)} ⇒ {}
When used as a macro argument, a none
invocation (like any other expression) counts as one
argument:
(:pi (:none)) ⇒ // Error: 'pi' expects 0 arguments, given 1
The none
macro is equivalent to an empty expression group ((::)
), but unlike an expression group,
it is not limited to use as a macro argument. (:none)
can appear anywhere an expression can appear.
tip
While (:none)
and (:values)
both produce the empty stream, the former is preferred for
clarity of intent and terminology.
Cardinality
As described earlier, parameters are all streams of values, but the number of values can be
controlled by the parameter's cardinality. So far we have seen the default exactly-one
and the *
(zero-or-more) cardinality modifiers, and in total there are four:
Modifier | Cardinality |
---|---|
! | exactly-one value |
? | zero-or-one value |
+ | one-or-more values |
* | zero-or-more values |
Exactly-One
Many parameters expect exactly one value and thus have exactly-one
cardinality.
This is the default cardinality, but the !
modifier can be used for clarity.
This cardinality means that the parameter requires a stream producing a single value, so one might refer to them as singleton streams or just singletons colloquially.
Zero-or-One
A parameter with the modifier ?
has zero-or-one
cardinality, which is much like
exactly-one cardinality, except the parameter accepts an empty-stream
argument as a way to denote an absent parameter.
(macro temperature (degrees scale?)
{
degrees: (%degrees),
scale: (%scale)
})
Since the scale accepts the empty stream, we can pass it an empty expression group:
(:temperature 96 F) ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283}
Note that the result’s scale
field has disappeared because no value was provided. It would be
more useful to fill in a default value, which we can achieve with the default
system macro:
(macro temperature (degrees scale?)
{
degrees: (%degrees),
scale: (.default (%scale) K)
})
(:temperature 96 F) ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}
To refine things a bit further, trailing arguments that accept the empty stream can be omitted entirely:
(:temperature 283) ⇒ {degrees:283, scale:K}
tip
The default
macro is implemented with the help of a special form that can detect the empty stream: if_none
.
Zero-or-More
A parameter with the modifier *
has zero-or-more
cardinality.
(macro prices (amount* currency)
(.for [(amt (%amount))]
(.price (%amt) (%currency))))
When *
is on a non-final parameter, we cannot take “all the rest” of the arguments
and must use a different calling convention to draw the boundaries of the stream.
Instead, we need a single
expression that produces the desired values:
(:prices (::) JPY) ⇒ // empty stream
(:prices 54 CAD) ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP) ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}
Here we use a non-empty expression group (:: /*...*/)
to delimit
the multiple elements of the amount
stream.
One-or-More
A parameter with the modifier +
has one-or-more
cardinality, which works like *
except:
+
parameters cannot accept the empty stream- When expanded,
+
parameters must produce at least one value. To continue using ourprices
example:
(macro prices (amount+ currency)
(.for [(amt (%amount))]
(.price (%amt) (%currency))))
(:prices (::) JPY) ⇒ // Error: `+` parameter received the empty stream
(:prices 54 CAD) ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP) ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}
On the final parameter, +
collects the remaining (one or more) arguments:
(macro thanks (names+)
(.make_string "Thank you to my Patreon supporters:\n"
(.for [(name (%names))]
(.make_string " * " (%name) "\n"))))
(:thanks) ⇒ // Error: at least one value expected for + parameter
(:thanks Larry Curly Moe) =>
'''\
Thank you to my Patreon supporters:
* Larry
* Curly
* Moe
'''
Expression Groups
The non-rest versions of multi-value parameters require some kind of delimiting
syntax to contain the applicable sub-expressions. For the tagged-type parameters we have seen
so far, you could use :values
or some other macro to produce the stream, but that doesn't
work for tagless types.
The preferred syntax, supporting all argument types, is a special delimiting form
called an expression group. Here is a macro to illustrate:
(macro prices
(amount* currency)
(.for [(amt (%amount))]
(.price (%amt) (%currency))))
The parameter amount
accepts any number of argument expressions.
It's easy to provide exactly one:
(:prices 12.99 GBP) ⇒ {amount:12.99, currency:GBP}
To provide a non-singleton stream of values, use an expression group.
Inside an E-expression, a group starts with (::
(:prices (::) GBP) ⇒ _void_
(:prices (:: 1) GBP) ⇒ {amount:1, currency:GBP}
(:prices (:: 1 2 3) GBP) ⇒ {amount:1, currency:GBP}
{amount:2, currency:GBP}
{amount:3, currency:GBP}
Within the group, the invocation can have any number of expressions that align with the parameter's encoding. The macro parameter produces the results of those expressions, concatenated into a single stream, and the expander verifies that each value on that stream is acceptable by the parameter’s declared encoding.
(:prices (:: 1 (:values 2 3) 4) GBP) ⇒ {amount:1, currency:GBP}
{amount:2, currency:GBP}
{amount:3, currency:GBP}
{amount:4, currency:GBP}
Expression groups may only appear inside macro invocations where the corresponding
parameter has ?
, *
, or +
cardinality.
There is no binary opcode for these constructs; the encoding uses a tagless format to keep
things as dense as possible.
As usual, the text format mirrors this constraint.
In TDL, an expression group is denoted using (..
and )
. For example:
(macro foo (x*) { foo: (%x) })
(macro bar () (.foo (.. b a r))) // Argument to foo is 3 expressions in an expression group
(:bar) ⇒ { foo: b,
foo: a,
foo: r }
Optional Arguments
When a trailing parameter accepts the empty stream, an invocation can omit its corresponding argument expression,
as long as no following parameter is being given an expression. We’ve seen
this as applied to final *
parameters, but it also applies to ?
parameters:
(macro optionals (a* b? c! d* e? f*)
(.make_list a b c d e f))
Since d
, e
, and f
all accept the empty stream, they can be omitted by invokers. But c
is required so
a
and b
must always be present, at least as an empty group:
(:optionals (::) (::) "value for c") ⇒ ["value for c"]
Now c
receives the string "value for c"
while the other parameters are all empty.
If we want to provide e
, then we must also provide a group for d
:
(:optionals (::) (::) "value for c" (::) "value for e")
⇒ ["value for c", "value for e"]
Tagless and fixed-width types
In Ion 1.0, the binary encoding of every value starts off with a “type tag”, an opcode that indicates the data-type of the next value and thus the interpretation of the following octets of data. In general, these tags also indicate whether the value has annotations, and whether it’s null.
These tags are necessary because the Ion data model allows values of any type to be used anywhere. Ion documents are not schema-constrained: nothing forces any part of the data to have a specific type or shape. We call Ion “self-describing” precisely because each value self-describes its type via a type tag.
If schema constraints are enforced through some mechanism outside the serializer/deserializer, the type tags are unnecessary and may add up to a non-trivial amount of wasted space. Furthermore, the overhead for each value also includes length information: encoding an octet of data takes two octets on the stream.
Ion 1.1 tries to mitigate this overhead in the binary format by allowing macro parameters to use more-constrained tagless types. These are subtypes of the concrete types, constrained such that type tags are not necessary in the binary form. In general this can shave 4-6 bits off each value, which can add up in aggregate. In the extreme, that octet of data can be encoded with no overhead at all.
The following tagless types are available:
Tagless type | Description |
---|---|
flex_symbol | Tagless symbol (SID or text) |
flex_string | Tagless string |
flex_int | Tagless, variable-width signed int |
flex_uint | Tagless, variable-width unsigned int |
int8 int16 int32 int64 | Fixed-width signed int |
uint8 uint16 uint32 uint64 | Fixed-width unsigned int |
float16 float32 float64 | Fixed-width float |
To define a tagless parameter, just declare one of the primitive types:
(macro point (flex_int::x flex_int::y)
{x: (%x), y: (%y)})
(:point 3 17) ⇒ {x:3, y:17}
The tagless encoding has no real benefit here in text, as primitive types aim to improve the binary encoding.
This density comes at the cost of flexibility. Primitive types cannot be annotated or null, and arguments cannot be expressed using macros, like we’ve done before:
(:point null.int 17) ⇒ // Error: primitive flex_int does not accept nulls
(:point a::3 17) ⇒ // Error: primitive flex_int does not accept annotations
(:point (:values 1) 2) ⇒ // Error: cannot use macro for a primitive argument
While Ion text syntax doesn’t use tags—the types are built into the syntax—these errors ensure that a text E-expression may only express things that can also be expressed using an equivalent binary E-expression.
For the same reasons, supplying a (non-rest) tagless parameter with no value, or with more than one value, can only be expressed by using an expression group.
A subset of the primitive types are fixed-width: they are binary-encoded with no per-value overhead.
(macro byte_array
(uint8::bytes*)
[(%bytes)])
Invocations of this macro are encoded as a sequence of untagged octets, because the macro definition constrains the argument shape such that nothing else is acceptable. A text invocation is written using normal ints:
(:byte_array 0 1 2 3 4 5 6 7 8) ⇒ [0, 1, 2, 3, 4, 5, 6, 7, 8]
(:byte_array 9 -10 11) ⇒ // Error: -10 is not a valid uint8
(:byte_array 256) ⇒ // Error: 256 is not a valid uint8
As above, Ion text doesn’t have syntax specifically denoting “8-bit unsigned integers”, so to keep text and binary capabilities aligned, the parser rejects invocations where an argument value exceeds the range of the binary-only type.
Primitive types have inherent tradeoffs and require careful consideration, but in the right circumstances the density wins can be significant.
Macro Shapes
We can now introduce the final kind of input constraint, macro-shaped parameters. To understand the motivation, consider modeling a scatter-plot as a list of points:
[{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}, …]
Lists like these exhibit a lot of repetition. Since we already have a point
macro, we can
eliminate a fair amount:
[(:point 3 17), (:point 395 23), (:point 15 48), (:point 2023 5), …]
This eliminates all the x
s and y
s, but leaves repeated macro invocations.
What we’d like is to eliminate the point
calls and just write a stream of pairs, something
like:
(:scatterplot (3 17) (395 23) (15 48) (2023 5) …)
We can achieve exactly that with a macro-shaped parameter, in which we use the point
macro as an encoding:
(macro scatterplot (point::points*)
// ^^^^^
[(%points)])
point
is not one of the built-in encodings, so this is a reference to the macro of that name defined earlier.
(:scatterplot (3 17) (395 23) (15 48) (2023 5))
⇒
[{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}]
Each argument S-expression like (3 17)
is implicitly an
E-expression invoking the point
macro. The argument mirrors the shape of the inner macro,
without repeating its name. Further, expansion of the implied point
s happens automatically,
so the overall behavior is just like the preceding variant and the points
parameter produces a stream of structs.
The binary encoding of macro-shaped parameters are similarly tagless, eliding any opcodes
mentioning point
and just writing its arguments with minimal delimiting.
Macro types can be combined with cardinality modifiers, with invocations using groups as needed:
(macro scatterplot
(point::points+ flex_string::x_label flex_string::y_label)
{ points: [(%points)], x_label: (%x_label), y_label: (%y_label) })
(:scatterplot (:: (3 17) (395 23) (15 48) (2023 5)) "hour" "widgets")
⇒
{
points: [{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}],
x_label: "hour",
y_label: "widgets"
}
As with other tagless parameters, you cannot replace a group with a macro invocation, and you cannot use a macro invocation as an element of an expression group:
(:scatterplot (:make_points 3 17 395 23 15 48 2023 5) "hour" "widgets")
⇒ // Error: Expression group expected, found :make_points
(:scatterplot (:: (3 17) (:make_points 395 23 15 48) (2023 5)) "hour" "widgets")
⇒ // Error: sexp expected with args for 'point', found :make_points
(:scatterplot (:: (3 17) (:point 395 23) (15 48) (2023 5)) "hour" "widgets")
⇒ // Error: sexp expected with args for 'point', found :point
This limitation mirrors the binary encoding, where both the expression group and the individual macro invocations are tagless and there's no way to express a macro invocation.
tip
The primary goal of macro-shaped arguments, and tagless types in general, is to increase density by tightly constraining the inputs.
Defining macros
A macro is defined using a macro
clause within a module's macro_table
clause.
Syntax
(macro name signature template)
Argument | Description |
---|---|
name | A unique name assigned to the macro. When constructing an anonymous macro, this argument is omitted. |
signature | An s-expression enumerating the parameters this macro accepts. |
template | A template definition language (TDL) expression that can be evaluated to produce zero or more Ion values. |
Example macro clause
// ┌─── name
// │ ┌─── signature
// ┌┴┐ ┌──┴──┐
(macro foo (x y z)
{ // ─┐
x: (%x), // │
y: (%y), // ├─ template
z: (%z), // │
} // ─┘
)
Macro names
Syntactically, macro names are identifiers. Each macro name in a macro table must be unique.
In some circumstances, it may not make sense to name a macro. (For example, when the macro is generated automatically.) In such cases, authors may omit the macro name to indicate that the macro does not have a name. Anonymous macros can only be referenced by their address in the macro table.
Macro parameters
A parameter is a named stream of Ion values. The stream's contents are determined by the macro's invocation. A macro's parameters are declared in the macro signature.
Each parameter declaration has three elements:
- A name
- An optional encoding
- An optional cardinality
Example parameter declaration
// ┌─── encoding
// │ ┌─── name
// │ │┌─── cardinality
// ┌───┴───┐ ││
flex_uint::x*
Parameter names
A parameter's name is an identifier. The name is required; any non-identifier (including null
, quoted symbols, $0
, or a non-symbol) found in parameter-name position will cause the reader to raise an error.
All of a macro's parameters must have unique names.
Parameter encodings
In binary Ion, the default encoding for all parameters is tagged. Each argument passed into the macro from the callsite is prefixed by an opcode (or "tag") that indicates the argument's type and length.
Parameters may choose to specify an alternative encoding to make the corresponding arguments' binary representation more compact and/or fixed width. These "tagless" encodings do not begin with an opcode, an arrangement which saves space but also limits the domain of values they can each represent. Arguments passed to tagless parameters cannot be null
, cannot be annotated, and may have additional range restrictions.
When writing text Ion, the declared encoding does not affect how values are serialized.
However, it does constrain the domain of values that that parameter will accept.
When transcribing from text to binary, it must be possible to serialize all values passed as an argument using the parameter's declared encoding.
This means that parameters with a primitive encoding cannot be annotated or a null
of any type.
If an int
or a float
is being passed to a parameter with a fixed-width encoding,
that value must fit within the range of values that can be represented by that width.
For example, the value 256
cannot be passed to a parameter with an encoding of uint8
because a uint8
can only represent values in the range [0, 255]
.
To specify an encoding, the parameter name is annotated with a primitive encoding or a macro reference. Encoding types may be qualified with their module names for disambiguation when there is more than one macro with the given name that is in scope, or when a macro name shadows a primitive encoding.
Primitive encodings
The following primitive encodings are provided by the system module.
Tagless encodings | Description |
---|---|
flex_int | Variable-width, signed int |
flex_uint | Variable-width, unsigned int |
int8 int16 int32 int64 | Fixed-width, signed int |
uint8 uint16 uint32 uint64 | Fixed-width, unsigned int |
float16 float32 float64 | Fixed-width float |
flex_symbol | FlexSym -encoded SID or text |
flex_string | Variable-width string |
Parameter cardinalities
A parameter name may optionally be followed by a cardinality modifier. This is a sigil that indicates how many values the parameter expects the corresponding argument expression to produce when it is evaluated.
Modifier | Cardinality |
---|---|
? | zero-or-one value |
* | zero-or-more values |
! | exactly-one value |
+ | one-or-more values |
If no modifier is specified, the parameter's cardinality will default to exactly-one.
An exactly-one
parameter will always expand to a stream containing a single value.
Parameters with a cardinality other than exactly-one
are called variadic parameters.
If an argument expression expands to a number of values that the cardinality forbids, the reader must raise an error.
When a parameter has a cardinality of zero-or-more
or one-or-more
, the arguments for that parameter are eligible to use rest argument syntax in the Ion text encoding.
Optional parameters
Parameters with a cardinality that can accept an empty expression group as an argument (?
and *
) are called
optional parameters. In text Ion, their corresponding arguments can be elided from e-expressions and TDL macro
invocations when they appear in tail position. When an argument is elided, it is treated as though an explicit
empty group (::)
had been passed in its place.
In contrast, parameters with a cardinality that cannot accept an empty group (!
and +
) are called required
parameters. Required parameters can never be elided.
(:set_macros
(foo (x y? z*) // `x` is required, `y` and `z` are optional
[x, y, z]
)
)
// `z` is a populated expression group
(:foo 1 2 (:: 3 4 5)) => [1, 2, 3, 4, 5]
// `z` is an empty expression group
(:foo 1 2 (::)) => [1, 2]
// `z` has been elided
(:foo 1 2) => [1, 2]
// `y` and `z` have been elided
(:foo 1) => [1]
// `x` cannot be elided
(:foo) => ERROR: missing required argument `x`
Optional parameters that are not in tail position cannot be elided, as this would cause them to appear in a position corresponding to a different argument.
(:set_macros
(foo (x? y) // `x` is optional, `y` is required
[x, y]
)
)
(:foo (::) 1) => [(::), 1] => [1]
(:foo 1) => ERROR: missing required argument `y`
Macro signatures
A macro's signature is the ordered sequence of parameters which an invocation of that macro must define. Syntactically, the signature is an s-expression of parameter declarations.
Example macro signature
(w flex_uint::x* float16::y? z+)
Name | Encoding | Cardinality |
---|---|---|
w | tagged | exactly-one |
x | flex_uint | zero-or-more |
y | float16 | zero-or-one |
z | tagged | one-or-more |
Template definition language (TDL)
The macro's template is a single Ion value that defines how a reader should expand invocations of the macro. Ion 1.1 introduces a template definition language (TDL) to express this process in terms of the macro's parameters. TDL is a small language with only a few constructs.
A TDL expression can be any of the following:
- A literal Ion scalar
- A macro invocation
- A variable expansion
- A quasi-literal Ion container
- A special form
In terms of its encoding, TDL is "just Ion." As you shall see in the following sections, the constructs it introduces are written as s-expressions with a distinguishing leading value or values.
A grammar for TDL can be found in the Grammar chapter.
Ion scalars
Ion scalars are interpreted literally. These include values of any type except list
, sexp
, and struct
.
null
values of any type—even null.list
, null.sexp
, and null.struct
—are also interpreted literally.
Examples
These macros are constants; they take no parameters. When they are invoked, they expand to a stream of a single value: the Ion scalar acting as the template expression.
$ion::
(module _
(macro_table
(macro greeting () "hello")
(macro birthday () 1996-10-11)
// Annotations are also literal
(macro price () USD::29.95)
)
)
(:greeting) => "hello"
(:birthday) => 1996-10-11
(:price) => USD::29.95
Macro invocations
Macro invocations call an existing macro. The invoked macro could be a system macro, a macro imported from a shared module, or a macro previously defined in the current scope.
Syntactically, a macro invocation is an s-expression whose first value is the operator .
and whose second value is a macro reference.
Grammar
macro-invocation ::= '(.' macro-ref macro-arg* ')'
macro-ref ::= (module-name '::')? (macro-name | macro-address)
macro-arg ::= expression | expression-group
macro-name ::= ion-identifier
macro-address ::= unsigned-ion-integer
expression-group ::= '(..' expression* ')'
Invocation syntax illustration
// Invoking a macro defined in the same module by name.
(.macro_name arg1 arg2 /*...*/ argN)
// Invoking a macro defined in another module by name.
(.module_name::macro_name arg1 arg2 /*...*/ argN)
// Invoking a macro defined in the same module by its address.
(.0 arg1 arg2 /*...*/ argN)
// Invoking a macro defined in a different module by its address.
(.module_name::0 arg1 arg2 /*...*/ argN)
// Passing more than one argument expression for a single parameter using an expression group
(.macro_name (.. expr1 expr2 /*...*/ exprN) )
Examples
$ion::
(module _
(macro_table
// Calls the system macro `values`, allowing it to produce a stream of three values.
(macro nephews () (.values Huey Dewey Louie))
// Calls a macro previously defined in this module, splicing its result
// stream into a list.
(macro list_of_nephews () [(.nephews)])
)
)
(:nephews) => Huey Dewey Louie
(:list_of_nephews) => [Huey, Dewey, Louie]
important
There are no forward references in TDL. If a macro definition includes an invocation of a name or address that is not already valid, the reader must raise an error.
$ion::
(module _
(macro_table
(macro list_of_nephews () [(.nephews)])
// ^^^^^^^^
// ERROR: Calls a macro that has not yet been defined in this module.
(macro nephews () (.values Huey Dewey Louie))
)
)
Variable expansion
Templates can insert the contents of a macro parameter into their output by using a variable expansion,
an s-expression whose first value is the operator %
and whose second and final value is the variable name of the parameter to expand.
If the variable name does not match one of the declared macro parameters, the implementation must raise an error.
Grammar
variable-expansion ::= '(%' variable-name ')'
variable-name ::= ion-identifier
Examples
$ion::
(module _
(macro_table
// Produces a stream that repeats the content of parameter `x` twice.
(macro twice (x*) (.values (%x) (%x)))
)
)
(:twice foo) => foo foo
(:twice "hello") => "hello" "hello"
(:twice 1 2 3) => 1 2 3 1 2 3
Quasi-literal Ion containers
When an Ion container appears in a template definition, it is interpreted almost literally.
Each nested value in the container is inspected.
- If the value is an Ion scalar, it is added to the output as-is.
- If the value is a variable expansion, the stream bound to that variable name is added to the output.
The variable expansion literal (for example:
(%name)
) is discarded. - If the value is a macro invocation, the invocation is evaluated and the resulting stream is added to the output.
The macro invocation literal (for example:
(.name 1 2 3)
) is discarded. - If the value is a container, the reader will recurse into the container and repeat this process.
Expansion within a sequence
When the container is a list or s-expression, the values in the nested expression's expansion are spliced into the sequence at the site of the expression. If the expansion was empty, no values are spliced into the container.
$ion::
(module _
(macro_table
(macro bookend_list (x y*) [(%x), (%y), (%x)])
(macro bookend_sexp (x y*) ((%x) (%y) (%x)))
)
)
(:bookend_list ! a b c) => ['!', a, b, c, '!']
(:bookend_sexp ! a b c) => (! a b c !)
(:bookend_sexp !) => (! !)
Expansion within a struct
When the container is a struct, the expansion of each field value is paired with the corresponding field name. If the expansion produces a single value, a single field with that name will be spliced into the parent struct. If the expansion produces multiple values, a field with that name will be created for each value and spliced into the parent struct. If the expansion was empty, no fields are spliced into the parent struct.
Examples
$ion::
(module _
(macro_table
(macro resident (id names*)
{
town: "Riverside",
id: (.make_string "123-" (%id)),
name: (%names)
}
)
)
)
(:resident "abc" "Alice") =>
{
town: "Riverside",
id: "123-abc",
name: "Alice"
}
(:resident "def" "John" "Jacob" "Jingleheimer" "Schmidt") =>
{
town: "Riverside",
id: "123-def",
name: "John",
name: "Jacob",
name: "Jingleheimer",
name: "Schmidt",
}
(:resident "ghi") =>
{
town: "Riverside",
id: "123-ghi",
}
Special forms
special-form ::= '(.' ('$ion::')? special-form-name expression* ')'
special-form-name ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi' | 'parse_ion' | 'literal'
Special forms are similar to macro invocations, but they have their own expansion rules. See Special forms for the list of special forms and a description of each.
Special Forms
When a TDL expression is syntactically an S-expression and its
first element is the symbol .
, its next element must be a symbol that matches either a set of keywords denoting the
special forms, or the name of a previously-defined macro.
The interpretation of the S-expression’s remaining elements depends on how the symbol resolves.
In the case of macro invocations, the elements following the operator are arbitrary TDL expressions, but for special
forms that is not always the case.
Special forms are "special" precisely because they cannot be expressed as macros and must therefore receive bespoke syntactic treatment. Since the elements of macro-invocation expressions are themselves expressions, when you want something to not be evaluated that way, it must be a special form. Argument expressions are passed to the special form without interpretation, and each special form has custom logic for interpreting its arguments.
// The argument being passed in is the _expansion_ of the foo macro
(.regular_macro (.foo))
// This argument being passed in is literally `( '.' 'foo' )`.
(.special_form (.foo))
These special forms are part of the template language itself, and most are not addressable outside TDL;
the E-expression (:if_none foo bar baz)
must necessarily refer to some user-defined macro named if_none
, not to the special form of the same name. The only exception is parse_ion
, which is explicitly included in the system macro table.
literal
(literal (values*) /* Not representable in TDL */)
The literal
form is an identity function that accepts its arguments as literal values and then produces them without any evaluation.
Both literal
and values
are identity functions, but they differ in regard to how their arguments are interpreted:
// When the arguments are values, literal produces the same result as values
(.literal 1 2 3) ⇒ 1 2 3
(.values 1 2 3) ⇒ 1 2 3
// When the arguments are TDL macros or special forms, literal produces different results than values
(.literal (.make_string "a" "b")) ⇒ (.make_string "a" "b")
(.values (.make_string "a" "b")) ⇒ "ab"
// When the arguments are TDL expression groups, literal produces different results than values
(.literal (.. true false)) ⇒ ( .. true false)
(.values (.. true false)) ⇒ true false
// When the arguments are TDL variable expansions, literal produces different results than values
// Assuming that the variable x is bound to "Hello"
(.literal (%x)) ⇒ ( % x )
(.values (%x)) ⇒ "Hello"
if_none
The if_none
special form accepts three arguments—stream
, true_branch
, and false_branch
—each of which may be a single value or a stream of zero-to-many values.
The if_none
form is if/then/else syntax testing stream emptiness.
It has three sub-expressions, the first being a stream to check.
If and only if that stream is empty (it produces no values), the second sub-expression is expanded.
Otherwise, the third sub-expression is expanded.
The expanded second or third sub-expression becomes the result that is produced by if_none
.
note
Exactly one branch is expanded, because otherwise the empty stream
might be used in a context that requires a value, resulting in an errant expansion error.
(macro temperature (degrees scale?)
{
degrees: (%degrees),
scale: (.if_none (%scale) K (%scale)),
})
(:temperature 96 F) ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}
To refine things a bit further, trailing optional arguments can be omitted entirely:
(:temperature 283) ⇒ {degrees:283, scale:K}
tip
If you're using if_none
to specify an expression to default to, you can use the default
system macro to be more concise.
(macro temperature (degrees scale)
{
degrees: (%degrees),
scale: (.default (%scale) K),
}
)
if_some
The if_some
special form accepts three arguments—stream
, true_branch
, and false_branch
—each of which may be a single value or a stream of zero-to-many values.
If stream
evaluates to one or more values, it produces true_branch
. Otherwise, it produces false_branch
.
Exactly one of true_branch
and false_branch
is evaluated.
The stream
expression must be expanded enough to determine whether it produces any values, but implementations are not required to fully expand the expression.
Example:
(macro foo (x)
{
foo: (.if_some (%x) [(%x)] null)
})
(:foo (::)) => { foo: null }
(:foo 2) => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }
The false_branch
parameter may be elided, allowing if_some
to serve as a map-if-not-none function.
Example:
(macro foo (x)
{
foo: (.if_some (%x) [(%x)])
})
(:foo (::)) => { }
(:foo 2) => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }
if_single
The if_single
special form accepts three arguments—stream
, true_branch
, and false_branch
—each of which may be a single value or a stream of zero-to-many values.
If stream
evaluates to exactly one value, if_single
produces the expansion of true_branch
. Otherwise, it produces the expansion of false_branch
.
Exactly one of true_branch
and false_branch
is evaluated.
The stream
argument must be expanded enough to determine whether it produces exactly one value, but implementations are not required to fully expand the expression.
Example:
(macro foo (x)
{
foo: (.if_single (%x) (%x) [(%x)])
})
(:foo (::)) => { foo: [] }
(:foo 2) => { foo: 2 }
(:foo (:: 2 3)) => { foo: [2, 3] }
if_multi
The if_multi
special form accepts three arguments—stream
, true_branch
, and false_branch
—each of which may be a single value or a stream of zero-to-many values.
If stream
evaluates to more than one value, it produces true_branch
. Otherwise, it produces false_branch
.
Exactly one of true_branch
and false_branch
is evaluated.
The stream
argument must be expanded enough to determine whether it produces more than one value, but implementations are not required to fully expand the expression.
Example:
(macro foo (x)
{
foo: (.if_multi (%x) "zero or one" "many")
})
(:foo (::)) => { foo: "zero or one" }
(:foo 2) => { foo: "zero or one" }
(:foo (:: 2 3)) => { foo: "many" }
for
The for
special form maps one or more streams to an output stream.
It accepts two arguments—stream_bindings
and template
.
stream_bindings
is a list or s-expression containing one or more s-expressions of the form (name expr0 expr1 ... exprN)
.
The first value is a symbol to act as a variable name.
The remaining expressions in the s-expression will be expanded and concatenated into a single stream; for each value in the stream, the for
expansion will produce a copy of the template
argument expression with any appearance of the variable replaced by the value.
For example:
(.for
[(word // Variable name
foo bar baz)] // Values over which to iterate
(.values (%word) (%word))) // Template expression; `(%word)` will be replaced
=>
foo foo bar bar baz baz
Multiple s-expressions can be specified. The streams will be iterated over in lockstep.
(.for
((x 1 2 3) // for x in...
(y 4 5 6)) // for y in...
((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 4)
(2 5)
(3 6)
Iteration will end when the shortest stream is exhausted.
(.for
[(x 1 2), // for x in...
(y 3 4 5)] // for y in...
((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 3)
(2 4)
// no more output, `x` is exhausted
Names defined inside a for
shadow names in the parent scope.
(macro triple (x)
// └─── Parameter `x` is declared here...
(.for
// ...but the `for` expression introduces a
// ┌─── new variable of the same name here.
((x a b c))
(%x)
// └─── This refers to the `for` expression's `x`, not the parameter.
)
)
(:triple 1) // Argument `1` is ignored
=>
a b c
The for
special form can only be invoked in the body of template macro. It is not valid to use as an E-Expression.
parse_ion
Ion documents may be embedded in other Ion documents using the parse_ion
form.
The parse_ion
form accepts a single argument that must be a literal string or blob.
It constructs a stream of values by parsing its argument as a single, self-contained Ion document.
The argument must be a literal value because macros are not allowed to contain recursive calls, and composing an embedded document from multiple expressions would make it possible to implement recursion in the macro system.
The data argument is evaluated in a clean environment that cannot read anything from the parent document. Allowing context to leak from the outer scope into the document being parsed would also enable recursion.
All values produced by the expansion of parse_ion
are application values.
(i.e. it is as if they are all annotated with $ion_literal
.)
The IVM at the beginning of an Ion data stream is sufficient to identify whether it is text or binary, so text Ion can be embedded as a blob containing the UTF-8 encoded text.
Embedded text example:
(:parse_ion
'''
$ion_1_1
$ion::(module _ (symbol_table ["foo" "bar"]]))
$1 $2
'''
)
=> foo bar
Embedded binary example:
(:parse_ion {{ 4AEB6qNmb2+jYmFy }} )
=> foo bar
The parse_ion
form has an address in the system macro table, making it
the only special form that can be invoked as an e-expression.
For normative examples, see parse_ion
in the Ion conformance test suite.
System Macros
Many of the system macros MAY be defined as template macros, and when possible, the specification includes a template. Templates are given here as normative example, but system macros are not required to be implemented as template macros.
The macros that can be defined as templates are included as system macros because of their broad applicability, and
so that Ion implementations can provide optimizations for these macros that run directly in the implementations' runtime
environments rather than in the macro evaluator.
For example, a macro such as add_symbols
does not produce user values, so an Ion Reader could bypass
evaluating the template and directly update the encoding context with the new symbols.
Stream Constructors
none
(macro none () (.values))
none
accepts no values and produces nothing (an empty stream).
For normative examples, see none
in the Ion conformance test suite.
values
(macro values (v*) v)
This is, essentially, the identity function. It produces a stream from any number of arguments, concatenating the streams produced by the nested expressions. Used to aggregate multiple values or sub-streams to pass to a single argument, or to produce multiple results.
For normative examples, see values
in the Ion conformance test suite.
default
(macro default (expr* default_expr*)
// If `expr` is empty...
(.if_none (%expr)
// then expand `default_expr` instead.
(%default_expr)
// If it wasn't empty, then expand `expr`.
(%expr)
)
)
default
tests expr
to determine whether it expands to the empty stream.
If it does not, default
will produce the expansion of expr
.
If it does, default
will produce the expansion of default_expr
instead.
For normative examples, see values
in the Ion conformance test suite.
flatten
(macro flatten (sequence*) /* Not representable in TDL */)
The flatten
system macro constructs a stream from the content of one or more sequences.
Produces a stream with the contents of all the sequence
values.
Any annotations on the sequence
values are discarded.
Any non-sequence arguments will raise an error.
Any null arguments will be ignored.
Examples:
(:flatten [a, b, c] (d e f)) => a b c d e f
(:flatten [[], null.list] foo::()) => [] null.list
The flatten
macro can also be used to splice the content of one list or s-expression into another list or s-expression.
[1, 2, (:flatten [a, b]), 3, 4] => [1, 2, a, b, 3, 4]
For normative examples, see flatten
in the Ion conformance test suite.
parse_ion
parse_ion
is a special form because (unlike macros) its argument must
specifically be a literal value. However, because of its usefulness for embedding an Ion stream in another
Ion stream, it has an address in the system macro table.
Value Constructors
annotate
(macro annotate (ann* value) /* Not representable in TDL */)
Produces the value
prefixed with the annotations ann
s1.
Each ann
must be a non-null, unannotated string or symbol.
(:annotate (: "a2") a1::true) => a2::a1::true
For normative examples, see annotate
in the Ion conformance test suite.
make_string
(macro make_string (content*) /* Not representable in TDL */)
Produces a non-null, unannotated string containing the concatenated content produced by the arguments. Nulls (of any type) are forbidden. Any annotations on the arguments are discarded.
For normative examples, see make_string
in the Ion conformance test suite.
make_symbol
(macro make_symbol (content*) /* Not representable in TDL */)
Like make_string
but produces a symbol.
For normative examples, see make_symbol
in the Ion conformance test suite.
make_blob
(macro make_blob (lobs*) /* Not representable in TDL */)
Like make_string
but accepts lobs and produces a blob.
For normative examples, see make_blob
in the Ion conformance test suite.
make_list
(macro make_list (sequences*) [ (.flatten sequences) ])
Produces a non-null, unannotated list by concatenating the content of any number of non-null list or sexp inputs.
(:make_list) => []
(:make_list (1 2)) => [1, 2]
(:make_list (1 2) [3, 4]) => [1, 2, 3, 4]
(:make_list ((1 2)) [[3, 4]]) => [(1 2), [3, 4]]
For normative examples, see make_list
in the Ion conformance test suite.
make_sexp
(macro make_sexp (sequences*) ( (.flatten sequences) ))
Like make_list
but produces a sexp.
(:make_sexp) => ()
(:make_sexp (1 2)) => (1 2)
(:make_sexp (1 2) [3, 4]) => (1 2 3 4)
(:make_sexp ((1 2)) [[3, 4]]) => ((1 2) [3, 4])
For normative examples, see make_sexp
in the Ion conformance test suite.
make_struct
(macro make_struct (structs*) /* Not representable in TDL */)
Produces a non-null, unannotated struct by combining the fields of any number of non-null structs.
(:make_struct) => {}
(:make_struct
{k1: 1, k2: 2}
{k3: 3}
{k4: 4}) => {k1:1, k2:2, k3:3, k4:4}
For normative examples, see make_struct
in the Ion conformance test suite.
make_field
(macro make_field (field_name value) /* Not representable in TDL */)
Produces a non-null, unannotated, single-field struct using the given field name and value.
The field_name
parameter may be (or evaluate to) any non-null text value, and the value
parameter may be (or evaluate to) any single value.
This can be used to dynamically construct field names based on macro parameters.
Example:
(macro foo_struct (extra_name extra_value)
(make_struct
{
foo_a: 1,
foo_b: 2,
}
(make_field (make_string "foo_" (%extra_name)) (%extra_value))
))
Then:
(:foo_struct c 3) => { foo_a: 1, foo_b: 2, foo_c: 3 }
For normative examples, see make_struct
in the Ion conformance test suite.
make_decimal
(macro make_decimal (coefficient exponent) /* Not representable in TDL */)
This is no more compact than the regular binary encoding for decimals. However, it can be used in conjunction with other macros, for example, to represent fixed-point numbers.
Both coefficient
and exponent
must be (or evaluate to) a single integer value.
(macro usd (cents) (.annotate USD (.make_decimal cents -2))
(:usd 199) => USD::1.99
note
It is not possible to use make_decimal
to construct any negative zero value because Ion integers do not have signed zero.
For normative examples, see make_decimal
in the Ion conformance test suite.
make_timestamp
(macro make_timestamp (year month? day? hour? minute? second? offset_minutes?) /* Not representable in TDL */)
Produces a non-null, unannotated timestamp at various levels of precision.
When offset
is absent, the result has unknown local offset; offset 0
denotes UTC.
The make_timestamp
macro has rules that cannot be expressed in the macro signature because it must construct a
valid Ion timestamp value.
The arguments to this macro may not be any null value.
The evaluated argument for the year
parameter must be an integer from 1 to 9999 inclusive.
The evaluated argument for the month
parameter, if present, must be an integer from 1 to 12 inclusive.
The evaluated argument for the day
parameter, if present, must be an integer that is a valid, 1-indexed day for the given month.
The evaluated argument for the hour
parameter, if present, must be an integer from 0 to 23 inclusive.
The evaluated argument for the day
parameter, if present, must be an integer from 0 to 59 inclusive.
The evaluated argument for the second
parameter, if present, must be a decimal or integer value that is greater than
or equal to zero and less than 60. The evaluated arguments for all other parameters, if present, must be integer values.
The offset_minutes
and hour
parameters may only be present if minute
is present. Aside from offset_minutes
, if
any evaluated argument is present, the evaluated arguments for all parameters to the left must also be present.
The precision of the constructed timestamp is determined by which parameters have non-empty arguments.
Example:
(macro ts_today
(uint8::hour uint8::minute uint32::seconds_millis)
(.make_timestamp
2022
4
28
hour
minute
(.make_decimal (%seconds_millis) -3) 0))
For normative examples, see make_timestamp
in the Ion conformance test suite.
Encoding Utility Macros
repeat
The repeat
system macro can be used for efficient run-length encoding.
(macro repeat (n! value*) /* Not representable in TDL */)
Produces a stream that repeats the specified value
expression(s) n
times.
The evaluated argument for n
must be a non-null integer value that is equal to or greater than zero.
(:repeat 5 0) => 0 0 0 0 0
(:repeat 2 true false) => true false true false
For normative examples, see repeat
in the Ion conformance test suite.
delta
(macro delta (deltas*) /* Not representable in TDL */)
The delta
system macro can be used for directed delta encoding.
It produces a stream that is equal in length to the deltas
argument, defined by the recurrence relation:
output₀ = delta₀
outputₙ₊₁ = outputₙ + deltaₙ₊₁
Example:
(:delta 1000 1 2 3 -4) => 1000 1001 1003 1006 1002
For normative examples, see delta
in the Ion conformance test suite.
sum
(macro sum (a b) /* Not representable in TDL */)
Produces the sum of two non-null integer arguments.
Examples:
(:sum 1 2) => 3
For normative examples, see sum
in the Ion conformance test suite.
meta
(macro meta (anything*) (.none))
The meta
macro accepts any values and emits nothing.
It allows writers to encode data that will be not be surfaced to most readers.
Readers can be configured to intercept calls to meta
, allowing them to read the otherwise invisible data.
When transcribing from one format to another, writers should preserve invocations of meta
when possible.
Example:
(:values
(:meta {author: "Mike Smith", email: "mikesmith@example.com"})
{foo:2,foo:1}
)
=>
{foo:2,foo:1}
For normative examples, see meta
in the Ion conformance test suite.
Updating the Encoding Context
set_symbols
Redefines the default module's symbol table, preserving any macros in its macro table.
(macro set_symbols (symbols*)
$ion::
(module _
(symbol_table [(%symbols)])
(macro_table _)
))
Example:
(:set_symbols foo bar)
=>
$ion::
(module _
(symbol_table [foo, bar])
(macro_table _)
)
For normative examples, see set_symbols
in the Ion conformance test suite.
add_symbols
Appends symbols to the default module's symbol table, preserving any macros in its macro table.
(macro add_symbols (symbols*)
$ion::
(module _
(symbol_table _ [(%symbols)])
(macro_table _)
))
Example:
(:add_symbols foo bar)
=>
$ion::
(module _
(symbol_table _ [foo, bar])
(macro_table _)
)
For normative examples, see add_symbols
in the Ion conformance test suite.
set_macros
Sets the default module's macro table, preserving any symbols in its symbol table.
(macro set_macros (macros*)
$ion::
(module _
(symbol_table _)
(macro_table (%macros))
))
Example:
(:set_macros (macro pi () 3.14159))
=>
$ion::
(module _
(symbol_table _)
(macro_table (macro pi () 3.14159))
)
For normative examples, see set_macros
in the Ion conformance test suite.
add_macros
Appends macros to the default module's macro table, preserving any symbols in its symbol table.
(macro add_macros (macros*)
$ion::
(module _
(symbol_table _)
(macro_table _ (%macros))
))
Example:
(:add_macros (macro pi () 3.14159))
=>
$ion::
(module _
(symbol_table _)
(macro_table _ (macro pi () 3.14159))
)
For normative examples, see add_macros
in the Ion conformance test suite.
use
Appends the content of the given module to the default module.
(macro use (catalog_key version?)
$ion::
(module _
(import the_module catalog_key (.default (%version) 1))
(symbol_table _ the_module)
(macro_table _ the_module)
))
Example:
(:use "org.example.FooModule" 2)
=>
$ion::
(module _
(import the_module "org.example.FooModule" 2)
(symbol_table _ the_module)
(macro_table _ the_module)
)
For normative examples, see use
in the Ion conformance test suite.
The annotations sequence comes first in the macro signature because it parallels how annotations are read from the data stream.^
Ion 1.1 modules
In Ion 1.0, each stream has a symbol table. The symbol table stores text values that can be referred to by their integer index in the table, providing a much more compact representation than repeating the full UTF-8 text bytes each time the value is used. Symbol tables do not store any other information used by the reader or writer.
Ion 1.1 introduces the concept of a macro table. It is analogous to the symbol table, but instead of holding text values it holds macro definitions.
Ion 1.1 also introduces the concept of a module, an organizational unit that holds a (symbol table, macro table)
pair.
tip
You can think of an Ion 1.0 symbol table as a module with an empty macro table.
In Ion 1.1, each stream has an encoding module sequence— a collection of modules whose symbols and macros are being used to encode the current segment.
Module interface
The interface to a module consists of:
- its spec version, denoting the Ion version used to define the module
- its exported symbols, an array of strings denoting symbol content
- its exported macros, an array of
<name, macro>
pairs, where all names (where specified) are unique identifiers
The spec version is external to the module body and the precise way it is determined depends on the type of module being defined. This is explained in further detail in Module Versioning.
The exported symbol array is denoted by the symbol_table
clause of a module definition, and
by the symbols
field of a shared symbol table.
The exported macro array is denoted by the module’s macro_table
clause, with addresses
allocated to macros or macro bindings in the order they are declared.
The exported symbols and exported macros are defined in the module body.
Types of modules
There are multiple types of modules. All modules share the same interface, but vary in their implementation in order to support a variety of different use cases.
Module Type | Purpose |
---|---|
Local Modules | Organizing symbols and macros within a scope |
Shared Modules | Defining reusable symbols and macros outside of the data stream |
System Modules | Defining system symbols and macros |
Encoding Modules | Encoding the current stream segment |
Module versioning
Every module definition has a spec version that determines the syntax and semantics of the module body. A module’s spec version is expressed in terms of a specific Ion version; the meaning of the module is as defined by that version of the Ion specification.
The spec version for a local module is inherited from its parent scope, which may be the stream itself. The spec version for a shared module is denoted via a required annotation. The spec version of a system module is the Ion version in which it was specified.
To ensure that all consumers of a module can properly understand it, a module can only import shared modules defined with the same or earlier spec version.
Examples
The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N
.
This allows the module to be serialized using any version of Ion, and its meaning will not change.
$ion_shared_module::
$ion_1_1::
("com.example.symtab" 3
(symbol_table ...)
(macro_table ...))
The spec version of a local module is always the same as the spec version of its enclosing scope. If the local module is defined at the top level of the stream, its spec version is the Ion version of the current segment.
$ion_1_1
$ion::
(module foo
// Module semantics specified by Ion 1.1
...
)
// ...
$ion_1_3
$ion::
(module foo
// Module semantics specified by Ion 1.3
...
)
//... // Assuming no IVM
$ion::
(module bar
// Module semantics specified by Ion 1.3
...
)
Identifiers
Many of the grammatical elements used to define modules and macros are identifiers--symbols that do not require quotation marks.
More explicitly, an identifier is a sequence of one or more ASCII letters, digits, or the characters $
(dollar sign) or _
(underscore), not starting with a digit.
It also cannot be of the form $\d+
, which is the syntax for symbol IDs (for example: $3
, $10
, $458
, etc.), nor can it be a keyword (true
, false
, null
, or nan
).
Defining modules
A module is defined by four kinds of subclauses which, if present, always appear in the same order.
import
- a reference to a shared module definition; repeatablemodule
- a nested module definition; repeatablesymbol_table
- an exported list of text valuesmacro_table
- an exported list of macro definitions
The lexical name given to a module definition must be an identifier.
However, it must not begin with a $
--this is reserved for system-defined bindings like $ion
.
Internal environment
The body of a module tracks an internal environment by which macro references are resolved. This environment is constructed incrementally by each clause in the definition and consists of:
- the module bindings, a map from identifier to module definition
- the exported symbols, an array containing symbol texts
- the exported macros, an array containing name/macro pairs
Before any clauses of the module definition are examined, each of these is empty.
Each clause affects the environment as follows:
- An
import
declaration retrieves a shared module from the implementation’s catalog and binds a name to it, making its macros available for use. An error must be signaled if the name already appears in the module bindings. - A
module
declaration defines a new module and binds a name to it. An error must be signaled if the name already appears in the module bindings. - A
symbol_table
declaration defines the exported symbols. - A
macro_table
declaration defines the exported macros.
Resolving Macro References
Within a module definition, macros can be referenced in several contexts using the following macro-ref syntax:
qualified-ref ::= module-name '::' macro-ref
macro-ref ::= macro-name | macro-addr
macro-name ::= unannotated-identifier-symbol
macro-addr ::= unannotated-uint
Macro references are resolved to a specific macro as follows:
-
An unqualified macro-name is looked up in the following locations:
- in the macros already exported in this module's
macro_table
- in the default_module
- in the system module
If it maps to a macro, that’s the resolution of the reference. Otherwise, an error is signaled due to an unbound reference.
- in the macros already exported in this module's
-
An anonymous local reference (macro-addr) is resolved by index in the exported macro array. If the address exceeds the array boundary, an error is signaled due to an invalid reference.
-
A qualified reference (qualified-ref) resolves solely against the referenced module. First, the module name must be resolved to a module definition.
- If the module name is in the module bindings, it resolves to the corresponding module definition.
- If the module name is not in the module bindings, resolution is attempted recursively upwards through the parent scopes.
- If the search reaches the top level without resolving to a module, an error is signaled due to an unbound reference.
Next, the name or address is resolved within that module definition’s exported macro table.
import
import ::= '(import ' module-name catalog-key ')'
module-name ::= unannotated-identifier-symbol
catalog-key ::= catalog-name catalog-version?
catalog-name ::= string
catalog-version ::= int // positive, unannotated
An import binds a lexically scoped module name to a shared module that is identified by a catalog key—a (name, version)
pair.
The version
of the catalog key is optional—when omitted, the version is implicitly 1.
In Ion 1.0, imports may be substituted with a different version if an exact match is not found. In Ion 1.1, however, all imports require an exact match to be found in the reader's catalog; if an exact match is not found, the implementation must signal an error.
module
The module
clause defines a new local module that is contained in the current module.
inner-module ::= '(module' module-name import* symbol-table? macro-table? ')'
Inner modules automatically have access to modules previously declared in the containing module using module
or import
.
The new module (and its exported symbols and macros) is available to any following module
, symbol_table
, and
macro_table
clauses in the enclosing container.
See local modules for full explanation.
symbol_table
A module can define a list of exported symbols by copying symbols from other modules and/or declaring new symbols.
symbol-table ::= '(symbol_table' symbol-table-entry* ')'
symbol-table-entry ::= module-name | symbol-list
symbol-list ::= '[' ( symbol-text ',' )* ']'
symbol-text ::= symbol | string
The symbol_table
clause assembles a list of text values for the module to export.
It takes any number of arguments, each of which may be the name of visible module or a list of symbol-texts.
The symbol table is a list of symbol-texts by concatenating the symbol tables of named modules and lists of symbol/string values.
Where a module name occurs, its symbol table is appended. (The module name must refer to another module that is visible to the current module.) Unlike Ion 1.0, no symbol-maxid is needed because Ion 1.1 always requires exact matches for imported modules.
tip
When redefining a top-level module binding, the binding being redefined can be added to the symbol table in order to retain its symbols. For example:
// Define module `foo`
$ion::
(module foo
(symbol_table ["b", "c"]))
// Redefine `foo` in terms of its former definition
$ion::
(module foo
(symbol_table
["a"]
foo // The old definition of `foo` with symbols ["b", "c"]
["d"]))
// Now `foo`'s symbol table is ["a", "b", "c", "d"]
Where a list occurs, it must contain only non-null, unannotated strings and symbols.
The text of these strings and/or symbols are appended to the symbol table.
Upon encountering any non-text value, null value, or annotated value in the list, the implementation shall signal an error.
To add a symbol with unknown text to the symbol table, one may use $0
.
All modules have a symbol table, so when a module has no symbol_table
clause, the module has an empty symbol table.
Symbol zero $0
Symbol zero (i.e. $0
) is a special symbol that is not assigned text by any symbol table, even the system symbol table.
Symbol zero always has unknown text, and can be useful in synthesizing symbol identifiers where the text of the symbol is not known in a particular operating context.
All symbol tables (even an empty symbol table) can be thought of as implicitly containing $0
.
However, $0
precedes all symbol tables rather than belonging to any symbol table.
When adding the exported symbols from one module to the symbol table of another, the preceding $0
is not copied into the destination symbol table (because it is not part of the source symbol table).
It is important to note that $0
is only semantically equivalent to itself and to locally-declared SIDs with unknown text.
It is not semantically equivalent to SIDs with unknown text from shared symbol tables, so replacing such SIDs with $0
is a destructive operation to the semantics of the data.
Processing
When the symbol_table
clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.
For each arg
:
- If the
arg
is a list of text values, the nested text values are appended to the end of the symbol table being constructed.- When
$0
appears in the list of text values, this creates a symbol with unknown text. - The presence of any other Ion value in the list raises an error.
- When
- If the
arg
is the name of a module, the symbols in that module's symbol table are appended to the end of the symbol table being constructed. - If the
arg
is anything else, the reader must raise an error.
Example
(symbol_table // Constructs an empty symbol table (list)
["a", b, 'c'] // The text values in this list are appended to the table
foo // Module `foo`'s symbol table values are appended to the table
['''g''', "h", i]) // The text values in this list are appended to the table
If module foo
's symbol table were [d, e, f]
, then the symbol table defined by the above clause would be:
["a", "b", "c", "d", "e", "f", "g", "h", "i"]
This is an Ion 1.0 symbol table that imports two shared symbol tables and then declares some symbols of its own.
$ion_1_0
$ion_symbol_table::{
imports: [{ name: "com.example.shared1", version: 1, max_id: 10 },
{ name: "com.example.shared2", version: 2, max_id: 20 }],
symbols: ["s1", "s2"]
}
Here’s the Ion 1.1 equivalent in terms of symbol allocation order:
$ion_1_1
$ion::(import m1 "com.example.shared1" 1)
$ion::(import m2 "com.example.shared2" 2)
$ion::
(module _
(symbol_table m1 m2 ["s1", "s2"])
)
macro_table
Macros are declared after symbols.
The macro_table
clause assembles a list of macro definitions for the module to export. It takes any number of arguments.
All modules have a macro table, so when a module has no macro_table
clause, the module has an empty macro table.
Most commonly, a macro table entry is a definition of a new macro expansion function, following this general shape:
// ┌─── `macro` keyword
// │ ┌─── macro name
// │ │ ┌─── signature (s-expression of parameters)
// │ │ │ ┌─── template (TDL expression)
(macro foo (x y z) (.values (%x) (%y) (%z))
(See the Defining macros for details.)
When no name is given, this defines an anonymous macro that can be referenced by its numeric
address (that is, its index in the enclosing macro table).
Inside the defining module, that uses a local reference like 12
.
The signature defines the syntactic shape of expressions invoking the macro; see Macro Signatures for details. The template defines the expansion of the macro, in terms of the signature’s parameters; see Template Expressions for details.
Imported macros must be explicitly exported if so desired.
Module names and export
clauses can be intermingled with macro
definitions inside the macro_table
;
together, they determine the bindings that make up the module’s exported macro array.
The module-name export form is shorthand for referencing all exported macros from that module, in their original order with their original names.
An export
clause contains a single macro reference followed by an optional alias for the exported macro.
The referenced macro is appended to the macro table.
tip
No name can be repeated among the exported macros, including macro definitions.
Name conflicts must be resolved by export
s with aliases.
Processing
When the macro_table
clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.
For each arg
:
- If the
arg
is amacro
clause, the clause is processed and the resulting macro definition is appended to the end of the macro table being constructed. - If the
arg
is anexport
clause, the clause is processed and the referenced macro definition is appended to the end of the macro table being constructed. - If the
arg
is the name of a module, the macro definitions in that module's macro table are appended to the end of the macro table being constructed. - If the
arg
is anything else, the reader must raise an error.
A macro name is a symbol that can be used to reference a macro, both inside and outside the module. Macro names are optional, and improve legibility when using, writing, and debugging macros. When a name is used, it must be an identifier per Ion’s syntax for symbols. Macro definitions being added to the macro table must have a unique name. If a macro is added whose name conflicts with one already present in the table, the implementation must raise an error.
macro
A macro
clause defines a new macro.
When the macro declaration uses a name, an error must be signaled if it already appears in the exported macro array.
export
An export
clause declares a name for an existing macro and appends the macro to the macro table.
- If the reference to the existing macro is followed by a name, the existing macro is appended to the exported macro array with the latter name instead of the original name, if any. In this way, an anonymous macro can be given a name. An error must be signaled if that name already appears in the exported macro array.
- If the reference to the existing macro is followed by
null
, the macro is appended to the exported macro array without a name, regardless of whether the macro has a name. - If the reference to the existing macro is anonymous, the macro is appended to the exported macro array without a name.
- When the reference to the existing macro uses a name, the name and macro are appended to the exported macro
array. An error must be signaled if that name already appears in the exported macro array.
Module names in macro_table
A module name appends all exported macros from the module to the exported macro array. If any exported macro uses a name that already appears in the exported macro array, an error must be signaled.
Directives
Directives are system values that modify the encoding context.
Syntactically, a directive is a top-level s-expression annotated with $ion
.
Its first child value is an operation name.
The operation determines what changes will be made to the encoding context and which clauses may legally follow.
$ion::
(operation_name
(clause_1 /*...*/)
(clause_2 /*...*/)
/*...more clauses...*/
(clause_N /*...*/))
In Ion 1.1, there are three supported directive operations:
Top-level bindings
The module
and import
directives each create a stream-level binding to a module definition.
Once created, module bindings at this level endure until the file ends or another Ion version marker is encountered.
Module bindings at the stream-level can be redefined.
tip
The add_macros
and add_symbols
system macros work by redefining the default module (_
) in terms of itself.
This behavior differs from module bindings created inside another module; attempting to redefine these will raise an error.
module
directives
The module
directive binds a name to a local module definition at the top level of the stream.
$ion::
(module foo
/*...imports, if any...*/
/*...submodules, if any...*/
(macro_table /*...*/)
(symbol_table /*...*/)
)
import
directives
The import directive looks up the module corresponding to the given (name, version)
pair in the catalog.
Upon success, it creates a new binding to that module at the top level of the stream.
$ion::
(import
bar // Binding
"com.example.bar" // Module name
2) // Module version
The version
can be omitted. When it is not specified, it defaults to 1
.
If the catalog does contain an exact match, this operation raises an error.
encoding
directives
An encoding
directive accepts a sequence of module bindings to use as the following stream segment's
encoding module sequence.
$ion::
(encoding
mod_a
mod_b
mod_c)
The new encoding module sequence takes effect immediately after the directive and remains the same until the next encoding
directive or Ion version marker.
Note that the default module is always implicitly at the head of the encoding module sequence.
Local modules
Local modules are lexically scoped. They can be referenced immediately following their definition, up until the end of their enclosing scope. They can be defined either:
- At the top level of a stream, in which case the enclosing scope is the stream itself.
- Inside another module, in which case the enclosing scope is the parent module. The parent module can be a shared or local module.
Local modules always have a symbolic name given at the point of definition, also known as a binding. It is legal for a module binding to "shadow" a module binding in its parent scope by using the same name.
$ion::
(module foo // <-- Top-level module `foo`
(macro_table
(macro quux () Quux)))
$ion::
(module bar
(module foo // <-- Shadows the top-level module `foo`
(macro_table
(macro quuz () Quuz)))
(macro_table foo::quuz) // <-- Refers to the innermost `foo`
)
However, it is not legal for a local module to use the same name as a module previously defined in the same scope.
$ion::
(module bar
(module foo // <-- First definition of `foo` inside `bar`
(macro_table
(macro quux () Quux)))
(module foo // <-- ERROR: module `foo` already defined in this scope
(macro_table
(macro quuz () Quuz)))
/*...*/
)
The only exception to this rule is at the top level. Stream-level bindings are mutable, while bindings inside a module are immutable.
$ion::
(module foo // <-- Top-level module `foo`
(macro_table
(macro quux () Quux)))
$ion::
(module foo // <-- Redefines the top-level binding `foo`
(macro_table
(macro quuz () Quuz)))
Local modules inherit their spec version from the enclosing scope.
Local modules automatically have access to modules previously declared in their enclosing scope using module
or import
.
Examples
Local modules can be used to define helper macros without having to export them.
$ion_shared_module::$ion_1_1::(
"org.example.Foo" 1
(module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
(macro_table
(macro y_axis_point (y) (.util::point2d 0 (%y)))
(macro poylgon (util::point2d::points+) [(%points)]))
)
In this example, the macro point2d
is declared in a local module.
The macro definitions being exported in the shared module's macro table are able to reference the helper macros by name.
Local modules can also be used for grouping macros into namespaces (only visible within the parent scope).
$ion_shared_module::$ion_1_1::(
"org.example.Foo" 1
(module cartesian (macro_table (macro point2d (x y) { x:(%x), y:(%y) })
(macro polygon (point2d::points+) [(%points)]) ))
(module polar (macro_table (macro point2d (r phi) { r:(%r), phi:(%phi) })
(macro polygon (point2d::points+) [(%points)]) ))
(macro_table
(export cartesian::polygon cartesian_poylgon)
(export polar::polygon polar_poylgon))
)
In this example, there are two macros named point2d
and two named polygon
.
There is no name conflict between them because they are declared in separate namespaces.
Both polygon
macros are added to the shared module's macro table,
with each one given an alias in order to resolve the name conflict.
Neither one of the point2d
macros needs to be added to the shared module's macro table because they can be referenced
in the definitions of both polygon
macros without needing to be added to the shared module's macro table.
When grouping macros in local modules, there are more than just organizational benefits. By first defining helper macros in an inner module, a module can export macros in a different order than they are declared:
$ion_shared_module::$ion_1_1::(
"org.example.Foo" 1
// point2d must be declared before polygon...
(module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
(macro_table
// ...because it is used in the definition of polygon
(macro poylgon (util::point2d::points+) [(%points)])
// But it can be added to the macro table after polygon
util)
)
Local modules can also be used for organization of symbols.
$ion::
(encoding
(module dairy (symbol_table [cheese, yogurt, milk]))
(module grains (symbol_table [cereal, bread, rice]))
(module vegetables (symbol_table [carrots, celery, peas]))
(module meat (symbol_table [chicken, mutton, beef]))
(symbol_table dairy
grains
vegetables
meat)
)
Encoding modules
The encoding of each segment of a stream is shaped by the currently configured encoding modules, an ordered sequence of modules that determine which symbols and macros are available for use in the stream. A writer can modify this sequence by emitting an encoding directive.
By logically concatenating the encoding modules' symbol and macro tables respectively, they can be viewed as unified local symbol and macro tables.
For example, consider these module definitions and the subsequent encoding directive:
$ion::
(module mod_a
(symbol_table ["a", "b", "c"])
(macro_table
(macro foo () Foo)
(macro bar () Bar)))
$ion::
(module mod_b
(symbol_table ["c", "d", "e"])
(macro_table
(macro baz () Baz)
(macro quux () Quux)))
$ion::
(module mod_c
(symbol_table ["f", "g", "h"])
(macro_table
(macro quuz () Quuz)
(macro foo () Foo2)))
$ion::
(encoding
mod_a
mod_b
mod_c)
It produces the encoding module sequence _ mod_a mod_b mod_c
.
(The default module, _
, is always implicitly at the head of the encoding sequence.)
The segment's local symbol table, formed by logically concatenating the symbol tables of mod_a
,
mod_b
, and mod_c
in that order, is:
Address | Symbol text |
---|---|
0 | <unknown text> |
1 | a |
2 | b |
3 | c |
4 | c |
5 | d |
6 | e |
7 | f |
8 | g |
9 | h |
Notice that no de-duplication takes place; c
appears in both addresses 4
and 5
.
The segment's macro table, formed by logically concatenating the macro tables of mod_a
,
mod_b
, and mod_c
in that order, is:
Address | Macro |
---|---|
0 | mod_a::foo |
1 | mod_a::bar |
2 | mod_b::baz |
3 | mod_b::quux |
4 | mod_c::quuz |
5 | mod_c::foo |
Notice that mod_a::foo
and mod_c::foo
can coexist in this unified view without issue.
Invocations of these macros require that they be qualified by their enclosing module's name.
Because lower addresses take fewer bytes to encode than higher addresses, writers should place the modules they anticipate referencing the most frequently at the beginning of the encoding module sequence.
Modules in the current segment's encoding module sequence are said to be active, while modules that are defined or imported but which are not in the encoding module sequence are available. E-expressions can only invoke macros in an active module.
For example:
$ion::
(module mod_a
(macro_table
(macro foo () Foo)))
// `mod_a` is now available
$ion::
(module mod_b
(macro_table
(macro bar () Bar)))
// `mod_b` is now available
$ion::
(encoding mod_a)
// `mod_a` is now active
(:mod_a::foo) // Foo
(:mod_b::bar) // ERROR: `mod_b` is not in the encoding module sequence
The default module
The default module, _
, is an empty top-level module that is implicitly defined at the beginning of every stream.
When resolving an unqualified macro name, readers first look for the corresponding macro definition in _
.
If it is not found in _
, they will then look in $ion
.
If it is still not found, the reader will raise an error.
This makes it possible to leverage macros in a lightweight way; writers do not have to first name/define a custom module to house their macros, and the macros themselves can be invoked in text without having to write out the module name.
Macros and symbols can be added to the default module by redefining _
.
Like all modules, _
can be redefined in terms of itself, making appends and prepends straightforward.
$ion_1_1
// `_` exists, but is empty
$ion::
(module _
(macro_table
(macro foo () Foo)))
// `_` now contains macro `foo`
$ion::
(module _
(macro_table
_ // Add all macros in `_` to its redefinition
(macro bar () Bar)))
// `_` now contains macros `foo` and `bar`
(:foo) // Equivalent to `(:_::foo)`
(:bar) // Equivalent to `(:_::bar)`
System macros like add_symbols
and add_macros
apply their changes to _
,
so we can rewrite the above more succinctly as:
$ion_1_1
// `_` exists, but is empty
(:add_macros
(macro foo () Foo)
(macro bar () Bar))
// `_` now contains macros `foo` and `bar`
(:foo) // Equivalent to `(:_::foo)`
(:bar) // Equivalent to `(:_::bar)`
_
can also be redefined by an import
directive.
Default encoding module sequence
At the beginning of a stream, the encoding module sequence contains two modules:
- the default module,
_
- the system module,
$ion
Recall that a segment's symbol and macro tables are logical concatenations of those found in the segment's encoding modules.
Because _
is empty at the beginning of the stream,
the stream's initial symbol and macro tables are identical to those of the system module, $ion
.
This is beneficial because it allows all system macros to be invoked from the stream's macro table in a single byte rather than the two-byte sequence needed to invoke them from the system macro table. In this way, a writer can define its macros and symbols in a maximally compact fashion at the head of the stream.
Modifying active modules
If a module binding in the encoding module sequence is redefined, the new module definition replaces the old one in the sequence.
For example after these directives are evaluated:
$ion::
(module mod_a
(macro_table
(macro foo () Foo))
(macro bar () Bar)))
$ion::
(module mod_b)
$ion::
(module mod_c
(macro_table
(macro quux () Quux)
(macro quuz () Quuz)))
$ion::(encoding mod_a mod_b mod_c)
the encoding sequence is _ mod_a mod_b mod_c
, and mod_b
is empty.
(:0) // => Foo
(:1) // => Bar
(:2) // => Quux
(:3) // => Quuz
If we then add macros to mod_b
, those macros will immediately become available.
$ion::
(module mod_b
(macro_table
(macro baz () Baz)))
(:0) // => Foo
(:1) // => Bar
(:2) // => Baz
(:3) // => Quux
(:4) // => Quuz
important
Notice that modifying a module (in this case mod_b
) can cause the addresses of all subsequent macros to be modified.
Clearing the symbol and macro tables
(module _) // Redefine `_` to be an empty module
// If other modules are in use, remove them from the encoding module sequence
$ion::(encoding)
You can also consider writing an Ion verson marker, which is more compact.
The behavior is slightly different, however:
an IVM will also add $ion
to the encoding module sequence.
See the Default encoding module sequence section for details.
The system module
The symbols and macros of the system module $ion
are available everywhere within an Ion document,
with the version of that module being determined by the spec-version of each segment.
The specific system symbols are largely uninteresting to users; while the binary encoding heavily
leverages the system symbol table, the text encoding that users typically interact with does not.
The system macros are more visible, especially to authors of macros.
This chapter catalogs the system-provided symbols and macros.
The examples below use unqualified names, which works assuming no other macros with the same name are in scope.
The unambiguous form $ion::macro-name
is always available to use.
Relation to local symbol and macro tables
In Ion 1.0, the system symbol table is always the first import of the local symbol table. However, in Ion 1.1, the system symbol and macro tables have a system address space that is distinct from the local address space, but can optionally be included in the user address space.
When starting an Ion 1.1 segment (i.e. immediately after encountering an $ion_1_1
version marker),
the system module is in the sequence of active encoding modules immediately following the default module.
As a result, both the system macros and system symbols are initially included in the local macro and symbol tables1.
The system module is not a permanent fixture in the active encoding modules, so (in contrast to Ion 1.0)
the system symbols and macros can be removed from the local symbol and macro tables.
System Symbols
The Ion 1.1 System Symbol table replaces rather than extends the Ion 1.0 System Symbol table. The system symbols are as follows:
ID | Hex | Text |
---|---|---|
0 | 0x00 | <reserved> |
1 | 0x01 | $ion |
2 | 0x02 | $ion_1_0 |
3 | 0x03 | $ion_symbol_table |
4 | 0x04 | name |
5 | 0x05 | version |
6 | 0x06 | imports |
7 | 0x07 | symbols |
8 | 0x08 | max_id |
9 | 0x09 | $ion_shared_symbol_table |
10 | 0x0A | encoding |
11 | 0x0B | $ion_literal |
12 | 0x0C | $ion_shared_module |
13 | 0x0D | macro |
14 | 0x0E | macro_table |
15 | 0x0F | symbol_table |
16 | 0x10 | module |
17 | 0x11 | export |
18 | 0x12 | import |
19 | 0x13 | flex_symbol |
20 | 0x14 | flex_int |
21 | 0x15 | flex_uint |
22 | 0x16 | uint8 |
23 | 0x17 | uint16 |
24 | 0x18 | uint32 |
25 | 0x19 | uint64 |
26 | 0x1A | int8 |
27 | 0x1B | int16 |
28 | 0x1C | int32 |
29 | 0x1D | int64 |
30 | 0x1E | float16 |
31 | 0x1F | float32 |
32 | 0x20 | float64 |
33 | 0x21 | zero-length text (i.e. '' ) |
34 | 0x22 | for |
35 | 0x23 | literal |
36 | 0x24 | if_none |
37 | 0x25 | if_some |
38 | 0x26 | if_single |
39 | 0x27 | if_multi |
40 | 0x28 | none |
41 | 0x29 | values |
42 | 0x2A | default |
43 | 0x2B | meta |
44 | 0x2C | repeat |
45 | 0x2D | flatten |
46 | 0x2E | delta |
47 | 0x2F | sum |
48 | 0x30 | annotate |
49 | 0x31 | make_string |
50 | 0x32 | make_symbol |
51 | 0x33 | make_decimal |
52 | 0x34 | make_timestamp |
53 | 0x35 | make_blob |
54 | 0x36 | make_list |
55 | 0x37 | make_sexp |
56 | 0x38 | make_field |
57 | 0x39 | make_struct |
58 | 0x3A | parse_ion |
59 | 0x3B | set_symbols |
60 | 0x3C | add_symbols |
61 | 0x3D | set_macros |
62 | 0x3E | add_macros |
63 | 0x3F | use |
In Ion 1.1 Text, system symbols can never be referenced by symbol ID; $1
always refers to the first symbol in the user symbol table.
This allows the Ion 1.1 system symbol table to be relatively large without taking away SID space from the user symbol table.
System Macros
ID | Hex | Text |
---|---|---|
0 | 0x00 | none |
1 | 0x01 | values |
2 | 0x02 | default |
3 | 0x03 | meta |
4 | 0x04 | repeat |
5 | 0x05 | flatten |
6 | 0x06 | delta |
7 | 0x07 | sum |
8 | 0x08 | annotate |
9 | 0x09 | make_string |
10 | 0x0A | make_symbol |
11 | 0x0B | make_decimal |
12 | 0x0C | make_timestamp |
13 | 0x0D | make_blob |
14 | 0x0E | make_list |
15 | 0x0F | make_sexp |
16 | 0x10 | make_field |
17 | 0x11 | make_struct |
18 | 0x12 | parse_ion |
19 | 0x13 | set_symbols |
20 | 0x14 | add_symbols |
21 | 0x15 | set_macros |
22 | 0x16 | add_macros |
23 | 0x17 | use |
System symbols require the same number of bytes whether they are encoded using the system symbol or the user symbol encoding. The reasons the system symbols are initially loaded into the user symbol table are twofold—to be consistent with loading the system macros into user space, and so that implementors can start testing user symbols even before they have implemented support for reading encoding directives.^
Shared modules
Shared modules exist independently of the documents that use them. They are identified by a catalog key consisting of a string name and an integer version.
The self-declared catalog-names of shared modules are generally long, since they must be more-or-less globally unique. When imported by another module, they are given local symbolic names—a binding—by import declarations.
They have a spec version that is explicit via annotation, and a content version derived from the catalog version.
The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N
.
This allows the module to be serialized using any version of Ion, and its meaning will not change.
$ion_shared_module::
$ion_1_1::("com.example.symtab" 3
(symbol_table ...)
(macro_table ...) )
Example
An Ion 1.1 shared module.
$ion_shared_module::
$ion_1_1::("org.example.geometry" 2
(symbol_table ["x", "y", "square", "circle"])
(macro_table (macro point2d (x y) { x:(%x), y:(%y) })
(macro polygon (point2d::points+) [(%points)]) )
)
The system module provides a convenient macro (use
) to append a shared module to the encoding module.
$ion_1_1
(:use "org.example.geometry" 2)
(:polygon (:: (1 4) (1 8) (3 6)))
Compatibility with Ion 1.0
Ion 1.0 shared symbol tables are treated as Ion 1.1 shared modules that have an empty macro table.
Ion 1.1 Text Encoding
The Ion text encoding is a stream of UTF-8 encoded text. It is intended to be easy to read and write by humans.
Whitespace is insignificant and is only required where necessary to separate tokens. C-style comments (either block or in-line) are treated as whitespace; they are not part of the data model and implementations are not required to preserve them.
A text Ion 1.1 stream begins with the Ion 1.1 version marker ($ion_1_1
) followed by a series of
value literals and/or encoding expressions.
Values
Annotations
In the text format, type annotations are denoted by a symbol token and double-colons preceding any value. Multiple annotations on the same value are separated by double-colons:
int32::12 // Suggests 32 bits as end-user type
degrees::'celsius'::100 // You can have multiple annotaions on a value
'my.custom.type'::{ x : 12 , y : -1 } // Gives a struct a user-defined type
{ field: some_annotation::value } // Field's name must precede annotations of its value
jpeg :: {{ ... }} // Indicates the blob contains jpeg data
bool :: null.int // A very misleading annotation on the integer null
'' :: 1 // An empty annotation
null.symbol :: 1 // ERROR: type annotation cannot be null
foo::(:make_string "a" "b") // ERROR: e-expressions may not be annotated
(:make_string foo::(:: "a" "b")) // ERROR: expression groups may not be annotated
Nulls
Null values are represented by the keyword null
, optionally followed by .
and the name of a type in the Ion data model.
null
null.null // Identical to unadorned null
null.bool
null.int
null.float
null.decimal
null.timestamp
null.string
null.symbol
null.blob
null.clob
null.struct
null.list
null.sexp
The text format treats all of these as reserved tokens; to use those same characters as a symbol token, they must be enclosed in single-quotes:
null // The type is null
'null' // The type is symbol
null.list // The type is list
'null.int' // The type is symbol
Any text token starting with null.
must be one of the legal null values.
(llun.foo) // A s-expression equivalent to (llun . foo)
(null.foo) // This is illegal; not equivalent to (null . foo)
// because null. is never split into separate tokens
Booleans
Boolean values are represented by the literals true
and false
.
The text format treats both of these as reserved tokens; to use those same characters as a symbol token, they must be enclosed in single-quotes.
true // a boolean value
'true' // a symbol value
'true'::1 // an integer annotated with the text "true"
true::1 // ERROR: cannot use an unquoted keyword as an annotation
{ 'true': 1 } // a struct containing a field name with the text "true"
{ true: 1 } // ERROR: cannot use an unquoted keyword as a field name
Integers
Integer values may be encoded in binary, decimal, and hexadecimal notation.
A decimal-encoded int
consists of the digit 0
OR a non-zero digit followed by zero-or more base 10 digits (0123456789
)—leading zeros are not allowed.
A binary-encoded int
consists of 0b
followed by one or more base 2 digits (01
).
A hexadecimal-encoded int
consists of 0x
followed by one or more case-insensitive base 16 digits (0123456789abcdefABCDEF
).
All integer values may be preceded by an optional minus sign (-
), indicating that the value is negative.
(The token -0
is legal and equivalent to 0
; to distinguish -0
from 0
, consider encoding as a decimal
or float
instead.)
Single underscores may be used to separate digits; consecutive underscores are never allowed.
All integer values must be followed by one of the fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f
.
Though the text format allows hexadecimal and binary notation, such notation is not guaranteed to be maintained if a data stream is re-transcribed.
0 // Zero. Surprise!
-0 // ...the same value with a minus sign
123 // A normal int
-123 // A negative int
0xBeef // An int denoted in hexadecimal
-0xBeef // A negative int denoted in hexadecimal
0b0101 // An int denoted in binary
-0b0101 // A negative int denoted in binary
1_2_3 // An int with underscores
0xFA_CE // An int denoted in hexadecimal with underscores
0b10_10_10 // An int denoted in binary with underscores
+1 // ERROR: leading plus not allowed
0123 // ERROR: leading zeros not allowed
1_ // ERROR: trailing underscore not allowed
1__2 // ERROR: consecutive underscores not allowed
0x_12 // ERROR: underscore can only appear between digits (the radix prefix is not a digit)
_1 // A symbol (ints cannot start with underscores)
Floats
The text encoding of a numeric float
value:
- Optionally starts with a minus sign
- Has a whole number part that is either:
- zero, or
- starts with 1-9 followed by any number of digits
- Has an optional decimal point followed by zero or more decimal digits
- Has the letter 'e'
- Has an optional minus sign for the exponent
- Ends with one or more digits for the exponent
A numeric Ion float
value must always contain an e
—fractional numbers without an e
are decimal
values.
Ion float
values may also be special non-number values, represented in text by the following keywords:
nan
denotes the not a number (NaN) value.+inf
denotes positive infinity.-inf
denotes negative infinity.
The text format treats nan
as a reserved token; to use those same characters as a symbol token, they must be enclosed in single-quotes.
While base-10 notation is convenient for human representation, many base-10 real numbers are irrational with respect
to base-2 and cannot be expressed exactly as a binary floating point number (e.g. 1.1e0
).
When encoding a decimal real number that is irrational in base-2 or has more precision than can be stored in binary64
,
the exact binary64
value is determined by using the IEEE-754 round-to-nearest mode with a round-half-to-even as the tie-break.
This mode/tie-break is the common default used in most programming environments and is discussed in detail in
"Correctly Rounded Binary-Decimal and Decimal-Binary Conversions".
This conversion algorithm is illustrated in a straightforward way in Clinger's Algorithm.
When encoding a float
value to Ion text, an implementation MAY want to consider the approach described in
"Printing Floating-Point Numbers Quickly and Accurately".
Examples
Although the textual representation of 1.2e0
itself is irrational, its
canonical form in the data model is not (based on the rounding rules), thus
the following text forms all map to the same float
value:
// the most human-friendly representation
1.2e0
// the exact textual representation in base-10 for the binary64 value 1.2e0 represents
1.1999999999999999555910790149937383830547332763671875e0
// a shortened, irrational version, but still the same value
1.1999999999999999e0
// a lengthened, irrational version that is still the same value
1.19999999999999999999999999999999999999999999999999999999e0
Decimals
The Hursley rules for describing a finite value converting from textual notation must be followed. The Hursley rules for describing a special value are not followed—the rules for
infinity
-- rule is not applicable for Ion Decimals.nan
-- rule is not applicable for Ion Decimals
Specifically, the rules for getting the integer coefficient from the decimal-part (digits preceding the exponent) of the textual representation are specified as follows.
If the decimal-part included a decimal point the exponent is then reduced by the count of digits following the decimal point (which may be zero) and the decimal point is removed. The remaining string of digits has any leading zeros removed (except for the rightmost digit) and is then converted to form the coefficient which will be zero or positive.
Where X
is any unsigned integer, all the following formulae can be
demonstrated to be equivalent using the text conversion rules and the data
model.
// Exponent implicitly zero
X.
// Exponent explicitly zero
Xd0
// Exponent explicitly negative zero (equivalent to zero).
Xd-0
Other equivalent representations include the following, where Y
is the number
of digits in X
.
// There are Y digits past the decimal point in the
// decimal-part, making the exponent zero. One leading zero
// is removed.
0.XdY
For example, all the following text Ion decimal representations are equivalent to each other.
0.
0d0
0d-0
0.0d1
Additionally, all the following are equivalent to each other (but not to any forms of positive zero).
-0.
-0d0
-0d-0
-0.0d1
Because all forms of zero are distinctly identified by the exponent, the following are not equivalent to each other.
// Exponent implicitly zero.
0.
// Exponent explicitly 5.
0d5
All the following are equivalent to each other.
42.
42d0
42d-0
4.2d1
0.42d2
However, the following are not equivalent to each other.
// Text converted to 42.
0.42d2
// Text converted to 42.0
0.420d2
In the text notation, decimal values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f
.
Timestamps
In the text format, timestamps follow the W3C note on date and time formats,
but they must end with the literal T
if not at least whole-day precision.
Fractional seconds are allowed, with at least one digit of precision and an unlimited maximum.
Local-time offsets may be represented as either hour:minute offsets from UTC, or as the literal Z
to denote a local time of UTC.
If the offset is -00:00
, it indicates that the local offset in which the timestamp was recorded is unknown, and that the time is therefore encoded as UTC.
Local-time offsets are required on timestamps with time and are not allowed on date values.
2007-02-23T12:14Z // Seconds are optional, but local offset is not
2007-02-23T12:14:33.079-08:00 // A timestamp with millisecond precision and PST local time
2007-02-23T20:14:33.079Z // The same instant in UTC ("zero" or "zulu")
2007-02-23T20:14:33.079+00:00 // The same instant, with explicit local offset
2007-02-23T20:14:33.079-00:00 // The same instant, with unknown local offset
2007-01-01T00:00-00:00 // Happy New Year in UTC, unknown local offset
2007-01-01 // The same instant, with days precision, unknown local offset
2007-01-01T // The same value, different syntax.
2007-01T // The same instant, with months precision, unknown local offset
2007T // The same instant, with years precision, unknown local offset
2007-02-23 // A day, unknown local offset
2007-02-23T00:00Z // The same instant, but more precise and in UTC
2007-02-23T00:00+00:00 // An equivalent format for the same value
2007-02-23T00:00:00-00:00 // The same instant, with seconds precision
2007 // Not a timestamp, but an int
2007-01 // ERROR: Must end with 'T' if not whole-day precision, this results as an invalid-numeric-stopper error
2007-02-23T20:14:33.Z // ERROR: Must have at least one digit precision after decimal point.
In the text notation, timestamp values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f
.
Strings
In the text format, strings are delimited by double-quotes and follow C/Java backslash-escape conventions (see Escape Characters).
null.string // A null string value
"" // An empty string value
" my string " // A normal string
"\"" // Contains one double-quote character
"\uABCD" // Contains one unicode character
xml::"<e a='v'>c</e>" // String with type annotation 'xml'
The text format supports an alternate syntax for "long strings", including those that break across lines.
Sequences bounded by three single-quotes ('''
) can cross multiple lines and still count as a valid, single string.
In addition, any number of adjacent triple-quoted strings are concatenated into a single value.
The concatenation happens within the Ion text parser and is neither detectable via the data model nor applicable to the binary format.
Note that comments are always treated as whitespace, so concatenation still occurs when a comment falls between two long strings.
( '''hello ''' // Sexp with one element
'''world!''' )
("hello world!") // The exact same sexp value
// This Ion value is a string containing three newlines. The serialized
// form's first newline is escaped into nothingness.
'''\
The first line of the string.
This is the second line of the string,
and this is the third line.
'''
Symbols
A symbol value is encoded using a symbol token.
In the text format, symbols are delimited by single-quotes and use the same escape characters as strings.
null.symbol // A null symbol value
'myVar2' // A symbol
myVar2 // The same symbol
myvar2 // A different symbol
'hi ho' // Symbol requiring quotes
'\'ahoy\'' // A symbol with embedded quotes
'' // The empty symbol
Within S-expressions, the rules for unquoted symbols include another set of tokens: operators.
An operator is an unquoted sequence of one or more of the following nineteen ASCII characters: !#%&*+-./;<=>?@^`|~
.
Operators and identifiers can be juxtaposed without whitespace:
( 'x' '+' 'y' ) // S-expression with three symbols
( x + y ) // The same three symbols
(x+y) // The same three symbols
(a==b&&c==d) // S-expression with seven symbols
Clobs
In the text format, clob
values use similar syntax to blob
, but the data between braces must be one string.
Similar to string
, adjoining long string literals within an Ion clob
are concatenated automatically.
Within a clob
, only one short string literal or multiple long string literals are allowed.
The string may only contain legal 7-bit ASCII characters, using the same escaping syntax as string
and symbol
values.
This guarantees that the value can be transmitted unscathed while remaining generally readable (at least for western language text).
Either form of comment within a clob
is invalid.
{{ "This is a CLOB of text." }}
shift_jis::
{{
'''Another clob with user-defined encoding, '''
'''this time on multiple lines.'''
}}
// Two equivalent clobs
{{ '''Hello''' '''World''' }}
{{ "HelloWorld" }}
{{
// ERROR
"comments not allowed"
}}
Blobs
In the text format, blob
values are denoted as RFC 4648-compliant
Base64 text within two pairs of curly braces.
When parsing blob
text, an error must be raised if the data:
- Contains characters outside of the Base64 character set.
- Contains a padding character (
=
) anywhere other than at the end. - Is terminated by an incorrect number of padding characters.
Within blob
values, whitespace is ignored.
Comments within blob
s are not supported: the /
character is always considered part of the Base64 data and the *
is invalid.
// A valid blob value with zero padding characters.
{{
+AB/
}}
// A valid blob value with one required padding character.
{{ VG8gaW5maW5pdHkuLi4gYW5kIGJleW9uZCE= }}
// ERROR: Incorrect number of padding characters.
{{ VG8gaW5maW5pdHkuLi4gYW5kIGJleW9uZCE== }}
// ERROR: Padding character within the data.
{{ VG8gaW5maW5pdHku=Li4gYW5kIGJleW9uZCE= }}
// A valid blob value with two required padding characters.
{{ dHdvIHBhZGRpbmcgY2hhcmFjdGVycw== }}
// ERROR: Invalid character within the data.
{{ dHdvIHBhZGRpbmc_gY2hhcmFjdGVycw= }}
Lists
In the text format, lists are bounded by square brackets and elements are separated by commas.
[] // An empty list value
[1, 2, 3] // List of three ints
[ 1 , two ] // List of an int and a symbol
[a , [b]] // Nested list
[ 1.2, ] // Trailing comma is legal in Ion (unlike JSON)
[ 1, , 2 ] // ERROR: missing element between commas
S-expressions
In the text format, S-expressions are bounded by parentheses. S-expressions also allow unquoted operator symbols, in addition to the unquoted identifier symbols allowed everywhere.
() // An empty expression value
(cons 1 2) // S-expression of three values
([hello][there]) // S-expression containing two lists
(a+-b) ( 'a' '+-' 'b' ) // Equivalent; three symbols
(a.b;) ( 'a' '.' 'b' ';') // Equivalent; four symbols
Note that comments are allowed within S-expressions and have higher precedence
than operators, therefore //
and /*
denote the start of comment blocks.
Users are advised to avoid them as operators, though they can be used when
escaped with single quotes:
(a/* word */b) // An S-expression with two symbols and a comment
(a '/*' word '*/' b) // An S-expression with five symbols
Structs
In the text format, a struct
is wrapped by curly braces, with a colon between each name and value, and a comma between the fields.
The field name is a symbol token.
For the purposes of JSON compatibility, it is also legal to use a string for field names, but they are converted to symbol tokens by the parser.
{ } // An empty struct value
{ first : "Tom" , last: "Riddle" } // Structure with two fields
{"first":"Tom","last":"Riddle"} // The same value with confusing style
{center:{x:1.0, y:12.5}, radius:3} // Nested struct
{ x:1, } // Trailing comma is legal in Ion (unlike JSON)
{ "":42 } // A struct value containing a field with an empty name
{ x:1, x:null.int } // WARNING: repeated name 'x' leads to undefined behavior
{ x:1, , } // ERROR: missing field between commas
Note that field names are symbol tokens, not symbol values, and thus may not be annotated. The value of a field may be annotated like any other value. Syntactically the field name comes first, then annotations, then the content.
{ annotation:: field_name: value } // ERROR
{ field_name: annotation:: value } // Okay
E-expressions
In Ion text, encoding expressions (E-expressions) start with (:
, immediately
followed by a macro reference, which must be one of:
- a macro name
- a base-10 integer macro address
- a qualified macro name consisting of a module name, double-colon (
::
), and the macro name - a qualified macro name consisting of a module name, double-colon (
::
), and a base-10 integer macro address
See Encoding modules for details about qualified macro references.
Macro and module names follow the syntax rules for identifier symbol tokens, excluding symbol identifiers. There may not be any whitespace from the start of the E-expression through to the end of the macro reference.
Values in the E-expression body follow the same syntax as values in an S-expression body.
E-expressions are not values, so they may not be annotated; to annotate the result of an e-expression use the
annotate
macro.
(:pi) // Invokes the macro 'pi'
(:1) // Invokes the macro with address 1 in the macro table
(:constants::pi) // Invokes the macro 'pi' from the module 'constants'
(: pi) // ERROR: whitespace is not permitted between '(:' and the macro reference
foo::(:pi) // ERROR: e-expression may not be annotated
E-expressions may also appear in structs in place of an entire name-value pair.
{
foo: 1,
(:bar 2), // Expands to a struct that is spliced into this struct
}
Expression Groups
To pass multiple arguments to a single macro parameter, the arguments to the parameter must be delimited by an Expression Group.
Inside an E-expression, an expression group starts with (::
.
The remainder of the expression group uses the same syntax as an S-expression.
Expression groups are not values, so they may not be annotated.
Expression groups may not contain another expression group.
(:make_string (:: "a" "b" "c" "d"))
// └──────┬──────┘
// └── 4 argument expressions passed to `parts`
(:make_string (:: "a" (:: "b" "c") "d") ) // ERROR: Expression groups may only occur directly in an E-expression
Rest Arguments
Rest arguments are a special-case of expression groups that is only applicable to Ion 1.1 text.
When the final parameter in the macro signature is zero-or-more
or one-or-more
, "all the rest" of the argument expressions will be passed to that parameter.
Rest arguments are an implicit expression group, and may not include any explicit expression groups.
(:make_string)
// └── 0 argument expressions passed to `parts`
(:make_string "a")
// └┬┘
// └── 1 argument expression passed to `parts`
(:make_string "a" "b" "c" "d")
// └──────┬──────┘
// └── 4 argument expressions passed to `parts`
(:make_string (:: "a" "b" "c" "d"))
// └──────┬──────┘
// └── Also 4 argument expressions passed to `parts`
(:make_string (:: "a") "b" "c" "d") // ERROR: Too many arguments
(:make_string "a" (:: "b") "c" "d") // ERROR: Too many arguments
Macro-shaped parameters
Macro-shaped parameters are tagless parameters whose encoding type is the arguments for another macro.
(See Macros by Example.)
In Ion text, each set of arguments for a macro-shape parameter must be enclosed between (
and )
.
The only difference between this and an E-expression is the lack of the ':' and macro reference at the start of the E-expression.
The arguments for a macro-shape use the same syntax as the arguments to any other E-expression.
// Given the following macro signatures:
// (macro point (x y) ...)
// (macro line_segment (point::a point::b) ...)
// (macro polygon (point::points*) ...)
(:line_segment (0 1) (4 8) )
// └─┬─┘ └─┬─┘
// │ └── Implicit invocation of (:point ...) for parameter b
// └── Implicit invocation of (:point ...) for parameter a
(:polygon (:: (1 1) (1 2) (2 4) (2 5) ))
// └──────────┬──────────┘
// └── 4 macro-shaped arguments passed to `points`
Symbol tokens
In Ion text, symbols are represented in three ways:
- Quoted symbol: a sequence of zero or more characters between
single-quotes, e.g.,
'hello'
,'a symbol'
,'123'
,''
. This representation can denote any symbol text. - Identifier: an unquoted sequence of one or more ASCII letters, digits,
or the characters
$
(dollar sign) or_
(underscore), not starting with a digit and not including the keywordsnull
,nan
,true
, andfalse
. - Operator: an unquoted sequence of one or more of the following nineteen
ASCII characters:
!#%&*+-./;<=>?@^`|~
Operators can only be used as (direct) elements of an S-expression. In any other context those characters require single-quotes.
A subset of identifiers have special meaning:
- Symbol Identifier: an identifier that starts with
$
(dollar sign) followed by one or more digits. These identifiers directly represent the symbol's integer symbol ID, not the symbol's text. This form is not typically visible to users, but they should be aware of the reserved notation so they don't attempt to use it for other purposes.
Escape Characters
Strings and Symbols
The Ion text format supports unicode escape sequences only within quoted strings and symbols. Ion supports most of the escape sequences defined by C++, Java, and JSON.
The following sequences are allowed:
Unicode Code Point | Ion Escape | Meaning |
---|---|---|
U+0000 | \0 | NUL |
U+0007 | \a | alert BEL |
U+0008 | \b | backspace BS |
U+0009 | \t | horizontal tab HT |
U+000A | \n | linefeed LF |
U+000B | \v | vertical tab VT |
U+000C | \f | form feed FF |
U+000D | \r | carriage return CR |
U+0022 | \" | double quote |
U+0027 | \' | single quote |
U+002F | \/ | forward slash |
U+003F | \? | question mark |
U+005C | \\ | backslash |
nothing | \NL | escaped NL expands to nothing |
U+00HH | \xHH | 2-digit hexadecimal Unicode code point |
U+HHHH | \uHHHH | 4-digit hexadecimal Unicode code point |
U+HHHHHHHH | \UHHHHHHHH | 8-digit hexadecimal Unicode code point |
Any other sequence following a backslash is an error.
Note that Ion does not support the following escape sequences:
- Java's extended Unicode markers, e.g.,
"\uuuXXXX"
- General octal escape sequences,
\OOO
Clobs
The rules for the quoted strings within a clob
follow similarly to the string
type, with the following exceptions.
Unicode newline characters in long strings and all verbatim ASCII characters are interpreted as their ASCII octet values.
Non-printable ASCII and non-ASCII Unicode code points are not allowed un-escaped in the string bodies.
Furthermore, the following table describes the clob
string escape sequences that have direct octet replacement for both all strings.
Octet | Ion Escape | Meaning |
---|---|---|
0x00 | \0 | NUL |
0x07 | \a | alert BEL |
0x08 | \b | backspace BS |
0x09 | \t | horizontal tab HT |
0x0A | \n | linefeed LF |
0x0B | \v | vertical tab VT |
0x0C | \f | form feed FF |
0x0D | \r | carriage return CR |
0x22 | \" | double quote |
0x27 | \' | single quote |
0x2F | \/ | forward slash |
0x3F | \? | question mark |
0x5C | \\ | backslash |
0xHH | \xHH | 2-digit hexadecimal octet |
nothing | \NL | escaped NL expands to nothing |
The clob
escape \x
must be followed by two hexadecimal digits.
Note that clob
does not support the \u
and \U
escapes since it represents an octet sequence and not a Unicode encoding.
It is important to note that clob
is a binary type that is designed for binary values that are either text encoded in a
code page that is ASCII compatible or should be octet editable by a human (escaped string syntax vs. base64 encoded data).
Clearly non-ASCII based encodings will not be very readable (e.g. the clob
for the EBCDIC encoded string
representing "hello" could be denoted as{% raw %}{{ "\xc7\xc1%%?" }}{% endraw %}
).
Ion 1.1 Binary Encoding
A binary Ion stream consists of an Ion version marker followed by a series of value literals and/or encoding expressions.
Both value literals and e-expressions begin with an opcode that indicates what the next expression represents and how the bytes that follow should be interpreted.
Primitives
This section describes Ion 1.1's binary encoding primitives—reusable building blocks that can be combined to represent more complex constructs.
Name | Type | Width |
---|---|---|
FixedUInt | int | Determined by context |
FixedInt | int | Determined by context |
FlexUInt | int | Variable, self-delimiting |
FlexInt | int | Variable, self-delimiting |
FlexSym | symbol | Variable, self-delimiting |
FlexUInt
A variable-length unsigned integer.
The bytes of a FlexUInt
are written in
little-endian byte order. This means that the first bytes will contain
the FlexUInt
's least significant bits.
The least significant bits in the FlexUInt
indicate the number of bytes that were used to encode the integer.
If a FlexUInt
is N
bytes long, its N-1
least significant bits will be 0
; a terminal 1
bit will be
in the next most significant position.
All bits that are more significant than the terminal 1
represent the magnitude of the FlexUInt
.
FlexUInt
encoding of 14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14
FlexUInt
encoding of 729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the unsigned of the unsigned
integer integer
FlexUInt
encoding of 21,043
┌───── There are 2 zeros in the least significant bits, so this
│ integer is three bytes wide.
┌─┴─┐
1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
└───┬───┘ └──────┬──────┘ └──────┬──────┘
lowest 6 bits next 8 bits of highest 8 bits
of the unsigned the unsigned of the unsigned
integer integer integer
FlexInt
A variable-length signed integer.
From an encoding perspective, FlexInt
s are structurally similar to a FlexUInt
. Both
encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate
how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a
FlexUInt
's bits are unsigned, a FlexInt
's bits are encoded using
two's complement notation.
tip
An implementation could choose to read a FlexInt
by instead reading a FlexUInt
and then reinterpreting its bits
as two's complement.
FlexInt
encoding of 14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
2's comp. 14
FlexInt
encoding of -14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
2's comp. -14
FlexInt
encoding of 729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the 2's of the 2's
comp. integer comp. integer
FlexInt
encoding of -729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
1 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the 2's of the 2's
comp. integer comp. integer
FixedUInt
A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.
FixedUInt
encoding of 3,954,261
0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
lowest 8 bits next 8 bits of highest 8 bits
of the unsigned the unsigned of the unsigned
integer integer integer
FixedInt
A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two's complement.
FixedInt
encoding of -3,954,261
1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
lowest 8 bits next 8 bits of highest 8 bits
of the 2's the 2's comp. of the 2's comp.
comp. integer integer integer
FlexSym
A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.
A FlexSym
begins with a FlexInt
; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:
- greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.
- less than zero, its absolute value represents a number of UTF-8 bytes that follow the
FlexInt
. These bytes represent the symbol’s text. - exactly zero, another byte follows that is a
FlexSymOpCode
.
FlexSym
encoding of symbol ID $10
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│
0 0 0 1 0 1 0 1
└─────┬─────┘
2's comp.
positive 10
FlexSym
encoding of symbol text 'hello'
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│ h e l l o
1 1 1 1 0 1 1 1 01101000 01100101 01101100 01101100 01101111
└─────┬─────┘ └─────────────────────┬─────────────────────┘
2's comp. 5-byte UTF-8 encoded "hello"
negative 5
FlexSymOpCode
FlexSymOpCode
s are a combination of system symbols and a subset of the general opcodes.
The FlexSym
parser is not responsible for evaluating a FlexSymOpCode
, only returning it—the caller will decide whether the opcode is legal in the current context.
Example usages of the FlexSymOpCode
include:
- Representing SID
$0
- Representing system symbols
- Note that the empty symbol (i.e. the symbol
''
) is a system symbol and can be referenced this way.
- Note that the empty symbol (i.e. the symbol
- When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct.
- In a delimited struct, terminating the sequence of
(field name, value)
pairs with0xF0
.
OpCode Byte | Meaning | Additional Notes |
---|---|---|
0x00 - 0x5F | E-Expression | May be used when the FlexSym occurs in the field name position of any struct |
0x60 | Symbol with unknown text (also known as $0 ) | |
0x61 - 0xDF | System SID (with 0x60 bias) | While the range of 0x61 - 0xDF is reserved for system symbols, not all of these bytes correspond to a system symbol. See system symbols for the list of system symbols. |
0xEF | E-Expression invoking a system macro | May be used when the FlexSym occurs in the field name position of any struct |
0xF0 | Delimited container end marker | May only be when the FlexSym occurs in the field name position of a delimited struct |
0xF5 | Length-prefixed macro invocation | May be used when the FlexSym occurs in the field name position of any struct |
FlexSym
encoding of ''
(empty text) using an opcode
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│
0 0 0 0 0 0 0 1 01110111
└─────┬─────┘ └───┬──┘
2's comp. FixedInt 0x77,
zero System SID 23
(the empty symbol)
Opcodes
An opcode is a 1-byte FixedUInt
that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
The meanings of each opcode are organized loosely by their high and low nibbles.
High nibble | Low nibble | Meaning |
---|---|---|
0x0_ to 0x3_ | 0 -F | E-expression with 6-bit address |
0x4_ | 0 -F | E-expression with 12-bit address |
0x5_ | 0 -F | E-expression with 20-bit address |
0x6_ | 0 -8 | Integers from 0 to 8 bytes wide |
9 | Reserved | |
A -D | Floats | |
E -F | Booleans | |
0x7_ | 0 -F | Decimals |
0x8_ | 0 -C | Short-form timestamps |
D -F | Reserved | |
0x9_ | 0 -F | Strings |
0xA_ | 0 -F | Symbols with inline text |
0xB_ | 0 -F | Lists |
0xC_ | 0 -F | S-expressions |
0xD_ | 0 | Empty struct |
1 | Reserved | |
2 -F | Structs | |
0xE_ | 0 | Ion version marker |
1 -3 | Symbols with symbol address | |
4 -6 | Annotations with symbol address | |
7 -9 | Annotations with FlexSym text | |
A | null.null | |
B | Typed nulls | |
C -D | NOP | |
E | System symbol | |
F | System macro invocation | |
0xF_ | 0 | Delimited container end |
1 | Delimited list start | |
2 | Delimited S-expression start | |
3 | Delimited struct start | |
4 | E-expression with FlexUInt macro address | |
5 | E-expression with FlexUInt macro address followed by FlexUInt length prefix | |
6 | Integer with FlexUInt length prefix | |
7 | Decimal with FlexUInt length prefix | |
8 | Timestamp with FlexUInt length prefix | |
9 | String with FlexUInt length prefix | |
A | Symbol with FlexUInt length prefix and inline text | |
B | List with FlexUInt length prefix | |
C | S-expression with FlexUInt length prefix | |
D | Struct with FlexUInt length prefix | |
E | Blob with FlexUInt length prefix | |
F | Clob with FlexUInt length prefix |
Values
Nulls
The opcode 0xEA
indicates an untyped null (that is: null
, or its alias null.null
).
The opcode 0xEB
indicates a typed null; a byte follows whose value represents an offset into the following table:
Byte | Type |
---|---|
0x00 | null.bool |
0x01 | null.int |
0x02 | null.float |
0x03 | null.decimal |
0x04 | null.timestamp |
0x05 | null.string |
0x06 | null.symbol |
0x07 | null.blob |
0x08 | null.clob |
0x09 | null.list |
0x0A | null.sexp |
0x0B | null.struct |
All other byte values are reserved for future use.
Encoding of null
┌──── The opcode `0xEA` represents a null (null.null)
EA
Encoding of null.string
┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows
│ ┌──── Byte 0x05 indicates the type `string`
EB 05
Booleans
0x6E
represents boolean true
, while 0x6F
represents boolean false
.
0xEB 0x00
represents null.bool
.
Encoding of boolean true
6E
Encoding of boolean false
6F
Encoding of null.bool
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: boolean
│ │
EB 00
Integers
Opcodes in the range 0x60
to 0x68
represent an integer. The opcode is followed by a FixedInt
that
represents the integer value. The low nibble of the opcode (0x_0
to 0x_8
) indicates the size of the FixedInt
.
Opcode 0x60
represents integer 0
; no more bytes follow.
Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6
,
followed by a FlexUInt
indicating how many bytes of representation data follow.
0xEB 0x01
represents null.int
.
Encoding of integer 0
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││ no more bytes follow.
60
Encoding of integer 17
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││ a single byte follows.
61 11
└── FixedInt 17
Encoding of integer -944
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││ that two bytes follow.
62 50 FC
└─┬─┘
FixedInt -944
Encoding of integer -944
┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows
│ ┌─── FlexUInt 2; a 2-byte FixedInt follows
│ │
F6 05 50 FC
└─┬─┘
FixedInt -944
Encoding of null.int
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: integer
│ │
EB 01
Floats
Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:
- 0 bits (0 bytes), representing the value 0e0 and indicated by opcode
0x6A
- 16 bits (2 bytes in little-endian order, half-precision),
indicated by opcode
0x6B
- 32 bits (4 bytes in little-endian order, single precision),
indicated by opcode
0x6C
- 64 bits (8 bytes in little-endian order, double precision),
indicated by opcode
0x6D
note
In the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.
0xEB 0x02
represents null.float
.
Encoding of float 0e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││ a 0-length float; 0e0
6A
Encoding of float 3.14e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
└─┬─┘
half-precision 3.14
Encoding of float 3.1415927e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││ single-precision value.
6C DB 0F 49 40
└────┬────┘
single-precision 3.1415927
Encoding of float 3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││ double-precision value.
6D 18 2D 44 54 FB 21 09 40
└──────────┬──────────┘
double-precision 3.141592653589793
Encoding of null.float
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: float
│ │
EB 02
Decimals
If an opcode has a high nibble of 0x7_
, it represents a decimal. Low nibble values indicate
the number of trailing bytes used to encode the decimal.
The body of the decimal is encoded as a FlexInt
representing its exponent, followed by a FixedInt
representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length
of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0
. When
the coefficient is present but has a value of 0
, the coefficient is -0
.
Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7
.
0xEB 0x03
represents null.decimal
.
Encoding of decimal 0d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││ decimal; 0d0
70
Encoding of decimal 7d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
| └─── Coefficient: 1-byte FixedInt 7
└─── Exponent: FlexInt 0
Encoding of decimal 1.27
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
| └─── Coefficient: FixedInt 127
└─── Exponent: 1-byte FlexInt -2
Variable-length encoding of decimal 1.27
┌──── Opcode F7 indicates a variable-length decimal
│
F7 05 FD 7F
| | └─── Coefficient: FixedInt 127
| └───── Exponent: 1-byte FlexInt -2
└─────── Decimal length: FlexUInt 2
Encoding of 0d3
, which has a coefficient of zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
└────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
Encoding of -0d3
, which has a coefficient of negative zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
| └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
└────── Exponent: FlexInt 3
Encoding of null.decimal
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: decimal
│ │
EB 03
Timestamps
Timestamps have two encodings:
- Short-form timestamps, a compact representation optimized for the most commonly used precisions and date ranges.
- Long-form timestamps, a less compact representation capable of representing any timestamp in the Ion data model.
0xEB x04
represents null.timestamp
.
Encoding of null.timestamp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: timestamp
│ │
EB 04
note
In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.
Short-form Timestamps
If an opcode has a high nibble of 0x8_
, it represents a short-form timestamp. This encoding focuses on making the
most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via
the variable-length long form timestamp encoding.
Timestamps may be encoded using the short form if they meet all of the following conditions:
- The year is between 1970 and 2097. The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form.
- The local offset is either UTC, unknown, or falls between
-14:00
to+14:00
and is divisible by 15 minutes. 7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset-14:00
). The value0b1111111
indicates an unknown offset. At the time of this writing (2024-08T), all real-world offsets fall between-12:00
and+14:00
and are multiples of 15 minutes. - The fractional seconds are a common precision. The timestamp's fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).
Opcodes by precision and offset
Each opcode with a high nibble of 0x8_
indicates a different precision and offset encoding pair.
Opcode | Precision | Serialized size in bytes1 | Offset encoding |
---|---|---|---|
0x80 | Year | 1 | Implicitly Unknown offset |
0x81 | Month | 2 | Implicitly Unknown offset |
0x82 | Day | 2 | Implicitly Unknown offset |
0x83 | Hour and minutes | 4 | 1 bit to indicate UTC or Unknown Offset |
0x84 | Seconds | 5 | 1 bit to indicate UTC or Unknown Offset |
0x85 | Milliseconds | 6 | 1 bit to indicate UTC or Unknown Offset |
0x86 | Microseconds | 7 | 1 bit to indicate UTC or Unknown Offset |
0x87 | Nanoseconds | 8 | 1 bit to indicate UTC or Unknown Offset |
0x88 | Hour and minutes | 5 | 7 bits to represent a known offset.2 |
0x89 | Seconds | 5 | 7 bits to represent a known offset. |
0x8A | Milliseconds | 7 | 7 bits to represent a known offset. |
0x8B | Microseconds | 8 | 7 bits to represent a known offset. |
0x8C | Nanoseconds | 9 | 7 bits to represent a known offset. |
0x8D | Reserved | -- | -- |
0x8E | Reserved | -- | -- |
0x8F | Reserved | -- | -- |
Serialized size in bytes does not include the opcode.
This encoding can also represent UTC and Unknown Offset
, though
it is less compact than opcodes 0x83
-0x87
above.
The body of a short-form timestamp is encoded as a FixedUInt
of the size specified by the opcode. This integer is
then partitioned into bit-fields representing the timestamp's subfields. Note that endianness does not apply here because the
bit-fields are defined over the body interpreted as an integer.
The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.
The Month
and Day
subfields are one-based; 0
is not a valid month or day.
Letter code | Number of bits | Subfield |
---|---|---|
Y | 7 | Year |
M | 4 | Month |
D | 5 | Day |
H | 5 | Hour |
m | 6 | Minute |
o | 7 | Offset |
U | 1 | Unknown (0 ) or UTC (1 ) offset |
s | 6 | Second |
f | 10 (ms) 20 (μs) 30 (ns) | Fractional second |
. | n/a | Unused |
We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.
7 0 <--- bit position
| |
+=========+
byte 0 | 0xNN | <-- hex notation for constants like opcodes
+=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
1 |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
+---------+ <-- octet boundary within an encoding primitive
...
+---------+
N |nnnn:nnnn|
+=========+
The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)
note
While this encoding may complicate human reading, it guarantees that the timestamp's subfields (year
, month
,
etc.) occupy the same contiguous bit indexes regardless of how many bytes there are overall. (The last subfield,
fractional_seconds
, always begins at the same bit index when present, but can vary in length according to the
precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the
appropriate bit ranges to access the subfields.
Encoding of a timestamp with year precision
+=========+
byte 0 | 0x80 |
+=========+
1 |.YYY:YYYY|
+=========+
Encoding of a timestamp with month precision
+=========+
byte 0 | 0x81 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |....:.MMM|
+=========+
Encoding of a timestamp with day precision
+=========+
byte 0 | 0x82 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+=========+
Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset
+=========+
byte 0 | 0x83 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |....:Ummm|
+=========+
Encoding of a timestamp with seconds precision at UTC or unknown offset
+=========+
byte 0 | 0x84 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |....:..ss|
+=========+
Encoding of a timestamp with milliseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x85 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |....:ffff|
+=========+
Encoding of a timestamp with microseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x86 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |ffff:ffff|
+---------+
7 |..ff:ffff|
+=========+
Encoding of a timestamp with nanoseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x87 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |ffff:ffff|
+=========+
Encoding of a timestamp with hour-and-minutes precision at known offset
+=========+
byte 0 | 0x88 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |....:..oo|
+=========+
Encoding of a timestamp with seconds precision at known offset
+=========+
byte 0 | 0x89 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+=========+
Encoding of a timestamp with milliseconds precision at known offset
+=========+
byte 0 | 0x8A |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |....:..ff|
+=========+
Encoding of a timestamp with microseconds precision at known offset
+=========+
byte 0 | 0x8B |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |....:ffff|
+=========+
Encoding of a timestamp with nanoseconds precision at known offset
+=========+
byte 0 | 0x8C |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |ffff:ffff|
+---------+
9 |..ff:ffff|
+=========+
Examples of short-form timestamps
Text | Binary |
---|---|
2023T | 80 35 |
2023-10-15T | 82 35 7D |
2023-10-15T11:22:33Z | 84 35 7D CB 1A 02 |
2023-10-15T11:22:33-00:00 | 84 35 7D CB 12 02 |
2023-10-15T11:22:33+01:15 | 89 35 7D CB 2A 84 |
2023-10-15T11:22:33.444555666+01:15 | 8C 35 7D CB 2A 84 92 61 7F 1A |
warning
Opcodes 0x8D
, 0x8E
, and 0x8F
are illegal; they are reserved for future use.
Long-form Timestamps
Unlike the short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.
The long form begins with opcode 0xF8
. A FlexUInt
follows indicating the number
of bytes that were needed to represent the timestamp. The encoding consumes the minimum number
of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s
precision as follows:
Length | Corresponding precision |
---|---|
0 | Illegal |
1 | Illegal |
2 | Year |
3 | Month or Day (see below) |
4 | Illegal; the hour cannot be specified without also specifying minutes |
5 | Illegal |
6 | Minutes |
7 | Seconds |
8 or more | Fractional seconds |
Unlike the short-form encoding, the long-form encoding reserves:
- 14 bits for the year (
Y
), which is not biased. - 12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440
(that is:
-24:00
). An offset value of0b111111111111
indicates an unknown offset.
Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the
timestamp are encoded as bit-fields on a FixedUInt
that corresponds to the length that followed the opcode.
If the timestamp's overall length is greater than or equal to 8
, the FixedUInt
part of the timestamp is 7
bytes
and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a
(scale, coefficient)
pair, which is similar to a decimal. The primary difference is that the scale
represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to
1.0
or less than 0.0
. The scale is encoded as a FlexUInt
(instead of FlexInt
) to discourage the
encoding of decimal numbers greater than 1.0
. The coefficient is encoded as a FixedUInt
(instead of FixedInt
) to
prevent the encoding of fractional seconds less than 0.0
. Note that validation is still required; namely:
- A scale value of
0
is illegal, as that would result in a fractional seconds greater than1.0
(a whole second). - If
coefficient * 10^-scale > 1.0
, that(coefficient, scale)
pair is illegal.
If the timestamp's length is 3
, the precision is determined by inspecting the day (DDDDD
) bits. Like the short-form,
the Month
and Day
subfields are one-based (0
is not a valid month or day). If the day subfield is zero, that
indicates month precision. If the day subfield is any non-zero number, that indicates day precision.
Encoding of the body of a long-form timestamp
+=========+
byte 0 |YYYY:YYYY|
+=========+
1 |MMYY:YYYY|
+---------+
2 |HDDD:DDMM|
+---------+
3 |mmmm:HHHH|
+---------+
4 |oooo:oomm|
+---------+
5 |ssoo:oooo|
+---------+
6 |....:ssss|
+=========+
7 |FlexUInt | <-- scale of the fractional seconds
+---------+
...
+=========+
N |FixedUInt| <-- coefficient of the fractional seconds
+---------+
...
Examples of long-form timestamps
Text | Binary |
---|---|
1947T | F8 05 9B 07 |
1947-12T | F8 07 9B 07 03 |
1947-12-23T | F8 07 9B 07 5F |
1947-12-23T11:22:33-00:00 | F8 0F 9B 07 DF 65 FD 7F 08 |
1947-12-23T11:22:33+01:15 | F8 0F 9B 07 DF 65 AD 57 08 |
1947-12-23T11:22:33.127+01:15 | F8 13 9B 07 DF 65 AD 57 08 07 7F |
Strings
If the high nibble of the opcode is 0x9_
, it represents a string. The low nibble of the opcode
indicates how many UTF-8 bytes follow. Opcode 0x90
represents a string with empty text (""
).
Strings longer than 15 bytes can be encoded with the F9
opcode, which takes a FlexUInt
-encoded length
after the opcode.
0xEB x05
represents null.string
.
Encoding of the empty string, ""
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90
Encoding of a 14-byte string
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││ f o u r t e e n b y t e s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
└──────────────────┬────────────────────┘
UTF-8 bytes
Encoding of a 24-byte string
┌──── Opcode F9 indicates a variable-length string
│ ┌─── Length: FlexUInt 24
│ │ v a r i a b l e l e n g t h e n c o d i n g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
└────────────────────────────────┬────────────────────────────────────┘
UTF-8 bytes
Encoding of null.string
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: string
│ │
EB 05
Symbols
Symbols With Inline Text
If the high nibble of the opcode is 0xA_
, it represents a symbol whose text follows the opcode. The low nibble of the
opcode indicates how many UTF-8 bytes follow. Opcode 0xA0
represents a symbol with empty text (''
).
0xEB x06
represents null.symbol
.
Encoding of a symbol with empty text (''
)
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0
Encoding of a symbol with 14 bytes of inline text
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││ f o u r t e e n b y t e s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
└──────────────────┬────────────────────┘
UTF-8 bytes
Encoding of a symbol with 24 bytes of inline text
┌──── Opcode FA indicates a variable-length symbol with inline text
│ ┌─── Length: FlexUInt 24
│ │ v a r i a b l e l e n g t h e n c o d i n g
FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
└────────────────────────────────┬────────────────────────────────────┘
UTF-8 bytes
Encoding of null.symbol
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: symbol
│ │
EB 06
Symbols With a Symbol Address
Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1
through 0xE3
:
0xE1
represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byteFixedUInt
that follows the opcode.0xE2
represents a symbol whose address in the symbol table is a 2-byteFixedUInt
that follows the opcode.0xE3
represents a symbol whose address in the symbol table is aFlexUInt
that follows the opcode.
Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.
Opcode | Symbol address range | Bias |
---|---|---|
0xE1 | 0 to 255 | 0 |
0xE2 | 256 to 65,791 | 256 |
0xE3 | 65,792 to infinity | 65,792 |
System Symbols
System symbols (that is, symbols defined in the system module) can be encoded using the 0xEE
opcode followed by a 1-byte FixedUInt
representing an index in the system symbol table.
Unlike Ion 1.0, symbols are not required to use the lowest available SID for a given text, and system symbols MAY be encoded using other SIDs.
Encoding of the system symbol $ion
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│ ┌─── FixedUInt 1 indicates system symbol 1
│ │
EE 01
Binary Data
Blobs
Opcode FE
indicates a blob of binary data. A FlexUInt
follows that represents the blob's byte-length.
0xEB x07
represents null.blob
.
Example blob
encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: blob
│ │
EB 07
Clobs
Opcode FF
indicates a clob--binary character data of an unspecified encoding. A FlexUInt
follows that represents
the clob's byte-length.
0xEB x08
represents null.clob
.
Example clob
encoding
┌──── Opcode FF indicates a clob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.clob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: clob
│ │
EB 08
Binary Data
Blobs
Opcode FE
indicates a blob of binary data. A FlexUInt
follows that represents the blob's byte-length.
0xEB x07
represents null.blob
.
Example blob
encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: blob
│ │
EB 07
Clobs
Opcode FF
indicates a clob--binary character data of an unspecified encoding. A FlexUInt
follows that represents
the clob's byte-length.
0xEB x08
represents null.clob
.
Example clob
encoding
┌──── Opcode FF indicates a clob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.clob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: clob
│ │
EB 08
Lists
Length-prefixed encoding
An opcode with a high nibble of 0xB_
indicates a length-prefixed list. The lower nibble of the
opcode indicates how many bytes were used to encode the child values that the list contains.
If the list's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB
opcode
to write a variable-length list. The 0xFB
opcode is followed by a FlexUInt
that indicates the list's byte length.
0xEB 0x09
represents null.list
.
Length-prefixed encoding of an empty list ([]
)
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this
││ list took zero bytes to encode.
B0
Length-prefixed encoding of [1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 6 indicates that the child values of this
││ list took six bytes to encode.
B6 61 01 61 02 61 03
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Length-prefixed encoding of ["variable length list"]
┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows.
│ ┌───── Length: FlexUInt 22
│ │ ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│ │ │ ┌─────── Length: FlexUInt 20
│ │ │ │ v a r i a b l e l e n g t h l i s t
FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
└─────────────────────────────┬─────────────────────────────────┘
Nested string element
Encoding of null.list
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: list
│ │
EB 09
Delimited Encoding
Opcode 0xF1
begins a delimited list, while opcode 0xF0
closes the most recently opened delimited container
that has not yet been closed.
Delimited encoding of an empty list ([]
)
┌──── Opcode 0xF1 indicates a delimited list
│ ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F1 F0
Delimited encoding of [1, 2, 3]
┌──── Opcode 0xF1 indicates a delimited list
│ ┌─── Opcode 0xF0 indicates the end of
│ │ the most recently opened container
F1 61 01 61 02 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Delimited encoding of [1, [2], 3]
┌──── Opcode 0xF1 indicates a delimited list
│ ┌─── Opcode 0xF1 begins a nested delimited list
│ │ ┌─── Opcode 0xF0 closes the most recently
│ │ │ opened delimited container: the nested list.
│ │ │ ┌─── Opcode 0xF0 closes the most recently opened (and
│ │ │ │ still open) delimited container: the outer list.
│ │ │ │
F1 61 01 F1 61 02 F0 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
S-Expressions
S-expressions use the same encodings as lists, but with different opcodes.
Opcode | Encoding |
---|---|
0xC0 -0xCF | Length-prefixed S-expression; low nibble of the opcode represents the byte-length. |
0xFC | Variable-length prefixed S-expression; a FlexUInt following the opcode represents the byte-length. |
0xF2 | Starts a delimited S-expression; 0xF0 closes the most recently opened delimited container. |
0xEB 0x0A
represents null.sexp
.
Length-prefixed encoding
Length-prefixed encoding of an empty S-expression (()
)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression
││ took zero bytes to encode.
C0
Length-prefixed encoding of (1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression
││ took six bytes to encode.
C6 61 01 61 02 61 03
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Length-prefixed encoding of ("variable length sexp")
┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows.
│ ┌───── Length: FlexUInt 22
│ │ ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│ │ │ ┌─────── Length: FlexUInt 20
│ │ │ │ v a r i a b l e l e n g t h s e x p
FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
└─────────────────────────────┬─────────────────────────────────┘
Nested string element
Encoding of null.sexp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: sexp
│ │
EB 0A
Delimited encoding
Delimited encoding of an empty S-expression (()
)
┌──── Opcode 0xF2 indicates a delimited S-expression
│ ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F2 F0
Delimited encoding of (1 2 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│ ┌─── Opcode 0xF0 indicates the end of
│ │ the most recently opened container
F2 61 01 61 02 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Delimited encoding of (1 (2) 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│ ┌─── Opcode 0xF2 begins a nested delimited S-expression
│ │ ┌─── Opcode 0xF0 closes the most recently
│ │ │ opened delimited container: the nested S-expression.
│ │ │ ┌─── Opcode 0xF0 closes the most recently opened (and
│ │ │ │ still open)delimited container: the outer S-expression.
│ │ │ │
F2 61 01 F2 61 02 F0 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Structs
Length-prefixed encoding
If the high nibble of the opcode is 0xD_
, it represents a struct. The lower nibble of the opcode
indicates how many bytes were used to encode all of its nested (field name, value)
pairs. Opcode
0xD0
represents an empty struct.
warning
Opcode 0xD1
is illegal. Non-empty structs must have at least two bytes: a field name and a value.
If the struct's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD
opcode
to write a variable-length struct. The 0xFD
opcode is followed by a FlexUInt
that indicates the byte length.
Each field in the struct is encoded as a FlexUInt
representing the address of the field name's
text in the symbol table, followed by an opcode-prefixed value.
0xEB 0x0B
represents null.struct
.
Length-prefixed encoding of an empty struct ({}
)
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0
Length-prefixed encoding of {$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│ ┌─── Field name: FlexUInt 10 ($10)
│ │ ┌─── Field name: FlexUInt 11 ($11)
│ │ │
D6 15 61 01 17 61 02
└─┬─┘ └─┬─┘
1 2
Length-prefixed encoding of {$10: "variable length struct"}
┌───────────── Opcode `FD` indicates a struct with a FlexUInt length prefix
│ ┌────────── Length: FlexUInt 25
│ │ ┌─────── Field name: FlexUInt 10 ($10)
│ │ │ ┌──── Opcode `F9` indicates a variable length string
│ │ │ │ ┌─ FlexUInt: 22 the string is 22 bytes long
│ │ │ │ │ v a r i a b l e l e n g t h s t r u c t
FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
└─────────────────────────────┬─────────────────────────────────┘
UTF-8 bytes
Encoding of null.struct
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: struct
│ │
EB 0B
Optional FlexSym
field name encoding
By default, all struct field names are encoded as FlexUInt
symbol addresses.
However, a writer has the option of encoding the field names as FlexSym
s instead,
granting additional flexibility at the expense of some compactness.
Writing a field names as a FlexSym
s allows the writer to:
- encode the UTF-8 bytes of the field name inline (for example, to avoid modifying the symbol table).
- call a macro whose output (another struct) will be merged into the current struct.
- encode the field name as a symbol address if it's already in the symbol table. (just like a
FlexUInt
would, but slightly less compactly.)
To switch to FlexSym
field names, the writer emits a FlexUInt
zero
(byte 0x01
) in field name position to inform the reader that subsequent field names will be encoded
as FlexSym
s.
This switch is one way. Once the writer switches to using FlexSym
, the encoding cannot be switched
back to FlexUInt
for the remainder of the struct.
Switching to FlexSym
while encoding {$10: 1, foo: 2, $11: 3}
In this example, the writer switches to FlexSym
field names before encoding foo
so it can write the UTF-8 bytes inline.
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│ ┌─── Field name: FlexUInt 10 ($10)
│ │ ┌─── FlexUInt 0: Switch to FlexSym field name encoding
│ │ │
│ │ │ ┌─── FlexSym: 3 UTF-8 bytes follow
│ │ │ │ ┌─── Field name: FlexSym 11 ($11)
│ │ │ │ f o o │
D6 15 61 01 01 FB 66 6F 6F 17 61 02
└─┬─┘ └─┬─┘
1 2
note
Because FlexUInt
zero indicates a mode switch, encoding symbol ID $0
requires switching to FlexSym
.
Length-prefixed encoding of {$0: 1}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 5
││ ┌── FlexUInt 0 in the field name position indicates that the struct
││ │ is switching to FlexSym mode
││ │ ┌── FlexSym "escape"
││ │ │ ┌── Symbol address: 1-byte FixedUInt follows
││ │ │ │ ┌─ FixedUInt 0
││ │ │ │ │
D5 01 01 E1 00 61 01
└───┬──┘ └─┬─┘
$0 1
Delimited encoding
Opcode 0xF3
indicates the beginning of a delimited struct. Unlike length-prefixed structs,
delimited structs always encode their field names as FlexSym
s.
Unlike lists and S-expressions, structs cannot use opcode 0xF0
by itself to indicate the end of the delimited
container. This is because 0xF0
is a valid FlexSym
(a symbol with 16 bytes of inline text). To close the delimited
struct, the writer emits a 0x01
byte (a FlexSym
escape) followed by the opcode 0xF0
.
Delimited encoding of the empty struct ({}
)
┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│ ┌─── FlexSym escape code 0 (0x01): an opcode follows
│ │ ┌─── Opcode 0xF0 indicates the end of the most
│ │ │ recently opened delimited container
F3 01 F0
note
It is much more compact to write 0xD0
—the empty length-prefixed struct.
Delimited encoding of {"foo": 1, $11: 2}
┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│
│ ┌─ FlexSym -3 ┌─ FlexSym: 11 ($11)
│ │ │ ┌─── FlexSym escape code 0 (0x01): an opcode follows
│ │ │ │ ┌─── Opcode 0xF0 indicates the end of the most
│ │ f o o │ │ │ recently opened delimited container
F3 FB 66 6F 6F 61 01 17 61 02 01 F0
└──┬───┘ └─┬─┘ └─┬─┘
3 UTF-8 1 2
bytes
Encoding Expressions
note
This chapter focuses on the binary encoding of e-expressions. The Macros section explains what they are and how they are used.
E-expression with the address in the opcode
If the value of the opcode is less than 64
(0x40
), it represents an E-expression invoking the macro at the
corresponding address—-an offset within the local macro table.
Invocation of macro address 7
┌──── Opcode in 00-3F range indicates an e-expression
│ where the opcode value is the macro address
│
07
└── FixedUInt 7
Invocation of macro address 31
┌──── Opcode in 00-3F range indicates an e-expression
│ where the opcode value is the macro address
│
1F
└── FixedUInt 31
Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section E-expression argument encoding.
E-expressions with biased FixedUInt
addresses
While E-expressions invoking macro addresses in the range [0, 63]
can be encoded in a single byte using
E-expressions with the address in the opcode,
many applications will benefit from defining more than 64 macros. The 0x4_
and 0x5_
opcodes
can be used to represent macro addresses up to 1,052,734. In both encodings, the address is biased by
the total number of addresses with lower opcodes.
If the high nibble of the opcode is 0x4_
, then a biased address follows as a 1-byte FixedUInt
.
For 0x4_
, the bias is 256 * low_nibble + 64
(or (low_nibble << 8) + 64
).
If the high nibble of the opcode is 0x5_
, then a biased address follows as a 2-byte FixedUInt
.
For 0x5_
, the bias is 65536 * low_nibble + 4160
(or (low_nibble << 16) + 4160
)
Invocation of macro address 841
┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address
│┌─── Low nibble 3 indicates bias of 832
││
43 09
│
└─── FixedUInt 9
Biased Address : 9
Bias : 832
Address : 841
Invocation of macro address 142918
┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address
│┌─── Low nibble 2 indicates bias of 135232
││
52 06 1E
└─┬─┘
└─── FixedUInt 7686
Biased Address : 7686
Bias : 135232
Address : 142918
Macro address range biases for 0x4_
and 0x5_
Low Nibble | 0x4_ Bias | 0x5_ Bias |
---|---|---|
0 | 64 | 4160 |
1 | 320 | 69696 |
2 | 576 | 135232 |
3 | 832 | 200768 |
4 | 1088 | 266304 |
5 | 1344 | 331840 |
6 | 1600 | 397376 |
7 | 1856 | 462912 |
8 | 2112 | 528448 |
9 | 2368 | 593984 |
A | 2624 | 659520 |
B | 2880 | 725056 |
C | 3136 | 790592 |
D | 3392 | 856128 |
E | 3648 | 921664 |
F | 3904 | 987200 |
E-expression with the address as a trailing FlexUInt
The opcode 0xF4
indicates an e-expression whose address is encoded as a trailing FlexUInt
with no bias.
This encoding is less compact for addresses that can be encoded using opcodes 0x5F
and below, but it is one of only two opcodes,
along with 0xF5
, that can be used for macro addresses greater than 1,052,734.
Invocation of macro address 4
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 09
│
└─── FlexUInt 4
Invocation of macro address 1_100_000
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 04 47 86
└──┬───┘
└─── FlexUInt 1,100,000
E-expression with the address as a FlexUInt
and length as a trailing FlexUInt
The opcode 0xF5
indicates an e-expression whose address is encoded as a FlexUInt
with no bias,
followed by a FlexUInt
that represents the length in bytes of the remainder of the expression. Although this encoding is less
compact than other e-expression encodings, it allows for readers to quickly seek to the end of the expression if the user requires
only partial evaluation.
Invocation of macro address 4
with two tagged arguments
┌──── Opcode F5 indicates an e-expression with a `FlexUInt` macro address followed by a `FlexUInt` length
│
│
F5 09 07 60 61 01
│ │ │ └─┬─┘
│ │ │ └─ Tagged integer 1
│ │ └─ Tagged integer 0
│ └─ FlexUInt 3 (the remaining length of the expression)
└─── FlexUInt 4 (the macro address)
Invocation of macro address 1_100_000
with no arguments
┌──── Opcode F5 indicates an e-expression with a `FlexUInt` macro address followed by a `FlexUInt` length
│
│
F5 04 47 86 01
└──┬───┘ │
│ └─ FlexUInt 0 (the remaining length of the expression)
└─── FlexUInt 1,100,000 (the macro address)
System Macro Invocations
E-expressions that invoke a system macro can be encoded using the 0xEF
opcode followed by a 1-byte FixedUInt
representing an index in the system macro table.
Encoding of the system macro values
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│ ┌─── FixedInt 1 indicates macro 1 from the system macro table
│ │
EF 01
In addition, system macros MAY be invoked using any of the 0x00
-0x5F
or 0xF4
-0xF5
opcodes, provided that the macro being invoked has been given an address in user macro address space.
For more information about managing the macro address space, see the Modules section.
E-expression argument encoding
The example invocations in prior sections have demonstrated how to encode an invocation of the simplest form of macro--one with no parameters. This section explains how to encode macro invocations when they take parameters of different encodings and cardinalities.
To begin, we will examine how arguments are encoded when all of the macro's parameters use the tagged encoding and have a cardinality of exactly-one.
Tagged encoding
When a macro parameter does not specify an encoding (the parameter name is not annotated), arguments passed to that parameter use the 'tagged' encoding. The argument begins with a leading opcode that dictates how to interpret the bytes that follow.
This is the same encoding used for values in other Ion 1.1 contexts like lists, s-expressions, or at the top level.
Encoding a single exactly-one
argument
A parameter with a cardinality of exactly-one expects its corresponding argument to be encoded as a single expression of the parameter's declared encoding. (The following section will explore the available encodings in greater depth; for now, our examples will be limited to parameters using the tagged encoding.)
When the macro has a single exactly-one
parameter, the corresponding encoded argument follows the opcode and (if separate) the encoded address.
Example encoding of an e-expression with a tagged, exactly-one
argument
Macro definition
(:set_macros
(foo (x) /*...*/)
)
Text e-expression
(:foo 1)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression invoking
│ the macro at address 0.
│ ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│ ┌─┴─┐
00 61 01
Binary e-expression using a trailing FlexUInt
address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│ ┌──── FlexUInt 0: Macro address 0
│ │ ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│ │ ┌─┴─┐
F4 01 61 01
Encoding multiple exactly-one
arguments
If the macro has more than one parameter, a reader would iterate over the parameters declared in the macro signature from left to right. For each parameter, the reader would use the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing of the e-expression's arguments is complete.
Example encoding of an e-expression with multiple tagged, exactly-one
arguments
Macro definition
(:set_macros
(foo (a b c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│ │ ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│ │ │ ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 61 01 61 02 61 03
Binary e-expression using a trailing FlexUInt
address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│ ┌──── FlexUInt 0: Macro address 0
│ │ ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│ │ │ ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│ │ │ │ ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│ │ │ │ │
│ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
F4 01 61 01 61 02 61 03
Tagless Encodings
In contrast to the tagged encoding, tagless encodings do not begin with an opcode.
This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings
do not have an opcode, they cannot represent E-expressions, annotation sequences, or null
values of any kind.
Tagless encodings are comprised of the primitive encodings and macro shapes.
Primitive encodings
Primitive encodings are self-delineating, either by having a statically known size in bytes or by including length information in their serialized form.
Ion type | Primitive encoding | Size in bytes | Encoding |
---|---|---|---|
int | uint8 | 1 | FixedUInt |
uint16 | 2 | FixedUInt | |
uint32 | 4 | FixedUInt | |
uint64 | 8 | FixedUInt | |
flex_uint | variable | FlexUInt | |
int8 | 1 | FixedInt | |
int16 | 2 | FixedInt | |
int32 | 4 | FixedInt | |
int64 | 8 | FixedInt | |
flex_int | variable | FlexInt | |
float | float16 | 2 | Little-endian IEEE-754 half-precision float |
float32 | 4 | Little-endian IEEE-754 single-precision float | |
float64 | 8 | Little-endian IEEE-754 double-precision float | |
symbol | flex_sym | variable | FlexSym |
Example encoding of an e-expression with primitive, exactly-one
arguments
As first demonstrated in Encoding multiple exactly-one arguments, the bytes of the serialized arguments begin immediately after the e-expression's opcode and (if separate) the macro address. The reader iterates over the parameters in the macro signature in the order they are declared. For each parameter, the reader uses the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing is complete.
Macro definition
(:set_macros
(foo (flex_uint::a int8::b uint16::c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌─── Argument 'a': FlexUInt 1
│ │ ┌─── Argument 'b': 1-byte FixedInt 2
│ │ │ ┌─── Argument 'c': 2-byte FixedUInt 3
│ │ │ ┌─┴─┐
00 03 02 03 00
Binary e-expression using a trailing FlexUInt
address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│ ┌──── FlexUInt 0: Macro address 0
│ │ ┌─── Argument 'a': FlexUInt 1
│ │ │ ┌─── Argument 'b': 1-byte FixedInt 2
│ │ │ │ ┌─── Argument 'c': 2-byte FixedUInt 3
│ │ │ │ ┌─┴─┐
F4 01 03 02 03 00
Macro shapes
The term macro shape describes a macro that is being used as the encoding of an E-expression argument. A parameter using a macro shape as its encoding is sometimes called a macro-shaped parameter. For example, consider the following two macro definitions.
The point2D
macro takes two flex_int
-encoded values as arguments.
(macro point2D (flex_int::x flex_int::y)
{
x: (%x),
y: (%y),
}
)
The line
macro takes a pair of point2D
invocations as arguments.
(macro line (point2D::start point2D::end)
{
start: (%start),
end: (%end),
}
)
Normally an e-expression would begin with an opcode and an address communicating what comes next. However, when we're reading the argument for a macro-shaped parameter, the macro being invoked is inferred from the parent macro signature instead. As such, there is no need to include an opcode or address.
┌──── Opcode 0x01 is less than 0x40; this is an e-expression
│ invoking the macro at address 1: `line`
│ ┌─── Argument $start: an implicit invocation of macro `point2D`
│ │ ┌─── Argument $end: an implicit invocation of macro `point2D`
│ ┌─┴─┐ ┌─┴─┐
00 03 05 07 09
│ │ │ └──── $end/$y: FlexInt 4
│ │ └─────── $end/$x: FlexInt 3
│ └────────── $start/$y: FlexInt 2
└───────────── $start/$x: FlexInt 1
Any macro can be used as a macro shape except for constants--macros which take zero parameters. Constants cannot be used as a macro shape because their serialized representation would be empty, making it impossible to encode them in expression groups. However, this limitation does not sacrifice any expressiveness; the desired constant can always be invoked directly in the body of the macro.
(:add_macros
// Defines a constant 'hostname'
(hostname () "abc123.us_west.example.com")
(http_ok (hostname::server page)
// └── ERROR: cannot use a constant as a macro shape
{
server: (%server),
page: (%page),
message: OK,
status: 200,
}
)
(http_ok (page)
{
server: (.hostname),
// └── OK: invokes constant as needed
page: (%page),
message: OK,
status: 200,
}
)
)
Encoding variadic arguments
The preceding sections have described how to (de)serialize the various parameter encodings,
but these parameters have always had the same cardinality:
exactly-one
.
This section explains how to encode e-expressions invoking a macro whose signature contains
variadic parameters--parameters with a cardinality of zero-or-one
, zero-or-more
, or one-or-more
.
Argument Encoding Bitmap (AEB)
If a macro signature has one or more variadic parameters, then e-expressions invoking that macro will include an additional construct: the Argument Encoding Bitmap (AEB). This little-endian byte sequence precedes the first serialized argument and indicates how each argument corresponding to a variadic parameter has been encoded.
Each variadic parameter in the signature is assigned two bits in the AEB. This means that the reader can statically determine how many AEB bytes to expect in the e-expression by examining the signature.
Number of variadic parameters | AEB byte length |
---|---|
0 | 0 |
1 to 4 | 1 |
5 to 8 | 2 |
9 to 12 | 3 |
N | ceiling(N/4) |
Bits in the AEB are assigned from least significant to most significant and correspond to the variadic parameters in the signature from left to right. This allows the reader to right-shift away the bits of each variadic parameter when its corresponding argument has been read.
Example Signature | AEB Layout |
---|---|
() | <No variadics, no AEB> |
(a b c) | <No variadics, no AEB> |
(a b c?) | ------cc |
(a b* c?) | ----ccbb |
(a+ b* c?) | --ccbbaa |
(a+ b c?) | ----ccaa |
(a+ b* c? d*) | ddccbbaa |
(a+ b* c? d* e) | ddccbbaa |
(a+ b* c? d* e f?) | ddccbbaa ------ff |
(a+ b* c? d* e+ f?) | ddccbbaa ----ffee |
Each pair of bits in the AEB indicates what kind of expression to expect in the corresponding argument position.
Bit sequence | Meaning | ? | * | + |
---|---|---|---|---|
00 | An empty stream. No bytes are present in the corresponding argument position. | ✅ | ✅ | ❌ |
01 | A single expression of the declared encoding is present in the corresponding argument position. | ✅ | ✅ | ✅ |
10 | A expression group of the declared encoding is present in the corresponding argument position. | ❌ | ✅ | ✅ |
11 | Reserved. A bitmap entry with this bit sequence is illegal in Ion 1.1. | ❌ | ❌ | ❌ |
As noted in the table above:
- An empty stream (
00
) cannot be used to encode an argument for a parameter with a cardinality ofone-or-more
. - An expression group (
10
) cannot be used to encode an argument for a parameter with a cardinality ofzero-or-one
.
Expression groups
This section describes the encoding of an expression group. For an explanation of what an expression group is and how to use it, see Expression groups.
An expression group begins with a FlexUInt
. If the FlexUInt
's value
is:
- greater than zero, then it represents the number of bytes used to encode the rest of the expression group. The reader should continue reading expressions of the declared encoding until that number of bytes has been consumed.
- zero, then it indicates that this is a delimited expression group and the processing varies according to
whether the declared encoding is tagged or tagless. If the encoding is:
- tagged, then each expression in the group begins with an opcode. The reader
must consume tagged expressions until it encounters a terminating
END
opcode (0xF0
). - tagless, then the expression group is a delimited sequence of 'chunks' that each
have a
FlexUInt
length prefix and a body comprised of one or more expressions of the declared encoding. The reader will continue reading chunks until it encounters a length prefix ofFlexUInt
0
(0x01
), indicating the end of the chunk sequence. Each chunk in the sequence must be self-contained; an expression of the declared encoding may not be split across multiple chunks. See Example encoding of taglesszero-or-more
with delimited expression group for an illustration.
- tagged, then each expression in the group begins with an opcode. The reader
must consume tagged expressions until it encounters a terminating
tip
While it is legal to write an empty expression group for zero-or-more
parameters,
it is always more efficient to set the parameter's AEB bits to 00
instead.
Example encoding of tagged zero-or-one
with empty group
(:add_macros
(foo (a?) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa
│ │ a=00, empty expression group
00 00
Example encoding of tagged zero-or-one
with single expression
(:add_macros
(foo (a?) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=01, single expression
│ │ ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│ │ ┌─┴─┐
00 01 61 01
Example encoding of tagged zero-or-more
with empty group
(:add_macros
(foo (a*) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=00, empty expression group
│ │
00 00
Example encoding of tagged zero-or-more
with single expression
(:add_macros
(foo (a*) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=01, single expression
│ │ ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│ │ ┌─┴─┐
00 01 61 01
Example encoding of tagged zero-or-more
with expression group
(:add_macros
(foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 6: 6-byte expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (1)
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (2)
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int (3)
│ │ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 02 0D 61 01 61 02 61 03
└───────┬───────┘
6-byte expression group body
Example encoding of tagged zero-or-more
with delimited expression group
(:add_macros
(foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 0: delimited expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (1)
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (2)
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int (3)
│ │ │ │ │ │ ┌─── Opcode 0xF0 is delimited end
│ │ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │
00 02 01 61 01 61 02 61 03 F0
└───────┬───────┘
expression group body
Example encoding of tagged one-or-more
with single expression
(:add_macros
(foo (a+) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=01, single expression
│ │ ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int
│ │ │ 1
00 01 61 01
Example encoding of tagged one-or-more
with expression group
(:add_macros
(foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 6: 6-byte expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int
│ │ │ │ 1 │ 2 │ 3
00 02 0D 61 01 61 02 61 03
└───────┬───────┘
6-byte expression group body
Example encoding of tagged one-or-more
with delimited expression group
(:add_macros
(foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 0: delimited expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int
│ │ │ │ │ │ ┌─── Opcode 0xF0 is delimited end
│ │ │ │ 1 │ 2 │ 3 │
00 02 01 61 01 61 02 61 03 F0
└───────┬───────┘
expression group body
Example encoding of tagless zero-or-more
with expression group
(:add_macros
(foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 3: 3-byte expression group
│ │ │ ┌──── uint8 1
│ │ │ │ ┌──── uint8 2
│ │ │ │ │ ┌─── uint8 3
│ │ │ │ │ │
00 02 07 01 02 03
└──┬───┘
expression group body
Example encoding of tagless zero-or-more
with delimited expression group
(:add_macros
(foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 0: Delimited expression group
│ │ │ ┌──── FlexUInt 3: 3-byte chunk of uint8 expressions
│ │ │ │ ┌──── FlexUInt 2: 2-byte chunk of uint8 expressions
│ │ │ │ │ ┌──── FlexUInt 0: End of group
│ │ │ │ │ │
00 02 01 07 01 02 03 05 04 05 01
└──┬───┘ └─┬─┘
chunk 1 chunk 2
Annotations
Annotations can be encoded either as symbol addresses or
as FlexSym
s. In both encodings, the annotations sequence appears
just before the value that it decorates.
It is illegal for an annotations sequence to appear before any of the following:
- The end of the stream
- Another annotations sequence
- A
NOP
- An e-expression. To add annotations to the expansion of an E-expression, see the
annotate
macro.
Annotations With Symbol Addresses
Opcodes 0xE4
through 0xE6
indicate one or more annotations encoded as symbol addresses. If the opcode is:
0xE4
, a singleFlexUInt
-encoded symbol address follows.0xE5
, twoFlexUInt
-encoded symbol addresses follow.0xE6
, aFlexUInt
follows that represents the number of bytes needed to encode the annotations sequence, which can be made up of any number ofFlexUInt
symbol addresses.
Encoding of $10::false
┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows
│ ┌──── Annotation with symbol address: FlexUInt 10
E4 15 6F
└── The annotated value: `false`
Encoding of $10::$11::false
┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow
│ ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│ │ ┌──── Annotation with symbol address: FlexUInt 11 ($11)
E5 15 17 6F
└── The annotated value: `false`
Encoding of $10::$11::$12::false
┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations;
│ a FlexUInt follows representing the length of the sequence.
│ ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10)
│ │ ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│ │ │ ┌──── Annotation with symbol address: FlexUInt 11 ($11)
│ │ │ │ ┌──── Annotation with symbol address: FlexUInt 12 ($12)
E5 07 15 17 19 6F
└── The annotated value: `false`
Annotations With FlexSym
Text
Opcodes 0xE7
through 0xE9
indicate one or more annotations encoded as FlexSym
s.
If the opcode is:
0xE7
, a singleFlexSym
-encoded symbol follows.0xE8
, twoFlexSym
-encoded symbols follow.0xE9
, aFlexUInt
follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded asFlexSym
s.
While this encoding is more flexible than annotations with symbol addresses it can be slightly less compact when all the annotations are encoded as symbol addresses.
Encoding of $10::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│ ┌──── Annotation with symbol address: FlexSym 10 ($10)
E7 15 6F
└── The annotated value: `false`
Encoding of foo::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│ │ f o o
E7 FD 66 6F 6F 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
Note that FlexSym
annotation sequences can switch between symbol address and inline text
on a per-annotation basis.
Encoding of $10::foo::false
┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow
│ ┌──── Annotation: FlexSym 10 ($10)
│ │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│ │ │ f o o
E8 15 FD 66 6F 6F 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
Encoding of $10::foo::$11::false
┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations
│ ┌──── Length: FlexUInt 6
│ │ ┌──── Annotation: FlexSym 10 ($10)
│ │ │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│ │ │ │ ┌──── Annotation: FlexSym 11 ($11)
│ │ │ │ f o o │
E9 0D 15 FD 66 6F 6F 17 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
NOP
s
A NOP
(short for "no-operation") is the binary equivalent of whitespace. NOP
bytes have no meaning,
but can be used as padding to achieve a desired alignment.
An opcode of 0xEC
indicates a single-byte NOP
pad. An opcode of 0xED
indicates that a
FlexUInt
follows that represents the number of additional bytes to skip.
It is legal for a NOP
to appear anywhere that a value can be encoded. It is not legal for a NOP
to appear in
annotation sequences or struct field names. If a NOP
appears in place of a struct field value, then the associated
field name is ignored; the NOP
is immediately followed by the next field name, if any.
Encoding of a 1-byte NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC
Encoding of a 3-byte NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│ ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│ │
ED 05 93 C6
└─┬─┘
NOP bytes, values ignored
Security considerations
The Ion 1.1 data format is orthogonal to many classes of attacks, such as privilege escalation and phishing attacks. Ion 1.1 is primarily susceptible to denial-of-service (DoS) attacks that attempt to cause an error condition in the receiving system or consume excessive system resources. As with many such attacks, the strongest defense is to not accept any untrusted input, but that defense is not always compatible with the business requirements of the receiving application.
This document addresses various types of attacks, assuming that it is not possible to avoid accepting untrusted input.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Data expansion denial-of-service
An attacker could craft an input that is relatively small, but upon expansion, produces something thousands or millions of times larger.
For many use cases, the expansion of a template macro will grow linearly with the size of its input. However, it is
possible to create macros with expansions that grow at greater rates. Using for
we can
nest an arbitrary number of loops to create a macro expansion with a polynomial growth rate. Using the
repeat
macro, we can create classes of inputs with expansions that grow
exponentially in relation to the input.
For example, this input is less than 250 characters when encoded as Ion text (and omitting all optional whitespace). In Ion binary, it requires only 74 bytes. For each additional level of nesting, only 20 additional characters (text) or 6 additional bytes (binary) are required, but it increases the number of expanded values by 2147483647 times.
$ion_1_1
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647
(:repeat 2147483647 "abc")))))))))))
The expansion of these e-expressions results in a stream of ~450 googol string
values. Any attempt to hold all of this in memory or write it to disk will exhaust all available resources and
eventually fail. Even an attempt to count the length of the stream, while it may theoretically succeed if using an
appropriate BigInteger
type, will require a considerable amount of CPU operations (over a googol), and even the
fastest processors will require many millennia to completely count the number of values in the stream.
Even without using repeat
or for
, a Billion laughs attack
could exist for any data format with macro expansion, and it is certainly possible with Ion 1.1.
$ion_1_1
(:add_macros (macro lol0 () "lol")
(macro lol1 () (.values (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0) (.lol0)))
(macro lol2 () (.values (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1) (.lol1)))
(macro lol3 () (.values (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2) (.lol2)))
(macro lol4 () (.values (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3) (.lol3)))
(macro lol5 () (.values (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4) (.lol4)))
(macro lol6 () (.values (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5) (.lol5)))
(macro lol7 () (.values (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6) (.lol6)))
(macro lol8 () (.values (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7) (.lol7)))
(macro lol9 () (.values (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8) (.lol8)))
(macro lolz () (.lol9)) )
(:lolz)
Implementations of Ion 1.1 MUST have some mechanism by which to mitigate data expansion attacks.
The macro evaluator of Ion 1.1 implementations SHOULD have a (possibly configurable) limit on the number of values
produced by the expansion of any macro or e-expression. If the macro evaluator reaches that limit, evaluation should halt
and the reader should signal an error. This is similar to the Token Bucket Algorithm,
but instead of refilling the bucket, the bucket starts at the maximum capacity whenever the reader begins evaluating an
e-expression that is not nested in any other e-expression at any other depth. In order to prevent a malicious input that
produces no values (for example, (macro sneaky_lolz () (.meta (.lolz)))
), tokens SHOULD be consumed at every level of
expansion, including special forms and TDL macro invocations. Expansions that are skipped are not required to consume
tokens (since they are not expanded), but an empty expansion MUST consume at least one token.
$ion_1_1
// Fill bucket here
(:make_list
[
// Do not fill bucket here
(:repeat 100 "foo")
]
[
"bar",
"baz",
]
)
{
// Fill bucket here
foo: (:make_string "foo" "bar")
// Fill bucket here.
// Consume one token for each value produced by repeat and for each value produced by make_string
bar: (:make_string (:repeat 16 "na") " batman!")
}
Remote code execution
The template definition language (TDL) is a domain specific programming language used to declare template macros in Ion 1.1. It is intentionally limited in its capabilities—it cannot recurse and does not support forward references. In general, it supports combining Ion values to produce other Ion values, but it does not support arbitrary computation on those values.
Remote code execution (RCE) attacks allow an attacker to remotely execute malicious code on a computer. By invoking e-expressions in the body of an Ion document, an attacker can cause the recipient to execute arbitrary TDL (code) when reading the document.
This is unlikely to be a concern in practice because TDL is not arbitrary code. TDL is intentionally not Turing complete, to make it impossible to perform arbitrary computation. It also has a very limited domain—it can only transform/produce Ion data model values. While it could be possible to attempt a denial-of-service attack using TDL, TDL expansion is guaranteed to terminate in a finite number of steps, and implementations can additionally limit the expansion size (as described above).
Embedded Documents
Ion 1.1 supports embedded documents using the parse_ion
macro. Generally speaking,
systems that accept embedded documents should properly isolate and validate embedded documents to prevent attacks.
Ion 1.1 specifies that parse_ion
must only accept a literal string or literal blob, and that the resulting values are
always user values (rather than system values). This ensures that the embedded document cannot be affected by any input
from the containing document, nor can it have any effect on the encoding context of the containing document.
The parse_ion
macro uses an Ion reader, so it will be validated just as any other Ion document.
Data injection via shared modules
Applications are not required to use shared modules. If an application does use shared modules, it should take steps to ensure that shared modules come from a trusted source and use appropriate measures to prevent man-in-the-middle and other attacks that can compromise data while it is in transit.
In many cases, even if an application needs to accept Ion payloads from untrusted sources, it is possible to design a solution in which the shared modules are supplied by a trusted source. For example, in a service-oriented-architecture, the server can host shared modules so that the server does not have to trust the client. (However, this assumes that the client trusts the server.)
If shared modules must come from an untrusted source, then applications should take steps to ensure that the shared modules originate from the same source as the data that uses them, and they can be treated as if they are one composite piece of data from that source.
Arbitrary-sized values
The Ion specification places no limits on the size of Ion values, so an attacker could send a sufficiently large value, it could consume enough system resources to disrupt the application reading the value.
Even though the Ion specification does not have limits on the size of values, all real computer systems have finite resources, so all implementations will have limits in practice. Ion implementations MAY set limits on the maximum size of any Ion value for any available metric, including (but not limited to) number of bytes, number of codepoints, number of child values, digits of precision, or number of annotations. An implementation MAY allow limits to be configurable by an application that uses the Ion implementation. Any limits imposed SHOULD be described in the public documentation of an Ion implementation, unless the limits are unknown and/or are dependent on the underlying runtime environment.
Symbol table and macro table inflation
An attacker could try to create an input that results in excessively large symbol and macro tables in the Ion reader that could exhaust the memory of the receiving system and lead to a denial of service.
Although Ion 1.1 does not specify a maximum size for symbol tables or macro tables, Ion implementations MAY impose upper bounds on the size of symbol tables, macro tables, module bindings, and any other direct or indirect component of the encoding context. An implementation MAY allow limits to be configurable by an application that uses the Ion implementation. Any limits imposed SHOULD be described in the public documentation of an Ion implementation, unless the limits are unknown and/or are dependent on the underlying runtime environment.
Grammar
This chapter presents Ion 1.1's domain grammar, by which we mean the grammar of the domain of values that drive Ion's encoding features.
We use a BNF-like notation for describing various syntactic parts of a document, including Ion data structures. In such cases, the BNF should be interpreted loosely to accommodate Ion-isms like commas and unconstrained ordering of struct fields.
Documents
document ::= ivm? segment*
ivm ::= '$ion_1_0' | '$ion_1_1'
segment ::= value* directive?
directive ::= ivm
| encoding-directive
| symtab-directive
symtab-directive ::= local-symbol-table ; As per the Ion 1.0 specification¹
encoding-directive ::= '$ion::(encoding ' module-name* ')'
¹Symbols – Local Symbol Tables.
Modules
module-body ::= import* inner-module* symbol-table? macro-table?
shared-module ::= '$ion_shared_module::' ivm '::(' catalog-key module-body ')'
import ::= '(import ' module-name catalog-key ')'
catalog-key ::= catalog-name catalog-version?
catalog-name ::= string
catalog-version ::= unannotated-uint ; must be positive
inner-module ::= '(module' module-name module-body ')'
module-name ::= unannotated-identifier-symbol
symbol-table ::= '(symbol_table' symbol-table-entry* ')'
symbol-table-entry ::= module-name | symbol-list
symbol-list ::= '[' symbol-text* ']'
symbol-text ::= symbol | string
macro-table ::= '(macro_table' macro-table-entry* ')'
macro-table-entry ::= macro-definition
| macro-export
| module-name
macro-export ::= '(export' qualified-macro-ref macro-name-declaration? ')'
Macro references
qualified-macro-ref ::= module-name '::' macro-ref
macro-ref ::= macro-name | macro-addr
qualified-macro-name ::= module-name '::' macro-name
macro-name ::= unannotated-identifier-symbol
macro-addr ::= unannotated-uint
Macro definitions
macro-definition ::= '(macro' macro-name-declaration signature tdl-expression ')'
macro-name-declaration ::= macro-name | 'null'
signature ::= '(' parameter* ')'
parameter ::= parameter-encoding? parameter-name parameter-cardinality?
parameter-encoding ::= (primitive-encoding-type | qualified-primitive-type | macro-name | qualified-macro-name)'::'
qualified-primitive-type::= '$ion::' primitive-encoding-type
primitive-encoding-type ::= 'uint8' | 'uint16' | 'uint32' | 'uint64'
| 'int8' | 'int16' | 'int32' | 'int64'
| 'float16' | 'float32' | 'float64'
| 'flex_int' | 'flex_uint'
| 'flex_sym' | 'flex_string'
parameter-name ::= unannotated-identifier-symbol
parameter-cardinality ::= '!' | '*' | '?' | '+'
tdl-expression ::= operation | variable-expansion | ion-scalar | ion-container
operation ::= macro-invocation | special-form
variable-expansion ::= '(%' variable-name ')'
variable-name ::= unannotated-identifier-symbol
macro-invocation ::= '(.' macro-ref macro-arg* ')'
special-form ::= '(.' '$ion::'? special-form-name tdl-expression* ')'
special-form-name ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi' | 'literal' | 'parse_ion '
macro-arg ::= tdl-expression | expression-group
expression-group ::= '(..' tdl-expression* ')'
Glossary
active encoding module
An encoding module whose symbol table and macro table are available in the current segment of an Ion document.
The sequence of active encoding modules is set by an encoding directive.
argument
The sub-expression(s) within a macro invocation, corresponding to exactly one of the macro's parameters.
cardinality
Describes both the number of argument expressions that a parameter will accept when the macro is invoked,
and the number of values that the parameter may expand to during evaluation.
A parameter's cardinality can be zero-or-one
, exactly-one
, zero-or-more
, or one-or-more
,
specified in a signature by one of the modifiers ?
, !
, *
, or +
respectively.
If no modifier is specified, cardinality defaults to exactly-one
.
declaration
The association of a name with an entity (for example, a module or macro). See also definition.
Not all declarations are definitions: some introduce new names for existing entities.
definition
The specification of a new entity.
directive
A keyword or unit of data in an Ion document that affects the encoding environment, and thus the way the document's data is encoded and decoded.
In Ion 1.0 there are two directives: Ion version markers, and the symbol table directives.
Ion 1.1 adds encoding directives.
document
A stream of octets conforming to either the Ion text or binary specification.
Can consist of multiple segments, perhaps using varying versions of the Ion specification.
A document does not necessarily exist as a file, and is not necessarily finite.
E-expression
See encoding expression.
encoding directive
In an Ion 1.1 segment, a top-level S-expression annotated with $ion
.
Defines a new encoding module sequence for the segment immediately following it.
The symbol table directive is effectively a less capable alternative syntax.
encoding environment
The context-specific data maintained by an Ion implementation while encoding or decoding data. In
Ion 1.0 this consists of the current symbol table; in Ion 1.1 this is expanded to also include the Ion
spec version, the current macro table, and a collection of available modules.
encoding expression
The invocation of a macro in encoded data, aka e-expression.
Starts with a macro reference denoting the function to invoke.
The Ion text format uses "smile syntax" (:macro ...)
to denote e-expressions.
Ion binary devotes a large number of opcodes to e-expressions, so they can be compact.
encoding module
A module whose symbol table and macro table can be used directly in the user data stream.
expression
A serialized syntax element that may produce values.
Encoding expressions and values are both considered expressions, whereas NOP, comments, and IVMs, for example, are not.
expression group
A grouping of zero or more expressions that together form one argument.
The concrete syntax for passing a stream of expressions to a macro parameter.
In a text e-expression, a group starts with the trigraph (::
and ends with )
, similar to an S-expression.
In template definition language, a group is written as an S-expression starting with ..
(two dots).
inner module
A module that is defined inside another module and only visible inside the definition of that module.
Ion version marker
A keyword directive that denotes the start of a new segment encoded with a specific Ion version.
Also known as "IVM".
macro
A transformation function that accepts some number of streams of values, and produces a stream of values.
macro definition
Specifies a macro in terms of a signature and a template.
macro reference
Identifies a macro for invocation or exporting. Must always be unambiguous. Lexically
scoped. Cannot be a "forward reference" to a macro that is declared later in the document;
these are not legal.
module
The data entity that defines and exports both symbols and macros.
opcode
A 1-byte, unsigned integer that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
optional parameter
A parameter that can have its corresponding subform(s) omitted when the macro is invoked.
A parameter is optional if both it and the parameters that follow it in the macro signature can accept an empty stream.
parameter
A named input to a macro, as defined by its signature.
At expansion time a parameter produces a stream of values.
qualified macro reference
A macro reference that consists of a module name and either a macro name exported by that module,
or a numeric address within the range of the module's exported macro table. In TDL, these look
like module-name::name-or-address.
required parameter
A macro parameter that is not optional and therefore requires an argument at each invocation.
rest parameter
A macro parameter—always the final parameter—declared with *
or +
cardinality,
that accepts all remaining individual arguments to the macro as if they were in an implicit expression group.
Applies to Ion text and TDL.
Similar to "varargs" parameters in Java and other languages.
segment
A contiguous partition of a document that uses the same encoding module sequence.
Segment boundaries are caused by directives: an IVM starts a new segment (ending the prior segment, if any),
while encoding
directives end segments (with a new one starting immediately afterward).
import
and module
directives can also end a segment if they are redefining a module binding that was in the encoding module sequence.
shared module
A module that exists independent of the data stream of an Ion document. It is identified by a
name and version so that it can be imported by other modules.
signature
The part of a macro definition that specifies its "calling convention", in terms of the shape,
type, and cardinality of arguments it accepts.
symbol table directive
A top-level struct annotated with $ion_symbol_table
. Defines a new encoding environment
without any macros. Valid in Ion 1.0 and 1.1.
system e-expression
An e-expression that invokes a macro from the system-module rather than from the active encoding module.
system macro
A macro provided by the Ion implementation via the system module $ion
.
System macros are available at all points within Ion 1.1 segments.
system module
A standard module named $ion
that is provided by the Ion implementation, implicitly installed so
that the system symbols and system macros are available at all points within a document.
Subsumes the functionality of the Ion 1.0 system symbol table.
system symbol
A symbol provided by the Ion implementation via the system module $ion
.
System symbols are available at all points within an Ion document, though the selection of symbols
varies by segment according to its Ion version.
TDL
See template definition language.
template
The part of a macro definition that expresses its transformation of inputs to results.
template definition language
An Ion-based, domain-specific language that declaratively specifies the output produced by a macro.
Template definition language uses only the Ion data model.
unqualified macro reference
A macro reference that consists of either a macro name or numeric address, without a qualifying module name.
These are resolved using lexical scope and must always be unambiguous.
variable expansion
In TDL, a special form that causes all argument expression(s) for the given parameter to be expanded and the result of the expansion to be substituted into the template.
TODO
This page is a placeholder and will be updated when the target page is available.
If you believe the target page is available, please open an issue.