This is a draft specification of Ion 1.1, a new minor version of the Ion serialization format.
Status
This document is a working draft and is subject to change.
Audience
This documents presents the formal specification for the Ion 1.1 data format. This document is not intended to be used as a user guide or as a cook book, but as a reference to the syntax and semantics of the Ion data format and its logical data model.
What's New in Ion 1.1
We will go through a high-level overview of what is new and different in Ion 1.1 from Ion 1.0 from an implementer's perspective.
Motivation
Ion 1.1 has been designed to address some of the trade-offs in Ion 1.0 to make it suitable for a wider range of applications, giving greater representational choice and expressive power. Some applications want to optimize writes over reads, or are constrained by the writer in some way (e.g. it's prohibitively expensive to buffer an entire value before writing). Ion 1.1 now makes both length prefixing of containers and the interning of symbol tokens independently optional, granting such writers greater flexibility. Data density is another motivation. Certain encodings (e.g., timestamps, integers) have been made more compact and efficient. More significantly, macros now enable applications to have very flexible interning of their data's structure. In aggregate, data transcoded from Ion 1.0 to Ion 1.1 should be more compact and more efficient to both read and write.
Backwards compatibility
Ion 1.0 and Ion 1.1 share the same data model. Any data that can be represented in Ion 1.0 can also be represented with full fidelity in Ion 1.1 and vice versa. This means that it is always possible to convert data from one version to the other without risk of data loss.
Ion 1.1 readers should be able to understand both Ion 1.0 and Ion 1.1 data.
The text encoding grammar of Ion 1.1 is a superset of Ion 1.0's text encoding grammar. Any Ion 1.0 text data can also be parsed by an Ion 1.1 text parser.
Ion 1.1's binary encoding is substantially different from Ion 1.0's binary encoding. Many changes have been made to make values more compact, faster to read and/or faster to write. Ion 1.0's type descriptors have been supplanted by Ion 1.1's more general opcodes, which have been organized to prioritize the most commonly used encodings and make leveraging macros as inexpensive as possible.
In both text and binary Ion 1.1, the Ion Version Marker syntax is compatible with Ion 1.0's version marker syntax.
This means that an Ion 1.0-only reader can correctly identify when a stream uses Ion 1.1 (allowing it to report an error), and an Ion 1.1 reader can correctly "downshift" to expecting Ion 1.0 data when it encounters a stream using Ion 1.0.
Two streams using different Ion versions can be safely concatenated together provided that they are both text or both binary. A concatenated stream containing both Ion 1.0 and Ion 1.1 can only be fully read by a reader that supports Ion 1.1. When appended to an Ion 1.1 stream, an Ion 1.0 stream must begin with the appropriate IVM to ensure that symbol tables are handled correctly, and when an Ion 1.0 stream is appended to another Ion 1.0 stream, an IVM may be desirable to prevent the encoding context from unintentionally leaking into the latter of the concatenated streams.
Upgrading an existing application to Ion 1.1 often requires little-to-no code changes,
as APIs typically operate at the data model level ("write an integer")
rather than at the encoding level ("write 0x64 followed by four Little-Endian bytes").
However, taking full advantage of macros after upgrading typically requires additional development time.
Macros, templates, and encoding expressions
Ion 1.1 introduces a new primitive called an encoding expression (E-expression). These expressions are (in text syntax) similar to S-expressions, but they are not part of the data model. E-Expressions represent encoding details, and can be used to define macros, invoke macros, and modify the encoding context.
Template e-expressions are evaluated into one Ion value, which enables compact representation of Ion data. These e-expressions represent the invocation of user defined macros with arguments that are either themselves E-expressions or value literals corresponding to the formal parameters of the macro's definition. The resulting stream is then expanded into the resulting Ion data model.
Macro definitions
Macros can be defined by a user either directly in a default module within an encoding directive or in a module defined externally (i.e., shared module). A macro has a name which must be unique in a module or it may have no name.
In Ion binary, macros are always addressed in E-expressions by integer macro address. In Ion text, macros may be addressed by the offset in the local macro table (mirroring binary), by name, or by qualifying the macro name/offset with the module name in the encoding context. An E-expression can only refer to macros installed in the local macro table.
E-expressions name resolution
// resolves to macro `bar` in the "default" module
(:bar)
// resolves to macro `bar` in the "foo" module
(:foo::bar)
// resolves to macro 5 in the local macro table
(:5)
Template definitions
User defined macros are defined by their template which defines how they are invoked and what data they evaluate to.
This template is defined as Ion data with a special-purpose E-Expression to signify a placeholder for an argument to be substituted.
Placeholders may accept any type of value, with an optional default value to use if no value is provided for that argument.
Placeholders for "tagless" values—whose encodings do not begin with an opcode and are therefore more compact and less flexible than tagged values—require an encoding tag argument (e.g., {#int32}, {#float16}) to specify how the argument is encoded.
The macro definition includes a template body that defines how the macro is expanded.
Modules
Ion 1.0 uses symbol tables to group together related text values. In order to also accommodate macros, Ion 1.1 introduces modules, a named organizational unit that contains:
- An exported symbol table, a list of text values used to compactly encode symbol tokens like field names, annotations, and symbol values.
- An exported macro table, a list of macro definitions used to compactly encode complete values or partially populated containers.
While Ion 1.0 does not have modules, it is reasonable to think of Ion 1.0's local symbol table as a module that only has symbols, and whose macro table is permanently empty.
Modules can be imported from the catalog (they subsume shared symbol tables) or defined locally.
Directives
A directive is a top-level e-expression that modifies the encoding context.
In text, directives use the e-expression name $ion, and the first child value is an operation name.
In binary, each directive has its own opcode.
The operation determines what changes will be made to the encoding context and which values or clauses may legally follow.
(:$ion operation_name /*...*/ )
In Ion v1.1, there are eight supported directive operations:
In Ion 1.1, directives must be used to modify the symbol or macro table. Ion 1.0 symbol table syntax is not supported.
Shared Modules
Ion 1.1 extends the concept of a shared symbol table to be a shared module. An Ion 1.0 shared symbol table is a shared module with no macro definitions. A new schema for the convention of serializing shared modules in Ion are introduced in Ion 1.1. An Ion 1.1 implementation should support containing Ion 1.0 shared symbol tables and Ion 1.1 shared modules in its catalog.
Text syntax changes
Ion 1.1 text must use the $ion_1_1 version marker at the top-level of the data stream or document.
Ion 1.1 introduces new syntax elements to represent e-expressions and tagless values.
The introduction of encoding expression (E-expression) syntax allows for the invocation of macros in the data stream.
This syntax is grammatically similar to S-expressions, except that these expressions are opened with (: and closed with ).
For example, (:a 1 2) would expand the macro named a with the arguments 1 and 2.
This syntax is allowed anywhere an Ion value is allowed.
See the Macros, templates, and encoding expressions section for details.
Tagless-values are primarily a concern of the binary encoding, but there is a text encoding for them so that data can be
transcoded between text and binary without loss.
The tag of a tagless value is represented in text as {#<type>}, where <type> can be any valid Tagless Scalar Type
opcode or its alias, or as {:<macro-reference>} where <macro-reference> is a valid macro name, qualified macro name, or macro id.
See the sections on Tagless-Element Sequences for more details.
Binary encoding changes
Ion 1.1 binary encoding reorganizes the type descriptors to support compact E-expressions, make certain encodings
more compact, and certain lower priority encodings marginally less compact (for greater detail see Type Encoding Changes). The IVM for this encoding is the octet
sequence 0xE0 0x01 0x01 0xEA.
Inlined symbol tokens
In binary Ion 1.0, symbol values, field names, and annotations are required to be encoded using a symbol ID in the local symbol table. For some use cases (e.g. RPC or small, independent values where the symbol table overhead cannot be amortized) this creates a burden on the writer and may not actually be efficient for an application. Ion 1.1 introduces optional binary syntax for encoding inline UTF-8 sequences for these tokens which can allow an encoder to have flexibility in whether and when to add a given text value to the symbol table.
Ion text requires no change for this feature as it already had inline symbol tokens without using the local symbol
table. Ion text also has compatible syntax for representing the local symbol table and encoding of symbol tokens with
their position in the table (i.e., the $id syntax).
See FlexSym documentation for greater detail.
Delimited containers
In Ion 1.0, all data is length prefixed. While this is good for optimizing the reading of data, it requires an Ion encoder to buffer any data in memory to calculate the data's length. Ion 1.1 introduces optional binary syntax to allow containers to be encoded with an end marker instead of a length prefix.
See the relevant list, sexp, and struct delimited encoding sections for greater detail.
Tagless-Element Sequences
In Ion 1.0, all lists and s-expressions can contain heterogeneous values. In practice, however many applications use collections of homogeneous values, so Ion 1.1 introduces Tagless-Element Sequences to represent such collections. Tagless-Element (TE) lists and s-expressions make it possible to encode homogeneous data even more compactly, and enable optimizations in Ion reader and writer implementations, such as zero-copy reads of certain primitive types.
See the section on Tagless-Element List or Tagless-Element S-Exp for more details.
Low-level binary encoding changes
Ion 1.0's VarUInt and VarInt encoding primitives
used big-endian byte order and used the high bit of each byte to indicate whether it was the final byte in the encoding.
VarInt used an additional bit in the first byte to represent the integer's sign. Ion 1.1 replaces these primitives
with more optimized versions called FlexUInt and FlexInt.
FlexUInt and FlexInt use little-endian byte order, avoiding the need for reordering on common architectures like
x86, aarch64, and RISC-V.
Rather than using a bit in each byte to indicate the width of the encoding, FlexUInt and FlexInt front-load
the continuation bits. In most cases, this means that these bits all fit in the first byte of the representation,
allowing a reader to determine the complete size of the encoding without having to inspect each byte individually.
Finally, FlexInt does not use a separate bit to indicate its value's sign. Instead, it uses two's complement
representation, allowing it to share much of the same structure and parsing logic as its unsigned counterpart.
Benchmarks have shown that in aggregate, these encoding changes are between 1.25 and 3x faster than Ion 1.0's
VarUInt and VarInt encodings depending on the host architecture.
Ion 1.1 supplants Ion 1.0's Int encoding primitive
with a new encoding called FixedInt, which uses two's complement notation instead of sign-and-magnitude.
A corresponding FixedUInt primitive has also been introduced; its encoding is nearly the same as
Ion 1.0's UInt primitive, save that UInt is big endian where FixedUInt is little endian.
A new primitive encoding type, FlexSym, has been introduced to flexibly encode
symbol IDs and symbol tokens with inline text.
tip
FlexSym makes it possible for a writer to emit any Ion value as binary without requiring a symbol table.
This is generally less efficient when working with multiple values but there are use cases where it is convenient.
Type encoding changes
All Ion types use the new low-level encoding primitives described in the previous section. Ion 1.0's type descriptors have been supplanted by Ion 1.1's more general opcodes, which have been organized to prioritize the most commonly used encodings and make leveraging macros as inexpensive as possible.
Typed null values are now encoded in two bytes using the 0x8F opcode.
Symbol values using symbol IDs now have 8 opcodes (versus 15 type IDs in Ion 1.0), but the representation has been made more efficient. Symbol IDs below ~2 billion are now, on average, more compact than in Ion 1.0.
Lists and S-expressions have three encodings:
a length-prefixed encoding, a new delimited form that ends with the 0xEF opcode, and a tagless-element encoding that is prefixed with an opcode and the number of elements in the list or s-expression.
Struct values encode their field names as a FlexSym, enabling them to write field name text inline
instead of adding all names to the symbol table. There is now also a delimited form.
Similarly, symbol values now also have the option of encoding their symbol text inline.
Annotation sequences are a repeatable prefix to the value they decorate, and no longer have an outer length container.
They are now encoded with one of the two opcodes 0x58 or 0x59.
- Opcodes
0x58indicates one annotation encoded as symbol addresses. - Opcodes
0x59indicates one annotation encoded as aFlexSym.
Integers now use a FixedInt sub-field instead of the Ion 1.0 encoding which used sign-and-magnitude (with two opcodes).
Decimals are structurally identical to their Ion 1.0 counterpart except the negative zero coefficient.
The Ion 1.1 FlexInt encoding is two's complement, so negative zero cannot be encoded directly with it.
Instead, an implicit zero coefficient is positive zero, and an explicit zero coefficient is negative zero.
Timestamps no longer encode their sub-field components as octet-aligned fields.
The Ion 1.1 format uses a packed bit encoding and has a biased form (encoding the year field as an offset from 1970) to make common encodings of timestamp easily fit in a 64-bit word for microsecond and nanosecond precision (with UTC offset or unknown UTC offset). Benchmarks have shown this new encoding to be 40% smaller, 59% faster to encode and 21% faster to decode in-range timestamps. A non-biased, arbitrary length timestamp with packed bit encoding is defined for uncommon cases.
Encoding expressions in binary
See the binary E-expressions documentation to learn more about how e-expressions are encoded in binary.
Macros
Like other self-describing formats, Ion 1.0 makes it possible to write a stream with truly arbitrary content—no formal schema required. However, in practice all applications have a de facto schema, with each stream sharing large amounts of predictable structure and recurring values. This means that Ion readers and writers often spend substantial resources processing undifferentiated data.
Consider this example excerpt from a webserver's log file:
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.100",
resource: "index.html"
}
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.100",
resource: "images/funny.jpg"
}
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.101",
resource: "index.html"
}
Macros allow users to define fill-in-the-blank templates for their data. This enables applications to focus on encoding and decoding the parts of the data that are distinctive, eliding the work needed to encode the boilerplate.
Using this macro definition:
(getOk
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::(:?),
resource: (:?)
})
The same webserver log file could be written like this:
(:getOk "192.168.1.100" "index.html")
(:getOk "192.168.1.100" "images/funny.jpg")
(:getOk "192.168.1.101" "index.html")
Macros are an encoding-level concern, and their use in the data stream is invisible to consuming applications. For writers, macros are always optional—a writer can always elect to write their data using value literals instead.
For a guided walkthrough of what macros can do, see Macros by example.
Macros by example
Before getting into the technical details of Ion's macro and module system, it will help to be more familiar with the use of macros. We'll step through increasingly sophisticated use cases, some admittedly synthetic for illustrative purposes, with the intent of teaching the core concepts and moving parts without getting into the weeds of more formal specification.
Ion macros are defined using Ion data structures to represent their templates, plus placeholders.
In this document, the fundamental construct we explore is the macro
definition, denoted using an S-expression of the form (name template) where name must be a
symbol denoting the macro's name.
NOTE: Macros can only be defined within directives like set_macros or add_macros, or within
modules. We will mostly omit this context in the examples below to keep things simple.
Constants
The most basic macro is a constant:
(pi 3.141592653589793)
This declaration defines a macro named pi. The 3.141592653589793 is the template - just a
plain Ion decimal value. Since there are no placeholders in the template, this macro accepts no
arguments and always returns a constant value.
To use pi in an Ion document, we write an encoding expression or E-expression:
$ion_1_1
(:pi)
The syntax (:pi) looks a lot like an S-expression. It's not, though, since colons
cannot appear unquoted in that context. Ion 1.1 makes use of syntax that is not valid in Ion
1.0—specifically, the (: digraph—to denote E-expressions. Those characters must be followed by
a reference to a macro, and we say that the E-expression is an invocation of the macro. Here,
(:pi) is an invocation of the macro named pi.
That document is equivalent to the following, in the sense that they denote the same data:
$ion_1_1
3.141592653589793
The process by which the Ion implementation turns the former document into the latter is called
macro expansion or just expansion. This happens transparently to
Ion-consuming applications: the stream of values in both cases are the same. The documents have
the same content, encoded in two different ways. It's reasonable to think of (:pi) as a custom
encoding for 3.141592653589793, and the notation's similarity to S-expressions leads us to the
term "encoding expression" (or "e-expression").
note
Any Ion 1.1 document with macros can be fully expanded into an equivalent Ion 1.0 document.
We can streamline future examples with a couple of conventions. First, assume that any E-expression
is occurring within an Ion 1.1 document; second, we use the relation notation, ⇒, to mean "expands to".
So we can say:
(:pi) ⇒ 3.141592653589793
Placeholders
Most macros are not constant—they accept inputs that determine their results.
(wrap_value {value: (:?)})
This macro has a template that is a struct containing a single placeholder (:?). The placeholder indicates
that the macro accepts one argument. When invoked, the placeholder is replaced with the argument value.
Note that a template cannot consist solely of a placeholder (annotated or unannotated) - it must be wrapped in a container or have other content.
(:wrap_value 1) => {value: 1}
(:wrap_value "foo") => {value: "foo"}
(:wrap_value [a, b, c]) => {value: [a, b, c]}
The (:?) is a tagged placeholder - it accepts any Ion value including nulls and annotated values.
Simple Templates
Here's a more realistic macro:
(price { amount: (:?), currency: (:?) })
This macro's template is a struct with two placeholders. The macro therefore accepts two arguments when invoked.
(:price 99 USD) ⇒ { amount: 99, currency: USD }
Template expressions that are structs are interpreted almost literally;
the field names are literal—which is why the amount and currency field names show up as-is in the expansion—but the field values may contain placeholders.
Templates also treat lists quasi-literally, where each element inside the list may be a literal value or a placeholder. Here's a simple macro to illustrate:
(two_item_list [(:?), (:?)])
(:two_item_list foo bar) ⇒ [foo, bar]
E-expressions can accept other e-expressions as arguments. For example:
(:two_item_list (:price 99 USD) foo)
// └──────┬──────┘
// └─── passing another e-expression as an argument
Expansion happens from the "inside out". The outer e-expression receives the results from the expansion of the inner e-expression.
(:two_item_list (:price 99 USD) foo)
// First, the inner invocation of `price` is expanded...
=> (:two_item_list {amount: 99, currency: USD} foo)
// ...and then the outer invocation of `two_item_list` is expanded.
=> [{amount: 99, currency: USD}, foo]
Default Values
Tagged placeholders can provide default values that are used when no argument is provided:
(temperature {
degrees: (:?),
scale: (:? K) // Default value is K
})
When invoking this macro, you can omit the second argument by passing (:) (which means "no argument"):
(:temperature 96 F) ⇒ {degrees: 96, scale: F}
(:temperature 283 (:)) ⇒ {degrees: 283, scale: K}
Note that when no value is provided for the scale placeholder, it uses the default value K.
E-expressions in Templates
Templates can contain E-expressions, which are expanded when the macro is defined, not when it's invoked. This allows you to use previously defined macros to construct your template.
(:$ion set_macros (prefix_suffix [(:?), (:?)]))
(:$ion add_macros (website_url (:prefix_suffix "https://www.amazon.com/" (:?))))
The website_url macro's template contains an E-expression (:prefix_suffix ...) which is
expanded when the macro is defined.
For example:
(:$ion add_macros (website_url (:prefix_suffix "https://www.amazon.com/" (:?))))
⇒ (:$ion add_macros (website_url ["https://www.amazon.com/", (:?)]))
(:website_url "gp/cart") ⇒ ["https://www.amazon.com/", "gp/cart"]
Tagless Placeholders
In addition to tagged placeholders, Ion 1.1 supports tagless placeholders that specify a particular type. These are more restrictive but enable more compact binary encoding.
(point {x: (:? {#int}), y: (:? {#int})})
This macro uses tagless placeholders that require integer arguments. The {#int} is an encoding tag
indicating the placeholder expects an integer value.
(:point 3 17) ⇒ {x: 3, y: 17}
Tagless placeholders have important restrictions:
- They cannot accept null values
- They cannot accept annotated values
- They cannot be omitted (they're always required)
- Arguments must match the specified type
(:point null.int 17) ⇒ // Error: tagless int does not accept nulls
(:point a::3 17) ⇒ // Error: tagless int does not accept annotations
(:point (:) 17) ⇒ // Error: tagless parameters cannot be omitted
While Ion text syntax doesn’t use tags—the types are built into the syntax—these errors ensure that a text E-expression may only express things that can also be expressed using an equivalent binary E-expression.
Tagless encodings have no real benefit in text, as primitive types aim to improve the binary encoding.
This density comes at the cost of flexibility. Primitive types cannot be annotated or null, and arguments cannot be expressed using macros.
See tagless_encodings for the complete list of tagless types.
Annotated Macros
Macros can be annotated, which causes the expanded value to have that annotation prepended to any annotations present on the expanded value.
(bar [bar])
(foobar foo::[bar])
baz::(:bar) ⇒ baz::[bar]
baz::(:foobar) ⇒ baz::foo::[bar]
Annotated Placeholders
Placeholders can be annotated, which causes the expanded value to have that annotation prepended to any existing annotations.
(annotated_value [foo::(:?)])
(:annotated_value 42) ⇒ [foo::42]
(:annotated_value bar::42) ⇒ [foo::bar::42]
Argument Evaluation
It's important to understand that arguments to macros are always single Ion values. Even if an argument is a collection (like a list or struct), it's passed as a single value and inserted into the template as-is.
When an E-expression is used as an argument, it is fully expanded before being passed to the macro. The macro receives the result of that expansion, which must be a single Ion value.
(:$ion set_macros
(wrap_in_list [(:?)])
(struct_constant {x: 1, y: 2})
)
(:wrap_in_list (:struct_constant)) ⇒ [{x: 1, y: 2}]
In this example, (:struct_constant) is expanded first to produce {x: 1, y: 2}, and then that single struct value is passed to wrap_in_list, which wraps it in a list.
E-expressions in Templates
E-expressions can appear in template definitions, but they are expanded when the macro is defined, not when it's invoked. Additionally, E-expressions may only appear in value position - they cannot appear in field name position in structs.
(:$ion set_macros
(base_config {type: "standard", size: 100})
// The following would be an error:
// (extended_config {
// (:base_config), // ERROR: E-expressions cannot appear in field position
// extra: true
// })
)
Directives
Ion 1.1 includes built-in directives for managing the encoding context. These directives produce system values (not user-visible values) and can only appear at the top level of a stream. See directives.
Defining macros
Macros are defined within directives like set_macros or add_macros, or in modules.
Syntax
(name template)
| Argument | Description |
|---|---|
name | A unique name assigned to the macro. When constructing an anonymous macro null is used in the place of a unique name. |
template | A single Ion value that may contain placeholders. |
Example macro definition
(foo // ─── name
{ // ─┐
x: (:?), // │
y: (:?), // ├─ template with placeholders
z: (:?), // │
} // ─┘
)
Macro names
The lexical name given to a macro must be an identifier.
However, it must not begin with a $—this is reserved for system-defined bindings like $ion.
In some circumstances, it may not make sense to name a macro. (For example, when the macro is generated automatically.)
In such cases, authors must use null to indicate that the macro does not have a name.
Anonymous macros can only be referenced by their address in the macro table.
Macro signatures
Macro signatures are derived from the placeholders in the macro template body. The signature is determined by analyzing the placeholders that appear in the template.
Important: Within macro definitions, the order of struct fields matters when determining the signature.
Placeholder types and signatures
The type of placeholder determines the parameter characteristics:
- Tagged placeholders
(:?)- Optional; accepts any Ion value - Tagged placeholders with defaults
(:? default_value)- Optional with default; accepts any Ion value - Tagless placeholders with primitive type
(:? {#type})- Required; specifies a type from the enumerated set of primitives
All tagged parameters are optional parameters (can be elided with (:)).
All tagless parameters are required parameters (cannot be elided).
Example signature derivation
(example {
a: (:?), // First parameter: tagged
b: (:? 10), // Second parameter: tagged with default
c: (:? {#int8}) // Third parameter: tagless int8
})
This macro accepts 3 arguments:
- Any Ion value (optional, can be
(:)) - Any Ion value with default 10 (optional, can be
(:)) - An int8 value (required, cannot be null or annotated)
Template body
The macro's template is a single Ion value that defines how a reader should expand invocations of the macro.
Within the template there may be placeholders, which indicate that a macro argument should be substituted into that position in the template body.
Important restrictions:
- The template body cannot "call" other macros. Any E-Expressions in the template body are expanded before the template is added to the macro table.
- A template body consisting solely of a placeholder (annotated or unannotated) is not allowed.
Ion scalars
Ion scalars are interpreted literally. These include values of any type except list, sexp, and struct.
null values of any type—even null.list, null.sexp, and null.struct—are also interpreted literally.
Examples
These macros are constants; they take no parameters. When they are invoked, they expand to a single value: the Ion scalar acting as the template expression.
(:$ion set_macros
(greeting "hello")
(birthday 1996-10-11)
// Annotations are also literal
(price USD::29.95)
)
(:greeting) => "hello"
(:birthday) => 1996-10-11
(:price) => USD::29.95
Placeholders
Templates can insert macro arguments into their output by using placeholders. Placeholders are positional - they correspond to arguments based on their order of appearance in the template.
Key restrictions:
- Placeholders may only occur in value position
- Placeholders may only occur in a macro body; they are illegal anywhere else
- Placeholders may represent a tagged parameter or a tagless parameter
- Placeholders for a tagged parameter may optionally provide a single value that will be used as a default value when no value is provided for that parameter
- Placeholders for a tagless parameter must indicate the tagless scalar type of the argument
- All tagged parameters are optional parameters
- All tagless parameters are required parameters
- A template body consisting solely of a placeholder (annotated or unannotated) is not allowed
- A placeholder may be annotated
Example
// Given
(:$ion set_macros
(line_from_origin { x0: 0, y0: 0, x1: (:? {#int8}), y1: (:? 99) })
)
// When invoked
(:line_from_origin 5 10) => { x0: 0, y0: 0, x1: 5, y1: 10 }
(:line_from_origin 5 (:)) => { x0: 0, y0: 0, x1: 5, y1: 99 }
Quasi-literal Ion containers
When an Ion container appears in a template definition, it is interpreted almost literally.
Each nested value in the container is inspected.
- If the value is an Ion scalar, it is added to the output as-is.
- If the value is a placeholder, the value bound to that variable name is added to the output.
The placeholder literal (for example:
(:?)) is discarded. - If the value is an E-Expression, it is expanded before the template is added to the macro table, and the resulting values are included in the template.
- If the value is a container, the reader will recurse into the container and repeat this process.
Important: The template body cannot "call" other macros at expansion time. Any E-Expressions in the template body are expanded when the macro is defined, not when it is invoked.
Placeholders within a sequence
Placeholders may be used within sequence types (lists or s-expressions). When an argument to such a placeholder is (:),
no value is inserted into the sequence.
(:$ion set_macros
(short_list [(:?), (:?), (:?)])
(short_sexp ((:?) (:?) (:?)))
)
(:short_list a b c) => [a, b, c]
(:short_sexp a (:) c) => (a c)
Placeholders within a struct
Placeholders may be used within structs. Arguments are paired with the corresponding field name in the template body.
When an argument is (:), the field name and value are elided.
(:$ion set_macros
(resident
{
town: "Riverside",
id: (:?),
name: (:?)
}
)
)
(:resident "abc" "Alice") =>
{
town: "Riverside",
id: "abc",
name: "Alice"
}
(:resident "def" "John") =>
{
town: "Riverside",
id: "def",
name: "John"
}
(:resident "ghi" (:)) =>
{
town: "Riverside",
id: "ghi",
}
Tagless Encodings
Tagless encodings may be specified by encoding tags in template placeholders and in tagless-element sequences. In binary, this allows the opcode to be elided from the encoding of values that fill tagless slots.
Consider the following data:
[
{ sum: 123, sample_count: 12 },
{ sum: 54, sample_count: 8 },
{ sum: 125, sample_count: 15 },
{ sum: 314, sample_count: 30 },
]
With a macro such as (metric {sum: (:? {#int}), sample_count: (:? {#int}), unit: ms}), it can be encoded
using a tagless-element list as follows
[{:metric} // {:<macro_name>} specifies a macro-shape
(123 12),
(54 8),
(125 15),
(314 30),
]
The available tagless encodings are enumerated in the following table.
| Ion Type | Encoding Name | Size in bytes | Binary Encoding Tag | Encoding |
|---|---|---|---|---|
int | int | variable | 0x60 | [FlexInt][flexint] |
int8 | 1 | 0x61 | [FixedInt][fixedint] | |
int16 | 2 | 0x62 | [FixedInt][fixedint] | |
int32 | 4 | 0x64 | [FixedInt][fixedint] | |
int64 | 8 | 0x68 | [FixedInt][fixedint] | |
uint | variable | 0xE0 | [FlexUInt][flexuint] | |
uint8 | 1 | 0xE1 | [FixedUInt][fixeduint] | |
uint16 | 2 | 0xE2 | [FixedUInt][fixeduint] | |
uint32 | 4 | 0xE4 | [FixedUInt][fixeduint] | |
uint64 | 8 | 0xE8 | [FixedUInt][fixeduint] | |
float | float16 | 2 | 0x6B | [Little-endian IEEE-754 half-precision float][f16] |
float32 | 4 | 0x6C | [Little-endian IEEE-754 single-precision float][f32] | |
float64 | 8 | 0x6D | [Little-endian IEEE-754 double-precision float][f64] | |
decimal | small_decimal | variable (2+) | 0x70 | Tuple of (int,int8) representing the coefficient and exponent respectively. |
timestamp | timestamp_day | 2 | 0x82 | [Day precision timestamp][t1] |
timestamp_min | 4 | 0x83 | [Minute precision timestamp][t2] | |
timestamp_s | 5 | 0x84 | [Second precision timestamp][t3] | |
timestamp_ms | 6 | 0x85 | [Millisecond precision timestamp][t4] | |
timestamp_us | 7 | 0x86 | [Microsecond precision timestamp][t5] | |
timestamp_ns | 8 | 0x87 | [Nanosecond precision timestamp][t6] | |
symbol | symbol | variable | 0xEE | [FlexSym][flexsym] |
Although both primitive and macro-shape encodings may be used in tagless-element sequences, only the primitive encodings in the above table may be used in tagless template placeholders.
Ion 1.1 modules
In Ion 1.0, each stream has a symbol table. The symbol table stores text values that can be referred to by their integer index in the table, providing a much more compact representation than repeating the full UTF-8 text bytes each time the value is used. Symbol tables do not store any other information used by the reader or writer.
Ion 1.1 introduces the concept of a macro table. It is analogous to the symbol table, but instead of holding text values it holds macro definitions.
Ion 1.1 also introduces the concept of a module, an organizational unit that holds a (symbol table, macro table) pair.
tip
You can think of an Ion 1.0 symbol table as a module with an empty macro table.
In Ion 1.1, each stream has an encoding module sequence— a collection of modules whose symbols and macros are being used to encode the current segment.
Module interface
The interface to a module consists of:
- its spec version, denoting the Ion version used to define the module
- its exported symbols, an array of strings denoting symbol content
- its exported macros, an array of
<name, macro>pairs, where all names (where specified) are unique identifiers
The spec version is external to the module body and the precise way it is determined depends on the type of module being defined. This is explained in further detail in Module Versioning.
The exported macro array is denoted by the module’s macros clause, with addresses
allocated to macros or macro bindings in the order they are declared.
The exported symbol array is denoted by the symbols clause of a module definition, and
by the symbols field of a shared symbol table. The new symbols defined in the symbols clause are not available in the preceding macros clause.
The exported symbols and exported macros are defined in the module body.
Types of modules
There are multiple types of modules. All modules share the same interface, but vary in their implementation in order to support a variety of different use cases.
| Module Type | Purpose |
|---|---|
| Local Modules | Organizing symbols and macros within a scope |
| Shared Modules | Defining reusable symbols and macros outside of the data stream |
| System Modules | Defining system symbols and macros |
| Encoding Modules | Encoding the current stream segment |
Module versioning
Every module definition has a spec version that determines the syntax and semantics of the module body. A module’s spec version is expressed in terms of a specific Ion version; the meaning of the module is as defined by that version of the Ion specification.
The spec version for a local module is inherited from its parent scope, which may be the stream itself. The spec version of a system module is the Ion version in which it was specified.
To ensure that all consumers of a module can properly understand it, a module can only import shared modules defined with the same or earlier spec version.
Examples
The spec version of a local module is always the same as the spec version of its enclosing scope. If the local module is defined at the top level of the stream, its spec version is the Ion version of the current segment.
$ion_1_1
(:$ion module foo
// Module semantics specified by Ion 1.1
...
)
// ...
$ion_1_3
(:$ion module foo
// Module semantics specified by Ion 1.3
...
)
//... // Assuming no IVM
(:$ion module bar
// Module semantics specified by Ion 1.3
...
)
Identifiers
Many of the grammatical elements used to define modules and macros are identifiers--symbols that do not require quotation marks.
More explicitly, an identifier is a sequence of one or more ASCII letters, digits, or the characters $ (dollar sign) or _ (underscore), not starting with a digit.
It also cannot be of the form $\d+, which is the syntax for symbol IDs (for example: $3, $10, $458, etc.), nor can it be a keyword (true, false, null, or nan).
Defining modules
A module is defined by two kinds of subclauses which, if present, always appear in the same order.
macros- an exported list of macro definitionssymbols- an exported list of text values
The lexical name given to a module definition must be an identifier.
However, it must not begin with a $—this is reserved for system-defined bindings like $ion.
Internal environment
The body of a module tracks an internal environment by which macro references are resolved. This environment is constructed incrementally by each clause in the definition and consists of:
- the exported symbols, an array containing symbol texts
- the exported macros, an array containing name/macro pairs
Before any clauses of the module definition are examined, each of these is empty.
Each clause affects the environment as follows:
- A
macrosdeclaration defines the exported macros. - A
symbolsdeclaration defines the exported symbols.
macros
The macros clause assembles a list of macro definitions for the module to export. It takes any number of arguments.
All modules have a macro table, so when a module has no macros clause, the module has an empty macro table.
Most commonly, a macro table entry is a definition of a new macro expansion function, following this general shape:
// ┌─── `macro` keyword
// │ ┌─── macro name
// │ │ ┌─── template
// │ │ │
(macro foo {a:(:?),b:(:?)})
(See the Defining macros for details.)
When the value null is given for the macro name, this defines an anonymous macro that can be referenced by its numeric
address (that is, its index in the enclosing macro table).
Module names can be intermingled with macro definition s-expressions inside the macros clause;
together, they determine the bindings that make up the module’s exported macro array.
The module-name export form is shorthand for referencing all exported macros from that module, in their original order with their original names.
tip
No name can be repeated among the exported macros, including macro definitions.
Processing
When the macros clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.
For each arg:
- If the
argis an s-expression, it is processed as a macro definition, which is appended to the end of the macro table being constructed. - If the
argis the name of a module, the macro definitions in that module's macro table are appended to the end of the macro table being constructed. - If the
argis anything else, the reader must raise an error.
A macro name is a symbol that can be used to reference a macro, both inside and outside the module. Macro names are optional, and improve legibility when using, writing, and debugging macros. When a name is used, it must be an identifier per Ion’s syntax for symbols. Macro definitions being added to the macro table must have a unique name. If a macro is added whose name conflicts with one already present in the table, the implementation must raise an error.
S-expressions in macros
An s-expression in macros defines a new macro.
When the macro declaration uses a name, an error must be signaled if it already appears in the exported macro array.
Module names in macros
A module name appends all exported macros from the module to the exported macro array. If any exported macro uses a name that already appears in the exported macro array, an error must be signaled.
symbols
A module can define a list of exported symbols by copying symbols from other modules and/or declaring new symbols.
symbol-table ::= '(symbols' symbol-table-entry* ')'
symbol-table-entry ::= module-name | symbol-list
symbol-list ::= '[' ( symbol-text ',' )* ']'
symbol-text ::= symbol | string
The symbols clause assembles a list of text values for the module to export.
It takes any number of arguments, each of which may be the name of visible module or a list of symbol-texts.
The symbol table is a list of symbol-texts by concatenating the symbol tables of named modules and lists of symbol/string values.
Where a module name occurs, its symbol table is appended. (The module name must refer to another module that is visible to the current module.) Unlike Ion 1.0, no symbol-maxid is needed because Ion 1.1 always requires exact matches for imported modules.
tip
When redefining a top-level module binding, the binding being redefined can be added to the symbol table in order to retain its symbols. For example:
// Define module `foo`
(:$ion module foo
(symbols ["b", "c"]))
// Redefine `foo` in terms of its former definition
(:$ion module foo
(symbols
["a"]
foo // The old definition of `foo` with symbols ["b", "c"]
["d"]))
// Now `foo`'s symbol table is ["a", "b", "c", "d"]
Where a list occurs, it must contain only non-null, unannotated strings and symbols.
The text of these strings and/or symbols are appended to the symbol table.
Upon encountering any non-text value, null value, or annotated value in the list, the implementation shall signal an error.
To add a symbol with unknown text to the symbol table, one may use $0.
All modules have a symbol table, so when a module has no symbols clause, the module has an empty symbol table.
Symbol zero $0
Symbol zero (i.e. $0) is a special symbol that is not assigned text by any symbol table, even the system symbol table.
Symbol zero always has unknown text, and can be useful in synthesizing symbol identifiers where the text of the symbol is not known in a particular operating context.
All symbol tables (even an empty symbol table) can be thought of as implicitly containing $0.
However, $0 precedes all symbol tables rather than belonging to any symbol table.
When adding the exported symbols from one module to the symbol table of another, the preceding $0 is not copied into the destination symbol table (because it is not part of the source symbol table).
It is important to note that $0 is only semantically equivalent to itself and to locally-declared SIDs with unknown text.
It is not semantically equivalent to SIDs with unknown text from shared symbol tables, so replacing such SIDs with $0 is a destructive operation to the semantics of the data.
Processing
When the symbols clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.
For each arg:
- If the
argis a list of text values, the nested text values are appended to the end of the symbol table being constructed.- When
$0appears in the list of text values, this creates a symbol with unknown text. - The presence of any other Ion value in the list raises an error.
- When
- If the
argis the name of a module, the symbols in that module's symbol table are appended to the end of the symbol table being constructed. - If the
argis anything else, the reader must raise an error.
Example
(symbols // Constructs an empty symbol table (list)
["a", b, 'c'] // The text values in this list are appended to the table
foo // Module `foo`'s symbol table values are appended to the table
['''g''', "h", i]) // The text values in this list are appended to the table
If module foo's symbol table were [d, e, f], then the symbol table defined by the above clause would be:
["a", "b", "c", "d", "e", "f", "g", "h", "i"]
This is an Ion 1.0 symbol table that imports two shared symbol tables and then declares some symbols of its own.
$ion_1_0
$ion_symbol_table::{
imports: [{ name: "com.example.shared1", version: 1, max_id: 10 },
{ name: "com.example.shared2", version: 2, max_id: 20 }],
symbols: ["s1", "s2"]
}
Here’s the Ion 1.1 equivalent in terms of symbol allocation order:
$ion_1_1
(:$ion import m1 "com.example.shared1" 1)
(:$ion import m2 "com.example.shared2" 2)
(:$ion module _
(symbols m1 m2 ["s1", "s2"])
)
note
Alternately, one could use the default module directives to avoid the boilerplate of creating bindings for the imported modules.
However, this has slightly different semantics because use also brings in the macros from the imported modules.
$ion_1_1
(:$ion use "com.example.shared1" 1)
(:$ion use "com.example.shared2" 2)
(:$ion add_symbols "s1" "s2")
Directives
Directives are system values that modify the encoding context.
Syntactically, a directive is a top-level e-expression $ion.
Its first child value is an operation name.
The operation determines what changes will be made to the encoding context and which clauses may legally follow.
The operation declared in a directive is applied to the encoding context immediately after the directive's closing delimiter. This guarantees that the encoding context is immutable while in a directive.
(:$ion operation_name
(clause_1 /*...*/)
(clause_2 /*...*/)
/*...more clauses...*/
(clause_N /*...*/))
In Ion 1.1, there are eight supported directive operations—
module,
import,
encoding,
set_symbols,
add_symbols,
set_macros,
add_macros,
and use.
Top-level bindings
The module and import directives each create a stream-level binding to a module definition.
Once created, module bindings at this level endure until the file ends or another Ion version marker is encountered.
Module bindings at the stream-level can be redefined.
module directives
The module directive binds a name to a local module definition at the top level of the stream.
(:$ion module foo
(macros /*...*/)
(symbols /*...*/)
)
import directives
The import directive looks up the module corresponding to the given (name, version) pair in the catalog.
Upon success, it creates a new binding to that module at the top level of the stream.
(:$ion import
bar // Binding
"com.example.bar" // Module name
2) // Module version
The version can be omitted. When it is not specified, it defaults to 1.
If the catalog does contain an exact match, this operation raises an error.
encoding directives
An encoding directive accepts a sequence of module bindings to use as the following stream segment's
encoding module sequence.
(:$ion encoding
mod_a
mod_b
mod_c)
The new encoding module sequence takes effect immediately after the directive's closing delimiter and remains the same until the next encoding directive or Ion version marker.
Note that the $ion module and default module are always implicitly at the head of the encoding module sequence.
Default module manipulation
Because of the relative importance of the default module,
Ion 1.1 provides directives that are optimized for manipulating the default module.
All of these operations can be defined in terms of the module and import directives, but using these operations
is more compact.
set_symbols directives
Replaces the symbol table of the default module with a new one.
(:$ion set_symbols "foo" "bar" "baz")
Equivalent to:
(:$ion module _ (macros _) (symbols "foo" "bar" "baz"))
add_symbols directives
Adds symbols to the symbol table of the default module.
(:$ion add_symbols "qux" "quux")
Equivalent to:
(:$ion module _ (macros _) (symbols _ "foo" "bar" "baz"))
set_macros directives
Replaces the macro table of the default module with new macro definitions.
All content of this directive must be macro clauses, and indeed they are assumed to be, so the macro keyword is omitted.
(:$ion set_macros
(point2d {x: (:?), y: (:?)})
(greeting {message: (:? "Hello"), name: (:?)})
)
Equivalent to:
(:$ion module _
(macros (macro point2d {x: (:?), y: (:?)})
(macro greeting {message: (:? "Hello"), name: (:?)}))
(symbols _))
add_macros directives
Adds macro definitions to the macro table of the default module.
All content of this directive must be macro clauses, and indeed they are assumed to be, so the macro keyword is omitted.
(:$ion add_macros
(rgb [(:?\uint8\), (:?\uint8\), (:?\uint8\)])
)
Equivalent to:
(:$ion module _
(macros _ (rgb [(:?\uint8\), (:?\uint8\), (:?\uint8\)]))
(symbols _))
use directives
Imports macros and symbols from a shared module and appends them to the default module. If the catalog does not contain an exact match, this operation raises an error.
// Adds the content of this shared symbol table to the default module
(:$ion use "com.example.geometry" 2)
// Equivalent to the following, except that `use` does not create any top-level binding like `import` does
($ion import $temp "com.example.geometry" 2)
($ion module _ (macros _ $temp) (symbols _ $temp))
($ion module $temp)
When the version is not provided, a default value of 1 is used.
// Defaults to version 1
(:$ion use "com.example.geometry")
Ion 1.0 symbol table syntax not supported
In Ion 1.1 streams, Ion 1.0 symbol table syntax (for example, $ion_symbol_table::{...}) is not supported.
Ion 1.1 writers should not write such an annotated struct at the top level; Ion 1.1 readers
must treat such an annotated struct as a no-op at the top level.
Local modules
Local modules are lexically scoped. They can be defined at the top level of a stream. They can be referenced immediately following the directive in which they are defined, up until the end of the stream.
Local modules always have a symbolic name given at the point of definition, also known as a binding.
Stream-level bindings are mutable.
(:$ion module foo // <-- Top-level module `foo`
(macros
(quux Quux)))
(:$ion module foo // <-- Redefines the top-level binding `foo`
(macros
(quuz Quuz)))
Local modules inherit their spec version from the enclosing scope.
Local modules automatically have access to modules previously declared in their enclosing scope using module or import.
The default module is a local module that is implicitly defined at the beginning of every stream.
Encoding modules
The encoding of each segment of a stream is shaped by the currently configured encoding modules, an ordered sequence of modules that determine which symbols and macros are available for use in the stream. A writer can modify this sequence by emitting an encoding directive.
By logically concatenating the encoding modules' symbol and macro tables respectively, they can be viewed as unified local symbol and macro tables.
For example, consider these module definitions and the subsequent encoding directive:
(:$ion module mod_a
(macros
(macro foo () Foo)
(macro bar () Bar))
(symbols ["a", "b", "c"]))
(:$ion module mod_b
(macros
(macro baz () Baz)
(macro quux () Quux))
(symbols ["c", "d", "e"]))
(:$ion module mod_c
(macros
(macro quuz () Quuz)
(macro foo () Foo2))
(symbols ["f", "g", "h"]))
(:$ion encoding
mod_a
mod_b
mod_c)
It produces the encoding module sequence $ion _ mod_a mod_b mod_c.
(The $ion module and default module, _, is always implicitly at the head of the encoding sequence.)
The segment's local symbol table, formed by logically concatenating the symbol tables of mod_a,
mod_b, and mod_c in that order, is:
| Address | Symbol text |
|---|---|
0 | <unknown text> |
1 | $ion |
2 | $ion_1_0 |
3 | $ion_symbol_table |
4 | name |
5 | version |
6 | imports |
7 | symbols |
8 | max_id |
9 | $ion_shared_symbol_table |
10 | a |
11 | b |
12 | c |
13 | c |
14 | d |
15 | e |
16 | f |
17 | g |
18 | h |
Notice that no de-duplication takes place; c appears in both addresses 4 and 5.
The segment's macro table, formed by logically concatenating the macro tables of mod_a,
mod_b, and mod_c in that order, is:
| Address | Macro |
|---|---|
0 | mod_a::foo |
1 | mod_a::bar |
2 | mod_b::baz |
3 | mod_b::quux |
4 | mod_c::quuz |
5 | mod_c::foo |
Notice that mod_a::foo and mod_c::foo can coexist in this unified view without issue.
Invocations of these macros require that they be qualified by their enclosing module's name.
Because lower addresses take fewer bytes to encode than higher addresses, writers should place the modules they anticipate referencing the most frequently at the beginning of the encoding module sequence.
Modules in the current segment's encoding module sequence are said to be active, while modules that are defined or imported but which are not in the encoding module sequence are available. E-expressions can only invoke macros in an active module.
For example:
(:$ion module mod_a
(macros
(macro foo () Foo)))
// `mod_a` is now available
(:$ion module mod_b
(macros
(macro bar () Bar)))
// `mod_b` is now available
(:$ion encoding mod_a)
// `mod_a` is now active
(:mod_a::foo) // Foo
(:mod_b::bar) // ERROR: `mod_b` is not in the encoding module sequence
The default module
The default module, _, is an empty top-level module that is implicitly defined at the beginning of every stream.
When resolving an unqualified macro name, readers look for the corresponding macro definition in _.
If it is not found, the reader will raise an error.
This makes it possible to leverage macros in a lightweight way; writers do not have to first name/define a custom module to house their macros, and the macros themselves can be invoked in text without having to write out the module name.
Macros and symbols can be added to the default module by redefining _.
Like all modules, _ can be redefined in terms of itself, making appends and prepends straightforward.
$ion_1_1
// `_` exists, but is empty
(:$ion module _
(macros
(foo Foo)))
// `_` now contains macro `foo`
(:$ion module _
(macros
_ // Add all macros in `_` to its redefinition
(bar Bar)))
// `_` now contains macros `foo` and `bar`
(:foo) // Equivalent to `(:_::foo)`
(:bar) // Equivalent to `(:_::bar)`
Directives like add_symbols
and add_macros apply their changes to _,
so we can rewrite the above more succinctly as:
$ion_1_1
// `_` exists, but is empty
(:$ion add_macros
(foo Foo)
(bar Bar))
// `_` now contains macros `foo` and `bar`
(:foo) // Equivalent to `(:_::foo)`
(:bar) // Equivalent to `(:_::bar)`
_ can also be redefined by an import directive.
Default encoding module sequence
At the beginning of a stream, the encoding module sequence contains two modules:
- the system module,
$ion - the default module,
_
Recall that a segment's symbol and macro tables are logical concatenations of those found in the segment's encoding modules.
Because _ is empty at the beginning of the stream,
the stream's initial symbol and macro tables are identical to those of the system module, $ion.
Modifying active modules
If a module binding in the encoding module sequence is redefined, the new module definition replaces the old one in the sequence.
For example after these directives are evaluated:
(:$ion module mod_a
(macros
(foo Foo)
(bar Bar)))
(:$ion module mod_b)
(:$ion module mod_c
(macros
(quux Quux)
(quuz Quuz)))
(:$ion encoding mod_a mod_b mod_c)
the encoding sequence is $ion _ mod_a mod_b mod_c, and mod_b is empty.
(:0) // => Foo
(:1) // => Bar
(:2) // => Quux
(:3) // => Quuz
If we then add macros to mod_b, those macros will immediately become available.
(:$ion module mod_b
(macros
(baz Baz)))
(:0) // => Foo
(:1) // => Bar
(:2) // => Baz
(:3) // => Quux
(:4) // => Quuz
important
Notice that modifying a module (in this case mod_b) can cause the addresses of all subsequent macros to be modified.
Clearing the symbol and macro tables
(:$ion module _) // Redefine `_` to be an empty module
// If other modules are in use, remove them from the encoding module sequence
(:$ion encoding)
You can also consider writing an Ion version marker, which is more compact, although the behavior is slightly different. See the Default encoding module sequence section for details.
System module
The symbols of the system module $ion are available everywhere within an Ion document,
with the version of that module being determined by the spec-version of each segment.
The specific system symbols are largely uninteresting to users; while the binary encoding heavily
leverages the system symbol table, the text encoding that users typically interact with does not.
Relation to local symbol and macro tables
The $ion module is equivalent to the Ion 1.0 system symbol table.
Its symbols are identical, and it contains no macros.
In Ion 1.0, the system symbol table is always the first import of the local symbol table.
Ion 1.1 has slightly different semantics, but the result is the same.
The $ion module is always the first module in the sequence of encoding modules, so the $ion module symbols always occupying the first 9 symbol IDs.
The $ion module
This is the same as the Ion 1.0 system symbol table. This binding is always available in an Ion 1.1 stream at the head of the encoding modules.
| ID | Hex | Text |
|---|---|---|
| 0 | 0x00 | <reserved> |
| 1 | 0x01 | $ion |
| 2 | 0x02 | $ion_1_0 |
| 3 | 0x03 | $ion_symbol_table |
| 4 | 0x04 | name |
| 5 | 0x05 | version |
| 6 | 0x06 | imports |
| 7 | 0x07 | symbols |
| 8 | 0x08 | max_id |
| 9 | 0x09 | $ion_shared_symbol_table |
Shared modules
Shared modules exist independently of the documents that use them. They are identified by a catalog key consisting of a string name and an integer version.
Unlike local modules, the Ion parser does not intrinsically recognize or process this data; it is up to higher-level specifications or conventions to define how shared modules are communicated.
The self-declared catalog-names of shared modules are generally long, since they must be more-or-less globally unique. When imported by another module, they are given local symbolic names—a binding—by import declarations. Once the shared module has been imported and given a binding, it can be referenced by other modules and/or added to the encoding modules.
$ion_1_1
(:$ion import geo "org.example.geometry" 2)
(:$ion encoding geo)
(:geo::polygon [{:geo::point2d} (1 4) (1 8) (3 6)] rgb::0xFFCC00)
Ion 1.1 also provides a convenient directive (use) to append a shared module to the default module.
$ion_1_1
(:$ion use "org.example.geometry" 2)
// The content of the shared module is immediately available through the default module.
(:polygon [{:point2d} (1 4) (1 8) (3 6)] rgb::0xFFCC00)
Compatibility with Ion 1.0
Ion 1.0 shared symbol tables are treated as Ion 1.1 shared modules that have an empty macro table.
Ion 1.1 Text Encoding
The Ion text encoding is a stream of UTF-8 encoded text. It is intended to be easy to read and write by humans.
Whitespace is insignificant and is only required where necessary to separate tokens. C-style comments (either block or in-line) are treated as whitespace; they are not part of the data model and implementations are not required to preserve them.
A text Ion 1.1 stream begins with the Ion 1.1 version marker ($ion_1_1) followed by a series of
value literals and/or encoding expressions.
Values
Annotations
In the text format, type annotations are denoted by a symbol token and double-colons preceding any value. Multiple annotations on the same value are separated by double-colons:
int32::12 // Suggests 32 bits as end-user type
degrees::'celsius'::100 // You can have multiple annotaions on a value
'my.custom.type'::{ x : 12 , y : -1 } // Gives a struct a user-defined type
{ field: some_annotation::value } // Field's name must precede annotations of its value
jpeg :: {{ ... }} // Indicates the blob contains jpeg data
bool :: null.int // A very misleading annotation on the integer null
'' :: 1 // An empty annotation
foo::(:bar "a" "b") // E-expressions may be annotated
null.symbol :: 1 // ERROR: type annotation cannot be null
Nulls
Null values are represented by the keyword null, optionally followed by . and the name of a type in the Ion data model.
null
null.null // Identical to unadorned null
null.bool
null.int
null.float
null.decimal
null.timestamp
null.string
null.symbol
null.blob
null.clob
null.struct
null.list
null.sexp
The text format treats all of these as reserved tokens; to use those same characters as a symbol token, they must be enclosed in single-quotes:
null // The type is null
'null' // The type is symbol
null.list // The type is list
'null.int' // The type is symbol
Any text token starting with null. must be one of the legal null values.
(llun.foo) // A s-expression equivalent to (llun . foo)
(null.foo) // This is illegal; not equivalent to (null . foo)
// because null. is never split into separate tokens
Booleans
Boolean values are represented by the literals true and false.
The text format treats both of these as reserved tokens; to use those same characters as a symbol token, they must be enclosed in single-quotes.
true // a boolean value
'true' // a symbol value
'true'::1 // an integer annotated with the text "true"
true::1 // ERROR: cannot use an unquoted keyword as an annotation
{ 'true': 1 } // a struct containing a field name with the text "true"
{ true: 1 } // ERROR: cannot use an unquoted keyword as a field name
Integers
Integer values may be encoded in binary, decimal, and hexadecimal notation.
A decimal-encoded int consists of the digit 0 OR a non-zero digit followed by zero-or more base 10 digits (0123456789)—leading zeros are not allowed.
A binary-encoded int consists of 0b followed by one or more base 2 digits (01).
A hexadecimal-encoded int consists of 0x followed by one or more case-insensitive base 16 digits (0123456789abcdefABCDEF).
All integer values may be preceded by an optional minus sign (-), indicating that the value is negative.
(The token -0 is legal and equivalent to 0; to distinguish -0 from 0, consider encoding as a decimal or float instead.)
Single underscores may be used to separate digits; consecutive underscores are never allowed.
All integer values must be followed by one of the fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f.
Though the text format allows hexadecimal and binary notation, such notation is not guaranteed to be maintained if a data stream is re-transcribed.
0 // Zero. Surprise!
-0 // ...the same value with a minus sign
123 // A normal int
-123 // A negative int
0xBeef // An int denoted in hexadecimal
-0xBeef // A negative int denoted in hexadecimal
0b0101 // An int denoted in binary
-0b0101 // A negative int denoted in binary
1_2_3 // An int with underscores
0xFA_CE // An int denoted in hexadecimal with underscores
0b10_10_10 // An int denoted in binary with underscores
+1 // ERROR: leading plus not allowed
0123 // ERROR: leading zeros not allowed
1_ // ERROR: trailing underscore not allowed
1__2 // ERROR: consecutive underscores not allowed
0x_12 // ERROR: underscore can only appear between digits (the radix prefix is not a digit)
_1 // A symbol (ints cannot start with underscores)
Floats
The text encoding of a numeric float value:
- Optionally starts with a minus sign
- Has a whole number part that is either:
- zero, or
- starts with 1-9 followed by any number of digits
- Has an optional decimal point followed by zero or more decimal digits
- Has the letter 'e'
- Has an optional minus sign for the exponent
- Ends with one or more digits for the exponent
A numeric Ion float value must always contain an e—fractional numbers without an e are decimal values.
Ion float values may also be special non-number values, represented in text by the following keywords:
nandenotes the not a number (NaN) value.+infdenotes positive infinity.-infdenotes negative infinity.
The text format treats nan as a reserved token; to use those same characters as a symbol token, they must be enclosed in single-quotes.
While base-10 notation is convenient for human representation, many base-10 real numbers are irrational with respect
to base-2 and cannot be expressed exactly as a binary floating point number (e.g. 1.1e0).
When encoding a decimal real number that is irrational in base-2 or has more precision than can be stored in binary64,
the exact binary64 value is determined by using the IEEE-754 round-to-nearest mode with a round-half-to-even as the tie-break.
This mode/tie-break is the common default used in most programming environments and is discussed in detail in
"Correctly Rounded Binary-Decimal and Decimal-Binary Conversions".
This conversion algorithm is illustrated in a straightforward way in Clinger's Algorithm.
When encoding a float value to Ion text, an implementation MAY want to consider the approach described in
"Printing Floating-Point Numbers Quickly and Accurately".
Examples
Although the textual representation of 1.2e0 itself is irrational, its
canonical form in the data model is not (based on the rounding rules), thus
the following text forms all map to the same float value:
// the most human-friendly representation
1.2e0
// the exact textual representation in base-10 for the binary64 value 1.2e0 represents
1.1999999999999999555910790149937383830547332763671875e0
// a shortened, irrational version, but still the same value
1.1999999999999999e0
// a lengthened, irrational version that is still the same value
1.19999999999999999999999999999999999999999999999999999999e0
Decimals
The Hursley rules for describing a finite value converting from textual notation must be followed. The Hursley rules for describing a special value are not followed—the rules for
infinity-- rule is not applicable for Ion Decimals.nan-- rule is not applicable for Ion Decimals
Specifically, the rules for getting the integer coefficient from the decimal-part (digits preceding the exponent) of the textual representation are specified as follows.
If the decimal-part included a decimal point the exponent is then reduced by the count of digits following the decimal point (which may be zero) and the decimal point is removed. The remaining string of digits has any leading zeros removed (except for the rightmost digit) and is then converted to form the coefficient which will be zero or positive.
Where X is any unsigned integer, all the following formulae can be
demonstrated to be equivalent using the text conversion rules and the data
model.
// Exponent implicitly zero
X.
// Exponent explicitly zero
Xd0
// Exponent explicitly negative zero (equivalent to zero).
Xd-0
Other equivalent representations include the following, where Y is the number
of digits in X.
// There are Y digits past the decimal point in the
// decimal-part, making the exponent zero. One leading zero
// is removed.
0.XdY
For example, all the following text Ion decimal representations are equivalent to each other.
0.
0d0
0d-0
0.0d1
Additionally, all the following are equivalent to each other (but not to any forms of positive zero).
-0.
-0d0
-0d-0
-0.0d1
Because all forms of zero are distinctly identified by the exponent, the following are not equivalent to each other.
// Exponent implicitly zero.
0.
// Exponent explicitly 5.
0d5
All the following are equivalent to each other.
42.
42d0
42d-0
4.2d1
0.42d2
However, the following are not equivalent to each other.
// Text converted to 42.
0.42d2
// Text converted to 42.0
0.420d2
In the text notation, decimal values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f.
Timestamps
In the text format, timestamps follow the W3C note on date and time formats,
but they must end with the literal T if not at least whole-day precision.
Fractional seconds are allowed, with at least one digit of precision and an unlimited maximum.
Local-time offsets may be represented as either hour:minute offsets from UTC, or as the literal Z to denote a local time of UTC.
If the offset is -00:00, it indicates that the local offset in which the timestamp was recorded is unknown, and that the time is therefore encoded as UTC.
Local-time offsets are required on timestamps with time and are not allowed on date values.
2007-02-23T12:14Z // Seconds are optional, but local offset is not
2007-02-23T12:14:33.079-08:00 // A timestamp with millisecond precision and PST local time
2007-02-23T20:14:33.079Z // The same instant in UTC ("zero" or "zulu")
2007-02-23T20:14:33.079+00:00 // The same instant, with explicit local offset
2007-02-23T20:14:33.079-00:00 // The same instant, with unknown local offset
2007-01-01T00:00-00:00 // Happy New Year in UTC, unknown local offset
2007-01-01 // The same instant, with days precision, unknown local offset
2007-01-01T // The same value, different syntax.
2007-01T // The same instant, with months precision, unknown local offset
2007T // The same instant, with years precision, unknown local offset
2007-02-23 // A day, unknown local offset
2007-02-23T00:00Z // The same instant, but more precise and in UTC
2007-02-23T00:00+00:00 // An equivalent format for the same value
2007-02-23T00:00:00-00:00 // The same instant, with seconds precision
2007 // Not a timestamp, but an int
2007-01 // ERROR: Must end with 'T' if not whole-day precision, this results as an invalid-numeric-stopper error
2007-02-23T20:14:33.Z // ERROR: Must have at least one digit precision after decimal point.
In the text notation, timestamp values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f.
Strings
In the text format, strings are delimited by double-quotes and follow C/Java backslash-escape conventions (see Escape Characters).
null.string // A null string value
"" // An empty string value
" my string " // A normal string
"\"" // Contains one double-quote character
"\uABCD" // Contains one unicode character
xml::"<e a='v'>c</e>" // String with type annotation 'xml'
The text format supports an alternate syntax for "long strings", including those that break across lines.
Sequences bounded by three single-quotes (''') can cross multiple lines and still count as a valid, single string.
In addition, any number of adjacent triple-quoted strings are concatenated into a single value.
The concatenation happens within the Ion text parser and is neither detectable via the data model nor applicable to the binary format.
Note that comments are always treated as whitespace, so concatenation still occurs when a comment falls between two long strings.
( '''hello ''' // Sexp with one element
'''world!''' )
("hello world!") // The exact same sexp value
// This Ion value is a string containing three newlines. The serialized
// form's first newline is escaped into nothingness.
'''\
The first line of the string.
This is the second line of the string,
and this is the third line.
'''
Symbols
A symbol value is encoded using a symbol token.
In the text format, symbols are delimited by single-quotes and use the same escape characters as strings.
null.symbol // A null symbol value
'myVar2' // A symbol
myVar2 // The same symbol
myvar2 // A different symbol
'hi ho' // Symbol requiring quotes
'\'ahoy\'' // A symbol with embedded quotes
'' // The empty symbol
Within S-expressions, the rules for unquoted symbols include another set of tokens: operators.
An operator is an unquoted sequence of one or more of the following nineteen ASCII characters: !#%&*+-./;<=>?@^`|~.
Operators and identifiers can be juxtaposed without whitespace:
( 'x' '+' 'y' ) // S-expression with three symbols
( x + y ) // The same three symbols
(x+y) // The same three symbols
(a==b&&c==d) // S-expression with seven symbols
Clobs
In the text format, clob values use similar syntax to blob, but the data between braces must be one string.
Similar to string, adjoining long string literals within an Ion clob are concatenated automatically.
Within a clob, only one short string literal or multiple long string literals are allowed.
The string may only contain legal 7-bit ASCII characters, using the same escaping syntax as stringand symbol values.
This guarantees that the value can be transmitted unscathed while remaining generally readable (at least for western language text).
Either form of comment within a clob is invalid.
{{ "This is a CLOB of text." }}
shift_jis::
{{
'''Another clob with user-defined encoding, '''
'''this time on multiple lines.'''
}}
// Two equivalent clobs
{{ '''Hello''' '''World''' }}
{{ "HelloWorld" }}
{{
// ERROR
"comments not allowed"
}}
Blobs
In the text format, blob values are denoted as RFC 4648-compliant
Base64 text within two pairs of curly braces.
When parsing blob text, an error must be raised if the data:
- Contains characters outside of the Base64 character set.
- Contains a padding character (
=) anywhere other than at the end. - Is terminated by an incorrect number of padding characters.
Within blob values, whitespace is ignored.
Comments within blobs are not supported: the / character is always considered part of the Base64 data and the * is invalid.
// A valid blob value with zero padding characters.
{{
+AB/
}}
// A valid blob value with one required padding character.
{{ VG8gaW5maW5pdHkuLi4gYW5kIGJleW9uZCE= }}
// ERROR: Incorrect number of padding characters.
{{ VG8gaW5maW5pdHkuLi4gYW5kIGJleW9uZCE== }}
// ERROR: Padding character within the data.
{{ VG8gaW5maW5pdHku=Li4gYW5kIGJleW9uZCE= }}
// A valid blob value with two required padding characters.
{{ dHdvIHBhZGRpbmcgY2hhcmFjdGVycw== }}
// ERROR: Invalid character within the data.
{{ dHdvIHBhZGRpbmc_gY2hhcmFjdGVycw= }}
Lists
In the text format, lists are bounded by square brackets and elements are separated by commas.
[] // An empty list value
[1, 2, 3] // List of three ints
[ 1 , two ] // List of an int and a symbol
[a , [b]] // Nested list
[ 1.2, ] // Trailing comma is legal in Ion (unlike JSON)
[ 1, , 2 ] // ERROR: missing element between commas
S-expressions
In the text format, S-expressions are bounded by parentheses. S-expressions also allow unquoted operator symbols, in addition to the unquoted identifier symbols allowed everywhere.
() // An empty expression value
(cons 1 2) // S-expression of three values
([hello][there]) // S-expression containing two lists
(a+-b) ( 'a' '+-' 'b' ) // Equivalent; three symbols
(a.b;) ( 'a' '.' 'b' ';') // Equivalent; four symbols
Note that comments are allowed within S-expressions and have higher precedence
than operators, therefore // and /* denote the start of comment blocks.
Users are advised to avoid them as operators, though they can be used when
escaped with single quotes:
(a/* word */b) // An S-expression with two symbols and a comment
(a '/*' word '*/' b) // An S-expression with five symbols
Tagless-element Sequences
Tagless-element sequences allow for sequences (lists or s-expressions) to be encoded in binary without repetitively declaring the same opcode.
In text, tagless-element sequences are differentiated from regular sequences by adding an encoding tag immediately after the opening delimiter.
When the type is a macro-shape, the arguments for each instance of the macro invocation are enclosed in ( and ).
Macros that accept 0 arguments may not be used as a macro-shape type for a tagless-element sequence.
Examples:
[{#int8} 1, 2, 3, 4]
[{:point} (1 3), (1 4), (2 4)]
({#0x60} 1 -2 3 -99999999999999999999999)
({:foo}) // An empty, macro-shaped s-expression
[{#int8} 1, 2, 3, foo::4] // ERROR: tagless elements cannot have annotations
Structs
In the text format, a struct is wrapped by curly braces, with a colon between each name and value, and a comma between the fields.
The field name is a symbol token.
For the purposes of JSON compatibility, it is also legal to use a string for field names, but they are converted to symbol tokens by the parser.
{ } // An empty struct value
{ first : "Tom" , last: "Riddle" } // Structure with two fields
{"first":"Tom","last":"Riddle"} // The same value with confusing style
{center:{x:1.0, y:12.5}, radius:3} // Nested struct
{ x:1, } // Trailing comma is legal in Ion (unlike JSON)
{ "":42 } // A struct value containing a field with an empty name
{ x:1, x:null.int } // WARNING: repeated name 'x' leads to undefined behavior
{ x:1, , } // ERROR: missing field between commas
Note that field names are symbol tokens, not symbol values, and thus may not be annotated. The value of a field may be annotated like any other value. Syntactically the field name comes first, then annotations, then the content.
{ annotation:: field_name: value } // ERROR
{ field_name: annotation:: value } // Okay
E-expressions
In Ion text, encoding expressions (E-expressions) start with (:, immediately
followed by a macro reference, which must be one of:
- a macro name
- a base-10 integer macro address
- a qualified macro name consisting of a module name, double-colon (
::), and the macro name - a qualified macro name consisting of a module name, double-colon (
::), and a base-10 integer macro address
See Encoding modules for details about qualified macro references.
Macro and module names follow the syntax rules for identifier symbol tokens, excluding symbol identifiers. There may not be any whitespace from the start of the E-expression through to the end of the macro reference.
Values in the E-expression body follow the same syntax as values in an S-expression body. E-expressions may be annotated.
(:pi) // Invokes the macro 'pi'
(:1) // Invokes the macro with address 1 in the macro table
(:constants::pi) // Invokes the macro 'pi' from the module 'constants'
(: pi) // ERROR: whitespace is not permitted between '(:' and the macro reference
foo::(:pi) // E-expression annotated with 'foo'
E-expressions may in structs in value position, but not field name position.
{
foo: 1,
bar: (:bar 2), // Expands to a value associated with the field name 'bar'
(:bar 3) // ERROR: e-expressions may not occur in field name position
}
When an e-expression represents a macro invocation that contains trailing optional parameters, any or all of the trailing optionals may be elided from the e-expression.
($ion set_macros (foo {bar: (:?), baz: (:? 123)})) // Both parameters are optional
(:foo abc) // ⇒ {bar: abc, baz: 123}
(:foo abc (:)) // Equivalent to the previous line. Second optional explicitly suppressed using `(:)`
(:foo) // ⇒ {baz: 123}
(:foo (:) (:)) // Equivalent to the previous line. Both optionals explicitly suppressed using `(:)`
Template Placeholders
Template placeholders are special E-Expressions that help define template macros.
Examples:
- Tagged value, optional, no default value:
(:?) - Tagged value, optional, with default value:
(:? "foo") - Tagless value (with primitive encoding tag), required, default value not allowed:
(:? {#int8})
Encoding Tags
Encoding tags are used in tagless e-expression placeholders and
tagless-element sequences.
The text syntax for encoding tags consists of a tagless type identifier
preceded by either # (for primitive encodings) or : (for macro shapes), and surrounded by {}.
Examples:
- named macro-shape:
{:foo},{:foo_module::bar_macro}. May only be used in tagless-element sequences. - macro-shape by id:
{:12},{:493}. May only be used in tagless-element sequences. - tagless scalar type by name:
{#int},{#uint8},{#symbol},{#timestamp_day}. May be used in tagless-element sequences or tagless template placeholders.
Macros that accept 0 arguments are not eligible to be used in an encoding tag.
Symbol tokens
In Ion text, symbols are represented in three ways:
- Quoted symbol: a sequence of zero or more characters between
single-quotes, e.g.,
'hello','a symbol','123',''. This representation can denote any symbol text. - Identifier: an unquoted sequence of one or more ASCII letters, digits,
or the characters
$(dollar sign) or_(underscore), not starting with a digit and not including the keywordsnull,nan,true, andfalse. - Operator: an unquoted sequence of one or more of the following nineteen
ASCII characters:
!#%&*+-./;<=>?@^`|~Operators can only be used as (direct) elements of an S-expression. In any other context those characters require single-quotes.
A subset of identifiers have special meaning:
- Symbol Identifier: an identifier that starts with
$(dollar sign) followed by one or more digits. These identifiers directly represent the symbol's integer symbol ID, not the symbol's text. This form is not typically visible to users, but they should be aware of the reserved notation so they don't attempt to use it for other purposes.
Escape Characters
Strings and Symbols
The Ion text format supports unicode escape sequences only within quoted strings and symbols. Ion supports most of the escape sequences defined by C++, Java, and JSON.
The following sequences are allowed:
| Unicode Code Point | Ion Escape | Meaning |
|---|---|---|
U+0000 | \0 | NUL |
U+0007 | \a | alert BEL |
U+0008 | \b | backspace BS |
U+0009 | \t | horizontal tab HT |
U+000A | \n | linefeed LF |
U+000B | \v | vertical tab VT |
U+000C | \f | form feed FF |
U+000D | \r | carriage return CR |
U+0022 | \" | double quote |
U+0027 | \' | single quote |
U+002F | \/ | forward slash |
U+003F | \? | question mark |
U+005C | \\ | backslash |
| nothing | \NL | escaped NL expands to nothing |
U+00HH | \xHH | 2-digit hexadecimal Unicode code point |
U+HHHH | \uHHHH | 4-digit hexadecimal Unicode code point |
U+HHHHHHHH | \UHHHHHHHH | 8-digit hexadecimal Unicode code point |
Any other sequence following a backslash is an error.
Note that Ion does not support the following escape sequences:
- Java's extended Unicode markers, e.g.,
"\uuuXXXX" - General octal escape sequences,
\OOO
Clobs
The rules for the quoted strings within a clob follow similarly to the string type, with the following exceptions.
Unicode newline characters in long strings and all verbatim ASCII characters are interpreted as their ASCII octet values.
Non-printable ASCII and non-ASCII Unicode code points are not allowed un-escaped in the string bodies.
Furthermore, the following table describes the clob string escape sequences that have direct octet replacement for both all strings.
| Octet | Ion Escape | Meaning |
|---|---|---|
0x00 | \0 | NUL |
0x07 | \a | alert BEL |
0x08 | \b | backspace BS |
0x09 | \t | horizontal tab HT |
0x0A | \n | linefeed LF |
0x0B | \v | vertical tab VT |
0x0C | \f | form feed FF |
0x0D | \r | carriage return CR |
0x22 | \" | double quote |
0x27 | \' | single quote |
0x2F | \/ | forward slash |
0x3F | \? | question mark |
0x5C | \\ | backslash |
0xHH | \xHH | 2-digit hexadecimal octet |
| nothing | \NL | escaped NL expands to nothing |
The clob escape \x must be followed by two hexadecimal digits.
Note that clob does not support the \u and \U escapes since it represents an octet sequence and not a Unicode encoding.
It is important to note that clob is a binary type that is designed for binary values that are either text encoded in a
code page that is ASCII compatible or should be octet editable by a human (escaped string syntax vs. base64 encoded data).
Clearly non-ASCII based encodings will not be very readable (e.g. the clob for the EBCDIC encoded string
representing "hello" could be denoted as{% raw %}{{ "\xc7\xc1%%?" }}{% endraw %}).
Ion 1.1 Binary Encoding
A binary Ion stream consists of an Ion version marker followed by a series of value literals and/or encoding expressions.
Both value literals and e-expressions begin with an opcode that indicates what the next expression represents and how the bytes that follow should be interpreted.
Primitives
This section describes Ion 1.1's binary encoding primitives—reusable building blocks that can be combined to represent more complex constructs.
| Name | Type | Width |
|---|---|---|
FixedUInt | int | Determined by context |
FixedInt | int | Determined by context |
FlexUInt | int | Variable, self-delimiting |
FlexInt | int | Variable, self-delimiting |
FlexSym | symbol | Variable, self-delimiting |
FlexUInt
A variable-length unsigned integer.
The bytes of a FlexUInt are written in
little-endian byte order. This means that the first bytes will contain
the FlexUInt's least significant bits.
The least significant bits in the FlexUInt indicate the number of bytes that were used to encode the integer.
If a FlexUInt is N bytes long, its N-1 least significant bits will be 0; a terminal 1 bit will be
in the next most significant position.
All bits that are more significant than the terminal 1 represent the magnitude of the FlexUInt.
FlexUInt encoding of 14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14
FlexUInt encoding of 729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the unsigned of the unsigned
integer integer
FlexUInt encoding of 21,043
┌───── There are 2 zeros in the least significant bits, so this
│ integer is three bytes wide.
┌─┴─┐
1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
└───┬───┘ └──────┬──────┘ └──────┬──────┘
lowest 6 bits next 8 bits of highest 8 bits
of the unsigned the unsigned of the unsigned
integer integer integer
FlexInt
A variable-length signed integer.
From an encoding perspective, FlexInts are structurally similar to a FlexUInt. Both
encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate
how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a
FlexUInt's bits are unsigned, a FlexInt's bits are encoded using
two's complement notation.
tip
An implementation could choose to read a FlexInt by instead reading a FlexUInt and then reinterpreting its bits
as two's complement.
FlexInt encoding of 14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
2's comp. 14
FlexInt encoding of -14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
2's comp. -14
FlexInt encoding of 729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the 2's of the 2's
comp. integer comp. integer
FlexInt encoding of -729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
1 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the 2's of the 2's
comp. integer comp. integer
FixedUInt
A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.
FixedUInt encoding of 3,954,261
0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
lowest 8 bits next 8 bits of highest 8 bits
of the unsigned the unsigned of the unsigned
integer integer integer
FixedInt
A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two's complement.
FixedInt encoding of -3,954,261
1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
lowest 8 bits next 8 bits of highest 8 bits
of the 2's the 2's comp. of the 2's comp.
comp. integer integer integer
FlexSym
A variable-length symbol token whose UTF-8 bytes can be inline or found in the symbol table.
A FlexSym begins with a FlexInt; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:
- non-negative, it represents a symbol ID. The symbol's associated text can be found in the local symbol table. No more bytes follow.
- negative, the symbol text is encoded as a number of UTF-8 bytes that follow the FlexInt. The number of bytes is calculated by
-1 - flexInt. (This is because 0 is already in use as SID 0, so text length needs to be offset by 1 to support zero-length text.)
FlexSym encoding of symbol ID $10
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│
0 0 0 1 0 1 0 1
└─────┬─────┘
2's comp.
positive 10
FlexSym encoding of symbol text 'hello'
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│ h e l l o
1 1 1 1 0 1 0 1 01101000 01100101 01101100 01101100 01101111
└─────┬─────┘ └─────────────────────┬─────────────────────┘
2's comp. 5-byte UTF-8 encoded "hello"
negative 6
FlexSym encoding of empty symbol text ''
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│
1 1 1 1 1 1 1 1
└─────┬─────┘
2's comp.
negative 1
Opcodes
An opcode is a 1-byte FixedUInt that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
The meanings of each opcode are organized loosely by their high and low nibbles.
| High nibble | Low nibble | Meaning |
|---|---|---|
0x0_ to 0x3_ | 0-F | E-expression, opcode is the address |
0x4_ | 0-7 | E-expression, opcode is the address |
8-F | E-expression with extended address | |
0x5_ | 0-7 | Symbols with symbol address |
8-9 | Annotations | |
A | Reserved | |
B | Tagless-element List | |
C | Tagless-element S-expression | |
D-F | Reserved | |
0x6_ | 0-8 | Integers from 0 to 8 bytes wide |
9 | Reserved | |
A-D | Floats | |
E-F | Booleans | |
0x7_ | 0-F | Decimals |
0x8_ | 0-C | Short-form timestamps |
D | Reserved | |
E | null.null | |
F | Typed nulls | |
0x9_ | 0-F | Strings |
0xA_ | 0-F | Symbols with inline text |
0xB_ | 0-F | Lists |
0xC_ | 0-F | S-expressions |
0xD_ | 0 | Empty struct |
1 | Reserved | |
2-F | Structs | |
0xE_ | 0 | Ion version marker (top-level) / Absent Argument (in E-expression arguments) |
1-8 | Directives | |
9-B | Template placeholders | |
C-D | NOP | |
E | Struct switch modes (SID ↔ FlexSym) | |
F | Delimited container end | |
0xF_ | 0 | Delimited list start |
1 | Delimited S-expression start | |
2 | Delimited struct start (SID Mode) | |
3 | Delimited struct start (FlexSym Mode) | |
4 | E-expression with FlexUInt macro address followed by FlexUInt length prefix | |
5 | Integer with FlexUInt length prefix | |
6 | Decimal with FlexUInt length prefix | |
7 | Timestamp with FlexUInt length prefix | |
8 | String with FlexUInt length prefix | |
9 | Symbol with FlexUInt length prefix and inline text | |
A | List with FlexUInt length prefix | |
B | S-expression with FlexUInt length prefix | |
C | Struct with FlexUInt length prefix (SID Mode) | |
D | Struct with FlexUInt length prefix (FlexSym Mode) | |
E | Blob with FlexUInt length prefix | |
F | Clob with FlexUInt length prefix |
In addition, some opcodes have different meanings when used as tagless types.
| High nibble | Low nibble | Meaning |
|---|---|---|
0x6_ | 0 | FlexInt integer value |
0x7_ | 0 | Decimal; tuple of (int,int8) representing the coefficient and exponent respectively. |
0xE_ | 0 | FlexUInt integer value |
1, 2, 4, 8 | FixedUInt integer values that are 1, 2, 4, or 8 bytes wide | |
E | Symbol FlexSym |
Values
Nulls
The opcode 0x8E indicates an untyped null (that is: null, or its alias null.null).
The opcode 0x8F indicates a typed null; a byte follows whose value represents an offset into the following table:
| Byte | Type |
|---|---|
0x01 | null.bool |
0x02 | null.int |
0x03 | null.float |
0x04 | null.decimal |
0x05 | null.timestamp |
0x06 | null.string |
0x07 | null.symbol |
0x08 | null.blob |
0x09 | null.clob |
0x0A | null.list |
0x0B | null.sexp |
0x0C | null.struct |
All other byte values are reserved for future use.
Encoding of null
┌──── The opcode `0x8E` represents a null (null.null)
8E
Encoding of null.string
┌──── The opcode `0x8F` indicates a typed null; a byte indicating the type follows
│ ┌──── Byte 0x06 indicates the type `string`
8F 06
Booleans
0x6E represents boolean true, while 0x6F represents boolean false.
0x8F 0x01 represents null.bool.
Encoding of boolean true
6E
Encoding of boolean false
6F
Encoding of null.bool
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: boolean
│ │
8F 01
Integers
Opcodes in the range 0x60 to 0x68 represent an integer. The opcode is followed by a FixedInt that
represents the integer value. The low nibble of the opcode (0x_0 to 0x_8) indicates the size of the FixedInt.
Opcode 0x60 represents integer 0; no more bytes follow.
Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF5,
followed by a FlexUInt indicating how many bytes of representation data follow.
0x8F 0x02 represents null.int.
Encoding of integer 0
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││ no more bytes follow.
60
Encoding of integer 17
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││ a single byte follows.
61 11
└── FixedInt 17
Encoding of integer -944
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││ that two bytes follow.
62 50 FC
└─┬─┘
FixedInt -944
Encoding of integer -944
┌──── Opcode F5 indicates a variable-length integer, FlexUInt length follows
│ ┌─── FlexUInt 2; a 2-byte FixedInt follows
│ │
F5 05 50 FC
└─┬─┘
FixedInt -944
Encoding of null.int
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: integer
│ │
8F 02
Floats
Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:
- 0 bits (0 bytes), representing the value 0e0 and indicated by opcode
0x6A - 16 bits (2 bytes in little-endian order, half-precision),
indicated by opcode
0x6B - 32 bits (4 bytes in little-endian order, single precision),
indicated by opcode
0x6C - 64 bits (8 bytes in little-endian order, double precision),
indicated by opcode
0x6D
note
In the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.
0x8F 0x03 represents null.float.
Encoding of float 0e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││ a 0-length float; 0e0
6A
Encoding of float 3.14e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
└─┬─┘
half-precision 3.14
Encoding of float 3.1415927e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││ single-precision value.
6C DB 0F 49 40
└────┬────┘
single-precision 3.1415927
Encoding of float 3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││ double-precision value.
6D 18 2D 44 54 FB 21 09 40
└──────────┬──────────┘
double-precision 3.141592653589793
Encoding of null.float
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: float
│ │
8F 03
Decimals
If an opcode has a high nibble of 0x7_, it represents a decimal. Low nibble values indicate
the number of trailing bytes used to encode the decimal.
The body of the decimal is encoded as a FlexInt representing its exponent, followed by a FixedInt
representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length
of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0. When
the coefficient is present but has a value of 0, the coefficient is -0.
Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF6.
0x8F 0x04 represents null.decimal.
Encoding of decimal 0d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││ decimal; 0d0
70
Encoding of decimal 7d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
| └─── Coefficient: 1-byte FixedInt 7
└─── Exponent: FlexInt 0
Encoding of decimal 1.27
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
| └─── Coefficient: FixedInt 127
└─── Exponent: 1-byte FlexInt -2
Variable-length encoding of decimal 1.27
┌──── Opcode F6 indicates a variable-length decimal
│
F6 05 FD 7F
| | └─── Coefficient: FixedInt 127
| └───── Exponent: 1-byte FlexInt -2
└─────── Decimal length: FlexUInt 2
Encoding of 0d3, which has a coefficient of zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
└────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
Encoding of -0d3, which has a coefficient of negative zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
| └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
└────── Exponent: FlexInt 3
Encoding of null.decimal
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: decimal
│ │
8F 04
Timestamps
Timestamps have two encodings:
- Short-form timestamps, a compact representation optimized for the most commonly used precisions and date ranges.
- Long-form timestamps, a less compact representation capable of representing any timestamp in the Ion data model.
0x8F 0x05 represents null.timestamp.
Encoding of null.timestamp
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: timestamp
│ │
8F 05
note
In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.
Short-form Timestamps
If an opcode has a high nibble of 0x8_, it represents a short-form timestamp. This encoding focuses on making the
most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via
the variable-length long form timestamp encoding.
Timestamps may be encoded using the short form if they meet all of the following conditions:
- The year is between 1970 and 2097. The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form.
- The local offset is either UTC, unknown, or falls between
-14:00to+14:00and is divisible by 15 minutes. 7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset-14:00). The value0b1111111indicates an unknown offset. At the time of this writing (2024-08T), all real-world offsets fall between-12:00and+14:00and are multiples of 15 minutes. - The fractional seconds are a common precision. The timestamp's fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).
Opcodes by precision and offset
Each opcode with a high nibble of 0x8_ indicates a different precision and offset encoding pair.
| Opcode | Precision | Serialized size in bytes1 | Offset encoding |
|---|---|---|---|
0x80 | Year | 1 | Implicitly Unknown offset |
0x81 | Month | 2 | Implicitly Unknown offset |
0x82 | Day | 2 | Implicitly Unknown offset |
0x83 | Hour and minutes | 4 | 1 bit to indicate UTC or Unknown Offset |
0x84 | Seconds | 5 | 1 bit to indicate UTC or Unknown Offset |
0x85 | Milliseconds | 6 | 1 bit to indicate UTC or Unknown Offset |
0x86 | Microseconds | 7 | 1 bit to indicate UTC or Unknown Offset |
0x87 | Nanoseconds | 8 | 1 bit to indicate UTC or Unknown Offset |
0x88 | Hour and minutes | 5 | 7 bits to represent a known offset.2 |
0x89 | Seconds | 5 | 7 bits to represent a known offset. |
0x8A | Milliseconds | 7 | 7 bits to represent a known offset. |
0x8B | Microseconds | 8 | 7 bits to represent a known offset. |
0x8C | Nanoseconds | 9 | 7 bits to represent a known offset. |
The body of a short-form timestamp is encoded as a FixedUInt of the size specified by the opcode. This integer is
then partitioned into bit-fields representing the timestamp's subfields. Note that endianness does not apply here because the
bit-fields are defined over the body interpreted as an integer.
The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.
Unused bits are ignored by by reader implementations.
The Month and Day subfields are one-based; 0 is not a valid month or day.
| Letter code | Number of bits | Subfield |
|---|---|---|
Y | 7 | Year |
M | 4 | Month |
D | 5 | Day |
H | 5 | Hour |
m | 6 | Minute |
o | 7 | Offset |
U | 1 | Unknown (0) or UTC (1) offset |
s | 6 | Second |
f | 10 (ms) 20 (μs) 30 (ns) | Fractional second |
. | n/a | Unused |
We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.
7 0 <--- bit position
| |
+=========+
byte 0 | 0xNN | <-- hex notation for constants like opcodes
+=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
1 |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
+---------+ <-- octet boundary within an encoding primitive
...
+---------+
N |nnnn:nnnn|
+=========+
The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)
note
While this encoding may complicate human reading, it guarantees that the timestamp's subfields (year, month,
etc.) occupy the same contiguous bit indexes regardless of how many bytes there are overall. (The last subfield,
fractional_seconds, always begins at the same bit index when present, but can vary in length according to the
precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the
appropriate bit ranges to access the subfields.
Encoding of a timestamp with year precision
+=========+
byte 0 | 0x80 |
+=========+
1 |.YYY:YYYY|
+=========+
Encoding of a timestamp with month precision
+=========+
byte 0 | 0x81 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |....:.MMM|
+=========+
Encoding of a timestamp with day precision
+=========+
byte 0 | 0x82 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+=========+
Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset
+=========+
byte 0 | 0x83 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |....:Ummm|
+=========+
Encoding of a timestamp with seconds precision at UTC or unknown offset
+=========+
byte 0 | 0x84 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |....:..ss|
+=========+
Encoding of a timestamp with milliseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x85 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |....:ffff|
+=========+
Encoding of a timestamp with microseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x86 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |ffff:ffff|
+---------+
7 |..ff:ffff|
+=========+
Encoding of a timestamp with nanoseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x87 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |ffff:ffff|
+=========+
Encoding of a timestamp with hour-and-minutes precision at known offset
+=========+
byte 0 | 0x88 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |....:..oo|
+=========+
Encoding of a timestamp with seconds precision at known offset
+=========+
byte 0 | 0x89 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+=========+
Encoding of a timestamp with milliseconds precision at known offset
+=========+
byte 0 | 0x8A |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |....:..ff|
+=========+
Encoding of a timestamp with microseconds precision at known offset
+=========+
byte 0 | 0x8B |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |....:ffff|
+=========+
Encoding of a timestamp with nanoseconds precision at known offset
+=========+
byte 0 | 0x8C |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |ffff:ffff|
+---------+
9 |..ff:ffff|
+=========+
Examples of short-form timestamps
| Text | Binary |
|---|---|
| 2023T | 80 35 |
| 2023-10-15T | 82 35 7D |
| 2023-10-15T11:22:33Z | 84 35 7D CB 1A 02 |
| 2023-10-15T11:22:33-00:00 | 84 35 7D CB 12 02 |
| 2023-10-15T11:22:33+01:15 | 89 35 7D CB 2A 84 |
| 2023-10-15T11:22:33.444555666+01:15 | 8C 35 7D CB 2A 84 92 61 7F 1A |
Long-form Timestamps
Unlike the short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.
The long form begins with opcode 0xF7. A FlexUInt follows indicating the number
of bytes that were needed to represent the timestamp. The encoding consumes the minimum number
of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s
precision as follows:
| Length | Corresponding precision |
|---|---|
| 0 | Illegal |
| 1 | Illegal |
| 2 | Year |
| 3 | Month or Day (see below) |
| 4 | Illegal; the hour cannot be specified without also specifying minutes |
| 5 | Illegal |
| 6 | Minutes |
| 7 | Seconds |
| 8 or more | Fractional seconds |
Unlike the short-form encoding, the long-form encoding reserves:
- 14 bits for the year (
Y), which is not biased. - 12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440
(that is:
-24:00). An offset value of0b111111111111indicates an unknown offset.
Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the
timestamp are encoded as bit-fields on a FixedUInt that corresponds to the length that followed the opcode.
If the timestamp's overall length is greater than or equal to 8, the FixedUInt part of the timestamp is 7 bytes
and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a
(scale, coefficient) pair, which is similar to a decimal. The primary difference is that the scale
represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to
1.0 or less than 0.0. The scale is encoded as a FlexUInt (instead of FlexInt) to discourage the
encoding of decimal numbers greater than 1.0. The coefficient is encoded as a FixedUInt (instead of FixedInt) to
prevent the encoding of fractional seconds less than 0.0. Note that validation is still required; namely:
- A scale value of
0is illegal, as that would result in a fractional seconds greater than1.0(a whole second). - If
coefficient * 10^-scale > 1.0, that(coefficient, scale)pair is illegal.
If the timestamp's length is 3, the precision is determined by inspecting the day (DDDDD) bits. Like the short-form,
the Month and Day subfields are one-based (0 is not a valid month or day). If the day subfield is zero, that
indicates month precision. If the day subfield is any non-zero number, that indicates day precision.
Encoding of the body of a long-form timestamp
+=========+
byte 0 |YYYY:YYYY|
+=========+
1 |MMYY:YYYY|
+---------+
2 |HDDD:DDMM|
+---------+
3 |mmmm:HHHH|
+---------+
4 |oooo:oomm|
+---------+
5 |ssoo:oooo|
+---------+
6 |....:ssss|
+=========+
7 |FlexUInt | <-- scale of the fractional seconds
+---------+
...
+=========+
N |FixedUInt| <-- coefficient of the fractional seconds
+---------+
...
Examples of long-form timestamps
| Text | Binary |
|---|---|
| 1947T | F7 05 9B 07 |
| 1947-12T | F7 07 9B 07 03 |
| 1947-12-23T | F7 07 9B 07 5F |
| 1947-12-23T11:22:33-00:00 | F7 0F 9B 07 DF 65 FD 7F 08 |
| 1947-12-23T11:22:33+01:15 | F7 0F 9B 07 DF 65 AD 57 08 |
| 1947-12-23T11:22:33.127+01:15 | F7 13 9B 07 DF 65 AD 57 08 07 7F |
-
Serialized size in bytes does not include the opcode. ↩
-
This encoding can also represent
UTC and Unknown Offset, though it is less compact than opcodes0x83-0x87above. ↩
Strings
If the high nibble of the opcode is 0x9_, it represents a string. The low nibble of the opcode
indicates how many UTF-8 bytes follow. Opcode 0x90 represents a string with empty text ("").
Strings longer than 15 bytes can be encoded with the F8 opcode, which takes a FlexUInt-encoded length
after the opcode.
0x8F 0x06 represents null.string.
Encoding of the empty string, ""
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90
Encoding of a 14-byte string
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││ f o u r t e e n b y t e s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
└──────────────────┬────────────────────┘
UTF-8 bytes
Encoding of a 24-byte string
┌──── Opcode F8 indicates a variable-length string
│ ┌─── Length: FlexUInt 24
│ │ v a r i a b l e l e n g t h e n c o d i n g
F8 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
└────────────────────────────────┬────────────────────────────────────┘
UTF-8 bytes
Encoding of null.string
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: string
│ │
8F 06
Symbols
Symbols With Inline Text
If the high nibble of the opcode is 0xA_, it represents a symbol whose text follows the opcode. The low nibble of the
opcode indicates how many UTF-8 bytes follow. Opcode 0xA0 represents a symbol with empty text ('').
0x8F 0x07 represents null.symbol.
Encoding of a symbol with empty text ('')
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0
Encoding of a symbol with 14 bytes of inline text
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││ f o u r t e e n b y t e s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
└──────────────────┬────────────────────┘
UTF-8 bytes
Encoding of a symbol with 24 bytes of inline text
┌──── Opcode F9 indicates a variable-length symbol with inline text
│ ┌─── Length: FlexUInt 24
│ │ v a r i a b l e l e n g t h e n c o d i n g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
└────────────────────────────────┬────────────────────────────────────┘
UTF-8 bytes
Encoding of null.symbol
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: symbol
│ │
8F 07
Symbols With a Symbol Address
Symbol values whose text can be found in the local symbol table are encoded using opcodes 0x50 through 0x57.
The opcodes 0x50 through 0x57 share the same 5 most-significant bits. The 3 least-significant bits are used as the
3 least-significant bits of the symbol ID.
The opcode is followed by a FlexUInt, which, once decoded, represents the most-significant bits of the symbol ID.
To get the symbol ID from the opcode and FlexUInt is simple, and can be implemented using bitwise operations or simple arithmetic operations.
// Given an `opcode` and `flexUInt`...
let lsb = opcode & 0b111 // or opcode - 0x50
let msb = flexUInt << 3 // or flexUInt * 8
let symbolId = msb | lsb // or msb + lsb
The reverse transformation is also simple:
// Given `symbolId`...
let opcode = 0x50 | (symbolId & 0b111) // or 0x50 + (symbolId % 8)
let flexUInt = symbolId >>> 3 // or symbolId / 8
The number of bytes required to encode symbol addresses is as follows:
| SID Range | Encoded size, including opcode |
|---|---|
$0..$1023 | 2 |
$1024..$131071 | 3 |
$131072..$16777215 | 4 |
$16777216..$2147483647 | 5 |
This table only goes to ~2 billion, but the encoding itself does not have a limit on the number of symbol IDs. However, most Ion implementations will have some upper bound on the number of symbols that depends on the implementation language and/or the underlying hardware.
Encoding of symbol with SID 1 ($ion)
┌──── Opcode 0x51 indicates a symbol with SID; low 3 bits = 1
│ ┌─── FlexUInt 0 represents the high bits (0 << 3 = 0)
│ │
51 01
Encoding of symbol with SID 10
┌──── Opcode 0x52 indicates a symbol with SID; low 3 bits = 2
│ ┌─── FlexUInt 1 represents the high bits (1 << 3 = 8)
│ │
52 03
Encoding of symbol with SID 1000
┌──── Opcode 0x50 indicates a symbol with SID; low 3 bits = 0
│ ┌─── FlexUInt 125 represents the high bits (125 << 3 = 1000)
│ │
50 FB 01
Blobs
Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob's byte-length.
0x8F 0x08 represents null.blob.
Example blob encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: blob
│ │
8F 08
Clobs
Opcode FF indicates a clob--binary character data of an unspecified encoding. A FlexUInt follows that represents
the clob's byte-length.
0x8F 0x09 represents null.clob.
Example clob encoding
┌──── Opcode FF indicates a clob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.clob
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: clob
│ │
8F 09
Lists
| Opcode | Encoding |
|---|---|
0xB0-0xBF | Length-prefixed list; low nibble of the opcode represents the byte-length. |
0xFA | Variable-length prefixed list; a FlexUInt following the opcode represents the byte-length. |
0xF0 | Starts a delimited list; 0xEF closes the most recently opened delimited container. |
0x5B | Tagless-element list; opcode is followed by element encoding type and number of elements. |
0x8F 0x0A represents null.list.
Length-prefixed encoding
An opcode with a high nibble of 0xB_ indicates a length-prefixed list. The lower nibble of the
opcode indicates how many bytes were used to encode the child values that the list contains.
If the list's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFA opcode
to write a variable-length list. The 0xFA opcode is followed by a FlexUInt
that indicates the list's byte length.
0x8F 0x0A represents null.list.
Length-prefixed encoding of an empty list ([])
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this
││ list took zero bytes to encode.
B0
Length-prefixed encoding of [1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 6 indicates that the child values of this
││ list took six bytes to encode.
B6 61 01 61 02 61 03
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Length-prefixed encoding of ["variable length list"]
┌──── Opcode 0xFA indicates a variable-length list. A FlexUInt length follows.
│ ┌───── Length: FlexUInt 22
│ │ ┌────── Opcode 0xF8 indicates a variable-length string. A FlexUInt length follows.
│ │ │ ┌─────── Length: FlexUInt 20
│ │ │ │ v a r i a b l e l e n g t h l i s t
FA 2d F8 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
└─────────────────────────────┬─────────────────────────────────┘
Nested string element
Delimited Encoding
Opcode 0xF0 begins a delimited list, while opcode 0xEF closes the most recently opened delimited container
that has not yet been closed.
Delimited encoding of an empty list ([])
┌──── Opcode 0xF0 indicates a delimited list
│ ┌─── Opcode 0xEF indicates the end of the most recently opened container
F0 EF
Delimited encoding of [1, 2, 3]
┌──── Opcode 0xF0 indicates a delimited list
│ ┌─── Opcode 0xEF indicates the end of
│ │ the most recently opened container
F0 61 01 61 02 61 03 EF
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Delimited encoding of [1, [2], 3]
┌──── Opcode 0xF0 indicates a delimited list
│ ┌─── Opcode 0xF0 begins a nested delimited list
│ │ ┌─── Opcode 0xEF closes the most recently
│ │ │ opened delimited container: the nested list.
│ │ │ ┌─── Opcode 0xEF closes the most recently opened (and
│ │ │ │ still open) delimited container: the outer list.
│ │ │ │
F0 61 01 F0 61 02 EF 61 03 EF
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Tagless-Element Lists
Opcode 0x5B indicates a tagless-element list. This is a compact encoding for homogeneous collections where all elements have the same type.
The elements of the list can be a primitive encoding or a macro-shape.
The opcode is followed by:
- One or more bytes describing the tagless type:
- If the byte is in
0x00-0x47: only one byte (the opcode) is present; this is the macro address - If the byte is in
0x48-0x4F: it is followed by aFlexUIntto encode the entire macro address - If the byte is
0xF4: it is followed by aFlexUIntwhich encodes the entire macro address - For any other byte value: only one byte (the opcode) is present
- If the byte is in
- A
FlexUIntlength indicating the number of direct child values in the list - Each element encoded without the leading opcode or macro address
Tagless-element list of integers [{#int8} 1, 2, 3, 4]
┌──── Opcode 0x5B indicates a tagless-element list
│ ┌─── Tagless type: 0x61 (int8)
│ │ ┌─── Length: FlexUInt 4 (4 elements)
│ │ │
5B 61 09 01 02 03 04
└────┬────┘
4 int8 values
Tagless-element list with macro shape [{:point} (1 3), (1 4), (2 4)]
┌──── Opcode 0x5B indicates a tagless-element list
│ ┌─── Tagless type: 0x05 (macro address 5, assuming :point is at address 5)
│ │ ┌─── Length: FlexUInt 3 (3 elements)
│ │ │ ┌─── First element: (1 3)
│ │ │ │ ┌─── Second element: (1 4)
│ │ │ │ │ ┌─── Third element: (2 4)
│ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐
5B 05 07 61 01 61 03 61 01 61 04 61 02 61 04
└─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
1 3 1 4 2 4
Tagless-element list with macro-shape using length-prefixed E-expression [{:3} (1 3), (1 259) ]
┌──── Opcode 0x5B indicates a tagless-element list
│ ┌─── Tagless type opcode F4 with FlexUInt address 3
│ │ ┌─── Length: FlexUInt 2 (2 elements)
│ │ │ ┌─── First element: (1 3)
│ │ │ │ ┌─── Second element: (1 4)
│ ┌─┴─┐ │ ┌─────┴──────┐ ┌───────┴───────┐
5B F4 07 05 09 61 01 61 03 0B 61 01 62 04 01
│ └─┬─┘ └─┬─┘ │ └─┬─┘ └───┬──┘
│ 1 3 │ 1 259
└─ FlexUInt └─ FlexUInt
Length=4 Length=5
Encoding of null.list
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: list
│ │
8F 0A
S-Expressions
S-expressions use the same encodings as lists, but with different opcodes.
| Opcode | Encoding |
|---|---|
0xC0-0xCF | Length-prefixed S-expression; low nibble of the opcode represents the byte-length. |
0xFB | Variable-length prefixed S-expression; a FlexUInt following the opcode represents the byte-length. |
0xF1 | Starts a delimited S-expression; 0xEF closes the most recently opened delimited container. |
0x5C | Tagless-element S-expression; opcode is followed by element encoding type and number of elements. |
0x8F 0x0B represents null.sexp.
Length-prefixed encoding
Length-prefixed encoding of an empty S-expression (())
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression
││ took zero bytes to encode.
C0
Length-prefixed encoding of (1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression
││ took six bytes to encode.
C6 61 01 61 02 61 03
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Length-prefixed encoding of ("variable length sexp")
┌──── Opcode 0xFB indicates a variable-length sexp. A FlexUInt length follows.
│ ┌───── Length: FlexUInt 22
│ │ ┌────── Opcode 0xF8 indicates a variable-length string. A FlexUInt length follows.
│ │ │ ┌─────── Length: FlexUInt 20
│ │ │ │ v a r i a b l e l e n g t h s e x p
FB 2D F8 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
└─────────────────────────────┬─────────────────────────────────┘
Nested string element
Delimited encoding
Delimited encoding of an empty S-expression (())
┌──── Opcode 0xF1 indicates a delimited S-expression
│ ┌─── Opcode 0xEF indicates the end of the most recently opened container
F1 EF
Delimited encoding of (1 2 3)
┌──── Opcode 0xF1 indicates a delimited S-expression
│ ┌─── Opcode 0xEF indicates the end of
│ │ the most recently opened container
F1 61 01 61 02 61 03 EF
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Delimited encoding of (1 (2) 3)
┌──── Opcode 0xF1 indicates a delimited S-expression
│ ┌─── Opcode 0xF1 begins a nested delimited S-expression
│ │ ┌─── Opcode 0xEF closes the most recently
│ │ │ opened delimited container: the nested S-expression.
│ │ │ ┌─── Opcode 0xEF closes the most recently opened (and
│ │ │ │ still open)delimited container: the outer S-expression.
│ │ │ │
F1 61 01 F1 61 02 EF 61 03 EF
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Tagless-Element S-Expressions
Opcode 0x5C indicates a tagless-element S-expression. This is a compact encoding for homogeneous collections where all elements have the same type.
The elements of the S-expression can be a primitive encoding or a macro-shape.
The opcode is followed by:
- One or more bytes describing the tagless type:
- If the byte is in
0x00-0x47: only one byte (the opcode) is present; this is the macro address - If the byte is in
0x48-0x4F: it is followed by aFlexUIntto encode the entire macro address - If the byte is
0xF4: it is followed by aFlexUIntwhich encodes the entire macro address - For any other byte value: only one byte (the opcode) is present
- If the byte is in
- A
FlexUIntlength indicating the number of direct child values in the S-expression - Each element encoded without the leading opcode or macro address
Tagless-element S-expression of integers ({#int8} 1 2 3 4)
┌──── Opcode 0x5C indicates a tagless-element S-expression
│ ┌─── Tagless type: 0x61 (int8)
│ │ ┌─── Length: FlexUInt 4 (4 elements)
│ │ │
5C 61 09 01 02 03 04
└────┬────┘
4 int8 values
Tagless-element S-expression with macro shape [{:point} (1 3), (1 4), (2 4)]
┌──── Opcode 0x5C indicates a tagless-element sexp
│ ┌─── Tagless type: 0x05 (macro address 5, assuming :point is at address 5)
│ │ ┌─── Length: FlexUInt 3 (3 elements)
│ │ │ ┌─── First element: (1 3)
│ │ │ │ ┌─── Second element: (1 4)
│ │ │ │ │ ┌─── Third element: (2 4)
│ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐
5C 05 07 61 01 61 03 61 01 61 04 61 02 61 04
└─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
1 3 1 4 2 4
Tagless-element S-expression with macro-shape using length-prefixed E-expression [{:3} (1 3), (1 259) ]
┌──── Opcode 0x5C indicates a tagless-element sexp
│ ┌─── Tagless type opcode F4 with FlexUInt address 3
│ │ ┌─── Length: FlexUInt 2 (2 elements)
│ │ │ ┌─── First element: (1 3)
│ │ │ │ ┌─── Second element: (1 4)
│ ┌─┴─┐ │ ┌─────┴──────┐ ┌───────┴───────┐
5C F4 07 05 09 61 01 61 03 0B 61 01 62 04 01
│ └─┬─┘ └─┬─┘ │ └─┬─┘ └───┬──┘
│ 1 3 │ 1 259
└─ FlexUInt └─ FlexUInt
Length=4 Length=5
Encoding of null.sexp
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: sexp
│ │
8F 0B
Structs
Each field in the struct is encoded as a field name followed by an opcode-prefixed value. The encoding of field names depends on the current mode:
- SID Mode: Field names are encoded as
FlexUIntsymbol addresses. - FlexSym Mode: Field names are encoded as
FlexSyms.
All structs start in SID Mode, except for opcodes 0xF3 and 0xFD which start in FlexSym Mode.
0x8F 0x0C represents null.struct.
Length-prefixed encoding
If the high nibble of the opcode is 0xD_, it represents a struct. The lower nibble of the opcode
indicates how many bytes were used to encode all of its nested (field name, value) pairs. Opcode
0xD0 represents an empty struct.
warning
Opcode 0xD1 is illegal. Non-empty structs must have at least two bytes: a field name and a value.
If the struct's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFC opcode
to write a variable-length struct in SID Mode or the 0xFD opcode to write a variable-length struct in FlexSym Mode.
These opcodes are followed by a FlexUInt that indicates the byte length.
Length-prefixed encoding of an empty struct ({})
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0
Length-prefixed encoding of {$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│ ┌─── Field name: FlexUInt 10 ($10)
│ │ ┌─── Field name: FlexUInt 11 ($11)
│ │ │
D6 15 61 01 17 61 02
└─┬─┘ └─┬─┘
1 2
Length-prefixed encoding of {$10: "variable length struct"} in SID Mode
┌───────────── Opcode `FC` indicates a struct with a FlexUInt length prefix (SID Mode)
│ ┌────────── Length: FlexUInt 25
│ │ ┌─────── Field name: FlexUInt 10 ($10)
│ │ │ ┌──── Opcode `F8` indicates a variable length string
│ │ │ │ ┌─ FlexUInt: 22 the string is 22 bytes long
│ │ │ │ │ v a r i a b l e l e n g t h s t r u c t
FC 33 15 F8 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
└─────────────────────────────┬─────────────────────────────────┘
UTF-8 bytes
Encoding of null.struct
┌──── Opcode 0x8F indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: struct
│ │
8F 0C
Delimited encoding
Opcode 0xF2 indicates the beginning of a delimited struct in SID Mode.
Opcode 0xF3 indicates the beginning of a delimited struct in FlexSym Mode.
Delimited structs are closed by putting the delimited container end opcode (0xEF) in the value position.
The field name immediately prior is discarded.
By convention, you should use $0, but the field name itself is of no consequence.
Delimited encoding of the empty struct ({}) in SID Mode
┌─── Opcode 0xF2 indicates the beginning of a delimited struct in SID Mode
│ ┌─── Throwaway field name: FlexUInt 0 ($0)
│ │ ┌─── Opcode 0xEF indicates the end of the delimited container
F2 01 EF
Delimited encoding of the empty struct ({}) in FlexSym Mode
┌─── Opcode 0xF3 indicates the beginning of a delimited struct in FlexSym Mode
│ ┌─── Throwaway field name: FlexSym 0 ($0)
│ │ ┌─── Opcode 0xEF indicates the end of the delimited container
F3 01 EF
note
It is much more compact to write 0xD0—the empty length-prefixed struct.
Delimited encoding of {"foo": 1, $11: 2} in FlexSym Mode
┌─── Opcode 0xF3 indicates the beginning of a delimited struct in FlexSym Mode
│ ┌─ FlexSym -4 (3 UTF-8 bytes follow)
│ │ ┌─ FlexSym: 11 ($11)
│ │ │ ┌─── Throwaway field name
│ │ f o o │ │ ┌─── Opcode 0xEF indicates the end of the delimited container
F3 FB 66 6F 6F 61 01 17 61 02 01 EF
└──┬───┘ └─┬─┘ └─┬─┘
3 UTF-8 1 2
bytes
Mode Switching
Structs may switch between SID Mode and FlexSym Mode by placing a mode-switch opcode (0xEE) in the value position of a struct.
This causes the prior field name to be discarded (just like a NOP), and switches the struct from FlexSym Mode to SID Mode or from SID Mode to FlexSym Mode.
Mode-switching works with both prefixed and delimited structs. It costs 2 bytes to switch modes (one for the discarded field name, and one for the mode-switch opcode). It is possible, but not recommended, to switch back and forth between modes.
In SID Mode, each field name is a FlexUInt which is the SID.
In FlexSym Mode, each field name is a FlexSym.
note
The ability to switch modes exists to allow writer implementations to start in the more compact SID mode and then switch to FlexSym mode if needed without having to backtrack and rewrite all prior fields in the struct. If you know ahead of time that you will have a mix of inline text and SIDs, you should usually prefer to start a struct in FlexSym mode rather than switching to SID mode later.
Switching to FlexSym Mode while encoding {$10: 1, foo: 2, $11: 3}
In this example, the writer switches to FlexSym Mode before encoding foo so it can write the UTF-8 bytes inline.
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│ ┌─── Field name: FlexUInt 10 ($10) [SID Mode]
│ │ ┌─── Throwaway field name
│ │ │ ┌─── Mode switch opcode 0xEE (switches to FlexSym Mode)
│ │ │ │ ┌─── FlexSym: -4 (3 UTF-8 bytes follow)
│ │ │ │ │ ┌─── Field name: FlexSym 11 ($11) [FlexSym Mode]
│ │ │ │ │ f o o │
DB 15 61 01 01 EE FB 66 6F 6F 61 02 17 61 03
└─┬─┘ └──┬───┘ └─┬─┘ └─┬─┘
1 3 UTF-8 2 3
bytes
Annotations
Annotations can be encoded either as symbol addresses or as text. One or more annotations is an annotation sequence.
It is illegal for an annotations sequence to appear before any of the following:
Annotation with Symbol Address
Opcode 0x58 indicates an annotation encoded as a symbol ID. A FlexUInt-encoded symbol address follows.
Encoding of $10::false
┌──── The opcode `0x58` indicates an annotation encoded as a symbol ID follows
│ ┌──── Annotation with symbol address: FlexUInt 10
58 15 6F
└── The annotated value: `false`
Annotation with Text
Opcode 0x59 indicates an annotation encoded as inline text. A FlexUInt length follows, then that many bytes of UTF-8 text.
Encoding of foo::false
┌──── The opcode `0x59` indicates an annotation encoded as text follows
│ ┌──── Length: FlexUInt 3 (3 bytes of UTF-8 text follow)
│ │ f o o
59 07 66 6F 6F 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
Multiple Annotations
Multiple annotations are encoded by repeating the annotation opcodes in sequence.
Encoding of $10::foo::false
┌──── First annotation: symbol ID
│ ┌──── Symbol address: FlexUInt 10 ($10)
│ │ ┌──── Second annotation: text
│ │ │ ┌──── Length: FlexUInt 3
│ │ │ │ f o o
58 15 59 07 66 6F 6F 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
Encoding of $10::$11::$12::false
┌──── First annotation: symbol ID 10
│ ┌──── Second annotation: symbol ID 11
│ │ ┌──── Third annotation: symbol ID 12
┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌──── Annotated value: `false`
58 15 58 17 58 19 6F
Directives
Directives are system values that modify the encoding context.
note
This chapter focuses on the binary encoding of directives. The Directives section explains what they are and how they are used.
The binary encoding of directives is a specialized delimited container.
Directives are opened using opcodes 0xE1 through 0xE8.
They may contain any number of tagged values, and they are closed with the end-container opcode (0xEF).
Directives may only occur at the top level of a stream and may not be annotated.
Directive Opcodes
| Opcode | Directive |
|---|---|
0xE1 | set_symbols |
0xE2 | add_symbols |
0xE3 | set_macros |
0xE4 | add_macros |
0xE5 | use |
0xE6 | module |
0xE7 | import |
0xE8 | encoding |
Examples
Encoding of (:$ion set_symbols "foo" "bar")
┌──── Opcode 0xE1 indicates a set_symbols directive
│ ┌─── String "foo"
│ │ ┌─── String "bar"
│ │ │ ┌─── End of directive
E1 93 66 6F 6F 93 62 61 72 EF
Encoding of (:$ion add_symbols "baz")
┌──── Opcode 0xE2 indicates an add_symbols directive
│ ┌─── String "baz"
│ │ ┌─── End of directive
E2 93 62 61 7A EF
Encoding of (:$ion module foo "com.amazon.Bar" 1)
┌──── Opcode 0xE6 indicates a module directive
│ ┌─── Symbol "foo" (assuming SID 10)
│ │ ┌─── String "com.amazon.Bar"
│ │ │ ┌─── Integer 1
│ │ │ │ ┌─── End of directive
E6 50 15 9E 63 6F 6D 2E 61 6D 61 7A 6F 6E 2E 42 61 72 61 01 EF
Template Placeholders
note
This chapter focuses on the binary encoding of template placeholders. The Macros section explains what they are and how they are used.
Template placeholders are special constructs used within macro template bodies to indicate where macro arguments should be substituted.
They are encoded using opcodes 0xE9, 0xEA, and 0xEB.
Placeholders may only occur in value position, only within a macro body; they are illegal anywhere else.
Placeholder Types
| Opcode | Placeholder Type |
|---|---|
0xE9 | Tagged template placeholder (no default) |
0xEA | Tagged template placeholder (with default) |
0xEB | Tagless template placeholder |
Tagged Template Placeholder with No Default
Opcode 0xE9 indicates a tagged template placeholder with no default value. No additional bytes follow.
Encoding of (:?)
┌──── Opcode 0xE9 indicates a tagged template placeholder with no default
E9
Encoding of foo::(:?)
┌── Annotation text: foo
┌─────┴──────┐ ┌──── Opcode 0xE9 indicates a tagged template placeholder with no default
59 07 66 6F 6F E9
Tagged Template Placeholder with Default
Opcode 0xEA indicates a tagged template placeholder with a default value.
The default value follows and may be any value or e-expression that produces a value.
A NOP is legal, and ignored.
Encoding of (:? "foo")
┌──── Opcode 0xEA indicates a tagged template placeholder with default
│ ┌─── Default value: string "foo"
EA 93 66 6F 6F
Encoding of (:? 42)
┌──── Opcode 0xEA indicates a tagged template placeholder with default
│ ┌─── Default value: integer 42
EA 61 2A
Encoding of (:? 42) with NOP
┌──── Opcode 0xEA indicates a tagged template placeholder with default
│ ┌─── NOP
│ │ ┌─── Default value: integer 42
EA EC 61 2A
Encoding of (:? $10::false)
┌──── Opcode 0xEA indicates a tagged template placeholder with default
│
│ ┌──── Annotation SID: $10
│ ┌─┴─┐ ┌─── The annotated value: `false`
EA 58 15 6F
└──┬───┘
The default value: `$10::false`
Tagless Template Placeholder
Opcode 0xEB indicates a tagless template placeholder.
A single byte follows indicating the tagless scalar type that the argument must conform to.
No additional bytes follow.
Encoding of (:? {#int8})
┌──── Opcode 0xEB indicates a tagless template placeholder
│ ┌─── Tagless scalar type: int8 (0x61)
EB 61
Encoding of (:? {#uint32})
┌──── Opcode 0xEB indicates a tagless template placeholder
│ ┌─── Tagless scalar type: uint32 (0xE4)
EB E4
Encoding of (:? {#string})
┌──── Opcode 0xEB indicates a tagless template placeholder
│ ┌─── Tagless scalar type: string (0xF9)
EB F9
Encoding Expressions
note
This chapter focuses on the binary encoding of e-expressions. The Macros section explains what they are and how they are used.
E-expression with the address in the opcode
Opcodes 0x00-0x47 are single byte macro addresses.
If the value of the opcode is less than 72 (0x48), it represents an E-expression invoking the macro at the
corresponding address—-an offset within the local macro table.
Invocation of macro address 7
┌──── Opcode in 00-47 range indicates an e-expression
│ where the opcode value is the macro address
│
07
└── FixedUInt 7
Invocation of macro address 31
┌──── Opcode in 00-47 range indicates an e-expression
│ where the opcode value is the macro address
│
1F
└── FixedUInt 31
Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section E-expression argument encoding.
E-expressions with extended addresses
Opcodes 0x48-0x4F are extensible macro addresses, with an offset of 72.
The opcodes 0x48 through 0x4F share the same 5 most-significant bits. The 3 least-significant bits are used as the
3 least-significant bits of the macro address.
The opcode is followed by a FlexUInt, which, once decoded, represents the most-significant bits of the macro address.
Finally, the offset of 72 is added.
To get the macro address from the opcode and FlexUInt is straightforward, and can be implemented using bitwise operations or simple arithmetic operations.
// Given an `opcode` and `flexUInt`...
let lsb = opcode & 0b111 // or opcode - 0x48
let msb = flexUInt << 3 // or flexUInt * 8
let macroId = (msb | lsb) + 72 // or msb + lsb + 72
The reverse transformation is also simple:
// Given `macroId`...
let opcode = 0x48 | (macroId & 0b111) // or 0x48 + (macroId % 8)
let flexUInt = macroId >>> 3 // or macroId / 8
| Macro Address Range | Opcode Range | Encoded size, including opcode |
|---|---|---|
0..71 | 0x00-0x47 | 1 |
72..1095 | 0x48-0x4F | 2 |
1096..131143 | 0x48-0x4F | 3 |
131144..16777287 | 0x48-0x4F | 4 |
This table stops at 16777287, but the encoding does not impose any limit on the number of macro addresses. Practically, Ion implementations will have limits based on the programming language and the runtime environment used.
Invocation of macro address 287
To encode macro address 287:
- Subtract 72 to get
215(0b11010111) - Take the 3 least-significant bits (
111) and add them to0x48(0b01001000) to get0x4F(0b01001111). The opcode will be0x4F. - Shift
215left by 3 (discarding the 3 least-significant bits), and then encode the result (26) as aFlexUInt. TheFlexUIntencoding of26is0x35.
┌──── Opcode in range 48-4F indicates a macro address with extended address.
│ Least-significant 3 bits are `111`
│ ┌──── FlexUInt 26
4F 35
Length-prefixed E-Expressions
The opcode F4 represents an E-expression with a FlexUInt macro address and a FlexUInt length prefix.
The length prefix indicates the number of argument bytes for the e-expression.
The encoding of the arguments themselves are covered in E-expression argument encoding.
┌──── Opcode F4 indicates FlexUInt address and FlexUInt length prefix
│ ┌──── FlexUInt 26
│ │ ┌──── FlexUInt 6
F4 35 0D __ __ __ __ __ __
└───────┬───────┘
6 argument bytes
E-expression argument encoding
The example invocations in prior sections have demonstrated how to encode an invocation of the simplest form of macro--one with no parameters. This section explains how to encode macro invocations when they take parameters.
The encoding of E-Expression arguments follows the macro address (and length-prefix if present). For every placeholder in the macro template, there must be exactly one argument expression provided.
Tagged Arguments
When a macro parameter does not specify an encoding (the parameter name is not annotated), arguments passed to that parameter use the 'tagged' encoding. The argument begins with a leading opcode that dictates how to interpret the bytes that follow.
This is the same encoding used for values in other Ion 1.1 contexts like lists, s-expressions, or at the top level.
When invoking a template macro, the E-expression must have one argument for each parameter in the macro signature. Every argument must be exactly one value or explicitly an absent argument.
The absent argument is a special construct used in macro invocations to explicitly indicate that no value is provided for a particular parameter. The absent argument is distinct from NOP and serves a different purpose.
Opcode 0xEB indicates an absent argument; no additional bytes follow.
Example – two tagged arguments
Given a macro definition:
(point { x: (:?), y: (:? 0) })
This macro has two tagged parameters, the second of which has a default value.
Encoding of (:point 2 3)
┌──── E-expression with macro address 0 (assuming point is macro 0)
│ ┌─── Argument 1: integer 2
│ │ ┌─── Argument 2: integer 3
00 61 02 61 03
This would expand to { x: 2, y: 3 }.
Encoding of (:point 5 (:))
┌──── E-expression with macro address 0 (assuming point is macro 0)
│ ┌─── Argument 1: integer 5
│ │ ┌─── Argument 2: absent argument
00 61 05 E0
This would expand to { x: 5, y: 0 } since the second argument is absent and y has a default value of 0.
Encoding of (:point (:) 10)
┌──── E-expression with macro address 0
│ ┌─── Argument 1: absent argument
│ │ ┌─── Argument 2: integer 10
00 E0 61 0A
This would expand to { y: 10 } since the first argument is absent and x has no default value.
An absent argument is encoded the same way regardless of whether the placeholder has a default value.
Tagless Arguments
In contrast to the tagged encoding, tagless encodings do not begin with an opcode. This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings do not have an opcode, tagless arguments cannot have annotation sequences nor can the argument itself be absent.
Primitive encodings are self-delineating, either by having a statically known size in bytes or by including length information in their serialized form.
Given the following macro definition
(foo { foo: (:? {#int8}), bar: (:? {#int16}), baz: (?: {#string}) })
The text E-expression (:foo 1 2 "three") would be encoded like this:
┌──── Opcode 0x00 is less than 0x48; this is an e-expression
│ invoking the macro at address 0.
│ ┌─── First argument: 1-byte FixedInt 1
│ │ ┌─── Second argument: 2-byte FixedInt 2
│ │ │ ┌─── Third argument: length-prefixed string "three"
│ │ ┌─┴─┐ ┌───────┴───────┐
00 03 02 00 0B 74 68 72 65 65
│ └──────┬─────┘
│ └── 5 UTF-8 bytes
└──────────── FlexUInt (Length) 5
NOPs
A NOP (short for "no-operation") is the binary equivalent of whitespace. NOP bytes have no meaning,
but can be used as padding to achieve a desired alignment.
An opcode of 0xEC indicates a single-byte NOP pad. An opcode of 0xED indicates that a
FlexUInt follows that represents the number of additional bytes to skip.
It is legal for a NOP to appear anywhere that a value can be encoded. It is not legal for a NOP to appear in
annotation sequences or struct field names. If a NOP appears in place of a struct field value, then the associated
field name is ignored; the NOP is immediately followed by the next field name, if any.
Encoding of a 1-byte NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC
Encoding of a 3-byte NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│ ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│ │
ED 05 93 C6
└─┬─┘
NOP bytes, values ignored
Security considerations
The Ion 1.1 data format is orthogonal to many classes of attacks, such as privilege escalation and phishing attacks. Ion 1.1 is primarily susceptible to denial-of-service (DoS) attacks that attempt to cause an error condition in the receiving system. As with many such attacks, the strongest defense is to not accept any untrusted input, but that defense is not always compatible with the business requirements of the receiving application.
This document addresses various types of attacks, assuming that it is not possible to avoid accepting untrusted input.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Data expansion denial-of-service
An attacker could craft an input that is relatively small, but upon expansion, produces something thousands or millions of times larger.
For many use cases, the expansion of a template macro will grow linearly with the size of its input. However, it is possible to create macros with expansions that grow at greater rates.
For example, a Billion laughs attack could exist for any data format with macro expansion, and it is certainly possible with Ion 1.1.
(:$ion add_macros (lol0 "lol"))
(:$ion add_macros (lol1 [(:lol0), (:lol0), (:lol0), (:lol0), (:lol0), (:lol0), (:lol0), (:lol0), (:lol0), (:lol0),]))
(:$ion add_macros (lol2 [(:lol1), (:lol1), (:lol1), (:lol1), (:lol1), (:lol1), (:lol1), (:lol1), (:lol1), (:lol1),]))
(:$ion add_macros (lol3 [(:lol2), (:lol2), (:lol2), (:lol2), (:lol2), (:lol2), (:lol2), (:lol2), (:lol2), (:lol2),]))
// ...
(:lol9) // Could produce a billion "lol".
Implementations of Ion 1.1 MUST have some mechanism by which to mitigate data expansion attacks, such as by capping the size of individual macros or the macro table as a whole.
Data injection via shared modules
Applications are not required to use shared modules. If an application does use shared modules, it should take steps to ensure that shared modules come from a trusted source and use appropriate measures to prevent man-in-the-middle and other attacks that can compromise data while it is in transit.
In many cases, even if an application needs to accept Ion payloads from untrusted sources, it is possible to design a solution in which the shared modules are supplied by a trusted source. For example, in a service-oriented-architecture, the server can host shared modules so that the server does not have to trust the client. (However, this assumes that the client trusts the server.)
If shared modules must come from an untrusted source, then applications should take steps to ensure that the shared modules originate from the same source as the data that uses them, and they can be treated as if they are one composite piece of data from that source.
Arbitrary-sized values
The Ion specification places no limits on the size of Ion values, so an attacker could send a sufficiently large value, it could consume enough system resources to disrupt the application reading the value.
Even though the Ion specification does not have limits on the size of values, all real computer systems have finite resources, so all implementations will have limits in practice. Ion implementations MAY set limits on the maximum size of any Ion value for any available metric, including (but not limited to) number of bytes, number of codepoints, number of child values, digits of precision, or number of annotations. An implementation MAY allow limits to be configurable by an application that uses the Ion implementation. Any limits imposed SHOULD be described in the public documentation of an Ion implementation, unless the limits are unknown and/or are dependent on the underlying runtime environment.
Symbol table and macro table inflation
An attacker could try to create an input that results in excessively large symbol and macro tables in the Ion reader that could exhaust the memory of the receiving system and lead to a denial of service.
Although Ion 1.1 does not specify a maximum size for symbol tables or macro tables, Ion implementations MAY impose upper bounds on the size of symbol tables, macro tables, module bindings, and any other direct or indirect component of the encoding context. An implementation MAY allow limits to be configurable by an application that uses the Ion implementation. Any limits imposed SHOULD be described in the public documentation of an Ion implementation, unless the limits are unknown and/or are dependent on the underlying runtime environment.
Grammar
This chapter presents Ion 1.1's domain grammar, by which we mean the grammar of the domain of values that drive Ion's encoding features.
We use a BNF-like notation for describing various syntactic parts of a document, including Ion data structures. In such cases, the BNF should be interpreted loosely to accommodate Ion-isms like commas, whitespace, and unconstrained ordering of struct fields.
For brevity, the following components are not explicitly defined in this grammar:
value- any Ion value literalunannotated-identifier-symbol- an unannotated "identifier symbol" (as defined in the specification)unannotated-uint- an unannotated integer greater than or equal to zero
Documents
document ::= ivm segment*
ivm ::= '$ion_1_1'
any-value-encoding ::= value | e-expression | tagless-element-list | tagless-element-sexp
segment ::= any-value-encoding* directive?
directive ::= ivm | encoding-directive
Directives
encoding-directive ::= '(:$ion ' any-directive-content ')'
any-directive-content ::= set-symbols-content | add-symbols-content | set-macros-content | add-macros-content
| use-content | module-content | import-content | encoding-content
set-symbols-content ::= 'set_symbols' symbol-text*
add-symbols-content ::= 'add_symbols' symbol-text*
set-macros-content ::= 'set_macros' macro-definition*
add-macros-content ::= 'add_macros' macro-definition*
use-content ::= 'use' catalog-key
module-content ::= 'module' module-name macro-table? symbol-table?
import-content ::= 'import' module-name catalog-key
encoding-content ::= 'encoding' module-name*
Modules
catalog-key ::= catalog-name catalog-version?
catalog-name ::= string
catalog-version ::= unannotated-uint
module-name ::= unannotated-identifier-symbol
symbol-table ::= '(symbols' symbol-table-entry* ')'
symbol-table-entry ::= module-name | symbol-list
symbol-list ::= '[' symbol-text* ']'
symbol-text ::= symbol | string
macro-table ::= '(macros' macro-table-entry* ')'
macro-table-entry ::= macro-definition | module-name
Macro references
qualified-macro-ref ::= module-name '::' macro-ref
macro-ref ::= macro-name | macro-addr
macro-name ::= unannotated-identifier-symbol
macro-addr ::= unannotated-uint
any-macro-ref-encoding ::= macro-ref | qualified-macro-ref
Macro definitions
no-argument ::= '(:)'
macro-name-declaration ::= macro-name | 'null'
macro-definition ::= '(' macro-name-declaration value-with-placeholders ')' ; This prohibits "identity" macros
value-or-placeholder ::= value-with-placeholders | placeholder
list-with-placeholders ::= '[' value-or-placeholder* ']' ; Note: for brevity, this grammar avoids comma handling
sexp-with-placeholders ::= '(' value-or-placeholder* ')'
field-value-or-placeholder ::= symbol-text ':' value-or-placeholder
struct-with-placeholders ::= '{' field-value-or-placeholder* '}'
argument-or-placeholder ::= value-with-placeholders | no-argument | placeholder
e-exp-with-placeholders ::= '(:' any-macro-ref-encoding argument-or-placeholder* ')'
value-with-placeholders ::= value | list-with-placeholders | sexp-with-placeholders
| struct-with-placeholders | e-exp-with-placeholders
primitive-encoding-type ::= 'int' | 'int8' | 'int16' | 'int32' | 'int64'
| 'uint' | 'uint8' | 'uint16' | 'uint32' | 'uint64'
| 'float16' | 'float32' | 'float64'
| 'small_decimal'
| 'timestamp_day' | 'timestamp_min' | 'timestamp_s'
| 'timestamp_ms' | 'timestamp_us' | 'timestamp_ns'
| 'symbol'
primitive-type-marker ::= '{#' primitive-encoding-type '}'
macro-type-marker ::= '{:' any-macro-ref-encoding '}'
any-type-marker-encoding ::= primitive-type-marker | macro-type-marker
placeholder ::= tagged-placeholder | tagged-placeholder-default | tagless-placeholder
tagged-placeholder ::= '(:?)'
tagged-placeholder-default ::= '(:? ' any-value-encoding ')'
tagless-placeholder ::= '(:? ' primitive-type-marker ')'
tagless-element-list ::= '[' any-type-marker-encoding value* ']'
tagless-element-sexp ::= '(' any-type-marker-encoding value* ')'
E-expressions
argument ::= any-value-encoding | no-argument
e-expression ::= '(:' any-macro-ref-encoding argument* ')'
Glossary
active encoding module
An encoding module whose symbol table and macro table are available in the current segment of an Ion document.
The sequence of active encoding modules is set by an encoding directive.
argument
The sub-expression(s) within a macro invocation, corresponding to exactly one of the macro's parameters.
declaration
The association of a name with an entity (for example, a module or macro). See also definition.
Not all declarations are definitions: some introduce new names for existing entities.
definition
The specification of a new entity.
directive
A keyword or unit of data in an Ion document that affects the encoding environment, and thus the way the document's data is encoded and decoded.
In Ion 1.0 there are two directives: Ion version markers, and the symbol table directives.
Ion 1.1 adds encoding directives.
document
A stream of octets conforming to either the Ion text or binary specification.
Can consist of multiple segments, perhaps using varying versions of the Ion specification.
A document does not necessarily exist as a file, and is not necessarily finite.
E-expression
See encoding expression.
encoding directive
In an Ion 1.1 segment, a top-level E-expression that invokes the implicit $ion macro.
Defines a new encoding module sequence for the segment immediately following it.
The symbol table directive is effectively a less capable alternative syntax.
encoding environment
The context-specific data maintained by an Ion implementation while encoding or decoding data. In
Ion 1.0 this consists of the current symbol table; in Ion 1.1 this is expanded to also include the Ion
spec version, the current macro table, and a collection of available modules.
encoding expression
The invocation of a macro in encoded data, aka e-expression.
Starts with a macro reference denoting the function to invoke.
The Ion text format uses "smile syntax" (:macro ...) to denote e-expressions.
Ion binary devotes a large number of opcodes to e-expressions, so they can be compact.
encoding module
A module whose symbol table and macro table can be used directly in the user data stream.
encoding tag
A way of conveying the encoding of a value (i.e. its opcode), separated from the value itself.
expression
A serialized syntax element that may produce values.
Encoding expressions and values are both considered expressions, whereas NOP, comments, and IVMs, for example, are not.
inner module
A module that is defined inside another module and only visible inside the definition of that module.
Ion version marker
A keyword directive that denotes the start of a new segment encoded with a specific Ion version.
Also known as "IVM".
macro
A transformation function that accepts some number of streams of values, and produces a stream of values.
macro definition
Specifies a macro in terms of a template.
module
The data entity that defines and exports both symbols and macros.
opcode
A 1-byte, unsigned integer that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
placeholder
A special-purpose encoding expression that is replaced by a macro argument when evaluating the expansion of a template.
qualified macro reference
A macro reference that consists of a module name and either a macro name exported by that module,
or a numeric address within the range of the module's exported macro table. In text, these look
like :module-name::name-or-address.
segment
A contiguous partition of a document that uses the same encoding module sequence.
Segment boundaries are caused by directives: an IVM, set_symbols, add_symbols, set_macros, add_macros, use, and encoding directives end segments (with a new one starting immediately afterward).
The import and module directives can also end a segment if they are redefining a module binding that was in the encoding module sequence.
shared module
A module that exists independent of the data stream of an Ion document. It is identified by a
name and version so that it can be imported by other modules.
signature
The part of a macro definition that specifies its "calling convention", in terms of the shape and type of arguments it accepts.
The signature is implicit in a macro definition; it is derived from the placeholders that are in the template.
symbol table directive
A top-level struct annotated with $ion_symbol_table. Defines a new encoding environment
without any macros. Valid in Ion 1.0. In Ion 1.1, this is effectively a no-op because it has been replaced by the add_symbols, set_symbols, and use directives.
system module
A standard module named $ion that is provided by the Ion implementation, implicitly installed so
that the system symbols are available at all points within a document.
Subsumes the functionality of the Ion 1.0 system symbol table.
system symbol
A symbol provided by the Ion implementation via the system module $ion.
System symbols are available at all points within an Ion document, though the selection of symbols
varies by segment according to its Ion version.
tagless-element sequence
A list or s-expression that has homogeneous elements, allowing the type descriptor of the elements to be lifted into the
container's definition for more compact representation of the child values.
template
The part of a macro definition that expresses its transformation of inputs to results.
unqualified macro reference
A macro reference that consists of either a macro name or numeric address, without a qualifying module name.
These are resolved using lexical scope and must always be unambiguous.
TODO
This page is a placeholder and will be updated when the target page is available.
If you believe the target page is available, please open an issue.