This is a draft specification of Ion 1.1, a new minor version of the Ion serialization format.
Status
This document is a working draft and is subject to change.
Audience
This documents presents the formal specification for the Ion 1.1 data format. This document is not intended to be used as a user guide or as a cook book, but as a reference to the syntax and semantics of the Ion data format and its logical data model.
What's New in Ion 1.1
We will go through a high-level overview of what is new and different in Ion 1.1 from Ion 1.0 from an implementer's perspective.
Motivation
Ion 1.1 has been designed to address some of the trade-offs in Ion 1.0 to make it suitable for a wider range of applications. Ion 1.1 now makes length prefixing of containers optional, and makes the interning of symbolic tokens optional as well. This allows for applications that write data more than they read data or are constrained by the writer in some way to have more flexibility. Data density is another motivation. Certain encodings (e.g., timestamps, integers) have been made more compact and efficient, but more significantly, macros now enable applications to have very flexible interning of their data's structure. In aggregate, data transcoded from Ion 1.0 to Ion 1.1 should be more compact.
Backwards compatibility
Ion 1.1 is backwards compatible with Ion 1.0. While their encodings are distinct, they share the same data model--any data that can be produced and read by an application in Ion 1.1 has an equivalent representation in Ion 1.0.
Ion 1.1 is not required to preserve Ion 1.0 binary encodings in Ion 1.1 encoding contexts (i.e., the type codes and
lower-level encodings are not preserved in the new version). The Ion Version Marker (IVM) is used to denote the
different versions of the syntax. Ion 1.1 does retain text compatibility with Ion 1.0 in that the changes are a strict
superset of the grammar, however due to the updated system symbol table, symbol IDs referred to using the $n
syntax
for symbols beyond the 1.0 system symbol table are not compatible.
Text syntax changes
Ion 1.1 text must use the $ion_1_1
version marker at the top-level of the data stream or document.
The only syntax change for the text format is the introduction of encoding expression (E-expression) syntax, which
allows for the invocation of macros in the data stream. This syntax is grammatically similar to S-expressions, except that
these expressions are opened with (:
and closed with )
. For example, (:a 1 2)
would expand the macro named a
with the
arguments 1
and 2
. See the <<sec:whatsnew-eexp, Macros, Templates, and Encoding-Expressions>> section for details.
This syntax is allowed anywhere an Ion value is allowed:
E-expression examples
// At the top level
(:foo 1 2)
// Nested in a list
[1, 2, (:bar 3 4)]
// Nested in an S-expression
(cons a (:baz b))
// Nested in a struct
{c: (:bop d)}
E-expressions are also grammatically allowed in the field name position of a struct and when used there, indicate that the expression should expand to a struct value that is merged into the enclosing struct:
E-Expression in field position of struct
{
a:1,
b:2,
(:foo 1 2),
c: 3,
}
In the above example, the E-expression (:foo 1 2)
must evaluate into a struct that will be merged between the b
field and the c
field. If it does not evaluate to a struct, then the above is an error.
Binary encoding changes
Ion 1.1 binary encoding reorganizes the type descriptors to support compact E-expressions, make certain encodings
more compact, and certain lower priority encodings marginally less compact. The IVM for this encoding is the octet
sequence 0xE0 0x01 0x01 0xEA
.
Inlined symbolic tokens
In binary Ion 1.0, symbol values, field names, and annotations are required to be encoded using a symbol ID in the local symbol table. For some use cases (e.g., as write-once, read-maybe logs) this creates a burden on the writer and may not actually be efficient for an application. Ion 1.1 introduces optional binary syntax for encoding inline UTF-8 sequences for these tokens which can allow an encoder to have flexibility in whether and when to add a given symbolic token to the symbol table.
Ion text requires no change for this feature as it already had inline symbolic tokens without using the local symbol
table. Ion text also has compatible syntax for representing the local symbol table and encoding of symbolic tokens with
their position in the table (i.e., the $id
syntax).
Delimited containers
In Ion 1.0, all data is length prefixed. While this is good for optimizing the reading of data, it requires an Ion encoder to buffer any data in memory to calculate the data's length. Ion 1.1 introduces optional binary syntax to allow containers to be encoded with an end marker instead of a length prefix.
Low-level binary encoding changes
Ion 1.0's VarUInt
and VarInt
encoding primitives
used big-endian byte order and used the high bit of each byte to indicate whether it was the final byte in the encoding.
VarInt
used an additional bit in the first byte to represent the integer's sign. Ion 1.1 replaces these primitives
with more optimized versions called FlexUInt
and FlexInt
.
FlexUInt
and FlexInt
use little-endian byte order, avoiding the need for reordering on common architectures like
x86, aarch64, and RISC-V.
Rather than using a bit in each byte to indicate the width of the encoding, FlexUInt
and FlexInt
front-load
the continuation bits. In most cases, this means that these bits all fit in the first byte of the representation,
allowing a reader to determine the complete size of the encoding without having to inspect each byte individually.
Finally, FlexInt
does not use a separate bit to indicate its value's sign. Instead, it uses two's complement
representation, allowing it to share much of the same structure and parsing logic as its unsigned counterpart.
Benchmarks have shown that in aggregate, these encoding changes are between 1.25 and 3x faster than Ion 1.0's
VarUInt
and VarInt
encodings depending on the host architecture.
Ion 1.1 supplants Ion 1.0's Int
encoding primitive
with a new encoding called FixedInt
, which uses two's complement notation instead of sign-and-magnitude.
A corresponding FixedUInt
primitive has also been introduced; its encoding is the same as
Ion 1.0's UInt
primitive.
A new primitive encoding type, FlexSym
, has been introduced to flexibly encode
symbol IDs and symbolic tokens with inline text.
Type encoding changes
All Ion types use the new low-level encodings as specified in the previous section. Many of the opcodes used in Ion 1.0 have been re-organized primarily to make E-expressions compact.
Typed null
values are now [encoded in two bytes using the 0xEB
opcode].
Lists and S-expressions have two encodings:
a length-prefixed encoding and a new delimited form that ends with the 0xF0
opcode.
Struct values have the option of encoding their field names as
a FlexSym
, enabling them to write field name text inline
instead of adding all names to the symbol table. There is now also a delimited form.
Similarly, symbol values now also have the option of encoding their symbol text inline.
Annotation sequences are a prefix to the value they decorate, and no longer have an outer length container. They are now encoded with one of three opcodes:
0xE7
, which is followed by a single annotation and then the decorated value.0xE8
, which is followed by two annotations and then the decorated value.0xE9
, which is followed by aFlexUInt
indicating the number of bytes used to encode the annotations sequence, the sequence itself, and then the decorated value.
The latter encoding is similar to how Ion 1.0 annotations are encoded with the exception that there is no outer length in addition to the annotations sequence length.
Integers now use a FixedInt
sub-field instead of the Ion 1.0 encoding which used sign-and-magnitude (with two opcodes).
Decimals are structurally identical to their Ion 1.0 counterpart with the exception
of the negative zero coefficient. The Ion 1.1 FlexInt
encoding is two's complement, so negative zero cannot be
encoded directly with it. Instead, an opcode is allocated specifically for encoding decimals with a negative zero
coefficient.
Timestamps no longer encode their sub-field components as octet-aligned fields.
The Ion 1.1 format uses a packed bit encoding and has a biased form (encoding the year field as an offset from 1970) to make common encodings of timestamp easily fit in a 64-bit word for microsecond and nanosecond precision (with UTC offset or unknown UTC offset). Benchmarks have shown this new encoding to be 59% faster to encode and 21% faster to decode. A non-biased, arbitrary length timestamp with packed bit encoding is defined for uncommon cases.
Encoding expressions in binary
In binary, E-expressions are encoded with an opcode that includes the macro identifier or an opcode that
specifies a FlexUInt
for the macro identifier.
The identifier is followed by the encoding of the arguments to the E-expression.
The macro's definition statically determines how the arguments are to be laid out.
An argument may be a full Ion value with a leading opcode (sometimes called a "tagged" value), or it could be a lower-level encoding (e.g., a fixed width integer or FlexInt
/FlexUInt
).
Macros, templates, and encoding expressions
Ion 1.1 introduces a new primitive called an encoding expression (E-expression). These expressions are (in text syntax) similar to S-expressions, but they are not part of the data model and are evaluated into one or more Ion values (called a stream) which enable compact representation of Ion data. E-expressions represent the invocation of either system defined or user defined macros with arguments that are either themselves E-expressions, value literals, or container constructors (list, sexp, struct syntax containing E-expressions) corresponding to the formal parameters of the macro's definition. The resulting stream is then expanded into the resulting Ion data model.
At the top level, the stream becomes individual top-level values. Consider for illustrative purposes an E-expression
(:values 1 2 3)
that evaluates to the stream 1
, 2
, 3
and (:none)
that evaluates to the empty stream. In the
following examples, values
and none
are the names of the macros being invoked and each line is equivalent.
Top-level e-expressions
// Encoding
a (:values 1 2 3) b (:none) c
// Evaluates to
a 1 2 3 b c
Within a list or S-expression, the stream becomes additional child elements in the collection.
E-expressions in lists
// Encoding
[a, (:values 1 2 3), b, (:none), c]
// Evaluates to
[a, 1, 2, 3, b, c]
E-expressions in S-expressions
(a (:values 1 2 3) b (:none) c)
(a 1 2 3 b c)
Within a struct at the field name position, the resulting stream must contain structs and each of the fields in those
structs become fields in the enclosing struct (the value portion is not specified); at the value position, the resulting
stream of values becomes fields with whatever field name corresponded before the E-expression (empty stream elides the
field all together). In the following examples, let us define (:make_struct c 5)
that evaluates to a single struct
{c: 5}
.
E-expressions in structs
// Encoding
{
a: (:values 1 2 3),
b: 4,
(:make_struct c 5),
d: 6,
e: (:none)
}
// Evaluates to
{
a: 1,
a: 2,
a: 3,
b: 4,
c: 5,
d: 6
}
Encoding context and modules
In Ion 1.0, there is a single encoding context which is the local symbol table. In Ion 1.1, the encoding context becomes the following:
-
The local symbol table which is a list of strings. This is used to encode/decode symbolic tokens.
-
The local macro table which is a list of macros. This is used to reference macros that can be invoked by E-expressions.
-
A mapping of a string name to module which is an organizational unit of symbol definitions and macro definitions. Within the encoding context, this name is unique and used to address a module's contents either as the list of symbols to install into the local symbol table, the list of macros to install into the local macro table, or to qualify the name of a macro in a text E-expression or the definition of a macro.
The module is a new concept in Ion 1.1. It contains:
-
A list of strings representing the symbol table of the module.
-
A list of macro definitions.
Modules can be imported from the catalog (they subsume shared symbol tables), but can also be defined locally. Modules
are referenced as a group to allocate entries in the local symbol table and local macro table (e.g., the local symbol
table is initially, implicitly allocated with the symbols in the $ion
module).
Ion 1.1 introduces a new system value (an encoding directive) for the encoding context (see the TBD section for details.)
Ion encoding directive example
$ion_encoding::{
modules: [ /* module declarations - including imports */ ],
install_symbols: [ /* names of declared modules */ ],
install_macros: [ /* names of declared modules */ ]
}
Macro definitions
Macros can be defined by a user either directly in a local module within an encoding directive or in a module defined externally (i.e., shared module). A macro has a name which must be unique in a module or it may have no name.
Ion 1.1 defines a list of system macros that are built-in in the module named $ion
. Unlike the system symbol table,
which is always installed and accessible in the local symbol table, the system macros are both always accessible to
E-expressions and not installed in the local macro table by default (unlike the local symbol table).
In Ion binary, macros are always addressed in E-expressions by the offset in the local macro table. System macros may
be addressed by the system macro identifier using a specific encoding op-code. In Ion text, macros may be addressed by
the offset in the local macro table (mirroring binary), its name if its name is unambiguous within the local encoding
context, or by qualifying the macro name/offset with the module name in the encoding context. An E-expression can
only refer to macros installed in the local macro table or a macro from the system module. In text, an E-expression
referring to a system macro that is not installed in the local macro table, must use a qualified name with the $ion
module name.
For illustrative purposes let's consider the module named foo
that has a macro named bar
at offset 5 installed at
the begining of the local macro table.
E-expressions name resolution
// allowed if there are no other macros named 'bar'
(:bar)
// fully qualified by module--always allowed
(:foo:bar)
// by local macro table offset
(:5)
// In text, system macros are always addressable by name.
// In binary, system macros may be invoked using a separate
// opcode.
(:$ion:none)
Macro definition language
User defined macros are defined by their parameters and template which defines how they are invoked and what stream of data they evaluate to. This template is defined using a domain specific Ion macro definition language with S-expressions. A template defines a list of zero or more parameters that it can accept. These parameters each have their own cardinality of expression arguments which can be specified as exactly one, zero or one, zero or more, and one or more. Furthermore the template defines what type of argument can be accepted by each of these parameters:
- "Tagged" values, whose encodings always begin with an opcode.
- "Tagless" values, whose encodings do not begin with an opcode and are therefore both more compact and less flexible (For example:
flex_int
,int32
,float16
). - Specific macro shaped arguments to allow for structural composition of macros and efficient encoding in binary.
The macro definition includes a template body that defines how the macro is expanded. In the language, system macros, macros defined in previously defined modules in the encoding context, and macros defined previously in the current module are available to be invoked with (name ...)
syntax where name
is
the macro to be invoked. Certain names in the expression syntax are reserved for special forms (for example, literal
and if_none
). When a macro name is shadowed by a special form, or is ambiguous with respect to all
macros visible, it can always be qualified with (':module:name' ...)
syntax where module
is the name of the module
and name
is the offset or name of the macro. Referring to a previously defined macro name within a module may be
qualified with (':name' ...)
syntax.
Shared Modules
Ion 1.1 extends the concept of a shared symbol table to be a shared module. An Ion 1.0 shared symbol table is a shared module with no macro definitions. A new schema for the convention of serializing shared modules in Ion are introduced in Ion 1.1. An Ion 1.1 implementation should support containing Ion 1.0 shared symbol tables and Ion 1.1 shared modules in its catalog.
System Symbol Table Changes
The system symbol table in Ion 1.1 replaces the Ion 1.0 symbol table with new symbols. However, the system symbols are not required to be in the symbol table—they are always available to use.
Macros
Like other self-describing formats, Ion 1.0 makes it possible to write a stream with truly arbitrary content--no formal schema required. However, in practice all applications have a de facto schema, with each stream sharing large amounts of predictable structure and recurring values. This means that Ion readers and writers often spend substantial resources processing undifferentiated data.
Consider this example excerpt from a webserver's log file:
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.100",
resource: "index.html"
}
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp:
ip_addr::"192.168.1.100",
resource: "images/funny.jpg"
}
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: ip_addr::"192.168.1.101",
resource: "index.html"
}
Macros allow users to define fill-in-the-blank templates for their data. This enables applications to focus on encoding and decoding the parts of the data that are distinctive, eliding the work needed to encode the boilerplate.
Using this macro definition:
(macro getOk (clientIp resource)
{
method: GET,
statusCode: 200,
status: "OK",
protocol: https,
clientIp: (.annotate "ip_addr" (%clientIp)),
resource: (%resource)
})
The same webserver log file could be written like this:
(:getOk "192.168.1.100" "index.html")
(:getOk "192.168.1.100" "images/funny.jpg")
(:getOk "192.168.1.101" "index.html")
Macros are an encoding-level concern, and their use in the data stream is invisible to consuming applications. For writers, macros are always optional--a writer can always elect to write their data using value literals instead.
For a guided walkthrough of what macros can do, see Macros by example.
Defining macros
A macro is defined using a macro
clause within a module's macro_table
clause.
Syntax
(macro name signature template)
Argument | Description |
---|---|
name | A unique name assigned to the macro or--to construct an anonymous macro--null . |
signature | An s-expression enumerating the parameters this macro accepts. |
template | A template definition language (TDL) expression that can be evaluated to produce zero or more Ion values. |
Example macro clause
// ┌─── name
// │ ┌─── signature
// ┌┴┐ ┌──┴──┐
(macro foo (x y z)
{ // ─┐
x: (%x), // │
y: (%y), // ├─ template
z: (%z), // │
} // ─┘
)
Macro names
Syntactically, macro names are identifiers. Each macro name in a macro table must be unique.
In some circumstances, it may not make sense to name a macro. (For example, when the macro is generated automatically.) In such cases, authors may set the macro name to null
or null.symbol
to indicate that the macro does not have a name. Anonymous macros can only be referenced by their address in the macro table.
Macro Parameters
A parameter is a named stream of Ion values. The stream's contents are determined by the macro's invocation. A macro's parameters are declared in the macro signature.
Each parameter declaration is comprised of three elements:
- A name
- An optional encoding
- An optional cardinality
Example parameter declaration
// ┌─── encoding
// │ ┌─── name
// │ │┌─── cardinality
// ┌───┴───┐ ││
flex_uint::x*
Parameter names
A parameter's name is an identifier. The name is required; any non-identifier (including null
, quoted symbols, $0
, or a non-symbol) found in parameter-name position will cause the reader to raise an error.
All of a macro's parameters must have unique names.
Parameter encodings
In binary Ion, the default encoding for all parameters is tagged. Each argument passed into the macro from the callsite is prefixed by an opcode (or "tag") that indicates the argument's type and length.
Parameters may choose to specify an alternative encoding to make the corresponding arguments' binary representation more compact and/or fixed width. These "tagless" encodings do not begin with an opcode, an arrangement which saves space but also limits the domain of values they can each represent. Arguments passed to tagless parameters cannot be null
, cannot be annotated, and may have additional range restrictions.
To specify an encoding, the parameter name is annotated with one of the following tokens:
Tagless encodings | Description |
---|---|
flex_int | Variable-width, signed int |
flex_uint | Variable-width, unsigned int |
int8 int16 int32 int64 | Fixed-width, signed int |
uint8 uint16 uint32 uint64 | Fixed-width, unsigned int |
float16 float32 float64 | Fixed-width float |
flex_symbol | FlexSym -encoded SID or text |
When writing text Ion, the declared encoding does not affect how values are serialized.
However, it does constrain the domain of values that that parameter will accept.
When transcribing from text to binary, it must be possible to serialize all values passed as an argument using the parameter's declared encoding.
This means that parameters with a primitive encoding cannot be annotated or a null
of any type.
If an int
or a float
is being passed to a parameter with a fixed-width encoding,
that value must fit within the range of values that can be represented by that width.
For example, the value 256
cannot be passed to a parameter with an encoding of uint8
because a uint8
can only represent values in the range [0, 255]
.
Parameter cardinalities
A parameter name may optionally be followed by a cardinality modifier. This is a sigil that indicates how many values the parameter expects the corresponding argument expression to produce when it is evaluated.
Modifier | Cardinality |
---|---|
? | zero-or-one value |
* | zero-or-more values |
! | exactly-one value |
+ | one-or-more values |
If no modifier is specified, the parameter's cardinality will default to exactly-one.
An exactly-one
parameter will always expand to a stream containing a single value.
Parameters with a cardinality other than exactly-one
are called variadic parameters.
If an argument expression expands to a number of values that the cardinality forbids, the reader must raise an error.
Optional parameters
Parameters with a cardinality that can accept an empty expression group as an argument (?
and *
) are called
optional parameters. In text Ion, their corresponding arguments can be elided from e-expressions and TDL macro
invocations when they appear in tail position. When an argument is elided, it is treated as though an explicit
empty group (::)
had been passed in its place.
In contrast, parameters with a cardinality that cannot accept an empty group (!
and +
) are called required
parameters. Required parameters can never be elided.
(:set_macros
(foo (x y? z*) // `x` is required, `y` and `z` are optional
[x, y, z]
)
)
// `z` is a populated expression group
(:foo 1 2 (:: 3 4 5)) => [1, 2, 3, 4, 5]
// `z` is an empty expression group
(:foo 1 2 (::)) => [1, 2]
// `z` has been elided
(:foo 1 2) => [1, 2]
// `y` and `z` have been elided
(:foo 1) => [1]
// `x` cannot be elided
(:foo) => ERROR: missing required argument `x`
Optional parameters that are not in tail position cannot be elided, as this would cause them to appear in a position corresponding to a different argument.
(:set_macros
(foo (x? y) // `x` is optional, `y` is required
[x, y]
)
)
(:foo (::) 1) => [(::), 1] => [1]
(:foo 1) => ERROR: missing required argument `y`
Macro signatures
A macro's signature is the ordered sequence of parameters which an invocation of that macro must define. Syntactically, the signature is an s-expression of parameter declarations.
Example macro signature
(w flex_uint::x* float16::y? z+)
Name | Encoding | Cardinality |
---|---|---|
w | tagged | exactly-one |
x | flex_uint | zero-or-more |
y | float16 | zero-or-one |
z | tagged | one-or-more |
Template definition language (TDL)
The macro's template is a single Ion value that defines how a reader should expand invocations of the macro. Ion 1.1 introduces a template definition language (TDL) to express this process in terms of the macro's parameters. TDL is a small language with only a few constructs.
A TDL expression can be any of the following:
- A literal Ion scalar
- A macro invocation
- A variable expansion
- A quasi-literal Ion container
- A special form
In terms of its encoding, TDL is "just Ion." As you shall see in the following sections, the constructs it introduces are written as s-expressions with a distinguishing leading value or values.
A grammar for TDL can be found at the end of this chapter.
Ion scalars
Ion scalars are interpreted literally. These include values of any type except list
, sexp
, and struct
.
null
values of any type—even null.list
, null.sexp
, and null.struct
—are also interpreted literally.
Examples
These macros are constants; they take no parameters. When they are invoked, they expand to a stream of a single value: the Ion scalar acting as the template expression.
$ion_encoding::(
(macro_table
(macro greeting () "hello")
(macro birthday () 1996-10-11)
// Annotations are also literal
(macro price () USD::29.95)
)
)
(:greeting) => "hello"
(:birthday) => 1996-10-11
(:price) => USD::29.95
Macro invocations
Macro invocations call an existing macro. The invoked macro could be a system macro, a macro imported from a shared module, or a macro previously defined in the current scope.
Syntactically, a macro invocation is an s-expression whose first value is the operator .
and whose second value is a macro reference.
Grammar
macro-invocation ::= '(.' macro-ref macro-arg* ')'
macro-ref ::= (module-name '::')? (macro-name | macro-address)
macro-arg ::= expression | expression-group
macro-name ::= ion-identifier
macro-address ::= unsigned-ion-integer
expression-group ::= '(..' expression* ')'
Invocation syntax illustration
// Invoking a macro defined in the same module by name.
(.macro_name arg1 arg2 /*...*/ argN)
// Invoking a macro defined in another module by name.
(.module_name::macro_name arg1 arg2 /*...*/ argN)
// Invoking a macro defined in the same module by its address.
(.0 arg1 arg2 /*...*/ argN)
// Invoking a macro defined in a different module by its address.
(.module_name::0 arg1 arg2 /*...*/ argN)
Examples
$ion_encoding::(
(macro_table
// Calls the system macro `values`, allowing it to produce a stream of three values.
(macro nephews () (.values Huey Dewey Louie))
// Calls a macro previously defined in this module, splicing its result
// stream into a list.
(macro list_of_nephews () [(.nephews)])
)
)
(:nephews) => Huey Dewey Louie
(:list_of_nephews) => [Huey, Dewey, Louie]
important
There are no forward references in TDL. If a macro definition includes an invocation of a name or address that is not already valid, the reader must raise an error.
$ion_encoding::(
(macro_table
(macro list_of_nephews () [(.nephews)])
// ^^^^^^^^
// ERROR: Calls a macro that has not yet been defined in this module.
(macro nephews () (.values Huey Dewey Louie))
)
)
Variable expansion
Templates can insert the contents of a macro parameter into their output by using a variable expansion,
an s-expression whose first value is the operator %
and whose second and final value is the variable name of the parameter to expand.
If the variable name does not match one of the declared macro parameters, the implementation must raise an error.
Grammar
variable-expansion ::= '(%' variable-name ')'
variable-name ::= ion-identifier
Examples
$ion_encoding::(
(macro_table
// Produces a stream that repeats the content of parameter `x` twice.
(macro twice (x*) (.values (%x) (%x)))
)
)
(:twice foo) => foo foo
(:twice "hello") => "hello" "hello"
(:twice 1 2 3) => 1 2 3 1 2 3
Quasi-literal Ion containers
When an Ion container appears in a template definition, it is interpreted almost literally.
Each nested value in the container is inspected.
- If the value is an Ion scalar, it is added to the output as-is.
- If the value is a variable expansion, the stream bound to that variable name is added to the output.
The variable expansion literal (for example:
(%name)
) is discarded. - If the value is a macro invocation, the invocation is evaluated and the resulting stream is added to the output.
The macro invocation literal (for example:
(.name 1 2 3)
) is discarded. - If the value is a container, the reader will recurse into the container and repeat this process.
Expansion within a sequence
When the container is a list or s-expression, the values in the nested expression's expansion are spliced into the sequence at the site of the expression. If the expansion was empty, no values are spliced into the container.
$ion_encoding::(
(macro_table
(macro bookend_list (x y*) [(%x), (%y), (%x)])
(macro bookend_sexp (x y*) ((%x) (%y) (%x)))
)
)
(:bookend_list ! a b c) => ['!', a, b, c, '!']
(:bookend_sexp ! a b c) => (! a b c !)
(:bookend_sexp !) => (! !)
Expansion within a struct
When the container is a struct, the expansion of each field value is paired with the corresponding field name. If the expansion produces a single value, a single field with that name will be spliced into the parent struct. If the expansion produces multiple values, a field with that name will be created for each value and spliced into the parent struct. If the expansion was empty, no fields are spliced into the parent struct.
Examples
$ion_encoding::(
(macro_table
(macro resident (id names*)
{
town: "Riverside",
id: (.make_string "123-" (%id)),
name: (%names)
}
)
)
)
(:resident "abc" "Alice") =>
{
town: "Riverside",
id: "123-abc",
name: "Alice"
}
(:resident "def" "John" "Jacob" "Jingleheimer" "Schmidt") =>
{
town: "Riverside",
id: "123-def",
name: "John",
name: "Jacob",
name: "Jingleheimer",
name: "Schmidt",
}
(:resident "ghi") =>
{
town: "Riverside",
id: "123-ghi",
}
Special forms
special-form ::= '(.' ('$ion::')? special-form-name expression* ')'
special-form-name ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi'
Special forms are similar to macro invocations, but they have their own expansion rules. See Special forms for the list of special forms and a description of each.
Note that unlike macro expansions, special forms cannot accept argument groups.
Macros by example
Before getting into the technical details of Ion’s macro and module system, it will help to be more familiar with the use of macros. We’ll step through increasingly sophisticated use cases, some admittedly synthetic for illustrative purposes, with the intent of teaching the core concepts and moving parts without getting into the weeds of more formal specification.
Ion macros are defined using a domain-specific language that is in turn expressed via the Ion
data model. That is, macro definitions are Ion data, and use Ion features like S-expressions and
symbols to represent code in a Lisp-like fashion. In this document, the fundamental construct we
explore is the macro definition, denoted using an S-expression of the form (macro name …)
where macro
is a keyword and name
must be a symbol denoting the macro's name.
NOTE: S-expressions of that shape only declare macros when they occur in the context of an encoding module. We will completely ignore modules for now, and the examples below omit this context to keep things simple.
Constants
The most basic macro is a constant:
(macro pi // name
() // signature
3.141592653589793) // template
This declaration defines a macro named pi
. The ()
is the macro’s signature, in this
case a trivial one that declares no parameters. The 3.141592653589793
is a similarly trivial
template, an expression in Ion 1.1's domain-specific language for defining macro functions.
This macro accepts no arguments and always returns a constant value.
To use pi
in an Ion document, we write an encoding expression or E-expression:
$ion_1_1
(:pi)
The syntax (:pi)
looks a lot like an S-expression. It’s not, though, since colons
cannot appear unquoted in that context. Ion 1.1 makes use of syntax that is not valid in Ion
1.0—specifically, the (:
digraph—to denote E-expressions. Those characters must be followed by
a reference to a macro, and we say that the E-expression is an invocation of the macro. Here,
(:pi)
is an invocation of the macro named pi
.
note
We also call these “smile expressions” when we’re feeling particularly casual. (:
That document is equivalent to the following, in the sense that they denote the same data:
$ion_1_1
3.141592653589793
The process by which the Ion implementation turns the former document into the latter is called
macro expansion or just expansion. This happens transparently to
Ion-consuming applications: the stream of values in both cases are the same. The documents have
the same content, encoded in two different ways. It’s reasonable to think of (:pi)
as a custom
encoding for 3.141592653589793
, and the notation’s similarity to S-expressions leads us to the
term “encoding expression” (or "e-expression").
note
Any Ion 1.1 document with macros can be fully expanded into an equivalent Ion 1.0 document.
We can streamline future examples with a couple of conventions. First, assume that any E-expression
is occurring within an Ion 1.1 document; second, we use the relation notation, ⇒
, to mean “expands to”.
So we can say:
(:pi) ⇒ 3.141592653589793
Parameters and variable expansion
Most macros are not constant--they accept inputs that determine their results.
(macro passthrough
(x) // signature
(%x) // template
)
This macro has a signature that declares a parameter called x
, and it therefore requires one argument to be passed in when it is invoked.
This creates a variable (i.e. named data) called x
that can be referred to within the context of the template.
note
We are careful to distinguish between the views from “inside” and “outside” the macro: parameters are the names used by a macro’s implementation to refer to its expansion-time inputs, while arguments are the data provided to a macro at the point of invocation. In other words, we have “formal” parameters and “actual” arguments.
The body of this macro is our first non-trivial template, an expression in Ion’s new domain-specific language for defining macro functions.
This template definition language (TDL) treats Ion scalar values as literals, giving the decimal in pi
’s template its intended meaning.
In this example, the template expression (%x)
is a variable expansion in the form (%variable_name)
.
During macro evaluation, variable expansions are replaced by the contents of the referenced variable.
Because this macro's template is an expansion of its only parameter, x
, invoking the macro will produce the same value it was given as an argument.
(:passthrough 1) => 1
(:passthrough "foo") => "foo"
(:passthrough [a, b, c]) => [a, b, c]
Simple Templates
Here's a more realistic macro:
(macro price
(a c) // signature
{ amount: (%a), currency: (%c) }) // template
This macro has a signature that declares two parameters named a
and c
. It therefore accepts two arguments when invoked.
(:price 99 USD) ⇒ { amount: 99, currency: USD }
Template expressions that are structs are interpreted almost literally;
the field names are literal--is why the amount
and currency
field names show up as-is in the expansion--but the field “values” are arbitrary expressions.
We call these almost-literal forms quasi-literals.
The template definition language also treats lists quasi-literally, and every element inside the list is anexpression. Here’s a silly macro to illustrate:
(macro two_item_list (a b) [(%a), (%b)])
(:two_item_list foo bar) ⇒ [foo, bar]
E-expressions can accept other e-expressions as arguments. For example:
(:two_item_list (:price 99 USD) foo)
// └──────┬──────┘
// └─── passing another e-expression as an argument
Expansion happens from the "inside out". The outer e-expression receives the results from the expansion of the inner e-expression.
(:two_item_list (:price 99 USD) foo)
// First, the inner invocation of `price` is expanded...
=> (:two_item_list {amount: 99, currency: USD} foo)
// ...and then the outer invocation of `two_item_list` is expanded.
=> [{amount: 99, currency: USD}, foo]
Invoking Macros from Templates
Templates are able to invoke other macros.
In TDL, an s-expression starting with a .
and an identifier is an operator invocation,
where operators are either macros or special forms, which we'll explore later.
(macro website_url
(path)
(.make_string "https://www.amazon.com/" (%path)))
This macro's template is an s-expression beginning with .make_string
, so it an invocation of a macro called make_string
.
make_string
is a system macro (a built-in function) which concatenates its arguments to produce a single string.
(:website_url "gp/cart") ⇒ "https://www.amazon.com/gp/cart"
In TDL, it is legal for a macro invocation to appear anywhere that a value could appear.
In this example, an invocation of make_string
is being passed as an argument to an invocation of website_url
.
(macro detail_page_url
(asin)
(.website_url (.make_string "dp/" (%asin))))
(:detail_page_url "B08KTZ8249") ⇒ "https://www.amazon.com/dp/B08KTZ8249"
note
This may not look like much of an improvement, but the full string
"https://www.amazon.com/dp/B08KTZ8249"
takes 38 bytes to encode while the macro invocation
(:detail_page_url "B08KTZ8249")
takes as few as 12 bytes in binary Ion. While text Ion spells out the macro name to be human-friendly, the binary Ion encoding uses the macro's integer address instead. Here's an illustration:
(:1 "B08KTZ8249")
This makes the e-expression both more compact and faster to decode. Readers can also avoid the cost of repeatedly validating the UTF-8 bytes of substrings that are 'baked into' the macro definition.
E-expressions Versus S-expressions
We've now seen two ways to invoke macros, and their difference deserves thorough exploration.
An E-expression is an encoding artifact of a serialized Ion document. It has no intrinsic meaning other than the fact that it represents a macro invocation. The meaning of the document can only be determined by expanding the macro, passing the E-expression's arguments to the function defined by the macro. This all happens as the Ion document is parsed, transparent to the reader of the document. In casual terms, E-expressions are expanded away before the application sees the data.
Within the template definition language, you can define new macros in terms of other macros, and those invocations are written as S-expressions. Unlike E-expressions, TDL macro invocations are normal Ion data structures, consumed by the Ion system and interpreted as TDL. Further, TDL macro invocations only have meaning in the context of a macro definition, inside an encoding module, while E-expressions can occur anywhere in an Ion document.
warning
It's entirely possible to write a macro that can generate all or part of a macro definition. We don't recommend that you spend time considering such things at this point.
These two invocation forms are syntactically aligned in their calling convention, but are distinct in context and "immediacy". E-expressions occur anywhere and are invoked immediately, as they are parsed. S-expression invocations occur only within macro definitions, and are only invoked if and when that code path is ever executed by invocation of the surrounding macro.
Rest Parameters
Sometimes we want a macro to accept an arbitrary number of arguments, in particular all the rest
of them. The make_string
macro is one of those, concatenating all of its arguments into a
single string:
(:make_string) ⇒ ""
(:make_string "a") ⇒ "a"
(:make_string "a" "b") ⇒ "ab"
(:make_string "a" "b" "c") ⇒ "abc"
(:make_string "a" "b" "c" "d") ⇒ "abcd"
To make this work, the declaration of make_string
is effectively:
(macro make_string (parts*) /*...*/)
The *
is a cardinality modifier.
A parameter's cardinality dictates both the number of argument expressions it can accept and the number of values its expansion can produce.
In the examples so far, all parameters have had a cardinality of exactly-one
, which is the default.
The parts
parameter has a cardinality of zero-or-more
, meaning:
- It can accept
zero-or-more
argument expressions. - When expanded, it will produce
zero-or-more
values.
When the final parameter in the macro signature is zero-or-more
, "all of the rest" of the argument expressions will be passed to that parameter.
(:make_string)
// └── 0 argument expressions passed to `parts`
(:make_string "a")
// └┬┘
// └── 1 argument expression passed to `parts`
(:make_string "a" "b" "c" "d")
// └──────┬──────┘
// └── 4 argument expressions passed to `parts`
At this point our distinction between parameters and arguments becomes more apparent, since they are no longer one-to-one: this macro with one parameter can be invoked with one argument, or twenty, or none.
tip
To declare a final parameter that requires at least one rest-argument, use the +
modifier.
Arguments and results are streams
The inputs to and results from a macro are modeled as streams of values. When a macro is invoked, each argument expression produces a stream of values, and within the macro definition, each parameter name refers to the corresponding stream, not to a specific value. The declared cardinality of a parameter constrains the number of elements produced by its stream, and is verified by the macro expansion system.
More generally, the results of all template expressions are streams. While most expressions produce a single value, various macros and special forms can produce zero or more values.
We have everything we need to illustrate this, via another system macro, values
:
(macro values (vals*) (%vals))
(:values 1) ⇒ 1
(:values 1 true null) ⇒ 1 true null
(:values) ⇒ _nothing_
The values
macro accepts any number of arguments and returns their values; it is effectively a multi-value identity function.
We can use this to explore how streams combine in E-expressions.
Splicing in encoded data
At the top level, an e-expression's resulting values become top-level values.
(:values 1 2 3) => 1 2 3
When an E-expression appears within a list or S-expression, the resulting values are spliced into the surrounding container:
[first, (:values), last] ⇒ [first, last]
[first, (:values "middle"), last] ⇒ [first, "middle", last]
(first (:values left right) last) ⇒ (first left right last)
This also applies wherever a tagged type can appear inside an E-expression:
(first (:values (:values left right) (:values)) last) ⇒ (first left right last)
Note that each argument-expression always maps to one parameter, even when that expression returns too-few or too-many values.
(macro reverse (a b)
[(%b), (%a)])
(:reverse (:values 5 USD)) ⇒ // Error: 'reverse' expects 2 arguments, given 1
(:reverse 5 (:values) USD) ⇒ // Error: 'reverse' expects 2 arguments, given 3
(:reverse (:values 5 6) USD) ⇒ // Error: argument 'a' expects 1 value, given 2
In this example, the parameters expect exactly one argument, producing exactly one value. When
the cardinality allows multiple values, then the argument result-streams are concatenated. We saw
this (rather subtly) above in the nested use of values
, but can also illustrate using the
rest-parameter to make_string
, which we'll expand here in steps:
(:make_string (:values) a (:values b (:values c) d) e)
// ^^^^^^ next
⇒ (:make_string a (:values b (:values c) d) e)
// ^^^^^^ next
⇒ (:make_string a (:values b c d) e)
// ^^^^^^ next
⇒ (:make_string a b c d e)
⇒ "abcde"
Splicing within sequences is straightforward, but structs are trickier due to their key/value nature. When used in field-value position, each result from a macro is bound to the field-name independently, leading to the field being repeated or even absent:
{ name: (:values) } ⇒ { }
{ name: (:values v) } ⇒ { name: v }
{ name: (:values v ann::w) } ⇒ { name: v, name: ann::w }
An E-expression can even be used in place of a key-value pair, in which case it must return structs, which are merged into the surrounding container:
{ a:1, (:values), z:3 } ⇒ { a:1, z:3 }
{ a:1, (:values {}), z:3 } ⇒ { a:1, z:3 }
{ a:1, (:values {b:2}), z:3 } ⇒ { a:1, b:2, z:3 }
{ a:1, (:values {b:2} {z:3}), z:3 } ⇒ { a:1, b:2, z:3, z:3 }
{ a:1, (:values key "value") } ⇒ // Error: struct expected for splicing into struct
Splicing in template expressions
The preceding examples demonstrate splicing of E-expressions into encoded data, but similar stream-splicing occurs within the template language, making it trivial to convert a stream to a list:
(macro list_of (vals*) [ (%vals) ])
(macro clumsy_bag (elts*) { '': (%elts) })
(:list_of) ⇒ []
(:clumsy_bag) ⇒ {}
(:list_of 1 2 3) ⇒ [1, 2, 3]
(:clumsy_bag true 2) ⇒ {'':true, '':2}
Mapping templates over streams: for
Another way to produce a stream is via a mapping form. The for
special form evaluates a
template once for each value provided by a stream or streams. Each time, a local variable is
created and bound to the next value on the stream.
(macro prices (currency amounts*)
(.for
// Binding pairs
[(amt (%amounts))]
//└┬┘ └────┬───┘
// │ └─── stream to map over
// └─────────── variable name
// Template
(.price (%amt) (%currency))
)
)
The first subform of for
is a list of binding pairs, S-expressions containing a variable
names and a series of TDL expressions. Here, that TDL expression series is a single parameter expansion,
so each individual value from the amounts
stream is bound to the name amt
before the price
invocation is expanded.
(:prices GBP 10 9.99 12.)
⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP} {amount:12., currency:GBP}
More than one stream can be iterated in parallel, and iteration terminates when any stream becomes empty.
(macro zip (front* back*)
(.for [(f (%front)),
(b (%back))]
[(%f), (%b)]))
(:zip (:values 1 2 3) (:values a b))
⇒ [1, a] [2, b]
Empty streams: none
The empty stream is an important edge case that requires careful handling and communication.
The built-in macro none
accepts no values and produces an empty stream:
(macro list_of (items*) [(%items)])
(:list_of (:none)) ⇒ []
(:list_of 1 (:none) 2) ⇒ [1, 2]
[(:none)] ⇒ []
{a:(:none)} ⇒ {}
When used as a macro argument, a none
invocation (like any other expression) counts as one
argument:
(:pi (:none)) ⇒ // Error: 'pi' expects 0 arguments, given 1
The special form (::)
is an empty argument expression group, similar to
(:none)
but used specifically to express the absence of an argument:
(:int_list (::)) ⇒ []
(:int_list 1 (::) 2) ⇒ [1, 2]
TIP: While none
and values
both produce the empty stream, the former is preferred for
clarity of intent and terminology.
Cardinality
As described earlier, parameters are all streams of values, but the number of values can be
controlled by the parameter's cardinality. So far we have seen the default exactly-one
and the *
(zero-or-more) cardinality modifiers, and in total there are four:
Modifier | Cardinality |
---|---|
! | exactly-one value |
? | zero-or-one value |
+ | one-or-more values |
* | zero-or-more values |
Exactly-One
Many parameters expect exactly one value and thus have exactly-one
cardinality.
This is the default cardinality, but the !
modifier can be used for clarity.
This cardinality means that the parameter requires a stream producing a single value, so one might refer to them as singleton streams or just singletons colloquially.
Zero-or-One
A parameter with the modifier ?
has zero-or-one
cardinality, which is much like
exactly-one cardinality, except the parameter accepts an empty-stream
argument as a way to denote an absent parameter.
(macro temperature (degrees scale?)
{
degrees: (%degrees),
scale: (%scale)
})
Since the scale accepts the empty stream, we can pass it an empty argument group:
(:temperature 96 F) ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283}
Note that the result’s scale
field has disappeared because no value was provided. It would be
more useful to fill in a default value, which we can achieve with the default
system macro:
(macro temperature (degrees scale?)
{
degrees: (%degrees),
scale: (.default (%scale) K)
})
(:temperature 96 F) ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}
To refine things a bit further, trailing arguments that accept the empty stream can be omitted entirely:
(:temperature 283) ⇒ {degrees:283, scale:K}
tip
The default
macro is implemented with the help of a special form that can detect the empty stream: if_none
.
Zero-or-More
A parameter with the modifier *
has zero-or-more
cardinality.
(macro prices (amount* currency)
(.for [(amt (%amount))]
(.price (%amt) (%currency))))
When *
is on a non-final parameter, we cannot take “all the rest” of the arguments
and must use a different calling convention to draw the boundaries of the stream.
Instead, we need a single
expression that produces the desired values:
(:prices (::) JPY) ⇒ // empty stream
(:prices 54 CAD) ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP) ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}
Here we use a non-empty argument group (:: /*...*/)
to delimit
the multiple elements of the amount
stream.
One-or-More
A parameter with the modifier +
has one-or-more
cardinality, which works like *
except:
+
parameters cannot accept the empty stream- When expanded,
+
parameters must produce at least one value. To continue using ourprices
example:
(macro prices (amount+ currency)
(.for [(amt (%amount))]
(.price (%amt) (%currency))))
(:prices (::) JPY) ⇒ // Error: `+` parameter received the empty stream
(:prices 54 CAD) ⇒ {amount:54, currency:CAD}
(:prices (:: 10 9.99) GBP) ⇒ {amount:10, currency:GBP} {amount:9.99, currency:GBP}
On the final parameter, +
collects the remaining (one or more) arguments:
(macro thanks (names+)
(.make_string "Thank you to my Patreon supporters:\n"
(.for [(name (%names))]
(.make_string " * " (%name) "\n"))))
(:thanks) ⇒ // Error: at least one value expected for + parameter
(:thanks Larry Curly Moe) =>
'''\
Thank you to my Patreon supporters:
* Larry
* Curly
* Moe
'''
Argument Groups
The non-rest versions of multi-value parameters require some kind of delimiting
syntax to contain the applicable sub-expressions. For the tagged-type parameters we've seen
so far, you could use :values
or some other macro to produce the stream, but that doesn't
work for tagless types.
The preferred syntax, supporting all argument types, is a special delimiting form
called an argument group. Here is a macro to illustrate:
(macro prices
(amount* currency)
(.for [(amt (%amount))]
(.price (%amt) (%currency))))
The parameter amount
accepts any number of argument expressions.
It's easy to provide exactly one:
(:prices 12.99 GBP) ⇒ {amount:12.99, currency:GBP}
To provide a non-singleton stream of values, use an argument group.
Inside an E-expression, a group starts with (::
(:prices (::) GBP) ⇒ _void_
(:prices (:: 1) GBP) ⇒ {amount:1, currency:GBP}
(:prices (:: 1 2 3) GBP) ⇒ {amount:1, currency:GBP}
{amount:2, currency:GBP}
{amount:3, currency:GBP}
Within the group, the invocation can have any number of expressions that align with the parameter's encoding. The macro parameter produces the results of those expressions, concatenated into a single stream, and the expander verifies that each value on that stream is acceptable by the parameter’s declared encoding.
(:prices (:: 1 (:values 2 3) 4) GBP) ⇒ {amount:1, currency:GBP}
{amount:2, currency:GBP}
{amount:3, currency:GBP}
{amount:4, currency:GBP}
Argument groups may only appear inside macro invocations where the corresponding
parameter has ?
, *
, or +
cardinality.
There is no binary opcode for these constructs; the encoding uses a tagless format to keep
things as dense as possible.
As usual, the text format mirrors this constraint.
warning
The allowed combinations of cardinality and argument groups is pending finalization of the binary encoding.
Optional Arguments
When a trailing parameter accepts the empty stream, an invocation can omit its corresponding argument expression,
as long as no following parameter is being given an expression. We’ve seen
this as applied to final *
parameters, but it also applies to ?
parameters:
(macro optionals (a* b? c! d* e? f*)
(.make_list a b c d e f))
Since d
, e
, and f
all accept the empty stream, they can be omitted by invokers. But c
is required so
a
and b
must always be present, at least as an empty group:
(:optionals (::) (::) "value for c") ⇒ ["value for c"]
Now c
receives the string "value for c"
while the other parameters are all empty.
If we want to provide e
, then we must also provide a group for d
:
(:optionals (::) (::) "value for c" (::) "value for e")
⇒ ["value for c", "value for e"]
Tagless and fixed-width types
In Ion 1.0, the binary encoding of every value starts off with a “type tag”, an opcode that indicates the data-type of the next value and thus the interpretation of the following octets of data. In general, these tags also indicate whether the value has annotations, and whether it’s null.
These tags are necessary because the Ion data model allows values of any type to be used anywhere. Ion documents are not schema-constrained: nothing forces any part of the data to have a specific type or shape. We call Ion “self-describing” precisely because each value self-describes its type via a type tag.
If schema constraints are enforced through some mechanism outside the serializer/deserializer, the type tags are unnecessary and may add up to a non-trivial amount of wasted space. Furthermore, the overhead for each value also includes length information: encoding an octet of data takes two octets on the stream.
Ion 1.1 tries to mitigate this overhead in the binary format by allowing macro parameters to use more-constrained tagless types. These are subtypes of the concrete types, constrained such that type tags are not necessary in the binary form. In general this can shave 4-6 bits off each value, which can add up in aggregate. In the extreme, that octet of data can be encoded with no overhead at all.
The following tagless types are available:
Tagless type | Description |
---|---|
flex_symbol | Tagless symbol (SID or text) |
flex_string | Tagless string |
flex_int | Tagless, variable-width signed int |
flex_uint | Tagless, variable-width unsigned int |
int8 int16 int32 int64 | Fixed-width signed int |
uint8 uint16 uint32 uint64 | Fixed-width unsigned int |
float16 float32 float64 | Fixed-width float |
To define a tagless parameter, just declare one of the primitive types:
(macro point (flex_int::x flex_int::y)
{x: (%x), y: (%y)})
(:point 3 17) ⇒ {x:3, y:17}
The tagless encoding has no real benefit here in text, as primitive types aim to improve the binary encoding.
This density comes at the cost of flexibility. Primitive types cannot be annotated or null, and arguments cannot be expressed using macros, like we’ve done before:
(:point null.int 17) ⇒ // Error: primitive flex_int does not accept nulls
(:point a::3 17) ⇒ // Error: primitive flex_int does not accept annotations
(:point (:values 1) 2) ⇒ // Error: cannot use macro for a primitive argument
While Ion text syntax doesn’t use tags—the types are built into the syntax—these errors ensure that a text E-expression may only express things that can also be expressed using an equivalent binary E-expression.
For the same reasons, supplying a (non-rest) tagless parameter with no value, or with more than one value, can only be expressed by using an argument group.
A subset of the primitive types are fixed-width: they are binary-encoded with no per-value overhead.
(macro byte_array
(uint8::bytes*)
[(%bytes)])
Invocations of this macro are encoded as a sequence of untagged octets, because the macro definition constrains the argument shape such that nothing else is acceptable. A text invocation is written using normal ints:
(:byte_array 0 1 2 3 4 5 6 7 8) ⇒ [0, 1, 2, 3, 4, 5, 6, 7, 8]
(:byte_array 9 -10 11) ⇒ // Error: -10 is not a valid uint8
(:byte_array 256) ⇒ // Error: 256 is not a valid uint8
As above, Ion text doesn’t have syntax specifically denoting “8-bit unsigned integers”, so to keep text and binary capabilities aligned, the parser rejects invocations where an argument value exceeds the range of the binary-only type.
Primitive types have inherent tradeoffs and require careful consideration, but in the right circumstances the density wins can be significant.
Macro Shapes
We can now introduce the final kind of input constraint, macro-shaped parameters. To understand the motivation, consider modeling a scatter-plot as a list of points:
[{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}, …]
Lists like these exhibit a lot of repetition. Since we already have a point
macro, we can
eliminate a fair amount:
[(:point 3 17), (:point 395 23), (:point 15 48), (:point 2023 5), …]
This eliminates all the x
s and y
s, but leaves repeated macro invocations.
What we’d like is to eliminate the point
calls and just write a stream of pairs, something
like:
(:scatterplot (3 17) (395 23) (15 48) (2023 5) …)
We can achieve exactly that with a macro-shaped parameter, in which we use the point
macro as an encoding:
(macro scatterplot (point::points*)
// ^^^^^
[(%points)])
point
is not one of the built-in encodings, so this is a reference to the macro of that name defined earlier.
(:scatterplot (3 17) (395 23) (15 48) (2023 5))
⇒
[{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}]
Each argument S-expression like (3 17)
is implicitly an
E-expression invoking the point
macro. The argument mirrors the shape of the inner macro,
without repeating its name. Further, expansion of the implied point
s happens automatically,
so the overall behavior is just like the preceding variant and the points
parameter produces a stream of structs.
The binary encoding of macro-shaped parameters are similarly tagless, eliding any opcodes
mentioning point
and just writing its arguments with minimal delimiting.
Macro types can be combined with cardinality modifiers, with invocations using groups as needed:
(macro scatterplot
(point::points+ flex_string::x_label flex_string::y_label)
{ points: [(%points)], x_label: (%x_label), y_label: (%y_label) })
(:scatterplot (:: (3 17) (395 23) (15 48) (2023 5)) "hour" "widgets")
⇒
{
points: [{x:3, y:17}, {x:395, y:23}, {x:15, y:48}, {x:2023, y:5}],
x_label: "hour",
y_label: "widgets"
}
As with other tagless parameters, you cannot replace a group with a macro invocation, and you can't use a macro invocation as an element of an argument group:
(:scatterplot (:make_points 3 17 395 23 15 48 2023 5) "hour" "widgets")
⇒ // Error: Argument group expected, found :make_points
(:scatterplot (:: (3 17) (:make_points 395 23 15 48) (2023 5)) "hour" "widgets")
⇒ // Error: sexp expected with args for 'point', found :make_points
(:scatterplot (:: (3 17) (:point 395 23) (15 48) (2023 5)) "hour" "widgets")
⇒ // Error: sexp expected with args for 'point', found :point
This limitation mirrors the binary encoding, where both the argument group and the individual macro invocations are tagless and there's no way to express a macro invocation.
tip
The primary goal of macro-shaped arguments, and tagless types in general, is to increase density by tightly constraining the inputs.
Special Forms
When a TDL expression is syntactically an S-expression and its
first element is the symbol .
, its next element must be a symbol that matches either a set of keywords denoting the
special forms, or the name of a previously-defined macro.
The interpretation of the S-expression’s remaining elements depends on how the symbol resolves.
In the case of macro invocations, the elements following the operator are arbitrary TDL expressions, but for special
forms that is not always the case.
Special forms are "special" precisely because they cannot be expressed as macros and must therefore receive bespoke syntactic treatment. Since the elements of macro-invocation expressions are themselves expressions, when you want something to not be evaluated that way, it must be a special form.
Finally, these special forms are part of the template language itself, and are not addressable outside of TDL;
the E-expression (:if_none foo bar baz)
must necessarily refer to some user-defined macro named if_none
, not to the special form of the same name.
todo
Many of these could be system macros instead of special forms. Being unrepresentable in TDL is not a reason for something
to be a special form.
Candidates to be moved to system macros are if_*
and fail
.
Additionally, the system macro parse_ion
may need to be classified as a special form since it only accepts literals.
if_none
(macro if_none (stream* true_branch* false_branch*) /* Not representable in TDL */)
The if_none
form is if/then/else syntax testing stream emptiness.
It has three sub-expressions, the first being a stream to check.
If and only if that stream is empty (it produces no values), the second sub-expression is expanded.
Otherwise, the third sub-expression is expanded.
The expanded second or third sub-expression becomes the result that is produced by if_none
.
note
Exactly one branch is expanded, because otherwise the empty stream
might be used in a context that requires a value, resulting in an errant expansion error.
(macro temperature (degrees scale)
{
degrees: (%degrees),
scale: (.if_none (%scale) K (%scale)),
})
(:temperature 96 F) ⇒ {degrees:96, scale:F}
(:temperature 283 (::)) ⇒ {degrees:283, scale:K}
To refine things a bit further, trailing optional arguments can be omitted entirely:
(:temperature 283) ⇒ {degrees:283, scale:K}
tip
If you're using if_none
to specify an expression to default to, you can use the default
system macro to be more concise.
(macro temperature (degrees scale)
{
degrees: (%degrees),
scale: (.default (%scale) K),
}
)
if_some
(macro if_some (stream* true_branch* false_branch*) /* Not representable in TDL */)
If stream
evaluates to one or more values, it produces true_branch
. Otherwise, it produces false_branch
.
Exactly one of true_branch
and false_branch
is evaluated.
The stream
expression must be expanded enough to determine whether it produces any values, but implementations are not required to fully expand the expression.
Example:
(macro foo (x)
{
foo: (.if_some (%x) [(%x)] null)
})
(:foo (::)) => { foo: null }
(:foo 2) => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }
The false_branch
parameter may be elided, allowing if_some
to serve as a map-if-not-none function.
Example:
(macro foo (x)
{
foo: (.if_some (%x) [(%x)])
})
(:foo (::)) => { }
(:foo 2) => { foo: [2] }
(:foo (:: 2 3)) => { foo: [2, 3] }
if_single
(macro if_single (expressions* true_branch* false_branch*) /* Not representable in TDL */)
If expressions
evaluates to exactly one value, if_single
produces the expansion of true_branch
. Otherwise, it produces the expansion of false_branch
.
Exactly one of true_branch
and false_branch
is evaluated.
The stream
expression must be expanded enough to determine whether it produces exactly one value, but implementations are not required to fully expand the expression.
if_multi
(macro if_multi (expressions* true_branch* false_branch*) /* Not representable in TDL */)
If expressions
evaluates to more than one value, it produces true_branch
. Otherwise, it produces false_branch
.
Exactly one of true_branch
and false_branch
is evaluated.
The stream
expression must be expanded enough to determine whether it produces more than one value, but implementations are not required to fully expand the expression.
for
(for name_and_expressions template)
name_and_expressions
is a list or s-expression containing one or more s-expressions of the form (name expr0 expr1 ... exprN)
.
The first value is a symbol to act as a variable name.
The remaining expressions in the s-expression will be expanded and concatenated into a single stream; for each value in the stream, the for
expansion will produce a copy of the template
argument expression with any appearance of the variable replaced by the value.
For example:
(.for
[(word // Variable name
foo bar baz)] // Values over which to iterate
(.values (%word) (%word))) // Template expression; `(%word)` will be replaced
=>
foo foo bar bar baz baz
Multiple s-expressions can be specified. The streams will be iterated over in lockstep.
(.for
((x 1 2 3) // for x in...
(y 4 5 6)) // for y in...
((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 4)
(2 5)
(3 6)
Iteration will end when the shortest stream is exhausted.
(.for
[(x 1 2), // for x in...
(y 3 4 5)] // for y in...
((%x) (%y))) // Template; `(%x)` and `(%y)` will be replaced
=>
(1 3)
(2 4)
// no more output, `x` is exhausted
Names defined inside a for
shadow names in the parent scope.
(macro triple (x)
// └─── Parameter `x` is declared here...
(.for
// ...but the `for` expression introduces a
// ┌─── new variable of the same name here.
((x a b c))
(%x)
// └─── This refers to the `for` expression's `x`, not the parameter.
)
)
(:triple 1) // Argument `1` is ignored
=>
a b c
The for
special form can only be invoked in the body of template macro. It is not valid to use as an E-Expression.
System Macros
Many of the system macros MAY be defined as template macros, and when possible, the specification includes a template. Templates are given here as normative example, but system macros are not required to be implemented as template macros.
The macros that can be defined as templates are included as system macros because of their broad applicability, and
so that Ion implementations can provide optimizations for these macros that run directly in the implementations runtime
environment rather than in the macro evaluator.
For example, a macro such as add_symbols
does not produce user values, so an Ion Reader could bypass
evaluating the template and directly update the encoding context with the new symbols.
Stream Constructors
none
(macro none () (.values))
none
accepts no values and produces nothing (an empty stream).
values
(macro values (v*) v)
This is, essentially, the identity function. It produces a stream from any number of arguments, concatenating the streams produced by the nested expressions. Used to aggregate multiple values or sub-streams to pass to a single argument, or to produce multiple results.
default
(macro default (expr* default_expr*)
// If `expr` is empty...
(.if_none (%expr)
// then expand `default_expr` instead.
(%default_expr)
// If it wasn't empty, then expand `expr`.
(%expr)
)
)
default
tests expr
to determine whether it expands to the empty stream.
If it does not, default
will produce the expansion of expr
.
If it does, default
will produce the expansion of default_expr
instead.
flatten
(macro flatten (sequence*) /* Not representable in TDL */)
The flatten
system macro constructs a stream from the content of one or more sequences.
Produces a stream with the contents of all the sequence
values.
Any annotations on the sequence
values are discarded.
Any non-sequence arguments will raise an error.
Any null arguments will be ignored.
Examples:
(:flatten [a, b, c] (d e f)) => a b c d e f
(:flatten [[], null.list] foo::()) => [] null.list
The flatten
macro can also be used to splice the content of one list or s-expression into another list or s-expression.
[1, 2, (:flatten [a, b]), 3, 4] => [1, 2, a, b, 3, 4]
parse_ion
Ion documents may be embedded in other Ion documents using the parse_ion
macro.
(macro parse_ion (uint8::data*) /* Not representable in TDL */)
The parse_ion
macro constructs a stream of values by parsing a blob literal or string literal as a single, self-contained Ion document.
All values produced by the expansion of parse_ion
are application values.
(I.e. it is as if they are all annotated with $ion_literal
.)
The IVM at the beginning of an Ion data stream is sufficient to identify whether it is text or binary, so text Ion can be embedded as a blob containing the UTF-8 encoded text.
Embedded text example:
(:parse_ion
'''
$ion_1_1
$ion_encoding::((symbol_table ["foo" "bar"]]))
$1 $2
'''
)
=> foo bar
Embedded binary example:
(:parse_ion {{ 4AEB6qNmb2+jYmFy }} )
=> foo bar
important
Unlike most macros, this macro specifically requires literals. Macros are not allowed to contain recursive calls, and composing an embedded document from multiple expressions would make it possible to implement recursion in the macro system.
The data argument is evaluated in a clean environment that cannot read anything from the parent document. Allowing context to leak from the outer scope into the document being parsed would also enable recursion.
Value Constructors
annotate
(macro annotate (ann* value) /* Not representable in TDL */)
Produces the value
prefixed with the annotations ann
s1.
Each ann
must be a non-null, unannotated string or symbol.
(:annotate (: "a2") a1::true) => a2::a1::true
make_string
(macro make_string (content*) /* Not representable in TDL */)
Produces a non-null, unannotated string containing the concatenated content produced by the arguments. Nulls (of any type) are forbidden. Any annotations on the arguments are discarded.
make_symbol
(macro make_symbol (content*) /* Not representable in TDL */)
Like make_string
but produces a symbol.
make_blob
(macro make_blob (lobs*) /* Not representable in TDL */)
Like make_string
but accepts lobs and produces a blob.
make_list
(macro make_list (sequences*) [ (.flatten sequences) ])
Produces a non-null, unannotated list by concatenating the content of any number of non-null list or sexp inputs.
(:make_list) => []
(:make_list (1 2)) => [1, 2]
(:make_list (1 2) [3, 4]) => [1, 2, 3, 4]
(:make_list ((1 2)) [[3, 4]]) => [(1 2), [3, 4]]
make_sexp
(macro make_sexp (sequences*) ( (.flatten sequences) ))
Like make_list
but produces a sexp.
(:make_sexp) => ()
(:make_sexp (1 2)) => (1 2)
(:make_sexp (1 2) [3, 4]) => (1 2 3 4)
(:make_sexp ((1 2)) [[3, 4]]) => ((1 2) [3, 4])
make_struct
(macro make_struct (structs*) /* Not representable in TDL */)
Produces a non-null, unannotated struct by combining the fields of any number of non-null structs.
(:make_struct) => {}
(:make_struct
{k1: 1, k2: 2}
{k3: 3}
{k4: 4}) => {k1:1, k2:2, k3:3, k4:4}
make_field
(macro make_field (flex_sym::field_name value) /* Not representable in TDL */)
Produces a non-null, unannotated, single-field struct using the given field name and value.
This can be used to dynamically construct field names based on macro parameters.
Example:
(macro foo_struct (extra_name extra_value)
(make_struct
{
foo_a: 1,
foo_b: 2,
}
(make_field (make_string "foo_" (%extra_name)) (%extra_value))
))
Then:
(:foo_struct c 3) => { foo_a: 1, foo_b: 2, foo_c: 3 }
make_decimal
(macro make_decimal (flex_int::coefficient flex_int::exponent) /* Not representable in TDL */)
This is no more compact than the regular binary encoding for decimals. However, it can be used in conjunction with other macros, for example, to represent fixed-point numbers.
(macro usd (cents) (.annotate USD (.make_decimal cents -2))
(:usd 199) => USD::1.99
make_timestamp
(macro make_timestamp (uint16::year
uint8::month?
uint8::day?
uint8::hour?
uint8::minute?
/*decimal*/ second?
int16::offset_minutes?) /* Not representable in TDL */)
Produces a non-null, unannotated timestamp at various levels of precision.
When offset
is absent, the result has unknown local offset; offset 0
denotes UTC.
The arguments to this macro may not be any null value.
note
TODO ion-docs#256 Reconsider offset semantics, perhaps default should be UTC.
Example:
(macro ts_today
(uint8::hour uint8::minute uint32::seconds_millis)
(.make_timestamp
2022
4
28
hour
minute
(.make_decimal (%seconds_millis) -3) 0))
Encoding Utility Macros
repeat
The repeat
system macro can be used for efficient run-length encoding.
(macro repeat (n! value+) /* Not representable in TDL */)
Produces a stream that repeats the specified value
expression(s) n
times.
(:repeat 5 0) => 0 0 0 0 0
(:repeat 2 true false) => true false true false
delta
note
🚧 Name still TBD 🚧
The delta
system macro can be used for directed delta encoding.
(macro delta (flex_int::initial! flex_int::deltas+) /* Not representable in TDL */)
Example:
(:delta 10 1 2 3 -4) => 11 13 16 12
sum
(macro sum (i*) /* Not representable in TDL */)
Produces the sum of all the integer arguments.
Examples:
(:sum 1 2 3) => 6
(:sum (:)) => 0
meta
(macro meta (anything*) (.none))
The meta
macro accepts any values and emits nothing.
It allows writers to encode data that will be not be surfaced to most readers.
Readers can be configured to intercept calls to meta
, allowing them to read the otherwise invisible data.
When transcribing from one format to another, writers should preserve invocations of meta
when possible.
Example:
(:values
(:meta {author: "Mike Smith", email: "mikesmith@example.com"})
{foo:2,foo:1}
)
=>
{foo:2,foo:1}
Updating the Encoding Context
set_symbols
Sets the local symbol table, preserving any macros in the macro table.
(macro set_symbols (symbols*)
$ion_encoding::(
(symbol_table [(%symbols)])
(macro_table $ion_encoding)
))
Example:
(:set_symbols foo bar)
=>
$ion_encoding::(
(symbol_table [foo, bar])
(macro_table $ion_encoding)
)
add_symbols
Appends symbols to the local symbol table, preserving any macros in the macro table.
(macro add_symbols (symbols*)
$ion_encoding::(
(symbol_table $ion_encoding [(%symbols)])
(macro_table $ion_encoding)
))
Example:
(:add_symbols foo bar)
=>
$ion_encoding::(
(symbol_table $ion_encoding [foo, bar])
(macro_table $ion_encoding)
)
set_macros
Sets the local macro table, preserving any symbols in the symbol table.
(macro set_macros (macros*)
$ion_encoding::(
(symbol_table $ion_encoding)
(macro_table (%macros))
))
Example:
(:set_macros (macro pi () 3.14159))
=>
$ion_encoding::(
(symbol_table $ion_encoding)
(macro_table (macro pi () 3.14159))
)
add_macros
Appends macros to the local macro table, preserving any symbols in the symbol table.
(macro add_macros (macros*)
$ion_encoding::(
(symbol_table $ion_encoding)
(macro_table $ion_encoding (%macros))
))
Example:
(:add_macros (macro pi () 3.14159))
=>
$ion_encoding::(
(symbol_table $ion_encoding)
(macro_table $ion_encoding (macro pi () 3.14159))
)
use
Appends the content of the given module to the encoding context.
(macro use (catalog_key version?)
$ion_encoding::(
(import the_module catalog_key (.default (%version) 1))
(symbol_table $ion_encoding the_module)
(macro_table $ion_encoding the_module)
))
Example:
(:use "org.example.FooModule" 2)
=>
$ion_encoding::(
(import the_module "org.example.FooModule" 2)
(symbol_table $ion_encoding the_module)
(macro_table $ion_encoding the_module)
)
The annotations sequence comes first in the macro signature because it parallels how annotations are read from the data stream.^
Ion 1.1 modules
In Ion 1.0, each stream has a symbol table. The symbol table stores text values that can be referred to by their integer index in the table, providing a much more compact representation than repeating the full UTF-8 text bytes each time the value is used. Symbol tables do not store any other information used by the reader or writer.
Ion 1.1 introduces the concept of a macro table. It is analogous to the symbol table, but instead of holding text values it holds macro definitions.
Ion 1.1 also introduces the concept of a module, an organizational unit that holds a (symbol table, macro table)
pair.
tip
You can think of an Ion 1.0 symbol table as a module with an empty macro table.
In Ion 1.1, each stream has an encoding module—the active (symbol table, macro table)
pair that is being used to encode the stream.
Module interface
The interface to a module consists of:
- its spec version, denoting the Ion version used to define the module
- its exported symbols, an array of strings denoting symbol content
- its exported macros, an array of
<name, macro>
pairs, where all names are unique identifiers (or null).
The spec version is external to the module body and the precise way it is determined depends on the type of module being defined. This is explained in further detail in Module Versioning.
The exported symbol array is denoted by the symbol_table
clause of a module definition, and
by the symbols
field of a shared symbol table.
The exported macro array is denoted by the module’s macro_table
clause, with addresses
allocated to macros or macro bindings in the order they are declared.
The exported symbols and exported macros are defined in the module body.
Types of modules
There are multiple types of modules. All modules share the same interface, but vary in their implementation in order to support a variety of different use cases.
Module Type | Purpose |
---|---|
Encoding Module | Defining the local encoding context |
System Module | Defining system symbols and macros |
Inner Module | Organizing symbols and macros and limiting the scope of macros |
Shared Module | Defining symbols and macros outside of the data stream |
Module versioning
Every module definition has a spec version that determines the syntax and semantics of the module body. A module’s spec version is expressed in terms of a specific Ion version; the meaning of the module is as defined by that version of the Ion specification.
The spec version for an encoding module is implicitly derived from the Ion version of its containing segment. The spec version for a shared module is denoted via a required annotation. The spec version of an inner module is always the same as its containing module. The spec version of a system module is the Ion version in which it was specified.
To ensure that all consumers of a module can properly understand it, a module can only import shared modules defined with the same or earlier spec version.
Examples
The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N
.
This allows the module to be serialized using any version of Ion, and its meaning will not change.
$ion_shared_module::
$ion_1_1::("com.example.symtab" 3
(symbol_table ...)
(macro_table ...))
The spec version of an encoding module is always the same as the Ion version of its enclosing segment.
$ion_1_1
$ion_encoding::(
// Module semantics specified by Ion 1.1
...
)
// ...
$ion_1_3
$ion_encoding::(
// Module semantics specified by Ion 1.3
...
)
//... // Assuming no IVM
$ion_encoding::(
// Module semantics specified by Ion 1.3
...
)
Identifiers
Many of the grammatical elements used to define modules and macros are identifiers--symbols that do not require quotation marks.
More explicitly, an identifier is a sequence of one or more ASCII letters, digits, or the characters $
(dollar sign) or _
(underscore), not starting with a digit.
It also cannot be of the form $\d+
, which is the syntax for symbol IDs (for example: $3
, $10
, $458
, etc.), nor can it be a keyword (true
, false
, null
, or nan
).
Defining modules
A module is defined by four kinds of subclauses which, if present, always appear in the same order.
import
- a reference to a shared module definition; repeatablemodule
- a nested module definition; repeatablesymbol_table
- an exported list of text valuesmacro_table
- an exported list of macro definitions
Internal environment
The body of a module tracks an internal environment by which macro references are resolved. This environment is constructed incrementally by each clause in the definition and consists of:
- the visible modules, a map from identifier to module
- the exported symbols, an array containing symbol texts
- the exported macros, an array containing name/macro pairs
Before any clauses of the module definition are examined, the initial environment is as follows:
- The visible modules map binds
$ion
to the system module for the appropriate spec version. Inside an encoding directive, the visible modules map also binds$ion_encoding
to the active encoding module (the encoding module that was active when the encoding directive was encountered). For an inner module, it also includes the modules previously made available by the enclosing module (viaimport
ormodule
). - The macro table and symbol table are empty.
Each clause affects the environment as follows:
- An
import
declaration retrieves a shared module from the implementation’s catalog, assigns it a name in the visible modules, and makes its macros available for use. An error must be signaled if the name already appears in the visible modules. - A
module
declaration defines a new module and assigns it a name in the visible modules. An error must be signaled if the name already appears in the visible modules. - A
symbol_table
declaration defines the exported symbols. - A
macro_table
declaration defines the exported macros.
Resolving Macro References
Within a module definition, macros can be referenced in several contexts using the following macro-ref syntax:
qualified-ref ::= module-name '::' macro-ref
macro-ref ::= macro-name | macro-addr
macro-name ::= unannotated-identifier-symbol
macro-addr ::= unannotated-uint
Macro references are resolved to a specific macro as follows:
- An unqualified macro-name is looked up within the exported macros, and if not found, then the active encoding module's macro table. If it maps to a macro, that’s the resolution of the reference. Otherwise, an error is signaled due to an unbound reference.
- An anonymous local reference (macro-addr) is resolved by index in the exported macro array. If the address exceeds the array boundary, an error is signaled due to an invalid reference.
- A qualified reference (qualified-ref) resolves solely against the referenced module. If the module name does not exist in the visible modules, an error is signaled due to an unbound reference. Otherwise, the name or address is resolved within that module’s exported macro array.
warning
An unqualified macro name can change meaning in the middle of an encoding module if you choose to shadow the
name of a macro in the active encoding module. To unambiguously refer to the active encoding module,
use the qualified reference syntax: $ion_encoding::<macro-name>
.
import
import ::= '(import ' module-name catalog-key ')'
module-name ::= unannotated-identifier-symbol
catalog-key ::= catalog-name catalog-version?
catalog-name ::= string
catalog-version ::= int // positive, unannotated
An import binds a lexically scoped module name to a shared module that is identified by a catalog key—a (name, version)
pair. The version
of the catalog key is optional—when omitted, the version is implicitly 1.
In Ion 1.0, imports may be substituted with a different version if an exact match is not found. In Ion 1.1, however, all imports require an exact match to be found in the reader's catalog; if an exact match is not found, the implementation must signal an error.
module
The module
clause defines a new module that is contained in the current module.
inner-module ::= '(module' module-name import* symbol-table? macro-table? ')'
Inner modules automatically have access to modules previously declared in the containing module using module
or import
.
The new module (and its exported symbols and macros) is available to any following module
, symbol_table
, and
macro_table
clauses in the enclosing container.
See inner modules for full explanation.
symbol_table
A module can define a list of exported symbols by copying symbols from other modules and/or declaring new symbols.
symbol-table ::= '(symbol_table' symbol-table-entry* ')'
symbol-table-entry ::= module-name | symbol-list
symbol-list ::= '[' ( symbol-text ',' )* ']'
symbol-text ::= symbol | string
The symbol_table
clause assembles a list of text values for the module to export.
It takes any number of arguments, each of which may be the name of visible module or a list of symbol-texts.
The symbol table is a list of symbol-texts by concatenating the symbol tables of named modules and lists of symbol/string values.
Where a module name occurs, its symbol table is appended. (The module name must refer to another module that is visible to the current module.) Unlike Ion 1.0, no symbol-maxid is needed because Ion 1.1 always required exact matches for imported modules.
tip
In an encoding directive, the active encoding module $ion_encoding
can be added to the symbol table in order to
retain the symbols from the active encoding module.
$ion_encoding
can occur anywhere in the symbol_table
clause, so in Ion 1.1 it is possible to append and prepend to
the symbol table.
Where a list occurs, it must contain only non-null, unannotated strings and symbols.
The text of these strings and/or symbols are appended to the symbol table.
Upon encountering any non-text value, null value, or annotated value in the list, the implementation shall signal an error.
To add a symbol with unknown text to the symbol table, one may use $0
.
All modules have a symbol table, so when a module has no symbol_table
clause, the module has an empty symbol table.
Symbol zero $0
Symbol zero (i.e. $0
) is a special symbol that is not assigned text by any symbol table, even the system symbol table.
Symbol zero always has unknown text, and can be useful in synthesizing symbol identifiers where the text image of the symbol is not known in a particular operating context.
All symbol tables (even an empty symbol table) can be thought of as implicitly containing $0
.
However, $0
precedes all symbol tables rather than belonging to any symbol table.
When adding the exported symbols from one module to the symbol table of another, the preceding $0
is not copied into the destination symbol table (because it is not part of the source symbol table).
It is important to note that $0
is only semantically equivalent to itself and to locally-declared SIDs with unknown text.
It is not semantically equivalent to SIDs with unknown text from shared symbol tables, so replacing such SIDs with $0
is a destructive operation to the semantics of the data.
Processing
When the symbol_table
clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.
For each arg
:
- If the
arg
is a list of text values, the nested text values are appended to the end of the symbol table being constructed.- When
$0
appears in the list of text values, this creates a symbol with unknown text. - The presence of any other Ion value in the list raises an error.
- When
- If the
arg
is the name of a module, the symbols in that module's symbol table are appended to the end of the symbol table being constructed. - If the
arg
is anything else, the reader must raise an error.
Example
(symbol_table // Constructs an empty symbol table (list)
["a", b, 'c'] // The text values in this list are appended to the table
foo // Module `foo`'s symbol table values are appended to the table
['''g''', "h", i]) // The text values in this list are appended to the table
If module foo
's symbol table were [d, e, f]
, then the symbol table defined by the above clause would be:
["a", "b", "c", "d", "e", "f", "g", "h", "i"]
This is an Ion 1.0 symbol table that imports two shared symbol tables and then declares some symbols of its own.
$ion_1_0
$ion_symbol_table::{
imports: [{ name: "com.example.shared1", version: 1, max_id: 10 },
{ name: "com.example.shared2", version: 2, max_id: 20 }],
symbols: ["s1", "s2"]
}
Here’s the Ion 1.1 equivalent in terms of symbol allocation order:
$ion_1_1
$ion_encoding::(
(import m1 "com.example.shared1" 1)
(import m2 "com.example.shared2" 2)
(symbol_table m1 m2 ["s1", "s2"])
)
macro_table
Macros are declared after symbols.
The macro_table
clause assembles a list of macro definitions for the module to export. It takes any number of arguments.
All modules have a macro table, so when a module has no macro_table
clause, the module has an empty macro table.
Most commonly, a macro table entry is a definition of a new macro expansion function, following this general shape:
When no name is given, this defines an anonymous macro that can be referenced by its numeric
address (that is, its index in the enclosing macro table).
Inside the defining module, that uses a local reference like 12
.
The signature defines the syntactic shape of expressions invoking the macro; see Macro Signatures for details. The template defines the expansion of the macro, in terms of the signature’s parameters; see Template Expressions for details.
Imported macros must be explicitly exported if so desired.
Module names and export
clauses can be intermingled with macro
definitions inside the macro_table
;
together, they determine the bindings that make up the module’s exported macro array.
The module-name export form is shorthand for referencing all exported macros from that module, in their original order with their original names.
An export
clause contains a single macro reference followed by an optional alias for the exported macro.
The referenced macro is appended to the macro table.
tip
No name can be repeated among the exported macros, including macro definitions.
Name conflicts must be resolved by export
s with aliases.
Processing
When the macro_table
clause is encountered, the reader constructs an empty list. The arguments to the clause are then processed from left to right.
For each arg
:
- If the
arg
is amacro
clause, the clause is processed and the resulting macro definition is appended to the end of the macro table being constructed. - If the
arg
is anexport
clause, the clause is processed and the referenced macro definition is appended to the end of the macro table being constructed. - If the
arg
is the name of a module, the macro definitions in that module's macro table are appended to the end of the macro table being constructed. - If the
arg
is anything else, the reader must raise an error.
A macro name is a symbol that can be used to reference a macro, both inside and outside the module. Macro names are optional, and improve legibility when using, writing, and debugging macros. When a name is used, it must be an identifier per Ion’s syntax for symbols. Macro definitions being added to the macro table must have a unique name. If a macro is added whose name conflicts with one already present in the table, the implementation must raise an error.
macro
A macro
clause defines a new macro.
When the macro declaration uses a name, an error must be signaled if it already appears in the exported macro array.
export
An export
clause declares a name for an existing macro and appends the macro to the macro table.
- If the reference to the existing macro is followed by a name, the existing macro is appended to the exported macro array with the latter name instead of the original name, if any. In this way, an anonymous macro can be given a name. An error must be signaled if that name already appears in the exported macro array.
- If the reference to the existing macro is followed by
null
, the macro is appended to the exported macro array without a name, regardless of whether the macro has a name. - If the reference to the existing macro is anonymous, the macro is appended to the exported macro array without a name.
- When the reference to the existing macro uses a name, the name and macro are appended to the exported macro
array. An error must be signaled if that name already appears in the exported macro array.
Module names in macro_table
A module name appends all exported macros from the module to the exported macro array. If any exported macro uses a name that already appears in the exported macro array, an error must be signaled.
The encoding module
The encoding module is the module that is currently being used to encode the data stream. When the stream begins, the encoding module is the system module.
The application may define a new encoding module by writing an encoding directive at the top level of the stream.
An encoding directive is an s-expression annotated with $ion_encoding
; its nested clauses define a new encoding module.
When the reader advances beyond an encoding directive, the module it defined becomes the new encoding module.
In the context of an encoding directive, the active encoding module is named $ion_encoding
.
The encoding directive may preserve symbols or macros that were defined in the previous encoding directive by referencing $ion_encoding
.
The $ion_encoding
module may only be imported to an encoding directive, and it is done so automatically and implicitly.
Examples
An encoding directive
A simple encoding directive—it defines a module that exports three symbols and two macros.
$ion_encoding::(
(symbol_table [
"a", // $1
"b", // $2
"c" // $3
])
(macro_table
(macro pi () 3.14159265)
(macro moon_landing_ts () 1969-07-20T20:17Z)
)
)
Adding symbols to the encoding module
The implicitly imported $ion_encoding
is used to append to the current symbol and macro tables.
$ion_encoding::(
(symbol_table [
"a", // $1
"b", // $2
"c", // $3
])
(macro_table
(macro pi () 3.14159265)
(macro moon_landing_ts () 1969-07-20T20:17Z)
)
)
// ...
$ion_encoding::(
// The first argument of the symbol_table clause is the module name '$ion_encoding',
// which adds the symbols from the active encoding module to the new encoding module.
// The '$ion_encoding' argument in the macro_table clause behaves similarly.
(symbol_table $ion_encoding
[
"d", // $4
"e", // $5
"f", // $6
])
(macro_table $ion_encoding
(macro e () 2.71828182))
)
// ...
Clearing the local symbols and local macros
$ion_encoding::()
The absence of the symbol_table
and macro_table
clauses is interpreted as empty symbol and macro tables.
Note that this is different from the behaviour of an IVM. When an IVM is encountered, the encoding module is set to the system module.
Shared modules
Shared modules exist independently of the documents that use them. They are identified by a catalog key consisting of a string name and an integer version.
The self-declared catalog-names of shared modules are generally long, since they must be more-or-less globally unique. When imported by another module, they are given local symbolic names by import declarations.
They have a spec version that is explicit via annotation, and a content version derived from the catalog version.
The spec version of a shared module must be declared explicitly using an annotation of the form $ion_1_N
.
This allows the module to be serialized using any version of Ion, and its meaning will not change.
$ion_shared_module::
$ion_1_1::("com.example.symtab" 3
(symbol_table ...)
(macro_table ...) )
Example
An Ion 1.1 shared module.
$ion_shared_module::
$ion_1_1::("org.example.geometry" 2
(symbol_table ["x", "y", "square", "circle"])
(macro_table (macro point2d (x y) { x:(%x), y:(%y) })
(macro polygon (point2d::points+) [(%points)]) )
)
The system module provides a convenient macro (use
) to append a shared module to the encoding module.
$ion_1_1
(:use "org.example.geometry" 2)
(:polygon (:: (1 4) (1 8) (3 6)))
Compatibility with Ion 1.0
Ion 1.0 shared symbol tables are treated as Ion 1.1 shared modules that have an empty macro table.
Inner modules
Inner modules are defined within another module, and can be referenced only within the enclosing module. Their scope is lexical; they can be referenced immediately following their definition, up until the end of the containing module.
Inline modules always have a symbolic name given at the point of definition.
They inherit their spec version from the containing module, and they have no content version.
Inner modules automatically have access to modules previously declared in their containing module using module
or import
.
Inner modules may not contain their own nested inner modules.
Examples
Inner modules can be used to define helper macros and use them by name in the definitions of other macros without having to export the helper macro by name.
$ion_shared_module::$ion_1_1::(
"org.example.Foo" 1
(module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
(macro_table
(export util::0)
(macro y_axis_point (y) (.util::point2d 0 (%y)))
(macro poylgon (util::point2d::points+) [(%points)]))
)
In this example, the macro point2d
is declared in an inner module.
It is added to the shared module's macro table without a name, and subsequently referenced by name in the definition
of other macros.
Inner modules can also be used for grouping macros into namespaces (only visible within the outer module), and to declare helper macros that are not added to the macro table of the outer module.
$ion_shared_module::$ion_1_1::(
"org.example.Foo" 1
(module cartesian (macro_table (macro point2d (x y) { x:(%x), y:(%y) })
(macro polygon (point2d::points+) [(%points)]) ))
(module polar (macro_table (macro point2d (r phi) { r:(%r), phi:(%phi) })
(macro polygon (point2d::points+) [(%points)]) ))
(macro_table
(export cartesian::polygon cartesian_poylgon)
(export polar::polygon polar_poylgon))
)
In this example, there are two macros named point2d
and two named polygon
.
There is no name conflict between them because they are declared in separate namespaces.
Both polygon
macros are added to the shared module's macro table, each one given an alias in order to resolve the name conflict.
Neither one of the point2d
macros needs to be added to the shared module's macro table because they can be referenced
in the definitions of both polygon
macros without needing to be added to the shared module's macro table.
When grouping macros in inner modules, there are more than just organizational benefits. By defining helper macros in an inner module, the order in which the macros are added to the macro table of the outer module does not have to be the same as the order in which the macros are declared:
$ion_shared_module::$ion_1_1::(
"org.example.Foo" 1
// point2d must be declared before polygon...
(module util (macro_table (macro point2d (x y) { x:(%x), y:(%y) })))
(macro_table
// ...because it is used in the definition of polygon
(macro poylgon (util::point2d::points+) [(%points)])
// But it can be added to the macro table after polygon
util)
)
Inner modules can also be used for organization of symbols.
$ion_encoding::(
(module dairy (symbol_table [cheese, yogurt, milk]))
(module grains (symbol_table [cereal, bread, rice]))
(module vegetables (symbol_table [carrots, celery, peas]))
(module meat (symbol_table [chicken, mutton, beef]))
(symbol_table dairy
grains
vegetables
meat)
)
The system module
The symbols and macros of the system module $ion
are available everywhere within an Ion document,
with the version of that module being determined by the spec-version of each segment.
The specific system symbols are largely uninteresting to users; while the binary encoding heavily
leverages the system symbol table, the text encoding that users typically interact with does not.
The system macros are more visible, especially to authors of macros.
This chapter catalogs the system-provided symbols and macros.
The examples below use unqualified names, which works assuming no other macros with the same name are in scope. The unambiguous form $ion::macro-name
is always available to use in the template definition language.
Relation to local symbol and macro tables
In Ion 1.0, the system symbol table is always the first import of the local symbol table.
However, in Ion 1.1, the system symbol and macro tables have a system address space that is distinct from the local address space.
When starting an Ion 1.1 segment (i.e. immediately after encountering an $ion_1_1
version marker),
the local symbol table is prepopulated with the system symbols1.
The local macro table is also prepopulated with the system macros.
However, the system symbols and macros are not permanent fixtures of the local symbol and macro tables respectively.
When a local macro has the same name as a system macro, it shadows the system macro.
In TDL, it is still possible to invoke a shadowed system macro by using a qualified name, such as $ion::make_string
.
If a macro in the active local macro table has the same name as a system macro, it is impossible to invoke that system
macro by name using an E-Expression.
(It is still possible to invoke the system macro if the local macro table has assigned an alias for that system macro.)
System Symbols
The Ion 1.1 System Symbol table replaces rather than extends the Ion 1.0 System Symbol table. The system symbols are as follows:
ID | Hex | Text |
---|---|---|
0 | 0x00 | <reserved> |
1 | 0x01 | $ion |
2 | 0x02 | $ion_1_0 |
3 | 0x03 | $ion_symbol_table |
4 | 0x04 | name |
5 | 0x05 | version |
6 | 0x06 | imports |
7 | 0x07 | symbols |
8 | 0x08 | max_id |
9 | 0x09 | $ion_shared_symbol_table |
10 | 0x0A | $ion_encoding |
11 | 0x0B | $ion_literal |
12 | 0x0C | $ion_shared_module |
13 | 0x0D | macro |
14 | 0x0E | macro_table |
15 | 0x0F | symbol_table |
16 | 0x10 | module |
17 | 0x11 | see ion-docs#345 |
18 | 0x12 | export |
19 | 0x13 | see ion-docs#345 |
20 | 0x14 | import |
21 | 0x15 | zero-length text (i.e. '' ) |
22 | 0x16 | literal |
23 | 0x17 | if_none |
24 | 0x18 | if_some |
25 | 0x19 | if_single |
26 | 0x1A | if_multi |
27 | 0x1B | for |
28 | 0x1C | default |
29 | 0x1D | values |
30 | 0x1E | annotate |
31 | 0x1F | make_string |
32 | 0x20 | make_symbol |
33 | 0x21 | make_blob |
34 | 0x22 | make_decimal |
35 | 0x23 | make_timestamp |
36 | 0x24 | make_list |
37 | 0x25 | make_sexp |
38 | 0x26 | make_struct |
39 | 0x27 | parse_ion |
40 | 0x28 | repeat |
41 | 0x29 | delta |
42 | 0x2A | flatten |
43 | 0x2B | sum |
44 | 0x2C | set_symbols |
45 | 0x2D | add_symbols |
46 | 0x2E | set_macros |
47 | 0x2F | add_macros |
48 | 0x30 | use |
49 | 0x31 | meta |
50 | 0x32 | flex_symbol |
51 | 0x33 | flex_int |
52 | 0x34 | flex_uint |
53 | 0x35 | uint8 |
54 | 0x36 | uint16 |
55 | 0x37 | uint32 |
56 | 0x38 | uint64 |
57 | 0x39 | int8 |
58 | 0x3A | int16 |
59 | 0x3B | int32 |
60 | 0x3C | int64 |
61 | 0x3D | float16 |
62 | 0x3E | float32 |
63 | 0x3F | float64 |
64 | 0x40 | none |
65 | 0x41 | make_field |
In Ion 1.1 Text, system symbols can never be referenced by symbol ID; $1
always refers to the first symbol in the user symbol table.
This allows the Ion 1.1 system symbol table to be relatively large without taking away SID space from the user symbol table.
System Macros
ID | Hex | Text |
---|---|---|
0 | 0x00 | none |
1 | 0x01 | values |
2 | 0x02 | annotate |
3 | 0x03 | make_string |
4 | 0x04 | make_symbol |
5 | 0x05 | make_blob |
6 | 0x06 | make_decimal |
7 | 0x07 | make_timestamp |
8 | 0x08 | make_list |
9 | 0x09 | make_sexp |
10 | 0x0A | make_struct |
11 | 0x0B | set_symbols |
12 | 0x0C | add_symbols |
13 | 0x0D | set_macros |
14 | 0x0E | add_macros |
15 | 0x0F | use |
16 | 0x10 | parse_ion |
17 | 0x11 | repeat |
18 | 0x12 | delta |
19 | 0x13 | flatten |
20 | 0x14 | sum |
21 | 0x15 | meta |
22 | 0x16 | make_field |
23 | 0x17 | default |
System symbols require the same number of bytes whether they are encoded using the system symbol or the user symbol encoding. The reasons the system symbols are initially loaded into the user symbol table are twofold—to be consistent with loading the system macros into user space, and so that implementors can start testing user symbols even before they have implemented support for reading encoding directives.^
Ion 1.1 Binary Encoding
A binary Ion stream consists of an Ion version marker followed by a series of value literals and/or encoding expressions.
Both value literals and e-expressions begin with an opcode that indicates what the next expression represents and how the bytes that follow should be interpreted.
Primitives
This section describes Ion 1.1's binary encoding primitives--reusable building blocks that can be combined to represent more complex constructs.
Name | Type | Width |
---|---|---|
FixedUInt | int | Determined by context |
FixedInt | int | Determined by context |
FlexUInt | int | Variable, self-delimiting |
FlexInt | int | Variable, self-delimiting |
FlexSym | symbol | Variable, self-delimiting |
FlexUInt
A variable-length unsigned integer.
The bytes of a FlexUInt
are written in
little-endian byte order. This means that the first bytes will contain
the FlexUInt
's least significant bits.
The least significant bits in the FlexUInt
indicate the number of bytes that were used to encode the integer.
If a FlexUInt
is N
bytes long, its N-1
least significant bits will be 0
; a terminal 1
bit will be
in the next most significant position.
All bits that are more significant than the terminal 1
represent the magnitude of the FlexUInt
.
FlexUInt
encoding of 14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14
FlexUInt
encoding of 729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the unsigned of the unsigned
integer integer
FlexUInt
encoding of 21,043
┌───── There are 2 zeros in the least significant bits, so this
│ integer is three bytes wide.
┌─┴─┐
1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
└───┬───┘ └──────┬──────┘ └──────┬──────┘
lowest 6 bits next 8 bits of highest 8 bits
of the unsigned the unsigned of the unsigned
integer integer integer
FlexInt
A variable-length signed integer.
From an encoding perspective, FlexInt
s are structurally similar to a FlexUInt
. Both
encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate
how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a
FlexUInt
's bits are unsigned, a FlexInt
's bits are encoded using
two's complement notation.
TIP: An implementation could choose to read a FlexInt
by instead reading a FlexUInt
and then reinterpreting its bits
as two's complement.
FlexInt
encoding of 14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
2's comp. 14
FlexInt
encoding of -14
┌──── Lowest bit is 1 (end), indicating
│ this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
2's comp. -14
FlexInt
encoding of 729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the 2's of the 2's
comp. integer comp. integer
FlexInt
encoding of -729
┌──── There's 1 zero in the least significant bits, so this
│ integer is two bytes wide.
┌┴┐
1 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0
└────┬────┘ └──────┬──────┘
lowest 6 bits highest 8 bits
of the 2's of the 2's
comp. integer comp. integer
FixedUInt
A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.
FixedUInt
encoding of 3,954,261
0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
lowest 8 bits next 8 bits of highest 8 bits
of the unsigned the unsigned of the unsigned
integer integer integer
FixedInt
A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two's complement.
FixedInt
encoding of -3,954,261
1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
lowest 8 bits next 8 bits of highest 8 bits
of the 2's the 2's comp. of the 2's comp.
comp. integer integer integer
FlexSym
A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.
A FlexSym
begins with a FlexInt
; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:
- greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.
- less than zero, its absolute value represents a number of UTF-8 bytes that follow the
FlexInt
. These bytes represent the symbol’s text. - exactly zero, another byte follows that is a
FlexSymOpCode
.
FlexSym
encoding of symbol ID $10
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│
0 0 0 1 0 1 0 1
└─────┬─────┘
2's comp.
positive 10
FlexSym
encoding of symbol text 'hello'
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│ h e l l o
1 1 1 1 0 1 1 1 01101000 01100101 01101100 01101100 01101111
└─────┬─────┘ └─────────────────────┬─────────────────────┘
2's comp. 5-byte UTF-8 encoded "hello"
negative 5
FlexSymOpCode
FlexSymOpCode
s are a combination of system symbols and a subset of the general opcodes.
The FlexSym
parser is not responsible for evaluating a FlexSymOpCode
, only returning it—the caller will decide whether the opcode is legal in the current context.
Example usages of the FlexSymOpCode
include:
- Representing SID
$0
- Representing system symbols
- Note that the empty symbol (i.e. the symbol
''
) is now a system symbol and can be referenced this way.
- Note that the empty symbol (i.e. the symbol
- When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct.
- In a delimited struct, terminating the sequence of
(field name, value)
pairs with0xF0
.
OpCode Byte | Meaning | Additional Notes |
---|---|---|
0x00 - 0x5F | E-Expression | May be used when the FlexSym occurs in the field name position of any struct |
0x60 | Symbol with unknown text (also known as $0 ) | |
0x61 - 0xDF | System SID (with 0x60 bias) | While the range of 0x61 - 0xDF is reserved for system symbols, not all of these bytes correspond to a system symbol. See system symbols for the list of system symbols. |
0xEE | System symbol | |
0xEF | E-Expression invoking a system macro | May be used when the FlexSym occurs in the field name position of any struct |
0xF0 | Delimited container end marker | May only be when the FlexSym occurs in the field name position of a delimited struct |
0xF5 | Length-prefixed macro invocation | May be used when the FlexSym occurs in the field name position of any struct |
FlexSym
encoding of ''
(empty text) using an opcode
┌─── The leading FlexInt ends in a `1`,
│ no more FlexInt bytes follow.
│
0 0 0 0 0 0 0 1 01110111
└─────┬─────┘ └───┬──┘
2's comp. FixedInt 0x77,
zero System SID 23
(the empty symbol)
Opcodes
An opcode is a 1-byte FixedUInt
that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
The meanings of each opcode are organized loosely by their high and low nibbles.
High nibble | Low nibble | Meaning |
---|---|---|
0x0_ to 0x3_ | 0 -F | E-expression with 6-bit address |
0x4_ | 0 -F | E-expression with 12-bit address |
0x5_ | 0 -F | E-expression with 20-bit address |
0x6_ | 0 -8 | Integers from 0 to 8 bytes wide |
9 | Reserved | |
A -D | Floats | |
E -F | Booleans | |
0x7_ | 0 -F | Decimals |
0x8_ | 0 -C | Short-form timestamps |
D -F | Reserved | |
0x9_ | 0 -F | Strings |
0xA_ | 0 -F | Symbols with inline text |
0xB_ | 0 -F | Lists |
0xC_ | 0 -F | S-expressions |
0xD_ | 0 | Empty struct |
1 | Reserved | |
2 -F | Structs | |
0xE_ | 0 | Ion version marker |
1 -3 | Symbols with symbol address | |
4 -6 | Annotations with symbol address | |
7 -9 | Annotations with FlexSym text | |
A | null.null | |
B | Typed nulls | |
C -D | NOP | |
E | System symbol | |
F | System macro invocation | |
0xF_ | 0 | Delimited container end |
1 | Delimited list start | |
2 | Delimited S-expression start | |
3 | Delimited struct start | |
4 | E-expression with FlexUInt macro address | |
5 | E-expression with FlexUInt length prefix | |
6 | Integer with FlexUInt length prefix | |
7 | Decimal with FlexUInt length prefix | |
8 | Timestamp with FlexUInt length prefix | |
9 | String with FlexUInt length prefix | |
A | Symbol with FlexUInt length prefix and inline text | |
B | List with FlexUInt length prefix | |
C | S-expression with FlexUInt length prefix | |
D | Struct with FlexUInt length prefix | |
E | Blob with FlexUInt length prefix | |
F | Clob with FlexUInt length prefix |
Values
Nulls
The opcode 0xEA
indicates an untyped null (that is: null
, or its alias null.null
).
The opcode 0xEB
indicates a typed null; a byte follows whose value represents an offset into the following table:
Byte | Type |
---|---|
0x00 | null.bool |
0x01 | null.int |
0x02 | null.float |
0x03 | null.decimal |
0x04 | null.timestamp |
0x05 | null.string |
0x06 | null.symbol |
0x07 | null.blob |
0x08 | null.clob |
0x09 | null.list |
0x0A | null.sexp |
0x0B | null.struct |
All other byte values are reserved for future use.
Encoding of null
┌──── The opcode `0xEA` represents a null (null.null)
EA
Encoding of null.string
┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows
│ ┌──── Byte 0x05 indicates the type `string`
EB 05
Booleans
0x6E
represents boolean true
, while 0x6F
represents boolean false
.
0xEB 0x00
represents null.bool
.
Encoding of boolean true
6E
Encoding of boolean false
6F
Encoding of null.bool
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: boolean
│ │
EB 00
Integers
Opcodes in the range 0x60
to 0x68
represent an integer. The opcode is followed by a FixedInt
that
represents the integer value. The low nibble of the opcode (0x_0
to 0x_8
) indicates the size of the FixedInt
.
Opcode 0x60
represents integer 0
; no more bytes follow.
Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6
,
followed by a
<<flexuint, FlexUInt>> indicating how many bytes of representation data follow.
0xEB 0x01
represents null.int
.
Encoding of integer 0
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││ no more bytes follow.
60
Encoding of integer 17
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││ a single byte follows.
61 11
└── FixedInt 17
Encoding of integer -944
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││ that two bytes follow.
62 50 FC
└─┬─┘
FixedInt -944
Encoding of integer -944
┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows
│ ┌─── FlexUInt 2; a 2-byte FixedInt follows
│ │
F6 05 50 FC
└─┬─┘
FixedInt -944
Encoding of null.int
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: integer
│ │
EB 01
Floats
Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:
- 0 bits (0 bytes), representing the value 0e0 and indicated by opcode
0x6A
- 16 bits (2 bytes in little-endian order, half-precision),
indicated by opcode
0x6B
- 32 bits (4 bytes in little-endian order, single precision),
indicated by opcode
0x6C
- 64 bits (8 bytes in little-endian order, double precision),
indicated by opcode
0x6D
note
In the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.
0xEB 0x02
represents null.float
.
Encoding of float 0e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││ a 0-length float; 0e0
6A
Encoding of float 3.14e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
└─┬─┘
half-precision 3.14
Encoding of float 3.1415927e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││ single-precision value.
6C DB 0F 49 40
└────┬────┘
single-precision 3.1415927
Encoding of float 3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││ double-precision value.
6D 18 2D 44 54 FB 21 09 40
└──────────┬──────────┘
double-precision 3.141592653589793
Encoding of null.float
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: float
│ │
EB 02
Decimals
If an opcode has a high nibble of 0x7_
, it represents a decimal. Low nibble values indicate
the number of trailing bytes used to encode the decimal.
The body of the decimal is encoded as a FlexInt
representing its exponent, followed by a FixedInt
representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length
of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0
. When
the coefficient is present but has a value of 0
, the coefficient is -0
.
Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7
.
0xEB 0x03
represents null.decimal
.
Encoding of decimal 0d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││ decimal; 0d0
70
Encoding of decimal 7d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
| └─── Coefficient: 1-byte FixedInt 7
└─── Exponent: FlexInt 0
Encoding of decimal 1.27
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
| └─── Coefficient: FixedInt 127
└─── Exponent: 1-byte FlexInt -2
Variable-length encoding of decimal 1.27
┌──── Opcode F7 indicates a variable-length decimal
│
F7 05 FD 7F
| | └─── Coefficient: FixedInt 127
| └───── Exponent: 1-byte FlexInt -2
└─────── Decimal length: FlexUInt 2
Encoding of 0d3
, which has a coefficient of zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
└────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
Encoding of -0d3
, which has a coefficient of negative zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
| └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
└────── Exponent: FlexInt 3
Encoding of null.decimal
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: decimal
│ │
EB 03
Timestamps
Timestamps have two encodings:
- Short-form timestamps, a compact representation optimized for the most commonly used precisions and date ranges.
- Long-form timestamps, a less compact representation capable of representing any timestamp in the Ion data model.
0xEB x04
represents null.timestamp
.
Encoding of null.timestamp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: timestamp
│ │
EB 04
note
In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.
Short-form Timestamps
If an opcode has a high nibble of 0x8_
, it represents a short-form timestamp. This encoding focuses on making the
most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via
the variable-length long form timestamp encoding.
Timestamps may be encoded using the short form if they meet all of the following conditions:
The year is between 1970 and 2097.:: The year subfield is encoded as the number of years since 1970. 7 bits are
dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form.
The local offset is either UTC, unknown, or falls between -14:00
to +14:00
and is divisible by 15 minutes.
7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset -14:00
).
The value 0b1111111
indicates an unknown offset. At the time of this writing (2024-08T),
all real-world offsets fall between -12:00
and +14:00
and are multiples of 15 minutes.
The fractional seconds are a common precision. The timestamp's fractional second precision (if present) is
either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).
Opcodes by precision and offset
Each opcode with a high nibble of 0x8_
indicates a different precision and offset encoding pair.
Opcode | Precision | Serialized size in bytes1 | Offset encoding |
---|---|---|---|
0x80 | Year | 1 | Implicitly Unknown offset |
0x81 | Month | 2 | |
0x82 | Day | 2 | |
0x83 | Hour and minutes | 4 | 1 bit to indicate UTC or Unknown Offset |
0x84 | Seconds | 5 | |
0x85 | Milliseconds | 6 | |
0x86 | Microseconds | 7 | |
0x87 | Nanoseconds | 8 | |
0x88 | Hour and minutes | 5 | 7 bits to represent a known offset.2 |
0x89 | Seconds | 5 | |
0x8A | Milliseconds | 7 | |
0x8B | Microseconds | 8 | |
0x8C | Nanoseconds | 9 | |
0x8D | Reserved | -- | |
0x8E | Reserved | -- | |
0x8F | Reserved | -- |
Serialized size in bytes does not include the opcode.
This encoding can also represent UTC and Unknown Offset
, though
it is less compact than opcodes 0x83
-0x87
above.
The body of a short-form timestamp is encoded as a FixedUInt
of the size specified by the opcode. This integer is
then partitioned into bit-fields representing the timestamp's subfields. Note that endianness does not apply here because the
bit-fields are defined over the body interpreted as an integer.
The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.
The Month
and Day
subfields are one-based; 0
is not a valid month or day.
Letter code | Number of bits | Subfield |
---|---|---|
Y | 7 | Year |
M | 4 | Month |
D | 5 | Day |
H | 5 | Hour |
m | 6 | Minute |
o | 7 | Offset |
U | 1 | Unknown (0 ) or UTC (1 ) offset |
s | 6 | Second |
f | 10 (ms) 20 (μs) 30 (ns) | Fractional second |
. | n/a | Unused |
We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.
7 0 <--- bit position
| |
+=========+
byte 0 | 0xNN | <-- hex notation for constants like opcodes
+=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
1 |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
+---------+ <-- octet boundary within an encoding primitive
...
+---------+
N |nnnn:nnnn|
+=========+
The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)
note
While this encoding may complicate human reading, it guarantees that the timestamp's subfields (year
, month
,
etc.) occupy the same bit contiguous indexes regardless of how many bytes there are overall. (The last subfield,
fractional_seconds
, always begins at the same bit index when present, but can vary in length according to the
precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the
appropriate bit ranges to access the subfields.
Encoding of a timestamp with year precision
+=========+
byte 0 | 0x80 |
+=========+
1 |.YYY:YYYY|
+=========+
Encoding of a timestamp with month precision
+=========+
byte 0 | 0x81 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |....:.MMM|
+=========+
Encoding of a timestamp with day precision
+=========+
byte 0 | 0x82 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+=========+
Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset
+=========+
byte 0 | 0x83 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |....:Ummm|
+=========+
Encoding of a timestamp with seconds precision at UTC or unknown offset
+=========+
byte 0 | 0x84 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |....:..ss|
+=========+
Encoding of a timestamp with milliseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x85 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |....:ffff|
+=========+
Encoding of a timestamp with microseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x86 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |ffff:ffff|
+---------+
7 |..ff:ffff|
+=========+
Encoding of a timestamp with nanoseconds precision at UTC or unknown offset
+=========+
byte 0 | 0x87 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |ssss:Ummm|
+---------+
5 |ffff:ffss|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |ffff:ffff|
+=========+
Encoding of a timestamp with hour-and-minutes precision at known offset
+=========+
byte 0 | 0x88 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |....:..oo|
+=========+
Encoding of a timestamp with seconds precision at known offset
+=========+
byte 0 | 0x89 |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+=========+
Encoding of a timestamp with milliseconds precision at known offset
+=========+
byte 0 | 0x8A |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |....:..ff|
+=========+
Encoding of a timestamp with microseconds precision at known offset
+=========+
byte 0 | 0x8B |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |....:ffff|
+=========+
Encoding of a timestamp with nanoseconds precision at known offset
+=========+
byte 0 | 0x8C |
+=========+
1 |MYYY:YYYY|
+---------+
2 |DDDD:DMMM|
+---------+
3 |mmmH:HHHH|
+---------+
4 |oooo:ommm|
+---------+
5 |ssss:ssoo|
+---------+
6 |ffff:ffff|
+---------+
7 |ffff:ffff|
+---------+
8 |ffff:ffff|
+---------+
9 |..ff:ffff|
+=========+
Examples of short-form timestamps
Text | Binary |
---|---|
2023T | 80 35 |
2023-10-15T | 82 35 7D |
2023-10-15T11:22:33Z | 84 35 7D CB 1A 02 |
2023-10-15T11:22:33-00:00 | 84 35 7D CB 12 02 |
2023-10-15T11:22:33+01:15 | 89 35 7D CB 2A 84 |
2023-10-15T11:22:33.444555666+01:15 | 8C 35 7D CB 2A 84 92 61 7F 1A |
warning
Opcodes 0x8D
, 0x8E
, and 0x8F
are illegal; they are reserved for future use.
Long-form Timestamps
Unlike the short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.
The long form begins with opcode 0xF8
. A FlexUInt
follows indicating the number
of bytes that were needed to represent the timestamp. The encoding consumes the minimum number
of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s
precision as follows:
Length | Corresponding precision |
---|---|
0 | Illegal |
1 | Illegal |
2 | Year |
3 | Month or Day (see below) |
4 | Illegal; the hour cannot be specified without also specifying minutes |
5 | Illegal |
6 | Minutes |
7 | Seconds |
8 or more | Fractional seconds |
Unlike the short-form encoding, the long-form encoding reserves:
- 14 bits for the year (
Y
), which is not biased. - 12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440
(that is:
-24:00
). An offset value of0b111111111111
indicates an unknown offset.
Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the
timestamp are encoded as bit-fields on a FixedUInt
that corresponds to the length that followed the opcode.
If the timestamp's overall length is greater than or equal to 8
, the FixedUInt
part of the timestamp is 7
bytes
and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a
(scale, coefficient)
pair, which is similar to a decimal. The primary difference is that the scale
represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to
1.0
or less than 0.0
. The scale is encoded as a FlexUInt
(instead of FlexInt
) to discourage the
encoding of decimal numbers greater than 1.0
. The coefficient is encoded as a FixedUInt
(instead of FixedInt
) to
prevent the encoding of fractional seconds less than 0.0
. Note that validation is still required; namely:
- A scale value of
0
is illegal, as that would result in a fractional seconds greater than1.0
(a whole second). - If
coefficient * 10^-scale > 1.0
, that(coefficient, scale)
pair is illegal.
If the timestamp's length is 3
, the precision is determined by inspecting the day (DDDDD
) bits. Like the short-form,
the Month
and Day
subfields are one-based (0
is not a valid month or day). If the day subfield is zero, that
indicates month precision. If the day subfield is any non-zero number, that indicates day precision.
Encoding of the body of a long-form timestamp
+=========+
byte 0 |YYYY:YYYY|
+=========+
1 |MMYY:YYYY|
+---------+
2 |HDDD:DDMM|
+---------+
3 |mmmm:HHHH|
+---------+
4 |oooo:oomm|
+---------+
5 |ssoo:oooo|
+---------+
6 |....:ssss|
+=========+
7 |FlexUInt | <-- scale of the fractional seconds
+---------+
...
+=========+
N |FixedUInt| <-- coefficient of the fractional seconds
+---------+
...
Examples of long-form timestamps
Text | Binary |
---|---|
1947T | F8 05 9B 07 |
1947-12T | F8 07 9B 07 03 |
1947-12-23T | F8 07 9B 07 5F |
1947-12-23T11:22:33-00:00 | F8 0F 9B 07 DF 65 FD 7F 08 |
1947-12-23T11:22:33+01:15 | F8 0F 9B 07 DF 65 AD 57 08 |
1947-12-23T11:22:33.127+01:15 | F8 13 9B 07 DF 65 AD 57 08 07 7F |
Strings
If the high nibble of the opcode is 0x9_
, it represents a string. The low nibble of the opcode
indicates how many UTF-8 bytes follow. Opcode 0x90
represents a string with empty text (""
).
Strings longer than 15 bytes can be encoded with the F9
opcode, which takes a FlexUInt
-encoded length
after the opcode.
0xEB x05
represents null.string
.
Encoding of the empty string, ""
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90
Encoding of a 14-byte string
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││ f o u r t e e n b y t e s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
└──────────────────┬────────────────────┘
UTF-8 bytes
Encoding of a 24-byte string
┌──── Opcode F9 indicates a variable-length string
│ ┌─── Length: FlexUInt 24
│ │ v a r i a b l e l e n g t h e n c o d i n g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
└────────────────────────────────┬────────────────────────────────────┘
UTF-8 bytes
Encoding of null.string
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: string
│ │
EB 05
Symbols
Symbols With Inline Text
If the high nibble of the opcode is 0xA_
, it represents a symbol whose text follows the opcode. The low nibble of the
opcode indicates how many UTF-8 bytes follow. Opcode 0xA0
represents a symbol with empty text (''
).
0xEB x06
represents null.symbol
.
Encoding of a symbol with empty text (''
)
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0
Encoding of a symbol with 14 bytes of inline text
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││ f o u r t e e n b y t e s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
└──────────────────┬────────────────────┘
UTF-8 bytes
Encoding of a symbol with 24 bytes of inline text
┌──── Opcode FA indicates a variable-length symbol with inline text
│ ┌─── Length: FlexUInt 24
│ │ v a r i a b l e l e n g t h e n c o d i n g
FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
└────────────────────────────────┬────────────────────────────────────┘
UTF-8 bytes
Encoding of null.symbol
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: symbol
│ │
EB 06
Symbols With a Symbol Address
Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1
through 0xE3
:
0xE1
represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byteFixedUInt
that follows the opcode.0xE2
represents a symbol whose address in the symbol table is a 2-byteFixedUInt
that follows the opcode.0xE3
represents a symbol whose address in the symbol table is aFlexUInt
that follows the opcode.
Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.
Opcode | Symbol address range | Bias |
---|---|---|
0xE1 | 0 to 255 | 0 |
0xE2 | 256 to 65,791 | 256 |
0xE3 | 65,792 to infinity | 65,792 |
System Symbols
System symbols (that is, symbols defined in the system module) can be encoded using the 0xEE
opcode followed by a 1-byte FixedUInt
representing an index in the system symbol table.
Unlike Ion 1.0, symbols are not required to use the lowest available SID for a given text, and system symbols MAY be encoded using other SIDs.
Encoding of the system symbol $ion
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│ ┌─── FixedUInt 1 indicates system symbol 1
│ │
EE 01
Binary Data
Blobs
Opcode FE
indicates a blob of binary data. A FlexUInt
follows that represents the blob's byte-length.
0xEB x07
represents null.blob
.
Example blob
encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: blob
│ │
EB 07
Clobs
Opcode FF
indicates a clob--binary character data of an unspecified encoding. A FlexUInt
follows that represents
the clob's byte-length.
0xEB x08
represents null.clob
.
Example clob
encoding
┌──── Opcode FF indicates a clob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.clob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: clob
│ │
EB 08
Binary Data
Blobs
Opcode FE
indicates a blob of binary data. A FlexUInt
follows that represents the blob's byte-length.
0xEB x07
represents null.blob
.
Example blob
encoding
┌──── Opcode FE indicates a blob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: blob
│ │
EB 07
Clobs
Opcode FF
indicates a clob--binary character data of an unspecified encoding. A FlexUInt
follows that represents
the clob's byte-length.
0xEB x08
represents null.clob
.
Example clob
encoding
┌──── Opcode FF indicates a clob, FlexUInt length follows
│ ┌─── Length: FlexUInt 24
│ │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
└────────────────────────────────┬────────────────────────────────────┘
24 bytes of binary data
Encoding of null.clob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: clob
│ │
EB 08
Lists
Length-prefixed encoding
An opcode with a high nibble of 0xB_
indicates a length-prefixed list. The lower nibble of the
opcode indicates how many bytes were used to encode the child values that the list contains.
If the list's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB
opcode
to write a variable-length list. The 0xFB
opcode is followed by a FlexUInt
that indicates the list's byte length.
0xEB 0x09
represents null.list
.
Length-prefixed encoding of an empty list ([]
)
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this
││ list took zero bytes to encode.
B0
Length-prefixed encoding of [1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 6 indicates that the child values of this
││ list took six bytes to encode.
B6 61 01 61 02 61 03
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Length-prefixed encoding of ["variable length list"]
┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows.
│ ┌───── Length: FlexUInt 22
│ │ ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│ │ │ ┌─────── Length: FlexUInt 20
│ │ │ │ v a r i a b l e l e n g t h l i s t
FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
└─────────────────────────────┬─────────────────────────────────┘
Nested string element
Encoding of null.list
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: list
│ │
EB 09
Delimited Encoding
Opcode 0xF1
begins a delimited list, while opcode 0xF0
closes the most recently opened delimited container
that has not yet been closed.
Delimited encoding of an empty list ([]
)
┌──── Opcode 0xF1 indicates a delimited list
│ ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F1 F0
Delimited encoding of [1, 2, 3]
┌──── Opcode 0xF1 indicates a delimited list
│ ┌─── Opcode 0xF0 indicates the end of
│ │ the most recently opened container
F1 61 01 61 02 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Delimited encoding of [1, [2], 3]
┌──── Opcode 0xF1 indicates a delimited list
│ ┌─── Opcode 0xF1 begins a nested delimited list
│ │ ┌─── Opcode 0xF0 closes the most recently
│ │ │ opened delimited container: the nested list.
│ │ │ ┌─── Opcode 0xF0 closes the most recently opened (and
│ │ │ │ still open) delimited container: the outer list.
│ │ │ │
F1 61 01 F1 61 02 F0 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
S-Expressions
S-expressions use the same encodings as lists, but with different opcodes.
Opcode | Encoding |
---|---|
0xC0 -0xCF | Length-prefixed S-expression; low nibble of the opcode represents the byte-length. |
0xFC | Variable-length prefixed S-expression; a FlexUInt following the opcode represents the byte-length. |
0xF2 | Starts a delimited S-expression; 0xF0 closes the most recently opened delimited container. |
0xEB 0x0A
represents null.sexp
.
Length-prefixed encoding
Length-prefixed encoding of an empty S-expression (()
)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression
││ took zero bytes to encode.
C0
Length-prefixed encoding of (1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression
││ took six bytes to encode.
C6 61 01 61 02 61 03
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Length-prefixed encoding of ("variable length sexp")
┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows.
│ ┌───── Length: FlexUInt 22
│ │ ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│ │ │ ┌─────── Length: FlexUInt 20
│ │ │ │ v a r i a b l e l e n g t h s e x p
FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
└─────────────────────────────┬─────────────────────────────────┘
Nested string element
Encoding of null.sexp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: sexp
│ │
EB 0A
Delimited encoding
Delimited encoding of an empty S-expression (()
)
┌──── Opcode 0xF2 indicates a delimited S-expression
│ ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F2 F0
Delimited encoding of (1 2 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│ ┌─── Opcode 0xF0 indicates the end of
│ │ the most recently opened container
F2 61 01 61 02 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Delimited encoding of (1 (2) 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│ ┌─── Opcode 0xF2 begins a nested delimited S-expression
│ │ ┌─── Opcode 0xF0 closes the most recently
│ │ │ opened delimited container: the nested S-expression.
│ │ │ ┌─── Opcode 0xF0 closes the most recently opened (and
│ │ │ │ still open)delimited container: the outer S-expression.
│ │ │ │
F2 61 01 F2 61 02 F0 61 03 F0
└─┬─┘ └─┬─┘ └─┬─┘
1 2 3
Structs
Length-prefixed encoding
If the high nibble of the opcode is 0xD_
, it represents a struct. The lower nibble of the opcode
indicates how many bytes were used to encode all of its nested (field name, value)
pairs. Opcode
0xD0
represents an empty struct.
warning
Opcode 0xD1
is illegal. Non-empty structs must have at least two bytes: a field name and a value.
If the struct's encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD
opcode
to write a variable-length struct. The 0xFD
opcode is followed by a FlexUInt
that indicates the byte length.
Each field in the struct is encoded as a FlexUInt
representing the address of the field name's
text in the symbol table, followed by an opcode-prefixed value.
0xEB 0x0B
represents null.struct
.
Length-prefixed encoding of an empty struct ({}
)
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0
Length-prefixed encoding of {$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│ ┌─── Field name: FlexUInt 10 ($10)
│ │ ┌─── Field name: FlexUInt 11 ($11)
│ │ │
D6 15 61 01 17 61 02
└─┬─┘ └─┬─┘
1 2
Length-prefixed encoding of {$10: "variable length struct"}
┌───────────── Opcode `FD` indicates a struct with a FlexUInt length prefix
│ ┌────────── Length: FlexUInt 25
│ │ ┌─────── Field name: FlexUInt 10 ($10)
│ │ │ ┌──── Opcode `F9` indicates a variable length string
│ │ │ │ ┌─ FlexUInt: 22 the string is 22 bytes long
│ │ │ │ │ v a r i a b l e l e n g t h s t r u c t
FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
└─────────────────────────────┬─────────────────────────────────┘
UTF-8 bytes
Encoding of null.struct
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│ ┌─── Null type: struct
│ │
EB 0B
Optional FlexSym
field name encoding
By default, all struct field names are encoded as FlexUInt
symbol addresses.
However, a writer has the option of encoding the field names as FlexSym
s instead,
granting additional flexibility at the expense of some compactness.
Writing a field names as a FlexSym
s allows the writer to:
- encode the UTF-8 bytes of the field name inline (for example, to avoid modifying the symbol table).
- call a macro whose output (another struct) will be merged into the current struct.
- encode the field name as a symbol address if it's already in the symbol table. (just like a
FlexUInt
would, but slightly less compactly.)
To switch to FlexSym
field names, the writer emits a FlexUInt
zero
(byte 0x01
) in field name position to inform the reader that subsequent field names will be encoded
as FlexSym
s.
This switch is one way. Once the writer switches to using FlexSym
, the encoding cannot be switched
back to FlexUInt
for the remainder of the struct.
Switching to FlexSym
while encoding {$10: 1, foo: 2, $11: 3}
In this example, the writer switches to FlexSym
field names before encoding foo
so it can write the UTF-8 bytes inline.
┌──── An opcode in the range 0xD0-0xDF indicates a length-prefixed struct
│ ┌─── Field name: FlexUInt 10 ($10)
│ │ ┌─── FlexUInt 0: Switch to FlexSym field name encoding
│ │ │
│ │ │ ┌─── FlexSym: 3 UTF-8 bytes follow
│ │ │ │ ┌─── Field name: FlexSym 11 ($11)
│ │ │ │ f o o │
D6 15 61 01 01 FB 66 6F 6F 17 61 02
└─┬─┘ └─┬─┘
1 2
note
Because FlexUInt
zero indicates a mode switch, encoding symbol ID $0
requires switching to FlexSym
.
Length-prefixed encoding of {$0: 1}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 5
││ ┌── FlexUInt 0 in the field name position indicates that the struct
││ │ is switching to FlexSym mode
││ │ ┌── FlexSym "escape"
││ │ │ ┌── Symbol address: 1-byte FixedUInt follows
││ │ │ │ ┌─ FixedUInt 0
││ │ │ │ │
D5 01 01 E1 00 61 01
└───┬──┘ └─┬─┘
$0 1
Delimited encoding
Opcode 0xF3
indicates the beginning of a delimited struct. Unlike length-prefixed structs,
delimited structs always encode their field names as FlexSym
s.
Unlike lists and S-expressions, structs cannot use opcode 0xF0
by itself to indicate the end of the delimited
container. This is because 0xF0
is a valid FlexSym
(a symbol with 16 bytes of inline text). To close the delimited
struct, the writer emits a 0x01
byte (a FlexSym
escape) followed by the opcode 0xF0
.
note
It is much more compact to write 0xD0
-- the empty length-prefixed struct.
Delimited encoding of the empty struct ({}
)
┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│ ┌─── FlexSym escape code 0 (0x01): an opcode follows
│ │ ┌─── Opcode 0xF0 indicates the end of the most
│ │ │ recently opened delimited container
F3 01 F0
Delimited encoding of {"foo": 1, $11: 2}
┌─── Opcode 0xF3 indicates the beginning of a delimited struct
│
│ ┌─ FlexSym -3 ┌─ FlexSym: 11 ($11)
│ │ │ ┌─── FlexSym escape code 0 (0x01): an opcode follows
│ │ │ │ ┌─── Opcode 0xF0 indicates the end of the most
│ │ f o o │ │ │ recently opened delimited container
F3 FB 66 6F 6F 61 01 17 61 02 01 F0
└──┬───┘ └─┬─┘ └─┬─┘
3 UTF-8 1 2
bytes
Encoding Expressions
note
This chapter focuses on the binary encoding of e-expressions. Macros by example explains what they are and how they are used.
E-expression with the address in the opcode
If the value of the opcode is less than 64
(0x40
), it represents an E-expression invoking the macro at the
corresponding address—-an offset within the local macro table.
Invocation of macro address 7
┌──── Opcode in 00-3F range indicates an e-expression
│ where the opcode value is the macro address
│
07
└── FixedUInt 7
Invocation of macro address 31
┌──── Opcode in 00-3F range indicates an e-expression
│ where the opcode value is the macro address
│
1F
└── FixedUInt 31
Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section Macro calling conventions. (TODO: Link)
E-expressions with biased FixedUInt
addresses
While E-expressions invoking macro addresses in the range [0, 63]
can be encoded in a single byte using
E-expressions with the address in the opcode,
many applications will benefit from defining more than 64 macros. The 0x4_
and 0x5_
opcodes
can be used to represent macro addresses up to 1,052,734. In both encodings, the address is biased by
the total number of addresses with lower opcodes.
If the high nibble of the opcode is 0x4_
, then a biased address follows as a 1-byte FixedUInt
.
For 0x4_
, the bias is 256 * low_nibble + 64
(or (low_nibble << 8) + 64
).
If the high nibble of the opcode is 0x5_
, then a biased address follows as a 2-byte FixedUInt
.
For 0x5_
, the bias is 65536 * low_nibble + 4160
(or (low_nibble << 16) + 4160
)
Invocation of macro address 841
┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address
│┌─── Low nibble 3 indicates bias of 832
││
43 09
│
└─── FixedUInt 9
Biased Address : 9
Bias : 832
Address : 841
Invocation of macro address 142918
┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address
│┌─── Low nibble 2 indicates bias of 135232
││
52 06 1E
└─┬─┘
└─── FixedUInt 7686
Biased Address : 7686
Bias : 135232
Address : 142918
Macro address range biases for 0x4_
and 0x5_
Low Nibble | 0x4_ Bias | 0x5_ Bias |
---|---|---|
0 | 64 | 4160 |
1 | 320 | 69696 |
2 | 576 | 135232 |
3 | 832 | 200768 |
4 | 1088 | 266304 |
5 | 1344 | 331840 |
6 | 1600 | 397376 |
7 | 1856 | 462912 |
8 | 2112 | 528448 |
9 | 2368 | 593984 |
A | 2624 | 659520 |
B | 2880 | 725056 |
C | 3136 | 790592 |
D | 3392 | 856128 |
E | 3648 | 921664 |
F | 3904 | 987200 |
E-expression with the address as a trailing FlexUInt
The opcode 0xF4
indicates an e-expression whose address is encoded as a trailing FlexUInt
with no bias.
This encoding is less compact for addresses that can be encoded using opcodes 0x5F
and below, but it is the
only encoding that can be used for macro addresses greater than 1,052,734.
Invocation of macro address 4
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 09
│
└─── FlexUInt 4
Invocation of macro address 1_100_000
┌──── Opcode F4 indicates an e-expression with a trailing `FlexUInt` macro address
│
│
F4 04 47 86
└──┬───┘
└─── FlexUInt 1,100,000
System Macro Invocations
E-expressions that invoke a system macro can be encoded using the 0xEF
opcode followed by a 1-byte FixedUInt
representing an index in the system macro table.
Encoding of the system macro values
┌──── Opcode 0xEF indicates a system symbol or macro invocation
│ ┌─── FixedInt 1 indicates macro 1 from the system macro table
│ │
EF 01
In addition, system macros MAY be invoked using any of the 0x00
-0x5F
or 0xF4
-0xF5
opcodes, provided that the macro being invoked has been given an address in user macro address space.
E-expression argument encoding
The example invocations in prior sections have demonstrated how to encode an invocation of the simplest form of macro--one with no parameters. This section explains how to encode macro invocations when they take parameters of different encodings and cardinalities.
To begin, we will examine how arguments are encoded when all of the macro's parameters use the tagged encoding and have a cardinality of exactly-one.
Tagged encoding
When a macro parameter does not specify an encoding (the parameter name is not annotated), arguments passed to that parameter use the 'tagged' encoding. The argument begins with a leading opcode that dictates how to interpret the bytes that follow.
This is the same encoding used for values in other Ion 1.1 contexts like lists, s-expressions, or at the top level.
Encoding a single exactly-one
argument
A parameter with a cardinality of exactly-one expects its corresponding argument to be encoded as a single expression of the parameter's declared encoding. (The following section will explore the available encodings in greater depth; for now, our examples will be limited to parameters using the tagged encoding.)
When the macro has a single exactly-one
parameter, the corresponding encoded argument follows the opcode and (if separate) the encoded address.
Example encoding of an e-expression with a tagged, exactly-one
argument
Macro definition
(:set_macros
(foo (x) /*...*/)
)
Text e-expression
(:foo 1)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression invoking
│ the macro at address 0.
│ ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│ ┌─┴─┐
00 61 01
Binary e-expression using a trailing FlexUInt
address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│ ┌──── FlexUInt 0: Macro address 0
│ │ ┌─── Argument 'x': opcode 0x61 indicates a 1-byte integer (1)
│ │ ┌─┴─┐
F4 01 61 01
Encoding multiple exactly-one
arguments
If the macro has more than one parameter, a reader would iterate over the parameters declared in the macro signature from left to right. For each parameter, the reader would use the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing of the e-expression's arguments is complete.
Example encoding of an e-expression with multiple tagged, exactly-one
arguments
Macro definition
(:set_macros
(foo (a b c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│ │ ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│ │ │ ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 61 01 61 02 61 03
Binary e-expression using a trailing FlexUInt
address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│ ┌──── FlexUInt 0: Macro address 0
│ │ ┌─── Argument 'a': opcode 0x61 indicates a 1-byte integer (1)
│ │ │ ┌─── Argument 'b': opcode 0x61 indicates a 1-byte integer (2)
│ │ │ │ ┌─── Argument 'c': opcode 0x61 indicates a 1-byte integer (3)
│ │ │ │ │
│ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
F4 01 61 01 61 02 61 03
Tagless Encodings
In contrast to the tagged encoding
, tagless encodings do not begin with an opcode.
This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings
do not have an opcode, they cannot represent E-expressions, annotation sequences, or null
values of any kind.
Tagless encodings are comprised of the primitive encodings and macro shapes.
Primitive encodings
Primitive encodings are self-delineating, either by having a statically known size in bytes or by including length information in their serialized form.
Ion type | Primitive encoding | Size in bytes | Encoding |
---|---|---|---|
int | uint8 | 1 | FixedUInt |
uint16 | 2 | ||
uint32 | 4 | ||
uint64 | 8 | ||
flex_uint | variable | FlexUInt | |
int8 | 1 | FixedInt | |
int16 | 2 | ||
int32 | 4 | ||
int64 | 8 | ||
flex_int | variable | FlexInt | |
float | float16 | 2 | Little-endian IEEE-754 half-precision float |
float32 | 4 | Little-endian IEEE-754 single-precision float | |
float64 | 8 | Little-endian IEEE-754 double-precision float | |
symbol | flex_sym | variable | FlexSym |
Example encoding of an e-expression with primitive, exactly-one
arguments
As first demonstrated in Encoding multiple exactly-one arguments, the bytes of the serialized arguments begin immediately after the e-expression's opcode and (if separate) the macro address. The reader iterates over the parameters in the macro signature in the order they are declared. For each parameter, the reader uses the parameter's declared encoding to interpret the next bytes in the stream. When no more parameters remain, parsing is complete.
Macro definition
(:set_macros
(foo (flex_uint::a int8::b uint16::c) /*...*/)
)
Text e-expression
(:foo 1 2 3)
Binary e-expression with the address in the opcode
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌─── Argument 'a': FlexUInt 1
│ │ ┌─── Argument 'b': 1-byte FixedInt 2
│ │ │ ┌─── Argument 'c': 2-byte FixedUInt 3
│ │ │ ┌─┴─┐
00 03 02 03 00
Binary e-expression using a trailing FlexUInt
address
┌──── Opcode F4: An e-expression with a trailing FlexUInt address
│ ┌──── FlexUInt 0: Macro address 0
│ │ ┌─── Argument 'a': FlexUInt 1
│ │ │ ┌─── Argument 'b': 1-byte FixedInt 2
│ │ │ │ ┌─── Argument 'c': 2-byte FixedUInt 3
│ │ │ │ ┌─┴─┐
F4 01 03 02 03 00
Macro shapes
The term macro shape describes a macro that is being used as the encoding of an E-expression argument. A parameter using a macro shape as its encoding is sometimes called a macro-shaped parameter. For example, consider the following two macro definitions.
The point2D
macro takes two flex_int
-encoded values as arguments.
(macro point2D (flex_int::x flex_int::y)
{
x: (%x),
y: (%y),
}
)
The line
macro takes a pair of point2D
invocations as arguments.
(macro line (point2D::start point2D::end)
{
start: (%start),
end: (%end),
}
)
Normally an e-expression would begin with an opcode and an address communicating what comes next. However, when we're reading the argument for a macro-shaped parameter, the macro being invoked is inferred from the parent macro signature instead. As such, there is no need to include an opcode or address.
┌──── Opcode 0x01 is less than 0x40; this is an e-expression
│ invoking the macro at address 1: `line`
│ ┌─── Argument $start: an implicit invocation of macro `point2D`
│ │ ┌─── Argument $end: an implicit invocation of macro `point2D`
│ ┌─┴─┐ ┌─┴─┐
00 03 05 07 09
│ │ │ └──── $end/$y: FlexInt 4
│ │ └─────── $end/$x: FlexInt 3
│ └────────── $start/$y: FlexInt 2
└───────────── $start/$x: FlexInt 1
Any macro can be used as a macro shape except for constants--macros which take zero parameters. Constants cannot be used as a macro shape because their serialized representation would be empty, making it impossible to encode them in expression groups. However, this limitation does not sacrifice any expressiveness; the desired constant can always be invoked directly in the body of the macro.
(:add_macros
// Defines a constant 'hostname'
(hostname () "abc123.us_west.example.com")
(http_ok (hostname::server page)
// └── ERROR: cannot use a constant as a macro shape
{
server: (%server),
page: (%page),
message: OK,
status: 200,
}
)
(http_ok (page)
{
server: (.hostname),
// └── OK: invokes constant as needed
page: (%page),
message: OK,
status: 200,
}
)
)
Encoding variadic arguments
The preceding sections have described how to (de)serialize the various parameter encodings,
but these parameters have always had the same cardinality:
exactly-one
.
This section explains how to encode e-expressions invoking a macro whose signature contains
variadic parameters--parameters with a cardinality of zero-or-one
, zero-or-more
, or one-or-more
.
Argument Encoding Bitmap (AEB)
If a macro signature has one or more variadic parameters, then e-expressions invoking that macro will include an additional construct: the Argument Encoding Bitmap (AEB). This little-endian byte sequence precedes the first serialized argument and indicates how each argument corresponding to a variadic parameter has been encoded.
Each variadic parameter in the signature is assigned two bits in the AEB. This means that the reader can statically determine how many AEB bytes to expect in the e-expression by examining the signature.
Number of variadic parameters | AEB byte length |
---|---|
0 | 0 |
1 to 4 | 1 |
5 to 8 | 2 |
9 to 12 | 3 |
N | ceiling(N/4) |
Bits in the AEB are assigned from least significant to most significant and correspond to the variadic parameters in the signature from left to right. This allows the reader to right-shift away the bits of each variadic parameter when its corresponding argument has been read.
Example Signature | AEB Layout |
---|---|
() | <No variadics, no AEB> |
(a b c) | <No variadics, no AEB> |
(a b c?) | ------cc |
(a b* c?) | ----ccbb |
(a+ b* c?) | --ccbbaa |
(a+ b c?) | ----ccaa |
(a+ b* c? d*) | ddccbbaa |
(a+ b* c? d* e) | ddccbbaa |
(a+ b* c? d* e f?) | ddccbbaa ------ff |
(a+ b* c? d* e+ f?) | ddccbbaa ----ffee |
Each pair of bits in the AEB indicates what kind of expression to expect in the corresponding argument position.
Bit sequence | Meaning | ? | * | + |
---|---|---|---|---|
00 | An empty stream. No bytes are present in the corresponding argument position. | ✅ | ✅ | ❌ |
01 | A single expression of the declared encoding is present in the corresponding argument position. | ✅ | ✅ | ✅ |
10 | A expression group of the declared encoding is present in the corresponding argument position. | ❌ | ✅ | ✅ |
11 | Reserved. A bitmap entry with this bit sequence is illegal in Ion 1.1. | ❌ | ❌ | ❌ |
As noted in the table above:
- An empty stream (
00
) cannot be used to encode an argument for a parameter with a cardinality ofone-or-more
. - An expression group (
10
) cannot be used to encode an argument for a parameter with a cardinality ofzero-or-one
.
Expression groups
This section describes the encoding of an expression group. For an explanation of what an expression group is and how to use it, see Expression groups.
An expression group begins with a FlexUInt
. If the FlexUInt
's value
is:
- greater than zero, then it represents the number of bytes used to encode the rest of the expression group. The reader should continue reading expressions of the declared encoding until that number of bytes has been consumed.
- zero, then it indicates that this is a delimited expression group and the processing varies according to
whether the declared encoding is tagged or tagless. If the encoding is:
- tagged, then each expression in the group begins with an opcode. The reader
must consume tagged expressions until it encounters a terminating
END
opcode (0xF0
). - tagless, then the expression group is a delimited sequence of 'chunks' that each
have a
FlexUInt
length prefix and a body comprised of one or more expressions of the declared encoding. The reader will continue reading chunks until it encounters a length prefix ofFlexUInt
0
(0x01
), indicating the end of the chunk sequence. Each chunk in the sequence must be self-contained; an expression of the declared encoding may not be split across multiple chunks. See Example encoding of taglesszero-or-more
with delimited expression group for an illustration.
- tagged, then each expression in the group begins with an opcode. The reader
must consume tagged expressions until it encounters a terminating
tip
While it is legal to write an empty expression group for zero-or-more
parameters,
it is always more efficient to set the parameter's AEB bits to 00
instead.
Example encoding of tagged zero-or-one
with empty group
(:add_macros
(foo (a?) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa
│ │ a=00, empty expression group
00 00
Example encoding of tagged zero-or-one
with single expression
(:add_macros
(foo (a?) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=01, single expression
│ │ ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│ │ ┌─┴─┐
00 01 61 01
Example encoding of tagged zero-or-more
with empty group
(:add_macros
(foo (a*) /*...*/)
)
(:foo) // `a` is implicitly empty
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=00, empty expression group
│ │
00 00
Example encoding of tagged zero-or-more
with single expression
(:add_macros
(foo (a*) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=01, single expression
│ │ ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int (1)
│ │ ┌─┴─┐
00 01 61 01
Example encoding of tagged zero-or-more
with expression group
(:add_macros
(foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 6: 6-byte expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (1)
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (2)
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int (3)
│ │ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
00 02 0D 61 01 61 02 61 03
└───────┬───────┘
6-byte expression group body
Example encoding of tagged zero-or-more
with delimited expression group
(:add_macros
(foo (a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 0: delimited expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (1)
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int (2)
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int (3)
│ │ │ │ │ │ ┌─── Opcode 0xF0 is delimited end
│ │ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │
00 02 01 61 01 61 02 61 03 F0
└───────┬───────┘
expression group body
Example encoding of tagged one-or-more
with single expression
(:add_macros
(foo (a+) /*...*/)
)
(:foo 1)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=01, single expression
│ │ ┌──── Argument 'a': opcode 0x61 indicates a 1-byte int
│ │ │ 1
00 01 61 01
Example encoding of tagged one-or-more
with expression group
(:add_macros
(foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 6: 6-byte expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int
│ │ │ │ 1 │ 2 │ 3
00 02 0D 61 01 61 02 61 03
└───────┬───────┘
6-byte expression group body
Example encoding of tagged one-or-more
with delimited expression group
(:add_macros
(foo (a+) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 0: delimited expression group
│ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ ┌──── Opcode 0x61 indicates a 1-byte int
│ │ │ │ │ ┌─── Opcode 0x61 indicates a 1-byte int
│ │ │ │ │ │ ┌─── Opcode 0xF0 is delimited end
│ │ │ │ 1 │ 2 │ 3 │
00 02 01 61 01 61 02 61 03 F0
└───────┬───────┘
expression group body
Example encoding of tagless zero-or-more
with expression group
(:add_macros
(foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 3: 3-byte expression group
│ │ │ ┌──── uint8 1
│ │ │ │ ┌──── uint8 2
│ │ │ │ │ ┌─── uint8 3
│ │ │ │ │ │
00 02 07 01 02 03
└──┬───┘
expression group body
Example encoding of tagless zero-or-more
with delimited expression group
(:add_macros
(foo (uint8::a*) /*...*/)
)
(:foo 1 2 3)
┌──── Opcode 0x00 is less than 0x40; this is an e-expression
│ invoking the macro at address 0.
│ ┌──── AEB: 0b------aa; a=10, expression group
│ │ ┌──── FlexUInt 0: Delimited expression group
│ │ │ ┌──── FlexUInt 3: 3-byte chunk of uint8 expressions
│ │ │ │ ┌──── FlexUInt 2: 2-byte chunk of uint8 expressions
│ │ │ │ │ ┌──── FlexUInt 0: End of group
│ │ │ │ │ │
00 02 01 07 01 02 03 05 04 05 01
└──┬───┘ └─┬─┘
chunk 1 chunk 2
Annotations
Annotations can be encoded either as symbol addresses or
as FlexSym
s. In both encodings, the annotations sequence appears
just before the value that it decorates.
It is illegal for an annotations sequence to appear before any of the following:
- The end of the stream
- Another annotations sequence
- A
NOP
- An e-expression. To add annotations to the expansion of an E-expression, see the
annotate
macro.
Annotations With Symbol Addresses
Opcodes 0xE4
through 0xE6
indicate one or more annotations encoded as symbol addresses. If the opcode is:
0xE4
, a singleFlexUInt
-encoded symbol address follows.0xE5
, twoFlexUInt
-encoded symbol addresses follow.0xE6
, aFlexUInt
follows that represents the number of bytes needed to encode the annotations sequence, which can be made up of any number ofFlexUInt
symbol addresses.
Encoding of $10::false
┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows
│ ┌──── Annotation with symbol address: FlexUInt 10
E4 15 6F
└── The annotated value: `false`
Encoding of $10::$11::false
┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow
│ ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│ │ ┌──── Annotation with symbol address: FlexUInt 11 ($11)
E5 15 17 6F
└── The annotated value: `false`
Encoding of $10::$11::$12::false
┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations;
│ a FlexUInt follows representing the length of the sequence.
│ ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10)
│ │ ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│ │ │ ┌──── Annotation with symbol address: FlexUInt 11 ($11)
│ │ │ │ ┌──── Annotation with symbol address: FlexUInt 12 ($12)
E5 07 15 17 19 6F
└── The annotated value: `false`
Annotations With FlexSym
Text
Opcodes 0xE7
through 0xE9
indicate one or more annotations encoded as FlexSym
s.
If the opcode is:
0xE7
, a singleFlexSym
-encoded symbol follows.0xE8
, twoFlexSym
-encoded symbols follow.0xE9
, aFlexUInt
follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded asFlexSym
s.
While this encoding is more flexible than annotations with symbol addresses it can be slightly less compact when all the annotations are encoded as symbol addresses.
Encoding of $10::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│ ┌──── Annotation with symbol address: FlexSym 10 ($10)
E7 15 6F
└── The annotated value: `false`
Encoding of foo::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│ │ f o o
E7 FD 66 6F 6F 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
Note that FlexSym
annotation sequences can switch between symbol address and inline text
on a per-annotation basis.
Encoding of $10::foo::false
┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow
│ ┌──── Annotation: FlexSym 10 ($10)
│ │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│ │ │ f o o
E8 15 FD 66 6F 6F 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
Encoding of $10::foo::$11::false
┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations
│ ┌──── Length: FlexUInt 6
│ │ ┌──── Annotation: FlexSym 10 ($10)
│ │ │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│ │ │ │ ┌──── Annotation: FlexSym 11 ($11)
│ │ │ │ f o o │
E9 0D 15 FD 66 6F 6F 17 6F
└──┬───┘ └── The annotated value: `false`
3 UTF-8
bytes
NOP
s
A NOP
(short for "no-operation") is the binary equivalent of whitespace. NOP
bytes have no meaning,
but can be used as padding to achieve a desired alignment.
An opcode of 0xEC
indicates a single-byte NOP
pad. An opcode of 0xED
indicates that a
FlexUInt
follows that represents the number of additional bytes to skip.
It is legal for a NOP
to appear anywhere that a value can be encoded. It is not legal for a NOP
to appear in
annotation sequences or struct field names. If a NOP
appears in place of a struct field value, then the associated
field name is ignored; the NOP
is immediately followed by the next field name, if any.
Encoding of a 1-byte NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC
Encoding of a 3-byte NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│ ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│ │
ED 05 93 C6
└─┬─┘
NOP bytes, values ignored
Grammar
This chapter presents Ion 1.1's domain grammar, by which we mean the grammar of the domain of values that drive Ion's encoding features.
We use a BNF-like notation for describing various syntactic parts of a document, including Ion data structures. In such cases, the BNF should be interpreted loosely to accommodate Ion-isms like commas and unconstrained ordering of struct fields.
Documents
document ::= ivm? segment*
ivm ::= '$ion_1_0' | '$ion_1_1'
segment ::= value* directive?
directive ::= ivm
| encoding-directive
| symtab-directive
symtab-directive ::= local-symbol-table ; As per the Ion 1.0 specification¹
encoding-directive ::= '$ion_encoding::(' module-body ')'
¹Symbols – Local Symbol Tables.
Modules
module-body ::= import* inner-module* symbol-table? macro-table?
shared-module ::= '$ion_shared_module::' ivm '::(' catalog-key module-body ')'
import ::= '(import ' module-name catalog-key ')'
catalog-key ::= catalog-name catalog-version?
catalog-name ::= string
catalog-version ::= unannotated-uint ; must be positive
inner-module ::= '(module' module-name module-body ')'
module-name ::= unannotated-identifier-symbol
symbol-table ::= '(symbol_table' symbol-table-entry* ')'
symbol-table-entry ::= module-name | symbol-list
symbol-list ::= '[' symbol-text* ']'
symbol-text ::= symbol | string
macro-table ::= '(macro_table' macro-table-entry* ')'
macro-table-entry ::= macro-definition
| macro-export
| module-name
macro-export ::= '(export' qualified-macro-ref macro-name-declaration? ')'
Macro references
qualified-macro-ref ::= module-name '::' macro-ref
macro-ref ::= macro-name | macro-addr
qualified-macro-name ::= module-name '::' macro-name
macro-name ::= unannotated-identifier-symbol
macro-addr ::= unannotated-uint
Macro definitions
macro-definition ::= '(macro' macro-name-declaration signature tdl-expression ')'
macro-name-declaration ::= macro-name | 'null'
signature ::= '(' parameter* ')'
parameter ::= parameter-encoding? parameter-name parameter-cardinality?
parameter-encoding ::= (primitive-encoding-type | macro-name | qualified-macro-name)'::'
primitive-encoding-type ::= 'uint8' | 'uint16' | 'uint32' | 'uint64'
| 'int8' | 'int16' | 'int32' | 'int64'
| 'float16' | 'float32' | 'float64'
| 'flex_int' | 'flex_uint'
| 'flex_sym' | 'flex_string'
parameter-name ::= unannotated-identifier-symbol
parameter-cardinality ::= '!' | '*' | '?' | '+'
tdl-expression ::= operation | variable-expansion | ion-scalar | ion-container
operation ::= macro-invocation | special-form
variable-expansion ::= '(%' variable-name ')'
variable-name ::= unannotated-identifier-symbol
macro-invocation ::= '(.' macro-ref macro-arg* ')'
special-form ::= '(.' '$ion::'? special-form-name tdl-expression* ')'
special-form-name ::= 'for' | 'if_none' | 'if_some' | 'if_single' | 'if_multi'
macro-arg ::= tdl-expression | expression-group
expression-group ::= '(..' tdl-expression* ')'
Glossary
active encoding module
The encoding module whose symbol table and macro table are available in the current segment of an Ion document.
The active encoding module is set by a directive.
argument
The sub-expression(s) within a macro invocation, corresponding to exactly one of the macro's parameters.
cardinality
Describes both the number of argument expressions that a parameter will accept when the macro is invoked,
and the number of values that the parameter may expand to during evaluation.
A parameter's cardinality can be zero-or-one
, exactly-one
, zero-or-more
, or one-or-more
,
specified in a signature by one of the modifiers ?
, !
, *
, or +
respectively.
If no modifier is specified, cardinality defaults to exactly-one
.
declaration
The association of a name with an entity (for example, a module or macro). See also definition.
Not all declarations are definitions: some introduce new names for existing entities.
definition
The specification of a new entity.
directive
A keyword or unit of data in an Ion document that affects the encoding environment, and thus the way the document's data is encoded and decoded.
In Ion 1.0 there are two directives: Ion version markers, and the symbol table directives.
Ion 1.1 adds encoding directives.
document
A stream of octets conforming to either the Ion text or binary specification.
Can consist of multiple segments, perhaps using varying versions of the Ion specification.
A document does not necessarily exist as a file, and is not necessarily finite.
E-expression
See encoding expression.
encoding directive
In an Ion 1.1 segment, a top-level S-Expression annotated with $ion_encoding
.
Defines a new encoding module for the segment immediately following it.
At the end of the encoding directive, the new encoding module is promoted to be the active encoding module.
The symbol table directive is effectively a less capable alternative syntax.
encoding environment
The context-specific data maintained by an Ion implementation while encoding or decoding data. In
Ion 1.0 this consists of the current symbol table; in Ion 1.1 this is expanded to also include the Ion
spec version, the current macro table, and a collection of available modules.
encoding expression
The invocation of a macro in encoded data, aka e-expression.
Starts with a macro reference denoting the function to invoke.
The Ion text format uses "smile syntax" (:macro ...)
to denote e-expressions.
Ion binary devotes a large number of opcodes to e-expressions, so they can be compact.
encoding module
A module whose symbol table and macro table can be used directly in the user data stream.
expression
A serialized syntax element that may produce values.
Encoding expressions and values are both considered expressions, whereas NOP, comments, and IVMs, for example, are not.
expression group
A grouping of zero or more expressions that together form one argument.
The concrete syntax for passing a stream of expressions to a macro parameter.
In a text e-expression, a group starts with the trigraph (::
and ends with )
, similar to an S-expression.
In template definition language, a group is written as an S-expression starting with ..
(two dots).
inner module
A module that is defined inside another module and only visible inside the definition of that module.
Ion version marker
A keyword directive that denotes the start of a new segment encoded with a specific Ion version.
Also known as "IVM".
macro
A transformation function that accepts some number of streams of values, and produces a stream of values.
macro definition
Specifies a macro in terms of a signature and a template.
macro reference
Identifies a macro for invocation or exporting. Must always be unambiguous. Lexically
scoped. Cannot be a "forward reference" to a macro that is declared later in the document;
these are not legal.
module
The data entity that defines and exports both symbols and macros.
opcode
A 1-byte, unsigned integer that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
optional parameter
A parameter that can have its corresponding subform(s) omitted when the macro is invoked.
A parameter is optional if both it and the parameters that follow it in the macro signature can accept an empty stream.
parameter
A named input to a macro, as defined by its signature.
At expansion time a parameter produces a stream of values.
qualified macro reference
A macro reference that consists of a module name and either a macro name exported by that module,
or a numeric address within the range of the module's exported macro table. In TDL, these look
like module-name::name-or-address.
required parameter
A macro parameter that is not optional and therefore requires an argument at each invocation.
rest parameter
A macro parameter—always the final parameter—declared with *
or +
cardinality,
that accepts all remaining individual arguments to the macro as if they were in an implicit argument group.
Applies to Ion text and TDL.
Similar to "varargs" parameters in Java and other languages.
segment
A contiguous partition of a document that uses the same active encoding module.
Segment boundaries are caused by directives: an IVM starts a new segment (ending the prior segment, if any),
while $ion_symbol_table
and $ion_encoding
directives end segments (with a new one starting immediately afterward).
shared module
A module that exists independent of the data stream of an Ion document. It is identified by a
name and version so that it can be imported by other modules.
signature
The part of a macro definition that specifies its "calling convention", in terms of the shape,
type, and cardinality of arguments it accepts.
symbol table directive
A top-level struct annotated with $ion_symbol_table
. Defines a new encoding environment
without any macros. Valid in Ion 1.0 and 1.1.
system e-expression
An e-expression that invokes a macro from the system-module rather than from the active encoding module.
system macro
A macro provided by the Ion implementation via the system module $ion
.
System macros are available at all points within Ion 1.1 segments.
system module
A standard module named $ion
that is provided by the Ion implementation, implicitly installed so
that the system symbols and system macros are available at all points within a document.
Subsumes the functionality of the Ion 1.0 system symbol table.
system symbol
A symbol provided by the Ion implementation via the system module $ion
.
System symbols are available at all points within an Ion document, though the selection of symbols
varies by segment according to its Ion version.
TDL
See template definition language.
template
The part of a macro definition that expresses its transformation of inputs to results.
template definition language
An Ion-based, domain-specific language that declaratively specifies the output produced by a macro.
Template definition language uses only the Ion data model.
unqualified macro reference
A macro reference that consists of either a macro name or numeric address, without a qualifying module name.
These are resolved using lexical scope and must always be unambiguous.
variable expansion
In TDL, a special form that causes all argument expression(s) for the given parameter to be expanded and the result of the expansion to be substituted into the template.
TODO
This page is a placeholder and will be updated when the target page is available.
If you believe the target page is available, please open an issue.