What's New in Ion 1.1
We will go through a high-level overview of what is new and different in Ion 1.1 from Ion 1.0 from an implementer's perspective.
Motivation
Ion 1.1 has been designed to address some of the trade-offs in Ion 1.0 to make it suitable for a wider range of applications. Ion 1.1 now makes length prefixing of containers optional, and makes the interning of symbolic tokens optional as well. This allows for applications that write data more than they read data or are constrained by the writer in some way to have more flexibility. Data density is another motivation. Certain encodings (e.g., timestamps, integers) have been made more compact and efficient, but more significantly, macros now enable applications to have very flexible interning of their data's structure. In aggregate, data transcoded from Ion 1.0 to Ion 1.1 should be more compact.
Backwards compatibility
Ion 1.1 is backwards compatible with Ion 1.0. While their encodings are distinct, they share the same data model--any data that can be produced and read by an application in Ion 1.1 has an equivalent representation in Ion 1.0.
Ion 1.1 is not required to preserve Ion 1.0 binary encodings in Ion 1.1 encoding contexts (i.e., the type codes and
lower-level encodings are not preserved in the new version). The Ion Version Marker (IVM) is used to denote the
different versions of the syntax. Ion 1.1 does retain text compatibility with Ion 1.0 in that the changes are a strict
superset of the grammar, however due to the updated system symbol table, symbol IDs referred to using the $n
syntax
for symbols beyond the 1.0 system symbol table are not compatible.
Text syntax changes
Ion 1.1 text must use the $ion_1_1
version marker at the top-level of the data stream or document.
The only syntax change for the text format is the introduction of encoding expression (E-expression) syntax, which
allows for the invocation of macros in the data stream. This syntax is grammatically similar to S-expressions, except that
these expressions are opened with (:
and closed with )
. For example, (:a 1 2)
would expand the macro named a
with the
arguments 1
and 2
. See the <<sec:whatsnew-eexp, Macros, Templates, and Encoding-Expressions>> section for details.
This syntax is allowed anywhere an Ion value is allowed:
E-expression examples
// At the top level
(:foo 1 2)
// Nested in a list
[1, 2, (:bar 3 4)]
// Nested in an S-expression
(cons a (:baz b))
// Nested in a struct
{c: (:bop d)}
E-expressions are also grammatically allowed in the field name position of a struct and when used there, indicate that the expression should expand to a struct value that is merged into the enclosing struct:
E-Expression in field position of struct
{
a:1,
b:2,
(:foo 1 2),
c: 3,
}
In the above example, the E-expression (:foo 1 2)
must evaluate into a struct that will be merged between the b
field and the c
field. If it does not evaluate to a struct, then the above is an error.
Binary encoding changes
Ion 1.1 binary encoding reorganizes the type descriptors to support compact E-expressions, make certain encodings
more compact, and certain lower priority encodings marginally less compact. The IVM for this encoding is the octet
sequence 0xE0 0x01 0x01 0xEA
.
Inlined symbolic tokens
In binary Ion 1.0, symbol values, field names, and annotations are required to be encoded using a symbol ID in the local symbol table. For some use cases (e.g., as write-once, read-maybe logs) this creates a burden on the writer and may not actually be efficient for an application. Ion 1.1 introduces optional binary syntax for encoding inline UTF-8 sequences for these tokens which can allow an encoder to have flexibility in whether and when to add a given symbolic token to the symbol table.
Ion text requires no change for this feature as it already had inline symbolic tokens without using the local symbol
table. Ion text also has compatible syntax for representing the local symbol table and encoding of symbolic tokens with
their position in the table (i.e., the $id
syntax).
Delimited containers
In Ion 1.0, all data is length prefixed. While this is good for optimizing the reading of data, it requires an Ion encoder to buffer any data in memory to calculate the data's length. Ion 1.1 introduces optional binary syntax to allow containers to be encoded with an end marker instead of a length prefix.
Low-level binary encoding changes
Ion 1.0's VarUInt
and VarInt
encoding primitives
used big-endian byte order and used the high bit of each byte to indicate whether it was the final byte in the encoding.
VarInt
used an additional bit in the first byte to represent the integer's sign. Ion 1.1 replaces these primitives
with more optimized versions called FlexUInt
and FlexInt
.
FlexUInt
and FlexInt
use little-endian byte order, avoiding the need for reordering on common architectures like
x86, aarch64, and RISC-V.
Rather than using a bit in each byte to indicate the width of the encoding, FlexUInt
and FlexInt
front-load
the continuation bits. In most cases, this means that these bits all fit in the first byte of the representation,
allowing a reader to determine the complete size of the encoding without having to inspect each byte individually.
Finally, FlexInt
does not use a separate bit to indicate its value's sign. Instead, it uses two's complement
representation, allowing it to share much of the same structure and parsing logic as its unsigned counterpart.
Benchmarks have shown that in aggregate, these encoding changes are between 1.25 and 3x faster than Ion 1.0's
VarUInt
and VarInt
encodings depending on the host architecture.
Ion 1.1 supplants Ion 1.0's Int
encoding primitive
with a new encoding called FixedInt
, which uses two's complement notation instead of sign-and-magnitude.
A corresponding FixedUInt
primitive has also been introduced; its encoding is the same as
Ion 1.0's UInt
primitive.
A new primitive encoding type, FlexSym
, has been introduced to flexibly encode
symbol IDs and symbolic tokens with inline text.
Type encoding changes
All Ion types use the new low-level encodings as specified in the previous section. Many of the opcodes used in Ion 1.0 have been re-organized primarily to make E-expressions compact.
Typed null
values are now [encoded in two bytes using the 0xEB
opcode].
Lists and S-expressions have two encodings:
a length-prefixed encoding and a new delimited form that ends with the 0xF0
opcode.
Struct values have the option of encoding their field names as
a FlexSym
, enabling them to write field name text inline
instead of adding all names to the symbol table. There is now also a delimited form.
Similarly, symbol values now also have the option of encoding their symbol text inline.
Annotation sequences are a prefix to the value they decorate, and no longer have an outer length container. They are now encoded with one of three opcodes:
0xE7
, which is followed by a single annotation and then the decorated value.0xE8
, which is followed by two annotations and then the decorated value.0xE9
, which is followed by aFlexUInt
indicating the number of bytes used to encode the annotations sequence, the sequence itself, and then the decorated value.
The latter encoding is similar to how Ion 1.0 annotations are encoded with the exception that there is no outer length in addition to the annotations sequence length.
Integers now use a FixedInt
sub-field instead of the Ion 1.0 encoding which used sign-and-magnitude (with two opcodes).
Decimals are structurally identical to their Ion 1.0 counterpart with the exception
of the negative zero coefficient. The Ion 1.1 FlexInt
encoding is two's complement, so negative zero cannot be
encoded directly with it. Instead, an opcode is allocated specifically for encoding decimals with a negative zero
coefficient.
Timestamps no longer encode their sub-field components as octet-aligned fields.
The Ion 1.1 format uses a packed bit encoding and has a biased form (encoding the year field as an offset from 1970) to make common encodings of timestamp easily fit in a 64-bit word for microsecond and nanosecond precision (with UTC offset or unknown UTC offset). Benchmarks have shown this new encoding to be 59% faster to encode and 21% faster to decode. A non-biased, arbitrary length timestamp with packed bit encoding is defined for uncommon cases.
Encoding expressions in binary
In binary, E-expressions are encoded with an opcode that includes the macro identifier or an opcode that
specifies a FlexUInt
for the macro identifier.
The identifier is followed by the encoding of the arguments to the E-expression.
The macro's definition statically determines how the arguments are to be laid out.
An argument may be a full Ion value with a leading opcode (sometimes called a "tagged" value), or it could be a lower-level encoding (e.g., a fixed width integer or FlexInt
/FlexUInt
).
Macros, templates, and encoding expressions
Ion 1.1 introduces a new primitive called an encoding expression (E-expression). These expressions are (in text syntax) similar to S-expressions, but they are not part of the data model and are evaluated into one or more Ion values (called a stream) which enable compact representation of Ion data. E-expressions represent the invocation of either system defined or user defined macros with arguments that are either themselves E-expressions, value literals, or container constructors (list, sexp, struct syntax containing E-expressions) corresponding to the formal parameters of the macro's definition. The resulting stream is then expanded into the resulting Ion data model.
At the top level, the stream becomes individual top-level values. Consider for illustrative purposes an E-expression
(:values 1 2 3)
that evaluates to the stream 1
, 2
, 3
and (:none)
that evaluates to the empty stream. In the
following examples, values
and none
are the names of the macros being invoked and each line is equivalent.
Top-level e-expressions
// Encoding
a (:values 1 2 3) b (:none) c
// Evaluates to
a 1 2 3 b c
Within a list or S-expression, the stream becomes additional child elements in the collection.
E-expressions in lists
// Encoding
[a, (:values 1 2 3), b, (:none), c]
// Evaluates to
[a, 1, 2, 3, b, c]
E-expressions in S-expressions
(a (:values 1 2 3) b (:none) c)
(a 1 2 3 b c)
Within a struct at the field name position, the resulting stream must contain structs and each of the fields in those
structs become fields in the enclosing struct (the value portion is not specified); at the value position, the resulting
stream of values becomes fields with whatever field name corresponded before the E-expression (empty stream elides the
field all together). In the following examples, let us define (:make_struct c 5)
that evaluates to a single struct
{c: 5}
.
E-expressions in structs
// Encoding
{
a: (:values 1 2 3),
b: 4,
(:make_struct c 5),
d: 6,
e: (:none)
}
// Evaluates to
{
a: 1,
a: 2,
a: 3,
b: 4,
c: 5,
d: 6
}
Encoding context and modules
In Ion 1.0, there is a single encoding context which is the local symbol table. In Ion 1.1, the encoding context becomes the following:
-
The local symbol table which is a list of strings. This is used to encode/decode symbolic tokens.
-
The local macro table which is a list of macros. This is used to reference macros that can be invoked by E-expressions.
-
A mapping of a string name to module which is an organizational unit of symbol definitions and macro definitions. Within the encoding context, this name is unique and used to address a module's contents either as the list of symbols to install into the local symbol table, the list of macros to install into the local macro table, or to qualify the name of a macro in a text E-expression or the definition of a macro.
The module is a new concept in Ion 1.1. It contains:
-
A list of strings representing the symbol table of the module.
-
A list of macro definitions.
Modules can be imported from the catalog (they subsume shared symbol tables), but can also be defined locally. Modules
are referenced as a group to allocate entries in the local symbol table and local macro table (e.g., the local symbol
table is initially, implicitly allocated with the symbols in the $ion
module).
Ion 1.1 introduces a new system value (an encoding directive) for the encoding context (see the TBD section for details.)
Ion encoding directive example
$ion_encoding::{
modules: [ /* module declarations - including imports */ ],
install_symbols: [ /* names of declared modules */ ],
install_macros: [ /* names of declared modules */ ]
}
Macro definitions
Macros can be defined by a user either directly in a local module within an encoding directive or in a module defined externally (i.e., shared module). A macro has a name which must be unique in a module or it may have no name.
Ion 1.1 defines a list of system macros that are built-in in the module named $ion
. Unlike the system symbol table,
which is always installed and accessible in the local symbol table, the system macros are both always accessible to
E-expressions and not installed in the local macro table by default (unlike the local symbol table).
In Ion binary, macros are always addressed in E-expressions by the offset in the local macro table. System macros may
be addressed by the system macro identifier using a specific encoding op-code. In Ion text, macros may be addressed by
the offset in the local macro table (mirroring binary), its name if its name is unambiguous within the local encoding
context, or by qualifying the macro name/offset with the module name in the encoding context. An E-expression can
only refer to macros installed in the local macro table or a macro from the system module. In text, an E-expression
referring to a system macro that is not installed in the local macro table, must use a qualified name with the $ion
module name.
For illustrative purposes let's consider the module named foo
that has a macro named bar
at offset 5 installed at
the begining of the local macro table.
E-expressions name resolution
// allowed if there are no other macros named 'bar'
(:bar)
// fully qualified by module--always allowed
(:foo:bar)
// by local macro table offset
(:5)
// In text, system macros are always addressable by name.
// In binary, system macros may be invoked using a separate
// opcode.
(:$ion:none)
Macro definition language
User defined macros are defined by their parameters and template which defines how they are invoked and what stream of data they evaluate to. This template is defined using a domain specific Ion macro definition language with S-expressions. A template defines a list of zero or more parameters that it can accept. These parameters each have their own cardinality of expression arguments which can be specified as exactly one, zero or one, zero or more, and one or more. Furthermore the template defines what type of argument can be accepted by each of these parameters:
- "Tagged" values, whose encodings always begin with an opcode.
- "Tagless" values, whose encodings do not begin with an opcode and are therefore both more compact and less flexible (For example:
flex_int
,int32
,float16
). - Specific macro shaped arguments to allow for structural composition of macros and efficient encoding in binary.
The macro definition includes a template body that defines how the macro is expanded. In the language, system macros, macros defined in previously defined modules in the encoding context, and macros defined previously in the current module are available to be invoked with (name ...)
syntax where name
is
the macro to be invoked. Certain names in the expression syntax are reserved for special forms (for example, literal
and if_none
). When a macro name is shadowed by a special form, or is ambiguous with respect to all
macros visible, it can always be qualified with (':module:name' ...)
syntax where module
is the name of the module
and name
is the offset or name of the macro. Referring to a previously defined macro name within a module may be
qualified with (':name' ...)
syntax.
Shared Modules
Ion 1.1 extends the concept of a shared symbol table to be a shared module. An Ion 1.0 shared symbol table is a shared module with no macro definitions. A new schema for the convention of serializing shared modules in Ion are introduced in Ion 1.1. An Ion 1.1 implementation should support containing Ion 1.0 shared symbol tables and Ion 1.1 shared modules in its catalog.
System Symbol Table Changes
The system symbol table in Ion 1.1 replaces the Ion 1.0 symbol table with new symbols. However, the system symbols are not required to be in the symbol table—they are always available to use.