Docs/ Ion Binary Encoding
This document covers the binary Ion format.
Value Streams
In the binary format, a value stream always starts with a binary version marker (BVM) that specifies the precise Ion version used to encode the data that follows:
7 0 7 0 7 0 7 0 +------+-------+-------+------+ binary version marker | 0xE0 | major | minor | 0xEA | +------+-------+-------+------+
The four-octet BVM also acts as a “magic cookie” to distinguish Ion binary data
from other formats, including Ion text data. Its first octet (in sequence from
the beginning of the stream) is 0xE0
and its fourth octet is 0xEA
. The
second and third octets contain major and minor version numbers. The only valid
BVM, identifying Ion 1.0, is 0xE0 0x01 0x00 0xEA
.
An Ion value stream starts with a BVM, followed by zero or more values which contain the actual data. These values are generally referred to as “top-level values”.
+-------------------------+ value stream | binary version marker | +-------------------------+ : value : +=========================+ ⋮ +=========================+ : binary version marker : +=========================+ : value : +=========================+ ⋮
A value stream can contain other, perhaps different, BVMs interspersed with the top-level values. Each BVM resets the decoder to the appropriate initial state for the given version of Ion. This allows the stream to be constructed by concatenating data from different sources, without requiring transcoding to a single version of the format.
Note: The BVM is not a value and should not be visible to or manipulable by the user; it is internal data managed by and for the Ion implementation.
Basic Field Formats
Binary-encoded Ion values are comprised of one or more fields, and the fields use a small number of basic formats (separate from the Ion types visible to users).
UInt and Int Fields
UInt and Int fields represent fixed-length unsigned and signed integer values. These field formats are always used in some context that clearly indicates the number of octets in the field.
7 0 +-------------------------+ UInt field | bits | +-------------------------+ : bits : +=========================+ ⋮ +=========================+ : bits : +=========================+
UInts are sequences of octets, interpreted as big-endian.
7 6 0 +---+---------------------+ Int field | | bits | +---+---------------------+ ^ | +--sign +=========================+ : bits : +=========================+ ⋮ +=========================+ : bits : +=========================+
Ints are sequences of octets, interpreted as sign-and-magnitude big endian integers (with the sign on the highest-order bit of the first octet). This means that the representations of 123456 and -123456 should only differ in their sign bit. (See Signed Number Representations for more info.)
VarUInt and VarInt Fields
VarUInt and VarInt fields represent self-delimiting, variable-length unsigned and signed integer values. These field formats are always used in a context that does not indicate the number of octets in the field; the last octet (and only the last octet) has its high-order bit set to terminate the field.
7 6 0 7 0 +===+=====================+ +---+---------------------+ VarUInt field : 0 : bits : … | 1 | bits | +===+=====================+ +---+---------------------+
VarUInts are a sequence of octets. The high-order bit of the last octet is one, indicating the end of the sequence. All other high-order bits must be zero.
7 6 5 0 7 0 +===+ +---+ VarInt field : 0 : payload … | 1 | payload +===+ +---+ +---+-----------------+ +=====================+ | | magnitude | … : magnitude : +---+-----------------+ +=====================+ ^ ^ ^ | | | | +--sign +--end flag +--end flag
VarInts are sign-and-magnitude integers, like Ints. Their layout is complicated, as there is one special leading bit (the sign) and one special trailing bit (the terminator). In the above diagram, we put the two concepts on different layers.
The high-order bit in the top layer is an end-of-sequence marker. It must be set on the last octet in the representation and clear in all other octets. The second-highest order bit (0x40) is a sign flag in the first octet of the representation, but part of the extension bits for all other octets. For single-octet VarInt values, this collapses down to:
7 6 5 0 +---+---+-------------+ single octet VarInt field | 1 | | magnitude | +---+---+-------------+ ^ | +--sign
Typed Value Formats
A value consists of a one-octet type descriptor, possibly followed by a length in octets, possibly followed by a representation.
7 4 3 0 +---------+---------+ value | T | L | +---------+---------+======+ : length [VarUInt] : +==========================+ : representation : +==========================+
The type descriptor octet has two subfields: a four-bit type code T, and a four-bit length L.
Each value of T identifies the format of the representation, and generally (though not always) identifies an Ion datatype. Each type code T defines the semantics of its length field L as described below.
The length value – the number of octets in the representation field(s) – is encoded in L and/or length fields, depending on the magnitude and on some particulars of the actual type. The length field is empty (taking up no octets in the message) if we can store the length value inside L itself. If the length field is not empty, then it is a single VarUInt field. The representation may also be empty (no octets) in some cases, as detailed below.
Unless otherwise defined, the length of the representation is encoded as follows:
- If the value is null (for that type), then L is set to 15.
- If the representation is less than 14 bytes long, then L is set to the length, and the length field is omitted.
- If the representation is at least 14 bytes long, then L is set to 14, and the length field is set to the representation length, encoded as a VarUInt field.
0x0: null
7 4 3 0 +---------+---------+ Null value | 0x0 | 15 | +---------+---------+
Values of type null
always have empty lengths and representations. The only
valid L value is 15, representing the only value of this type, null.null
.
NOP Padding
7 4 3 0 +---------+---------+ NOP Pad | 0x0 | L | +---------+---------+======+ : length [VarUInt] : +--------------------------+ | ignored octets | +--------------------------+
In addition to null.null
, the null type code is used to encode padding
that has no operation (NOP padding). This can be used for “binary whitespace”
when alignment of octet boundaries is needed or to support in-place editing.
Such encodings are not considered values and are ignored by the processor.
In this encoding, L specifies the number of octets that should be ignored.
The following is a single byte NOP pad. The NOP padding typedesc bytes are counted as padding:
0x00
The following is a two byte NOP pad:
0x01 0xFE
Note that the single byte of “payload” 0xFE
is arbitrary and ignored by the
parser.
The following is a 16 byte NOP pad:
0x0E 0x8E 0x00 ... <12 arbitrary octets> ... 0x00
NOP padding is valid anywhere a value can be encoded, except for within an
annotation wrapper. NOP padding in struct
requires additional encoding considerations.
0x1: bool
7 4 3 0 +---------+---------+ Bool value | 0x1 | rep | +---------+---------+
Values of type bool
always have empty lengths, and their representation is
stored in the typedesc itself (rather than after the typedesc). A
representation of 0 means false; a representation of 1 means true; and a
representation of 15 means null.bool
.
0x2 and 0x3: int
Values of type int
are stored using two type codes: 2 for positive values
and 3 for negative values. Both codes use a UInt subfield to store the magnitude.
7 4 3 0 +---------+---------+ Int value |0x2 / 0x3| L | +---------+---------+======+ : length [VarUInt] : +==========================+ : magnitude [UInt] : +==========================+
Zero is always stored as positive; negative zero is illegal.
If L is 0 the value is zero, and there are no length or magnitude subfields. As a result, when T is 3, both L and the magnitude subfield must be non-zero.
With either type code 2 or 3, if L is 15, then the value is null.int
and
the magnitude is empty. Note that this implies there are two equivalent
binary representations of null integer values.
0x4: float
7 4 3 0 +---------+---------+ Float value | 0x4 | L | +---------+---------+-----------+ | representation [IEEE-754] | +-------------------------------+
Floats are encoded as big endian octets of their IEEE 754 bit patterns.
The L field of floats encodes the size of the IEEE-754 value.
- If L is 4, then the representation is 32 bits (4 octets).
- If L is 8, then the representation is 64 bits (8 octets).
There are two exceptions for the L field:
- If L is 0, then the value is
0e0
and representation is empty.- Note, this is not to be confused with
-0e0
which is a distinct value and in current Ion must be encoded as a normal IEEE float bit pattern.
- Note, this is not to be confused with
- If L is 15, then the value is
null.float
and the representation is empty.
Note: Ion 1.0 only supports 32-bit and 64-bit float values (i.e. L size 4 or 8), but future versions of the standard may support 16-bit and 128-bit float values.
0x5: decimal
7 4 3 0 +---------+---------+ Decimal value | 0x5 | L | +---------+---------+======+ : length [VarUInt] : +--------------------------+ | exponent [VarInt] | +--------------------------+ | coefficient [Int] | +--------------------------+
Decimal representations have two components: exponent (a VarInt) and coefficient (an Int). The decimal’s value is coefficient * 10 ^ exponent.
The length of the coefficient subfield is the total length of the representation minus the length of exponent. The subfield should not be present (that is, it has zero length) when the coefficient’s value is (positive) zero.
There are two exceptions for the L field:
- If L is 0, then the value is
0.
(aka0d0
), and there are no length, exponent, or coefficient subfields. - If L is 15, then the value is
null.decimal
and there are no length, exponent, or coefficient subfields.
0x6: timestamp
7 4 3 0 +---------+---------+ Timestamp value | 0x6 | L | +---------+---------+========+ : length [VarUInt] : +----------------------------+ | offset [VarInt] | +----------------------------+ | year [VarUInt] | +----------------------------+ : month [VarUInt] : +============================+ : day [VarUInt] : +============================+ : hour [VarUInt] : +==== ====+ : minute [VarUInt] : +============================+ : second [VarUInt] : +============================+ : fraction_exponent [VarInt] : +============================+ : fraction_coefficient [Int] : +============================+
Timestamp representations have 7 components, where 5 of these components are optional depending on the precision of the timestamp. The 2 non-optional components are offset and year. The 5 optional components are (from least precise to most precise): month, day, hour and minute, second, fraction_exponent and fraction_coefficient. All of these 7 components are in Universal Coordinated Time (UTC).
The offset denotes the local-offset portion of the timestamp, in minutes difference from UTC.
The hour and minute is considered as a single component, that is, it is illegal to have hour but not minute (and vice versa).
The fraction_exponent and fraction_coefficient denote the fractional seconds of the timestamp as a decimal value. The fractional seconds’ value is coefficient * 10 ^ exponent. It must be greater than or equal to zero and less than 1. A missing coefficient defaults to zero. Fractions whose coefficient is zero and exponent is greater than -1 are ignored. The following hex encoded timestamps are equivalent:
68 80 0F D0 81 81 80 80 80 // 2000-01-01T00:00:00Z with no fractional seconds
69 80 0F D0 81 81 80 80 80 80 // The same instant with 0d0 fractional seconds and implicit zero coefficient
6A 80 0F D0 81 81 80 80 80 80 00 // The same instant with 0d0 fractional seconds and explicit zero coefficient
69 80 0F D0 81 81 80 80 80 C0 // The same instant with 0d-0 fractional seconds
69 80 0F D0 81 81 80 80 80 81 // The same instant with 0d1 fractional seconds
Conversely, none of the following are equivalent:
68 80 0F D0 81 81 80 80 80 // 2000-01-01T00:00:00Z with no fractional seconds
69 80 0F D0 81 81 80 80 80 C1 // 2000-01-01T00:00:00.0Z
69 80 0F D0 81 81 80 80 80 C2 // 2000-01-01T00:00:00.00Z
If a timestamp representation has a component of a certain precision, each of the less precise components must also be present or else the representation is illegal. For example, a timestamp representation that has a fraction_exponent and fraction_coefficient component but not the month component, is illegal.
Note: The component values in the binary encoding are always in UTC, while components in the text encoding are in the local time! This means that transcoding requires a conversion between UTC and local time.
0x7: symbol
7 4 3 0 +---------+---------+ Symbol value | 0x7 | L | +---------+---------+======+ : length [VarUInt] : +--------------------------+ | symbol ID [UInt] | +--------------------------+
In the binary encoding, all Ion symbols are stored as integer symbol IDs whose text values are provided by a symbol table. If L is zero then the symbol ID is zero and the length and symbol ID fields are omitted.
See Ion Symbols for more details about symbol representations and symbol tables.
0x8: string
7 4 3 0 +---------+---------+ String value | 0x8 | L | +---------+---------+======+ : length [VarUInt] : +==========================+ : representation [UTF8] : +==========================+
These are always sequences of Unicode characters, encoded as a sequence of UTF-8 octets. If L is zero then the string is the empty string “” and the length and representation fields are omitted.
0x9: clob
7 4 3 0 +---------+---------+ Clob value | 0x9 | L | +---------+---------+======+ : length [VarUInt] : +==========================+ : data [Bytes] : +==========================+
Values of type clob
are encoded as a sequence of octets that should be
interpreted as text with an unknown encoding (and thus opaque to the
application).
Zero-length clobs are legal, so L may be zero.
0xA: blob
7 4 3 0 +---------+---------+ Blob value | 0xA | L | +---------+---------+======+ : length [VarUInt] : +==========================+ : data [Bytes] : +==========================+
This is a sequence of octets with no interpretation (and thus opaque to the application).
Zero-length blobs are legal, so L may be zero.
0xB: list
7 4 3 0 +---------+---------+ List value | 0xB | L | +---------+---------+======+ : length [VarUInt] : +==========================+ : value : +==========================+ ⋮
The representation fields of a list
value are simply nested Ion values.
When L is 15, the value is null.list
and there’s no length or nested
values. When L is 0, the value is an empty list, and there’s no length or
nested values.
Because values indicate their total lengths in octets, it is possible to locate the beginning of each successive value in constant time.
0xC: sexp
7 4 3 0 +---------+---------+ Sexp value | 0xC | L | +---------+---------+======+ : length [VarUInt] : +==========================+ : value : +==========================+ ⋮
Values of type sexp
are encoded exactly as are list
values, except with a
different type code.
0xD: struct
Structs are encoded as sequences of symbol/value pairs. Since all symbols are encoded as positive integers, we can omit the typedesc on the field names and just encode the integer value.
7 4 3 0 +---------+---------+ Struct value | 0xD | L | +---------+---------+======+ : length [VarUInt] : +======================+===+==================+ : field name [VarUInt] : value : +======================+======================+ ⋮ ⋮
Binary-encoded structs support a special case where the fields are known to be sorted such that the field-name integers are increasing. This state exists when L is one. Thus:
- When L is 0, the value is an empty struct, and there’s no length or nested fields.
- When L is 1, the struct has at least one symbol/value pair, the length field exists, and the field name integers are sorted in increasing order.
- When L is 14, the length field exists, and no assertion is made about field ordering.
- When L is 15, the value is
null.struct
, and there’s no length or nested fields. - Otherwise, 1 < L < 14 and there is no length field as L is enough to represent the struct size. No assertion is made about field ordering.
Note: Because VarUInts depend on end tags to indicate their lengths, finding the succeeding value requires parsing the field name prefix. However, VarUInts are a more compact representation than Int values.
NOP Padding in struct
Fields
NOP Padding in struct
values requires additional consideration
of the field name element. If the “value” of a struct
field is the
NOP pad, then the field name is ignored. This means that it
is not possible to encode padding in a struct
value that is less than
two bytes.
Implementations should use symbol ID zero as the field name to emphasize the lack of meaning of the field name. For more general details about the semantics of symbol ID zero, refer to Ion Symbols.
For example, consider the following empty struct
with three bytes of
padding:
0xD3 0x80 0x01 0xAC
In the above example, the struct declares that it is three bytes large, and
the encoding of the pair of symbol ID zero followed by a pad that is two bytes
large (note the last octet 0xAC
is completely arbitrary and never
interpreted by an implementation).
The following is an example of struct with a single field with four total bytes of padding:
0xD7 0x84 0x81 "a" 0x80 0x02 0x01 0x02
The above is equivalent to {name:"a"}
.
The following is also a empty struct, with a two byte pad:
0xD2 0x8F 0x00
In the above example, the field name of symbol ID 15 is ignored (regardless of if it is a valid symbol ID).
The following is malformed because there is an annotation “wrapping” a NOP pad, which is not allowed generally for annotations:
// {$0:name::<NOP>}
0xD5 0x80 0xE3 0x81 0x84 0x00
0xE: Annotations
This special type code doesn’t map to an Ion value type, but instead is a wrapper used to associate annotations with content.
Annotations are a special type that wrap content identified by the other type codes. The annotations themselves are encoded as integer symbol ids.
7 4 3 0 +---------+---------+ Annotation wrapper | 0xE | L | +---------+---------+======+ : length [VarUInt] : +--------------------------+ | annot_length [VarUInt] | +--------------------------+ | annot [VarUInt] | … +--------------------------+ | value | +--------------------------+
The length field L field indicates the length from the beginning of the annot_length field to the end of the enclosed value. Because at least one annotation and exactly one content field must exist, L is at least 3 and is never 15.
The annot_length field contains the length of the (one or more) annot fields.
It is illegal for an annotation to wrap another annotation atomically, i.e., annotation(annotation(value)). However, it is legal to have an annotation on a container that holds annotated values.
Furthermore, it is illegal for an annotation to wrap a NOP Pad since this encoding is not an Ion value. Thus, the following sequence is malformed:
0xE3 0x81 0x84 0x00
Note: Because L cannot be zero, the octet 0xE0
is not a valid type
descriptor. Instead, that octet signals the start of a binary version marker.
0xF: reserved
The remaining type code, 15, is reserved for future use and is not legal in Ion 1.0 data.
Illegal Type Descriptors
The preceding sections define valid type descriptor octets, composed of a type code (T) in the upper four bits and a length field (L) in the lower four bits. As mentioned, many possible combinations are illegal and must cause parsing errors.
The following table enumerates the illegal type descriptors in Ion 1.0 data.
T | L | Reason |
---|---|---|
0x1 | [2-14] |
For bool values, L is used to encode the value, and may be
0 (false ), 1 (true ), or 15 (null.bool ).
|
0x3 | [0] |
The int 0 is always stored with type code 2. Thus,
type code 3 with L equal to zero is illegal.
|
0x4 | [1-3],[5-7],[9-14] |
For float values, only 32-bit and 64-bit IEEE-754 values are
supported. Additionally, 0e0 and null.float are
represented with L equal to 0 and 15, respectively.
|
0x6 | [0-1] |
For timestamp values, a VarInt offset and VarUInt year are required.
Thus, type code 6 with L equal to zero or one is illegal.
|
0xE | [0]*,[1-2],[15] |
Annotation wrappers must have one annot_length field, at least one
annot field, and exactly one value field. Null annotation wrappers
are illegal.
*Note: Since 0xE0 signals the start of the BVM, encountering this
octet where a type descriptor is expected should only cause parsing
errors when it is not followed by the rest of the BVM octet sequence.
|
0xF | [0-15] | The type code 0xF is illegal in Ion 1.0 data. |