Docs/ Developers’ Guide to Ion Symbols

This document provides developer-focused commentary on the Symbols section of the specification and discusses the implementation of symbol table, symbol token, and catalog APIs.

Definitions

Terms

Structures

ImportDescriptor = <importName:String, version:Int, max_id:Int>

ImportLocation = <importName:String, importSID:Int>

SymbolToken = <text:String, importLocation:ImportLocation>

Where Int may be any integer and String may be any string.

SymbolToken equivalence

In order to fully support the equivalence semantics defined by the specification, SymbolToken equivalence must be implemented as follows. When text is

Symbol tables

There are three types of symbol tables:

Implementations should be able to determine the type of a given symbol table, as not all fields are valid for all types, and not all types are valid input to all APIs. For example, local symbol tables do not have names, while shared symbol tables require them; only shared symbol tables may be added to a catalog or to a writer’s list of imports.

Symbol tables should support being in more than one catalog simultaneously. Otherwise, piping data from a reader through a writer with a different catalog would require a copy of the symbol tables the two catalogs have in common.

A local symbol table is the current symbol table for the subset of values in the stream that occur between the end of the symbol table struct and either the next Ion version marker (IVM) or the end of the next local symbol table struct (whichever comes first). This way, a local symbol table struct may contain SymbolTokens from the current symbol table.

Fundamental symbol table APIs

Advanced symbol table APIs

Catalogs

Catalogs should enable users to define the logic used to look up a shared symbol table given an ImportDescriptor. This allows users the flexibility to, for example, lazily query shared tables from a centralized store. A basic implementation, which stores the mappings in memory, should be provided.

Fundamental catalog APIs

Reading SymbolTokens

Ion readers must support being provided with an optional catalog to use for resolving shared symbol table imports declared within local symbol tables encountered in the stream. If a declared import is not found in the catalog, all of the symbol IDs in its max_id range will have unknown text.

Generally, Ion readers provide two kinds of SymbolToken reading APIs, those that return:

For

Writing symbol tables

Ion writers must accept an optional list of imports to be used during writing. These imports, which may be either fully-materialized shared symbol tables or ImportDescriptors, will be added to each new local symbol table the writer creates. If the implementation allows its writer imports to be specified as ImportDescriptors, its Ion writers must also support being provided with an optional catalog, which will be used to resolve these imports. In this case, the implementation should specify that the imported tables must be present in the catalog if the user intends for the symbol IDs in range of those shared tables to map to known text. For

Ion writers may allow users to use writer APIs to manually construct a valid local symbol table struct. If the implementation chooses

Writing SymbolTokens

Generally, Ion writers provide two kinds of SymbolToken-writing APIs, those that accept:

For APIs that accept

Appendix

0: When using

1: This is a potentially a lossy operation, as it does not convey import location. There are two reasons why implementations may choose not to raise an error in this case:

2: The ImportLocation can be determined by applying the symbol ID assignment algorithm defined by the specification, where the system symbol table starts at symbol ID 1.

3: Unlike symbols with unknown text resolved from shared symbol tables, symbols with with unknown text resolved from local symbol tables can never have defined text because the local symbol table is included in the encoding and its symbol ID mappings are immutable. Therefore, there is no need to preserve the local symbol IDs of SymbolTokens representing such symbols. Treating them equivalently to symbol zero simplifies writing symbol tables because it obviates the need for writers to keep track of null slots in local symbol tables.

4: No special semantics are ascribed to text Ion symbol tokens which have the same form as Symbol Identifiers but are quoted. Readers must pass along the text as-is, and writers must never write user-provided text with the same form as a Symbol Identifier as an unquoted symbol token. This maintains the user’s ability to write symbol tokens with any text without experiencing surprising side-effects on roundtrip.

5: This case requires that the text writer serialize a local symbol table containing the imports mapped to by Symbol Identifier tokens within the stream. Note that imports that have no unknown mappings in the stream do not need to be included (nor do any local symbols), but if only a subset of the imports are included, the Symbol Identifiers need to refer to the same slot in the same import as before any shared symbol tables were excluded (this can be computed by translating the SymbolToken’s importLocation to a local symbol ID in the new symbol table using the algorithm defined by the specification).

Although this is the only case that requires a text writer to serialize a local symbol table, it should be noted that serializing a local symbol table in other cases is only wasteful, never harmful. Accordingly, it is simpler to serialize a local symbol table which includes all shared imports whenever a writer is provided with shared imports. Otherwise, when the writer has shared imports, it needs to buffer the entire stream while that symbol table is the current symbol table (similar to the binary writer), because it can’t determine ahead of time that the user won’t specify a symbol with unknown text (unless all of the imports are found in the catalog and none of them have null slots, which can be checked ahead of time in return for some additional preprocessing time).

It may be tempting for an implementation to try to wait until a SymbolToken with unknown text is written before serializing a local symbol table. However, this is problematic because symbol tables may only occur at the top level, but the first SymbolToken with unknown text can occur at any depth.

6: This is to avoid writing invalid Ion. Consider a writer whose current symbol table contains two symbols, $10 = abc and $11 = def. A user manually writes a local symbol table with only one symbol, $10 = foo. If the writer simply writes this manually-written table to the stream without internally changing its current symbol table, it would allow the user to write symbol ID 11 (with “def” in mind), while a reader of the data would process the new local symbol table and subsequently consider $11 to be out of range, raising an error.

7: Note that this means that SymbolTokens are not guaranteed to have the same import location on roundtrip, but they are guaranteed to have the same text representation, which is sufficient to maintain equivalence.

8: If the implementation uses a singleton system symbol table directly as the current symbol table, appending a new symbol will first require creating a mutable local symbol table which implicitly extends the system symbol table. In other words, care should be taken never to mutate the system symbol table.