Docs/ Strings and Clobs

This document clarifies the semantics of the Amazon Ion string and clob data types with respect to escapes and the Unicode standard.

As of the date of this writing, the Unicode Standard is on version 10.0. This specification is to that standard.

Unicode Primer

The Unicode standard specifies a large set of code points, the Universal Character Set (UCS), which is an integer in the range of 0 (0x0) through 1,114,111 (0x10FFFF) inclusive. Throughout this document, the notation U+HHHH and U+HHHHHHHH refer to the Unicode code point HHHH and HHHHHHHH respectively as a hexadecimal ordinal. This notation follows the Unicode standard convention.

Traditionally, from a programmer’s perspective, a code point can be thought of as a character, but there is sometimes a subtle distinction. For example, in Java, the char type is an unsigned, 16-bit integer, which is normally used to hold UTF-16 code units (e.g. java.lang.CharSequence). For the Unicode code point, Mathematical Bold Capital “A” (code point U+0001D400), this encoded in a UTF-16 string as two units: 0xD835 followed by 0xDC00. So in this case, Java’s UTF-16 representation actually utilizes two character (i.e. char) values to represent one Unicode code point.

This document attempts to avoid using the term character when referring to Unicode code points. The reasoning for this is partly stated above, but also has to do with the overloaded nature of the term (e.g. a user character or grapheme). For more details, consult section 3.4 of the Unicode Standard.

Another interesting aspect of the UCS, is a block of code points that is reserved exclusively for use in the UTF-16 encoding (i.e. surrogate code points). As such, strictly speaking, no encoding of Unicode are allowed to represent the code points in the inclusive range U+D800 to U+DFFF. In the UTF-16 case, these code points are only allowed to be used in the encoding to specify characters in the U+00010000 to U+0010FFFF range. Refer to sections 3.8 and 3.9 of the Unicode Standard for details.

Ion String

The Ion String data type is a sequence of Unicode code points. The Ion semantics of this are agnostic to any particular Unicode encoding (e.g. UTF-16, UTF-8), except for the concrete syntax specification of the Ion binary and text formats.

Text Format

See the grammar for a formal definition of the Ion Text encoding for the string type.

Multiple Ion long string literals that are adjacent to each other by zero or more whitespace are concatenated automatically. For example the following two blocks of Ion text syntax are semantically equivalent. Note that short string literals do not exhibit this behavior.

"1234"    '''Hello'''    '''World'''

"1234"    "HelloWorld"

Each individual long string literal must be a valid Unicode character sequence when unescaped. The following examples are invalid due to splitting Unicode escapes, an escaped surrogate pair, and a common escape, respectively.

'''\u'''        '''1234'''

'''\U0000'''    '''1234'''

'''\uD800'''    '''\uDC00'''

'''\'''         '''n'''

Within long string literals unescaped newlines are normalized such that U+000D U+000A pairs (CARRIAGE RETURN and LINE FEED respectively) and U+000D are replaced with U+000A. This is to facilitate compatibility across operating systems.

Normalization can be subverted by using a combination of escapes:

CARRIAGE RETURN only:
'''one\r\
two'''

CARRIAGE RETURN and LINE FEED:
'''one\r
two'''

Escaped newlines are not replaced with any characters (i.e. the newline is removed). In addition, the following table describes the string escape sequences that have direct code point replacement for all quoted string and symbol forms.

Unicode Code Point Ion Escape Semantics
U+0007 \a BEL (alert)
U+0008 \b BS (backspace)
U+0009 \t HT (tab)
U+000A \n LF (linefeed)
U+000C \f FF (form feed)
U+000D \r CR (carriage return)
U+000B \v VT (vertical tab)
U+0022 \" double quote
U+0027 \' single quote
U+003F \? question mark
U+005C \\ backslash
U+002F \/ forward slash
U+0000 \0 NUL (null character)

The for the Unicode ordinal string escapes, \U, \u, and \x, the escape must be followed by a number of hexadecimal digits as described below.

Unicode
Code Point

Ion
Escape

Semantics

U+HHHHHHHH \UHHHHHHHH 8-digit hexadecimal Unicode code point
U+HHHH \uHHHH 4-digit hexadecimal Unicode code point; equivalent to \U0000HHHH
U+00HH \xHH 2-digit hexadecimal Unicode code point; equivalent to \u00HH and \U000000HH

Ion does not specify the behavior of specifying invalid Unicode code points or surrogate code points (used only for UTF-16) using the escape sequences. It is highly recommended that Ion implementations reject such escape sequences as they are not proper Unicode as specified by the standard. To this point, consider the Ion string sequence, "\uD800\uDC00". A compliant parser may throw an exception because surrogate characters are specified outside of the context of UTF-16, accept the string as a technically invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or interpret it as the single Unicode code point U+00010000. In this regard, the Ion string data type does not conform to the Unicode specification. A strict Unicode implementation of the Ion text should not accept such sequences.

Binary Format

The Ion binary format encodes the string data type directly as a sequence of UTF-8 octets. A strict, Unicode compliant implementation of Ion should not allow invalid UTF-8 sequences (e.g. surrogate code points, overlong values, and values outside of the inclusive range, U+0000 to U+0010FFFF).

Ion Clob

An Ion clob type is similar to the blob type except that the denotation in the Ion text format uses an ASCII-based string notation rather than a base64 encoding to denote its binary value. It is important to make the distinction that clob is a sequence of raw octets and string is a sequence of Unicode code points.

Text Format

See the grammar for a formal definition of the Ion Text encoding for the clob type.

Similar to string, adjoining long string literals within an Ion clob are concatenated automatically. Within a clob, only one short string literal or multiple long string literals are allowed. For example, the following two blocks of Ion text syntax are semantically equivalent.

{{ '''Hello'''    '''World''' }}

{{ "HelloWorld" }}

The rules for the quoted strings within a clob follow the similarly to the string type, with the following exceptions. Unicode newline characters in long strings and all verbatim ASCII characters are interpreted as their ASCII octet values. Non-printable ASCII and non-ASCII Unicode code points are not allowed un-escaped in the string bodies. Furthermore, the following table describes the clob string escape sequences that have direct octet replacement for both all strings.

Octet Ion Escape Semantics
0x07 \a ASCII BEL (alert)
0x08 \b ASCII BS (backspace)
0x09 \t ASCII HT (tab)
0x0A \n ASCII LF (line feed)
0x0C \f ASCII FF (form feed)
0x0D \r ASCII CR (carriage return)
0x0B \v ASCII VT (vertical tab)
0x22 \" ASCII double quote
0x27 \' ASCII single quote
0x3F \? ASCII question mark
0x5C \\ ASCII backslash
0x2F \/ ASCII forward slash
0x00 \0 ASCII NUL (null character)

The clob escape \x must be followed by two hexadecimal digits. Note that clob does not support the \u and \U escapes since it represents an octet sequence and not a Unicode encoding.

Octet Ion Escape Semantics
0xHH \xHH 2-digit hexadecimal octet

It is important to note that clob is a binary type that is designed for binary values that are either text encoded in a code page that is ASCII compatible or should be octet editable by a human (escaped string syntax vs. base64 encoded data). Clearly non-ASCII based encodings will not be very readable (e.g. the clob for the EBCDIC encoded string representing “hello” could be denoted as {{ "\xc7\xc1%%?" }}).

Binary Format

This is represented directly as the octet values in the clob value.

References