Docs/ Strings and Clobs
This document clarifies the semantics of the Amazon Ion
string
and clob
data types with respect to
escapes and the Unicode standard.
As of the date of this writing, the Unicode Standard is on version 10.0. This specification is to that standard.
Unicode Primer
The Unicode standard specifies a large set of code points, the Universal Character Set (UCS), which is an integer in the range of 0 (0x0) through 1,114,111 (0x10FFFF) inclusive. Throughout this document, the notation U+HHHH and U+HHHHHHHH refer to the Unicode code point HHHH and HHHHHHHH respectively as a hexadecimal ordinal. This notation follows the Unicode standard convention.
Traditionally, from a programmer’s perspective, a code point can be
thought of as a character, but there is sometimes a subtle
distinction. For example, in Java, the char
type is an unsigned,
16-bit integer, which is normally used to hold UTF-16 code units (e.g.
java.lang.CharSequence
).
For the Unicode code point, Mathematical Bold Capital “A” (code point
U+0001D400), this encoded in a UTF-16 string as two units: 0xD835 followed by
0xDC00. So in this case, Java’s UTF-16 representation actually utilizes two
character (i.e. char
) values to represent one Unicode code point.
This document attempts to avoid using the term character when referring to Unicode code points. The reasoning for this is partly stated above, but also has to do with the overloaded nature of the term (e.g. a user character or grapheme). For more details, consult section 3.4 of the Unicode Standard.
Another interesting aspect of the UCS, is a block of code points that is reserved exclusively for use in the UTF-16 encoding (i.e. surrogate code points). As such, strictly speaking, no encoding of Unicode are allowed to represent the code points in the inclusive range U+D800 to U+DFFF. In the UTF-16 case, these code points are only allowed to be used in the encoding to specify characters in the U+00010000 to U+0010FFFF range. Refer to sections 3.8 and 3.9 of the Unicode Standard for details.
Ion String
The Ion String data type is a sequence of Unicode code points. The Ion semantics of this are agnostic to any particular Unicode encoding (e.g. UTF-16, UTF-8), except for the concrete syntax specification of the Ion binary and text formats.
Text Format
See the grammar for a formal definition of the
Ion Text encoding for the string
type.
Multiple Ion long string
literals that are adjacent to each other by
zero or more whitespace are concatenated automatically. For example the
following two blocks of Ion text syntax are semantically equivalent.
Note that short string
literals do not exhibit this behavior.
"1234" '''Hello''' '''World'''
"1234" "HelloWorld"
Each individual long string
literal must be a valid Unicode character
sequence when unescaped. The following examples are invalid due to
splitting Unicode escapes, an escaped surrogate pair, and a common
escape, respectively.
'''\u''' '''1234'''
'''\U0000''' '''1234'''
'''\uD800''' '''\uDC00'''
'''\''' '''n'''
Within long string
literals unescaped newlines are normalized such that
U+000D U+000A pairs (CARRIAGE RETURN and LINE FEED respectively) and U+000D are
replaced with U+000A. This is to facilitate compatibility across operating
systems.
Normalization can be subverted by using a combination of escapes:
CARRIAGE RETURN only:
'''one\r\
two'''
CARRIAGE RETURN and LINE FEED:
'''one\r
two'''
Escaped newlines are not replaced with any characters (i.e. the newline is
removed). In addition, the following table describes the string
escape
sequences that have direct code point replacement for all quoted string and
symbol forms.
Unicode Code Point | Ion Escape | Semantics |
---|---|---|
U+0007 |
\a |
BEL (alert) |
U+0008 |
\b |
BS (backspace) |
U+0009 |
\t |
HT (tab) |
U+000A |
\n |
LF (linefeed) |
U+000C |
\f |
FF (form feed) |
U+000D |
\r |
CR (carriage return) |
U+000B |
\v |
VT (vertical tab) |
U+0022 |
\" |
double quote |
U+0027 |
\' |
single quote |
U+003F |
\? |
question mark |
U+005C |
\\ |
backslash |
U+002F |
\/ |
forward slash |
U+0000 |
\0 |
NUL (null character) |
The for the Unicode ordinal string
escapes, \U
, \u
, and \x
, the
escape must be followed by a number of hexadecimal digits as described below.
Unicode |
Ion |
Semantics |
---|---|---|
U+HHHHHHHH |
\UHHHHHHHH |
8-digit hexadecimal Unicode code point |
U+HHHH |
\uHHHH |
4-digit hexadecimal Unicode code point; equivalent to \U0000HHHH |
U+00HH |
\xHH |
2-digit hexadecimal Unicode code point; equivalent to \u00HH and \U000000HH |
Ion does not specify the behavior of specifying invalid Unicode code
points or surrogate code points (used only for UTF-16) using the
escape sequences. It is highly recommended that Ion implementations
reject such escape sequences as they are not proper Unicode as specified
by the standard. To this point, consider the Ion string
sequence,
"\uD800\uDC00"
. A compliant parser may throw an exception because
surrogate characters are specified outside of the context of UTF-16,
accept the string as a technically invalid sequence of two Unicode code
points (i.e. U+D800 and U+DC00), or interpret it as the single Unicode
code point U+00010000. In this regard, the Ion string
data type does
not conform to the Unicode specification. A strict Unicode
implementation of the Ion text should not accept such sequences.
Binary Format
The Ion binary format encodes the string
data type directly as a
sequence of UTF-8 octets. A strict, Unicode compliant implementation of
Ion should not allow invalid UTF-8 sequences (e.g. surrogate code
points, overlong values, and values outside of the inclusive range,
U+0000 to U+0010FFFF).
Ion Clob
An Ion clob
type is similar to the blob
type except that the
denotation in the Ion text format uses an ASCII-based string notation
rather than a base64 encoding to denote its binary value. It is
important to make the distinction that clob
is a sequence of raw
octets and string
is a sequence of Unicode code points.
Text Format
See the grammar for a formal definition of the
Ion Text encoding for the clob
type.
Similar to string
, adjoining long string literals within an Ion clob
are concatenated automatically. Within a clob
, only one short string
literal or multiple long string literals are allowed. For example, the
following two blocks of Ion text syntax are semantically equivalent.
{{ '''Hello''' '''World''' }}
{{ "HelloWorld" }}
The rules for the quoted strings within a clob
follow the similarly to
the string
type, with the following exceptions. Unicode newline
characters in long strings and all verbatim ASCII characters are
interpreted as their ASCII octet values. Non-printable ASCII and
non-ASCII Unicode code points are not allowed un-escaped in the string
bodies. Furthermore, the following table describes the clob
string
escape sequences that have direct octet replacement for both all
strings.
Octet | Ion Escape | Semantics |
---|---|---|
0x07 | \a |
ASCII BEL (alert) |
0x08 | \b |
ASCII BS (backspace) |
0x09 | \t |
ASCII HT (tab) |
0x0A | \n |
ASCII LF (line feed) |
0x0C | \f |
ASCII FF (form feed) |
0x0D | \r |
ASCII CR (carriage return) |
0x0B | \v |
ASCII VT (vertical tab) |
0x22 | \" |
ASCII double quote |
0x27 | \' |
ASCII single quote |
0x3F | \? |
ASCII question mark |
0x5C | \\ |
ASCII backslash |
0x2F | \/ |
ASCII forward slash |
0x00 | \0 |
ASCII NUL (null character) |
The clob
escape \x
must be followed by two hexadecimal digits.
Note that clob
does not support the \u
and \U
escapes since it
represents an octet sequence and not a Unicode encoding.
Octet | Ion Escape | Semantics |
---|---|---|
0xHH | \xHH |
2-digit hexadecimal octet |
It is important to note that clob
is a binary type that is designed
for binary values that are either text encoded in a code page that is
ASCII compatible or should be octet editable by a human (escaped string
syntax vs. base64 encoded data). Clearly non-ASCII based encodings will
not be very readable (e.g. the clob
for the EBCDIC encoded string
representing “hello” could be denoted as
{{ "\xc7\xc1%%?" }}
).
Binary Format
This is represented directly as the octet values in the clob
value.