AIDA Binary XML Format

A description of the AIDA Binary XML format follows. It adheres to the WAP-192-WBXML-20010725-a, version 1.3, 25 July 2001 standard, but only uses a small number of its features.

To read the format a standard WBXML parser can be used, calling back user for any of the tags and attributes found, just as is normally done in an SAX parser. Since only a subset of the standard is used, a handwritten parser can also easily be written.

The binary format can easily be written to a (binary) file or stream. Care should be taken with primitive types, the length of UTF strings and their encoding and the encoding of multi-byte integers.

Big Endian

The format contains data in big-endian form.

Multi-byte Integer

WBXML uses a special encoding for multi-byte integers, designated by mb_u_int32. A multi-byte integer consists of a series of bytes, where the most significant bit is the continuation bit and the remaining seven bits are a scalar value. If the continuation bit is set, then this byte is not the end of the multi-byte sequence. A single integer value is encoded into a sequence of N bytes. The first N-1 bytes have the continuation flag set, while the final byte has the continuation flag cleared.

The remaining seven bits in each byte are encoded in big-endian order. For example, the integer value 0xA0 is encoded as 0x81 0x20, while a lower integer value such as 0x60 is just encoded as 0x60.

Charset and String encoding

The charset used is UTF-8 (Java version), see DataInputStream.java. All Strings are encoded in UTF-8 and NULL-terminated, as dictated by the WBXML spec.

BNF Document Structure

The BNF description below is a simplified version of the one used in WBXML, where we left out any of the element that we did not use. Any of those could be re-added in future without breaking compatibility with WBXML.

The description uses standard conventions except that the "|" character is used tp designate alternatives and capitalised words indicate single-byte tokens. "(" and ")" are used to group elements, while optional elements are enclosed in "[" and "]". Elements may be postfixed by "*" to specify 0 or more repetitons of the element or "+" to specify 1 or more. Everything from "#" and following is comment.

start           = version publicid charset strtbl body
version         = 0x03                                  # WBXML version 1.3 (-1)
publicid        = 0x01                                  # UNKNOWN PUBLIC ID
charset         = 0x6a                                  # UTF8 (Java)

strtbl          = length byte*                          # table with NULL-terminated UTF-8 (Java) encoded strings, see below

body            = element                                                                                               
element         = [switchPage] tag [ attribute+ END ] [ content* END ]
END             = 0x01

content         = element                                                                                               
tag             = TAG

attribute       = attrStart attrValue                   # WBXML allows zero or more values,
attrStart       = [switchPage] ATTRSTART                #               we allow just one.
attrValue       = opaque

switchPage      = SWITCH_PAGE pageIndex
SWITCH_PAGE     = 0x00
pageIndex       = u_int8

opaque          = OPAQUE length byte*
OPAQUE          = 0xC3

length          = mb_u_int_32

u_int8          = <unsigned byte>
mb_u_int_32     = <multi-byte unsigned 32 bit integer>

String Table

The string table contains only the NULL-terminated UTF-8 string: "BinaryAIDA/1.0", and is currently not referred to from anywhere in the file.

Tags

TAG page and code values can be found in the Tag Table.

Tags are divided in pages. Each page contains a number of codes starting at 0x05 through 0x3f. The parser (and writer) starts in Tag page 0 and keeps track of the current Tag page.

The two highest bits in a tag have a special meaning. Bit 7 (highest) indicates that attributes are present after (as part of) this tag. Bit 6 indicates that content (or subtags) are present after this tag. The Tag tables show the codes with the highest 2 bits set to zero.

Attributes

ATTRSTART values are defined in the Attributes Table.

Attributes are divided in pages. Each page contains a number of codes starting at 0x05 through 0x3f. The parser (and writer) starts in Attribute page 0 and keeps track of the current Attribute page.

The highest bit (7) in an attribute code has special meaning, but is not used by the Binary AIDA XML.

Opaque values

OPAQUE entries are written with their length, followed by the bytes needed to write string(s) or primitives (double, float, int, ...). The first byte is a type code which specifies which primitive/string type follows.

Strings

Strings are encoded as NULL_terminated UTF-8 using the OPAQUE code. The actual encoded string length is written.

Primitive Values

All attribute values are written as big-endian values using the OPAQUE code. In ASCII xml non of the values have a type. In Binary XML all values have been given a type, and the attribute "VALUE" has been split up into VALUE_BOOLEAN, VALUE_BYTE, ...

The primitive value is written with its length.

Not Used from WBXML

The following items are defined by WBXML but not used by the Binary AIDA Format:

  • Processing Instructions (pi)
  • Literals (LITERAL, LITERAL_A, LITERAL_C and LITERAL_AC).
  • Entity value (entity)
  • Predefined Attribute Values (ATTRVALUE)
  • Extensions
  • String, STR_I (inline) or STR_T (ref), (our strings are written in OPAQUE encoding)