A description of the AIDA Binary XML format follows. It adheres to the WAP-192-WBXML-20010725-a, version 1.3, 25 July 2001 standard, but only uses a small number of its features.
To read the format a standard WBXML parser can be used, calling back user for any of the tags and attributes found, just as is normally done in an SAX parser. Since only a subset of the standard is used, a handwritten parser can also easily be written.
The binary format can easily be written to a (binary) file or stream. Care should be taken with primitive types, the length of UTF strings and their encoding and the encoding of multi-byte integers.
WBXML uses a special encoding for multi-byte integers, designated by mb_u_int32. A multi-byte integer consists of a series of bytes, where the most significant bit is the continuation bit and the remaining seven bits are a scalar value. If the continuation bit is set, then this byte is not the end of the multi-byte sequence. A single integer value is encoded into a sequence of N bytes. The first N-1 bytes have the continuation flag set, while the final byte has the continuation flag cleared.
The remaining seven bits in each byte are encoded in big-endian order. For example, the integer value 0xA0 is encoded as 0x81 0x20, while a lower integer value such as 0x60 is just encoded as 0x60.
The charset used is UTF-8 (Java version), see DataInputStream.java. All Strings are encoded in UTF-8 and NULL-terminated, as dictated by the WBXML spec.
The BNF description below is a simplified version of the one used in WBXML, where we left out any of the element that we did not use. Any of those could be re-added in future without breaking compatibility with WBXML.
The description uses standard conventions except that the "|" character is used tp designate alternatives and capitalised words indicate single-byte tokens. "(" and ")" are used to group elements, while optional elements are enclosed in "[" and "]". Elements may be postfixed by "*" to specify 0 or more repetitons of the element or "+" to specify 1 or more. Everything from "#" and following is comment.
start = version publicid charset strtbl body version = 0x03 # WBXML version 1.3 (-1) publicid = 0x01 # UNKNOWN PUBLIC ID charset = 0x6a # UTF8 (Java) strtbl = length byte* # table with NULL-terminated UTF-8 (Java) encoded strings, see below body = element element = [switchPage] tag [ attribute+ END ] [ content* END ] END = 0x01 content = element tag = TAG attribute = attrStart attrValue # WBXML allows zero or more values, attrStart = [switchPage] ATTRSTART # we allow just one. attrValue = opaque switchPage = SWITCH_PAGE pageIndex SWITCH_PAGE = 0x00 pageIndex = u_int8 opaque = OPAQUE length byte* OPAQUE = 0xC3 length = mb_u_int_32 u_int8 = <unsigned byte> mb_u_int_32 = <multi-byte unsigned 32 bit integer>
The string table contains only the NULL-terminated UTF-8 string: "BinaryAIDA/1.0", and is currently not referred to from anywhere in the file.
TAG page and code values can be found in the Tag Table.
Tags are divided in pages. Each page contains a number of codes starting at 0x05 through 0x3f. The parser (and writer) starts in Tag page 0 and keeps track of the current Tag page.
The two highest bits in a tag have a special meaning. Bit 7 (highest) indicates that attributes are present after (as part of) this tag. Bit 6 indicates that content (or subtags) are present after this tag. The Tag tables show the codes with the highest 2 bits set to zero.
ATTRSTART values are defined in the Attributes Table.
Attributes are divided in pages. Each page contains a number of codes starting at 0x05 through 0x3f. The parser (and writer) starts in Attribute page 0 and keeps track of the current Attribute page.
The highest bit (7) in an attribute code has special meaning, but is not used by the Binary AIDA XML.
OPAQUE entries are written with their length, followed by the bytes needed to write string(s) or primitives (double, float, int, ...). The first byte is a type code which specifies which primitive/string type follows.
Strings are encoded as NULL_terminated UTF-8 using the OPAQUE code. The actual encoded string length is written.
All attribute values are written as big-endian values using the OPAQUE code. In ASCII xml non of the values have a type. In Binary XML all values have been given a type, and the attribute "VALUE" has been split up into VALUE_BOOLEAN, VALUE_BYTE, ...
The primitive value is written with its length.
The following items are defined by WBXML but not used by the Binary AIDA Format: