Internals.md

Documentation: github.com/google/flatbuffers/docs/source

     1FlatBuffer Internals    {#flatbuffers_internals}
     2====================
     3
     4This section is entirely optional for the use of FlatBuffers. In normal
     5usage, you should never need the information contained herein. If you're
     6interested however, it should give you more of an appreciation of why
     7FlatBuffers is both efficient and convenient.
     8
     9### Format components
    10
    11A FlatBuffer is a binary file and in-memory format consisting mostly of
    12scalars of various sizes, all aligned to their own size. Each scalar is
    13also always represented in little-endian format, as this corresponds to
    14all commonly used CPUs today. FlatBuffers will also work on big-endian
    15machines, but will be slightly slower because of additional
    16byte-swap intrinsics.
    17
    18It is assumed that the following conditions are met, to ensure
    19cross-platform interoperability:
    20- The binary `IEEE-754` format is used for floating-point numbers.
    21- The `two's complemented` representation is used for signed integers.
    22- The endianness is the same for floating-point numbers as for integers.
    23
    24On purpose, the format leaves a lot of details about where exactly
    25things live in memory undefined, e.g. fields in a table can have any
    26order, and objects to some extent can be stored in many orders. This is
    27because the format doesn't need this information to be efficient, and it
    28leaves room for optimization and extension (for example, fields can be
    29packed in a way that is most compact). Instead, the format is defined in
    30terms of offsets and adjacency only. This may mean two different
    31implementations may produce different binaries given the same input
    32values, and this is perfectly valid.
    33
    34### Format identification
    35
    36The format also doesn't contain information for format identification
    37and versioning, which is also by design. FlatBuffers is a statically typed
    38system, meaning the user of a buffer needs to know what kind of buffer
    39it is. FlatBuffers can of course be wrapped inside other containers
    40where needed, or you can use its union feature to dynamically identify
    41multiple possible sub-objects stored. Additionally, it can be used
    42together with the schema parser if full reflective capabilities are
    43desired.
    44
    45Versioning is something that is intrinsically part of the format (the
    46optionality / extensibility of fields), so the format itself does not
    47need a version number (it's a meta-format, in a sense). We're hoping
    48that this format can accommodate all data needed. If format breaking
    49changes are ever necessary, it would become a new kind of format rather
    50than just a variation.
    51
    52### Offsets
    53
    54The most important and generic offset type (see `flatbuffers.h`) is
    55`uoffset_t`, which is currently always a `uint32_t`, and is used to
    56refer to all tables/unions/strings/vectors (these are never stored
    57in-line). 32bit is
    58intentional, since we want to keep the format binary compatible between
    5932 and 64bit systems, and a 64bit offset would bloat the size for almost
    60all uses. A version of this format with 64bit (or 16bit) offsets is easy to set
    61when needed. Unsigned means they can only point in one direction, which
    62typically is forward (towards a higher memory location). Any backwards
    63offsets will be explicitly marked as such.
    64
    65The format starts with an `uoffset_t` to the root table in the buffer.
    66
    67We have two kinds of objects, structs and tables.
    68
    69### Structs
    70
    71These are the simplest, and as mentioned, intended for simple data that
    72benefits from being extra efficient and doesn't need versioning /
    73extensibility. They are always stored inline in their parent (a struct,
    74table, or vector) for maximum compactness. Structs define a consistent
    75memory layout where all components are aligned to their size, and
    76structs aligned to their largest scalar member. This is done independent
    77of the alignment rules of the underlying compiler to guarantee a cross
    78platform compatible layout. This layout is then enforced in the generated
    79code.
    80
    81### Tables
    82
    83Unlike structs, these are not stored in inline in their parent, but are
    84referred to by offset.
    85
    86They start with an `soffset_t` to a vtable. This is a signed version of
    87`uoffset_t`, since vtables may be stored anywhere relative to the object.
    88This offset is subtracted (not added) from the object start to arrive at
    89the vtable start. This offset is followed by all the
    90fields as aligned scalars (or offsets). Unlike structs, not all fields
    91need to be present. There is no set order and layout. A table may contain
    92field offsets that point to the same value if the user explicitly
    93serializes the same offset twice.
    94
    95To be able to access fields regardless of these uncertainties, we go
    96through a vtable of offsets. Vtables are shared between any objects that
    97happen to have the same vtable values.
    98
    99The elements of a vtable are all of type `voffset_t`, which is
   100a `uint16_t`. The first element is the size of the vtable in bytes,
   101including the size element. The second one is the size of the object, in bytes
   102(including the vtable offset). This size could be used for streaming, to know
   103how many bytes to read to be able to access all *inline* fields of the object.
   104The remaining elements are the N offsets, where N is the amount of fields
   105declared in the schema when the code that constructed this buffer was
   106compiled (thus, the size of the table is N + 2).
   107
   108All accessor functions in the generated code for tables contain the
   109offset into this table as a constant. This offset is checked against the
   110first field (the number of elements), to protect against newer code
   111reading older data. If this offset is out of range, or the vtable entry
   112is 0, that means the field is not present in this object, and the
   113default value is return. Otherwise, the entry is used as offset to the
   114field to be read.
   115
   116### Unions
   117
   118Unions are encoded as the combination of two fields: an enum representing the
   119union choice and the offset to the actual element. FlatBuffers reserves the
   120enumeration constant `NONE` (encoded as 0) to mean that the union field is not
   121set.
   122
   123### Strings and Vectors
   124
   125Strings are simply a vector of bytes, and are always
   126null-terminated. Vectors are stored as contiguous aligned scalar
   127elements prefixed by a 32bit element count (not including any
   128null termination). Neither is stored inline in their parent, but are referred to
   129by offset. A vector may consist of more than one offset pointing to the same
   130value if the user explicitly serializes the same offset twice.
   131
   132### Construction
   133
   134The current implementation constructs these buffers backwards (starting
   135at the highest memory address of the buffer), since
   136that significantly reduces the amount of bookkeeping and simplifies the
   137construction API.
   138
   139### Code example
   140
   141Here's an example of the code that gets generated for the `samples/monster.fbs`.
   142What follows is the entire file, broken up by comments:
   143
   144    // automatically generated, do not modify
   145
   146    #include "flatbuffers/flatbuffers.h"
   147
   148    namespace MyGame {
   149    namespace Sample {
   150
   151Nested namespace support.
   152
   153    enum {
   154      Color_Red = 0,
   155      Color_Green = 1,
   156      Color_Blue = 2,
   157    };
   158
   159    inline const char **EnumNamesColor() {
   160      static const char *names[] = { "Red", "Green", "Blue", nullptr };
   161      return names;
   162    }
   163
   164    inline const char *EnumNameColor(int e) { return EnumNamesColor()[e]; }
   165
   166Enums and convenient reverse lookup.
   167
   168    enum {
   169      Any_NONE = 0,
   170      Any_Monster = 1,
   171    };
   172
   173    inline const char **EnumNamesAny() {
   174      static const char *names[] = { "NONE", "Monster", nullptr };
   175      return names;
   176    }
   177
   178    inline const char *EnumNameAny(int e) { return EnumNamesAny()[e]; }
   179
   180Unions share a lot with enums.
   181
   182    struct Vec3;
   183    struct Monster;
   184
   185Predeclare all data types since circular references between types are allowed
   186(circular references between object are not, though).
   187
   188    FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(4) Vec3 {
   189     private:
   190      float x_;
   191      float y_;
   192      float z_;
   193
   194     public:
   195      Vec3(float x, float y, float z)
   196        : x_(flatbuffers::EndianScalar(x)), y_(flatbuffers::EndianScalar(y)), z_(flatbuffers::EndianScalar(z)) {}
   197
   198      float x() const { return flatbuffers::EndianScalar(x_); }
   199      float y() const { return flatbuffers::EndianScalar(y_); }
   200      float z() const { return flatbuffers::EndianScalar(z_); }
   201    };
   202    FLATBUFFERS_STRUCT_END(Vec3, 12);
   203
   204These ugly macros do a couple of things: they turn off any padding the compiler
   205might normally do, since we add padding manually (though none in this example),
   206and they enforce alignment chosen by FlatBuffers. This ensures the layout of
   207this struct will look the same regardless of compiler and platform. Note that
   208the fields are private: this is because these store little endian scalars
   209regardless of platform (since this is part of the serialized data).
   210`EndianScalar` then converts back and forth, which is a no-op on all current
   211mobile and desktop platforms, and a single machine instruction on the few
   212remaining big endian platforms.
   213
   214    struct Monster : private flatbuffers::Table {
   215      const Vec3 *pos() const { return GetStruct<const Vec3 *>(4); }
   216      int16_t mana() const { return GetField<int16_t>(6, 150); }
   217      int16_t hp() const { return GetField<int16_t>(8, 100); }
   218      const flatbuffers::String *name() const { return GetPointer<const flatbuffers::String *>(10); }
   219      const flatbuffers::Vector<uint8_t> *inventory() const { return GetPointer<const flatbuffers::Vector<uint8_t> *>(14); }
   220      int8_t color() const { return GetField<int8_t>(16, 2); }
   221    };
   222
   223Tables are a bit more complicated. A table accessor struct is used to point at
   224the serialized data for a table, which always starts with an offset to its
   225vtable. It derives from `Table`, which contains the `GetField` helper functions.
   226GetField takes a vtable offset, and a default value. It will look in the vtable
   227at that offset. If the offset is out of bounds (data from an older version) or
   228the vtable entry is 0, the field is not present and the default is returned.
   229Otherwise, it uses the entry as an offset into the table to locate the field.
   230
   231    struct MonsterBuilder {
   232      flatbuffers::FlatBufferBuilder &fbb_;
   233      flatbuffers::uoffset_t start_;
   234      void add_pos(const Vec3 *pos) { fbb_.AddStruct(4, pos); }
   235      void add_mana(int16_t mana) { fbb_.AddElement<int16_t>(6, mana, 150); }
   236      void add_hp(int16_t hp) { fbb_.AddElement<int16_t>(8, hp, 100); }
   237      void add_name(flatbuffers::Offset<flatbuffers::String> name) { fbb_.AddOffset(10, name); }
   238      void add_inventory(flatbuffers::Offset<flatbuffers::Vector<uint8_t>> inventory) { fbb_.AddOffset(14, inventory); }
   239      void add_color(int8_t color) { fbb_.AddElement<int8_t>(16, color, 2); }
   240      MonsterBuilder(flatbuffers::FlatBufferBuilder &_fbb) : fbb_(_fbb) { start_ = fbb_.StartTable(); }
   241      flatbuffers::Offset<Monster> Finish() { return flatbuffers::Offset<Monster>(fbb_.EndTable(start_, 7)); }
   242    };
   243
   244`MonsterBuilder` is the base helper struct to construct a table using a
   245`FlatBufferBuilder`. You can add the fields in any order, and the `Finish`
   246call will ensure the correct vtable gets generated.
   247
   248    inline flatbuffers::Offset<Monster> CreateMonster(flatbuffers::FlatBufferBuilder &_fbb,
   249                                                      const Vec3 *pos, int16_t mana,
   250                                                      int16_t hp,
   251                                                      flatbuffers::Offset<flatbuffers::String> name,
   252                                                      flatbuffers::Offset<flatbuffers::Vector<uint8_t>> inventory,
   253                                                      int8_t color) {
   254      MonsterBuilder builder_(_fbb);
   255      builder_.add_inventory(inventory);
   256      builder_.add_name(name);
   257      builder_.add_pos(pos);
   258      builder_.add_hp(hp);
   259      builder_.add_mana(mana);
   260      builder_.add_color(color);
   261      return builder_.Finish();
   262    }
   263
   264`CreateMonster` is a convenience function that calls all functions in
   265`MonsterBuilder` above for you. Note that if you pass values which are
   266defaults as arguments, it will not actually construct that field, so
   267you can probably use this function instead of the builder class in
   268almost all cases.
   269
   270    inline const Monster *GetMonster(const void *buf) { return flatbuffers::GetRoot<Monster>(buf); }
   271
   272This function is only generated for the root table type, to be able to
   273start traversing a FlatBuffer from a raw buffer pointer.
   274
   275    }; // namespace MyGame
   276    }; // namespace Sample
   277
   278### Encoding example.
   279
   280Below is a sample encoding for the following JSON corresponding to the above
   281schema:
   282
   283    { pos: { x: 1, y: 2, z: 3 }, name: "fred", hp: 50 }
   284
   285Resulting in this binary buffer:
   286
   287    // Start of the buffer:
   288    uint32_t 20  // Offset to the root table.
   289
   290    // Start of the vtable. Not shared in this example, but could be:
   291    uint16_t 16 // Size of table, starting from here.
   292    uint16_t 22 // Size of object inline data.
   293    uint16_t 4, 0, 20, 16, 0, 0  // Offsets to fields from start of (root) table, 0 for not present.
   294
   295    // Start of the root table:
   296    int32_t 16     // Offset to vtable used (default negative direction)
   297    float 1, 2, 3  // the Vec3 struct, inline.
   298    uint32_t 8     // Offset to the name string.
   299    int16_t 50     // hp field.
   300    int16_t 0      // Padding for alignment.
   301
   302    // Start of name string:
   303    uint32_t 4  // Length of string.
   304    int8_t 'f', 'r', 'e', 'd', 0, 0, 0, 0  // Text + 0 termination + padding.
   305
   306Note that this not the only possible encoding, since the writer has some
   307flexibility in which of the children of root object to write first (though in
   308this case there's only one string), and what order to write the fields in.
   309Different orders may also cause different alignments to happen.
   310
   311### Additional reading.
   312
   313The author of the C language implementation has made a similar
   314[document](https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#flatbuffers-binary-format)
   315that may further help clarify the format.
   316
   317# FlexBuffers
   318
   319The [schema-less](@ref flexbuffers) version of FlatBuffers have their
   320own encoding, detailed here.
   321
   322It shares many properties mentioned above, in that all data is accessed
   323over offsets, all scalars are aligned to their own size, and
   324all data is always stored in little endian format.
   325
   326One difference is that FlexBuffers are built front to back, so children are
   327stored before parents, and the root of the data starts at the last byte.
   328
   329Another difference is that scalar data is stored with a variable number of bits
   330(8/16/32/64). The current width is always determined by the *parent*, i.e. if
   331the scalar sits in a vector, the vector determines the bit width for all
   332elements at once. Selecting the minimum bit width for a particular vector is
   333something the encoder does automatically and thus is typically of no concern
   334to the user, though being aware of this feature (and not sticking a double in
   335the same vector as a bunch of byte sized elements) is helpful for efficiency.
   336
   337Unlike FlatBuffers there is only one kind of offset, and that is an unsigned
   338integer indicating the number of bytes in a negative direction from the address
   339of itself (where the offset is stored).
   340
   341### Vectors
   342
   343The representation of the vector is at the core of how FlexBuffers works (since
   344maps are really just a combination of 2 vectors), so it is worth starting there.
   345
   346As mentioned, a vector is governed by a single bit width (supplied by its
   347parent). This includes the size field. For example, a vector that stores the
   348integer values `1, 2, 3` is encoded as follows:
   349
   350    uint8_t 3, 1, 2, 3, 4, 4, 4
   351
   352The first `3` is the size field, and is placed before the vector (an offset
   353from the parent to this vector points to the first element, not the size
   354field, so the size field is effectively at index -1).
   355Since this is an untyped vector `SL_VECTOR`, it is followed by 3 type
   356bytes (one per element of the vector), which are always following the vector,
   357and are always a uint8_t even if the vector is made up of bigger scalars.
   358
   359A vector may include more than one offset pointing to the same value if the
   360user explicitly serializes the same offset twice.
   361
   362### Types
   363
   364A type byte is made up of 2 components (see flexbuffers.h for exact values):
   365
   366* 2 lower bits representing the bit-width of the child (8, 16, 32, 64).
   367  This is only used if the child is accessed over an offset, such as a child
   368  vector. It is ignored for inline types.
   369* 6 bits representing the actual type (see flexbuffers.h).
   370
   371Thus, in this example `4` means 8 bit child (value 0, unused, since the value is
   372in-line), type `SL_INT` (value 1).
   373
   374### Typed Vectors
   375
   376These are like the Vectors above, but omit the type bytes. The type is instead
   377determined by the vector type supplied by the parent. Typed vectors are only
   378available for a subset of types for which these savings can be significant,
   379namely inline signed/unsigned integers (`TYPE_VECTOR_INT` / `TYPE_VECTOR_UINT`),
   380floats (`TYPE_VECTOR_FLOAT`), and keys (`TYPE_VECTOR_KEY`, see below).
   381
   382Additionally, for scalars, there are fixed length vectors of sizes 2 / 3 / 4
   383that don't store the size (`TYPE_VECTOR_INT2` etc.), for an additional savings
   384in space when storing common vector or color data.
   385
   386### Scalars
   387
   388FlexBuffers supports integers (`TYPE_INT` and `TYPE_UINT`) and floats
   389(`TYPE_FLOAT`), available in the bit-widths mentioned above. They can be stored
   390both inline and over an offset (`TYPE_INDIRECT_*`).
   391
   392The offset version is useful to encode costly 64bit (or even 32bit) quantities
   393into vectors / maps of smaller sizes, and to share / repeat a value multiple
   394times.
   395
   396### Booleans and Nulls
   397
   398Booleans (`TYPE_BOOL`) and nulls (`TYPE_NULL`) are encoded as inlined unsigned integers.
   399
   400### Blobs, Strings and Keys.
   401
   402A blob (`TYPE_BLOB`) is encoded similar to a vector, with one difference: the
   403elements are always `uint8_t`. The parent bit width only determines the width of
   404the size field, allowing blobs to be large without the elements being large.
   405
   406Strings (`TYPE_STRING`) are similar to blobs, except they have an additional 0
   407termination byte for convenience, and they MUST be UTF-8 encoded (since an
   408accessor in a language that does not support pointers to UTF-8 data may have to
   409convert them to a native string type).
   410
   411A "Key" (`TYPE_KEY`) is similar to a string, but doesn't store the size
   412field. They're so named because they are used with maps, which don't care
   413for the size, and can thus be even more compact. Unlike strings, keys cannot
   414contain bytes of value 0 as part of their data (size can only be determined by
   415`strlen`), so while you can use them outside the context of maps if you so
   416desire, you're usually better off with strings.
   417
   418### Maps
   419
   420A map (`TYPE_MAP`) is like an (untyped) vector, but with 2 prefixes before the
   421size field:
   422
   423| index | field                                                        |
   424| ----: | :----------------------------------------------------------- |
   425| -3    | An offset to the keys vector (may be shared between tables). |
   426| -2    | Byte width of the keys vector.                               |
   427| -1    | Size (from here on it is compatible with `TYPE_VECTOR`)      |
   428| 0     | Elements.                                                    |
   429| Size  | Types.                                                       |
   430
   431Since a map is otherwise the same as a vector, it can be iterated like
   432a vector (which is probably faster than lookup by key).
   433
   434The keys vector is a typed vector of keys. Both the keys and corresponding
   435values *have* to be stored in sorted order (as determined by `strcmp`), such
   436that lookups can be made using binary search.
   437
   438The reason the key vector is a separate structure from the value vector is
   439such that it can be shared between multiple value vectors, and also to
   440allow it to be treated as its own individual vector in code.
   441
   442An example map { foo: 13, bar: 14 } would be encoded as:
   443
   444    0 : uint8_t 'b', 'a', 'r', 0
   445    4 : uint8_t 'f', 'o', 'o', 0
   446    8 : uint8_t 2      // key vector of size 2
   447    // key vector offset points here
   448    9 : uint8_t 9, 6   // offsets to bar_key and foo_key
   449    11: uint8_t 2, 1   // offset to key vector, and its byte width
   450    13: uint8_t 2      // value vector of size
   451    // value vector offset points here
   452    14: uint8_t 14, 13 // values
   453    16: uint8_t 4, 4   // types
   454
   455### The root
   456
   457As mentioned, the root starts at the end of the buffer.
   458The last uint8_t is the width in bytes of the root (normally the parent
   459determines the width, but the root has no parent). The uint8_t before this is
   460the type of the root, and the bytes before that are the root value (of the
   461number of bytes specified by the last byte).
   462
   463So for example, the integer value `13` as root would be:
   464
   465    uint8_t 13, 4, 1    // Value, type, root byte width.
   466
   467
   468<br>
View as plain text