18.9 Advanced topic: custom blocks

18.9.1. The struct custom_operations
18.9.2. Allocating custom blocks
18.9.3. Accessing custom blocks
18.9.4. Writing custom serialization and deserialization functions
18.9.5. Choosing identifiers
18.9.6. Finalized blocks

Blocks with tag Custom_tag contain both arbitrary user data and a pointer to a C struct, with type struct custom_operations, that associates user-provided finalization, comparison, hashing, serialization and deserialization functions to this block.

18.9.1 The struct custom_operations

The struct custom_operations is defined in <caml/custom.h> and contains the following fields:

char *identifier

A zero-terminated character string serving as an identifier for serialization and deserialization operations.

void (*finalize)(value v)

The finalize field contains a pointer to a C function that is called when the block becomes unreachable and is about to be reclaimed. The block is passed as first argument to the function. The finalize field can also be custom_finalize_default to indicate that no finalization function is associated with the block.

int (*compare)(value v1, value v2)

The compare field contains a pointer to a C function that is called whenever two custom blocks are compared using Caml's generic comparison operators (=, <>, <=, >=, <, > and compare). The C function should return 0 if the data contained in the two blocks are structurally equal, a negative integer if the data from the first block is less than the data from the second block, and a positive integer if the data from the first block is greater than the data from the second block. %~ The compare field can be set to custom_compare_default; this default comparison function simply raises Failure.

long (*hash)(value v)

The hash field contains a pointer to a C function that is called whenever Caml's generic hash operator (see module Hashtbl) is applied to a custom block. The C function can return an arbitrary long integer representing the hash value of the data contained in the given custom block. The hash value must be compatible with the compare function, in the sense that two structurally equal data (that is, two custom blocks for which compare returns 0) must have the same hash value. %~ The hash field can be set to custom_hash_default, in which case the custom block is ignored during hash computation.

void (*serialize)(value v, unsigned long * wsize_32, unsigned long * wsize_64)

The serialize field contains a pointer to a C function that is called whenever the custom block needs to be serialized (marshaled) using the Caml functions output_value or Marshal.to_.... For a custom block, those functions first write the identifier of the block (as given by the identifier field) to the output stream, then call the user-provided serialize function. That function is responsible for writing the data contained in the custom block, using the serialize_... functions defined in <caml/intext.h> and listed below. The user-provided serialize function must then store in its wsize_32 and wsize_64 parameters the sizes in bytes of the data part of the custom block on a 32-bit architecture and on a 64-bit architecture, respectively. %~ The serialize field can be set to custom_serialize_default, in which case the Failure exception is raised when attempting to serialize the custom block.

unsigned long (*deserialize)(void * dst)

The deserialize field contains a pointer to a C function that is called whenever a custom block with identifier identifier needs to be deserialized (un-marshaled) using the Caml functions input_value or Marshal.from_.... This user-provided function is responsible for reading back the data written by the serialize operation, using the deserialize_... functions defined in <caml/intext.h> and listed below. It must then rebuild the data part of the custom block and store it at the pointer given as the dst argument. Finally, it returns the size in bytes of the data part of the custom block. This size must be identical to the wsize_32 result of the serialize operation if the architecture is 32 bits, or wsize_64 if the architecture is 64 bits. %~ The deserialize field can be set to custom_deserialize_default to indicate that deserialization is not supported. In this case, do not register the struct custom_operations with the deserializer using register_custom_operations (see below).

Note: the finalize, compare, hash, serialize and deserialize functions attached to custom block descriptors must never trigger a garbage collection. Within these functions, do not call any of the Caml allocation functions, and do not perform a callback into Caml code. Do not use CAMLparam to register the parameters to these functions, and do not use CAMLreturn to return the result.

18.9.2 Allocating custom blocks

Custom blocks must be allocated via the caml_alloc_custom function. caml_alloc_custom(ops, size, used, max) returns a fresh custom block, with room for size bytes of user data, and whose associated operations are given by ops (a pointer to a struct custom_operations, usually statically allocated as a C global variable).

The two parameters used and max are used to control the speed of garbage collection when the finalized object contains pointers to out-of-heap resources. Generally speaking, the Caml incremental major collector adjusts its speed relative to the allocation rate of the program. The faster the program allocates, the harder the GC works in order to reclaim quickly unreachable blocks and avoid having large amount of "floating garbage" (unreferenced objects that the GC has not yet collected).

Normally, the allocation rate is measured by counting the in-heap size of allocated blocks. However, it often happens that finalized objects contain pointers to out-of-heap memory blocks and other resources (such as file descriptors, X Windows bitmaps, etc.). For those blocks, the in-heap size of blocks is not a good measure of the quantity of resources allocated by the program.

The two arguments used and max give the GC an idea of how much out-of-heap resources are consumed by the finalized block being allocated: you give the amount of resources allocated to this object as parameter used, and the maximum amount that you want to see in floating garbage as parameter max. The units are arbitrary: the GC cares only about the ratio used / max.

For instance, if you are allocating a finalized block holding an X Windows bitmap of w by h pixels, and you'd rather not have more than 1 mega-pixels of unreclaimed bitmaps, specify used = w * h and max = 1000000.

Another way to describe the effect of the used and max parameters is in terms of full GC cycles. If you allocate many custom blocks with used / max = 1 / N, the GC will then do one full cycle (examining every object in the heap and calling finalization functions on those that are unreachable) every N allocations. For instance, if used = 1 and max = 1000, the GC will do one full cycle at least every 1000 allocations of custom blocks.

If your finalized blocks contain no pointers to out-of-heap resources, or if the previous discussion made little sense to you, just take used = 0 and max = 1. But if you later find that the finalization functions are not called "often enough", consider increasing the used / max ratio.

18.9.3 Accessing custom blocks

The data part of a custom block v can be accessed via the pointer Data_custom_val(v). This pointer has type void * and should be cast to the actual type of the data stored in the custom block.

The contents of custom blocks are not scanned by the garbage collector, and must therefore not contain any pointer inside the Caml heap. In other terms, never store a Caml value in a custom block, and do not use Field, Store_field nor modify to access the data part of a custom block. Conversely, any C data structure (not containing heap pointers) can be stored in a custom block.

18.9.4 Writing custom serialization and deserialization functions

The following functions, defined in <caml/intext.h>, are provided to write and read back the contents of custom blocks in a portable way. Those functions handle endianness conversions when e.g. data is written on a little-endian machine and read back on a big-endian machine.

Function Action
caml_serialize_int_1Write a 1-byte integer
caml_serialize_int_2Write a 2-byte integer
caml_serialize_int_4Write a 4-byte integer
caml_serialize_int_8Write a 8-byte integer
caml_serialize_float_4Write a 4-byte float
caml_serialize_float_8Write a 8-byte float
caml_serialize_block_1Write an array of 1-byte quantities
caml_serialize_block_2Write an array of 2-byte quantities
caml_serialize_block_4Write an array of 4-byte quantities
caml_serialize_block_8Write an array of 8-byte quantities
caml_deserialize_uint_1Read an unsigned 1-byte integer
caml_deserialize_sint_1Read a signed 1-byte integer
caml_deserialize_uint_2Read an unsigned 2-byte integer
caml_deserialize_sint_2Read a signed 2-byte integer
caml_deserialize_uint_4Read an unsigned 4-byte integer
caml_deserialize_sint_4Read a signed 4-byte integer
caml_deserialize_uint_8Read an unsigned 8-byte integer
caml_deserialize_sint_8Read a signed 8-byte integer
caml_deserialize_float_4Read a 4-byte float
caml_deserialize_float_8Read an 8-byte float
caml_deserialize_block_1Read an array of 1-byte quantities
caml_deserialize_block_2Read an array of 2-byte quantities
caml_deserialize_block_4Read an array of 4-byte quantities
caml_deserialize_block_8Read an array of 8-byte quantities
caml_deserialize_errorSignal an error during deserialization; input_value or Marshal.from_... raise a Failure exception after cleaning up their internal data structures

Serialization functions are attached to the custom blocks to which they apply. Obviously, deserialization functions cannot be attached this way, since the custom block does not exist yet when deserialization begins! Thus, the struct custom_operations that contain deserialization functions must be registered with the deserializer in advance, using the register_custom_operations function declared in <caml/custom.h>. Deserialization proceeds by reading the identifier off the input stream, allocating a custom block of the size specified in the input stream, searching the registered struct custom_operation blocks for one with the same identifier, and calling its deserialize function to fill the data part of the custom block.

18.9.5 Choosing identifiers

Identifiers in struct custom_operations must be chosen carefully, since they must identify uniquely the data structure for serialization and deserialization operations. In particular, consider including a version number in the identifier; this way, the format of the data can be changed later, yet backward-compatible deserialisation functions can be provided.

Identifiers starting with _ (an underscore character) are reserved for the Objective Caml runtime system; do not use them for your custom data. We recommend to use a URL (http://mymachine.mydomain.com/mylibrary/version-number) or a Java-style package name (com.mydomain.mymachine.mylibrary.version-number) as identifiers, to minimize the risk of identifier collision.

18.9.6 Finalized blocks

Custom blocks generalize the finalized blocks that were present in Objective Caml prior to version 3.00. For backward compatibility, the format of custom blocks is compatible with that of finalized blocks, and the alloc_final function is still available to allocate a custom block with a given finalization function, but default comparison, hashing and serialization functions. caml_alloc_final(n, f, used, max) returns a fresh custom block of size n words, with finalization function f. The first word is reserved for storing the custom operations; the other n-1 words are available for your data. The two parameters used and max are used to control the speed of garbage collection, as described for caml_alloc_custom.