Efficient methods of storage for dynamic languages

Table of Contents

Alright this is a bit involved, but I think it should be possible to follow it with some coffee. I assume you know the very basics of C but honestly this is kinda like general knowledge for computers. If any of this interests you, good: write a compiler NOW!

A classical tagged union

Firstly: dynamic languages do type checks at runtime (🤮), so we need a uniform data structure to house all possible data objects that the interpreter may deal with. Consider the classical approach to constructing such a data type:

struct Object
{
  enum Type
  {
    INTEGER = 0,
    LIST
  } type;

  union
  {
    int integer;
    struct List
    {
      struct Object *car, *cdr;
    } list;
  };
};

This is the “tagged union” approach. We have something to designate the type of the object (a “tag”) and a collection of possible payloads (a “union”). In particular, a union in C ensures that all the data should be housed in the same memory so we’re not losing too much space. On my machine we get the following breakdown for the sizes:

Property	Size (bytes)
type	4
integer	4
list	16

This makes perfect sense of course:

Enumerations are internally represented as 32 bit integers
Integers are 32 bit
A pointer on a 64 bit machine is 8 bytes and a list is composed of 2 of them, so of course it’s 16 bytes

A union must have enough space to house the largest possible type in the union. So the payload, in our case, must be at least 16 bytes for the list. Assuming the union really works and the C compiler doesn’t decide to just aggregate them all together, we need at least 20 bytes of memory to store any object.

Oh lord this thing is inefficient

For the 32 bit integer “1” we need:

4 bytes for a tag
4 bytes for the data
12 bytes because of how we defined the structure

So per 32 bit integer, 12 bytes of cruft must be added! To put this into perspective, with around 2048 integers in the runtime we waste 24.6 KB of space. 2048 is not a lot of integers for a language runtime to be juggling, so this is very bad. Even worse, in many cases you’ll be allocating the object on the heap so constructing lists is not a heartbreaking process. This means you can have multiple duplicates of the same constant floating around in your runtime (though there are interesting ways to get around this such as a pointer cache). Just insane!

There are two ways I found, both equally hacky and not portable for machines that aren’t particularly modern. But hey, it’s 2023: if you can’t run electron then around 50% of software startups don’t consider you a possible customer, so you may as well assume an ideal world.

[2023-09-10 Sun] Just realised I never described the other way, NaN-boxing. Maybe I’ll try to do that and make a post about it.

Memory alignment

First, an aside on memory alignment. Machines align their memory by how big a word is. A word is a number of bytes, represents the basic building block for the CPU in terms of fetching memory: it always fetches words at a time. A particular consequence is that its a lot of extra work for the CPU to fetch stuff that’s not an integral multiple of the word size, as it must first fetch as many words that contain the data, then apply boxing and unboxing methods to interpret it. Many machines simply don’t allow interactions with memory that is not word aligned (x86 SIMD instructions, for example, assume your memory is word aligned).

[2024-06-04 Tue]: There are many things wrong with this statement; x86 can just handle misaligned data, within a single instruction sometimes. SIMD instructions have aligned and non aligned access versions. There’s a bit of debate around whether misaligned data causes performance issues (apparently it did for older CPUs like in 2008 🤯) but I’d argue it’s best to not forgo word alignment completely. Upshot is that CPU’s are black magic boxes where dragons lie: I can’t definitely state how a CPU actually deals with misaligned data and this poor interpretation comes from my monkey brain.

Ok, so what? Well on 64 bit machines a word is, no surprises, 64 bits or 8 bytes. If you have a pointer to some memory, that memory better be word aligned. If a pointer references a specific byte of memory, and that byte must sit in a word aligned area, that must mean it should be at a multiple of 8 (look below to see why this is a possibly horrible assumption).

Working in binary this means the 3 least significant bits of any pointer must be 0¹:

The last bit must be 0 as otherwise the pointer is odd i.e. not a multiple of 8
The second to last bit must be 0 as otherwise dividing by 2 leads to a 1 in the last bit i.e. not divisible by 4 so not divisible by 8
The third to last bit must be 0 as otherwise dividing by 4 leads to a 1 in the last bit i.e. not divisible by 4

Wait they leave 3 bits for free?

So heap allocated pointer must leave the last 3 bits alone. From another point of view, we can store 8 possible values in the last 3 bits and still have access to the pointer by just switching those bits off. This is the basis of pointer tagging.

In concrete terms:

For heap allocated structures, like a string or list, we can use the last 3 bits to distinguish them without having to use a corresponding type enum like our naive approach. If we want access to the data (a pointer), we just set those 3 bits to 0 and dereference the corresponding pointer.
For immediate types, like integers or booleans, just embed the data directly into the 61 bits, and tag using the 3 least significant bits. To access the data, just shift down by 3 and reinterpret the data.

I think the greatest thing about this the fact that equality for immediate types is dead simple: as they should both be tagged the same anyway we can literally just do an integer comparison of the bits.

Problems with naive pointer tagging and concluding thoughts

Though this is pretty powerful, it’s not particularly portable cos there are so many things that could go wrong:

What if we aren’t on a 64 bit machine?
- Many microprocessors are 16 bit so this can be a valid concern
What if words and the minimum addressable size are the same?
- This leads to no available bits to butcher and use for tags
What if memory doesn’t need to be page aligned?
- [2023-09-10 Sun]: not really a concern; you shouldn’t be dereferencing tagged pointers anyway mate?
What if there’s some ungodly memory manager between the kernel and the user, which does its own work?
- Even if the previous assumptions held, we may still not have enough useless bits to do our job
- [2023-09-10 Sun]: Most tagging software which deals with memory managers (such as in… Windows 🥹) uses the “most significant bits” approach, which only lends credence to my point of lower portability

I’m thinking about instead implementing a virtual machine and a bytecode system. This means I could:

“compile” the language down to some very tiny binary file or stream
“execute” byte-compiled code a bit faster than straight interpretation

[2023-09-10 Sun]: I’ll most likely be doing an update for writing a virtual machine once I actually implement something like it. C is much better at dealing with binary than with characters ("wait, your programming language has UTF-8 support builtin??"), to the point where theoretically reading or writing bytecode would be an fread~/~fwrite away!

General conjecture: if a binary number of n bits is divisible by \(2^{k}\) where \(k<n\) (obviously) then the k least significant bits must be 0. Pretty sure you could prove this directly using modulo laws. ↩︎