Change files

Posted on 2022-08-20

When you create a new Pijul repository, a .pijul folder is created with two subfolders: "change" and "pristine".

“change” appears to be organized much like Git’s “objects” folder in that the changes are referenced by a hash. A hash like U6TQX5Z2NF6GX3SRLUBQGCZ7WAXNYMWWZ2YMADUSG4EWVKNV2BIAC is stored in the file: .pijul/change/U6/TQX5Z2NF6GX3SRLUBQGCZ7WAXNYMWWZ2YMADUSG4EWVKNV2BIAC.change.
“pristine” contains only… a “db” file?

We’ll forget about the “pristine” folder for now and get back to that later. In this write-up we’ll focus on the structure of change files.

Structure of a .change file

The overall format of a .change file is something like this:


    ┌────────────────┐
    │ offsets        │ fixed 56 bytes
    ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
    │ hashed parts   │ = zstd_compress(bincode(hashed))
    ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
    │ unhashed parts │ = zstd_compress(bincode(unhashed | [])) - may be zero
    ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
    │ contents       │ = zstd_compress(contents)
    └────────────────┘

The first part contains information about offsets into the rest of the file. The remaining three segments: hashed, unhashed, and contents are all compressed with a seekable variant of. https://github.com/facebook/zstd/tree/dev/contrib/seekable_format ⊕We will refer to this as zstdseek

The offsets section also provides the uncompressed sizes of the three compressed segments. This allows allocating sufficient buffer space for zstd to decompress a segment into.

Offsets

The offsets “header” section of a change file can be represented as the following struct:

#include <stdint.h>
struct offsets {
	uint64_t version;
	uint64_t hashed_len;	/* length of the hashed contents */
	uint64_t unhashed_off;
	uint64_t unhashed_len;	/* length of unhashed contents */
	uint64_t contents_off;
	uint64_t contents_len;
	uint64_t total;
}

With seven components each eight bytes in size we get: 56 bytes. But how are they actually written to the file? The answer is: bincode.

Bincode

An interesting feature of Pijul is that everything that’s serialized to disk is serialized with bincodehttps://github.com/bincode-org/bincode. What’s interesting about bincode is that it appears to be Rust-only. ⊕The specification is a Markdown document in the source code repo.

Reading the specification it appears to focus on cutting as much fluff as possible and just produce and consume byte streams. This means a struct with seven u64 members has no other information than the values of those seven members, one after the other.

The primary thing to keep in mind is that values must be encoded and decode in little-endian byte order.

All the segments of a change file are bincode-encoded except the contents section which is already just an array of bytes.

Hashed

The hashed contents is the most structured part of a change file. In a similar vein as Git, it gathers all the bits that should be hashed. This includes:

A “version” field - currently always the value 6 (u64)
A change header struct (more on that later)
A list of dependencies, just the hashes
Another list of hashes - extra known “context” changes (assists with recovery from deleted contexts)
The hash of the contents

Notably the contents themselves are not in the hashed structure, only the hash of the contents.

In C, the hashed structure looks something like this:

struct hashed {
	uint64_t version;
	struct change_header header;
	struct hash *dependencies;	/* Vec<Hash> */
	struct hash *extra_known;	/* extra known "context" changes (recovery from deleted contexts) */
	uint8_t *metadata;		/* space for application-specific data */
	?? changes;			/* Vec<Hunk>, "Hunk" being a generic argument */
	hash contents_hash;		/* hash of the contents */
}

Computing the hash happens on the bincode-encoded form of the hashed struct, something like this: hash = blake3(bincode(hashed))

Unhashed

I haven’t looked into the unhashed segment yet, but supposedly it’s for putting stuff associated with a change that you don’t want hashed.

Just an array of bytes by the looks of it, I have yet to figure out what it actually holds.

Change files

Structure of a .change file

Offsets

Bincode

Hashed

Unhashed

Contents

Articles from blogs I follow around the net

In praise of Plan 9

Making Hare more debuggable

bloat