Change files

Posted on

pijul

When you create a new Pijul repository, a .pijul folder is created with two subfolders: “change” and “pristine”. Here we will look into the structure of the change directory and change files.

We’ll forget about the “pristine” folder for now and get back to that later. In this write-up we’ll focus on the structure of change files.

Structure of a .change file

The overall format of a .change file is something like this:

┌────────────────┐
│ offsets        │ fixed 56 bytes
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ hashed parts   │ = zstd_compress(bincode(hashed))
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ unhashed parts │ = zstd_compress(bincode(unhashed | [])) - may be zero
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ contents       │ = zstd_compress(contents)
└────────────────┘

The first part contains information about offsets into the rest of the file. The remaining three segments: hashed, unhashed, and contents are all compressed with a seekable variant of. https://github.com/facebook/zstd/tree/dev/contrib/seekable_format We will refer to this as zstdseek

The offsets section also provides the uncompressed sizes of the three compressed segments. This allows allocating sufficient buffer space for zstd to decompress a segment into.

Offsets

The offsets “header” section of a change file can be represented as the following struct:

#include <stdint.h>
struct offsets {
	uint64_t version;
	uint64_t hashed_len;	/* length of the hashed contents */
	uint64_t unhashed_off;
	uint64_t unhashed_len;	/* length of unhashed contents */
	uint64_t contents_off;
	uint64_t contents_len;
	uint64_t total;
}

With seven components each eight bytes in size we get: 56 bytes. But how are they actually written to the file? The answer is: bincode.

Bincode

An interesting feature of Pijul is that everything that’s serialized to disk is serialized with bincode.https://github.com/bincode-org/bincode What’s interesting about bincode is that it appears to be Rust-only. The specification is a Markdown document in the source code repo.

Reading the specification it appears to focus on cutting as much fluff as possible and just produce and consume byte streams. This means a struct with seven u64 members has no other information than the values of those seven members, one after the other.

The primary thing to keep in mind is that values must be encoded and decode in little-endian byte order.

All the segments of a change file are bincode-encoded except the contents section which is already just an array of bytes.

Hashed

The hashed contents is the most structured part of a change file. In a similar vein as Git, it gathers all the bits that should be hashed. This includes:

Notably the contents themselves are not in the hashed structure, only the hash of the contents.

In C, the hashed structure looks something like this:

struct hashed {
	uint64_t version;
	struct change_header header;
	struct hash *dependencies;	/* Vec<Hash> */
	struct hash *extra_known;	/* extra known "context" changes (recovery from deleted contexts) */
	uint8_t *metadata;		/* space for application-specific data */
	?? changes;			/* Vec<Hunk>, "Hunk" being a generic argument */
	hash contents_hash;		/* hash of the contents */
}

Computing the hash happens on the bincode-encoded form of the hashed struct, something like this: hash = blake3(bincode(hashed))

Unhashed

I haven’t looked into the unhashed segment yet, but supposedly it’s for putting stuff associated with a change that you don’t want hashed.

Contents

Just an array of bytes by the looks of it, I have yet to figure out what it actually holds.