Change files

Posted on

When you create a new Pijul repository, a .pijul folder is created with two subfolders: "change" and "pristine".

We’ll forget about the “pristine” folder for now and get back to that later. In this write-up we’ll focus on the structure of change files.

Structure of a .change file

The overall format of a .change file is something like this:


    ┌────────────────┐
    │ offsets        │ fixed 56 bytes
    ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
    │ hashed parts   │ = zstd_compress(bincode(hashed))
    ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
    │ unhashed parts │ = zstd_compress(bincode(unhashed | [])) - may be zero
    ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
    │ contents       │ = zstd_compress(contents)
    └────────────────┘

The first part contains information about offsets into the rest of the file. The remaining three segments: hashed, unhashed, and contents are all compressed with a seekable variant of. https://github.com/facebook/zstd/tree/dev/contrib/seekable_format We will refer to this as zstdseek

The offsets section also provides the uncompressed sizes of the three compressed segments. This allows allocating sufficient buffer space for zstd to decompress a segment into.

Offsets

The offsets “header” section of a change file can be represented as the following struct:

#include <stdint.h>
struct offsets {
	uint64_t version;
	uint64_t hashed_len;	/* length of the hashed contents */
	uint64_t unhashed_off;
	uint64_t unhashed_len;	/* length of unhashed contents */
	uint64_t contents_off;
	uint64_t contents_len;
	uint64_t total;
}

With seven components each eight bytes in size we get: 56 bytes. But how are they actually written to the file? The answer is: bincode.

Bincode

An interesting feature of Pijul is that everything that’s serialized to disk is serialized with bincodehttps://github.com/bincode-org/bincode. What’s interesting about bincode is that it appears to be Rust-only. The specification is a Markdown document in the source code repo.

Reading the specification it appears to focus on cutting as much fluff as possible and just produce and consume byte streams. This means a struct with seven u64 members has no other information than the values of those seven members, one after the other.

The primary thing to keep in mind is that values must be encoded and decode in little-endian byte order.

All the segments of a change file are bincode-encoded except the contents section which is already just an array of bytes.

Hashed

The hashed contents is the most structured part of a change file. In a similar vein as Git, it gathers all the bits that should be hashed. This includes:

Notably the contents themselves are not in the hashed structure, only the hash of the contents.

In C, the hashed structure looks something like this:

struct hashed {
	uint64_t version;
	struct change_header header;
	struct hash *dependencies;	/* Vec<Hash> */
	struct hash *extra_known;	/* extra known "context" changes (recovery from deleted contexts) */
	uint8_t *metadata;		/* space for application-specific data */
	?? changes;			/* Vec<Hunk>, "Hunk" being a generic argument */
	hash contents_hash;		/* hash of the contents */
}

Computing the hash happens on the bincode-encoded form of the hashed struct, something like this: hash = blake3(bincode(hashed))

Unhashed

I haven’t looked into the unhashed segment yet, but supposedly it’s for putting stuff associated with a change that you don’t want hashed.

Contents

Just an array of bytes by the looks of it, I have yet to figure out what it actually holds.


Articles from blogs I follow around the net

In praise of Plan 9

Plan 9 is an operating system designed by Bell Labs. It’s the OS they wrote after Unix, with the benefit of hindsight. It is the most interesting operating system that you’ve never heard of, and, in my opinion, the best operating system design to date. Even …

via Drew DeVault's blog November 12, 2022

Making Hare more debuggable

Hare programs need to be easier to debug. This blog post outlines our plans for improving the situation. For a start, we’d like to implement the following features: Detailed backtraces Address sanitization New memory allocator DWARF support These are rou…

via Blogs on The Hare programming language November 4, 2022

bloat

the actual problem with bloat

via orib.dev September 26, 2022

Generated by openring