Change files
Posted on
When you create a new Pijul repository, a .pijul
folder is created with two
subfolders: "change" and "pristine".
-
“change” appears to be organized much like Git’s “objects” folder in that the
changes are referenced by a hash. A hash like
U6TQX5Z2NF6GX3SRLUBQGCZ7WAXNYMWWZ2YMADUSG4EWVKNV2BIAC is stored in the file:
.pijul/change/U6/TQX5Z2NF6GX3SRLUBQGCZ7WAXNYMWWZ2YMADUSG4EWVKNV2BIAC.change
. - “pristine” contains only… a “db” file?
We’ll forget about the “pristine” folder for now and get back to that later. In this write-up we’ll focus on the structure of change files.
Structure of a .change file
The overall format of a .change
file is something like this:
┌────────────────┐
│ offsets │ fixed 56 bytes
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ hashed parts │ = zstd_compress(bincode(hashed))
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ unhashed parts │ = zstd_compress(bincode(unhashed | [])) - may be zero
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ contents │ = zstd_compress(contents)
└────────────────┘
The first part contains information about offsets into the rest of the file. The remaining three segments: hashed, unhashed, and contents are all compressed with a seekable variant of. https://github.com/facebook/zstd/tree/dev/contrib/seekable_format We will refer to this as zstdseek
The offsets section also provides the uncompressed sizes of the three compressed segments. This allows allocating sufficient buffer space for zstd to decompress a segment into.
Offsets
The offsets “header” section of a change file can be represented as the following struct:
#include <stdint.h>
struct offsets {
uint64_t version;
uint64_t hashed_len; /* length of the hashed contents */
uint64_t unhashed_off;
uint64_t unhashed_len; /* length of unhashed contents */
uint64_t contents_off;
uint64_t contents_len;
uint64_t total;
}
With seven components each eight bytes in size we get: 56 bytes. But how are they actually written to the file? The answer is: bincode.
Bincode
An interesting feature of Pijul is that everything that’s serialized to disk is serialized with bincodehttps://github.com/bincode-org/bincode. What’s interesting about bincode is that it appears to be Rust-only. The specification is a Markdown document in the source code repo.
Reading the specification it appears to focus on cutting as much fluff as possible and just produce and consume byte streams. This means a struct with seven u64 members has no other information than the values of those seven members, one after the other.
The primary thing to keep in mind is that values must be encoded and decode in little-endian byte order.
All the segments of a change file are bincode-encoded except the contents section which is already just an array of bytes.
Hashed
The hashed contents is the most structured part of a change file. In a similar vein as Git, it gathers all the bits that should be hashed. This includes:
-
A “version” field - currently always the value
6
(u64) - A change header struct (more on that later)
- A list of dependencies, just the hashes
- Another list of hashes - extra known “context” changes (assists with recovery from deleted contexts)
- The hash of the contents
Notably the contents themselves are not in the hashed structure, only the hash of the contents.
In C, the hashed structure looks something like this:
struct hashed {
uint64_t version;
struct change_header header;
struct hash *dependencies; /* Vec<Hash> */
struct hash *extra_known; /* extra known "context" changes (recovery from deleted contexts) */
uint8_t *metadata; /* space for application-specific data */
?? changes; /* Vec<Hunk>, "Hunk" being a generic argument */
hash contents_hash; /* hash of the contents */
}
Computing the hash happens on the bincode-encoded form of the hashed struct, something like this: hash = blake3(bincode(hashed))
Unhashed
I haven’t looked into the unhashed segment yet, but supposedly it’s for putting stuff associated with a change that you don’t want hashed.
Contents
Just an array of bytes by the looks of it, I have yet to figure out what it actually holds.