2. Objects

gitrust

The .git/ directory of a git repo contains a few things to represent the state of the repo. One of the things it stores is objects. Objects come in a few different types:

  • blob - a file
  • tree - a hierarchical collection of trees and blobs
  • commit - a snapshot of the overall repo tree at a particular point in time
  • tag - a named pointer to a particular commit

Objects are stored in the .git/objects/ dir using an approach called content addressable IDs.

Format of an object

The uncompressed format of an object on disk is:

<kind> <len><null><data>

where kind is one of the object types above e.g. blob, len is the number of bytes in the data section, null is the ASCII NUL char \0, and data is the contents of the object as bytes.

For a simple blob, the contents might be:

blob 5\0hello

Storing the len of the object in the header assists and reading out the object contents, and also allows for a simple integrity check.

ID of an object

The ID of an object is computed by calculating the SHA-1 hash of the object representation above, including the kind/len header. This is what “content-addressable ID” means, The ID of an object can be derived from its contents. This makes it easy to check if an object is already stored in the git repo, which makes it easier to avoid storing duplicate data.

Path to an object

The path to an object on disk is computed as follows:

  1. the “fanout” parent directory of the object is the hex representation of the first byte of the object ID, which is 2 hex chars
  2. the file name is the hex representation of the remaining 19 bytes of the object ID, which is 38 hex chars

So if an object has the ID:

c1f5bb10a67aa1da2f4f89ef14dadd36e5fb7d66

it’s path on disk would be

.git/objects/c1/f5bb10a67aa1da2f4f89ef14dadd36e5fb7d66

Writing an object

To write an object to the .git/objects store, the approach is:

  1. compute the object representation (see “Format of an object” above)
  2. compute the path of the object (see “Path to an Object” above)
  3. if the object already exists, stop here
  4. compress the object representation from (1) using ZLib compression
  5. write the compressed bytes of the object to the path from (2)

By stopping early at (3), we avoid unnecessary work, and this is possible because 2 different objects are guaranteed to have the same location on disk if their contents are the same, thanks to the content-addressable ID approach.

Reading an object

To read an object from the .git/objects store, given it’s ID, the approach is:

  1. read the contents of the file at the path of the object (see “Path to an object” above)
  2. decompress the file contents using Zlib decompression
  3. detect the type of the object using the “kind” from the header (see “Format of an object” above)
  4. read the contents of the object from the “data” section of the decompressed object
  5. check that number of bytes read in (4) matches the len from the header, as an integrity check

Doing this in grit

The basic logic to cover object reading and writing is available in grit, along with some documentation on how to test it.

The next steps will be to expose some commands in grit to make it easier to test this functionality.