2. Objects
The .git/ directory of a git repo contains a few
things to represent the state of the repo.
One of the things it stores is objects. Objects
come in a few different types:
blob- a filetree- a hierarchical collection of trees and blobscommit- a snapshot of the overall repo tree at a particular point in timetag- a named pointer to a particular commit
Objects are stored in the .git/objects/ dir using
an approach called content addressable IDs.
Format of an object
The uncompressed format of an object on disk is:
<kind> <len><null><data>
where kind is one of the object types above
e.g. blob, len is the number of bytes in the
data section, null is the ASCII NUL char \0,
and data is the contents of the object as bytes.
For a simple blob, the contents might be:
blob 5\0hello
Storing the len of the object in the header
assists and reading out the object contents,
and also allows for a simple integrity check.
ID of an object
The ID of an object is computed by calculating the SHA-1 hash of the object representation above, including the kind/len header. This is what “content-addressable ID” means, The ID of an object can be derived from its contents. This makes it easy to check if an object is already stored in the git repo, which makes it easier to avoid storing duplicate data.
Path to an object
The path to an object on disk is computed as follows:
- the “fanout” parent directory of the object is the hex representation of the first byte of the object ID, which is 2 hex chars
- the file name is the hex representation of the remaining 19 bytes of the object ID, which is 38 hex chars
So if an object has the ID:
c1f5bb10a67aa1da2f4f89ef14dadd36e5fb7d66
it’s path on disk would be
.git/objects/c1/f5bb10a67aa1da2f4f89ef14dadd36e5fb7d66
Writing an object
To write an object to the .git/objects store,
the approach is:
- compute the object representation (see “Format of an object” above)
- compute the path of the object (see “Path to an Object” above)
- if the object already exists, stop here
- compress the object representation from (1) using ZLib compression
- write the compressed bytes of the object to the path from (2)
By stopping early at (3), we avoid unnecessary work, and this is possible because 2 different objects are guaranteed to have the same location on disk if their contents are the same, thanks to the content-addressable ID approach.
Reading an object
To read an object from the .git/objects store,
given it’s ID, the approach is:
- read the contents of the file at the path of the object (see “Path to an object” above)
- decompress the file contents using Zlib decompression
- detect the type of the object using the “kind” from the header (see “Format of an object” above)
- read the contents of the object from the “data” section of the decompressed object
- check that number of bytes read in (4) matches the len from the header, as an integrity check
Doing this in grit
The basic logic to cover object reading and writing is available in grit, along with some documentation on how to test it.
- the grit object module
- integration tests comparing the grit behaviour with git
- docs explaining how to run grit locally
The next steps will be to expose some commands in grit to make it easier to test this functionality.