Git for Computer Scientists
In simplified form, git object storage is "just" a DAG of objects, with a handful of different types of objects. They are all stored compressed and identified by an SHA-1 hash (that, incidentally, isn't the SHA-1 of the contents of the file they represent, but of their representation in git).
blob: The simplest object, just a bunch of bytes. This is often
a file, but can be a symlink or pretty much anything else. The
object that points to the
blob determines the semantics.
tree: Directories are represented by
tree object. They
blobs that have the contents of files (filename,
access mode, etc is all stored in the
tree), and to other
trees for subdirectories.
When a node points to another node in the DAG, it depends on the
other node: it cannot exist without it. Nodes that nothing points
to can be garbage collected with
git gc, or rescued much like
filesystem inodes with no filenames pointing to them with
git fsck --lost-found.
commit refers to a
tree that represents the
state of the files at the time of the commit. It also refers to
commits that are its parents. More than one
parent means the commit is a merge, no parents means it is an
initial commit, and interestingly there can be more than one
initial commit; this usually means two separate projects
merged. The body of the
commit object is the commit message.
refs: References, or heads or branches, are like post-it notes
slapped on a node in the DAG. Where as the DAG only gets added to
and existing nodes cannot be mutated, the post-its can be moved
around freely. They don't get stored in the history, and they
aren't directly transferred between repositories. They act as sort
of bookmarks, "I'm working here".
git commit adds a node to the DAG and moves the post-it note
for current branch to this new node.
HEAD ref is special in that it actually points to another
ref. It is a pointer to the currently active branch. Normal refs
are actually in a namespace
heads/XXX, but you can often skip
remote refs: Remote references are post-it notes of a different
color. The difference to normal
refs is the different namespace,
and the fact that remote refs are essentially controlled by the
git fetch updates them.
tag is both a node in the DAG and a post-it note (of
yet another color). A
tag points to a
commit, and includes
an optional message and a GPG signature.
The post-it is just a fast way to access the tag, and if lost can
be recovered from just the DAG with
git fsck --lost-found.
The nodes in the DAG can be moved from repository to repository, can
be stored in more effective form (packs), and unused nodes can be
garbage collected. But in the end, a
git repository is always just
a DAG and post-its.
So, armed with that knowledge on how
git stores the version
history, how do we visualize things like merges, and how does
differ from tools that try to manage history as linear changes per
This is the simplest repository. We have
cloned a remote repository
with one commit in it.
Here we have
fetched the remote and received one new commit
from the remote, but have not merged it yet.
The situation after
git merge remotes/MYSERVER/master. As the
merge was a
fast forward (that is, we had no new commits in our
local branch), the only thing that happened was moving our post-it
note and changing the files in our working directory respectively.
git commit and a
git fetch later. We have both a
new local commit and a new remote commit. Clearly, a merge is
git merge remotes/MYSERVER/master. Because we had
new local commits, this wasn't a
fast forward, but an actual new
commit node was created in the DAG. Note how it has two parent
Here's what the tree will look after a few commits on both branches
and another merge. See the "stitching" pattern emerge? The
DAG records exactly what the history of actions taken was.
The "stitching" pattern is somewhat tedious to read. If you have not
yet published your branch, or have clearly communicated that others
should not base their work on it, you have an alternative. You can
rebase your branch, where instead of merging, your commit is
replaced by another commit with a different parent, and your branch
is moved there.
Your old commit(s) will remain in the DAG until garbage collected. Ignore them for now, but just know there's a way out if you screwed up totally. If you have extra post-its pointing to your old commit, they will remain pointing to it, and keep your old commit alive indefinitely. That can be fairly confusing, though.
Don't rebase branches that others have created new commits on top of. It is possible to recover from that, it's not hard, but the extra work needed can be frustrating.
The situation after garbage collecting (or just ignoring the
unreachable commit), and creating a new commit on top of your
rebase also knows how to rebase multiple commits with one
That's the end of our brief intro to
git for people who are not
intimidated by computer science. Hope it helped!