HPR episode: About git
This is more or less the transcript of my HPR submission about git, a version control system.
I have used other version control systems in the past: cvs (long ago) and subversion, but today git is by far my favourite. This is why:
You can commit new revisions when your internet connection is down.
You can easily prevent that just any developer can commit to the master branch or to release branches.
You can try out experimental things locally, committing changes, without having to create a branch in a central repository.
Creating feature branches, and switching between branches is easy, even without a working internet connection.
Git has way more features than subversion or cvs.
Linus says: 'If you are using subversion, you are stupid and ugly.' ;-)
Yet there are problems when moving from subversion to git
Git works in a very different way than subversion. It requires some effort to fully understand what it does.
The easiest way to work with git (imo) is with the command line. For some users, this is a drawback.
I moved a project from subversion and trac to git and redmine, about half a year ago. I could use an existing server with git, gitolite and redmine, so I didn't have to bother about setting those things up. The conversion from subversion to git went pretty easy. We started using git, and it kind of worked, although we were not quite sure why it did. I guess it is a common problem for new git users.
I aim to explain the things that I wanted to know when starting with git. Now there are some oddities in our repository, which could probably be avoided if I only knew what I was doing.
I only discuss the concepts of using git. You will not find specific git commands here. This is intentional. Possibly I will publish a more practical follow-up later on.
I make abstraction of some of the technical stuff, just to restrict the length of this introduction.
Git is distributed
Git is distributed. Theoretically there is no central code repository, and every developer has his own local copy of the entire repository. If you want a copy for yourself, you can just clone an existing one.
Your own repository typically contains references to remote
repositories. At the moment you clone a repository, a reference to the
original is kept, which is usually called
A git repository is basically nothing more than a targeted acyclic graph of commits. A commit represents a specific revision of your source code. Each commit is determined by a SHA-1 hash, which is a unique checksum.
The hash is 40 characters long, which is tedious to type. If there is no ambiguity, you can refer to a commit using just the first few characters of the hash. Usually 5 or 6 characters are sufficient.
Each commit (except for the initial one) has a reference to one or two parents. If you pick a random commit, you can find its entire history following the parent links to the very beginning.
In the simplest case, a git repository is just a sequence of commits, where each commit has at most one parent and most one child. The history is so to speak one straight line:
C1 <- C2 <- C3 <- C4
``C1`` is the initial commit. In order not to overload the diagrams, I will from now on omit the "arrows" of the parent relation (``<-``). By convention I put the parent on the left, and the children on the right.
Generally, it is not the case that all the revisions in a git repository nicely follow one after the other. In the case of branching a commit typically has several children. In a merge operation a commit can have two parents. But more on that later.
Example of a more complex repository:
C1 -- C2 -- C3 -- C4 -- C5 -- D6 -- D7 | | +--- D3 -- D4 -- D5 ----+ | | +--- F5 -- F6 | | +--- E3 -- E4--- E5
While programming, there is always one commit checked out. This commit
HEAD of your repository. The current source code (called
working copy) corresponds to the code of
HEAD, with a certain
modifications you made. Files can be added, deleted, or changed.
If you want to commit a new revision of your code, you need to inform git about the changes in the working copy that should be included in the new commit. Git knows how you version of the code differs from the code in the last commit, but it does not include by default all changes in a new commit. If you want a change to be included in the next commit, you should explicitly add it to the 'index': this is called 'staging'. All staged changes will be part of the next commit. Changes you did not stage, stay as they are in your working copy, but they are kept out of the commit.
HEAD ↓ C1 - C2 - C3 - C4 - staged changes - working copy
... after committing:
HEAD ↓ C1 -- C2 -- C3 -- C4 -- C5 -- working copy
When a new revision of your code is committed, this commit becomes the
HEAD of the repository.
When git repositories talk to each other, commits are moved from one repository to the other. After some time you end up with a lot of commits, and it becomes difficult to find your way. Branches are the answer to this problem.
You might know the concept of branches from other source control
systems. In the easiest case, your code history is one straight line,
one commit after the other, from the initial commit to
way, there is only one branch in your source repository.
It is possible however, that at some point, development is done in
parallel. After a certain commit, say
C developer A adds new commits
A3, while developer B ignores these commits, and
adds other comitts
C. Now there are
two branches, in which the code is diverging. These branches could or
could not be merged again, at some point in the future, but more about
X1 -- X2 -- X3 -- C -- A1 -- A2 -- A3 | +--- B1 -- B2 -- B3
Technically, a git branch is nothing more than a pointer to a particular
commit in your repository. Just like
HEAD, as a matter of fact. A
branch is pointing to its most recent commit.
If you take two random branches in your repository, you can always find a commit where they diverged. You start from the commits the branches are pointing to, and then keep follwing the parent links. At some point, you will find a common ancester, and this is the commit you are looking for.
Just as with any other version control system, there is typically one
branch 'checked out'. This is the branch you are working on.
pointing to the same commit as the checked out branch, and when
commiting a new revision, the branch pointer will move along with
HEAD to this new commit.
Adding new branches is very easy, you just add a new pointer to the repository.
You can name branches as you like, but typically there is one branch
master is pointing to the 'mainline', the most
up-to-date development revision.
master ↓ HEAD C1 -- C2 -- C3 -- C4 branch2 | ↓ +--- D3 -- D4 -- D5 | | +--- F5 -- F6 | | ↑ +--- E3 -- E4 branch4 ↑ branch3
Branches in your own copy of the repository, are called local branches. Git is also aware of branches in remote repositories: remote branches. When you 'fetch' a remote branch, git downloads all necessary commits to your repository, and puts a pointer to the commit corresponding with the remote branch.
(remote repo: origin) master ↓ C1 -- C2 -- C3 -- C4 branch2 | ↓ +--- D3 -- D4 -- D5 -- D6 | | +--- F5 -- F6 | | ↑ +--- E3 -- E4 branch4 ↑ branch3 (local repo) HEAD master ↓ C1 -- C2 -- C3 -- C4 -- C5
After fetching ``origin/branch4``:
(local repo) HEAD master ↓ C1 -- C2 -- C3 -- C4 -- C5 | +--- E3 -- F4 -- F5 -- F6 ↑ origin/branch4
You can not directly add commits to a remote branch. Typically you first fetch the remote branch, you link it to a local branch, and you commit new revisions to the local branch. Such a local branch that is linked to a remote branch, is called a '(remote) tracking branch'.
If you are working in a tracking branch, git knows where the original is. This makes it easy to download the latest commits in the remote branch, and git will inform you about the differences between the remote branch and your associated tracking branch.
Back to the previous example. If you have a local branch ``branch4`` checked out, which is set it up to track ``origin/branch4``, the situation is as follows:
master ↓ C1 -- C2 -- C3 -- C4 -- C5 | +--- E3 -- F4 -- F5 -- F6 ↑ origin/branch4 branch4 HEAD
A tracking branch behaves just like an ordinary local branches. If it is
checked out, and you create a new commit, the branch will move along
master ↓ HEAD C1 -- C2 -- C3 -- C4 -- C5 branch4 | ↓ +--- E3 -- F4 -- F5 -- F6 -- F7 -- F8 ↑ origin/branch4
Suppose you have 2 branches, let's say
B, which originate
from a common ancester commit
B into branch
A means incorporating into
all changes between
In the simplest case, branch
A itself is an ancestor of branch
B. So when working on branch A, you created a new branch B, to which
you added some commits. (The common ancester
C is just the last
commit of branch
In this case, git will just move the pointer
A, so that it points to
the same commit as
B. This kind of merge is called a "fast forward
merge"; an important concept in the world of git. A fast forward merge
is a merge operation which comes down to moving the pointer of the
branch into whom you are merging.
HEAD A ↓ C1 -- C2 -- C3 -- C4 | +--- D5 -- D6 ↑ B
After merge of ``B`` into ``A``:
C1 -- C2 -- C3 -- C4 | +--- D5 -- D6 ↑ A B HEAD
(Note that this graph is isomorph to a straight line)
A fast forward merge is not always possible. If
from their common ancester
C, simply moving a pointer does not work.
HEAD A ↓ C1 -- C2 -- C3 -- C4 -- C5 | +--- D5 -- D6 ↑ B
In this case, when merging
A, the changes between the
common ancestor and branch to
B are applied to branch
A. If this
doesn't cause any trouble (lucky you), git will create a new commit on
A, containing the changes in
The example below shows how ``B`` will be merged merged into ``A``.
HEAD A ↓ C1 -- C2 -- C3 -- C4 -- C5 ------ C6 | | +--- D5 -- D6 --+ ↑ B
If both branches modify the same part of your code, you cannot just apply the changes from one branch to the other. If this happens, git marks the conflicts, and does not commit the result of the merge operation. You first have to resolve the conflicts, before you commit.
That's it about merging. Merging comes down to integrating changes frome one branch into another branch in the same repository. Now we will consider push and pull operations, which is about moving changes between repositories.
Pull and push
Suppose you have checked out a remote tracking branch, and you want to apply the latest commits of the remote branch to your tracking branch locally. This is called a pull operation. Git fetches the current state of the remote branch, together with all necessary commits, and merges it into the tracking branch in your repository.
For example: When ``remote/branch1`` pointed to ``C3``, you made a remote tracking branch. Since then, commit ``C4`` was added to the remote repository, while you added ``C4'``, ``C5'`` and ``C6'`` to your local repository.
(origin) branch1 ↓ C1 -- C2 -- C3 -- C4 (Local) HEAD branch1 (trackt remote/branch1) ↓ C1 -- C2 -- C3 -- C4' -- C5' -- C6' ↑ remote/branch1
After fetching of ``remote/branch1``, see the local repo looks as follows:
HEAD branch1 ↓ C1 -- C2 -- C3 -- C4' -- C5' -- C6 ' | + --- C4 ↑ remote/branch1
HEAD branch1 ↓ C1 -- C2 -- C3 -- C4' -- C5' -- C6' -- C7' | | +--- C4 -------------------+ ↑ remote/branch1
As with any other merge, it could be that this causes conflicts, which you'll have to resolve.
Conversely you can push the commits in a local branch to a branch in a remote repository. This can be either to a new remote branch as to an existing remote branch. Git will upload the most recent commit of the local branch, together with all necessary ancestor commits to link it to the existing remote commits. This way you create a new remote branch; if there was no existing branch, you are done.
If the remote branch you were pushing to already existed, the newly created branch will be merged into the existing branch. But in most configurations this only works if this merge operation is a fast forward merge, which is the case if no commits were added to the remote branch after your last pull. If a fast forward merge is not possible, you will get an error message.
To resolve this, you first fetch the remote branch, and merge it locally with your local tracking branch. (Which is in fact a pull operation.) This operation results in a new local commit with the latest commit from the remote repository as one of its parents. So if you push your branch again, it will be fast forward merged into the remote repository without a problem.
When branches diverge, merging is one way to get them together again. A typical use case of merging, is the resynchronisation of the same branches in different repositories, as we've encountered in the discussion of push and pull operations.
There is however another way to integrate changes from one branch into another: rebasing.
Suppose you have two branches, let's say
B, with a common
A can be seen as taking branch
B from the point where it diverged from
A, tearing it off,
taking it away, and reattaching it to the current commit of branch
Let's look at this into more detail. You created a branch
diverged from branch
A. New commits were added to
B, but to
A as well.
When you rebase your
A, git searches for the commit where
the branches diverged, which is
Now git will iterate over the commits from
B, and determine
the changes that have been applied to the source code between each
commit. Then git starts a new branch on
A, and creates similar
commits on there by replaying the same changes.
It is possible that conflicts occur, in particular if the same code was
changed in branches
B. If so, you will have to resolve
these conflicts before the rebase process can continue. When all commits
B are recreated on top of
A, the new branch will
take the place of the original
The overall result will be that the changes which where developed in
parallel on branches
B now appear to be serial changes:
A first, then
Visually: after commit ``C4`` in the ``A``-branch, you created a new branch ``B``. You added commits ``D5`` and ``D6`` to this new branch.
A ↓ C1 -- C2 -- C3 -- C4 -- C5 -- C6 -- C7 | +--- D5 -- D6 ↑ B HEAD
Meanwhile, new commis were added to the ``A``-branch. Now you want the changes from the ``B``-branch to be applied to the current state of ``A`` (``C7``): rebasing branch ``B`` onto ``A``.
Git searches for the point where both branches diverged, in this example, ``C4``. Now, the changes needed to transform ``C4`` to ``D5``, will be applied to ``C7``, and committed (``D5'``). Next, the changes for the transition from ``D5`` to ``D6`` are applied, to create the next commit (``D6'``). The result is as follows:
A ↓ C1 -- C2 -- C3 -- C4 -- C5 -- C6 -- C7 | --- D5' -- D6' ↑ B HEAD
One should be careful with rebasing. You should only rebase branches that nobody else is supposed to be tracking. Rebasing changes the history of a branch. So if a collegue wants to push/pull commits to/from a branch you rebased, you probably end up in a lot of trouble.
There are many ways to organise your work with git. At the moment, I usually work as follows:
master branch contains the latest relevant code. It may contain
experimental features, but the idea is that the code in
compiles and works.
Every time you want to implement a new feature, you create a feature branch.
master ↓ C1 -- C2 -- C3 -- C4 | + --- D5 -- D6 -- D7 ↑ feature1 HEAD
In a feature branch, you can commit non-functional or even broken code.
This is not a problem; only the code in
master is expected to work.
Now suppose you are working on a new feature, but meanwhile a bug had
been reported, which urgently needs a fix. In that case you can rather
easily switch back to
master, and a create a new bugfix branch. The
changes you made in your half-finished feature branch will cause no
master ↓ C1 -- C2 -- C3 -- C4 | +--- D5 -- D6 -- D7 | ↑ +--- E5 feature1 ↑ bugfix HEAD
When your bugfix is ready, and nothing changed to
master you can
easily fast-forward merge the bugfix branch to the
HEAD bugfix master ↓ C1 -- C2 -- C3 -- C4 -- E5 | +--- D5 -- D6 -- D7 ↑ feature1
After merging, the bugfix branch is of no more interest; this pointer can be removed. You can check out your feature branch again, and continue to work on the feature.
At some point, hopefully, your feature implementation is ready, and has
to be merged it into
master. A fast forward merge is impossible now,
because the bugfix created new commits in the master branch. To avoid
clutter in the history of your code, it is useful to rebase your feature
master before merging.
Rebase ``feature1`` onto ``master``:
master ↓ C1 -- C2 -- C3 -- C4 -- E5 | +--- D5' -- D6' -- D7' ↑ feature1 HEAD
After that you can fast forward merge:
HEAD feature1 master ↓ C1 -- C2 -- C3 -- C4 -- E5 -- D5' -- D6' -- D7'
A feature branch is typically a branch on which you work alone; chances are high that no one else is tracking it. So rebasing is no problem. Because of the rebase operation, your feature seems to be completely developed after the bugfix, which results in a cleaner history of your project's code.
If you had just merged your feature branch without rebasing, you would end up with a commit with two parents, which would just make things more complicated than they should be.
If a new release of your project is approaching, you typically create a
release branch from
release-1 master ↓ C1 -- C2 -- C3 -- C4
There are probably a number of bugs that still need to be fixed before
release. Meanwhile, the normal development of new features can continue
release-1 master ↓ ↓ C1 -- C2 -- C3 -- C4 -- C5 -- C6
Suppose you have a release-critical bug to fix. Then you fix that bug in the release branch.
master ↓ C1 -- C2 -- C3 -- C4 -- C5 -- C6 | +--- D5 ↑ release-1
However, you probably also want to apply the bugfix on the
branch. At this point rebasing the release branch onto
master is not
an option, because this would make the new features you committed to
master part of the release branch. Which is not what you want,
because these new features could be experimental or untested. So in this
particular case merging the release branch into the
master branch is
the way to go.
master ↓ C1 -- C2 -- C3 -- C4 -- C5 -- C6 -- C7 | | +----D5 ---------+ ↑ release-1
After the merge, you must not remove the release branch since you will need it afterwards for other release critical bugs to be committed.
A final use case that I want to discuss is a major refactoring. If you want to refactor your code in such a way that a lot has to be rewritten, you also create a branch.
This kind of refactoring usually takes some time, and you typically want feedback from other developers during the process. If you're lucky, other people are even willing to help you with the refactoring. So it is a good idea to make the refactoring branch publicly available.
Now suppose you want the new fixes from
master to be incorporated in
your refactoring branch. Rebasing your refactoring branch onto
master is usually not a good idea: other developers have probably
pulled it; they might even be working on it. So in this case, merging
master branch into your feature branch will do.
There you go. A modest introduction to git. I made abstraction of some details, because I wanted to keep it (relatively) short. And of course also because there are still things I don't understand completely myself :-)
The workflow as I describe it here, seems to work for me. I'm not sure whether it is really best practice. If you have any feedback, I am certainly interested.
This text is also available on github. You can comment over there (just submit an issue), or send me pull requests if you want to improve it. :)