However, it's a nice example for why you shouldn't use SHA-1 for anything security related and a good opportunity to learn more about the inner workings of git._
In 2017, the Cryptology Group at Centrum Wiskunde & Informatica (CWI) and the Google Research Security, Privacy and Anti-abuse Group announced the first (public) SHA-1 hash collision.
They generated 2 PDF files with the same hash in an attack they called SHAttered.
While it still required a lot of processing power (the equivalent of 6,500 years single-CPU computations and 110 years of single-GPU computations), it showed that such attacks are not only theoretically possible - they are technically feasible. [^1]
IDs in git, for example for commits, are generated using SHA-1 by default.
As explained by git's creator, Linus Torvalds, the hash function is not used for security.
It's simply used to generate a checksum, like e.g. a CRC.[°The quote "We check a checksum that's cryptographically secure. Nobody has been able to break SHA-1" didn't hold up that well, but in general, Linus' point still stands: git uses SHA-1 for consistency checks, not for security.] [^2]
It generates an ID for each object that it stores, which can later be used to identify and retrieve the object.
By default, this ID is a SHA-1 hash.
The input given to the hash function depends on the object type.
For files, which are stored as so called blobs, the ID is generated by hashing the string `blob`, the length of the file, a null byte and the file itself. [^3]
However, one does not usually work directly with blobs and trees.
It's what git uses internally, but from a user perspective, it's usually not necessary to interact with git on that level.
The usual workflow consists out of editing files, staging these changes and then committing them.
Commits are just another object type in git.
These objects associate a tree with an author and committer,[°The author and the committer can be different. The author is who wrote the code and the committer is who added it to the repository.] a parent commit (if it exists) and a commit message to describe to changes included in this commit. [^3]
The resulting object can once again be viewed with `git cat-file`.
The commit ID is generated by hashing this information, prepended by the same header that is used for the other object types: `$object-type $length\0`.[°You can find a nice example how the commit hash is constructed [here](https://gist.github.com/masak/2415865 "How is git commit sha1 formed").]
OK, so that's the reason why `git hash-object` returns different values for the two SHAttered files.[°Try it yourself if you want: prepending any data to these files will make their SHA-1 hashes differ. `echo test | cat - shatterd-1.pdf | sha1sum` results in a different hash than `echo test | cat - shatterd-2.pdf | sha1sum`]
Further, this explains why the commit hashes didn't collide.
There won't be any progress made by using the existing SHA-1 hash collision from SHAttered, but collisions can still happen.
It's unlikely to find one by accident and it's expensive to cause them intentionally, but it's not impossible.
How would git behave in this case?
Let's vandalize the code a little bit and find out.
Git can be built with multiple different SHA-1 backends.
By default, it uses an implementation with a collision attack detection mechanism.[°This collision attack detection is not relevant for our experiment for now. We'll get back to it later.]
In order to cause collisions between object IDs in git, it is helpful to reduce the size of the identifier.
To achieve this, every byte of the hash after the first one was set to 1.
This reduces the effective length of the identifier to 1 byte.
256 values may still sound like a lot, but there's no need to check every possible value.
We're not looking for a specific hash, any collision will do.
The first collisions should occur after a few tries.[°Such collisions are already very likely after a surprisingly small amount of attempts if the value range is not too large. Look up the birthday problem if you don't know it.]
So let's compile git with our "improvement" and try to cause some collisions.
The fist two commits may not collide, but it's already obvious what our little modification to the source code did: only the first 2 characters of the commit hash differ, everything afterwards is filled with the same two characters.
Continuing to add and commits to this repository, the first collision was observed on the 7th attempt.
In this case both of the colliding objects are commits.
This time, git did not even print an error message.
After the creation of the new commit object once again failed silently, git proceeded to check out the existing commit with the same hash.[°People assume that a commit history is a strict progression of cause to effect. But actually from a non-linear, non-subjective viewpoint it's more like a big ball of wibbly-wobbly timey-wimey... stuff.]
This confirms the assumption that git will keep the existing object in case of a collision.
The other commits that got rolled back are still present as objects in git and can be checked out, but especially if the user does not notice what happened here and continues working, this can seriously mess up the commit history.
Those tests were continued until the following collisions occurred and git's behavior could be observed:
Collisions between two blobs
: Creating the new blob object fails silently. The existing blob object remains unchanged.
Collisions between blobs and trees
: Creating the new blob object fails silently. The existing tree object remains unchanged.
Collisions between blobs and commits
: Creating the new blob object fails silently. The existing commit object remains unchanged.
Collisions between blobs and tags
: Creating the new blob object fails silently. The existing tag object remains unchanged.
Collisions between trees and blobs
: Creating the new tree object fails silently. The existing blob object remains unchanged. Trying to commit this results in an error message complaining that the object is not a valid tree object.
Collisions between two trees
: Creating the tree object fails silently. The existing tree object remains unchanged. Trying to commit this, git commits the old tree again. This effectively means a rollback to the commit associated with of the old tree while keeping the commit history.
Collisions between trees and commits
: Creating the new tree object fails silently. The existing commit object remains unchanged. Trying to commit this results in an error message complaining that the object is not a valid tree object.
Collisions between trees and tags
: Creating the new tree object fails silently. The existing tag object remains unchanged. Trying to commit this results in an error message complaining that the object is not a valid tree object.
Collisions between commits and blobs
: Creating the new commit object fails. The existing blob object remains unchanged. Git attempts to check out the new commit (that wasn't created) and fails with the error message `fatal: cannot update ref 'refs/heads/main': trying to write non-commit object $HASH to branch 'refs/heads/main'`.
Collisions between commits and trees
: Creating the new commit object fails. The existing tree object remains unchanged. Git attempts to check out the new commit (that wasn't created) and fails with the error message `fatal: cannot update ref 'refs/heads/main': trying to write non-commit object $HASH to branch 'refs/heads/main'`.
Collisions between two commits
: Creating the new commit object fails silently. The existing commit remains unchanged and is checked out.
Collisions between commits and tags
: Creating the new commit object fails. The existing tag object remains unchanged. Git attempts to check out the new commit (that wasn't created) and fails with the error message `fatal: cannot update ref 'refs/heads/main': trying to write non-commit object $HASH to branch 'refs/heads/main'`.
: A new file is created under `.git/refs/tags` pointing at the hash, but the creation of the new tag object under `.git/objects` fails silently. The existing blob object remains unchanged.
: A new file is created under `.git/refs/tags` pointing at the hash, but the creation of the new tag object under `.git/objects` fails silently. The existing tree object remains unchanged. The tag is displayed by `git tag -l`, but attempts at checking it out fail with the error message `fatal: Cannot switch branch to a non-commit`.
Collisions between tags and commits
: A new file is created under `.git/refs/tags` pointing at the hash, but the creation of this new tag object under `.git/objects` fails silently. The existing commit object remains unchanged. The tag is displayed by `git tag -l`, but is interpreted as a lightweight tag for the colliding commit.
Collisions between two tags
: A new file is created under `.git/refs/tags` pointing at the hash, but the creation of this new tag object under `.git/objects` fails silently. The existing tag object remains unchanged. Since the new tag reference object points at the old tag object, the new tag will be an alias for the old tag, with the same message and object reference.
Git does not overwrite existing objects.
Creating a new object with a hash that's already associated with another object always fails silently.
However, depending on the object type, git shows some interesting behavior.
If the object type fits, git will just continue with it as if nothing is wrong, e.g. it will commit an old tree or checkout an old commit.
Git only responds with an error if the object types are incompatible, e.g. if the collision causes a reference to a tree to point at a tag object instead.
Hash collisions, intentional oŕ not, can become a problem for git.
To mitigate this risk, git includes a mechanism that detects SHA-1 collision attacks and reacts by hashing the suspected block 3 times, extending SHA-1 from 80 to 240 steps in these cases.
This ensures that different hashes are generated in theses cases.[^4]
Further, git does not only support SHA-1.
It supports SHA-256, too.
Unlike SHA-1, SHA-256 is considered cryptographically secure.
Actual SHA-1 collisions are very unlikely to occur as a coincidence.
The collisions in this experiment could only be observed after the code was modified to limit the effective hash size to 1 byte.
Otherwise, it would not have been possible to create collisions with the available resources.
However, SHAttered shows that it is feasible to intentionally cause such a collision.[°Whether or not attackers with the resources required to do this are part of your threat model is up to you to decide.]
In case of a collision, git shows some interesting behavior that may not be noticed immediately.
This may leave the git repository in an unintended state.
Further, attackers could modify files and commits by replacing objects with specifically crafted objects that produce the same hash.
This risk is mitigated by the use of the collision attack detection mechanism.
Even though the risk is already mitigated, this example shows why SHA-1 should not be used for cryptographic hashing and was a good opportunity to learn more about git itself.
If you want to try it out yourself, you can have a look at the code with my modification [here](https://git.undefinedbehavior.de/undef/git-commit-vandalism "git commit vandalism - undefined git server").
[^1]: M. Stevens, E. Bursztein, P. Karpman, A. Albertini and Y. Markov (2017, February) "The first collision for full SHA-1", [https://shattered.io/static/shattered.pdf](https://shattered.io/static/shattered.pdf "The first collision for full SHA-1").
[^2]: L. Torvalds (2007, May) "Tech Talk: Linus Torvalds on git", [https://www.youtube.com/watch?v=4XpnKHJAok8&t=56m20s](https://www.youtube.com/watch?v=4XpnKHJAok8&t=56m20s "Tech Talk: Linus Torvalds on git").
[^3]: S. Chacon, B. Straub et al. (2014) "10.2 Git Internals - Git Objects" in *Pro Git*, [https://git-scm.com/book/en/v2/Git-Internals-Git-Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects "Git Internals - Git Objects").
[^4]: M. Stevens, D. Shumow (2017) "sha1dc/sha1.h", [https://github.com/git/git/blob/master/sha1dc/sha1.h](https://github.com/git/git/blob/master/sha1dc/sha1.h "sha1dc/sha1.h").