The Git project recently released Git v2.47.0. Let's look at a few notable highlights from this release, which includes contributions from GitLab's Git team and the wider Git community.
New global configuration options
If you have been following recent Git releases, you are probably familiar with the new "reftable" reference backend that became available with Git version 2.45. Check out our Beginner's guide to the Git reftable format to learn more. Previously, in order to initialize a repository with the "reftable" format, the --ref-format option needed to be passed to to git-init(1):
$ git init --ref-format reftableWith the 2.47 release, Git now has the init.defaultRefFormat configuration option, which tells Git which reference backend to use when initializing a repository. This can be used to override the default "files" backend and begin using the "reftable" backend. To configure, execute the following:
$ git config set --global init.defaultRefFormat reftableAs some of you may know, the object hash format used by Git repositories is also configurable. By default, repositories are initialized to use the SHA-1 object format. An alternative is the SHA-256 format, which is more secure and future-proof. You can read more about this in one of our previous blog posts on SHA-256 support in Gitaly. A SHA-256 repository can be created by passing the --object-format option to git-init(1):
$ git init --object-format sha256In this Git release another configuration option, init.defaultObjectFormat, has been added. This option tells Git which object format to use by default when initializing a repository. To configure, execute the following:
$ git config set --global init.defaultObjectFormat sha256Something to note, SHA-256 repositories are not interoperable with SHA-1 repositories and not all forges support hosting SHA-256 repositories. GitLab recently announced experimental support for SHA-256 repositories if you want to try it out.
These options provide a useful mechanism to begin using these repository features without having to consciously think about it every time you initialize a new repository.
This project was led by Patrick Steinhardt.
New subcommand for git-refs(1)
In the previous Git release, the git-refs(1) command was introduced to provide low-level access to references in a repository and provided the "migrate" subcommand to convert between reference backends. This release adds a new "verify" subcommand which allows the user to check the reference database for consistency. To verify the consistency of a repository, we often execute git-fsck(1).
Notably, this command does not explicitly verify the reference database of the repository though. With the introduction of the "reftable" reference format, which is a binary format and thus harder to inspect manually, it is now even more important that tooling be established to fill this gap. Let's set up a repository with an invalid reference to demonstrate:
# The "files" backend is used so we can easily create an invalid reference. $ git init --ref-format files $ git commit --allow-empty -m "init" # A lone '@' is not a valid reference name. $ cp .git/refs/heads/main .git/refs/heads/@ $ git refs verify error: refs/heads/@: badRefName: invalid refname formatWe can see the invalid reference was detected and an error message printed to the user. While this tooling is not something the end-user will likely run, it is particularly useful on the server side to ensure repositories remain consistent. Eventually, the goal is to integrate this command as part of git-fsck(1) to provide a unified way to execute repository consistency checks.
This project was led by Jialuo She as part of the Google Summer of Code. To learn more, you can read Jialuo's GSoC report.
Ongoing reftables work
This release also includes fixes for some bugs found in the "reftable" backend. One of these bugs is particularly interesting and revolves around how table compaction was being performed.
As you may recall, the reftable backend consists of a series of tables containing the state of all the references in the repository. Each atomic set of reference changes results in a new table being written and recorded in the "tables.list" file. To reduce the number of tables present, after each reference update, the tables are compacted to follow a geometric sequence by file size. After the tables are compacted, the "tables.list" file is updated to reflect the new on-disk state of the reftables.
By design, concurrent table writes and compaction is allowed. Synchronization at certain points is controlled through the use of lock files. For example, when compaction is starting the "tables.list" file is initially locked so the file can be consistently read and the tables requiring compaction can also be locked. Since the actual table compaction can take a while the lock is released, allowing concurrent writes to proceed. This is safe because concurrent writers know that they must not modify the now-locked tables which are about to be compacted. When the newly compacted tables have finished being written, the "tables.list" file is locked again and this time it is updated to reflect the new table state.
There is a problem though: What happens if a concurrent reference update writes a new table to the "tables.list" in the middle of table compaction after the initial lock was released, but before the new list file was written? If this race were to occur, the compacting process would not know about the new table and consequently rewrite the "tables.list" file without the new table. This effectively drops the concurrent update and could result in references not being added, updated, or removed as expected.
Luckily, the fix to remediate this problem is rather straightforward. When the compacting process acquires the lock to write to the "tables.list" it must first check if any updates to the file have occurred and reload the file. Doing so ensures any concurrent table updates are also reflected appropriately. For more information on this fix, check out the corresponding mailing-list thread.
This project was led by Patrick Steinhardt.
Fixes for git-maintenance(1)
As a repository grows, it is important that it is properly maintained. By default, Git executes git-maintenace(1) after certain operations to keep the repository healthy. To avoid performing unnecessary maintenance, the --auto option is specified which uses defined heuristics to determine whether maintenance tasks should be run. The command can be configured to perform various different maintenance tasks, but by default, it simply executes git-gc(1) in the background and allows the user to carry on with their business.
This works as expected until maintenance is configured to perform non-default maintenance tasks. When this happens the configured maintenance tasks are performed in the foreground and the initial maintenance process doesn't exit until all tasks complete. Only the "gc" task detaches into the background as expected. It turns out this was because git-gc(1), when run with --auto, was accidentally detaching itself, and other maintenance tasks had no means to do so. This had the potential to slow down certain Git commands as auto-maintenance had to run to completion before they could exit.
This release addresses this issue by teaching git-maintenance(1) the --detach option, which allows the whole git-maintenance(1) process to run in the background instead of individual tasks. The auto-maintenance performed by Git was also updated to use this new option. For more information on this fix, check out the mailing-list thread.
A little earlier it was mentioned that the auto-maintenance uses a set of heuristics to determine whether or not certain maintenance operations should be performed. Unfortunately for the "files" reference backend, when git-pack-refs(1) executes with the --auto option, there is no such heuristic and loose references are unconditionally packed into a "packed-refs" file. For repositories with many references, rewriting the "pack-refs" file can be quite time-consuming.
This release also introduces a heuristic that decides whether it should pack loose references in the "files" backend. This heuristic takes into account the size of the existing "packed-refs" file and the number of loose references present in the repository. The larger the "packed-refs" file gets, the higher the threshold for the number of loose references before reference packing occurs. This effectively makes reference packing in the "files" backend less aggressive while still keeping the repository in a maintained state. Check out the mailing-list thread for more info.
This project was led by Patrick Steinhardt.
Code refactoring and maintainability improvements
In addition to functional changes, there is also work being done to refactor and clean up the code. These improvements are also valuable because they help move the project closer toward the longstanding goal of libifying its internal components. To read more, here is a recent update thread regarding libification.
One area of improvement has been around resolving memory leaks. The Git project has quite a few memory leaks. For the most part, these leaks don't cause much trouble because usually a Git process only runs for a short amount of time and the system cleans up after, but in the context of libification it becomes something that should be addressed. Tests in the project can be compiled with a leak sanitizer to detect leaks, but due to the presence of existing leaks, it is difficult to validate and enforce that new changes do not introduce new leaks. There has been an ongoing effort to fix all memory leaks surfaced by existing tests in the project. Leak-free tests are subsequently marked with TEST_PASSES_SANITIZE_LEAK=true to indicate that they are expected to be free of leaks going forward. Prior to this release, the project had 223 test files containing memory leaks. This has now been whittled down to just 60 in this release.
Another ongoing effort has been to reduce the use of global variables throughout the project. One such notorious global variable is the_repository, which contains the state of the repository being operated on and is referenced all over the project. This release comes with a number of patches that remove uses of the_repository in favor of directly passing the value where needed. Subsystems in the Git project that still depend on the_repository have USE_THE_REPOSITORY_VARIABLE defined allowing the global to be used. Now the refs, config, and path subsystems no longer rely on its use.
This project was led by Patrick Steinhardt with the help of John Cai and Jeff King.
Read more
This blog post highlighted just a few of the contributions made by GitLab and the wider Git community for this latest release. You can learn about these from the official release announcement of the Git project. Also, check out our previous Git release blog posts to see other past highlights of contributions from GitLab team members.