headshot of Chris Tonkinson
Chris Tonkinson

tech {enthusiast, practitioner, leader, podcaster, mentor, entrepreneur}

HomeAboutTags



Refactored Podcast

I co-host Refactored, a casual, ongoing conversation between two technology leaders trying to suck a little less every day.

Career Schema Project

I founded the Career Schema Project with the mission of developing open standards and technology to reduce the hassle of online jobseeking.


RSS Subscribe

© 2021 Chris Tonkinson

How To Recover 20GB of Disk Space Back From Git


TL;DR — Don’t be a packrat. But if you must, use gc/repack/prune regularly.

I love open source software for myriad reasons. Some are ideological, some are academic, and some are practical. One of the most beneficial practical aspects of open source software is that you can easily introspect what would otherwise be a black box.

When you’ve identified a business need, have searched for and installed software, and subsequently received some unhelpful and undocumented error message while configuring said software, your path forward is governed by whether or not you have source access. You’ve already searched Stack Exchange, mailing lists, etc. for solutions and tried several, all of which failed with different, but still maddeningly opaque errors.

With closed software, you’re left to call customer support, and proceed to bash your head against the nearest keyboard(s)†. With open software, however, you can inspect the code, find the source of the error, and actually understand some of the context around the issue.

Searching for code is faster with find, grep, ack, or my new favorite ag (the_silver_searcher) than with Google (DuckDuckGo for me). Once you’ve found the file you’re looking for, browsing/exploring around the codebase is faster and easier with Atom or Vim rather than clicking through a web UI such as on GitHub, GitLab or Bitbucket. Go figure - local operations are faster than network operations. :trollface: For this reason, I clone just about every project I use under ~/repo/vendor — and a bunch that I don’t use directly but find helpful or interesting for whatever reason.


To ensure this code is up-to-date and relevant when I need it, I run a cron job nighly that pulls down the latest from upstream. Something like

#!/bin/bash
for repo in ~/repo/vendor/*; do
  # Update.
  git -C $repo fetch --all --quiet
  git -C $repo reset --hard --quiet
  git -C $repo merge --quiet
  # Dependencies.
  git -C $repo submodule update --checkout
done

Today I noticed disk usage approaching 60% on my primary workstation (running 2x Intel 730 240GB 6Gb/s SSDs in RAID0) and I wasn’t surprised to find out that the glut of third party code I keep was consuming 65GB all by itself.

Turns out that git, for all it’s superlatives, can get pretty gummed up under certain circumstances (such as when cloning a large project with a long and complex history such as linux, chromium, and gcc.

I decided to test a worst-case-scenario to see if adding in any routine maintenance would yield benefit. My local clone of the chromium repository (from https://chromium.googlesource.com/chromium/src.git) was 20GB alone, so I did a quick search of the man pages and wound up running git gc --aggressive --prune=all.

Three hours later… pregnant pause… yes, three hours later, the whole time pegging all 12 logical cores of my Intel i7-5930K and utilizing upwards of 70% of my 64GB DDR4 RAM, it was finished. The result was a 13.5GB reduction in space usage, down to a mere 6.5GB. Did I mention it took over three hours?

It didn’t take long before I came to learn that gc --aggressive is rarely the right answer. See http://stackoverflow.com/questions/28720151/git-gc-aggressive-vs-git-repack for a full discussion of why that is, and how you might proceed differently in your case. I wound up adding a few extra tasks on the end of my nightly update script:

#!/bin/bash
for repo in ~/repo/vendor/*; do
  # Update.
  git -C $repo fetch --all --quiet
  git -C $repo reset --hard --quiet
  git -C $repo merge --quiet
  # Dependencies.
  git -C $repo submodule update --checkout
  # Maintenance.
  git -C $repo gc
  git -C $repo repack -Ad
  git -C $repo prune
done

I’ll probably run git gc --aggressive --prune=all as I clone new repositories down (just to be sure they’ve been correctly packed since we don’t know what GitHub, et al. do behind the scenes), but other than that, it’s a waste of time.

I saved about 19GB running these tasks over everything under ~/repo/vendor, and another 1GB or so doing the same thing to all the other code I have locally, for about a 20GB total disk savings.

All told, rm -r is the fastest way to recover disk space, but if you’re a packrat like me, git gc/repack/prune is second best.

If you know of similar tricks for optimizing Mercurial, Subversion, or CVS repositories, I’d love to hear them. Most everything I have is in git, but there are some disk-hungry notable exceptions (top of mind are webkit, which uses Subversion and sits on about 15GB, and the mozilla-central repository which uses Mercurial and eats about 3.8GB).


† The probability of getting an error, and subsequently being unable to make progress against it without outside help because of Software Documentation Insufficiency Disorder (SDID) while using proprietary systems is approximately 73.2%, according to my research experience.