Way back in the past I accidentally committed a large number of Java artifacts (.war, .jar and .class) into my GitHub repo. This resulted in a huge bloat in size to about 100Mb. I didn’t notice until many commits and branch merges later.
Fortunately, there is a lot of info out there about this and so after trawling endlessly through StackOverflow, GitHub and Git documentation (thanks everyone!) I finally managed to put the following script together:
#!/bin/bash
echo "Removing history for *.war, *.jar, *.class files"
echo "Starting size"
git count-objects -v
echo "Removing history for *.war, *.jar, *.class files"
git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch *.war' --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch *.jar' --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch *.class' --prune-empty --tag-name-filter cat -- --all
echo "Purging refs and garbage collection"
# Purge the backups
rm -Rf .git/refs/original
# Force reflog to expire now (not in the default 30 days)
git reflog expire --expire=now --all
# Prune
git gc --prune=now
# Aggressive garbage collection
git gc --aggressive --prune=now
echo
echo "Ending size (size-pack shows new size in Kb)"
git count-objects -v
# Can't do this in the script - it needs a human to be sure
echo
echo "Now use this command to force the changes into your remote repo (origin)"
echo
echo git push --all origin --force
This worked perfectly locally, my 100Mb repo dropped to about 2Mb. I then used the
git push --all origin --force
command to overwrite all branches in the GitHub repo with my local changes. All went well. To check everything I deleted my local repo and cloned from GitHub. This should have been 2Mb, but was again 100Mb.
So, after all that rambling, where have I gone wrong? How can I force GitHub to use my local repo with its purged history?
Edits for further information
The GitHub repo can’t be deleted since it has a lot of additional information surrounding it (issues, wiki, watches etc). Performing this script against an empty scratch repo works fine – the cloned repo is 2Mb.
The problem remains as to why it doesn’t work with the main repo.
It was all because of a fork
It turns out that if someone forks your repo on GitHub, then they retain links and references to entries within it. Consequently, your purge won’t work unless everyone who is holding a fork also runs the script on their repo.