How can I rewrite our commit history to ensure certain keywords never appear?
Background: we have three tiers of repositories:
- Local – our development environments.
- Internal – our team’s private GH repository
- Client – Production / end client. All of our real names, emails, etc. must never make it here.
I’ve already found that git-filter-branch can help rewrite history to strip out names, using something like this…
git filter-branch -f --env-filter "GIT_AUTHOR_NAME='safeusername'; GIT_AUTHOR_EMAIL='safe@email.com'; GIT_COMMITTER_NAME='safeusername'; GIT_COMMITTER_EMAIL='safe@email.com';" HEAD
This appears to work great. When I push to the final remote, none of our names are present. However, upon some merges, I don’t want any branch names or other comments to potentially come out on accident.
Additionally, I want our actual emails and usernames to continue to be configured so our internal project management system works and is transparent.
How can I ensure a list of keywords or names never appear in commit messages? Also, any other approaches to solving this problem?
Thanks!
Okay, so the general flow you want for doing something like this is:
So first: the magic. You’ll want to use
git filter-branch --commit-filter my-commit-filter-script. It’s called directly instead of commit-tree, taking necessary arguments, and the commit message on stdin. So you’ll want to do something like this:That is, change the names and emails via the appropriate environment variables, run whatever filtering you need on the message, and pipe it along to the commit-tree invocation that’d have been run normally.
sanitizeis meant to be a function/script that does some private->public mapping of names/emails; if all you want to do is change them all to a single name, then that bit is really easy. And thesedcommand presumably might be something a little fancier, which for example reads a table of transformations. That bit is up to you, depending on the complexity of the sanitization you need to do.If you trust your commit message filtering, then you’re done at this point. If you want to validate, you can do it manually, or you can independently search for the “dangerous” strings. For example, if you have a file
dangerous-strings.txt, you could dogit log --pretty="%an %ae %cn %ce%n%B" [branches] | grep -f dangerous-strings.txt. (The log command prints author/committer name/email followed by the commit message.)Then publish as normal – push, presumably.
Finally, a few alternate suggestions, perhaps for future readers with different requirements:
Instead of rewriting commits, make new commits. The message could just be quick versioning information (including the SHA1 of the internal commit it represents), or it could include a shortlog of the commits being introduced (just the subjects). You could do this by keeping a publishing branch and using
git merge --squash [--log], or by committing fresh in a separate repo, after copying things in.Keep your repo in a form that doesn’t need the transformation. This seems to be impossible for the OP, but if your situation is different, keep it simple. Less risky, less work.