I learned a few days ago that because I have a valid academic affiliation, I'm eligible for five free GitHub private repositories, and so I've been moving some of the material in private repos on another Git hosting service into GitHub on those terms. (There are advantages to not having all one's files on a single host, of course.)
But GitHub has restrictions on large files that were not in place when I first set up the earlier repos, and there were the relics of a number of very large files (databases and Python pickle
output) from before I had learned to exclude those things via .gitignore
. So I had to edit the commit history in order to be able to push to GitHub for the first time. That was a lot of fun — and much less trouble than I had feared it would be. To remove <local_file_path>
I used the following incantation from the GitHub help site for the first time:
git gc
git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch <local_file_path>' \
--prune-empty --tag-name-filter cat -- --all
filter-branch
is used for rewriting Git revision history through the application of filters.
The main trouble was that I didn't know what large files were hidden in my commit history and would have to be removed, so with each attempt to push I got a series of messages like this:
remote: warning: File <file_path1> is 91.54 MB; this is larger than GitHub's recommended maximum file size of 50 MB
and
remote: error: File <file_path2> is 136.25 MB; this exceeds GitHub's file size limit of 100 MB
In order to find files larger than the limit, I used the following Python3 script:
#! /usr/bin/env python
# find_large_file_py3.py
# Adapted from http://stackoverflow.com/a/10099633/621762 (accessed 20140823).
"""Report all files in Git commit history over a given size."""
import sys
if sys.version_info[0] < 3:
sys.stdout.write('Python 3 required; exiting.')
sys.exit()
import os
def main(argv):
if len(argv) != 2:
size_limit = 100000000 # Current GitHub maximum file size.
else:
size_limit = int(argv[1])
commits = os.popen('git rev-list HEAD').read().split()
files = set()
for commit in commits:
tree_list = os.popen('git ls-tree -rl {}'.
format(commit)).read().split('\n')
for item in tree_list:
if item:
data, path = tuple(item.split('\t'))
_, _, commit, size = data.split()
if size == '-':
continue
size = int(size)
if size > size_limit:
files.add('size:{} commit: {}\npath: {}'.
format(size, commit, path))
files = sorted(files, reverse=True)
for f in files:
print(f, end='\n\n')
if __name__ == '__main__':
main(sys.argv)
After that, I was able to remove them all from the commit history and push to GitHub without difficulty.
Since not all my Git commits end up on GitHub, moving one large repository to GitHub from elsewhere made my past year's commit tally jump by some 74 old commits.
[end]