Trimming a Git Repo before Moving It to GitHub

& (verbiage overflow)Sat 23 August 2014RSS

I learned a few days ago that because I have a valid academic affiliation, I'm eligible for five free GitHub private repositories, and so I've been moving some of the material in private repos on another Git hosting service into GitHub on those terms. (There are advantages to not having all one's files on a single host, of course.)

But GitHub has restrictions on large files that were not in place when I first set up the earlier repos, and there were the relics of a number of very large files (databases and Python pickle output) from before I had learned to exclude those things via .gitignore. So I had to edit the commit history in order to be able to push to GitHub for the first time. That was a lot of fun — and much less trouble than I had feared it would be. To remove <local_file_path> I used the following incantation from the GitHub help site for the first time:

git gc

git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch <local_file_path>' \
--prune-empty --tag-name-filter cat -- --all

filter-branch is used for rewriting Git revision history through the application of filters.

The main trouble was that I didn't know what large files were hidden in my commit history and would have to be removed, so with each attempt to push I got a series of messages like this:

remote: warning: File <file_path1> is 91.54 MB; this is larger than GitHub's recommended maximum file size of 50 MB

and

remote: error: File <file_path2> is 136.25 MB; this exceeds GitHub's file size limit of 100 MB

In order to find files larger than the limit, I used the following Python3 script:

#! /usr/bin/env python                                                          
# find_large_file_py3.py
# Adapted from http://stackoverflow.com/a/10099633/621762 (accessed 20140823).

"""Report all files in Git commit history over a given size."""

import sys
if sys.version_info[0] < 3:
    sys.stdout.write('Python 3 required; exiting.')
    sys.exit()
import os

def main(argv):
    if len(argv) != 2:
        size_limit = 100000000 # Current GitHub maximum file size.
    else:
        size_limit = int(argv[1])
    commits = os.popen('git rev-list HEAD').read().split()
    files = set()
    for commit in commits:
        tree_list = os.popen('git ls-tree -rl {}'.
                format(commit)).read().split('\n')
        for item in tree_list:
            if item:
                data, path = tuple(item.split('\t'))
                _, _, commit, size = data.split()
                if size == '-':
                    continue
                size = int(size)
                if size > size_limit:
                    files.add('size:{} commit: {}\npath: {}'.
                            format(size, commit, path))
    files = sorted(files, reverse=True)
    for f in files:
        print(f, end='\n\n')

if __name__ == '__main__':
    main(sys.argv)

After that, I was able to remove them all from the commit history and push to GitHub without difficulty.

Since not all my Git commits end up on GitHub, moving one large repository to GitHub from elsewhere made my past year's commit tally jump by some 74 old commits.

[end]

Comments are enabled.