I found myself wanting to get a picture of the history of a project at work yesterday – just general stuff like getting a feel for the growth of the codebase over time, the commit patterns of the developers during key phases etc, etc. I knew I could get a lot of information out of Git, but I’d never really done it before. Cue one late night hackathon!

I started out by using the “git log” command, but I ended up with a number of options:

git log --pretty=format:"%ad, %an" --no-merges --date=short --numstat

Taking these in turn:

--pretty=format:"%ad, %an"

This outputs the commit date and the author name.

--no-merges

This filters out the merges which would otherwise result in some double accounting.

--date=short

This ensures that the dates are output in a short, simple format yyyy-mm-dd.

--numstat

Outputs a bunch of pre-formatted statistics about the files changed with the check-in.

So the output of this looks something like the following:

2015-03-23, Alan Gibson
34	0	app/database/migrations/2015_03_23_065812_add_town_to_bookings.php
1	0	app/models/Booking.php
9	0	app/views/admin_bookings/create.blade.php
9	0	app/views/admin_bookings/edit.blade.php
3	1	app/views/admin_bookings/export.blade.php
3	0	app/views/admin_bookings/show.blade.php
3	0	app/views/bookings/confirmation.blade.php
4	0	app/views/bookings/details.blade.php

2015-03-12, Alan Gibson
2	2	app/views/admin_bookings/export.blade.php
1	1	app/views/admin_bookings/index.blade.php

So far so good, but I don’t need the line-by-line breakdown of the files; I just want a summary of the lines affected. Enter awk!

It’s been an awfully long time since I worked in a proper Unix development environment and I have to confess I couldn’t remember anything about awk. Thank goodness for Google.

The key to awk is that it splits text into records and fields. In my case a “record” spans multiple lines, but the exact number of lines varies with each commit. So I ended up with the following script:

BEGIN {
    RS="";
    FS="\n";
}

{
    files=0
    added=0
    deleted=0

    for(i=2; i<=NF; i++) {
        split($i, a, " ");
        files += 1;
        added += a[1];
        deleted += a[2];
    }

    print $1 ", " files ", " added ", " deleted ", " added-deleted
}

Actually this is a slightly simplified version as I wanted to exclude certain, generated files from my statistics … but this is the general gist of it. By setting RS=”” we’re telling awk to use blank lines as the record separator. Then FS=”\n” says treat each line within the record as a field. After that I basically loop through the files modified by each commit and count up the files, inserts and deletes before outputting my summary on a single line per commit.

The output is something like this:

2014-04-18, Alan Gibson, 2, 3, 1, 2
2014-04-17, Alan Gibson, 3, 156, 82, 74
2014-04-16, Alan Gibson, 1, 15, 2, 13
2014-04-16, Alan Gibson, 7, 5, 5, 0
2014-04-16, Colin Orr, 1, 3, 4, -1
2014-04-16, Colin Orr, 5, 56, 25, 31
2014-04-14, Alan Gibson, 2, 15, 5, 10
2014-04-14, Colin Orr, 3, 40, 25, 15
2014-04-08, Alan Gibson, 1, 2, 2, 0
2014-04-08, Alan Gibson, 1, 1, 1, 0

It’s no accident that this looks like CSV – I pipe the output into a CSV file and then use MS Excel to analyse the data, produce some graphs etc. Stuff like this:

Git Commits

And this:

Git Growth

Obviously this is just some sample data from a side-project I did with my good friend Colin Orr last year, but I’m sure you get the idea well enough. With Excel I can carve up the data in lots of different ways, using the dates to group data by sprint or release, using the authors to check up on the productivity of individuals in my team … only kidding of course!

I’m sure there are much better ways to achieve the same results – I’m definitely not an awk expert! But it was interesting all the same and maybe it will give someone some ideas for doing something similar or better.