Git 201: Creating a “fixup” commit

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

As I work on a branch, adding commit after commit, I’ll sometimes find that there’s a piece of work I forgot to do which really should have been part of some prior commit.  The first thing to do—before fixing anything—is to save the work I’m doing.  This might mean using a work-in-progress commit as described previously, or simply using the stash.  From there, I have a few options, but in this post I’m going to focus on using the “fixup” command within git’s interactive rebase tool.

The essence of the process is to make a new commit, and then use git to combine it with the commit I want to fix.  The first step is to make my fix, and commit it.  The message here really doesn’t matter since it’s going to get replaced anyway.

> git add -A
> git commit -m "fix errors"

This will add a new commit to the end of the branch, but that’s not actually what I wanted.  Now I need to squash it into the older commit I want to fix.  To do this, I’m going to use an “interactive rebase” command:

> git rebase -i master

This is telling git that I want to edit the history of commits back to where my branch diverged from master (if you originally created your branch from somewhere else, you’ll want to specify that instead).  In response to this request, git is going to create a temporary file on disk somewhere and open up my editor (the same one used for commit messages) with that file loaded.  It will wind up looking something like this:

pick 7e70c43 Add Chicken Tikka Masala
pick e8cc090 Remove low-carb flag from BBQ ribs
pick 9ade3d6 Fix spelling error in BBQ ribs
pick b857991 fix errors

# Rebase 1222f97..b857991 onto 1222f97 (       5 TODO item(s))
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to
# bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

All of these commits are those which I’ve made since I cut my branch:  ordered from oldest (on the top) to the newest (on the bottom).  The commented out parts give the instructions for what you can do in this “interactive” portion of the rebase.  Assuming that the fix is for the Chicken Tikka Masala recipe, I’d want to edit the file to look like this:

pick 7e70c43 Add Chicken Tikka Masala
fixup b857991 fix errors
pick e8cc090 Remove low-carb flag from BBQ ribs
pick 9ade3d6 Fix spelling error in BBQ ribs

When I save the file and quit my editor, git is going to rebuild the branch from scratch according to these instructions.  The first line tells git to simply keep commit 7e70c43 as-is.  The next line tells git to remove the prior commit, and replace it with one which is the combination of the prior commit and my fix-up commit, b857991.  The other two commands tell git to create two new commits which result in the same end state as each of the old commits, e8cc090 and 9ad3d6.

As a bit of an aside…  Why does git have to create new commits for the last two commands?  Remember that commits are immutable once created, and that part of the data which makes up the commit is the parent commit it was created from.  Since I’ve asked git to replace the parent commit, it will now need to create new commits for everything which follows on that same branch since each one now has to have a new parent: all the way back to the commit I replaced.

At the end of all this, if we were to inspect the log, we’d see:

> git log --oneline master..

6a829bc3 Add Chicken Tikka Masala
29dd3231 Remove low-carb flag from BBQ ribs
0efc5692 Fix spelling error in BBQ ribs

In essence, everything appears to be the same, except that the original commit includes my fix.  However, looking a little closer, you can see that each commit has a different hash.  The first one is different because I modified the “diff” portion of the commit (i.e., I added in my fix) and therefore a new commit was needed (commits are immutable, so adding the fix required making a new one).  The other two needed to be re-created because their parent commit disappeared, and therefore new commits were needed to preserve the effect of those two commits, starting from my fixed-up commit as the new parent.

✧✧✧

There is one caveat I have to warn you about when using this method.  Any time you rebase a branch, you’ve changed the commit history of the branch.  That is, you’ve thrown out a whole bunch of commits and replaced them with a completely new set.  This is going to mean loads of difficulties and possibly lost work if you’re sharing that same branch with other people, so only use this technique on your own private working branches!

Git 201: Keeping a clean commit history

This series is going to get very technical.  If you’re not already familiar with git, you may want to start over here instead.

✧✧✧

Most people I’ve worked with treat git somewhat like they might treat an armed bomb with a faulty count-down timer.  They generally want to stay away from it, and if forced to interact with it, they only do exactly what some expert told them to do.  What I hope to accomplish here is to convince you that git is really more like an editor for the history of your code, and not a dark god requiring specific incantations to keep it from eating your code.

As I’m thinking of this as a Git 201 series, I’m going to assume you are familiar with all bunch of 101-level terms and concepts: repository, index, commit, branch, merge, remote, origin, stash.  If you aren’t, scroll back up and follow this link to some 101-level content.  Come on back when you’re comfortable with all of those.  I’m also going to assume you’ve already read my last post about creating a solid commit.

As you might expect, there’s a lot to be said on this topic, so I’m going to break this up into multiple commits… err posts.  To make things easy, I’m going to keep updating the list at the end of this post as I add new ones in the series.  So, feel free to jump around between each of them, or just read them straight through (each will link to the next post, and back here).

✧✧✧

Git 201: Using a work-in-progress commit

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

My work day doesn’t naturally divide itself cleanly into commits.  I’m often interrupted to work on something new when I’m half-way done with my current task, or I just need to head home in the evening without quite having finished that few feature.  This post talks about one technique I use in those situations.

To rewind  a bit though, when I’m starting a new piece of work (whether a new feature, bug fix, etc.), I first check out the source branch (almost always master), and run:

> git pull --rebase

This grabs the latest changes from the origin without adding an unnecessary merge commit. I then create myself a new branch which shows my username along with a brief reminder of what the change is for:

> git checkout -b aminer/performance-tuning

At this point, I’m ready to start working.  So, I’ll crack open my favorite editor and make my changes.  As I’m going along, I’ll often find that I need to stop mid-thought and move on to a different task (e.g., fix something else, go home for the day, etc.).  It may also be that I’ve arrived at a point where I’ve just figured out something tricky, but I’m not through with the overall change yet.  So, I’ll create a work-in-progress commit:

> git add -A
> git commit -n -m wip

This is really more of a “checkpoint” than an actual commit since I very likely haven’t fully finished anything, probably don’t have tests, etc.  The -n here tells git not to run tests or linters.  This is fine for now because this isn’t a “finished” commit, and I’m going to get rid of it later.  For the time-being, though, I keep on working.  If I reach another checkpoint before I’m ready to make a real commit, I’ll just add on to the checkpoint commit:

> git add -A
> git commit -n --amend

The first command pulls all my new work into the index while the second removes the prior work-in-progress commit, and replaces it with a new one which contains both the already-committed changes along with my new ones.

When I’ve actually finished a complete unit of work, I remove the work-in-progress commit:

> git reset HEAD~1

This will take all the changes which were in the prior commit, and add them back to my working directory as though they were there all along.  The commit history on my branch looks just as though the commit had never been there.

What makes a good commit?

It is distressingly common for me to see commits in a git repository which look something like:

commit 6a69053d2419b311cc38c9fdbbbac75b5a62c4fa
Author: dev-dude <dev-dude@somco.com>
Date: Sun Mar 29 14:05:46 2015 +0200

    fix missing commas

Such a commit will often be half a complete thought (if even that much), have no tests, and a message which explains exactly nothing about why the change was made, or what its intended consequence are.  We can do a lot better.

✧✧✧

A good commit is much like a good sentence: a well-formed, grammatically correct, complete expression of a single thought.  In the case of a commit, this means that it should:

  • make exactly one cohesive change to the code
  • include tests to verify the change
  • have a title which clearly—but briefly—indicates what was done
  • have a message which describes why it was done

 

Cohesion

The most important virtue of a well-formed commit is good cohesion, or: doing only one thing.  This could be a refactoring in preparation for some new feature to be added, it could be adding that feature, it could be fixing a bug.  The actual size of the commit doesn’t matter so terribly much (though smaller is generally better), but that it clearly represents a single change does.

As a short-cut for deciding whether your change is cohesive or not, consider how easily you could come up with the title for the commit.  Can you, in 70 characters or less, state clearly what your change does.  If not, it’s probably not very cohesive.  If you find yourself tempted to make a list as your commit title, you definitely don’t have good cohesion.

In such cases, stop and ask yourself: “what is the smallest thing I could possibly do to this codebase which would be a clear improvement?”.  In a single pull request, it’s not at all uncommon for me to have more than one of these kinds of commits:

  • clear up some stylistic problems with old code
  • refactor various parts of the code I’m about to touch
  • add a new class which will be part of a new feature I’m about to add
  • tie a bunch of prior commits together in a way which makes the new feature visible to the user
  • update documentation

Of course, depending upon the exact nature of the work, I may not need all of those, but it will be very common for me to have at least a few of them.

 

Tests

In my opinion, tests deserve as much care as any other code, and it’s quite possible to have too many or too few.  That’s a topic for another time.  For now, I just want to point out that any new commit should include any appropriate tests related to that new code.  Of course, it’s possible that it’s appropriate not to add new tests (e.g., for a refactoring), but if a commit doesn’t have tests, it should only ever be because you considered it, and decided there weren’t any new tests which were needed.

 

Commit Title

Git actually does have a proper format for the commit message; it’s not just arbitrary.  Remember that git is a command-line tool, and is often used in environments where space can be limited.  To that end, there’s a definite format.

In this format, the first line is considered a title, and it should be a present-tense, imperative sentence stating what changed with this commit (e.g., “Add the FizzyWizzBanger class”).  This should be less than 70 characters because git will only show that many in the various “one-line” versions of its commit logs.  Ideally, it will be sufficiently differentiated from other commit messages so that you can readily figure out which one of several commits it is, even out of context.

 

Commit Message

The question of “what” the commit changes should really be summarized by the title, and made obvious from the code.  In the commit message, you should focus on why you made this change.  Perhaps it fixes a certain bug.  If so, go ahead and explain—in a sentence or two—why this change fixes that particular bug.  It doesn’t hurt to mention the bug number here either.

Don’t be afraid to use simple formatting here either.  A good many Git-related tools support markdown formatting, so feel free to use whatever you like.  Just be sure to leave an empty line after the title since that’s what git will be expecting for its longer-form messages, as will other git-related tools.

✧✧✧

Building up your codebase using a series of well-formed commits has a number of major benefits.  

  • other developers will be able to understand what changed and why much more easily with small, focused commits with clear messages
  • it creates a commit log which reads well, whether using the full log format or just the “one-line” version
  • it is much easier to test small changes
  • not allowing any commits without proper tests means not playing catch-up with a massive and under-tested codebase later on
  • tracking down problems later is much easier if you can isolate the problem to a single commit, and that commit isn’t massive
  • merging branches is much easier (and git is much better at avoiding conflicts altogether) if you work in smaller increments

So, before you start typing… stop to think about what small, specific improvement you’re going to make to the codebase next, and make it as perfect and tidy as you can before moving along to the next one.

Canvas vs. SVG

When I began my start-up, I knew the product was going to focus heavily on drawing high-quality graphs, so I spent quite a while looking at the various options.

 

Server Side Rendering

The first option, and the one that’s been around the longest, is to do all the rendering on the server and then send the resulting images back to the client.  This is essentially what the Google Maps did (though they’ve added more and more client-rendered elements over the years).  To do this, you’ll need some kind of image rendering environment on the server (e.g., Java2D), and, of course, a server capable of serving up the images thus produced.

I decided to skip this option because it adds latency and dramatically cuts down on interactivity and the ability to animation transitions.  Both were very important to me, so this solution was clearly not a good fit.

 

Canvas

The second option was to use HTML5’s Canvas element.  This allows you to specify a region of your webpage as a freely drawn canvas which allows all the typical 2D drawing operations.  The result is a raster image which is drawn completely using JavaScript operations.

While Canvas has the advantages of giving a lot more control on the client-side, you’re still stuck with a raster image which needs to be drawn from scratch for each change.  You also lose the structure of the scene (since it’s all reduced to pixels), and therefore lose the functionality provided in the DOM (e.g., CSS and event handlers).

 

SVG

The final option I considered (and therefore the one I chose) was using Scalable Vector Graphics (SVG).  SVG is a mark-up language, much like HTML, which is specifically designed to represent vector graphics.  You pretty much all the same primitives as with Canvas, except these are all represented as elements in the DOM, and remain accessible via JavaScript even after everything is rendered.  In particular, both CSS and JavaScript can be applied to the elements of an SVG document, making it much more suitable for the kind of interactive graphics I had in mind.

As an added incentive for using SVG, there is an excellent library called Data Driven Documents (D3) which provides an amazing resource for manipulating SVG documents with interactivity, animations, and a lot more.  More than anything else, the existence of D3 decided me on using SVG for my custom graphic needs.

Streams in Node.js, part 2: Object Streams

In my first post on streams, I discussed the Readable, Writable and Transform classes and how you override them to create your own sources, sinks and filters.

However, where Node.js streams diverge from more classical models (e.g., from the shell), is Object streams.  Each of the three types of stream objects can work with objects (instead of buffers of bytes) by passing the objectMode parameter set to true into the parent class constructor’s options argument.  From that point on, the stream will deal with individual objects (instead of groups of bytes) as the medium of the stream.

This has a few direct consequences:

  1. Readable objects are expected to call push once per object, and each argument is treated as a new element in the stream.
  2. Writeable objects will receive a single object at a time as the first argument to their _write methods, and the method will be called once for each object in the stream.
  3. Transform objects have the same changes as the both the other two objects.

 

Application: Tax Calculations

At first glance, it may not be obvious why object streams are so useful.  Let me provide a few examples to show why.  For the first example, consider performing tax calculations for a meal in a restaurant.  There are a number of different steps, and the outcome for each step often depends upon the results of another.  The whole thing can get very complex.  Object streams can be used to break things down into manageable pieces.

Let’s simplify a bit and say the steps are:

  1. Apply item-level discounts (e.g., mark the price of a free dessert as $0)
  2. Compute the tax for each item
  3. Compute the subtotal by summing the price of each item
  4. Compute the tax total by summing the tax of each item
  5. Apply check-level discounts (e.g., a 10% discount for poor service)
  6. Add any automatic gratuity for a large party
  7. Compute the grand total by summing the subtotal, tax total, and auto-gratuity

Of course, bear in mind that I’m actually leaving out a lot of detail and subtlety here, but I’m sure you get the idea.

You could, of course, write all this in a single big function, but that would be some pretty complicated code, easy to get wrong, and hard to test.  Instead, let’s consider how you might do the same thing with object streams.

First, let’s say we have a Readable which knows how to read orders from a database.  It’s constructor is given a connection object of some kind, and the order ID.  The _read method, of course, uses these to build an object which represents the order in memory.  This object is then given as an argument to the push method.

Next, let’s say each of the calculation steps above is separated into its own Transform object.  Each one will receive the object created by the Readable, and will modify it my adding on the extra data it’s responsible for.  So, for example, the second transform might look for an items array on the object, and then loop through it adding a taxTotal property with the appropriate computed value for each item.  It would then call its own push method, passing along the primary object for the next Transform.

After having passed from one Transform to the next, the order object created by the Readable would wind up with all the proper computations having been tacked on, piece-by-piece, by each object.  Finally, the object would be passed to a Writable subclass which would store all the new data back into the database.

Now that each step is nicely isolated with a very clear and simple interface (i.e., pass an object, get one back), it’s very easy to test each part of the calculation in isolation, or to add in new steps as needed.

Streams in Node.js, part 1: Basic concepts

When I started with Node.js, I started with the context of a lot of different programming environments from Objective C to C# to Bash.  Each of these has a notion of processing a large data sets by operating on little bits at a time, and I expected to find something similar in Node.  However, given Node’s way of embracing the asynchronous, I’d expected it to be something quite different.

What I found was actually more straight-forward than I’d expected.  In a typical stream metaphor, you have sources which produce data, filters which modify data, and sinks which consume data.  In Node.js, these are represented by three classes from the stream module: Readable, Transform and Writable.  Each of them is very simple to override to create your own, and the result is a very nicely factored set of classes.

Overriding Readable

As the “source” part of the stream metaphor, Readable subclasses are expected to provide data.  Any Readable can have data pushed into it manually by calling the push method.  The addition of new data immediately triggers the appropriate events which makes the data trickle downstream to any listeners.

When making your own Readable, you override the psuedo-hidden _read(size) function.  This is called by the machinery of the stream module whenever it determines that more data is needed from your class.  You then do whatever it is that you have to do to get the data and end by calling the push method to make it available to the underlying stream machinery.

You don’t have to worry about pushing too much data (multiple calls to push are handled gracefully), and when you’re done, you just push null to end the stream.

Here’s a simple Readable (in CoffeeScript) which returns words from a given sentence:

class Source extends Readable
    constructor: (sentence)->
        @words = sentence.split ' '
        @index = 0
    _read: ->
        if @index < @words.length
            @push @words[index]
        else
            @push null

Overriding Writable

The Writable provides the “sink” part of the stream metaphor.  To create one of your own, you only need to override the _write(chunk, encoding, callback) method.  The chunk argument is the data itself (typically a Buffer with some bytes in it).  The encoding argument tells you the encoding of the bytes in the chunk argument if it was translated from a String.  Finally, you are expected to call callback when you’re finished (with an error if something went wrong).

Overriding Writable is about as easy as it gets.  Your _write method will be called whenever new data arrives, and you just need to deal with it as you like.  The only slight complexity is that, depending up on how you set up the stream, your may either get a Buffer, String, or a plain JavaScript object, and you may need to be ready to deal with multiple input types.  Here’s a simple example which accepts any type of data and writes it to the console:

class Sink extends Writable

    _write: (chunk, encoding, callback)->
        if Buffer.isBuffer chunk
            text = chunk.toString encoding
        else if typeof(chunk) is 'string'
            text = chunk
        else
            text = chunk.toString()

        console.log text
        callback()

Overriding Transform

A Transform fits between a source and a sink, and allows you to transform the data in any way you like.  For example, you might have a stream of binary data flowing through a Transform which compresses the data, or you might have a text stream flowing through a Transform which capitalizes all the letters.

Transforms don’t actually have to output data each time they receive data, however.  So, you could have a Transform which breaks up a incoming binary stream into lines of text by buffering enough raw data until a full line is received, and only at that point, emitting the string as a result.  In fact, you could even have a Transform which merely counts the lines, and only emits a single integer when the end of the stream is reached.

Fortunately, creating your own Transform is nearly the same as writing a class which implements both Readable and Writable.  However, in this case instead of overriding the _write(chunk, encoding, callback) method, you override the _transform(chunk, encoding, callback) method.  And, instead of overriding the _read method to gather data in preparation for calling push, you simply call push from within your _transform method.

Here’s a small example of a transform which capitalizes letters:

class Capitalizer extends Transform
    _transform: (chunk, encoding, callback)->
        text = chunk.toString encoding
        text = text.toUpperCase()
        @push text
        callback()

✧✧✧

All this is very interesting, but hardly unique to the Node.js platform. Where things get really interesting is when you start dealing with Object streams. I’ll talk more about those in a future post.

Gotchas: Searching across Redis keys

A teammate introduced me to Redis at a prior job, and ever since, I’ve impressed with it. For those not familiar with it, it’s a NoSQL, in-memory database which stores a variety of data types from simple strings all the way to sorted lists and hashes, which nevertheless has a solid replication and back-up story.

In any case, as you walk through the tutorial, you notice the convention is for keys to be separated by colons, for example: foo:bar:baz. You also see that nearly every command expects to be given a key to work with. If you want to set and then retrieve a value, you might use:

> SET foo:bar:baz "some value to store"
OK
> GET foo:bar:baz
"some value to store"

Great. At this point, you might want to fetch or delete all of the values pertaining to the whole “bar” sub-tree. Perhaps, you’d expect it to work something like this:

> GET foo:bar:*
> DEL foo:bar:*

Well… too bad; it doesn’t work like that. The only command which accepts a wildcard is the KEYS command, and it only returns the keys which match the given pattern: not the data. Without getting into too much detail, there are legitimate performance reasons not to, but it was something of a surprise to me to find out.

However, all is not lost. Redis does support a Hash data structure which allows accessing some or all properties related to a specific key. Along with this data structure are commands both to manipulate individual properties along with the entire set.

Redis: Storing time-series data in a sorted set

Along with a regular Set data type, Redis also has a Sorted Set data type. Like a regular set, you can add unique items to the set using the ZADD and fetch them back again using various other commands.

The main difference with a sorted set is that when you add an item, you must also provide a “score” which is used to determine the item’s order in the set. The score can be any integer, and can be completely unrelated to both the key and value being stored.

It is important to note that a sorted set is still a set! If you add identical values multiple times, the item will only appear once using the last score assigned. For this reason, you can’t store arbitrary data in a set unless you can guarantee each value is unique (i.e., it is either a unique identifier or is a data blob containing a unique identifier).

In my case, I wanted to store time-series data in a sorted set, but, since I can be pretty sure that the values won’t be unique, I can’t use a plain sorted set (i.e., if the value on Monday is 10, and the value on Wednesday is 10, then the Monday value gets clobbered since a set only stores unique values). However, I did figure out a handy work-around using Redis’s built-in scripting capabilities.

Why use a sorted set?

The very handy thing about sorted sets is that you can use a single command to fetch all the values within a certain time frame. Using ZADD, you store a given value in a key with a score which represents its sort order. For time-series data, the key can be anything you like, the score should be the time of the event (as an integer), and the value should contain the data.

Once the data is loaded, the ZRANGEBYSCORE command can fetch a sorted list of values whose scores fall within a certain range.

But what about uniqueness?

The problem is that a sorted set is still a set, and it won’t sort non-unique values (even if they have different scores). Let’s take an example. Suppose you’re measure the temperature outside your home each day. Let’s further suppose that on Monday, the temperature was 52°. Your command would be:

ZADD temp:house 1421308800000 52

On Tuesday, perhaps it was a bit warmer:

ZADD temp:house 1421395200000 65

But, on Wednesday, it got colder again:

ZADD temp:house 1421481600000 52

Now, we’ve got problems. Since uniqueness only pertains to the value, regardless of the score, you just changed the timestamp of 52° to Wednesday, thus deleting the data recorded on Monday.

How to create uniqueness

To work around this issue of uniqueness, I decided to add the timestamp to the data as well as the score. One could certainly do this on the client side (e.g., by encoding both temperature and timestamp in JSON), but I wanted to be able to access that data from within Redis’s scripting environment for other reasons.

It turns out, the Redis scripting language was also the key to encoding the data as I needed. To start, I used the LOAD command to store a LUA script:

local value = cmsgpack.pack({ARGV[2], ARGV[1]})
redis.call('zadd', KEYS[1], ARGV[1], value)
return redis.status_reply('ok')

The result of the command is to load the script and return an identifier you can later use to call it (which happens to be the SHA of the script). This uses the cmsgpack built-in library to create a very efficient binary-encoded representation of the two pieces of data.

Now, instead of just using ZADD to store my values, I invoke my script with the EVALSHA command:

EVALSHA <sha> 1 temp:house 1421481600000 52

Now, instead of storing 52° as my value, I’ve got a binary record combines 52 and 1421481600000, and therefore is unique to that moment in time.

Getting the data back again

In my case, I wanted to use the data by performing calculations on it in another LUA script. Without getting into too much detail, I basically just used the LOAD and EVALSHA commands to execute a LUA script containing:

for k, v in ipairs(redis.call('zrangebyscore', KEYS[1], ARGV[1], ARGV[2])) do
    local value = tonumber(cmsgpack.unpack(v)[1])

In this example, ARGV represents the arguments passed to the EVALSHA command, KEYS is the list of keys passed to the command, redis.call is a function used to run regular Redis command from within the LUA script, and cmsgpack.unpack takes the value and returns an array of the fields it contains.

From there, my LUA script goes to work!