What makes a good commit?

It is distressingly common for me to see commits in a git repository which look something like:

commit 6a69053d2419b311cc38c9fdbbbac75b5a62c4fa
Author: dev-dude <dev-dude@somco.com>
Date: Sun Mar 29 14:05:46 2015 +0200

    fix missing commas

Such a commit will often be half a complete thought (if even that much), have no tests, and a message which explains exactly nothing about why the change was made, or what its intended consequence are.  We can do a lot better.


A good commit is much like a good sentence: a well-formed, grammatically correct, complete expression of a single thought.  In the case of a commit, this means that it should:

  • make exactly one cohesive change to the code
  • include tests to verify the change
  • have a title which clearly—but briefly—indicates what was done
  • have a message which describes why it was done



The most important virtue of a well-formed commit is good cohesion, or: doing only one thing.  This could be a refactoring in preparation for some new feature to be added, it could be adding that feature, it could be fixing a bug.  The actual size of the commit doesn’t matter so terribly much (though smaller is generally better), but that it clearly represents a single change does.

As a short-cut for deciding whether your change is cohesive or not, consider how easily you could come up with the title for the commit.  Can you, in 70 characters or less, state clearly what your change does.  If not, it’s probably not very cohesive.  If you find yourself tempted to make a list as your commit title, you definitely don’t have good cohesion.

In such cases, stop and ask yourself: “what is the smallest thing I could possibly do to this codebase which would be a clear improvement?”.  In a single pull request, it’s not at all uncommon for me to have more than one of these kinds of commits:

  • clear up some stylistic problems with old code
  • refactor various parts of the code I’m about to touch
  • add a new class which will be part of a new feature I’m about to add
  • tie a bunch of prior commits together in a way which makes the new feature visible to the user
  • update documentation

Of course, depending upon the exact nature of the work, I may not need all of those, but it will be very common for me to have at least a few of them.



In my opinion, tests deserve as much care as any other code, and it’s quite possible to have too many or too few.  That’s a topic for another time.  For now, I just want to point out that any new commit should include any appropriate tests related to that new code.  Of course, it’s possible that it’s appropriate not to add new tests (e.g., for a refactoring), but if a commit doesn’t have tests, it should only ever be because you considered it, and decided there weren’t any new tests which were needed.


Commit Title

Git actually does have a proper format for the commit message; it’s not just arbitrary.  Remember that git is a command-line tool, and is often used in environments where space can be limited.  To that end, there’s a definite format.

In this format, the first line is considered a title, and it should be a present-tense, imperative sentence stating what changed with this commit (e.g., “Add the FizzyWizzBanger class”).  This should be less than 70 characters because git will only show that many in the various “one-line” versions of its commit logs.  Ideally, it will be sufficiently differentiated from other commit messages so that you can readily figure out which one of several commits it is, even out of context.


Commit Message

The question of “what” the commit changes should really be summarized by the title, and made obvious from the code.  In the commit message, you should focus on why you made this change.  Perhaps it fixes a certain bug.  If so, go ahead and explain—in a sentence or two—why this change fixes that particular bug.  It doesn’t hurt to mention the bug number here either.

Don’t be afraid to use simple formatting here either.  A good many Git-related tools support markdown formatting, so feel free to use whatever you like.  Just be sure to leave an empty line after the title since that’s what git will be expecting for its longer-form messages, as will other git-related tools.


Building up your codebase using a series of well-formed commits has a number of major benefits.  

  • other developers will be able to understand what changed and why much more easily with small, focused commits with clear messages
  • it creates a commit log which reads well, whether using the full log format or just the “one-line” version
  • it is much easier to test small changes
  • not allowing any commits without proper tests means not playing catch-up with a massive and under-tested codebase later on
  • tracking down problems later is much easier if you can isolate the problem to a single commit, and that commit isn’t massive
  • merging branches is much easier (and git is much better at avoiding conflicts altogether) if you work in smaller increments

So, before you start typing… stop to think about what small, specific improvement you’re going to make to the codebase next, and make it as perfect and tidy as you can before moving along to the next one.

Canvas vs. SVG

When I began my start-up, I knew the product was going to focus heavily on drawing high-quality graphs, so I spent quite a while looking at the various options.


Server Side Rendering

The first option, and the one that’s been around the longest, is to do all the rendering on the server and then send the resulting images back to the client.  This is essentially what the Google Maps did (though they’ve added more and more client-rendered elements over the years).  To do this, you’ll need some kind of image rendering environment on the server (e.g., Java2D), and, of course, a server capable of serving up the images thus produced.

I decided to skip this option because it adds latency and dramatically cuts down on interactivity and the ability to animation transitions.  Both were very important to me, so this solution was clearly not a good fit.



The second option was to use HTML5’s Canvas element.  This allows you to specify a region of your webpage as a freely drawn canvas which allows all the typical 2D drawing operations.  The result is a raster image which is drawn completely using JavaScript operations.

While Canvas has the advantages of giving a lot more control on the client-side, you’re still stuck with a raster image which needs to be drawn from scratch for each change.  You also lose the structure of the scene (since it’s all reduced to pixels), and therefore lose the functionality provided in the DOM (e.g., CSS and event handlers).



The final option I considered (and therefore the one I chose) was using Scalable Vector Graphics (SVG).  SVG is a mark-up language, much like HTML, which is specifically designed to represent vector graphics.  You pretty much all the same primitives as with Canvas, except these are all represented as elements in the DOM, and remain accessible via JavaScript even after everything is rendered.  In particular, both CSS and JavaScript can be applied to the elements of an SVG document, making it much more suitable for the kind of interactive graphics I had in mind.

As an added incentive for using SVG, there is an excellent library called Data Driven Documents (D3) which provides an amazing resource for manipulating SVG documents with interactivity, animations, and a lot more.  More than anything else, the existence of D3 decided me on using SVG for my custom graphic needs.

Streams in Node.js, part 2: Object Streams

In my first post on streams, I discussed the Readable, Writable and Transform classes and how you override them to create your own sources, sinks and filters.

However, where Node.js streams diverge from more classical models (e.g., from the shell), is Object streams.  Each of the three types of stream objects can work with objects (instead of buffers of bytes) by passing the objectMode parameter set to true into the parent class constructor’s options argument.  From that point on, the stream will deal with individual objects (instead of groups of bytes) as the medium of the stream.

This has a few direct consequences:

  1. Readable objects are expected to call push once per object, and each argument is treated as a new element in the stream.
  2. Writeable objects will receive a single object at a time as the first argument to their _write methods, and the method will be called once for each object in the stream.
  3. Transform objects have the same changes as the both the other two objects.


Application: Tax Calculations

At first glance, it may not be obvious why object streams are so useful.  Let me provide a few examples to show why.  For the first example, consider performing tax calculations for a meal in a restaurant.  There are a number of different steps, and the outcome for each step often depends upon the results of another.  The whole thing can get very complex.  Object streams can be used to break things down into manageable pieces.

Let’s simplify a bit and say the steps are:

  1. Apply item-level discounts (e.g., mark the price of a free dessert as $0)
  2. Compute the tax for each item
  3. Compute the subtotal by summing the price of each item
  4. Compute the tax total by summing the tax of each item
  5. Apply check-level discounts (e.g., a 10% discount for poor service)
  6. Add any automatic gratuity for a large party
  7. Compute the grand total by summing the subtotal, tax total, and auto-gratuity

Of course, bear in mind that I’m actually leaving out a lot of detail and subtlety here, but I’m sure you get the idea.

You could, of course, write all this in a single big function, but that would be some pretty complicated code, easy to get wrong, and hard to test.  Instead, let’s consider how you might do the same thing with object streams.

First, let’s say we have a Readable which knows how to read orders from a database.  It’s constructor is given a connection object of some kind, and the order ID.  The _read method, of course, uses these to build an object which represents the order in memory.  This object is then given as an argument to the push method.

Next, let’s say each of the calculation steps above is separated into its own Transform object.  Each one will receive the object created by the Readable, and will modify it my adding on the extra data it’s responsible for.  So, for example, the second transform might look for an items array on the object, and then loop through it adding a taxTotal property with the appropriate computed value for each item.  It would then call its own push method, passing along the primary object for the next Transform.

After having passed from one Transform to the next, the order object created by the Readable would wind up with all the proper computations having been tacked on, piece-by-piece, by each object.  Finally, the object would be passed to a Writable subclass which would store all the new data back into the database.

Now that each step is nicely isolated with a very clear and simple interface (i.e., pass an object, get one back), it’s very easy to test each part of the calculation in isolation, or to add in new steps as needed.

Streams in Node.js, part 1: Basic concepts

When I started with Node.js, I started with the context of a lot of different programming environments from Objective C to C# to Bash.  Each of these has a notion of processing a large data sets by operating on little bits at a time, and I expected to find something similar in Node.  However, given Node’s way of embracing the asynchronous, I’d expected it to be something quite different.

What I found was actually more straight-forward than I’d expected.  In a typical stream metaphor, you have sources which produce data, filters which modify data, and sinks which consume data.  In Node.js, these are represented by three classes from the stream module: Readable, Transform and Writable.  Each of them is very simple to override to create your own, and the result is a very nicely factored set of classes.

Overriding Readable

As the “source” part of the stream metaphor, Readable subclasses are expected to provide data.  Any Readable can have data pushed into it manually by calling the push method.  The addition of new data immediately triggers the appropriate events which makes the data trickle downstream to any listeners.

When making your own Readable, you override the psuedo-hidden _read(size) function.  This is called by the machinery of the stream module whenever it determines that more data is needed from your class.  You then do whatever it is that you have to do to get the data and end by calling the push method to make it available to the underlying stream machinery.

You don’t have to worry about pushing too much data (multiple calls to push are handled gracefully), and when you’re done, you just push null to end the stream.

Here’s a simple Readable (in CoffeeScript) which returns words from a given sentence:

class Source extends Readable
    constructor: (sentence)->
        @words = sentence.split ' '
        @index = 0
    _read: ->
        if @index < @words.length
            @push @words[index]
            @push null

Overriding Writable

The Writable provides the “sink” part of the stream metaphor.  To create one of your own, you only need to override the _write(chunk, encoding, callback) method.  The chunk argument is the data itself (typically a Buffer with some bytes in it).  The encoding argument tells you the encoding of the bytes in the chunk argument if it was translated from a String.  Finally, you are expected to call callback when you’re finished (with an error if something went wrong).

Overriding Writable is about as easy as it gets.  Your _write method will be called whenever new data arrives, and you just need to deal with it as you like.  The only slight complexity is that, depending up on how you set up the stream, your may either get a Buffer, String, or a plain JavaScript object, and you may need to be ready to deal with multiple input types.  Here’s a simple example which accepts any type of data and writes it to the console:

class Sink extends Writable

    _write: (chunk, encoding, callback)->
        if Buffer.isBuffer chunk
            text = chunk.toString encoding
        else if typeof(chunk) is 'string'
            text = chunk
            text = chunk.toString()

        console.log text

Overriding Transform

A Transform fits between a source and a sink, and allows you to transform the data in any way you like.  For example, you might have a stream of binary data flowing through a Transform which compresses the data, or you might have a text stream flowing through a Transform which capitalizes all the letters.

Transforms don’t actually have to output data each time they receive data, however.  So, you could have a Transform which breaks up a incoming binary stream into lines of text by buffering enough raw data until a full line is received, and only at that point, emitting the string as a result.  In fact, you could even have a Transform which merely counts the lines, and only emits a single integer when the end of the stream is reached.

Fortunately, creating your own Transform is nearly the same as writing a class which implements both Readable and Writable.  However, in this case instead of overriding the _write(chunk, encoding, callback) method, you override the _transform(chunk, encoding, callback) method.  And, instead of overriding the _read method to gather data in preparation for calling push, you simply call push from within your _transform method.

Here’s a small example of a transform which capitalizes letters:

class Capitalizer extends Transform
    _transform: (chunk, encoding, callback)->
        text = chunk.toString encoding
        text = text.toUpperCase()
        @push text


All this is very interesting, but hardly unique to the Node.js platform. Where things get really interesting is when you start dealing with Object streams. I’ll talk more about those in a future post.

Gotchas: Searching across Redis keys

A teammate introduced me to Redis at a prior job, and ever since, I’ve impressed with it. For those not familiar with it, it’s a NoSQL, in-memory database which stores a variety of data types from simple strings all the way to sorted lists and hashes, which nevertheless has a solid replication and back-up story.

In any case, as you walk through the tutorial, you notice the convention is for keys to be separated by colons, for example: foo:bar:baz. You also see that nearly every command expects to be given a key to work with. If you want to set and then retrieve a value, you might use:

> SET foo:bar:baz "some value to store"
> GET foo:bar:baz
"some value to store"

Great. At this point, you might want to fetch or delete all of the values pertaining to the whole “bar” sub-tree. Perhaps, you’d expect it to work something like this:

> GET foo:bar:*
> DEL foo:bar:*

Well… too bad; it doesn’t work like that. The only command which accepts a wildcard is the KEYS command, and it only returns the keys which match the given pattern: not the data. Without getting into too much detail, there are legitimate performance reasons not to, but it was something of a surprise to me to find out.

However, all is not lost. Redis does support a Hash data structure which allows accessing some or all properties related to a specific key. Along with this data structure are commands both to manipulate individual properties along with the entire set.

Redis: Storing time-series data in a sorted set

Along with a regular Set data type, Redis also has a Sorted Set data type. Like a regular set, you can add unique items to the set using the ZADD and fetch them back again using various other commands.

The main difference with a sorted set is that when you add an item, you must also provide a “score” which is used to determine the item’s order in the set. The score can be any integer, and can be completely unrelated to both the key and value being stored.

It is important to note that a sorted set is still a set! If you add identical values multiple times, the item will only appear once using the last score assigned. For this reason, you can’t store arbitrary data in a set unless you can guarantee each value is unique (i.e., it is either a unique identifier or is a data blob containing a unique identifier).

In my case, I wanted to store time-series data in a sorted set, but, since I can be pretty sure that the values won’t be unique, I can’t use a plain sorted set (i.e., if the value on Monday is 10, and the value on Wednesday is 10, then the Monday value gets clobbered since a set only stores unique values). However, I did figure out a handy work-around using Redis’s built-in scripting capabilities.

Why use a sorted set?

The very handy thing about sorted sets is that you can use a single command to fetch all the values within a certain time frame. Using ZADD, you store a given value in a key with a score which represents its sort order. For time-series data, the key can be anything you like, the score should be the time of the event (as an integer), and the value should contain the data.

Once the data is loaded, the ZRANGEBYSCORE command can fetch a sorted list of values whose scores fall within a certain range.

But what about uniqueness?

The problem is that a sorted set is still a set, and it won’t sort non-unique values (even if they have different scores). Let’s take an example. Suppose you’re measure the temperature outside your home each day. Let’s further suppose that on Monday, the temperature was 52°. Your command would be:

ZADD temp:house 1421308800000 52

On Tuesday, perhaps it was a bit warmer:

ZADD temp:house 1421395200000 65

But, on Wednesday, it got colder again:

ZADD temp:house 1421481600000 52

Now, we’ve got problems. Since uniqueness only pertains to the value, regardless of the score, you just changed the timestamp of 52° to Wednesday, thus deleting the data recorded on Monday.

How to create uniqueness

To work around this issue of uniqueness, I decided to add the timestamp to the data as well as the score. One could certainly do this on the client side (e.g., by encoding both temperature and timestamp in JSON), but I wanted to be able to access that data from within Redis’s scripting environment for other reasons.

It turns out, the Redis scripting language was also the key to encoding the data as I needed. To start, I used the LOAD command to store a LUA script:

local value = cmsgpack.pack({ARGV[2], ARGV[1]})
redis.call('zadd', KEYS[1], ARGV[1], value)
return redis.status_reply('ok')

The result of the command is to load the script and return an identifier you can later use to call it (which happens to be the SHA of the script). This uses the cmsgpack built-in library to create a very efficient binary-encoded representation of the two pieces of data.

Now, instead of just using ZADD to store my values, I invoke my script with the EVALSHA command:

EVALSHA <sha> 1 temp:house 1421481600000 52

Now, instead of storing 52° as my value, I’ve got a binary record combines 52 and 1421481600000, and therefore is unique to that moment in time.

Getting the data back again

In my case, I wanted to use the data by performing calculations on it in another LUA script. Without getting into too much detail, I basically just used the LOAD and EVALSHA commands to execute a LUA script containing:

for k, v in ipairs(redis.call('zrangebyscore', KEYS[1], ARGV[1], ARGV[2])) do
    local value = tonumber(cmsgpack.unpack(v)[1])

In this example, ARGV represents the arguments passed to the EVALSHA command, KEYS is the list of keys passed to the command, redis.call is a function used to run regular Redis command from within the LUA script, and cmsgpack.unpack takes the value and returns an array of the fields it contains.

From there, my LUA script goes to work!