Git 201: Safely Using Rebase

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

The rebase tool in git is extremely powerful, and therefore also rather dangerous. In fact, I’ve known engineers (usually those new to git) who won’t touch it at all. I hope this will convince you that you really can use it safely, and it’s actually very beneficial.

What is rebase?

To start, I’ll very briefly touch on what a rebase actually is. So, very briefly then, a rebase is a way to rewrite the history of a branch. Let’s assume you’re starting out with a standard working branch from master:

At this point, let’s say you want to update your branch to contain the new commits on master (i.e., D and F), but you don’t want to create a merge commit in the middle of your work. You could rebase instead:

git rebase master

This command will rewrite your current branch as though it had originally been created starting from F (the tip of master). In order to that, though, it will need to re-create each of the commits on your branch (C, E, and G). Remember that a commit is the difference applied to some prior commit. In our example, C is the changes applied to B. E contains changes applied to B, and G contains changes applied to E. Rebasing will be that we need to change things around so that C actually has F as a parent instead of B.

The problem is that git can’t just change C’s parent because there’s no guarantee that the changes represented by C will result in the same codebase when applied to F instead of B. It might be that you’d wind up with some completely different code if you did that. So, git needs to figure out what result C creates, and then figure out what changes to apply to F in order to create the same result. That will yield a completely new commit which we’ll call CC. Since E was based upon C, which has been replaced, git will need to create a new commit using the same process, which we’ll call EE. And, since E has been removed, that means we’ll need to replace G with GG. Once all of the commits have been created, git moves the branch pointer to the end of the newest commit:

While all this seems complicated, it’s all hidden inside of git, and you don’t really have to deal with any of it.  In the end, using rebase instead of merge just means changing a single command, and your commit history is simpler because it appears as though you created your branch from the right place all along.  If you’d like a much fuller tutorial with loads of depth, I’d recommend you head over here.

Rebasing across repos

If you’re working on a branch which only exists locally, then rebasing is pretty straight-forward to work with. It’s really when you’re working across multiple clones of a repo (e.g., your local clone, and the one up on GitHub) that things become a little more complicated.

Let’s say you’ve been working on a branch for a while, and somewhere along the way, you pushed the branch back to the origin (e.g., GitHub). Later on, though, you decide you want to rebase to pick up some changes from master. That leaves you in the state we see in this diagram:

If you were to pull right now, git would freak out just a bit. Your local version of the branch seems to have three new commits on it (CC, EE, and GG) while it’s missing three others (C, E, and G). Then, when git checks for merge conflicts, there’s all sorts of things which seem to conflict (C conflicts with CC, E conflicts with EE, etc.). It’s a complete mess.

So, the normal thing to do here is to force git to push your local version of the branch back to origin:

git push -f

This is telling git to disregard any weirdness between the local version of the branch and origin’s version of the branch, and just make origin look like your local copy. If you’re the only one making changes to the branch, this works just fine. The origin gets your new branch, and you can move right along. But… what if you aren’t the only one making changes?

Where rebasing goes wrong

Imagine if someone else noticed your branch, and decided to help you out by fixing a bug. They clone the repository, checkout your branch, add a commit, and push. Now the origin has all your changes as well as the bug fix from your friend. Except, in the meantime, you decided to rebase. That would mean you’re in a situation like this:

Now you’re stuck. If you pull in order to get commit H, you’re going to have all sorts of nasty conflicts. However, if you force push your branch back to origin (to avoid the conflicts), you’re going to lose commit H since you’re telling git to disregard the version of the branch on origin. And, if your friend neglected to tell you about the bug fix, you might do exactly that and never even realize.

Solution 1: Communicate

The best way to fix the problem is to avoid it in the first place. Communicate clearly with your teammates that this branch is a working branch, and that they shouldn’t push commits onto it. It’s a good idea for teams to adopt some clear conventions around this to make this kind of mistake hard to make (e.g., any branch stating with a username should only be changed by that user, branches with “shared”, “team” or some other prefix are expected to have multiple contributors).

If you can’t be sure you’re only one working on a branch, the next best thing is, before starting the rebase, talk with anyone who might be working with the branch. Say that you’re going to rebase it, and what they should expect. If anyone speaks up that they’re working on changes to that branch, then you know to hold off.

Once everyone has pushed up any outstanding changes, pull down the latest version of the branch, rebase, and then push everything back up as soon as possible. That looks like this:

git checkout mybranch
git pull
git rebase master
git push -f

Once you’ve finished, you’ll want to tell the other people working on the branch that they need to get the fresh version of the branch for themselves. That looks like:

git checkout master
git branch -D mybranch
git checkout mybranch

Solution 2: Restart the rebase

If you find yourself having just rebased and only then learn there are upstream changes you’re missing, the simplest way out of this difficulty is to simply ditch your rebase. Go back, and pull down the changes from the origin, and start over (after referring to solution 1). That would look something like this:

git checkout master
git branch -D mybranch
git checkout mybranch
git rebase master
git push -f

This will switch back to master (1), so that you can delete your local copy of the branch (2), and then grab the most recent version from the origin (3). Now, you can re-apply your rebase (4), and then push up the rebased branch before anyone else has a chance to mess things up again (5).

Solution 3: Start a new branch

If you find that you want to rebase right away, and don’t want to wait to coordinate with others who might be sharing your branch, a good plan is to isolate yourself from the potentially shared branch first, and then do your rebase.

git checkout mybranch
git checkout -b mybranch-2

At this point, you’ve got a brand new branch which only exists on your local machine, so no one else could possibly have done anything to it. That means you can go ahead and rebase all you like. When you push the branch back up to origin (e.g., GitHub), it will be the first time that particular branch has been pushed.

Of course, if someone else has added a commit to the old branch, it will still be stuck over there, and not on your new branch. If you want to to get their commit on your new branch, use git’s cherry pick feature:

git cherry-pick <hash>

This will create a new commit on your branch which will have the exact same effect on your branch as it did on the old one. Once you’ve rescued any errant commits, you can delete the old branch and continue from the new one.

✧✧✧

I’m hope this makes rebasing less scary, and helps you get a sense of when you’d use it and when not. And, of course, should things go wrong, I hope this gives you a good sense of how to recover.

Two last bits of advice… First, before rebasing, create a new branch from the head of the branch you’re going to rebase. That way, should things go completely wrong, you can just delete the rebased branch, and use your backup. And, finally, if you’re in the middle of a rebase which seems to be going a little nuts, you can always bail out by using:

git rebase --abort

So, feel free to experiment!

Improving your estimates

Estimating most projects is necessarily an imprecise exercise. The goal of this post is to share some tools I’ve learned to remove those sources of error. Not all of those tools will apply to every project, though, so use this more as a reminder of things to consider when estimating, rather than a strict checklist of things you must do for every project. As always, you are the expert doing the estimating, so it is up to your own best judgement.

Break things into small pieces

When estimating, error is generally reduced by dividing tasks into more and smaller pieces of work. As the tasks get smaller, several beneficial things result:

  • Smaller tasks are generally better understood, and it is easier to compare the task to one of known duration (e.g., some prior piece of work).
  • The error on a smaller task is generally smaller than the error on a small task. That is, if you’re off by 50% on an 8 hour task, you’re off by 4 hours. If you’re off by 50% on an 8 day task, you’re off by 4 days.
  • You’re more likely to forget to account for some part of work in a longer task than a shorter one.

As a general rule, it’s a good idea to break a project down into tasks of less than 2 days duration, but your project may be different. Pick a standard which makes sense for the size of project and level of accuracy you need.

Count what can be counted

When estimating a large project, it is often the case that it is made up of many similar parts. Perhaps it’s an activity which is repeated a number of times, or perhaps there’s some symmetry to the overall structure of the thing being created. Whichever way, try to figure out if there’s something you already know which is countable, and then try to work out how much time each one requires. You may even be able to time yourself doing one of those repeated items so your estimate is that much more accurate.

Establish a range

When estimating individual tasks (i.e., those which can’t be further subdivided), it is often beneficial to start out by figuring out the range of possible durations. Start by asking yourself: “If everything went perfectly, what is the shortest time I could imagine this taking?” Then, turn it around: “If everything went completely pear-shaped, what shortest duration I’d be willing to bet my life on?” This gives you a best/worse-case scenario. Now, with all the ways it could go wrong in mind, make a guess about how long you really think it will take.

Get a second opinion

It’s often helpful to get multiple people to estimate the same project, but you can lose a lot of the value in doing so if the different people influence each other prematurely. To avoid that, consider using planning poker. With this technique, each estimator comes up with their own estimate without revealing it to the others. Then, once everyone is finished, they all compare estimates.

Naturally, there are going to be some differences from one person to the next. When these are small, taking an average of all the estimates is fine. However, when the differences are large, it’s often a sign that there’s some disagreement about the scope of the project, what work is required to complete it, or the risks involved in doing so. At this point, it’s good for everyone to talk about how they arrived at their own estimates, and then do another round of private estimates. The tendency is for the numbers to converge pretty rapidly with only a few rounds.

Perform a reality check

Oftentimes, one is asked to estimate a project which is at least similar to a project one has already completed. However, when coming up with a quick estimate, it’s easy to just trust to one’s intuition about how long things will take rather than really examining specific knowledge of particular past projects to see what you can learn. Here’s a set of questions you can ask yourself to try to dredge up that knowledge:

  • The last time you did this, how long was it from when you started to when you actually moved on to another project?
  • What is the riskiest part of this project? What is the worst-case scenario for how long that might take?
  • The last time you did this, what parts took longer than expected?
  • The last time you did this, what did you forget to include in your estimate?
  • How many times have you done this before? How much “learning time” will you need this time around?
  • Do already you have all the tools you need to start? Do you already know how to use them all?

There are loads of other questions you might ask yourself along these lines, and the really good ones will be those which force you to remember why that similar project you’re thinking of was harder / took longer / was more expensive than you expected it to be.

Create an estimation checklist

If you are planning to do a lot of estimating, it can be immensely helpful to cultivate an estimation checklist. This is a list of all the “parts” of the projects you’ve done before. Naturally, this will vary considerably from one kind of project to the next, and not every item in the checklist will apply to every new project, but they can be immensely valuable in helping you not forget things. In my personal experience, I’ve seen more projects be late from the things which were never in the plan, than from things which took longer than expected.

✧✧✧

Estimation is super hard, and there’s really no getting around that. You’re always going to have some error bars around your estimates, and, depending upon the part of the project you’re estimating, perhaps some considerably large ones. Fortunately, a lot of people have been thinking about this for a long while, and there are a lot tricks you can use, and a lot of books on the subject you can read, if you’d like to get better. Here’s one I found particularly useful which describes a lot of what I’ve just talked about, and more:


Software Estimation: Demystifying the Black Art

Git 201: Creating a “fixup” commit

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

As I work on a branch, adding commit after commit, I’ll sometimes find that there’s a piece of work I forgot to do which really should have been part of some prior commit.  The first thing to do—before fixing anything—is to save the work I’m doing.  This might mean using a work-in-progress commit as described previously, or simply using the stash.  From there, I have a few options, but in this post I’m going to focus on using the “fixup” command within git’s interactive rebase tool.

The essence of the process is to make a new commit, and then use git to combine it with the commit I want to fix.  The first step is to make my fix, and commit it.  The message here really doesn’t matter since it’s going to get replaced anyway.

> git add -A
> git commit -m "fix errors"

This will add a new commit to the end of the branch, but that’s not actually what I wanted.  Now I need to squash it into the older commit I want to fix.  To do this, I’m going to use an “interactive rebase” command:

> git rebase -i master

This is telling git that I want to edit the history of commits back to where my branch diverged from master (if you originally created your branch from somewhere else, you’ll want to specify that instead).  In response to this request, git is going to create a temporary file on disk somewhere and open up my editor (the same one used for commit messages) with that file loaded.  It will wind up looking something like this:

pick 7e70c43 Add Chicken Tikka Masala
pick e8cc090 Remove low-carb flag from BBQ ribs
pick 9ade3d6 Fix spelling error in BBQ ribs
pick b857991 fix errors

# Rebase 1222f97..b857991 onto 1222f97 (       5 TODO item(s))
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to
# bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

All of these commits are those which I’ve made since I cut my branch:  ordered from oldest (on the top) to the newest (on the bottom).  The commented out parts give the instructions for what you can do in this “interactive” portion of the rebase.  Assuming that the fix is for the Chicken Tikka Masala recipe, I’d want to edit the file to look like this:

pick 7e70c43 Add Chicken Tikka Masala
fixup b857991 fix errors
pick e8cc090 Remove low-carb flag from BBQ ribs
pick 9ade3d6 Fix spelling error in BBQ ribs

When I save the file and quit my editor, git is going to rebuild the branch from scratch according to these instructions.  The first line tells git to simply keep commit 7e70c43 as-is.  The next line tells git to remove the prior commit, and replace it with one which is the combination of the prior commit and my fix-up commit, b857991.  The other two commands tell git to create two new commits which result in the same end state as each of the old commits, e8cc090 and 9ad3d6.

As a bit of an aside…  Why does git have to create new commits for the last two commands?  Remember that commits are immutable once created, and that part of the data which makes up the commit is the parent commit it was created from.  Since I’ve asked git to replace the parent commit, it will now need to create new commits for everything which follows on that same branch since each one now has to have a new parent: all the way back to the commit I replaced.

At the end of all this, if we were to inspect the log, we’d see:

> git log --oneline master..

6a829bc3 Add Chicken Tikka Masala
29dd3231 Remove low-carb flag from BBQ ribs
0efc5692 Fix spelling error in BBQ ribs

In essence, everything appears to be the same, except that the original commit includes my fix.  However, looking a little closer, you can see that each commit has a different hash.  The first one is different because I modified the “diff” portion of the commit (i.e., I added in my fix) and therefore a new commit was needed (commits are immutable, so adding the fix required making a new one).  The other two needed to be re-created because their parent commit disappeared, and therefore new commits were needed to preserve the effect of those two commits, starting from my fixed-up commit as the new parent.

✧✧✧

There is one caveat I have to warn you about when using this method.  Any time you rebase a branch, you’ve changed the commit history of the branch.  That is, you’ve thrown out a whole bunch of commits and replaced them with a completely new set.  This is going to mean loads of difficulties and possibly lost work if you’re sharing that same branch with other people, so only use this technique on your own private working branches!

Git 201: Keeping a clean commit history

This series is going to get very technical.  If you’re not already familiar with git, you may want to start over here instead.

✧✧✧

Most people I’ve worked with treat git somewhat like they might treat an armed bomb with a faulty count-down timer.  They generally want to stay away from it, and if forced to interact with it, they only do exactly what some expert told them to do.  What I hope to accomplish here is to convince you that git is really more like an editor for the history of your code, and not a dark god requiring specific incantations to keep it from eating your code.

As I’m thinking of this as a Git 201 series, I’m going to assume you are familiar with all bunch of 101-level terms and concepts: repository, index, commit, branch, merge, remote, origin, stash.  If you aren’t follow this link to some 101-level content.  Come on back when you’re comfortable with all that.  I’m also going to assume you’ve already read my last post about creating a solid commit.

As you might expect, there’s a lot to be said on this topic, so I’m going to break this up into multiple commits… err posts.  To make things easy, I’m going to keep updating the list at the end of this post as I add new ones in the series.  So, feel free to jump around between each of them, or just read them straight through (each will link to the next post, and back here).

✧✧✧

Git 201: Using a work-in-progress commit

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

My work day doesn’t naturally divide itself cleanly into commits.  I’m often interrupted to work on something new when I’m half-way done with my current task, or I just need to head home in the evening without quite having finished that few feature.  This post talks about one technique I use in those situations.

To rewind  a bit though, when I’m starting a new piece of work (whether a new feature, bug fix, etc.), I first check out the source branch (almost always master), and run:

> git pull --rebase

This grabs the latest changes from the origin without adding an unnecessary merge commit. I then create myself a new branch which shows my username along with a brief reminder of what the change is for:

> git checkout -b aminer/performance-tuning

At this point, I’m ready to start working.  So, I’ll crack open my favorite editor and make my changes.  As I’m going along, I’ll often find that I need to stop mid-thought and move on to a different task (e.g., fix something else, go home for the day, etc.).  It may also be that I’ve arrived at a point where I’ve just figured out something tricky, but I’m not through with the overall change yet.  So, I’ll create a work-in-progress commit:

> git add -A
> git commit -n -m wip

This is really more of a “checkpoint” than an actual commit since I very likely haven’t fully finished anything, probably don’t have tests, etc.  The -n here tells git not to run tests or linters.  This is fine for now because this isn’t a “finished” commit, and I’m going to get rid of it later.  For the time-being, though, I keep on working.  If I reach another checkpoint before I’m ready to make a real commit, I’ll just add on to the checkpoint commit:

> git add -A
> git commit -n --amend

The first command pulls all my new work into the index while the second removes the prior work-in-progress commit, and replaces it with a new one which contains both the already-committed changes along with my new ones.

When I’ve actually finished a complete unit of work, I remove the work-in-progress commit:

> git reset HEAD~1

This will take all the changes which were in the prior commit, and add them back to my working directory as though they were there all along.  The commit history on my branch looks just as though the commit had never been there.

What makes a good commit?

It is distressingly common for me to see commits in a git repository which look something like:

commit 6a69053d2419b311cc38c9fdbbbac75b5a62c4fa
Author: dev-dude <dev-dude@somco.com>
Date: Sun Mar 29 14:05:46 2015 +0200

    fix missing commas

Such a commit will often be half a complete thought (if even that much), have no tests, and a message which explains exactly nothing about why the change was made, or what its intended consequence are.  We can do a lot better.

✧✧✧

A good commit is much like a good sentence: a well-formed, grammatically correct, complete expression of a single thought.  In the case of a commit, this means that it should:

  • make exactly one cohesive change to the code
  • include tests to verify the change
  • have a title which clearly—but briefly—indicates what was done
  • have a message which describes why it was done

 

Cohesion

The most important virtue of a well-formed commit is good cohesion, or: doing only one thing.  This could be a refactoring in preparation for some new feature to be added, it could be adding that feature, it could be fixing a bug.  The actual size of the commit doesn’t matter so terribly much (though smaller is generally better), but that it clearly represents a single change does.

As a short-cut for deciding whether your change is cohesive or not, consider how easily you could come up with the title for the commit.  Can you, in 70 characters or less, state clearly what your change does.  If not, it’s probably not very cohesive.  If you find yourself tempted to make a list as your commit title, you definitely don’t have good cohesion.

In such cases, stop and ask yourself: “what is the smallest thing I could possibly do to this codebase which would be a clear improvement?”.  In a single pull request, it’s not at all uncommon for me to have more than one of these kinds of commits:

  • clear up some stylistic problems with old code
  • refactor various parts of the code I’m about to touch
  • add a new class which will be part of a new feature I’m about to add
  • tie a bunch of prior commits together in a way which makes the new feature visible to the user
  • update documentation

Of course, depending upon the exact nature of the work, I may not need all of those, but it will be very common for me to have at least a few of them.

 

Tests

In my opinion, tests deserve as much care as any other code, and it’s quite possible to have too many or too few.  That’s a topic for another time.  For now, I just want to point out that any new commit should include any appropriate tests related to that new code.  Of course, it’s possible that it’s appropriate not to add new tests (e.g., for a refactoring), but if a commit doesn’t have tests, it should only ever be because you considered it, and decided there weren’t any new tests which were needed.

 

Commit Title

Git actually does have a proper format for the commit message; it’s not just arbitrary.  Remember that git is a command-line tool, and is often used in environments where space can be limited.  To that end, there’s a definite format.

In this format, the first line is considered a title, and it should be a present-tense, imperative sentence stating what changed with this commit (e.g., “Add the FizzyWizzBanger class”).  This should be less than 70 characters because git will only show that many in the various “one-line” versions of its commit logs.  Ideally, it will be sufficiently differentiated from other commit messages so that you can readily figure out which one of several commits it is, even out of context.

 

Commit Message

The question of “what” the commit changes should really be summarized by the title, and made obvious from the code.  In the commit message, you should focus on why you made this change.  Perhaps it fixes a certain bug.  If so, go ahead and explain—in a sentence or two—why this change fixes that particular bug.  It doesn’t hurt to mention the bug number here either.

Don’t be afraid to use simple formatting here either.  A good many Git-related tools support markdown formatting, so feel free to use whatever you like.  Just be sure to leave an empty line after the title since that’s what git will be expecting for its longer-form messages, as will other git-related tools.

✧✧✧

Building up your codebase using a series of well-formed commits has a number of major benefits.  

  • other developers will be able to understand what changed and why much more easily with small, focused commits with clear messages
  • it creates a commit log which reads well, whether using the full log format or just the “one-line” version
  • it is much easier to test small changes
  • not allowing any commits without proper tests means not playing catch-up with a massive and under-tested codebase later on
  • tracking down problems later is much easier if you can isolate the problem to a single commit, and that commit isn’t massive
  • merging branches is much easier (and git is much better at avoiding conflicts altogether) if you work in smaller increments

So, before you start typing… stop to think about what small, specific improvement you’re going to make to the codebase next, and make it as perfect and tidy as you can before moving along to the next one.

Canvas vs. SVG

When I began my start-up, I knew the product was going to focus heavily on drawing high-quality graphs, so I spent quite a while looking at the various options.

 

Server Side Rendering

The first option, and the one that’s been around the longest, is to do all the rendering on the server and then send the resulting images back to the client.  This is essentially what the Google Maps did (though they’ve added more and more client-rendered elements over the years).  To do this, you’ll need some kind of image rendering environment on the server (e.g., Java2D), and, of course, a server capable of serving up the images thus produced.

I decided to skip this option because it adds latency and dramatically cuts down on interactivity and the ability to animation transitions.  Both were very important to me, so this solution was clearly not a good fit.

 

Canvas

The second option was to use HTML5’s Canvas element.  This allows you to specify a region of your webpage as a freely drawn canvas which allows all the typical 2D drawing operations.  The result is a raster image which is drawn completely using JavaScript operations.

While Canvas has the advantages of giving a lot more control on the client-side, you’re still stuck with a raster image which needs to be drawn from scratch for each change.  You also lose the structure of the scene (since it’s all reduced to pixels), and therefore lose the functionality provided in the DOM (e.g., CSS and event handlers).

 

SVG

The final option I considered (and therefore the one I chose) was using Scalable Vector Graphics (SVG).  SVG is a mark-up language, much like HTML, which is specifically designed to represent vector graphics.  You pretty much all the same primitives as with Canvas, except these are all represented as elements in the DOM, and remain accessible via JavaScript even after everything is rendered.  In particular, both CSS and JavaScript can be applied to the elements of an SVG document, making it much more suitable for the kind of interactive graphics I had in mind.

As an added incentive for using SVG, there is an excellent library called Data Driven Documents (D3) which provides an amazing resource for manipulating SVG documents with interactivity, animations, and a lot more.  More than anything else, the existence of D3 decided me on using SVG for my custom graphic needs.

Streams in Node.js, part 2: Object Streams

In my first post on streams, I discussed the Readable, Writable and Transform classes and how you override them to create your own sources, sinks and filters.

However, where Node.js streams diverge from more classical models (e.g., from the shell), is Object streams.  Each of the three types of stream objects can work with objects (instead of buffers of bytes) by passing the objectMode parameter set to true into the parent class constructor’s options argument.  From that point on, the stream will deal with individual objects (instead of groups of bytes) as the medium of the stream.

This has a few direct consequences:

  1. Readable objects are expected to call push once per object, and each argument is treated as a new element in the stream.
  2. Writeable objects will receive a single object at a time as the first argument to their _write methods, and the method will be called once for each object in the stream.
  3. Transform objects have the same changes as the both the other two objects.

 

Application: Tax Calculations

At first glance, it may not be obvious why object streams are so useful.  Let me provide a few examples to show why.  For the first example, consider performing tax calculations for a meal in a restaurant.  There are a number of different steps, and the outcome for each step often depends upon the results of another.  The whole thing can get very complex.  Object streams can be used to break things down into manageable pieces.

Let’s simplify a bit and say the steps are:

  1. Apply item-level discounts (e.g., mark the price of a free dessert as $0)
  2. Compute the tax for each item
  3. Compute the subtotal by summing the price of each item
  4. Compute the tax total by summing the tax of each item
  5. Apply check-level discounts (e.g., a 10% discount for poor service)
  6. Add any automatic gratuity for a large party
  7. Compute the grand total by summing the subtotal, tax total, and auto-gratuity

Of course, bear in mind that I’m actually leaving out a lot of detail and subtlety here, but I’m sure you get the idea.

You could, of course, write all this in a single big function, but that would be some pretty complicated code, easy to get wrong, and hard to test.  Instead, let’s consider how you might do the same thing with object streams.

First, let’s say we have a Readable which knows how to read orders from a database.  It’s constructor is given a connection object of some kind, and the order ID.  The _read method, of course, uses these to build an object which represents the order in memory.  This object is then given as an argument to the push method.

Next, let’s say each of the calculation steps above is separated into its own Transform object.  Each one will receive the object created by the Readable, and will modify it my adding on the extra data it’s responsible for.  So, for example, the second transform might look for an items array on the object, and then loop through it adding a taxTotal property with the appropriate computed value for each item.  It would then call its own push method, passing along the primary object for the next Transform.

After having passed from one Transform to the next, the order object created by the Readable would wind up with all the proper computations having been tacked on, piece-by-piece, by each object.  Finally, the object would be passed to a Writable subclass which would store all the new data back into the database.

Now that each step is nicely isolated with a very clear and simple interface (i.e., pass an object, get one back), it’s very easy to test each part of the calculation in isolation, or to add in new steps as needed.

Streams in Node.js, part 1: Basic concepts

When I started with Node.js, I started with the context of a lot of different programming environments from Objective C to C# to Bash.  Each of these has a notion of processing a large data sets by operating on little bits at a time, and I expected to find something similar in Node.  However, given Node’s way of embracing the asynchronous, I’d expected it to be something quite different.

What I found was actually more straight-forward than I’d expected.  In a typical stream metaphor, you have sources which produce data, filters which modify data, and sinks which consume data.  In Node.js, these are represented by three classes from the stream module: Readable, Transform and Writable.  Each of them is very simple to override to create your own, and the result is a very nicely factored set of classes.

Overriding Readable

As the “source” part of the stream metaphor, Readable subclasses are expected to provide data.  Any Readable can have data pushed into it manually by calling the push method.  The addition of new data immediately triggers the appropriate events which makes the data trickle downstream to any listeners.

When making your own Readable, you override the psuedo-hidden _read(size) function.  This is called by the machinery of the stream module whenever it determines that more data is needed from your class.  You then do whatever it is that you have to do to get the data and end by calling the push method to make it available to the underlying stream machinery.

You don’t have to worry about pushing too much data (multiple calls to push are handled gracefully), and when you’re done, you just push null to end the stream.

Here’s a simple Readable (in CoffeeScript) which returns words from a given sentence:

class Source extends Readable
    constructor: (sentence)->
        @words = sentence.split ' '
        @index = 0
    _read: ->
        if @index < @words.length
            @push @words[index]
        else
            @push null

Overriding Writable

The Writable provides the “sink” part of the stream metaphor.  To create one of your own, you only need to override the _write(chunk, encoding, callback) method.  The chunk argument is the data itself (typically a Buffer with some bytes in it).  The encoding argument tells you the encoding of the bytes in the chunk argument if it was translated from a String.  Finally, you are expected to call callback when you’re finished (with an error if something went wrong).

Overriding Writable is about as easy as it gets.  Your _write method will be called whenever new data arrives, and you just need to deal with it as you like.  The only slight complexity is that, depending up on how you set up the stream, your may either get a Buffer, String, or a plain JavaScript object, and you may need to be ready to deal with multiple input types.  Here’s a simple example which accepts any type of data and writes it to the console:

class Sink extends Writable

    _write: (chunk, encoding, callback)->
        if Buffer.isBuffer chunk
            text = chunk.toString encoding
        else if typeof(chunk) is 'string'
            text = chunk
        else
            text = chunk.toString()

        console.log text
        callback()

Overriding Transform

A Transform fits between a source and a sink, and allows you to transform the data in any way you like.  For example, you might have a stream of binary data flowing through a Transform which compresses the data, or you might have a text stream flowing through a Transform which capitalizes all the letters.

Transforms don’t actually have to output data each time they receive data, however.  So, you could have a Transform which breaks up a incoming binary stream into lines of text by buffering enough raw data until a full line is received, and only at that point, emitting the string as a result.  In fact, you could even have a Transform which merely counts the lines, and only emits a single integer when the end of the stream is reached.

Fortunately, creating your own Transform is nearly the same as writing a class which implements both Readable and Writable.  However, in this case instead of overriding the _write(chunk, encoding, callback) method, you override the _transform(chunk, encoding, callback) method.  And, instead of overriding the _read method to gather data in preparation for calling push, you simply call push from within your _transform method.

Here’s a small example of a transform which capitalizes letters:

class Capitalizer extends Transform
    _transform: (chunk, encoding, callback)->
        text = chunk.toString encoding
        text = text.toUpperCase()
        @push text
        callback()

✧✧✧

All this is very interesting, but hardly unique to the Node.js platform. Where things get really interesting is when you start dealing with Object streams. I’ll talk more about those in a future post.

Gotchas: Searching across Redis keys

A teammate introduced me to Redis at a prior job, and ever since, I’ve impressed with it. For those not familiar with it, it’s a NoSQL, in-memory database which stores a variety of data types from simple strings all the way to sorted lists and hashes, which nevertheless has a solid replication and back-up story.

In any case, as you walk through the tutorial, you notice the convention is for keys to be separated by colons, for example: foo:bar:baz. You also see that nearly every command expects to be given a key to work with. If you want to set and then retrieve a value, you might use:

> SET foo:bar:baz "some value to store"
OK
> GET foo:bar:baz
"some value to store"

Great. At this point, you might want to fetch or delete all of the values pertaining to the whole “bar” sub-tree. Perhaps, you’d expect it to work something like this:

> GET foo:bar:*
> DEL foo:bar:*

Well… too bad; it doesn’t work like that. The only command which accepts a wildcard is the KEYS command, and it only returns the keys which match the given pattern: not the data. Without getting into too much detail, there are legitimate performance reasons not to, but it was something of a surprise to me to find out.

However, all is not lost. Redis does support a Hash data structure which allows accessing some or all properties related to a specific key. Along with this data structure are commands both to manipulate individual properties along with the entire set.