Organizing Members of a Class

In all the coding books I’ve read, I don’t recall any of them specifically talk about organizing the different members of a class.  However, in all the code I’ve read, the lack of any standard or order to how members are organized has been the single biggest obstacle to feeling at home with a new codebase.

The reason this matters so much has to do with our brain’s limited ability to keep track of a lot of things at once.  If we can only keep a few things in our head at a time, then it becomes hugely important for the readers of our code that we provide a hierarchy which doesn’t demand they understand more than a half-dozen or so sections at a time.  Moreover, it’s super important that these be immediately recognizable, and that they be in a consistent order so that they can be easily found.

Using Dividers to Create Hierarchy

The way I create an immediately recognizable sense of order is to create visual markers which clearly mark each section of a file. So, for example, let’s consider just the bare framework of a module in Python which contains two classes:

from some_package import a_module


############################################################

KNIGHTS_SAY = "Ni"
EXPECT_SPANISH_INQUISITION = False


############################################################

class TheMainObject(object):

    def __init__(self):
        # do constructor stuff

    # Properties #################################

    @property
    def alpha(self):
        return self._alpha

    @alpha.setter
    def alpha(self, value):
        self._alpha = value

    # Public Methods #############################

    def calculate(self):
        # calculate a value here

    # Private Methods ############################

    def _private_calculation(self):
        # do some private calculation

############################################################

class SomeHelperObject(object):

    # etc, ...

The first section (lines 1–3) contains all the imports. One of my favorite things about Python is that this gives you a complete listing of external dependencies, and an exact listing of where they are all coming from. Someone reading module can, in an instant, scan this very first section to see everything that’s coming in from the outside.

Next, I put in a full-width marker line with no label. As this indicates the top level of your hierarchy, it should be the full line width your team allows. That way, it’s immediately clear that you’re moving from one top-level part of the hierarchy to another when you see one of these lines.

The next section contains a few constants. These are clearly to be used across all the major sections of this module, and possibly even outside it.

Next, we have the class initialization section for the first class (lines 11–16). This is where the actual class declaration lives, along with any constructors. The semantic meaning of this section, while not explicitly labeled, it pretty clear: we’re setting up the class here.

Next, we have a subdivider labeled “Properties” followed by all of this class’s property declarations. If we had multiple properties here, they would each be listed in alphabetical order (more on this later). It’s important to note that this is a sub-divider, and is therefore shorter than the primary dividers since it represents the second level of the hierarchy. It’s important that it be enough shorter (15–20%) than the primary dividers that it be easy to see the difference even at a quick glance.

Next are the public methods, followed by private methods. As additional members of the second layer of the hierarchy, they use the same dividers. This pretty much covered your bases for most languages.

Sequencing and Sorting

With the visual hierarchy in place, then next thing I consider is the ordering of each element of the hierarchy. Since code is read far more than it is modified, my general approach here revolves around attempting to make the code as easy to read as possible. To that end, it’s important to keep the various levels of the hierarchy in as predictable and clear an order as possible.

For the top level of the hierarchy, we’re talking about sections for defining classes, constants, imports, and (depending upon your language) free-standing functions. In many languages, you’re required to list imports first, plus, that ordering has the advantage of making it clear what this code depends upon, so it’s useful to keep them up front. Next, I list the constants which are going to be used throughout this file. Next comes the primary class in the file (if there is one). Finally, come the sections for helper classes. In most cases, the part of the file you can readily see without scrolling down contains the imports, constants, and the start of the primary class section. So, in a single glance, the reader of the code can easily see what is the primary purpose of this file and what its dependencies are.

In the second level of the hierarchy, at least for a class, we’re mostly looking at various groupings of methods. I arrange these sections in order from most public to least. The reasoning is that a reader of the code is most likely to want to use the class, and then to inherit from it, and finally to actually change its implementation. This ordering addresses those various needs in order.

The third (and final) layer of the hierarchy consists of the members themselves. In 99.99% of cases, I simply sort these in alphabetical order. It is extremely tempting to try to order these in some “logical” order, but I have found that almost always leads to chaos. The reason is pretty simple. At first, the “logic” is pretty clear. The original author knows what the rational is, and puts things in that order. However, when the next author comes along, perhaps the logic of the ordering isn’t super clear. Maybe the next author is a little lazy. Perhaps they need to add a member which doesn’t fit into the logical scheme the other members follow. Inevitably, though, members start to be added wherever it “feels” like they seem to fit: or merely at the end of the list. Sooner or later, the only organizational scheme a new author can discern is: random.

Alphabetizing the list of members solves all these problems. It’s obvious from very brief inspection what the scheme is. That makes it obvious where to find an existing member, and where new ones should be added as there is only one correct place for each member. Moreover, it helps keep code diffs very easy to read and understand as there is a lot less movement of code from one change to another. Finally, it makes it extremely easy to tell whether a certain member is present or not (e.g., whether a certain method has been overridden).

✧✧✧

I have worked in at least a dozen different languages from PowerPC Assembly to C++ to Python to Go. In each one, I apply the principles of establishing visual hierarchy and creating an objectively correct ordering of elements in the file to ensure that the ultimate goal of making the code obvious and readable are satisfied. It doesn’t matter what the language is, whether it’s Object-Oriented, or what other conventions the language encourages. I find applying this framework makes my code immediately recognizable and earns praise from other developers for being among the most orderly and rational code they’ve ever seen.

✧✧✧

Want to see these techniques in practice? Check out some of my code at github.com/andrewminer.

The Right Size for a Method

One occasionally hears of some programming zealot who swears up and down that methods should be kept to n lines or less. The actual value may vary, but the arbitrary nature of the claim remains. However, there is a kernel of truth there, and it has to do with preserving unit economy.

We all know that the longer a method is, the more we have to keep in our minds to understand it. There are likely to be more local variables, more conditional statements, more exceptions caught and thrown, and more side-effects of all those lines of code. Furthermore, the problem grows faster and faster as there are more lines of code since they all potentially can have an impact on one another. Keeping methods short has a disproportionately large benefit.

Of course, claiming that there’s some absolute “correct” number is clearly nonsensical. The same number of lines in C, Lisp, Java, Assembler, or Ruby will accomplish radically different things, and even how one legitimately counts lines will change dramatically. What does not change, though, is the need for the reader (and author) of the code to understand it as a whole. To this end, one should strive to keep the number of discrete tasks a method accomplishes to within the range of what people generally can remember at once: between one and six.

Each task within a method may have several lines of code of its own; how many tends to vary widely. Consider the process of reading a series of rows from a database. There may be a task to establish a database connection, another to create the query, another to read the values, and perhaps one more to close everything down. Each of these may be composed of anywhere from one to many lines of code.

Tasks may even have subtasks. Consider the example of building a login dialog. At some point, there is likely to be some code which creates a variety of controls, and places them on the screen (e.g. an image control for the company logo, a text field to capture the user name, etc). In the method which does this, one may consider the process of creating the components a single task which has a number of subtasks: one for each component.

In both cases, the important consideration is how organizing the method into tasks and subtasks helps preserve unit economy. By creating tasks which have strong cohesion (i.e. you can name what that group of code does) and loose coupling (i.e. you can actually separate that group of lines from the others), you give the reader ready-made abstractions within your method. In the first example, the reader can readily tell that there’s a section for setting up the connection, and be able to mentally file that away as one unit without the need to remember each line of code in it. In the latter example, the reader can categorize the series of UI element creation subtasks as a single “build all the UI components” task, and again be able to abstract the entire thing away under a single unit. Even if there are a dozen or more individual components, it still can be considered a single task, that is, a single mental unit.

This ability to create abstractions within a single method is why there is no absolute “right” size for a method. Since grouping like things into tasks and subtasks preserves the reader’s (and author’s) unit economy, it is quite possible to have a method which is very long in absolute terms, and still quite comprehensible. It also implies that a fairly short method can be “too long” if it fails to provide this kind of mental structure. The proper length will always be determined by the amount of units (tasks) which one has to keep in mind, and the complexity of how those tasks are interrelated.

Using Exceptions Robustly

Before Object-Oriented programming (OOP), error conditions were commonly reported via a return code, an OS signal, or even by setting a global variable. One of the most useful notions introduced by OOP is that of an ‘exception’ because they drastically reduce the mental load of handling error cases.

In the old style of error handling, any time a function was called which could result in an error, the programmer had to remember to check whether an error occurred. If he forgot, the program would probably fail in some mysterious way at some point later down the line. There was no mechanism built into the language to aid the programmer in managing all the possible error conditions. This often meant that error handling was forgotten.

The situation is much improved with Exceptions, primarily because they offer a fool-proof way to ensure that error handling code is invoked (even if it is code for reporting an unhandled exception). This both makes it unnecessary to remember to check for errors, and it increases the cohesion of such code (i.e. it can be gathered into “catch” blocks instead of mixed in with the logic of the function). Both of these help preserve the unit economy of the author and reader of the code.

Unfortunately, despite being such a useful innovation, exceptions are often abused. We’ve all seen situations where one must catch three different exceptions and do the same thing for each. We’ve all seen situations where only a single exception is thrown no matter what goes wrong, and it doesn’t tell us anything about the problem. Both ends of the spectrum reflect a failure to use exceptions with the end user of the code in mind.

When throwing an exception, one should always keep two questions in mind: “Who is going to catch this?” and “What will they want to do with it?”. With this in mind, here are a number of best practices I’ve seen:

Each library should have a superclass for its exceptions.

Very frequently, users of a library aren’t going to be interested in what specific problem occured within the library; all they’re going to want to know is that the library either did or didn’t do its job. In the latter case, they will want the process of error handling to be as simple as possible. Having all exceptions in the library inherit from the same superclass makes that much easier.

Create a new subclass for each distinct outcome.

Most often, exception subclasses are created for each distinct problem which can arise. This makes a lot of sense to the author, but it usually doesn’t match what the user of the code needs. Instead of creating an exception subclass for each problem, create one for each possible solution. This may mean having exceptions to represent: permanent errors, temporary errors, errors in input, etc. Try to consider what possible users of the component will want to do with the exception, not what the problem originally was.

Remember that exceptions can hold data.

In most languages, exceptions are full-fledged classes, and your subclasses can extend them like any other parent class. This means that you can add your own data to them. Whether it is an error code for the specific problem, the name of the resource which was missing, or a localization key for the error message, including specific data in the exception object itself often is an invaluable means for communicating data which would otherwise be inaccessible from the ‘catch’ block where the exception is handled.

Exceptions should be self-describing in logs.

In most applications, when an exception is finally caught (i.e., not to be re-thrown or wrapped in another exception), it should be logged. The output produced should be as descriptive as possible, including:

  • a plain-English description of what happened
  • the state of any relevant variables in play
  • a full stack trace of where the error occurred

Git 201: Safely Using Rebase

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

The rebase tool in git is extremely powerful, and therefore also rather dangerous. In fact, I’ve known engineers (usually those new to git) who won’t touch it at all. I hope this will convince you that you really can use it safely, and it’s actually very beneficial.

What is rebase?

To start, I’ll very briefly touch on what a rebase actually is. So, very briefly then, a rebase is a way to rewrite the history of a branch. Let’s assume you’re starting out with a standard working branch from master:

At this point, let’s say you want to update your branch to contain the new commits on master (i.e., D and F), but you don’t want to create a merge commit in the middle of your work. You could rebase instead:

git rebase master

This command will rewrite your current branch as though it had originally been created starting from F (the tip of master). In order to that, though, it will need to re-create each of the commits on your branch (C, E, and G). Remember that a commit is the difference applied to some prior commit. In our example, C is the changes applied to B. E contains changes applied to B, and G contains changes applied to E. Rebasing will be that we need to change things around so that C actually has F as a parent instead of B.

The problem is that git can’t just change C’s parent because there’s no guarantee that the changes represented by C will result in the same codebase when applied to F instead of B. It might be that you’d wind up with some completely different code if you did that. So, git needs to figure out what result C creates, and then figure out what changes to apply to F in order to create the same result. That will yield a completely new commit which we’ll call CC. Since E was based upon C, which has been replaced, git will need to create a new commit using the same process, which we’ll call EE. And, since E has been removed, that means we’ll need to replace G with GG. Once all of the commits have been created, git moves the branch pointer to the end of the newest commit:

While all this seems complicated, it’s all hidden inside of git, and you don’t really have to deal with any of it.  In the end, using rebase instead of merge just means changing a single command, and your commit history is simpler because it appears as though you created your branch from the right place all along.  If you’d like a much fuller tutorial with loads of depth, I’d recommend you head over here.

Rebasing across repos

If you’re working on a branch which only exists locally, then rebasing is pretty straight-forward to work with. It’s really when you’re working across multiple clones of a repo (e.g., your local clone, and the one up on GitHub) that things become a little more complicated.

Let’s say you’ve been working on a branch for a while, and somewhere along the way, you pushed the branch back to the origin (e.g., GitHub). Later on, though, you decide you want to rebase to pick up some changes from master. That leaves you in the state we see in this diagram:

If you were to pull right now, git would freak out just a bit. Your local version of the branch seems to have three new commits on it (CC, EE, and GG) while it’s missing three others (C, E, and G). Then, when git checks for merge conflicts, there’s all sorts of things which seem to conflict (C conflicts with CC, E conflicts with EE, etc.). It’s a complete mess.

So, the normal thing to do here is to force git to push your local version of the branch back to origin:

git push -f

This is telling git to disregard any weirdness between the local version of the branch and origin’s version of the branch, and just make origin look like your local copy. If you’re the only one making changes to the branch, this works just fine. The origin gets your new branch, and you can move right along. But… what if you aren’t the only one making changes?

Where rebasing goes wrong

Imagine if someone else noticed your branch, and decided to help you out by fixing a bug. They clone the repository, checkout your branch, add a commit, and push. Now the origin has all your changes as well as the bug fix from your friend. Except, in the meantime, you decided to rebase. That would mean you’re in a situation like this:

Now you’re stuck. If you pull in order to get commit H, you’re going to have all sorts of nasty conflicts. However, if you force push your branch back to origin (to avoid the conflicts), you’re going to lose commit H since you’re telling git to disregard the version of the branch on origin. And, if your friend neglected to tell you about the bug fix, you might do exactly that and never even realize.

Solution 1: Communicate

The best way to fix the problem is to avoid it in the first place. Communicate clearly with your teammates that this branch is a working branch, and that they shouldn’t push commits onto it. It’s a good idea for teams to adopt some clear conventions around this to make this kind of mistake hard to make (e.g., any branch stating with a username should only be changed by that user, branches with “shared”, “team” or some other prefix are expected to have multiple contributors).

If you can’t be sure you’re only one working on a branch, the next best thing is, before starting the rebase, talk with anyone who might be working with the branch. Say that you’re going to rebase it, and what they should expect. If anyone speaks up that they’re working on changes to that branch, then you know to hold off.

Once everyone has pushed up any outstanding changes, pull down the latest version of the branch, rebase, and then push everything back up as soon as possible. That looks like this:

git checkout mybranch
git pull
git rebase master
git push -f

Once you’ve finished, you’ll want to tell the other people working on the branch that they need to get the fresh version of the branch for themselves. That looks like:

git checkout master
git branch -D mybranch
git checkout mybranch

Solution 2: Restart the rebase

If you find yourself having just rebased and only then learn there are upstream changes you’re missing, the simplest way out of this difficulty is to simply ditch your rebase. Go back, and pull down the changes from the origin, and start over (after referring to solution 1). That would look something like this:

git checkout master
git branch -D mybranch
git checkout mybranch
git rebase master
git push -f

This will switch back to master (1), so that you can delete your local copy of the branch (2), and then grab the most recent version from the origin (3). Now, you can re-apply your rebase (4), and then push up the rebased branch before anyone else has a chance to mess things up again (5).

Solution 3: Start a new branch

If you find that you want to rebase right away, and don’t want to wait to coordinate with others who might be sharing your branch, a good plan is to isolate yourself from the potentially shared branch first, and then do your rebase.

git checkout mybranch
git checkout -b mybranch-2

At this point, you’ve got a brand new branch which only exists on your local machine, so no one else could possibly have done anything to it. That means you can go ahead and rebase all you like. When you push the branch back up to origin (e.g., GitHub), it will be the first time that particular branch has been pushed.

Of course, if someone else has added a commit to the old branch, it will still be stuck over there, and not on your new branch. If you want to to get their commit on your new branch, use git’s cherry pick feature:

git cherry-pick <hash>

This will create a new commit on your branch which will have the exact same effect on your branch as it did on the old one. Once you’ve rescued any errant commits, you can delete the old branch and continue from the new one.

✧✧✧

I’m hope this makes rebasing less scary, and helps you get a sense of when you’d use it and when not. And, of course, should things go wrong, I hope this gives you a good sense of how to recover.

Two last bits of advice… First, before rebasing, create a new branch from the head of the branch you’re going to rebase. That way, should things go completely wrong, you can just delete the rebased branch, and use your backup. And, finally, if you’re in the middle of a rebase which seems to be going a little nuts, you can always bail out by using:

git rebase --abort

So, feel free to experiment!

Improving your estimates

Estimating most projects is necessarily an imprecise exercise. The goal of this post is to share some tools I’ve learned to remove those sources of error. Not all of those tools will apply to every project, though, so use this more as a reminder of things to consider when estimating, rather than a strict checklist of things you must do for every project. As always, you are the expert doing the estimating, so it is up to your own best judgement.

Break things into small pieces

When estimating, error is generally reduced by dividing tasks into more and smaller pieces of work. As the tasks get smaller, several beneficial things result:

  • Smaller tasks are generally better understood, and it is easier to compare the task to one of known duration (e.g., some prior piece of work).
  • The error on a smaller task is generally smaller than the error on a small task. That is, if you’re off by 50% on an 8 hour task, you’re off by 4 hours. If you’re off by 50% on an 8 day task, you’re off by 4 days.
  • You’re more likely to forget to account for some part of work in a longer task than a shorter one.

As a general rule, it’s a good idea to break a project down into tasks of less than 2 days duration, but your project may be different. Pick a standard which makes sense for the size of project and level of accuracy you need.

Count what can be counted

When estimating a large project, it is often the case that it is made up of many similar parts. Perhaps it’s an activity which is repeated a number of times, or perhaps there’s some symmetry to the overall structure of the thing being created. Whichever way, try to figure out if there’s something you already know which is countable, and then try to work out how much time each one requires. You may even be able to time yourself doing one of those repeated items so your estimate is that much more accurate.

Establish a range

When estimating individual tasks (i.e., those which can’t be further subdivided), it is often beneficial to start out by figuring out the range of possible durations. Start by asking yourself: “If everything went perfectly, what is the shortest time I could imagine this taking?” Then, turn it around: “If everything went completely pear-shaped, what shortest duration I’d be willing to bet my life on?” This gives you a best/worse-case scenario. Now, with all the ways it could go wrong in mind, make a guess about how long you really think it will take.

Get a second opinion

It’s often helpful to get multiple people to estimate the same project, but you can lose a lot of the value in doing so if the different people influence each other prematurely. To avoid that, consider using planning poker. With this technique, each estimator comes up with their own estimate without revealing it to the others. Then, once everyone is finished, they all compare estimates.

Naturally, there are going to be some differences from one person to the next. When these are small, taking an average of all the estimates is fine. However, when the differences are large, it’s often a sign that there’s some disagreement about the scope of the project, what work is required to complete it, or the risks involved in doing so. At this point, it’s good for everyone to talk about how they arrived at their own estimates, and then do another round of private estimates. The tendency is for the numbers to converge pretty rapidly with only a few rounds.

Perform a reality check

Oftentimes, one is asked to estimate a project which is at least similar to a project one has already completed. However, when coming up with a quick estimate, it’s easy to just trust to one’s intuition about how long things will take rather than really examining specific knowledge of particular past projects to see what you can learn. Here’s a set of questions you can ask yourself to try to dredge up that knowledge:

  • The last time you did this, how long was it from when you started to when you actually moved on to another project?
  • What is the riskiest part of this project? What is the worst-case scenario for how long that might take?
  • The last time you did this, what parts took longer than expected?
  • The last time you did this, what did you forget to include in your estimate?
  • How many times have you done this before? How much “learning time” will you need this time around?
  • Do already you have all the tools you need to start? Do you already know how to use them all?

There are loads of other questions you might ask yourself along these lines, and the really good ones will be those which force you to remember why that similar project you’re thinking of was harder / took longer / was more expensive than you expected it to be.

Create an estimation checklist

If you are planning to do a lot of estimating, it can be immensely helpful to cultivate an estimation checklist. This is a list of all the “parts” of the projects you’ve done before. Naturally, this will vary considerably from one kind of project to the next, and not every item in the checklist will apply to every new project, but they can be immensely valuable in helping you not forget things. In my personal experience, I’ve seen more projects be late from the things which were never in the plan, than from things which took longer than expected.

✧✧✧

Estimation is super hard, and there’s really no getting around that. You’re always going to have some error bars around your estimates, and, depending upon the part of the project you’re estimating, perhaps some considerably large ones. Fortunately, a lot of people have been thinking about this for a long while, and there are a lot tricks you can use, and a lot of books on the subject you can read, if you’d like to get better. Here’s one I found particularly useful which describes a lot of what I’ve just talked about, and more:


Software Estimation: Demystifying the Black Art

Git 201: Creating a “fixup” commit

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

As I work on a branch, adding commit after commit, I’ll sometimes find that there’s a piece of work I forgot to do which really should have been part of some prior commit.  The first thing to do—before fixing anything—is to save the work I’m doing.  This might mean using a work-in-progress commit as described previously, or simply using the stash.  From there, I have a few options, but in this post I’m going to focus on using the “fixup” command within git’s interactive rebase tool.

The essence of the process is to make a new commit, and then use git to combine it with the commit I want to fix.  The first step is to make my fix, and commit it.  The message here really doesn’t matter since it’s going to get replaced anyway.

> git add -A
> git commit -m "fix errors"

This will add a new commit to the end of the branch, but that’s not actually what I wanted.  Now I need to squash it into the older commit I want to fix.  To do this, I’m going to use an “interactive rebase” command:

> git rebase -i master

This is telling git that I want to edit the history of commits back to where my branch diverged from master (if you originally created your branch from somewhere else, you’ll want to specify that instead).  In response to this request, git is going to create a temporary file on disk somewhere and open up my editor (the same one used for commit messages) with that file loaded.  It will wind up looking something like this:

pick 7e70c43 Add Chicken Tikka Masala
pick e8cc090 Remove low-carb flag from BBQ ribs
pick 9ade3d6 Fix spelling error in BBQ ribs
pick b857991 fix errors

# Rebase 1222f97..b857991 onto 1222f97 (       5 TODO item(s))
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to
# bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

All of these commits are those which I’ve made since I cut my branch:  ordered from oldest (on the top) to the newest (on the bottom).  The commented out parts give the instructions for what you can do in this “interactive” portion of the rebase.  Assuming that the fix is for the Chicken Tikka Masala recipe, I’d want to edit the file to look like this:

pick 7e70c43 Add Chicken Tikka Masala
fixup b857991 fix errors
pick e8cc090 Remove low-carb flag from BBQ ribs
pick 9ade3d6 Fix spelling error in BBQ ribs

When I save the file and quit my editor, git is going to rebuild the branch from scratch according to these instructions.  The first line tells git to simply keep commit 7e70c43 as-is.  The next line tells git to remove the prior commit, and replace it with one which is the combination of the prior commit and my fix-up commit, b857991.  The other two commands tell git to create two new commits which result in the same end state as each of the old commits, e8cc090 and 9ad3d6.

As a bit of an aside…  Why does git have to create new commits for the last two commands?  Remember that commits are immutable once created, and that part of the data which makes up the commit is the parent commit it was created from.  Since I’ve asked git to replace the parent commit, it will now need to create new commits for everything which follows on that same branch since each one now has to have a new parent: all the way back to the commit I replaced.

At the end of all this, if we were to inspect the log, we’d see:

> git log --oneline master..

6a829bc3 Add Chicken Tikka Masala
29dd3231 Remove low-carb flag from BBQ ribs
0efc5692 Fix spelling error in BBQ ribs

In essence, everything appears to be the same, except that the original commit includes my fix.  However, looking a little closer, you can see that each commit has a different hash.  The first one is different because I modified the “diff” portion of the commit (i.e., I added in my fix) and therefore a new commit was needed (commits are immutable, so adding the fix required making a new one).  The other two needed to be re-created because their parent commit disappeared, and therefore new commits were needed to preserve the effect of those two commits, starting from my fixed-up commit as the new parent.

✧✧✧

There is one caveat I have to warn you about when using this method.  Any time you rebase a branch, you’ve changed the commit history of the branch.  That is, you’ve thrown out a whole bunch of commits and replaced them with a completely new set.  This is going to mean loads of difficulties and possibly lost work if you’re sharing that same branch with other people, so only use this technique on your own private working branches!

Git 201: Keeping a clean commit history

This series is going to get very technical.  If you’re not already familiar with git, you may want to start over here instead.

✧✧✧

Most people I’ve worked with treat git somewhat like they might treat an armed bomb with a faulty count-down timer.  They generally want to stay away from it, and if forced to interact with it, they only do exactly what some expert told them to do.  What I hope to accomplish here is to convince you that git is really more like an editor for the history of your code, and not a dark god requiring specific incantations to keep it from eating your code.

As I’m thinking of this as a Git 201 series, I’m going to assume you are familiar with all bunch of 101-level terms and concepts: repository, index, commit, branch, merge, remote, origin, stash.  If you aren’t follow this link to some 101-level content.  Come on back when you’re comfortable with all that.  I’m also going to assume you’ve already read my last post about creating a solid commit.

As you might expect, there’s a lot to be said on this topic, so I’m going to break this up into multiple commits… err posts.  To make things easy, I’m going to keep updating the list at the end of this post as I add new ones in the series.  So, feel free to jump around between each of them, or just read them straight through (each will link to the next post, and back here).

✧✧✧

Git 201: Using a work-in-progress commit

This post is part of a Git 201 series on keeping your commit history clean.  The series assumes some prior knowledge of Git, so you may want to start here if you’re new to git.

✧✧✧

My work day doesn’t naturally divide itself cleanly into commits.  I’m often interrupted to work on something new when I’m half-way done with my current task, or I just need to head home in the evening without quite having finished that few feature.  This post talks about one technique I use in those situations.

To rewind  a bit though, when I’m starting a new piece of work (whether a new feature, bug fix, etc.), I first check out the source branch (almost always master), and run:

> git pull --rebase

This grabs the latest changes from the origin without adding an unnecessary merge commit. I then create myself a new branch which shows my username along with a brief reminder of what the change is for:

> git checkout -b aminer/performance-tuning

At this point, I’m ready to start working.  So, I’ll crack open my favorite editor and make my changes.  As I’m going along, I’ll often find that I need to stop mid-thought and move on to a different task (e.g., fix something else, go home for the day, etc.).  It may also be that I’ve arrived at a point where I’ve just figured out something tricky, but I’m not through with the overall change yet.  So, I’ll create a work-in-progress commit:

> git add -A
> git commit -n -m wip

This is really more of a “checkpoint” than an actual commit since I very likely haven’t fully finished anything, probably don’t have tests, etc.  The -n here tells git not to run tests or linters.  This is fine for now because this isn’t a “finished” commit, and I’m going to get rid of it later.  For the time-being, though, I keep on working.  If I reach another checkpoint before I’m ready to make a real commit, I’ll just add on to the checkpoint commit:

> git add -A
> git commit -n --amend

The first command pulls all my new work into the index while the second removes the prior work-in-progress commit, and replaces it with a new one which contains both the already-committed changes along with my new ones.

When I’ve actually finished a complete unit of work, I remove the work-in-progress commit:

> git reset HEAD~1

This will take all the changes which were in the prior commit, and add them back to my working directory as though they were there all along.  The commit history on my branch looks just as though the commit had never been there.

What makes a good commit?

It is distressingly common for me to see commits in a git repository which look something like:

commit 6a69053d2419b311cc38c9fdbbbac75b5a62c4fa
Author: dev-dude <dev-dude@somco.com>
Date: Sun Mar 29 14:05:46 2015 +0200

    fix missing commas

Such a commit will often be half a complete thought (if even that much), have no tests, and a message which explains exactly nothing about why the change was made, or what its intended consequence are.  We can do a lot better.

✧✧✧

A good commit is much like a good sentence: a well-formed, grammatically correct, complete expression of a single thought.  In the case of a commit, this means that it should:

  • make exactly one cohesive change to the code
  • include tests to verify the change
  • have a title which clearly—but briefly—indicates what was done
  • have a message which describes why it was done

 

Cohesion

The most important virtue of a well-formed commit is good cohesion, or: doing only one thing.  This could be a refactoring in preparation for some new feature to be added, it could be adding that feature, it could be fixing a bug.  The actual size of the commit doesn’t matter so terribly much (though smaller is generally better), but that it clearly represents a single change does.

As a short-cut for deciding whether your change is cohesive or not, consider how easily you could come up with the title for the commit.  Can you, in 70 characters or less, state clearly what your change does.  If not, it’s probably not very cohesive.  If you find yourself tempted to make a list as your commit title, you definitely don’t have good cohesion.

In such cases, stop and ask yourself: “what is the smallest thing I could possibly do to this codebase which would be a clear improvement?”.  In a single pull request, it’s not at all uncommon for me to have more than one of these kinds of commits:

  • clear up some stylistic problems with old code
  • refactor various parts of the code I’m about to touch
  • add a new class which will be part of a new feature I’m about to add
  • tie a bunch of prior commits together in a way which makes the new feature visible to the user
  • update documentation

Of course, depending upon the exact nature of the work, I may not need all of those, but it will be very common for me to have at least a few of them.

 

Tests

In my opinion, tests deserve as much care as any other code, and it’s quite possible to have too many or too few.  That’s a topic for another time.  For now, I just want to point out that any new commit should include any appropriate tests related to that new code.  Of course, it’s possible that it’s appropriate not to add new tests (e.g., for a refactoring), but if a commit doesn’t have tests, it should only ever be because you considered it, and decided there weren’t any new tests which were needed.

 

Commit Title

Git actually does have a proper format for the commit message; it’s not just arbitrary.  Remember that git is a command-line tool, and is often used in environments where space can be limited.  To that end, there’s a definite format.

In this format, the first line is considered a title, and it should be a present-tense, imperative sentence stating what changed with this commit (e.g., “Add the FizzyWizzBanger class”).  This should be less than 70 characters because git will only show that many in the various “one-line” versions of its commit logs.  Ideally, it will be sufficiently differentiated from other commit messages so that you can readily figure out which one of several commits it is, even out of context.

 

Commit Message

The question of “what” the commit changes should really be summarized by the title, and made obvious from the code.  In the commit message, you should focus on why you made this change.  Perhaps it fixes a certain bug.  If so, go ahead and explain—in a sentence or two—why this change fixes that particular bug.  It doesn’t hurt to mention the bug number here either.

Don’t be afraid to use simple formatting here either.  A good many Git-related tools support markdown formatting, so feel free to use whatever you like.  Just be sure to leave an empty line after the title since that’s what git will be expecting for its longer-form messages, as will other git-related tools.

✧✧✧

Building up your codebase using a series of well-formed commits has a number of major benefits.  

  • other developers will be able to understand what changed and why much more easily with small, focused commits with clear messages
  • it creates a commit log which reads well, whether using the full log format or just the “one-line” version
  • it is much easier to test small changes
  • not allowing any commits without proper tests means not playing catch-up with a massive and under-tested codebase later on
  • tracking down problems later is much easier if you can isolate the problem to a single commit, and that commit isn’t massive
  • merging branches is much easier (and git is much better at avoiding conflicts altogether) if you work in smaller increments

So, before you start typing… stop to think about what small, specific improvement you’re going to make to the codebase next, and make it as perfect and tidy as you can before moving along to the next one.

Canvas vs. SVG

When I began my start-up, I knew the product was going to focus heavily on drawing high-quality graphs, so I spent quite a while looking at the various options.

 

Server Side Rendering

The first option, and the one that’s been around the longest, is to do all the rendering on the server and then send the resulting images back to the client.  This is essentially what the Google Maps did (though they’ve added more and more client-rendered elements over the years).  To do this, you’ll need some kind of image rendering environment on the server (e.g., Java2D), and, of course, a server capable of serving up the images thus produced.

I decided to skip this option because it adds latency and dramatically cuts down on interactivity and the ability to animation transitions.  Both were very important to me, so this solution was clearly not a good fit.

 

Canvas

The second option was to use HTML5’s Canvas element.  This allows you to specify a region of your webpage as a freely drawn canvas which allows all the typical 2D drawing operations.  The result is a raster image which is drawn completely using JavaScript operations.

While Canvas has the advantages of giving a lot more control on the client-side, you’re still stuck with a raster image which needs to be drawn from scratch for each change.  You also lose the structure of the scene (since it’s all reduced to pixels), and therefore lose the functionality provided in the DOM (e.g., CSS and event handlers).

 

SVG

The final option I considered (and therefore the one I chose) was using Scalable Vector Graphics (SVG).  SVG is a mark-up language, much like HTML, which is specifically designed to represent vector graphics.  You pretty much all the same primitives as with Canvas, except these are all represented as elements in the DOM, and remain accessible via JavaScript even after everything is rendered.  In particular, both CSS and JavaScript can be applied to the elements of an SVG document, making it much more suitable for the kind of interactive graphics I had in mind.

As an added incentive for using SVG, there is an excellent library called Data Driven Documents (D3) which provides an amazing resource for manipulating SVG documents with interactivity, animations, and a lot more.  More than anything else, the existence of D3 decided me on using SVG for my custom graphic needs.