Points: conclusions and hypothesis

As anyone with time to spare will know I’ve recently spent a lot of time thinking, and writing about story points. This was in response to Vasco Duarte’s Story Points Considered Harmful blog from a month or two back. For completeness here are the links:

Some conclusions I draw from this:

  • There is far more work to do on Abstract/Story Points than we, as a community have done to date
  • There are many more nuances to the assignment of points, the breakdown of work and the management of the outcomes than I think I previously realised
  • I must go and see what Mike Cohn actually says about Story Points before I say any more about his approach
  • Stable teams are crucial – but then I’ve been saying this all along

Given all this I’d like to pose a few questions and hypothesis of my own

Q1: When does the correlation between story points and number of cards become stable?

Hypothesis: I would expect a team new to “Agile”, stories and points to start off with erratic point scores and number of stories complete per sprint. Thus I would not expect the correlation to be stable. As a team settles down I would expect points to become stable, then stories completed and thus establish a correlation.

Q2: Is there any serious research into story points out there?

In the same line as my recent post “Agile: Where’s the Evidence?” it would be interesting to know if anyone has examined the use and accuracy of story points. Again, I should seclude myself in an academic library and review the data. But again, I have to find time.

More problematic, I suspect, OK another hypothesis, that some of the reason why story points work – which I listed in part 3 of my posts – will make it very difficult to determine if they are accurate because the thing story points are measuring will change.

Points 4 of 4 – Breakdown

This entry directly continues from three earlier ones:

Duarte’s analysis, and my response, has got me thinking. And I think it would be useful – to me at least, maybe to some readers! – to explain why I think story points, or rather the “Abstract Points” that I prefer, are still useful, and why I advise teams to break down Blues – stories, possibly User Stories.

Why do I advise teams to break down Blues/Stories?

My background is as a C++ programmer, I worked on financial, telecom, and other systems. A business story, a Blue, would frequently be bigger than a developer could manage in an iteration – particularly if you have a legacy system. Thus I would break Blues down to Whites. (See Blue-White-Red (PDF) if you want to know more about this approach.)

Blues mean something to the business, Whites mean something to developers. I think this situation still holds for many developers in many environments. This has several advantages:

  • Whites are smaller pieces of work, they flow through a system more easily. Progress can be seen, tasks tracked, velocity calculated.
  • “The Business” aka Product Owner/Manager/BA are not always good at delivering small stories, breaking a blue down gives the developer a chance.
  • On some teams the business have been beaten up by development to request really small stories. However these stories lack business value. Because Blues are going to be broken down they can be large enough to have value even if that means that can’t be completed in one iteration/sprint.
  • (Yes, you heard that right) Whites are completed during the iteration, when all the Whites, or the essential ones, are completed then the Blue is completed.
  • Breaking Blues down to Whites is as much a design exercise as it is an estimation and scheduling one. This allows teams to engage in design and create a shared understanding.
  • Breaking Blues down to Whites frequently reveals functionality or assumptions about the Blue requirement which can be removed or postponed.
  • Having the Product Manager/Owner/BA in the room during this break down allows for requirements elaboration and knowledge mining.
  • Work can be rolled from one iteration to the next. I’m very relaxed about carry over work and I think for a new team its almost unavoidable. However doing it this way allows some points to be counted and illustrates what is happening.

(Of course the break down does create some problems: a Blue can only be done when all the Whites, or some done and other cancelled, which mean tracking becomes more complex. It might also break the Lean idea of “single piece flow” but I’m not sure.)

Next, why, given Duarte’s analysis, do I still advise teams to estimate their work?

  • I have seen the breakdown and estimate approach. As detailed previously in one case it allowed a team to forecast to the day.
  • Estimating work allows teams, and individual team members, to raise a warning when work is not understood, defined or involves a lot of risk. For example, a team estimating with planning poker will normally settle on an “average size” of task, e.g. 3 or 5 points. When they suddenly assign 13 or 20 points to a task something is wrong.
  • And just in case the warning is ignored the team, the people at the code face, the people doing the work, have a control mechanism. No matter how much the business or a manager bully a team they can still assign a high point score.
  • Equally, when differences in estimation appears it is a trigger to discussion, to learning, to understanding. This is desirable.

Finally, there is one more reason why I will continue to advise to point their work, and its one I don’t normally admit to but, well Vasco, you win…..

Placebo effect.

Managers, particularly trained project managers, find it alien to not estimate. Actually they are not alone, I’ve seen plenty of developers and testers who think the Kanban craze of not estimating work is nutty. Going through the rituals of pointing and planning poker provides at least the appearance of doing “the right thing.”

Asking these folk to go cold turkey on planning and estimation is tough.

Likewise, asking them to give up Gantt charts can be tough, so we offer them burn down charts. To be honest I find intra-sprint burn-down charts useless. Even the efficacy of pan-iteration burn-down charts surprised me at first. I now see they can be very useful and recommend their use. (Intellectually I prefer Cumulative Flow Diagrams but they are more difficult to get your head around and more difficult for the casual viewer to understand.)

A mature team is, almost by definition, beyond needing placebos. In a mature team I would expect the business to be requesting small stories which do represent value and do fit within an iteration. Thus I would expect Duarte’s analysis to hold up and a mature team might well decide to go without points and use cards.

However, for team at the beginning of its Agile journey I don’t expect these conditions hold.

Finally, for this instalment, I’ve started to wonder about Blue-White-Red again. I’ve long regarded Blue-White-Red as a Scrum/XP hybrid – closer to XP than Scrum if I’m honest. While I’ve been asked to write more about it in the past never have. Over time I have refined my thinking about it. I’m now wondering if Blue-White-Red is actually something more different than I’ve ever appreciated.

Maybe someone who has used Blue-White-Red can answer than one.

Story points 3 of 4 – An example

This entry continues from two earlier ones:

I’ve got some (abstract) points data of my own. Not as much as Duarte’s but some. One team in particular is interesting. The development manage said a few months ago “We can deliver to the day.” But actually, when you look at the data the velocity looks quite variable. Whats going on?

Well two things, at least. First, when you average the data out it is no where near as variable – thats what averages do. I’m reminded of the old economists warning: “Do not pay to much attention to one month’s figures [GDP/GNP/Inflaction/etc]. Look at the trend.”

So yes, velocity iteration to iteration changes but over a longer period it is meaningful.

Second, this is a team I regard as stable. I learned a long time ago that if you don’t have a stable team your velocity data is meaningless. The velocity is delivered by the team members, if you change the team you can’t get a meaningful velocity.

While I regard this team as stable when I looked close, looked at the data and dredged my memory, this was not a stable team. One member retired, one member joined, the team was joined by another person to tackle a specific sub-set of work, the team adopted TDD a few months after moving to iterations, later still they tried doing pair-programming.

Somewhere along the line the hardware team joined the iterations, added to the velocity, then, after a while left. It didn’t work as well as hoped. When you look at the data you can see this: alone the software team can have a standard deviation as low as 3.6 on an average velocity of 62 (over 5 iterations) , with the hardware team added that goes to over 17 on a velocity of 60.

(Velocity falls after team expansion is a phenomenon I’ve seen in two other data sets I’ve got. Brooks’ Law doesn’t completely explain this, other factors are at work which I will discuss another day (i.e. when I understand them more fully!))

In other words, the team wasn’t stable. In fact, given all that change I’m surprised velocity was as stable as it was!

I think a third factor was at work. Once a team have put a point score on a card, say they point it to 5. Then there is a mild incentive to finish the card in something that feels like 5 points. Not the strong commitment of Scrum mythology, more a pride in ones own skills, and perhaps, a desire to score points at the end of the iteration.

Fourth: its not just development estimates that are helping the team hit dates. Armed with this data scope can be fine tuned, teams can take decisions on when to do refactorings and so on.

Fifth: once a team has velocity data and can forecast dates it it can negotiate on features and deliveries. This echo’s Duarte’s story but is more fine grained. Of course this won’t help if the end customers/users/clients/stakeholders aren’t prepared to engage.

Given all this I believe abstract points and graphs are helpful, not harmful.

Perhaps one day I’ll be able to publish this data. Its just one team but it shows that velocity and point scoring can work.

Story points 2 of 4 – Duarte's arguments

This blog entry follow directly from the previous Story Points – Journey’s Start.

Key to Duarte argument, and something I didn’t originally appreciate from his initial Tweets is: he is not saying story points are rubbish, forget about them. What he is saying is: it is simpler and equally accurate to just count the stories as atomic items. This is equivalent to saying “All stories are 1 point” and having done with it.

For a mature team with a good relationship with its stakeholders I could see this working well. However, for less mature teams (who have difficulty agreeing among themselves) or a team with bully-boy stakeholders (or bully boy anyone else for that matter) then I think being able to put a higher point score on the card serves as a useful warning mechanism.

Duarte says in his blog “the best predictor of the future is your past performance!” On this I couldn’t agree more. He then poses three questions – and answers – which I think are worth reviewing.

Q1: Is there sufficient difference between what Story Points and ’number of items’ measure to say that they don’t measure the same thing?

Here he finds there is a close correlation. I’m not surprised here, in fact I would expect this to be the case. Teams are encouraged to write small stories, in fact Scrum almost mandates this because work should be completely done at the end of a sprint. In effect there is an upper bound placed on the size of a story.

Actually, I’m not so keen on this rule. I allow work to be carried from iteration to iteration but I only allow points to be scored when the work is done. Thus I encourage stories to be completed in an iteration but I don’t mandate it. One of the exercises I do with teams on my courses actually sets out to illustrate this point.

At the very least I would expect teams to settle on an “average story size” implicitly. Notice also that the correlation applies whether all stories are of size 1 or of size 2, 3 or any other number. Its a correlation between two series of numbers.

However, given all this Duarte has a point: if your stories are clustered around an average size then you might as well count the stories.

Q2: Which one of the two metrics is more stable? And what does that mean?

Duarte’s analysis says that both stories and story points have similar standard deviation. Thus they are of similar stability. Since these two are closely correlated this isn’t a surprise. In fact, given the correlation, it would be a surprise if one was notably more stable.

Q3: Are both metrics close enough so that measuring one (number of items) is equivalent to measuring the other (Story Points)?

Duarte’s data seems to measure the same thing – again, if they are closely correlated then this is exactly what you would expect. You can write out the equation:
        Story Points ~= (Correlation Co-efficient) x (Number of Stories)

(The ~= is supposed to mean approximately equal.)

With this out of the way Duarte moves on to consider Mike Cohn’s claims for story points.

Claim 1: The use of Story points allows us to change our mind whenever we have new information about a story
Duarte says: Story Points offer no advantage over just simply counting the number of items left to be Done.

I agree here. I’ve long encouraged teams to move away from story pointing work in the distant future. Yes I encourage them to story point some stories in the backlog – say a few months work – and story point tasks – for the next iteration. But for stuff that is “out there” or has just arisen my advice is usually: just assign it your average story point value.

In other words, assume your average story point value is your Correlation Co-efficient. When work gets close then estimate it traditionally, you might find the value changes. When it gets really close break it down into tasks.

Claim 2: The use of Story points works for both epics and smaller stories
Duarte says: there is no significant added information by classifying a story in a 100 SP category

Again I agree. To be honest I’m not a fan of Epics and while some of the teams I work with use them I often encourage teams to dump them. To me an epic is just a collection of stories around a theme.

Actually, what Cohn and Duarte are saying are not at odds here. Cohn doesn’t (seem to) make any additional claims. Its just a scaling question.

Claim 3: The use of Story points doesn’t take a lot of time        
Duarte says: In fact, as anybody that has tried a nontrivial project knows it can take days of work to estimate the initial backlog for a reasonable size project.

Here I have issues with both Cohn and Duarte.
If you are estimating stories then it does take time. Fast as it is, even planning poker takes time. However there is also a lot of design and requirements discussion going on in that activity. Therefore I don’t see this as a problem. In fact I see it as an important learning exercise.

True, on a none trivial project it will take time to estimate a large backlog. But a) that is valuable learning and b) I won’t try. I’d either estimate it in chunks or I’d apply an average estimate to work which wasn’t going to happen anytime soon – see claim #2.

I deliberately delay estimation as long as possible to allow more information to arrive and because work will change. It might be changed out of all recognition or it might go away completely.

Claim 4: The use of Story points provides useful information about our progress and the work remaining
Duarte says: This claim holds true if, and only if you have estimated all of your stories in the Backlog and go through the same process for each new story added to the Backlog.

Again I agree. However I doubt the usefulness of the concept of “work remaining”. Its only work remaining if you think you have a lump of work to do. In my experience work is always negotiable. Its just that people don’t want to negotiate until they accept that they won’t get everything.

One of my clients has gone through the very expensive exercise of estimating all the work they might do. Earlier this year they realised they had 3000 points to do by Christmas. They also realised they had capacity to do less than 1000. This brought home to fact that they couldn’t do everything – something many people on the project had long known or suspected. The company are still working through this issue but at they are having the discussion now, in March and April not September and October.

Claim 5: The use of Story points is tolerant of imprecision in the estimates
Duarte says: there’s no data [in Cohn’s book] to justify the belief that Story Points do this better than merely counting the number of Stories Done. In fact, we can argue that counting the number of stories is even more tolerant of imprecisions

Again I agree. But then, if there is a high correlation between story points and stories then this is self-evident. And again, as I said before: we need to work with aggregates and averages.

Claim 6: Story points can be used to plan releases
Duarte says: Fair enough. On the other hand we can use any estimation technique to do this, so how would Story Points [be better than counting the number of stories]

Again, agreement, and with correlation its self evident.

Duarte goes on to give a worked examples in which a project does not achieve the desire velocity and gets cancelled. His story makes no use of story points, simply stories. To be honest I’m missing something here. True the stories in his story have no points, but I don’t see where that makes a difference. What he describes is exactly the way I would play the scenario although I would have story points in the mix.

His conclusion: “Don’t estimate the size of a story further than this: when doing Backlog Grooming or Sprint Planning just ask: can this Story be completed in a Sprint by one person? If not, break the story down!”

This is interesting because while Duarte is working at the story level this pretty closely models the way I advise teams to work. I always tell teams:

  • “I’d like a story to be small, to fit in one iteration but that isn’t always the way.”
  • “In my experience stories need to be broken down, both so they can get done but also so you get flow and as part of a design exercise.”

I then have teams break down stories – which I call blues – into (developer) tasks – whites – this follows my Blue-White-Red process from a few years back. I then put points on whites – at this point you probably start to see why I prefer the term Abstract Points not Story Points because Whites aren’t stories.

When a team have a feel for points then I will have them put points – the same units, just bigger – on Blues. For Blues that won’t be done for a while then I’m happy to assign averages, and for Blues that are further out then I’m happy to leave them unpointed.

So, thank you for reading my analysis of here and staying all the way.

My conclusion? I think I agree with Duarte but I don’t agree with him. I think I actually disagree with Mike Cohn but I’d have to go back and look at what he says himself.

I think the way I estimate with teams, the way I used to do it when I ran teams and the way I teach clients to estimate is more different than perhaps I appreciated. My method grew from my interpretation of Kent Beck’s XP planning game and velocity. However I now think my approach has drifted from this. The result of my experience, seeing what works and what doesn’t has refined my approach.

The approach I’ve ended up with has similarities with what Duarte describes but is also different.

Despite lots of authors attempts to describe Scrum/XP planning and estimating I still find myriads of minor variations. Some improve things, some not. Until this moment I’ve always believed my approach only differed in minor ways. Now I’m thinking….

And what of the Duarte’s harmful claim? Actually, although he uses the word in the title I don’t see any discussion of the harm they cause.

Story points might be pointless but do they do any harm? I don’t really see it, they may be a waste of time, there might be more effective ways of doing the same thing but that’s not the same as harmful. Story points might mislead, there is little evidence here so I’ll hold judgement on that one.

To be continued….

Story points considered harmful? 1 of 4 – Journey's start

A few months ago Vasco Duarte, with a little help from Joseph Pelrine started a discussion entitled “Story Points considered harmful.” They, or at least Vasco, has given this as a conference keynote and has blogged about it.

(Warning: this is the first of 4 blog entires, its quite a long post. Plus I think there are two appendix blog entries to follow up.)

Now I’ll admit, when I first heard this argument I thought “Well I can guess where they are coming from” – I have sympathy with the argument, I’ve always considered story points as suspect myself. I also thought “But I don’t think they are right”. Specifically I know a team who use story points and claim to be able to schedule delivery “to the day.”

I’ve collected some data of my own from teams I’ve worked with, am working with or at least in contact with and done a bit of amateur analysis. I’m also taking time to go over Vasco and Joseph’s arguments.

This is a big topic and its going to take me a while to get to the bottom of the data, what I think and the pro and con arguments. So please forgive me, this is going to take a few, possibly long blog articles to go through.

Lets get a couple of things on the table to start with.

Firstly I don’t believe Story points are a Scrum technique. Yes they have been subsumed into Common Scrum but I believe they originally originated with Extreme Programming. Where they originated isn’t really important because I believe story point estimation and tasking (i.e. estimating work to do with story points and then scheduling a number of story points) is at odds with Scrum Commitment.

I’ve blogged about this before, Two Ways to Fill and Iteration, so I won’t repeat myself. Just say, in my book commitment and story pointing are alternatives.

By the way, the name Story Points comes from Mike Cohn, I prefer to call them Abstract Points, and I’ve heard others call them “Nebulous Units of Time”.

Second, I don’t believe story points can ever be stable if you don’t have a stable team. If you remove people from a team I expect it to slow down, if you add people to the team I expect it to slow down too – at least in the short term. In the longer term you might increase capacity but frankly I don’t know how long that will take.

I few years ago I worked with one team which had story/abstract points which appeared to be random. When I adjusted form changes in team staffing they average was constant. That said, I don’t expect points to remain constant, they might do for a while but I expect them to fluctuate at the very least.

It goes without saying that I expect sprint/iteration length to be stable too.

Which brings us to the third thing: with points, stories and projections, they only work at the average and aggregate level. They are a good predictor over several sprints but they offer no guarantees for the next sprint/iteration.

Finally, for now, something I do agree with Duarte on: we can’t estimate. By “we” I mean humans. Last year I devoted several blog entries to the subject, Humans Can’t Estimate.