Healthcare Errors Are More Like Frogger Than Swiss Cheese

(Click the Soundcloud app above to hear David read this article.)

By:  David Kashmer (@DavidKashmer, LinkedIn here.)

All models are wrong, but some are useful.

–EP Box

Remember when you first heard of the Swiss-Cheese model of medical error?  It sounds so good, right?  When a bunch of items line up, we get a defect.  It’s satisfying.  We all share some knowledge of Swiss Cheese–it’s a common image.

That’s what makes it so attractive–and, of course, the Swiss Cheese model is a much better mental model than what came before, which was some more loose concept of a bunch of items that made an error or, worse yet, a profound emphasis on how someone screwed up and how that produced a bad outcome.

Models supplant each other over time.  Sun goes around the Earth (geocentric) model was supplanted by the sun at the center (heliocentric) model–thank you Kepler and Copernicus!

Now, we can do better with our model of medical error and defect, because medical errors really don’t follow a Swiss Cheese model.  So let’s get a better one and develop our shared model together.

In fact, medical errors are more like Frogger.  Now that we have more Millenials in the workplace than ever (who seem to be much more comfortable as digital natives than I am as a Gen Xer), we can use a more refined idea of medical error that will resonate with the group who staff our hospitals.  Here’s how medical errors are more like Frogger than Swiss Cheese:


(1) In Swiss Cheese, the holes stay still.  That’s not how it is with medical errors.  In fact, each layer of a system that a patient passes through has a probability of having an issue.  Some layers are lower, and some are higher.  Concepts like Rolled Throughput Yield reflect this and are much more akin to how things actually work than the illusion that we have fixed holes…thinking of holes gives the illusion that, if only we could only identify and plug those holes, life would be perfect!

In Frogger, there are gaps in each line of cars that pass by.  We need to get the frog to pass through each line safely and oh, by the way, the holes are moving and sometimes not in the same place.  That kind of probabilistic thinking is much more akin to medical errors:  each line of traffic has an associated chance of squishing us before we get to the other side.  The trick is, of course, we can influence and modify the frequency and size of the holes…well, sometimes anyway.  Can’t do that with Swiss Cheese for sure and, with Frogger, we can select Fast or Slower modes.  (In real life, we have a lot more options.  Sometimes I even label the lines of traffic as the 6M‘s.)


(2) In the Swiss Cheese Model, we imagine a block of cheese sort of sitting there.  There’s no inherent urgency in cheese (unless you’re severely lactose intolerant or have a milk allergy I guess).  It’s sort of a static model that doesn’t do much to indicate safety.

But ahhh, Frogger, well there’s a model that makes it obvious.  If you don’t maneuver the frog carefully that’s it–you’re a goner.  Of course, we have the advantage of engineering our systems to control the flow of traffic and both the size and presence of gaps.  We basically have a cheat code.  And, whether your cheat code is Lean, Lean Six Sigma, Six Sigma, Baldrige Excellence Framework, ISO, Lean Startup, or some combination…you have the ultimate ability unlike almost any Frogger player to change the game to your patient’s advantage.  Of course, unlike Frogger, your patient only gets one chance to make it through unscathed–that’s very different than the video game world and, although another patient will be coming through the system soon, we’ll never get another chance to help the current patient have a perfect experience.

All of that is highlighted by Frogger and is not made obvious by a piece of cheese.


(3) In Frogger, the Frog starts anywhere.  Meaning, not only does the traffic move but the Frog starts anywhere along the bottom of the screen.  In Frogger we can control that position, but in real life the patients enter the system with certain positions we can not control easily and, for the purposes of their hospital course anyway, are unable to be changed.  It may be their 80 pack year history of smoking, their morbid obesity, or their advanced age.  However, the Frogger model recognizes the importance of initial position (which unlike real life we can control more easily) while the Swiss Cheese model doesn’t seem to make clear where we start.  Luckily, in real life, I’ve had the great experience of helping systems “cheat” by modifying the initial position…you may not be able to change initial patient comorbid conditions but you can sometimes set them up with a better initial position for the system you’re trying to improve.


Like you, I hear about the Swiss Cheese model a lot.  And, don’t get me wrong, it’s much better than what came before.  Now, however, in order to recognize the importance of probability, motion, initial position, devising a safe path through traffic, and a host of other considerations, let’s opt for a model that recognizes uncertainty, probability, and safety.  With more Millenials than ever in the workplace (even though Frogger predates them!) we have digital natives with whom game imagery is much more prone to resonate than a static piece of cheese.

Use Frogger next time you explain medical error because it embodies how to avoid medical errors MUCH better than cheese.



Dr. David Kashmer, a trauma and acute care surgeon, is a Fellow of the American College of Surgeons and is a nationally known healthcare expert. He serves as a member of the Board of Reviewers for the Malcolm Baldrige National Quality Award. In addition to his Medical Doctor degree from MCP Hahnemann University, now Drexel University College of Medicine, he holds an MBA degree from George Washington University. He also earned a Lean Six Sigma Master Black Belt Certification from Villanova University. Kashmer contributes to,, and The Healthcare Quality where the focus is on quality improvement and value in surgery and healthcare.

To learn more about the application of quality improvement tools like Lean Six Sigma in healthcare and Dr. David Kashmer, visit



Fine Time At The Podcast

Thanks to Vivienne and the team at The Healthcare Quality Podcast.  I had a great time learning about the specifics of podcasting and appreciated the help to muddle through the talk!

Look forward to working with you all in the future…I wonder how many times I used the word “share” at the beginning of the cast!

I count 4 times in 30 seconds…how many times do you think I over-used that word?  “Share” your thoughts anytime!

How You Measure The Surgical Checklist Determines What You Find

By:  DMKashmer MD MBA MBB FACS (@DavidKashmer)


Have you ever wondered how a measurement system affects your conclusions? There are several ways we’ve mentioned that the type of data you choose affects a great deal about your quality improvement project. In this entry, let’s talk more about how your setup for measuring a certain quality endpoint determines, in part, what you find…and more importantly, perhaps, how you respond.


The Type Of Data You Collect Affects What You Can Learn


Remember, previously, we discussed discrete versus continuous data. Discrete data, we mentioned, is data that is categorical, such as yes/no, go/stop, black/white, or red/yellow/green. This type of data has some advantages including that it can be rapid to collect. However, we also described that discrete data comes with several drawbacks.


First, discrete data often requires a much larger sample size to demonstrate significant change. Look here. Remember the simplified equation for discrete sample data size:




where p = the probability of some event, and delta is the smallest change you would like to be able to detect.


So, let’s pretend we wanted to detect a 10% (or greater) improvement in some feature of our program, which is currently performing at a rate of 40% of such-and-such. We would need sample size:  (0.40)(0.60)(2/0.10)^2, or 96 samples.


Continuous Data Require A Smaller Sample Size


Continuous data, by contrast, requires a much smaller sample size to show meaningful change. Look at the simplified continuous data sample size equation here:


(2 [standard deviation] / delta)^2


This is an important distinction between discrete and continuous data and, in part, can play a large role in what conclusions we draw from our quality improvement project.  Let’s investigate with an example.


A Cautionary Fairy Tale


Once upon a time there was a Department of Surgery that wanted to improve its usage of a surgical checklist. The team believed this would help keep patients safe in their surgical system. The team decided to use discrete data.


If a checklist was missing any element at all (and there were lots) it was called “not adequate”.  If it was complete from head to toe, 100%, then it would count as “adequate”. The team collected data on its current performance and found that only 40% of checklists were adequate . The teams target goal was 100%.


Using the discrete data formula, the team set up a sample that (at best) would allow them to detect only changes of 10% or larger. That was going to require a sample size of 96 per the simplified discrete data formula above.


The team made interesting changes to their system. For example, they made changes so that the surgeon would need to be present on check-in for the patient, and they made other changes to patient flow that they felt would result in improved checklist compliance.


Weeks later, the team recollected its data to discover how much things had improved. Experientially, the team saw many more checklists being utilized and there was significantly more participation. Much more of the checklist was being completed, per observations, each time.  The team felt that there was going to be significant improvement in the checklists and was excited to re-collect the data. Unfortunately, when the team used their numbers in statistical testing, there was no significant improvement in checklist utilization. Why was that?


This resulted because the team had utilized discrete data. Anything other than complete checklist utilization counted in the “not adequate” bin and so was counted against them. So, even if checklists were much more complete than they ever had been (and that seemed to be so), anything less than perfection would still count against the percentage of complete (“adequate”) checklists. Because they used discrete data in that way, they were unable to demonstrate significant improvement based on their numbers. They were disappointed and, more importantly, they had actually made great strides.


What options did the team have?  Why, they could have developed a continuous data endpoint on checklist completion.  How?  Look here.  This would have required a smaller sample size and may have shown meaningful improvement more easily.


A Take-Home Message


So remember:  discrete data can limit your ability to demonstrate meaningful change in several important ways. Continuous data, by contrast, can allow teams like the checklist team above to demonstrate significant improvement even if checklists are still not quite 100% complete. For your next quality improvement project, make sure you choose carefully whether you want discrete data endpoints or continuous data end points, and recognize how your choice can greatly impact your ability to draw meaningful conclusions as well as your chance of celebrating meaningful change.

How To Use Trauma Triage As A Test



You may have learned about concepts like sensitivity and specificity regarding diagnostic tests. These useful concepts can be applied to many ways we make diagnoses in Medicine. In this write up, let’s briefly review the concepts of sensitivity and specificity as they relate to the identification of injured trauma patients.


Sensitivity is “PID”


First consider sensitivity. Sensitivity is the chance that a test is positive in disease. An easy way to remember sensitivity is as “PID”. PID, here, does not stand for Pelvic Inflammatory Disease. It stands for the chance that a test is “positive in disease”. A triage system can be looked at in terms of the probability that is positive in disease. By this, I mean consider that you can evaluate a triage system based on how sensitive it is to identify significantly injured (ISS greater than or equal to 16) patients. After all, isn’t that the point of triage?


If we consider a critically injured trauma patient to have an injury severity score of 16 or greater, the sensitivity of the triage system can be described as the probability that a full trauma team activation occurs for a trauma patient who has an ISS of 16 or greater. This would indicate the sensitivity of the trauma triage system as a whole for significantly injured patients. Of course, as with any methodology, this one suffers from the fact that we use injury severity score to determine whether the trauma patient is critically injured. Injury severity score is, of course, retrospective.  This is always challenging because triage decisions are made in anterograde fashion with limited information and there’s always a certain signal to noise ratio.  But hey–you gotta start somewhere.  At least the ISS measure gives us that starting point.


Specificity Is “NIH”


Next, consider the specificity of a trauma system. Specificity can be remembered as NIH. NIH, here, does not stand for National Institutes of Health (or, worse yet, “Not Invented Here”). Instead, it stands for “negative in health”. The specificity of a trauma triage system may be regarded as the probability that it does NOT give a full team activation in patients who are NOT significantly injured. Again, if we consider that patients with an ISS of less than 16 are not critically injured, we would care about the probability that the trauma activation is not called in patients who have an ISS less than 16.


And Now To The Interesting Stuff…


The concepts of sensitivity and specificity can allow us to do some interesting things. We can create both the odds ratio positive and odds ratio negative for patients coming in the triage system. This would allow us to determine the probability that a patient entering the system with a certain pre-test probability of injury is classified appropriately as a trauma activation. We could also determine the probability that they will not be classified as a trauma activation. Remember, the odds ratio positive is defined as the sensitivity of a test over the quantity one minus its specificity (end quantity). That looks like:


OR+ = SN / (1-SP) where OR+ is the odds ratio positive, SN is sensitivity, and SP is specificity.


The odds ratio negative is defined as quantity one minus the sensitivity (end quantity) divided by the specificity.  Here’s that one:


OR – = (1 – SN) / SP, where OR – = odds ratio negative, SN = sensitivity, and SP = specificity


Therefore, the more specific the test, the smaller the odds ratio negative and smaller the resultant odds of a patient being critically injured given system activation. This is one useful conceptual way to make anterograde decisions or to conceptualize trauma.


Using Our Numbers To Make Decisions

Let’s pretend, for example, a trauma triage system has a 25% sensitivity for labeling severely injured trauma patients, and has a 98% specificity for labeling patients who do NOT have significant traumatic injuries as people who should NOT have activations.  Therefore, the odds ratios:


OR+ = 0.25 / (1 – 0.98), or 12.5


OR – = (1 – 0.25) / (0.98), or 0.76


Now let’s pretend EMS brings a patient, and the person determining whether to activate the system hears a lot of story and is unsure about whether to activate.  “Eh,” they say “it’s a 50 / 50 chance. I could go either way about whether to activate.” Let’s use our triage-as-a-test to see what happens in the scenario as we make anterograde decisions:


First, we assign a 50% probability to the likelihood that the patient has significant injures. After all, in this example, the person deciding whether to activate thinks it could “go either way.” This 50% is sometimes called the prettest probability.


To use the odds ratios, we need to convert the probability into odds.  Probability is converted to odds by the formula:


p / (1-p) = odds, where p = probability.


Here, that’s 0.50 / (1-0.5), or 1.  So that’s the “prettest odds”.  (The odds something exists before we run the test.)


Next, let’s pretend that the triage system / info / criteria / whatever-we-use says “Don’t activate!” So we use the OR- to modify the prettest odds:


(1)(OR-) = posttest odds; here, that’s (1)(0.76) or post-test odds of 0.76.


Last, we convert the posttest odds back to probability.  Probability is converted to odds by the formula:


o / (o+1) = probability


So, here, that’s (0.76 / 1.76) or 43%!


So, wait a minute!  The patient had a 50% chance (in the triage person’s mind) of being significantly injured before the triage test / criteria were used.  Now, after the triage system said “Don’t activate!” that patient still has a 43% chance of being significantly injured.  Is that a useful triage system?  Probably not, because it doesn’t have the ability to change our minds…especially about patients who may be significantly injured!


What lets us do that math?  It’s called Bayes’ Theorem, and it introduces how to use test information to modify probabilities.  Conditional probability is an interesting topic–especially when it comes to triage.


Next, let’s discuss a few more interesting ways to measure our triage system.


One interesting way to measure triage system is how often critically injured patients are not activated as full team traumas. This is one measure of “undertriage”. To use this measure, we would review the total number of trauma patients who did not have a full trauma team activation and who were significantly injured with an ISS over 16 and we would divide by the total number of patients who came through the system with an ISS over 16. This would demonstrate what probability, given critical illness, we had of identifying the patient improperly. Said differently, this answers the question, “What proportion of significantly injured patients were not identified by our trauma system?”


Another interesting method is the matrix method, or Cribari grid. The grid answers the question “What percentage of patients, among those who were not full trauma team activations, were severely injured?”


Both methods point to the same ultimate issue: a type of error called undercontrolling. This type two statistical error indicates the risk of NOT adjusting or under-recognizing something even when, in fact, that thing exists. The concept of under controlling can be challenging to completely understand. Consider the picture below.  It always helps me.
















Therefore, when we talk about undertriage in systems, it is important to determine which measure we are going to adopt and how we are going to measure ourselves going forward. In new trauma systems, or minimally developed trauma systems, there are sometimes challenges with application of the Cribari grid. Trauma registries and other trauma data sources may not fully and adequately capture all trauma patients as some may be shunted to medical services or there may be other difficulties.  It’s important to work with physician colleagues and the entire team to learn what the triage measures mean and how different ones apply.




For more information regarding concepts in triage including type two errors and thoughts on trauma triage, look here.




Featured On The Minitab Blog



Disclaimer:  I’m not affiliated with Minitab in any way…except for the fact that I find their product very useful on a daily basis!


Our friends at the Minitab blog just posted part one of a two part series involving how to create (and validate) a new measurement tool.  Look here for the post on this useful technique from the Minitab blog, and for more coverage on this important skill look here.

You’re A Programmed Coincidence Machine, And You Can Do Better

By:  DM Kashmer MD MBA FACS (@DavidKashmer)


A Few What If Scenarios

Take a minute to answer these questions.  I’m really interested in what you think.  Nothing tricky here, just some interrogatives about what we commonly experience.  Picture each situation in your mind and see where they go…


(1) Lightning strikes a dried out log during a storm, what happens next?


(2) It rains for a half an hour and the ground is soaked.  Fortunately, the sun came out while it was raining and stayed out when the rain stopped.  You look up at the sky on this sunny day just after the rain, and you expect to see what?


Ok, so what about that first situation?  A flash of lightning violently strikes a log and you watch expecting to see what?  A fire?  How about the rain storm on a fairly sunny day?  You look up at the sky and expect to see what?  A rainbow?


Guess what…usually when lightning strikes a log there’s no fire.  And in most situations where it rains, yet it’s fairly sunny, there’s no rainbow.  Why do you intuit things that aren’t really going to happen?  (I do it too.) Why is our mental simulator WAY off?


The Mental Simulator We Have Is Way Off

Here’s why:  we’ve evolved as a programmed coincidence machine.  Look here.  Or here.  (Please note:  I did NOT offer Dawkins’ argument to agree with his conclusions about the Divine…I just offer some of that work to highlight how the idea that we seek order in randomness is very common.) Oh I didn’t make up that catchphrase “programmed coincidence machine”, yet it nicely captures the idea.  It is evolutionarily adaptive, so the line of reasoning goes, to notice “Hey that makes a fire!” or “Wow look at that unusual thing…”.  Noticing special cases is programmed into us.


Well, guess what…that leads us to lousy decisions about everything from investing to what makes us happy.  (Check out how our mental simulator fools us with respect to happiness here.) Strange, huh?  And counter-intuitive.  I file findings like this away with truisms like the Dunning-Kruger effect.


We Don’t Notice The True Message of the System

The bottom line is we don’t notice the full robustness of the situation, with all of its variation, central tendency, and beauty in the system.  We are easily distracted by special cases which don’t embody the full message of the system.  You see this all the time!


For example, what happens in the field of Surgery when a case goes wrong?  Well, it garners attention.  Sometimes we even react to the spectacular cases where the spotlight has shined and we miss the message and robustness of the system.  We overcorrect or, worse yet, under-recognize.  Often, in classic Process Improvement systems in Healthcare, we don’t know if this latest attention-grabbing headline / case is a issue, outlier, or exactly where it falls.  So we react (because we care) and disrupt a system with inappropriate corrections that actually induce MORE variation in outcomes.


Advanced Quality Tools Offer Some Protection

That’s why I work with these tools, and why I like to describe them.  Understanding Type 1 and Type 2 errors, working with data to represent the complete picture of a system’s variation, and knowing rigorously whether we are improving, worsening, or staying the same with respect to our performance are key to understanding whether and how to make course corrections.


I recommend using some rigorous tools to understand your team’s true performance, or else you may fall victim to the spectacular…yet distracting.


Disagree?  Have a story about being lead astray by intuition?  Let me know beneath.


Without Data, You Just Have An Opinion

By:  David M. Kashmer MD MBA MBB (@DavidKashmer)


Do you agree with the thought that Six Sigma is 80% people and 20% math?  Whether or not you do, it’s important to realize that the 20% of the process which is math is VERY important.  As we discussed in other posts, the virtues of basing decisions on good data rather than your gut, social pressure, or other whims can’t be overstated.  As usual, we’re not saying that “feelings” and soft skills are unimportant; in fact, they’re very important.  Just as data alone isn’t enough (but is a key ingredient in consistent improvement) so too are feelings/intuition not enough when applied on their own.  Here, let’s explore an example of what good data analysis looks like–after all, without the engine of good data analysis, the quality improvement machine can’t run.


Starts With Good Data Collection

If the situation of your quality improvement project is not set up properly–well, let’s just say it’s unlikely to succeed.  We’ve discussed, here, the importance of selecting what data you will collect.  We’ve referenced how to setup a data collection plan (once over lightly) including sample size and types of endpoints.


It’s possible that the importance of setting things up properly can be overstated–but I think it’s very unlikely.  The key to the rest of the analysis we will discuss is that we have a good sample of appropriate size that collects data on the entire process we need to represent.  Yes, colleagues, that means data from the times it’s tougher to collect as well such as nights and weekends.


Requires A Clear Understanding Of What The Data Can (and Can’t) Say

The ball gets dropped, on this point, a lot.  In an earlier entry, we’ve described the importance of knowing whether, for example, your continuous data are normally distributed.  Does it make a difference?  No, it makes perhaps the difference when you go to apply a tool or hypothesis test to your data.  Look here.


Other important considerations come from knowing the limits of your data.  Were the samples representative of the system at which you’re looking?  Is the sample size adequate to detect the size of the change for which you’re looking?


You need to know what voices the data have and which they lack.


Nowadays, Often Requires Some Software

I’m sure there’s some value to learning how to perform many of the classic statistical tests by hand…but performing a multiple regression by hand?  Probably not a great use of time.  In the modern day, excellent software packages exist that can assist you in performing the tool application.


WARNING:  remember the phrase garbage in, garbage out.  (GIGO as it is termed.) These software packages are in no way a substitute for training and understanding of the tools being used.  Some attempt to guide you through the quality process; however, I haven’t seen one yet that protects you completely from poor analysis.  Also, remember, once the tool you are using spits out a nice table, test statistic, or whatever it may show:  you need to be able to review it and make sure it’s accurate and meaningful.  Easily said and not always easily done.


Two of the common, useful packages I’ve seen are SigmaXL and Minitab (with its quality suite).  SigmaXL is an Excel plug-in that makes data analysis directly from your Excel very straightforward.


Means You Need To Select The Correct Tool

We explored, here, the different tools and how they apply to your data.  (There’s a very handy reference sheet at the bottom of that entry.) If you’ve done the rest of the setup appropriately, you can select a tool to investigate the item on which you want to drill down.  Selecting the correct tool is very straightforward if the data setup and collection are done properly, because it’s almost as if you’ve reverse engineered the data collection from what it will take to satisfy modern statistical tools.  You’ve made the question and data collection which started all of this into a form that has meaning and can be answered in a rigorous fashion by common tools.


Allows A Common Understanding Of Your Situation Beyond What You “Feel”

This is my favorite part about data analysis:  sometimes it really yields magic.  For example, consider a trauma program where everything feels fine.  It’s pretty routine, in fact, that staff feel like the quality outcomes are pretty good.  (I’ve been in that position myself.) Why do we see this so commonly?  In part, it’s because services typically perform at a level of quality that yields one defect per every thousand opportunities.  Feels pretty good, right?  I mean, that’s a whole lot of doing things right before we encounter something that didn’t go as planned.


The trouble with this lull-to-sleep level of defects is that it is totally unacceptable where people’s lives are at stake.  Consider, for example, that if we accepted the 1 defect / 1000 opportunities model (1 sigma level of performance) that we would have one plane crash each day at O’Hare airport.  Probably not ok.


Another common situation seen in trauma programs concerns timing.  For instance, whatever processes are in place may work really well from 8AM until 5PM when the hospital swells with subspecialists and other staff–but what about at night?  What about on weekends?  (In fact, trauma is sometimes called a disease of nights and weekends.) Any data taken from the process in order to demonstrate performance MUST include data from those key times.  Truly most quality improvement projects in Trauma and Acute Care Surgery must focus on both nights and weekends.


So here again we have the tension between how we feel about a process and what our data demonstrate.  The utility of the data?  It gives us a joint, non-pejorative view on our performance and spurns us toward improvement.  It makes us look ourselves squarely in the eye, as a team, and decide what we want to do to improve or it tells us we’re doing just fine.  It puts a fine point on things.


Last, good data has the power to change our minds.  Consider a program that has always felt things are “pretty good” but has data that say otherwise.  The fact that data exist gives the possibility that the program may seek to improve, and may recover from its PGS (Pretty Good Syndrome).  In other words, part of the magic of data is that it has the power, where appropriate, to change our minds about our performance.  Maybe it shows us how we perform at night–maybe it shows us something different than we thought.  It may even tell us we’re doing a good job.


At The End Of The Day, Your Gut Is Not Enough

Issues with using your “gut” or feelings alone to make decisions include such classic problems as the fundamental attribution error, post-facto bias, and plain old mis-attribution.  It was DaVinci, if I recall, who said that “The greatest deception men suffer is from their own opinions.” We have tools, now, to disabuse ourselves of opinion based on our experience only–let’s use them and show we’ve advanced beyond the Renaissance.  So now we come to one of the “battle cries” of Six Sigma:  without data, you just have an opinion.  Opinions are easy and everyone has one–now, in high stakes situations, let’s show some effort and work to make actual improvement.


My Data Are Non-normal…Now What?



So you are progressing through your quality improvement project and you are passing through the steps of DMAIC or some similar steps.  You finally have some good, continuous data and you would like to work on analyzing it.


You look at your data to find whether these data are normally distributed.  You likely performed the Anderson-Darling test, as described here, or some similar test to find out whether your data are normally distributed.  Oh no! You have found that your data are non normal.  Now what?  Beneath we discuss some of the treatments and options for non-normal data sets.


One of the frequent issues with quality improvement projects and data analysis is that people often assume their data are normally distributed when they are not.  They then go on and use statistical tests which require data that are normally distributed.  (Ut oh.) Conclusions ensue which may or may not be justified.  After all, non-normal data sets do not allow us to utilize the familiar, comfortable statistical tests that we employ routinely. For this reason, let’s talk about how to tell whether your data are normally distributed.


First, we review our continuous data as a histogram.  Sometimes, the histogram may look like the normal distribution to our eyes and intuition.  We call this the “eyeball test”.  Unfortunately, the eyeball test is not always accurate.  There is an explicit test, called the Anderson-Darling test, which asks whether our data deviate significantly from the normal distribution.


Incidentally, the normal distribution does not mean that all is right with the world.  Plenty of systems are known to display distributions other than the normal distribution–and they are meant to do so.  Having the normal distribution does not mean everything is OK–it’s just that we routinely see the normal distribution in nature and so call it, well, normal.  We will get to more on this later.


For now, you have reviewed your data with the eyeball test and you think they are normally distributed.  Now what?  We utilize the Anderson-Darling test to compare our data set to the normal distribution.  If the p value associated with the Anderson-Darling test statistic is GREATER than 0.05 this means that our data do NOT deviate from a normal distribution.  In other words we can say that we have normally distributed data.  For more information with regard to the Anderson-Darling test, and its application to your data, look here.


So now we know whether our data are or are not normally distributed.  Next, let’s pretend that our Anderson-Darling test gave us a p value of less than 0.05 and we were forced to say that our data are not normally distributed.  There are plenty of systems in which data are not normally distributed.  Some of these include time until metal fatigue / failure and other similar systems.  Time until failure and other systems may display, for example, a Pousieulle distribution.  This is just one of the many other named distributions we find in addition to the normal (aka Gaussian) distribution.  Simply because a system does not follow the normal distribution does not mean the system is wrong or some how irrevocably broken.


Many systems, however, should follow the normal distribution.  When they do not follow it and are highly skewed in some manner, the system may be very broken.  If the normal distribution is not followed and there is not some other clear distribution, we may say that there is a big problem with one of the six causes of special variation as described here.  When data are normally distributed we routinely say the system is displaying common cause variation, and all of the causes for variation are in balance and contributing expected amounts.  Next, let’s talk about where to go from here.


When we have a non normal data set, one option is to perform a distribution fitting.  This asks the question “If we don’t have the normal distribution, which distribution do we have?”  This is where we ask MiniTab, SigmaXL, or a similar program to fit our data versus known distributions and to tell us whether our distribution deviates from these other distributions.  Eventually, we may find that one particular distribution fits our data.  This is good. We now know that this is the expected type of system for our data.  If we have non-normal data and we fit a distribution to our data, the question then becomes what can we do as far as statistical testing goes.  How can we say whether we made improvement after intervening in the system?  One of the things we can do is to use statistical tests which are not contingent on having normally distributed data.  These are infrequently used and include the Mood’s median test, the Levene test, and the Kruskal-Wallis test (or KW because that one’s not easy to say) test.  I have a list of tools and statistical tests used for both normal and non-normal data sets at the bottom of the blog entry here.


So, to conclude this portion, one option for working with non-normal data sets is to perform distribution fitting and then to utilize statistical tests which do not rely on the assumption of having a normal data set.


The next option for when you are faced with a non-normal data set is to transform the data so that it becomes a normally distributed data set.  For example, pretend that you are measuring time for some process in your hospital.  Let’s say you have used the Anderson-Darling test and discovered that time is not normally distributed in your system.  As mentioned, you could perform distribution fitting and use non-normal data tools.  Another option is to transform the data so that they become normal.   Transform does not mean that you have faked, or doctored, the data.  What transformed means is that you raise the variable, here time, to some power value.  This can be any power value including the 1/2 value, 2, 3, and every number in-between and beyond.  This can also be a ‘negative’ power such as -2 etc.  So, now, you raised your time variable until the data set becomes normally distributed.  A computer software package like MiniTab or SigmaXL will test each value of the power to which you raised your data.  These values get called lambda values.  The computer will find the lambda value at which your data become normally distributed according to the Anderson-Darling test.  Let’s pretend in this situation that time^2 is normally distributed according to the Anderson-Darling test.


This brings up a philosophic question.  We can easily feel what it means to manage the variable time.  What, however, does it mean to managed time raised to the second power?  These are questions that six sigma practitioners and clinical staff may ask, and, again, are more philosophic in nature.  Next, we use our transformed data set, and remember that if we transform the data before we intervened in the system we must again transform the data after we intervene in the system (to the same value).  This allows us to compare apples to apples.  Next, we can utilize the routine, familiar, statistical tests on this transformed data set.  We can use t tests, anova tests, and other tests that we typically enjoy to analyze the central tendency of the data and data dispersion / variance.


This, then, represents the second option for how to deal with data that are not normally distributed:  transform the data set and utilize our routine tests.  For examples of tests that do utilize normal data see the tool time excel spreadsheet from Villanova University that is at the bottom of our blog entry as mentioned above.  In conclusion, working with non normal data sets can be challenging.  We have presented the two classic options for looking at how to deal with data that are not normally distributed and which include distribution fitting followed by utilization of statistical tests that are not contingent on normality, as well as transforming the data with a power transform (such as the Box-Cox transform) and then utilizing the transformed data with our routine tools that require normally distributed data.


Questions, comments or thoughts about utilizing non-normal data in your quality improvement project? Leave us your thoughts beneath.

Use Continuous Data (!)




For the purposes of quality improvement projects, I prefer continuous to discrete data.  Here, let’s discuss the importance of classifying data as discrete or continuous and the influence this can have over your quality improvement project.  For those of you who want to skip to the headlines: continuous data is preferable to discrete data for your quality improvement project because you can do a lot more with a lot less of it.


First let’s define continuous versus discrete data.  Continuous data is data that is infinitely divisible.  That means types of data that you can divide forever ad infinitum comprise the types of data we can label as continuous.  Examples of continuous data include time.  One hour can be divided into two groups of thirty minutes, minutes can be divided into seconds and seconds can continue to be divided on down the line.  Contrast this with discrete data:  discrete data is data which is, in short, not continuous.  (Revolutionary definition, I know.) Things like percentages, levels and colors comprise data that comes in divided packets and so can be called discrete.


Now that we have our definitions sorted, let’s talk about why discreet data can be so challenging.  First, when we go to sample directly from a system, discrete data often demands a larger sample size.  Consider our simple sample size equation for how big a sample we need of discreet data to detect a change:


(p)(1-p) (2 / delta)^2.


This sample size equation for discreet data has several important consequences.  First, consider the terms.  P is the probability of outcome of a certain event.  This is for percentage-type data where we have a yes or no, go or stop, etc.  The terms help determine our sample size.   As mentioned, p is the probability of the event occurring and the delta is the smallest change we want to be able to detect with our sample.


The 2 in the equation comes from the (approximate) z-score at the 95% level of confidence.  We round up from the true value of z to 2 because that gives us a whole number sample slightly larger than what’s required rather than a sample with a fraction in it.  (How do you have 29.2 of a patient, for example?) Rounding up is important too because rounding down would yield a sample that is slightly too small.


In truth, there are many other factors in sampling besides merely sampling size.  However, here, notice what happens when we work through this sample size equation for discrete data.  Let’s say we have an event that has a 5% probability of occurring. This would be fairly typical for many things in medicine, such as wound infections in contaminated wounds etc.  Working through the sample size equation, and in order to detect a 2% change in that percentage, we have 0.05 x 0.95 (2 / 0.02)^2.  This gives us approximately 475 samples required in order to detect a smallest possible decrease of 2%.  In other words, we have obtained a fairly large sample size to see a reasonable change.  We can’t detect a change of 1% with that sample size, so if we think we see a 4.8% as the new percentage after interventions to change wound infections…well, perforce of our sample size, we can’t really say if anything has changed.


One more thing:  don’t know the probability of an event because you’ve never measured it before?  Well, make your best guess.  Many of us use 0.5 as the p if we really have no idea.  Some sample size calculation is better than none, and you can always revise the p as you start to collect data from a system and you get a sense of what the p actually is.


Now let’s consider continuous data.  For continuous data, sample size required to detect some delta at 95% level of confidence can be represented as



( [2][historic standard deviation of the data] / delta)^2.


When we plug numbers into this simplified sample size equation we see very quickly that we have much smaller samples of data required to show significant change.  This is one of the main reasons why I prefer continuous to discrete data.  Smaller sample sizes can show meaningful change.  However, for many of the different end points you will be collecting in your quality project, you will need both.  Remember, as with the discrete data equation, you set the delta as the smallest change you want to be able to find with your data collection project.


Interesting trick:  if you don’t know the historic standard deviation of your data (or you don’t have one) take the highest value of your continuous data and subtract the lowest.  Then divide what you get by 3.  Viola…estimate of historic standard deviation.


Another reason why continuous data is preferable to discrete data is the number of powerful tools it unlocks.  Continuous data allows us to use many other quality tools such as the CPK, data power transforms, and useful hypothesis testing. This can be more challenging with discrete data.  Some of the best ways we have see to represent discrete data include a Pareto diagram.  For more information regarding a Pareto diagram visit here.


Other than the Pareto diagram and a few other useful ways, discrete data presents us with more challenges for analysis.  Yes, there are statistical tests such as the chi squared proportions test that can determine statistical significance.  However, continuous data plainly open up a wider array of options for us.


Having continuous data allows us to make often better visual representations and allows our team to achieve a vision of the robustness of the process along with the current level of variation in the process.  This can be more challenging with the discrete data endpoints.


In conclusion, I like continuous data more than discrete data and I use it wherever I can in a project.  Continuous data endpoints often allow better visualization of variation in a process.  They also require smaller sample sizes and unlock a more full cabinet of tools which we can use to demonstrate our current level of performance.  In your next healthcare quality improvement project be sure to use continuous data points where possible and life will be easier!


Disagree?  Do you like discrete data better or argue “proper data for proper questions”?  Let us know!




These Two Tools Are More Powerful Together




Click beneath for the video version of the blog entry:


Click beneath for the audio version of the blog entry:


Using two quality improvement tools together can be more powerful than using one alone. One great example is the use of the fishbone diagram and multiple regression as a highly complementary combination.  In this entry, let’s explore how these two tools, together, can give powerful insight and decision-making direction to your system.


You may have heard of a fishbone, or Ishikawa diagram, previously. This diagram highlights the multiple causes for special cause variation.  From previous blog entries, recall that special cause variation may be loosely defined as variation above and beyond the normal variation seen in a system.  These categories are often used in root cause analysis in hospitals.  See Figure 1.


Figure 1:  Fishbone (Ishikawa) diagram example
Figure 1: Fishbone (Ishikawa) diagram example


As you also may recall from previous discussions, note that there are six categories of special cause variation. These are sometimes called the “6 M’s’” or “5 M’s and one P”. They are Man, Materials, Machine, Method, Mother Nature and Management (the 6 Ms).  We can replace the word “man” with the word “people” to obtain the 5Ms and one P version of the mneumonic device.  In any event, the issue is that an Ishikawa diagram is a powerful tool for demonstrating the root cause of different defects.


Although fishbone diagrams are intuitively satisfying, they can also be very frustrating.  For example, once a team has met and has created a fishbone diagram, well…now what?  Other than opinion, there really is no data to demonstrate that what the team THINKS is associated with the defect / outcome variable is actually associated with that outcome.  In other words, the Ishikawa represents the team’s opinions and intuitions.  But is it actionable?  In other words, can we take action based on the diagram and expect tangible improvements?  Who knows.  This is what’s challenging about fishbones:  we feel good about them, yet can we ever regard them as more than just a team’s opinion about a system?


Using another tool alongside the fishbone makes for greater insight and more actionable data.  We can more rigorously demonstrate that the outcome / variable / defect is directly and significantly related to those elements of the fishbone about which we have hypothesized with the group.  For this reason, we typically advocate taking that fishbone diagram and utilizing it to frame a multiple regression.  Here’s how.


We do this in several steps.  First, we label each portion of the fishbone as “controllable” or “noise”.  Said differently, we try to get a sense of which factors we have control over and which we don’t.  For example, we cannot control the weather.  If sunny weather is significantly related to number of patients on the trauma service, well, so it is and we can’t change it.  Weather is not controllable by us.  When we perform our multiple regression we do so with all factors identified labeled as controllable or not.  Each is embodied in the multiple regression model.  Then, depending on how well the model fits the data, we may decide to see what happens if the elements that are beyond our control are removed from the model such that only with the controllable elements are used.  Let me explain in greater detail about this interesting technique.


Pretend we create the fishbone diagram in a meeting with stakeholders. This lets us know, intuitively, what factors are related to different measures. We sometimes talk about the fishbone as a hunt for Y=f(x) where Y is the outcome we’re considering and it represents a function of underlying x’s. The candidate underlying x’s (which may or may not be significantly associated with Y) are identified with the fishbone diagram.  Next, we try to identify which fishbone elements are ones for which we have useful data already.  We may have rigorous data from some source that we believe. Also, we may need to collect data on our system. Therefore, it bears saying that we take specific time to try to identify those x’s about which we have data.  We then establish a data collection plan. Remember, all the data for the model should be over a similar time period.  That is, we can’t have data from one time period and mix it with another time period to predict a Y value or outcome at some other time.  In performing all this, we label the candidate x’s as controllable or noise (non-controllable).


Next, we seek to create a multiple regression model with Minitab or some other program. There are lots of ways to do this, and some of the specifics are ideas we routinely teach to Lean Six Sigma practitioners or clients. These include the use of dummy variables for data questions that are yes/no, such as “was it sunny or not?” (You can use 0 as no and 1 as yes in your model.) Next, we perform the regression and do try to input confounders if we think two x’s or more x’s are clearly related.  (We will describe this more in a later blog entry on confounding.) Finally, when we review the multiple regression output, we look for an r^2 value of greater than 0.80. This indicates that 80% of the variability in our outcome data, or our Y, is explained by the x’s that are in the model. We prefer higher r^2 and r^2 adjusted values. R^2 adjusted is a more stringent test based on the specifics of your data and we like both r^2 and r^2 adjusted to be higher.


Next we look at the p values associated with each of the x’s to determine whether any of the x’s affect the Y in a statistically significant manner.  As a final and interesting step we remove those factors that we cannot control and run the model again so as to determine what portion of the outcome is in our control.  We ask the question “What portion of the variability in the outcome data is in our control per choices we can make?”


So, at the end of the day, the Ishikawa / fishbone diagram and the multiple regression are powerful tools that complement each other well.


Next, let me highlight an example of a multiple regression analysis, in combination with a fishbone, and its application to the real world of healthcare:


A trauma center had issues with a perceived excess time on “diversion’, or that time in which patients are not being accepted and so are diverted to other centers. The center had more than 200 hours of diversion over a brief time period.  For that reason, the administration was floating multiple reasons why this was occurring.  Clearly diversion could impact quality of care for injured patients (because they would need to travel further to reach another center) and could represent lost revenue.


Candidate reasons included an idea that the emergency room physicians (names changed in the figure beneath) were just not talented enough to avoid the situation.  Other reasons included the weather, and still other reasons included lack of availability of regular hospital floor beds.  The system was at a loss for where to start and it was challenging for everyone to be on the same page to have clarity with respect to where to start and what to do next with this complex issue.


For this reason, the trauma and acute care surgery team performed an Ishikawa diagram with relevant stakeholders and combined this with the technique of multiple regression to allow for sophisticated analysis and decision making.  See Figure 2.


Figure 2:  Multiple regression output
Figure 2: Multiple regression output


Variables utilized included the emergency room provider who was working when the diversion occurred (as they had been impugned previously), the day of the week, the weather, and the availability of intensive care unit beds to name just a sample of variables used.  The final regression result gave an r^2 value less than than 0.80 and, interestingly, the only variable which reached significance was presence or absence of ICU beds.  How do we interpret this?  The variables included in the model explain less than 80% of the variation in the amount of time the hospital was in a state of diversion (“on divert”) for the month.  However, we can say the the availability of ICU beds is significantly associated with whether the hospital was “on divert”.  Less ICU beds was associated with increased time on divert.  This gave the system a starting point to correct the issue.


Just as important was what the model did NOT show.  The diversion issue was NOT associated significantly with the emergency room doctor.  Again, as we’ve found before, data can help foster positive relationships.  Here, it disabused the staff and the rest of administration of the idea that the emergency room providers were somehow responsible for (or associated with) the diversion issue.


The ICU was expanded in terms of available nursing staff which allowed more staffed beds and made the ICU more available to accept patients. Recruitment and retention of new nurses were linked directly to the diversion time for the hospital:  the issue was staffed beds, and so the hospital realized that more nursing staff needed to be hired as one intervention.  This lead to a recruitment and retention push, and, shortly thereafter, an increase in the number of staffed beds.  The diversion challenge resolved immediately once the additional staff was available.


In conclusion, you can see how the fishbone diagram, when combined with the multiple regression, is a very powerful technique to determine which issues underly the seemingly complex choices we make on a daily basis.  In the example above, a trauma center utilized these powerful techniques together to resolve a difficult problem. At the end of the day, consider utilizing an fishbone diagram in conjunction with a multiple regression to help make complex decisions in our data intensive world.


Thoughts, questions, or feedback regarding your use of multiple regression or fishbone diagram techniques? We would love to hear from you.