Some of the things I learned as a junior surgical resident were over simplified. One of these includes that a p value less than 0.01 is “good”. In this entry we discuss when it is appropriate for a p value to be greater than 0.01 and those times when it’s even favorable to have one greater than 0.05. We invite you to take some time with this blog entry as it makes for some of the most interesting facts we have found about hypothesis testing and statistics.
In Lean and Six Sigma much of what we do is to take statistical tools that exist and to apply these to business scenarios so that we have a more rigorous approach to process improvement. Although we call the processes Six Sigma or Lean depending on the toolset we are using, in fact, the processes are pathways to set up a sampling plan, capture data, and rigorously test data so as to determine if we are doing better or worse with certain system changes–and we get this done with people, as a people sport, in a complex organization. I have found, personally, the value in using the data to tell how us how we are doing is that it disabuses us of instances where we think we are doing well and we are not. It also focuses our team on team factors and system factors, which are, in fact, responsible for most defects. Using data prevents us from becoming defensive or angry at ourselves and our colleagues. That said, there are some interesting facts about hypothesis testing about which many of us knew nothing as surgical residents. In particular, consider the idea of the p value.
Did you know, for example, that you actually set certain characteristics of your hypothesis testing when you design your experiment or data collection? For example, when you are designing a project or experiment, you need to decide at what level you will set your alpha. (This relates to p values in just a moment.) The alpha is the risk of making a type 1 error. For more information about a type 1 error please visit our early blog entry about type 1 and type 2 errors here. In this case, let’s leave it at saying the alpha risk is the risk of tampering with a system that is ok; that is, alpha is the risk of thinking there is an effect or change when in fact there is no legitimate effect or change. So, when we set up an experiment or data collection, we set the alpha risk inherent in our hypothesis testing. Of course, there are certain conventions in medical literature that determine what alpha level we accept.
Be careful, by the way, because alpha is used in other fields too. For example, in investing, alpha is the amount of return on your mutual fund investment that you will get IN EXCESS of the risk inherent in investing in the mutual fund. In that context, alpha is a great thing. There are even investor blogs out there that focus on how to find and get this extra return above and beyond the level of risk you take by investing. If you’re ever interested, visit seekingalpha.com.
Anyhow, let’s pretend, here, that we say we are willing to accept a 10% risk of concluding that there is some change or difference in our post-changes-we-made state when in fact there is no actual difference (10% alpha). In most cases the difference we may see could vary in either direction. Our values post changes could be either higher or lower than they were pre changes. For this reason, it is customary to use what is called a two tailed p value. The alpha risk is split among two tails of the distribution (ie the values post changes are higher or lower than by chance alone) so that we say if the p value is greater than 0.05 (a 5% alpha risk in either direction) we would conclude there is no significant difference in our data between the pre and post changes we made to a system.
The take home is that we decide, before we collect data to keep the ethics of it clean, how we will test these data to conclude if there a change or difference under 2 states. We determine what will we will count as a statistically significant change based on the conditions we set: what alpha risk is too high to be acceptable in our estimation?
Sometimes, if we have reason to suspect the data may or can vary in only one direction (such as prior evidence indicating an effect only going one direction or some other factor) we may use a one tailed p value. A one tailed p value simply says that all of our alpha risk is lumped in one tail of the distribution. In either case we should set up how we will test our data before we collect them. Of course, in real life, sometimes there are already data that exist, are high quality (clear operating definition etc.) and we need to analyze them for some project.
Next, let’s build up to when it’s good to have a p > 0.05. After all, that was the teaser for this entry. This brings us to some other interesting facts about data collection and the sampling methods by which we do this. For example, in Lean and Six Sigma, we tend to classify data as either discrete or continuous. Discrete data is, for example, yes or no data. Discrete data can be certain defined categories only such as red, yellow, blue, yes / no, black / white / grey etc. etc…continuous data, by contrast, is data that is infinitely divisible. One way I have heard continuous data described that I use when I teach is that continuous data are data that can be divided in half forever and still make sense. That is, an hour can be divided into two groups of 30 minutes, minutes can be divided into seconds, and seconds can continue to be divided. This infinitely divisible type of data is continuous and makes a continuous curve when plotted. In Lean and Six sigma we attempt to utilize continuous data whenever possible. Why? The answer makes for some interesting facts about sampling.
First, did you know that we need much smaller samples of continuous data in order to be able to demonstrate statistically significant changes? In fact, consider a boiled down sampling equation for continuous data versus discrete data. A sampling equation for continuous data is (2s/delta)^2 where s is the historic standard deviation of the data and delta is the smallest change you want to be able to detect with your data. The 2 comes from the z score at the 95th percent level of confidence. For now just remember that this is a generic conservative sampling equation for continuous data.
Now let’s look a sampling equation for discrete data. The sampling equation for discrete data is p(1-p)(2/delta)^2. In other words, let’s plug in what it would take to be able to detect a 10% difference in discrete data. Plugging in the numbers and using p=50% for the probability of yes or no we find that we need a large sample to detect a small change. For continuous data, using similar methodology we need much smaller samples. Usually for reasonably small deltas this may be only 35 data points or so. Again, this is why Lean and Six sigma utilizes continuous data whenever possible. So, now, we focus on some sampling methodology issues and the nature of what a p value is.
Next, consider the nature of statistical testing and some things that you may not have learned in school. For example, did you know that underlying most of the common statistical tests is the assumption that the data involved are normally distributed? In fact, data in the real world may be normally distributed. Again, normal distribution means data that may be demonstrated as a histogram that follows a Gaussian curve. However, in the real world of business, manufacturing and healthcare, it is often not the case that data are actually distributed normally. Sometimes data maybe plotted and look normally distributed but in fact they are not. This fact would invalidate some of the assumptions utilized by common statistical tests. In other words, we can’t use a t test on data that are not normally distributed. Students t test, for example, has the assumption that the data are normally distributed. What can we do in this situation?
First we can rigorously test our data to determine if they are normally distributed. There is a named test, called the Anderson-Darling test, that focuses on whether our data are normally distributed. The Anderson-Darling test tests our data distribution versus normally distributed data. If the p value for the Anderson-Darling test is greater than 0.05 that means our data do not deviate significantly from the normal distribution. In other words, if the Anderson-Darling test statistic’s accompanying p value is greater than 0.05 we conclude that our data are normally distributed and we can use the common statistical tests that are known and loved by general surgery residents (and beyond) everywhere. However, if the Anderson-Darling test indicates that our data are not normally distributed, that is the p value is less than 0.05 we must look for alternative ways to test our data. This was very interesting to me when I first learned it. In other words, a p value greater than 0.05 can be good especially if we are looking to demonstrate that our data are normal so that we can go on and use hypothesis tests which require normally distributed data. Here are some screen captures that highlight Anderson-Darling. Note that, in Fig 1., the data DON’T appear to be normally distributed by the “eyeball test” (the “eyeball test” is when we just look at the data and go with our gut). Yet, in fact, the data ARE normally distributed and p > 0.05. Figure 2 highlights how a data distribution follows the routine, expected frequencies of the normal distribution.
Figure 1: A histogram with its associated Anderson-Darling test statistic and p value > 0.05. Here, p > 0.05 means these data do NOT deviate from the normal distribution…and that’s a good thing if you want to use hypothesis tests that assume your data are normally distributed.
Figure 2: These data follow the expected frequencies associated with the normal distribution. The small plot in Figure 2 demonstrates the frequencies of data in the distribution versus those of the normal distribution.
As with most things, the message that a p value less 0.01 is good and one greater than 0.01 is bad is a vast oversimplification. However, it is probably useful as we teach statistics to general surgery residents and beyond.
So, now that you have identified a methodology for whether your data are or are not normally distributed, let’s progress to talking about what to do next–especially when you find that your data are NOT normally distributed and you wonder where to go next. In general, there are two options when we have continuous data sets that are NOT normally distributed. One is that we must transform these data sets with what is called a power transformation. There are many different power transformations including the Box-Cox transformation and Johnson transformation to name a few.
The power transforms take the raw, non-normally distributed data, and raise the data to different powers, such as raising the data to the 1/2 power (aka taking its square root) or raising the data to the second power, third power, fourth power, etc. The optimal power to which the data are raised so as to make the data closest to the normal distribution is identified. The data are then replotted as transformed data to that power, and then the Anderson-Darling test (or a similar test) is performed on that transformed data to determine whether the new data are now normally distributed.
Often the power transformations will allow the data to become normally distributed. This brings up an interesting point: pretend we are looking at a system where time is the focus. The data are not normally distributed and we perform a power transform which demonstrates that time squared is a normally distributed variable. Interestingly we may have a philosophic management question. What does it mean to manage time squared instead of time? These and other interesting questions arise when we use power transforms. The use of power transforms is somewhat controversial for that reason. Sometimes it is challenging to know whether the variables have meaning for management when we use power transforms.
However, on the bright side, if we successfully “Box-Cox-ed” or somehow otherwise power-transformed the data to normal data we can now use the common statistical tests. Remember, if the initial data set is transformed the subsequent data must be transformed to the same power. We have to compare apples to apples.
The next option for how to deal with non-normal data set is to utilize statistical tests which do not require the input of normal data. These include such rarely used tests as the Levene test, and so called KW or Kruskal-Wallis test. The Levene test and KW test are tests of data variability. Another test, the Mood’s median test, tests the median value for non-normal data. So, again, we have several options for how to address non-normal data sets. Usually, as we teach the Lean and Six Sigma process, we reserve teaching about how to deal with non-normal data for at least a black belt level of understanding.
At the end of the day, this blog post explores some interesting consequences of the choices we make with respect to data and the consequences of some interesting facts about hypothesis testing. Again, interestingly, there is much more choice involved than I ever understood as a general surgical resident. Eventually, working through the Lean and Six sigma courses (and finally the master black belt course) taught me about the importance of how we manage data and, in fact, ourselves. Also, there are more than 10 projects in which I have participated that have really highlighted these certain facts about data and reinforced text book learning.
An interesting take home message is that the p value less than 0.01 does not mean all is right with the world, just as a p value greater than 0.05 is not necessarily bad. Again, after all, tests like the Anderson-Darling test are useful to tell us when our data are normally distributed and when we can continue using the more comfortable hypothesis tests that focuses on data which are normally distributed. In this blog post, we describe some of the interesting ways to deal with data that are non-normally distributed so as to improve our understanding and conclusions based on continuous data sets. Whenever possible, we favor continuous data as it requires a smaller sample size with which to make meaningful conclusions. However, as with all sampling, we have to be sure that our continuous data sample adequately represents the system we are attempting to characterize.
Our team hopes you enjoyed this review of some interesting statistics related to the nature and complexity of p-values. As always, we invite your input as statisticians or mathematicians especially if you have special expertice or interest in these topics. None of us, as Lean or Six Sigma practitioners, claim to be statisticians or mathematicians. However, the Lean and Six Sigma process is extremely valuable in applying classic statistical tools to business decision-making. In our experience, this approach to data driven decision making has yielded vast improvements in how we practice in business systems instead of other models based on opinion or personal experience.
As a parting gift, please enjoy (and use!) the file beneath to help you to select what tool to use to analyze your data. This tool, taken from Villanova’s Master Black Belt Course, helps me a great deal on a weekly basis. No viruses or spam from me involved I promise!