golfunfiltered
Johnny Unhappy Pro
Fair warning: this is gonna be long and nerdy.
Having the opportunity to interview and meet many of the biggest names in golf equipment has been extremely exciting for me over the years. Learning about the thought processes that go into the design of a new club or golf ball is fascinating, and the money that's invested to help us enjoy the game can be staggering.
That's why I think it is SO important to understand that when those products are tested by consumers, whether it be via a blog, club demo or custom fitting, having a fundamental understanding of what it means to "improve" can be helpful.
When I'm not littering the internet with my golf thoughts, I work in process improvement during the day. A big part of that deals with statistics, specifically the concept of proving something through hypothesis testing and statistical significance. This can only be done through the establishment of a sound measurement system, used in a properly designed experiment, which includes appropriate sample sizes and testing methods.
For example, let's say you go to your local demo day where a company rep has a Trackman set up on the range. You grab the newest driver from a bouquet of options and step up to the tee. You've got your current driver as well to compare to the new one.
How many golf balls should you hit with each driver before you can safely say you've got the appropriate sample size? Furthermore, how do you analyze the comparison data to determine if one driver truly outperforms the other?
More often than not, most of us will focus on carry distance and shot dispersion. Those are real metrics we can gather from Trackman. That data is usually given to us in terms of average total.
But there's a problem with averages. They are awfully touchy when it comes to outliers, especially when our sample size is low (<30). I contend this is the worst way to compare two products against each other, preferring to use the median (50% of data above, 50% of data below) of the data set, but that's controversial and I digress.
Once you have your data sets ready, the next best thing is to test for statistical significance. This brings up the concepts of a null hypothesis and p-value.
At a high level, the null hypothesis suggests that there is no difference between two samples being compared. It's an assumption you go into every experiment with as to not have any bias toward one sample or the other. What you are testing, then, is whether or not you should reject or fail to reject the null hypothesis. To do that, you need to find the p-value.
A p-value is "the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event." In other words, it's the probability that the comparison results you are seeing are due to chance if the null hypothesis is true. For many industries, the standard p-value is 0.05 (in healthcare it is 0.01). That means there is a 5% chance the results we are seeing are due to chance. Anything less than that means we reject the null hypothesis and can say with some level of certainty there is a difference between the two samples.
So, to go back to our example, we should go in to the driver test with the null hypothesis that there is no difference between the two drivers. With the appropriate sample sizes of drives from each club, we can then collect the data and compare the two means together (for example, carry distance numbers). This process -- called hypothesis testing -- will result in a p-value. If the p-value is less than 0.05, we can suggest there is a statistically significant difference between the two drivers being tested.
But all of that is boring and confusing, right?
Instead, what many choose to do is simply hit a bunch of drives and make a determination on which club is better based on what we see in a very limited amount of time. Besides, who has the time to do all of that stuff?
We can be better than that.
I welcome your thoughts and/or questions below.
Having the opportunity to interview and meet many of the biggest names in golf equipment has been extremely exciting for me over the years. Learning about the thought processes that go into the design of a new club or golf ball is fascinating, and the money that's invested to help us enjoy the game can be staggering.
That's why I think it is SO important to understand that when those products are tested by consumers, whether it be via a blog, club demo or custom fitting, having a fundamental understanding of what it means to "improve" can be helpful.
When I'm not littering the internet with my golf thoughts, I work in process improvement during the day. A big part of that deals with statistics, specifically the concept of proving something through hypothesis testing and statistical significance. This can only be done through the establishment of a sound measurement system, used in a properly designed experiment, which includes appropriate sample sizes and testing methods.
For example, let's say you go to your local demo day where a company rep has a Trackman set up on the range. You grab the newest driver from a bouquet of options and step up to the tee. You've got your current driver as well to compare to the new one.
How many golf balls should you hit with each driver before you can safely say you've got the appropriate sample size? Furthermore, how do you analyze the comparison data to determine if one driver truly outperforms the other?
More often than not, most of us will focus on carry distance and shot dispersion. Those are real metrics we can gather from Trackman. That data is usually given to us in terms of average total.
But there's a problem with averages. They are awfully touchy when it comes to outliers, especially when our sample size is low (<30). I contend this is the worst way to compare two products against each other, preferring to use the median (50% of data above, 50% of data below) of the data set, but that's controversial and I digress.
Once you have your data sets ready, the next best thing is to test for statistical significance. This brings up the concepts of a null hypothesis and p-value.
At a high level, the null hypothesis suggests that there is no difference between two samples being compared. It's an assumption you go into every experiment with as to not have any bias toward one sample or the other. What you are testing, then, is whether or not you should reject or fail to reject the null hypothesis. To do that, you need to find the p-value.
A p-value is "the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event." In other words, it's the probability that the comparison results you are seeing are due to chance if the null hypothesis is true. For many industries, the standard p-value is 0.05 (in healthcare it is 0.01). That means there is a 5% chance the results we are seeing are due to chance. Anything less than that means we reject the null hypothesis and can say with some level of certainty there is a difference between the two samples.
So, to go back to our example, we should go in to the driver test with the null hypothesis that there is no difference between the two drivers. With the appropriate sample sizes of drives from each club, we can then collect the data and compare the two means together (for example, carry distance numbers). This process -- called hypothesis testing -- will result in a p-value. If the p-value is less than 0.05, we can suggest there is a statistically significant difference between the two drivers being tested.
But all of that is boring and confusing, right?
Instead, what many choose to do is simply hit a bunch of drives and make a determination on which club is better based on what we see in a very limited amount of time. Besides, who has the time to do all of that stuff?
We can be better than that.
I welcome your thoughts and/or questions below.