Equipment Tests, Statistical Significance, and You

golfunfiltered

Johnny Unhappy Pro
Joined
May 8, 2015
Messages
2,719
Reaction score
2,388
Location
Red Robin bidet
Handicap
6
Fair warning: this is gonna be long and nerdy.

Having the opportunity to interview and meet many of the biggest names in golf equipment has been extremely exciting for me over the years. Learning about the thought processes that go into the design of a new club or golf ball is fascinating, and the money that's invested to help us enjoy the game can be staggering.

That's why I think it is SO important to understand that when those products are tested by consumers, whether it be via a blog, club demo or custom fitting, having a fundamental understanding of what it means to "improve" can be helpful.

When I'm not littering the internet with my golf thoughts, I work in process improvement during the day. A big part of that deals with statistics, specifically the concept of proving something through hypothesis testing and statistical significance. This can only be done through the establishment of a sound measurement system, used in a properly designed experiment, which includes appropriate sample sizes and testing methods.

For example, let's say you go to your local demo day where a company rep has a Trackman set up on the range. You grab the newest driver from a bouquet of options and step up to the tee. You've got your current driver as well to compare to the new one.

How many golf balls should you hit with each driver before you can safely say you've got the appropriate sample size? Furthermore, how do you analyze the comparison data to determine if one driver truly outperforms the other?

More often than not, most of us will focus on carry distance and shot dispersion. Those are real metrics we can gather from Trackman. That data is usually given to us in terms of average total.

But there's a problem with averages. They are awfully touchy when it comes to outliers, especially when our sample size is low (<30). I contend this is the worst way to compare two products against each other, preferring to use the median (50% of data above, 50% of data below) of the data set, but that's controversial and I digress.

Once you have your data sets ready, the next best thing is to test for statistical significance. This brings up the concepts of a null hypothesis and p-value.

At a high level, the null hypothesis suggests that there is no difference between two samples being compared. It's an assumption you go into every experiment with as to not have any bias toward one sample or the other. What you are testing, then, is whether or not you should reject or fail to reject the null hypothesis. To do that, you need to find the p-value.

A p-value is "the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event." In other words, it's the probability that the comparison results you are seeing are due to chance if the null hypothesis is true. For many industries, the standard p-value is 0.05 (in healthcare it is 0.01). That means there is a 5% chance the results we are seeing are due to chance. Anything less than that means we reject the null hypothesis and can say with some level of certainty there is a difference between the two samples.

So, to go back to our example, we should go in to the driver test with the null hypothesis that there is no difference between the two drivers. With the appropriate sample sizes of drives from each club, we can then collect the data and compare the two means together (for example, carry distance numbers). This process -- called hypothesis testing -- will result in a p-value. If the p-value is less than 0.05, we can suggest there is a statistically significant difference between the two drivers being tested.

But all of that is boring and confusing, right?

Instead, what many choose to do is simply hit a bunch of drives and make a determination on which club is better based on what we see in a very limited amount of time. Besides, who has the time to do all of that stuff?

We can be better than that.

I welcome your thoughts and/or questions below.
 
Fair warning: this is gonna be long and nerdy.

Having the opportunity to interview and meet many of the biggest names in golf equipment has been extremely exciting for me over the years. Learning about the thought processes that go into the design of a new club or golf ball is fascinating, and the money that's invested to help us enjoy the game can be staggering.

That's why I think it is SO important to understand that when those products are tested by consumers, whether it be via a blog, club demo or custom fitting, having a fundamental understanding of what it means to "improve" can be helpful.

When I'm not littering the internet with my golf thoughts, I work in process improvement during the day. A big part of that deals with statistics, specifically the concept of proving something through hypothesis testing and statistical significance. This can only be done through the establishment of a sound measurement system, used in a properly designed experiment, which includes appropriate sample sizes and testing methods.

For example, let's say you go to your local demo day where a company rep has a Trackman set up on the range. You grab the newest driver from a bouquet of options and step up to the tee. You've got your current driver as well to compare to the new one.

How many golf balls should you hit with each driver before you can safely say you've got the appropriate sample size? Furthermore, how do you analyze the comparison data to determine if one driver truly outperforms the other?

More often than not, most of us will focus on carry distance and shot dispersion. Those are real metrics we can gather from Trackman. That data is usually given to us in terms of average total.

But there's a problem with averages. They are awfully touchy when it comes to outliers, especially when our sample size is low (<30). I contend this is the worst way to compare two products against each other, preferring to use the median (50% of data above, 50% of data below) of the data set, but that's controversial and I digress.

Once you have your data sets ready, the next best thing is to test for statistical significance. This brings up the concepts of a null hypothesis and p-value.

At a high level, the null hypothesis suggests that there is no difference between two samples being compared. It's an assumption you go into every experiment with as to not have any bias toward one sample or the other. What you are testing, then, is whether or not you should reject or fail to reject the null hypothesis. To do that, you need to find the p-value.

A p-value is "the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event." In other words, it's the probability that the comparison results you are seeing are due to chance if the null hypothesis is true. For many industries, the standard p-value is 0.05 (in healthcare it is 0.01). That means there is a 5% chance the results we are seeing are due to chance. Anything less than that means we reject the null hypothesis and can say with some level of certainty there is a difference between the two samples.

So, to go back to our example, we should go in to the driver test with the null hypothesis that there is no difference between the two drivers. With the appropriate sample sizes of drives from each club, we can then collect the data and compare the two means together (for example, carry distance numbers). This process -- called hypothesis testing -- will result in a p-value. If the p-value is less than 0.05, we can suggest there is a statistically significant difference between the two drivers being tested.

But all of that is boring and confusing, right?

Instead, what many choose to do is simply hit a bunch of drives and make a determination on which club is better based on what we see in a very limited amount of time. Besides, who has the time to do all of that stuff?

We can be better than that.

I welcome your thoughts and/or questions below.

Speaking to the bolded, I also find these 2 driver purchasing tactics funny:

1. Only demoing 1 brand due to their "brand loyalty" for whatever reason
2. Not demoing any and blind buying

and then claiming that it is the "best driver for you"

Id personally love to hit 25-30 of each and let the numbers speak for themself. But like you said, who has the time for that. (not to mention my driver swing is trash)
 
Or you could just buy one of each and play which one you think will be best that day?

Probably not the best approach but it’s been my approach the past couple years, lol.

Forced myself to play the Cobra this weekend, although nothing’s been able to knock out my old R15 430 permanently. Something about it just allows me to hit it well. Not very forgiving compared to most newer ones, but damn is it long if I do my job.
 
Your example is simple. A new driver vs. an old driver. My last demo day had me trying two heads and 5 shafts vs. my old driver. Much more complicated. Is this still using the null hypothesis, or is it ten sets of null hypotheses, with each head/shaft being compared to the old driver, or would one have to use a different set of data. (Needless to say, I didn't find a significant difference during the 30 minutes of testing, so kept my old driver which was fitted last year.)
 
Speaking to the bolded, I also find these 2 driver purchasing tactics funny:

1. Only demoing 1 brand due to their "brand loyalty" for whatever reason

Brand loyalty can have a place if, over a significant time, a given OEM's product has proven to be as good or better fit for the individual in past comparisons.
 
Brand loyalty can have a place if, over a significant time, a given OEM's product has proven to be as good or better fit for the individual in past comparisons.
But how would one know if they aren't trying the full gammet of options available? Good or better fit would be impossible to know then.
 
Brand loyalty can have a place if, over a significant time, a given OEM's product has proven to be as good or better fit for the individual in past comparisons.

What? So you mean to tell me that if I played 3 different Taylormade drivers over 6 years, and they were all "good" for me (whether or not they were compared to any others), then I can assume the next Taylormade driver will work me?

Is the opposite true? Since Callaway couldn't make a decent driver for several years (think ~2008-2012), and they all sucked for me, does that mean I can count on Callaway drivers never being good for me?

~Rock
 
I guess I fall into the brand loyal camp, having tried and owned several brands over the last few years with Titleist, Srixon and
Cobra being the three others I have owned recently Callaway has been the one I come back to and frankly I don't have a really good reason except I like the Company.

Could there be a few more yards out there with a TM or Honma or some other JDM Driver? maybe but for now it is easier on the brain to just run with what I like right now. I could wake out tomorrow and decide I want all exotic JDM stuff and all bets are off, but it would take a really bad month of play to make that thought to set in.
 
What? So you mean to tell me that if I played 3 different Taylormade drivers over 6 years, and they were all "good" for me (whether or not they were compared to any others), then I can assume the next Taylormade driver will work me?

Is the opposite true? Since Callaway couldn't make a decent driver for several years (think ~2008-2012), and they all sucked for me, does that mean I can count on Callaway drivers never being good for me?

~Rock

It's all relative. To be truly brand agnostic is a tough thing for most golfers (myself included, not afraid to admit it). Which is another reason why running a true experiment for the purpose of choosing a "winner" is pointless.
 
A few thoughts:

1) It's probably less practical to hit a statistically significant number of shots for all combos. That's where feel, ball flight observations, spin trends help narrow things down.

2) Someone should post the number of shots required to detect minimum driver distance of say 2 yards. 4 yards, 6 yards, on say a 250 yard nominal shot. Would be interesting to know... might be that for 5-6 yard differences you only need 8-10 shots to be reasonably confident. Of course, everyone's std. deviation is different and that plays into it too. Sounds like we need a google spreadsheet :angel:

3) Humans are emotional and make emotional decisions based on more than science alone, that part won't go away.
 
It's all relative. To be truly brand agnostic is a tough thing for most golfers (myself included, not afraid to admit it). Which is another reason why running a true experiment for the purpose of choosing a "winner" is pointless.

It's been a long day for me, so forgive me. But I just want to be sure I understand. You post about a better, or at least statistically better way to perform equipment tests. And now you are saying it doesn't matter because the golfing world is full of fanboys?

~Rock
 
Most of the online reviews couch there statements about clubs, but present them in a way that could be confusing to many people. If a player hits 10 shots with driver A and then a week later hits 10 shots with driver B and shows the same chart that comparison is in no way a fair comparison and no conclusions should be draw from it.
 
I think number of sample sets is more important than sample size. Every day brings new hope and new experiences to this game.
I am not a "great" golfer. 5% better or worse on a full shot is not my problem. I'm not to the point of being yards from the hole or down the fairway on a miss hit.
Even if that .05 meant one less club for my approach shot. At any rate, if I need a large sample size to determine what is an improvement, then it doesn't matter.

My bias and my confidence likely does more for my game than optimization. They are tied together, (and I do not want to short myself on anything) but I don't have to have a good reason to like a golf club. Sometimes look or sound is enough.

In my situation the biggest improvement I can make in my game, is playing more. Score isn't everything, it is one of the important parts.
 
I’m curious as to what choices consumer’s have except to try your driver compared to some “new” option with a very limited sample set. Who has the time or resources to go much further? When factoring in the “human” element the testing data is subject to question in my opinion. I have my good swing days and my bad swing days. Which day will be the one when I do my sample???? Most of us golf nerds look for more information but in the end, probably, make an emotional decision based on the data from a few ( at best) testing days. It’s certainly an imperfect method!!
 
Very interesting topic, and a broad one at that. I feel it's a little bit unrealistic for everyday golfers to go to such depths before making a purchase. At the end of the day it's a hobby for 99% of us, and some of us like having shiny, new and the product testing phase is part of the fun.

Now, heading on a bit of a sidebar here, I appreciate any and all tests & comparisons that are created within the industry. Most of them have valid info. The most difficult part of this is understanding that we all have our swing idiosyncrasies. So, to take a recent ball study as an example, it provides some interesting data, and helps as a comparison, but it only completely applies when looking at driver swings at that speed, with that path, at that attack angle, and using that driver. Then all of the valid questions you bring up about statistical significance add more uncertainty to the test.

I appreciate the effort. Data is actually fun to look at. The part that bothers me, and it actually bothers me GREATLY, is that definitive statements are being made based off it. And that just won't match up to a lot of people's first hand reports.
 
Very interesting topic, and a broad one at that. I feel it's a little bit unrealistic for everyday golfers to go to such depths before making a purchase. At the end of the day it's a hobby for 99% of us, and some of us like having shiny, new and the product testing phase is part of the fun.

Now, heading on a bit of a sidebar here, I appreciate any and all tests & comparisons that are created within the industry. Most of them have valid info. The most difficult part of this is understanding that we all have our swing idiosyncrasies. So, to take a recent ball study as an example, it provides some interesting data, and helps as a comparison, but it only completely applies when looking at driver swings at that speed, with that path, at that attack angle, and using that driver. Then all of the valid questions you bring up about statistical significance add more uncertainty to the test.

I appreciate the effort. Data is actually fun to look at. The part that bothers me, and it actually bothers me GREATLY, is that definitive statements are being made based off it. And that just won't match up to a lot of people's first hand reports.


Wait....so your saying that just because you and the robot hit Billy Bobs Ball Bombing driver great that I shouldn’t just assume it’s going to be perfect for me?

That’s just wacky.


The truth of it is, new toys are fun. What works for me, may not work for anyone else, or may not work for a robot. The data resulting from any type of testing has some merit, but that doesn’t mean that’s the only thing that has merit.

As to the idea that having brand loyalty is somehow “wrong” that’s just asinine. Sure maybe I leave a few yards on the table by not trying a brand, or trying only one brand, but if I like that brand and I’m more confident with it, who cares?

I don’t believe that for the majority of us, we can perform enough testing to get a statistically accurate representation that will definitively say what club is best for us. So, why not just find something that gives us good results, and that we like? Sure we should all be open minded and try as much stuff as we can, but we live in the real world, bias, ease of access, and a million other factors play into our choices.

Robot testing is the OEMs to ensure they’re making good equipment that’s going to perform at their desired specs. It has precisely 0 bearing on what any of us should play.
 
i’m surprised no one has touched on fitness yet. as a whole we are not very fit. asking us to make a statistically significant number of quality, materially similar swings is not realistic.

it also does not factor in the impact that feel can have on performance. if it doesn’t feel right, i’m not likely to give it my best. why waste time on the combo i don’t like just for the sake of the p value?

i’m of the opinion that most products perform well. even very well. so if a test shows significant outliers, a retest is warranted. or f the test and buy what makes you happy.




Sent from my iPhone using Tapatalk Pro
 
I’m curious as to what choices consumer’s have except to try your driver compared to some “new” option with a very limited sample set. Who has the time or resources to go much further? When factoring in the “human” element the testing data is subject to question in my opinion. I have my good swing days and my bad swing days. Which day will be the one when I do my sample???? Most of us golf nerds look for more information but in the end, probably, make an emotional decision based on the data from a few ( at best) testing days. It’s certainly an imperfect method!!
I would argue that bad swing days, at the amateur level, are much more statistically important.

IF one driver gives you more distance with bad swings than the other shouldn't you choose that driver vs the better distance from the rare "on the screws" shot?

Then again for most, ego will raise it's head and the one absolutely "nutted" shot will become the "normal".
 
Brand loyalty can have a place if, over a significant time, a given OEM's product has proven to be as good or better fit for the individual in past comparisons.

Not trying to single you out, but you had an experience that extremely few people will ever have. I can understand why you have brand loyalty. But expecting past performance to guarantee future results is not a smart strategy IMO. For example, I'm sure there were a number of people a few years ago that bought a TaylorMade SLDR driver because it was the upgrade over the prior TM driver, and in retrospect that wasn't a good thing for many people.
 
Equipment Tests, Statistical Significance, and You

So the OP was in reference to statistical significance, so lets break that down for a second, shall we? To answer the question of how many shots are needed to detect/prove a difference, lets look at the variability of the golfer as compared to the size of the yardage difference you are trying to detect. How many shots would it take to prove?

First, lets take a distance example. Baseline Driver carry of 260 and standard deviation of 7 yards. (5-8+ std. deviation are reasonable numbers based on my testing experience, depending on your threshold for calling a poor strike non-representative). If I want to detect a 5 yard gain, I'll need to hit 31 shots apiece (@ statistical significance shown)

f3a7414b65e241636b29908de11e732f.jpg


Next, let's look at what you can resonably detect with 10 shots. By coincidence, any population below 10 this calculator returns a popup that says "not enough sample size for reliable result". If I look at the good ones and am having a good swing day, my standard deviation on carry might be more like 5 yards. This shows you could detect a 6-yard distance difference at best based on 10 shots at that variation level.

acd6dc9e2dda7f5503d154e5df36e9f9.jpg


Finally, during my last my last driver purchase I was looking at shaft differences that showed between 2-3 yards carry difference at most across several shafts of both stiff and X- flex. To detect a difference of 2 yards carry, with 5 yard golfer standard deviation, you would need a whopping 99 shots. And I made my decisions on 15-20 at most with the winner :angel:

37683dc41ac1fd29caebb6d79c6c6c8f.jpg


I think these numbers show that it is a combination of both fitting data and subjective results that will always carry the day to achieve the best combo of what performs, looks, and feels good to you and the way you swing a club and play the game. By the time you get down to really fine differences the golfer variability far outweighs the equipment differences. Ouch.

EDIT: This example is valid for a low handicap golfer, but if you carry it shorter and are more variable, your detection thresholds will change. Feel free to play with the calculator in the link.

EDIT - EDIT: Another takeaway message is don't go into a fitting expecting to detect 2-3 yard differences. For all except the most consistent golfers, it ain't happening. 5-10 yard trends, now we are looking more like a sure bet. Besides, culb fitting is about more than just 2-3 yards of distance or fractional amounts of launch/spin/whatever. Its the sum of ALL of those numbers and feels.
 
Last edited:
This is why I think getting rid of bad shots is OK when comparing equipment. Theoretically that should scrap some of the variability that comes with our swings. And if you see a bunch of bad shots, well, that’s probably not the setup for you. I think, if we’re talking statistical significance and variability, as long as you’re looking at 1, maybe 2 metrics - say, ball speed with a driver, you don’t need that many samples. In biology, we’ll do experiments in triplicate and can make conclusions from those.

I think a lot of people can get dialed in with a good segment of the equipment, so it’s probably more important for ruling out equipment rather than making final decisions, though.
 
This is why I think getting rid of bad shots is OK when comparing equipment. Theoretically that should scrap some of the variability that comes with our swings. And if you see a bunch of bad shots, well, that’s probably not the setup for you. I think, if we’re talking statistical significance and variability, as long as you’re looking at 1, maybe 2 metrics - say, ball speed with a driver, you don’t need that many samples. In biology, we’ll do experiments in triplicate and can make conclusions from those.

I think a lot of people can get dialed in with a good segment of the equipment, so it’s probably more important for ruling out equipment rather than making final decisions, though.

Agreed 100%. And if two things are that close on numbers then other performance attributes and subjective qualities balance the equation. Human biology and feedback systems are more sophisticated than the computers, systems, and models we've created based on (observational) science. Its a combination of both science and gut feel.
 
It's been a long day for me, so forgive me. But I just want to be sure I understand. You post about a better, or at least statistically better way to perform equipment tests. And now you are saying it doesn't matter because the golfing world is full of fanboys?

~Rock

Said another way, I don’t understand reviews that would name a winner. Especially when it’s very difficult to find a statistically significant top performer, and because most people will have a brand bias one way or another.


Sent from my iPhone using Tapatalk
 
I think brand loyalty doesn't need to be a bad word. I have 3 brands that I always go to first, and I have 2 brands that I couldn't care less about. I've just always assumed everyone operates similarly and don't begrudge them for their likes and dislikes. I believe everyone has brand biases one way or another so the only people to be wary of are the ones who insist they have zero.

Personally, I try to be careful not to come online and crap all over the 2 I don't like because there's no sense spreading more negativity in a world gone mad IMO.
 
Good topic.

Using a Trackman to measure the results of my swings is a little like using a micrometer to measure the results of cutting with an axe. But I still do it.

However, even axe swingers can see differences. When I bought my current driver, I hit about 90 balls, including about 25 each for the three finalists. There were differences, not so much in distance as in dispersion, and the ProTracer I used gave a measurement called "consistency" expressed in percent, probably some massaging of the standard deviation of straight and long. It came down to Shafts A and B. A was on average a little longer and a little straighter, but I bought B because it was more consistent - I could more consistently play for a distance and a shot pattern with B. The longest shaft, demonstrably so, also felt the best but sprayed the ball everywhere. Its consistency score was 0%.

I believe brands do matter, because sometimes "believing is seeing." I also believe brands do change over time, at least relative to a swing.

Looking back on my most recent purchases - driver, fairways, irons,and wedges - I hit at least 80 balls each time, and I spent extra time with the one I felt I was going to choose, AND I hit again on a second day for the irons and driver. With those kinds of numbers, any trends there will emerge, even with an inconsistent swing. ANd I might add I have been happier with these purchases.
 
Back
Top