How many shots do you need to fire? How do you know if a result was just random luck? Have you given up on testing because it's too hard to make an improvement? All of these questions have answers. Last week I built a statistics calculator, but there were no instructions. This article concludes the series on statistics with an explanation of my iterative load development philosophy, and how to practically apply simple statistics to make sense of your test results. If you haven't read my previous articles, this one will be very helpful to read first. Most of the time, we shoot groups of 3 to 5 shots each. Intuitively, we know that one group doesn't mean much, so we look at the big picture to understand if there is a conclusion to be drawn. We observe many groups at the same time, and try to get a feeling for whether they seem 'good' or 'bad'. We can't put a number on a feeling. Our brains are wired to make simple, emotional, yes or no decisions. In a split second, we evaluate what we see, form an opinion, and then resist to change it. This is great for human survival (and makes for lively forum threads), but not so useful for tuning a rifle. To get the most from your efforts, you must train yourself to think objectively. ## Process of eliminationThe key to finding a good load before you burn out your barrel (or your patience) is iterative testing. Try something, make a statistical conclusion from that test, and move on to try something else. If you can measure powder and seat bullets at the range, you can work through the entire process in one day. The cost of portable reloading gear will pay itself off almost immediately. The first step is to identify what doesn't work. If you fire 5 shots and the group is big, you can pretty confidently rule out this load. The ES, SD, and your intuition would all agree - better steer clear of this one. Why? It's only 5 shots! The reasoning is that the relation between a sample and its population is asymmetric. It's more likely that a bad load will produce a small group than it is for a good load to produce a large group. We exploit this and it's why iteration works. The 90% confidence intervals for a 5-shot group are -35% and +137%. This diagram shows how a large 5-shot group is very likely to be bad, so you can rule it out. However, a small 5-shot group could be good or average. It's not so easy to prove a load as it is to disprove it. We must speak in terms of probabilities. No test result is definite. If you have one large 5 shot group, it is within the realm of possibility that those 5 shots were extremely unlucky and the best load for your rifle is hidden within it. However, statistically, that chance of that is low, and you would have a greater return on investment trying something else. If you've tried everything you can think of, and all you have is a paper target full of big groups, then it's time to take a second look at everything else. Are you shooting well? Is your scope tightly mounted? Is there something wrong with your barrel? Try a different bullet / primer / powder. Go back to the basics. It only takes one problem, and it's better to find it sooner than later. To be clear, this process is not contrary to ladder testing, OCW, or any other load development technique. All data is just data, and it should be analyzed statistically. The objective with iterative testing is to only keep testing a load that has not yet already been ruled out. This is a simple matter of efficiency. Go ahead and fire some groups, and use the ES to judge them. Try whatever comes to mind. There's nothing wrong with this. You are just looking for promising leads. When you think you've found something good, fire a couple more of the same load and see if your fortune repeats itself. Then proceed with verification. ## Measuring a result with confidenceOnce you happen across a load that appears promising, don't stop there. This is a mistake we make all too often. If you don't consider statistical confidence, you are setting yourself up for a disaster, or worse, a year's worth of mediocre performance that could so easily be avoided. Trust me... I speak from experience on both counts, and it's a hard lesson. This is where we break out the statistics. The objective is to put a number on the chance that this load will work again in the future. For that, we calculate an SD and a confidence interval, which you can read about in my previous article. The more shots in the sample, the smaller the confidence interval, and the more likely the sample SD is to match the true SD of the load. SD vs ES is an age-old debate. You may be surprised at my position on this. I'll use whatever method gets the job done. ES, SD, and confidence intervals are tools in the toolbox and value comes from knowing which to use when. Here's a rule of thumb: - Use extreme spread to rule out a bad load with 5 shots.
- Use standard deviation to prove a good load with confidence.
How many shots do you need? Well, you need to choose a confidence level that you are happy with. If you choose 85% confidence, then you are accepting that there's a 85% chance that the true performance of the rifle is within the interval we are about to calculate. It could be worse, or it could be better. The confidence level you are comfortable with is a personal trade-off between accepting some risk that your results are not accurate vs. investing more time and money to keep testing. ## Measuring group sizeFor group dispersion, we are interested in measuring the distance of each and every shot from the natural center of the load. You can do this is at home, with a ruler. The natural center is not your point of aim, or the center of each group, but the center of an imagined overlay group that includes all shots at that load. The distance from this point to each shot, in MOA, are your data points. Maybe someday there'll be an app for that (wink wink). In practice, I don't recommend measuring each and every bullet hole of every group you fire to the millimeter. This is what I did with Damon Cali's data, because I had 35 essentially random groups to analyze and I could not comprehend it otherwise. If you test iteratively, you will narrow in on a good load, and you may not need such a thorough process at the verification stage. What I do recommend is visualizing overlay groups. As I demonstrated through this experience, an overlay group allows you to view all the data without the bias of which shots were in which group. If you can construct these quickly, it can save you from heading in the wrong direction. Understand that groups are normally distributed, and that you should expect a tight cluster in the center. Good overlays will look a lot better than bad overlays, and it's easier to tell the difference by eye with 20+ shots. With an overlay group, you can estimate the SD as 1/4 of the extreme spread, as long as the group appears to be roughly circular and normally distributed. If you have a flier way outside the group, this would skew the ES more so than the SD so it should be weighted less. While crude, this will allow you to calculate confidence intervals. Once you have a SD and a number of shots, even if it's rough, you have everything you need to calculate a confidence interval. This interval, at your given confidence level, tells you the range of SD that this load is actually within. The true performance is a single number, you just can't know it exactly. If you fired more shots, you would estimate it more accurately. How many shots you need depends on how small you would like that confidence interval to be. ## Measuring velocity variationProper statistical measurement is much more important for shot velocity. At 600 yards and beyond, velocity variation dominates, and reducing your velocity SD will have a significant impact on your performance. A tight group is nice to have, but at long range, it pales in comparison to your velocity variation. Minor changes in your load can have an impact on your velocity SD. Every time I go to the range, I record my velocities, and I may notice the SD has been creeping up over the past few months. This prompts me to think about what may be different now - whether it's a new lot of powder, the cases are getting old, or a temperature shift in the seasons. Managing your SD requires maintenance, and it's time well spent. The Two-Box Chrono was designed specifically for this purpose. It reduces random error to insignificant levels. Less error means a lower SD, and a smaller confidence interval. You'll notice more consistent SD measurements day to day, detect changes in your SD sooner, and be able to observe the effect of changing something with less shots fired (as I will show below). To calculate the SD of a sample, simply plug the numbers from the chronograph into the calculator. The sample SD is the actual SD of this group, while the confidence interval is the likely range of SD of your load itself (given only this sample as input). If you have more data, you can shrink your confidence interval and get a better estimate of your true load SD. The most important takeaway is to understand that the SD has a confidence interval in the first place. Just because you fire 20 shots and measure an SD doesn't mean you will get the same SD next time. ## Getting more data for freeSuppose you fire ten 5-shot groups, all at different powder charges and seating depths. 3 of the groups are good, so you repeat them. Now you have 65 shots on paper, but only 5 or 10 shots of any one scenario from which to do analysis on. With only 10 shots at a given load, you can't really draw a statistical conclusion, because the confidence intervals are too large. However, you have fired 65 shots, and if they could all be considered, that would be plenty of data to give you some confidence in. The problem is they are not equal. However, if we apply some assumptions, we can go ahead and group them, and take advantage of the combined data. As I mentioned in an earlier post, I operate with a working assumption that seating depth will not affect velocity SD, and powder charge will not affect group size, within reasonable limits. It's just a theory, but it hasn't let me down yet. With this assumption, you can combine all the data at one powder charge, regardless of seating depth, into a measurement of velocity SD. You can also consider all the groups at one seating depth as the same group, and overlay them visually. Remember that the assumption is just a theory, but go ahead and take advantage of the free statistics if you think it helps you iterate towards the best load. This working model also allows you to focus your efforts. If your groups are large, try changing seating depth, not powder charge. If your velocity SD is large, try changing powder charge, not seating depth. If you try 10 different powder charges all at the same seating depth, you may end up with 10 bad groups and all that data is obsolete as soon as you realize seating depth was the problem. This is what happened to Damon Cali, and also what happened to me. There's another trick to combining data. As you increase powder charge, velocity will increase predictably. For me, it's about 50 fps / grain. If you fire groups at slightly different powder charges, you can combine them into a single sample as long as you adjust the data accordingly. For example, to combine a group fired at 44.0 grains with one fired at 44.2, I might add 5 fps to the first group and subtract 5 fps from the second. This would provide twice as many data points representing 44.1, and allow higher confidence on that measurement. ## Making statistical improvementsUp to this point, we have focused on measuring individual samples. Measuring improvements on the other hand, is about comparing two samples. We have two samples, and we need to know if they are statistically different. More precisely, how likely it is that the two samples came from different populations. If you fire two groups, they will be different. Always. The question is how different. Are they different enough? Would that difference be repeatable? Is it enough warrant a change in the load? With statistical testing, we can ask questions like: - Is 45.0 grains better than 44.0?
- Is 30 jump better than 10 jump?
- Should I neck size or full length size?
- Does this primer produce a smaller SD than that primer?
- Is this new lot of powder hotter than the old one?
To compare the averages of two samples, we use the T-test. To compare the variation between two samples, we use the F-test. These are magical formulas that are quite complicated, and I only understand enough to build a calculator that gives the right answer. To compare two samples, you need two things: - Lots of data in each sample.
- A large relative difference between them.
You need a lot of data. Beg, borrow, and steal data from other groups. For example, if you shoot a ladder at 10 different charges, you may only have 5 shots at each charge, and you can make no statistical comparisons considering each group as independent. However, if you combine data, may be able to say that everything from 44 to 45 is better than from 45 to 46. Here's an example. Suppose I fire 10-shot groups at 44 and 45 grains. I measure an SD of 6 for one and 8 for the other. Does this represent a significant improvement? Answer: No. There's only a 59.6% chance these groups are from different populations, so we've learned very little. Two random 10-shot groups with SDs of 6 and 8 would occur fairly often even from the same load. The confidence intervals are overlapped. We need more data. Now we can ask the question, how many shots do we need to fire to prove such a difference with 90% confidence? Answer: 35 shots for each group. If you relax your confidence level to 75%, you can get away with 18 shots per group, but you are gambling. It's a question of return on investment. How much is more confidence worth to you? You might get lucky, or you might find yourself back at square one next week. ## Using a chronograph to improve velocity SDSuppose I've been shooting all year and my elevation is pretty good, but maybe it could be better. Maybe with another day at the range, I could tweak the powder charge, or try small primers. My SD has been around 7 all year, so as a goal, I hope to measure a 16% improvement in SD, from 7 to 6. I'd like to have 90% confidence in the result. After all, I plan to shoot about 800 rounds of this load in July and August alone (3 provincial matches plus the Nationals and Worlds for F-class). It might take a full day to perform this test, so I'll aim to only have to do this once. Is it worth my time to try? We can predict what it would require to find this result, even before going to the range. Let's make sure we bring enough ammo so that it's even possible to achieve this. Otherwise it's a doomed exercise. So the question to ask the stats calculator: if I fire two groups, and one measured an SD of 6, and the other 7, how many shots must those groups have to be 90% confident the difference is real? The answer: 116 shots. From each group. Well that's not realistic. It kind of puts things in perspective when you look at it that way. I don't really feel like shooting 232 shots just on a hunch that I can make an improvement. So therein lies the problem facing many long range shooters. You find a decent load quickly, but making an improvement is statistically very difficult. You can try to get lucky again, but it's a cycle of trial and error. Now let's consider the chronograph. Random error increases the SD of both groups, decreasing the relative difference between them, and therefore requiring more data to make a comparison. With a more precise chrono, maybe we can ease the pain. Let's say I was using a chronograph with an inherent SD of 3.5 fps, which is reasonable number based on Applied Ballistics' testing of the Magnetospeed, Chrony, and others. Most shots would be within +/- 7 fps. The random error SD of 3.5 actually means my observed SD of 6 would have to come from a load with a true SD of 4.87. The ammo is always a little better than the chronograph says it is, because it can only add error over time. It would also increase a true SD of 6.06 to 7. That means, to observe the same 16% improvement at the chrono (7 to 6), my ammo would actually have to improve by 24% (6.06 to 4.87)! If I had a perfect chrono with no random error, I would need to fire only 59 shots of each group to see the 16% improvement I am hoping for, not 116. That's still a lot, but it's half the shooting. Now let's say we used the Two-Box Chrono with an error SD of 0.5 fps. That same load with true performance of 4.87 would be observed as 4.90, and a load at 6.06 is observed as 6.08. The number of shots required in this case would be 60. Just one more shot. It's hard to make an improvement in a good load. The relative differences are small and a lot of shooting is required. Fine tuning is possible, but only with a good chronograph and an understanding of the statistics that are controlling your fate. Play with the calculator to get a feel for what difference in SD with what number of shots will give you a positive result at the confidence level you are comfortable with. You'll see there are realistic scenarios that produce 50% confidence or less, where you'd be better off flipping a coin. ## In conclusion...When I learned how to use the F-test I knew it was the key to making sense of the madness, to avoid going to range and coming home with a non-result. It completely changed my perspective on how to approach and plan tests. Any test where variance is measured and compared, where confidence is not considered, could be very misleading. About 99.9% of information you find online falls into this category. Now you know how to make sense of whether it's really meaningful. I used to load 10 or 20 shots of two, three or four different scenarios to compare. The results seemed significant at the time, and I thought I was learning things about reloading that I couldn't find answers to elsewhere. I've punched thousands of holes in paper and filled two books with test results. Now I know that all I was doing was learning how little I knew about what I was doing. Now, I go to the range with a plan that has a reasonable chance of success. I test iteratively, looking for quick clear answers most of the time, but knowing that easy answers are just suggestions. I never say anything for sure until I'm ready to test it properly. I limit my extensive testing to when I have a very specific goal in mind. Keep it simple. I only care about my group size and SD for one rifle, for one summer, with one powder, one bullet, and one primer. My reloading procedure has been basically fixed for 2 years. I focus my energy on charge and jump and keep everything else constant. Otherwise... the madness will return. Wind reading and shooting strategy is also important. It is something I focus on at different times. Load development is homework. The more you put into it, the easier it is to follow the wind (because the rifle is more accurate), the easier it is to learn the flags, and of course the less points you lose to elevation. Please feel free to post any questions in the comments. Good shooting!
1 Comment
Ben Winget
7/3/2017 08:35:00 am
Checkout the on target software, this makes it very easy to measure your groups and gives you an average to center measurement, along with the group size, horizontal and vertical measurements.
Reply
## Leave a Reply. |
## Who am I?
## Archives |