A Case Study In Dice Stat Tests Part 1: General Approach

Recently the Kickstarter for “Honest Dice | Precision Machined Metal Dice You Can Trust” ended, having successfully funded for over half a million dollars. The Kickstarter was sent to me in the final few hours, so while I initially had some strong opinions about the content and the dice, there wasn’t time to do the proper review, analysis, and commentary that was warranted.

Having had some time since to digest and do a fair amount of math, I approach this as a case study, similar to the ones my business and stat teachers were fond of handing out: What analysis was done? Was it the right analysis? Was it done correctly? The point of a case study is a roleplaying exercise. You are taking the role of a pretend consultant and are handed a situation in progress. What do you do?

There’s a lot to unpack here so this article is part 1 of a 3 part series.

I wanted to write this article for two reasons:

First, reviewing what was done and suggesting areas for improvement provides a resource for those doing this type of analysis in the future. The internet is full of resources explaining how to do a chi-square Goodness of Fit test. Less common are resources explaining how to do a test of homogeneity. Rarer still are resources that explain why you would do one test or the other, and how to prevent and express the results.
Second, this is a half million dollar Kickstarter for game accessories, which puts it in a small subset of Kickstarters. Kickstarter’s own stats break their funding categories into 100k to less than a million and million plus. At the time of this writing, there have only ever been 4,268 100k plus gaming Kickstarters, putting Honest Dice in a small and impactful group. It only makes sense to review its claims and analysis carefully.

Before I start the actual statistics case study, there are a few points I’d like to make that fall outside the statistics:

Disclaimer: I have not bought, received, or even ever seen in person or touched any of these dice. I have not collected my own rolls, and cannot truly vouch for them. However, I’m inclined to take the evidence that has been presented at face value. You’ll see later during the review of the statistics why I think their collected data is likely to be legitimate and because of this, I feel pretty comfortable saying that these seem to be high quality, attractive, highly durable dice. They’re also expensive, even more expensive than you might expect for what they are. As such, if you like the look of them, if you like their unique features, and if you’re comfortable dropping the kind of cash the creator is asking for them, I have no reasons to tell you to not buy them. While the Kickstarter is already over, late pledges are still open. Also, the creator has a web store where you can buy many of their existing dice.

There were several non-stat issues I had with claims that were made in the Kickstarter. Most were inconsequential enough to leave out here, but one in particular struck me as disingenuous enough that while I am writing this article I felt I had to address it. At the start of their statistical analysis they make this statement: “I decided not to name the brands of the other dice. While I’m in favor of complete transparency, I’m also not trying to throw shade at other dice companies.” That’s admirable. The part I feel is disingenuous is that they immediately directly quote marketing slogans from the competitor in question, which is effectively exactly the same thing as naming the company outright. One cannot have this both ways. A company can name their competitor during comparison or not, but attempting to frame not naming them as taking a moral position makes an immediate violation of that stance into a moral position as well.

Finally, I’d like to briefly touch on the mathematical balancing of the numbers on the die faces that is a major feature of these dice. I think this is a very interesting topic and a similarly interesting topological problem. The section of the Kickstarter where they discuss how they determined their optimal arrangement is fascinating. In addition, I feel that this is something that could potentially enhance the fairness of some dice and that the particular arrangement of faces they came up with and the process of solving for optimal arrangements is something they should patent immediately if that is indeed a thing that can be patented. However, I do feel I have to point out that if the manufacturing standards to which the dice are held successfully create a die with probabilities very close to the ideal distribution, then rolling the die will result in a random distribution of faces very close to the ideal distribution as well. If the distribution of rolled faces are very close to the ideal distribution, then which number is engraved on which face doesn’t matter. So this ideal arrangement of numbers doesn’t do much to help the accuracy of an already highly accurate die. On the other hand, this arrangement of faces might ironically be highly useful to a company making lower quality lower accuracy dice.

So to start our case study, let’s look at what one ideally would do for a project like this:

Consider Purpose: The first thing you would do is decide what you want to accomplish with your testing. Do you want to test your own dice for table fairness? Do you want to make an aggressive test of your dice to really assess their accuracy? Do you want to be really certain about your results or are you okay with some wiggle room? Do you want to test your dice against a competitor’s dice? How much time and effort do you want to put into this? All these things are going to impact decisions you make later.
Design/Choose Tests: Once you have a good idea of what you want to do, it’s time to start designing your analysis. You’ll need to decide on how you’re collecting data, how much data you’re collecting and what tests you’re going to perform. Typically at this point you’re also choosing your threshold for statistical significance. It’s probably easiest to choose these elements in this order:
- Tests/Questions: Every test is based on a pair of Hypotheses. One of these, called H0 or null hypothesis is a default assumption and always includes an equal sign. The math of the test assumes this hypothesis is true and the results of the test are the likelihood that one would observe a sample as unusual as the data that was collected, if that hypothesis were true. If we see a really rare event, then we have evidence our assumption was wrong and we reject H0. The typical tests for dice are:
  - Chi-Square Goodness of Fit Test: This is a test of the hypothesis that the distribution of the die you are testing is equal to the ideal distribution for a die of that type. So, for example, for a D20, H0 is “The probability of rolling each face of the die is 1/20” and observing a rare event rejects that hypothesis.
  - Chi-Square Test of Homogeneity: This is a test that the distribution of two or more dice (or other phenomenon) are the same. So for a relevant example, you would use this test to check if the distribution of a precision machined metal D4 is the same as the distribution of a low impact plastic cast tumble painted D4. Your H0 would be “The probability distribution for all of the dice in this test are the same.” and observing a rare event rejects that hypothesis.
    Why don’t we just test each die with the Goodness of Fit test and see which one tests better? Well, imagine a situation where your high quality expensive precision D4 has a random string of unusual rolls. Maybe you roll a bunch more 4s than you would expect. At the same time your low quality cheap inaccurate D4 has a string of really good rolls. If you look at those samples together it might look like your low quality die was better than your high quality die. But really, all you did was observe two rare samples back to back. That happens from time to time, but two separate goodness of fit tests have no way of determining if that might be the case. The test of homogeneity on the other hand takes into account the variance of the dice to help prevent these incorrect assessments. Remember that when you perform a test of your dice you’re only selecting one possible set of rolls of that size and set of results all range from very likely to very unlikely. The only way you know which set you have observed is to observe multiple samples or observe large enough samples. A single small to medium sized sample is an indication of what your true value may look like but isn’t always consistent.
    Note that the test of homogeneity only tests “H0: All of these dice are distributed exactly the same.” and rejecting that hypothesis only concludes that there is evidence that at least one of them is different than the others. Unless you are only testing two dice against one another, the test makes no claim about which ones are different from each other and the magnitude of those differences. Those follow up questions will require additional tests of their own and you should have those tests planned and accounted for before you run the first test. If the original test that tests all dice against each other finds no significance, there’s no need to actually run any of the planned follow up tests since there’s no evidence of individual differences to find if the first test isn’t significant. We’ll see an example of this later.
  - Chi-Square Test of Independence: Sometimes you’ll find explanation of this test alongside the Test of Homogeneity. If you’re saying to yourself “Huh. The statement ‘These dice all have homogenous distributions’ and the statement ‘the distribution of these die rolls is independent of which die they came from’ sure sound exactly the same.” congrats. That’s an astute observation. These tests are really poorly named. In general you almost never use the test of independence for die testing, but here are a few facts that will help make up your mind:
    - Mathematically, the tests of Homogeneity and Independence are identical, so if you’ve accidentally done the wrong test, no big deal. You’ve also done the right one.
    - The Test of Homogeneity tests if two different samples have the same distribution. Since they’re two different samples, they don’t have to have the same sample size. You can roll your first die 500 times and your second die 5000 times. It’ll make any attempt to do power calculations (the math to see how often you detect differences if they exist) weird but you can do it. Which means you can in theory trawl the internet picking up random D20 data sets from various manufacturers and mush them into one giant test of homogeneity.
    - The Test of Independence on the other hand tests multiple factors collected from a single sample. For example you might want to test if suit and number in a deck of cards are independent or if the temperature at which dice are rolled impact distribution, or some other combination of factors. Unlike Homogeneity, with this test every time you do a test it generates a set of data with a measurement of each factor of interest. Thus it would be really hard to have different sample sizes for different factors.
  - Other Tests: Those are the most common tests that are performed in dice evaluation but others might be used if you’re trying to answer questions other than “How well does my die match the ideal distribution?” or “Is one of these dice better than another?” In fact, you can make up your own tests as long as you can define a hypothesis and then are able to figure out the math for: Assuming the hypothesis is true, what is the probability we observe a sample as or more rare than the sample we actually observed?
- Threshold for Significance: A test is said to be “Significant” if we assume that H0 is true and, if that is the case our collected data is too rare. What is “too rare”? Well, that’s up to you. You can set this threshold wherever you want, but there are some factors to consider:
  - Pick Before You Test: You should run no risk of your collected data influencing where you place your threshold for significance. If your test results show an observed rarity of 2%, do you want to wait till that moment to decide if 5% or 1% is your threshold for too rare?
  - Use One Threshold: Once you’ve chosen a threshold, that should be the threshold for your entire set of tests unless there’s a good reason to change from test to test (we’ll discuss that shortly). Without excellent justification, changing thresholds, especially if your data is borderline and you change in such a way to find results that are favorable to you, smacks of dishonesty and an attempt to manipulate results in your favor.
  - Use Common Standards: The standard thresholds are .1 (10%), .05 (5%), or .01 (1%). It’s possible to use other thresholds, nothing’s stopping you. However, larger thresholds come with larger chances of making an error (we’ll discuss that in a moment) and start detecting common results as “too rare”. Remember that rolling a 1 on a d10 for example has a 10% chance of occurring. Does it really make sense to say that an event more common than a standard d10 roll is so rare that observing them must mean your assumptions are wrong? On the other end of the spectrum, there are applications where thresholds smaller than .01 are appropriate. These are commonly seen in applications where lives or large sums of money may be on the line such as medical research or insurance calculations. In general, die testing doesn’t warrant this level of accuracy unless you have extraordinary amounts of money riding on them. I haven’t even heard of casinos testing their dice to levels of accuracy tighter than .01 but it wouldn’t surprise me if they did.
  - Consider Family-Wise Error Rate: Let’s say you decide your threshold for significance in a test of goodness of fit is .05 which is the same as 1 in 20. That means that if you assume your die has the ideal distribution and your set of rolls are as or rarer than 1 in 20, you’ll say the results are significant and reject that the die is fair. Now let’s say you have 100 identical ideally distributed dice and you test each of them, rejecting each die for which you see a sample with 1 in 20 rarity. How many of those 100 tests are you going to reject, on average? You’d expect to see an event of 1 in 20 rarity about 1 20th of the time, so in our theoretical 100 tests of fair dice, you would expect to reject 5 of them specifically because they follow the expected distribution. These are called Type 1 errors. When you choose your threshold of significance, another way of thinking about it is that your threshold is the proportion of time you’re willing to make these Type 1 errors. But if you’re performing more than one test at a given threshold then as in our example above, your chance of making at least one Type 1 error increases over your desired threshold as the number of tests increases. This increased chance of error is called your Family-Wise Error Rate and you can control for it in a number of ways. One such way is mathematically calculating a new threshold for individual tests that result in your overall Family-Wise Error Rate being equal to your original desired threshold. Another is using step by step procedures to reject individual tests and fail to reject others based on p-values and other factors. Discussion of individual methods is beyond our scope here other than to say: be careful when deciding if you can assume individual tests are independent or not. Explanation of a handful of procedures can be found here.
    To determine if you need to worry about Family-Wise Error Rate, ask yourself if any of your tests are related groups of tests and if increasing the number of tests will increase the chance of accidental error.
- Sample Size: Now that you know what kind of tests you’ll be performing and at what level of significance, you can finally determine the size of the sample you’ll need to generate in order to properly run your tests. Remember that the more data you have, the smaller a difference you can detect with greater accuracy. Unfortunately, most of the tests you’ll use in die testing are chi-square tests and determining necessary sample sizes for chi-square tests can’t reasonably be done without software. If you’re confident in your R or Python abilities, several packages exist for those. However, a stand-alone product exists for those who don’t want to bother installing a programming language and code editor and figuring out how to code a solution themselves. G*Power has good reviews online, is a simple stand-alone product and will handle what we need. If you choose to use G*Power, once you download it and start it up, you’ll need to choose Test Family=X2, Statistical Test=Goodness of Fit Tests: Contingency Tables, and Type of Power Analysis=A Priori: Compute Required Sample Size – Given alpha, power, and effect size. Then simply enter the requested information and click calculate.
  - Effect Size: For Chi-Square tests standard effect sizes are .1=small, .3=medium, and .5=large. A smaller effect size means that you’ll be able to detect smaller differences between the distributions you’re comparing to one another.
  - Alpha Err Prob: This is your threshold for significance discussed above. Remember that if you’re dealing with Family-Wise-Error, even if you’re planning on using step procedures to deal with determining significance, you’ll still need an actual alpha here. Depending on if your tests can be said to be independent or not, the easiest option are the Bonferroni or Sidak procedure.
  - Power (1-Beta err prob): Power is the probability that you’ll be able to detect the difference of interest if it actually exists. Sampling involves a lot of random noise so even if the difference you’re looking for exists some samples you generate will make it look like it doesn’t. The higher your Power, the more samples will detect differences correctly, but also the bigger those samples will have to be. Standards include .8 (detects differences 80% of the time if they exist) and .95 (detects them 95% of the time if they exist.)
  - Df: This is the degrees of freedom of your test. For Goodness of fit tests it’s the number of faces your die has -1. For tests of homogeneity it’s (number of faces-1)(number of dice you’re comparing-1). So for example a goodness of fit test on a d20 has 19 degrees of freedom, while comparing a set of 4 different d20s together has 19*3=57 degrees of freedom.
Collect Data: Now that you know how much data you need and what you’re going to do with it, it’s finally time to start collecting data. This order is important because you neither want too little nor too much data. Too little and you have insufficient power to detect deviations. Too much data and you’re going to detect really tiny differences. That sounds great until all your dice test as not fair because you collected 15,000 rolls each. When you collect data, in general it’s easiest to simply keep a running tally of the count of each face rolled. Some people like to keep an actual record of every roll that was made. That’s fine too, although the first thing you will do with a list of rolls is tabulate it into a count of each face. Here are two potential approaches you could take with data collection. Either way is fine. You’re just measuring slightly different things:
- In one approach you could attempt to gather data in the environment that best describes your use case. Sit at your dining room table, maybe scatter some books, papers, and soda cans around as obstacles and get rolling. The idea here is that your data will do a good job representing the performance that the average home user sees.
- In the other, you can attempt to gather data in an ideal environment: attempting to optimize for bouncing, well lit for accurate reading of results, and using as close to the same force and angle for each throw as possible for consistency. This approach is an attempt to capture as close to the true distribution of faces of the die as possible. The fact that the end user will never experience these idealized results is irrelevant.
Analysis: Surprisingly there’s not much to say here. You have your planned tests. You have your parameters. You have your collected data. The analysis is just math. The math is simple enough for most Chi-Square tests that you can literally do it with pencil and paper (and if you’re like me you’d find that fun but would also make dozens of simple errors and then have to hunt them down.) In practice, this is almost exclusively done via one form of software or another. The end result of analysis is almost always a p-value and a statement about the original claim for each test, something like: “p-value=.003 Evidence exists to reject the claim that all the dice tested share the same distribution.” We use p-value for most test results because they are familiar to many people, they are fairly simple to understand, and they are standardized such that completely different tests, even different types of test can be compared easily. P-values by the way literally mean “If our H0 hypothesis is true, this is the probability of observing data as or more rare than we did.” For the final statement, keep in mind that actually proving something with statistics is almost impossible. Disproving things is sometimes just as hard. Instead most statistical results make use of phrases like “Evidence exists in support of the claim that…” or “Data suggest that…”
Presentation: Sharing your results is a straightforward process. You simply walk through this same process and explain what you did.
- Discuss what you were attempting to test. This part can be in plain language. Explaining the problem or question.
- Explain your analysis plan including tests, threshold for significance, and sample size. You may want to discuss why you designed your plan the way you did. This is also a good place to discuss common statistics concepts and definitions if you feel your audience may not be familiar with them.
- If you did anything fancy for data collection, you will probably want to share it. You may want to share graphs of simple descriptive statistics of your data. It’s also a good practice to provide your actual data set if there’s no good reason not to.
- Go over each test you performed and their results. As mentioned earlier, if possible, these should mostly be just p-values and statements of what those p-values say about your initial claims.
- At the end you may wish to close with an overview especially if your results show an important trend or pattern. Even if your results aren’t particularly impressive, a conclusion like “Results are inconclusive, further testing with larger sample sizes is needed.” can be useful as well as suggestions for follow up tests that could be performed.

So that’s the general approach for testing dice. This isn’t comprehensive. There are quite a few bits I glossed over but for our case study it’s a sufficient framework to work with.

Next week is part 2: Review of the Honest Dice Analysis.

The week after that is part 3: Suggested Analysis.

This post is brought to you by our wonderful patron Bob Quek, supporting us since September 2016! Thanks for helping us keep the stew fires going!

A Case Study In Dice Stat Tests Part 1: General Approach

Rehan

Leave a Reply Cancel reply

Rehan

Leave a Reply Cancel reply

You May Like

Mythwrecked: Ambrosia Island Launches December 5th with a New Trailer Reveal

PlayStation 5 had record-breaking sales during holiday season

Fellowship Preview | Console Creatures