User:Yamagawa/Drop Rates

Drop Rates
I use a method called.... um... I know I saw it on wiki.... ... ....  Um, we'll call it the average of averages, or... powered average (eg: average squared). Apologies if the terms don't match whats in the books, I don't have them, I just use what I remember, which is more *how* it works than *what* it's called....

Basically, we get a bunch of different samples together, each with different rates. Each sample can be considered an average. (My ten samples of ecto drops averaged 20% ectos... My next sample averaged 0% ectos...). We take all those averages and average them together. There is however a fair amount of precision to be lost depending on how you average them together.

This breaks down into two parts Bucketing the data. Averaging the buckets.

Bucketing the data
The process calls for n buckets of n drops.

This tries to maximize on the raw 'information' in the drop table. So the first step is to take all the rows of data, find the total number of measurements. Get the square root of that. Say, I have 100 drops. The square root is 10. That's my N. I want to lump the data together into N samples of N size, or 10 samples of 10 drops. If I make the sample size larger, I get fewer drops, at a cost of information. If I make the sample size smaller, I get more total buckets, but information is again lost in the smaller sample size. A 'power' effect (or is that inverse square) applies to sample size. A sample of 4 may be considered to carry 2 units of data. A sample of 9 carries 3 units. A sample of 16 carries 4 units. A sample of 25 carries 5 units. In the process used here, each bucket is considered a sample, and the collection of buckets is another sample. By getting those numbers both as low as possible, we can maximize the transfer of data into the final result.

To aid in this process, I'd suggest encouraging users to target sample sizes in the 10-25 drops range, where possible.

To buckets should consist of the raw counts (not percentages), plus a sample size column.

Averaging the Buckets
Unless you are very very lucky, the buckets are not all the same size. This is annoying, but not fatal to the process. Different bucket size gives each bucket a different weight, and standard spreadsheet statistics functions assume each row of data has the same weight. This forces us to do extra work for average and standard deviation.

Extending the buckets
Going back to the previous section on bucketing the data, it's mentioned that a sample of 4 carries 2 units of data, and a sample of 25 caries only 5 (21 more drops, for 3 more units...) We need to treat each row according to this weight. So: Add a sum-of-drops column (if missing), to sum the drops in each bucket. Add a n^0.5 column which carries the square root of the sum of drops in the bucket. If the bucket has 25 drops, this will show 5. Add averages row. Get an average for your n^0.5 column. Add a 'Bucket Weight' column. This row will contain the row's n^0.5 value divided by the average n^0.5 value. This is a measure of how significant the row is relative to other rows, and it's needed because the standard statistical functions built into spreadsheet programs work on the basis that samples have equal weight. In this instance, they do not.

The Adjusted Table
Now we need to make a drop rates table that is adjusted for the weight of the rows. Create a new sheet. Copy the column headings from the first table to it. You can omit the columns we added in the last table, we just start with the drops themselves. The basic formulae is simple multiplication. For each bucket, you multiply the observed drop rate by the row's Bucket Weight (That's 3 parts: Count divided by row total, times weight). Copy & paste is your friend here, most spreadsheet programs will update the formulae as you copy it from cell to cell, however you may need to sort out something for the Bucket Weight value that you read in, as you move from column to column.

The Results - part 1: Average
Sort out average drop rates. Because of the work we did previously by including the weight into each row, a standard average function on each column will work for this.

This gives you an adjusted average drop rate. These are normally expressed as percentages, so format the values accordingly.

Do take a moment to spot check your work. Do your averages look anything like the averages from the raw data? The do? Good. We have more work to do, we aren't done yet.

We still need to sort out the 'how good are these numbers''. ie, what is the margin of error. If we just wanted averages, I'd have skipped the whole complicated process and used the base average, as that's not far from what we get here, but we need this adjusted number for what's to come...

Standard Deviation
Averages are good, but we want error margin too. And error margin has everything to do with standard deviation, but as previously stated, our samples are of unequal sizes... Standard table functions will not work for this. In fact, while I was able to find several variations on how to calculate a deviation value for a set of data, my research found none, so I more or less created this by modifying the standard method. Any errors are mine, and someone good with statistics might want to double-check it, as I've already had to fix it for errors from a year ago (errors here only affect the error bars, and my error will have made the bars too wide).

Create a Deviation table, parallel to our first table. Standard deviation is commonly based on the square of the distance of each value from the average. We will adjust this value according to the weight of each bucket. See http://en.wikipedia.org/wiki/Standard_deviation#With_sample_standard_deviation

Each cell in it will be: (Adjusted Average - Bucket Rate)^2 * Weight The formula in my spreadsheet program looks like: =((Weighted Table :: A$44-Raw Data :: A2/Raw Data :: $AC2)^2)*Raw Data :: $AE2 (If, unlike me, you are using Excel a copy/paste of the above formula will probably fail.)

Now create a few 'totals' rows in the Deviation table. The first will be 'Sum', and is a simple sum for each column. =SUM('5 Merc ') The next is 'Average', and is the Sum divided by the number of buckets minus 1. (Yes! N - 1.  See wikipedia link above). This is not a true average, just an name approximate to the value. =B44/(COUNT('5 Merc ')-1) Then we add a Deviation row. =B45^0.5 You can do them all as one single calculation, but I favor doing things in small steps, which makes it harder for a missed paren or such to leave an error in the detail. The deviation is measured against drop rates, so go ahead and format the deviation as a percentage.

Error Margin
Still on the deviation table, add another totals-liek row. Name it 95% Margin. Take deviation, divide it by the square root of the number of buckets, and multiply that result by 1.960 (The real figure has a few more digits in it, but we aren't performing rocket science here). =B46/(COUNT('5 Merc ')^0.5)*1.96 Actual values will 95% of the time fall within + or - that number. If you want the 99% error margin, you can multiply instead by 2.576 See http://en.wikipedia.org/wiki/Standard_deviation#Rules_for_normally_distributed_data for values used to determine other error margins.

The Drop That Wasn't
You've got your drop data. Put it in buckets, built out the tables, but for some reason, one of the columns in the table is all 0's. Either you were lied to and rabbits don't drop long swords, or it's such a rare occurrence that it just hasn't happened yet. Even if it hasn't happened yet, we can figure some sort of rate of occurrence for how often it may actually happen. For example, if you've had 3 drops recorded and none were long swords, odds are reasonably good the drop rate is not over 98%. Oh, you may have hit the 2% miss rate once or twice in a row, but 3 times is hard to believe. Sure, it *could* happen, but it probably hasn't. Odds are, the actual drop rate is lower. If we get another 7 drops, for a total of 10 and still haven't seen that long sword, odds are the drop rate is under 50%. If we have 100 drops... see where this is going? The neat thing is, it's not just assumptions of this or that, it's something that fits well into math... which will be added shortly after I review it.

I'm running short on time, so I've not reviewed the table I used last time for it. I *think* I'm working it correctly, but if someone can, please double-check the final figures. For a 95% confidence interval:

Basically, I'm figuring the approximate drop rate which at N drops, will give 0 drops 5% of the time. The above table should give people a ballpark idea of how many drops total are needed to set the maximum drop rate as 'Very Low', for various definitions of 'Very Low'.