This is one potential trouble spot with big data that I don’t think most people recognize or consider. Many big data initiatives today are that special combination of new data being applied to a new problem. This makes it critical to validate your assumptions and the influence they have on the results of the analysis. If your results aren’t stable across the range of reasonable assumptions, then you have a problem.
I recall many years ago when I was first building predictive models that included TV advertising data. The data was at an extremely high level to begin with and on top of that we had to make many assumptions about the data as we prepared it for our models. For example, what decay rate would we use for the advertising impressions? How would we reconcile any differences in projected impressions from different sources?
The guidance I was given at the time, which is what I think most people usually follow even today, is that if a model’s parameter estimates come out significant and the model explains a good bit of the variance, then you have found a model that is good and you can use it. However, I stumbled upon a huge problem with that approach.
One day I had what would have been considered a good model under the above rules. However, for some reason, I decided to see what would happen if I changed my assumptions about the decay rate and a few other points and re-ran the analysis. I was astonished to see that I still had significant parameters that in total explained a lot of the variance. However, the new parameter estimates were different from my original ones by more than the margin of error!
In effect, my assumptions did more to determine my results than did the model itself. The team and I did more extensive testing to finalize assumptions we all agreed were the best possible for the advertising data. However, I am still uncomfortable today with the idea that assumptions can in many cases do more to determine your answer than the analysis that uses those assumptions.
Be Sure To Test Your Assumptions
I recommend that you make a point to test the impact that your assumptions have on your results even if a new analysis looks great at first. If you find that minor changes in your assumptions have a substantive impact on your results, then you should go through a much more detailed process of validating your assumptions. This is especially true if changing assumptions leads to results that will actually point to different decisions. With big data, this extra work may be necessary frequently because you are often breaking new ground where assumptions haven’t stood the test of time and application.
Of course, there is always the possibility that your assumptions are wrong. You may also not be able to prove what the best assumptions are. For an example of this, look at the impacts on a retirement portfolio caused by changes in the average compound interest rate earned over time. There is no way to know what the actual rate of return will be, but you are wise to use one more towards the lower end of what you think is reasonable to be safe. By understanding how the rate of return assumption impacts the ending value of your savings, you are able to choose assumptions that best fit your mindset, risk tolerance, and needs.
Following the approach I have outlined here won’t remove all your risk. But, it will certainly ensure that you better understand the risks you are exposed to. In a situation where one set of reasonable assumptions produces a result that says “go” and another says “no go”, I suggest that you make everyone aware of the issue and then have a candid discussion about the implications of choosing one set of assumptions over the other as you determine the best way to proceed. This leads to a more informed decision, which is what you should always strive for with any analysis.
Originally published by the International Institute for Analytics