(A)analytics 101: Sampling Makes Science

Abstract

Just like vetting business rules and the data sources that make up a business report, you must also question the sampling used in an advanced analytics project. You should know three terms:

  • Random Sampling – the case where you know the population and can select randomly from it.
  • Convenience Sampling – the situation where you take what you can get from a representative group of the population.
  • Random Assignment – when you randomly assign people to either a test or control group.

As a business sponsor or stakeholder, always evaluate the sampling plan before starting a project. You should also pause the project and gather a second opinion if you have any concerns. Just like reporting, if the methods are suspect, then so are the results. Hold your team accountable on sampling design; it will ensure the highest quality of results.

Article

When approaching science, always remember the mantra: sampling makes science. Since the scientific method requires random sampling, if you don’t have random sampling, you don’t have science. It’s that simple.

In the workplace, most leaders are sensitive to the influence business rules and source systems have on their reports. “Whose definition of revenue did you use here?” “Where are you pulling this line item from?” “Does this reconcile with finance’s billing report?” Indeed, most leaders know the risks they face – having been burned once before – when making data-driven decisions without vetting the business rules and origins of the data first.

As you move into advanced analytics, a similar level of discipline is required. It’s safe to think of it as the same exercise: you are ensuring that the methods behind the numbers are valid. And just like reporting, if the methods are valid, then so are the results.

In science, there is the technique of sampling and assignment. Essentially, a reliable and valid experiment will have both random sampling and random assignment into various groups. A marketing exercise may want to determine the effectiveness of two different offers, for example. To draw accurate conclusions from their study, they will want to randomly sample prospects and then randomly assign them to respective test and control groups. As a leader on the hook for recommending either a 25% discount or $15 off promotional offer and its subsequent financial results, you are responsible for asking, “What is the sampling plan?”

Please know that sampling is a deep subject, and it isn’t your responsibility to know it all. However, you do need to know the difference between random and convenience sampling.

  • Random Sampling – this is used when the population under study is fully known and you randomly sample within it. Drawing conclusions from your 1,000,000 customers may be done by randomly sampling 50,000 of them for a model building exercise, for instance.
  • Convenience Sampling – when you can’t sample from the entire population but instead take what you can get, you are sampling from convenience. Customer feedback, for example, is a form of convenience sampling since customers are readily available and usually willing to share. Convenience sampling can easily produce misleading results, however, so be very, very careful that the convenience sample is representative of the group you want to study.

In most cases, businesses are not doing true science because they cannot randomly sample from the population of interest. The analyst worth his or her salt will be sensitive to this; it is a positive sign if they are uncomfortable with the convenience samples they are working with. As a manager, it is your responsibility to validate that the samples are, in fact, representative of the group under study.

A final term you need to know is random assignment. This is at the heart of the scientific method, and without it, like random sampling, you simply don’t have science.

Whenever you are testing two or more groups with one another, make sure that the subjects under study are randomly assigned to each treatment group. Often, because of system constraints, someone will recommend the use of alternating assignment (e.g., “A”,”B”,”A”,…) or assignment based on a system id characteristic (e.g., even numbered customer ids are tagged “A” and odds are tagged “B”). Take the high road and push for random assignment from a pseudo-random number generator. It is absolutely mind blowing how correlated timing and system ids are with other variables. Random assignment is a significant control mechanism for eliminating confounding variables. When your analysts fight for it, fight with them. It will make all the difference in the quality of your results.