Joe and Devine are having a casual conversation in a coffee shop. While Devine orders his usual espresso, Joe orders a new item on the card, the memory booster mocha.

Devine: Do you think this “memory booster” will work?

Joe: Apparently, their success rate is 75%, and they advertise that they are better than chance.

Devine: Hmm. So we can consider you as a subject of the experiment and test their claim.

Joe: Like a hypothesis test?

Devine: Yes. We can collect the data from all the subjects who participated in the test and verify, statistically, if their 75% is sufficiently different from the random chance of 50%.

Joe: Is there a formal way to prove this?

Devine: Once we collect the data, we can compute the probability of the data under the assumption of a null hypothesis, and if this probability is less than a certain threshold, we can say with some confidence that the data is incompatible with the null hypothesis. We can reject the null hypothesis. You must be familiar with “proof by contradiction” from your classes in logic.

Joe: Null hypothesis? How do we establish that? Moreover, there will be a lot of sampling variability, and depending on what sample we get, the results may be different. How can there be a complete contradiction?

Devine: That is a good point, Joe. It is possible to get samples that will show no memory-boosting effect, in which case, we cannot contradict. Since we are basing our decisions on the probability calculated from the sample data given a null hypothesis, we should say we are proving by low-probability. It is possible that we err on our decision 😉

Joe: There seem to be several concepts here that I may have to understand carefully. Can we dissect them and take it sip-by-sip!

Devine: Absolutely. Let’s go over the essential elements of hypothesis tests today and then, in the following weeks, we can dig deeper. I will introduce you to some new terms today, but we will learn about their details in later lessons. The hypothesis testing concepts are vast. While we may only look at the surface, we will emphasize the philosophical underpinnings that will give you the required arsenal to go forward.

Joe: 😎 😎 😎

Devine: Let’s start with a simple classification of the various types of hypothesis tests; one-sample tests and two or more sample tests.

A one-sample hypothesis is a statement about the parameter of the population; or, it is a statement about the probability distribution of a random variable.

Our discussion today is on whether or not a certain proportion of subjects taking the memory-boosting mocha improve their memory. The test is to see if this proportion is significantly different from 50%. We are verifying whether the parameter (proportion, p) is equal to or different from 50%. So it is a one-sample hypothesis test.

The value that we compare the parameter on can be based on experience or knowledge of the process, based on some theory, or based on some design considerations or obligations. If it is based on experience or prior knowledge of the process, then we are verifying whether or not the parameter has changed. If it is based on some theory, then we are testing the theory. Our coffee example will fall under this criterion. We know that random chance means a 50% probability of improving (or not) the memory. So we test the proportion against this model; p = 0.5. If the parameter is compared against a value based on some design consideration or obligation, then we are testing for compliance.

Sometimes, we have to test one sample against another sample. For example, people who take the memory-boosting test from New York City may be compared with people taking the test from San Fransisco. This type of test is a two or multiple sample hypothesis test where we determine whether a random variable differs in its parameter among the two or more groups.

Joe: So, that is one-sample tests or two-sample tests.

Devine: Yes. Now, for any of these two types, we can further classify them into parametric tests or nonparametric tests.

If we assume that the data has a particular probability distribution, the test can be developed based on this probability distribution. These are called parametric tests.

If a probability distribution is appropriate for the data, then, the information contained in the data can be summarized using the parameters of this distribution; like the mean, standard deviation, proportion, etc. The hypothesis test can be designed using these parameters. The entire process becomes very efficient since we already know the mathematical formulations. In our case, since we are testing for proportion, we can assume a binomial distribution to derive the probabilities.

Joe: What if the data does not follow the distribution that we assume?

Devine: This is possible. If we make incorrect assumptions regarding the probability distributions, the parameters that we use to summarize the data are at best, a poor representation of the data, which will result in incorrect conclusions.

Joe: So I believe the nonparametric tests are an alternative to this.

Devine: That is correct. There are hypothesis tests that do not require the assumption that the data follow a particular probability distribution. Do you recall the bootstrap where we used the data to approximate the probability distribution function of the population?

Joe: Yes, I remember that. We did not have to make any assumption for deriving the confidence intervals.

Devine: Exactly. These type of tests are called nonparametric hypothesis tests. Information is efficiently extracted from the data without summarizing them into their statistics or parameters.

Here, I prepared a simple chart to show these classifications.

Joe: Is there a systematic process for the hypothesis test? Are there steps that I can follow?

Devine: Of course. We can follow these five steps for any hypothesis test. Let’s use our memory-booster test as a case in point as we elaborate on these steps.

1. Choose the appropriate test; one-sample or two-sample and parametric or nonparametric. 2. Establish the null and alternate hypothesis. 3. Decide on an acceptable rate of error or rejection rate ( ). 4. Compute the test statistic and its corresponding p-value from the observed data. 5. Make the decision; Reject the null hypothesis if the p-value is less than the acceptable rate of error, .

Joe: Awesome. We discussed the choice of the test — one-sample or two-sample; parametric vs. nonparametric. The choice between parametric or nonparametric test should be based on the expected distribution of the data.

Devine: Yes, if we are comfortable with the assumption of a probability distribution for the data, a parametric test may be used. If there is little

information about the prior process, then it is beneficial to use the nonparametric tests. Nonparametric tests are also especially appropriate for small data sets.

As I already told you, we can assume a binomial distribution for the data on the number of people showing signs of improvement after taking the memory-boosting mocha.

Suppose ten people take the test, the probabilities can be derived from a binomial distribution with n = 10 and p = 0.5. The null distribution, i.e., what may happen by chance is a binomial distribution with n = 10 and p = 0.5, and we can check how far out on this distribution is our observed proportion.

Joe: What about the alternate hypothesis?

Devine: If the null hypothesis is that the memory-booster has no impact, we would expect, on average, a 50% probability of success, i.e., around 5 out of 10 people will see the effect purely by chance. Now, the coffee shop claims that their new product is effective, beyond random possibility. We call this claim the alternate hypothesis.

The null hypothesis ( ) is that p = 0.5

The alternate hypothesis ( ) is that p > 0.5.

The null hypothesis is usually denoted as , and the alternate hypothesis is denoted as .

The null hypothesis ( ) is what is assumed to be true before any evidence from data. It is usually the null situation that has to be disproved otherwise. Null has the meaning of “no effect,” or “of no consequence.”

is identified with the hypothesis of no change from the current belief.

The alternate hypothesis ( ) is the situation that is anticipated to be true if the data (empirical evidence) shows that the null hypothesis ( ) is unlikely.

The alternate hypothesis can be of two types, the one-sided alternative or the two-sided alternative.

The two-sided alternative can be considered when evidence in either direction (values larger than or smaller than the accepted level) would cause the rejection of the null hypothesis. The one-sided alternative is considered when the departures in one direction (either less than or greater than) are sufficient to reject .

Our test is a one-sided alternative hypothesis test. The proportion of people who would benefit from the memory-booster coffee is greater than the proportion who would claim benefit randomly.

It is usually the case that the null hypothesis is the favored claim. The onus of proof is on the alternative, i.e., we will continue to believe in , the status quo unless the experimental evidence strongly contradicts it; proof by low-probability.

Joe: Understood. In step 3, I see there are some new terms, the acceptable rate of error, rejection rate . What is this?

Devine: Think about the possible outcomes of your hypothesis test.

Joe: We will either reject the null hypothesis or accept the null hypothesis.

Devine: Right. Let’s say we either reject the null hypothesis or fail to reject the null hypothesis if the data is inconclusive. Now, would your decision always be correct?

Joe: Not necessary??

Devine: Let’s say the memory-booster is false, and we know that for sure. But, the people who took the test claim that their memory improved, then we would have rejected the null hypothesis for the alternate. However, we know that coffee should not have any effect. We know is true, but, based on the sample, we had to reject it. We committed an error. This kind of error is called a Type I error. Let’s call this error, the rejection rate . There is a certain probability that this will happen, and we select this rejection rate. Assume .

A 5% rejection rate implies that we are rejecting the null hypothesis 5% of the times when in fact is true.

Now, in reality, we will not know whether or not is true. The choice of is the risk taken by us for rejecting the truth. If we choose , a 5% rejection rate, we choose to reject the null hypothesis 5% of the times.

In hypothesis tests, it is a common practice to set at 5%. However, can also be chosen to have a higher or lower rejection rate.

Suppose , we will only reject the null hypothesis 1% of the times. There needs to be greater proof to reject the null. If you want to save yourself that extra dollar, you would like to see a greater proof, a lower rejection rate. The coffee shop would perhaps like to choose . They want to reject the null hypothesis more often, so they can show value in their new product.

Joe: I think I understand. But some things are still not evident.

Devine: Don’t worry. We will get to the bottom of it as we do more and more hypothesis tests. There is another kind of error, the second type, Type II. It is the probability of not rejecting the null hypothesis when it is false. For example, suppose the coffee does boost the memory, but a sample of people did not show that effect, we would fail to reject the null hypothesis. In this case, we would have committed a Type II error.

Type II error is also called the lack of power in the test.

Some attention to these two Types shows that Type I and Type II errors are inversely related.

If Type I error is high, i.e., if we choose high , then Type II error will be low. Alternately, if we want a low value, then Type II error will be high.

Joe: 😐 😐 😐

Devine: I promise. These things will be evident as we discuss more. Let me show all these possibilities in a table.

Joe: Two more steps. What are the test statistic and the p-value?

Devine: The test statistic summarizes the information in the data. For example, suppose out of ten people who took the test, 9 reported a positive effect, we would take nine as the test statistic, and compute as the p-value. In a Binomial null distribution with n = 10 and p = 0.5, what is the probability of getting a value that is as large or greater than 9? If the value has a sufficiently low probability, we cannot say that it may occur by chance.

If this statistic, 9, is not significantly different from what is expected in the null hypothesis, then cannot be rejected.

The p-value is the probability of obtaining the computed test statistics under the null hypothesis. It is the evidence or lack thereof against the null hypothesis. The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

Here, I computed the probabilities from the binomial distribution, and I am showing it as a null distribution. , the p-value is shaded. Its value is 0.0107.

Joe: I see. Decision time. If I select a rejection rate of 5%, since the p-value is less than 5%, I have to reject the null hypothesis. If I picked an value of 1%, I cannot reject the null hypothesis. At the 1% rejection rate, 9 out of 10 is not strong enough evidence for rejection. We need much higher proof.

Devine: Excellent. What we went through now is the procedure for any hypothesis test. Over the next few weeks, we will undertake several examples that will need a step-by-step hypothesis test to understand the evidence and make decisions. We will also learn the concepts of Type I and Type II errors at length. Till then, here is a summary of the steps.

1. Choose the appropriate test; one-sample or two-sample and parametric or nonparametric. 2. Establish the null and alternate hypothesis. 3. Decide on an acceptable rate of error or rejection rate ( ). 4. Compute the test statistic and its corresponding p-value from the observed data. 5. Make the decision; Reject the null hypothesis if the p-value is less than the acceptable rate of error, .

And remember,

The null hypothesis is never “accepted,” or proven to be true. It is assumed to be true until proven otherwise and is “not rejected” when there is insufficient evidence to do so.

If you find this useful, please like, share and subscribe.

You can also follow me on Twitter @realDevineni for updates on new lessons.