What is Chi square and when to use

Chi Square is a widely used tool to check association and is explained here with very simple examples so that the concept is understood. Chi Square is used to check the effect of a factor on output and is also used to check goodness of fit of various distributions.

Important to note : Chi Square is used when both X and Y are discrete data types. Chi Square statistic should be estimated only on counts of data. If the data is in percentage form, they should be converted to counts or numbers. Another assumption is that the observations are drawn independently.

Purpose I

Chi Square is used for following two purpose:

1. To test hypothesis of several proportions (contingency table) : Chi Square is used to test the significance of the observed association in a cross tabulation. The null hypothesis is that there is no association between the variables. The test is conducted by computing the cell frequencies that would be expected if no association were present between the variables, given the row and column totals.

Chi-Square Test of Independence

Ho: A factor has no effect on the output

Ha: A factor has an effect on the output

Purpose II

2. Chi-Square Goodness-of-Fit Test

Chi Square can also be used to determine whether a certain model fits the observed data. These tests are conducted by calculating the significance of sample deviation from the assumed theoretical(expected) distribution. This can be performed on cross tabulations as well as on frequencies(one-way tabulation). The calculation of the Chi square statistic and the determination of its significance is the same as in scenario 1.

Ho: The hypothesized distribution is a good fit of the data

Ha: The hypothesized distribution is not a good fit of the data

Chi Square example(Contingency table)

To test hypothesis of several proportions (contingency table)

It is often necessary to compare proportions representing various process conditions. Machines may be compared as to their ability to produce precise parts. The ability of inspectors to identify defective parts can be evaluated. This application of Chi Square is called the Contingency table or row and column analysis.

The procedure is as follows:

1. Take one subgroup from each of the various processes and determine the Observed frequencies(O) for the various conditions being compared.

2. Calculate for each condition the expected frequencies(E) under the assumption that no differences exist among the processes.

3. Compare the observed and expected frequencies to obtain “reality”. The following calculation is made for each condition:

4. Total all the process conditions:

This is the most “famous” Chi-Square statistic.

5. A critical value is determined. The degrees of freedom is determined from the calculation(R-1)(C-1) : the number of rows minus 1 times the number of columns minus 1

6. A comparison between the test statistic and the critical value confirms if a significant difference exists( at a selected confidence level)

Comparing proportions(Contingency tables)

Suppose, we are analyzing the performance of German soccer team in Germany & Overseas during last 2 years. We look at the performance data and come up with following figures

The data has two classifications

This table is called 2 X 2 contingency table (2 rows, 2 columns)

We are comparing two proportions here i.e. Victories in Germany and Overseas

Let’s hypothesize that proportion of victories in home conditions or abroad is equal