Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics – comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.

Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?

I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets 🙂 ], read a very good article by Gerard E. Dallal, and I found the answers.

Going back to our introductory class in statistics, let’s check out the formulae for the t-tests.

**1. Assuming that the population variances are equal,T = (X1 – X2)/sqrt (Sp2(1/n1 + 1/n2) ……….Equation 1**

where

X1, X2 = means of sample 1 and 2

n1, n2 = size of sample 1 and 2

Sp = pooled …

Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics – comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.

Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?

I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets 🙂 ], read a very good article by Gerard E. Dallal, and I found the answers.

Going back to our introductory class in statistics, let’s check out the formulae for the t-tests.

**1. Assuming that the population variances are equal,T = (X _{1} – X_{2})/sqrt (Sp^{2}(1/n_{1} + 1/n_{2}) ……….Equation 1**

where

X

_{1}, X

_{2}= means of sample 1 and 2

n

_{1}, n

_{2}= size of sample 1 and 2

Sp

^{2}= pooled variance = [((n

_{1}-1)S

_{1}

^{2}+(n

_{2}-1)S

_{2}

^{2})/(n

_{1}+n

_{2}-2)]

**2. Assuming that the population variances are not equal,T = (X _{1} – X_{2})/sqrt(S_{1}^{2}/n_{1} + S_{2}^{2}/n_{2}) ……….Equation 2**

We have also been taught that the test statistic Z is used to determine the difference between two population proportions based on the difference between the two sample proportions (P_{1} – P_{2}).

And the formula for the Z statistic is given by**Z = (P _{1} – P_{2})/ sqrt(P(1-P)(1/n_{1} + 1/n_{2})) ……….Equation 3**

where

P_{1}, P_{2} = proportions of success (or target category) in samples 1 and 2

S_{1}, S_{2} = variances for samples 1 and 2

n_{1}, n_{2} = size of samples 1 and 2

P = pooled estimate of the sample proportion of successes =(X_{1} + X_{2}) / (n_{1} +n_{2})

X_{1}, X_{2} = number of successes (or target category) in samples 1 and 2

The test statistic Z (equation 3) is equivalent to the chi- square goodness-of-fit test, also called a test of homogeneity of proportions.

But how different is the proportions from means? The proportion having the desired outcome is the number of individuals/observations with the outcome divided by total number of individuals/observations. Suppose we create a variable that equals 1 if the subject has the outcome and 0 if not. The proportion of individuals/observations with the outcome is the mean of this variable because the sum of these 0s and 1s is the number of individuals/observations with the outcome.

Let’s suppose there are m 1s and (n-m) 0s among the n observations. Then, X_{Mean} (=P) =m/n and is equal to (1-m/n) for m observations and 0-m/n for (n-m) observations. When these results are combined, the final result is

∑(X_{i} – X_{Mean})^{2} = m(1-m/n)^{2} + (n – m) (0 – m/n)^{2}= m(1 – 2m/n + m^{2}/n^{2}) + (n – m) m^{2}/n^{2}

= m – 2(m^{2}/n^{2}) + (m^{3}/n^{2}) + (m^{2}/n) – (m^{3}/n^{2})

= m – (m^{2}/n)

= m(1-m/n)

= nP(1-P)

So, variance = ∑(X_{i} – X_{Mean})^{2}/n = P(1-P)

Substituting this in the equation 3 (for Z statistic), we get

(P_{1} – P_{2})/ sqrt(Variance/n_{1} + Variance/n_{2})), which is not so different from equation 2 (the formula for the “equal variances not assumed” version of t test).

As long as the sample size is relatively large, the distributional assumptions are met, and the response is binomial – the t test and the z test will give p-values that are very close to one another.

And in the case where we have only two categories, the z test and the chi-square test turn out to be exactly equivalent, though the chi-square is by nature a two-tailed test. The chi-square distribution for 1 df is just the square of the z distribution.

The various tests and their assumptions as listed in Wikipedia are given below:**1. Two-sample pooled t-test, equal variances**(Normal populations or n

_{1}+ n

_{2}> 40) and independent observations and σ1 = σ2 and (σ1 and σ2 unknown)

**2. Two-sample unpooled t-test, unequal variances**

(Normal populations or n_{1} + n_{2} > 40) and independent observations and σ1 ≠ σ2 and (σ1 and σ2 unknown)

**3. Two-proportion z-test, equal variances**n1 p1 > 5 and n

_{1}(1 − p

_{1}) > 5 and n

_{2}p

_{2}> 5 and n

_{2}(1 − p

_{2}) > 5 and independent observations

**4. Two-proportion z-test, unequal variances **

n1 p_{1} > 5 and n_{1}(1 − p_{1}) > 5 and n_{2} p_{2} > 5 and n_{2}(1 − p_{2}) > 5 and independent observations