đ Permutation Test for Two Proportions
Test Proportions
Introduction
We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.
We will use two packages in R, mosaic
and the relatively new infer
package, to develop our intuition for what are called permutation
based statistical tests.
Testing for Two or More Proportions
Let us try a dataset with Qualitative / Categorical data. This is the
General Social Survey GSS dataset, and we have people with different
levels of Education
stating their opinion on the Death Penalty
. We
want to know if these two Categorical variables have a correlation, i.e.
can the opinions in favour of the Death Penalty
be explained by the
Education
level?
Since data is Categorical ( both variables ), we need to take counts
in a table, and then implement a chi-square test
. In the test, we will
permute the Education
variable to see if we can see how significant
its effect size is.
##
## categorical variables:
## name class levels n missing
## 1 Region factor 7 2765 0
## 2 Gender factor 2 2765 0
## 3 Race factor 3 2765 0
## 4 Education factor 5 2760 5
## 5 Marital factor 5 2765 0
## 6 Religion factor 13 2746 19
## 7 Happy factor 3 1369 1396
## 8 Income factor 24 1875 890
## 9 PolParty factor 8 2729 36
## 10 Politics factor 7 1331 1434
## 11 Marijuana factor 2 851 1914
## 12 DeathPenalty factor 2 1308 1457
## 13 OwnGun factor 3 924 1841
## 14 GunLaw factor 2 916 1849
## 15 SpendMilitary factor 3 1324 1441
## 16 SpendEduc factor 3 1343 1422
## 17 SpendEnv factor 3 1322 1443
## 18 SpendSci factor 3 1266 1499
## 19 Pres00 factor 5 1749 1016
## 20 Postlife factor 2 1211 1554
## distribution
## 1 North Central (24.7%) ...
## 2 Female (55.6%), Male (44.4%)
## 3 White (79.1%), Black (14.8%) ...
## 4 HS (53.8%), Bachelors (16.1%) ...
## 5 Married (45.9%), Never Married (25.6%) ...
## 6 Protestant (53.2%), Catholic (24.5%) ...
## 7 Pretty happy (57.3%) ...
## 8 40000-49999 (9.1%) ...
## 9 Ind (19.3%), Not Str Dem (18.9%) ...
## 10 Moderate (39.2%), Conservative (15.8%) ...
## 11 Not legal (64%), Legal (36%)
## 12 Favor (68.7%), Oppose (31.3%)
## 13 No (65.5%), Yes (33.5%) ...
## 14 Favor (80.5%), Oppose (19.5%)
## 15 About right (46.5%) ...
## 16 Too little (73.9%) ...
## 17 Too little (60%) ...
## 18 About right (49.7%) ...
## 19 Bush (50.6%), Gore (44.7%) ...
## 20 Yes (80.5%), No (19.5%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 ID integer 1 692 1383 2074 2765 1383 798.3311 2765 0
Note how all variables are Categorical !! Education
has five levels
:
## Education n
## 1 Left HS 400
## 2 HS 1485
## 3 Jr Col 202
## 4 Bachelors 443
## 5 Graduate 230
## 6 <NA> 5
## DeathPenalty n
## 1 Favor 899
## 2 Oppose 409
## 3 <NA> 1457
Let us drop NA entries in Education
and Death Penalty
. And set up a table for the chi-square test.
## [1] 1307 2
## # A tibble: 10 Ă 5
## Education DeathPenalty count edu_count edu_prop
## <fct> <fct> <int> <int> <dbl>
## 1 Bachelors Favor 135 206 0.655
## 2 Bachelors Oppose 71 206 0.345
## 3 Graduate Favor 64 114 0.561
## 4 Graduate Oppose 50 114 0.439
## 5 Jr Col Favor 71 87 0.816
## 6 Jr Col Oppose 16 87 0.184
## 7 HS Favor 511 711 0.719
## 8 HS Oppose 200 711 0.281
## 9 Left HS Favor 117 189 0.619
## 10 Left HS Oppose 72 189 0.381
Table Plots
We can plot a heatmap-like mosaic chart
for this table.
Using ggplot
Using ggmosaic
Observed Statistic: the X^2 metric
When there are multiple proportions involved, the X^2 test is what is used.
Let us now perform the base chisq test
: We need a table
and then the
chisq
test:
## Education
## DeathPenalty Left HS HS Jr Col Bachelors Graduate
## Favor 117 511 71 135 64
## Oppose 72 200 16 71 50
## X.squared
## 23.45093
##
## Pearson's Chi-squared test
##
## data: tally(DeathPenalty ~ Education, data = gss2002)
## X-squared = 23.451, df = 4, p-value = 0.0001029
What would our Hypotheses be?
$$ H_0: Education Does Not affect Votes on Death Penalty\
H_a: Education affects Votes on Death Penalty
$$
We should now repeat the test with permutations on Education
:
## X.squared df p.value method alternative
## X-squared...1 1.1268272 4 0.8899928 Pearson's Chi-squared test NA
## X-squared...2 6.1348278 4 0.1893030 Pearson's Chi-squared test NA
## X-squared...3 10.1774328 4 0.0375426 Pearson's Chi-squared test NA
## X-squared...4 3.9776503 4 0.4090390 Pearson's Chi-squared test NA
## X-squared...5 3.9712187 4 0.4099150 Pearson's Chi-squared test NA
## X-squared...6 0.3383714 4 0.9872044 Pearson's Chi-squared test NA
## data .row
## X-squared...1 tally(DeathPenalty ~ shuffle(Education), data = gss2002) 1
## X-squared...2 tally(DeathPenalty ~ shuffle(Education), data = gss2002) 1
## X-squared...3 tally(DeathPenalty ~ shuffle(Education), data = gss2002) 1
## X-squared...4 tally(DeathPenalty ~ shuffle(Education), data = gss2002) 1
## X-squared...5 tally(DeathPenalty ~ shuffle(Education), data = gss2002) 1
## X-squared...6 tally(DeathPenalty ~ shuffle(Education), data = gss2002) 1
## .index
## X-squared...1 1
## X-squared...2 2
## X-squared...3 3
## X-squared...4 4
## X-squared...5 5
## X-squared...6 6
## prop_TRUE
## 0.00019998
The p-value
is well below our threshold of $0.05%, so we would
conclude that Education
has a significant effect on DeathPenalty
opinion!