using computer simulation. Based on examples from the infer package. Code for Quiz 13.
Read it into and assign to hr
Note: col_types = “fddfff” defines the column types factor-double-double-factor-factor-factor
Name | hr |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 4 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
gender | 0 | 1 | FALSE | 2 | fem: 253, mal: 247 |
evaluation | 0 | 1 | FALSE | 4 | bad: 148, fai: 138, goo: 122, ver: 92 |
salary | 0 | 1 | FALSE | 6 | lev: 98, lev: 87, lev: 87, lev: 86 |
status | 0 | 1 | FALSE | 3 | fir: 196, pro: 172, ok: 132 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 39.41 | 11.33 | 20 | 29.9 | 39.35 | 49.1 | 59.9 | ▇▇▇▇▆ |
hours | 0 | 1 | 49.68 | 13.24 | 35 | 38.2 | 45.50 | 58.8 | 79.9 | ▇▃▃▂▂ |
The mean hours worked per week is: 49.7
Response: hours (numeric)
# A tibble: 500 x 1
hours
<dbl>
1 49.6
2 39.2
3 63.2
4 42.2
5 54.7
6 54.3
7 37.3
8 45.6
9 35.1
10 53
# … with 490 more rows
hypothesize that the average hours worked is 48
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 x 1
hours
<dbl>
1 49.6
2 39.2
3 63.2
4 42.2
5 54.7
6 54.3
7 37.3
8 45.6
9 35.1
10 53
# … with 490 more rows
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 x 2
# Groups: replicate [1,000]
replicate hours
<int> <dbl>
1 1 49.9
2 1 37.0
3 1 33.4
4 1 45.9
5 1 50.4
6 1 36.6
7 1 52.7
8 1 49.6
9 1 52.7
10 1 35.7
# … with 499,990 more rows
The output has 500,000 rows
calculate the distribution of statistics from the generated data
Assign the output null_t_distribution
Display null_t_distribution
# A tibble: 1,000 x 2
replicate stat
* <int> <dbl>
1 1 1.96
2 2 0.532
3 3 -1.04
4 4 -0.00975
5 5 1.32
6 6 0.177
7 7 0.550
8 8 0.517
9 9 0.492
10 10 -0.821
# … with 990 more rows
null_t_distribution has 1000 t-stats
calculate the statistic from your observed data
Assign the output observed_t_statistic
Display observed_t_statistic
# A tibble: 1 x 1
stat
<dbl>
1 2.83
# A tibble: 1 x 1
p_value
<dbl>
1 0.008
shade_p_value on the simulated null distribution
If the p-value < 0.05? ??? (yes)
Does your analysis support the null hypothesis that the true mean number of hours worked was 48? ??? (no)
hr_2_tidy.csv is the name of your data subset
Read it into and assign to hr_2
Note: col_types = “fddfff” defines the column types factor-double-double-factor-factor-factor
use skim to summarize the data in hr_2 by gender
Name | Piped data |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | gender |
Variable type: factor
skim_variable | gender | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
evaluation | male | 0 | 1 | FALSE | 4 | bad: 79, fai: 68, goo: 61, ver: 48 |
evaluation | female | 0 | 1 | FALSE | 4 | bad: 75, fai: 74, ver: 48, goo: 47 |
salary | male | 0 | 1 | FALSE | 6 | lev: 49, lev: 48, lev: 48, lev: 44 |
salary | female | 0 | 1 | FALSE | 6 | lev: 47, lev: 46, lev: 41, lev: 39 |
status | male | 0 | 1 | FALSE | 3 | fir: 93, pro: 90, ok: 73 |
status | female | 0 | 1 | FALSE | 3 | fir: 101, pro: 89, ok: 54 |
Variable type: numeric
skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | male | 0 | 1 | 38.63 | 11.57 | 20.3 | 28.50 | 37.85 | 49.52 | 59.6 | ▇▇▆▆▆ |
age | female | 0 | 1 | 41.14 | 11.43 | 20.3 | 31.30 | 41.60 | 50.90 | 59.9 | ▆▅▇▇▇ |
hours | male | 0 | 1 | 49.30 | 13.24 | 35.0 | 37.35 | 46.00 | 59.23 | 79.9 | ▇▃▂▂▂ |
hours | female | 0 | 1 | 49.49 | 13.08 | 35.0 | 37.68 | 45.05 | 58.73 | 78.4 | ▇▃▃▂▂ |
Females worked an average of 49.5 hours per week
Males worked an average of 49.3 hours per week
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 x 2
hours gender
<dbl> <fct>
1 78.1 male
2 35.1 female
3 36.9 female
4 38.5 male
5 36.1 male
6 78.1 female
7 76 female
8 35.6 female
9 35.6 male
10 56.8 male
# … with 490 more rows
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
hours gender
<dbl> <fct>
1 78.1 male
2 35.1 female
3 36.9 female
4 38.5 male
5 36.1 male
6 78.1 female
7 76 female
8 35.6 female
9 35.6 male
10 56.8 male
# … with 490 more rows
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups: replicate [1,000]
hours gender replicate
<dbl> <fct> <int>
1 60.8 male 1
2 36.4 female 1
3 62.6 female 1
4 61.9 male 1
5 48.2 male 1
6 35.2 female 1
7 64.1 female 1
8 42.9 female 1
9 47.2 male 1
10 65.8 male 1
# … with 499,990 more rows
The output has 500,000 rows
calculate the distribution of statistics from the generated data
Assign the output null_distribution_2_sample_permute
Display null_distribution_2_sample_permute
# A tibble: 1,000 x 2
replicate stat
* <int> <dbl>
1 1 1.30
2 2 1.56
3 3 1.46
4 4 0.884
5 5 0.145
6 6 -0.684
7 7 -1.20
8 8 -0.378
9 9 0.642
10 10 0.288
# … with 990 more rows
null_t_distribution has 1000 t-stats
visualize the simulated null distribution
calculate the statistic from your observed data
Assign the output observed_t_2_sample_stat
Display observed_t_2_sample_stat
# A tibble: 1 x 1
stat
<dbl>
1 0.160
# A tibble: 1 x 1
p_value
<dbl>
1 0.862
If the p-value < 0.05? ??? (yes)
Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? (no)
hr_1_tidy.csv is the name of your data subset
Read it into and assign to hr_anova
Q: Is the average number of hours worked the same for all three status (fired, ok and promoted) ?
use skim to summarize the data in hr_anova by status
Name | Piped data |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | status |
Variable type: factor
skim_variable | status | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
gender | fired | 0 | 1 | FALSE | 2 | fem: 96, mal: 89 |
gender | ok | 0 | 1 | FALSE | 2 | fem: 77, mal: 76 |
gender | promoted | 0 | 1 | FALSE | 2 | fem: 87, mal: 75 |
evaluation | fired | 0 | 1 | FALSE | 4 | bad: 65, fai: 63, goo: 31, ver: 26 |
evaluation | ok | 0 | 1 | FALSE | 4 | bad: 69, fai: 59, goo: 15, ver: 10 |
evaluation | promoted | 0 | 1 | FALSE | 4 | ver: 63, goo: 60, fai: 20, bad: 19 |
salary | fired | 0 | 1 | FALSE | 6 | lev: 41, lev: 37, lev: 32, lev: 32 |
salary | ok | 0 | 1 | FALSE | 6 | lev: 40, lev: 37, lev: 29, lev: 23 |
salary | promoted | 0 | 1 | FALSE | 6 | lev: 37, lev: 35, lev: 29, lev: 23 |
Variable type: numeric
skim_variable | status | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | fired | 0 | 1 | 38.64 | 11.43 | 20.2 | 28.30 | 38.30 | 47.60 | 59.6 | ▇▇▇▅▆ |
age | ok | 0 | 1 | 41.34 | 12.11 | 20.3 | 31.00 | 42.10 | 51.70 | 59.9 | ▆▆▆▆▇ |
age | promoted | 0 | 1 | 42.13 | 10.98 | 21.0 | 33.40 | 42.95 | 50.98 | 59.9 | ▆▅▆▇▇ |
hours | fired | 0 | 1 | 41.67 | 7.88 | 35.0 | 36.10 | 38.90 | 43.90 | 75.5 | ▇▂▁▁▁ |
hours | ok | 0 | 1 | 48.05 | 11.65 | 35.0 | 37.70 | 45.60 | 56.10 | 78.2 | ▇▃▃▂▁ |
hours | promoted | 0 | 1 | 59.27 | 12.90 | 35.0 | 51.12 | 60.10 | 70.15 | 79.7 | ▆▅▇▇▇ |
Employees that were fired worked an average of 42.7 hours per week
Employees that were ok worked an average of 48.0 hours per week
Employees that were promoted worked an average of 59.3 hours per week
Use geom_boxplot to plot distributions of hours worked by status
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 x 2
hours status
<dbl> <fct>
1 36.5 fired
2 55.8 ok
3 35 fired
4 52 promoted
5 35.1 ok
6 36.3 ok
7 40.1 promoted
8 42.7 fired
9 66.6 promoted
10 35.5 ok
# … with 490 more rows
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
hours status
<dbl> <fct>
1 36.5 fired
2 55.8 ok
3 35 fired
4 52 promoted
5 35.1 ok
6 36.3 ok
7 40.1 promoted
8 42.7 fired
9 66.6 promoted
10 35.5 ok
# … with 490 more rows
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups: replicate [1,000]
hours status replicate
<dbl> <fct> <int>
1 46.2 fired 1
2 65.1 ok 1
3 40 fired 1
4 48 promoted 1
5 56.4 ok 1
6 40.5 ok 1
7 39.6 promoted 1
8 59.7 fired 1
9 56.7 promoted 1
10 35.2 ok 1
# … with 499,990 more rows
The output has 500,000 rows
calculate the distribution of statistics from the generated data
Assign the output null_distribution_anova
Display null_distribution_anova
# A tibble: 1,000 x 2
replicate stat
* <int> <dbl>
1 1 0.365
2 2 2.30
3 3 0.166
4 4 2.00
5 5 0.496
6 6 0.0308
7 7 1.18
8 8 0.394
9 9 0.0437
10 10 1.23
# … with 990 more rows
null_distribution_anova has 1000 F-stats
calculate the statistic from your observed data
Assign the output observed_f_sample_stat
Display observed_f_sample_stat
# A tibble: 1 x 1
stat
<dbl>
1 115.
# A tibble: 1 x 1
p_value
<dbl>
1 0
If the p-value < 0.05? ??? (yes)
Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? ??? (yes)