Final Exam Review Guide

To Documents

Final Exam Review Guide

Format of Exam Questions

Short answer, multiple choice, short essay, problems, interpret R output.

Symbols

x_i     Value of ith variable of univariate dataset
μ     Population mean
σ     Population standard deviation
σ² Population variance
x     Sample mean
SD Sample standard deviation, divide by n
s_x or SD+ Sample standard deviation, divide by n - 1
Q0 Q1 Q2 Q3 Q4 Sample quartiles
z_x z_X z_S   z-scores
x_i   Value of ith independent variable of bivariate dataset
y_i   Value of ith dependent variable of bivariate dataset
r    Correlation of x and y in a bivariate dataset
a   Slope of true regression line
a^{^} Slope of estimated regression line
b   Intercept of true regression line
b^{^} Intercept of estimated regression line
RMSE Root mean square error for a regression model
H₀ The null hypothesis
H₁ The alternative hypothesis
α Level of a z-test or t-test. Common levels are α = 0.05 and α = 0.01

Formulas

Interquartile Range: IQR = Q3 - Q1
Inner fences for boxplot: Q1 - 1.5 × IQR; Q3 + 1.5 × IQR
Outer fences for boxplot: Q1 - 3.0 × IQR; Q3 + 3.0 × IQR
z-score for individual observations: z = (x - x) / SD+
Standard error of the average: SE_ave = SD+ / √n
z-score for sample average: z = (x - μ) / SE_ave
Ideal measurement model: x_i = μ + e_i
Linear regression model: y_i = ax_i + b + e_i
Estimated Linear regression model: y_i - y = (r SD_y / SD_y) (x_i - x)
Root mean squared error for regression: RMSE = (SD+) √(1 - r²)
Addition Rule: if A and B are disjoint events, P(A ∪ B) = P(A) + P(B).
Multiplication Rule: if A and B are independent events, P(A ∩ B) = P(A)P(B)
Probability of at Least One Success, of independent Bernoulli trials: 1 - (1 - p)ⁿ
Expected Value of a random variable: E(x) = x₁P(x₁) + ... + x_mP(x_m)
Theoretical Variance of a random variable: Var(x) = (x₁ - E(x))² P(x₁) + ... + (x_m - E(x))² P(x_m)
Theoretical SD of a Random Variable: SD(x) = √Var(x)
Expected Value of Bernoulli Random Variable: E(X) = p
Variance of Bernoulli Random Variable: Var(X) = p(1 - p)
Expected Value of a Sum: E(S) = nE(x₁)
Variance of a Sum, where the random variables x₁, ... , x_n are independent: Var(S) = nVar(x₁)
Standard Error of a Sum, where the random variables x₁, ... , x_n are independent: σ_S = √x
Standard Error of an Average, where the random variables x₁, ... , x_n are independent: σ_ave = σ_x / √n
Test Statistic for a z-test: z = (x - μ) / SE_ave, where SE_ave = SD+ / √n
Number of ways of choosing k items from n items: kCn = n!/(k! (n-k)!)
Probability of k out of n successes, where P(success) = p: nCk p^k (1 - p)^n-k
Expected Value of Binomial Random Variable: E(X) = np
Variance of Binomial Random Variable: Var(X) = np(1 - p)
Test Statistic for a z-test: z = (x - μ) / SE_ave, where SE_ave = SD+ / √n
Test Statistic for a t-test: t = (x - μ) / SE_ave, where SE_ave = SD+ / √n

Persons

Blaise Pascal, Abraham de Moivre, Jakob Bernoulli, Karl Gauss, Ronald Fisher, John Tukey, Alexandr Lyapunov, William Gossat (Student)

Definitions

Controlled experiment, double blind, randomization, observational study, lurking variables (also called confounding factors), univariate dataset, histogram, density histogram, bin, mean, median, variance, parsimonious, stem-plot, boxplot, normal plot, mild outliers, extreme outliers, normal histogram, ideal measurement model, bias, center, spread, plot of x_i vs. i (unbiased, biased, homoscedastic, heteroscedastic, standard normal curve), critical point, inflection point, standard error of the mean, normal scores, normal plot, bivariate dataset, bivariate normal, correlation, R-squared value, causation, regression line, residual plot (residuals vs. predicted values, unbiased, biased, homoscedastic, heteroscedastic), root mean square error, probability, ways to obtain probabilities (theoretical, frequentist, subjective), fair bet, mutually exclusive events, the addition rule, independent events, the multiplication rule, n factorial (n!), 0! = 1, binomial formula, random variable (RV), probability distribution, expected value of an RV, theoretical SD of an RV, expected value and theoretical SD of a sum, sample mean, SD+, expected value, and SE for average in ideal measurement model, expected value and SD for Bernoulli random variables, law of averages (law of large numbers = LLN), central limit theorem = CLT, normal approximation of the binomial distribution, confidence interval for p, confidence interval for μ, the 5 steps to a test of hypothesis, null hypothesis, alternative hypothesis, test statistic, one-sample z-test for average, z-test for a sum, one-sample t-test, 95% confidence intervals using z and t tables, p-value, paired sample z- and t-tests, importance vs. significance.

Know How To

Find the proportion of observations within a given interval of a normal histogram using the standard normal table.
Write down and/or discuss the ideal measurement model.
Compute the standard error of the average and a 95 confidence interval for the true measurement in the ideal measurement model
Draw the boxplot or interpret an R boxplot. Use it to detect outliers.
Given an x-value, x, and SD_x, compute the z-score.
Use the normal table to determine the proportion of observations in a bin of the form [a, b], (-∞, a], or [a, ∞).
Given a number p between 1 and 100, find the percentile for that value, using a normal table: work backwards by looking up the proportion in the body of the table to find the corresponding z-score, then use x = z * mu + sigma, if necessary.
Use R to compute areas under the normal curve with pnorm.
Use R to compute percentiles of the normal curve with qnorm.
Use R to generate normally distributed random outcomes with qnorm.
Find normal scores by hand and using qqnorm.
Interpret a normal plot (normal, skewed to the left or right, thin tails, thick tails).
Compute a regression equation given using the formula
y - y = (r SD_y / SD_x) (x - x)
Given a regression equation and x value find the predicted y value.
Assess whether a regression model is adequate using residual and normal plots.
Interpret a residual plot (unbiased, biased, homoscedastic, heteroscedastic).
Compute the root mean squared error using this formula:
RMSE = SD_y √(1 - r²)
Calculate probabilities using the addition rule, the multiplication rule, the binomial formula.
Use the formula 1 - (1 - p)ⁿ to calculate probability of at least one success out of n Bernoulli trials.
Compute the expected value and SE for the sum of random variables.
Compute confidence intervals for the sum of Bernoulli random variables.
Compute the confidence interval for μ in the ideal measurement model.
Perform the 5 steps of a test of hypothesis, either by hand or using R output:
1. Write down the null and alternative hypotheses for a z-test or y-test.
2. Compute the test statistic, assuming that the null hypothesis is true.
3. Write down a 95% confidence interval. [-1.96, 1.96] for a z-test. For simplicity, sometimes we use [-2, 2] for a 95% confidence interval. Use standard normal table to get confidence intervals of other sizes.
4. Decide whether to accept or reject the null hypothesis.
5. Compute the p-value. Accept the null hypothesis if p ≥ 0.05 and reject the null hypothesis if p < 0.05.
Compute confidence intervals using the normal table and using the t-table. Look at the bottom of the t-table to get confidence intervals easily.
Perform these tests of hypothesis by hand: one sample z-test for μ, one sample z-test for S, paired-sample z-test.
Perform these tests by hand and/or using R output: one sample t-test, paired-sample t-test
Use R to obtain the p-value for a z-test or t-test.

R Functions

c  sum  mean  sd  cor  plot  boxplot
dnorm  pnorm  qnorm  qqnorm  read.csv
data.frame  lm  summary  resid  predict
dbinom  pbinom  qbinom  rbinom  t.test

Explain

Be able to explain in terms that someone not very familiar with statistics will understand:
1. What to the sample mean and SD tell you about a dataset?
2. What does a histogram tell you and what must you watch out for of the bin widths are not all equal?
3. What is the ideal measurement model?
4. What is correlation and how does it relate to causation?
5. Why is correlation not always the same as causation?
6. What is a regression equation? What is required to have a good linear regression model.
7. How are probabilities determined?
8. What is the Law of Large Numbers (Law of Averages) and how is it commonly misstated?
9. Why is the Central Limit Theorem important in statistics?
10. What does it mean for a result to be statistically significant? Why is statistical significance not the same as importance?
11. Explain what a test of hypothesis is and some things to watch out for when using a test of hypothesis.
12. What is a paired-sample t-test.