- What do the terms inner fence and outer fence mean for a boxplot?
Ans: The inner fences are the locations 1.5 IQRs below Q1 and 1.5 IQRs above Q3;
The outer fences are the locations 3.0 IQRs below Q1 and 3.0 IQRs above Q3
Points outside of the outer fences are called extreme outliers;
points between an inner fence and an outer fence are called mild outliers.
- What do these expressions mean?
Ans: They mean that the residuals are unbiased and homoscedastic, respectively.
- What are some other names for a regression equation?
Ans: Least squares estimator and, for a simple linear regression equation,
the line of averages.
- Compute the normal scores for a dataset when n = 9.
Ans: Look up the areas
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
inside the normal table to find these corresponding z-scores:
-1.28, -0.84, -0.52, -0.25, 0.00, 0.25, 0.52, 0.84, 1.28
You can also accomplish this with these R and SAS scripts:
* R Script:
print(qnorm((1:9)/10)
* SAS Script:
data normal_scores;
do i = 1 to 9;
quantile("normal", i / 10);
output;
end;
- What is the hat matrix? What does it have to do with influence points?
Ans:
- What does AIC mean?
Ans: It is Akaike's Information criterion defined by
AIC = -2(log-likelihood) + 2p
or
AIC = n log(SSE / n) + 2p
- What is a quadratic regression model?
Ans: It is a model that includes a second degree term xi2 in the model.
- For the Pendul Example, why is a square root transform better than adding a
quadratic term?
Ans: Because the quadratic model predicted values would form a parabola, which would eventually
descend after decreasing. A pendulum does not do this, the period of the parabola continues to
increase as its length decreases. A this is still true when using the square root transformation.
- For the BodyBrain Example, use the log-log model to find the predicted brain weight if the
body weight is 100 kg. Here is the predicted value for log_brain:
log_brain = 2.268 + 0.536 * log(100) = 4.73
Ans: Find the predicted value using the log-log model:
log(brain) = 2.268 + 0.536*log(100) = 4.73
Then take the exponential to find the actual brain weight: brain = exp(log(brain)) = exp(4.73) = 113.
- What is the Bernoulli distribution?
Use R to generate 100 random Bernoulli outcomes with p = 0.5. Ans:
rbinom(n=100, size=1, prob=0.5)
[1] 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 1
[26] 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0
[51] 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1
[76] 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 1
- A maximum likelihood estimator for a parameter is the value of the parameter that
maximizes the probability (or the log of the probability) that the obtained sample occurs. Since the Bernouilli probability
function is
Prob(x) = px (1-p)1-x, where x = 0 or 1.
The joint density for a sample of n Bernoulli random variables is
Prob(x1, ... , xn) =
Π
i=1n pxi (1-p)1-xi
Find the value of p that maximizes log Prob(x1, ... , xn)
by differentiating the following expression with respect to p, setting the result equal to zero, and solving for p:
log Prob(x1, ... , xn) = log
Π
i=1n pxi (1-p)1-xi =
Σ
i=1n log ( pxi (1-p)1-xi ) =
Σ
i=1n xi log p +
Σ
i=1n
(1 - xi) log(1-p) =
(log p)Σ
i=1n xi +
log(1 - p)Σ
i=1n
(1 - xi)
Ans:
Recall that the derivative of log(p) with respect to p is 1 / p. Also, let
S be the number of sucesses. We set the derivative of the preceding
expression with respect to p to zero:
(1 / p) Σ
i=1n xi +
(1 / (1 - p))(-1) Σ
i=1n
(1 - xi) = 0
(1 / p) S - (1 / (1 - p)) (n - S) = 0
(1 / p) S = (1 / (1 - p)) (n - S)
(1 - p) S = p (n - S)
S - pS = pn - pS
S = pn
p = S / n,
which is the usual way that we estimate the true probability of success p.
- How does generalized regression differ from ordinary regression?
Ans: Ordinary regression assumes that each yi is a random variable ∼ N(E(yi, σ2).
With generalized regression each dependent variable value is determined by a distribution, which might not be normal
(for example Bernoulli or Poisson). It also uses a link function, which is the identity function for
ordinary regression.)
- What is a link function?
Ans: It is a function which ties the independent variables to the dependent variable.
Logistic regression uses the logit function: logit(s) = log(s / (1 - s))