To Lecture Notes

CSC 423 -- 7/17/17

Course Documents

 

Collect Micrometer Paper Thickness Measurements

 

Course Topics

 

Review Questions

  1. Give examples of continuous variables and categorical variables.
    Ans: A continuous variable can take on any value in a range (subject to roundoff error). Examples are height, stock price, price, voltage, IQ, or paper thickness.
     
    A categorical variable (also called a nominal or discrete variable) is often not a number. A categorical variable can only take on a small number of discrete values. Examples of categorical variables are gender, employment status, dog breed, country of origin.
     
    A third type of variable is an ordinal variable, which is a compromise between continuous and categorical. They can be ordered but are not continuous. For example, year in school (freshman, sophomore, junior, senior), military rank, letter grade for a course.
     
  2. Explain the difference between a population and a sample.
    Ans: A population is the entire set of entities from which inferences are to be drawn. Examples of populations are adult women in the U.S., insurance companies in California, heights of fifth grade boys in California, SAS scores in Illinois, lifespans of Golden Retriever dogs, fish in Lake Michigan. A sample is a small subset of the population and can be used to perform a statistical test. For valid statistical inferences, the sample must be a simple random sample, which means that every entity in the population is equally likely to be included in the random sample.
     
    With the advent of big data, the distinction between a population and a sample is blurred; presently it is not uncommon for a company to collect data for an entire population (for example the customers that shop at Walmart). However, even if data for an entire population is collected, it is still common to extract a random sample from this data when studying some aspect of the entire population would be to time consuming or costly.
     
  3. What is the name of each of these Greek letters? What do they represent in statistics?
     
      α    β    ε    μ    ρ    σ    θ    χ

    Ans:
      α: alpha, is the size or type 1 error of a statistical test (the probability of rejecting the null hypothesis when it is true). The most commonly used value for α is 0.05.
      β: beta, represents parameters of a regression equation.
      ε: epsilon, represents the random errors or residuals in a regression model.
      μ: mu, is the population mean. If the normal distribution is used, μ is the center of the normal density.
      ρ: rho, is the population correlation.
      σ: sigma, is the population standard deviation. σ2 is the population variance.
      θ: theta, is an unspecified parameter in a probability density. Used for theoretical discussions and definitions.
      χ: chi, denotes the χ2 (chi-squared) distribution.

  4. What is a random variable?
    Ans: The formal definition of a random variable X is a function from the sample space Ω into the set of real numbers:
     
      X: Ω → R

    I prefer a more informal definition: a random variable is a process for choosing a random number.
     
  5. Rewrite the expression x1 + x2 + ... + xn using summation notation.
    Ans: Σi=1n xi
     
  6. Write the definitions of the sample mean x and the sample variance sx2 in summation notation.
    Ans: (1/n) Σi=1n xi   and [(1/(n-1)] Σi=1n (xi - x)2
    Notes: in the definition of s2, the sum of squared deviations is divided by n-1 instead of n so that the sample variance s2 is an unbiased estimator of the population variance σ2:
     
      E(s2) = σ2

     
  7. Find the calculus derivative of each of these expressions:
     
      7x2     4x     19     4(3 - 5x)2

    Ans: Use the formulas (d/dx)(xn) = n xn-1 and (d/dx)(cy) = c (d/dx)(y).
    (d/dx)(7x2) = 7(d/dx)(x2) = 7(2x1) = 14x.
    (d/dx)(4x) = 4(d/dx)(x1 = 4(1x0) = 4.
    (d/dx)(19) = 19(d/dx)(x0) = 19(0x-1) = 0.
    To compute the last derivative, we need the chain rule: dy/dx = (dy/du)(du/dx).
    Let y = 4(3 - 5x)2 and u = 3 - 5x.
    (d/dx)(4(3 - 5x)2) = (dy/du)(du/dx) = (d/du)(4u2) (d/dx)(3 - 5x) = 8u(-5) = 8(3 - 5x)(-5) = -40(3 - 5x)
     
  8. (Needed for Project 4) Compute the partial derivatives ∂y/∂x1 and ∂y/∂x2 for this expression:
     
      y = 3x12 + 5x1x2 - x22 + 7x1 + 8x2 + 35

    Ans: To compute the partial derivative ∂y/∂x1, x1 is the variable and x2 is treated as a constant.
    ∂y/∂x1 = 3(2x1) + 5(1x2) + 7(1) + 8(0) + 0 = 6x1 + 5x2 + 7
    ∂y/∂x2 = 5(2x1) + (-1)(1x2) + 7(0) + 8(1) + 0 = 5x1 - x2 + 8
     
  9. Sketch the graphs of these functions:
     
      y = x2      y = √x      y = ex    y = ln x     

    Here are these
    graphs drawn by R.
     
  10. Define these terms:
     

  11. Ans:
    Discrete Random Variable: A random variable that can take on only a finite set of numeric values.
    Continuous Random Variable: A random variable that can take on any value in an interval, possibly infinite like this: (-∞, ∞).
    Continuous Probability Density: A positive function that is used to obtain probabilities for continuous random variables. To find P(a ≤ x ≤ b) compute the area between a and b under the probability density. If the function for the probability density is integrable, use calculus integration to find the area under the curve.  If this function is not integrable, statistical tables or software such as SAS or R is needed to compute this area.  See the Random Variable Properties and Technical Details for more explanation.
    Normal Distribution: The well known bell-shaped curve. It has the probability density
    φ(x) = (1 / √) exp((x - μ)2/(2σ2))
     
  12. What is the Central Limit Theorem (CLT) and why is it important in statistics?
    Ans: The CLT states that even if n independent observations from a population do not have a normal distribution, the sample mean of the observations is approximately normally distributed if n is large. Usually we consider n to be large if n > 30, but this depends on the original distribution. If the observations are Bernoulli, with p close to 0 or 1, then n must be substantially larger than 30 for the CLT to apply.

 

Confidence Intervals for the Population Mean (μ)

 

The Standard Normal Density

 

Introduction to SAS and R

 

The NIST Example

 

Normal Plots

 

The Boxplot

 

Projects

 

Tests of Hypothesis

These last three sections on hypothesis tests will be discussed next time on Wednesday, July 19.

 

Some Terminology

 

The Five Steps of a Hypothesis Test

  1. State the null hypotheses H0 and the alternative hypothesis H1.
     
  2. Compute the test statistic T.
     
  3. Compute a (1-α)100% confidence interval I for the test statistic T, where α is the level of the hypothesis test.
     
  4. If T ∈ I, accept H0 (and reject H1); if T ∉ I, reject H1 (and accept H0).
     
  5. Compute the p-value for T. In most cases p must be computed using statistical software.

 

T-tests of Hypothesis for One or Two Groups