7/25/15 Notes

CSC 423 -- 7/24/17

Review Exercises

What does a 99% confidence interval for the population mean tell you? (Also see Problems 2 and 6.) Find a 99% confidence interval for these IQ scores:
Ans: No matter how many times you construct a confidence interval for the population mean by choosing a random sample, the population mean μ is fixed. It is the confidence interval that will change from sample to sample, and if the population is normally distributed, the 99% confidence interval will contain the true population mean μ 99% of the time.

To construct the 99% confidence interval for the true measurement μ you need the sample mean x and the standard error for the mean SE_mean (computed from the sample standard deviation and the sample size). Because n < 30, we use the t-distribution instead of the normal distribution for constructing the confidence interval. (My advice is to always use the t-distribution, even when n ≥ 30.)
Look up a 99% confidence interval (alpha = 0.005) in the t-table in these Statistical Tables using n - 1 = 3 degrees of freedom: (-3.182, 3.182). Then
so the 99% confidence interval for μ is (96.71, 114.79).

You can also obtain the t-table value from R like this:
and from SAS in a data step like this:

Use SAS and R to obtain a 99% confidence interval for the population mean of the dataset in Exercise 1. Ans:

* SAS script;
  The @@ at the end of the following
  input line holds the data line for
  additional observations.  Without
  the @@, only the first observation
  will be read.;

data iq_scores;
   input score @@;
datalines;
103 110 104 106
;
proc means clm alpha=0.01;

SAS Output:
 Lower 99%     Upper 99%
CL for Mean  CL for Mean
------------------------
 96.7091604  114.7908396
------------------------

# R script
x = c(103, 110, 104, 106)
t.test(x, NULL, conf.level=0.99)

R output:
 One Sample t-test

data: x
t = 68.321, df = 3, p-value = 6.91e-06
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
   96.70916 114.79084
sample estimates:
mean of x 
   105.75

Which statistics can the SAS proc means compute? Which R functions compute these same statistics?
Ans: Recall that the SAS proc means statement without any options computes the simple descriptive statistics sample size, mean, standard deviation, minimum, and maximum. However, proc means can also compute many other statistics if requested as options. Here are the proc means options:

SAS proc means option	Meaning	R Function
n	Sample size	length
mean	Sample mean	mean
std	Sample standard deviation	sd
stderr	Standard error of the mean
min	Minimum value of the sample	min
max	Maximum value of the sample	max
skewness	Measures the skewness of the sample; positive value means skewed to the right; negative value means skewed to the left	skewness, moments package
kurtosis	Measures the thickness or thiness of the tails of the sample; positive value means thick tails relative to a normal distribution; negative value means thin tails relative to a normal distribution	kurtosis, moments package
lclm	Lower confidence limit for the mean	t.test
uclm	Upper confidence limit for the mean	t.test
clm	Both lclm and uclm	t.test
p1 p5 p10 p25 p50 p75 p90 p95 p99	These precentiles	quantile
q1	25th percentile	quantile
q3	75th percentile	quantile
qrange	Interquartile range (q3 - q1)	IQR
t	Test statistic for one-sample t-test with H₀: μ=0	t.test
prt	Two sided p-value for one-sample t-test with H₀: μ=0	t.test

To obtain confidence intervals (clm, lclm, and/or uclm) other than 95%, specify value of alpha.

What are the assumptions for a one sample z-test? for the one-sample t-test?
What the definition of the p-value for a statistical test.
To test whether eating fish makes increases intelligence, a researcher selects a random sample of four persons and puts them on a fish-rich diet for one year. At the end of the year, she gives each subject an intellegence test. Here are the results:
1. Test the null hypotheses that the eating fish does not make a difference using a t-test. Ans:
2. Suggest a way to improve the design of the experiment?
Explain the difference between the paired-sample t-test and the independent two-sample t-test.
Ans: With the paired-sample t-test there is a natural pairing between the observations in Group A with the observations in Group B. Thid pairing may reduce the variability and increase the chances of rejecting the null hypothesis or of obtaining a low p-value.
For each of these scenerios, which t-test would you use? Ans:
1. Independent
2. Paired, pair each subject in the study with his or her twin;
3. You could assign whole houses to paint brands (independent) or randomly choose exterior walls of each house to paint (paired)
4. Paired, pair up the men and the women and let both subjects in the pair drive the same car
5. Paired, let each tester in the study evaluate both websites
6. Pair up the paper measurements by measurer.
To see how R can get confused when trying to read from a UTF-8 data file:
- Download the data file reading-indep.txt into the c:/datasets folder by right clicking in the Windows Explorer and selecting Save Target As.
- Run this R script:
  Notice the ï.. in the header of the data frame. R is confused because of the UTF-8 byte order marks at the beginning of the data file.
There are two ways to correct this problem:
- Open the data file with Notepad and select Save As. Before saving the file, change the encoding at the bottom to ANSI.
- Create a new datafile in the c:/datasets folder. Then copy and paste the contents of the file from the browser into the new datafile.

Some Arguments for R Plots

col	Sets the color of the plotting symbols.
cex	Sets the relative size of the plotting symbols.
main	Sets the title for a plot.
pch	Sets the plotting character for a plot. Legal values are keyboard characters in quotes or integers from 1 to 15, not in quotes.
xlab	Sets the label for the x-axis.
xlim	Sets the minimum and maximum values for the x-axis.
ylab	Sets the label for the y-axis.
ylim	Sets the minimum and maximum values for the y-axis.

x = rnorm(30)
y = rnorm(30)
plot(x, y, 
   main="Plot of 30 Bivariate Normal Random Values",
   xlab="Independent Variable", ylab="Dependent Variable",
   pch="*", cex=2, xlim=c(-5,5), ylim=c(-5,5), col="red")

To see what the plotting symbols from 1 to 20 look like, use this R script

Examples of Correlation

The news is filled with examples of correlations and associations:

Covariance and Correlation

Details about covariance and correlation
R Simulation: generate two sets of simulated random IQ scores (mu = 100, sigma = 15) with five observations each:
Compute the correlation and plot the x- and y-values. You can use this R script. When I try it I obtain this graph with this output:
This shows that with a small number of observations, it is easy to obtain a high correlation (in this case negative) just by chance.

Here is a SAS script that does the same thing, with the generated graph and output. Of course the points generated by SAS are different than the points generated by R because they are random.
Look at the BearsCorr Example.
See the TireWear Example.

Simple Linear Regression

A simple linear regression is a linear equation for predicting the dependent variable from one independent variable.
A multiple linear regression is a linear equation for predicting the dependent variable from more than one independent variable.
The SpringReg Example finds a regression equation for predicting the force of a spring from its displacement using Hooke's Law: force = k * displacement, where k is a constant.
To obtain the simple linear regression equation y = ax + b using R:
To obtain the regression through the origin equation y = ax using SAS:
Both SAS and R obtain this simple linear regression equation for the SpringReg Example:
An alternative to the simple linear regression is regression through the origin, where the regression line is forced to pass through the origin (0, 0).
To obtain the regression through the origin equation y = ax using SAS:
To obtain the regression through the origin equation y = ax using R:
Both SAS and R obtain this regression through the origin equation for the SpringReg Example:
BearsReg Example
Residual Analysis

Project 2

Look at Project 2.

Least Squares Estimators

Linear Regression: Three Regression Models
Derive the LSE for the Horizontal Line Regression Model y_i = μ + ε_i. Ans.:
Derive the LSE for the Regression through the Origin Model y_i = a x_i + ε_i. Ans:
Derivations of the LSE for Four Regression Models (See this document for technical details)
In class, we only discuss the derivations of the Horizontal Line Regression and the Regression through the Origin models.
Here is the simple linear regression equation expressed in terms of the sample means, standard deviations, and sample correlation:
Practice Problems:
1. Use SAS or R to compute x, y, s_x, s_y, and r for the following bivariate dataset:
  
  x: 1 2 3 4
  
  y: 1 3 2 4
  
  Then use these statistics to compute the simple straight line regression equation for this dataset.
  Ans: Here are the R and SAS scripts. The computed statistics are
  Then sub these values into the simple linear regression equation:
2. At the University of Southern North Dakota law school, the average LSAT score of incoming students 162 with a standard deviation of 6. The average first year score is 68 with a standard deviation of 10. The correlation of the LSAT scores with the first year scores is 0.6. Compute the regression equation for predicting first year score from LSAT score. What is the predicted first year score for someone with LSAT score 168? Ans:
  Substitute x = 168 into the regression equation: y^{^} = 168 - 94 = 74.