To Projects
CSC 423 -- Project 1
Use the NIST and BloodPressure Examples to help you write your SAS or R code.
Look at the Project Submission Guidelines
before submitting any project.
The total points for this project is listed as 75, this raw score will be multiplied by 100/75 to obtain the scaled score.
Part A. Univariate Data Analysis (30 pts.)
The measurements in the Paper dataset in paper1.txt and paper2.txt
consist of the paper thicknesses of two colors of paper (white and yellow) that we
measured in class with the micrometer. Use this data to answer the following
questions. Include source code comments to explain what you are doing.
- Read in the data from paper1.txt and print it to verify that everything was input correctly.
- Read in the data from paper2.txt and print it to verify that everything was input correctly.
- Obtain these univariate statistics separately by color for the paper thicknesses: sample mean, sample standard deviation, sample median, sample IQR, these percentiles:
5, 10, 25, 75, 90, 95. You can use the SAS proc means or proc univariate to
compute these statistics. Don't compute them by hand.
- Find 95% confidence intervals for the true thickness for each color of paper
separately. Show your hand calculations and also show the relevent SAS or R output
to verify your calculations.
- Create three histograms for thicknesses from the combined colors of paper:
- Create a histogram using the default setting for the number of bins
Run your SAS or R code first without the code in steps 6b or 6c to see what bins
are obtained with the default setting.
- Create a histogram with more bins than the default. In SAS, you can do this
with an option on the histogram statement of proc univariate. For example:
histogram / endpoints = (11 to 25 by 1);
11 is a value that is less than all of the data values and 25 is a value that is
(greater than all of the data values.
1.0 is the width of the histogram bins.
(You will use different numbers that make sense for your histograms.)
In R, you can set the
number of bin boundaries (breaks) like this:
hist(x, breaks=seq(11, 25, 1))
- Create a histogram with less bins than the default. (See step 6b.)
- Create side-by-side boxplots of the thicknesses for the colors white and yellow.
Discuss what the boxplots tell you. Are there any outliers? If you are using
SAS, use the paper2.txt dataset and sort the dataset by color before plotting
the boxplots.
Part B. One-sample t-test (20 pts.)
To investigate the load on its network, a technician records the number of concurrent
users at fifty locations (thousands of people):
17.2 22.1 18.5 17.2 18.6 14.8 21.7 15.8 16.3 22.8
24.1 13.3 16.2 17.5 19.0 23.9 14.8 22.2 21.7 20.7
13.5 15.8 13.1 16.1 21.9 23.9 19.3 12.0 19.9 19.4
15.4 16.7 19.5 16.2 16.9 17.1 20.2 13.4 19.8 17.7
19.7 18.7 17.6 15.9 15.2 17.1 15.0 18.8 21.6 11.9
- Create a SAS or R dataset containing the number of concurrent users at each location.
For example, call your variable nusers. If you are using SAS and copy the preceding data lines
verbatim, don't forget the trailing @@ in the input statement. Use this R statement to read data from
within the script:
nusers = scan( ) # c(scan( )) also works
# datalines go here with a blank line to
# terminate the data
print(nusers)
- Create create and interpret the normal plot for nusers.
- Compute a 95% confidence interval for nusers.
Show your hand calculations with the relevant SAS or R output. Don't use
the
standard normal confidence interval [-1.96,1.96], use the t-distribution
confidence interval
obtained from the t-table
for 50 - 1 = 49 degrees of freedom. You can check your answer with
SAS using proc means or proc ttest. You can check your answer with R using
the t.test function.
- Show the five steps of the one-sample t-test at the α = 0.05 level to
test whether nusers has changed in the past month. Usage data from last month
shows an average of 17.2 thousand concurrent users. You don't need to perform
hand calculations for steps 1, 2, and 5; just copy the values from the SAS or R
output. For Step 3, find the 95% confidence interval using the t-table.
Part C. Two-sample t-tests (25 pts.)
Use the data in paper1.txt and/or paper2.txt
to answer the following questions.
- If you are using SAS, create labels for each variable thickness and color. If you are using R, add print statements in your source code to explain what your output means.
- Create normal plots of the thicknesses separately for the paper colors white
and yellow.
Interpret these normal plots.
- Type out the five steps of a 0.05-level paired-sample t-test to test the null
hypothesis that there is no difference between the paper thicknesses in
paper1.txt. Show relevent SAS or R output in your report. You will need to obtain
the confidence interval for the test statistic from the t-table.
- Type out the five steps a 0.05-level independent two-sample t-test to test the
null hypothesis that there is no difference between the paper thicknesses for
colors white and yellow. Show your output and discuss what it
means. Use paper2.txt sorted by color for SAS but paper1.txt for R.
You will need to obtain the confidence interval for the test statistic from the
t-table.
- Is the paired sample or the independent two-sample t-test is more
appropriate to decide if the true thickness of a sheet of paper is different for
color white or or yellow? Explain your answer. How do the p-values compare for the two
t-tests? Is this what you would expect?