5/8/17 Notes

CSC 433 -- 5/8/17

Review Questions

What does the attach function do?
Ans: It allows the columns of a dataframe to be used as unqualified variables (it "attaches" the columns to the work space). For example:
Use the detach function to detach the columns of the dataframe from the workspace. Always detach a dataframe when you are finished with it. Some programmers don't like the attach function. They say it makes R scripts harder to maintain.
Write a script that produces a plot showing the plotting symbols corresponding to 1, 2, ... , 19.

Show a plot that contains lines plotted with lty=1, ... , 6. Ans: Use this script:

# set up plot region first.
plot(0:21, 0:21, type="n") 

for (y in 1:20) {
  lines(c(1, 20), c(y, y), lty=y)
}

Give an example of an R nonatomic datatype.
Ans: list, dataframe.
What are attributes and how do you determine what they are for an R object? Ans:
Ans: Attributes are pieces of data that can be added to an object. For example:
List four ways of subsetting a vector, matrix, or data frame. Ans:
1. Use positive indices to tell which rows or columns to keep:
  v[c(2, 4, 5)] A[3, c(1, 3)]
2. Use negative indices to tell which rows or columns to omit:
  v[c(-1, -3)] A[-(1:10), -6]
3. Use a logical vector, where TRUE indicates which items to keep and FALSE indicates which items to omit:
  v[c(T, F, F, T, T)]
4. Use the $ operator to indicate by name which column to keep, for example kids$age.
  k$age

How do you delete a component from a list?
Ans: Set that component to NULL:

> car <- list(make="Volkswagon", 
+     model="Jetta", year=2011)
> car
$make
[1] "Volkswagon"

$model
[1] "Jetta"

$year
[1] 2011

> car$model <- NULL
> car
$make
[1] "Volkswagon"

$year
[1] 2011

How can you give names to the components of a vector?
Ans: Set the names attribute:

> temps <- c(31, 45, 68)
> names(temps) <- c("low", "ave", "hi")
> temps
low ave hi 
31 45 68 
> as.vector(temps)
[1] 31 45 68

What is an R replacement function?
Ans: It is a function that can be called on the left side of the assignment operator <- to change the argument. This is called return by reference in C++. For example:
What are some commonly used arguments of the plot function? Ans:
x, y, main, sum, xlab, ylab, xlim, ylim, col, pty

Write a function named outliers that inputs a numeric dataframe with one column and returns the original dataframe with an added character column that indicates whether the data value in that row is an extreme outlier ("*"), mild outlier ("O"), or not an outlier (""). Ans:

outliers <- function(df) {
   data <- as.vector(df[ , 1])
   n <- length(data)
   q1 <- quantile(data, 0.25)
   q3 <- quantile(data, 0.75)
   iqr = q3 - q1
   lower_outer_fence <- q1 - 3.0 * iqr
   lower_inner_fence <- q1 - 1.5 * iqr
   upper_inner_fence <- q3 + 1.5 * iqr  
   upper_outer_fence <- q3 + 3.0 * iqr
   symbol <- rep(" ", n)
   symbol[data < lower_inner_fence | 
          data > upper_inner_fence] <- "O"
   symbol[data < lower_outer_fence | 
          data > upper_outer_fence] <- "*"
   extra_col <- as.data.frame(symbol)
   names(extra_col) <- "outlier"
   return (cbind(df, extra_col))
}

x <- data.frame(values=c(2, 7, 4, 3, 2, 6, 60, 99))
out <- outliers(x)
out

Quiz 6

Take Quiz 6 in groups of 2 or 3.

Functions that Input or Return Matrices

The cbind and rbind functions return matrices.

The cbind function inputs vectors and/or matrices of the same height and binds them together by columns to produce a matrix of the same height as the inputs. For example:

u <- c(2, 4, 6, 4, 1)
v <- c(5, 0, -1, 2, 3)
w <- matrix(c(2, 2, 6, -3, 0, 
  1, 1, 3, 0, 2), 5, 2, byrow=T)
M <- cbind(u, v, w)
print(M)

# Output:
     u  v 
[1,] 2  5 2  2
[2,] 4  0 6 -3
[3,] 6 -1 0  1
[4,] 4  2 1  3
[5,] 1  3 0  2

The rbind function binds together vectors and/or matrices width the same width by rows to form a matrix with the same width as the inputs. For example:

u <- c(2, 4, 6, 4)
v <- matrix(c(2, 2, 6, -3,
  1, 1, 3, 0), 2, 4, byrow=T)
w <- c(5, 0, -1, 2)
M <- rbind(u, v, w)
print(M)

# Output:
 [,1] [,2] [,3] [,4]
u   2    4    6    4
    2    2    6   -3
    1    1    3    0
w   5    0   -1    2

The apply function applies an R function to the rows or columns of a matrix. For example:

A <- matrix(1:9, 3, 3, byrow=T)
print(A)

# Apply sum function to rows of matrix. 
apply(A, 1, sum)

# Apply sum function to columns of matrix. 
apply(A, 2, sum)

# Output:
    [,1] [,2] [,3]
[1,]   1    2    3
[2,]   4    5    6
[3,]   7    8    9

[1] 6 15 24
[1] 12 15 18

R String Handling Functions

Here are some useful R String Handling Functions. You can bring this document to the final exam.

Project 5b

Discuss Project 5b

More about Factors

A factor is the R way to represent values of a categorical variable.
Use the cut function to convert numeric values to factor values.
Use the R table function to summarize factor data.
The table function is used in the same situations as the SAS proc freq.

Here is an example that uses the table function. Eight eligible voters are interviewed; each is asked these questions: "What is your gender?" and "Which candidate, A or B, do you prefer?"

# Create data frame
election <- data.frame(
  gender=c("F", "M", "M", "F", "M", "M", "F", "F"),
  cand=c("A", "B", "B", "B", "B", "A", "A", "A"))
election
# Output:
 gender cand
1 F A
2 M B
3 M B
4 F B
5 M B
6 M A
7 F A
8 F A

table(election)
# Output:
 cand
gender A B
F 3 1
M 1 3

table(election$gender)
# Output:
F M 
4 4 

table(election$cand)
# Output:
A B 
4 4

Project 4b

Look at the FixedWidthFields and Aggregate examples.
Discuss Project 4b.

Project 5

The system.time function.
Look at the MySquare and MyConcat Examples.
See the Project 5 Description.

R Graphics

Graphics Practice Problems:
1. Add a least squares line to the scatterplot of the father's and son's heights in the Pearson Dataset. Ans:
2. Create plots of the builtin ChickWeight dataset where diet=1 and Chick < 20. Connect the weights of each chick with line segments. Ans:
3. Using the autoSales.txt, create these plots for the amount of sales by day of week:
  Also create stacked and side-by-side bar charts for sales by day of week and vehicle type.
4. Create a histogram of the faithful dataset (eruption times of the old faithful geyser). Check the effect of using the Sturgis, Scott, and FD algorithms for setting the number of bins. Ans:
  When the number of breaks is given, it is only a suggestion.

Brushing

Brushing means clicking on or mousing over points on a scatterplot to identify the point.
Consider the Cars Example. Add this line after the plot statement:
Now click on points in the scatterplot for which you want row labels.
See the BodyBrain Example.
Press ESC when you are finished.

Reading Data from a Web Page

In addition to reading data from files, R can also read data from a connection, which can be a connection to an internet, to data on a local or remote machine.
The readHTMLTable function reads HTML tables from a web page specified by URL. The XML package is required.
See this document to see how to install an R package:

Here is an script that reads from this URL:

http://www.basketball-reference.com/contracts/CHI.html

# Load XML package.
library(XML)

# Define URL of web page that contains tables.
url <- "http://www.basketball-reference.com/contracts/CHI.html"

# Read tables on web page.  There are two tables.
tables <- readHTMLTable(url)

# Pick out desired table.
salary.table <- tables$payroll

Computing Sample Quantiles with R

References:
1. Help Page for R quantile function:
  https://stat.ethz.ch/R-manual/R-patched/library/stats/html/quantile.html
2. Quantile Calculations in R:
  http://tolstoy.newcastle.edu.au/R/e17/help/att-1067/Quartiles_in_R.pdf
3. Hyndman and Fan, Sample Quantiles in Statistical Packages, The American Statistician, Nov. 1996, Vol. 50, No. 4, pp. 361-365.
  https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf
Naively speaking, the pth quantile is the data value Q(p) such that the p fraction data points have values less than or equal to Q(p).
For example, if the population has a precisely standard normal distribution, the 0.95 quantile is z = 1.65. This can be seen from a standard normal table or using R:
However, if we wish to compute quantiles from a finite sample, the situation is not so simple. Let's use the (sorted) artificial sample x, where x is defined as
These are the first ten prime numbers. This example and the examples QuantileDefs1 (R) and QuantileDefs2 (SAS) were inspired by Reference 1. These examples show the results of the various R methods for computing quantiles.
What is the quantile 0.25 of the data defined by x? The fraction 0.2 of the data is ≤ 3, whereas 0.3 of the data ≤ 5; there is no value q = Q(0.25) such that exactly 0.25 of the data ≤ q.
Here are three immediate solutions:
1. Choose Q(0.25) = 3.
2. Choose Q(0.25) = 5.
3. Interpolate. Since 0.25 is halfway between 0.2 and 0.3, Q(0.25) is chosen halfway between x₂ = 3 and x₃ = 5: Q(0.25) = 4
A quick and dirty way to compute the first quartile (Q1 = Q(0.25) and the third quartile (Q3 = Q(0.75)) is to use the Tukey's Hinges Method: if the sample size n is even, Tukeys method says that Q1 is the median of the top half of the data and Q3 is the median of the bottom half. For example, for the dataset
The bottom half of the data is 2, 3, 5, whose median is 3, so Q3 = 3. The top half of the data is 7, 11, 13, so Q3 = median of top half = 11. If n is odd, as is the case with this data set:
Tukey's Hinges method says to count the middle element 7 in both halves of the data. The bottom half is 2, 3, 5, 7

A related method for computing Q1 and Q3 is recommended by Moore and McCabe. If n is odd as in Dataset 2, they omit the middle observation 7 to obtain the bottom half as 2, 3, 5 and the top half as 7, 11, 13. Thus, Q1 = median of bottom half = 3; Q3 = median of top half = 13.
In fact, there are additional ways that are used to compute percentiles. See Reference 3. SAS currently has 5 percentile definitions available; R has 9.
Here are the detailed quantile calculations for each of the nine R methods for the dataset x defined in Line 1, earlier in this section.