To Lecture Notes
CSC 433 -- 5/8/17
Review Questions
- What does the attach function do?
Ans: It allows the columns of a dataframe to be used as unqualified variables
(it "attaches" the columns to the work space).
For example:
> kids
name gender age
1 Alice F 11
> name
Error: object 'name' not found
> attach(kids)
> name
[1] Alice
Levels: Alice
Use the detach function to detach the columns of the
dataframe from the workspace. Always detach a dataframe when you are finished
with it. Some programmers don't like the attach function.
They say it makes R scripts harder to maintain.
- Write a script that produces a plot showing the plotting symbols corresponding
to 1, 2, ... , 19.
x <- 1:19
plot(x, x, pch=x)
- Show a plot that contains lines plotted with lty=1, ... , 6. Ans: Use this script:
# set up plot region first.
plot(0:21, 0:21, type="n")
for (y in 1:20) {
lines(c(1, 20), c(y, y), lty=y)
}
- Give an example of an R nonatomic datatype.
Ans: list, dataframe.
- What are attributes and how do you determine what they are for
an R object? Ans:
Ans: Attributes are pieces of data that can be added to an object. For example:
> v <- c(1, 3, 2, 4)
> attributes(v) <- list(author="sdj")
> print(attributes(v))
$author
[1] "sdj"
print(v)
[1] 1 3 2 4
attr(,"author")
[1] "sdj"
- List four ways of subsetting a vector, matrix, or data frame. Ans:
- Use positive indices to tell which rows or columns to keep:
v[c(2, 4, 5)] A[3, c(1, 3)]
- Use negative indices to tell which rows or columns to omit:
v[c(-1, -3)] A[-(1:10), -6]
- Use a logical vector, where TRUE indicates which items to keep and
FALSE indicates which items to omit:
v[c(T, F, F, T, T)]
- Use the $ operator to indicate by name which column to keep, for example
kids$age.
k$age
- How do you delete a component from a list?
Ans: Set that component to NULL:
> car <- list(make="Volkswagon",
+ model="Jetta", year=2011)
> car
$make
[1] "Volkswagon"
$model
[1] "Jetta"
$year
[1] 2011
> car$model <- NULL
> car
$make
[1] "Volkswagon"
$year
[1] 2011
- How can you give names to the components of a vector?
Ans: Set the names attribute:
> temps <- c(31, 45, 68)
> names(temps) <- c("low", "ave", "hi")
> temps
low ave hi
31 45 68
> as.vector(temps)
[1] 31 45 68
- What is an R replacement function?
Ans: It is a function that can be called on the left side of the
assignment operator <- to change the argument.
This is called return by reference in C++. For example:
> v <- c(1, 3, 2, 4)
> length(v) <- 6
> v
[1] 1 3 2 4 NA NA
> length(v) <- 3
v
[1] 1 3 2
- What are some commonly used arguments of the plot function? Ans:
x, y, main, sum, xlab, ylab, xlim, ylim, col, pty
- Write a function named outliers that inputs a numeric dataframe with one column
and returns the original dataframe with an added character column that indicates whether
the data value in that row is an extreme outlier ("*"),
mild outlier ("O"), or not an outlier (""). Ans:
outliers <- function(df) {
data <- as.vector(df[ , 1])
n <- length(data)
q1 <- quantile(data, 0.25)
q3 <- quantile(data, 0.75)
iqr = q3 - q1
lower_outer_fence <- q1 - 3.0 * iqr
lower_inner_fence <- q1 - 1.5 * iqr
upper_inner_fence <- q3 + 1.5 * iqr
upper_outer_fence <- q3 + 3.0 * iqr
symbol <- rep(" ", n)
symbol[data < lower_inner_fence |
data > upper_inner_fence] <- "O"
symbol[data < lower_outer_fence |
data > upper_outer_fence] <- "*"
extra_col <- as.data.frame(symbol)
names(extra_col) <- "outlier"
return (cbind(df, extra_col))
}
x <- data.frame(values=c(2, 7, 4, 3, 2, 6, 60, 99))
out <- outliers(x)
out
Quiz 6
- Take Quiz 6 in groups of 2 or 3.
Functions that Input or Return Matrices
- The cbind and rbind functions return matrices.
- The cbind function inputs vectors and/or matrices of the same height
and binds them together by columns to produce a matrix of the same height as the inputs. For example:
u <- c(2, 4, 6, 4, 1)
v <- c(5, 0, -1, 2, 3)
w <- matrix(c(2, 2, 6, -3, 0,
1, 1, 3, 0, 2), 5, 2, byrow=T)
M <- cbind(u, v, w)
print(M)
# Output:
u v
[1,] 2 5 2 2
[2,] 4 0 6 -3
[3,] 6 -1 0 1
[4,] 4 2 1 3
[5,] 1 3 0 2
The rbind function binds together vectors and/or
matrices width the same width by
rows to form a matrix with the same width as the inputs. For example:
u <- c(2, 4, 6, 4)
v <- matrix(c(2, 2, 6, -3,
1, 1, 3, 0), 2, 4, byrow=T)
w <- c(5, 0, -1, 2)
M <- rbind(u, v, w)
print(M)
# Output:
[,1] [,2] [,3] [,4]
u 2 4 6 4
2 2 6 -3
1 1 3 0
w 5 0 -1 2
The apply function applies an R function to the rows or columns of a matrix. For example:
A <- matrix(1:9, 3, 3, byrow=T)
print(A)
# Apply sum function to rows of matrix.
apply(A, 1, sum)
# Apply sum function to columns of matrix.
apply(A, 2, sum)
# Output:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[1] 6 15 24
[1] 12 15 18
R String Handling Functions
Project 5b
- A factor is the R way to represent values of a categorical variable.
- Use the cut function to convert numeric values to factor values.
- Use the R table function to summarize factor data.
- The table function
is used in the same situations as the SAS proc freq.
- Here is an example that uses the table function. Eight eligible voters
are interviewed; each is asked these questions: "What is your gender?" and
"Which candidate, A or B, do you prefer?"
# Create data frame
election <- data.frame(
gender=c("F", "M", "M", "F", "M", "M", "F", "F"),
cand=c("A", "B", "B", "B", "B", "A", "A", "A"))
election
# Output:
gender cand
1 F A
2 M B
3 M B
4 F B
5 M B
6 M A
7 F A
8 F A
table(election)
# Output:
cand
gender A B
F 3 1
M 1 3
table(election$gender)
# Output:
F M
4 4
table(election$cand)
# Output:
A B
4 4
Project 4b
Project 5
- Graphics Practice Problems:
- Add a least squares line to the scatterplot of the father's and son's heights in the
Pearson Dataset. Ans:
x <- pearson$fathers_height
y <- pearson$sons_height
plot(x, y)
model <- lm(y ~ x)
abline(model)
Create plots of the builtin ChickWeight dataset where
diet=1 and Chick < 20. Connect the
weights of each chick with line segments. Ans:
chicks <- ChickWeight[1:107, -4]
plot(NULL, NULL, xlim=c(0, 21), ylim=c(0, 300),
xlab="Time", ylab="Weight")
for(chick in 1:9) {
x <- chicks[chicks$Chick==as.character(chick), 2]
y <- chicks[chicks$Chick==as.character(chick), 1]
lines(x, y, pch=as.character(chick), type="b")
}
Using the autoSales.txt, create these plots for the amount of sales by day of week:
Also create stacked and side-by-side bar charts for sales by day of week and vehicle type.
Create a histogram of the faithful dataset
(eruption times of the old faithful geyser). Check the effect of using the
Sturgis, Scott, and FD algorithms for setting the number of bins. Ans:
x <- faithful$waiting
hist(x, breaks="sturgis")
hist(x, breaks=seq(40, 100, 10))
hist(x, breaks=2)
hist(x, breaks=60)
When the number of breaks is given, it is only a suggestion.
Brushing
Now click on points in the scatterplot for which you want row labels.
See the BodyBrain Example.
Press ESC when you are finished.
Reading Data from a Web Page
Computing Sample Quantiles with R
However, if we wish to compute quantiles from a finite sample, the situation
is not so simple. Let's use the (sorted) artificial sample x, where x is defined as
x <- c(2, 3, 5, 7, 9, 11, 13, 17, 19, 23) # Line 11
These are the first ten prime numbers. This example and the examples
QuantileDefs1 (R) and
QuantileDefs2 (SAS) were inspired by Reference 1.
These examples show the results of the various R methods for computing
quantiles.
What is the quantile 0.25 of the data defined by x? The fraction 0.2 of the data is ≤ 3,
whereas 0.3 of the data ≤ 5; there is no value q = Q(0.25) such that exactly 0.25 of the data ≤ q.
Here are three immediate solutions:
- Choose Q(0.25) = 3.
- Choose Q(0.25) = 5.
- Interpolate. Since 0.25 is halfway between 0.2 and 0.3, Q(0.25) is chosen
halfway between x2 = 3 and x3 = 5: Q(0.25) = 4
A quick and dirty way to compute the first quartile (Q1 = Q(0.25) and the third quartile (Q3 = Q(0.75))
is to use the Tukey's Hinges Method: if the sample size n is even, Tukeys method says that Q1 is the
median of the top half of the data and Q3 is the median of the bottom half. For example, for the dataset
2 3 5 7 11 13 # Dataset 1
The bottom half of the data is 2, 3, 5, whose median is 3, so Q3 = 3. The top half of the data is 7, 11, 13, so
Q3 = median of top half = 11. If n is odd, as is the case with this data set:
2 3 5 7 11 13 17 # Dataset 2
Tukey's Hinges method says to count the middle element 7 in both halves of the data.
The bottom half is 2, 3, 5, 7, whose median is (3 + 5) / 2 = 4 = Q1; the
top half is 7, 11, 13, 17, whose median is (11 + 13) / 2 = 12 = Q3.
A related method for computing Q1 and Q3 is recommended by Moore and McCabe. If n is odd as in Dataset 2,
they omit the middle observation 7 to obtain the bottom half as 2, 3, 5 and the
top half as 7, 11, 13. Thus, Q1 = median of bottom half = 3; Q3 = median of top half = 13.
In fact, there are additional ways that are used to compute percentiles. See Reference 3.
SAS currently has 5 percentile definitions available; R has 9.
Here are the detailed
quantile calculations for each of the nine R methods for the dataset x
defined in Line 1, earlier in this section.