To HomePage
Introduction to R
History of S and R
- Predecessors of R are S (sometimes called Old S), New S, and S-Plus.
- S was created by Rick Becker, John Chambers, Douglas Dunn, Paul Tukey,
Graham Wilkinson at Bell Labs in 1975. It was inspired by the computer languages Scheme
and APL, but the syntax of S is most similar to C.
- Early versions of S were written in Fortran.
- In 1980, the first version of S was distributed outside of Bell Labs.
- By 1988, many changes had been made to S. In that year, the book entitled The New S Language was published by
Becker, Chambers, and Wilks, that described these changes.
- S-Plus was a commercial implementation of S, that was first produced by a Seattle startup company in 1988. This startup company
was Statistical Sciences, Inc., and was started by Douglas Martin of the University of Washington.
- R is a public domain version of S, first developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand,
in 1997. The name R was partly chosen for the first names of the developers, and was partly
as a play on S.
- R is very good for interactive analysis and also as a programming language for customizing and extending statistical
procedures. R is not as good as SAS for producing well layed out reports that present results of a statistical analyses.
- Many persons and groups have contributed R packages.
- The current number of R users is estimated to be about 100,000.
Installing R
- Download R from
or from one of the sites on
- Double click on the installation file R-2.15.2-win32.exe to install R.
- The R software will be installed in the new folder c:\Program Files\R.
- The actual R program is located at
c:\Program Files\R\R-2.14.2\bin\i386\RGui.exe for 32-bit and in
c:\Program Files\R\R-2.14.2\bin\x64\RGui.exe for 64-bit.
Invoking R
- To invoke the R software in the computer labs, select Start >> All Programs >>
Statistics >> R >> R 2.14.0 (or the current version).
This will bring up a multiple document interface (MDI) window titled RGui with a child window
with the title R Console.
- To invoke R from your home computer, double click on the R icon, which is on the desktop.
- To run R commands, type them in the R Console Window.
- Additional windows may be displayed in the RGui Window, for example graphics
windows or the R Editor.
- When showing typed R commands, the > symbol at the beginning of a line is
the R prompt.
- R commands can also be run from the R Editor, which is invoked by File >> New Script.
- To execute lines within the R Editor, select the lines of the script to be executed, then right click
>> Run line or selection, or type Control-R.
Some Useful R Commands
Vectors and Objects
- Everything in R is an object. The simplest objects
are atomic elements, which cannot be decomposed into smaller items.
- The datatype of an atomic element is its mode. Here are
the four modes
for atomic elements:
Mode | Meaning | Examples | Restrictive |
logical | True or False | TRUE, FALSE, T, F | Most |
numeric | Floating Point Numbers | 3.1416, -34.0, 5.97e24 | 2nd Most |
complex | Complex Numbers | 3.43+6.92i, 1+0i, 0+1i | 2nd Least |
character | Character Strings | "elephant", "abc\n", "345" | Least |
- Atomic elements do not exist in isolation. They are the components of
vectors. A scalar element is actually a vector of length 1.
- A vector consists an arbitrary number of atomic elements, all of the same mode.
- Here is a list of the atomic modes, from most restrictive to least restrictive:
logical numeric complex character
- Use the c function to combine atomic elements or vectors into a vector.
The assignment operator = assigns the vector determined by the expression
on the right hand side to the variable
x on the left hand side. Some examples:
> values = c(33, 5, 429, 37)
> animals = c("dog", "cat", "mouse")
> flags = c(TRUE, FALSE, TRUE, TRUE, FALSE)
- R provides three different assignment operators. These three expressions mean the same thing:
x = 318 x <- 318 318 -> x
- The left arrow representation of the assignment operator is the preferred version:
> x <- 318
To display the value of a variable or expression interactively, type the variable or expression at
the R command prompt. The value will appear in the R Command Window:
> values
[1] 33 5 429 37
> animals
[1] "dog" "cat" "mouse"
> flags
[1] TRUE FALSE TRUE TRUE FALSE
Any R object has two attributes mode and length. Use the mode and
length functions to
obtain the values of these attributes:
> mode(values)
[1] "numeric"
> mode(animals)
[1] "character"
> mode(flags)
[1] "logical"
> length(values)
[1] 4
> length(animals)
[1] 3
> length(flags)
[1] 5
The functions with prefix is. can also be used to check the mode:
> is.numeric(345)
[1] TRUE
> is.character(345)
[1] FALSE
> is.logical(345)
[1] FALSE
> is.complex(345)
[1] FALSE
Although atomic elements of more than one mode can be used when creating a vector,
the elements will be coerced into the least restrictive mode, so that the resulting vector
contains elements that are all of the same mode.
For example:
> c(T, 34, "dog")
[1] "TRUE" "34" "dog"
Use a function with an as. prefix to coerce an
atomic element into a different mode:
as.character as.complex as.logical as.numeric
For example:
> as.numeric("1198")
[1] 1198
> as.complex(43)
[1] 43+0i
> as.logical("FALSE")
[1] FALSE
> as.logical(13)
[1] TRUE
Roughly speaking, the terms class (obtained with the class function) and type (obtained with the typeof function)
are synonyms for mode.
Repetition and Sequences
- Use the rep function to repeat a value in a vector:
> r = rep(45, 5)
> r
[1] 45 45 45 45 45
- m:n creates a sequence from m to
n, inclusive:
> s = 1:6
> s
[1] 1 2 3 4 5 6
> r = 9:6
[1] 9 8 7 6
- Use the seq function to create a more general sequence:
> q = seq(from=3, to=5.5, by=0.5)
> q
[1] 3.0 3.5 4.0 4.5 5.0 5.5
> p = seq(from=900, to=600, by=-100)
[1] 900 800 700 600
Operators
- An operator precedence table of R operators.
Operators | Description | Precedence |
( ) { } | Function call and grouping | 1 (High) |
[ ] [[ ]] | Indexing | 2 |
:: ::: | Variable in namespace | 3 |
$ | Component extraction | 4 |
^ | Exponentiation (right to left) | 5 |
+ - | Unary plus and minus | 6 |
: | Sequence | 7 |
%any% %/% %% | Special operator,
integer division, mod | 8 |
* / | Multiplication, division
| 9 |
+ - | Addition and subtraction | 10 |
< > <= >= != == | Comparison | 11 |
! | Logical not | 12 |
& && | Logical and | 13 |
| || | Logical or | 14 |
~ | As in formulas | 15 |
<- | Assignment | 16 |
= | Assignment | 17 |
-> | Assignment | 18 |
? | Help (unary and binary) | 19 |
Special Values
NULL The null object. Returned by expressions and functions whose return value is undefined.
See the MaxDouble Example.
Matrices
- A matrix is a two-dimensional array of atomic elements.
An n×m matrix is a matrix with n rows and m columns.
- Use the matrix command to format a vector into rows and columns.
This example creates an array with 3 rows and 4 columns. This requires
a vector of length 12.
> v = c(4, 2, 5, 9, 4, 7, -1, 6, 0, 3, 2, 7)
> m = matrix(v, 3, 4, byrow=TRUE)
> m
[,1] [,2] [,3] [,4]
[1,] 4 2 5 9
[2,] 4 7 -1 6
[3,] 0 3 2 7
- A table of frequently used matrix operators:
Operation | Operator or Function |
Matrix Addition | + |
Scalar Multiplication | * |
Matrix Multiplication | %*% |
Transpose | t |
Identity | diag(nrow=n) |
Inverse | solve |
- See the Matrix Example, which reads data from a file into a matrix.
- An array is an extension of the matrix data struction to possibly more than two dimensions.
- Look at the Array Example.
Subsetting Vectors and Matrices
- The subset operator is [ ].
- Subsetting is also called slicing.
- The three ways to select a subset of a vector:
- Use a vector of positive indices
> a = c("apple", "orange", "pear", "grape")
> u = c(1, 4)
> a[u]
[1] "apple" "grape"
The values with indices 1 and 4 are selected from a.
- Use a vector of negative indices
> a = c("apple", "orange", "pear", "grape", "watermelon")
> v = c(-2, -4, -5)
> a[v]
[1] "apple" "grape"
The values with indices 2, 4, and 5 are omitted from a.
- Use a vector of boolean values
> a = c("apple", "orange", "pear", "grape", "watermelon")
> w = c(F, T, F, T, T)
> a[w]
[1] "orange" "grape" "watermelon"
Only the values corresponding to TRUE, which have indices 2, 4, and 5 are
selected.
- The subset operator for a matrix takes two inputs:
m[5:10, -4]
Rows 5 through 10 are selected and column 4 is omitted from the matrix m.
- If a slice of width 1 is selected from a matrix, a vector will be produced, as in this example:
> M = matrix(1:9, 3, 3, byrow=T)
> M
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> M[,3]
[1] 3 6 9
- To preserve the matrix structure of this slice, include the drop=F argument:
> M[,3, drop=F]
[,1]
[1,] 3
[2,] 6
[3,] 9
Attributes
- R objects can have properties attached to them called attributes.
- Common attributes are:
Attribute | Description |
class | The class of the object |
comment | A comment on or description of the object |
dim | The dimentions of a matrix or array |
dimnames | Names associated with each dimension of an object |
names | The names associated with an object, often row or list item names |
row.names | The row names of a matrix or data frame |
tsp | Start time for a time series data object |
levels | Levels of a factor |
- To get the attributes associated with an object, use the attributes
function.
- See the Attributes Example
Lists and Data Frames
- To create an R data structure with named components, create a list like this:
> scores = list(name=c("Jason", "Ginger", Mary),
+ midterm=c(89, 93, 90))
> scores
$name
[1] "Jason" "Ginger" "Mary"
$midterm
[1] 89 93 90
> scores$name
[1] "Jason" "Ginger" "Mary"
> scores$midterm
[1] 89 93 90
- An alternative to a list is a data frame, which looks more
like a conventional dataset:
name midterm
1 Jason 89
2 Ginger 93
3 Mary 90
- To create the preceding data frame:
> scores = data.frame(name=c("Jason", Ginger", "Mary"),
+ midterm=c(89, 93, 90))
File Input
- To read values, either from the keyboard, or from an input file, use the scan function:
> # Enter a blank line to terminate input from keyboard:
> x = scan( )
1: 4 2 5 4 6
6:
Read 5 items.
> x
[1] 4 2 5 4 6
> setwd("c:\r-input")
> getwd( )
[1] c:\r-input
> # Default mode of input is numeric:
> x = scan("input.txt")
> x
[1] 45 65 45 4
> # Set input mode to character:
> y = scan("states.txt", what="character")
> y
[1] "Illinois" "Indiana" "Iowa" "Nebraska"
- To load data from a file in table format, use the read.table function. The current working
directory contains the file scores.txt with these contents:
name midterm
Jason 89
Ginger 93
Mary 90
These statements read the table data from a file and load them into a data
frame:
> scores = read.table("scores.txt", header=TRUE)
> scores
name midterm
Jason 89
Ginger 93
Mary 90
> scores$name
[1] "Jason" "Ginger" "Mary"
> scores$midterm
[i] 89 93 90
- If header=TRUE is missing, the variable names can still be read from the first row if (1) row numbers are included, and
(2) the number of names is one less than the number of fields
in the rest of the rows. If the input file looks like this
name midterm
1 Jason 89
2 Ginger 93
3 Mary 90
Read the file like this:
> scores = read.table("scores.txt")
- See the Kids Examples.
Running R Statements from a Script
The mean of the vector 3.54.
Redirecting R Output to Disk
- To start sending R output to a file instead of the screen:
> sink("output.txt")
- When finished sending output to the file, type
> sink( )
to resume sending output to the screen.
Some R Functions
- Help Function and Operator
help ?
For help with symbols, place them in backquotes: ?`%%`
Vector Construction
c length
Stat Functions
mean sd var cor
median quantile
Problem: Design a simulation to illustrate how the central limit theorem
works with uniform random numbers. Ans: Recall that the Central Limit Theorem states that
even if random samples are drawn from a non-normal distribution, the mean of the random samples
will have an approximately normal distribution if the sample sizes are high enough (rule of thumb: n > 30).
Here is the script:
m = NULL
for(i in 1:100) {
x <- runif(50)
m <- c(m, mean(x))
}
# Normal plot of the means:
qqnorm(m)
# Normal plot of last random sample:
qqnorm(x)
Math Utility Functions
abs ceiling floor length
prod round sign sqrt sum
Trig Functions
acos asin atan atan2
cos sin tan
Problem: Use the sin function to evaluate sin π.
Also evaluate
sin 103π
sin 106π
sin 109π
sin 1012π
sin 1015π
What is your conclusion? Ans:
> sin(pi)
[1] 1.224606e-16
> sin(10^15 * pi)
[1] -0.2362051
Exponential and Logarithms:
exp expm1
log log10 log2 log1p
Character:
nchar paste substr toupper
tolower chartr strtrim strsplit
Matrix Operations:
+ - %*% t diag solve det
Apply a Function to margins of Matrix, Array, or to List Elements:
apply lapply sapply
Bind Together Matrices
cbind: bind together matrices by columns
rbind: bind together matrices by rows, for example:
> print(A)
[,1] [,2]
[1,] 1 1
[2,] 1 1
> print(B)
[,1] [,2]
[1,] 2 2
[2,] 2 2
> cbind(A, B)
[,1] [,2] [,3] [,4]
[1,] 1 1 2 2
[2,] 1 1 2 2
> rbind(A, B)
[,1] [,2]
[1,] 1 1
[2,] 1 1
[3,] 2 2
[4,] 2 2
R Utility Functions
ls (list all currently defined R objects)
rm (delete an R object)
setwd (set working directory)
getwd (get working directory)
Execution Time of Function:
system.time computes the execution time of an R function call. For example:
> system.time(rnorm(1000))
user system elapsed
0.00 0.00 0.01
> system.time(rnorm(100000))
user system elapsed
0.01 0.00 0.02
> system.time(rnorm(10000000))
user system elapsed
1.15 0.04 1.19
rnorm(n) generates a vector of n independent standard normal random variables.
Standard normal means mean=0 and standard deviation=1.
Distributions and Densities
pnorm punif pbinom ppois pexp
dnorm dunif dbinom dpois dexp
rnorm runif rbinom rpois rexp
User Defined Functions
x is the formal parameter; y is the return value
The square function can also be written like this without an explicit return statement:
square <- function(x) {
y <- x^2
y
}
# or
square <- function(x) {
x^2
}
y is the return value because it is the last line executed in the function.
Here is an invocation of square:
> square(13)
[1] 169
13 is the actual parameter.
User Defined Replacement Functions
The function second<- has replaced the vector v (1 2
3 4) with the new vector (1 5 3 4). The function second<- is defined as
`second<-` = function(v, value) {
v[2] = value
return(v)
}
Note that back quotes are needed to define the second<- function.
User Defined Operators
Again, note that double quotes are needed to define %.%.
The dot product operator can be tested like this:
> a <- 1:4
> b <- 1:4
> a %.% b
[1] 14
Factors
To record all possible levels of the factor, not just the
ones that happen to be in days, use the levels parameter:
> days.of.week <- c('Sun', 'Mon', 'Tue', 'Wed',
+ 'Thu', 'Fri', 'Sat')
> days <- factor(c('Wed', 'Thu', 'Sat', 'Mon'),
+ levels=days.of.week)
> days
[1] Wed Thu Sat Mon
Levels: Sun Mon Tue Wed Thu Fri Sat
Factors are actually represented internally by numbers:
as.numeric(days)
[1] 4 5 7 2
A factor can be used in R statistical analysis functions or plotted
like this:
> time <- c(65, 68, 66, 31)
> plot(days, time)
Google R Style Guide
- Indentation: Indent by two spaces, do not use tabs.
- Spacing: Single space between tokens. Do not space before a comma; space after a comma. Do not space between function
name and parenthesis for arguments.
- Blocks: Do not put the opening brace on a line by itself. Do put the closing brace on a line by itself. Indent the contents
of a block by two spaces.
- Semicolons: Omit semicolons at the ends of lines when they are optional.
- Naming: Name objects with lowercase words, separated by
periods, for example num.customers . Opinion is split on how to name functions.
Option 1: name functions with lowercase letters, except use uppercase
letters for the first letter of a new word, for example closeAllConnections
(lower camel casing).
Option 2: name functions with lowercase letters, separating words with underscores, for example
close_all_connections (underscore notation).
- Assignment: Use <- for the assignment operator, not = or
->.
- Reference:
google-styleguide.googlecode.com/svn/trunk/google-r-style.html