R Intro

To HomePage

Introduction to R

History of S and R

Predecessors of R are S (sometimes called Old S), New S, and S-Plus.
S was created by Rick Becker, John Chambers, Douglas Dunn, Paul Tukey, Graham Wilkinson at Bell Labs in 1975. It was inspired by the computer languages Scheme and APL, but the syntax of S is most similar to C.
Early versions of S were written in Fortran.
In 1980, the first version of S was distributed outside of Bell Labs.
By 1988, many changes had been made to S. In that year, the book entitled The New S Language was published by Becker, Chambers, and Wilks, that described these changes.
S-Plus was a commercial implementation of S, that was first produced by a Seattle startup company in 1988. This startup company was Statistical Sciences, Inc., and was started by Douglas Martin of the University of Washington.
R is a public domain version of S, first developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in 1997. The name R was partly chosen for the first names of the developers, and was partly as a play on S.
R is very good for interactive analysis and also as a programming language for customizing and extending statistical procedures. R is not as good as SAS for producing well layed out reports that present results of a statistical analyses.
Many persons and groups have contributed R packages.
The current number of R users is estimated to be about 100,000.

Installing R

Download R from

cran.us.r-project.org/bin/windows/base/

or from one of the sites on

http://cran.r-project.org/mirrors.html
Double click on the installation file R-2.15.2-win32.exe to install R.
The R software will be installed in the new folder c:\Program Files\R.
The actual R program is located at c:\Program Files\R\R-2.14.2\bin\i386\RGui.exe for 32-bit and in c:\Program Files\R\R-2.14.2\bin\x64\RGui.exe for 64-bit.

Invoking R

To invoke the R software in the computer labs, select Start >> All Programs >> Statistics >> R >> R 2.14.0 (or the current version). This will bring up a multiple document interface (MDI) window titled RGui with a child window with the title R Console.
To invoke R from your home computer, double click on the R icon, which is on the desktop.
To run R commands, type them in the R Console Window.
Additional windows may be displayed in the RGui Window, for example graphics windows or the R Editor.
When showing typed R commands, the > symbol at the beginning of a line is the R prompt.
R commands can also be run from the R Editor, which is invoked by File >> New Script.
To execute lines within the R Editor, select the lines of the script to be executed, then right click >> Run line or selection, or type Control-R.

Some Useful R Commands

To obtain help on a specific R command, enter help and the command for which you need help in parentheses. For example:
```
> help(hist)
```
A browser window will open with the requested information.
You may want to read data from a file or send output, such as graphs, to an output file. Although a full path name can be used to specify the input or output file, it is useful to specify the current working directory so that a relative path can be used. To set and get the current working directory:
```
> setwd("c:/rdatasets")
> getwd( )
[1] c:/rdatasets
```
To exit or quit from R:
```
> quit( )
```
or
```
> q( )
```

Vectors and Objects

Everything in R is an object. The simplest objects are atomic elements, which cannot be decomposed into smaller items.

The datatype of an atomic element is its mode. Here are the four modes for atomic elements:

Mode	Meaning	Examples	Restrictive
logical	True or False	TRUE, FALSE, T, F	Most
numeric	Floating Point Numbers	3.1416, -34.0, 5.97e24	2nd Most
complex	Complex Numbers	3.43+6.92i, 1+0i, 0+1i	2nd Least
character	Character Strings	"elephant", "abc\n", "345"	Least

Atomic elements do not exist in isolation. They are the components of vectors. A scalar element is actually a vector of length 1.
A vector consists an arbitrary number of atomic elements, all of the same mode.
Here is a list of the atomic modes, from most restrictive to least restrictive:
Use the c function to combine atomic elements or vectors into a vector. The assignment operator = assigns the vector determined by the expression on the right hand side to the variable x on the left hand side. Some examples:
```
> values = c(33, 5, 429, 37)
> animals = c("dog", "cat", "mouse")
> flags = c(TRUE, FALSE, TRUE, TRUE, FALSE)
```
R provides three different assignment operators. These three expressions mean the same thing:
```
x = 318    x <- 318    318 -> x
```
The left arrow representation of the assignment operator is the preferred version:
To display the value of a variable or expression interactively, type the variable or expression at the R command prompt. The value will appear in the R Command Window:
```
> values
[1]  33   5 429  37
> animals
[1] "dog"  "cat"  "mouse"
> flags
[1]  TRUE FALSE  TRUE  TRUE FALSE
```

Any R object has two attributes mode and length. Use the mode and length functions to obtain the values of these attributes:

> mode(values)
[1] "numeric"
> mode(animals)
[1] "character"
> mode(flags)
[1] "logical"
> length(values)
[1] 4
> length(animals)
[1] 3
> length(flags)
[1] 5

The functions with prefix is. can also be used to check the mode:

> is.numeric(345)
[1] TRUE
> is.character(345)
[1] FALSE
> is.logical(345)
[1] FALSE
> is.complex(345)
[1] FALSE

Although atomic elements of more than one mode can be used when creating a vector, the elements will be coerced into the least restrictive mode, so that the resulting vector contains elements that are all of the same mode. For example:
```
> c(T, 34, "dog")
[1] "TRUE"  "34"  "dog"
```

Use a function with an as. prefix to coerce an atomic element into a different mode:

as.character  as.complex  as.logical  as.numeric

For example:

> as.numeric("1198")
[1] 1198
> as.complex(43)
[1] 43+0i
> as.logical("FALSE")
[1] FALSE
> as.logical(13)
[1] TRUE

Roughly speaking, the terms class (obtained with the class function) and type (obtained with the typeof function) are synonyms for mode.

Repetition and Sequences

Use the rep function to repeat a value in a vector:
```
> r = rep(45, 5)
> r
[1]  45  45  45  45  45
```

m:n creates a sequence from m to n, inclusive:

> s = 1:6
> s
[1]  1  2  3  4  5  6
> r = 9:6
[1]  9  8  7  6

Use the seq function to create a more general sequence:

> q = seq(from=3, to=5.5, by=0.5)
> q
[1]  3.0  3.5  4.0  4.5  5.0  5.5
> p = seq(from=900, to=600, by=-100)
[1]  900  800  700  600

Operators

An operator precedence table of R operators.

Operators	Description	Precedence
( ) { }	Function call and grouping	1 (High)
[ ] [[ ]]	Indexing	2
:: :::	Variable in namespace	3
$	Component extraction	4
^	Exponentiation (right to left)	5
+ -	Unary plus and minus	6
:	Sequence	7
%any% %/% %%	Special operator, integer division, mod	8
* /	Multiplication, division	9
+ -	Addition and subtraction	10
< > <= >= != ==	Comparison	11
!	Logical not	12
& &&	Logical and	13
\| \|\|	Logical or	14
~	As in formulas	15
<-	Assignment	16
=	Assignment	17
->	Assignment	18
?	Help (unary and binary)	19

Special Values

Some special values that are used in R:

Inf

NaN

0 / 0   Inf / Inf   Inf - Inf   Inf * 0

NULL

See the MaxDouble Example.

Matrices

A matrix is a two-dimensional array of atomic elements. An n×m matrix is a matrix with n rows and m columns.

Use the matrix command to format a vector into rows and columns. This example creates an array with 3 rows and 4 columns. This requires a vector of length 12.

> v = c(4, 2, 5, 9, 4, 7, -1, 6, 0, 3, 2, 7)
> m = matrix(v, 3, 4, byrow=TRUE)
> m
     [,1] [,2] [,3] [,4]
[1,]    4    2    5    9
[2,]    4    7   -1    6
[3,]    0    3    2    7

A table of frequently used matrix operators:

Operation Operator or Function

Matrix Addition +

Scalar Multiplication *

Matrix Multiplication %*%

Transpose t

Identity diag(nrow=n)

Inverse solve
See the Matrix Example, which reads data from a file into a matrix.
An array is an extension of the matrix data struction to possibly more than two dimensions.
Look at the Array Example.

Operation	Operator or Function
Matrix Addition	+
Scalar Multiplication	*
Matrix Multiplication	%*%
Transpose	t
Identity	diag(nrow=n)
Inverse	solve

Subsetting Vectors and Matrices

The subset operator is [ ].
Subsetting is also called slicing.
The three ways to select a subset of a vector:
1. Use a vector of positive indices
```
> a = c("apple", "orange", "pear", "grape")
> u = c(1, 4)
> a[u]
[1]  "apple" "grape"
```
  The values with indices 1 and 4 are selected from a.
2. Use a vector of negative indices
```
> a = c("apple", "orange", "pear", "grape", "watermelon")
> v = c(-2, -4, -5)
> a[v]
[1]  "apple" "grape"
```
  The values with indices 2, 4, and 5 are omitted from a.
3. Use a vector of boolean values
```
> a = c("apple", "orange", "pear", "grape", "watermelon")
> w = c(F, T, F, T, T)
> a[w]
[1]  "orange"  "grape"  "watermelon"
```
  Only the values corresponding to TRUE, which have indices 2, 4, and 5 are selected.
The subset operator for a matrix takes two inputs:
```
m[5:10, -4]
```
Rows 5 through 10 are selected and column 4 is omitted from the matrix m.

If a slice of width 1 is selected from a matrix, a vector will be produced, as in this example:

> M = matrix(1:9, 3, 3, byrow=T)
> M
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> M[,3]
[1] 3  6  9

To preserve the matrix structure of this slice, include the drop=F argument:
```
> M[,3, drop=F]
     [,1]
[1,]    3
[2,]    6
[3,]    9
```

Attributes

R objects can have properties attached to them called attributes.

Common attributes are:

Attribute	Description
class	The class of the object
comment	A comment on or description of the object
dim	The dimentions of a matrix or array
dimnames	Names associated with each dimension of an object
names	The names associated with an object, often row or list item names
row.names	The row names of a matrix or data frame
tsp	Start time for a time series data object
levels	Levels of a factor

To get the attributes associated with an object, use the attributes function.
See the Attributes Example

Lists and Data Frames

To create an R data structure with named components, create a list like this:

> scores = list(name=c("Jason", "Ginger", Mary), 
+ midterm=c(89, 93, 90))
> scores
$name
[1]  "Jason"  "Ginger"  "Mary"

$midterm
[1] 89 93 90

> scores$name
[1]  "Jason"  "Ginger"  "Mary"
> scores$midterm
[1] 89 93 90

An alternative to a list is a data frame, which looks more like a conventional dataset:
```
  name   midterm
1 Jason  89
2 Ginger 93
3 Mary   90 
```

To create the preceding data frame:

> scores = data.frame(name=c("Jason", Ginger", "Mary"), 
+ midterm=c(89, 93, 90))

File Input

To read values, either from the keyboard, or from an input file, use the scan function:

> # Enter a blank line to terminate input from keyboard:
> x = scan( ) 
1: 4 2 5 4 6
6:
Read 5 items.
> x
[1] 4 2 5 4 6
> setwd("c:\r-input")
> getwd( )
[1] c:\r-input
> # Default mode of input is numeric:
> x = scan("input.txt")  
> x
[1] 45 65 45  4
> # Set input mode to character:
> y = scan("states.txt", what="character") 
> y
[1] "Illinois" "Indiana" "Iowa" "Nebraska"

To load data from a file in table format, use the read.table function. The current working directory contains the file scores.txt with these contents:
```
 name   midterm
Jason  89
Ginger 93
Mary   90
```
These statements read the table data from a file and load them into a data frame:
```
> scores = read.table("scores.txt", header=TRUE)
> scores
name midterm
Jason 89
Ginger 93
Mary 90
> scores$name
[1] "Jason"  "Ginger"  "Mary" 
> scores$midterm
[i] 89 93 90
```
If header=TRUE is missing, the variable names can still be read from the first row if (1) row numbers are included, and (2) the number of names is one less than the number of fields in the rest of the rows. If the input file looks like this
```
  name   midterm
1 Jason  89
2 Ginger 93
3 Mary   90
```
Read the file like this:
```
> scores = read.table("scores.txt")
```
See the Kids Examples.

Running R Statements from a Script

To run R statements from a script file:
```
> source("script.r")
```
Caution: entering an expression in a script file will not automatically display the value of the expression. Use the print function to display the value. For example:
```
> print(mean(x))
```
An alternative to print is the cat command. Using print includes the vector index at the beginning of each output line and quote marks for character data. Using cat suppresses vector indices and quote marks for character data. Also, new line characters (\n) must be explicitly included with cat.

For example:

> cat("The mean of the vector x is ", mean(x), ".\n")

The mean of the vector  3.54.

Redirecting R Output to Disk

To start sending R output to a file instead of the screen:
```
> sink("output.txt")
```
When finished sending output to the file, type
```
> sink( )
```
to resume sending output to the screen.

Some R Functions

Help Function and Operator
For help with symbols, place them in backquotes: ?`%%`
Vector Construction
Stat Functions
Problem: Design a simulation to illustrate how the central limit theorem works with uniform random numbers. Ans: Recall that the Central Limit Theorem states that even if random samples are drawn from a non-normal distribution, the mean of the random samples will have an approximately normal distribution if the sample sizes are high enough (rule of thumb: n > 30). Here is the script:

Math Utility Functions

abs  ceiling  floor  length

prod  round  sign  sqrt  sum

Trig Functions
Problem: Use the sin function to evaluate sin π. Also evaluate
What is your conclusion? Ans:
Exponential and Logarithms:

Character:

nchar  paste  substr  toupper  

tolower  chartr  strtrim  strsplit

Matrix Operations:
Apply a Function to margins of Matrix, Array, or to List Elements:

Bind Together Matrices

cbind

rbind

> print(A)
     [,1] [,2]
[1,]    1    1
[2,]    1    1
> print(B)
     [,1] [,2]
[1,]    2    2
[2,]    2    2
> cbind(A, B)
     [,1] [,2] [,3] [,4]
[1,]    1    1    2    2
[2,]    1    1    2    2
> rbind(A, B)
     [,1] [,2]
[1,]    1    1
[2,]    1    1
[3,]    2    2
[4,]    2    2

R Utility Functions
Execution Time of Function:

Distributions and Densities

pnorm  punif  pbinom  ppois  pexp

dnorm  dunif  dbinom  dpois  dexp

rnorm  runif  rbinom  rpois  rexp

User Defined Functions

Create the user defined function square like this:

square <- function(x) {
   y <- x^2
   return(y)
}

x is the formal parameter; y is the return value
The square function can also be written like this without an explicit return statement:
y is the return value because it is the last line executed in the function.
Here is an invocation of square:
13 is the actual parameter.

User Defined Replacement Functions

A replacement function looks as if it modifies the argument of a function. For example:
The function second<- has replaced the vector v (1 2 3 4) with the new vector (1 5 3 4). The function second<- is defined as
Note that back quotes are needed to define the second<- function.

User Defined Operators

A user defined operator is delimited with % symbols.
Here is the definition of the vector dot product:
Again, note that double quotes are needed to define %.%.
The dot product operator can be tested like this:

Factors

A factor is the R represention for categorical data.

For example:

> days <- factor(c('Wed', 'Thu', 'Sat', 'Mon'))
> days
[1] Wed Thu Sat Mon
Levels: Mon Sat Thu Wed

To record all possible levels of the factor, not just the ones that happen to be in days, use the levels parameter:

> days.of.week <- c('Sun', 'Mon', 'Tue', 'Wed',
+  'Thu', 'Fri', 'Sat')
> days <- factor(c('Wed', 'Thu', 'Sat', 'Mon'), 
+  levels=days.of.week)
> days
[1] Wed Thu Sat Mon
Levels: Sun Mon Tue Wed Thu Fri Sat

Factors are actually represented internally by numbers:
A factor can be used in R statistical analysis functions or plotted like this:

Google R Style Guide

Indentation: Indent by two spaces, do not use tabs.

Spacing: Single space between tokens. Do not space before a comma; space after a comma. Do not space between function name and parenthesis for arguments.

Blocks: Do not put the opening brace on a line by itself. Do put the closing brace on a line by itself. Indent the contents of a block by two spaces.

Semicolons: Omit semicolons at the ends of lines when they are optional.

Naming: Name objects with lowercase words, separated by periods, for example num.customers . Opinion is split on how to name functions. Option 1: name functions with lowercase letters, except use uppercase letters for the first letter of a new word, for example closeAllConnections (lower camel casing). Option 2: name functions with lowercase letters, separating words with underscores, for example close_all_connections (underscore notation).

Assignment: Use <- for the assignment operator, not = or ->.

Reference: google-styleguide.googlecode.com/svn/trunk/google-r-style.html