This tutorial continues on from Introduction to R - Part 1.
Files needed:
- SPSS_data1.sav
- TMS_data.txt (right click and choose save link as..)
Make sure your working directory in R is set to the folder contaiing these files.
Download the script for this tutorial:
It is possible to extend the functionality of R through the use of packages.
There are 2 ways to install a new package, you can either use the menu (Tools > Install Packages) or you can use the install.packages()
function. Luckily the pckage we need is already installed so we can skip this step.
Once you’ve installed a package you need to load it into the R workspace. You can do this by using the library()
or require()
functions.
In order to import the SPSS data file we have we need a package called foreign. You can load it by running either of the following commands:
# Either
library(foreign)
# Or
require(foreign)
This will make all of the functions from the package available to you, in this case we’re interested in the read.spss()
function to load the SPSS data into R.
dat2<-read.spss("SPSS_Data1.sav")
str(dat2)
You’ll see that the last command showed that the SPSS data we just imported is stored in a list variable. This is fine for some data but we want this variable to be in a data frame.
You can convert it using the data.frame()
function:
dat2<-data.frame(dat2)
str(dat2) # Check to make sure the structure type has changed
Once it’s converted you can start looking at the data. Let’s start by checking that it looks reasonable.
Use head()
and tail()
to look at the first or last few rows of data, and names()
to look at the names of the columns in the dataset. You can also use View()
to open up the full dataset in the script pane.
head(dat2) # Look at the first few rows
tail(dat2) # Look at the last few rows
names(dat2) # Check the column names
View(dat2) # Open dataset in script pane
That looks good, although the column names could be a bit more informative. Let’s rename the HEIGHT column to HeightCM so we know the units of measurement.
You can do this by assigning the new value to the 2nd position of names(dat2)
, like so:
names(dat2)[2] <- "HeightCM"
names(dat2)
Actually, all of the column names could be improved, lets rename them all:
names(dat2)[1:4]<-c("AgeYrs","HeightCM","ShoeSizeUK","HairColour")
names(dat2)
Hopefully, when you look at these examples you will see that we are calling a function on dat2, which uses the round brackets (), and also selecting data from a specific position, using the square brackets [].
The colon is used to indicate a sequence, so 1:4 is the same as 1, 2, 3, 4.
Let’s dive a little deeper into exploring data frames.
Start by making sure that the TMS_Data.txt file is loaded into the workspace. Just in case you don’t remember, the code for that is:
dat1 <- read.delim("TMS_data.txt")
Let’s do some of the same checks we did on the last dataset:
str(dat1) # Show the structure of the variable, is it a dataframe?
names(dat1) # Show the names of the columns, are they understandable?
head(dat1) # Show the first few lines of data, do they look sensible?
dim(dat1) # Show the dimensions of the data (Rows, Columns)
You can also explore specific variables in the data but to do this we need to revisit indexing.
Previously we have seen that you can select a specific element in a variable by using square brackets to indicate its position.
temp <- c (1, 3, 5, 6, 17, 8)
temp[5]
[1] 17
But a data frame is more complex than a vector. There are several ways to index in data frames. You can still use square brackets but you include 2 numbers: the first indicates the Row and the 2nd the Column.
dat1[2,6] # Shows the element in Row 2, Column 3
dat1[456,9] # Shows the element in Row 465, Column 9
# You can select multiple sequential elements using the colon symbol:
dat1[1:5,9] # Shows rows 1 to 5 in column 9
dat1[5,1:3] # Shows columns 1 to 3 in row 5
dat1[1:3,3:5] # SHows columns 3 to 5 in rows 1 to 3
# You can also leave one of the numbers out and R will show the all rows or columns:
dat1[50,] # Shows all columns in Row 50
dat1[,6] # Shows all rows in column 6
dat1[1:5,] # Shows all columns for rows 1 to 5
If you have a little difficulty remembering the order for indexing, Roman Catholic works as a simple mnemonic.
Because a data frame has column names, you can also use them indicate which columns from the data you’d like to select:
# Either using square brackets:
dat1[45,"RT"] # Show Row 45 in the column called RT
dat1[,"Axes"] # Show all rows in the column called "Axes"
# Or by using the $ symbol:
dat1$Hemisphere # Show all the rows in the column called "Hemisphere"
# You can even combine the two ways:
dat1$Congruence[57] # Show row number 57 in the column called "Congruence"
The mean can be calculated using the mean()
function. If you’re using it on a data frame be sure to select the column you’d like to like to run the function on:
mean(dat1) # This will produce an error
mean(dat1$RT) # This will calculate the mean RT
Some data will have missing entries (usually indicated by NA), and this can confuse some functions:
mean(dat1$Twitches)
## [1] NA
In order to deal with this mean()
has an optional input called na.rm, if you set this to true then it will ignore the missing values.
mean(dat1$Twitches, na.rm=TRUE)
## [1] 2.226583
The standard deviation can be calculated using the sd()
function:
sd(dat1$RT)
# As before if a variable has missing values then 'na.rm' must be set to true
sd(dat1$Twitches, na.rm = TRUE)
All of your favorite descriptive statistics are available in R, including:
- Median: median()
- Mode: mode()
- Interquartile range: IQR()
- Variance: var()
You can also use the summary()
function to calculate a number of these simultaneously.
Right, let’s get to some proper statistics!
For this section we’re going to look at the data in dat1 to see if there is a difference in reaction time between congruent and incongruent stimuli. So that calls for a paired sample T-test.
You can run a T-test using the t.test()
function but how do we do that without all sort of nonsense spliting up dat1 into the 2 conditions?
Introducing the tilde: ~. The tilde is used to generate formula for statisitical tests.
A simple rule of thumb is that the dependant variable should be placed on the left side of the tilde and the independent variable(s) should be placed on the right side of the tilde. In our analysis RT is the dependant variable and Congruence is the independant variable, so we should use the formula: RT ~ Congruence.
You’ll see this pop-up more frequently when using more complex statistical tests, so it’s good to get your head around it now.
But if we just execute t.test(RT ~ Congruence)
R will give us an error. It can’t find ‘RT’ or ‘Congruence’, so we need to tell it where to find them by using the optional argument ‘data =’.
t.test(RT ~ Congruence, data = dat1)
##
## Welch Two Sample t-test
##
## data: RT by Congruence
## t = -5.2736, df = 9581.3, p-value = 1.367e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03221094 -0.01475402
## sample estimates:
## mean in group Cong mean in group Incong
## 0.4944810 0.5179635
Obviously a paired sample t-test is not going to be ideal for every situation, so let’s look at some of the optional arguments that can change what t.test()
does by default:
So let’s assume that we have equal variances:
t.test(RT ~ Congruence, data = dat1, var.equal = TRUE)