View on GitHub

rminr

Research Methods in R

Quick reference guide

Andy Wills

About this guide

This is not intended to be a stand-alone guide. It will not make much sense unless you’ve completed the worksheets. However, once you have done the worksheets, it becomes a handy quick reference guide to all the main commands you have learned about.

Getting help

If you want to do almost anything in R, you can just google “something in R”. For example, want to produce a graph different to the ones you’ve been taught in class? Search “graphs in R”. One of the first results will be the R Graph Gallery, with lots of examples you can adapt. Similarly, if the output you’re seeing includes the word ‘error’, copy that output into google and search for it. There’s a huge, friendly, community of R users out there, so it’s likely someone has already put the answer to your problem online.

Note: ‘Google’ here means any internet search engine. For example, you could also use DuckDuckGo, which works just as well as Google, but respects your privacy, too.


Begin

library(tidyverse) - Load a package before you use it.

In the sections below, a word in bold after the heading indicates the R package you need to load. So, for example, in the Load section, you need to load the tidyverse package.


Power calculations

Package: pwr

pwr.t.test(n = ?, type = ?, power = ?, d = ?, alternative = ?, sig.level = ?)

Replace all but one the question marks with the appropriate options, and remove the thing you want to calculate. For example:

pwr.t.test(type = "paired", power = .8, d = .5, alternative="two.sided", sig.level = .05)

will tell you how many participants you need to run for 80% power assuming an effect size of d = .5, a within-subjects design (paired), a nondirectional (two.sided) hypothesis, and a signifiance level of .05. Another option for type is two.sample (between-subjects), and another option for alternative is less (one-tailed test).


Listing files in a directory

The command:

list.files("rawdata", "csv", full.names=TRUE)

lists all the files in the directory rawdata than contain csv in their filename. The option full.names=TRUE returns the full file name, including the path e.g. rawdata/1.csv rather than 1.csv.


Load

Package: tidyverse

dframe <- read_csv("file.csv") - Read data from a CSV file into a data frame called dframe.


Renaming columns

Rename the columns of your data frame like this:

colnames(dframe) <- c("col1", "col2", "col3")


Combining data frames

The command:

alldat <- bind_rows(dat, dat2)

takes the data frame dat, pastes the datframe dat2 at the bottom of it, and puts the combined data into alldat.


Pivot

To make a long-format data file wider, use

dframe2 <- dframe %>% pivot_wider(names_from = colX, values_from = colY)

Replace colX with the column in dframe that contains the names of the columns you want to create in dframe2. Replace colY with the column in dframe that contains the values you want to appear in those columns.


Mutate

To create a new column in dframe called colZ, which is calculated by taking the values in colX and subtracting the values in colY, use:

dframe <- dframe %>% mutate(colZ = colX - colY)


Filter

Package: tidyverse

fdata <- dframe %>% filter(expression)

Replace dframe with the name of your data frame. Replace expression with some instructions that tell R what data you want to keep (see below). The filtered data will be put in a data frame called fdata.

Examples

Command Meaning Example expression
== Equal to filter(sex == "male")
!= NOT equal to filter(job != "nopay")
> Greater than filter(income > 0)
< Less than filter(income < 5000)
& AND filter(education == "grade-school" & sex == "male")
| OR filter(education == "master" | education == "doctor")

Select

Package: tidyverse

sdata <- dframe %>% select(expression)

Replace dframe with the name of your data frame. Replace expression with some instructions that tell R which columns of the data frame you want to keep (see below). The selected data will be put in a data frame called sdata.

Examples

select(rating1, rating2) - Select the two columns called rating1 and rating2.


Summarise

Package: tidyverse

dframe %>% summarise(mean(DV))

Replace dframe with the name of your data frame, and DV with the name of the column you want to look at (e.g. income).

To calculate for each group in your data, use group_by(IV).

dframe %>% group_by(IV) %>% summarise(mean(DV))

Replace IV with the name of the column that says which group each participant is in (e.g. sex).

You can get multiple summaries at once like this:

dframe %>% summarise(ingroup = mean(ingroup), outgroup = mean(outgroup))

Summary commands

Replace mean with one of these to get a different summary:

Command Meaning
median median
sd standard deviation
max maximum value
min minimum value
IQR inter-quartile range

Missing data

If the column you want to summarise has some missing data (shown as NA), you will need to tell R to ignore the missing data. For example: mean(DV, na.rm = TRUE)

Alternatively, you can remove the missing data from the data frame:

dframe <- dframe %>% drop_na()


Tabulate

Replace dframe with the name of your data frame. Replace IV with the name of your independent variable.

Frequency tables

table(dframe$IV) - Count number of rows in your data frame for each level of your IV. Example: table(dframe$gender) gives the number of rows (often the number of participants) of each gender.

Contingency tables

table(dframe$IV1, dframe$IV2) - Count number of rows in your data frame for each combination of the independent variables IV1 and IV2.


Plot

Package: tidyverse

Replace dframe with the name of your data frame, and DV with the name of the column of data you want to plot.

Histogram

dframe %>% ggplot(aes(DV)) + geom_histogram(binwidth=X)

Replace X with how wide you want your bars to be. For examples, if you want different bars for 0-9, 10-19, 20-29, etc., then binwidth=10.

Scaled density plot

dframe %>% ggplot(aes(DV, colour=factor(IV))) + geom_density(aes(y=..scaled..), adjust = 1)

Replace IV with the name of the column that contains your grouping variable (e.g. sex). Increase the value of adjust to get a smoother plot.

Violin plot

A violin plot is a density plot rotated through 90 degrees and mirrored to make it symmetrical:

dframe %>% ggplot(aes(x=DV, y=IV)) + geom_violin()

Scatterplot

dframe %>% ggplot(aes(x = var1, y = var2)) + geom_point()

var1 and var2 are the names of the columns in your data frame containing the x- and y- values of your points.

Add a vertical line

Add a vertical line to your plot using:

geom_vline(xintercept = 4.35, colour = "black")

Replace 4.35 with the position on the x-axis you want the vertical line to cross, and black with the colour you want the line to be.

Bar graph

grpmeans %>% ggplot(aes(x = IV, y = DV)) + geom_col()

Replace grpmeans with a data frame that contains one row for each bar you want to plot. Replace IV with the column that contains the x-axis labels for those bars. Replace DV with the column that contains the values you want to plot.

If you want to have more than one bar graph on the same axes, you can use a second IV to do this, by using it to set the fill colour

grpmeans %>% ggplot(aes(x = IV1, y = DV, fill = IV2)) + geom_col(position = "dodge")

Set the colours of bars in a bar plot using:

scale_fill_manual(values=c("yellow","black"))

List the available colours using colours()

Line graph

These work much the same way as bar graphs. For example, for a graph of dots connected by lines, where one IV is on the x-axis and the other is shown by the colour of the line, use:

grpmeans %>%
    ggplot(aes(x = IV1, y = DV, group = IV2)) +
    geom_line(aes(colour=IV2)) +
    geom_point(aes(colour=IV2))

This is a fairly standard plot for illustrating the results of two-factor experiments, particularly where the interaction between factors is of interest (an interaction is illustrated by the lines not being parallel).

Axis labels

Add labels to your x-axis and y-axis with these commands:

xlab('Text to appear as the x-axis label')

ylab('Text to appear as the y-axis label')

Axis limits

Limit the range of the x-axis or y-axis with these commands:

xlim(0, 10) - Limit the x-axis to the range 0-10.

ylim(-5, 100) - Limit the x-axis to the range -5 to +100.

APA style

  1. Load the APA style from my website:

source("http://www.willslab.org.uk/rminr/theme-apa.R")

  1. Add +theme_APA() to your command e.g.

grpmeans %>% ggplot(aes(x = IV, y = DV)) + geom_col() + theme_APA()

Mosaic plot

mosaicplot(cont) - Visualize a contingency table. Replace cont with the contigency table of your data e.g. cont <- table(dframe$IV1, dframe$IV2)


Analyse

Replace dframe with the name of your data frame. Replace DV with the name of the column containing the data you want to look at (e.g. income). Replace IV with the name of the column containing your grouping variable (e.g. sex). For correlation and similar, replace var1 and var2 with your two variables.

Note: All the commands below expect you to have exactly two levels in your grouping variable (e.g. male, female). If you have more than two (e.g. year9, year10, year11) you will have to filter your data until it has just two levels (e.g. year9, year11).

Effect size

Package: effsize

cohen.d(dframe$DV ~ dframe$IV)

Correlation

Calculate

cor(dframe$var1, dframe$var2) - Standard (Pearson) correlation

cor(dframe$var1, dframe$var2, method="spearman") - Rank (Spearman) correlation

cor(dframe$var1, dframe$var2, method="kendall") - Kendall’s tau correlation

Bayesian test

Package: BayesFactor

correlationBF(dframe$var1, dframe$var2)

Traditional test

cor.test(dframe$var1, dframe$var2)

cor.test(dframe$var1, dframe$var2, alternative = "greater") - One-tailed test (correlation is positive)

cor.test(dframe$var1, dframe$var2, alternative = "less") - One-tailed test (correlation is negative)

Between-subjects t-test

Bayesian

Package: BayesFactor

ttestBF(formula = DV ~ IV, data = data.frame(dframe))

Traditional

t.test(dframe$DV ~ dframe$IV)

One-factor ANOVA

Bayesian

Package: BayesFactor

anovaBF(formula = DV ~ IV + subj,
        data = data.frame(dframe),
        whichRandom = "subj")

Replace dframe with the name of your data frame. Replace DV with the column in that data frame that contains the dependent variable. Replace IV with the column that contains the independent variable. Make sure you have a column called subj that contains the participant IDs. Before running the above command, ensure IV and subj are factors:

dframe$IV <- factor(dframe$IV)
dframe$subj <- factor(dframe$subj)

Two-factor ANOVA

Two-factor ANOVA is basically the same command as a one-factor ANOVA:

Bayesian

Package: BayesFactor

bf <- anovaBF(formula = DV ~ IV1*IV2 + subj,
              data = data.frame(dframe), whichRandom = "subj")
bf
bf[4] / bf[3]

The command bf gives the main effects on the first two lines - [1] and [2]. In order to calculate the BF for the interaction, you divide the fourth line by the third line - bf[4] / bf[3].

Inter-rater reliablity

Package: irr

agree(dframe) - Percentage agreement between raters, one rater per column in dframe

kappa2(dframe) - Cohen’s kappa measure of inter-rater reliability, one rater per column in dframe.

Chi-square test

These commands use a contingency table, generated as described above, e.g. cont <- table(dframe$IV1, dframe$IV2)

Bayesian

Package: BayesFactor

contingencyTableBF(cont, fixedMargin = "rows", sampleType = "indepMulti")

In the above command, the assumption is that the rows of the contingency table is what the experimenter manipulated, and the columns are what was measured. For example, in a study of cross-cultural friendship styles, friendship style is what is measured, and so should be on the columns. If that is not the case for your data, see the more on relationships worksheet.

Traditional

chisq.test(cont)


This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.