Repetitive Tasks and Functional Programming

HES 505 Fall 2022: Session 4

Matt Williamson

Objectives

Describe the basic components of functions
Introduce the apply and map family of functions
Practice designing functions for repetitive tasks

What are functions?

A specific class of R object (can call function inside functions)

rg <- paste("The range of mpg is", sum(mean(mtcars$mpg), sd(mtcars$mpg)), "-", sum(mean(mtcars$mpg), -sd(mtcars$mpg)))
rg

[1] "The range of mpg is 26.1175730520891 - 14.0636769479109"

A self-contained (i.e., modular) piece of code that performs a specific task
Allows powerful customization and extension of R

Why use functions?

Copy-and-paste and repetitive typing are prone to errors
Evocative names and modular code make your analysis more tractable
Update in one place!

If you are copy-and-pasting more than 2x, consider a function!

Designing Functions

Getting started

Sketch out the steps in the algorithm (pseudocode!)
Develop working code for each step
Anonymize

do_something <- function(arg1, arg2, arg3){
  intermediate_process <- manipulate(arg1,arg2, arg3)
  clean_output <- cleanup(intermediate_process)
  return(clean_output)
}

Structure of functions: Names

What will your function do?
Short, but clear!
Avoid using reserved words or functions that already exist
Use snake_case

something <- function(...){
}

Not Great

do_something_ultraspecific <- function(...){
}

Better

do_something <- function(...){
}

Pretty good

Structure of functions: Arguments

Provide the data that the function will work on
Provide other arguments that control the details of the computation (often with defaults)
Called by name or position (names should be descriptive)

nums <- rnorm(n = 1000, mean=2, sd=1.5)

Same As

nums <- rnorm(1000, 2, 1.5)

Structure of functions: Body

The body of the function appears between the {}
This is where the function does its work

# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)
mean_ci(x)

[1] 0.4328836 0.5415704

mean_ci(x, conf = 0.99)

[1] 0.4158077 0.5586463

Structure of functions: Return

Default is to return the last argument evaluated
Can use return() to return an earlier value
Can use list to return multiple values
A note on the Environment

mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  ci <- mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
  myresults <- list(alpha = alpha, ci = ci, se = se)
  return(myresults)
}

ci_result <- mean_ci(x)

Structure of functions: Return

str(ci_result)

List of 3
 $ alpha: num 0.05
 $ ci   : num [1:2] 0.433 0.542
 $ se   : num 0.0277

Repetitive Tasks

Iteration

Another tool for reducing code duplication
Iteration for when you need to repeat the same task on different columns or datasets
Imperative iteration uses loops (for and while)
Functional iteration combines functions with the apply family to break computational challenges into independent pieces.

Loops

Use counters (for) or conditionals (while) to repeat a set of tasks
3 key components
- Output - before you can loop, you need a place to store the results
- Sequence - defines what you are looping over
- Body - defines what the code is actually doing

Loops

library(tidyverse)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output

[1] -0.5016472 -0.3887528  0.2770325  0.2178513

#> [1] -0.24576245 -0.28730721 -0.05669771  0.14426335

The `apply` family

Vectorized functions that eliminate explicit for loops
Differ by the class they work on and the output they return
apply, lapply are most common; extensions for parallel processing (e.g., parallel::mclapply)

The `apply` family

apply for vectors and data frames
Args: X for the data, MARGIN how will the function be applied, (1=rows, 2=columns), FUN for your function, ... for other arguments to the function

apply(mtcars, 2, mean)

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

The `apply` family

lapply for lists (either input or output)
Args: X for the data, FUN for your function, ... for other arguments to the function

data <- list(item1 = 1:4, 
             item2 = rnorm(10), 
             item3 = rnorm(20, 1), 
             item4 = rnorm(100, 5))

# get the mean of each list item 
lapply(data, mean)

$item1
[1] 2.5

$item2
[1] -0.3173127

$item3
[1] 1.294093

$item4
[1] 5.088717

The `map` family

Similar to apply, but more consistent input/output
All take a vector for input
Difference is based on the output you expect
Integrates with tidyverse

The `map` family

map(): output is a list
map_int(): output is an integer vector
map_lgl(): output is a logical vector
map_dbl(): output is a double vector
map_chr(): output is a character vector
map_df(), map_dfr(), map_dfc(): output is a dataframe (r and c specify how to combine the data)

Some parting thoughts

Transparency vs. speed
Testing
Moving forward

Repetitive Tasks and Functional Programming

Objectives

What are functions?

Why use functions?

Designing Functions

Getting started

Structure of functions: Names

Structure of functions: Arguments

Structure of functions: Body

Structure of functions: Return

Structure of functions: Return

Repetitive Tasks

Iteration

Loops

Loops

The apply family

The apply family

The apply family

The map family

The map family

Some parting thoughts

Back to our example

The `apply` family

The `apply` family

The `apply` family

The `map` family

The `map` family