unique drops duplicates so that each element in the result is unique (only appears once):
x = c(2, 1, 1, 2, 1) unique(x)
# 2 1
Values are returned in the order they first appeared.
duplicated tags each duplicated element:
duplicated(x)
# FALSE FALSE TRUE TRUE TRUE
anyDuplicated(x) > 0L is a quick way of checking whether a vector contains any duplicates.
Section 34.5: Measuring set overlaps / Venn diagrams for vectors
To count how many elements of two sets overlap, one could write a custom function:
xtab_set <- function(A, B){
both <- union(A, B) inA <- both %in% A inB <- both %in% B return(table(inA, inB)) }
A = 1:20 B = 10:30 xtab_set(A, B)
# inB
# inA FALSE TRUE
# FALSE 0 10
# TRUE 9 11
A Venn diagram, offered by various packages, can be used to visualize overlap counts across multiple sets.
Chapter 35: tidyverse
Section 35.1: tidyverse: an overview
What is tidyverse?
tidyverse is the fast and elegant way to turn basic R into an enhanced tool, redesigned by Hadley/Rstudio. The development of all packages included in tidyverse follow the principle rules of The tidy tools manifesto. But first, let the authors describe their masterpiece:
The tidyverse is a set of packages that work in harmony because they share common data
representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.
The best place to learn about all the packages in the tidyverse and how they fit together is R for Data Science. Expect to hear more about the tidyverse in the coming months as I work on improved package websites, making citation easier, and providing a common home for discussions about data analysis with the tidyverse.
(source))
How to use it?
Just with the ordinary R packages, you need to install and load the package.
install.package("tidyverse") library("tidyverse")
The difference is, on a single command a couple of dozens of packages are installed/loaded. As a bonus, one may rest assured that all the installed/loaded packages are of compatible versions.
What are those packages?
The commonly known and widely used packages:
ggplot2: advanced data visualisation SO_doc
dplyr: fast (Rcpp) and coherent approach to data manipulation SO_doc tidyr: tools for data tidying SO_doc
readr: for data import.
purrr: makes your pure functions purr by completing R's functional programming tools with important features from other languages, in the style of the JS packages underscore.js, lodash and lazy.js.
tibble: a modern re-imagining of data frames.
magrittr: piping to make code more readable SO_doc Packages for manipulating specific data formats:
hms: easily read times
stringr: provide a cohesive set of functions designed to make working with strings as easy as posssible lubridate: advanced date/times manipulations SO_doc
forcats: advanced work with factors.
DBI: defines a common interface between the R and database management systems (DBMS) haven: easily import SPSS, SAS and Stata files SO_doc
httr: the aim of httr is to provide a wrapper for the curl package, customised to the demands of modern web APIs
jsonlite: a fast JSON parser and generator optimized for statistical data and the web readxl: read.xls and .xlsx files without need for dependency packages SO_doc rvest: rvest helps you scrape information from web pages SO_doc
xml2: for XML And modelling:
modelr: provides functions that help you create elegant pipelines when modelling broom: easily extract the models into tidy data
Finally, tidyverse suggest the use of:
knitr: the amazing general-purpose literate programming engine, with lightweight API's designed to give users full control of the output without heavy coding work. SO_docs: one, two
rmarkdown: Rstudio's package for reproducible programming. SO_docs: one, two, three, four
Section 35.2: Creating tbl_df’s
A tbl_df (pronounced tibble diff) is a variation of a data frame that is often used in tidyverse packages. It is implemented in the tibble package.
Use the as_data_frame function to turn a data frame into a tbl_df:
library(tibble)
mtcars_tbl <- as_data_frame(mtcars)
One of the most notable differences between data.frames and tbl_dfs is how they print:
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# ... with 22 more rows
The printed output includes a summary of the dimensions of the table (32 x 11) It includes the type of each column (dbl)
It prints a limited number of rows. (To change this use options(tibble.print_max = [number])).
Many functions in the dplyr package work naturally with tbl_dfs, such as group_by().
Chapter 36: Rcpp
Section 36.1: Extending Rcpp with Plugins
Within C++, one can set different compilation flags using:
// [[Rcpp::plugins(name)]]
List of the built-in plugins:
// built-in C++11 plugin // [[Rcpp::plugins(cpp11)]]
// built-in C++11 plugin for older g++ compiler // [[Rcpp::plugins(cpp0x)]]
// built-in C++14 plugin for C++14 standard // [[Rcpp::plugins(cpp14)]]
// built-in C++1y plugin for C++14 and C++17 standard under development // [[Rcpp::plugins(cpp1y)]]
// built-in OpenMP++11 plugin // [[Rcpp::plugins(openmp)]]
Section 36.2: Inline Code Compile
Rcpp features two functions that enable code compilation inline and exportation directly into R: cppFunction() and evalCpp(). A third function called sourceCpp() exists to read in C++ code in a separate file though can be used akin to cppFunction().
Below is an example of compiling a C++ function within R. Note the use of "" to surround the source.
# Note - This is R code.
# cppFunction in Rcpp allows for rapid testing.
require(Rcpp)
# Creates a function that multiples each element in a vector
# Returns the modified vector.
cppFunction("
NumericVector exfun(NumericVector x, int i){
x = x*i;
return x;
}")
# Calling function in R exfun(1:5, 3)
To quickly understand a C++ expression use:
# Use evalCpp to evaluate C++ expressions evalCpp("std::numeric_limits<double>::max()")
## [1] 1.797693e+308
Section 36.3: Rcpp Attributes
Rcpp Attributes makes the process of working with R and C++ straightforward. The form of attributes take:
// [[Rcpp::attribute]]
The use of attributes is typically associated with:
// [[Rcpp::export]]
that is placed directly above a declared function header when reading in a C++ file via sourceCpp(). Below is an example of an external C++ file that uses attributes.
// Add code below into C++ file Rcpp_example.cpp
#include <Rcpp.h>
using namespace Rcpp;
// Place the export tag right above function declaration.
// [[Rcpp::export]]
double muRcpp(NumericVector x){
int n = x.size(); // Size of vector double sum = 0; // Sum value
// For loop, note cpp index shift to 0 for(int i = 0; i < n; i++){
// Shorthand for sum = sum + x[i]
sum += x[i];
}
return sum/n; // Obtain and return the Mean }
// Place dependent functions above call or // declare the function definition with:
double muRcpp(NumericVector x);
// [[Rcpp::export]]
double varRcpp(NumericVector x, bool bias = true){
// Calculate the mean using C++ function double mean = muRcpp(x);
double sum = 0;
int n = x.size();
for(int i = 0; i < n; i++){
sum += pow(x[i] - mean, 2.0); // Square }
return sum/(n-bias); // Return variance }
To use this external C++ file within R, we do the following:
require(Rcpp)
# Compile File
sourceCpp("path/to/file/Rcpp_example.cpp")
# Make some sample data x = 1:5
all.equal(muRcpp(x), mean(x))
## TRUE
all.equal(varRcpp(x), var(x))
## TRUE
Section 36.4: Specifying Additional Build Dependencies
To use additional packages within the Rcpp ecosystem, the correct header file may not be Rcpp.h but
Rcpp<PACKAGE>.h (as e.g. for RcppArmadillo). It typically needs to be imported and then the dependency is stated within
// [[Rcpp::depends(Rcpp<PACKAGE>)]]
Examples:
// Use the RcppArmadillo package
// Requires different header file from Rcpp.h
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// Use the RcppEigen package
// Requires different header file from Rcpp.h
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
Chapter 37: Random Numbers Generator
Section 37.1: Random permutations
To generate random permutation of 5 numbers:
sample(5)
# [1] 4 5 3 1 2
To generate random permutation of any vector:
sample(10:15)
# [1] 11 15 12 10 14 13
One could also use the package pracma randperm(a, k)
# Generates one random permutation of k of the elements a, if a is a vector,
# or of 1:a if a is a single integer.
# a: integer or numeric vector of some length n.
# k: integer, smaller as a or length(a).
# Examples library(pracma) randperm(1:10, 3) [1] 3 7 9
randperm(10, 10)
[1] 4 5 10 8 2 7 6 9 3 1 randperm(seq(2, 10, by=2))
[1] 6 4 10 2 8
Section 37.2: Generating random numbers using various density functions
Below are examples of generating 5 random numbers using various probability distributions.
Uniform distribution between 0 and 10 runif(5, min=0, max=10)
[1] 2.1724399 8.9209930 6.1969249 9.3303321 2.4054102
Normal distribution with 0 mean and standard deviation of 1 rnorm(5, mean=0, sd=1)
[1] -0.97414402 -0.85722281 -0.08555494 -0.37444299 1.20032409
Binomial distribution with 10 trials and success probability of 0.5 rbinom(5, size=10, prob=0.5)
[1] 4 3 5 2 3
Geometric distribution with 0.2 success probability rgeom(5, prob=0.2)
[1] 14 8 11 1 3
Hypergeometric distribution with 3 white balls, 10 black balls and 5 draws rhyper(5, m=3, n=10, k=5)
[1] 2 0 1 1 1
Negative Binomial distribution with 10 trials and success probability of 0.8 rnbinom(5, size=10, prob=0.8)
[1] 3 1 3 4 2
Poisson distribution with mean and variance (lambda) of 2 rpois(5, lambda=2)
[1] 2 1 2 3 4
Exponential distribution with the rate of 1.5 rexp(5, rate=1.5)
[1] 1.8993303 0.4799358 0.5578280 1.5630711 0.6228000
Logistic distribution with 0 location and scale of 1 rlogis(5, location=0, scale=1)
[1] 0.9498992 -1.0287433 -0.4192311 0.7028510 -1.2095458
Chi-squared distribution with 15 degrees of freedom rchisq(5, df=15)
[1] 14.89209 19.36947 10.27745 19.48376 23.32898
Beta distribution with shape parameters a=1 and b=0.5 rbeta(5, shape1=1, shape2=0.5)
[1] 0.1670306 0.5321586 0.9869520 0.9548993 0.9999737
Gamma distribution with shape parameter of 3 and scale=0.5 rgamma(5, shape=3, scale=0.5)
[1] 2.2445984 0.7934152 3.2366673 2.2897537 0.8573059
Cauchy distribution with 0 location and scale of 1 rcauchy(5, location=0, scale=1)
[1] -0.01285116 -0.38918446 8.71016696 10.60293284 -0.68017185
Log-normal distribution with 0 mean and standard deviation of 1 (on log scale) rlnorm(5, meanlog=0, sdlog=1)
[1] 0.8725009 2.9433779 0.3329107 2.5976206 2.8171894
Weibull distribution with shape parameter of 0.5 and scale of 1 rweibull(5, shape=0.5, scale=1)
[1] 0.337599112 1.307774557 7.233985075 5.840429942 0.005751181
Wilcoxon distribution with 10 observations in the first sample and 20 in second.
rwilcox(5, 10, 20) [1] 111 88 93 100 124
rmultinom(5, size=5, prob=c(0.1,0.1,0.8)) [,1] [,2] [,3] [,4] [,5]
[1,] 0 0 1 1 0 [2,] 2 0 1 1 0 [3,] 3 5 3 3 5
Section 37.3: Random number generator's reproducibility
When expecting someone to reproduce an R code that has random elements in it, the set.seed() function becomes very handy. For example, these two lines will always produce different output (because that is the whole point of random number generators):
> sample(1:10,5) [1] 6 9 2 7 10
> sample(1:10,5) [1] 7 6 1 2 10
These two will also produce different outputs:
> rnorm(5)
[1] 0.4874291 0.7383247 0.5757814 -0.3053884 1.5117812
> rnorm(5)
[1] 0.38984324 -0.62124058 -2.21469989 1.12493092 -0.04493361
However, if we set the seed to something identical in both cases (most people use 1 for simplicity), we get two identical samples:
> set.seed(1)
> sample(letters,2) [1] "g" "j"
> set.seed(1)
> sample(letters,2) [1] "g" "j"
and same with, say, rexp() draws:
> set.seed(1)
> rexp(5)
[1] 0.7551818 1.1816428 0.1457067 0.1397953 0.4360686
> set.seed(1)
> rexp(5)
[1] 0.7551818 1.1816428 0.1457067 0.1397953 0.4360686
Chapter 38: Parallel processing
Section 38.1: Parallel processing with parallel package
The base package parallel allows parallel computation through forking, sockets, and random-number generation.
Detect the number of cores present on the localhost:
parallel::detectCores(all.tests = FALSE, logical = TRUE) Create a cluster of the cores on the localhost:
parallelCluster <- parallel::makeCluster(parallel::detectCores())
First, a function appropriate for parallelization must be created. Consider the mtcars dataset. A regression on mpg could be improved by creating a separate regression model for each level of cyl.
data <- mtcars yfactor <- 'cyl'
zlevels <- sort(unique(data[[yfactor]])) datay <- data[,1]
dataz <- data[,2]
datax <- data[,3:11]
fitmodel <- function(zlevel, datax, datay, dataz) {
glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel]) }
Create a function that can loop through all the possible iterations of zlevels. This is still in serial, but is an important step as it determines the exact process that will be parallelized.
fitmodel <- function(zlevel, datax, datay, dataz) {
glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel]) }
for (zlevel in zlevels) { print("*****")
print(zlevel)
print(fitmodel(zlevel, datax, datay, dataz)) }
Curry this function:
worker <- function(zlevel) {
fitmodel(zlevel,datax, datay, dataz) }
Parallel computing using parallel cannot access the global environment. Luckily, each function creates a local environment parallel can access. Creation of a wrapper function allows for parallelization. The function to be applied also needs to be placed within the environment.
wrapper <- function(datax, datay, dataz) {
force(datay) force(dataz)
# these variables are now in an enviroment accessible by parallel function # function to be applied also in the environment
fitmodel <- function(zlevel, datax, datay, dataz) {
glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel]) }
# calling in this environment iterating over single parameter zlevel worker <- function(zlevel) {
fitmodel(zlevel,datax, datay, dataz) }
return(worker) }
Now create a cluster and run the wrapper function.
parallelcluster <- parallel::makeCluster(parallel::detectCores()) models <- parallel::parLapply(parallelcluster,zlevels,
wrapper(datax, datay, dataz)) Always stop the cluster when finished.
parallel::stopCluster(parallelcluster)
The parallel package includes the entire apply() family, prefixed with par.
Section 38.2: Parallel processing with foreach package
The foreach package brings the power of parallel processing to R. But before you want to use multi core CPUs you have to assign a multi core cluster. The doSNOW package is one possibility.
A simple use of the foreach loop is to calculate the sum of the square root and the square of all numbers from 1 to 100000.
library(foreach) library(doSNOW)
cl <- makeCluster(5, type = "SOCK") registerDoSNOW(cl)
f <- foreach(i = 1:100000, .combine = c, .inorder = F) %dopar% { k <- i ** 2 + sqrt(i)
k }
The structure of the output of foreach is controlled by the .combine argument. The default output structure is a list. In the code above, c is used to return a vector instead. Note that a calculation function (or operator) such as "+"
may also be used to perform a calculation and return a further processed object.
It is important to mention that the result of each foreach-loop is the last call. Thus, in this example k will be added to the result.
Parameter Details
.combine combine Function. Determines how the results of the loop are combined. Possible values are c, cbind, rbind, "+", "*"...
.inorder if TRUE the result is ordered according to the order of the iteration vairable (here i). If FALSE the result is not ordered. This can have postive effects on computation time.
.packages for functions which are provided by any package except base, like e.g. mass, randomForest or else, you have to provide these packages with c("mass", "randomForest")
Section 38.3: Random Number Generation
A major problem with parallelization is the used of RNG as seeds. Random numbers by the number are iterated by the number of operations from either the start of the session or the most recent set.seed(). Since parallel
processes arise from the same function, it can use the same seed, possibly causing identical results! Calls will run in serial on the different cores, provide no advantage.
A set of seeds must be generated and sent to each parallel process. This is automatically done in some packages (parallel, snow, etc.), but must be explicitly addressed in others.
s <- seed
for (i in 1:numofcores) { s <- nextRNGStream(s)
# send s to worker i as .Random.seed }
Seeds can be also be set for reproducibility.
clusterSetRNGStream(cl = parallelcluster, iseed)
Section 38.4: mcparallelDo
The mcparallelDo package allows for the evaluation of R code asynchronously on Unix-alike (e.g. Linux and MacOSX) operating systems. The underlying philosophy of the package is aligned with the needs of exploratory data analysis rather than coding. For coding asynchrony, consider the future package.
Example Create data
data(ToothGrowth)
Trigger mcparallelDo to perform analysis on a fork
mcparallelDo({glm(len ~ supp * dose, data=ToothGrowth)},"interactionPredictorModel") Do other things, e.g.
binaryPredictorModel <- glm(len ~ supp, data=ToothGrowth) gaussianPredictorModel <- glm(len ~ dose, data=ToothGrowth)
The result from mcparallelDo returns in your targetEnvironment, e.g. .GlobalEnv, when it is complete with a message (by default)
summary(interactionPredictorModel) Other Examples
# Example of not returning a value until we return to the top level
mcparallelDo({2+2}, targetValue = "output") }
if (exists("output")) print(i) }
# Example of getting a value without returning to the top level for (i in 1:10) {
if (i == 1) {
mcparallelDo({2+2}, targetValue = "output") }
mcparallelDoCheck()
if (exists("output")) print(i) }
Chapter 39: Subsetting
Given an R object, we may require separate analysis for one or more parts of the data contained in it. The process of obtaining these parts of the data from a given object is called subsetting.
Section 39.1: Data frames
Subsetting a data frame into a smaller data frame can be accomplished the same as subsetting a list.
> df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
> df3
## x y
## 1 1 a
## 2 2 b
## 3 3 c
> df3[1] # Subset a variable by number
## x
## 1 1
## 2 2
## 3 3
> df3["x"] # Subset a variable by name
## x
## 1 1
## 2 2
## 3 3
> is.data.frame(df3[1])
## TRUE
> is.list(df3[1])
## TRUE
Subsetting a dataframe into a column vector can be accomplished using double brackets [[ ]] or the dollar sign operator $.
> df3[[2]] # Subset a variable by number using [[ ]]
## [1] "a" "b" "c"
> df3[["y"]] # Subset a variable by name using [[ ]]
## [1] "a" "b" "c"
> df3$x # Subset a variable by name using $
## [1] 1 2 3
> typeof(df3$x)
## "integer"
> is.vector(df3$x)
## TRUE
Subsetting a data as a two dimensional matrix can be accomplished using i and j terms.
> df3[1, 2] # Subset row and column by number
> df3[1, "y"] # Subset row by number and column by name
## [1] "a"
> df3[2, ] # Subset entire row by number
## x y
## 2 2 b
> df3[ , 1] # Subset all first variables
## [1] 1 2 3
> df3[ , 1, drop = FALSE]
## x
## 1 1
## 2 2
## 3 3
Note: Subsetting by j (column) alone simplifies to the variable's own type, but subsetting by i alone returns a data.frame, as the different variables may have different types and classes. Setting the drop parameter to FALSE keeps the data frame.
> is.vector(df3[, 2])
## TRUE
> is.data.frame(df3[2, ])
## TRUE
> is.data.frame(df3[, 2, drop = FALSE])
## TRUE
Section 39.2: Atomic vectors
Atomic vectors (which excludes lists and expressions, which are also vectors) are subset using the [ operator:
# create an example vector v1 <- c("a", "b", "c", "d")
# select the third element v1[3]
## [1] "c"
The [ operator can also take a vector as the argument. For example, to select the first and third elements:
v1 <- c("a", "b", "c", "d") v1[c(1, 3)]
## [1] "a" "c"
Some times we may require to omit a particular value from the vector. This can be achieved using a negative sign(-) before the index of that value. For example, to omit to omit the first value from v1, use v1[-1]. This can be
extended to more than one value in a straight forward way. For example, v1[-c(1,3)].
> v1[-1]
[1] "b" "c" "d"
> v1[-c(1,3)]
[1] "b" "d"
On some occasions, we would like to know, especially, when the length of the vector is large, index of a particular
value, if it exists:
> v1=="c"
[1] FALSE FALSE TRUE FALSE
> which(v1=="c") [1] 3
If the atomic vector has names (a names attribute), it can be subset using a character vector of names:
v <- 1:3
names(v) <- c("one", "two", "three") v
## one two three
## 1 2 3 v["two"]
## two
## 2
The [[ operator can also be used to index atomic vectors, with differences in that it accepts a indexing vector with a length of one and strips any names present:
v[[c(1, 2)]]
## Error in v[[c(1, 2)]] :
## attempt to select more than one element in vectorIndex v[["two"]]
## [1] 2
Vectors can also be subset using a logical vector. In contrast to subsetting with numeric and character vectors, the logical vector used to subset has to be equal to the length of the vector whose elements are extracted, so if a logical vector y is used to subset x, i.e. x[y], if length(y) < length(x) then y will be recycled to match length(x):
v[c(TRUE, FALSE, TRUE)]
## one three
## 1 3
v[c(FALSE, TRUE)] # recycled to 'c(FALSE, TRUE, FALSE)'
## two
## 2
v[TRUE] # recycled to 'c(TRUE, TRUE, TRUE)'
## one two three
## 1 2 3
v[FALSE] # handy to discard elements but save the vector's type and basic structure
## named integer(0)
Section 39.3: Matrices
For each dimension of an object, the [ operator takes one argument. Vectors have one dimension and take one argument. Matrices and data frames have two dimensions and take two arguments, given as [i, j] where i is the row and j is the column. Indexing starts at 1.
## a sample matrix