cols_chosen <- c("mpg", "disp", "hp", "drat", "wt", "qsec")
mtcars[order(gear, cyl), lapply(.SD, mean), by = .(gear, cyl), .SDcols = cols_chosen]
# gear cyl mpg disp hp drat wt qsec
#1: 3 4 21.500 120.1000 97.0000 3.700000 2.465000 20.0100
#2: 3 6 19.750 241.5000 107.5000 2.920000 3.337500 19.8300
#3: 3 8 15.050 357.6167 194.1667 3.120833 4.104083 17.1425
#4: 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125
#5: 4 6 19.750 163.8000 116.5000 3.910000 3.093750 17.6700
#6: 5 4 28.200 107.7000 102.0000 4.100000 1.826500 16.8000
#7: 5 6 19.700 145.0000 175.0000 3.620000 2.770000 15.5000
#8: 5 8 15.400 326.0000 299.5000 3.880000 3.370000 14.5500
Maybe we don't want to calculate the mean by groups. To calculate the mean for all the cars in the dataset, we don't specify the by variable.
mtcars[ , lapply(.SD, mean), .SDcols = cols_chosen]
# mpg disp hp drat wt qsec
#1: 20.09062 230.7219 146.6875 3.596563 3.21725 17.84875 Note:
It is not necessary to define cols_chosen beforehand. .SDcols can directly take column names
.SDcols can also directly take a vector of columnnumbers. In the above example this would be mtcars[ , lapply(.SD, mean), .SDcols = c(1,3:7)]
.N
.N is shorthand for the number of rows in a group.
iris[, .(count=.N), by=Species]
# Species count
#1: setosa 50
#2: versicolor 50
#3: virginica 50
Remove columns by setting to NULL: mtcars[, mpg_sq := NULL]
Add multiple columns by using the := operator's multivariate format:
mtcars[, `:=`(mpg_sq = mpg^2, wt_sqrt = sqrt(wt))]
# or
mtcars[, c("mpg_sq", "wt_sqrt") := .(mpg^2, sqrt(wt))]
If the columns are dependent and must be defined in sequence, one way is:
mtcars[, c("mpg_sq", "mpg2_hp") := .(temp1 <- mpg^2, temp1/hp)]
The .() syntax is used when the right-hand side of LHS := RHS is a list of columns.
For dynamically-determined column names, use parentheses:
vn = "mpg_sq"
mtcars[, (vn) := mpg^2]
Columns can also be modified with set, though this is rarely necessary:
set(mtcars, j = "hp_over_wt", v = mtcars$hp/mtcars$wt) Editing subsets of columns
Use the i argument to subset to rows "where" edits should be made:
mtcars[1:3, newvar := "Hello"]
# or
set(mtcars, j = "newvar", i = 1:3, v = "Hello")
As in a data.frame, we can subset using row numbers or logical tests. It is also possible to use a "join" in i, but that more complicated task is covered in another example.
Editing column attributes
Functions that edit attributes, such as levels<- or names<-, actually replace an object with a modified copy. Even if only used on one column in a data.table, the entire object is copied and replaced.
To modify an object without copies, use setnames to change the column names of a data.table or data.frame and setattr to change an attribute for any object.
# Print a message to the console whenever the data.table is copied tracemem(mtcars)
mtcars[, cyl2 := factor(cyl)]
# Neither of these statements copy the data.table setnames(mtcars, old = "cyl2", new = "cyl_fac")
setattr(mtcars$cyl_fac, "levels", c("four", "six", "eight"))
# Each of these statements copies the data.table names(mtcars)[names(mtcars) == "cyl_fac"] <- "cf"
levels(mtcars$cf) <- c("IV", "VI", "VIII")
affects the object in all environments.
# This function also changes the levels in the global environment
edit_levels <- function(x) setattr(x, "levels", c("low", "med", "high")) edit_levels(mtcars$cyl_factor)
Section 23.4: Writing code compatible with both data.frame and data.table
Differences in subsetting syntax
A data.table is one of several two-dimensional data structures available in R, besides data.frame, matrix and (2D) array. All of these classes use a very similar but not identical syntax for subsetting, the A[rows, cols] schema.
Consider the following data stored in a matrix, a data.frame and a data.table:
ma <- matrix(rnorm(12), nrow=4, dimnames=list(letters[1:4], c('X', 'Y', 'Z'))) df <- as.data.frame(ma)
dt <- as.data.table(ma)
ma[2:3] #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!) df[2:3] #---> returns the 2nd and 3rd columns
dt[2:3] #---> returns the 2nd and 3rd rows!
If you want to be sure of what will be returned, it is better to be explicit.
To get specific rows, just add a comma after the range:
ma[2:3, ] # \
df[2:3, ] # }---> returns the 2nd and 3rd rows dt[2:3, ] # /
But, if you want to subset columns, some cases are interpreted differently. All three can be subset the same way with integer or character indices not stored in a variable.
ma[, 2:3] # \ df[, 2:3] # \
dt[, 2:3] # }---> returns the 2nd and 3rd columns ma[, c("Y", "Z")] # /
df[, c("Y", "Z")] # / dt[, c("Y", "Z")] # /
However, they differ for unquoted variable names mycols <- 2:3
ma[, mycols] # \
df[, mycols] # }---> returns the 2nd and 3rd columns dt[, mycols, with = FALSE] # /
dt[, mycols] # ---> Raises an error
In the last case, mycols is evaluated as the name of a column. Because dt cannot find a column named mycols, an error is raised.
Note: For versions of the data.table package priorto 1.9.8, this behavior was slightly different. Anything in the column index would have been evaluated using dt as an environment. So both dt[, 2:3] and dt[, mycols] would
return the vector 2:3. No error would be raised for the second case, because the variable mycols does exist in the parent environment.
Strategies for maintaining compatibility with data.frame and data.table
There are many reasons to write code that is guaranteed to work with data.frame and data.table. Maybe you are forced to use data.frame, or you may need to share some code that you don't know how will be used. So, there are some main strategies for achieving this, in order of convenience:
Use syntax that behaves the same for both classes.
1.
Use a common function that does the same thing as the shortest syntax.
2.
Force data.table to behave as data.frame (ex.: call the specific method print.data.frame).
3.
Treat them as list, which they ultimately are.
4.
Convert the table to a data.frame before doing anything (bad idea if it is a huge table).
5.
Convert the table to data.table, if dependencies are not a concern.
6.
Subset rows. Its simple, just use the [, ] selector, with the comma:
A[1:10, ]
A[A$var > 17, ] # A[var > 17, ] just works for data.table
Subset columns. If you want a single column, use the $ or the [[ ]] selector:
A$var
colname <- 'var' A[[colname]]
A[[1]]
If you want a uniform way to grab more than one column, it's necessary to appeal a bit:
B <- `[.data.frame`(A, 2:4)
# We can give it a better name select <- `[.data.frame`
B <- select(A, 2:4)
C <- select(A, c('foo', 'bar'))
Subset 'indexed' rows. While data.frame has row.names, data.table has its unique key feature. The best thing is to avoid row.names entirely and take advantage of the existing optimizations in the case of data.table when possible.
B <- A[A$var != 0, ]
# or...
B <- with(A, A[var != 0, ]) # data.table will silently index A by var before subsetting stuff <- c('a', 'c', 'f')
C <- A[match(stuff, A$name), ] # really worse than: setkey(A); A[stuff, ]
Get a 1-column table, get a row as a vector. These are easy with what we have seen until now:
B <- select(A, 2) #---> a table with just the second column
C <- unlist(A[1, ]) #---> the first row as a vector (coerced if necessary)
Section 23.5: Setting keys in data.table
Yes, you need to SETKEY pre 1.9.6
In the past (pre 1.9.6), your data.table was sped up by setting columns as keys to the table, particularly for large tables. [See intro vignette page 5 of September 2015 version, where speed of search was 544 times better.] You may find older code making use of this setting keys with 'setkey' or setting a 'key=' column when setting up the table.
library(data.table) DT <- data.table(
x = letters[1:5], y = 5:1,
z = (1:5) > 3 )
#> DT
# x y z
#1: a 5 FALSE
#2: b 4 FALSE
#3: c 3 FALSE
#4: d 2 TRUE
#5: e 1 TRUE
Set your key with the setkey command. You can have a key with multiple columns.
setkey(DT, y)
Check your table's key in tables() tables()
> tables()
NAME NROW NCOL MB COLS KEY [1,] DT 5 3 1 x,y,z y Total: 1MB
Note this will re-sort your data.
#> DT
# x y z
#1: e 1 TRUE
#2: d 2 TRUE
#3: c 3 FALSE
#4: b 4 FALSE
#5: a 5 FALSE Now it is unnecessary
Prior to v1.9.6 you had to have set a key for certain operations especially joining tables. The developers of
data.table have sped up and introduced a "on=" feature that can replace the dependency on keys. See SO answer here for a detailed discussion.
In Jan 2017, the developers have written a vignette around secondary indices which explains the "on" syntax and allows for other columns to be identified for fast indexing.
Creating secondary indices?
In a manner similar to key, you can setindex(DT, key.col) or setindexv(DT, "key.col.string"), where DT is your data.table. Remove all indices with setindex(DT, NULL).
See your secondary indices with indices(DT). Why secondary indices?
This does not sort the table (unlike key), but does allow for quick indexing using the "on" syntax. Note there can be only one key, but you can use multiple secondary indices, which saves having to rekey and resort the table. This will speed up your subsetting when changing the columns you want to subset on.
Recall, in example above y was the key for table DT:
DT
# x y z
# 1: e 1 TRUE
# 2: d 2 TRUE
# 3: c 3 FALSE
# 4: b 4 FALSE
# 5: a 5 FALSE
# Let us set x as index setindex(DT, x)
# Use indices to see what has been set indices(DT)
# [1] "x"
# fast subset using index and not keyed column DT["c", on ="x"]
#x y z
#1: c 3 FALSE
# old way would have been rekeying DT from y to x, doing subset and
# perhaps keying back to y (now we save two sorts)
# This is a toy example above but would have been more valuable with big data sets
Chapter 24: Pivot and unpivot with data.table
Parameter Details
id.vars tell melt which columns to retain
variable.name tell melt what to call the column with category labels
value.name tell melt what to call the column that has values associated with category labels value.var tell dcast where to find the values to cast in columns
formula tell dcast which columns to retain to form a unique record identifier (LHS) and which one holds the category labels (RHS)
fun.aggregate specify the function to use when the casting operation generates a list of values in each cell