Syntax for accessing rows and columns: [, [[, and $
This topic covers the most common syntax to access specific rows and columns of a data frame. These are Like a matrix with single brackets data[rows, columns]
Using row and column numbers Using column (and row) names Like a list:
With single brackets data[columns] to get a data frame With double brackets data[[one_column]] to get a vector With $ for a single column data$column_name
We will use the built-in mtcars data frame to illustrate.
Like a matrix: data[rows, columns]
With numeric indexes
Using the built in data frame mtcars, we can extract rows and columns using [] brackets with a comma included.
Indices before the comma are rows:
# get the first row mtcars[1, ]
# get the first five rows mtcars[1:5, ]
Similarly, after the comma are columns:
# get the first column mtcars[, 1]
# get the first, third and fifth columns:
mtcars[, c(1, 3, 5)]
As shown above, if either rows or columns are left blank, all will be selected. mtcars[1, ] indicates the first row with all the columns.
So far, this is identical to how rows and columns of matrices are accessed. With data.frames, most of the time it is preferable to use a column name to a column index. This is done by using a character with the column name instead of numeric with a column number:
# get the mpg column mtcars[, "mpg"]
# get the mpg, cyl, and disp columns mtcars[, c("mpg", "cyl", "disp")]
Though less common, row names can also be used:
mtcars["Mazda Rx4", ] Rows and columns together
The row and column arguments can be used together:
# first four rows of the mpg column mtcars[1:4, "mpg"]
# 2nd and 5th row of the mpg, cyl, and disp columns mtcars[c(2, 5), c("mpg", "cyl", "disp")]
A warning about dimensions:
When using these methods, if you extract multiple columns, you will get a data frame back. However, if you extract a single column, you will get a vector, not a data frame under the default options.
## multiple columns returns a data frame class(mtcars[, c("mpg", "cyl")])
# [1] "data.frame"
## single column returns a vector class(mtcars[, "mpg"])
# [1] "numeric"
There are two ways around this. One is to treat the data frame as a list (see below), the other is to add a drop = FALSE argument. This tells R to not "drop the unused dimensions":
class(mtcars[, "mpg", drop = FALSE])
# [1] "data.frame"
Note that matrices work the same way - by default a single column or row will be a vector, but if you specify drop = FALSE you can keep it as a one-column or one-row matrix.
Like a list
Data frames are essentially lists, i.e., they are a list of column vectors (that all must have the same length). Lists can be subset using single brackets [ for a sub-list, or double brackets [[ for a single element.
With single brackets data[columns]
When you use single brackets and no commas, you will get column back because data frames are lists of columns.
mtcars["mpg"]
mtcars[c("mpg", "cyl", "disp")]
my_columns <- c("mpg", "cyl", "hp") mtcars[my_columns]
Single brackets like a list vs. single brackets like a matrix
The difference between data[columns] and data[, columns] is that when treating the data.frame as a list (no comma in the brackets) the object returned will be a data.frame. If you use a comma to treat the data.frame like a matrix then selecting a single column will return a vector but selecting multiple columns will return a data.frame.
## When selecting a single column
## like a list will return a data frame class(mtcars["mpg"])
# [1] "data.frame"
## like a matrix will return a vector class(mtcars[, "mpg"])
# [1] "numeric"
With double brackets data[[one_column]]
To extract a single column as a vector when treating your data.frame as a list, you can use double brackets [[. This will only work for a single column at a time.
# extract a single column by name as a vector mtcars[["mpg"]]
# extract a single column by name as a data frame (as above) mtcars["mpg"]
Using $ to access columns
A single column can be extracted using the magical shortcut $ without using a quoted column name:
# get the column "mpg"
mtcars$mpg
Columns accessed by $ will always be vectors, not data frames.
Drawbacks of $ for accessing columns
The $ can be a convenient shortcut, especially if you are working in an environment (such as RStudio) that will auto-complete the column name in this case. However, $ has drawbacks as well: it uses non-standard evaluation to avoid the need for quotes, which means it will not work if your column name is stored in a variable.
my_column <- "mpg"
# the below will not work mtcars$my_column
# but these will work
mtcars[, my_column] # vector
mtcars[my_column] # one-column data frame mtcars[[my_column]] # vector
Due to these concerns, $ is best used in interactive R sessions when your column names are constant. For programmatic use, for example in writing a generalizable function that will be used on different data sets with different column names, $ should be avoided.
Also note that the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by $
# give you the values of "mpg" column
mtcars$m
# will give you "NULL"
# as "mtcars" has more than one columns having name starting with "d"
mtcars$d
Advanced indexing: negative and logical indices
Whenever we have the option to use numbers for a index, we can also use negative numbers to omit certain indices or a boolean (logical) vector to indicate exactly which items to keep.
Negative indices omit elements mtcars[1, ] # first row
mtcars[ -1, ] # everything but the first row
mtcars[-(1:10), ] # everything except the first 10 rows
Logical vectors indicate specific elements to keep
We can use a condition such as < to generate a logical vector, and extract only the rows that meet the condition:
# logical vector indicating TRUE when a row has mpg less than 15
# FALSE when a row has mpg >= 15 test <- mtcars$mpg < 15
# extract these rows from the data frame mtcars[test, ]
We can also bypass the step of saving the intermediate variable
# extract all columns for rows where the value of cyl is 4.
mtcars[mtcars$cyl == 4, ]
# extract the cyl, mpg, and hp columns where the value of cyl is 4 mtcars[mtcars$cyl == 4, c("cyl", "mpg", "hp")]
Section 18.3: Convenience functions to manipulate data.frames
Some convenience functions to manipulate data.frames are subset(), transform(), with() and within(). subset
The subset() function allows you to subset a data.frame in a more convenient way (subset also works with other classes):
subset(mtcars, subset = cyl == 6, select = c("mpg", "hp")) mpg hp
Mazda RX4 21.0 110 Mazda RX4 Wag 21.0 110 Hornet 4 Drive 21.4 110 Valiant 18.1 105 Merc 280 19.2 123 Merc 280C 17.8 123 Ferrari Dino 19.7 175
In the code above we asking only for the lines in which cyl == 6 and for the columns mpg and hp. You could achieve the same result using [] with the following code:
mtcars[mtcars$cyl == 6, c("mpg", "hp")]
transform
The transform() function is a convenience function to change columns inside a data.frame. For instance the following code adds another column named mpg2 with the result of mpg^2 to the mtcarsdata.frame:
mtcars <- transform(mtcars, mpg2 = mpg^2)
with and within
Both with() and within() let you to evaluate expressions inside the data.frame environment, allowing a somewhat cleaner syntax, saving you the use of some $ or [].
For example, if you want to create, change and/or remove multiple columns in the airqualitydata.frame: aq <- within(airquality, {
lOzone <- log(Ozone) # creates new column
Month <- factor(month.abb[Month]) # changes Month Column cTemp <- round((Temp - 32) * 5/9, 1) # creates new column S.cT <- Solar.R / cTemp # creates new column
rm(Day, Temp) # removes columns })