data.table
data.table
?The data.table
package provides yet another alternative to data frames.
Similarly to tibbles
the data.table
class extends standard data frames.
data.table
?The data.table
package provides yet another alternative to data frames.
Similarly to tibbles
the data.table
class extends standard data frames.
library(data.table)DT <- data.table(a = rnorm(10), b = rnorm(10)) class(DT)
## [1] "data.table" "data.frame"
data.table
?The data.table
package provides yet another alternative to data frames.
Similarly to tibbles
the data.table
class extends standard data frames.
library(data.table)DT <- data.table(a = rnorm(10), b = rnorm(10)) class(DT)
## [1] "data.table" "data.frame"
Compared to tibbles
which provide only some convenience functionality over data frames,
data.tables
are more like tibbles
with integrated dplyr
.
data.table
?The data.table
package provides yet another alternative to data frames.
Similarly to tibbles
the data.table
class extends standard data frames.
library(data.table)DT <- data.table(a = rnorm(10), b = rnorm(10)) class(DT)
## [1] "data.table" "data.frame"
Compared to tibbles
which provide only some convenience functionality over data frames,
data.tables
are more like tibbles
with integrated dplyr
.
data.table
A data frame can be coerced to a data.table
with as.data.table()
.
library(data.table)df <- data.frame(a = rnorm(10), b = rnorm(10)) DT <- as.data.table(df)class(DT)
## [1] "data.table" "data.frame"
data.table
A data frame can be coerced to a data.table
with as.data.table()
.
library(data.table)df <- data.frame(a = rnorm(10), b = rnorm(10)) DT <- as.data.table(df)class(DT)
## [1] "data.table" "data.frame"
However, the preferred way is to use setDT()
.
setDT(df)class(df)
## [1] "data.table" "data.frame"
data.table
A data frame can be coerced to a data.table
with as.data.table()
.
library(data.table)df <- data.frame(a = rnorm(10), b = rnorm(10)) DT <- as.data.table(df)class(DT)
## [1] "data.table" "data.frame"
However, the preferred way is to use setDT()
.
setDT(df)class(df)
## [1] "data.table" "data.frame"
library(lobstr)X <- data.frame(a = rnorm(10), b = rnorm(10)) obj_addr(X$a)X <- as.data.table(X)obj_addr(X$a)
## [1] "0x2252dac0"## [1] "0x2255c560"
library(lobstr)X <- data.frame(a = rnorm(10), b = rnorm(10)) obj_addr(X$a)X <- as.data.table(X)obj_addr(X$a)
## [1] "0x2252dac0"## [1] "0x2255c560"
X <- data.frame(a = rnorm(10), b = rnorm(10)) obj_addr(X$a)setDT(X)obj_addr(X$a)
## [1] "0x225ba570"## [1] "0x225ba570"
data.table
Compared to base R
and the tidyverse
the advantages of data.table
are
data.table
can be used if the other options already exceeded their limits especially in terms of memory.
data.table
Compared to base R
and the tidyverse
the advantages of data.table
are
data.table
can be used if the other options already exceeded their limits especially in terms of memory.
The main functionality can be compared to dplyr
filter()
and select()
)mutate()
)group_by()
and summarise()
)*_join()
)With dplyr
we use for each operation a single purpose function.
With data.table
(almost) all instructions are written in []
.
With dplyr
we use for each operation a single purpose function.
With data.table
(almost) all instructions are written in []
.
The general form of data.table
syntax is:
DT[ i, j, by ] # + extra arguments | | | | | -------> grouped by what? | -------> what to do? ---> on which rows?
With dplyr
we use for each operation a single purpose function.
With data.table
(almost) all instructions are written in []
.
The general form of data.table
syntax is:
DT[ i, j, by ] # + extra arguments | | | | | -------> grouped by what? | -------> what to do? ---> on which rows?
Preparing a data.table:
library(ggplot2)# setDT connot be used on data sets coming with packages DT <- as.data.table(diamonds)
i
: Filtering rowsi
is used for filtering rows. Note that no $
is needed. i
: Filtering rowsi
is used for filtering rows. Note that no $
is needed. DT[cut== "Fair" | cut == "Good"]
## carat cut color clarity depth table price x y z## 1: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31## 2: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75## 3: 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49## 4: 0.30 Good J SI1 64.0 55 339 4.25 4.28 2.73## 5: 0.30 Good J SI1 63.4 54 351 4.23 4.29 2.70## --- ## 6512: 0.80 Good G VS2 64.2 58 2753 5.84 5.81 3.74## 6513: 0.84 Good I VS1 63.7 59 2753 5.94 5.90 3.77## 6514: 0.74 Good D SI1 63.1 59 2753 5.71 5.74 3.61## 6515: 0.79 Good F SI1 58.1 59 2756 6.06 6.13 3.54## 6516: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
When setting the key
attribute a data.table
is sorted in memory using a radix-sort. This can be used for very fast lookup.
When setting the key
attribute a data.table
is sorted in memory using a radix-sort. This can be used for very fast lookup.
# one key columnsetkey(DT,cut)DT["Fair"]# multiple key columnssetkey(DT,cut,color)DT[list("Fair", "J")]
When setting the key
attribute a data.table
is sorted in memory using a radix-sort. This can be used for very fast lookup.
# one key columnsetkey(DT,cut)DT["Fair"]# multiple key columnssetkey(DT,cut,color)DT[list("Fair", "J")]
.(...)
is a convenient alias for list(...)
. When setting the key
attribute a data.table
is sorted in memory using a radix-sort. This can be used for very fast lookup.
# one key columnsetkey(DT,cut)DT["Fair"]# multiple key columnssetkey(DT,cut,color)DT[list("Fair", "J")]
DT[.("Fair", "J")]
A list or a data.table
can be passed to i
to run a lookup. (This is actually a join as we will see later.)
# Lookup table: LU <- data.table(color = c("E", "I") , clarity = c("SI2", "VS2"))
A list or a data.table
can be passed to i
to run a lookup. (This is actually a join as we will see later.)
# Lookup table: LU <- data.table(color = c("E", "I") , clarity = c("SI2", "VS2"))
For every row in LU, look up corresponding rows in DT, using the variables specified in "on".
DT[LU, on = c("color", "clarity")]
## carat cut color clarity depth table price x y z## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43## 2: 0.20 Premium E SI2 60.2 62 345 3.79 3.75 2.27## 3: 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52## 4: 0.74 Ideal E SI2 62.2 56 2761 5.80 5.84 3.62## 5: 1.01 Fair E SI2 67.4 60 2797 6.19 6.05 4.13## --- ## 2878: 0.79 Ideal I VS2 63.0 58 2712 5.85 5.89 3.70## 2879: 0.81 Very Good I VS2 64.0 60 2714 5.84 5.88 3.75## 2880: 0.82 Premium I VS2 62.2 59 2716 6.03 5.97 3.73## 2881: 0.73 Ideal I VS2 61.3 56 2756 5.80 5.84 3.57## 2882: 0.73 Ideal I VS2 61.6 55 2756 5.82 5.84 3.59
j
can be used for simple subsetting in the following ways:
DT[ , price]
j
can be used for simple subsetting in the following ways:
Returning a single vector.
DT[ , price]
Returning a data.table
with only one column.
DT[ , .(price)]
j
can be used for simple subsetting in the following ways:
Returning a single vector.
DT[ , price]
Returning a data.table
with only one column.
DT[ , .(price)]
Returning data.table
with multiple columns.
DT[ , .(price, depth)]DT[ , 1:2]DT[ , c("price", "depth")]
cols <- c("price", "depth")DT[ , ..cols] # orDT[ , cols, with = FALSE]
DT[ ,price_eur := price/0.91]
Add a variable.
DT[ ,price_eur := price/0.91]
Add multiple variables
# Here, we need to quote at the left hand side: DT[ ,c("x","y") := .(log(price), price - mean(price))]
Add a variable.
DT[ ,price_eur := price/0.91]
Add multiple variables
# Here, we need to quote at the left hand side: DT[ ,c("x","y") := .(log(price), price - mean(price))]
Remove one variable.
DT[ , price_eur := NULL]
Add a variable.
DT[ ,price_eur := price/0.91]
Add multiple variables
# Here, we need to quote at the left hand side: DT[ ,c("x","y") := .(log(price), price - mean(price))]
Remove one variable.
DT[ , price_eur := NULL]
Remove multiple variables.
DT[ , c("x", "price_eur") := NULL]
DT[price > 2500, price := 2500]
Simple update
DT[price > 2500, price := 2500]
Updating multiple columns.
DT[price > 2500, c("price","depth") := .(log(price), sqrt(depth))]
j
can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()
).
j
can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()
).
DT[ , mean(price) / sd(depth)]
## [1] 1264.789
The results are simply returned as a value which can be assigned to some variable.
j
can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()
).
DT[ , mean(price) / sd(depth)]
## [1] 1264.789
The results are simply returned as a value which can be assigned to some variable.
Several expressions can be used by putting them in a list. If a list is used, the result is returned as a data.table
.
DT[ , .(mean_price = mean(price), sd_depth = sd(depth))]
## mean_price sd_depth## 1: 1811.963 1.432621
The by
argument allows operations by group (compare to dplyr::group_by()
).
The by
argument allows operations by group (compare to dplyr::group_by()
).
DT[ , .(mean_price = mean(price)), by = clarity]
The by
argument allows operations by group (compare to dplyr::group_by()
).
Grouping by a single variable
DT[ , .(mean_price = mean(price)), by = clarity]
Grouping by multiple variables.
DT[ , .(mean_price = mean(price)), by = .(clarity, color)]
The by
argument allows operations by group (compare to dplyr::group_by()
).
Grouping by a single variable
DT[ , .(mean_price = mean(price)), by = clarity]
Grouping by multiple variables.
DT[ , .(mean_price = mean(price)), by = .(clarity, color)]
Grouping by expressions.
DT[ , mean(carat), by = list("Is price larger than 2300?" = price > 2300)]
keyby
Additionally to grouping keyby
runs setkey()
on the by columns → results are sorted.
keyby
Additionally to grouping keyby
runs setkey()
on the by columns → results are sorted.
# only the first and list five rows are printedDT[ , mean(price), keyby = .(clarity, color)]
## clarity color V1## 1: I1 D 2113.190## 2: I1 E 2149.941## 3: I1 F 2011.469## 4: I1 G 2052.260## 5: I1 H 2293.321## --- ## 52: IF F 1381.714## 53: IF G 1422.398## 54: IF H 1411.729## 55: IF I 1243.769## 56: IF J 1349.686
Similar to dplyr::across()
together with the predicate functions of the dplyr
package we can aggregate multiple columns concisely using the special symbol .SD
.
If nothing else is specified, .SD
corresponds to all columns in the data.table
.
We can use this together with lapply
to aggregate all columns using the same function.
Similar to dplyr::across()
together with the predicate functions of the dplyr
package we can aggregate multiple columns concisely using the special symbol .SD
.
If nothing else is specified, .SD
corresponds to all columns in the data.table
.
We can use this together with lapply
to aggregate all columns using the same function.
numeric_DT <- data.table(a = rnorm(10), b = rnorm(10), c = rnorm(10))numeric_DT[ ,lapply(.SD, mean)]
## a b c## 1: 0.5251576 -0.1371821 -0.1494699
Similar to dplyr::across()
together with the predicate functions of the dplyr
package we can aggregate multiple columns concisely using the special symbol .SD
.
If nothing else is specified, .SD
corresponds to all columns in the data.table
.
We can use this together with lapply
to aggregate all columns using the same function.
numeric_DT <- data.table(a = rnorm(10), b = rnorm(10), c = rnorm(10))numeric_DT[ ,lapply(.SD, mean)]
## a b c## 1: 0.5251576 -0.1371821 -0.1494699
We can however change .SD
to only contain a subset of all columns using .SDcols
.
num_cols <- names(DT)[sapply(DT, is.numeric)] DT[ , lapply(.SD, mean), .SDcols = num_cols ]
## carat depth table price x y z## 1: 0.7979397 61.7494 57.45718 1811.963 7.786768 -3.96194e-13 3.538734
library(dplyr)members <- as.data.table(band_members)instruments <- as.data.table(band_instruments)
library(dplyr)members <- as.data.table(band_members)instruments <- as.data.table(band_instruments)
members
is the left table )instruments[members, on = "name"]
## name plays band## 1: Mick <NA> Stones## 2: John guitar Beatles## 3: Paul bass Beatles
library(dplyr)members <- as.data.table(band_members)instruments <- as.data.table(band_instruments)
Left join (members
is the left table )
instruments[members, on = "name"]
## name plays band## 1: Mick <NA> Stones## 2: John guitar Beatles## 3: Paul bass Beatles
Left join by reference
members[instruments, band := band, on = "name"]
instruments[members, on = "name", nomatch = 0]
## name plays band## 1: John guitar Beatles## 2: Paul bass Beatles
Inner join
instruments[members, on = "name", nomatch = 0]
## name plays band## 1: John guitar Beatles## 2: Paul bass Beatles
A full join cannot be achieved with standard data.table
syntax. Use data.table::merge()
.
merge(members, instruments, all = TRUE)
## name band plays## 1: John Beatles guitar## 2: Keith <NA> guitar## 3: Mick Stones <NA>## 4: Paul Beatles bass
merge(members, instruments, all = TRUE)[is.na(plays)]
## name band plays## 1: Mick Stones <NA>
Join type | DT | data.table::merge() | dplyr |
---|---|---|---|
INNER | X[Y, nomatch=0] | merge(X, Y, all=FALSE) | inner_join(X,Y) |
LEFT | X[Y] | merge(X, Y, all.x=TRUE) | left_join(X,Y) |
RIGHT | Y[X] | merge(X, Y, all.y=TRUE) | right_join(X,Y) |
FULL | - | merge(X, Y, all=TRUE) | full_join(X,Y) |
ANTI | - | merge(X, Y, all=TRUE)[is.na(...)] | anti_join(X,Y) |
DT <- data.table(t = c(1.2, 2.8, 4.5, 10), value = c(3, 8, 4, 2), id_DT = 1:4) LU <- data.table(id_LU = 1:3, t = c(3.3, 1.0, 9.0)) DT[LU, on = "t", roll = TRUE]
## t value id_DT id_LU## 1: 3.3 8 2 1## 2: 1.0 NA NA 2## 3: 9.0 4 3 3
DT <- data.table(t = c(1.2, 2.8, 4.5, 10), value = c(3, 8, 4, 2), id_DT = 1:4) LU <- data.table(id_LU = 1:3, t = c(3.3, 1.0, 9.0)) DT[LU, on = "t", roll = TRUE]
## t value id_DT id_LU## 1: 3.3 8 2 1## 2: 1.0 NA NA 2## 3: 9.0 4 3 3
t
in DT
which is smaller than t
in LU
.data.table
packagefread()
, fwrite()
: very fast data reading and writingrbindlist()
: Creates a data.table
from a list of data.tables
rowid():
Creates groupwise row ids.fsetdiff(DT1,DT2)
, funion(DT1,DT2)
, ...: Fast set operations for data.tables
uniqueN(cut)
: Number of unique elementsshift()
: For lead and lagdplyr
using data.table
. read.csv()
and fread()
. Use the
function system.time()
to find out how long it takes. data.table
?The data.table
package provides yet another alternative to data frames.
Similarly to tibbles
the data.table
class extends standard data frames.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |