+ - 0:00:00
Notes for current slide
Notes for next slide

Advanced R for Econometricians

data.table

Martin C. Arnold, Jens Klenke

1 / 24

What is a data.table?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

2 / 24

What is a data.table?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

library(data.table)
DT <- data.table(a = rnorm(10), b = rnorm(10))
class(DT)
## [1] "data.table" "data.frame"
2 / 24

What is a data.table?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

library(data.table)
DT <- data.table(a = rnorm(10), b = rnorm(10))
class(DT)
## [1] "data.table" "data.frame"

Compared to tibbles which provide only some convenience functionality over data frames, data.tables are more like tibbles with integrated dplyr.

2 / 24

What is a data.table?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

library(data.table)
DT <- data.table(a = rnorm(10), b = rnorm(10))
class(DT)
## [1] "data.table" "data.frame"

Compared to tibbles which provide only some convenience functionality over data frames, data.tables are more like tibbles with integrated dplyr.

2 / 24

Creating a data.table

A data frame can be coerced to a data.table with as.data.table().

library(data.table)
df <- data.frame(a = rnorm(10), b = rnorm(10))
DT <- as.data.table(df)
class(DT)
## [1] "data.table" "data.frame"
3 / 24

Creating a data.table

A data frame can be coerced to a data.table with as.data.table().

library(data.table)
df <- data.frame(a = rnorm(10), b = rnorm(10))
DT <- as.data.table(df)
class(DT)
## [1] "data.table" "data.frame"

However, the preferred way is to use setDT().

setDT(df)
class(df)
## [1] "data.table" "data.frame"
3 / 24

Creating a data.table

A data frame can be coerced to a data.table with as.data.table().

library(data.table)
df <- data.frame(a = rnorm(10), b = rnorm(10))
DT <- as.data.table(df)
class(DT)
## [1] "data.table" "data.frame"

However, the preferred way is to use setDT().

setDT(df)
class(df)
## [1] "data.table" "data.frame"

What is the difference?
3 / 24

Modify by Reference

library(lobstr)
X <- data.frame(a = rnorm(10), b = rnorm(10))
obj_addr(X$a)
X <- as.data.table(X)
obj_addr(X$a)
## [1] "0x2252dac0"
## [1] "0x2255c560"
4 / 24

Modify by Reference

library(lobstr)
X <- data.frame(a = rnorm(10), b = rnorm(10))
obj_addr(X$a)
X <- as.data.table(X)
obj_addr(X$a)
## [1] "0x2252dac0"
## [1] "0x2255c560"
X <- data.frame(a = rnorm(10), b = rnorm(10))
obj_addr(X$a)
setDT(X)
obj_addr(X$a)
## [1] "0x225ba570"
## [1] "0x225ba570"
4 / 24

Why to use data.table

Compared to base R and the tidyverse the advantages of data.table are

  • speed
  • memory efficiency
  • compact syntax (which can be also a downside).

data.table can be used if the other options already exceeded their limits especially in terms of memory.

5 / 24

Why to use data.table

Compared to base R and the tidyverse the advantages of data.table are

  • speed
  • memory efficiency
  • compact syntax (which can be also a downside).

data.table can be used if the other options already exceeded their limits especially in terms of memory.

The main functionality can be compared to dplyr

  • subsetting (filter() and select())
  • updating (mutate())
  • grouping and summarising (group_by() and summarise())
  • merging (*_join())
5 / 24

Syntax

With dplyr we use for each operation a single purpose function. With data.table (almost) all instructions are written in [].

6 / 24

Syntax

With dplyr we use for each operation a single purpose function. With data.table (almost) all instructions are written in [].

The general form of data.table syntax is:

DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
6 / 24

Syntax

With dplyr we use for each operation a single purpose function. With data.table (almost) all instructions are written in [].

The general form of data.table syntax is:

DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?

Preparing a data.table:

library(ggplot2)
# setDT connot be used on data sets coming with packages
DT <- as.data.table(diamonds)
6 / 24

i: Filtering rows

  • i is used for filtering rows. Note that no $ is needed.
7 / 24

i: Filtering rows

  • i is used for filtering rows. Note that no $ is needed.
    DT[cut== "Fair" | cut == "Good"]
    ## carat cut color clarity depth table price x y z
    ## 1: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
    ## 2: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
    ## 3: 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
    ## 4: 0.30 Good J SI1 64.0 55 339 4.25 4.28 2.73
    ## 5: 0.30 Good J SI1 63.4 54 351 4.23 4.29 2.70
    ## ---
    ## 6512: 0.80 Good G VS2 64.2 58 2753 5.84 5.81 3.74
    ## 6513: 0.84 Good I VS1 63.7 59 2753 5.94 5.90 3.77
    ## 6514: 0.74 Good D SI1 63.1 59 2753 5.71 5.74 3.61
    ## 6515: 0.79 Good F SI1 58.1 59 2756 6.06 6.13 3.54
    ## 6516: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
7 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

8 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

# one key column
setkey(DT,cut)
DT["Fair"]
# multiple key columns
setkey(DT,cut,color)
DT[list("Fair", "J")]

8 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

# one key column
setkey(DT,cut)
DT["Fair"]
# multiple key columns
setkey(DT,cut,color)
DT[list("Fair", "J")]

.(...) is a convenient alias for list(...).
8 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

# one key column
setkey(DT,cut)
DT["Fair"]
# multiple key columns
setkey(DT,cut,color)
DT[list("Fair", "J")]

.(...) is a convenient alias for list(...).

DT[.("Fair", "J")]
8 / 24

Lookup

A list or a data.table can be passed to i to run a lookup. (This is actually a join as we will see later.)

# Lookup table:
LU <- data.table(color = c("E", "I") , clarity = c("SI2", "VS2"))
9 / 24

Lookup

A list or a data.table can be passed to i to run a lookup. (This is actually a join as we will see later.)

# Lookup table:
LU <- data.table(color = c("E", "I") , clarity = c("SI2", "VS2"))

For every row in LU, look up corresponding rows in DT, using the variables specified in "on".

DT[LU, on = c("color", "clarity")]
## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.20 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 3: 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
## 4: 0.74 Ideal E SI2 62.2 56 2761 5.80 5.84 3.62
## 5: 1.01 Fair E SI2 67.4 60 2797 6.19 6.05 4.13
## ---
## 2878: 0.79 Ideal I VS2 63.0 58 2712 5.85 5.89 3.70
## 2879: 0.81 Very Good I VS2 64.0 60 2714 5.84 5.88 3.75
## 2880: 0.82 Premium I VS2 62.2 59 2716 6.03 5.97 3.73
## 2881: 0.73 Ideal I VS2 61.3 56 2756 5.80 5.84 3.57
## 2882: 0.73 Ideal I VS2 61.6 55 2756 5.82 5.84 3.59
9 / 24

j: Subsetting columns

j can be used for simple subsetting in the following ways:

  • Returning a single vector.
    DT[ , price]
10 / 24

j: Subsetting columns

j can be used for simple subsetting in the following ways:

  • Returning a single vector.

    DT[ , price]
  • Returning a data.table with only one column.

    DT[ , .(price)]
10 / 24

j: Subsetting columns

j can be used for simple subsetting in the following ways:

  • Returning a single vector.

    DT[ , price]
  • Returning a data.table with only one column.

    DT[ , .(price)]
  • Returning data.table with multiple columns.

    DT[ , .(price, depth)]
    DT[ , 1:2]
    DT[ , c("price", "depth")]
10 / 24

j: Subsetting columns

  • If the column names are stored in a variable use
11 / 24

j: Subsetting columns

  • If the column names are stored in a variable use
cols <- c("price", "depth")
DT[ , ..cols] # or
DT[ , cols, with = FALSE]
11 / 24

j: Add and Remove Columns by Reference

  • Add a variable.
    DT[ ,price_eur := price/0.91]
12 / 24

j: Add and Remove Columns by Reference

  • Add a variable.

    DT[ ,price_eur := price/0.91]
  • Add multiple variables

    # Here, we need to quote at the left hand side:
    DT[ ,c("x","y") := .(log(price), price - mean(price))]
12 / 24

j: Add and Remove Columns by Reference

  • Add a variable.

    DT[ ,price_eur := price/0.91]
  • Add multiple variables

    # Here, we need to quote at the left hand side:
    DT[ ,c("x","y") := .(log(price), price - mean(price))]
  • Remove one variable.

    DT[ , price_eur := NULL]
12 / 24

j: Add and Remove Columns by Reference

  • Add a variable.

    DT[ ,price_eur := price/0.91]
  • Add multiple variables

    # Here, we need to quote at the left hand side:
    DT[ ,c("x","y") := .(log(price), price - mean(price))]
  • Remove one variable.

    DT[ , price_eur := NULL]
  • Remove multiple variables.

    DT[ , c("x", "price_eur") := NULL]
12 / 24

Update Columns by Reference

  • Simple update
    DT[price > 2500, price := 2500]
13 / 24

Update Columns by Reference

  • Simple update

    DT[price > 2500, price := 2500]

  • Updating multiple columns.

    DT[price > 2500, c("price","depth") := .(log(price), sqrt(depth))]
13 / 24

j: Evaluate Expressions

j can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()).

14 / 24

j: Evaluate Expressions

j can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()).

DT[ , mean(price) / sd(depth)]
## [1] 1264.789

The results are simply returned as a value which can be assigned to some variable.

14 / 24

j: Evaluate Expressions

j can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()).

DT[ , mean(price) / sd(depth)]
## [1] 1264.789

The results are simply returned as a value which can be assigned to some variable.

Several expressions can be used by putting them in a list. If a list is used, the result is returned as a data.table.

DT[ , .(mean_price = mean(price), sd_depth = sd(depth))]
## mean_price sd_depth
## 1: 1811.963 1.432621
14 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

15 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

  • Grouping by a single variable
    DT[ , .(mean_price = mean(price)), by = clarity]
15 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

  • Grouping by a single variable

    DT[ , .(mean_price = mean(price)), by = clarity]
  • Grouping by multiple variables.

    DT[ , .(mean_price = mean(price)), by = .(clarity, color)]
15 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

  • Grouping by a single variable

    DT[ , .(mean_price = mean(price)), by = clarity]
  • Grouping by multiple variables.

    DT[ , .(mean_price = mean(price)), by = .(clarity, color)]
  • Grouping by expressions.

    DT[ , mean(carat), by = list("Is price larger than 2300?" = price > 2300)]
15 / 24

keyby

Additionally to grouping keyby runs setkey() on the by columns → results are sorted.

16 / 24

keyby

Additionally to grouping keyby runs setkey() on the by columns → results are sorted.

# only the first and list five rows are printed
DT[ , mean(price), keyby = .(clarity, color)]
## clarity color V1
## 1: I1 D 2113.190
## 2: I1 E 2149.941
## 3: I1 F 2011.469
## 4: I1 G 2052.260
## 5: I1 H 2293.321
## ---
## 52: IF F 1381.714
## 53: IF G 1422.398
## 54: IF H 1411.729
## 55: IF I 1243.769
## 56: IF J 1349.686
16 / 24

.SD

Similar to dplyr::across() together with the predicate functions of the dplyr package we can aggregate multiple columns concisely using the special symbol .SD.

If nothing else is specified, .SD corresponds to all columns in the data.table. We can use this together with lapply to aggregate all columns using the same function.

17 / 24

.SD

Similar to dplyr::across() together with the predicate functions of the dplyr package we can aggregate multiple columns concisely using the special symbol .SD.

If nothing else is specified, .SD corresponds to all columns in the data.table. We can use this together with lapply to aggregate all columns using the same function.

numeric_DT <- data.table(a = rnorm(10), b = rnorm(10), c = rnorm(10))
numeric_DT[ ,lapply(.SD, mean)]
## a b c
## 1: 0.5251576 -0.1371821 -0.1494699
17 / 24

.SD

Similar to dplyr::across() together with the predicate functions of the dplyr package we can aggregate multiple columns concisely using the special symbol .SD.

If nothing else is specified, .SD corresponds to all columns in the data.table. We can use this together with lapply to aggregate all columns using the same function.

numeric_DT <- data.table(a = rnorm(10), b = rnorm(10), c = rnorm(10))
numeric_DT[ ,lapply(.SD, mean)]
## a b c
## 1: 0.5251576 -0.1371821 -0.1494699

We can however change .SD to only contain a subset of all columns using .SDcols.

num_cols <- names(DT)[sapply(DT, is.numeric)]
DT[ , lapply(.SD, mean), .SDcols = num_cols ]
## carat depth table price x y z
## 1: 0.7979397 61.7494 57.45718 1811.963 7.786768 -3.96194e-13 3.538734
17 / 24

Joins

library(dplyr)
members <- as.data.table(band_members)
instruments <- as.data.table(band_instruments)
18 / 24

Joins

library(dplyr)
members <- as.data.table(band_members)
instruments <- as.data.table(band_instruments)
  • Left join (members is the left table )
    instruments[members, on = "name"]
    ## name plays band
    ## 1: Mick <NA> Stones
    ## 2: John guitar Beatles
    ## 3: Paul bass Beatles
18 / 24

Joins

library(dplyr)
members <- as.data.table(band_members)
instruments <- as.data.table(band_instruments)
  • Left join (members is the left table )

    instruments[members, on = "name"]
    ## name plays band
    ## 1: Mick <NA> Stones
    ## 2: John guitar Beatles
    ## 3: Paul bass Beatles
  • Left join by reference

    members[instruments, band := band, on = "name"]
18 / 24

Joins

  • Inner join
    instruments[members, on = "name", nomatch = 0]
    ## name plays band
    ## 1: John guitar Beatles
    ## 2: Paul bass Beatles
19 / 24

Joins

  • Inner join

    instruments[members, on = "name", nomatch = 0]
    ## name plays band
    ## 1: John guitar Beatles
    ## 2: Paul bass Beatles
  • A full join cannot be achieved with standard data.table syntax. Use data.table::merge().

    merge(members, instruments, all = TRUE)
    ## name band plays
    ## 1: John Beatles guitar
    ## 2: Keith <NA> guitar
    ## 3: Mick Stones <NA>
    ## 4: Paul Beatles bass
19 / 24

Joins

  • Anti-Join
    merge(members, instruments, all = TRUE)[is.na(plays)]
    ## name band plays
    ## 1: Mick Stones <NA>
20 / 24

Joins: Overview


Join type DT data.table::merge() dplyr
INNER X[Y, nomatch=0] merge(X, Y, all=FALSE) inner_join(X,Y)
LEFT X[Y] merge(X, Y, all.x=TRUE) left_join(X,Y)
RIGHT Y[X] merge(X, Y, all.y=TRUE) right_join(X,Y)
FULL - merge(X, Y, all=TRUE) full_join(X,Y)
ANTI - merge(X, Y, all=TRUE)[is.na(...)] anti_join(X,Y)
21 / 24

Rolling Join

DT <- data.table(t = c(1.2, 2.8, 4.5, 10), value = c(3, 8, 4, 2), id_DT = 1:4)
LU <- data.table(id_LU = 1:3, t = c(3.3, 1.0, 9.0))
DT[LU, on = "t", roll = TRUE]
## t value id_DT id_LU
## 1: 3.3 8 2 1
## 2: 1.0 NA NA 2
## 3: 9.0 4 3 3
22 / 24

Rolling Join

DT <- data.table(t = c(1.2, 2.8, 4.5, 10), value = c(3, 8, 4, 2), id_DT = 1:4)
LU <- data.table(id_LU = 1:3, t = c(3.3, 1.0, 9.0))
DT[LU, on = "t", roll = TRUE]
## t value id_DT id_LU
## 1: 3.3 8 2 1
## 2: 1.0 NA NA 2
## 3: 9.0 4 3 3
  • Looks for closest value of t in DT which is smaller than t in LU.
22 / 24

Other Useful Functions from the data.table package

  • fread(), fwrite(): very fast data reading and writing
  • rbindlist(): Creates a data.table from a list of data.tables
  • rowid(): Creates groupwise row ids.
  • fsetdiff(DT1,DT2), funion(DT1,DT2), ...: Fast set operations for data.tables
  • uniqueN(cut): Number of unique elements
  • shift(): For lead and lag
23 / 24

Exercises

  1. Repeat the exercises from the chapter on dplyr using data.table.
  2. Compare the speed of reading in the Boston crime dataset with read.csv() and fread(). Use the function system.time() to find out how long it takes.
24 / 24

What is a data.table?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow