Advanced R for Econometricians

Advanced R for Econometriciansdata.tableMartin C. Arnold, Jens Klenke1 / 24

What is a `data.table`?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

2 / 24

What is a `data.table`?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

library(data.table)
DT <- data.table(a = rnorm(10), b = rnorm(10)) 
class(DT)

## [1] "data.table" "data.frame"

2 / 24

What is a `data.table`?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

library(data.table)
DT <- data.table(a = rnorm(10), b = rnorm(10)) 
class(DT)

## [1] "data.table" "data.frame"

Compared to tibbles which provide only some convenience functionality over data frames, data.tables are more like tibbles with integrated dplyr.

2 / 24

What is a `data.table`?

The data.table package provides yet another alternative to data frames. Similarly to tibbles the data.table class extends standard data frames.

library(data.table)
DT <- data.table(a = rnorm(10), b = rnorm(10)) 
class(DT)

## [1] "data.table" "data.frame"

Compared to tibbles which provide only some convenience functionality over data frames, data.tables are more like tibbles with integrated dplyr.

Some Interesting Links

2 / 24

Creating a `data.table`

A data frame can be coerced to a data.table with as.data.table().

library(data.table)
df <- data.frame(a = rnorm(10), b = rnorm(10)) 
DT <- as.data.table(df)
class(DT)

## [1] "data.table" "data.frame"

3 / 24

Creating a `data.table`

A data frame can be coerced to a data.table with as.data.table().

library(data.table)
df <- data.frame(a = rnorm(10), b = rnorm(10)) 
DT <- as.data.table(df)
class(DT)

## [1] "data.table" "data.frame"

However, the preferred way is to use setDT().

setDT(df)
class(df)

## [1] "data.table" "data.frame"

3 / 24

Creating a `data.table`

A data frame can be coerced to a data.table with as.data.table().

library(data.table)
df <- data.frame(a = rnorm(10), b = rnorm(10)) 
DT <- as.data.table(df)
class(DT)

## [1] "data.table" "data.frame"

However, the preferred way is to use setDT().

setDT(df)
class(df)

## [1] "data.table" "data.frame"

What is the difference?

3 / 24

Modify by Referencelibrary(lobstr)
X <- data.frame(a = rnorm(10), b = rnorm(10)) 
obj_addr(X$a)
X <- as.data.table(X)
obj_addr(X$a)

## [1] "0x2252dac0"
## [1] "0x2255c560"
4 / 24

Modify by Referencelibrary(lobstr)
X <- data.frame(a = rnorm(10), b = rnorm(10)) 
obj_addr(X$a)
X <- as.data.table(X)
obj_addr(X$a)

## [1] "0x2252dac0"
## [1] "0x2255c560"
X <- data.frame(a = rnorm(10), b = rnorm(10)) 
obj_addr(X$a)
setDT(X)
obj_addr(X$a)

## [1] "0x225ba570"
## [1] "0x225ba570"
4 / 24

Why to use `data.table`

Compared to base R and the tidyverse the advantages of data.table are

speed
memory efficiency
compact syntax (which can be also a downside).

data.table can be used if the other options already exceeded their limits especially in terms of memory.

5 / 24

Why to use `data.table`

Compared to base R and the tidyverse the advantages of data.table are

speed
memory efficiency
compact syntax (which can be also a downside).

data.table can be used if the other options already exceeded their limits especially in terms of memory.

The main functionality can be compared to dplyr

subsetting (filter() and select())
updating (mutate())
grouping and summarising (group_by() and summarise())
merging (*_join())

5 / 24

Syntax

With dplyr we use for each operation a single purpose function. With data.table (almost) all instructions are written in [].

6 / 24

Syntax

With dplyr we use for each operation a single purpose function. With data.table (almost) all instructions are written in [].

The general form of data.table syntax is:

    DT[ i,  j,  by ] # + extra arguments
        |   |   |
        |   |    -------> grouped by what?
        |    -------> what to do?
         ---> on which rows?

6 / 24

Syntax

With dplyr we use for each operation a single purpose function. With data.table (almost) all instructions are written in [].

The general form of data.table syntax is:

    DT[ i,  j,  by ] # + extra arguments
        |   |   |
        |   |    -------> grouped by what?
        |    -------> what to do?
         ---> on which rows?

Preparing a data.table:

library(ggplot2)
# setDT connot be used on data sets coming with packages 
DT <- as.data.table(diamonds)

6 / 24

i: Filtering rowsi is used for filtering rows. Note that no $ is needed. 
7 / 24

`i`: Filtering rows

i is used for filtering rows. Note that no $ is needed.

DT[cut== "Fair" | cut == "Good"]

##       carat  cut color clarity depth table price    x    y    z
##    1:  0.23 Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##    2:  0.31 Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    3:  0.22 Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
##    4:  0.30 Good     J     SI1  64.0    55   339 4.25 4.28 2.73
##    5:  0.30 Good     J     SI1  63.4    54   351 4.23 4.29 2.70
##   ---                                                          
## 6512:  0.80 Good     G     VS2  64.2    58  2753 5.84 5.81 3.74
## 6513:  0.84 Good     I     VS1  63.7    59  2753 5.94 5.90 3.77
## 6514:  0.74 Good     D     SI1  63.1    59  2753 5.71 5.74 3.61
## 6515:  0.79 Good     F     SI1  58.1    59  2756 6.06 6.13 3.54
## 6516:  0.72 Good     D     SI1  63.1    55  2757 5.69 5.75 3.61

7 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

8 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

# one key column
setkey(DT,cut)
DT["Fair"]
# multiple key columns
setkey(DT,cut,color)
DT[list("Fair", "J")]

8 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

# one key column
setkey(DT,cut)
DT["Fair"]
# multiple key columns
setkey(DT,cut,color)
DT[list("Fair", "J")]

.(...) is a convenient alias for list(...).

8 / 24

i: keys

When setting the key attribute a data.table is sorted in memory using a radix-sort. This can be used for very fast lookup.

# one key column
setkey(DT,cut)
DT["Fair"]
# multiple key columns
setkey(DT,cut,color)
DT[list("Fair", "J")]

. (...)

is a convenient alias for

l i s t (...)

DT[.("Fair", "J")]

8 / 24

Lookup

A list or a data.table can be passed to i to run a lookup. (This is actually a join as we will see later.)

# Lookup table: 
LU <- data.table(color = c("E", "I") , clarity = c("SI2", "VS2"))

9 / 24

Lookup

A list or a data.table can be passed to i to run a lookup. (This is actually a join as we will see later.)

# Lookup table: 
LU <- data.table(color = c("E", "I") , clarity = c("SI2", "VS2"))

For every row in LU, look up corresponding rows in DT, using the variables specified in "on".

DT[LU, on = c("color", "clarity")]

##       carat       cut color clarity depth table price    x    y    z
##    1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##    2:  0.20   Premium     E     SI2  60.2    62   345 3.79 3.75 2.27
##    3:  0.86      Fair     E     SI2  55.1    69  2757 6.45 6.33 3.52
##    4:  0.74     Ideal     E     SI2  62.2    56  2761 5.80 5.84 3.62
##    5:  1.01      Fair     E     SI2  67.4    60  2797 6.19 6.05 4.13
##   ---                                                               
## 2878:  0.79     Ideal     I     VS2  63.0    58  2712 5.85 5.89 3.70
## 2879:  0.81 Very Good     I     VS2  64.0    60  2714 5.84 5.88 3.75
## 2880:  0.82   Premium     I     VS2  62.2    59  2716 6.03 5.97 3.73
## 2881:  0.73     Ideal     I     VS2  61.3    56  2756 5.80 5.84 3.57
## 2882:  0.73     Ideal     I     VS2  61.6    55  2756 5.82 5.84 3.59

9 / 24

j: Subsetting columns

j can be used for simple subsetting in the following ways:

Returning a single vector.
```
DT[ , price]
```

10 / 24

j: Subsetting columns

j can be used for simple subsetting in the following ways:

Returning a single vector.
```
DT[ , price]
```
Returning a data.table with only one column.
```
DT[ , .(price)]
```

10 / 24

j: Subsetting columns

j can be used for simple subsetting in the following ways:

Returning a single vector.
```
DT[ , price]
```
Returning a data.table with only one column.
```
DT[ , .(price)]
```

Returning data.table with multiple columns.

DT[ , .(price, depth)]
DT[ , 1:2]
DT[ , c("price", "depth")]

10 / 24

j: Subsetting columnsIf the column names are stored in a variable use
11 / 24

j: Subsetting columnsIf the column names are stored in a variable use
cols <- c("price", "depth")
DT[ , ..cols]                # or
DT[ , cols, with = FALSE]

11 / 24

j: Add and Remove Columns by Reference

Add a variable.
```
DT[ ,price_eur := price/0.91]
```

12 / 24

j: Add and Remove Columns by Reference

Add a variable.
```
DT[ ,price_eur := price/0.91]
```

Add multiple variables

# Here, we need to quote at the left hand side: 
DT[ ,c("x","y") := .(log(price), price - mean(price))]

12 / 24

j: Add and Remove Columns by Reference

Add a variable.
```
DT[ ,price_eur := price/0.91]
```

Add multiple variables

# Here, we need to quote at the left hand side: 
DT[ ,c("x","y") := .(log(price), price - mean(price))]

Remove one variable.
```
DT[ , price_eur := NULL]
```

12 / 24

j: Add and Remove Columns by Reference

Add a variable.
```
DT[ ,price_eur := price/0.91]
```

Add multiple variables

# Here, we need to quote at the left hand side: 
DT[ ,c("x","y") := .(log(price), price - mean(price))]

Remove one variable.
```
DT[ , price_eur := NULL]
```
Remove multiple variables.
```
DT[ , c("x", "price_eur") := NULL]
```

12 / 24

Update Columns by Reference

Simple update
```
DT[price > 2500, price := 2500]
```

13 / 24

Update Columns by Reference

Simple update
```
DT[price > 2500, price := 2500]
```

Updating multiple columns.

DT[price > 2500, c("price","depth") := .(log(price), sqrt(depth))]

13 / 24

j: Evaluate Expressions

j can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()).

14 / 24

j: Evaluate Expressions

j can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()).

DT[ , mean(price) / sd(depth)]

## [1] 1264.789

The results are simply returned as a value which can be assigned to some variable.

14 / 24

j: Evaluate Expressions

j can be used to evaluate expressions. Columns can be accessed as if they were in the global environment (compare to dplyr::summarise()).

DT[ , mean(price) / sd(depth)]

## [1] 1264.789

The results are simply returned as a value which can be assigned to some variable.

Several expressions can be used by putting them in a list. If a list is used, the result is returned as a data.table.

DT[ , .(mean_price = mean(price), sd_depth = sd(depth))]

##    mean_price sd_depth
## 1:   1811.963 1.432621

14 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

15 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

Grouping by a single variable

DT[ , .(mean_price = mean(price)), by = clarity]

15 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

Grouping by a single variable

DT[ , .(mean_price = mean(price)), by = clarity]

Grouping by multiple variables.

DT[ , .(mean_price = mean(price)), by = .(clarity, color)]

15 / 24

by: Grouping

The by argument allows operations by group (compare to dplyr::group_by()).

Grouping by a single variable

DT[ , .(mean_price = mean(price)), by = clarity]

Grouping by multiple variables.

DT[ , .(mean_price = mean(price)), by = .(clarity, color)]

Grouping by expressions.

DT[ , mean(carat), by = list("Is price larger than 2300?" = price > 2300)]

15 / 24

`keyby`

Additionally to grouping keyby runs setkey() on the by columns → results are sorted.

16 / 24

`keyby`

Additionally to grouping keyby runs setkey() on the by columns → results are sorted.

# only the first and list five rows are printed
DT[ , mean(price), keyby = .(clarity, color)]

##     clarity color       V1
##  1:      I1     D 2113.190
##  2:      I1     E 2149.941
##  3:      I1     F 2011.469
##  4:      I1     G 2052.260
##  5:      I1     H 2293.321
## ---                       
## 52:      IF     F 1381.714
## 53:      IF     G 1422.398
## 54:      IF     H 1411.729
## 55:      IF     I 1243.769
## 56:      IF     J 1349.686

16 / 24

.SD

Similar to dplyr::across() together with the predicate functions of the dplyr package we can aggregate multiple columns concisely using the special symbol .SD.

If nothing else is specified, .SD corresponds to all columns in the data.table. We can use this together with lapply to aggregate all columns using the same function.

17 / 24

.SD

Similar to dplyr::across() together with the predicate functions of the dplyr package we can aggregate multiple columns concisely using the special symbol .SD.

If nothing else is specified, .SD corresponds to all columns in the data.table. We can use this together with lapply to aggregate all columns using the same function.

numeric_DT <- data.table(a = rnorm(10), b = rnorm(10), c = rnorm(10))
numeric_DT[ ,lapply(.SD, mean)]

##            a          b          c
## 1: 0.5251576 -0.1371821 -0.1494699

17 / 24

.SD

Similar to dplyr::across() together with the predicate functions of the dplyr package we can aggregate multiple columns concisely using the special symbol .SD.

If nothing else is specified, .SD corresponds to all columns in the data.table. We can use this together with lapply to aggregate all columns using the same function.

numeric_DT <- data.table(a = rnorm(10), b = rnorm(10), c = rnorm(10))
numeric_DT[ ,lapply(.SD, mean)]

##            a          b          c
## 1: 0.5251576 -0.1371821 -0.1494699

We can however change .SD to only contain a subset of all columns using .SDcols.

num_cols <- names(DT)[sapply(DT, is.numeric)] 
DT[ , lapply(.SD, mean), .SDcols = num_cols ]

##        carat   depth    table    price        x            y        z
## 1: 0.7979397 61.7494 57.45718 1811.963 7.786768 -3.96194e-13 3.538734

17 / 24

Joins

library(dplyr)
members <- as.data.table(band_members)
instruments <- as.data.table(band_instruments)

18 / 24

Joins

library(dplyr)
members <- as.data.table(band_members)
instruments <- as.data.table(band_instruments)

Left join (members is the left table )

instruments[members, on = "name"]

##    name  plays    band
## 1: Mick   <NA>  Stones
## 2: John guitar Beatles
## 3: Paul   bass Beatles

18 / 24

Joins

library(dplyr)
members <- as.data.table(band_members)
instruments <- as.data.table(band_instruments)

Left join (members is the left table )

instruments[members, on = "name"]

##    name  plays    band
## 1: Mick   <NA>  Stones
## 2: John guitar Beatles
## 3: Paul   bass Beatles

Left join by reference

members[instruments, band := band, on = "name"]

18 / 24

Joins

Inner join

instruments[members, on = "name", nomatch = 0]

##    name  plays    band
## 1: John guitar Beatles
## 2: Paul   bass Beatles

19 / 24

Joins

Inner join

instruments[members, on = "name", nomatch = 0]

##    name  plays    band
## 1: John guitar Beatles
## 2: Paul   bass Beatles

A full join cannot be achieved with standard data.table syntax. Use data.table::merge().

merge(members, instruments, all = TRUE)

##     name    band  plays
## 1:  John Beatles guitar
## 2: Keith    <NA> guitar
## 3:  Mick  Stones   <NA>
## 4:  Paul Beatles   bass

19 / 24

Joins

Anti-Join

merge(members, instruments, all = TRUE)[is.na(plays)]

##    name   band plays
## 1: Mick Stones  <NA>

20 / 24

Joins: Overview

Join type	DT	data.table::merge()	dplyr
INNER	X[Y, nomatch=0]	merge(X, Y, all=FALSE)	inner_join(X,Y)
LEFT	X[Y]	merge(X, Y, all.x=TRUE)	left_join(X,Y)
RIGHT	Y[X]	merge(X, Y, all.y=TRUE)	right_join(X,Y)
FULL	-	merge(X, Y, all=TRUE)	full_join(X,Y)
ANTI	-	merge(X, Y, all=TRUE)[is.na(...)]	anti_join(X,Y)

21 / 24

Rolling JoinDT <- data.table(t = c(1.2, 2.8, 4.5, 10), value = c(3, 8, 4, 2), id_DT = 1:4) 
LU <- data.table(id_LU = 1:3, t = c(3.3, 1.0, 9.0)) 
DT[LU, on = "t", roll = TRUE]

##      t value id_DT id_LU
## 1: 3.3     8     2     1
## 2: 1.0    NA    NA     2
## 3: 9.0     4     3     3
22 / 24

Rolling JoinDT <- data.table(t = c(1.2, 2.8, 4.5, 10), value = c(3, 8, 4, 2), id_DT = 1:4) 
LU <- data.table(id_LU = 1:3, t = c(3.3, 1.0, 9.0)) 
DT[LU, on = "t", roll = TRUE]

##      t value id_DT id_LU
## 1: 3.3     8     2     1
## 2: 1.0    NA    NA     2
## 3: 9.0     4     3     3
Looks for closest value of t in DT which is smaller than t in LU.
22 / 24

Other Useful Functions from the data.table packagefread(), fwrite(): very fast data reading and writing
rbindlist(): Creates a data.table from a list of data.tables
rowid(): Creates groupwise row ids.
fsetdiff(DT1,DT2), funion(DT1,DT2), ...: Fast set operations for data.tables
uniqueN(cut): Number of unique elements
shift(): For lead and lag
23 / 24

ExercisesRepeat the exercises from the chapter on dplyr using data.table. 
Compare the speed of reading in the Boston crime dataset with read.csv() and fread(). Use the 
function system.time() to find out how long it takes. 
24 / 24

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Advanced R for Econometricians

data.table

Martin C. Arnold, Jens Klenke

What is a data.table?

What is a data.table?

What is a data.table?

What is a data.table?

Some Interesting Links

Creating a data.table

Creating a data.table

Creating a data.table

Modify by Reference

Modify by Reference

Why to use data.table

Why to use data.table

Syntax

Syntax

Syntax

i: Filtering rows

i: Filtering rows

i: keys

i: keys

i: keys

i: keys

Lookup

Lookup

j: Subsetting columns

j: Subsetting columns

j: Subsetting columns

j: Subsetting columns

j: Subsetting columns

j: Add and Remove Columns by Reference

j: Add and Remove Columns by Reference

j: Add and Remove Columns by Reference

j: Add and Remove Columns by Reference

Update Columns by Reference

Update Columns by Reference

j: Evaluate Expressions

j: Evaluate Expressions

j: Evaluate Expressions

by: Grouping

by: Grouping

by: Grouping

by: Grouping

keyby

keyby

.SD

.SD

.SD

Joins

Joins

Joins

Joins

Joins

Joins

Joins: Overview

Rolling Join

Rolling Join

Other Useful Functions from the data.table package

Exercises

What is a data.table?

Help

`data.table`

What is a `data.table`?

What is a `data.table`?

What is a `data.table`?

What is a `data.table`?

Creating a `data.table`

Creating a `data.table`

Creating a `data.table`

Why to use `data.table`

Why to use `data.table`

`i`: Filtering rows

`i`: Filtering rows

`keyby`

`keyby`

Other Useful Functions from the `data.table` package

What is a `data.table`?