Skip to contents

Introduction

deltaR provides an R interface to Delta Lake, the open-source storage layer that brings ACID transactions to data lakes. Built on the high-performance delta-rs Rust library, deltaR enables you to read and write Delta tables directly from R with minimal overhead.

What is Delta Lake?

Delta Lake is an open-source storage framework that enables building a lakehouse architecture. Key features include:

  • ACID Transactions: Ensures data integrity even with concurrent reads and writes
  • Time Travel: Access and restore previous versions of your data
  • Schema Enforcement: Prevents bad data from being written
  • Schema Evolution: Allows schema changes over time
  • Scalable Metadata: Handles petabyte-scale tables with billions of files

Installation

Prerequisites

deltaR requires the Rust toolchain to compile from source:

# On macOS/Linux, install Rust via rustup:
# curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# On Windows, download and run rustup-init.exe from https://rustup.rs/

Install deltaR

# Install from GitHub
remotes::install_github("ixpantia/deltaR")

Writing Data to Delta Tables

The primary function for writing data is write_deltalake(). It accepts data frames, Arrow tables, or any object that can be converted to an Arrow stream.

Creating a New Table

# Create sample data
sales_data <- data.frame(
  order_id = 1:100,
  customer_id = sample(1:20, 100, replace = TRUE),
  product = sample(c("Widget", "Gadget", "Gizmo"), 100, replace = TRUE),
  quantity = sample(1:10, 100, replace = TRUE),
  price = round(runif(100, 10, 500), 2),
  order_date = as.Date("2024-01-01") + sample(0:364, 100, replace = TRUE)
)

# Write to a Delta table
write_deltalake(sales_data, "path/to/sales_table")

Write Modes

deltaR supports four write modes:

# error (default): Fail if the table already exists
write_deltalake(sales_data, "path/to/table", mode = "error")

# append: Add new data to an existing table
new_sales <- data.frame(
  order_id = 101:110,
  customer_id = sample(1:20, 10, replace = TRUE),
  product = sample(c("Widget", "Gadget", "Gizmo"), 10, replace = TRUE),
  quantity = sample(1:10, 10, replace = TRUE),
  price = round(runif(10, 10, 500), 2),
  order_date = as.Date("2025-01-01") + sample(0:30, 10, replace = TRUE)
)
write_deltalake(new_sales, "path/to/sales_table", mode = "append")

# overwrite: Replace all data in the table
write_deltalake(sales_data, "path/to/sales_table", mode = "overwrite")

# ignore: Do nothing if the table already exists
write_deltalake(sales_data, "path/to/table", mode = "ignore")

Partitioned Tables

Partitioning improves query performance by organizing data into directories based on column values:

# Partition by date and product
write_deltalake(
  sales_data,
  "path/to/partitioned_sales",
  partition_by = c("order_date", "product")
)

When you query a partitioned table with filters on partition columns, Delta Lake can skip reading irrelevant partitions entirely.

Controlling File Size

For large datasets, you can control the target size of output files:

# Target 128 MB files
write_deltalake(
  large_dataset,
  "path/to/table",
  target_file_size = 128 * 1024 * 1024
)

Reading Delta Tables

Opening a Table

Use delta_table() to open an existing Delta table:

# Open a Delta table
dt <- delta_table("path/to/sales_table")

# Get basic information
table_version(dt)

Table Metadata

# View the schema
get_schema(dt)

# Get table metadata (name, description, etc.)
get_metadata(dt)

# List partition columns
partition_columns(dt)

# List all files in the table
get_files(dt)

Reading Data

deltaR delegates reading to other libraries like arrow, polars, or duckdb. Use get_files() to get the list of Parquet files and pass them to your preferred library:

# Get the list of Parquet files in the Delta table
files <- get_files(dt)

# Read with arrow
library(arrow)
arrow_table <- open_dataset(files)

# Read as a data.frame (for small data)
df <- arrow_table |> collect()

# Read with dplyr for filtering
library(dplyr)

high_value_orders <- open_dataset(files) |>
  filter(price > 200) |>
  select(order_id, customer_id, price) |>
  collect()

# Alternative: Read with duckdb
library(duckdb)
con <- dbConnect(duckdb())
duckdb_register_arrow(con, "sales", arrow_table)
result <- dbGetQuery(con, "SELECT * FROM sales WHERE price > 200")
dbDisconnect(con)

Time Travel

One of Delta Lake’s most powerful features is the ability to access historical versions of your data.

Viewing History

# View the commit history
hist <- history(dt)
print(hist)

# Limit the number of history entries
recent_hist <- history(dt, limit = 5)

Loading Previous Versions

# Load a specific version
load_version(dt, version = 2)

# Read data from that version
files <- get_files(dt)
old_data <- arrow::open_dataset(files) |> collect()

# Load data as of a specific timestamp
load_datetime(dt, datetime = "2024-06-15T10:30:00Z")

Use Cases for Time Travel

  • Auditing: Review data at any point in time
  • Debugging: Compare current data with previous versions
  • Recovery: Restore accidentally deleted or modified data
  • Reproducibility: Run analyses on historical snapshots

Schema Evolution

Delta Lake supports schema evolution, allowing you to add new columns or change the schema over time.

Adding New Columns

# Original data
df1 <- data.frame(id = 1:5, name = letters[1:5])
write_deltalake(df1, "path/to/evolving_table")

# New data with an additional column
df2 <- data.frame(
  id = 6:10,
  name = letters[6:10],
  score = runif(5)
)

# Use schema_mode = "merge" to add the new column
write_deltalake(
  df2,
  "path/to/evolving_table",
  mode = "append",
  schema_mode = "merge"
)

Overwriting the Schema

# Completely replace the schema
new_structure <- data.frame(
  user_id = 1:5,
  email = paste0(letters[1:5], "@example.com")
)

write_deltalake(
  new_structure,
  "path/to/table",
  mode = "overwrite",
  schema_mode = "overwrite"
)

Table Maintenance

Vacuum

Over time, Delta tables accumulate old files from previous versions. The vacuum() function removes files that are no longer needed:

dt <- delta_table("path/to/sales_table")

# Dry run - see what would be deleted
files_to_delete <- vacuum(dt, retention_hours = 168, dry_run = TRUE)
print(files_to_delete)

# Actually delete old files (default retention is 7 days = 168 hours)
vacuum(dt, retention_hours = 168, dry_run = FALSE)

Warning: Vacuuming removes the ability to time travel to versions older than the retention period. Choose your retention period carefully based on your needs.

Creating Empty Tables

You can create an empty Delta table with a predefined schema:

# Define schema using nanoarrow
schema <- nanoarrow::na_struct(list(
  id = nanoarrow::na_int64(),
  name = nanoarrow::na_string(),
  value = nanoarrow::na_double(),
  created_at = nanoarrow::na_timestamp("us", timezone = "UTC")
))

# Create the table
create_deltalake(
  "path/to/new_table",
  schema,
  name = "my_table",
  description = "A table for storing important data"
)

Checking if a Path is a Delta Table

# Check if a path contains a Delta table
is_delta_table_path("path/to/sales_table")

is_delta_table_path("path/to/regular_folder")

Best Practices

1

. Choose Appropriate Partition Columns

  • Partition on columns frequently used in filters
  • Avoid high-cardinality columns (too many unique values)
  • Consider date-based partitioning for time-series data

2. Use Arrow for Large Data

# Instead of loading everything into memory, use Arrow for lazy evaluation:
files <- get_files(dt)

# Use Arrow for filtering and aggregation:
result <- arrow::open_dataset(files) |>
  dplyr::filter(order_date >= as.Date("2024-06-01")) |>
  dplyr::summarise(total_sales = sum(price)) |>
  dplyr::collect()

3. Regular Maintenance

  • Run vacuum() periodically to clean up old files
  • Monitor table size and file count
  • Consider compaction for tables with many small files

4. Use Meaningful Table Names and Descriptions

write_deltalake(
  sales_data,
  "path/to/sales_table",
  name = "daily_sales",
  description = "Daily sales transactions from all stores"
)

Next Steps

Acknowledgments

deltaR is built on the excellent delta-rs Rust library. We are grateful to the delta-rs maintainers and the broader Delta Lake community for their work.