Introduction
deltaR provides an R interface to Delta Lake, the open-source storage layer that brings ACID transactions to data lakes. Built on the high-performance delta-rs Rust library, deltaR enables you to read and write Delta tables directly from R with minimal overhead.
What is Delta Lake?
Delta Lake is an open-source storage framework that enables building a lakehouse architecture. Key features include:
- ACID Transactions: Ensures data integrity even with concurrent reads and writes
- Time Travel: Access and restore previous versions of your data
- Schema Enforcement: Prevents bad data from being written
- Schema Evolution: Allows schema changes over time
- Scalable Metadata: Handles petabyte-scale tables with billions of files
Installation
Writing Data to Delta Tables
The primary function for writing data is
write_deltalake(). It accepts data frames, Arrow tables, or
any object that can be converted to an Arrow stream.
Creating a New Table
# Create sample data
sales_data <- data.frame(
order_id = 1:100,
customer_id = sample(1:20, 100, replace = TRUE),
product = sample(c("Widget", "Gadget", "Gizmo"), 100, replace = TRUE),
quantity = sample(1:10, 100, replace = TRUE),
price = round(runif(100, 10, 500), 2),
order_date = as.Date("2024-01-01") + sample(0:364, 100, replace = TRUE)
)
# Write to a Delta table
write_deltalake(sales_data, "path/to/sales_table")Write Modes
deltaR supports four write modes:
# error (default): Fail if the table already exists
write_deltalake(sales_data, "path/to/table", mode = "error")
# append: Add new data to an existing table
new_sales <- data.frame(
order_id = 101:110,
customer_id = sample(1:20, 10, replace = TRUE),
product = sample(c("Widget", "Gadget", "Gizmo"), 10, replace = TRUE),
quantity = sample(1:10, 10, replace = TRUE),
price = round(runif(10, 10, 500), 2),
order_date = as.Date("2025-01-01") + sample(0:30, 10, replace = TRUE)
)
write_deltalake(new_sales, "path/to/sales_table", mode = "append")
# overwrite: Replace all data in the table
write_deltalake(sales_data, "path/to/sales_table", mode = "overwrite")
# ignore: Do nothing if the table already exists
write_deltalake(sales_data, "path/to/table", mode = "ignore")Partitioned Tables
Partitioning improves query performance by organizing data into directories based on column values:
# Partition by date and product
write_deltalake(
sales_data,
"path/to/partitioned_sales",
partition_by = c("order_date", "product")
)When you query a partitioned table with filters on partition columns, Delta Lake can skip reading irrelevant partitions entirely.
Controlling File Size
For large datasets, you can control the target size of output files:
# Target 128 MB files
write_deltalake(
large_dataset,
"path/to/table",
target_file_size = 128 * 1024 * 1024
)Reading Delta Tables
Opening a Table
Use delta_table() to open an existing Delta table:
# Open a Delta table
dt <- delta_table("path/to/sales_table")
# Get basic information
table_version(dt)Table Metadata
# View the schema
get_schema(dt)
# Get table metadata (name, description, etc.)
get_metadata(dt)
# List partition columns
partition_columns(dt)
# List all files in the table
get_files(dt)Reading Data
deltaR delegates reading to other libraries like arrow, polars, or
duckdb. Use get_files() to get the list of Parquet files
and pass them to your preferred library:
# Get the list of Parquet files in the Delta table
files <- get_files(dt)
# Read with arrow
library(arrow)
arrow_table <- open_dataset(files)
# Read as a data.frame (for small data)
df <- arrow_table |> collect()
# Read with dplyr for filtering
library(dplyr)
high_value_orders <- open_dataset(files) |>
filter(price > 200) |>
select(order_id, customer_id, price) |>
collect()
# Alternative: Read with duckdb
library(duckdb)
con <- dbConnect(duckdb())
duckdb_register_arrow(con, "sales", arrow_table)
result <- dbGetQuery(con, "SELECT * FROM sales WHERE price > 200")
dbDisconnect(con)Time Travel
One of Delta Lake’s most powerful features is the ability to access historical versions of your data.
Loading Previous Versions
# Load a specific version
load_version(dt, version = 2)
# Read data from that version
files <- get_files(dt)
old_data <- arrow::open_dataset(files) |> collect()
# Load data as of a specific timestamp
load_datetime(dt, datetime = "2024-06-15T10:30:00Z")Schema Evolution
Delta Lake supports schema evolution, allowing you to add new columns or change the schema over time.
Adding New Columns
# Original data
df1 <- data.frame(id = 1:5, name = letters[1:5])
write_deltalake(df1, "path/to/evolving_table")
# New data with an additional column
df2 <- data.frame(
id = 6:10,
name = letters[6:10],
score = runif(5)
)
# Use schema_mode = "merge" to add the new column
write_deltalake(
df2,
"path/to/evolving_table",
mode = "append",
schema_mode = "merge"
)Overwriting the Schema
# Completely replace the schema
new_structure <- data.frame(
user_id = 1:5,
email = paste0(letters[1:5], "@example.com")
)
write_deltalake(
new_structure,
"path/to/table",
mode = "overwrite",
schema_mode = "overwrite"
)Table Maintenance
Vacuum
Over time, Delta tables accumulate old files from previous versions.
The vacuum() function removes files that are no longer
needed:
dt <- delta_table("path/to/sales_table")
# Dry run - see what would be deleted
files_to_delete <- vacuum(dt, retention_hours = 168, dry_run = TRUE)
print(files_to_delete)
# Actually delete old files (default retention is 7 days = 168 hours)
vacuum(dt, retention_hours = 168, dry_run = FALSE)Warning: Vacuuming removes the ability to time travel to versions older than the retention period. Choose your retention period carefully based on your needs.
Creating Empty Tables
You can create an empty Delta table with a predefined schema:
# Define schema using nanoarrow
schema <- nanoarrow::na_struct(list(
id = nanoarrow::na_int64(),
name = nanoarrow::na_string(),
value = nanoarrow::na_double(),
created_at = nanoarrow::na_timestamp("us", timezone = "UTC")
))
# Create the table
create_deltalake(
"path/to/new_table",
schema,
name = "my_table",
description = "A table for storing important data"
)Checking if a Path is a Delta Table
# Check if a path contains a Delta table
is_delta_table_path("path/to/sales_table")
is_delta_table_path("path/to/regular_folder")Best Practices
1
. Choose Appropriate Partition Columns
- Partition on columns frequently used in filters
- Avoid high-cardinality columns (too many unique values)
- Consider date-based partitioning for time-series data
3. Regular Maintenance
- Run
vacuum()periodically to clean up old files - Monitor table size and file count
- Consider compaction for tables with many small files
4. Use Meaningful Table Names and Descriptions
write_deltalake(
sales_data,
"path/to/sales_table",
name = "daily_sales",
description = "Daily sales transactions from all stores"
)Next Steps
- Read the Cloud Storage Guide to learn about using deltaR with S3, GCS, and Azure
- Explore the Function Reference for detailed API documentation
- Visit the Delta Lake documentation for more about the Delta Lake format
Acknowledgments
deltaR is built on the excellent delta-rs Rust library. We are grateful to the delta-rs maintainers and the broader Delta Lake community for their work.