Rust Data Processing with Polars
- What Makes Rust Different
- Variables and Mutability
- Basic Data Types
- Strings: Two Kinds
- Functions
- Control Flow
- Ownership: The Big Idea
- Structs: Custom Data Types
- Enums and Pattern Matching
- Option and Result: No Nulls, No Silent Errors
- The ? Operator: Error Handling Shorthand
- Common Collections
- Cargo: Rust's Build Tool & Package Manager
- To watch for:
- What Polars Is
- The Two Core Types
- Everything Returns a Result
- Reading and Writing Data
- Schemas: The Contract
- Selecting and Filtering
- Joins: Combining Tables
- Eager vs Lazy: The Big Distinction
- Common Operations Cheat Sheet
- Why Polars (vs pandas / Spark)
- How It Connects to the Rust Basics
- Demo
- A Deliberately Imperfect CSV
- Eager Read With Inferred Schema (the easy, dangerous path)
- Explicit Schema (the reliability lesson)
- This Is the Rust + Polars Point
- Error Handling as a First-Class Concern
- Lock It Down With a Test
- Handling Nulls on Purpose (not by accident)
- Section Takeaways
What Makes Rust Different
- Compiled and fast — compiles to native machine code, no runtime/GC
- Memory safe — the compiler prevents whole classes of bugs (null pointer errors, data races) before your program runs
- Strongly, statically typed — every value has a type known at compile time; the compiler catches mismatches early
Variables and Mutability
Variables are immutable by default. You opt into mutability with mut.
let x = 5; // immutable -- cannot be reassigned
let mut y = 10; // mutable
y = 20; // OK because of `mut`
// x = 6; // COMPILE ERROR: cannot assign twice to `x`
const MAX: u32 = 100_000; // constant: always immutable, type required
This default flips the usual expectation: you say up front what is allowed to change, which makes code easier to reason about.
Basic Data Types
Scalar types
- Integers:
i32,i64,u32,u64… (i = signed, u = unsigned; number = bits).i32is the default. - Floats:
f64(default),f32 - Boolean:
bool->true/false - Character:
char-> a single Unicode character, in single quotes
let count: i64 = 42;
let price: f64 = 19.99;
let is_ready: bool = true;
let letter: char = 'A';Compound types
- Tuple: fixed-size group of mixed types
- Array: fixed-size, all same type
let person: (i32, f64, char) = (30, 5.9, 'M');
let height = person.1; // access by index -> 5.9
let nums: [i32; 3] = [1, 2, 3]; // array of 3 i32s
let first = nums[0]; // -> 1
Strings: Two Kinds
&str— a "string slice", usually a fixed/borrowed string literalString— an owned, growable string you can modify
let literal: &str = "hello"; // fixed text
let mut owned: String = String::from("hello");
owned.push_str(", world"); // can grow because it's owned
Functions
- Declared with
fn - Parameter types are required; return type comes after
-> - The last expression (no semicolon) is the return value
fn add(a: i32, b: i32) -> i32 {
a + b // no semicolon = this is the return value
}
fn greet(name: &str) { // no `->` means it returns nothing
println!("Hello, {name}!");
}
fn main() {
let sum = add(2, 3); // every program starts at main()
println!("Sum: {sum}");
greet("Aziz");
}Note: println! is a macro (the ! gives it away), not a function.
Control Flow
if / else (it's an expression!)
let n = 7;
if n % 2 == 0 {
println!("even");
} else {
println!("odd");
}
// Because `if` returns a value, you can assign with it:
let label = if n > 5 { "big" } else { "small" };Loops
// loop: runs forever until you `break`
let mut i = 0;
loop {
if i >= 3 { break; }
i += 1;
}
// while
let mut c = 3;
while c > 0 {
println!("{c}");
c -= 1;
}
// for: the most common -- iterate over a range or collection
for k in 0..3 { // 0, 1, 2 (end-exclusive)
println!("k = {k}");
}Ownership: The Big Idea
Rust's headline feature. Three rules:
- Each value has one owner
- There's only one owner at a time
- When the owner goes out of scope, the value is cleaned up
let s1 = String::from("hi");
let s2 = s1; // ownership MOVES to s2
// println!("{s1}"); // ERROR: s1 no longer valid
// To let another function use a value WITHOUT taking ownership,
// you *borrow* it with & (a reference):
fn length(s: &String) -> usize {
s.len() // reads s, doesn't own it
}
let word = String::from("rust");
let n = length(&word); // lend it; `word` still usable after
This is what lets Rust guarantee memory safety with no garbage collector. It's the part that takes the most getting used to.
Structs: Custom Data Types
struct Order {
id: i64,
amount: f64,
shipped: bool,
}
let o = Order { id: 1, amount: 42.5, shipped: true };
println!("Order {} costs {}", o.id, o.amount);Enums and Pattern Matching
Enums let a value be one of several variants; match handles each.
enum Status {
Pending,
Shipped,
Cancelled,
}
let s = Status::Shipped;
match s {
Status::Pending => println!("waiting"),
Status::Shipped => println!("on the way"),
Status::Cancelled => println!("nope"),
}match must be exhaustive — handle every case or the code won't
compile. Another way the compiler stops you forgetting things.
Option and Result: No Nulls, No Silent Errors
Rust has no null. Instead:
Option— a value that's eitherSome(x)orNoneResult— eitherOk(x)orErr(e)(this is the basis of all the error handling in the Polars examples)
fn divide(a: f64, b: f64) -> Option<f64> {
if b == 0.0 { None } else { Some(a / b) }
}
match divide(10.0, 2.0) {
Some(result) => println!("Got {result}"),
None => println!("Can't divide by zero"),
}The ? Operator: Error Handling Shorthand
On a Result, ? means
"give me the value, or return the error from this function."
use std::num::ParseIntError;
fn parse_and_double(text: &str) -> Result<i32, ParseIntError> {
let n = text.parse::<i32>()?; // if parse fails, return the Err
Ok(n * 2) // otherwise keep going
}This is why read_orders(...)? reads cleanly: the ? quietly
propagates any failure instead of forcing a big match block.
Common Collections
Vec— growable list (like a Python list)HashMap— key/value map (like a Python dict)
let mut v: Vec<i32> = Vec::new();
v.push(1);
v.push(2);
for item in &v { println!("{item}"); }
use std::collections::HashMap;
let mut scores = HashMap::new();
scores.insert("alice", 10);
scores.insert("bob", 7);Cargo: Rust's Build Tool & Package Manager
The essentials:
cargo new my_project # create a new project
cargo build # compile
cargo run # compile + run
cargo test # run tests
cargo add polars # add a dependency to Cargo.tomlDependencies (called "crates") are declared in Cargo.toml and pulled
from crates.io.
To watch for:
- Ownership / borrowing — the
&andmutdance. Expect to fight it early; it clicks with practice. - Two string types (
Stringvs&str) — convert with.to_string()orString::from(...). - Immutable by default — forgetting
mutis the most common early error. - The compiler is your friend — Rust's error messages are unusually good. Read them; they often tell you the exact fix.
- Macros vs functions —
println!,vec!,df!end in!and behave a little differently from normal functions.
What Polars Is
- A DataFrame library for working with tabular data (rows and columns) — think spreadsheets or database tables, in code
- Written in Rust, built on Apache Arrow (a columnar memory format)
- Columnar: stores data by column, not by row — which is why column operations and analytics are fast
- Multithreaded by default: uses all your CPU cores without you asking
- Available from Rust directly, and from Python via bindings
The Two Core Types
Series— a single column of data, all the same typeDataFrame— a collection of Series; the table itself
use polars::prelude::*;
// A Series is one named column.
let s = Series::new("amount".into(), &[42.5, 17.0, 9.99]);
// A DataFrame is built from columns. The df! macro is the easy way.
let df = df!(
"order_id" => &[1, 2, 3],
"amount" => &[42.5, 17.0, 9.99],
)?;
println!("{df}");Note df! ends in ! — it's a macro, like println! and vec!.
Everything Returns a Result
Almost every Polars operation can fail (bad types, missing columns,
bad files), so it returns PolarsResult. That's why you see ?
everywhere in the workshop — it propagates errors instead of letting
them pass silently.
fn build() -> PolarsResult<DataFrame> {
let df = df!("a" => &[1, 2, 3])?; // ? unwraps or returns the error
Ok(df)
}This ties straight back to Rust's Result and ?: bad data becomes
an error you must handle, not a silent NaN.
Reading and Writing Data
The four formats from the agenda:
// CSV in
let df = CsvReadOptions::default()
.with_has_header(true)
.try_into_reader_with_file_path(Some("orders.csv".into()))?
.finish()?;
// Parquet out
let mut file = std::fs::File::create("orders.parquet")?;
ParquetWriter::new(&mut file).finish(&mut df)?;
// Parquet in
let mut f = std::fs::File::open("orders.parquet")?;
let df = ParquetReader::new(&mut f).finish()?;Key idea: Parquet stores the schema and types inside the file, so reading it back needs no guessing. CSV is text and must be inferred or given an explicit schema.
Schemas: The Contract
A Schema declares each column's name and type up front. Give one to
a reader and bad data fails loudly instead of corrupting a column.
let mut schema = Schema::default();
schema.with_column("order_id".into(), DataType::Int64);
schema.with_column("amount".into(), DataType::Float64);Common DataType=s: =Int64, Float64, String, Boolean, Date.
Selecting and Filtering
You describe operations with expressions — col(...) refers to a
column, and you chain transformations.
let result = df
.clone()
.lazy()
.filter(col("status").eq(lit("shipped"))) // keep matching rows
.select([col("order_id"), col("amount")]) // pick columns
.collect()?; // run it
col("x")— refer to column xlit("shipped")— a literal value to compare against.eq,.gt,.lt— comparison operators on expressions
Joins: Combining Tables
Match rows from two DataFrames on a shared key.
let joined = orders.join(
&customers,
["customer_id"], // key in left table
["customer_id"], // key in right table
JoinArgs::new(JoinType::Inner), // Inner / Left / Anti / ...
None,
)?;Join types worth knowing:
Inner— only rows that match in bothLeft— all left rows, nulls where no matchAnti— left rows with no match (great as a data-quality check)
Eager vs Lazy: The Big Distinction
- Eager — each operation runs immediately (
DataFrame). Simple, good for small data and exploration. - Lazy — you build a query plan, and nothing runs until
.collect(). Polars then optimizes the whole plan (pushing filters down, reading only needed columns).
// Lazy: scan_* and .lazy() return a LazyFrame -- a plan, not data yet.
let plan = LazyCsvReader::new(PlPath::new("orders.csv"))
.with_has_header(true)
.finish()?
.filter(col("status").eq(lit("shipped")))
.select([col("order_id"), col("amount")]);
println!("{}", plan.clone().explain(true)?); // inspect the plan
let df = plan.collect()?; // NOW it runs
explain(true) prints the optimized plan — you can see what the
engine decided to do before spending any compute.
Common Operations Cheat Sheet
df.height(); // number of rows
df.width(); // number of columns
df.column("amount")?; // get a column (Series)
df.head(Some(5)); // first 5 rows
df.get_column_names(); // column names
df.column("amount")?.dtype();// the column's data type
Why Polars (vs pandas / Spark)
- vs pandas — much faster, multithreaded, lazy optimization, far better memory behavior; types are stricter (fewer silent surprises)
- vs Spark — no cluster needed for single-machine workloads; many "we need Spark" jobs are really "pandas was too slow on one box"
- Polars gives you performance plus correctness without distributed-systems overhead
How It Connects to the Rust Basics
PolarsResultand?= Rust'sResult+?operator&customersin a join = borrowing (reading without taking ownership)&mut dfwhen writing Parquet = a mutable borrowdf!,col!style macros = the!macro syntax- Schemas and
DataType= Rust's "everything has a known type" idea, applied to table columns
Demo
A Deliberately Imperfect CSV
Use a file with a mixed-type column, a null, and a bad row:
order_id,customer_id,amount,status
1,100,42.50,shipped
2,101,,pending
3,102,17.00,shipped
4,bad_id,9.99,shippedRow 4 has a non-numeric customer_id. In a loose pipeline this
becomes a silent NaN or an object column. We want it to be loud.
Eager Read With Inferred Schema (the easy, dangerous path)
use polars::prelude::*;
fn main() -> PolarsResult<()> {
let df = CsvReadOptions::default()
.with_has_header(true)
.try_into_reader_with_file_path(Some("orders.csv".into()))?
.finish()?;
println!("{df}");
Ok(())
}This works — but inference looked at a sample and guessed the types. On a different file, or more rows, the guess can change. Inference is convenient and non-deterministic; that combination is what bites you in production.
Explicit Schema (the reliability lesson)
Stop guessing. State the contract:
use polars::prelude::*;
use std::sync::Arc;
fn read_orders(path: &str) -> PolarsResult<DataFrame> {
let mut schema = Schema::default();
schema.with_column("order_id".into(), DataType::Int64);
schema.with_column("customer_id".into(), DataType::Int64);
schema.with_column("amount".into(), DataType::Float64);
schema.with_column("status".into(), DataType::String);
CsvReadOptions::default()
.with_has_header(true)
.with_schema(Some(Arc::new(schema)))
.try_into_reader_with_file_path(Some(path.into()))?
.finish()
}Now customer_id is declared Int64. The bad row (bad_id) can no
longer slip through as text — Polars returns an Err, not a quietly
corrupted column. The failure happens at read time, with a clear
cause, instead of three transformations later.
This Is the Rust + Polars Point
- The schema is code — it is versioned, reviewed, and tested like any other contract
finish()returnsPolarsResult. There is no way to ignore a parse failure by accident — the?forces you to handle it or propagate it- Compare to a dynamically typed pipeline where a bad parse becomes
NaNand flows downstream silently. Here, the type system and the error type make silence impossible.
Error Handling as a First-Class Concern
Show both behaviors so the audience feels the difference:
fn main() {
match read_orders("orders.csv") {
Ok(df) => println!("Loaded {} rows\n{df}", df.height()),
Err(e) => eprintln!("CSV failed its contract: {e}"),
}
}In a pipeline, Err means the job stops here, loudly, with a
message — not at 3am, forty million rows in.
Lock It Down With a Test
The reliability theme made concrete — a test that asserts the contract, so a malformed upstream file fails in CI, not prod:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn schema_is_enforced() {
let df = read_orders("tests/data/orders_good.csv").unwrap();
assert_eq!(df.height(), 3);
assert_eq!(
df.column("amount").unwrap().dtype(),
&DataType::Float64
);
}
#[test]
fn bad_types_are_rejected() {
// The file with `bad_id` must NOT load silently.
assert!(read_orders("tests/data/orders_bad.csv").is_err());
}
}bad_types_are_rejected is the whole philosophy in one test: we
assert that bad data fails. Most pipelines never write that test
because in their stack, bad data does not fail — it spreads.
Handling Nulls on Purpose (not by accident)
The empty amount on row 2 is a real null. Decide what it means
instead of letting a guess decide:
use polars::prelude::*;
fn parse_options() -> CsvParseOptions {
CsvParseOptions::default()
.with_null_values(Some(NullValues::AllColumns(
vec!["".into(), "NA".into(), "null".into()].into(),
)))
}Operational clarity: nulls are a documented decision in the code, not an artifact of whatever the parser felt like doing.
Section Takeaways
- CSV is untyped and unsafe by default — treat every read as a boundary that must be validated
- Explicit schemas turn "hope it parses" into "it parses or it errors" — determinism over convenience
PolarsResultmakes ignoring failure a compile-time impossibility- One test (
bad_types_are_rejected) demonstrates the entire reliability thesis - Rust + Polars matters here not because it is faster, but because it makes silent data corruption structurally hard

