Ubuntu TechHive
rust-and-data-processing-with-polars.md
Rust and Data Processing with Polars
article.detail

Rust and Data Processing with Polars

reading.progress 12 min read

A quick introduction to Rust basics together with data processing using Polars

Rust Data Processing with Polars

What Makes Rust Different

  • Compiled and fast β€” compiles to native machine code, no runtime/GC
  • Memory safe β€” the compiler prevents whole classes of bugs (null pointer errors, data races) before your program runs
  • Strongly, statically typed β€” every value has a type known at compile time; the compiler catches mismatches early

Variables and Mutability

Variables are immutable by default. You opt into mutability with mut.

let x = 5;          // immutable -- cannot be reassigned
let mut y = 10;     // mutable
y = 20;             // OK because of `mut`
// x = 6;           // COMPILE ERROR: cannot assign twice to `x`

const MAX: u32 = 100_000;  // constant: always immutable, type required

This default flips the usual expectation: you say up front what is allowed to change, which makes code easier to reason about.

Basic Data Types

Scalar types

  • Integers: i32, i64, u32, u64 … (i = signed, u = unsigned; number = bits). i32 is the default.
  • Floats: f64 (default), f32
  • Boolean: bool -> true / false
  • Character: char -> a single Unicode character, in single quotes
let count: i64 = 42;
let price: f64 = 19.99;
let is_ready: bool = true;
let letter: char = 'A';

Compound types

  • Tuple: fixed-size group of mixed types
  • Array: fixed-size, all same type
let person: (i32, f64, char) = (30, 5.9, 'M');
let height = person.1;          // access by index -> 5.9

let nums: [i32; 3] = [1, 2, 3]; // array of 3 i32s
let first = nums[0];            // -> 1

Strings: Two Kinds

  • &str β€” a "string slice", usually a fixed/borrowed string literal
  • String β€” an owned, growable string you can modify
let literal: &str = "hello";          // fixed text
let mut owned: String = String::from("hello");
owned.push_str(", world");            // can grow because it's owned

Functions

  • Declared with fn
  • Parameter types are required; return type comes after ->
  • The last expression (no semicolon) is the return value
fn add(a: i32, b: i32) -> i32 {
    a + b          // no semicolon = this is the return value
}

fn greet(name: &str) {   // no `->` means it returns nothing
    println!("Hello, {name}!");
}

fn main() {
    let sum = add(2, 3);     // every program starts at main()
    println!("Sum: {sum}");
    greet("Aziz");
}

Note: println! is a macro (the ! gives it away), not a function.

Control Flow

if / else (it's an expression!)

let n = 7;
if n % 2 == 0 {
    println!("even");
} else {
    println!("odd");
}

// Because `if` returns a value, you can assign with it:
let label = if n > 5 { "big" } else { "small" };

Loops

// loop: runs forever until you `break`
let mut i = 0;
loop {
    if i >= 3 { break; }
    i += 1;
}

// while
let mut c = 3;
while c > 0 {
    println!("{c}");
    c -= 1;
}

// for: the most common -- iterate over a range or collection
for k in 0..3 {            // 0, 1, 2  (end-exclusive)
    println!("k = {k}");
}

Ownership: The Big Idea

Rust's headline feature. Three rules:

  1. Each value has one owner
  2. There's only one owner at a time
  3. When the owner goes out of scope, the value is cleaned up
let s1 = String::from("hi");
let s2 = s1;              // ownership MOVES to s2
// println!("{s1}");      // ERROR: s1 no longer valid

// To let another function use a value WITHOUT taking ownership,
// you *borrow* it with & (a reference):
fn length(s: &String) -> usize {
    s.len()              // reads s, doesn't own it
}
let word = String::from("rust");
let n = length(&word);   // lend it; `word` still usable after

This is what lets Rust guarantee memory safety with no garbage collector. It's the part that takes the most getting used to.

Structs: Custom Data Types

struct Order  {
    id: i64,
    amount: f64,
    shipped: bool,
}

let o = Order { id: 1, amount: 42.5, shipped: true };
println!("Order {} costs {}", o.id, o.amount);

Enums and Pattern Matching

Enums let a value be one of several variants; match handles each.

enum Status {
    Pending,
    Shipped,
    Cancelled,
}

let s = Status::Shipped;
match s {
    Status::Pending   => println!("waiting"),
    Status::Shipped   => println!("on the way"),
    Status::Cancelled => println!("nope"),
}

match must be exhaustive β€” handle every case or the code won't compile. Another way the compiler stops you forgetting things.

Option and Result: No Nulls, No Silent Errors

Rust has no null. Instead:

  • Option β€” a value that's either Some(x) or None
  • Result β€” either Ok(x) or Err(e) (this is the basis of all the error handling in the Polars examples)
fn divide(a: f64, b: f64) -> Option<f64> {
    if b == 0.0 { None } else { Some(a / b) }
}

match divide(10.0, 2.0) {
    Some(result) => println!("Got {result}"),
    None         => println!("Can't divide by zero"),
}

The ? Operator: Error Handling Shorthand

On a Result, ? means "give me the value, or return the error from this function."

use std::num::ParseIntError;

fn parse_and_double(text: &str) -> Result<i32, ParseIntError> {
    let n = text.parse::<i32>()?;  // if parse fails, return the Err
    Ok(n * 2)                      // otherwise keep going
}

This is why read_orders(...)? reads cleanly: the ? quietly propagates any failure instead of forcing a big match block.

Common Collections

  • Vec β€” growable list (like a Python list)
  • HashMap β€” key/value map (like a Python dict)
let mut v: Vec<i32> = Vec::new();
v.push(1);
v.push(2);
for item in &v { println!("{item}"); }

use std::collections::HashMap;
let mut scores = HashMap::new();
scores.insert("alice", 10);
scores.insert("bob", 7);

Cargo: Rust's Build Tool & Package Manager

The essentials:

cargo new my_project   # create a new project
cargo build            # compile
cargo run              # compile + run
cargo test             # run tests
cargo add polars       # add a dependency to Cargo.toml

Dependencies (called "crates") are declared in Cargo.toml and pulled from crates.io.

To watch for:

  • Ownership / borrowing β€” the & and mut dance. Expect to fight it early; it clicks with practice.
  • Two string types (String vs &str) β€” convert with .to_string() or String::from(...).
  • Immutable by default β€” forgetting mut is the most common early error.
  • The compiler is your friend β€” Rust's error messages are unusually good. Read them; they often tell you the exact fix.
  • Macros vs functions β€” println!, vec!, df! end in ! and behave a little differently from normal functions.

What Polars Is

  • A DataFrame library for working with tabular data (rows and columns) β€” think spreadsheets or database tables, in code
  • Written in Rust, built on Apache Arrow (a columnar memory format)
  • Columnar: stores data by column, not by row β€” which is why column operations and analytics are fast
  • Multithreaded by default: uses all your CPU cores without you asking
  • Available from Rust directly, and from Python via bindings

The Two Core Types

  • Series β€” a single column of data, all the same type
  • DataFrame β€” a collection of Series; the table itself
use polars::prelude::*;

// A Series is one named column.
let s = Series::new("amount".into(), &[42.5, 17.0, 9.99]);

// A DataFrame is built from columns. The df! macro is the easy way.
let df = df!(
    "order_id" => &[1, 2, 3],
    "amount"   => &[42.5, 17.0, 9.99],
)?;
println!("{df}");

Note df! ends in ! β€” it's a macro, like println! and vec!.

Everything Returns a Result

Almost every Polars operation can fail (bad types, missing columns, bad files), so it returns PolarsResult. That's why you see ? everywhere in the workshop β€” it propagates errors instead of letting them pass silently.

fn build() -> PolarsResult<DataFrame> {
    let df = df!("a" => &[1, 2, 3])?;   // ? unwraps or returns the error
    Ok(df)
}

This ties straight back to Rust's Result and ?: bad data becomes an error you must handle, not a silent NaN.

Reading and Writing Data

The four formats from the agenda:

// CSV in
let df = CsvReadOptions::default()
    .with_has_header(true)
    .try_into_reader_with_file_path(Some("orders.csv".into()))?
    .finish()?;

// Parquet out
let mut file = std::fs::File::create("orders.parquet")?;
ParquetWriter::new(&mut file).finish(&mut df)?;

// Parquet in
let mut f = std::fs::File::open("orders.parquet")?;
let df = ParquetReader::new(&mut f).finish()?;

Key idea: Parquet stores the schema and types inside the file, so reading it back needs no guessing. CSV is text and must be inferred or given an explicit schema.

Schemas: The Contract

A Schema declares each column's name and type up front. Give one to a reader and bad data fails loudly instead of corrupting a column.

let mut schema = Schema::default();
schema.with_column("order_id".into(), DataType::Int64);
schema.with_column("amount".into(), DataType::Float64);

Common DataType=s: =Int64, Float64, String, Boolean, Date.

Selecting and Filtering

You describe operations with expressions β€” col(...) refers to a column, and you chain transformations.

let result = df
    .clone()
    .lazy()
    .filter(col("status").eq(lit("shipped")))   // keep matching rows
    .select([col("order_id"), col("amount")])    // pick columns
    .collect()?;                                  // run it
  • col("x") β€” refer to column x
  • lit("shipped") β€” a literal value to compare against
  • .eq, .gt, .lt β€” comparison operators on expressions

Joins: Combining Tables

Match rows from two DataFrames on a shared key.

let joined = orders.join(
    &customers,
    ["customer_id"],                 // key in left table
    ["customer_id"],                 // key in right table
    JoinArgs::new(JoinType::Inner),  // Inner / Left / Anti / ...
    None,
)?;

Join types worth knowing:

  • Inner β€” only rows that match in both
  • Left β€” all left rows, nulls where no match
  • Anti β€” left rows with no match (great as a data-quality check)

Eager vs Lazy: The Big Distinction

  • Eager β€” each operation runs immediately (DataFrame). Simple, good for small data and exploration.
  • Lazy β€” you build a query plan, and nothing runs until .collect(). Polars then optimizes the whole plan (pushing filters down, reading only needed columns).
// Lazy: scan_* and .lazy() return a LazyFrame -- a plan, not data yet.
let plan = LazyCsvReader::new(PlPath::new("orders.csv"))
    .with_has_header(true)
    .finish()?
    .filter(col("status").eq(lit("shipped")))
    .select([col("order_id"), col("amount")]);

println!("{}", plan.clone().explain(true)?);  // inspect the plan
let df = plan.collect()?;                       // NOW it runs

explain(true) prints the optimized plan β€” you can see what the engine decided to do before spending any compute.

Common Operations Cheat Sheet

df.height();                 // number of rows
df.width();                  // number of columns
df.column("amount")?;        // get a column (Series)
df.head(Some(5));            // first 5 rows
df.get_column_names();       // column names
df.column("amount")?.dtype();// the column's data type

Why Polars (vs pandas / Spark)

  • vs pandas β€” much faster, multithreaded, lazy optimization, far better memory behavior; types are stricter (fewer silent surprises)
  • vs Spark β€” no cluster needed for single-machine workloads; many "we need Spark" jobs are really "pandas was too slow on one box"
  • Polars gives you performance plus correctness without distributed-systems overhead

How It Connects to the Rust Basics

  • PolarsResult and ? = Rust's Result + ? operator
  • &customers in a join = borrowing (reading without taking ownership)
  • &mut df when writing Parquet = a mutable borrow
  • df!, col! style macros = the ! macro syntax
  • Schemas and DataType = Rust's "everything has a known type" idea, applied to table columns

Demo

A Deliberately Imperfect CSV

Use a file with a mixed-type column, a null, and a bad row:

order_id,customer_id,amount,status
1,100,42.50,shipped
2,101,,pending
3,102,17.00,shipped
4,bad_id,9.99,shipped

Row 4 has a non-numeric customer_id. In a loose pipeline this becomes a silent NaN or an object column. We want it to be loud.

Eager Read With Inferred Schema (the easy, dangerous path)

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let df = CsvReadOptions::default()
        .with_has_header(true)
        .try_into_reader_with_file_path(Some("orders.csv".into()))?
        .finish()?;

    println!("{df}");
    Ok(())
}

This works β€” but inference looked at a sample and guessed the types. On a different file, or more rows, the guess can change. Inference is convenient and non-deterministic; that combination is what bites you in production.

Explicit Schema (the reliability lesson)

Stop guessing. State the contract:

use polars::prelude::*;
use std::sync::Arc;

fn read_orders(path: &str) -> PolarsResult<DataFrame> {
    let mut schema = Schema::default();
    schema.with_column("order_id".into(), DataType::Int64);
    schema.with_column("customer_id".into(), DataType::Int64);
    schema.with_column("amount".into(), DataType::Float64);
    schema.with_column("status".into(), DataType::String);

    CsvReadOptions::default()
        .with_has_header(true)
        .with_schema(Some(Arc::new(schema)))
        .try_into_reader_with_file_path(Some(path.into()))?
        .finish()
}

Now customer_id is declared Int64. The bad row (bad_id) can no longer slip through as text β€” Polars returns an Err, not a quietly corrupted column. The failure happens at read time, with a clear cause, instead of three transformations later.

This Is the Rust + Polars Point

  • The schema is code β€” it is versioned, reviewed, and tested like any other contract
  • finish() returns PolarsResult. There is no way to ignore a parse failure by accident β€” the ? forces you to handle it or propagate it
  • Compare to a dynamically typed pipeline where a bad parse becomes NaN and flows downstream silently. Here, the type system and the error type make silence impossible.

Error Handling as a First-Class Concern

Show both behaviors so the audience feels the difference:

fn main() {
    match read_orders("orders.csv") {
        Ok(df) => println!("Loaded {} rows\n{df}", df.height()),
        Err(e) => eprintln!("CSV failed its contract: {e}"),
    }
}

In a pipeline, Err means the job stops here, loudly, with a message β€” not at 3am, forty million rows in.

Lock It Down With a Test

The reliability theme made concrete β€” a test that asserts the contract, so a malformed upstream file fails in CI, not prod:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn schema_is_enforced() {
        let df = read_orders("tests/data/orders_good.csv").unwrap();
        assert_eq!(df.height(), 3);
        assert_eq!(
            df.column("amount").unwrap().dtype(),
            &DataType::Float64
        );
    }

    #[test]
    fn bad_types_are_rejected() {
        // The file with `bad_id` must NOT load silently.
        assert!(read_orders("tests/data/orders_bad.csv").is_err());
    }
}

bad_types_are_rejected is the whole philosophy in one test: we assert that bad data fails. Most pipelines never write that test because in their stack, bad data does not fail β€” it spreads.

Handling Nulls on Purpose (not by accident)

The empty amount on row 2 is a real null. Decide what it means instead of letting a guess decide:

use polars::prelude::*;

fn parse_options() -> CsvParseOptions {
    CsvParseOptions::default()
        .with_null_values(Some(NullValues::AllColumns(
            vec!["".into(), "NA".into(), "null".into()].into(),
        )))
}

Operational clarity: nulls are a documented decision in the code, not an artifact of whatever the parser felt like doing.

Section Takeaways

  • CSV is untyped and unsafe by default β€” treat every read as a boundary that must be validated
  • Explicit schemas turn "hope it parses" into "it parses or it errors" β€” determinism over convenience
  • PolarsResult makes ignoring failure a compile-time impossibility
  • One test (bad_types_are_rejected) demonstrates the entire reliability thesis
  • Rust + Polars matters here not because it is faster, but because it makes silent data corruption structurally hard