Ubuntu TechHive
rust-and-data-processing-with-polars.md
Rust and Data Processing with Polars
article.détail

Rust and Data Processing with Polars

reading.progression 12 min de lecture

Une introduction rapide aux bases de Rust ainsi qu'au traitement de données avec Polars

Rust Data Processing with Polars

What Makes Rust Different

  • Compiled and fast — compiles to native machine code, no runtime/GC
  • Memory safe — the compiler prevents whole classes of bugs (null pointer errors, data races) before your program runs
  • Strongly, statically typed — every value has a type known at compile time; the compiler catches mismatches early

Variables and Mutability

Variables are immutable by default. You opt into mutability with mut.

let x = 5;          // immutable -- cannot be reassigned
let mut y = 10;     // mutable
y = 20;             // OK because of `mut`
// x = 6;           // COMPILE ERROR: cannot assign twice to `x`

const MAX: u32 = 100_000;  // constant: always immutable, type required

This default flips the usual expectation: you say up front what is allowed to change, which makes code easier to reason about.

Basic Data Types

Scalar types

  • Integers: i32, i64, u32, u64 … (i = signed, u = unsigned; number = bits). i32 is the default.
  • Floats: f64 (default), f32
  • Boolean: bool -> true / false
  • Character: char -> a single Unicode character, in single quotes
let count: i64 = 42;
let price: f64 = 19.99;
let is_ready: bool = true;
let letter: char = 'A';

Compound types

  • Tuple: fixed-size group of mixed types
  • Array: fixed-size, all same type
let person: (i32, f64, char) = (30, 5.9, 'M');
let height = person.1;          // access by index -> 5.9

let nums: [i32; 3] = [1, 2, 3]; // array of 3 i32s
let first = nums[0];            // -> 1

Strings: Two Kinds

  • &str — a "string slice", usually a fixed/borrowed string literal
  • String — an owned, growable string you can modify
let literal: &str = "hello";          // fixed text
let mut owned: String = String::from("hello");
owned.push_str(", world");            // can grow because it's owned

Functions

  • Declared with fn
  • Parameter types are required; return type comes after ->
  • The last expression (no semicolon) is the return value
fn add(a: i32, b: i32) -> i32 {
    a + b          // no semicolon = this is the return value
}

fn greet(name: &str) {   // no `->` means it returns nothing
    println!("Hello, {name}!");
}

fn main() {
    let sum = add(2, 3);     // every program starts at main()
    println!("Sum: {sum}");
    greet("Aziz");
}

Note: println! is a macro (the ! gives it away), not a function.

Control Flow

if / else (it's an expression!)

let n = 7;
if n % 2 == 0 {
    println!("even");
} else {
    println!("odd");
}

// Because `if` returns a value, you can assign with it:
let label = if n > 5 { "big" } else { "small" };

Loops

// loop: runs forever until you `break`
let mut i = 0;
loop {
    if i >= 3 { break; }
    i += 1;
}

// while
let mut c = 3;
while c > 0 {
    println!("{c}");
    c -= 1;
}

// for: the most common -- iterate over a range or collection
for k in 0..3 {            // 0, 1, 2  (end-exclusive)
    println!("k = {k}");
}

Ownership: The Big Idea

Rust's headline feature. Three rules:

  1. Each value has one owner
  2. There's only one owner at a time
  3. When the owner goes out of scope, the value is cleaned up
let s1 = String::from("hi");
let s2 = s1;              // ownership MOVES to s2
// println!("{s1}");      // ERROR: s1 no longer valid

// To let another function use a value WITHOUT taking ownership,
// you *borrow* it with & (a reference):
fn length(s: &String) -> usize {
    s.len()              // reads s, doesn't own it
}
let word = String::from("rust");
let n = length(&word);   // lend it; `word` still usable after

This is what lets Rust guarantee memory safety with no garbage collector. It's the part that takes the most getting used to.

Structs: Custom Data Types

struct Order  {
    id: i64,
    amount: f64,
    shipped: bool,
}

let o = Order { id: 1, amount: 42.5, shipped: true };
println!("Order {} costs {}", o.id, o.amount);

Enums and Pattern Matching

Enums let a value be one of several variants; match handles each.

enum Status {
    Pending,
    Shipped,
    Cancelled,
}

let s = Status::Shipped;
match s {
    Status::Pending   => println!("waiting"),
    Status::Shipped   => println!("on the way"),
    Status::Cancelled => println!("nope"),
}

match must be exhaustive — handle every case or the code won't compile. Another way the compiler stops you forgetting things.

Option and Result: No Nulls, No Silent Errors

Rust has no null. Instead:

  • Option — a value that's either Some(x) or None
  • Result — either Ok(x) or Err(e) (this is the basis of all the error handling in the Polars examples)
fn divide(a: f64, b: f64) -> Option<f64> {
    if b == 0.0 { None } else { Some(a / b) }
}

match divide(10.0, 2.0) {
    Some(result) => println!("Got {result}"),
    None         => println!("Can't divide by zero"),
}

The ? Operator: Error Handling Shorthand

On a Result, ? means "give me the value, or return the error from this function."

use std::num::ParseIntError;

fn parse_and_double(text: &str) -> Result<i32, ParseIntError> {
    let n = text.parse::<i32>()?;  // if parse fails, return the Err
    Ok(n * 2)                      // otherwise keep going
}

This is why read_orders(...)? reads cleanly: the ? quietly propagates any failure instead of forcing a big match block.

Common Collections

  • Vec — growable list (like a Python list)
  • HashMap — key/value map (like a Python dict)
let mut v: Vec<i32> = Vec::new();
v.push(1);
v.push(2);
for item in &v { println!("{item}"); }

use std::collections::HashMap;
let mut scores = HashMap::new();
scores.insert("alice", 10);
scores.insert("bob", 7);

Cargo: Rust's Build Tool & Package Manager

The essentials:

cargo new my_project   # create a new project
cargo build            # compile
cargo run              # compile + run
cargo test             # run tests
cargo add polars       # add a dependency to Cargo.toml

Dependencies (called "crates") are declared in Cargo.toml and pulled from crates.io.

To watch for:

  • Ownership / borrowing — the & and mut dance. Expect to fight it early; it clicks with practice.
  • Two string types (String vs &str) — convert with .to_string() or String::from(...).
  • Immutable by default — forgetting mut is the most common early error.
  • The compiler is your friend — Rust's error messages are unusually good. Read them; they often tell you the exact fix.
  • Macros vs functionsprintln!, vec!, df! end in ! and behave a little differently from normal functions.

What Polars Is

  • A DataFrame library for working with tabular data (rows and columns) — think spreadsheets or database tables, in code
  • Written in Rust, built on Apache Arrow (a columnar memory format)
  • Columnar: stores data by column, not by row — which is why column operations and analytics are fast
  • Multithreaded by default: uses all your CPU cores without you asking
  • Available from Rust directly, and from Python via bindings

The Two Core Types

  • Series — a single column of data, all the same type
  • DataFrame — a collection of Series; the table itself
use polars::prelude::*;

// A Series is one named column.
let s = Series::new("amount".into(), &[42.5, 17.0, 9.99]);

// A DataFrame is built from columns. The df! macro is the easy way.
let df = df!(
    "order_id" => &[1, 2, 3],
    "amount"   => &[42.5, 17.0, 9.99],
)?;
println!("{df}");

Note df! ends in ! — it's a macro, like println! and vec!.

Everything Returns a Result

Almost every Polars operation can fail (bad types, missing columns, bad files), so it returns PolarsResult. That's why you see ? everywhere in the workshop — it propagates errors instead of letting them pass silently.

fn build() -> PolarsResult<DataFrame> {
    let df = df!("a" => &[1, 2, 3])?;   // ? unwraps or returns the error
    Ok(df)
}

This ties straight back to Rust's Result and ?: bad data becomes an error you must handle, not a silent NaN.

Reading and Writing Data

The four formats from the agenda:

// CSV in
let df = CsvReadOptions::default()
    .with_has_header(true)
    .try_into_reader_with_file_path(Some("orders.csv".into()))?
    .finish()?;

// Parquet out
let mut file = std::fs::File::create("orders.parquet")?;
ParquetWriter::new(&mut file).finish(&mut df)?;

// Parquet in
let mut f = std::fs::File::open("orders.parquet")?;
let df = ParquetReader::new(&mut f).finish()?;

Key idea: Parquet stores the schema and types inside the file, so reading it back needs no guessing. CSV is text and must be inferred or given an explicit schema.

Schemas: The Contract

A Schema declares each column's name and type up front. Give one to a reader and bad data fails loudly instead of corrupting a column.

let mut schema = Schema::default();
schema.with_column("order_id".into(), DataType::Int64);
schema.with_column("amount".into(), DataType::Float64);

Common DataType=s: =Int64, Float64, String, Boolean, Date.

Selecting and Filtering

You describe operations with expressionscol(...) refers to a column, and you chain transformations.

let result = df
    .clone()
    .lazy()
    .filter(col("status").eq(lit("shipped")))   // keep matching rows
    .select([col("order_id"), col("amount")])    // pick columns
    .collect()?;                                  // run it
  • col("x") — refer to column x
  • lit("shipped") — a literal value to compare against
  • .eq, .gt, .lt — comparison operators on expressions

Joins: Combining Tables

Match rows from two DataFrames on a shared key.

let joined = orders.join(
    &customers,
    ["customer_id"],                 // key in left table
    ["customer_id"],                 // key in right table
    JoinArgs::new(JoinType::Inner),  // Inner / Left / Anti / ...
    None,
)?;

Join types worth knowing:

  • Inner — only rows that match in both
  • Left — all left rows, nulls where no match
  • Anti — left rows with no match (great as a data-quality check)

Eager vs Lazy: The Big Distinction

  • Eager — each operation runs immediately (DataFrame). Simple, good for small data and exploration.
  • Lazy — you build a query plan, and nothing runs until .collect(). Polars then optimizes the whole plan (pushing filters down, reading only needed columns).
// Lazy: scan_* and .lazy() return a LazyFrame -- a plan, not data yet.
let plan = LazyCsvReader::new(PlPath::new("orders.csv"))
    .with_has_header(true)
    .finish()?
    .filter(col("status").eq(lit("shipped")))
    .select([col("order_id"), col("amount")]);

println!("{}", plan.clone().explain(true)?);  // inspect the plan
let df = plan.collect()?;                       // NOW it runs

explain(true) prints the optimized plan — you can see what the engine decided to do before spending any compute.

Common Operations Cheat Sheet

df.height();                 // number of rows
df.width();                  // number of columns
df.column("amount")?;        // get a column (Series)
df.head(Some(5));            // first 5 rows
df.get_column_names();       // column names
df.column("amount")?.dtype();// the column's data type

Why Polars (vs pandas / Spark)

  • vs pandas — much faster, multithreaded, lazy optimization, far better memory behavior; types are stricter (fewer silent surprises)
  • vs Spark — no cluster needed for single-machine workloads; many "we need Spark" jobs are really "pandas was too slow on one box"
  • Polars gives you performance plus correctness without distributed-systems overhead

How It Connects to the Rust Basics

  • PolarsResult and ? = Rust's Result + ? operator
  • &customers in a join = borrowing (reading without taking ownership)
  • &mut df when writing Parquet = a mutable borrow
  • df!, col! style macros = the ! macro syntax
  • Schemas and DataType = Rust's "everything has a known type" idea, applied to table columns

Demo

A Deliberately Imperfect CSV

Use a file with a mixed-type column, a null, and a bad row:

order_id,customer_id,amount,status
1,100,42.50,shipped
2,101,,pending
3,102,17.00,shipped
4,bad_id,9.99,shipped

Row 4 has a non-numeric customer_id. In a loose pipeline this becomes a silent NaN or an object column. We want it to be loud.

Eager Read With Inferred Schema (the easy, dangerous path)

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let df = CsvReadOptions::default()
        .with_has_header(true)
        .try_into_reader_with_file_path(Some("orders.csv".into()))?
        .finish()?;

    println!("{df}");
    Ok(())
}

This works — but inference looked at a sample and guessed the types. On a different file, or more rows, the guess can change. Inference is convenient and non-deterministic; that combination is what bites you in production.

Explicit Schema (the reliability lesson)

Stop guessing. State the contract:

use polars::prelude::*;
use std::sync::Arc;

fn read_orders(path: &str) -> PolarsResult<DataFrame> {
    let mut schema = Schema::default();
    schema.with_column("order_id".into(), DataType::Int64);
    schema.with_column("customer_id".into(), DataType::Int64);
    schema.with_column("amount".into(), DataType::Float64);
    schema.with_column("status".into(), DataType::String);

    CsvReadOptions::default()
        .with_has_header(true)
        .with_schema(Some(Arc::new(schema)))
        .try_into_reader_with_file_path(Some(path.into()))?
        .finish()
}

Now customer_id is declared Int64. The bad row (bad_id) can no longer slip through as text — Polars returns an Err, not a quietly corrupted column. The failure happens at read time, with a clear cause, instead of three transformations later.

This Is the Rust + Polars Point

  • The schema is code — it is versioned, reviewed, and tested like any other contract
  • finish() returns PolarsResult. There is no way to ignore a parse failure by accident — the ? forces you to handle it or propagate it
  • Compare to a dynamically typed pipeline where a bad parse becomes NaN and flows downstream silently. Here, the type system and the error type make silence impossible.

Error Handling as a First-Class Concern

Show both behaviors so the audience feels the difference:

fn main() {
    match read_orders("orders.csv") {
        Ok(df) => println!("Loaded {} rows\n{df}", df.height()),
        Err(e) => eprintln!("CSV failed its contract: {e}"),
    }
}

In a pipeline, Err means the job stops here, loudly, with a message — not at 3am, forty million rows in.

Lock It Down With a Test

The reliability theme made concrete — a test that asserts the contract, so a malformed upstream file fails in CI, not prod:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn schema_is_enforced() {
        let df = read_orders("tests/data/orders_good.csv").unwrap();
        assert_eq!(df.height(), 3);
        assert_eq!(
            df.column("amount").unwrap().dtype(),
            &DataType::Float64
        );
    }

    #[test]
    fn bad_types_are_rejected() {
        // The file with `bad_id` must NOT load silently.
        assert!(read_orders("tests/data/orders_bad.csv").is_err());
    }
}

bad_types_are_rejected is the whole philosophy in one test: we assert that bad data fails. Most pipelines never write that test because in their stack, bad data does not fail — it spreads.

Handling Nulls on Purpose (not by accident)

The empty amount on row 2 is a real null. Decide what it means instead of letting a guess decide:

use polars::prelude::*;

fn parse_options() -> CsvParseOptions {
    CsvParseOptions::default()
        .with_null_values(Some(NullValues::AllColumns(
            vec!["".into(), "NA".into(), "null".into()].into(),
        )))
}

Operational clarity: nulls are a documented decision in the code, not an artifact of whatever the parser felt like doing.

Section Takeaways

  • CSV is untyped and unsafe by default — treat every read as a boundary that must be validated
  • Explicit schemas turn "hope it parses" into "it parses or it errors" — determinism over convenience
  • PolarsResult makes ignoring failure a compile-time impossibility
  • One test (bad_types_are_rejected) demonstrates the entire reliability thesis
  • Rust + Polars matters here not because it is faster, but because it makes silent data corruption structurally hard