Ubuntu TechHive
rust-and-data-processing-with-polars.md
Rust and Data Processing with Polars
article.detalhe

Rust and Data Processing with Polars

reading.progresso 12 min de leitura

Uma introdução rápida aos fundamentos de Rust, juntamente com o processamento de dados usando Polars

#+author: Aziz Sereme
#+date: <2026-06-13 sat>
#+title: Rust Data Processing with Polars

** What Makes Rust Different

  • Compiled and fast — compiles to native machine code, no runtime/GC
  • Memory safe — the compiler prevents whole classes of bugs (null
    pointer errors, data races) before your program runs
  • Strongly, statically typed — every value has a type known at
    compile time; the compiler catches mismatches early

** Variables and Mutability
Variables are immutable by default. You opt into mutability with =mut=.
#+begin_src rust
let x = 5; // immutable -- cannot be reassigned
let mut y = 10; // mutable
y = 20; // OK because of mut
// x = 6; // COMPILE ERROR: cannot assign twice to x

const MAX: u32 = 100_000; // constant: always immutable, type required
#+end_src
This default flips the usual expectation: you say up front what is
allowed to change, which makes code easier to reason about.

** Basic Data Types
*** Scalar types
- Integers: =i32=, =i64=, =u32=, =u64= ... (i = signed, u = unsigned;
number = bits). =i32= is the default.
- Floats: =f64= (default), =f32=
- Boolean: =bool= -> =true= / =false=
- Character: =char= -> a single Unicode character, in single quotes
#+begin_src rust
let count: i64 = 42;
let price: f64 = 19.99;
let is_ready: bool = true;
let letter: char = 'A';
#+end_src

*** Compound types
- Tuple: fixed-size group of mixed types
- Array: fixed-size, all same type
#+begin_src rust
let person: (i32, f64, char) = (30, 5.9, 'M');
let height = person.1; // access by index -> 5.9

let nums: [i32; 3] = [1, 2, 3]; // array of 3 i32s
let first = nums[0]; // -> 1
#+end_src

** Strings: Two Kinds

  • =&str= — a "string slice", usually a fixed/borrowed string literal
  • =String= — an owned, growable string you can modify
    #+begin_src rust
    let literal: &str = "hello"; // fixed text
    let mut owned: String = String::from("hello");
    owned.push_str(", world"); // can grow because it's owned
    #+end_src

** Functions

  • Declared with =fn=
  • Parameter types are required; return type comes after =->=
  • The last expression (no semicolon) is the return value
    #+begin_src rust
    fn add(a: i32, b: i32) -> i32 {
    a + b // no semicolon = this is the return value
    }

fn greet(name: &str) { // no -> means it returns nothing
println!("Hello, {name}!");
}

fn main() {
let sum = add(2, 3); // every program starts at main()
println!("Sum: {sum}");
greet("Aziz");
}
#+end_src
Note: =println!= is a macro (the =!= gives it away), not a function.

** Control Flow
*** if / else (it's an expression!)
#+begin_src rust
let n = 7;
if n % 2 == 0 {
println!("even");
} else {
println!("odd");
}

// Because `if` returns a value, you can assign with it:
let label = if n > 5 { "big" } else { "small" };
#+end_src

*** Loops
#+begin_src rust
// loop: runs forever until you break
let mut i = 0;
loop {
if i >= 3 { break; }
i += 1;
}

// while
let mut c = 3;
while c > 0 {
    println!("{c}");
    c -= 1;
}

// for: the most common -- iterate over a range or collection
for k in 0..3 {            // 0, 1, 2  (end-exclusive)
    println!("k = {k}");
}
#+end_src

** Ownership: The Big Idea
Rust's headline feature. Three rules:

  1. Each value has one owner
  2. There's only one owner at a time
  3. When the owner goes out of scope, the value is cleaned up
    #+begin_src rust
    let s1 = String::from("hi");
    let s2 = s1; // ownership MOVES to s2
    // println!("{s1}"); // ERROR: s1 no longer valid

// To let another function use a value WITHOUT taking ownership,
// you borrow it with & (a reference):
fn length(s: &String) -> usize {
s.len() // reads s, doesn't own it
}
let word = String::from("rust");
let n = length(&word); // lend it; word still usable after
#+end_src
This is what lets Rust guarantee memory safety with no garbage
collector. It's the part that takes the most getting used to.

** Structs: Custom Data Types
#+begin_src rust
struct Order {
id: i64,
amount: f64,
shipped: bool,
}

let o = Order { id: 1, amount: 42.5, shipped: true };
println!("Order {} costs {}", o.id, o.amount);
#+end_src

** Enums and Pattern Matching
Enums let a value be one of several variants; =match= handles each.
#+begin_src rust
enum Status {
Pending,
Shipped,
Cancelled,
}

let s = Status::Shipped;
match s {
Status::Pending => println!("waiting"),
Status::Shipped => println!("on the way"),
Status::Cancelled => println!("nope"),
}
#+end_src
=match= must be exhaustive — handle every case or the code won't
compile. Another way the compiler stops you forgetting things.

** Option and Result: No Nulls, No Silent Errors
Rust has no =null=. Instead:

  • =Option= — a value that's either =Some(x)= or =None=
  • =Result<T, E>= — either =Ok(x)= or =Err(e)= (this is the basis of
    all the error handling in the Polars examples)
    #+begin_src rust
    fn divide(a: f64, b: f64) -> Option {
    if b == 0.0 { None } else { Some(a / b) }
    }

match divide(10.0, 2.0) {
Some(result) => println!("Got {result}"),
None => println!("Can't divide by zero"),
}
#+end_src

** The ? Operator: Error Handling Shorthand
On a =Result=, =?= means
"give me the value, or return the error from this function."
#+begin_src rust
use std::num::ParseIntError;

fn parse_and_double(text: &str) -> Result<i32, ParseIntError> {
let n = text.parse::()?; // if parse fails, return the Err
Ok(n * 2) // otherwise keep going
}
#+end_src
This is why =read_orders(...)?= reads cleanly: the =?= quietly
propagates any failure instead of forcing a big match block.

** Common Collections

  • =Vec= — growable list (like a Python list)
  • =HashMap<K, V>= — key/value map (like a Python dict)
    #+begin_src rust
    let mut v: Vec = Vec::new();
    v.push(1);
    v.push(2);
    for item in &v { println!("{item}"); }

use std::collections::HashMap;
let mut scores = HashMap::new();
scores.insert("alice", 10);
scores.insert("bob", 7);
#+end_src

** Cargo: Rust's Build Tool & Package Manager
The essentials:
#+begin_src bash
cargo new my_project # create a new project
cargo build # compile
cargo run # compile + run
cargo test # run tests
cargo add polars # add a dependency to Cargo.toml
#+end_src
Dependencies (called "crates") are declared in =Cargo.toml= and pulled
from crates.io.

** To watch for:

  • Ownership / borrowing — the =&= and =mut= dance. Expect to fight
    it early; it clicks with practice.
  • Two string types (=String= vs =&str=) — convert with
    =.to_string()= or =String::from(...)=.
  • Immutable by default — forgetting =mut= is the most common early
    error.
  • The compiler is your friend — Rust's error messages are unusually
    good. Read them; they often tell you the exact fix.
  • Macros vs functions — =println!=, =vec!=, =df!= end in =!= and
    behave a little differently from normal functions.

** What Polars Is

  • A DataFrame library for working with tabular data (rows and
    columns) — think spreadsheets or database tables, in code
  • Written in Rust, built on Apache Arrow (a columnar memory format)
  • Columnar: stores data by column, not by row — which is why
    column operations and analytics are fast
  • Multithreaded by default: uses all your CPU cores without you
    asking
  • Available from Rust directly, and from Python via bindings

** The Two Core Types

  • =Series= — a single column of data, all the same type
  • =DataFrame= — a collection of Series; the table itself
    #+begin_src rust
    use polars::prelude::*;

// A Series is one named column.
let s = Series::new("amount".into(), &[42.5, 17.0, 9.99]);

// A DataFrame is built from columns. The df! macro is the easy way.
let df = df!(
"order_id" => &[1, 2, 3],
"amount" => &[42.5, 17.0, 9.99],
)?;
println!("{df}");
#+end_src
Note =df!= ends in =!= — it's a macro, like =println!= and =vec!=.

** Everything Returns a Result
Almost every Polars operation can fail (bad types, missing columns,
bad files), so it returns =PolarsResult=. That's why you see =?=
everywhere in the workshop — it propagates errors instead of letting
them pass silently.
#+begin_src rust
fn build() -> PolarsResult {
let df = df!("a" => &[1, 2, 3])?; // ? unwraps or returns the error
Ok(df)
}
#+end_src
This ties straight back to Rust's =Result= and =?=: bad data becomes
an error you must handle, not a silent =NaN=.

** Reading and Writing Data
The four formats from the agenda:
#+begin_src rust
// CSV in
let df = CsvReadOptions::default()
.with_has_header(true)
.try_into_reader_with_file_path(Some("orders.csv".into()))?
.finish()?;

// Parquet out
let mut file = std::fs::File::create("orders.parquet")?;
ParquetWriter::new(&mut file).finish(&mut df)?;

// Parquet in
let mut f = std::fs::File::open("orders.parquet")?;
let df = ParquetReader::new(&mut f).finish()?;
#+end_src
Key idea: Parquet stores the schema and types inside the file, so
reading it back needs no guessing. CSV is text and must be inferred or
given an explicit schema.

** Schemas: The Contract
A =Schema= declares each column's name and type up front. Give one to
a reader and bad data fails loudly instead of corrupting a column.
#+begin_src rust
let mut schema = Schema::default();
schema.with_column("order_id".into(), DataType::Int64);
schema.with_column("amount".into(), DataType::Float64);
#+end_src
Common =DataType=s: =Int64=, =Float64=, =String=, =Boolean=, =Date=.

** Selecting and Filtering
You describe operations with expressions — =col(...)= refers to a
column, and you chain transformations.
#+begin_src rust
let result = df
.clone()
.lazy()
.filter(col("status").eq(lit("shipped"))) // keep matching rows
.select([col("order_id"), col("amount")]) // pick columns
.collect()?; // run it
#+end_src

  • =col("x")= — refer to column x
  • =lit("shipped")= — a literal value to compare against
  • =.eq=, =.gt=, =.lt= — comparison operators on expressions

** Joins: Combining Tables
Match rows from two DataFrames on a shared key.
#+begin_src rust
let joined = orders.join(
&customers,
["customer_id"], // key in left table
["customer_id"], // key in right table
JoinArgs::new(JoinType::Inner), // Inner / Left / Anti / ...
None,
)?;
#+end_src
Join types worth knowing:

  • =Inner= — only rows that match in both
  • =Left= — all left rows, nulls where no match
  • =Anti= — left rows with no match (great as a data-quality check)

** Eager vs Lazy: The Big Distinction

  • Eager — each operation runs immediately (=DataFrame=). Simple,
    good for small data and exploration.
  • Lazy — you build a query plan, and nothing runs until
    =.collect()=. Polars then optimizes the whole plan (pushing
    filters down, reading only needed columns).
    #+begin_src rust
    // Lazy: scan_* and .lazy() return a LazyFrame -- a plan, not data yet.
    let plan = LazyCsvReader::new(PlPath::new("orders.csv"))
    .with_has_header(true)
    .finish()?
    .filter(col("status").eq(lit("shipped")))
    .select([col("order_id"), col("amount")]);

println!("{}", plan.clone().explain(true)?); // inspect the plan
let df = plan.collect()?; // NOW it runs
#+end_src
=explain(true)= prints the optimized plan — you can see what the
engine decided to do before spending any compute.

** Common Operations Cheat Sheet
#+begin_src rust
df.height(); // number of rows
df.width(); // number of columns
df.column("amount")?; // get a column (Series)
df.head(Some(5)); // first 5 rows
df.get_column_names(); // column names
df.column("amount")?.dtype();// the column's data type
#+end_src

** Why Polars (vs pandas / Spark)

  • vs pandas — much faster, multithreaded, lazy optimization, far
    better memory behavior; types are stricter (fewer silent surprises)
  • vs Spark — no cluster needed for single-machine workloads; many
    "we need Spark" jobs are really "pandas was too slow on one box"
  • Polars gives you performance plus
    correctness
    without distributed-systems overhead

** How It Connects to the Rust Basics

  • =PolarsResult= and =?= = Rust's =Result= + =?= operator
  • =&customers= in a join = borrowing (reading without taking
    ownership)
  • =&mut df= when writing Parquet = a mutable borrow
  • =df!=, =col!= style macros = the =!= macro syntax
  • Schemas and =DataType= = Rust's "everything has a known type" idea,
    applied to table columns

** Demo

** A Deliberately Imperfect CSV
Use a file with a mixed-type column, a null, and a bad row:
#+begin_src text
order_id,customer_id,amount,status
1,100,42.50,shipped
2,101,,pending
3,102,17.00,shipped
4,bad_id,9.99,shipped
#+end_src
Row 4 has a non-numeric =customer_id=. In a loose pipeline this
becomes a silent NaN or an object column. We want it to be loud.

** Eager Read With Inferred Schema (the easy, dangerous path)
#+begin_src rust
use polars::prelude::*;

fn main() -> PolarsResult<()> {
let df = CsvReadOptions::default()
.with_has_header(true)
.try_into_reader_with_file_path(Some("orders.csv".into()))?
.finish()?;

   println!("{df}");
   Ok(())

}
#+end_src
This works — but inference looked at a sample and guessed the
types. On a different file, or more rows, the guess can change.
Inference is convenient and non-deterministic; that combination is
what bites you in production.

** Explicit Schema (the reliability lesson)
Stop guessing. State the contract:
#+begin_src rust
use polars::prelude::*;
use std::sync::Arc;

fn read_orders(path: &str) -> PolarsResult {
let mut schema = Schema::default();
schema.with_column("order_id".into(), DataType::Int64);
schema.with_column("customer_id".into(), DataType::Int64);
schema.with_column("amount".into(), DataType::Float64);
schema.with_column("status".into(), DataType::String);

   CsvReadOptions::default()
       .with_has_header(true)
       .with_schema(Some(Arc::new(schema)))
       .try_into_reader_with_file_path(Some(path.into()))?
       .finish()

}
#+end_src
Now =customer_id= is declared =Int64=. The bad row (=bad_id=) can no
longer slip through as text — Polars returns an =Err=, not a quietly
corrupted column. The failure happens at read time, with a clear
cause, instead of three transformations later.

** This Is the Rust + Polars Point

  • The schema is code — it is versioned, reviewed, and tested like
    any other contract
  • =finish()= returns =PolarsResult=. There is no way to
    ignore a parse failure by accident — the =?= forces you to handle
    it or propagate it
  • Compare to a dynamically typed pipeline where a bad parse becomes
    =NaN= and flows downstream silently. Here, the type system and
    the error type make silence impossible.

** Error Handling as a First-Class Concern
Show both behaviors so the audience feels the difference:
#+begin_src rust
fn main() {
match read_orders("orders.csv") {
Ok(df) => println!("Loaded {} rows\n{df}", df.height()),
Err(e) => eprintln!("CSV failed its contract: {e}"),
}
}
#+end_src
In a pipeline, =Err= means the job stops here, loudly, with a
message — not at 3am, forty million rows in.

** Lock It Down With a Test
The reliability theme made concrete — a test that asserts the
contract, so a malformed upstream file fails in CI, not prod:
#+begin_src rust
#[cfg(test)]
mod tests {
use super::*;

   #[test]
   fn schema_is_enforced() {
       let df = read_orders("tests/data/orders_good.csv").unwrap();
       assert_eq!(df.height(), 3);
       assert_eq!(
           df.column("amount").unwrap().dtype(),
           &DataType::Float64
       );
   }

   #[test]
   fn bad_types_are_rejected() {
       // The file with `bad_id` must NOT load silently.
       assert!(read_orders("tests/data/orders_bad.csv").is_err());
   }

}
#+end_src
=bad_types_are_rejected= is the whole philosophy in one test: we
assert that bad data fails. Most pipelines never write that test
because in their stack, bad data does not fail — it spreads.

** Handling Nulls on Purpose (not by accident)
The empty =amount= on row 2 is a real null. Decide what it means
instead of letting a guess decide:
#+begin_src rust
use polars::prelude::*;

fn parse_options() -> CsvParseOptions {
CsvParseOptions::default()
.with_null_values(Some(NullValues::AllColumns(
vec!["".into(), "NA".into(), "null".into()].into(),
)))
}
#+end_src
Operational clarity: nulls are a documented decision in the code,
not an artifact of whatever the parser felt like doing.

** Section Takeaways

  • CSV is untyped and unsafe by default — treat every read as a
    boundary that must be validated
  • Explicit schemas turn "hope it parses" into "it parses or it
    errors" — determinism over convenience
  • =PolarsResult= makes ignoring failure a compile-time impossibility
  • One test (=bad_types_are_rejected=) demonstrates the entire
    reliability thesis
  • Rust + Polars matters here not because it is faster, but because
    it makes silent data corruption structurally hard