रस्ट डेटा पाइपलाइन्स: फाइलों से लेकर क्लीन डेटाबेस और वेब डैशबोर्ड तक
परिचय
हम एक छोटी पर्यावरणीय डेटा पाइपलाइन बना रहे हैं। जल-गुणवत्ता निगरानी की कच्ची फाइलें CSV के रूप में आती हैं। हमारा रस्ट टूल उन्हें वैलिडेट करता है, खराब रिकॉर्ड्स को हटाता है, सुरक्षित अंतराल को भरता है, विश्वसनीय मापों को स्टोर करता है, और एक डैशबोर्ड को संचालित करता है।
डेटा पाइपलाइन
उपयोग किए गए डेटासेट के बारे में
इस डेटासेट1 में आयरलैंड के कॉर्क हार्बर, मोय किलाला और 15 अन्य तटीय स्थानों से कच्चा जल-गुणवत्ता निगरानी डेटा शामिल है। कच्चे एक्सट्रैक्ट किए गए डेटासेट में 1.27 मिलियन से अधिक प्रविष्टियाँ हैं, और रिपॉजिटरी में 11 जल-गुणवत्ता मापदंडों में 29,159 पंक्तियों वाला एक ट्रांसफ़ॉर्म/पिवटेड संस्करण भी शामिल है। फाइलें CSV हैं, इसलिए वे “फाइलें → क्लीन डेटाबेस → डैशबोर्ड” प्रवाह के लिए उपयोग करने में आसान हैं।
उपकरण और लाइब्रेरीज़
हम Polars2 का लाभ उठाकर अपनी डेटा पाइपलाइन को लागू करने के लिए रस्ट3 का उपयोग करते हैं।
डेटाफ़्रेम (DataFrame)
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
println!("Data:");
print!("{df}\n");
let head = df.head(Some(2));
println!("Head:");
print!("{head}\n");
Ok(())
}Data: shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1997-03-22 ┆ 54.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1997-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘ Head: shape: (2, 4) ┌──────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞══════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └──────────────┴────────────┴────────┴────────┘
कॉलम चुनना
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.select([
col("name"),
col("birthdate").dt().year().alias("birth_year"),
(col("weight") / col("height").pow(2)).alias("bmi"),
])
.collect()?;
println!("Column selection:");
print!("{result}\n");
Ok(())
}Column selection: shape: (4, 3) ┌────────────────┬────────────┬───────────┐ │ name ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- │ │ str ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1997 ┆ 20.055096 │ │ Daniel Donovan ┆ 1997 ┆ 27.134694 │ └────────────────┴────────────┴───────────┘
कॉलम जोड़ना
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::{DataFrame},
prelude::{LazyFrame, IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.with_columns([
col("birthdate").dt().year().alias("birth_year"),
(col("weight") / col("height").pow(2)).alias("bmi"),
])
.collect()?;
println!("With added colums:");
print!("{result}\n");
Ok(())
}With added colums: shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1997-03-22 ┆ 54.6 ┆ 1.65 ┆ 1997 ┆ 20.055096 │ │ Daniel Donovan ┆ 1997-04-30 ┆ 83.1 ┆ 1.75 ┆ 1997 ┆ 27.134694 │ └────────────────┴────────────┴────────┴────────┴────────────┴───────────┘
एक्सप्रेशन विस्तार
lit का अर्थ है लिटरल और यह Polars2 की लेज़ी विशेषता के लेज़ी एक्सप्रेशन API का हिस्सा है।
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col, cols, lit, RoundMode},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.select([
col("name"),
(cols(["weight", "height"]).as_expr() * lit(0.95))
.round(2, RoundMode::default())
.name()
.suffix("-5%"),
])
.collect()?;
println!("Transform:");
print!("{result}\n");
Ok(())
}Transform: shape: (4, 3) ┌────────────────┬───────────┬───────────┐ │ name ┆ weight-5% ┆ height-5% │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞════════════════╪═══════════╪═══════════╡ │ Alice Archer ┆ 55.0 ┆ 1.48 │ │ Ben Brown ┆ 68.88 ┆ 1.68 │ │ Chloe Cooper ┆ 51.87 ┆ 1.57 │ │ Daniel Donovan ┆ 78.94 ┆ 1.66 │ └────────────────┴───────────┴───────────┘
पंक्तियों को फ़िल्टर करना
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::{DataFrame},
prelude::{IntoLazy, col, lit, ClosedInterval},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.filter(col("birthdate").dt().year().lt(lit(1990)))
.collect()?;
println!("With row filtering:");
print!("{result}\n");
let result = df
.clone()
.lazy()
.filter(
col("birthdate")
.is_between(
lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
ClosedInterval::Both,
)
.and(col("height").gt(lit(1.7))),
)
.collect()?;
println!("With complex row filtering:");
print!("{result}\n");
Ok(())
}With row filtering: shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘ With complex row filtering: shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘
ग्रुपिंग (Grouping by)
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col, lit, len, RoundMode},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([len()])
.collect()?;
println!("Grouping by birth decade:");
print!("{result}\n");
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([
len().alias("sample_size"),
col("weight")
.mean()
.round(2, RoundMode::default())
.alias("avg_weight"),
col("height").max().alias("tallest"),
])
.collect()?;
println!("Grouping by derived features:");
println!("{result}");
Ok(())
}Grouping by birth decade: shape: (2, 2) ┌────────┬─────┐ │ decade ┆ len │ │ --- ┆ --- │ │ i32 ┆ u32 │ ╞════════╪═════╡ │ 1990 ┆ 3 │ │ 1980 ┆ 1 │ └────────┴─────┘ Grouping by derived features: shape: (2, 4) ┌────────┬─────────────┬────────────┬─────────┐ │ decade ┆ sample_size ┆ avg_weight ┆ tallest │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ u32 ┆ f64 ┆ f64 │ ╞════════╪═════════════╪════════════╪═════════╡ │ 1980 ┆ 1 ┆ 72.5 ┆ 1.77 │ │ 1990 ┆ 3 ┆ 65.2 ┆ 1.75 │ └────────┴─────────────┴────────────┴─────────┘
डेटा विश्लेषण
जब हमें कोई नया डेटासेट प्राप्त होता है, तो लक्ष्य तुरंत चार्ट बनाना या मॉडल चलाना नहीं होता है। पहला लक्ष्य यह समझना है कि क्या डेटा पर भरोसा किया जा सकता है।
कच्चे डेटा का निरीक्षण करें:
डेटा डाउनलोड करें, इसे Polars2 के साथ लोड करें और फिर हेड प्रिंट करें
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}
rust-script failed with exit code 1
[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))
Polars2 कुछ कॉलम के प्रकार (type) का सही अनुमान नहीं लगा रहा है। आइए इसे डिफ़ॉल्ट रूप से 100 पंक्तियों से अनुमान लगाने दें।
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_infer_schema_length(None)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}shape: (29_159, 14) ┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐ │ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH ┆ Temperature ┆ Total ┆ True │ │ e ┆ --- ┆ --- ┆ otal (as ┆ ┆ --- ┆ --- ┆ Hardness ┆ Colour │ │ --- ┆ i64 ┆ str ┆ CaCO3) ┆ ┆ f64 ┆ f64 ┆ (as CaCO3) ┆ --- │ │ str ┆ ┆ ┆ --- ┆ ┆ ┆ ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡ │ ABBEYTOWN_01 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ │ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ Allua ┆ 2007 ┆ Aug ┆ 14.0 ┆ … ┆ 7.42 ┆ 17.8 ┆ 13.4 ┆ 35.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 17.0 ┆ … ┆ 7.67 ┆ 18.1 ┆ 15.8 ┆ 29.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 18.0 ┆ … ┆ 7.63 ┆ 17.8 ┆ 15.9 ┆ 31.0 │ │ Allua ┆ 2007 ┆ Sep ┆ 19.0 ┆ … ┆ 7.33 ┆ 20.1 ┆ 15.4 ┆ 23.0 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ SULLANE_060 ┆ 2022 ┆ Sep ┆ 31.0 ┆ … ┆ 7.1 ┆ 14.9 ┆ 45.0 ┆ 27.0 │ │ SULLANE_060 ┆ 2022 ┆ Nov ┆ 22.0 ┆ … ┆ 6.9 ┆ 12.3 ┆ 34.0 ┆ 58.0 │ │ SULLANE_060 ┆ 2023 ┆ Mar ┆ 36.0 ┆ … ┆ 7.2 ┆ 7.1 ┆ 44.0 ┆ 20.0 │ │ TWO POT ┆ 2023 ┆ Feb ┆ 81.0 ┆ … ┆ 7.4 ┆ 8.6 ┆ 120.0 ┆ 9.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ TWO POT ┆ 2023 ┆ Feb ┆ 82.0 ┆ … ┆ 7.8 ┆ 8.1 ┆ 121.0 ┆ 5.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
आइए अब Polars4 को 10000 पंक्तियों से कॉलम के उचित प्रकार (types) का अनुमान लगाने दें
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_infer_schema_length(Some(10_000))
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}shape: (29_159, 14) ┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐ │ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH ┆ Temperature ┆ Total ┆ True │ │ e ┆ --- ┆ --- ┆ otal (as ┆ ┆ --- ┆ --- ┆ Hardness ┆ Colour │ │ --- ┆ i64 ┆ str ┆ CaCO3) ┆ ┆ f64 ┆ f64 ┆ (as CaCO3) ┆ --- │ │ str ┆ ┆ ┆ --- ┆ ┆ ┆ ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡ │ ABBEYTOWN_01 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ │ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ Allua ┆ 2007 ┆ Aug ┆ 14.0 ┆ … ┆ 7.42 ┆ 17.8 ┆ 13.4 ┆ 35.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 17.0 ┆ … ┆ 7.67 ┆ 18.1 ┆ 15.8 ┆ 29.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 18.0 ┆ … ┆ 7.63 ┆ 17.8 ┆ 15.9 ┆ 31.0 │ │ Allua ┆ 2007 ┆ Sep ┆ 19.0 ┆ … ┆ 7.33 ┆ 20.1 ┆ 15.4 ┆ 23.0 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ SULLANE_060 ┆ 2022 ┆ Sep ┆ 31.0 ┆ … ┆ 7.1 ┆ 14.9 ┆ 45.0 ┆ 27.0 │ │ SULLANE_060 ┆ 2022 ┆ Nov ┆ 22.0 ┆ … ┆ 6.9 ┆ 12.3 ┆ 34.0 ┆ 58.0 │ │ SULLANE_060 ┆ 2023 ┆ Mar ┆ 36.0 ┆ … ┆ 7.2 ┆ 7.1 ┆ 44.0 ┆ 20.0 │ │ TWO POT ┆ 2023 ┆ Feb ┆ 81.0 ┆ … ┆ 7.4 ┆ 8.6 ┆ 120.0 ┆ 9.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ TWO POT ┆ 2023 ┆ Feb ┆ 82.0 ┆ … ┆ 7.8 ┆ 8.1 ┆ 121.0 ┆ 5.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
// imports go here
fn main() -> PolarsResult<()> {
let df = CsvReadOptions::default()
.with_has_header(true)
// Discovery step: scan the file because we do not know columns yet.
.with_infer_schema_length(Some(10_000))
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
inspect_raw_data(df.clone())?;
Ok(())
}rows: 29159 columns: 14 columns and types: WaterbodyName: String (text or mixed) Years: Int64 (number) SampleDate: String (text or mixed) Alkalinity-total (as CaCO3): Float64 (number) Ammonia-Total (as N): Float64 (number) BOD - 5 days (Total): Float64 (number) Chloride: Float64 (number) Conductivity @25°C: Float64 (number) Dissolved Oxygen: Float64 (number) ortho-Phosphate (as P) - unspecified: Float64 (number) pH: Float64 (number) Temperature: Float64 (number) Total Hardness (as CaCO3): Float64 (number) True Colour: Float64 (number) one raw row: shape: (1, 14) ┌───────────────┬───────┬────────────┬─────────────────────┬───┬─────┬─────────────┬────────────────────┬─────────────┐ │ WaterbodyName ┆ Years ┆ SampleDate ┆ Alkalinity-total ┆ … ┆ pH ┆ Temperature ┆ Total Hardness (as ┆ True Colour │ │ --- ┆ --- ┆ --- ┆ (as CaCO3) ┆ ┆ --- ┆ --- ┆ CaCO3) ┆ --- │ │ str ┆ i64 ┆ str ┆ --- ┆ ┆ f64 ┆ f64 ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞═══════════════╪═══════╪════════════╪═════════════════════╪═══╪═════╪═════════════╪════════════════════╪═════════════╡ │ ABBEYTOWN_010 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ └───────────────┴───────┴────────────┴─────────────────────┴───┴─────┴─────────────┴────────────────────┴─────────────┘ location/date columns: ["WaterbodyName", "Years", "SampleDate"] measurement columns: ["Alkalinity-total (as CaCO3)", "Ammonia-Total (as N)", "BOD - 5 days (Total)", "Chloride", "Conductivity @25°C", "Dissolved Oxygen", "ortho-Phosphate (as P) - unspecified", "pH", "Temperature", "Total Hardness (as CaCO3)", "True Colour"] long water-quality shape: shape: (10, 7) ┌───────────────┬───────┬────────────┬─────────────────────────────┬───────────────────┬──────────────────┬──────────┐ │ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column ┆ measurement_value ┆ parameter ┆ unit │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ str ┆ f64 ┆ str ┆ str │ ╞═══════════════╪═══════╪════════════╪═════════════════════════════╪═══════════════════╪══════════════════╪══════════╡ │ ABBEYTOWN_010 ┆ 2023 ┆ Feb ┆ Alkalinity-total (as CaCO3) ┆ 314.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 14.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 17.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 18.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 19.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 19.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 18.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 8.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 9.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 10.0 ┆ Alkalinity-total ┆ as CaCO3 │ └───────────────┴───────┴────────────┴─────────────────────────────┴───────────────────┴──────────────────┴──────────┘

