Rust 数据流水线：从文件到清洗后的数据库及 Web 仪表盘

引言
数据流水线
脚注

引言

我们正在构建一个小型环境数据流水线。原始水质监测文件以 CSV 格式到达。我们的 Rust 工具会对它们进行验证、清理错误记录、填充安全间隙、存储可信测量数据，并驱动一个仪表盘。

数据流水线

关于所使用的数据集

该数据集¹包含来自爱尔兰科克港 (Cork Harbour)、莫伊基拉拉 (Moy Killala) 以及其他 15 个沿海地点的*原始水质监测数据*。原始提取的数据集包含超过 *127 万条条目*，存储库还包括一个转换/透视版本，包含跨越 *11 个水质参数*的 *29,159 行数据*。这些文件是 CSV 格式，因此非常适合“文件 → 清洗后的数据库 → 仪表盘”的流程。

工具与库

我们使用 Rust² 并利用 Polars³ 来实现我们的数据流水线。

DataFrame

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();
        println!("Data:");
        print!("{df}\n");

        let head = df.head(Some(2));
        println!("Head:");
        print!("{head}\n");

      Ok(())
  }

Data:
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
Head:
shape: (2, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└──────────────┴────────────┴────────┴────────┘

选择列

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("Column selection:");
        print!("{result}\n");

      Ok(())
  }

Column selection:
shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

添加列

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{LazyFrame, IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .with_columns([
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("With added colums:");
        print!("{result}\n");

      Ok(())
  }

With added colums:
shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

表达式扩展

lit 代表字面量 (literal)，它是 Polars³ 的 lazy 特性中 lazy 表达式 API 的一部分。

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, cols, lit, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                (cols(["weight", "height"]).as_expr() * lit(0.95))
                    .round(2, RoundMode::default())
                    .name()
                    .suffix("-5%"),
            ])
            .collect()?;
        println!("Transform:");
        print!("{result}\n");

      Ok(())
  }

Transform:
shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 51.87     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

过滤行

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{IntoLazy, col, lit, ClosedInterval},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .filter(col("birthdate").dt().year().lt(lit(1990)))
            .collect()?;
        println!("With row filtering:");
        print!("{result}\n");

        let result = df
              .clone()
              .lazy()
              .filter(
                  col("birthdate")
                      .is_between(
                          lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                          lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                          ClosedInterval::Both,
                      )
                      .and(col("height").gt(lit(1.7))),
              )
              .collect()?;
        println!("With complex row filtering:");
        print!("{result}\n");

      Ok(())
  }

With row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘
With complex row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

分组 (Group by)

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, lit, len, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([len()])
            .collect()?;
        println!("Grouping by birth decade:");
        print!("{result}\n");

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([
                len().alias("sample_size"),
                col("weight")
                    .mean()
                    .round(2, RoundMode::default())
                    .alias("avg_weight"),
                col("height").max().alias("tallest"),
            ])
            .collect()?;
        println!("Grouping by derived features:");
        println!("{result}");

      Ok(())
  }

Grouping by birth decade:
shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 3   │
│ 1980   ┆ 1   │
└────────┴─────┘
Grouping by derived features:
shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1990   ┆ 3           ┆ 65.2       ┆ 1.75    │
│ 1980   ┆ 1           ┆ 72.5       ┆ 1.77    │
└────────┴─────────────┴────────────┴─────────┘

数据分析

当我们收到一个新的数据集时，目标不是立即构建图表或运行模型。首要目标是了解数据是否可信。完整的分析位于 github。

1. 检查原始数据：

下载数据，使用 Polars³ 加载它，然后打印头部数据。

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

rust-script failed with exit code 1

[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))

Polars³ 没有正确猜测某些列的*类型*。让我们默认让它从 *100 行*中进行猜测。

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(None)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

现在让我们让 Polars⁴ 从 10000 行*中推断列的正确*类型

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::inspect_raw_data;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 1. Inspect the raw data
      inspect_raw_data(df)?;

      Ok(())
  }

  cargo run --bin inspect_raw_data

============================================================
1. Inspect the raw data
============================================================

raw dataset size:
This confirms how many rows and columns were loaded from the CSV.
shape: (2, 2)
┌─────────┬───────┐
│ metric  ┆ value │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ rows    ┆ 29159 │
│ columns ┆ 14    │
└─────────┴───────┘

inferred schema:
This shows each column name and the type Polars inferred from the file.
shape: (14, 3)
┌─────────────────────────────────┬───────────────┬───────────────┐
│ column                          ┆ inferred_type ┆ storage_kind  │
│ ---                             ┆ ---           ┆ ---           │
│ str                             ┆ str           ┆ str           │
╞═════════════════════════════════╪═══════════════╪═══════════════╡
│ WaterbodyName                   ┆ String        ┆ text or mixed │
│ Years                           ┆ Int64         ┆ number        │
│ SampleDate                      ┆ String        ┆ text or mixed │
│ Alkalinity-total (as CaCO3)     ┆ Float64       ┆ number        │
│ Ammonia-Total (as N)            ┆ Float64       ┆ number        │
│ …                               ┆ …             ┆ …             │
│ ortho-Phosphate (as P) - unspe… ┆ Float64       ┆ number        │
│ pH                              ┆ Float64       ┆ number        │
│ Temperature                     ┆ Float64       ┆ number        │
│ Total Hardness (as CaCO3)       ┆ Float64       ┆ number        │
│ True Colour                     ┆ Float64       ┆ number        │
└─────────────────────────────────┴───────────────┴───────────────┘

raw row sample:
This shows one original wide record before any reshaping.
shape: (1, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬─────┬──────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH  ┆ Temperature  ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ --- ┆ ---          ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64 ┆ f64          ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆     ┆              ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆     ┆              ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪═════╪══════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8 ┆ 10.4         ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆     ┆              ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴─────┴──────────────┴─────────────┴────────┘

first-pass column roles:
This separates location/date columns from measurement columns.
shape: (14, 2)
┌─────────────────────────────────┬─────────────┐
│ column                          ┆ role        │
│ ---                             ┆ ---         │
│ str                             ┆ str         │
╞═════════════════════════════════╪═════════════╡
│ WaterbodyName                   ┆ location    │
│ Years                           ┆ date        │
│ SampleDate                      ┆ date        │
│ Alkalinity-total (as CaCO3)     ┆ measurement │
│ Ammonia-Total (as N)            ┆ measurement │
│ …                               ┆ …           │
│ ortho-Phosphate (as P) - unspe… ┆ measurement │
│ pH                              ┆ measurement │
│ Temperature                     ┆ measurement │
│ Total Hardness (as CaCO3)       ┆ measurement │
│ True Colour                     ┆ measurement │
└─────────────────────────────────┴─────────────┘

long-form sample:
This previews the wide measurements as parameter/value rows.
shape: (5, 7)
┌───────────────┬───────┬────────────┬────────────────┬────────────────┬────────────────┬──────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column  ┆ measurement_va ┆ parameter      ┆ unit     │
│ ---           ┆ ---   ┆ ---        ┆ ---            ┆ lue            ┆ ---            ┆ ---      │
│ str           ┆ i64   ┆ str        ┆ str            ┆ ---            ┆ str            ┆ str      │
│               ┆       ┆            ┆                ┆ f64            ┆                ┆          │
╞═══════════════╪═══════╪════════════╪════════════════╪════════════════╪════════════════╪══════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ Alkalinity-tot ┆ 314.0          ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 14.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 17.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 18.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-tot ┆ 19.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
└───────────────┴───────┴────────────┴────────────────┴────────────────┴────────────────┴──────────┘

2. 分析数据 (Profile the data)

这为我们在做出决定之前提供了数据集的初步概览。

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::profile_the_data;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 2. Profile the data
      profile_the_data(df)?;

      Ok(())
  }

  cargo run --bin profile_the_data

============================================================
2. Profile the data
============================================================

profile scope:
This repeats the dataset size before summarizing each important column.
shape: (2, 2)
┌─────────┬───────┐
│ metric  ┆ value │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ rows    ┆ 29159 │
│ columns ┆ 14    │
└─────────┴───────┘

date coverage:
This combines Years and SampleDate into a usable month-level date range.
shape: (5, 2)
┌───────────────────────────┬────────────┐
│ metric                    ┆ value      │
│ ---                       ┆ ---        │
│ str                       ┆ str        │
╞═══════════════════════════╪════════════╡
│ earliest_date             ┆ 2007-01-01 │
│ latest_date               ┆ 2023-04-01 │
│ invalid_dates             ┆ 0          │
│ missing_dates             ┆ 0          │
│ gaps_over_time_gt_31_days ┆ 0          │
└───────────────────────────┴────────────┘

column profile:
This gives missing counts, distinct counts, numeric ranges, averages, and notes.
shape: (14, 9)
┌──────────────┬──────────┬─────────┬─────────┬───┬─────────┬─────────┬──────────────┬─────────────┐
│ column       ┆ role     ┆ type    ┆ missing ┆ … ┆ minimum ┆ maximum ┆ average      ┆ notes       │
│ ---          ┆ ---      ┆ ---     ┆ ---     ┆   ┆ ---     ┆ ---     ┆ ---          ┆ ---         │
│ str          ┆ str      ┆ str     ┆ i64     ┆   ┆ str     ┆ str     ┆ str          ┆ str         │
╞══════════════╪══════════╪═════════╪═════════╪═══╪═════════╪═════════╪══════════════╪═════════════╡
│ WaterbodyNam ┆ location ┆ String  ┆ 0       ┆ … ┆         ┆         ┆              ┆ unique      │
│ e            ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ locations:  │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ 160         │
│ Years        ┆ date     ┆ Int64   ┆ 0       ┆ … ┆ 2007    ┆ 2023    ┆ Float64(2014 ┆ included in │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ .78253712404 ┆ combined    │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 4)           ┆ date cove…  │
│ SampleDate   ┆ date     ┆ String  ┆ 0       ┆ … ┆         ┆         ┆              ┆ review full │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ category    │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ list; inc…  │
│ Alkalinity-t ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 442     ┆ Float64(139. ┆             │
│ otal (as     ┆          ┆         ┆         ┆   ┆         ┆         ┆ 858347851435 ┆             │
│ CaCO3)       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 2)           ┆             │
│ Ammonia-Tota ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 40      ┆ Float64(0.06 ┆             │
│ l (as N)     ┆          ┆         ┆         ┆   ┆         ┆         ┆ 357266127096 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 262)         ┆             │
│ …            ┆ …        ┆ …       ┆ …       ┆ … ┆ …       ┆ …       ┆ …            ┆ …           │
│ ortho-Phosph ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ -0.004  ┆ 70      ┆ Float64(0.06 ┆ negative    │
│ ate (as P) - ┆          ┆         ┆         ┆   ┆         ┆         ┆ 878934462773 ┆ value found │
│ unspe…       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 074)         ┆ (-0.004)    │
│ pH           ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 4.7     ┆ 9.8     ┆ Float64(7.55 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 205686066051 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 8)           ┆             │
│ Temperature  ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0.6     ┆ 637     ┆ Float64(10.8 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 505031036729 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 74)          ┆             │
│ Total        ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 642     ┆ Float64(159. ┆             │
│ Hardness (as ┆          ┆         ┆         ┆   ┆         ┆         ┆ 092110326142 ┆             │
│ CaCO3)       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 9)           ┆             │
│ True Colour  ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 953     ┆ Float64(58.1 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 374635618505 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 45)          ┆             │
└──────────────┴──────────┴─────────┴─────────┴───┴─────────┴─────────┴──────────────┴─────────────┘

text/category profile:
This summarizes unique text values and possible spelling variations.
shape: (2, 5)
┌───────────────┬──────────────┬───────────────┬──────────────────────┬────────────────────────────┐
│ column        ┆ empty_values ┆ unique_values ┆ sample_unique_values ┆ possible_spelling_variatio │
│ ---           ┆ ---          ┆ ---           ┆ ---                  ┆ ns                         │
│ str           ┆ i64          ┆ i64           ┆ str                  ┆ ---                        │
│               ┆              ┆               ┆                      ┆ str                        │
╞═══════════════╪══════════════╪═══════════════╪══════════════════════╪════════════════════════════╡
│ WaterbodyName ┆ 0            ┆ 160           ┆ ABBEYTOWN_010,       ┆                            │
│               ┆              ┆               ┆ ASKANAGAP STREA…     ┆                            │
│ SampleDate    ┆ 0            ┆ 12            ┆ Apr, Aug, Dec, Feb,  ┆                            │
│               ┆              ┆               ┆ Jan, Jul, …          ┆                            │
└───────────────┴──────────────┴───────────────┴──────────────────────┴────────────────────────────┘

3. 识别数据质量问题

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::identify_data_quality_problems;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 3. Identify data quality problems
      identify_data_quality_problems(df)?;

      Ok(())
  }

  cargo run --bin identify_data_quality_problems

============================================================
3. Identify data quality problems
============================================================

data quality summary:
This is the high-level checklist of problems that could make analysis unreliable.
shape: (13, 4)
┌─────────────────────────────────┬────────┬────────┬─────────────────────────────────┐
│ check                           ┆ count  ┆ status ┆ note                            │
│ ---                             ┆ ---    ┆ ---    ┆ ---                             │
│ str                             ┆ i64    ┆ str    ┆ str                             │
╞═════════════════════════════════╪════════╪════════╪═════════════════════════════════╡
│ measurement columns checked     ┆ 11     ┆ info   ┆ wide measurement columns becom… │
│ missing values                  ┆ 0      ┆ ok     ┆ null values across raw columns  │
│ duplicate rows                  ┆ 14478  ┆ review ┆ exact raw-row duplicates        │
│ numeric values stored as text   ┆ 0      ┆ ok     ┆ string columns whose values ar… │
│ invalid date values             ┆ 0      ┆ ok     ┆ date-like values that failed p… │
│ …                               ┆ …      ┆ …      ┆ …                               │
│ pH outside 0-14                 ┆ 0      ┆ ok     ┆ domain rule for pH              │
│ negative concentration-like me… ┆ 2      ┆ review ┆ negative values outside pH and… │
│ outlier values                  ┆ 12652  ┆ review ┆ IQR outliers across 8 columns   │
│ duplicate location/date/parame… ┆ 204237 ┆ review ┆ same location, date, and param… │
│ large gaps in time series       ┆ 5395   ┆ review ┆ location time periods with mis… │
└─────────────────────────────────┴────────┴────────┴─────────────────────────────────┘

data quality details:
This gives the columns and counts behind the summary checks.
shape: (9, 4)
┌──────────────────────────────┬─────────────────────────────┬───────┬─────────────────────────────┐
│ problem                      ┆ column                      ┆ count ┆ note                        │
│ ---                          ┆ ---                         ┆ ---   ┆ ---                         │
│ str                          ┆ str                         ┆ i64   ┆ str                         │
╞══════════════════════════════╪═════════════════════════════╪═══════╪═════════════════════════════╡
│ outliers                     ┆ Chloride                    ┆ 1861  ┆ outside [5.999999999999998, │
│                              ┆                             ┆       ┆ 31…                         │
│ outliers                     ┆ Conductivity @25°C          ┆ 20    ┆ outside [-179, 909] by IQR  │
│                              ┆                             ┆       ┆ rul…                        │
│ outliers                     ┆ Dissolved Oxygen            ┆ 2314  ┆ outside                     │
│                              ┆                             ┆       ┆ [10.999999999999993, 1…     │
│ outliers                     ┆ ortho-Phosphate (as P) -    ┆ 6229  ┆ outside                     │
│                              ┆ unspe…                      ┆       ┆ [0.009500000000000005,…     │
│ negative concentration-like  ┆ ortho-Phosphate (as P) -    ┆ 2     ┆ negative value outside pH   │
│ me…                          ┆ unspe…                      ┆       ┆ and …                       │
│ outliers                     ┆ pH                          ┆ 406   ┆ outside [6, 9.2] by IQR     │
│                              ┆                             ┆       ┆ rule                        │
│ outliers                     ┆ Temperature                 ┆ 134   ┆ outside                     │
│                              ┆                             ┆       ┆ [0.8499999999999988, 2…     │
│ outliers                     ┆ Total Hardness (as CaCO3)   ┆ 3     ┆ outside [-178, 486] by IQR  │
│                              ┆                             ┆       ┆ rul…                        │
│ outliers                     ┆ True Colour                 ┆ 1685  ┆ outside [-48.5, 147.5] by   │
│                              ┆                             ┆       ┆ IQR …                       │
└──────────────────────────────┴─────────────────────────────┴───────┴─────────────────────────────┘

principle:
This is the rule that guides the cleaning decision in the next step.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Bad input should not quietly b… │
└─────────────────────────────────┘

4. 清洗并标准化数据

  use data_pipeline::quality_flow::clean_and_normalize_the_data;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 4. Clean and normalize the data
      clean_and_normalize_the_data(df)?;

      Ok(())
  }

  cargo run --bin clean_and_normalize_the_data

============================================================
4. Clean and normalize the data
============================================================

cleaning summary:
This shows how many normalized rows were kept, rejected, or deduplicated.
shape: (3, 2)
┌──────────────────────────┬────────┐
│ metric                   ┆ value  │
│ ---                      ┆ ---    │
│ str                      ┆ i64    │
╞══════════════════════════╪════════╡
│ cleaned_rows             ┆ 148372 │
│ invalid_rows             ┆ 2      │
│ exact_duplicates_removed ┆ 172375 │
└──────────────────────────┴────────┘

cleaned sample:
This is the normalized long-form data that is easier to query and visualize.
shape: (5, 10)
┌────────────┬─────────────┬─────────────┬──────┬───┬────────────┬────────────┬───────┬────────────┐
│ source_row ┆ location    ┆ sample_date ┆ year ┆ … ┆ parameter_ ┆ unit       ┆ value ┆ source_col │
│ ---        ┆ ---         ┆ ---         ┆ ---  ┆   ┆ code       ┆ ---        ┆ ---   ┆ umn        │
│ i64        ┆ str         ┆ str         ┆ i32  ┆   ┆ ---        ┆ str        ┆ f64   ┆ ---        │
│            ┆             ┆             ┆      ┆   ┆ str        ┆            ┆       ┆ str        │
╞════════════╪═════════════╪═════════════╪══════╪═══╪════════════╪════════════╪═══════╪════════════╡
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ ALKALINITY ┆ as CaCO3   ┆ 314.0 ┆ Alkalinity │
│            ┆ 10          ┆             ┆      ┆   ┆ -TOTAL     ┆            ┆       ┆ -total (as │
│            ┆             ┆             ┆      ┆   ┆            ┆            ┆       ┆ CaCO3)     │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ AMMONIA-TO ┆ as N       ┆ 0.033 ┆ Ammonia-To │
│            ┆ 10          ┆             ┆      ┆   ┆ TAL        ┆            ┆       ┆ tal (as N) │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ BOD_-_5_DA ┆ Total      ┆ 1.2   ┆ BOD - 5    │
│            ┆ 10          ┆             ┆      ┆   ┆ YS         ┆            ┆       ┆ days       │
│            ┆             ┆             ┆      ┆   ┆            ┆            ┆       ┆ (Total)    │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ CHLORIDE   ┆ not_encode ┆ 27.3  ┆ Chloride   │
│            ┆ 10          ┆             ┆      ┆   ┆            ┆ d          ┆       ┆            │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ CONDUCTIVI ┆ @25°C      ┆ 711.0 ┆ Conductivi │
│            ┆ 10          ┆             ┆      ┆   ┆ TY         ┆            ┆       ┆ ty @25°C   │
└────────────┴─────────────┴─────────────┴──────┴───┴────────────┴────────────┴───────┴────────────┘

invalid rows sample:
These rows were separated so bad input does not become trusted data.
shape: (2, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ invalid_reason         │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
│ 15723      ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

5. 谨慎处理缺失值

  use data_pipeline::quality_flow::handle_missing_values_carefully;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 5. Handle missing values carefully
      handle_missing_values_carefully(df)?;

      Ok(())
  }

  cargo run --bin handle_missing_values_carefully

============================================================
5. Handle missing values carefully
============================================================

missing-value decision summary:
This separates invalid data, missing data, gap candidates, and flagged observed values.
shape: (7, 3)
┌────────────────────────────────┬───────┬─────────────────────────────────┐
│ case                           ┆ count ┆ decision                        │
│ ---                            ┆ ---   ┆ ---                             │
│ str                            ┆ i64   ┆ str                             │
╞════════════════════════════════╪═══════╪═════════════════════════════════╡
│ invalid or impossible rows     ┆ 2     ┆ quarantine                      │
│ missing critical fields        ┆ 0     ┆ reject row                      │
│ missing measurement values     ┆ 0     ┆ keep NULL unless safe to estim… │
│ small time-series gaps         ┆ 45165 ┆ candidate for interpolation af… │
│ large time-series gaps         ┆ 14179 ┆ keep missing                    │
│ suspicious but possible values ┆ 10677 ┆ keep observed value with quali… │
│ values filled automatically    ┆ 0     ┆ none; filling is not automatic  │
└────────────────────────────────┴───────┴─────────────────────────────────┘

quarantined row sample:
These rows are not filled because a critical field or measurement value is missing or invalid.
shape: (2, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ decision               │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
│ 15723      ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

time-series gap examples:
These are observed gaps; small gaps may be interpolated only after review.
shape: (20, 6)
┌──────────┬──────────────────┬────────────┬────────────┬────────────────┬──────────────────────┐
│ location ┆ parameter_code   ┆ from_date  ┆ to_date    ┆ missing_months ┆ decision             │
│ ---      ┆ ---              ┆ ---        ┆ ---        ┆ ---            ┆ ---                  │
│ str      ┆ str              ┆ str        ┆ str        ┆ i64            ┆ str                  │
╞══════════╪══════════════════╪════════════╪════════════╪════════════════╪══════════════════════╡
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2007-09-01 ┆ 2008-01-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2008-12-01 ┆ 2009-04-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-04-01 ┆ 2009-06-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-06-01 ┆ 2009-08-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-08-01 ┆ 2009-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ …        ┆ …                ┆ …          ┆ …          ┆ …              ┆ …                    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-06-01 ┆ 2009-08-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-08-01 ┆ 2009-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-10-01 ┆ 2010-03-01 ┆ 4              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2010-03-01 ┆ 2010-07-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2010-08-01 ┆ 2010-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
└──────────┴──────────────────┴────────────┴────────────┴────────────────┴──────────────────────┘

quality-flagged sample:
These observed values are kept, but marked because they need caution.
shape: (10, 6)
┌──────────┬─────────────┬─────────────────┬───────┬───────────────────────┬───────────────────────┐
│ location ┆ sample_date ┆ parameter_code  ┆ value ┆ quality_flag          ┆ missing_decision      │
│ ---      ┆ ---         ┆ ---             ┆ ---   ┆ ---                   ┆ ---                   │
│ str      ┆ str         ┆ str             ┆ f64   ┆ str                   ┆ str                   │
╞══════════╪═════════════╪═════════════════╪═══════╪═══════════════════════╪═══════════════════════╡
│ ALLUA    ┆ 2007-09-01  ┆ AMMONIA-TOTAL   ┆ 0.066 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.069 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ ORTHO-PHOSPHATE ┆ 0.005 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.068 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.067 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.133 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.111 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.113 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-03-01  ┆ AMMONIA-TOTAL   ┆ 0.04  ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-03-01  ┆ ORTHO-PHOSPHATE ┆ 0.005 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
└──────────┴─────────────┴─────────────────┴───────┴───────────────────────┴───────────────────────┘

principle:
This is the rule for deciding whether a missing value should be filled.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Filling data is a decision, no… │
└─────────────────────────────────┘

handled data summary:
This confirms the row counts after applying the missing-value decisions.
shape: (3, 2)
┌────────────────────┬────────┐
│ metric             ┆ value  │
│ ---                ┆ ---    │
│ str                ┆ i64    │
╞════════════════════╪════════╡
│ handled_rows       ┆ 148372 │
│ quarantined_rows   ┆ 2      │
│ duplicates_removed ┆ 172375 │
└────────────────────┴────────┘

6. 存储前验证

  use data_pipeline::quality_flow::validate_before_storing;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 6. Validate before storing
      validate_before_storing(df)?;

      Ok(())
  }

  cargo run --bin validate_before_storing

============================================================
6. Validate before storing
============================================================

validation rule summary:
This shows the storage rules, how many records were checked, and what failed.
shape: (7, 4)
┌─────────────────────────────────┬─────────┬────────┬─────────────────────────────────┐
│ rule                            ┆ checked ┆ failed ┆ action                          │
│ ---                             ┆ ---     ┆ ---    ┆ ---                             │
│ str                             ┆ i64     ┆ i64    ┆ str                             │
╞═════════════════════════════════╪═════════╪════════╪═════════════════════════════════╡
│ every measurement has a locati… ┆ 320749  ┆ 0      ┆ reject missing locations        │
│ every measurement has a parame… ┆ 320749  ┆ 0      ┆ reject missing parameters       │
│ every measurement has a date    ┆ 320749  ┆ 0      ┆ reject missing or invalid date… │
│ value is a valid number or exp… ┆ 320749  ┆ 0      ┆ store numeric values; store mi… │
│ known parameter respects range  ┆ 320749  ┆ 20     ┆ reject impossible values for k… │
│ exact duplicate records handle… ┆ 320729  ┆ 172369 ┆ remove exact duplicates         │
│ repeated measurements handled … ┆ 148360  ┆ 31854  ┆ keep with source_row so repeat… │
└─────────────────────────────────┴─────────┴────────┴─────────────────────────────────┘

records rejected before storage:
These rows failed validation and should not be inserted into trusted tables.
shape: (10, 6)
┌────────────┬──────────────┬──────────┬───────────────────────┬───────────┬───────────────────────┐
│ source_row ┆ location     ┆ raw_date ┆ source_column         ┆ raw_value ┆ rule_failed           │
│ ---        ┆ ---          ┆ ---      ┆ ---                   ┆ ---       ┆ ---                   │
│ i64        ┆ str          ┆ str      ┆ str                   ┆ str       ┆ str                   │
╞════════════╪══════════════╪══════════╪═══════════════════════╪═══════════╪═══════════════════════╡
│ 111        ┆ ASKANAGAP    ┆ Jan      ┆ ortho-Phosphate (as   ┆ -0.004    ┆ known parameter range │
│            ┆ STREAM_010   ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 2003       ┆ CAMCOR_020   ┆ Feb      ┆ Temperature           ┆ 58.0      ┆ known parameter range │
│            ┆              ┆          ┆                       ┆           ┆ failed (…             │
│ 4813       ┆ DARGLE_030   ┆ Jan      ┆ ortho-Phosphate (as   ┆ 42.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4815       ┆ DARGLE_030   ┆ Feb      ┆ ortho-Phosphate (as   ┆ 22.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4857       ┆ DARGLE_030   ┆ Jul      ┆ ortho-Phosphate (as   ┆ 70.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4873       ┆ DARGLE_030   ┆ May      ┆ ortho-Phosphate (as   ┆ 29.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4893       ┆ DARGLE_030   ┆ Mar      ┆ ortho-Phosphate (as   ┆ 26.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4903       ┆ DARGLE_030   ┆ Sep      ┆ ortho-Phosphate (as   ┆ 25.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 6096       ┆ GLENCREE_010 ┆ Feb      ┆ ortho-Phosphate (as   ┆ 27.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 6117       ┆ GLENCREE_010 ┆ Jul      ┆ ortho-Phosphate (as   ┆ 27.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
└────────────┴──────────────┴──────────┴───────────────────────┴───────────┴───────────────────────┘

duplicate handling sample:
These exact duplicates are handled deliberately before storage.
shape: (10, 6)
┌────────────┬──────────┬─────────────┬──────────────────┬───────┬─────────────────────────────────┐
│ source_row ┆ location ┆ sample_date ┆ parameter_code   ┆ value ┆ action                          │
│ ---        ┆ ---      ┆ ---         ┆ ---              ┆ ---   ┆ ---                             │
│ i64        ┆ str      ┆ str         ┆ str              ┆ f64   ┆ str                             │
╞════════════╪══════════╪═════════════╪══════════════════╪═══════╪═════════════════════════════════╡
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ AMMONIA-TOTAL    ┆ 0.033 ┆ skip exact duplicate before st… │
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ AMMONIA-TOTAL    ┆ 0.033 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ TEMPERATURE      ┆ 17.8  ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ ALKALINITY-TOTAL ┆ 19.0  ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
└────────────┴──────────┴─────────────┴──────────────────┴───────┴─────────────────────────────────┘

trusted records sample:
These records passed validation and are shaped for database insertion.
shape: (5, 8)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬───────┬────────────┐
│ source_row ┆ location   ┆ sample_dat ┆ parameter  ┆ parameter_ ┆ unit       ┆ value ┆ source_col │
│ ---        ┆ ---        ┆ e          ┆ ---        ┆ code       ┆ ---        ┆ ---   ┆ umn        │
│ i64        ┆ str        ┆ ---        ┆ str        ┆ ---        ┆ str        ┆ f64   ┆ ---        │
│            ┆            ┆ str        ┆            ┆ str        ┆            ┆       ┆ str        │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪════════════╡
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Alkalinity ┆ ALKALINITY ┆ as CaCO3   ┆ 314.0 ┆ Alkalinity │
│            ┆ 010        ┆            ┆ -total     ┆ -TOTAL     ┆            ┆       ┆ -total (as │
│            ┆            ┆            ┆            ┆            ┆            ┆       ┆ CaCO3)     │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Ammonia-To ┆ AMMONIA-TO ┆ as N       ┆ 0.033 ┆ Ammonia-To │
│            ┆ 010        ┆            ┆ tal        ┆ TAL        ┆            ┆       ┆ tal (as N) │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ BOD - 5    ┆ BOD_-_5_DA ┆ Total      ┆ 1.2   ┆ BOD - 5    │
│            ┆ 010        ┆            ┆ days       ┆ YS         ┆            ┆       ┆ days       │
│            ┆            ┆            ┆            ┆            ┆            ┆       ┆ (Total)    │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Chloride   ┆ CHLORIDE   ┆ not_encode ┆ 27.3  ┆ Chloride   │
│            ┆ 010        ┆            ┆            ┆            ┆ d          ┆       ┆            │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Conductivi ┆ CONDUCTIVI ┆ @25°C      ┆ 711.0 ┆ Conductivi │
│            ┆ 010        ┆            ┆ ty         ┆ TY         ┆            ┆       ┆ ty @25°C   │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴───────┴────────────┘

storage readiness summary:
This is the final count of clean records, rejected records, NULLs, and handled duplicates.
shape: (6, 2)
┌─────────────────────────────────┬────────┐
│ metric                          ┆ value  │
│ ---                             ┆ ---    │
│ str                             ┆ i64    │
╞═════════════════════════════════╪════════╡
│ raw_measurement_rows            ┆ 320749 │
│ trusted_records_ready_to_store  ┆ 148360 │
│ records_rejected                ┆ 20     │
│ explicit_null_values            ┆ 0      │
│ exact_duplicates_removed        ┆ 172369 │
│ repeated_measurements_kept_wit… ┆ 31854  │
└─────────────────────────────────┴────────┘

principle:
This is the rule for deciding what is safe to store.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ The database should store clea… │
└─────────────────────────────────┘

7. 结构化存储清洗后的数据

  use data_pipeline::quality_flow::store_clean_data_with_structure;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 7. Store clean data with structure
      store_clean_data_with_structure(df)?;

      Ok(())
  }

  cargo run --bin store_clean_data_with_structure

============================================================
7. Store clean data with structure
============================================================

database file:
This is the SQLite file written for the API project.
shape: (1, 2)
┌─────────────┬─────────────────────────────────┐
│ item        ┆ value                           │
│ ---         ┆ ---                             │
│ str         ┆ str                             │
╞═════════════╪═════════════════════════════════╡
│ sqlite_file ┆ /Users/chiefkemist/Documents/n… │
└─────────────┴─────────────────────────────────┘

structured schema:
The cleaned data is stored across small tables instead of one giant messy table.
shape: (5, 2)
┌────────────────┬─────────────────────────────────┐
│ table          ┆ purpose                         │
│ ---            ┆ ---                             │
│ str            ┆ str                             │
╞════════════════╪═════════════════════════════════╡
│ locations      ┆ one row per normalized locatio… │
│ parameters     ┆ one row per normalized paramet… │
│ measurements   ┆ trusted observed measurements   │
│ ingestion_runs ┆ source file, import time, coun… │
│ rejected_rows  ┆ rows that failed validation or… │
└────────────────┴─────────────────────────────────┘

ingestion run summary:
This records where the data came from and what happened during import.
shape: (6, 2)
┌──────────────────────────┬────────┐
│ metric                   ┆ value  │
│ ---                      ┆ ---    │
│ str                      ┆ i64    │
╞══════════════════════════╪════════╡
│ ingestion_run_id         ┆ 1      │
│ raw_rows                 ┆ 29159  │
│ raw_measurement_rows     ┆ 320749 │
│ accepted_measurements    ┆ 148360 │
│ rejected_rows            ┆ 20     │
│ exact_duplicates_removed ┆ 172369 │
└──────────────────────────┴────────┘

database table counts:
These counts are read back from SQLite after the write finishes.
shape: (5, 2)
┌────────────────┬────────┐
│ table          ┆ rows   │
│ ---            ┆ ---    │
│ str            ┆ i64    │
╞════════════════╪════════╡
│ ingestion_runs ┆ 1      │
│ locations      ┆ 160    │
│ parameters     ┆ 11     │
│ measurements   ┆ 148360 │
│ rejected_rows  ┆ 20     │
└────────────────┴────────┘

stored measurement sample:
These accepted rows are stored in the measurements table with foreign keys.
shape: (5, 6)
┌────────────┬───────────────┬─────────────┬──────────────────┬─────────────┬───────┐
│ source_row ┆ location      ┆ sample_date ┆ parameter_code   ┆ unit        ┆ value │
│ ---        ┆ ---           ┆ ---         ┆ ---              ┆ ---         ┆ ---   │
│ i64        ┆ str           ┆ str         ┆ str              ┆ str         ┆ f64   │
╞════════════╪═══════════════╪═════════════╪══════════════════╪═════════════╪═══════╡
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ ALKALINITY-TOTAL ┆ as CaCO3    ┆ 314.0 │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ AMMONIA-TOTAL    ┆ as N        ┆ 0.033 │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ BOD_-_5_DAYS     ┆ Total       ┆ 1.2   │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ CHLORIDE         ┆ not_encoded ┆ 27.3  │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ CONDUCTIVITY     ┆ @25°C       ┆ 711.0 │
└────────────┴───────────────┴─────────────┴──────────────────┴─────────────┴───────┘

rejected row sample:
These failed rows are stored separately for traceability.
shape: (5, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ rejection_reason       │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ known parameter range  │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 2003       ┆ CAMCOR_020 ┆ Feb      ┆ Temperature            ┆ 58.0      ┆ known parameter range  │
│            ┆            ┆          ┆                        ┆           ┆ failed (…              │
│ 4813       ┆ DARGLE_030 ┆ Jan      ┆ ortho-Phosphate (as P) ┆ 42.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 4815       ┆ DARGLE_030 ┆ Feb      ┆ ortho-Phosphate (as P) ┆ 22.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 4857       ┆ DARGLE_030 ┆ Jul      ┆ ortho-Phosphate (as P) ┆ 70.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

principle:
This is the reason for storing accepted rows, rejected rows, and ingestion metadata.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ The database is part of the da… │
└─────────────────────────────────┘

8. 清洗后可视化

  use data_pipeline::quality_flow::visualize_after_cleaning;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 8. Visualize after cleaning
      visualize_after_cleaning(df)?;

      Ok(())
  }

  cargo run --bin visualize_after_cleaning

============================================================
8. Visualize after cleaning
============================================================

dashboard handoff:
The dashboard reads the cleaned SQLite database produced by the storage step.
shape: (5, 2)
┌────────────────────┬─────────────────────────────────┐
│ item               ┆ value                           │
│ ---                ┆ ---                             │
│ str                ┆ str                             │
╞════════════════════╪═════════════════════════════════╡
│ raw_rows_available ┆ 29159                           │
│ sqlite_file        ┆ /Users/chiefkemist/Documents/n… │
│ dashboard_page     ┆ http://localhost:3434/data_viz  │
│ summary_json       ┆ http://localhost:3434/api/dash… │
│ timeseries_json    ┆ http://localhost:3434/api/dash… │
└────────────────────┴─────────────────────────────────┘

dashboard views:
These views turn the cleaned records into visual checks for patterns, gaps, and problems.
shape: (9, 2)
┌─────────────────────────────────┬─────────────────────────────────┐
│ view                            ┆ source                          │
│ ---                             ┆ ---                             │
│ str                             ┆ str                             │
╞═════════════════════════════════╪═════════════════════════════════╡
│ pH over time by location        ┆ measurements joined with locat… │
│ temperature over time           ┆ measurements joined with locat… │
│ dissolved oxygen over time      ┆ measurements joined with locat… │
│ ammonia spikes by location      ┆ measurements joined with locat… │
│ missing-data heatmap            ┆ measurement coverage by locati… │
│ outlier count by parameter      ┆ rejected_rows grouped by sourc… │
│ data completeness by location   ┆ measurements grouped by locati… │
│ before/after cleaning summary   ┆ ingestion_runs accepted and re… │
│ water-quality score by locatio… ┆ aggregated pH, dissolved oxyge… │
└─────────────────────────────────┴─────────────────────────────────┘

principle:
Visualization is the final check that the pipeline produced useful data.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Understand, clean, validate, s… │
└─────────────────────────────────┘

脚注

水质监测数据集 (爱尔兰)

Rust: 一种赋能每个人构建可靠且高效软件的语言。

Polars: 新时代的数据帧 (DataFrames)

Ubuntu TechHive

Rust Data Pipelines: From Files to Clean Databases and Web Dashboards

Rust 数据流水线：从文件到清洗后的数据库及 Web 仪表盘

引言

数据流水线

关于所使用的数据集

工具与库

DataFrame

选择列

添加列

表达式扩展

过滤行

分组 (Group by)

数据分析

1. 检查原始数据：

2. 分析数据 (Profile the data)

3. 识别数据质量问题

4. 清洗并标准化数据

5. 谨慎处理缺失值

6. 存储前验证

7. 结构化存储清洗后的数据

8. 清洗后可视化

脚注

所有文章

使用 Rust 和 Polars 进行数据处理

Rust 数据流水线：从文件到清洗后的数据库及 Web 仪表盘

引言

数据流水线

关于所使用的数据集

工具与库

DataFrame

选择列

添加列

表达式扩展

过滤行

分组 (Group by)

数据分析

1. 检查原始数据：

2. 分析数据 (Profile the data)

3. 识别数据质量问题

4. 清洗并标准化数据

5. 谨慎处理缺失值

6. 存储前验证

7. 结构化存储清洗后的数据

8. 清洗后可视化

脚注

标签

所有文章

使用 Rust 和 Polars 进行数据处理