हम एक छोटा पर्यावरणीय डेटा पाइपलाइन बना रहे हैं। कच्ची जल-गुणवत्ता निगरानी फाइलें CSV के रूप में आती हैं।

रस्ट डेटा पाइपलाइन्स: फाइलों से लेकर साफ़ डेटाबेस और वेब डैशबोर्ड तक

परिचय
डेटा पाइपलाइन
फुटनोट्स

परिचय

हम एक छोटा पर्यावरणीय डेटा पाइपलाइन बना रहे हैं। कच्ची जल-गुणवत्ता निगरानी फाइलें CSV के रूप में आती हैं। हमारा रस्ट टूल उन्हें मान्य करता है, खराब रिकॉर्ड को हटाता है, सुरक्षित अंतराल को भरता है, विश्वसनीय मापों को संग्रहीत करता है, और एक डैशबोर्ड को संचालित करता है।

डेटा पाइपलाइन

उपयोग किए गए डेटासेट के बारे में

डेटासेट¹ में कॉर्क हार्बर, मोय किलाला और आयरलैंड के 15 अन्य तटीय स्थानों से कच्चा जल-गुणवत्ता निगरानी डेटा शामिल है। कच्चे निष्कर्षण डेटासेट में 1.27 मिलियन से अधिक प्रविष्टियाँ हैं, और रिपॉजिटरी में 11 जल-गुणवत्ता मापदंडों में 29,159 पंक्तियों वाला एक रूपांतरित/पिवट संस्करण भी शामिल है। फाइलें CSV हैं, इसलिए वे “फाइलें → साफ़ डेटाबेस → डैशबोर्ड” प्रवाह के लिए उपयोग करने में आसान हैं।

उपकरण और लाइब्रेरी

हम Polars² का लाभ उठाकर अपनी डेटा पाइपलाइन को लागू करने के लिए रस्ट³ का उपयोग करते हैं।

डेटाफ्रेम

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();
        println!("Data:");
        print!("{df}\n");

        let head = df.head(Some(2));
        println!("Head:");
        print!("{head}\n");

      Ok(())
  }

Data:
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
Head:
shape: (2, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└──────────────┴────────────┴────────┴────────┘

कॉलम चुनना

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("Column selection:");
        print!("{result}\n");

      Ok(())
  }

Column selection:
shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

कॉलम जोड़ना

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{LazyFrame, IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .with_columns([
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("With added colums:");
        print!("{result}\n");

      Ok(())
  }

With added colums:
shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

एक्सप्रेशन विस्तार

lit का अर्थ लिटरल है और यह Polars² की लेज़ी विशेषता के लेज़ी एक्सप्रेशन API का हिस्सा है।

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, cols, lit, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                (cols(["weight", "height"]).as_expr() * lit(0.95))
                    .round(2, RoundMode::default())
                    .name()
                    .suffix("-5%"),
            ])
            .collect()?;
        println!("Transform:");
        print!("{result}\n");

      Ok(())
  }

Transform:
shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 51.87     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

पंक्तियों को फ़िल्टर करना

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{IntoLazy, col, lit, ClosedInterval},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .filter(col("birthdate").dt().year().lt(lit(1990)))
            .collect()?;
        println!("With row filtering:");
        print!("{result}\n");

        let result = df
              .clone()
              .lazy()
              .filter(
                  col("birthdate")
                      .is_between(
                          lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                          lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                          ClosedInterval::Both,
                      )
                      .and(col("height").gt(lit(1.7))),
              )
              .collect()?;
        println!("With complex row filtering:");
        print!("{result}\n");

      Ok(())
  }

With row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘
With complex row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

ग्रुपिंग करना

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, lit, len, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([len()])
            .collect()?;
        println!("Grouping by birth decade:");
        print!("{result}\n");

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([
                len().alias("sample_size"),
                col("weight")
                    .mean()
                    .round(2, RoundMode::default())
                    .alias("avg_weight"),
                col("height").max().alias("tallest"),
            ])
            .collect()?;
        println!("Grouping by derived features:");
        println!("{result}");

      Ok(())
  }

Grouping by birth decade:
shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 3   │
│ 1980   ┆ 1   │
└────────┴─────┘
Grouping by derived features:
shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1990   ┆ 3           ┆ 65.2       ┆ 1.75    │
│ 1980   ┆ 1           ┆ 72.5       ┆ 1.77    │
└────────┴─────────────┴────────────┴─────────┘

डेटा विश्लेषण

जब हमें एक नया डेटासेट प्राप्त होता है, तो लक्ष्य तुरंत चार्ट बनाना या मॉडल चलाना नहीं होता है। पहला लक्ष्य यह समझना है कि क्या डेटा पर भरोसा किया जा सकता है। पूर्ण विश्लेषण github पर है।

1. कच्चे डेटा का निरीक्षण करें:

डेटा डाउनलोड करें, इसे Polars² के साथ लोड करें और फिर हेड प्रिंट करें

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

rust-script failed with exit code 1

[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))

Polars² कुछ कॉलम के प्रकार का सही अनुमान नहीं लगा रहा है। आइए इसे डिफ़ॉल्ट रूप से 100 पंक्तियों से अनुमान लगाने दें।

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(None)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

आइए अब Polars⁴ को 10000 पंक्तियों से कॉलम के उचित प्रकार का अनुमान लगाने दें

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::inspect_raw_data;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 1. Inspect the raw data
      inspect_raw_data(df)?;

      Ok(())
  }

  cargo run --bin inspect_raw_data

============================================================
1. Inspect the raw data
============================================================

raw dataset size:
This confirms how many rows and columns were loaded from the CSV.
shape: (2, 2)
┌─────────┬───────┐
│ metric  ┆ value │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ rows    ┆ 29159 │
│ columns ┆ 14    │
└─────────┴───────┘

inferred schema:
This shows each column name and the type Polars inferred from the file.
shape: (14, 3)
┌─────────────────────────────────┬───────────────┬───────────────┐
│ column                          ┆ inferred_type ┆ storage_kind  │
│ ---                             ┆ ---           ┆ ---           │
│ str                             ┆ str           ┆ str           │
╞═════════════════════════════════╪═══════════════╪═══════════════╡
│ WaterbodyName                   ┆ String        ┆ text or mixed │
│ Years                           ┆ Int64         ┆ number        │
│ SampleDate                      ┆ String        ┆ text or mixed │
│ Alkalinity-total (as CaCO3)     ┆ Float64       ┆ number        │
│ Ammonia-Total (as N)            ┆ Float64       ┆ number        │
│ …                               ┆ …             ┆ …             │
│ ortho-Phosphate (as P) - unspe… ┆ Float64       ┆ number        │
│ pH                              ┆ Float64       ┆ number        │
│ Temperature                     ┆ Float64       ┆ number        │
│ Total Hardness (as CaCO3)       ┆ Float64       ┆ number        │
│ True Colour                     ┆ Float64       ┆ number        │
└─────────────────────────────────┴───────────────┴───────────────┘

raw row sample:
This shows one original wide record before any reshaping.
shape: (1, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬─────┬──────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH  ┆ Temperature  ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ --- ┆ ---          ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64 ┆ f64          ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆     ┆              ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆     ┆              ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪═════╪══════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8 ┆ 10.4         ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆     ┆              ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴─────┴──────────────┴─────────────┴────────┘

first-pass column roles:
This separates location/date columns from measurement columns.
shape: (14, 2)
┌─────────────────────────────────┬─────────────┐
│ column                          ┆ role        │
│ ---                             ┆ ---         │
│ str                             ┆ str         │
╞═════════════════════════════════╪═════════════╡
│ WaterbodyName                   ┆ location    │
│ Years                           ┆ date        │
│ SampleDate                      ┆ date        │
│ Alkalinity-total (as CaCO3)     ┆ measurement │
│ Ammonia-Total (as N)            ┆ measurement │
│ …                               ┆ …           │
│ ortho-Phosphate (as P) - unspe… ┆ measurement │
│ pH                              ┆ measurement │
│ Temperature                     ┆ measurement │
│ Total Hardness (as CaCO3)       ┆ measurement │
│ True Colour                     ┆ measurement │
└─────────────────────────────────┴─────────────┘

long-form sample:
This previews the wide measurements as parameter/value rows.
shape: (5, 7)
┌───────────────┬───────┬────────────┬────────────────┬────────────────┬────────────────┬──────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column  ┆ measurement_va ┆ parameter      ┆ unit     │
│ ---           ┆ ---   ┆ ---        ┆ ---            ┆ lue            ┆ ---            ┆ ---      │
│ str           ┆ i64   ┆ str        ┆ str            ┆ ---            ┆ str            ┆ str      │
│               ┆       ┆            ┆                ┆ f64            ┆                ┆          │
╞═══════════════╪═══════╪════════════╪════════════════╪════════════════╪════════════════╪══════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ Alkalinity-tot ┆ 314.0          ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 14.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 17.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 18.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-tot ┆ 19.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
└───────────────┴───────┴────────────┴────────────────┴────────────────┴────────────────┴──────────┘

2. डेटा को प्रोफाइल करें

यह हमें निर्णय लेने से पहले डेटासेट की पहली तस्वीर देता है।

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::profile_the_data;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 2. Profile the data
      profile_the_data(df)?;

      Ok(())
  }

  cargo run --bin profile_the_data

============================================================
2. Profile the data
============================================================

profile scope:
This repeats the dataset size before summarizing each important column.
shape: (2, 2)
┌─────────┬───────┐
│ metric  ┆ value │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ rows    ┆ 29159 │
│ columns ┆ 14    │
└─────────┴───────┘

date coverage:
This combines Years and SampleDate into a usable month-level date range.
shape: (5, 2)
┌───────────────────────────┬────────────┐
│ metric                    ┆ value      │
│ ---                       ┆ ---        │
│ str                       ┆ str        │
╞═══════════════════════════╪════════════╡
│ earliest_date             ┆ 2007-01-01 │
│ latest_date               ┆ 2023-04-01 │
│ invalid_dates             ┆ 0          │
│ missing_dates             ┆ 0          │
│ gaps_over_time_gt_31_days ┆ 0          │
└───────────────────────────┴────────────┘

column profile:
This gives missing counts, distinct counts, numeric ranges, averages, and notes.
shape: (14, 9)
┌──────────────┬──────────┬─────────┬─────────┬───┬─────────┬─────────┬──────────────┬─────────────┐
│ column       ┆ role     ┆ type    ┆ missing ┆ … ┆ minimum ┆ maximum ┆ average      ┆ notes       │
│ ---          ┆ ---      ┆ ---     ┆ ---     ┆   ┆ ---     ┆ ---     ┆ ---          ┆ ---         │
│ str          ┆ str      ┆ str     ┆ i64     ┆   ┆ str     ┆ str     ┆ str          ┆ str         │
╞══════════════╪══════════╪═════════╪═════════╪═══╪═════════╪═════════╪══════════════╪═════════════╡
│ WaterbodyNam ┆ location ┆ String  ┆ 0       ┆ … ┆         ┆         ┆              ┆ unique      │
│ e            ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ locations:  │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ 160         │
│ Years        ┆ date     ┆ Int64   ┆ 0       ┆ … ┆ 2007    ┆ 2023    ┆ Float64(2014 ┆ included in │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ .78253712404 ┆ combined    │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 4)           ┆ date cove…  │
│ SampleDate   ┆ date     ┆ String  ┆ 0       ┆ … ┆         ┆         ┆              ┆ review full │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ category    │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ list; inc…  │
│ Alkalinity-t ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 442     ┆ Float64(139. ┆             │
│ otal (as     ┆          ┆         ┆         ┆   ┆         ┆         ┆ 858347851435 ┆             │
│ CaCO3)       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 2)           ┆             │
│ Ammonia-Tota ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 40      ┆ Float64(0.06 ┆             │
│ l (as N)     ┆          ┆         ┆         ┆   ┆         ┆         ┆ 357266127096 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 262)         ┆             │
│ …            ┆ …        ┆ …       ┆ …       ┆ … ┆ …       ┆ …       ┆ …            ┆ …           │
│ ortho-Phosph ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ -0.004  ┆ 70      ┆ Float64(0.06 ┆ negative    │
│ ate (as P) - ┆          ┆         ┆         ┆   ┆         ┆         ┆ 878934462773 ┆ value found │
│ unspe…       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 074)         ┆ (-0.004)    │
│ pH           ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 4.7     ┆ 9.8     ┆ Float64(7.55 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 205686066051 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 8)           ┆             │
│ Temperature  ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0.6     ┆ 637     ┆ Float64(10.8 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 505031036729 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 74)          ┆             │
│ Total        ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 642     ┆ Float64(159. ┆             │
│ Hardness (as ┆          ┆         ┆         ┆   ┆         ┆         ┆ 092110326142 ┆             │
│ CaCO3)       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 9)           ┆             │
│ True Colour  ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 953     ┆ Float64(58.1 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 374635618505 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 45)          ┆             │
└──────────────┴──────────┴─────────┴─────────┴───┴─────────┴─────────┴──────────────┴─────────────┘

text/category profile:
This summarizes unique text values and possible spelling variations.
shape: (2, 5)
┌───────────────┬──────────────┬───────────────┬──────────────────────┬────────────────────────────┐
│ column        ┆ empty_values ┆ unique_values ┆ sample_unique_values ┆ possible_spelling_variatio │
│ ---           ┆ ---          ┆ ---           ┆ ---                  ┆ ns                         │
│ str           ┆ i64          ┆ i64           ┆ str                  ┆ ---                        │
│               ┆              ┆               ┆                      ┆ str                        │
╞═══════════════╪══════════════╪═══════════════╪══════════════════════╪════════════════════════════╡
│ WaterbodyName ┆ 0            ┆ 160           ┆ ABBEYTOWN_010,       ┆                            │
│               ┆              ┆               ┆ ASKANAGAP STREA…     ┆                            │
│ SampleDate    ┆ 0            ┆ 12            ┆ Apr, Aug, Dec, Feb,  ┆                            │
│               ┆              ┆               ┆ Jan, Jul, …          ┆                            │
└───────────────┴──────────────┴───────────────┴──────────────────────┴────────────────────────────┘

3. डेटा गुणवत्ता समस्याओं की पहचान करें

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::identify_data_quality_problems;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 3. Identify data quality problems
      identify_data_quality_problems(df)?;

      Ok(())
  }

  cargo run --bin identify_data_quality_problems

============================================================
3. Identify data quality problems
============================================================

data quality summary:
This is the high-level checklist of problems that could make analysis unreliable.
shape: (13, 4)
┌─────────────────────────────────┬────────┬────────┬─────────────────────────────────┐
│ check                           ┆ count  ┆ status ┆ note                            │
│ ---                             ┆ ---    ┆ ---    ┆ ---                             │
│ str                             ┆ i64    ┆ str    ┆ str                             │
╞═════════════════════════════════╪════════╪════════╪═════════════════════════════════╡
│ measurement columns checked     ┆ 11     ┆ info   ┆ wide measurement columns becom… │
│ missing values                  ┆ 0      ┆ ok     ┆ null values across raw columns  │
│ duplicate rows                  ┆ 14478  ┆ review ┆ exact raw-row duplicates        │
│ numeric values stored as text   ┆ 0      ┆ ok     ┆ string columns whose values ar… │
│ invalid date values             ┆ 0      ┆ ok     ┆ date-like values that failed p… │
│ …                               ┆ …      ┆ …      ┆ …                               │
│ pH outside 0-14                 ┆ 0      ┆ ok     ┆ domain rule for pH              │
│ negative concentration-like me… ┆ 2      ┆ review ┆ negative values outside pH and… │
│ outlier values                  ┆ 12652  ┆ review ┆ IQR outliers across 8 columns   │
│ duplicate location/date/parame… ┆ 204237 ┆ review ┆ same location, date, and param… │
│ large gaps in time series       ┆ 5395   ┆ review ┆ location time periods with mis… │
└─────────────────────────────────┴────────┴────────┴─────────────────────────────────┘

data quality details:
This gives the columns and counts behind the summary checks.
shape: (9, 4)
┌──────────────────────────────┬─────────────────────────────┬───────┬─────────────────────────────┐
│ problem                      ┆ column                      ┆ count ┆ note                        │
│ ---                          ┆ ---                         ┆ ---   ┆ ---                         │
│ str                          ┆ str                         ┆ i64   ┆ str                         │
╞══════════════════════════════╪═════════════════════════════╪═══════╪═════════════════════════════╡
│ outliers                     ┆ Chloride                    ┆ 1861  ┆ outside [5.999999999999998, │
│                              ┆                             ┆       ┆ 31…                         │
│ outliers                     ┆ Conductivity @25°C          ┆ 20    ┆ outside [-179, 909] by IQR  │
│                              ┆                             ┆       ┆ rul…                        │
│ outliers                     ┆ Dissolved Oxygen            ┆ 2314  ┆ outside                     │
│                              ┆                             ┆       ┆ [10.999999999999993, 1…     │
│ outliers                     ┆ ortho-Phosphate (as P) -    ┆ 6229  ┆ outside                     │
│                              ┆ unspe…                      ┆       ┆ [0.009500000000000005,…     │
│ negative concentration-like  ┆ ortho-Phosphate (as P) -    ┆ 2     ┆ negative value outside pH   │
│ me…                          ┆ unspe…                      ┆       ┆ and …                       │
│ outliers                     ┆ pH                          ┆ 406   ┆ outside [6, 9.2] by IQR     │
│                              ┆                             ┆       ┆ rule                        │
│ outliers                     ┆ Temperature                 ┆ 134   ┆ outside                     │
│                              ┆                             ┆       ┆ [0.8499999999999988, 2…     │
│ outliers                     ┆ Total Hardness (as CaCO3)   ┆ 3     ┆ outside [-178, 486] by IQR  │
│                              ┆                             ┆       ┆ rul…                        │
│ outliers                     ┆ True Colour                 ┆ 1685  ┆ outside [-48.5, 147.5] by   │
│                              ┆                             ┆       ┆ IQR …                       │
└──────────────────────────────┴─────────────────────────────┴───────┴─────────────────────────────┘

principle:
This is the rule that guides the cleaning decision in the next step.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Bad input should not quietly b… │
└─────────────────────────────────┘

4. डेटा को साफ़ और सामान्य करें

  use data_pipeline::quality_flow::clean_and_normalize_the_data;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 4. Clean and normalize the data
      clean_and_normalize_the_data(df)?;

      Ok(())
  }

  cargo run --bin clean_and_normalize_the_data

============================================================
4. Clean and normalize the data
============================================================

cleaning summary:
This shows how many normalized rows were kept, rejected, or deduplicated.
shape: (3, 2)
┌──────────────────────────┬────────┐
│ metric                   ┆ value  │
│ ---                      ┆ ---    │
│ str                      ┆ i64    │
╞══════════════════════════╪════════╡
│ cleaned_rows             ┆ 148372 │
│ invalid_rows             ┆ 2      │
│ exact_duplicates_removed ┆ 172375 │
└──────────────────────────┴────────┘

cleaned sample:
This is the normalized long-form data that is easier to query and visualize.
shape: (5, 10)
┌────────────┬─────────────┬─────────────┬──────┬───┬────────────┬────────────┬───────┬────────────┐
│ source_row ┆ location    ┆ sample_date ┆ year ┆ … ┆ parameter_ ┆ unit       ┆ value ┆ source_col │
│ ---        ┆ ---         ┆ ---         ┆ ---  ┆   ┆ code       ┆ ---        ┆ ---   ┆ umn        │
│ i64        ┆ str         ┆ str         ┆ i32  ┆   ┆ ---        ┆ str        ┆ f64   ┆ ---        │
│            ┆             ┆             ┆      ┆   ┆ str        ┆            ┆       ┆ str        │
╞════════════╪═════════════╪═════════════╪══════╪═══╪════════════╪════════════╪═══════╪════════════╡
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ ALKALINITY ┆ as CaCO3   ┆ 314.0 ┆ Alkalinity │
│            ┆ 10          ┆             ┆      ┆   ┆ -TOTAL     ┆            ┆       ┆ -total (as │
│            ┆             ┆             ┆      ┆   ┆            ┆            ┆       ┆ CaCO3)     │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ AMMONIA-TO ┆ as N       ┆ 0.033 ┆ Ammonia-To │
│            ┆ 10          ┆             ┆      ┆   ┆ TAL        ┆            ┆       ┆ tal (as N) │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ BOD_-_5_DA ┆ Total      ┆ 1.2   ┆ BOD - 5    │
│            ┆ 10          ┆             ┆      ┆   ┆ YS         ┆            ┆       ┆ days       │
│            ┆             ┆             ┆      ┆   ┆            ┆            ┆       ┆ (Total)    │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ CHLORIDE   ┆ not_encode ┆ 27.3  ┆ Chloride   │
│            ┆ 10          ┆             ┆      ┆   ┆            ┆ d          ┆       ┆            │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ CONDUCTIVI ┆ @25°C      ┆ 711.0 ┆ Conductivi │
│            ┆ 10          ┆             ┆      ┆   ┆ TY         ┆            ┆       ┆ ty @25°C   │
└────────────┴─────────────┴─────────────┴──────┴───┴────────────┴────────────┴───────┴────────────┘

invalid rows sample:
These rows were separated so bad input does not become trusted data.
shape: (2, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ invalid_reason         │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
│ 15723      ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

5. गायब मानों को सावधानी से संभालें

  use data_pipeline::quality_flow::handle_missing_values_carefully;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 5. Handle missing values carefully
      handle_missing_values_carefully(df)?;

      Ok(())
  }

  cargo run --bin handle_missing_values_carefully

============================================================
5. Handle missing values carefully
============================================================

missing-value decision summary:
This separates invalid data, missing data, gap candidates, and flagged observed values.
shape: (7, 3)
┌────────────────────────────────┬───────┬─────────────────────────────────┐
│ case                           ┆ count ┆ decision                        │
│ ---                            ┆ ---   ┆ ---                             │
│ str                            ┆ i64   ┆ str                             │
╞════════════════════════════════╪═══════╪═════════════════════════════════╡
│ invalid or impossible rows     ┆ 2     ┆ quarantine                      │
│ missing critical fields        ┆ 0     ┆ reject row                      │
│ missing measurement values     ┆ 0     ┆ keep NULL unless safe to estim… │
│ small time-series gaps         ┆ 45165 ┆ candidate for interpolation af… │
│ large time-series gaps         ┆ 14179 ┆ keep missing                    │
│ suspicious but possible values ┆ 10677 ┆ keep observed value with quali… │
│ values filled automatically    ┆ 0     ┆ none; filling is not automatic  │
└────────────────────────────────┴───────┴─────────────────────────────────┘

quarantined row sample:
These rows are not filled because a critical field or measurement value is missing or invalid.
shape: (2, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ decision               │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
│ 15723      ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

time-series gap examples:
These are observed gaps; small gaps may be interpolated only after review.
shape: (20, 6)
┌──────────┬──────────────────┬────────────┬────────────┬────────────────┬──────────────────────┐
│ location ┆ parameter_code   ┆ from_date  ┆ to_date    ┆ missing_months ┆ decision             │
│ ---      ┆ ---              ┆ ---        ┆ ---        ┆ ---            ┆ ---                  │
│ str      ┆ str              ┆ str        ┆ str        ┆ i64            ┆ str                  │
╞══════════╪══════════════════╪════════════╪════════════╪════════════════╪══════════════════════╡
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2007-09-01 ┆ 2008-01-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2008-12-01 ┆ 2009-04-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-04-01 ┆ 2009-06-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-06-01 ┆ 2009-08-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-08-01 ┆ 2009-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ …        ┆ …                ┆ …          ┆ …          ┆ …              ┆ …                    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-06-01 ┆ 2009-08-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-08-01 ┆ 2009-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-10-01 ┆ 2010-03-01 ┆ 4              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2010-03-01 ┆ 2010-07-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2010-08-01 ┆ 2010-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
└──────────┴──────────────────┴────────────┴────────────┴────────────────┴──────────────────────┘

quality-flagged sample:
These observed values are kept, but marked because they need caution.
shape: (10, 6)
┌──────────┬─────────────┬─────────────────┬───────┬───────────────────────┬───────────────────────┐
│ location ┆ sample_date ┆ parameter_code  ┆ value ┆ quality_flag          ┆ missing_decision      │
│ ---      ┆ ---         ┆ ---             ┆ ---   ┆ ---                   ┆ ---                   │
│ str      ┆ str         ┆ str             ┆ f64   ┆ str                   ┆ str                   │
╞══════════╪═════════════╪═════════════════╪═══════╪═══════════════════════╪═══════════════════════╡
│ ALLUA    ┆ 2007-09-01  ┆ AMMONIA-TOTAL   ┆ 0.066 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.069 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ ORTHO-PHOSPHATE ┆ 0.005 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.068 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.067 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.133 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.111 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.113 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-03-01  ┆ AMMONIA-TOTAL   ┆ 0.04  ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-03-01  ┆ ORTHO-PHOSPHATE ┆ 0.005 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
└──────────┴─────────────┴─────────────────┴───────┴───────────────────────┴───────────────────────┘

principle:
This is the rule for deciding whether a missing value should be filled.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Filling data is a decision, no… │
└─────────────────────────────────┘

handled data summary:
This confirms the row counts after applying the missing-value decisions.
shape: (3, 2)
┌────────────────────┬────────┐
│ metric             ┆ value  │
│ ---                ┆ ---    │
│ str                ┆ i64    │
╞════════════════════╪════════╡
│ handled_rows       ┆ 148372 │
│ quarantined_rows   ┆ 2      │
│ duplicates_removed ┆ 172375 │
└────────────────────┴────────┘

6. संग्रहीत करने से पहले मान्य करें

  use data_pipeline::quality_flow::validate_before_storing;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 6. Validate before storing
      validate_before_storing(df)?;

      Ok(())
  }

  cargo run --bin validate_before_storing

============================================================
6. Validate before storing
============================================================

validation rule summary:
This shows the storage rules, how many records were checked, and what failed.
shape: (7, 4)
┌─────────────────────────────────┬─────────┬────────┬─────────────────────────────────┐
│ rule                            ┆ checked ┆ failed ┆ action                          │
│ ---                             ┆ ---     ┆ ---    ┆ ---                             │
│ str                             ┆ i64     ┆ i64    ┆ str                             │
╞═════════════════════════════════╪═════════╪════════╪═════════════════════════════════╡
│ every measurement has a locati… ┆ 320749  ┆ 0      ┆ reject missing locations        │
│ every measurement has a parame… ┆ 320749  ┆ 0      ┆ reject missing parameters       │
│ every measurement has a date    ┆ 320749  ┆ 0      ┆ reject missing or invalid date… │
│ value is a valid number or exp… ┆ 320749  ┆ 0      ┆ store numeric values; store mi… │
│ known parameter respects range  ┆ 320749  ┆ 20     ┆ reject impossible values for k… │
│ exact duplicate records handle… ┆ 320729  ┆ 172369 ┆ remove exact duplicates         │
│ repeated measurements handled … ┆ 148360  ┆ 31854  ┆ keep with source_row so repeat… │
└─────────────────────────────────┴─────────┴────────┴─────────────────────────────────┘

records rejected before storage:
These rows failed validation and should not be inserted into trusted tables.
shape: (10, 6)
┌────────────┬──────────────┬──────────┬───────────────────────┬───────────┬───────────────────────┐
│ source_row ┆ location     ┆ raw_date ┆ source_column         ┆ raw_value ┆ rule_failed           │
│ ---        ┆ ---          ┆ ---      ┆ ---                   ┆ ---       ┆ ---                   │
│ i64        ┆ str          ┆ str      ┆ str                   ┆ str       ┆ str                   │
╞════════════╪══════════════╪══════════╪═══════════════════════╪═══════════╪═══════════════════════╡
│ 111        ┆ ASKANAGAP    ┆ Jan      ┆ ortho-Phosphate (as   ┆ -0.004    ┆ known parameter range │
│            ┆ STREAM_010   ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 2003       ┆ CAMCOR_020   ┆ Feb      ┆ Temperature           ┆ 58.0      ┆ known parameter range │
│            ┆              ┆          ┆                       ┆           ┆ failed (…             │
│ 4813       ┆ DARGLE_030   ┆ Jan      ┆ ortho-Phosphate (as   ┆ 42.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4815       ┆ DARGLE_030   ┆ Feb      ┆ ortho-Phosphate (as   ┆ 22.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4857       ┆ DARGLE_030   ┆ Jul      ┆ ortho-Phosphate (as   ┆ 70.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4873       ┆ DARGLE_030   ┆ May      ┆ ortho-Phosphate (as   ┆ 29.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4893       ┆ DARGLE_030   ┆ Mar      ┆ ortho-Phosphate (as   ┆ 26.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4903       ┆ DARGLE_030   ┆ Sep      ┆ ortho-Phosphate (as   ┆ 25.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 6096       ┆ GLENCREE_010 ┆ Feb      ┆ ortho-Phosphate (as   ┆ 27.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 6117       ┆ GLENCREE_010 ┆ Jul      ┆ ortho-Phosphate (as   ┆ 27.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
└────────────┴──────────────┴──────────┴───────────────────────┴───────────┴───────────────────────┘

duplicate handling sample:
These exact duplicates are handled deliberately before storage.
shape: (10, 6)
┌────────────┬──────────┬─────────────┬──────────────────┬───────┬─────────────────────────────────┐
│ source_row ┆ location ┆ sample_date ┆ parameter_code   ┆ value ┆ action                          │
│ ---        ┆ ---      ┆ ---         ┆ ---              ┆ ---   ┆ ---                             │
│ i64        ┆ str      ┆ str         ┆ str              ┆ f64   ┆ str                             │
╞════════════╪══════════╪═════════════╪══════════════════╪═══════╪═════════════════════════════════╡
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ AMMONIA-TOTAL    ┆ 0.033 ┆ skip exact duplicate before st… │
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ AMMONIA-TOTAL    ┆ 0.033 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ TEMPERATURE      ┆ 17.8  ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ ALKALINITY-TOTAL ┆ 19.0  ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
└────────────┴──────────┴─────────────┴──────────────────┴───────┴─────────────────────────────────┘

trusted records sample:
These records passed validation and are shaped for database insertion.
shape: (5, 8)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬───────┬────────────┐
│ source_row ┆ location   ┆ sample_dat ┆ parameter  ┆ parameter_ ┆ unit       ┆ value ┆ source_col │
│ ---        ┆ ---        ┆ e          ┆ ---        ┆ code       ┆ ---        ┆ ---   ┆ umn        │
│ i64        ┆ str        ┆ ---        ┆ str        ┆ ---        ┆ str        ┆ f64   ┆ ---        │
│            ┆            ┆ str        ┆            ┆ str        ┆            ┆       ┆ str        │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪════════════╡
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Alkalinity ┆ ALKALINITY ┆ as CaCO3   ┆ 314.0 ┆ Alkalinity │
│            ┆ 010        ┆            ┆ -total     ┆ -TOTAL     ┆            ┆       ┆ -total (as │
│            ┆            ┆            ┆            ┆            ┆            ┆       ┆ CaCO3)     │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Ammonia-To ┆ AMMONIA-TO ┆ as N       ┆ 0.033 ┆ Ammonia-To │
│            ┆ 010        ┆            ┆ tal        ┆ TAL        ┆            ┆       ┆ tal (as N) │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ BOD - 5    ┆ BOD_-_5_DA ┆ Total      ┆ 1.2   ┆ BOD - 5    │
│            ┆ 010        ┆            ┆ days       ┆ YS         ┆            ┆       ┆ days       │
│            ┆            ┆            ┆            ┆            ┆            ┆       ┆ (Total)    │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Chloride   ┆ CHLORIDE   ┆ not_encode ┆ 27.3  ┆ Chloride   │
│            ┆ 010        ┆            ┆            ┆            ┆ d          ┆       ┆            │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Conductivi ┆ CONDUCTIVI ┆ @25°C      ┆ 711.0 ┆ Conductivi │
│            ┆ 010        ┆            ┆ ty         ┆ TY         ┆            ┆       ┆ ty @25°C   │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴───────┴────────────┘

storage readiness summary:
This is the final count of clean records, rejected records, NULLs, and handled duplicates.
shape: (6, 2)
┌─────────────────────────────────┬────────┐
│ metric                          ┆ value  │
│ ---                             ┆ ---    │
│ str                             ┆ i64    │
╞═════════════════════════════════╪════════╡
│ raw_measurement_rows            ┆ 320749 │
│ trusted_records_ready_to_store  ┆ 148360 │
│ records_rejected                ┆ 20     │
│ explicit_null_values            ┆ 0      │
│ exact_duplicates_removed        ┆ 172369 │
│ repeated_measurements_kept_wit… ┆ 31854  │
└─────────────────────────────────┴────────┘

principle:
This is the rule for deciding what is safe to store.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ The database should store clea… │
└─────────────────────────────────┘

7. संरचना के साथ साफ़ डेटा संग्रहीत करें

  use data_pipeline::quality_flow::store_clean_data_with_structure;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 7. Store clean data with structure
      store_clean_data_with_structure(df)?;

      Ok(())
  }

  cargo run --bin store_clean_data_with_structure

============================================================
7. Store clean data with structure
============================================================

database file:
This is the SQLite file written for the API project.
shape: (1, 2)
┌─────────────┬─────────────────────────────────┐
│ item        ┆ value                           │
│ ---         ┆ ---                             │
│ str         ┆ str                             │
╞═════════════╪═════════════════════════════════╡
│ sqlite_file ┆ /Users/chiefkemist/Documents/n… │
└─────────────┴─────────────────────────────────┘

structured schema:
The cleaned data is stored across small tables instead of one giant messy table.
shape: (5, 2)
┌────────────────┬─────────────────────────────────┐
│ table          ┆ purpose                         │
│ ---            ┆ ---                             │
│ str            ┆ str                             │
╞════════════════╪═════════════════════════════════╡
│ locations      ┆ one row per normalized locatio… │
│ parameters     ┆ one row per normalized paramet… │
│ measurements   ┆ trusted observed measurements   │
│ ingestion_runs ┆ source file, import time, coun… │
│ rejected_rows  ┆ rows that failed validation or… │
└────────────────┴─────────────────────────────────┘

ingestion run summary:
This records where the data came from and what happened during import.
shape: (6, 2)
┌──────────────────────────┬────────┐
│ metric                   ┆ value  │
│ ---                      ┆ ---    │
│ str                      ┆ i64    │
╞══════════════════════════╪════════╡
│ ingestion_run_id         ┆ 1      │
│ raw_rows                 ┆ 29159  │
│ raw_measurement_rows     ┆ 320749 │
│ accepted_measurements    ┆ 148360 │
│ rejected_rows            ┆ 20     │
│ exact_duplicates_removed ┆ 172369 │
└──────────────────────────┴────────┘

database table counts:
These counts are read back from SQLite after the write finishes.
shape: (5, 2)
┌────────────────┬────────┐
│ table          ┆ rows   │
│ ---            ┆ ---    │
│ str            ┆ i64    │
╞════════════════╪════════╡
│ ingestion_runs ┆ 1      │
│ locations      ┆ 160    │
│ parameters     ┆ 11     │
│ measurements   ┆ 148360 │
│ rejected_rows  ┆ 20     │
└────────────────┴────────┘

stored measurement sample:
These accepted rows are stored in the measurements table with foreign keys.
shape: (5, 6)
┌────────────┬───────────────┬─────────────┬──────────────────┬─────────────┬───────┐
│ source_row ┆ location      ┆ sample_date ┆ parameter_code   ┆ unit        ┆ value │
│ ---        ┆ ---           ┆ ---         ┆ ---              ┆ ---         ┆ ---   │
│ i64        ┆ str           ┆ str         ┆ str              ┆ str         ┆ f64   │
╞════════════╪═══════════════╪═════════════╪══════════════════╪═════════════╪═══════╡
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ ALKALINITY-TOTAL ┆ as CaCO3    ┆ 314.0 │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ AMMONIA-TOTAL    ┆ as N        ┆ 0.033 │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ BOD_-_5_DAYS     ┆ Total       ┆ 1.2   │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ CHLORIDE         ┆ not_encoded ┆ 27.3  │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ CONDUCTIVITY     ┆ @25°C       ┆ 711.0 │
└────────────┴───────────────┴─────────────┴──────────────────┴─────────────┴───────┘

rejected row sample:
These failed rows are stored separately for traceability.
shape: (5, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ rejection_reason       │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ known parameter range  │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 2003       ┆ CAMCOR_020 ┆ Feb      ┆ Temperature            ┆ 58.0      ┆ known parameter range  │
│            ┆            ┆          ┆                        ┆           ┆ failed (…              │
│ 4813       ┆ DARGLE_030 ┆ Jan      ┆ ortho-Phosphate (as P) ┆ 42.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 4815       ┆ DARGLE_030 ┆ Feb      ┆ ortho-Phosphate (as P) ┆ 22.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 4857       ┆ DARGLE_030 ┆ Jul      ┆ ortho-Phosphate (as P) ┆ 70.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

principle:
This is the reason for storing accepted rows, rejected rows, and ingestion metadata.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ The database is part of the da… │
└─────────────────────────────────┘

8. साफ़ करने के बाद विज़ुअलाइज़ करें

  use data_pipeline::quality_flow::visualize_after_cleaning;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 8. Visualize after cleaning
      visualize_after_cleaning(df)?;

      Ok(())
  }

  cargo run --bin visualize_after_cleaning

============================================================
8. Visualize after cleaning
============================================================

dashboard handoff:
The dashboard reads the cleaned SQLite database produced by the storage step.
shape: (5, 2)
┌────────────────────┬─────────────────────────────────┐
│ item               ┆ value                           │
│ ---                ┆ ---                             │
│ str                ┆ str                             │
╞════════════════════╪═════════════════════════════════╡
│ raw_rows_available ┆ 29159                           │
│ sqlite_file        ┆ /Users/chiefkemist/Documents/n… │
│ dashboard_page     ┆ http://localhost:3434/data_viz  │
│ summary_json       ┆ http://localhost:3434/api/dash… │
│ timeseries_json    ┆ http://localhost:3434/api/dash… │
└────────────────────┴─────────────────────────────────┘

dashboard views:
These views turn the cleaned records into visual checks for patterns, gaps, and problems.
shape: (9, 2)
┌─────────────────────────────────┬─────────────────────────────────┐
│ view                            ┆ source                          │
│ ---                             ┆ ---                             │
│ str                             ┆ str                             │
╞═════════════════════════════════╪═════════════════════════════════╡
│ pH over time by location        ┆ measurements joined with locat… │
│ temperature over time           ┆ measurements joined with locat… │
│ dissolved oxygen over time      ┆ measurements joined with locat… │
│ ammonia spikes by location      ┆ measurements joined with locat… │
│ missing-data heatmap            ┆ measurement coverage by locati… │
│ outlier count by parameter      ┆ rejected_rows grouped by sourc… │
│ data completeness by location   ┆ measurements grouped by locati… │
│ before/after cleaning summary   ┆ ingestion_runs accepted and re… │
│ water-quality score by locatio… ┆ aggregated pH, dissolved oxyge… │
└─────────────────────────────────┴─────────────────────────────────┘

principle:
Visualization is the final check that the pipeline produced useful data.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Understand, clean, validate, s… │
└─────────────────────────────────┘

फुटनोट्स

जल गुणवत्ता निगरानी डेटासेट (आयरलैंड)

Polars: नए युग के लिए डेटाफ्रेम

रस्ट: विश्वसनीय और कुशल सॉफ्टवेयर बनाने के लिए सभी को सशक्त बनाने वाली एक भाषा।

Ubuntu TechHive

Rust Data Pipelines: From Files to Clean Databases and Web Dashboards

रस्ट डेटा पाइपलाइन्स: फाइलों से लेकर साफ़ डेटाबेस और वेब डैशबोर्ड तक

परिचय

डेटा पाइपलाइन

उपयोग किए गए डेटासेट के बारे में

उपकरण और लाइब्रेरी

डेटाफ्रेम

कॉलम चुनना

कॉलम जोड़ना

एक्सप्रेशन विस्तार

पंक्तियों को फ़िल्टर करना

ग्रुपिंग करना

डेटा विश्लेषण

1. कच्चे डेटा का निरीक्षण करें:

2. डेटा को प्रोफाइल करें

3. डेटा गुणवत्ता समस्याओं की पहचान करें

4. डेटा को साफ़ और सामान्य करें

5. गायब मानों को सावधानी से संभालें

6. संग्रहीत करने से पहले मान्य करें

7. संरचना के साथ साफ़ डेटा संग्रहीत करें

8. साफ़ करने के बाद विज़ुअलाइज़ करें

फुटनोट्स

सभी लेख

रस्ट और पोलर्स के साथ डेटा प्रोसेसिंग

रस्ट डेटा पाइपलाइन्स: फाइलों से लेकर साफ़ डेटाबेस और वेब डैशबोर्ड तक

परिचय

डेटा पाइपलाइन

उपयोग किए गए डेटासेट के बारे में

उपकरण और लाइब्रेरी

डेटाफ्रेम

कॉलम चुनना

कॉलम जोड़ना

एक्सप्रेशन विस्तार

पंक्तियों को फ़िल्टर करना

ग्रुपिंग करना

डेटा विश्लेषण

1. कच्चे डेटा का निरीक्षण करें:

2. डेटा को प्रोफाइल करें

3. डेटा गुणवत्ता समस्याओं की पहचान करें

4. डेटा को साफ़ और सामान्य करें

5. गायब मानों को सावधानी से संभालें

6. संग्रहीत करने से पहले मान्य करें

7. संरचना के साथ साफ़ डेटा संग्रहीत करें

8. साफ़ करने के बाद विज़ुअलाइज़ करें

फुटनोट्स

टैग

सभी लेख

रस्ट और पोलर्स के साथ डेटा प्रोसेसिंग