Pipelines de datos en Rust: De archivos a bases de datos limpias y paneles web

Introducción
Canalización de datos
Notas al pie

Introducción

Estamos construyendo una pequeña canalización de datos ambientales. Los archivos de monitoreo de calidad del agua en bruto llegan en formato CSV. Nuestra herramienta en Rust los valida, limpia los registros erróneos, rellena los vacíos seguros, almacena las mediciones confiables y alimenta un panel de control.

Canalización de datos

Sobre el conjunto de datos utilizado

El conjunto de datos¹ contiene datos de monitoreo de calidad del agua en bruto de Cork Harbour, Moy Killala y otras 15 ubicaciones costeras en Irlanda. El conjunto de datos extraído en bruto tiene más de 1.27 millones de entradas, y el repositorio también incluye una versión transformada/pivotada con 29,159 filas a través de 11 parámetros de calidad del agua. Los archivos son CSV, por lo que son fáciles de usar para el flujo “archivos → base de datos limpia → panel de control”.

Herramientas y bibliotecas

Usamos Rust² para implementar nuestra canalización de datos aprovechando Polars³.

DataFrame

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();
        println!("Data:");
        print!("{df}\n");

        let head = df.head(Some(2));
        println!("Head:");
        print!("{head}\n");

      Ok(())
  }

Data:
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
Head:
shape: (2, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└──────────────┴────────────┴────────┴────────┘

Selección de columnas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("Column selection:");
        print!("{result}\n");

      Ok(())
  }

Column selection:
shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

Adición de columnas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{LazyFrame, IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .with_columns([
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("With added colums:");
        print!("{result}\n");

      Ok(())
  }

With added colums:
shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

Expansión de expresiones

lit significa literal y es parte de la API de expresiones lazy de la característica lazy de Polars³.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, cols, lit, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                (cols(["weight", "height"]).as_expr() * lit(0.95))
                    .round(2, RoundMode::default())
                    .name()
                    .suffix("-5%"),
            ])
            .collect()?;
        println!("Transform:");
        print!("{result}\n");

      Ok(())
  }

Transform:
shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 51.87     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

Filtrado de filas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{IntoLazy, col, lit, ClosedInterval},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .filter(col("birthdate").dt().year().lt(lit(1990)))
            .collect()?;
        println!("With row filtering:");
        print!("{result}\n");

        let result = df
              .clone()
              .lazy()
              .filter(
                  col("birthdate")
                      .is_between(
                          lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                          lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                          ClosedInterval::Both,
                      )
                      .and(col("height").gt(lit(1.7))),
              )
              .collect()?;
        println!("With complex row filtering:");
        print!("{result}\n");

      Ok(())
  }

With row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘
With complex row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

Agrupación por

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, lit, len, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([len()])
            .collect()?;
        println!("Grouping by birth decade:");
        print!("{result}\n");

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([
                len().alias("sample_size"),
                col("weight")
                    .mean()
                    .round(2, RoundMode::default())
                    .alias("avg_weight"),
                col("height").max().alias("tallest"),
            ])
            .collect()?;
        println!("Grouping by derived features:");
        println!("{result}");

      Ok(())
  }

Grouping by birth decade:
shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 3   │
│ 1980   ┆ 1   │
└────────┴─────┘
Grouping by derived features:
shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1990   ┆ 3           ┆ 65.2       ┆ 1.75    │
│ 1980   ┆ 1           ┆ 72.5       ┆ 1.77    │
└────────┴─────────────┴────────────┴─────────┘

Análisis de datos

Cuando recibimos un nuevo conjunto de datos, el objetivo no es construir gráficos o ejecutar modelos de inmediato. El primer objetivo es entender si se puede confiar en los datos. El análisis completo está en github.

1. Inspeccionar los datos en bruto:

Descargue los datos, cárguelos con Polars³ y luego imprima el encabezado

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

rust-script failed with exit code 1

[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))

Polars³ no está adivinando el tipo de algunas de las columnas correctamente. Permitamos que adivine a partir de 100 filas por defecto.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(None)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

Dejemos que Polars⁴ infiera los tipos correctos de las columnas ahora a partir de 10000 filas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::inspect_raw_data;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 1. Inspect the raw data
      inspect_raw_data(df)?;

      Ok(())
  }

  cargo run --bin inspect_raw_data

============================================================
1. Inspect the raw data
============================================================

raw dataset size:
This confirms how many rows and columns were loaded from the CSV.
shape: (2, 2)
┌─────────┬───────┐
│ metric  ┆ value │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ rows    ┆ 29159 │
│ columns ┆ 14    │
└─────────┴───────┘

inferred schema:
This shows each column name and the type Polars inferred from the file.
shape: (14, 3)
┌─────────────────────────────────┬───────────────┬───────────────┐
│ column                          ┆ inferred_type ┆ storage_kind  │
│ ---                             ┆ ---           ┆ ---           │
│ str                             ┆ str           ┆ str           │
╞═════════════════════════════════╪═══════════════╪═══════════════╡
│ WaterbodyName                   ┆ String        ┆ text or mixed │
│ Years                           ┆ Int64         ┆ number        │
│ SampleDate                      ┆ String        ┆ text or mixed │
│ Alkalinity-total (as CaCO3)     ┆ Float64       ┆ number        │
│ Ammonia-Total (as N)            ┆ Float64       ┆ number        │
│ …                               ┆ …             ┆ …             │
│ ortho-Phosphate (as P) - unspe… ┆ Float64       ┆ number        │
│ pH                              ┆ Float64       ┆ number        │
│ Temperature                     ┆ Float64       ┆ number        │
│ Total Hardness (as CaCO3)       ┆ Float64       ┆ number        │
│ True Colour                     ┆ Float64       ┆ number        │
└─────────────────────────────────┴───────────────┴───────────────┘

raw row sample:
This shows one original wide record before any reshaping.
shape: (1, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬─────┬──────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH  ┆ Temperature  ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ --- ┆ ---          ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64 ┆ f64          ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆     ┆              ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆     ┆              ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪═════╪══════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8 ┆ 10.4         ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆     ┆              ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴─────┴──────────────┴─────────────┴────────┘

first-pass column roles:
This separates location/date columns from measurement columns.
shape: (14, 2)
┌─────────────────────────────────┬─────────────┐
│ column                          ┆ role        │
│ ---                             ┆ ---         │
│ str                             ┆ str         │
╞═════════════════════════════════╪═════════════╡
│ WaterbodyName                   ┆ location    │
│ Years                           ┆ date        │
│ SampleDate                      ┆ date        │
│ Alkalinity-total (as CaCO3)     ┆ measurement │
│ Ammonia-Total (as N)            ┆ measurement │
│ …                               ┆ …           │
│ ortho-Phosphate (as P) - unspe… ┆ measurement │
│ pH                              ┆ measurement │
│ Temperature                     ┆ measurement │
│ Total Hardness (as CaCO3)       ┆ measurement │
│ True Colour                     ┆ measurement │
└─────────────────────────────────┴─────────────┘

long-form sample:
This previews the wide measurements as parameter/value rows.
shape: (5, 7)
┌───────────────┬───────┬────────────┬────────────────┬────────────────┬────────────────┬──────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column  ┆ measurement_va ┆ parameter      ┆ unit     │
│ ---           ┆ ---   ┆ ---        ┆ ---            ┆ lue            ┆ ---            ┆ ---      │
│ str           ┆ i64   ┆ str        ┆ str            ┆ ---            ┆ str            ┆ str      │
│               ┆       ┆            ┆                ┆ f64            ┆                ┆          │
╞═══════════════╪═══════╪════════════╪════════════════╪════════════════╪════════════════╪══════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ Alkalinity-tot ┆ 314.0          ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 14.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 17.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-tot ┆ 18.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-tot ┆ 19.0           ┆ Alkalinity-tot ┆ as CaCO3 │
│               ┆       ┆            ┆ al (as CaCO3)  ┆                ┆ al             ┆          │
└───────────────┴───────┴────────────┴────────────────┴────────────────┴────────────────┴──────────┘

2. Perfilar los datos

Esto nos da una primera imagen del conjunto de datos antes de tomar decisiones.

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::profile_the_data;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 2. Profile the data
      profile_the_data(df)?;

      Ok(())
  }

  cargo run --bin profile_the_data

============================================================
2. Profile the data
============================================================

profile scope:
This repeats the dataset size before summarizing each important column.
shape: (2, 2)
┌─────────┬───────┐
│ metric  ┆ value │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ rows    ┆ 29159 │
│ columns ┆ 14    │
└─────────┴───────┘

date coverage:
This combines Years and SampleDate into a usable month-level date range.
shape: (5, 2)
┌───────────────────────────┬────────────┐
│ metric                    ┆ value      │
│ ---                       ┆ ---        │
│ str                       ┆ str        │
╞═══════════════════════════╪════════════╡
│ earliest_date             ┆ 2007-01-01 │
│ latest_date               ┆ 2023-04-01 │
│ invalid_dates             ┆ 0          │
│ missing_dates             ┆ 0          │
│ gaps_over_time_gt_31_days ┆ 0          │
└───────────────────────────┴────────────┘

column profile:
This gives missing counts, distinct counts, numeric ranges, averages, and notes.
shape: (14, 9)
┌──────────────┬──────────┬─────────┬─────────┬───┬─────────┬─────────┬──────────────┬─────────────┐
│ column       ┆ role     ┆ type    ┆ missing ┆ … ┆ minimum ┆ maximum ┆ average      ┆ notes       │
│ ---          ┆ ---      ┆ ---     ┆ ---     ┆   ┆ ---     ┆ ---     ┆ ---          ┆ ---         │
│ str          ┆ str      ┆ str     ┆ i64     ┆   ┆ str     ┆ str     ┆ str          ┆ str         │
╞══════════════╪══════════╪═════════╪═════════╪═══╪═════════╪═════════╪══════════════╪═════════════╡
│ WaterbodyNam ┆ location ┆ String  ┆ 0       ┆ … ┆         ┆         ┆              ┆ unique      │
│ e            ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ locations:  │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ 160         │
│ Years        ┆ date     ┆ Int64   ┆ 0       ┆ … ┆ 2007    ┆ 2023    ┆ Float64(2014 ┆ included in │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ .78253712404 ┆ combined    │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 4)           ┆ date cove…  │
│ SampleDate   ┆ date     ┆ String  ┆ 0       ┆ … ┆         ┆         ┆              ┆ review full │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ category    │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆              ┆ list; inc…  │
│ Alkalinity-t ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 442     ┆ Float64(139. ┆             │
│ otal (as     ┆          ┆         ┆         ┆   ┆         ┆         ┆ 858347851435 ┆             │
│ CaCO3)       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 2)           ┆             │
│ Ammonia-Tota ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 40      ┆ Float64(0.06 ┆             │
│ l (as N)     ┆          ┆         ┆         ┆   ┆         ┆         ┆ 357266127096 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 262)         ┆             │
│ …            ┆ …        ┆ …       ┆ …       ┆ … ┆ …       ┆ …       ┆ …            ┆ …           │
│ ortho-Phosph ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ -0.004  ┆ 70      ┆ Float64(0.06 ┆ negative    │
│ ate (as P) - ┆          ┆         ┆         ┆   ┆         ┆         ┆ 878934462773 ┆ value found │
│ unspe…       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 074)         ┆ (-0.004)    │
│ pH           ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 4.7     ┆ 9.8     ┆ Float64(7.55 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 205686066051 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 8)           ┆             │
│ Temperature  ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0.6     ┆ 637     ┆ Float64(10.8 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 505031036729 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 74)          ┆             │
│ Total        ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 642     ┆ Float64(159. ┆             │
│ Hardness (as ┆          ┆         ┆         ┆   ┆         ┆         ┆ 092110326142 ┆             │
│ CaCO3)       ┆          ┆         ┆         ┆   ┆         ┆         ┆ 9)           ┆             │
│ True Colour  ┆ numeric  ┆ Float64 ┆ 0       ┆ … ┆ 0       ┆ 953     ┆ Float64(58.1 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 374635618505 ┆             │
│              ┆          ┆         ┆         ┆   ┆         ┆         ┆ 45)          ┆             │
└──────────────┴──────────┴─────────┴─────────┴───┴─────────┴─────────┴──────────────┴─────────────┘

text/category profile:
This summarizes unique text values and possible spelling variations.
shape: (2, 5)
┌───────────────┬──────────────┬───────────────┬──────────────────────┬────────────────────────────┐
│ column        ┆ empty_values ┆ unique_values ┆ sample_unique_values ┆ possible_spelling_variatio │
│ ---           ┆ ---          ┆ ---           ┆ ---                  ┆ ns                         │
│ str           ┆ i64          ┆ i64           ┆ str                  ┆ ---                        │
│               ┆              ┆               ┆                      ┆ str                        │
╞═══════════════╪══════════════╪═══════════════╪══════════════════════╪════════════════════════════╡
│ WaterbodyName ┆ 0            ┆ 160           ┆ ABBEYTOWN_010,       ┆                            │
│               ┆              ┆               ┆ ASKANAGAP STREA…     ┆                            │
│ SampleDate    ┆ 0            ┆ 12            ┆ Apr, Aug, Dec, Feb,  ┆                            │
│               ┆              ┆               ┆ Jan, Jul, …          ┆                            │
└───────────────┴──────────────┴───────────────┴──────────────────────┴────────────────────────────┘

3. Identificar problemas de calidad de datos

  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  use data_pipeline::quality_flow::identify_data_quality_problems;

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 3. Identify data quality problems
      identify_data_quality_problems(df)?;

      Ok(())
  }

  cargo run --bin identify_data_quality_problems

============================================================
3. Identify data quality problems
============================================================

data quality summary:
This is the high-level checklist of problems that could make analysis unreliable.
shape: (13, 4)
┌─────────────────────────────────┬────────┬────────┬─────────────────────────────────┐
│ check                           ┆ count  ┆ status ┆ note                            │
│ ---                             ┆ ---    ┆ ---    ┆ ---                             │
│ str                             ┆ i64    ┆ str    ┆ str                             │
╞═════════════════════════════════╪════════╪════════╪═════════════════════════════════╡
│ measurement columns checked     ┆ 11     ┆ info   ┆ wide measurement columns becom… │
│ missing values                  ┆ 0      ┆ ok     ┆ null values across raw columns  │
│ duplicate rows                  ┆ 14478  ┆ review ┆ exact raw-row duplicates        │
│ numeric values stored as text   ┆ 0      ┆ ok     ┆ string columns whose values ar… │
│ invalid date values             ┆ 0      ┆ ok     ┆ date-like values that failed p… │
│ …                               ┆ …      ┆ …      ┆ …                               │
│ pH outside 0-14                 ┆ 0      ┆ ok     ┆ domain rule for pH              │
│ negative concentration-like me… ┆ 2      ┆ review ┆ negative values outside pH and… │
│ outlier values                  ┆ 12652  ┆ review ┆ IQR outliers across 8 columns   │
│ duplicate location/date/parame… ┆ 204237 ┆ review ┆ same location, date, and param… │
│ large gaps in time series       ┆ 5395   ┆ review ┆ location time periods with mis… │
└─────────────────────────────────┴────────┴────────┴─────────────────────────────────┘

data quality details:
This gives the columns and counts behind the summary checks.
shape: (9, 4)
┌──────────────────────────────┬─────────────────────────────┬───────┬─────────────────────────────┐
│ problem                      ┆ column                      ┆ count ┆ note                        │
│ ---                          ┆ ---                         ┆ ---   ┆ ---                         │
│ str                          ┆ str                         ┆ i64   ┆ str                         │
╞══════════════════════════════╪═════════════════════════════╪═══════╪═════════════════════════════╡
│ outliers                     ┆ Chloride                    ┆ 1861  ┆ outside [5.999999999999998, │
│                              ┆                             ┆       ┆ 31…                         │
│ outliers                     ┆ Conductivity @25°C          ┆ 20    ┆ outside [-179, 909] by IQR  │
│                              ┆                             ┆       ┆ rul…                        │
│ outliers                     ┆ Dissolved Oxygen            ┆ 2314  ┆ outside                     │
│                              ┆                             ┆       ┆ [10.999999999999993, 1…     │
│ outliers                     ┆ ortho-Phosphate (as P) -    ┆ 6229  ┆ outside                     │
│                              ┆ unspe…                      ┆       ┆ [0.009500000000000005,…     │
│ negative concentration-like  ┆ ortho-Phosphate (as P) -    ┆ 2     ┆ negative value outside pH   │
│ me…                          ┆ unspe…                      ┆       ┆ and …                       │
│ outliers                     ┆ pH                          ┆ 406   ┆ outside [6, 9.2] by IQR     │
│                              ┆                             ┆       ┆ rule                        │
│ outliers                     ┆ Temperature                 ┆ 134   ┆ outside                     │
│                              ┆                             ┆       ┆ [0.8499999999999988, 2…     │
│ outliers                     ┆ Total Hardness (as CaCO3)   ┆ 3     ┆ outside [-178, 486] by IQR  │
│                              ┆                             ┆       ┆ rul…                        │
│ outliers                     ┆ True Colour                 ┆ 1685  ┆ outside [-48.5, 147.5] by   │
│                              ┆                             ┆       ┆ IQR …                       │
└──────────────────────────────┴─────────────────────────────┴───────┴─────────────────────────────┘

principle:
This is the rule that guides the cleaning decision in the next step.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Bad input should not quietly b… │
└─────────────────────────────────┘

4. Limpiar y normalizar los datos

  use data_pipeline::quality_flow::clean_and_normalize_the_data;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 4. Clean and normalize the data
      clean_and_normalize_the_data(df)?;

      Ok(())
  }

  cargo run --bin clean_and_normalize_the_data

============================================================
4. Clean and normalize the data
============================================================

cleaning summary:
This shows how many normalized rows were kept, rejected, or deduplicated.
shape: (3, 2)
┌──────────────────────────┬────────┐
│ metric                   ┆ value  │
│ ---                      ┆ ---    │
│ str                      ┆ i64    │
╞══════════════════════════╪════════╡
│ cleaned_rows             ┆ 148372 │
│ invalid_rows             ┆ 2      │
│ exact_duplicates_removed ┆ 172375 │
└──────────────────────────┴────────┘

cleaned sample:
This is the normalized long-form data that is easier to query and visualize.
shape: (5, 10)
┌────────────┬─────────────┬─────────────┬──────┬───┬────────────┬────────────┬───────┬────────────┐
│ source_row ┆ location    ┆ sample_date ┆ year ┆ … ┆ parameter_ ┆ unit       ┆ value ┆ source_col │
│ ---        ┆ ---         ┆ ---         ┆ ---  ┆   ┆ code       ┆ ---        ┆ ---   ┆ umn        │
│ i64        ┆ str         ┆ str         ┆ i32  ┆   ┆ ---        ┆ str        ┆ f64   ┆ ---        │
│            ┆             ┆             ┆      ┆   ┆ str        ┆            ┆       ┆ str        │
╞════════════╪═════════════╪═════════════╪══════╪═══╪════════════╪════════════╪═══════╪════════════╡
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ ALKALINITY ┆ as CaCO3   ┆ 314.0 ┆ Alkalinity │
│            ┆ 10          ┆             ┆      ┆   ┆ -TOTAL     ┆            ┆       ┆ -total (as │
│            ┆             ┆             ┆      ┆   ┆            ┆            ┆       ┆ CaCO3)     │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ AMMONIA-TO ┆ as N       ┆ 0.033 ┆ Ammonia-To │
│            ┆ 10          ┆             ┆      ┆   ┆ TAL        ┆            ┆       ┆ tal (as N) │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ BOD_-_5_DA ┆ Total      ┆ 1.2   ┆ BOD - 5    │
│            ┆ 10          ┆             ┆      ┆   ┆ YS         ┆            ┆       ┆ days       │
│            ┆             ┆             ┆      ┆   ┆            ┆            ┆       ┆ (Total)    │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ CHLORIDE   ┆ not_encode ┆ 27.3  ┆ Chloride   │
│            ┆ 10          ┆             ┆      ┆   ┆            ┆ d          ┆       ┆            │
│ 1          ┆ ABBEYTOWN_0 ┆ 2023-02-01  ┆ 2023 ┆ … ┆ CONDUCTIVI ┆ @25°C      ┆ 711.0 ┆ Conductivi │
│            ┆ 10          ┆             ┆      ┆   ┆ TY         ┆            ┆       ┆ ty @25°C   │
└────────────┴─────────────┴─────────────┴──────┴───┴────────────┴────────────┴───────┴────────────┘

invalid rows sample:
These rows were separated so bad input does not become trusted data.
shape: (2, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ invalid_reason         │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
│ 15723      ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

5. Manejar los valores faltantes con cuidado

  use data_pipeline::quality_flow::handle_missing_values_carefully;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 5. Handle missing values carefully
      handle_missing_values_carefully(df)?;

      Ok(())
  }

  cargo run --bin handle_missing_values_carefully

============================================================
5. Handle missing values carefully
============================================================

missing-value decision summary:
This separates invalid data, missing data, gap candidates, and flagged observed values.
shape: (7, 3)
┌────────────────────────────────┬───────┬─────────────────────────────────┐
│ case                           ┆ count ┆ decision                        │
│ ---                            ┆ ---   ┆ ---                             │
│ str                            ┆ i64   ┆ str                             │
╞════════════════════════════════╪═══════╪═════════════════════════════════╡
│ invalid or impossible rows     ┆ 2     ┆ quarantine                      │
│ missing critical fields        ┆ 0     ┆ reject row                      │
│ missing measurement values     ┆ 0     ┆ keep NULL unless safe to estim… │
│ small time-series gaps         ┆ 45165 ┆ candidate for interpolation af… │
│ large time-series gaps         ┆ 14179 ┆ keep missing                    │
│ suspicious but possible values ┆ 10677 ┆ keep observed value with quali… │
│ values filled automatically    ┆ 0     ┆ none; filling is not automatic  │
└────────────────────────────────┴───────┴─────────────────────────────────┘

quarantined row sample:
These rows are not filled because a critical field or measurement value is missing or invalid.
shape: (2, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ decision               │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
│ 15723      ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ negative               │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ concentration-like me… │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

time-series gap examples:
These are observed gaps; small gaps may be interpolated only after review.
shape: (20, 6)
┌──────────┬──────────────────┬────────────┬────────────┬────────────────┬──────────────────────┐
│ location ┆ parameter_code   ┆ from_date  ┆ to_date    ┆ missing_months ┆ decision             │
│ ---      ┆ ---              ┆ ---        ┆ ---        ┆ ---            ┆ ---                  │
│ str      ┆ str              ┆ str        ┆ str        ┆ i64            ┆ str                  │
╞══════════╪══════════════════╪════════════╪════════════╪════════════════╪══════════════════════╡
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2007-09-01 ┆ 2008-01-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2008-12-01 ┆ 2009-04-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-04-01 ┆ 2009-06-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-06-01 ┆ 2009-08-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ ALKALINITY-TOTAL ┆ 2009-08-01 ┆ 2009-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ …        ┆ …                ┆ …          ┆ …          ┆ …              ┆ …                    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-06-01 ┆ 2009-08-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-08-01 ┆ 2009-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2009-10-01 ┆ 2010-03-01 ┆ 4              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2010-03-01 ┆ 2010-07-01 ┆ 3              ┆ keep missing; gap is │
│          ┆                  ┆            ┆            ┆                ┆ too large            │
│ ALLUA    ┆ AMMONIA-TOTAL    ┆ 2010-08-01 ┆ 2010-10-01 ┆ 1              ┆ candidate for        │
│          ┆                  ┆            ┆            ┆                ┆ interpolation af…    │
└──────────┴──────────────────┴────────────┴────────────┴────────────────┴──────────────────────┘

quality-flagged sample:
These observed values are kept, but marked because they need caution.
shape: (10, 6)
┌──────────┬─────────────┬─────────────────┬───────┬───────────────────────┬───────────────────────┐
│ location ┆ sample_date ┆ parameter_code  ┆ value ┆ quality_flag          ┆ missing_decision      │
│ ---      ┆ ---         ┆ ---             ┆ ---   ┆ ---                   ┆ ---                   │
│ str      ┆ str         ┆ str             ┆ f64   ┆ str                   ┆ str                   │
╞══════════╪═════════════╪═════════════════╪═══════╪═══════════════════════╪═══════════════════════╡
│ ALLUA    ┆ 2007-09-01  ┆ AMMONIA-TOTAL   ┆ 0.066 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.069 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ ORTHO-PHOSPHATE ┆ 0.005 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.068 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-01-01  ┆ AMMONIA-TOTAL   ┆ 0.067 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.133 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.111 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-02-01  ┆ AMMONIA-TOTAL   ┆ 0.113 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-03-01  ┆ AMMONIA-TOTAL   ┆ 0.04  ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
│ ALLUA    ┆ 2008-03-01  ┆ ORTHO-PHOSPHATE ┆ 0.005 ┆ suspicious_possible_o ┆ keep observed value   │
│          ┆             ┆                 ┆       ┆ utlier                ┆ with quali…           │
└──────────┴─────────────┴─────────────────┴───────┴───────────────────────┴───────────────────────┘

principle:
This is the rule for deciding whether a missing value should be filled.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Filling data is a decision, no… │
└─────────────────────────────────┘

handled data summary:
This confirms the row counts after applying the missing-value decisions.
shape: (3, 2)
┌────────────────────┬────────┐
│ metric             ┆ value  │
│ ---                ┆ ---    │
│ str                ┆ i64    │
╞════════════════════╪════════╡
│ handled_rows       ┆ 148372 │
│ quarantined_rows   ┆ 2      │
│ duplicates_removed ┆ 172375 │
└────────────────────┴────────┘

6. Validar antes de almacenar

  use data_pipeline::quality_flow::validate_before_storing;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 6. Validate before storing
      validate_before_storing(df)?;

      Ok(())
  }

  cargo run --bin validate_before_storing

============================================================
6. Validate before storing
============================================================

validation rule summary:
This shows the storage rules, how many records were checked, and what failed.
shape: (7, 4)
┌─────────────────────────────────┬─────────┬────────┬─────────────────────────────────┐
│ rule                            ┆ checked ┆ failed ┆ action                          │
│ ---                             ┆ ---     ┆ ---    ┆ ---                             │
│ str                             ┆ i64     ┆ i64    ┆ str                             │
╞═════════════════════════════════╪═════════╪════════╪═════════════════════════════════╡
│ every measurement has a locati… ┆ 320749  ┆ 0      ┆ reject missing locations        │
│ every measurement has a parame… ┆ 320749  ┆ 0      ┆ reject missing parameters       │
│ every measurement has a date    ┆ 320749  ┆ 0      ┆ reject missing or invalid date… │
│ value is a valid number or exp… ┆ 320749  ┆ 0      ┆ store numeric values; store mi… │
│ known parameter respects range  ┆ 320749  ┆ 20     ┆ reject impossible values for k… │
│ exact duplicate records handle… ┆ 320729  ┆ 172369 ┆ remove exact duplicates         │
│ repeated measurements handled … ┆ 148360  ┆ 31854  ┆ keep with source_row so repeat… │
└─────────────────────────────────┴─────────┴────────┴─────────────────────────────────┘

records rejected before storage:
These rows failed validation and should not be inserted into trusted tables.
shape: (10, 6)
┌────────────┬──────────────┬──────────┬───────────────────────┬───────────┬───────────────────────┐
│ source_row ┆ location     ┆ raw_date ┆ source_column         ┆ raw_value ┆ rule_failed           │
│ ---        ┆ ---          ┆ ---      ┆ ---                   ┆ ---       ┆ ---                   │
│ i64        ┆ str          ┆ str      ┆ str                   ┆ str       ┆ str                   │
╞════════════╪══════════════╪══════════╪═══════════════════════╪═══════════╪═══════════════════════╡
│ 111        ┆ ASKANAGAP    ┆ Jan      ┆ ortho-Phosphate (as   ┆ -0.004    ┆ known parameter range │
│            ┆ STREAM_010   ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 2003       ┆ CAMCOR_020   ┆ Feb      ┆ Temperature           ┆ 58.0      ┆ known parameter range │
│            ┆              ┆          ┆                       ┆           ┆ failed (…             │
│ 4813       ┆ DARGLE_030   ┆ Jan      ┆ ortho-Phosphate (as   ┆ 42.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4815       ┆ DARGLE_030   ┆ Feb      ┆ ortho-Phosphate (as   ┆ 22.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4857       ┆ DARGLE_030   ┆ Jul      ┆ ortho-Phosphate (as   ┆ 70.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4873       ┆ DARGLE_030   ┆ May      ┆ ortho-Phosphate (as   ┆ 29.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4893       ┆ DARGLE_030   ┆ Mar      ┆ ortho-Phosphate (as   ┆ 26.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 4903       ┆ DARGLE_030   ┆ Sep      ┆ ortho-Phosphate (as   ┆ 25.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 6096       ┆ GLENCREE_010 ┆ Feb      ┆ ortho-Phosphate (as   ┆ 27.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
│ 6117       ┆ GLENCREE_010 ┆ Jul      ┆ ortho-Phosphate (as   ┆ 27.0      ┆ known parameter range │
│            ┆              ┆          ┆ P) - unspe…           ┆           ┆ failed (…             │
└────────────┴──────────────┴──────────┴───────────────────────┴───────────┴───────────────────────┘

duplicate handling sample:
These exact duplicates are handled deliberately before storage.
shape: (10, 6)
┌────────────┬──────────┬─────────────┬──────────────────┬───────┬─────────────────────────────────┐
│ source_row ┆ location ┆ sample_date ┆ parameter_code   ┆ value ┆ action                          │
│ ---        ┆ ---      ┆ ---         ┆ ---              ┆ ---   ┆ ---                             │
│ i64        ┆ str      ┆ str         ┆ str              ┆ f64   ┆ str                             │
╞════════════╪══════════╪═════════════╪══════════════════╪═══════╪═════════════════════════════════╡
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ AMMONIA-TOTAL    ┆ 0.033 ┆ skip exact duplicate before st… │
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 3          ┆ ALLUA    ┆ 2007-08-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ AMMONIA-TOTAL    ┆ 0.033 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
│ 4          ┆ ALLUA    ┆ 2007-08-01  ┆ TEMPERATURE      ┆ 17.8  ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ ALKALINITY-TOTAL ┆ 19.0  ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ BOD_-_5_DAYS     ┆ 1.2   ┆ skip exact duplicate before st… │
│ 6          ┆ ALLUA    ┆ 2007-09-01  ┆ ORTHO-PHOSPHATE  ┆ 0.019 ┆ skip exact duplicate before st… │
└────────────┴──────────┴─────────────┴──────────────────┴───────┴─────────────────────────────────┘

trusted records sample:
These records passed validation and are shaped for database insertion.
shape: (5, 8)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬───────┬────────────┐
│ source_row ┆ location   ┆ sample_dat ┆ parameter  ┆ parameter_ ┆ unit       ┆ value ┆ source_col │
│ ---        ┆ ---        ┆ e          ┆ ---        ┆ code       ┆ ---        ┆ ---   ┆ umn        │
│ i64        ┆ str        ┆ ---        ┆ str        ┆ ---        ┆ str        ┆ f64   ┆ ---        │
│            ┆            ┆ str        ┆            ┆ str        ┆            ┆       ┆ str        │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪════════════╡
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Alkalinity ┆ ALKALINITY ┆ as CaCO3   ┆ 314.0 ┆ Alkalinity │
│            ┆ 010        ┆            ┆ -total     ┆ -TOTAL     ┆            ┆       ┆ -total (as │
│            ┆            ┆            ┆            ┆            ┆            ┆       ┆ CaCO3)     │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Ammonia-To ┆ AMMONIA-TO ┆ as N       ┆ 0.033 ┆ Ammonia-To │
│            ┆ 010        ┆            ┆ tal        ┆ TAL        ┆            ┆       ┆ tal (as N) │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ BOD - 5    ┆ BOD_-_5_DA ┆ Total      ┆ 1.2   ┆ BOD - 5    │
│            ┆ 010        ┆            ┆ days       ┆ YS         ┆            ┆       ┆ days       │
│            ┆            ┆            ┆            ┆            ┆            ┆       ┆ (Total)    │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Chloride   ┆ CHLORIDE   ┆ not_encode ┆ 27.3  ┆ Chloride   │
│            ┆ 010        ┆            ┆            ┆            ┆ d          ┆       ┆            │
│ 1          ┆ ABBEYTOWN_ ┆ 2023-02-01 ┆ Conductivi ┆ CONDUCTIVI ┆ @25°C      ┆ 711.0 ┆ Conductivi │
│            ┆ 010        ┆            ┆ ty         ┆ TY         ┆            ┆       ┆ ty @25°C   │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴───────┴────────────┘

storage readiness summary:
This is the final count of clean records, rejected records, NULLs, and handled duplicates.
shape: (6, 2)
┌─────────────────────────────────┬────────┐
│ metric                          ┆ value  │
│ ---                             ┆ ---    │
│ str                             ┆ i64    │
╞═════════════════════════════════╪════════╡
│ raw_measurement_rows            ┆ 320749 │
│ trusted_records_ready_to_store  ┆ 148360 │
│ records_rejected                ┆ 20     │
│ explicit_null_values            ┆ 0      │
│ exact_duplicates_removed        ┆ 172369 │
│ repeated_measurements_kept_wit… ┆ 31854  │
└─────────────────────────────────┴────────┘

principle:
This is the rule for deciding what is safe to store.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ The database should store clea… │
└─────────────────────────────────┘

7. Almacenar datos limpios con estructura

  use data_pipeline::quality_flow::store_clean_data_with_structure;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 7. Store clean data with structure
      store_clean_data_with_structure(df)?;

      Ok(())
  }

  cargo run --bin store_clean_data_with_structure

============================================================
7. Store clean data with structure
============================================================

database file:
This is the SQLite file written for the API project.
shape: (1, 2)
┌─────────────┬─────────────────────────────────┐
│ item        ┆ value                           │
│ ---         ┆ ---                             │
│ str         ┆ str                             │
╞═════════════╪═════════════════════════════════╡
│ sqlite_file ┆ /Users/chiefkemist/Documents/n… │
└─────────────┴─────────────────────────────────┘

structured schema:
The cleaned data is stored across small tables instead of one giant messy table.
shape: (5, 2)
┌────────────────┬─────────────────────────────────┐
│ table          ┆ purpose                         │
│ ---            ┆ ---                             │
│ str            ┆ str                             │
╞════════════════╪═════════════════════════════════╡
│ locations      ┆ one row per normalized locatio… │
│ parameters     ┆ one row per normalized paramet… │
│ measurements   ┆ trusted observed measurements   │
│ ingestion_runs ┆ source file, import time, coun… │
│ rejected_rows  ┆ rows that failed validation or… │
└────────────────┴─────────────────────────────────┘

ingestion run summary:
This records where the data came from and what happened during import.
shape: (6, 2)
┌──────────────────────────┬────────┐
│ metric                   ┆ value  │
│ ---                      ┆ ---    │
│ str                      ┆ i64    │
╞══════════════════════════╪════════╡
│ ingestion_run_id         ┆ 1      │
│ raw_rows                 ┆ 29159  │
│ raw_measurement_rows     ┆ 320749 │
│ accepted_measurements    ┆ 148360 │
│ rejected_rows            ┆ 20     │
│ exact_duplicates_removed ┆ 172369 │
└──────────────────────────┴────────┘

database table counts:
These counts are read back from SQLite after the write finishes.
shape: (5, 2)
┌────────────────┬────────┐
│ table          ┆ rows   │
│ ---            ┆ ---    │
│ str            ┆ i64    │
╞════════════════╪════════╡
│ ingestion_runs ┆ 1      │
│ locations      ┆ 160    │
│ parameters     ┆ 11     │
│ measurements   ┆ 148360 │
│ rejected_rows  ┆ 20     │
└────────────────┴────────┘

stored measurement sample:
These accepted rows are stored in the measurements table with foreign keys.
shape: (5, 6)
┌────────────┬───────────────┬─────────────┬──────────────────┬─────────────┬───────┐
│ source_row ┆ location      ┆ sample_date ┆ parameter_code   ┆ unit        ┆ value │
│ ---        ┆ ---           ┆ ---         ┆ ---              ┆ ---         ┆ ---   │
│ i64        ┆ str           ┆ str         ┆ str              ┆ str         ┆ f64   │
╞════════════╪═══════════════╪═════════════╪══════════════════╪═════════════╪═══════╡
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ ALKALINITY-TOTAL ┆ as CaCO3    ┆ 314.0 │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ AMMONIA-TOTAL    ┆ as N        ┆ 0.033 │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ BOD_-_5_DAYS     ┆ Total       ┆ 1.2   │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ CHLORIDE         ┆ not_encoded ┆ 27.3  │
│ 1          ┆ ABBEYTOWN_010 ┆ 2023-02-01  ┆ CONDUCTIVITY     ┆ @25°C       ┆ 711.0 │
└────────────┴───────────────┴─────────────┴──────────────────┴─────────────┴───────┘

rejected row sample:
These failed rows are stored separately for traceability.
shape: (5, 6)
┌────────────┬────────────┬──────────┬────────────────────────┬───────────┬────────────────────────┐
│ source_row ┆ location   ┆ raw_date ┆ source_column          ┆ raw_value ┆ rejection_reason       │
│ ---        ┆ ---        ┆ ---      ┆ ---                    ┆ ---       ┆ ---                    │
│ i64        ┆ str        ┆ str      ┆ str                    ┆ str       ┆ str                    │
╞════════════╪════════════╪══════════╪════════════════════════╪═══════════╪════════════════════════╡
│ 111        ┆ ASKANAGAP  ┆ Jan      ┆ ortho-Phosphate (as P) ┆ -0.004    ┆ known parameter range  │
│            ┆ STREAM_010 ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 2003       ┆ CAMCOR_020 ┆ Feb      ┆ Temperature            ┆ 58.0      ┆ known parameter range  │
│            ┆            ┆          ┆                        ┆           ┆ failed (…              │
│ 4813       ┆ DARGLE_030 ┆ Jan      ┆ ortho-Phosphate (as P) ┆ 42.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 4815       ┆ DARGLE_030 ┆ Feb      ┆ ortho-Phosphate (as P) ┆ 22.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
│ 4857       ┆ DARGLE_030 ┆ Jul      ┆ ortho-Phosphate (as P) ┆ 70.0      ┆ known parameter range  │
│            ┆            ┆          ┆ - unspe…               ┆           ┆ failed (…              │
└────────────┴────────────┴──────────┴────────────────────────┴───────────┴────────────────────────┘

principle:
This is the reason for storing accepted rows, rejected rows, and ingestion metadata.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ The database is part of the da… │
└─────────────────────────────────┘

8. Visualizar después de limpiar

  use data_pipeline::quality_flow::visualize_after_cleaning;
  use polars::{
      error::PolarsResult,
      io::{
          SerReader,
          csv::read::{CsvParseOptions, CsvReadOptions},
      },
  };

  fn main() -> PolarsResult<()> {
      let df = CsvReadOptions::default()
          .with_has_header(true)
          // Discovery step: scan the file because we do not know columns yet.
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;

      // 8. Visualize after cleaning
      visualize_after_cleaning(df)?;

      Ok(())
  }

  cargo run --bin visualize_after_cleaning

============================================================
8. Visualize after cleaning
============================================================

dashboard handoff:
The dashboard reads the cleaned SQLite database produced by the storage step.
shape: (5, 2)
┌────────────────────┬─────────────────────────────────┐
│ item               ┆ value                           │
│ ---                ┆ ---                             │
│ str                ┆ str                             │
╞════════════════════╪═════════════════════════════════╡
│ raw_rows_available ┆ 29159                           │
│ sqlite_file        ┆ /Users/chiefkemist/Documents/n… │
│ dashboard_page     ┆ http://localhost:3434/data_viz  │
│ summary_json       ┆ http://localhost:3434/api/dash… │
│ timeseries_json    ┆ http://localhost:3434/api/dash… │
└────────────────────┴─────────────────────────────────┘

dashboard views:
These views turn the cleaned records into visual checks for patterns, gaps, and problems.
shape: (9, 2)
┌─────────────────────────────────┬─────────────────────────────────┐
│ view                            ┆ source                          │
│ ---                             ┆ ---                             │
│ str                             ┆ str                             │
╞═════════════════════════════════╪═════════════════════════════════╡
│ pH over time by location        ┆ measurements joined with locat… │
│ temperature over time           ┆ measurements joined with locat… │
│ dissolved oxygen over time      ┆ measurements joined with locat… │
│ ammonia spikes by location      ┆ measurements joined with locat… │
│ missing-data heatmap            ┆ measurement coverage by locati… │
│ outlier count by parameter      ┆ rejected_rows grouped by sourc… │
│ data completeness by location   ┆ measurements grouped by locati… │
│ before/after cleaning summary   ┆ ingestion_runs accepted and re… │
│ water-quality score by locatio… ┆ aggregated pH, dissolved oxyge… │
└─────────────────────────────────┴─────────────────────────────────┘

principle:
Visualization is the final check that the pipeline produced useful data.
shape: (1, 1)
┌─────────────────────────────────┐
│ principle                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Understand, clean, validate, s… │
└─────────────────────────────────┘

Notas al pie

Conjunto de datos de monitoreo de calidad del agua (Irlanda)

Rust: Un lenguaje que empodera a todos para construir software confiable y eficiente.

Polars: DataFrames para la nueva era

Ubuntu TechHive

Rust Data Pipelines: From Files to Clean Databases and Web Dashboards

Pipelines de datos en Rust: De archivos a bases de datos limpias y paneles web

Introducción

Canalización de datos

Sobre el conjunto de datos utilizado

Herramientas y bibliotecas

DataFrame

Selección de columnas

Adición de columnas

Expansión de expresiones

Filtrado de filas

Agrupación por

Análisis de datos

1. Inspeccionar los datos en bruto:

2. Perfilar los datos

3. Identificar problemas de calidad de datos

4. Limpiar y normalizar los datos

5. Manejar los valores faltantes con cuidado

6. Validar antes de almacenar

7. Almacenar datos limpios con estructura

8. Visualizar después de limpiar

Notas al pie

Todos los artículos

Rust and Data Processing with Polars

Pipelines de datos en Rust: De archivos a bases de datos limpias y paneles web

Introducción

Canalización de datos

Sobre el conjunto de datos utilizado

Herramientas y bibliotecas

DataFrame

Selección de columnas

Adición de columnas

Expansión de expresiones

Filtrado de filas

Agrupación por

Análisis de datos

1. Inspeccionar los datos en bruto:

2. Perfilar los datos

3. Identificar problemas de calidad de datos

4. Limpiar y normalizar los datos

5. Manejar los valores faltantes con cuidado

6. Validar antes de almacenar

7. Almacenar datos limpios con estructura

8. Visualizar después de limpiar

Notas al pie

Etiquetas

Todos los artículos

Rust and Data Processing with Polars