Pipelines de données en Rust : des fichiers aux bases de données propres et tableaux de bord web

Introduction
Pipeline de données
Notes de bas de page

Introduction

Nous construisons un petit pipeline de données environnementales. Les fichiers bruts de surveillance de la qualité de l'eau arrivent au format CSV. Notre outil en Rust les valide, nettoie les enregistrements erronés, comble les lacunes de manière sécurisée, stocke les mesures fiables et alimente un tableau de bord.

Pipeline de données

À propos du jeu de données utilisé

Le jeu de données¹ contient des données brutes de surveillance de la qualité de l'eau provenant de Cork Harbour, Moy Killala et 15 autres sites côtiers en Irlande. Le jeu de données brut extrait compte plus de 1,27 million d'entrées, et le dépôt inclut également une version transformée/pivotée avec 29 159 lignes réparties sur 11 paramètres de qualité de l'eau. Les fichiers sont au format CSV, ils sont donc faciles à utiliser pour le flux « fichiers → base de données propre → tableau de bord ».

Outils et bibliothèques

Nous utilisons Rust² pour implémenter notre pipeline de données en tirant parti de Polars³.

DataFrame

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();
        println!("Data:");
        print!("{df}\n");

        let head = df.head(Some(2));
        println!("Head:");
        print!("{head}\n");

      Ok(())
  }

Data:
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
Head:
shape: (2, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└──────────────┴────────────┴────────┴────────┘

Sélection de colonnes

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("Column selection:");
        print!("{result}\n");

      Ok(())
  }

Column selection:
shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

Ajout de colonnes

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{LazyFrame, IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .with_columns([
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("With added colums:");
        print!("{result}\n");

      Ok(())
  }

With added colums:
shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

Expansion d'expression

lit signifie littéral et fait partie de l'API d'expression paresseuse de la fonctionnalité lazy de Polars³.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, cols, lit, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                (cols(["weight", "height"]).as_expr() * lit(0.95))
                    .round(2, RoundMode::default())
                    .name()
                    .suffix("-5%"),
            ])
            .collect()?;
        println!("Transform:");
        print!("{result}\n");

      Ok(())
  }

Transform:
shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 51.87     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

Filtrage de lignes

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{IntoLazy, col, lit, ClosedInterval},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .filter(col("birthdate").dt().year().lt(lit(1990)))
            .collect()?;
        println!("With row filtering:");
        print!("{result}\n");

        let result = df
              .clone()
              .lazy()
              .filter(
                  col("birthdate")
                      .is_between(
                          lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                          lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                          ClosedInterval::Both,
                      )
                      .and(col("height").gt(lit(1.7))),
              )
              .collect()?;
        println!("With complex row filtering:");
        print!("{result}\n");

      Ok(())
  }

With row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘
With complex row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

Regroupement (Group by)

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, lit, len, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([len()])
            .collect()?;
        println!("Grouping by birth decade:");
        print!("{result}\n");

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([
                len().alias("sample_size"),
                col("weight")
                    .mean()
                    .round(2, RoundMode::default())
                    .alias("avg_weight"),
                col("height").max().alias("tallest"),
            ])
            .collect()?;
        println!("Grouping by derived features:");
        println!("{result}");

      Ok(())
  }

Grouping by birth decade:
shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 3   │
│ 1980   ┆ 1   │
└────────┴─────┘
Grouping by derived features:
shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1980   ┆ 1           ┆ 72.5       ┆ 1.77    │
│ 1990   ┆ 3           ┆ 65.2       ┆ 1.75    │
└────────┴─────────────┴────────────┴─────────┘

Analyse de données

Lorsque nous recevons un nouveau jeu de données, l'objectif n'est pas de construire immédiatement des graphiques ou d'exécuter des modèles. Le premier objectif est de comprendre si les données sont fiables.

Inspecter les données brutes :

Téléchargez les données, chargez-les avec Polars³ puis affichez l'en-tête

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

rust-script failed with exit code 1

[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))

Polars³ ne devine pas correctement le type de certaines colonnes. Laissons-le deviner à partir de 100 lignes par défaut.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(None)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

Laissons maintenant Polars⁴ déduire les types appropriés des colonnes à partir de 10 000 lignes

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

// les imports vont ici

fn main() -> PolarsResult<()> {
    let df = CsvReadOptions::default()
        .with_has_header(true)
        // Étape de découverte : scannez le fichier car nous ne connaissons pas encore les colonnes.
        .with_infer_schema_length(Some(10_000))
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some(
            "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
        ))?
        .finish()?;

    inspect_raw_data(df.clone())?;

    Ok(())
}

lignes : 29159
colonnes : 14

colonnes et types :
WaterbodyName : String (texte ou mixte)
Years : Int64 (nombre)
SampleDate : String (texte ou mixte)
Alkalinity-total (as CaCO3) : Float64 (nombre)
Ammonia-Total (as N) : Float64 (nombre)
BOD - 5 days (Total) : Float64 (nombre)
Chloride : Float64 (nombre)
Conductivity @25°C : Float64 (nombre)
Dissolved Oxygen : Float64 (nombre)
ortho-Phosphate (as P) - unspecified : Float64 (nombre)
pH : Float64 (nombre)
Temperature : Float64 (nombre)
Total Hardness (as CaCO3) : Float64 (nombre)
True Colour : Float64 (nombre)

une ligne brute :
shape: (1, 14)
┌───────────────┬───────┬────────────┬─────────────────────┬───┬─────┬─────────────┬────────────────────┬─────────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ Alkalinity-total    ┆ … ┆ pH  ┆ Temperature ┆ Total Hardness (as ┆ True Colour │
│ ---           ┆ ---   ┆ ---        ┆ (as CaCO3)          ┆   ┆ --- ┆ ---         ┆ CaCO3)             ┆ ---         │
│ str           ┆ i64   ┆ str        ┆ ---                 ┆   ┆ f64 ┆ f64         ┆ ---                ┆ f64         │
│               ┆       ┆            ┆ f64                 ┆   ┆     ┆             ┆ f64                ┆             │
╞═══════════════╪═══════╪════════════╪═════════════════════╪═══╪═════╪═════════════╪════════════════════╪═════════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ 314.0               ┆ … ┆ 7.8 ┆ 10.4        ┆ 370.0              ┆ 24.0        │
└───────────────┴───────┴────────────┴─────────────────────┴───┴─────┴─────────────┴────────────────────┴─────────────┘

colonnes emplacement/date : ["WaterbodyName", "Years", "SampleDate"]
colonnes de mesure : ["Alkalinity-total (as CaCO3)", "Ammonia-Total (as N)", "BOD - 5 days (Total)", "Chloride", "Conductivity @25°C", "Dissolved Oxygen", "ortho-Phosphate (as P) - unspecified", "pH", "Temperature", "Total Hardness (as CaCO3)", "True Colour"]

forme longue de la qualité de l'eau :
shape: (10, 7)
┌───────────────┬───────┬────────────┬─────────────────────────────┬───────────────────┬──────────────────┬──────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column               ┆ measurement_value ┆ parameter        ┆ unit     │
│ ---           ┆ ---   ┆ ---        ┆ ---                         ┆ ---               ┆ ---              ┆ ---      │
│ str           ┆ i64   ┆ str        ┆ str                         ┆ f64               ┆ str              ┆ str      │
╞═══════════════╪═══════╪════════════╪═════════════════════════════╪═══════════════════╪══════════════════╪══════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ Alkalinity-total (as CaCO3) ┆ 314.0             ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 14.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 17.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 18.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 19.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 19.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 18.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 8.0               ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 9.0               ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 10.0              ┆ Alkalinity-total ┆ as CaCO3 │
└───────────────┴───────┴────────────┴─────────────────────────────┴───────────────────┴──────────────────┴──────────┘

Profiler les données

Notes de bas de page

Jeu de données de surveillance de la qualité de l'eau (Irlande)

Rust : un langage permettant à chacun de créer des logiciels fiables et efficaces.

Polars : DataFrames pour la nouvelle ère

Ubuntu TechHive

Rust Data Pipelines: From Files to Clean Databases and Web Dashboards

Pipelines de données en Rust : des fichiers aux bases de données propres et tableaux de bord web

Introduction

Pipeline de données

À propos du jeu de données utilisé

Outils et bibliothèques

DataFrame

Sélection de colonnes

Ajout de colonnes

Expansion d'expression

Filtrage de lignes

Regroupement (Group by)

Analyse de données

Inspecter les données brutes :

Profiler les données

Notes de bas de page

Tous les articles

Rust and Data Processing with Polars

Pipelines de données en Rust : des fichiers aux bases de données propres et tableaux de bord web

Introduction

Pipeline de données

À propos du jeu de données utilisé

Outils et bibliothèques

DataFrame

Sélection de colonnes

Ajout de colonnes

Expansion d'expression

Filtrage de lignes

Regroupement (Group by)

Analyse de données

Inspecter les données brutes :

Profiler les données

Notes de bas de page

Étiquettes

Tous les articles

Rust and Data Processing with Polars