Pipelines de Dados em Rust: De Arquivos a Bancos de Dados Limpos e Dashboards Web

Introdução
Pipeline de Dados
Notas de rodapé

Introdução

Estamos construindo um pequeno pipeline de dados ambientais. Arquivos brutos de monitoramento da qualidade da água chegam em formato CSV. Nossa ferramenta em Rust os valida, limpa registros incorretos, preenche lacunas seguras, armazena medições confiáveis e alimenta um dashboard.

Pipeline de Dados

Sobre o conjunto de dados utilizado

O conjunto de dados¹ contém dados brutos de monitoramento da qualidade da água de Cork Harbour, Moy Killala e outros 15 locais costeiros na Irlanda. O conjunto de dados extraído bruto possui mais de 1,27 milhão de entradas, e o repositório também inclui uma versão transformada/dinamizada com 29.159 linhas em 11 parâmetros de qualidade da água. Os arquivos estão em CSV, portanto são fáceis de usar para o fluxo “arquivos → banco de dados limpo → dashboard”.

Ferramentas e bibliotecas

Usamos Rust² para implementar nosso Pipeline de Dados aproveitando o Polars³.

DataFrame

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();
        println!("Data:");
        print!("{df}\n");

        let head = df.head(Some(2));
        println!("Head:");
        print!("{head}\n");

      Ok(())
  }

Data:
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
Head:
shape: (2, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└──────────────┴────────────┴────────┴────────┘

Selecionando colunas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("Column selection:");
        print!("{result}\n");

      Ok(())
  }

Column selection:
shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

Adicionando colunas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{LazyFrame, IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .with_columns([
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("With added colums:");
        print!("{result}\n");

      Ok(())
  }

With added colums:
shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

Expansão de expressão

lit significa literal e é parte da API de expressão lazy do recurso lazy do Polars³.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, cols, lit, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                (cols(["weight", "height"]).as_expr() * lit(0.95))
                    .round(2, RoundMode::default())
                    .name()
                    .suffix("-5%"),
            ])
            .collect()?;
        println!("Transform:");
        print!("{result}\n");

      Ok(())
  }

Transform:
shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 51.87     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

Filtrando linhas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{IntoLazy, col, lit, ClosedInterval},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .filter(col("birthdate").dt().year().lt(lit(1990)))
            .collect()?;
        println!("With row filtering:");
        print!("{result}\n");

        let result = df
              .clone()
              .lazy()
              .filter(
                  col("birthdate")
                      .is_between(
                          lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                          lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                          ClosedInterval::Both,
                      )
                      .and(col("height").gt(lit(1.7))),
              )
              .collect()?;
        println!("With complex row filtering:");
        print!("{result}\n");

      Ok(())
  }

With row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘
With complex row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

Agrupando por

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, lit, len, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([len()])
            .collect()?;
        println!("Grouping by birth decade:");
        print!("{result}\n");

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([
                len().alias("sample_size"),
                col("weight")
                    .mean()
                    .round(2, RoundMode::default())
                    .alias("avg_weight"),
                col("height").max().alias("tallest"),
            ])
            .collect()?;
        println!("Grouping by derived features:");
        println!("{result}");

      Ok(())
  }

Grouping by birth decade:
shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 3   │
│ 1980   ┆ 1   │
└────────┴─────┘
Grouping by derived features:
shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1980   ┆ 1           ┆ 72.5       ┆ 1.77    │
│ 1990   ┆ 3           ┆ 65.2       ┆ 1.75    │
└────────┴─────────────┴────────────┴─────────┘

Análise de Dados

Quando recebemos um novo conjunto de dados, o objetivo não é construir gráficos ou executar modelos imediatamente. O primeiro objetivo é entender se os dados são confiáveis.

Inspecionar os dados brutos:

Baixe os dados, carregue-os com Polars³ e imprima o cabeçalho

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

rust-script failed with exit code 1

[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))

O Polars³ não está adivinhando o tipo de algumas colunas corretamente. Vamos permitir que ele adivinhe a partir de 100 linhas por padrão.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(None)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

Vamos fazer com que o Polars⁴ infira os tipos adequados das colunas agora a partir de 10000 linhas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }

shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

// imports go here

fn main() -> PolarsResult<()> {
    let df = CsvReadOptions::default()
        .with_has_header(true)
        // Discovery step: scan the file because we do not know columns yet.
        .with_infer_schema_length(Some(10_000))
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some(
            "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
        ))?
        .finish()?;

    inspect_raw_data(df.clone())?;

    Ok(())
}

rows: 29159
columns: 14

columns and types:
WaterbodyName: String (text or mixed)
Years: Int64 (number)
SampleDate: String (text or mixed)
Alkalinity-total (as CaCO3): Float64 (number)
Ammonia-Total (as N): Float64 (number)
BOD - 5 days (Total): Float64 (number)
Chloride: Float64 (number)
Conductivity @25°C: Float64 (number)
Dissolved Oxygen: Float64 (number)
ortho-Phosphate (as P) - unspecified: Float64 (number)
pH: Float64 (number)
Temperature: Float64 (number)
Total Hardness (as CaCO3): Float64 (number)
True Colour: Float64 (number)

one raw row:
shape: (1, 14)
┌───────────────┬───────┬────────────┬─────────────────────┬───┬─────┬─────────────┬────────────────────┬─────────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ Alkalinity-total    ┆ … ┆ pH  ┆ Temperature ┆ Total Hardness (as ┆ True Colour │
│ ---           ┆ ---   ┆ ---        ┆ (as CaCO3)          ┆   ┆ --- ┆ ---         ┆ CaCO3)             ┆ ---         │
│ str           ┆ i64   ┆ str        ┆ ---                 ┆   ┆ f64 ┆ f64         ┆ ---                ┆ f64         │
│               ┆       ┆            ┆ f64                 ┆   ┆     ┆             ┆ f64                ┆             │
╞═══════════════╪═══════╪════════════╪═════════════════════╪═══╪═════╪═════════════╪════════════════════╪═════════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ 314.0               ┆ … ┆ 7.8 ┆ 10.4        ┆ 370.0              ┆ 24.0        │
└───────────────┴───────┴────────────┴─────────────────────┴───┴─────┴─────────────┴────────────────────┴─────────────┘

location/date columns: ["WaterbodyName", "Years", "SampleDate"]
measurement columns: ["Alkalinity-total (as CaCO3)", "Ammonia-Total (as N)", "BOD - 5 days (Total)", "Chloride", "Conductivity @25°C", "Dissolved Oxygen", "ortho-Phosphate (as P) - unspecified", "pH", "Temperature", "Total Hardness (as CaCO3)", "True Colour"]

long water-quality shape:
shape: (10, 7)
┌───────────────┬───────┬────────────┬─────────────────────────────┬───────────────────┬──────────────────┬──────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column               ┆ measurement_value ┆ parameter        ┆ unit     │
│ ---           ┆ ---   ┆ ---        ┆ ---                         ┆ ---               ┆ ---              ┆ ---      │
│ str           ┆ i64   ┆ str        ┆ str                         ┆ f64               ┆ str              ┆ str      │
╞═══════════════╪═══════╪════════════╪═════════════════════════════╪═══════════════════╪══════════════════╪══════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ Alkalinity-total (as CaCO3) ┆ 314.0             ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 14.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 17.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 18.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 19.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 19.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 18.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 8.0               ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 9.0               ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 10.0              ┆ Alkalinity-total ┆ as CaCO3 │
└───────────────┴───────┴────────────┴─────────────────────────────┴───────────────────┴──────────────────┴──────────┘

Perfilar os dados

Notas de rodapé

Conjunto de Dados de Monitoramento da Qualidade da Água (Irlanda)

Rust: Uma linguagem que capacita todos a construir software confiável e eficiente.

Polars: DataFrames para a nova era

Ubuntu TechHive

Rust Data Pipelines: From Files to Clean Databases and Web Dashboards

Pipelines de Dados em Rust: De Arquivos a Bancos de Dados Limpos e Dashboards Web

Introdução

Pipeline de Dados

Sobre o conjunto de dados utilizado

Ferramentas e bibliotecas

DataFrame

Selecionando colunas

Adicionando colunas

Expansão de expressão

Filtrando linhas

Agrupando por

Análise de Dados

Inspecionar os dados brutos:

Perfilar os dados

Notas de rodapé

Todos os artigos

Rust and Data Processing with Polars

Pipelines de Dados em Rust: De Arquivos a Bancos de Dados Limpos e Dashboards Web

Introdução

Pipeline de Dados

Sobre o conjunto de dados utilizado

Ferramentas e bibliotecas

DataFrame

Selecionando colunas

Adicionando colunas

Expansão de expressão

Filtrando linhas

Agrupando por

Análise de Dados

Inspecionar os dados brutos:

Perfilar os dados

Notas de rodapé

Etiquetas

Todos os artigos

Rust and Data Processing with Polars