Ubuntu TechHive
rust-data-pipelines-from-files-to-clean-databases-and-web-dashboards.md
Rust Data Pipelines: From Files to Clean Databases and Web Dashboards
article.detalhe

Rust Data Pipelines: From Files to Clean Databases and Web Dashboards

reading.progresso 15 min de leitura

Estamos construindo um pequeno pipeline de dados ambientais. Arquivos brutos de monitoramento da qualidade da água chegam em formato CSV.

Pipelines de Dados em Rust: De Arquivos a Bancos de Dados Limpos e Dashboards Web

Introdução

Estamos construindo um pequeno pipeline de dados ambientais. Arquivos brutos de monitoramento da qualidade da água chegam em formato CSV. Nossa ferramenta em Rust os valida, limpa registros incorretos, preenche lacunas seguras, armazena medições confiáveis e alimenta um dashboard.

Pipeline de Dados

Sobre o conjunto de dados utilizado

O conjunto de dados1 contém dados brutos de monitoramento da qualidade da água de Cork Harbour, Moy Killala e outros 15 locais costeiros na Irlanda. O conjunto de dados extraído bruto possui mais de 1,27 milhão de entradas, e o repositório também inclui uma versão transformada/dinamizada com 29.159 linhas em 11 parâmetros de qualidade da água. Os arquivos estão em CSV, portanto são fáceis de usar para o fluxo “arquivos → banco de dados limpo → dashboard”.

Ferramentas e bibliotecas

Usamos Rust2 para implementar nosso Pipeline de Dados aproveitando o Polars3.

DataFrame

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();
        println!("Data:");
        print!("{df}\n");

        let head = df.head(Some(2));
        println!("Head:");
        print!("{head}\n");

      Ok(())
  }
Data:
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
Head:
shape: (2, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└──────────────┴────────────┴────────┴────────┘

Selecionando colunas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("Column selection:");
        print!("{result}\n");

      Ok(())
  }
Column selection:
shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

Adicionando colunas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{LazyFrame, IntoLazy, col},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .with_columns([
                col("birthdate").dt().year().alias("birth_year"),
                (col("weight") / col("height").pow(2)).alias("bmi"),
            ])
            .collect()?;
        println!("With added colums:");
        print!("{result}\n");

      Ok(())
  }
With added colums:
shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1997-03-22 ┆ 54.6   ┆ 1.65   ┆ 1997       ┆ 20.055096 │
│ Daniel Donovan ┆ 1997-04-30 ┆ 83.1   ┆ 1.75   ┆ 1997       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

Expansão de expressão

lit significa literal e é parte da API de expressão lazy do recurso lazy do Polars3.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, cols, lit, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .select([
                col("name"),
                (cols(["weight", "height"]).as_expr() * lit(0.95))
                    .round(2, RoundMode::default())
                    .name()
                    .suffix("-5%"),
            ])
            .collect()?;
        println!("Transform:");
        print!("{result}\n");

      Ok(())
  }
Transform:
shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 51.87     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

Filtrando linhas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::{DataFrame},
      prelude::{IntoLazy, col, lit, ClosedInterval},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .filter(col("birthdate").dt().year().lt(lit(1990)))
            .collect()?;
        println!("With row filtering:");
        print!("{result}\n");

        let result = df
              .clone()
              .lazy()
              .filter(
                  col("birthdate")
                      .is_between(
                          lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                          lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                          ClosedInterval::Both,
                      )
                      .and(col("height").gt(lit(1.7))),
              )
              .collect()?;
        println!("With complex row filtering:");
        print!("{result}\n");

      Ok(())
  }
With row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘
With complex row filtering:
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

Agrupando por

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
  //! ```

  use chrono::NaiveDate;
  use polars::{
      df,
      error::PolarsError,
      frame::DataFrame,
      prelude::{IntoLazy, col, lit, len, RoundMode},
  };


  fn main() -> Result<(), PolarsError> {
      let mut df: DataFrame = df!(
            "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
            "birthdate" => [
                NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
                NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
                NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
                NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
            ],
            "weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
            "height" => [1.56, 1.77, 1.65, 1.75], // (m)
        )
        .unwrap();

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([len()])
            .collect()?;
        println!("Grouping by birth decade:");
        print!("{result}\n");

        let result = df
            .clone()
            .lazy()
            .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
            .agg([
                len().alias("sample_size"),
                col("weight")
                    .mean()
                    .round(2, RoundMode::default())
                    .alias("avg_weight"),
                col("height").max().alias("tallest"),
            ])
            .collect()?;
        println!("Grouping by derived features:");
        println!("{result}");

      Ok(())
  }
Grouping by birth decade:
shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 3   │
│ 1980   ┆ 1   │
└────────┴─────┘
Grouping by derived features:
shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1980   ┆ 1           ┆ 72.5       ┆ 1.77    │
│ 1990   ┆ 3           ┆ 65.2       ┆ 1.75    │
└────────┴─────────────┴────────────┴─────────┘

Análise de Dados

Quando recebemos um novo conjunto de dados, o objetivo não é construir gráficos ou executar modelos imediatamente. O primeiro objetivo é entender se os dados são confiáveis.

Inspecionar os dados brutos:

Baixe os dados, carregue-os com Polars3 e imprima o cabeçalho

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }
rust-script failed with exit code 1

[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))

O Polars3 não está adivinhando o tipo de algumas colunas corretamente. Vamos permitir que ele adivinhe a partir de 100 linhas por padrão.

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(None)
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }
shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘

Vamos fazer com que o Polars4 infira os tipos adequados das colunas agora a partir de 10000 linhas

  //! ```cargo
  //! [dependencies]
  //! chrono = "0.4.45"
  //! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
  //! ```

  use polars::{
      error::PolarsError,
      prelude::{CsvParseOptions, CsvReadOptions, SerReader},
  };

  fn main() -> Result<(), PolarsError> {
      let df_csv = CsvReadOptions::default()
          .with_has_header(true)
          .with_infer_schema_length(Some(10_000))
          .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
          .try_into_reader_with_file_path(Some(
              "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
          ))?
          .finish()?;
      println!("{df_csv}");
      Ok(())
  }
shape: (29_159, 14)
┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐
│ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH   ┆ Temperature ┆ Total       ┆ True   │
│ e            ┆ ---   ┆ ---        ┆ otal (as     ┆   ┆ ---  ┆ ---         ┆ Hardness    ┆ Colour │
│ ---          ┆ i64   ┆ str        ┆ CaCO3)       ┆   ┆ f64  ┆ f64         ┆ (as CaCO3)  ┆ ---    │
│ str          ┆       ┆            ┆ ---          ┆   ┆      ┆             ┆ ---         ┆ f64    │
│              ┆       ┆            ┆ f64          ┆   ┆      ┆             ┆ f64         ┆        │
╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡
│ ABBEYTOWN_01 ┆ 2023  ┆ Feb        ┆ 314.0        ┆ … ┆ 7.8  ┆ 10.4        ┆ 370.0       ┆ 24.0   │
│ 0            ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ Allua        ┆ 2007  ┆ Aug        ┆ 14.0         ┆ … ┆ 7.42 ┆ 17.8        ┆ 13.4        ┆ 35.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 17.0         ┆ … ┆ 7.67 ┆ 18.1        ┆ 15.8        ┆ 29.0   │
│ Allua        ┆ 2007  ┆ Aug        ┆ 18.0         ┆ … ┆ 7.63 ┆ 17.8        ┆ 15.9        ┆ 31.0   │
│ Allua        ┆ 2007  ┆ Sep        ┆ 19.0         ┆ … ┆ 7.33 ┆ 20.1        ┆ 15.4        ┆ 23.0   │
│ …            ┆ …     ┆ …          ┆ …            ┆ … ┆ …    ┆ …           ┆ …           ┆ …      │
│ SULLANE_060  ┆ 2022  ┆ Sep        ┆ 31.0         ┆ … ┆ 7.1  ┆ 14.9        ┆ 45.0        ┆ 27.0   │
│ SULLANE_060  ┆ 2022  ┆ Nov        ┆ 22.0         ┆ … ┆ 6.9  ┆ 12.3        ┆ 34.0        ┆ 58.0   │
│ SULLANE_060  ┆ 2023  ┆ Mar        ┆ 36.0         ┆ … ┆ 7.2  ┆ 7.1         ┆ 44.0        ┆ 20.0   │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 81.0         ┆ … ┆ 7.4  ┆ 8.6         ┆ 120.0       ┆ 9.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ TWO POT      ┆ 2023  ┆ Feb        ┆ 82.0         ┆ … ┆ 7.8  ┆ 8.1         ┆ 121.0       ┆ 5.0    │
│ (Cork        ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
│ City)_010    ┆       ┆            ┆              ┆   ┆      ┆             ┆             ┆        │
└──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
// imports go here

fn main() -> PolarsResult<()> {
    let df = CsvReadOptions::default()
        .with_has_header(true)
        // Discovery step: scan the file because we do not know columns yet.
        .with_infer_schema_length(Some(10_000))
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some(
            "data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
        ))?
        .finish()?;

    inspect_raw_data(df.clone())?;

    Ok(())
}
rows: 29159
columns: 14

columns and types:
WaterbodyName: String (text or mixed)
Years: Int64 (number)
SampleDate: String (text or mixed)
Alkalinity-total (as CaCO3): Float64 (number)
Ammonia-Total (as N): Float64 (number)
BOD - 5 days (Total): Float64 (number)
Chloride: Float64 (number)
Conductivity @25°C: Float64 (number)
Dissolved Oxygen: Float64 (number)
ortho-Phosphate (as P) - unspecified: Float64 (number)
pH: Float64 (number)
Temperature: Float64 (number)
Total Hardness (as CaCO3): Float64 (number)
True Colour: Float64 (number)

one raw row:
shape: (1, 14)
┌───────────────┬───────┬────────────┬─────────────────────┬───┬─────┬─────────────┬────────────────────┬─────────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ Alkalinity-total    ┆ … ┆ pH  ┆ Temperature ┆ Total Hardness (as ┆ True Colour │
│ ---           ┆ ---   ┆ ---        ┆ (as CaCO3)          ┆   ┆ --- ┆ ---         ┆ CaCO3)             ┆ ---         │
│ str           ┆ i64   ┆ str        ┆ ---                 ┆   ┆ f64 ┆ f64         ┆ ---                ┆ f64         │
│               ┆       ┆            ┆ f64                 ┆   ┆     ┆             ┆ f64                ┆             │
╞═══════════════╪═══════╪════════════╪═════════════════════╪═══╪═════╪═════════════╪════════════════════╪═════════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ 314.0               ┆ … ┆ 7.8 ┆ 10.4        ┆ 370.0              ┆ 24.0        │
└───────────────┴───────┴────────────┴─────────────────────┴───┴─────┴─────────────┴────────────────────┴─────────────┘

location/date columns: ["WaterbodyName", "Years", "SampleDate"]
measurement columns: ["Alkalinity-total (as CaCO3)", "Ammonia-Total (as N)", "BOD - 5 days (Total)", "Chloride", "Conductivity @25°C", "Dissolved Oxygen", "ortho-Phosphate (as P) - unspecified", "pH", "Temperature", "Total Hardness (as CaCO3)", "True Colour"]

long water-quality shape:
shape: (10, 7)
┌───────────────┬───────┬────────────┬─────────────────────────────┬───────────────────┬──────────────────┬──────────┐
│ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column               ┆ measurement_value ┆ parameter        ┆ unit     │
│ ---           ┆ ---   ┆ ---        ┆ ---                         ┆ ---               ┆ ---              ┆ ---      │
│ str           ┆ i64   ┆ str        ┆ str                         ┆ f64               ┆ str              ┆ str      │
╞═══════════════╪═══════╪════════════╪═════════════════════════════╪═══════════════════╪══════════════════╪══════════╡
│ ABBEYTOWN_010 ┆ 2023  ┆ Feb        ┆ Alkalinity-total (as CaCO3) ┆ 314.0             ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 14.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 17.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Aug        ┆ Alkalinity-total (as CaCO3) ┆ 18.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 19.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 19.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2007  ┆ Sep        ┆ Alkalinity-total (as CaCO3) ┆ 18.0              ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 8.0               ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 9.0               ┆ Alkalinity-total ┆ as CaCO3 │
│ Allua         ┆ 2008  ┆ Jan        ┆ Alkalinity-total (as CaCO3) ┆ 10.0              ┆ Alkalinity-total ┆ as CaCO3 │
└───────────────┴───────┴────────────┴─────────────────────────────┴───────────────────┴──────────────────┴──────────┘

Perfilar os dados

Notas de rodapé