Pipelines de données en Rust : des fichiers aux bases de données propres et tableaux de bord web
Introduction
Nous construisons un petit pipeline de données environnementales. Les fichiers bruts de surveillance de la qualité de l'eau arrivent au format CSV. Notre outil en Rust les valide, nettoie les enregistrements erronés, comble les lacunes de manière sécurisée, stocke les mesures fiables et alimente un tableau de bord.
Pipeline de données
À propos du jeu de données utilisé
Le jeu de données1 contient des données brutes de surveillance de la qualité de l'eau provenant de Cork Harbour, Moy Killala et 15 autres sites côtiers en Irlande. Le jeu de données brut extrait compte plus de 1,27 million d'entrées, et le dépôt inclut également une version transformée/pivotée avec 29 159 lignes réparties sur 11 paramètres de qualité de l'eau. Les fichiers sont au format CSV, ils sont donc faciles à utiliser pour le flux « fichiers → base de données propre → tableau de bord ».
Outils et bibliothèques
Nous utilisons Rust2 pour implémenter notre pipeline de données en tirant parti de Polars3.
DataFrame
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
println!("Data:");
print!("{df}\n");
let head = df.head(Some(2));
println!("Head:");
print!("{head}\n");
Ok(())
}Data: shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1997-03-22 ┆ 54.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1997-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘ Head: shape: (2, 4) ┌──────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞══════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └──────────────┴────────────┴────────┴────────┘
Sélection de colonnes
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.select([
col("name"),
col("birthdate").dt().year().alias("birth_year"),
(col("weight") / col("height").pow(2)).alias("bmi"),
])
.collect()?;
println!("Column selection:");
print!("{result}\n");
Ok(())
}Column selection: shape: (4, 3) ┌────────────────┬────────────┬───────────┐ │ name ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- │ │ str ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1997 ┆ 20.055096 │ │ Daniel Donovan ┆ 1997 ┆ 27.134694 │ └────────────────┴────────────┴───────────┘
Ajout de colonnes
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::{DataFrame},
prelude::{LazyFrame, IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.with_columns([
col("birthdate").dt().year().alias("birth_year"),
(col("weight") / col("height").pow(2)).alias("bmi"),
])
.collect()?;
println!("With added colums:");
print!("{result}\n");
Ok(())
}With added colums: shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1997-03-22 ┆ 54.6 ┆ 1.65 ┆ 1997 ┆ 20.055096 │ │ Daniel Donovan ┆ 1997-04-30 ┆ 83.1 ┆ 1.75 ┆ 1997 ┆ 27.134694 │ └────────────────┴────────────┴────────┴────────┴────────────┴───────────┘
Expansion d'expression
lit signifie littéral et fait partie de l'API d'expression paresseuse de la fonctionnalité lazy de Polars3.
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col, cols, lit, RoundMode},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.select([
col("name"),
(cols(["weight", "height"]).as_expr() * lit(0.95))
.round(2, RoundMode::default())
.name()
.suffix("-5%"),
])
.collect()?;
println!("Transform:");
print!("{result}\n");
Ok(())
}Transform: shape: (4, 3) ┌────────────────┬───────────┬───────────┐ │ name ┆ weight-5% ┆ height-5% │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞════════════════╪═══════════╪═══════════╡ │ Alice Archer ┆ 55.0 ┆ 1.48 │ │ Ben Brown ┆ 68.88 ┆ 1.68 │ │ Chloe Cooper ┆ 51.87 ┆ 1.57 │ │ Daniel Donovan ┆ 78.94 ┆ 1.66 │ └────────────────┴───────────┴───────────┘
Filtrage de lignes
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::{DataFrame},
prelude::{IntoLazy, col, lit, ClosedInterval},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.filter(col("birthdate").dt().year().lt(lit(1990)))
.collect()?;
println!("With row filtering:");
print!("{result}\n");
let result = df
.clone()
.lazy()
.filter(
col("birthdate")
.is_between(
lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
ClosedInterval::Both,
)
.and(col("height").gt(lit(1.7))),
)
.collect()?;
println!("With complex row filtering:");
print!("{result}\n");
Ok(())
}With row filtering: shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘ With complex row filtering: shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘
Regroupement (Group by)
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col, lit, len, RoundMode},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([len()])
.collect()?;
println!("Grouping by birth decade:");
print!("{result}\n");
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([
len().alias("sample_size"),
col("weight")
.mean()
.round(2, RoundMode::default())
.alias("avg_weight"),
col("height").max().alias("tallest"),
])
.collect()?;
println!("Grouping by derived features:");
println!("{result}");
Ok(())
}Grouping by birth decade: shape: (2, 2) ┌────────┬─────┐ │ decade ┆ len │ │ --- ┆ --- │ │ i32 ┆ u32 │ ╞════════╪═════╡ │ 1990 ┆ 3 │ │ 1980 ┆ 1 │ └────────┴─────┘ Grouping by derived features: shape: (2, 4) ┌────────┬─────────────┬────────────┬─────────┐ │ decade ┆ sample_size ┆ avg_weight ┆ tallest │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ u32 ┆ f64 ┆ f64 │ ╞════════╪═════════════╪════════════╪═════════╡ │ 1980 ┆ 1 ┆ 72.5 ┆ 1.77 │ │ 1990 ┆ 3 ┆ 65.2 ┆ 1.75 │ └────────┴─────────────┴────────────┴─────────┘
Analyse de données
Lorsque nous recevons un nouveau jeu de données, l'objectif n'est pas de construire immédiatement des graphiques ou d'exécuter des modèles. Le premier objectif est de comprendre si les données sont fiables.
Inspecter les données brutes :
Téléchargez les données, chargez-les avec Polars3 puis affichez l'en-tête
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}
rust-script failed with exit code 1
[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))
Polars3 ne devine pas correctement le type de certaines colonnes. Laissons-le deviner à partir de 100 lignes par défaut.
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_infer_schema_length(None)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}shape: (29_159, 14) ┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐ │ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH ┆ Temperature ┆ Total ┆ True │ │ e ┆ --- ┆ --- ┆ otal (as ┆ ┆ --- ┆ --- ┆ Hardness ┆ Colour │ │ --- ┆ i64 ┆ str ┆ CaCO3) ┆ ┆ f64 ┆ f64 ┆ (as CaCO3) ┆ --- │ │ str ┆ ┆ ┆ --- ┆ ┆ ┆ ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡ │ ABBEYTOWN_01 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ │ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ Allua ┆ 2007 ┆ Aug ┆ 14.0 ┆ … ┆ 7.42 ┆ 17.8 ┆ 13.4 ┆ 35.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 17.0 ┆ … ┆ 7.67 ┆ 18.1 ┆ 15.8 ┆ 29.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 18.0 ┆ … ┆ 7.63 ┆ 17.8 ┆ 15.9 ┆ 31.0 │ │ Allua ┆ 2007 ┆ Sep ┆ 19.0 ┆ … ┆ 7.33 ┆ 20.1 ┆ 15.4 ┆ 23.0 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ SULLANE_060 ┆ 2022 ┆ Sep ┆ 31.0 ┆ … ┆ 7.1 ┆ 14.9 ┆ 45.0 ┆ 27.0 │ │ SULLANE_060 ┆ 2022 ┆ Nov ┆ 22.0 ┆ … ┆ 6.9 ┆ 12.3 ┆ 34.0 ┆ 58.0 │ │ SULLANE_060 ┆ 2023 ┆ Mar ┆ 36.0 ┆ … ┆ 7.2 ┆ 7.1 ┆ 44.0 ┆ 20.0 │ │ TWO POT ┆ 2023 ┆ Feb ┆ 81.0 ┆ … ┆ 7.4 ┆ 8.6 ┆ 120.0 ┆ 9.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ TWO POT ┆ 2023 ┆ Feb ┆ 82.0 ┆ … ┆ 7.8 ┆ 8.1 ┆ 121.0 ┆ 5.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
Laissons maintenant Polars4 déduire les types appropriés des colonnes à partir de 10 000 lignes
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_infer_schema_length(Some(10_000))
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}shape: (29_159, 14) ┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐ │ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH ┆ Temperature ┆ Total ┆ True │ │ e ┆ --- ┆ --- ┆ otal (as ┆ ┆ --- ┆ --- ┆ Hardness ┆ Colour │ │ --- ┆ i64 ┆ str ┆ CaCO3) ┆ ┆ f64 ┆ f64 ┆ (as CaCO3) ┆ --- │ │ str ┆ ┆ ┆ --- ┆ ┆ ┆ ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡ │ ABBEYTOWN_01 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ │ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ Allua ┆ 2007 ┆ Aug ┆ 14.0 ┆ … ┆ 7.42 ┆ 17.8 ┆ 13.4 ┆ 35.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 17.0 ┆ … ┆ 7.67 ┆ 18.1 ┆ 15.8 ┆ 29.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 18.0 ┆ … ┆ 7.63 ┆ 17.8 ┆ 15.9 ┆ 31.0 │ │ Allua ┆ 2007 ┆ Sep ┆ 19.0 ┆ … ┆ 7.33 ┆ 20.1 ┆ 15.4 ┆ 23.0 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ SULLANE_060 ┆ 2022 ┆ Sep ┆ 31.0 ┆ … ┆ 7.1 ┆ 14.9 ┆ 45.0 ┆ 27.0 │ │ SULLANE_060 ┆ 2022 ┆ Nov ┆ 22.0 ┆ … ┆ 6.9 ┆ 12.3 ┆ 34.0 ┆ 58.0 │ │ SULLANE_060 ┆ 2023 ┆ Mar ┆ 36.0 ┆ … ┆ 7.2 ┆ 7.1 ┆ 44.0 ┆ 20.0 │ │ TWO POT ┆ 2023 ┆ Feb ┆ 81.0 ┆ … ┆ 7.4 ┆ 8.6 ┆ 120.0 ┆ 9.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ TWO POT ┆ 2023 ┆ Feb ┆ 82.0 ┆ … ┆ 7.8 ┆ 8.1 ┆ 121.0 ┆ 5.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
// les imports vont ici
fn main() -> PolarsResult<()> {
let df = CsvReadOptions::default()
.with_has_header(true)
// Étape de découverte : scannez le fichier car nous ne connaissons pas encore les colonnes.
.with_infer_schema_length(Some(10_000))
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
inspect_raw_data(df.clone())?;
Ok(())
}lignes : 29159 colonnes : 14 colonnes et types : WaterbodyName : String (texte ou mixte) Years : Int64 (nombre) SampleDate : String (texte ou mixte) Alkalinity-total (as CaCO3) : Float64 (nombre) Ammonia-Total (as N) : Float64 (nombre) BOD - 5 days (Total) : Float64 (nombre) Chloride : Float64 (nombre) Conductivity @25°C : Float64 (nombre) Dissolved Oxygen : Float64 (nombre) ortho-Phosphate (as P) - unspecified : Float64 (nombre) pH : Float64 (nombre) Temperature : Float64 (nombre) Total Hardness (as CaCO3) : Float64 (nombre) True Colour : Float64 (nombre) une ligne brute : shape: (1, 14) ┌───────────────┬───────┬────────────┬─────────────────────┬───┬─────┬─────────────┬────────────────────┬─────────────┐ │ WaterbodyName ┆ Years ┆ SampleDate ┆ Alkalinity-total ┆ … ┆ pH ┆ Temperature ┆ Total Hardness (as ┆ True Colour │ │ --- ┆ --- ┆ --- ┆ (as CaCO3) ┆ ┆ --- ┆ --- ┆ CaCO3) ┆ --- │ │ str ┆ i64 ┆ str ┆ --- ┆ ┆ f64 ┆ f64 ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞═══════════════╪═══════╪════════════╪═════════════════════╪═══╪═════╪═════════════╪════════════════════╪═════════════╡ │ ABBEYTOWN_010 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ └───────────────┴───────┴────────────┴─────────────────────┴───┴─────┴─────────────┴────────────────────┴─────────────┘ colonnes emplacement/date : ["WaterbodyName", "Years", "SampleDate"] colonnes de mesure : ["Alkalinity-total (as CaCO3)", "Ammonia-Total (as N)", "BOD - 5 days (Total)", "Chloride", "Conductivity @25°C", "Dissolved Oxygen", "ortho-Phosphate (as P) - unspecified", "pH", "Temperature", "Total Hardness (as CaCO3)", "True Colour"] forme longue de la qualité de l'eau : shape: (10, 7) ┌───────────────┬───────┬────────────┬─────────────────────────────┬───────────────────┬──────────────────┬──────────┐ │ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column ┆ measurement_value ┆ parameter ┆ unit │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ str ┆ f64 ┆ str ┆ str │ ╞═══════════════╪═══════╪════════════╪═════════════════════════════╪═══════════════════╪══════════════════╪══════════╡ │ ABBEYTOWN_010 ┆ 2023 ┆ Feb ┆ Alkalinity-total (as CaCO3) ┆ 314.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 14.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 17.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 18.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 19.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 19.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 18.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 8.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 9.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 10.0 ┆ Alkalinity-total ┆ as CaCO3 │ └───────────────┴───────┴────────────┴─────────────────────────────┴───────────────────┴──────────────────┴──────────┘

