Rust 数据流水线:从文件到整洁数据库及 Web 仪表盘
简介
我们正在构建一个小型的环境数据流水线。 原始的水质监测文件以 CSV 格式提供。 我们的 Rust 工具负责验证这些文件、清理错误记录、填充安全间隙、存储可信测量数据,并为仪表盘提供支持。
数据流水线
关于所使用的数据集
该数据集1包含来自爱尔兰科克港 (Cork Harbour)、莫伊基拉拉 (Moy Killala) 及其他 15 个沿海地点的*原始水质监测数据*。 提取出的原始数据集拥有超过 *127 万条条目*, 存储库还包含一个转换/透视后的版本,涵盖 *11 个水质参数*,共 *29,159 行*。 这些文件为 CSV 格式,因此非常适合“文件 → 清洁数据库 → 仪表盘”的工作流。
工具与库
我们使用 Rust2 并利用 Polars3 来实现我们的数据流水线。
DataFrame
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
println!("Data:");
print!("{df}\n");
let head = df.head(Some(2));
println!("Head:");
print!("{head}\n");
Ok(())
}Data: shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1997-03-22 ┆ 54.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1997-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘ Head: shape: (2, 4) ┌──────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞══════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └──────────────┴────────────┴────────┴────────┘
选择列
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.select([
col("name"),
col("birthdate").dt().year().alias("birth_year"),
(col("weight") / col("height").pow(2)).alias("bmi"),
])
.collect()?;
println!("Column selection:");
print!("{result}\n");
Ok(())
}Column selection: shape: (4, 3) ┌────────────────┬────────────┬───────────┐ │ name ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- │ │ str ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1997 ┆ 20.055096 │ │ Daniel Donovan ┆ 1997 ┆ 27.134694 │ └────────────────┴────────────┴───────────┘
添加列
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::{DataFrame},
prelude::{LazyFrame, IntoLazy, col},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.with_columns([
col("birthdate").dt().year().alias("birth_year"),
(col("weight") / col("height").pow(2)).alias("bmi"),
])
.collect()?;
println!("With added colums:");
print!("{result}\n");
Ok(())
}With added colums: shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1997-03-22 ┆ 54.6 ┆ 1.65 ┆ 1997 ┆ 20.055096 │ │ Daniel Donovan ┆ 1997-04-30 ┆ 83.1 ┆ 1.75 ┆ 1997 ┆ 27.134694 │ └────────────────┴────────────┴────────┴────────┴────────────┴───────────┘
表达式扩展
lit 代表字面量 (literal),它是 Polars3 惰性特性中 惰性表达式 API 的一部分。
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col, cols, lit, RoundMode},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.select([
col("name"),
(cols(["weight", "height"]).as_expr() * lit(0.95))
.round(2, RoundMode::default())
.name()
.suffix("-5%"),
])
.collect()?;
println!("Transform:");
print!("{result}\n");
Ok(())
}Transform: shape: (4, 3) ┌────────────────┬───────────┬───────────┐ │ name ┆ weight-5% ┆ height-5% │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞════════════════╪═══════════╪═══════════╡ │ Alice Archer ┆ 55.0 ┆ 1.48 │ │ Ben Brown ┆ 68.88 ┆ 1.68 │ │ Chloe Cooper ┆ 51.87 ┆ 1.57 │ │ Daniel Donovan ┆ 78.94 ┆ 1.66 │ └────────────────┴───────────┴───────────┘
过滤行
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "is_between", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::{DataFrame},
prelude::{IntoLazy, col, lit, ClosedInterval},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.filter(col("birthdate").dt().year().lt(lit(1990)))
.collect()?;
println!("With row filtering:");
print!("{result}\n");
let result = df
.clone()
.lazy()
.filter(
col("birthdate")
.is_between(
lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
ClosedInterval::Both,
)
.and(col("height").gt(lit(1.7))),
)
.collect()?;
println!("With complex row filtering:");
print!("{result}\n");
Ok(())
}With row filtering: shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘ With complex row filtering: shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘
分组 (Grouping by)
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql"] }
//! ```
use chrono::NaiveDate;
use polars::{
df,
error::PolarsError,
frame::DataFrame,
prelude::{IntoLazy, col, lit, len, RoundMode},
};
fn main() -> Result<(), PolarsError> {
let mut df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1997, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1997, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 54.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([len()])
.collect()?;
println!("Grouping by birth decade:");
print!("{result}\n");
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([
len().alias("sample_size"),
col("weight")
.mean()
.round(2, RoundMode::default())
.alias("avg_weight"),
col("height").max().alias("tallest"),
])
.collect()?;
println!("Grouping by derived features:");
println!("{result}");
Ok(())
}Grouping by birth decade: shape: (2, 2) ┌────────┬─────┐ │ decade ┆ len │ │ --- ┆ --- │ │ i32 ┆ u32 │ ╞════════╪═════╡ │ 1990 ┆ 3 │ │ 1980 ┆ 1 │ └────────┴─────┘ Grouping by derived features: shape: (2, 4) ┌────────┬─────────────┬────────────┬─────────┐ │ decade ┆ sample_size ┆ avg_weight ┆ tallest │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ u32 ┆ f64 ┆ f64 │ ╞════════╪═════════════╪════════════╪═════════╡ │ 1980 ┆ 1 ┆ 72.5 ┆ 1.77 │ │ 1990 ┆ 3 ┆ 65.2 ┆ 1.75 │ └────────┴─────────────┴────────────┴─────────┘
数据分析
当我们收到一个新的数据集时,目标并不是立即构建图表或运行模型。首要目标是了解数据是否值得信赖。
检查原始数据:
下载数据,使用 Polars3 加载它,然后打印头部数据。
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}
rust-script failed with exit code 1
[stderr]
Error: ComputeError(ErrString("could not parse `50.5` as dtype `i64` at column 'Alkalinity-total (as CaCO3)' (column number 4)\n\nThe current offset in the file is 7606 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `schema_overrides` argument\n- setting `ignore_errors` to `True`,\n- adding `50.5` to the `null_values` list.\n\nOriginal error: ```invalid primitive value found during CSV parsing```"))
Polars3 没有正确猜测某些列的*类型*。让我们默认让它从 *100 行*中进行猜测。
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_infer_schema_length(None)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}shape: (29_159, 14) ┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐ │ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH ┆ Temperature ┆ Total ┆ True │ │ e ┆ --- ┆ --- ┆ otal (as ┆ ┆ --- ┆ --- ┆ Hardness ┆ Colour │ │ --- ┆ i64 ┆ str ┆ CaCO3) ┆ ┆ f64 ┆ f64 ┆ (as CaCO3) ┆ --- │ │ str ┆ ┆ ┆ --- ┆ ┆ ┆ ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡ │ ABBEYTOWN_01 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ │ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ Allua ┆ 2007 ┆ Aug ┆ 14.0 ┆ … ┆ 7.42 ┆ 17.8 ┆ 13.4 ┆ 35.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 17.0 ┆ … ┆ 7.67 ┆ 18.1 ┆ 15.8 ┆ 29.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 18.0 ┆ … ┆ 7.63 ┆ 17.8 ┆ 15.9 ┆ 31.0 │ │ Allua ┆ 2007 ┆ Sep ┆ 19.0 ┆ … ┆ 7.33 ┆ 20.1 ┆ 15.4 ┆ 23.0 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ SULLANE_060 ┆ 2022 ┆ Sep ┆ 31.0 ┆ … ┆ 7.1 ┆ 14.9 ┆ 45.0 ┆ 27.0 │ │ SULLANE_060 ┆ 2022 ┆ Nov ┆ 22.0 ┆ … ┆ 6.9 ┆ 12.3 ┆ 34.0 ┆ 58.0 │ │ SULLANE_060 ┆ 2023 ┆ Mar ┆ 36.0 ┆ … ┆ 7.2 ┆ 7.1 ┆ 44.0 ┆ 20.0 │ │ TWO POT ┆ 2023 ┆ Feb ┆ 81.0 ┆ … ┆ 7.4 ┆ 8.6 ┆ 120.0 ┆ 9.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ TWO POT ┆ 2023 ┆ Feb ┆ 82.0 ┆ … ┆ 7.8 ┆ 8.1 ┆ 121.0 ┆ 5.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
现在,让我们让 Polars4 从 *10000 行*中推断出各列正确的*类型*。
//! ```cargo
//! [dependencies]
//! chrono = "0.4.45"
//! polars = { version = "0.54.4", features = ["lazy", "temporal", "sql", "csv"] }
//! ```
use polars::{
error::PolarsError,
prelude::{CsvParseOptions, CsvReadOptions, SerReader},
};
fn main() -> Result<(), PolarsError> {
let df_csv = CsvReadOptions::default()
.with_has_header(true)
.with_infer_schema_length(Some(10_000))
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
println!("{df_csv}");
Ok(())
}shape: (29_159, 14) ┌──────────────┬───────┬────────────┬──────────────┬───┬──────┬─────────────┬─────────────┬────────┐ │ WaterbodyNam ┆ Years ┆ SampleDate ┆ Alkalinity-t ┆ … ┆ pH ┆ Temperature ┆ Total ┆ True │ │ e ┆ --- ┆ --- ┆ otal (as ┆ ┆ --- ┆ --- ┆ Hardness ┆ Colour │ │ --- ┆ i64 ┆ str ┆ CaCO3) ┆ ┆ f64 ┆ f64 ┆ (as CaCO3) ┆ --- │ │ str ┆ ┆ ┆ --- ┆ ┆ ┆ ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞══════════════╪═══════╪════════════╪══════════════╪═══╪══════╪═════════════╪═════════════╪════════╡ │ ABBEYTOWN_01 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ │ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ Allua ┆ 2007 ┆ Aug ┆ 14.0 ┆ … ┆ 7.42 ┆ 17.8 ┆ 13.4 ┆ 35.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 17.0 ┆ … ┆ 7.67 ┆ 18.1 ┆ 15.8 ┆ 29.0 │ │ Allua ┆ 2007 ┆ Aug ┆ 18.0 ┆ … ┆ 7.63 ┆ 17.8 ┆ 15.9 ┆ 31.0 │ │ Allua ┆ 2007 ┆ Sep ┆ 19.0 ┆ … ┆ 7.33 ┆ 20.1 ┆ 15.4 ┆ 23.0 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ SULLANE_060 ┆ 2022 ┆ Sep ┆ 31.0 ┆ … ┆ 7.1 ┆ 14.9 ┆ 45.0 ┆ 27.0 │ │ SULLANE_060 ┆ 2022 ┆ Nov ┆ 22.0 ┆ … ┆ 6.9 ┆ 12.3 ┆ 34.0 ┆ 58.0 │ │ SULLANE_060 ┆ 2023 ┆ Mar ┆ 36.0 ┆ … ┆ 7.2 ┆ 7.1 ┆ 44.0 ┆ 20.0 │ │ TWO POT ┆ 2023 ┆ Feb ┆ 81.0 ┆ … ┆ 7.4 ┆ 8.6 ┆ 120.0 ┆ 9.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ TWO POT ┆ 2023 ┆ Feb ┆ 82.0 ┆ … ┆ 7.8 ┆ 8.1 ┆ 121.0 ┆ 5.0 │ │ (Cork ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ │ City)_010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ └──────────────┴───────┴────────────┴──────────────┴───┴──────┴─────────────┴─────────────┴────────┘
// 导入放在这里
fn main() -> PolarsResult<()> {
let df = CsvReadOptions::default()
.with_has_header(true)
// 发现步骤:扫描文件,因为我们还不知道列名。
.with_infer_schema_length(Some(10_000))
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some(
"data/Water Quality Monitoring Dataset_ Ireland.csv".into(),
))?
.finish()?;
inspect_raw_data(df.clone())?;
Ok(())
}行数: 29159 列数: 14 列及类型: WaterbodyName: String (文本或混合) Years: Int64 (数字) SampleDate: String (文本或混合) Alkalinity-total (as CaCO3): Float64 (数字) Ammonia-Total (as N): Float64 (数字) BOD - 5 days (Total): Float64 (数字) Chloride: Float64 (数字) Conductivity @25°C: Float64 (数字) Dissolved Oxygen: Float64 (数字) ortho-Phosphate (as P) - unspecified: Float64 (数字) pH: Float64 (数字) Temperature: Float64 (数字) Total Hardness (as CaCO3): Float64 (数字) True Colour: Float64 (数字) 一行原始数据: shape: (1, 14) ┌───────────────┬───────┬────────────┬─────────────────────┬───┬─────┬─────────────┬────────────────────┬─────────────┐ │ WaterbodyName ┆ Years ┆ SampleDate ┆ Alkalinity-total ┆ … ┆ pH ┆ Temperature ┆ Total Hardness (as ┆ True Colour │ │ --- ┆ --- ┆ --- ┆ (as CaCO3) ┆ ┆ --- ┆ --- ┆ CaCO3) ┆ --- │ │ str ┆ i64 ┆ str ┆ --- ┆ ┆ f64 ┆ f64 ┆ --- ┆ f64 │ │ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ f64 ┆ │ ╞═══════════════╪═══════╪════════════╪═════════════════════╪═══╪═════╪═════════════╪════════════════════╪═════════════╡ │ ABBEYTOWN_010 ┆ 2023 ┆ Feb ┆ 314.0 ┆ … ┆ 7.8 ┆ 10.4 ┆ 370.0 ┆ 24.0 │ └───────────────┴───────┴────────────┴─────────────────────┴───┴─────┴─────────────┴────────────────────┴─────────────┘ 位置/日期列: ["WaterbodyName", "Years", "SampleDate"] 测量列: ["Alkalinity-total (as CaCO3)", "Ammonia-Total (as N)", "BOD - 5 days (Total)", "Chloride", "Conductivity @25°C", "Dissolved Oxygen", "ortho-Phosphate (as P) - unspecified", "pH", "Temperature", "Total Hardness (as CaCO3)", "True Colour"] 长格式水质数据形状: shape: (10, 7) ┌───────────────┬───────┬────────────┬─────────────────────────────┬───────────────────┬──────────────────┬──────────┐ │ WaterbodyName ┆ Years ┆ SampleDate ┆ source_column ┆ measurement_value ┆ parameter ┆ unit │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ str ┆ f64 ┆ str ┆ str │ ╞═══════════════╪═══════╪════════════╪═════════════════════════════╪═══════════════════╪══════════════════╪══════════╡ │ ABBEYTOWN_010 ┆ 2023 ┆ Feb ┆ Alkalinity-total (as CaCO3) ┆ 314.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 14.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 17.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Aug ┆ Alkalinity-total (as CaCO3) ┆ 18.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 19.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 19.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2007 ┆ Sep ┆ Alkalinity-total (as CaCO3) ┆ 18.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 8.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 9.0 ┆ Alkalinity-total ┆ as CaCO3 │ │ Allua ┆ 2008 ┆ Jan ┆ Alkalinity-total (as CaCO3) ┆ 10.0 ┆ Alkalinity-total ┆ as CaCO3 │ └───────────────┴───────┴────────────┴─────────────────────────────┴───────────────────┴──────────────────┴──────────┘

