Datasets¶
A Dataset bundles together a station reference, a variable descriptor, and
the complete time series pulled from the IDEAM Aquarius WebPortal. Once you
have a Station object, producing a Dataset is a single
expression.
What is a Dataset?¶
from colombia_hydrodata import Client
client = Client()
station = client.fetch_station("29037020")
dataset = station["CAUDAL@HIS_Q_MEDIA_D"]
dataset is a Dataset instance, a plain dataclass with three attributes:
| Attribute | Type | Description |
|---|---|---|
dataset.station |
Station |
The station the data comes from |
dataset.variable |
Variable |
Descriptor for the measured variable |
dataset.data |
pd.DataFrame |
Full time-series observations |
Printing a dataset gives a compact one-line summary:
Accessing metadata¶
dataset.station¶
dataset.station is the full Station object, so every station attribute is
reachable directly from the dataset:
print(dataset.station.name) # CALAMAR
print(dataset.station.department) # BOLIVAR
print(dataset.station.location) # Location: altitude=8.00 [-74.915; 10.243]
dataset.variable¶
dataset.variable is a frozen Variable dataclass with three fields:
var = dataset.variable
print(var.param) # CAUDAL
print(var.label) # HIS_Q_MEDIA_D
print(var.id) # numeric Aquarius dataset ID
print(var) # CAUDAL@HIS_Q_MEDIA_D
| Field | Description |
|---|---|
var.param |
Parameter family, for example CAUDAL, NIVEL, PRECIPITACION |
var.label |
Series code that identifies aggregation and sensor, for example HIS_Q_MEDIA_D |
var.id |
Numeric Aquarius dataset identifier used to fetch the raw data |
The data DataFrame¶
dataset.data is a pandas.DataFrame with exactly two columns:
| Column | dtype | Description |
|---|---|---|
timestamp |
datetime64[ns] |
Date and time of the observation |
value |
float64 |
Measured value in the variable's native unit |
No unit column
The unit, such as m3/s for streamflow, m for gauge level, or mm for
rainfall, is encoded in the variable key rather than stored in a separate
column. Use dataset.variable.param and dataset.variable.label to
identify what you are working with.
Viewing the data¶
timestamp value
0 2000-01-01 05:00:00 1240.80
1 2000-01-02 05:00:00 1179.00
2 2000-01-03 05:00:00 1143.40
3 2000-01-04 05:00:00 1113.60
4 2000-01-05 05:00:00 1066.60
timestamp value
8761 2023-12-27 05:00:00 1834.10
8762 2023-12-28 05:00:00 1801.50
8763 2023-12-29 05:00:00 1768.20
8764 2023-12-30 05:00:00 1740.80
8765 2023-12-31 05:00:00 1693.40
Shape and memory¶
Summary statistics¶
count 8766.000000
mean 1834.512800
std 812.341500
min 312.400000
25% 1148.200000
50% 1681.900000
75% 2412.600000
max 5821.300000
Name: value, dtype: float64
Working with the DataFrame¶
Because dataset.data is a standard pandas DataFrame, every familiar
operation works without any conversion.
Set the timestamp as the index¶
Most time-series operations become more ergonomic with a DatetimeIndex:
Filter by date range¶
Drop missing values¶
The Aquarius series may contain gaps represented as NaN:
clean = dataset.data.dropna(subset=["value"])
print(f"Removed {len(dataset.data) - len(clean)} missing rows")
Resample to monthly means¶
df = dataset.data.set_index("timestamp")
monthly = df["value"].resample("ME").mean()
print(monthly.tail(6))
timestamp
2023-07-31 2914.8
2023-08-31 3201.4
2023-09-30 3542.1
2023-10-31 3128.7
2023-11-30 2487.3
2023-12-31 1923.6
Freq: ME, Name: value, dtype: float64
Identify annual extremes¶
df = dataset.data.set_index("timestamp")
print("Highest daily discharge:")
print(df["value"].idxmax(), df["value"].max(), "m3/s")
print("Lowest daily discharge:")
print(df["value"].idxmin(), df["value"].min(), "m3/s")
Highest daily discharge:
2011-11-08 05:00:00 5821.3 m3/s
Lowest daily discharge:
2016-03-14 05:00:00 312.4 m3/s
Comparing multiple variables¶
Fetch more than one variable from the same station and align them into a single DataFrame for side-by-side comparison:
ds_mean = station["NIVEL@NV_MEDIA_D"]
ds_max = station["NIVEL@NV_MAX_D"]
ds_min = station["NIVEL@NV_MIN_D"]
gauge = (
ds_mean.data.set_index("timestamp").rename(columns={"value": "mean"})
.join(ds_max.data.set_index("timestamp").rename(columns={"value": "max"}))
.join(ds_min.data.set_index("timestamp").rename(columns={"value": "min"}))
)
print(gauge.head(3))
mean max min
timestamp
2000-01-01 05:00:00 8.42 8.91 8.03
2000-01-02 05:00:00 8.17 8.63 7.82
2000-01-03 05:00:00 7.98 8.44 7.61
Exporting the data¶
Include station metadata in the filename
A small helper keeps exported files self-describing:
Plotting with dataset.plot¶
Every dataset exposes a plotting helper through the plot property, so you
can go straight from a fetched time series to common visual diagnostics.
from colombia_hydrodata import Client
import matplotlib.pyplot as plt
client = Client()
station = client.fetch_station("29037020")
dataset = (
station["NIVEL@NV_MEDIA_D"]
.sight_level(-0.367)
.rescale(1 / 100)
.interpolate()
.deconstruction()
)
Quick plots¶
dataset.plot.time_series()
dataset.plot.histogram(column_name="detrended", bins=40)
dataset.plot.monthly_data_series(column_name="detrended")
plt.show()
time_series_analysis()¶
Use time_series_analysis() to generate a standard four-panel diagnostic view
of the decomposed series.
You can also request the stacked layout explicitly:
fig, axs = dataset.plot.time_series_analysis(
layout="classic",
figsize=(10, 8),
tight_layout=True,
)
plt.show()
daily_series_analysis()¶
Use daily_series_analysis() to combine the annual cycle envelope with a
histogram, and optionally highlight specific years.
fig, axs = dataset.plot.daily_series_analysis(
years=[2024, 2025],
figsize=(10, 4),
tight_layout=True,
)
axs[0].legend()
plt.show()
This is especially useful for comparing a recent year against the historical annual range.
What's next?¶
Learn how to discover and fetch stations using bounding boxes and Shapely polygons.