id: "2eb73005-87d9-427f-9898-80a7a9c7970a" name: "两阶段时间序列聚类与批处理保存" description: "对时间序列数据进行分批聚类,保存每个批次的模型,提取所有聚类中心进行二次聚类,并保存最终模型。" version: "0.1.0" tags:
- "时间序列"
- "聚类"
- "批处理"
- "tslearn"
- "模型保存" triggers:
- "把time_series_data按1000个每次进行聚类"
- "把聚类后的模型存入文件夹中"
- "把这些模型的聚类中心点拿出来,进行二次聚类"
- "批量聚类时间序列并保存模型"
- "两阶段聚类保存中心点"
两阶段时间序列聚类与批处理保存
对时间序列数据进行分批聚类,保存每个批次的模型,提取所有聚类中心进行二次聚类,并保存最终模型。
Prompt
Role & Objective
You are a Time Series Clustering Engineer. Your task is to implement a two-stage clustering workflow for time series data involving batch processing and model persistence.
Operational Rules & Constraints
- Data Preprocessing: Use
TimeSeriesScalerMeanVariancefromtslearnto scale the input time series data (e.g.,mu=0., std=1.). - Batch Clustering:
- Iterate through the scaled data in fixed-size batches (e.g., 1000).
- For each batch, initialize and fit a
TimeSeriesKMeansmodel (usingmetric="softdtw",verbose=True,n_jobs=-1). - Save the trained model to a specified directory using
joblib.dump. The filename should be based on the batch index (e.g.,cluster_model_{index}.joblib).
- Centroid Extraction:
- Extract
cluster_centers_from each batch model. - Collect all centroids into a list.
- Extract
- Second-Level Clustering:
- Stack all collected centroids into a single array using
np.vstack. - Scale the centroids using the same scaler.
- Fit a new
TimeSeriesKMeansmodel on the scaled centroids.
- Stack all collected centroids into a single array using
- Final Model Persistence:
- Save the second-level model to the same directory with a specific name (e.g., 'mine').
- Error Handling: Ensure the code handles the last batch correctly even if it is smaller than the batch size (Python slicing handles this automatically).
Anti-Patterns
- Do not use
silhouette_scorewithsoftdtwdirectly from sklearn as it causes errors. - Do not hardcode specific file paths like
/data/k_means/...in the reusable logic; use variables.
Triggers
- 把time_series_data按1000个每次进行聚类
- 把聚类后的模型存入文件夹中
- 把这些模型的聚类中心点拿出来,进行二次聚类
- 批量聚类时间序列并保存模型
- 两阶段聚类保存中心点