diff --git a/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md b/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md index 722a30b52..f884c3d75 100644 --- a/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md +++ b/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md @@ -5388,3 +5388,89 @@ Output Series: +-----------------------------+-----------------------------------------------------+ ``` +### Cluster + +#### Registration statement + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### Usage + +This function takes a **single input time series**, splits it into **non-overlapping** contiguous subsequences (windows) of fixed length `l`, and clusters those subsequences into `k` groups. + +**Name:** Cluster + +**Input Series:** Only support single input numeric series. The type is INT32 / INT64 / FLOAT / DOUBLE. Points are read in time order; trailing samples that do not fill a full window are dropped (only `⌊n/l⌋` windows are used, where `n` is the number of valid points). + +**Parameters:** + +| Name | Meaning | Default | Notes | +|------|---------|---------|--------| +| `l` | Subsequence (window) length | (required) | Positive integer; each window has `l` consecutive samples. | +| `k` | Number of clusters | (required) | Integer ≥ 2. | +| `method` | Clustering algorithm | `kmeans` | Optional: `kmeans`, `kshape`, `medoidshape` (case-insensitive). Defaults to k-means if omitted. | +| `norm` | Z-score normalize each subsequence | `true` | Boolean; if `true`, each subsequence is standardized before clustering. | +| `maxiter` | Maximum iterations | `200` | Positive integer. | +| `output` | Output mode | `label` | `label`: one cluster id per window; `centroid`: concatenate the `k` centroid vectors in cluster order. | +| `sample_rate` | Greedy sampling rate | `0.3` | Used only when **`method` = `medoidshape`**; must be in `(0, 1]`. | + + +**`method` details:** + +- **kmeans**: k-means in Euclidean space (optionally after per-window normalization). +- **kshape**: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix. +- **medoidshape**: Coarsely cluster, then greedy selection of `k` representative subsequences; `sample_rate` controls how many candidates are sampled each round. + +**Output Series:** Controlled by `output`: + +- **`output` = `label` (default):** One output series, type **INT32**. Number of points = number of full windows, `⌊n/l⌋`. Timestamp of each point = **time of the first sample** in that window; value = cluster id **0 … k−1**. +- **`output` = `centroid`:** One output series, type **DOUBLE**. Number of points = **`k × l`**: for clusters **0 → k−1**, emit the `l` components of each centroid in order (concatenated). Timestamps are `0, 1, 2, …` (placeholders only, no physical time meaning). + +**Note:** + +- Require valid point count `n ≥ l` and window count `⌊n/l⌋ ≥ k`. + +#### Examples + +##### KShape: window length 3, k = 2 + +Nine samples `{1,2,3,10,20,30,1,5,1}` form three non-overlapping windows `{1,2,3}`, `{10,20,30}`, `{1,5,1}`. With **`method` = `kshape`** (default `norm` = `true`), each output row is the cluster id for one window; timestamps are the window start times. Resulting labels: **0, 0, 1**. + +Input Series: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +SQL for query: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +Output Series: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` \ No newline at end of file diff --git a/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md b/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md index 5289bde10..5c510ce75 100644 --- a/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md +++ b/src/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md @@ -5447,3 +5447,89 @@ Output Series: +-----------------------------+-----------------------------------------------------+ ``` +### Cluster + +#### Registration statement + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### Usage + +This function takes a **single input time series**, splits it into **non-overlapping** contiguous subsequences (windows) of fixed length `l`, and clusters those subsequences into `k` groups. + +**Name:** Cluster + +**Input Series:** Only support single input numeric series. The type is INT32 / INT64 / FLOAT / DOUBLE. Points are read in time order; trailing samples that do not fill a full window are dropped (only `⌊n/l⌋` windows are used, where `n` is the number of valid points). + +**Parameters:** + +| Name | Meaning | Default | Notes | +|------|---------|---------|--------| +| `l` | Subsequence (window) length | (required) | Positive integer; each window has `l` consecutive samples. | +| `k` | Number of clusters | (required) | Integer ≥ 2. | +| `method` | Clustering algorithm | `kmeans` | Optional: `kmeans`, `kshape`, `medoidshape` (case-insensitive). Defaults to k-means if omitted. | +| `norm` | Z-score normalize each subsequence | `true` | Boolean; if `true`, each subsequence is standardized before clustering. | +| `maxiter` | Maximum iterations | `200` | Positive integer. | +| `output` | Output mode | `label` | `label`: one cluster id per window; `centroid`: concatenate the `k` centroid vectors in cluster order. | +| `sample_rate` | Greedy sampling rate | `0.3` | Used only when **`method` = `medoidshape`**; must be in `(0, 1]`. | + + +**`method` details:** + +- **kmeans**: k-means in Euclidean space (optionally after per-window normalization). +- **kshape**: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix. +- **medoidshape**: Coarsely cluster, then greedy selection of `k` representative subsequences; `sample_rate` controls how many candidates are sampled each round. + +**Output Series:** Controlled by `output`: + +- **`output` = `label` (default):** One output series, type **INT32**. Number of points = number of full windows, `⌊n/l⌋`. Timestamp of each point = **time of the first sample** in that window; value = cluster id **0 … k−1**. +- **`output` = `centroid`:** One output series, type **DOUBLE**. Number of points = **`k × l`**: for clusters **0 → k−1**, emit the `l` components of each centroid in order (concatenated). Timestamps are `0, 1, 2, …` (placeholders only, no physical time meaning). + +**Note:** + +- Require valid point count `n ≥ l` and window count `⌊n/l⌋ ≥ k`. + +#### Examples + +##### KShape: window length 3, k = 2 + +Nine samples `{1,2,3,10,20,30,1,5,1}` form three non-overlapping windows `{1,2,3}`, `{10,20,30}`, `{1,5,1}`. With **`method` = `kshape`** (default `norm` = `true`), each output row is the cluster id for one window; timestamps are the window start times. Resulting labels: **0, 0, 1**. + +Input Series: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +SQL for query: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +Output Series: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` \ No newline at end of file diff --git a/src/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md b/src/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md index d49ecd847..aba685324 100644 --- a/src/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md +++ b/src/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md @@ -5387,3 +5387,89 @@ Output Series: +-----------------------------+-----------------------------------------------------+ ``` +### 9.4 Cluster + +#### Registration statement + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### Usage + +This function takes a **single input time series**, splits it into **non-overlapping** contiguous subsequences (windows) of fixed length `l`, and clusters those subsequences into `k` groups. + +**Name:** Cluster + +**Input Series:** Only support single input numeric series. The type is INT32 / INT64 / FLOAT / DOUBLE. Points are read in time order; trailing samples that do not fill a full window are dropped (only `⌊n/l⌋` windows are used, where `n` is the number of valid points). + +**Parameters:** + +| Name | Meaning | Default | Notes | +|------|---------|---------|--------| +| `l` | Subsequence (window) length | (required) | Positive integer; each window has `l` consecutive samples. | +| `k` | Number of clusters | (required) | Integer ≥ 2. | +| `method` | Clustering algorithm | `kmeans` | Optional: `kmeans`, `kshape`, `medoidshape` (case-insensitive). Defaults to k-means if omitted. | +| `norm` | Z-score normalize each subsequence | `true` | Boolean; if `true`, each subsequence is standardized before clustering. | +| `maxiter` | Maximum iterations | `200` | Positive integer. | +| `output` | Output mode | `label` | `label`: one cluster id per window; `centroid`: concatenate the `k` centroid vectors in cluster order. | +| `sample_rate` | Greedy sampling rate | `0.3` | Used only when **`method` = `medoidshape`**; must be in `(0, 1]`. | + + +**`method` details:** + +- **kmeans**: k-means in Euclidean space (optionally after per-window normalization). +- **kshape**: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix. +- **medoidshape**: Coarsely cluster, then greedy selection of `k` representative subsequences; `sample_rate` controls how many candidates are sampled each round. + +**Output Series:** Controlled by `output`: + +- **`output` = `label` (default):** One output series, type **INT32**. Number of points = number of full windows, `⌊n/l⌋`. Timestamp of each point = **time of the first sample** in that window; value = cluster id **0 … k−1**. +- **`output` = `centroid`:** One output series, type **DOUBLE**. Number of points = **`k × l`**: for clusters **0 → k−1**, emit the `l` components of each centroid in order (concatenated). Timestamps are `0, 1, 2, …` (placeholders only, no physical time meaning). + +**Note:** + +- Require valid point count `n ≥ l` and window count `⌊n/l⌋ ≥ k`. + +#### Examples + +##### KShape: window length 3, k = 2 + +Nine samples `{1,2,3,10,20,30,1,5,1}` form three non-overlapping windows `{1,2,3}`, `{10,20,30}`, `{1,5,1}`. With **`method` = `kshape`** (default `norm` = `true`), each output row is the cluster id for one window; timestamps are the window start times. Resulting labels: **0, 0, 1**. + +Input Series: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +SQL for query: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +Output Series: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` \ No newline at end of file diff --git a/src/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md b/src/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md index d317569c2..42d7e2392 100644 --- a/src/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md +++ b/src/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md @@ -5446,3 +5446,89 @@ Output Series: +-----------------------------+-----------------------------------------------------+ ``` +### 9.4 Cluster + +#### Registration statement + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### Usage + +This function takes a **single input time series**, splits it into **non-overlapping** contiguous subsequences (windows) of fixed length `l`, and clusters those subsequences into `k` groups. + +**Name:** Cluster + +**Input Series:** Only support single input numeric series. The type is INT32 / INT64 / FLOAT / DOUBLE. Points are read in time order; trailing samples that do not fill a full window are dropped (only `⌊n/l⌋` windows are used, where `n` is the number of valid points). + +**Parameters:** + +| Name | Meaning | Default | Notes | +|------|---------|---------|--------| +| `l` | Subsequence (window) length | (required) | Positive integer; each window has `l` consecutive samples. | +| `k` | Number of clusters | (required) | Integer ≥ 2. | +| `method` | Clustering algorithm | `kmeans` | Optional: `kmeans`, `kshape`, `medoidshape` (case-insensitive). Defaults to k-means if omitted. | +| `norm` | Z-score normalize each subsequence | `true` | Boolean; if `true`, each subsequence is standardized before clustering. | +| `maxiter` | Maximum iterations | `200` | Positive integer. | +| `output` | Output mode | `label` | `label`: one cluster id per window; `centroid`: concatenate the `k` centroid vectors in cluster order. | +| `sample_rate` | Greedy sampling rate | `0.3` | Used only when **`method` = `medoidshape`**; must be in `(0, 1]`. | + + +**`method` details:** + +- **kmeans**: k-means in Euclidean space (optionally after per-window normalization). +- **kshape**: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix. +- **medoidshape**: Coarsely cluster, then greedy selection of `k` representative subsequences; `sample_rate` controls how many candidates are sampled each round. + +**Output Series:** Controlled by `output`: + +- **`output` = `label` (default):** One output series, type **INT32**. Number of points = number of full windows, `⌊n/l⌋`. Timestamp of each point = **time of the first sample** in that window; value = cluster id **0 … k−1**. +- **`output` = `centroid`:** One output series, type **DOUBLE**. Number of points = **`k × l`**: for clusters **0 → k−1**, emit the `l` components of each centroid in order (concatenated). Timestamps are `0, 1, 2, …` (placeholders only, no physical time meaning). + +**Note:** + +- Require valid point count `n ≥ l` and window count `⌊n/l⌋ ≥ k`. + +#### Examples + +##### KShape: window length 3, k = 2 + +Nine samples `{1,2,3,10,20,30,1,5,1}` form three non-overlapping windows `{1,2,3}`, `{10,20,30}`, `{1,5,1}`. With **`method` = `kshape`** (default `norm` = `true`), each output row is the cluster id for one window; timestamps are the window start times. Resulting labels: **0, 0, 1**. + +Input Series: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +SQL for query: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +Output Series: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` \ No newline at end of file diff --git a/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md b/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md index bc7d45871..75d480a09 100644 --- a/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md +++ b/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_apache.md @@ -5491,3 +5491,89 @@ select rm(s0, s1,"tb"="3","vb"="2") from root.test.d0 +-----------------------------+-----------------------------------------------------+ ``` +### Cluster + +#### 注册语句 + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### 函数简介 + +本函数对**单条输入时间序列**,按固定长度 `l` 切分为**互不重叠**的连续子序列(窗口),再对这些子序列聚类,得到 `k` 个分组。 + +**函数名:** Cluster + +**输入序列:** 仅支持单条数值型时间序列,类型为 INT32 / INT64 / FLOAT / DOUBLE。点按时间顺序读取;末尾不足以凑满一整窗的采样会被**丢弃**(仅使用 `⌊n/l⌋` 个窗口,`n` 为有效点数)。 + +**参数:** + +| 名称 | 含义 | 默认值 | 说明 | +|------|------|--------|------| +| `l` | 子序列(窗口)长度 | (必填) | 正整数;每个窗口含连续 `l` 个采样。 | +| `k` | 聚类个数 | (必填) | 整数 ≥ 2。 | +| `method` | 聚类算法 | `kmeans` | 可选:`kmeans`、`kshape`、`medoidshape`(大小写不敏感)。省略时默认为 k-means。 | +| `norm` | 是否对每个子序列做 Z-score 标准化 | `true` | 布尔;为 `true` 时在聚类前对每个子序列标准化。 | +| `maxiter` | 最大迭代次数 | `200` | 正整数。 | +| `output` | 输出模式 | `label` | `label`:每个窗口一个簇编号;`centroid`:按簇顺序拼接 `k` 个质心向量。 | +| `sample_rate` | 贪心采样比例 | `0.3` | 仅在 **`method` = `medoidshape`** 时使用;取值须在 `(0, 1]`。 | + +**`method` 说明:** + +- **kmeans**:欧氏空间中的 k-means(可选是否先做逐窗归一化)。 +- **kshape**:基于形状距离(由归一化互相关 NCC 得到的 SBD)分配簇;质心通过簇矩阵的 **SVD** 更新。 +- **medoidshape**:先粗聚类,再贪心选出 `k` 条代表子序列;`sample_rate` 控制每轮采样的候选数量。 + +**输出序列:** 由 `output` 控制: + +- **`output` = `label`(默认):** 一条输出序列,类型为 **INT32**。行数 = 完整窗口个数 `⌊n/l⌋`。每行时间戳 = 该窗口**第一个采样**的时间;值为簇编号 **0 … k−1**。 +- **`output` = `centroid`:** 一条输出序列,类型为 **DOUBLE**。行数 = **`k × l`**:按簇 **0 → k−1** 依次输出各簇质心的 `l` 个分量(拼接)。时间戳为 `0, 1, 2, …`(仅占位,无物理时间含义)。 + +**提示:** + +- 需满足有效点数 `n ≥ l`,且窗口数 `⌊n/l⌋ ≥ k`。 + +#### 使用示例 + +##### KShape:窗口长度 3,k = 2 + +九个采样 `{1,2,3,10,20,30,1,5,1}` 构成三个长度为 3 的不重叠窗口 `{1,2,3}`、`{10,20,30}`、`{1,5,1}`。在 **`method` = `kshape`** 且默认 **`norm` = `true`** 时,每一行对应一个窗口的簇编号,时间戳为各窗口起点。得到的标签为:**0, 0, 1**。 + +输入序列: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +用于查询的 SQL 语句: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +输出序列: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` + diff --git a/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md b/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md index cdf8428e0..c8fca680f 100644 --- a/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md +++ b/src/zh/UserGuide/V1.3.x/SQL-Manual/UDF-Libraries_timecho.md @@ -5477,3 +5477,89 @@ select rm(s0, s1,"tb"="3","vb"="2") from root.test.d0 +-----------------------------+-----------------------------------------------------+ ``` +### Cluster + +#### 注册语句 + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### 函数简介 + +本函数对**单条输入时间序列**,按固定长度 `l` 切分为**互不重叠**的连续子序列(窗口),再对这些子序列聚类,得到 `k` 个分组。 + +**函数名:** Cluster + +**输入序列:** 仅支持单条数值型时间序列,类型为 INT32 / INT64 / FLOAT / DOUBLE。点按时间顺序读取;末尾不足以凑满一整窗的采样会被**丢弃**(仅使用 `⌊n/l⌋` 个窗口,`n` 为有效点数)。 + +**参数:** + +| 名称 | 含义 | 默认值 | 说明 | +|------|------|--------|------| +| `l` | 子序列(窗口)长度 | (必填) | 正整数;每个窗口含连续 `l` 个采样。 | +| `k` | 聚类个数 | (必填) | 整数 ≥ 2。 | +| `method` | 聚类算法 | `kmeans` | 可选:`kmeans`、`kshape`、`medoidshape`(大小写不敏感)。省略时默认为 k-means。 | +| `norm` | 是否对每个子序列做 Z-score 标准化 | `true` | 布尔;为 `true` 时在聚类前对每个子序列标准化。 | +| `maxiter` | 最大迭代次数 | `200` | 正整数。 | +| `output` | 输出模式 | `label` | `label`:每个窗口一个簇编号;`centroid`:按簇顺序拼接 `k` 个质心向量。 | +| `sample_rate` | 贪心采样比例 | `0.3` | 仅在 **`method` = `medoidshape`** 时使用;取值须在 `(0, 1]`。 | + +**`method` 说明:** + +- **kmeans**:欧氏空间中的 k-means(可选是否先做逐窗归一化)。 +- **kshape**:基于形状距离(由归一化互相关 NCC 得到的 SBD)分配簇;质心通过簇矩阵的 **SVD** 更新。 +- **medoidshape**:先粗聚类,再贪心选出 `k` 条代表子序列;`sample_rate` 控制每轮采样的候选数量。 + +**输出序列:** 由 `output` 控制: + +- **`output` = `label`(默认):** 一条输出序列,类型为 **INT32**。行数 = 完整窗口个数 `⌊n/l⌋`。每行时间戳 = 该窗口**第一个采样**的时间;值为簇编号 **0 … k−1**。 +- **`output` = `centroid`:** 一条输出序列,类型为 **DOUBLE**。行数 = **`k × l`**:按簇 **0 → k−1** 依次输出各簇质心的 `l` 个分量(拼接)。时间戳为 `0, 1, 2, …`(仅占位,无物理时间含义)。 + +**提示:** + +- 需满足有效点数 `n ≥ l`,且窗口数 `⌊n/l⌋ ≥ k`。 + +#### 使用示例 + +##### KShape:窗口长度 3,k = 2 + +九个采样 `{1,2,3,10,20,30,1,5,1}` 构成三个长度为 3 的不重叠窗口 `{1,2,3}`、`{10,20,30}`、`{1,5,1}`。在 **`method` = `kshape`** 且默认 **`norm` = `true`** 时,每一行对应一个窗口的簇编号,时间戳为各窗口起点。得到的标签为:**0, 0, 1**。 + +输入序列: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +用于查询的 SQL 语句: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +输出序列: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` + diff --git a/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md b/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md index d33ad35f7..1d2d4f6a8 100644 --- a/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md +++ b/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_apache.md @@ -5490,3 +5490,89 @@ select rm(s0, s1,"tb"="3","vb"="2") from root.test.d0 +-----------------------------+-----------------------------------------------------+ ``` +### 9.4 Cluster + +#### 注册语句 + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### 函数简介 + +本函数对**单条输入时间序列**,按固定长度 `l` 切分为**互不重叠**的连续子序列(窗口),再对这些子序列聚类,得到 `k` 个分组。 + +**函数名:** Cluster + +**输入序列:** 仅支持单条数值型时间序列,类型为 INT32 / INT64 / FLOAT / DOUBLE。点按时间顺序读取;末尾不足以凑满一整窗的采样会被**丢弃**(仅使用 `⌊n/l⌋` 个窗口,`n` 为有效点数)。 + +**参数:** + +| 名称 | 含义 | 默认值 | 说明 | +|------|------|--------|------| +| `l` | 子序列(窗口)长度 | (必填) | 正整数;每个窗口含连续 `l` 个采样。 | +| `k` | 聚类个数 | (必填) | 整数 ≥ 2。 | +| `method` | 聚类算法 | `kmeans` | 可选:`kmeans`、`kshape`、`medoidshape`(大小写不敏感)。省略时默认为 k-means。 | +| `norm` | 是否对每个子序列做 Z-score 标准化 | `true` | 布尔;为 `true` 时在聚类前对每个子序列标准化。 | +| `maxiter` | 最大迭代次数 | `200` | 正整数。 | +| `output` | 输出模式 | `label` | `label`:每个窗口一个簇编号;`centroid`:按簇顺序拼接 `k` 个质心向量。 | +| `sample_rate` | 贪心采样比例 | `0.3` | 仅在 **`method` = `medoidshape`** 时使用;取值须在 `(0, 1]`。 | + +**`method` 说明:** + +- **kmeans**:欧氏空间中的 k-means(可选是否先做逐窗归一化)。 +- **kshape**:基于形状距离(由归一化互相关 NCC 得到的 SBD)分配簇;质心通过簇矩阵的 **SVD** 更新。 +- **medoidshape**:先粗聚类,再贪心选出 `k` 条代表子序列;`sample_rate` 控制每轮采样的候选数量。 + +**输出序列:** 由 `output` 控制: + +- **`output` = `label`(默认):** 一条输出序列,类型为 **INT32**。行数 = 完整窗口个数 `⌊n/l⌋`。每行时间戳 = 该窗口**第一个采样**的时间;值为簇编号 **0 … k−1**。 +- **`output` = `centroid`:** 一条输出序列,类型为 **DOUBLE**。行数 = **`k × l`**:按簇 **0 → k−1** 依次输出各簇质心的 `l` 个分量(拼接)。时间戳为 `0, 1, 2, …`(仅占位,无物理时间含义)。 + +**提示:** + +- 需满足有效点数 `n ≥ l`,且窗口数 `⌊n/l⌋ ≥ k`。 + +#### 使用示例 + +##### KShape:窗口长度 3,k = 2 + +九个采样 `{1,2,3,10,20,30,1,5,1}` 构成三个长度为 3 的不重叠窗口 `{1,2,3}`、`{10,20,30}`、`{1,5,1}`。在 **`method` = `kshape`** 且默认 **`norm` = `true`** 时,每一行对应一个窗口的簇编号,时间戳为各窗口起点。得到的标签为:**0, 0, 1**。 + +输入序列: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +用于查询的 SQL 语句: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +输出序列: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` + diff --git a/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md b/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md index a1283b5b3..1203ad258 100644 --- a/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md +++ b/src/zh/UserGuide/latest/SQL-Manual/UDF-Libraries_timecho.md @@ -5478,3 +5478,89 @@ select rm(s0, s1,"tb"="3","vb"="2") from root.test.d0 +-----------------------------+-----------------------------------------------------+ ``` +### 9.4 Cluster + +#### 注册语句 + +```sql +create function cluster as 'org.apache.iotdb.library.dlearn.UDTFCluster' +``` + +#### 函数简介 + +本函数对**单条输入时间序列**,按固定长度 `l` 切分为**互不重叠**的连续子序列(窗口),再对这些子序列聚类,得到 `k` 个分组。 + +**函数名:** Cluster + +**输入序列:** 仅支持单条数值型时间序列,类型为 INT32 / INT64 / FLOAT / DOUBLE。点按时间顺序读取;末尾不足以凑满一整窗的采样会被**丢弃**(仅使用 `⌊n/l⌋` 个窗口,`n` 为有效点数)。 + +**参数:** + +| 名称 | 含义 | 默认值 | 说明 | +|------|------|--------|------| +| `l` | 子序列(窗口)长度 | (必填) | 正整数;每个窗口含连续 `l` 个采样。 | +| `k` | 聚类个数 | (必填) | 整数 ≥ 2。 | +| `method` | 聚类算法 | `kmeans` | 可选:`kmeans`、`kshape`、`medoidshape`(大小写不敏感)。省略时默认为 k-means。 | +| `norm` | 是否对每个子序列做 Z-score 标准化 | `true` | 布尔;为 `true` 时在聚类前对每个子序列标准化。 | +| `maxiter` | 最大迭代次数 | `200` | 正整数。 | +| `output` | 输出模式 | `label` | `label`:每个窗口一个簇编号;`centroid`:按簇顺序拼接 `k` 个质心向量。 | +| `sample_rate` | 贪心采样比例 | `0.3` | 仅在 **`method` = `medoidshape`** 时使用;取值须在 `(0, 1]`。 | + +**`method` 说明:** + +- **kmeans**:欧氏空间中的 k-means(可选是否先做逐窗归一化)。 +- **kshape**:基于形状距离(由归一化互相关 NCC 得到的 SBD)分配簇;质心通过簇矩阵的 **SVD** 更新。 +- **medoidshape**:先粗聚类,再贪心选出 `k` 条代表子序列;`sample_rate` 控制每轮采样的候选数量。 + +**输出序列:** 由 `output` 控制: + +- **`output` = `label`(默认):** 一条输出序列,类型为 **INT32**。行数 = 完整窗口个数 `⌊n/l⌋`。每行时间戳 = 该窗口**第一个采样**的时间;值为簇编号 **0 … k−1**。 +- **`output` = `centroid`:** 一条输出序列,类型为 **DOUBLE**。行数 = **`k × l`**:按簇 **0 → k−1** 依次输出各簇质心的 `l` 个分量(拼接)。时间戳为 `0, 1, 2, …`(仅占位,无物理时间含义)。 + +**提示:** + +- 需满足有效点数 `n ≥ l`,且窗口数 `⌊n/l⌋ ≥ k`。 + +#### 使用示例 + +##### KShape:窗口长度 3,k = 2 + +九个采样 `{1,2,3,10,20,30,1,5,1}` 构成三个长度为 3 的不重叠窗口 `{1,2,3}`、`{10,20,30}`、`{1,5,1}`。在 **`method` = `kshape`** 且默认 **`norm` = `true`** 时,每一行对应一个窗口的簇编号,时间戳为各窗口起点。得到的标签为:**0, 0, 1**。 + +输入序列: + +``` ++-----------------------------+---------------+ +| Time|root.test.d0.s0| ++-----------------------------+---------------+ +|2020-01-01T00:00:01.000+08:00| 1.0| +|2020-01-01T00:00:02.000+08:00| 2.0| +|2020-01-01T00:00:03.000+08:00| 3.0| +|2020-01-01T00:00:04.000+08:00| 10.0| +|2020-01-01T00:00:05.000+08:00| 20.0| +|2020-01-01T00:00:06.000+08:00| 30.0| +|2020-01-01T00:00:07.000+08:00| 1.0| +|2020-01-01T00:00:08.000+08:00| 5.0| +|2020-01-01T00:00:09.000+08:00| 1.0| ++-----------------------------+---------------+ +``` + +用于查询的 SQL 语句: + +```sql +select cluster(s0, "l"="3", "k"="2", "method"="kshape", "output"="label") +from root.test.d0 +``` + +输出序列: + +``` ++-----------------------------+----------------------------------------------------------------------------+ +| Time|cluster(root.test.d0.s0,"l"="3","k"="2","method"="kshape","output"="label")| ++-----------------------------+----------------------------------------------------------------------------+ +|2020-01-01T00:00:01.000+08:00| 0| +|2020-01-01T00:00:04.000+08:00| 0| +|2020-01-01T00:00:07.000+08:00| 1| ++-----------------------------+----------------------------------------------------------------------------+ +``` +