https://calcite.apache.org/docs/stream.html
Calcite’s SQL is an extension to standard SQL, not another ‘SQL-like’ language. The distinction is important, for several reasons:
- Streaming SQL is easy to learn for anyone who knows regular SQL.
- The semantics are clear, because we aim to produce the same results on a stream as if the same data were in a table.
- You can write queries that combine streams and tables (or the history of a stream, which is basically an in-memory table).
- Lots of existing tools can generate standard SQL.
If you don’t use the STREAM
keyword, you are back in regular standard SQL.
只是對於標准sql的擴展,StreamingSQL只是多個Stream關鍵詞
An example schema
Our streaming SQL examples use the following schema:
Orders (rowtime, productId, orderId, units)
- a stream and a tableProducts (rowtime, productId, name)
- a tableShipments (rowtime, orderId)
- a stream
以簡單的訂單,商品,發貨為例子
可以看到這里可以同時處理,流式表和靜態表;對於order即是流式表也是靜態表,意思是實時數據在流式表中,而歷史數據在靜態表中
簡單的加上STREAM就可以對流式表Orders進行查詢,結果是unbounded的;如果不帶STREAM就是對靜態表進行查詢,結果是bounded
Windows支持
- tumbling window (GROUP BY)
- hopping window (multi GROUP BY)
- sliding window (window functions)
- cascading window (window functions)
Tumbling windows
在sql中,所謂window就是對於時間的group
比如下面的例子,以小時為時間窗口
How did Calcite know that the 10:00:00 sub-totals were complete at 11:00:00, so that it could emit them? It knows that rowtime
is increasing, and it knows that CEIL(rowtime TO HOUR)
is also increasing. So, once it has seen a row at or after 11:00:00, it will never see a row that will contribute to a 10:00:00 total.
A column or expression that is increasing or decreasing is said to bemonotonic.
有個問題是如何知道11點之前的數據都已經到,這個取決於rowtime是單調遞增的
所以對於group by時,必須要有一個column是單調遞增的,monotonic
If column or expression has values that are slightly out of order, and the stream has a mechanism (such as punctuation or watermarks) to declare that a particular value will never be seen again, then the column or expression is said to be quasi-monotonic.
當然rowtime可能不是嚴格單調的,所以我們可以用watermark來限定一個時間段,在這個時間范圍上是單調的;這樣稱為quasi-monotonic,擬單調
更優雅些,我們可以使用TUMBLE關鍵字
上面的例子是30分鍾的時間窗口,但是非整點,而是有12分鍾的偏移,alignment time
Hopping windows
Hopping windows are a generalization of tumbling windows that allow data to be kept in a window for a longer than the emit interval.
其實就是滑動窗口
以1小時為滑動,3小時為窗口大小
HAVING
聚合后的過濾
Sliding windows
非groupby方式的sliding windows,
Standard SQL features so-called “analytic functions” that can be used in the SELECT
clause.
Unlike GROUP BY
, these do not collapse records. For each record that goes in, one record comes out. But the aggregate function is based on a window of many rows.
SELECT STREAM rowtime, productId, units, SUM(units) OVER (ORDER BY rowtime RANGE INTERVAL '1' HOUR PRECEDING) unitsLastHour FROM Orders;
這個和groupby的區別在於,窗口觸發的時機
對於groupby,時間整點觸發,會將窗口里records計算成一個值
而OVER,是record by record,每來一條record都會觸發一次計算,上面的例子是,對每條record都會觸發一次前一個小時的sum
這里更加復雜,
先聲明Window product,表示order by rowtime,partition by productId
再基於product,OVER生成7天和10分鍾的AVG(units)
子查詢
The previous HAVING
query can be expressed using a WHERE
clause on a sub-query:
having也可以實現成子查詢的形式
Since then, SQL has become a mathematically closed language, which means that any operation you can perform on a table can also perform on a query.
The closure property of SQL is extremely powerful. Not only does it render HAVING
obsolete (or, at least, reduce it to syntactic sugar), it makes views possible:
sql具有閉合特性,即任何可以在table上執行的操作,也同樣可以在query上執行,因為query的結果也是一個關系表
所以上面通過create view創建子查詢
Many people find that nested queries and views are even more useful on streams than they are on relations.
Streaming queries are pipelines of operators all running continuously, and often those pipelines get quite long. Nested queries and views help to express and manage those pipelines.
嵌套查詢對於Streaming非常有用,因為流其實就是一組operators的pipelines;以嵌套查詢或view的方式去表示會很方便
And, by the way, a WITH
clause can accomplish the same as a sub-query or a view:
With關鍵詞,用於實現子查詢或view
Sorting
Joining streams to tables
A stream-to-table join is straightforward if the contents of the table are not changing.
這個很直接,但有個問題是,靜態表是會變化的,當數據record流過來時,我們需要和record發生時靜態表做join,但如果靜態表已經變化了,我們只能取到最新值
要解決這個問題,我們需要為靜態表,創建版本表,保存每個時間的版本
One way to implement this is to have a table that keeps every version with a start and end effective date, ProductVersions
in the following example:
當前會從productVersion里面,根據record rowtime找出包含這個時間的版本
Joining streams to streams
DML
It’s not only queries that make sense against streams; it also makes sense to run DML statements (INSERT
, UPDATE
, DELETE
, and also their rarer cousins UPSERT
and REPLACE
) against streams.
DML is useful because it allows you do materialize streams or tables based on streams, and therefore save effort when values are used often.