前言:作為一個程序猿,總是能不時地聽到各種新技術名詞,大數據、雲計算、實時處理、流式處理、內存計算… 但當我們聽到這些時髦的名詞時他們究竟是在說什么?偶然搜到一個不錯的帖子,就總結一下實時處理和流式處理的差別吧。
正文:要說實時處理就得先提一下實時系統(Real-timeSystem)。所謂實時系統就是能在嚴格的時間限制內響應請求的系統。比如假設某系統能嚴格保證在10毫秒內處理來自網絡的NASDAQ股票報價。那么這個系統就行算作實時系統,至於系統是通過軟件還是硬件或者通過如何的設計達到的都不限。
盡管看似簡單,實際上現實世界中這樣的系統是非常難實現的。尤其是軟件實現的實時系統。由於你的進程可能隨時被其它進程搶占。CPU調度器無法保證能給你的進程所需的時間和資源來在嚴格時間限制內完畢響應。因此就有了各種實時操作系統內核。現實中實時系統的樣例能想到的如軍方的導彈控制系統和航天飛機等高精尖的軟件系統了。
那實時處理(Real-time Processingor Computing)又是什么?與實時系統相似,但軟件工業中似乎對實時二字沒有什么明白的定義。比如很多人說實時交易,實際上是由於市場數據瞬息萬變,決策常常在毫秒間。一個軟實時(Soft Real-time)的樣例是Amazon要求全部軟件子系統在處理99%的請求時。都能在100-200毫秒內要么給出結果要么立馬失敗。
說完實時處理再看流式處理(Stream Processing)。望文生義。流式處理就是指源源不斷的數據流過系統時。系統可以不停地連續計算。
所以流式處理沒有什么嚴格的時間限制。數據從進入系統到出來結果可能是須要一段時間。然而流式處理唯一的限制是系統長期來看的輸出速率應當快於或至少等於輸入速率。否則的話,數據豈不是會在系統中越積越多(不然數據哪去了)?如此,無論處理時是在內存、閃存還是硬盤,早晚都會空間耗盡的。就像雪崩效應,系統越來越慢,數據越積越多。
所以我們可以說Storm框架是一種流式處理系統的框架。
假設我們的代碼可以保證Storm的Topology中每一個Bolt結點處理數據的時長一定,那么我們就相當於用Storm開發了一個(軟)實時的系統。順便提一句,又比方Spark這個主要是內存計算框架,在增加了Streaming Spark子項目后。能將數據流切分並轉化成RDD進行興許計算,從而也支持了流式處理(否則之前Spark都是以固定的一坨數據為輸入的)。
原文:What's the difference between real-timeprocessing and stream processing?
“Usually,a system is called a real time system if it has tight deadlines within which aresult is guaranteed. For example, you can consider your TV to be a real timeprocessing system: given an analog or digital input, within say 1ms, acorresponding phosphor dot will light up on the screen. In the context ofsoftware systems, a system is usually called a real time system if it hasresponses that are guaranteed within hard "real-world" timedeadlines. For example, a system that guarantees the processing of a NASDAQstock quote coming in from the network within 10 ms would be considered a realtime processing system: whether this is achieved by using a softwarearchitecture that utilizes continuous (stream) processing or one shot processingin hardware is immaterial. The fact that there is a reasonably small real-worldguaranteed deadline for the processing makes it a real time system.
“Inpractice though real time systems are extremely hard to implement using commonsoftware systems. For example, the vanilla linux kernel isn't a real timekernel: certain operations such as process scheduling, network packetprocessing etc. are implemented using algorithms that don't guarantee a hardtime limit. eg. If your process is preempted from CPU resources by a higherpriority process, the scheduler may not give your process the CPU resources itneeds to guarantee a response in the given deadline (depending on thescheduling algorithm). The same thing applies to network packets. There are, ofcourse, flavors of the kernel available that provide real time schedulingguarantees for processes etc. (QNX [1]comes to mind) Software systems in this area usually go for a flavor of realtime processing called soft real time computing where the deadline is not an absolute but aprobability. For example, Amazon requires all the software subcomponents on itspage to provide a result or fail within 100-200ms for 99% of all requests. Thisgives it a soft real time guarantee that a page will render within a given timelimit.
“Streamprocessing on the other hand refers to a methodof continuous computation that happens as data is flowing through the system.There are no compulsory time limitations in stream processing. For example, asystem that simply output the count of words present in a Tweet for 99.9% ofthe tweets it encountered but output the complete works of Shakespeare for theremaining 0.1% of tweets is a valid stream processing system. There is no fixedtime deadline on the output of the system when an input is received: the datais processed as it comes in and sometimes data might be awaiting processing.The only constraint on such a stream processing system is that its long termoutput rate should be faster or at least equal to the long term data input rate(otherwise the storage requirements of the system grow without bound).Additionally, it must have enough memory to store queued inputs should it bestuck while processing any item in the input stream.
“Giventhis context, I'm sure it's easy to figure out that Storm is a streamprocessing system. You can use Storm to develop a (soft) real time system ifyou can place guarantees on the processing duration for all inputs at everystage of the topology.