Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy


邊緣智能:按需深度學習模型和設備邊緣協同的共同推理

本文為SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)論文。

筆者翻譯了該論文。由於時間倉促,且筆者英文能力有限,錯誤之處在所難免;歡迎讀者批評指正。

本文及翻譯版本僅用於學習使用。如果有任何不當,請聯系筆者刪除。

本文作者包含3位,En Li, Zhi Zhou, and Xu Chen@School of Data and Computer Science, Sun Yat-sen University 

ABSTRACT (摘要)

As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suffers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s effectiveness in enabling on-demand low-latency edge intelligence.

作為機器學習的骨干技術,深度神經網絡(DNNs)已經迅速成為人們關注的焦點。然而,在資源受限的移動設備上運行DNN絕不是微不足道的,因為它會帶來高性能和高能耗開銷。由於不受控制的長廣域網延遲,將DNN加載到雲中以便執行會帶來不可預測的性能。為了應對這些挑戰,在本文中,我們提出了Edgent,一種具有設備邊緣協同作用的協作和按需DNN協同推理框架。 Edgent追求兩個設計目標:(1)DNN划分,自適應地划分設備和邊緣之間的DNN計算,以便利用鄰近的混合計算資源進行實時DNN推理。(2)DNN正確調整大小,通過在適當的中間DNN層提前退出來加速DNN推理,以進一步減少計算延遲。基於Raspberry Pi的原型實現和廣泛評估證明了Edgent在實現按需低延遲邊緣智能方面的有效性。

1 INTRODUCTION & RELATED WORK (引言和相關工作)

As the backbone technology supporting modern intelligent mobile applications, Deep Neural Networks (DNNs) represent the most commonly adopted machine learning technique and have become increasingly popular. Due to DNNs’s ability to perform highly accurate and reliable inference tasks, they have witnessed successful applications in a broad spectrum of domains from computer vision [14] to speech recognition [12] and natural language processing [16]. However, as DNN-based applications typically require tremendous amount of computation, they cannot be well supported by today’s mobile devices with reasonable latency and energy consumption. 

作為支持現代智能移動應用的骨干技術,深度神經網絡(DNN)代表了最常用的機器學習技術,並且越來越受歡迎。 由於DNN能夠執行高度准確和可靠的推理任務,他們見證了從計算機視覺[14]到語音識別[12]和自然語言處理[16]等廣泛領域的成功應用。 但是,由於基於DNN的應用程序通常需要大量的計算,因此當今的移動設備無法很好地支持它們(在合理的延遲和能耗約束下)。

In response to the excessive resource demand of DNNs, the traditional wisdom resorts to the powerful cloud datacenter for training and evaluating DNNs. Input data generated from mobile devices is sent to the cloud for processing, and then results are sent back to the mobile devices after the inference. However, with such a cloud-centric approach, large amounts of data (e.g., images and videos) are uploaded to the remote cloud via a long wide-area network data transmission, resulting in high end-to-end latency and energy consumption of the mobile devices. To alleviate the latency and energy bottlenecks of cloud-centric approach, a better solution is to exploiting the emerging edge computing paradigm. Specifcally, by pushing the cloud capabilities from the network core to the network edges (e.g., base stations and WiFi access points) in close proximity to devices, edge computing enables low-latency and energy-efficient DNN inference. 

為了應對DNN的過多資源需求,傳統智慧采用強大的雲數據中心來訓練和評估DNN。 從移動設備生成的輸入數據被發送到雲進行處理,然后在推斷之后將結果發送回移動設備。 然而,利用這種以雲為中心的方法,大量數據(例如,圖像和視頻)通過長廣域網數據傳輸上傳到遠程雲,導致移動設備上大的端到端延遲和能量消耗。 為了緩解以雲為中心的方法的延遲和能量瓶頸,更好的解決方案是利用新興的邊緣計算范例。 具體地,通過將雲的能力從網絡核心推送到緊鄰設備的網絡邊緣(例如,基站和WiFi接入點),邊緣計算實現低延遲和高效能的DNN推斷。

While recognizing the benefts of edge-based DNN inference, our empirical study reveals that the performance of edge-based DNN inference is highly sensitive to the available bandwidth between the edge server and the mobile device. Specifcally, as the bandwidth drops from 1Mbps to 50Kbps, the latency of edge-based DNN inference climbs from 0.123s to 2.317s and becomes on par with the latency of local processing on the device. Then, considering the vulnerable and volatile network bandwidth in realistic environments (e.g., due to user mobility and bandwidth contention among various Apps), a natural question is that can we further improve the performance (i.e., latency) of edge-based DNN execution, especially for some mission-critical applications such as VR/AR games and robotics [13]. 

雖然我們認識到基於邊緣的DNN推理的好處,但我們的實證研究表明,基於邊緣的DNN推理的性能對邊緣服務器和移動設備之間的可用帶寬高度敏感。 具體而言,隨着帶寬從1Mbps降至50Kbps,基於邊緣的DNN推斷的延遲從0.123s上升到2.317s,並且與設備上本地處理的延遲相當。 然后,考慮到現實環境中易受攻擊和易變的網絡帶寬(例如,由於用戶移動性和各種應用之間的帶寬爭用),一個自然的問題是我們能否進一步改善基於邊緣的DNN執行的性能(即延遲), 特別是對於一些關鍵任務應用,如VR/AR游戲和機器人[13]。

To answer the above question in the positive, in this paper we proposed Edgent, a deep learning model co-inference framework with device-edge synergy. Towards low-latency edge intelligence, Edgent pursues two design knobs. The frst is DNN partitioning, which adaptively partitions DNN computation between mobile devices and the edge server based on the available bandwidth, and thus to take advantage of the processing power of the edge server while reducing data transfer delay. However, worth noting is that the latency after DNN partition is still restrained by the rest part running on the device side. Therefore, Edgent further combines DNN partition with DNN right-sizing which accelerates DNN inference through early-exit at an intermediate DNN layer. Needless to say, early-exit naturally gives rise to the latency-accuracy tradeoff (i.e., early-exit harms the accuracy of the inference). To address this challenge, Edgent jointly optimizes the DNN partitioning and right-sizing in an on-demand manner. That is, for mission-critical applications that typically have a predefned deadline, Edgent maximizes the accuracy without violating the deadline. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s effectiveness in enabling on-demand low-latency edge intelligence.

為了回答上述問題,我們在本文中提出了Edgent,一種具有設備邊緣協同作用的深度學習模型協同推理框架。對於低延遲邊緣智能(作為初始探索,本文我們只考慮執行延遲問題。在未來工作中,我們也將考慮能耗問題),Edgent追求兩個設計目標。第一個是DNN分區,其基於可用帶寬自適應地划分移動設備和邊緣服務器之間的DNN計算,從而利用邊緣服務器的處理能力,同時減少數據傳輸延遲。但值得注意的是,DNN分區后的延遲仍然受到設備端運行的其余部分的限制。因此,Edgent進一步將DNN分區與DNN正確大小調整相結合,通過在中間DNN層的早期退出來加速DNN推斷。不用說,提前退出自然會產生延遲和准確度之間的均衡(即,提前退出會損害推斷的准確性)。為了解決這一挑戰,Edgent以按需方式協同優化DNN分區和正確大小調整。也就是說,對於通常具有預定截止時間的關鍵任務應用程序,Edgent在不違反截止時間的情況下最大化准確性。基於Raspberry Pi的原型實現和廣泛評估證明了Edgent在實現按需低延遲邊緣智能方面的有效性。

While the topic of edge intelligence has began to garner much attention recently, our study is different from and complementary to existing pilot efforts. On one hand, for fast and low power DNN inference at the mobile device side, various approaches as exemplifed by DNN compression and DNN architecture optimization has been proposed [3–5, 7, 9]. Different from these works, we take a scale-out approach to unleash the benefts of collaborative edge intelligence between the edge and mobile devices, and thus to mitigate the performance and energy bottlenecks of the end devices. On the other hand, though the idea of DNN partition among cloud and end device is not new [6], realistic measurements show that the DNN partition is not enough to satisfy the stringent timeliness requirements of mission-critical applications. Therefore, we further apply the approach of DNN right-sizing to speed up DNN inference. 

雖然邊緣智能的話題在最近引起了很多關注,但我們的研究與現有的工作不同並且互為補充。一方面,對於移動設備側的快速和低功率DNN推斷,已經提出了DNN壓縮和DNN架構優化為例的各種方法[3-5,7,9]。 與這些工作不同,我們采用橫向擴展方法釋放邊緣和移動設備之間協同邊緣智能的好處,從而減輕終端設備的性能和能耗瓶頸。另一方面,雖然雲和終端設備之間DNN划分的想法並不新鮮[6],但實際測量表明DNN划分不足以滿足任務關鍵型應用的嚴格的實時性要求。 因此,我們進一步應用DNN正確大小調整的方法來加速DNN推理。

2 BACKGROUND & MOTIVATION (背景 & 動機)

In this section, we frst give a primer on DNN, then analyse the inefciency of edge- or device-based DNN execution, and finally illustrate the benefts of DNN partitioning and rightsizing with device-edge synergy towards low-latency edge intelligence. 

在本節中,我們首先介紹DNN,然后分析基於邊緣或設備的DNN執行的效率,最后說明利用設備邊緣協同作用的DNN划分和正確大小調整實現低延遲邊緣智能的好處。

2.1 A Primer on DNN (DNN基礎)

DNN represents the core machine learning technique for a broad spectrum of intelligent applications spanning computer vision, automatic speech recognition and natural language processing. As illustrated in Fig. 1, computer vision applications use DNNs to extract features from an input image and classify the image into one of the pre-defned categories. A typical DNN model is organized in a directed graph which includes a series of inner-connected layers, and within each layer comprising some number of nodes. Each node is a neuron that applies certain operations to its input and generates an output. The input layer of nodes is set by raw data while the output layer determines the category of the data. The process of passing forward from the input layer to the out layer is called model inference. For a typical DNN containing tens of layers and hundreds of nodes per layer, the number of parameters can easily reach the scale of millions. Thus, DNN inference is computational intensive. Note that in this paper we focus on DNN inference, since DNN training is generally delay tolerant and is typically conducted in an off-line manner using powerful cloud resources. 

DNN代表了包括計算機視覺、自動語音識別和自然語言處理在內的廣泛智能應用的核心機器學習技術。如圖1所示,計算機視覺應用程序使用DNN從輸入圖像中提取特征並將圖像分類為某一預定類別。典型的DNN模型被組織為有向圖,該有向圖包括一系列內部連接的層,並且在每個層內包括一些節點。每個節點都是一個神經元,它將某些操作應用於其輸入並生成輸出。輸入層節點由原始數據設置,而輸出層確定數據的類別。從輸入層到外層的前向傳遞過程稱為模型推斷。對於包含數十層和每層包含數百個節點的典型DNN,參數的數量可以輕松達到數百萬的規模。因此,DNN推斷是計算密集型的。注意,在本文中,我們關注DNN推理,因為DNN訓練通常是延遲容忍的,並且通常使用強大的雲資源以一種離線的方式進行。

圖1:一個用於計算機視覺的4層DNN

2.2 Inefficiency of Device- or Edge-based DNN Inference (基於設備或邊緣的DNN推理不高效)

Currently, the status quo of mobile DNN inference is either direct execution on the mobile devices or offloading to the cloud/edge server for execution. Unfortunately, both approaches may suffer from poor performance (i.e., end-to-end latency), being hard to well satisfy real-time intelligent mobile applications (e.g., AR/VR mobile gaming and intelligent robots) [2]. As illustration, we take a Raspberry Pi tiny computer and a desktop PC to emulate the mobile device and edge server respectively, running the classical AlexNet [1] DNN model for image recognition over the Cifar-10 dataset [8]. Fig. 2 plots the breakdown of the end-to-end latency of different approaches under varying bandwidth between the edge and mobile device. It clearly shows that it takes more than 2s to execute the model on the resource-limited Raspberry Pi. Moreover, the performance of edge-based execution approach is dominated by the input data transmission time (the edge server computation time keeps at ∼10ms) and thus highly sensitive to the available bandwidth. Specifcally, as the available bandwidth jumps from 1Mbps to 50Kbps, the end-to-end latency climbs from 0.123s to 2.317s. Considering the network bandwidth resource scarcity in practice (e.g., due to network resource contention among users and apps) and computing resource limitation on mobile devices, both of the device- and edge-based approaches are challenging to well support many emerging real-time intelligent mobile applications with stringent latency requirement. 

目前,移動DNN推斷的現狀是要么在移動設備上直接執行,要么加載到雲/邊緣服務器執行。不幸的是,這兩種方法都可能經歷較差的性能(即端到端延遲),難以很好地滿足實時智能移動應用(例如AR/VR移動游戲和智能機器人)[2]。例如,我們采用Raspberry Pi小型計算機和台式PC分別模擬移動設備和邊緣服務器,運行經典的AlexNet [1] DNN模型,通過Cifar-10數據集進行圖像識別[8]。圖2繪制了邊緣和移動設備之間不同帶寬下不同方法的端到端延遲的細分。它清楚地表明在資源有限的Raspberry Pi上執行模型需要2秒以上。此外,基於邊緣的執行方法的性能由輸入數據傳輸時間(邊緣服務器計算時間保持在~10ms)決定,因此對可用帶寬高度敏感。具體而言,隨着可用帶寬從1Mbps降至50Kbps,端到端延遲從0.123秒攀升至2.317秒。考慮到實際中網絡帶寬資源的稀缺性(例如,由於用戶和應用之間的網絡資源爭用)以及移動設備上的計算資源限制,基於設備和邊緣的方法都難以很好地支持許多新興的具有嚴格延遲要求的實時智能移動應用程序。

圖2:AlexNet運行時間。

2.3 Enabling Edge Intelligence with DNN Partitioning and Right-Sizing (使用DNN划分和正確大小調整使能邊緣智能)

DNN Partitioning: For a better understanding of the performance bottleneck of DNN execution, we further break the runtime (on Raspberry Pi) and output data size of each layer in Fig. 3. Interestingly, we can see that the runtime and output data size of different layers exhibit great heterogeneities, and layers with a long runtime do not necessarily have a large output data size. Then, an intuitive idea is DNN partitioning, i.e., partitioning the DNN into two parts and offloading the computational intensive one to the server at a low transmission overhead, and thus to reduce the end-to-end latency. For illustration, we choose the second local response normalization layer (i.e., lrn 2) in Fig. 3 as the partition point and offload the layers before the partition point to the edge server while running the rest layers on the device. By DNN partitioning between device and edge, we are able to collaborate hybrid computation resources in proximity for low-latency DNN inference. 

DNN划分:為了更好地理解DNN執行的性能瓶頸,我們進一步分解運行時(在Raspberry Pi上)並在圖3中給出每層的數據大小。有趣的是,我們可以看到運行時間和不同層的輸出數據大小表現出很大的異構性,具有長運行時間的層不一定具有大的輸出數據大小。 然后,直觀的想法是DNN划分,即,將DNN分成兩部分並以低傳輸開銷將計算密集的一部分卸載到服務器,從而減少端到端等待時間。 為了說明,我們選擇圖3中的第二局部響應歸一化層(即,lrn 2)作為划分點,並將划分點之前的層卸載到邊緣服務器,同時在設備上運行其余層。 通過DNN在設備和邊緣之間進行划分,我們能夠為低延遲DNN推理協同混合計算資源。

圖3:樹莓Pi設備上AlexNet層的運行時間。

DNN Right-Sizing: While DNN partitioning greatly reduces the latency by bending the computing power of the edge server and mobile device, we should note that the optimal DNN partitioning is still constrained by the run time of layers running on the device. For further reduction of latency, the approach of DNN Right-Sizing can be combined with DNN partitioning. DNN right-sizing promises to accelerate model inference through an early-exit mechanism. That is, by training a DNN model with multiple exit points and each has a different size, we can choose a DNN with a small size tailored to the application demand, meanwhile to alleviate the computing burden at the model division, and thus to reduce the total latency. Fig. 4 illustrates a branchy AlexNet with five exit points. Currently, the early-exit mechanism has been supported by the open source framework BranchyNet[15]. Intuitively, DNN right-sizing further reduces the amount of computation required by the DNN inference tasks. 

DNN正確大小調整:雖然DNN划分通過降低邊緣服務器和移動設備的計算能力大大減少了延遲,但我們應該注意到最佳DNN划分仍然受到設備上運行的層的運行時間的限制。 為了進一步減少延遲,DNN正確大小調整的方法可以與DNN划分相結合。 DNN正確的規模承諾通過早期退出機制加速模型推理。 也就是說,通過訓練具有多個出口點的DNN模型並且每個具有不同的尺寸,我們可以選擇適合應用需求的小尺寸DNN,同時減輕模型部門的計算負擔,從而減少總延遲。 圖4示出了具有五個出口點分支的AlexNet。 目前,早期退出機制得到了開源框架BranchyNet的支持[15]。 直觀地說,DNN正確大小調整進一步減少了DNN推理任務所需的計算量。

圖4: DNN正確大小調整中早期退出機制的示例

Problem Description: Obviously, DNN right-sizing incurs the problem of latency-accuracy tradeoff — while early exit reduces the computing time and the device side, it also deteriorates the accuracy of the DNN inference. Considering the fact that some applications (e.g., VR/AR game) have stringent deadline requirement while can tolerate moderate accuracy loss, we hence strike a nice balance between the latency and the accuracy in an on-demand manner. Particularly, given the predefned and stringent latency goal, we maximize the accuracy without violating the deadline requirement. More specifcally, the problem to be addressed in this paper can be summarized as: given a predefned latency requirement, how to jointly optimize the decisions of DNN partitioning and right-sizing, in order to maximize DNN inference accuracy. 

問題描述:顯然,DNN正確大小調整會產生延遲-准確性的權衡 - 雖然早期退出會縮短計算時間(設備),但它也會降低DNN推理的准確性。 考慮到某些應用程序(例如,VR/AR游戲)具有嚴格的期限要求而能夠容忍適度的准確度損失的事實,因此我們以按需方式在延遲和准確性之間取得了良好的平衡。 特別是,考慮到預定義和嚴格的延遲目標,我們在不違反期限要求的情況下最大化准確性。 更具體地說,本文要解決的問題可歸納為:給定預定的延遲要求,如何聯合優化DNN划分和正確大小調整的決策,以最大化DNN推理的准確性。

3 FRAMEWORK (框架)

We now outline the initial design of Edgent, a framework that automatically and intelligently selects the best partition point and exit point of a DNN model to maximize the accuracy while satisfying the requirement on the execution latency.

我們現在概述Edgent的初始設計,這是一個自動智能地選擇DNN模型的最佳划分點和退出點的框架,以在滿足執行延遲要求的同時最大化准確性。

3.1 System Overview (系統概述)

Fig. 5 shows the overview of Edgent. Edgent consists of three stages: ofine training stage, online optimization stage and co-inference stage. 

圖5顯示了Edgent的概述。 Edgent由三個階段組成:訓練階段、在線優化階段和共同推理階段。

圖5: Edgent概覽

At offline training stage, Edgent performs two initializations: (1) profling the mobile device and the edge server to generate regression-based performance prediction models (Sec. 3.2) for different types DNN layer (e.g., Convolution, Pooling, etc.). (2) Using Branchynet to train DNN models with various exit points, and thus to enable early-exit. Note that the performance profling is infrastructure-dependent, while the DNN training is application-dependent. Thus, given the sets of infrastructures (i.e., mobile devices and edge servers) and applications, the two initializations only need to be done once in an offline manner. 

離線訓練階段,Edgent執行兩次初始化:(1)對移動設備和邊緣服務器進行分析,以針對不同類型的DNN層(例如,卷積,池化等)生成基於回歸的性能預測模型(第3.2節)。 (2)使用Branchynet訓練具有不同退出點的DNN模型,從而實現提前退出。請注意,性能分析依賴於基礎結構,而DNN訓練則取決於應用程序。 因此,給定基礎設施(即,移動設備和邊緣服務器)和應用程序,兩個初始化僅需要以離線方式進行一次。

At online optimization stage, the DNN optimizer selects the best partition point and early-exit point of DNNs to maximize the accuracy while providing performance guarantee on the end-to-end latency, based on the input: (1) the profiled layer latency prediction models and Branchynet trained DNN models with various sizes. (2) the observed available bandwidth between the mobile device and edge server. (3) The pre-defined latency requirement. The optimization algorithm is detailed in Sec. 3.3. 

在線優化階段,DNN優化器選擇DNN的最佳分區點和早期退出點,以最大化准確性,同時根據輸入提供端到端延遲的性能保證:(1)分析的層延遲預測模型和各種尺寸的Branchynet訓練的DNN模型。(2)移動設備和邊緣服務器之間觀察到的可用帶寬。(3)預先確定的延遲要求。 優化算法詳見第3.3節。

At co-inference stage, according to the partition and early-exit plan, the edge server will execute the layer before the partition point and the rest will run on the mobile device.

協同推理階段,根據划分和提前退出規划,邊緣服務器將執行該層划分點之前的層,且其余部分將在移動設備上運行。

3.2 Layer Latency Prediction (層延遲預測)

When estimating the runtime of a DNN, Edgent models the per-layer latency rather than modeling at the granularity of a whole DNN. This greatly reduces the profling overhead since there are very limited classes of layers. By experiments, we observe that the latency of different layers is determined by various independent variables (e.g., input data size, output data size) which are summarized in Table 1. Note that we also observe that the DNN model loading time also has an obvious impact on the overall runtime. Therefore, we further take the DNN model size as a input parameter to predict the model loading time. Based on the above inputs of each layer, we establish a regression model to predict the latency of each layer based on its profles. The fnal regression models of some typical layers are shown in Table 2 (size is in bytes and latency is in ms). 

在估計DNN的運行時間時,Edgent會對每層的延遲進行建模,而不是以整個DNN為粒度進行建模。 這極大地減少了分析開銷,因為存在非常有限的層類別。 通過實驗,我們觀察到不同層的延遲由各種獨立變量(例如,輸入數據大小,輸出數據大小)決定,如表1所示。注意,我們還觀察到DNN模型的加載時間對總運行時間也有明顯的影響。 因此,我們進一步將DNN模型的大小作為輸入參數來預測模型的加載時間。 基於每層的上述輸入,我們建立回歸模型以基於分析預測每個層的延遲。 表2中顯示了一些典型層的最終回歸模型(大小以字節為單位,延遲以毫秒為單位)。

3.3 Joint Optimization on DNN Partition and DNN Right-Sizing (DNN划分和DNN正確大小調整的協同優化) 

At online optimization stage, the DNN optimizer receives the latency requirement from the mobile device, and then searches for the optimal exit point and partition point of the trained branchynet model. The whole process is given in Algorithm 1. For a branchy model with M exit points, we denote that the i-th exit point has Ni layers. Here a layer index i correspond to a more accurate inference model of larger size. We use the above-mentioned regression models to predict EDj the runtime of the j-th layer when it runs on device and ESj the runtime of the j-th layer when it runs on server. Dp is the output of the p-th layer. Under a specific bandwidth B, with the input data Input, then we calcuate Ai,p the whole runtime  when the p-th is the partition point of the selected model of i-th exit point. When p = 1, the model will only run on the device then ESp = 0, Dp-1/B = 0, Input/B = 0, and when p = Ni, the model will only run on the server then EDp = 0, Dp-1/B = 0. In this way, we can find out the best partition point having the smallest latency for the model of i-th exit point. Since the model partition does not affect the inference accuracy, we can then sequentially try the DNN inference models with different exit points(i.e., with different accuracy), and find the one having the largest size and meanwhile satisfying the latency requirement. Note that since regression models for layer latency prediction are trained beforehand, Algorithm 1 mainly involves linear search operations and can be done very fast (no more than 1ms in our experiments) .

在線優化階段,DNN優化器從移動設備接收延遲要求,然后搜索訓練的branchynet模型的最佳出口點和分區點。整個過程在算法1中給出。對於具有M個出口點的分支模型,我們表示第i個出口點具有Ni層。這里,更大的層索引i對應於更准確的推斷模型。我們使用上面提到的回歸模型來預測第j層在設備上運行時的運行時間EDj,ESj是它在服務器上運行時運行時間。 Dp是第p層的輸出。在特定帶寬B下,使用輸入數據Input,我們計算總運行時間Ai,p=其中,p是所選模型的划分點,i表示個出口點。當p = 1時,模型將僅在設備上運行,那么ESp = 0,Dp-1 / B = 0,Input/ B = 0;當p = Ni時,模型將僅在服務器上運行,那么EDp = 0 ,Dp-1 / B = 0。通過這種方式,我們可以找到具有最小延遲的最佳分區點,用於第i個出口點的模型。由於模型划分不影響推理精度,我們可以依次嘗試具有不同出口點的DNN推理模型(即,具有不同的精度),並找到具有最大尺寸並同時滿足延遲要求的模型。請注意,由於預先訓練了層延遲預測的回歸模型,因此算法1主要涉及線性搜索操作,並且可以非常快速地完成(在我們的實驗中不超過1ms)。

4 EVALUATION (評估)

We now present our preliminary implementation and evaluation results.

現在,我們給出初步實現和評估結果。

4.1 Prototype (原型)

We have implemented a simple prototype of Edgent to verify the feasibility and efcacy of our idea. To this end, we take a desktop PC to emulate the edge server, which is equipped with a quad-core Intel processor at 3.4 GHz with 8 GB of RAM, and runs the Ubuntu system. We further use Raspberry Pi 3 tiny computer to act as a mobile device. The Raspberry Pi 3 has a quad-core ARM processor at 1.2 GHz with 1 GB of RAM. The available bandwidth between the edge server and the mobile device is controlled by the WonderShaper [10] tool. As for the deep learning framework, we choose Chainer [11] that can well support branchy DNN structures. 

我們已經實現了Edgent的簡單原型系統來驗證我們的想法的可行性和有效性。 為此,我們采用台式機PC模擬邊緣服務器,該服務器配備了3.4 GHz的四核英特爾處理器和8 GB的RAM,並運行Ubuntu系統。 我們進一步使用Raspberry Pi 3微型計算機充當移動設備。 Raspberry Pi 3具有1.2 GHz的四核ARM處理器和1 GB的RAM。 邊緣服務器和移動設備之間的可用帶寬由WonderShaper [10]工具控制。 至於深度學習框架,我們選擇能夠很好地支持分支DNN結構的Chainer [11]。

For the branchynet model, based on the standard AlexNet model, we train a branchy AlexNet for image recognition over the large-scale Cifar-10 dataset [8]. The branchy AlexNet has five exit points as showed in Fig. 4 (Sec. 2), each exit point corresponds to a sub-model of the branchy AlexNet. Note that in Fig. 4, we only draw the convolution layers and the fully-connected layers but ignore other layers for ease of illustration. For the five sub-models, the number of layers they each have is 12, 16, 19, 20 and 22, respectively. 

對於branchynet模型,基於標准的AlexNet模型,我們訓練了一個分支的AlexNet,用於大規模Cifar-10數據集的圖像識別[8]。 如圖4(第2節)所示,分支的AlexNet具有5個出口點,每個出口點對應於分支AlexNet的子模型。 請注意,在圖4中,我們僅繪制卷積層和完全連接的層,為了便於說明而忽略其他層。 對於這5個子模型,它們各自具有的層數分別為12,16,19,20和22。

For the regression-based latency prediction models for each layer, the independent variables are shown in the Table. 1, and the obtained regression models are shown in Table 2. 

對於每層的基於回歸預測模型,獨立變量顯示在表1中, 獲得的回歸模型如表2所示。

4.2 Results (結果)

We deploy the branchynet model on the edge server and the mobile device to evaluate the performance of Edgent. Specifcally, since both the pre-defned latency requirement and the available bandwidth play vital roles in Edgent’s optimization logic, we evaluate the performance of Edgent under various latency requirements and available bandwidth. 

我們在邊緣服務器和移動設備上部署了branchynet模型,以評估Edgent的性能。 具體而言,由於預先確定的延遲要求和可用帶寬在Edgent的優化邏輯中起着至關重要的作用,因此我們在各種延遲要求和可用帶寬下評估Edgent的性能。

We first investigate the effect of the bandwidth by fixing the latency requirement at 1000ms and varying the bandwidth 50kbps to 1.5Mbps. Fig. 6(a) shows the best partition point and exit point under different bandwidth. While the best partition points may fluctuate, we can see that the best exit point gets higher as the bandwidth increases. Meaning that the higher bandwidth leads to higher accuracy. Fig. 6(b) shows that as the bandwidth increases, the model runtime first drops substantially and then ascends suddenly. However, this is reasonable since the accuracy gets better while the latency is still within the latency requirement when increase the bandwidth from 1.2Mbps to 2Mbps. It also shows that our proposed regression-based latency approach can well estimate the actual DNN model runtime latency. We further fix the bandwidth at 500kbps and vary the latency from 100ms to 1000ms. Fig. 6(c) shows the best partition point and exit point under different latency requirements. As expected, the best exit point gets higher as the latency requirement increases, meaning that a larger latency goal gives more room for accuracy improvement. 

我們首先通過使用固定的1000ms的延遲要求並將帶寬由50kbps到1.5Mbps變化來研究帶寬的影響。圖6(a)顯示了不同帶寬下的最佳划分點和退出點。雖然最佳划分點可能會波動,但我們可以看到隨着帶寬的增加,最佳退出點會變得更高,這意味着更高的帶寬會帶來更高的准確性。圖6(b)顯示隨着帶寬的增加,模型運行時間首先大幅下降,然后突然上升。但是,這是合理的,因為當將帶寬從1.2Mbps增加到2Mbps時,准確性變得更好,而延遲仍然在延遲要求內。它還表明我們提出的基於回歸的延遲方法可以很好地估計實際的DNN模型運行時延遲。我們進一步將帶寬固定在500kbps,並將延遲從100ms改為1000ms。圖6(c)顯示了不同延遲要求下的最佳划分點和出口點。正如預期的那樣,隨着延遲需求的增加,最佳出口點會越來越高,這意味着更大的延遲目標可以為更高的准確性提供更多空間。

圖6:不同帶寬和延遲要求下的結果。

In Fig. 7, under different latency requirements, it shows the model accuracy of different inference methods. The accuracy is negative if the inference can not satisfy the latency requirement. The network bandwidth is set to 400kbps. Seen in the Fig. 7, at a very low latency requirement (100ms), all four methods can’t satisfy the requirement. As the latency requirement increases, inference by Edgent works earlier than the other methods that at the 200ms to 300ms requirements, by using a small model with a moderate inference accuracy to meet the requirements. The accuracy of the model selected by Edgent gets higher as the latency requirement relaxes. 

在圖7中,根據不同的延遲要求,顯示了不同推理方法的模型精度。 如果推斷不能滿足延遲要求,則准確度為負。 網絡帶寬設置為400kbps。 如圖7所示,在非常低的延遲要求(100ms)下,所有四種方法都不能滿足要求。 隨着延遲要求的增加,Edgent的推理比其他200ms至300ms要求的方法更早結束,通過使用具有中等推理精度的小模型來滿足要求。 隨着延遲要求的放松,Edgent選擇的模型的准確性會提高。

圖7:不同延遲需求下的精度比較。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM