Nomad 簡介
Nomad是一個管理機器集群並在集群上運行應用程序的工具。
Nomad的特點:
- 支持docker,Nomad的job可以使用docker驅動將應用部署到集群中。
- Nomad安裝在linux中僅需單一的二進制文件,不需要其他服務協調,Nomad將資源管理器和調度程序的功能集成到一個系統中。
- 多數據中心,可以跨數據中心調度。
- 分布式高可用,支持多種驅動程序(Docker、VMS、Java)運行job,支持多種系統(Linux、Windows、BSD、OSX)。
Nomad安裝
一般環境下,首先安裝Vagrant,利用Vagrant連接本地的Virtualbox,創建本地測試環境。不過由於在學習過程中,本地win7環境缺失了一些組件,導致無法安裝並使用Vagrant。
所以直接使用Linux虛擬機來進行學習。本環境使用Ubuntu16.04,Docker version 17.09.0-ce。
- 下載Nomad二進制文件,選擇適合你系統的安裝包。
# wget https://releases.hashicorp.com/nomad/0.7.0/nomad_0.7.0_linux_amd64.zip?_ga=2.169483045.503594617.1512349197-1498904827.1511322624
- 解壓安裝包,將Nomad文件放在/usr/local/bin下.
# unzip -o nomad_0.7.0_linux_amd64.zip -d /usr/local/bin/
# cd /usr/local/bin
# chmod +x nomad
終端輸入nomad,可看到nomad 提示,即安裝成功。
開始Nomad
為了簡單運行,我們以開發模式運行Nomad agent。開發模式可以快速啟動server端和client端,測試學習Nomad。
# nomad agent -dev
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: DEBUG
Region: global (DC: dc1)
Server: true
==> Nomad agent started! Log data will stream in below:
[INFO] serf: EventMemberJoin: nomad.global 127.0.0.1
[INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
[INFO] client: using alloc directory /tmp/NomadClient599911093
[INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state
[INFO] nomad: adding server nomad.global (Addr: 127.0.0.1:4647) (DC: dc1)
[WARN] fingerprint.network: Ethtool not found, checking /sys/net speed file
[WARN] raft: Heartbeat timeout reached, starting election
[INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
[DEBUG] raft: Votes needed: 1
[DEBUG] raft: Vote granted. Tally: 1
[INFO] raft: Election won. Tally: 1
[INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state
[INFO] raft: Disabling EnableSingleNode (bootstrap)
[DEBUG] raft: Node 127.0.0.1:4647 updated peer set (2): [127.0.0.1:4647]
[INFO] nomad: cluster leadership acquired
[DEBUG] client: applied fingerprints [arch cpu host memory storage network]
[DEBUG] client: available drivers [docker exec java]
[DEBUG] client: node registration complete
[DEBUG] client: updated allocations at index 1 (0 allocs)
[DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
[DEBUG] client: state updated to ready
在終端輸出中看到,server和client都為true,表示同時開啟了server和client。
Nomad集群節點
# nomad node-status
ID DC Name Class Drain Status
fb533fd8 dc1 yc-jumpbox <none> false ready
輸出顯示了我們的節點ID,它是隨機生成的UUID,其數據中心,節點名稱,節點類別,漏斗模式和當前狀態。我們可以看到我們的節點處於就緒狀態。
# nomad server-members
Name Address Port Status Leader Protocol Build Datacenter Region
yc-jumpbox.global 10.30.0.52 4648 alive true 2 0.7.0 dc1 global
輸出顯示了我們自己的server,運行的地址,運行狀況,一些版本信息以及數據中心和區域。
停止Nomad agent
你可以使用Ctrl-C中斷agent。默認情況下,所有信號都會導致agent強制關閉。
Nomad Job
Job是我們在使用Nomad主要交互的內容。
示例Job
進入你的工作目錄使用nomad init命令。它會在當前目錄生成一個example.nomad,這是一個示例的nomad job配置文件。
# cd /tmp
# nomad init
Example job file written to example.nomad
運行這個job,我們使用nomad run命令。
# nomad run example.nomad
==> Monitoring evaluation "13ebb66d"
Evaluation triggered by job "example"
Allocation "883269bf" created: node "e42d6f19", group "cache"
Evaluation within deployment: "b0a84e74"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "13ebb66d" finished with status "complete"
查看job狀態,我們使用nomad status 命令
# nomad status example
ID = example
Name = example
Submit Date = 12/05/17 10:58:40 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 1 0 0 0
Latest Deployment
ID = b0a84e74
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy
cache 1 1 1 0
Allocations
ID Node ID Task Group Version Desired Status Created At
883269bf e42d6f19 cache 0 run running 12/05/17 10:58:40 UTC
檢查job的分配情況,我們使用nomad alloc-status命令。
# nomad alloc-status 883269bf
ID = 883269bf
Eval ID = 13ebb66d
Name = example.cache[0]
Node ID = e42d6f19
Job ID = example
Job Version = 0
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 12/05/17 10:58:49 UTC
Deployment ID = b0a84e74
Deployment Health = healthy
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
8/500 MHz 6.3 MiB/256 MiB 300 MiB 0 db: 127.0.0.1:22672
Task Events:
Started At = 12/05/17 10:58:49 UTC
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
10/31/17 22:58:49 UTC Started Task started by client
10/31/17 22:58:40 UTC Driver Downloading image redis:3.2
10/31/17 22:58:40 UTC Task Setup Building Task Directory
10/31/17 22:58:40 UTC Received Task received by client
查看job日志,我們使用nomad logs 命令。注意logs后面的參數為uuid和task名字。uuid可以通過nomad status example命令得到,task名字在example.nomad配置文件中定義。
# nomad logs 883269bf redis
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 3.2.1 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 1
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | http://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
...
修改job
# vim example.nomad
在文件中找到 count = 1,改為count = 3.
完成修改后,使用nomad plan example.nomad命令
# nomad plan example.nomad
+/- Job: "example"
+/- Task Group: "cache" (2 create, 1 in-place update)
+/- Count: "1" => "3" (forces create)
Task: "redis"
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 7
To submit the job with version verification run:
nomad run -check-index 7 example.nomad
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
使用給出的更新命令去更新job。
# nomad run -check-index 7 example.nomad
==> Monitoring evaluation "93d16471"
Evaluation triggered by job "example"
Evaluation within deployment: "0d06e1b6"
Allocation "3249e320" created: node "e42d6f19", group "cache"
Allocation "453b210f" created: node "e42d6f19", group "cache"
Allocation "883269bf" modified: node "e42d6f19", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "93d16471" finished with status "complete"
停止job,我們使用nomad stop命令。使用nomad status命令可以看到這個job的狀態為dead(stopped)。
# nomad stop example
==> Monitoring evaluation "6d4cd6ca"
Evaluation triggered by job "example"
Evaluation within deployment: "f4047b3a"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "6d4cd6ca" finished with status "complete"
建立簡單的Nomad集群
Nomad集群分為兩部分,server服務端和client客戶端。每個區域至少有一台server,建議使用3或者5台server集群。Nomad客戶端是一個非常輕量級的進程,它注冊主機,執行心跳,並運行由服務器分配給它的任務。代理必須在集群中的每個節點上運行,以便服務器可以將工作分配給這些機器。
啟動服務器
第一步是為服務器創建配置文件。無論是從下載的文件github,或粘貼到一個名為server.hcl:
vim server.hcl
# Increase log verbosity
log_level = "DEBUG"
#setup datacenter
datacenter= "dc1"
# Setup data dir
data_dir = "/tmp/server1"
# Enable the server
server {
enabled = true
# Self-elect, should be 3 or 5 for production
bootstrap_expect = 1}
這是一個相當最小的服務器配置文件,但只能以僅服務器方式啟動代理,並將其選為leader。應該對生產進行的主要變化是運行多台服務器,並更改相應的bootstrap_expect值。
創建文件后,在新選項卡中啟動代理:
$ sudo nomad agent -config server.hcl
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: false
Log Level: DEBUG
Region: global (DC: dc1)
Server: true
Version: 0.6.0
==> Nomad agent started! Log data will stream in below:
[INFO] serf: EventMemberJoin: nomad.global 127.0.0.1
[INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
[INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state
[INFO] nomad: adding server nomad.global (Addr: 127.0.0.1:4647) (DC: dc1)
[WARN] raft: Heartbeat timeout reached, starting election
[INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
[DEBUG] raft: Votes needed: 1
[DEBUG] raft: Vote granted. Tally: 1
[INFO] raft: Election won. Tally: 1
[INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state
[INFO] nomad: cluster leadership acquired
[INFO] raft: Disabling EnableSingleNode (bootstrap)
[DEBUG] raft: Node 127.0.0.1:4647 updated peer set (2): [127.0.0.1:4647]
我們可以看到,客戶端模式被禁用,我們只是作為服務器運行。這意味着該服務器將管理狀態並進行調度決策,但不會執行任何任務。現在我們需要一些代理來運行任務!
啟動客戶端
與服務器類似,我們必須先配置客戶端。請從github下載client1和client2的配置 ,或將以下內容粘貼到client1.hcl:
# Increase log verbosity
log_level = "DEBUG"
# Setup data dir
data_dir = "/tmp/client1"
# Enable the client
client {
enabled = true
# For demo assume we are talking to server1. For production,
# this should be like "nomad.service.consul:4647" and a system
# like Consul used for service discovery.
servers = ["127.0.0.1:4647"]
}
# Modify our port to avoid a collision with server1
ports {
http = 5656
}
將該文件復制client2.hcl並更改data_dir為“/tmp/client2 ”並將端口更改為5657.一旦創建了這兩個文件,client1.hcl並client2.hcl打開每個選項卡並啟動代理程序:
# sudo nomad agent -config client1.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: DEBUG
Region: global (DC: dc1)
Server: false
Version: 0.6.0
==> Nomad agent started! Log data will stream in below:
[DEBUG] client: applied fingerprints [host memory storage arch cpu]
[DEBUG] client: available drivers [docker exec]
[DEBUG] client: node registration complete
...
在輸出中,我們可以看到代理僅在客戶端模式下運行。該代理將可用於運行任務,但不會參與管理集群或做出調度決策。
使用node-status命令 我們應該看到ready狀態中的兩個節點:
# nomad node-status
ID Datacenter Name Class Drain Status
fca62612 dc1 nomad <none> false ready
c887deef dc1 nomad <none> false ready
我們現在有一個簡單的三節點集群運行。演示和完整生產集群之間的唯一區別是,我們運行的是單個服務器,而不是三個或五個。
提交工作
現在我們有一個簡單的集群,我們可以用它來安排一個工作。我們還應該擁有example.nomad之前的作業文件,但是確認count仍然設置為3。
然后,使用run命令提交作業:
# nomad init
# nomad run example.nomad
==> Monitoring evaluation "8e0a7cf9"
Evaluation triggered by job "example"
Evaluation within deployment: "0917b771"
Allocation "501154ac" created: node "c887deef", group "cache"
Allocation "7e2b3900" created: node "fca62612", group "cache"
Allocation "9c66fcaf" created: node "c887deef", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "8e0a7cf9" finished with status "complete"
我們可以在輸出中看到調度程序為其中一個客戶機節點分配了兩個任務,剩下的任務分配給第二個客戶端。
我們可以再次使用status命令驗證:
# nomad status example
ID = example
Name = example
Submit Date = 07/26/17 16:34:58 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 3 0 0 0
Latest Deployment
ID = fc49bd6c
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy
cache 3 3 0 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
501154ac 8e0a7cf9 c887deef cache run running 08/08/16 21:03:19 CDT
7e2b3900 8e0a7cf9 fca62612 cache run running 08/08/16 21:03:19 CDT
9c66fcaf 8e0a7cf9 c887deef cache run running 08/08/16 21:03:19 CDT
我們可以看到我們的所有任務已經分配並正在運行。一旦我們對我們的工作感到滿意,我們就可以把它刪掉了nomad stop
。
使用nomad UI
仁者見仁智者見智,我在使用途中,覺得第一種UI是挺好的,可以看到很多細節的內容,相比官方的UI還沒有完善更多功能。
目前Nomad0.7版本集成了UI,在0.7版本之前,UI一直沒有很好的實現,所以我在github上找到一位大牛的UI作品https://github.com/jippi/hashi-ui。
官方UI
- 需要在github上下載nomad項目到本地,地址為:https://github.com/hashicorp/nomad/tree/master/ui
- 認真閱讀README,將Node.js、Yarn、Ember CLI、PhantomJS安裝在本地環境中。
- 安裝
# cd ui/
# yarn
- 安裝完成后,運行這條命令:ember serve --proxy http://10.30.0.52:4646 (10.30.0.52換成你的外網IP,4646換成你自定義的端口),即可在瀏覽器中查看。
常見問題
- 服務會運行在127.0.0.1網卡上,外部不能訪問?
建議在運行nomad agent時,命令行配置相應的網卡。例如:
# nomad agent -config server.hcl -bind=0.0.0.0
# nomad agent -config client1.hcl -network-interface=ens160
- 使用docker運行服務時,容器會映射隨機端口在本地?
根據研究官方文檔,文檔中提示了docker會隨機映射端口,如果想使用靜態端口,可以在job配置文件中定義。
簡單的job配置文件
- hello world
# cat hello.nomad
job "hello1" {
datacenters = ["dc1"] #定義數據中心
group "hello2" { #組名字
task "hello3" { #一般使用服務名字表示task名字
driver = "docker" #使用docker驅動
config {
image = "hashicorp/http-echo" #服務鏡像名字
args = [ #容器運行時的命令參數
"-listen", ":5678",
"-text", "hello world",
]
}
resources { #配置服務的資源
network {
mbits = 10 #限制10MB帶寬
port "http" {
static = "5678" #使用靜態端口
}
}
}
}
}
}
- 搭建一個redmine,由於我還沒弄明白nomad如何像docker-compose一樣啟動服務,所以mysql只好提前單獨運行起來。
# cat redmine-example.nomad
job "redmine" {
region = "global" #設置地區
datacenters = ["dc1"] #設置數據中心
type = "service" #設置該job類型是服務,主要用於conusl的服務注冊,不寫這條,該job不會注冊服務到consul
update {
max_parallel = 1 #同時更新任務數量
min_healthy_time = "10s" #分配必須處於健康狀態的最低時間,然后標記為正常狀態。
healthy_deadline = "3m" #標記為健康的截止日期,之后分配自動轉換為不健康狀態
auto_revert = false #指定job在部署失敗時是否應自動恢復到上一個穩定job
canary = 0 #如果修改job以后導致更新失敗,需要創建指定數量的替身,不會停止之前的舊版本,一旦確定替身健康,他們就會提升為正式服務,更新舊版本。
}
group "redmine" {
count = 1 # 啟動服務數量
restart {
attempts = 10 #時間間隔內重啟次數
interval = "5m" #在服務開始運行的持續時間內,如果一直出現故障,則會由mode控制。mode是控制任務在一個時間間隔內失敗多次的行為。
delay = "25s" #重新啟動任務之前要等待的時間
mode = "delay" #指示調度程序延遲下一次重啟,直到下一次重啟成功。
}
ephemeral_disk { #臨時磁盤 MB為單位
size = 300
}
task "redmine" {
driver = "docker"
env { #環境變量
REDMINE_DB_MYSQL = "10.30.0.52"
REDMINE_DB_POSTGRES = "3306"
REDMINE_DB_PASSWORD = "passwd"
REDMINE_DB_USER = "root"
REDMINE_DB_NAME = "redmine"
}
config {
image = "redmine:yc"
port_map { #指定映射的端口
re = 3000
}
}
logs {
max_files = 10 #日志文件最多數量
max_file_size = 15 #單個日志文件大小 MB單位
}
resources {
cpu = 500 # 500 MHz #限制服務的cpu,內存,網絡
memory = 256 # 256MB
network {
mbits = 10
port "re" {} #使用上面配置的映射端口
}
}
service {
name = "global-redmine-check" #健康檢查
tags = ["global", "redmine"]
port = "re"
check {
name = "alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}