Vertica節點宕機處理一例


Vertica節點宕機處理一例:

  1. 查詢數據庫版本和各節點狀態
  2. 常規方式啟動宕機節點失敗
  3. 進一步查看宕機節點的詳細日志
  4. 定位問題並解決

1. 查詢數據庫版本和各節點狀態

``` dbadmin=> select version(); version ------------------------------------ Vertica Analytic Database v6.1.3-7 (1 row)

dbadmin=> select node_name, node_id, node_state, node_address from nodes;
node_name | node_id | node_state | node_address
--------------------+-------------------+------------+---------------
v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
v_xxxxxxx_node0002 | 45035996273719008 | DOWN | 192.168.xx.xx
v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
(5 rows)


<h1 id="2"> 2. 常規方式啟動宕機節點失敗</h1>
[常規方式啟動宕機節點](http://www.cnblogs.com/jyzhao/p/3855601.html)失敗,瞬間返回主界面,查詢到報錯如下:

*** Restarting hosts for database xxxxxxx ***
restart host 192.168.xx.xx with catalog v_xxxxxxx_node0002_catalog and data v_xxxxxxx_node0002_data
issuing multi-node restart
Spread does not seem to be running on 192.168.xx.xx. The database will not be started on this host.

The following host(s) are not available: 192.168.xx.xx.
You should get them running first. Operation can not be completed.
result of multi-node restart: K-safe parameters not met.
Restart Hosts result: K-safe parameters not met.


<h1 id="3"> 3. 進一步查看宕機節點的詳細日志</h1>
發現/opt/vertica/log/adminTools-dbadmin.log中有這么一段錯誤日志:

Apr 16 10:55:23 Error code 1 []
Apr 16 10:56:19 dbadmin@192.168.xx.xx: /opt/vertica/bin/vertica --status -D /Vertica/xxxxxxx/v_xxxxxxx_node0001_catalog
Apr 16 10:56:19 Error code 1 ['vertica process is not running']
Apr 16 10:56:19 dbadmin@192.168.xx.xx: ps -aef | grep /opt/vertica/bin/vertica | grep "-D /Vertica/xxxxxxx/v_xxxxxxx_node0001_catalog" | grep -v "ps -aef"


<h1 id="4"> 4. 定位問題並解決 </h1>
基本確定是宕機節點的spread進程當前沒有正常運行。
那么如何啟動spread進程呢? 
spread在Linux中是以服務的形式存在的。

/etc/init.d/spreadd status
/etc/init.d/spreadd start
/etc/init.d/spreadd stop

## 4.1 spread進程狀態 ##

[root@Vertica02 log]# /etc/init.d/spreadd status
spread 已死,但 pid 文件仍存
Try using 'spreadd stop' to clear state

而正常節點的spread服務應該是正常運行的:

[root@Vertica01 ~]# /etc/init.d/spreadd status
spread (pid 19256) 正在運行...


## 4.2 嘗試啟動spread進程 ##

[root@Vertica02 log]# /etc/init.d/spreadd start
Starting spread daemon: [失敗]

按提示嘗試stop
[root@Vertica02 log]# /etc/init.d/spreadd stop
Stopping spread daemon: [失敗]

[root@Vertica02 log]# /etc/init.d/spreadd help
用法:/etc/init.d/spreadd {start|stop|status|restart|condrestart}

[root@Vertica02 log]# /etc/init.d/spreadd restart
Stopping spread daemon: [失敗]

Starting spread daemon: spread (pid 53230) 正在運行...
[確定]
[root@Vertica02 log]#
[root@Vertica02 log]# /etc/init.d/spreadd status
spread (pid 53230) 正在運行...


## 4.3 驗證spread進程已經正常運行 ##

[root@Vertica02 log]# ps -ef|grep spread|grep -v grep
spread 53230 1 0 09:43 ? 00:00:00 /opt/vertica/spread/sbin/spread -n N192168062089 -c /opt/vertica/config/vspread.conf

spread進程起來后,然后就可以再次嘗試[常規方式啟動恢復宕機節點](http://www.cnblogs.com/jyzhao/p/3855601.html)了。

確定宕機節點已經在RECOVERING.

dbadmin=> select node_name, node_id, node_state, node_address from nodes;
node_name | node_id | node_state | node_address
--------------------+-------------------+------------+---------------
v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
v_xxxxxxx_node0002 | 45035996273719008 | RECOVERING | 192.168.xx.xx
v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
(5 rows)

當宕機節點的狀態由RECOVERING->UP,即可確定恢復操作已完成。

## 4.4 嘗試改用第二種恢復方案進行恢復 ##
很遺憾發現常規恢復的第一種方案無法成功(恢復整晚10小時+未成功)。
而估計的恢復時間,dstat監控宕機節點的網絡接受流量速率以及數據目錄的大小增加速率。
初步估計平均100M/s的速度copy恢復,1.3T數據量全部恢復大致也就需要4個小時。
故嘗試變更為第二種方案進行恢復,即清空宕機節點所有文件完全恢復。之前的總結只說了思路,這里簡單記錄下這個恢復過程。
### 1.停掉RECOVERING的節點。 ###
常規停止不行就kill掉,均在admintools工具中可以操作。

dbadmin=> select node_name, node_id, node_state, node_address from nodes;
node_name | node_id | node_state | node_address
--------------------+-------------------+------------+---------------
v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
v_xxxxxxx_node0002 | 45035996273719008 | DOWN | 192.168.xx.xx
v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
(5 rows)

### 2.宕機節點原Vertica目錄mv重命名xxxxxxx_old,然后后台刪除這個目錄(這步是為了盡快進入恢復階段)。 ###
`nohup rm -rf /Vertica/xxxxxxx_old &`
### 3.重新建立目錄(注意權限),拷貝vertica.conf到catalog目錄中。 ###
`mkdir -p /Vertica/xxxxxxx/v_xxxxxxx_node0002_catalog && mkdir -p /Vertica/xxxxxxx/v_xxxxxxx_node0002_data`

### 4.節點1admintools工具啟動宕機節點,進入恢復狀態。 ###

*** Restarting hosts for database xxxxxxx ***
restart host 192.168.xx.xx with catalog v_xxxxxxx_node0002_catalog and data v_xxxxxxx_node0002_data
issuing multi-node restart
Node Status: v_xxxxxxx_node0002: (DOWN)
Node Status: v_xxxxxxx_node0002: (DOWN)
Node Status: v_xxxxxxx_node0002: (INITIALIZING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Node Status: v_xxxxxxx_node0002: (RECOVERING)
Nodes UP: v_xxxxxxx_node0001, v_xxxxxxx_node0003, v_xxxxxxx_node0005, v_xxxxxxx_node0004
Nodes DOWN: v_xxxxxxx_node0002 (may be still initializing).
result of multi-node restart: 7
Restart Hosts result: 7
Vertica Analytic Database 6.1.3-7 Administration Tools

### 5.關注恢復狀態。 ###

dbadmin=> select node_name, node_id, node_state, node_address from nodes;
node_name | node_id | node_state | node_address
--------------------+-------------------+------------+---------------
v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
v_xxxxxxx_node0002 | 45035996273719008 | RECOVERING | 192.168.xx.xx
v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
(5 rows)

同樣,當宕機節點的狀態由RECOVERING->UP,即可確定恢復操作已完成。

又遇到小插曲,總共單節點1.3T的數據恢復到1.2T的時候,不動了。

$ df -h /Vertica/
文件系統 容量 已用 可用 已用%% 掛載點
/dev/mapper/vg_vertica02-LogVol00
3.6T 1.2T 2.3T 34% /Vertica


此時dstat的監控信息看到,網絡拷貝的流量同時幾乎沒有了。
恢復過程中發現有入庫程序在跑,停掉入庫程序重新恢復。
另外考慮到數據量,恢復前先刪除了部分大表的歷史分區,以縮短時間,最終恢復成功。

dbadmin=> select node_name, node_id, node_state, node_address from nodes;
node_name | node_id | node_state | node_address
--------------------+-------------------+------------+---------------
v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
v_xxxxxxx_node0002 | 45035996273719008 | UP | 192.168.xx.xx
v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
(5 rows)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM