Greenplum啟動失敗Error occurred: non-zero rc: 1的修復


某日開發反饋測試環境的集群啟動失敗

報錯內容如下:

[gpadmin@hadoop-test2:/root]
$ gpstart
20181205:16:42:23:005451 gpstart:hadoop-test2:gpadmin-[INFO]:-Starting gpstart with args: 20181205:16:42:23:005451 gpstart:hadoop-test2:gpadmin-[INFO]:-Gathering information and validating the environment... 20181205:16:42:23:005451 gpstart:hadoop-test2:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 5.0.0 build dev' 20181205:16:42:23:005451 gpstart:hadoop-test2:gpadmin-[INFO]:-Greenplum Catalog Version: '301705051' 20181205:16:42:24:005451 gpstart:hadoop-test2:gpadmin-[INFO]:-Starting Master instance in admin mode 20181205:16:52:24:005451 gpstart:hadoop-test2:gpadmin-[CRITICAL]:-Failed to start Master instance in admin mode 20181205:16:52:24:005451 gpstart:hadoop-test2:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /home/gpadmin/gpdata/gpmaster/gpseg-1 -l /home/gpadmin/gpdata/gpmaster/gpseg-1/pg_log/startup.log
-w -t 600 -o " -p 2346 --gp_dbid=1 --gp_num_contents_in_cluster=0 --silent-mode=true -i -M master --gp_contentid=-1 -x 0 -c gp_role=utility " start
' rc=1, stdout='waiting for server to start...................................................................................................................................
...........................................................................................................................................................................
...........................................................................................................................................................................
.................................................................................................................................. stopped waiting
', stderr='could not change directory to "/root" pg_ctl: could not start server Examine the log output.

查看啟動日志發現:

vim /home/gpadmin/gpdata/gpmaster/gpseg-1/pg_log/startup.log
2018-12-05 08:42:24.067241 GMT,,,p5464,th-829482944,,,,0,,,seg-1,,,,,"WARNING","01000","""work_mem"": setting is deprecated, and may be removed in a future release.",,,,,,,,"set_config_option","guc.c",4666,
2018-12-05 08:42:24.067612 GMT,,,p5464,th-829482944,,,,0,,,seg-1,,,,,"WARNING","01000","""work_mem"": setting is deprecated, and may be removed in a future release.",,,,,,,,"set_config_option","guc.c",4666,
2018-12-05 08:42:24.083813 GMT,,,p5465,th-829482944,,,,0,,,seg-1,,,,,"LOG","00000","removing all temporary files",,,,,,,,"RemovePgTempFiles","fd.c",2046,
2018-12-05 08:42:24.098673 GMT,,,p5465,th-829482944,,,,0,,,seg-1,,,,,"FATAL","XX000","could not create shared memory segment: Invalid argument (pg_shmem.c:183)","Failed system call was shmget(key=2346001, size=177586016, 03600).","This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter.  You can either reduce the request size or reconfigure the kernel with larger SHMMAX.  To reduce the request size (currently 177586016 bytes), reduce PostgreSQL's shared_buffers parameter (currently 4000) and/or its max_connections parameter (currently 253).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.",,,,,,"InternalIpcMemoryCreate","pg_shmem.c",183,1

內容大概是說/etc/sysctl.conf設置的內核參數shmmax過小,導致啟動失敗

查看/etc/sysctl.conf下的配置發現:

kernel.shmmax = 20000000
kernel.shmmni = 4096
kernel.shmall = 40000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.overcommit_memory = 2

對比官網建議的設置和參數定義以及集群已有的數據量,發現確實過小。於是改成官網建議的設置后啟動。

20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-----------------------------------------------------
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-   Successful segment starts                                            = 8
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-   Failed segment starts                                                = 0
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-   Skipped segment starts (segments are marked down in configuration)   = 0
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-----------------------------------------------------
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-Successfully started 8 of 8 segment instances 
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-----------------------------------------------------
20181205:17:54:28:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-Starting Master instance hadoop-test2 directory /home/gpadmin/gpdata/gpmaster/gpseg-1 
20181205:17:54:29:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-Command pg_ctl reports Master hadoop-test2 instance active
20181205:17:54:29:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-No standby master configured.  skipping...
20181205:17:54:29:009711 gpstart:hadoop-test2:gpadmin-[INFO]:-Database successfully started

啟動成功。

 

總結:pg啟動相關的內核參數配置與實際情況不匹配時,會導致啟動失敗。可通過查看日志詳細信息查找根源解決問題。

 

參考文檔:

1、官網建議設置 http://gpdb.docs.pivotal.io/4380/prep_os-system-params.html#topic3

2、內核參數含義http://www.oicqzone.com/pc/2012091612901.html

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM