一、現象描述

在一次執行gprecoverseg后發現數據庫無法正常連接，現象如下：

執行gprecoverseg日志如下：

gprecoverseg:mdw-1:gpadmin-[INFO]:-Starting gprecoverseg with args: -r
gprecoverseg:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev'
gprecoverseg:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)'
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 2
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 3
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 4
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 5
gprecoverseg:mdw-1:gpadmin-[CRITICAL]:-gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...

這里看出gprecoverseg命令是執行失敗了的。

之后嘗試啟動數據庫，執行gpstart：

gpstart:mdw-1:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
..
gpstart:mdw-1:gpadmin-[INFO]:-Process results...
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-   Successful segment starts                                            = 8
gpstart:mdw-1:gpadmin-[INFO]:-   Failed segment starts                                                = 0
gpstart:mdw-1:gpadmin-[INFO]:-   Skipped segment starts (segments are marked down in configuration)   = 0
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-
gpstart:mdw-1:gpadmin-[INFO]:-Successfully started 8 of 8 segment instances
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-Starting Master instance mdw-1 directory /home/data/master/gpseg-1
gpstart:mdw-1:gpadmin-[INFO]:-Command pg_ctl reports Master mdw-1 instance active
gpstart:mdw-1:gpadmin-[WARNING]:-server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

gpstart:mdw-1:gpadmin-[INFO]:-Starting standby master
gpstart:mdw-1:gpadmin-[INFO]:-Checking if standby master is running on host: mdw-2  in directory: /home/data/master/gpseg-1
gpstart:mdw-1:gpadmin-[INFO]:-Check status of database with gpstate utility

這時候顯示所有segment都是啟動的，但是庫沒有正常啟動起來

然后執行gpstate，確認狀態：

gpstate:mdw-1:gpadmin-[INFO]:-Starting gpstate with args:
gpstate:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev'
gpstate:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)'
gpstate:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gpstate:mdw-1:gpadmin-[INFO]:-Gathering data from segments...
.
gpstate:mdw-1:gpadmin-[INFO]:-Greenplum instance status summary
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Master instance                                           = Active
gpstate:mdw-1:gpadmin-[INFO]:-   Master standby                                            = mdw-2
gpstate:mdw-1:gpadmin-[INFO]:-   Standby master state                                      = Standby host passive
gpstate:mdw-1:gpadmin-[INFO]:-   Total segment instance count from metadata                = 8
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Primary Segment Status
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Total primary segments                                    = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total primary segment valid (at master)                   = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total primary segment failures (at master)                = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes found                   = 4
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Mirror Segment Status
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Total mirror segments                                     = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total mirror segment failures (at master)                 = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes found                   = 4
gpstate:mdw-1:gpadmin-[WARNING]:-Total number mirror segments acting as primary segments   = 1                      <<<<<<<<
gpstate:mdw-1:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 3
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------

這里說明是發生過primary/mirror的角色切換的

嘗試連接數據庫：

psql: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

這時候庫已經無法連接上。

二、定位過程

1、查看日志

首先查看master的日志：

CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 37). The new session id = 38",,,,,,,0,,"cdbgang.c",1638,
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con39,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 38). The new session id = 39",,,,,,,0,,"cdbgang.c",1638,
CST,,,p30743,th0,,,CST,0,con39,,seg-1,,,,,"PANIC","XX000","Unexpected internal error: Master process received signal SIGSEGV",,,,,,,0,,,,"1    0xa0ce4b postgres StandardHandlerForSigillSigsegvSigbus_OnMainThread + 0x1b2
2    0x8d376e postgres CdbProgramErrorHandler + 0xf1
3    0x7fee63a14100 libpthread.so.0 <symbol not found> + 0x63a14100
4    0xaaa718 postgres <symbol not found> + 0xaaa718
5    0xaaa3e4 postgres CdbDispatchCommand + 0x35
6    0xb75719 postgres <symbol not found> + 0xb75719
7    0xb755d4 postgres <symbol not found> + 0xb755d4
8    0xb7524b postgres <symbol not found> + 0xb7524b
9    0xb71c51 postgres <symbol not found> + 0xb71c51
10   0xb720ae postgres initTM + 0xe2
11   0xb78864 postgres cdb_setup + 0x4b
12   0xa1b6e8 postgres InitPostgres + 0x9fa
"
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","server process (PID 30743) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",5604,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",5284,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","seqserver process (PID 28900) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","ftsprobe process (PID 28901) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","sweeper process (PID 28902) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","BeginResetOfPostmasterAfterChildrenAreShutDown: counter 0",,,,,,,0,,"postmaster.c",2171,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","gp_session_id high-water mark is 39",,,,,,,0,,"postmaster.c",2197,

這一句說明有個segment是連不上的：

"Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,

連接到這個segment所在的物理機，查看segment日志，發現這個時間點segment並沒有相關的日志信息。

2、查看其它狀態信息

使用gpstate -s 查看各節點的狀態，截取其中有問題的一個segment：

gpstate:mdw-1:gpadmin-[INFO]:-   Segment Info
gpstate:mdw-1:gpadmin-[INFO]:-      Hostname                                = sdw-2
gpstate:mdw-1:gpadmin-[INFO]:-      Address                                 = sdw-2
gpstate:mdw-1:gpadmin-[INFO]:-      Datadir                                 = /gpadmin/data/mirror/gpseg0
gpstate:mdw-1:gpadmin-[INFO]:-      Port                                    = 50000
gpstate:mdw-1:gpadmin-[INFO]:-   Mirroring Info
gpstate:mdw-1:gpadmin-[INFO]:-      Current role                            = Primary
gpstate:mdw-1:gpadmin-[INFO]:-      Preferred role                          = Mirror
gpstate:mdw-1:gpadmin-[INFO]:-      Mirror status                           = Resynchronizing
gpstate:mdw-1:gpadmin-[INFO]:-   Change Tracking Info
gpstate:mdw-1:gpadmin-[INFO]:-      Change tracking data size               = 32.1 kB
gpstate:mdw-1:gpadmin-[INFO]:-   Resynchronization Info
gpstate:mdw-1:gpadmin-[INFO]:-      Resynchronization mode                  = Incremental
gpstate:mdw-1:gpadmin-[INFO]:-      Data synchronized                       = 0 bytes
gpstate:mdw-1:gpadmin-[INFO]:-      Estimated total data to synchronize     = Not Available
gpstate:mdw-1:gpadmin-[INFO]:-      Estimated resync progress with mirror   = Not Available
gpstate:mdw-1:gpadmin-[INFO]:-      Total resync objects                    = 0
gpstate:mdw-1:gpadmin-[INFO]:-      Objects to resync                       = 0
gpstate:mdw-1:gpadmin-[INFO]:-      Estimated resync end time               = Not Available
gpstate:mdw-1:gpadmin-[INFO]:-   Status
gpstate:mdw-1:gpadmin-[INFO]:-      PID                                     = 26649
gpstate:mdw-1:gpadmin-[INFO]:-      Configuration reports status as         = Up
gpstate:mdw-1:gpadmin-[INFO]:-      Database status                         = Stopping
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Segment Info
gpstate:mdw-1:gpadmin-[INFO]:-      Hostname                                = sdw-1
gpstate:mdw-1:gpadmin-[INFO]:-      Address                                 = sdw-1
gpstate:mdw-1:gpadmin-[INFO]:-      Datadir                                 = /gpadmin/data/primary/gpseg0
gpstate:mdw-1:gpadmin-[INFO]:-      Port                                    = 40000
gpstate:mdw-1:gpadmin-[INFO]:-   Mirroring Info
gpstate:mdw-1:gpadmin-[INFO]:-      Current role                            = Mirror
gpstate:mdw-1:gpadmin-[INFO]:-      Preferred role                          = Primary
gpstate:mdw-1:gpadmin-[INFO]:-      Mirror status                           = Resynchronizing
gpstate:mdw-1:gpadmin-[INFO]:-   Status
gpstate:mdw-1:gpadmin-[INFO]:-      PID                                     = 30458
gpstate:mdw-1:gpadmin-[INFO]:-      Configuration reports status as         = Up
gpstate:mdw-1:gpadmin-[WARNING]:-   Segment status                          = Down                           <<<<<<<<

這里看到segment是處於Change Tracking狀態的，其中數據庫是Stopping狀態，sdw-1機器上，端口為40000的Mirror處於down的狀態，和配置中的up狀態是沖突的。

由於這個時候庫是不可連接的，所以通過master only模式直接連接到master，查看源數據庫，信息如下：

postgres=# select * from gp_segment_configuration ;
 dbid | content | role | preferred_role | mode | status | port  |   hostname    |    address    | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+---------------+---------------+------------------+------------
(省略正常節點)
    2 |       0 | m    | p              | r    | u      | 40000 | sdw-1 | sdw-1 |            41000 |
    6 |       0 | p    | m              | r    | u      | 50000 | sdw-2 | sdw-2 |            51000 |

這里可以看到源數據庫標記的這個segment處於同步狀態，然而status的狀態都為‘u’，可見，源數據庫標記出現了錯誤！

三、解決方案

通過以上的所有信息，可以分析出如下的一些問題：

庫是無法正常啟動的，也無法正常連接
無法連接庫則使用gprecoverseg -F 這種全量修復是不可行的
元數據庫信息是有問題的，數據庫啟動的時候通過元數據庫認為segment正常，會進行同步，然而有個segment實際是down的，無法反正同步，導致庫無法啟動
gpcheckcat該命令是用於修復元數據庫錯誤的，會給出報告和修復建議，因為庫無法正常啟動，所以該命令無法正常執行

那么需要通過修復元數據的方式，讓greenplum數據庫正常啟動，首先我們了解下greenplum的primary/mirror的狀態：

greenplum檢測segment狀態的機制稱為FTS，所以在主備切換時，Primary活着，Mirror掛了。GP Master探測Primary成功之后，通過Primary返回的狀態得知Mirror掛掉了（Mirror掛掉之后，Primary將會探測到，將自己變成ChangeTracking模式），這時候更新Master元信息，進行下一個Segment檢測；在這種狀態下，會進行自動的primary/mirror的同步，知道了這點，我們就可以修改元數據庫，讓greenplum順利的完成同步。

另外，我們看下greenplum的primary/mirror的同步機制：

這里我們可以看到數據是先進primary，然后由primary發送數據到mirror上，所以如果發生了primary/mirror的切換，肯定是要將原來的primary(當前的mirror)數據同步到原來的mirror(當前的primary)，知道這一點后，修改源數據表

將dbid為2的mode修改為‘c’，將dbid為6的status修改為'd'，重啟數據庫即可正常啟動，數據進入同步狀態

四、參考文章

http://mysql.taobao.org/monthly/2016/01/02/

http://mysql.taobao.org/monthly/2016/03/08/?spm=a2c4e.11153959.blogcont51176.4.4c08674ct1QNx1

五、其他

朋友的星球開通了，傳授java基礎知識、面試技巧，討論不限於java的技術問題，有興趣的朋友可以加入：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive元數據庫理解 hive安裝用mysql作為元數據庫，mysql的設置 CDH Mysql元數據庫升級 hive配置元數據庫mysql文件配置 MySQL元數據庫——information_schema Hive的安裝與部署（MySQL作為元數據庫） hive上mysql元數據庫配置 hive 元數據庫表描述 Spark 配置連接hive 元數據庫(mysql) hue-使用mysql作為元數據庫