Centos 7, Torque 單節點部署


1.准備工作

 

安裝Torque必須首先配置linux主機名稱,服務器主機名稱大多默認localhost,不建議直接使用localhost。

linux主機名稱修改地址:http://www.cnblogs.com/smbin/p/8488909.html

 

linux系統:Centos 7

主機名稱:master

系統用戶:root

 

Torque官網下載地址:http://www.adaptivecomputing.com/support/download-center/torque-download/

作者下載的版本:http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.2.tar.gz

 

 

2.安裝和配置Torque

 

首先在/opt下創建文件夾torque,在此文件夾中下載壓縮包,並解壓下載並解壓Torque文件

[root@mastar ]# cd /opt
[root@mastar ]# mkdir torque
[root@mastar ]# cd torque
[root@mastar torque]# wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.2.tar.gz
......省略下載過程
[root@mastar torque]# tar -zxvf torque-6.1.2.tar.gz
......省略解壓過程
[root@mastar torque]#cd torque-6.1.2/
[root@mastar torque-6.1.2]#

 

 

加載、安裝和master配置。master配置就是主機和PBS之間的配置,master就是主機名

[root@master torque-6.1.2]# yum install libxml2-devel openssl-devel gcc gcc-c++ boost-devel libtool-y
Loaded plugins: fastestmirror, langpacks
base                                                                                                                                                                                      | 3.6 kB  00:00:00     
extras                                                                                                                                                                                    | 3.4 kB  00:00:00     
mysql-connectors-community                                                                                                                                                                | 2.5 kB  00:00:00     
mysql-tools-community                                                                                                                                                                     | 2.5 kB  00:00:00     
mysql56-community                                                                                                                                                                         | 2.5 kB  00:00:00     
updates                                                                                                                                                                                   | 3.4 kB  00:00:00     
Determining fastest mirrors
 * base: mirrors.cn99.com
 * extras: mirrors.tuna.tsinghua.edu.cn
 * updates: mirrors.tuna.tsinghua.edu.cn
Package libxml2-devel-2.9.1-6.el7_2.3.x86_64 already installed and latest version
Package 1:openssl-devel-1.0.2k-8.el7.x86_64 already installed and latest version
Package gcc-4.8.5-16.el7_4.1.x86_64 already installed and latest version
Package gcc-c++-4.8.5-16.el7_4.1.x86_64 already installed and latest version
Package boost-devel-1.53.0-27.el7.x86_64 already installed and latest version
No package libtool-y available.
Nothing to do
[root@master torque-6.1.2]# ./configure --prefix=/usr/local/torque --with-scp--with-default-server=master
......省略加載過程
Building components: server=yes mom=yes clients=yes
                     gui=no drmaa=no pam=no
PBS Machine type    : linux
Remote copy         : /bin/scp -rpB
PBS home            : /var/spool/torque
Default server      : master

Unix Domain sockets : 
Linux cpusets       : no
Tcl                 : disabled
Tk                  : disabled
Authentication      : trqauthd

configure: WARNING: This compilation has strict compiler options enabled that cause
the build to fail if any compiler warnings are emitted.  If this build fails
because of a harmless warning, please report the problem to torqueusers@supercluster.org
and run configure again without --enable-gcc-warnings.

Ready for 'make'.
[root@master torque-6.1.2]# make
......省略加載過程
[root@master torque-6.1.2]# make install
......省略加載過程
[root@master torque-6.1.2]# make packages

  [root@master torque-6.1.2]# make packages
  Building packages from /opt/torque/torque-6.1.2/tpackages
  rm -rf /opt/torque/torque-6.1.2/tpackages
  mkdir /opt/torque/torque-6.1.2/tpackages
  Building ./torque-package-server-linux-x86_64.sh ...
  libtool: install: warning: remember to run `libtool --finish /usr/local/torque/lib'          //需要去執行命令:libtool --finish /usr/local/torque/lib
  Building ./torque-package-mom-linux-x86_64.sh ...
  libtool: install: warning: remember to run `libtool --finish /usr/local/torque/lib'
  Building ./torque-package-clients-linux-x86_64.sh ...
  libtool: install: warning: remember to run `libtool --finish /usr/local/torque/lib'
  Building ./torque-package-devel-linux-x86_64.sh ...
  libtool: install: warning: remember to run `libtool --finish /usr/local/torque/lib'
  Building ./torque-package-doc-linux-x86_64.sh ...
  Done.

  The package files are self-extracting packages that can be copied
  and executed on your production machines. Use --help for options.
  [root@master torque-6.1.2]# libtool --finish /usr/local/torque/lib
  libtool: finish: PATH="/usr/lib/jvm/java-1.7.0-openjdk/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/torque/bin:/usr/local/torque/sbin:/root/bin:/sbin" ldconfig -n /usr/l   ocal/torque/lib
  ----------------------------------------------------------------------
  Libraries have been installed in:
  /usr/local/torque/lib

  If you ever happen to want to link against installed libraries
  in a given directory, LIBDIR, you must either use libtool, and
  specify the full pathname of the library, or use the `-LLIBDIR'
  flag during linking and do at least one of the following:
  - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
  during execution
  - add LIBDIR to the `LD_RUN_PATH' environment variable
  during linking
  - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
  - have your system administrator add LIBDIR to `/etc/ld.so.conf'

  See any operating system documentation about shared libraries for
  more information, such as the ld(1) and ld.so(8) manual pages.

 

 

 

 配置服務:pbs_server PBS,pbs_sched,pbs_mom,trqauthd

[root@master torque-6.1.2]# cp contrib/init.d/{pbs_{server,sched,mom},trqauthd} /etc/init.d/
[root@master torque-6.1.2]# for i in pbs_server pbs_sched pbs_mom trqauthd; do chkconfig --add $i; chkconfig $i on; done      //遇見y/n選擇y回車繼續

 

 

設置Torque環境變量

[root@master torque-6.1.2]# TORQUE=/usr/local/torque
[root@master torque-6.1.2]# echo "TORQUE=$TORQUE" >> /etc/profile
[root@master torque-6.1.2]# echo "export PATH=\$PATH:$TORQUE/bin:$TORQUE/sbin" >> /etc/profile
[root@master torque-6.1.2]# source /etc/profile

 

 

以root用戶啟動,報錯服務指向的主機名和現有主機名不一致,安裝過程中暫時沒有找到解決方案!安裝完畢后有解決方案,在本文最下方!!!

[root@master torque-6.1.2]# ./torque.setup root          //嘗試以root啟動,報錯:服務“pbs_server”已經啟動
initializing TORQUE (admin: root)
pbs_server already running... run 'qterm' to stop pbs_server and rerun          //運行sterm關閉服務
[root@master torque-6.1.2]# qterm                        //發現服務指向的主機名稱和正常顯示的主機名稱不一致,命令qterm無法關閉
Can not resolve name for server mastar. (rc = -2 - )
Cannot resolve specified server host 'mastar'.
qterm: could not connect to server '' (15010) Access from host not allowed, or unknown host
[root@master mom_priv]# ps -e | grep pbs          //查詢服務,嘗試以kill -9命令關閉服務
30505 ?        00:00:00 pbs_server
[root@master mom_priv]# kill -9 30505
[root@master mom_priv]# ps -e | grep pbs
[root@master torque-6.1.2]# ./torque.setup root        //發現服務關閉后仍無法啟動,服務指向的主機名和現有主機名不一致!經確認上邊配置的時候沒有配置錯誤:
                           //‘./configure --prefix=/usr/local/torque --with-scp--with-default-server=master’ configure沒有錯誤,未找到解決方案,懷疑是系統緩存的問題。
initializing TORQUE (admin: root)              //暫時只能修改/etc/hosts文件的內容      You have selected to start pbs_server in create mode. If the server database exists it will be overwritten. do you wish to continue y/(n)?y Can not resolve name for server mastar. (rc = -2 - ) Cannot resolve specified server host 'mastar'. qmgr: cannot connect to server (errno=15010) Access from host not allowed, or unknown host ERROR: cannot set root@master in operators list Can not resolve name for server mastar. (rc = -2 - ) Cannot resolve specified server host 'mastar'. qterm: could not connect to server '' (15010) Access from host not allowed, or unknown host [root@master torque-6.1.2]# vi /etc/hosts            //修改/etc/hosts文件 10.131.101.142 master 10.131.101.142 mastar        //添加這一行的內容 27.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

  [root@master torque-6.1.2]# ./torque.setup root            //此時執行成功
  initializing TORQUE (admin: root)

  You have selected to start pbs_server in create mode.
  If the server database exists it will be overwritten.
  do you wish to continue y/(n)?y          //輸入y

 

 

開始pbs_server,pbs_sched服務,pbs_mom和trqauthd

[root@master torque-6.1.2]# qterm          //關閉服務
[root@master torque-6.1.2]# for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i start; done
Starting pbs_server (via systemctl):                       [  OK  ]
Starting pbs_sched (via systemctl):                        [  OK  ]
Starting pbs_mom (via systemctl):                          [  OK  ]
Starting trqauthd (via systemctl):                         [  OK  ]

 

 

 

指定計算節點

添加計算節點”master”,設置CPU的數量

檢查CPU的數量通過使用命令“lscpu”或“nproc”

[root@master torque-6.1.2]# vi /var/spool/torque/server_priv/nodes
master np=8          //添加本行信息,注意等號前后不要有空格 master是主機名
[root@master torque-6.1.2]# vi /var/spool/torque/mom_priv/config
pbsserver master        //添加這兩行信息 master是主機名
logevent 255

 

 

檢查PBS的信息

[root@master torque-6.1.2]# ps -e | grep pbs
11188 ?        00:00:00 pbs_sched
11215 ?        00:00:00 pbs_mom
29683 ?        00:00:00 pbs_server
[root@master torque-6.1.2]# for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i restart; done
Restarting pbs_server (via systemctl):                     [  OK  ]
Restarting pbs_sched (via systemctl):                      [  OK  ]
Restarting pbs_mom (via systemctl):                        [  OK  ]
Restarting trqauthd (via systemctl):                       [  OK  ]

 

 

創建隊列的默認信息

[root@master torque-6.1.2]# qmgr -c 'create queue master'
[root@master torque-6.1.2]# qmgr -c 'set queue master queue_type= execution'
[root@master torque-6.1.2]# qmgr -c 'set queue master started= true'
[root@master torque-6.1.2]# qmgr -c 'set queue master enabled= true'
[root@master torque-6.1.2]# qmgr -c 'set queue master resources_default.walltime= 240:00:00'
[root@master torque-6.1.2]# qmgr -c 'set queue master resources_default.nodes= 1'
[root@master torque-6.1.2]# qmgr -c 'set server default_queue= master'

 

 

 提交任務測試:

[root@master torque-6.1.2]# qnodes      //查詢計算節點的狀態
master
     state = free
     power_state = Running
     np = 8
     ntype = cluster
     status = opsys=linux,uname=Linux master 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64,sessions=3154 3489 41105 41699,nsessions=4,nusers=3,idletime=3198,
    totmem=94868512kb,availmem=92195284kb,physmem=32367652kb,ncpus=56,loadave=0.85,gres=,netload=4005925534,state=free,varattr= ,cpuclock=Fixed,macaddr=68:cc:6e:c3:cf:87,version=6.1.2,rectime=1519980694,jobs=
mom_service_port = 15002 mom_manager_port = 15003 [root@master torque-6.1.2]# su master        //切換用戶:此master不是主機名,而是一個用戶的名字 [master@master torque-6.1.2]$ echo sleep 10 | qsub 0.master [master@master torque-6.1.2]$ qstat        //查詢任務狀態 Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 0.master STDIN master 0 R master [master@master torque-6.1.2]$ qstat -a -n      //查詢任務狀態和每個任務占用cpu核數 master: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 0.master master master STDIN 12470 1 1 -- 240:00:00 C -- master/0 [master@master torque-6.1.2]$

 

 

 

主機名和現有主機名不一致的問題解決方案:

 這個問題一直沒有找到出現的原因,但是懷疑是之前的Torque刪除時沒有刪除干凈,在“創建隊列的默認信息”這一步的緩存依然存在。

在Torque安裝成功后,停止Torque

[root@master torque-6.1.2]# for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i stop; done        //停止服務T,start改為stop
Stopping pbs_server (via systemctl):                       [  OK  ]
Stopping pbs_sched (via systemctl):                        [  OK  ]
Stopping pbs_mom (via systemctl):                          [  OK  ]
Stopping trqauthd (via systemctl):                         [  OK  ]
[root@master torque-6.1.2]# ./torque.setup root        //重新運行這一步
hostname: master
Currently no servers active. Default server will be listed as active server. Error  15133
Active server name: master  pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix
trqauthd successfully started
initializing TORQUE (admin: root)

You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?y          //輸入y
[root@master torque-6.1.2]# vi /var/spool/torque/server_priv/nodes master np=8           //=前后不要帶空格
[root@master torque
-6.1.2]# qterm          //關閉pbs_server、 pbs_sched、 pbs_mom、 trqauthd服務 [root@master torque-6.1.2]# for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i start; done        //重啟服務 Starting pbs_server (via systemctl): [ OK ] Starting pbs_sched (via systemctl): [ OK ] Starting pbs_mom (via systemctl): [ OK ] Starting trqauthd (via systemctl): [ OK ]

  [root@master torque-6.1.2]# qnodes          //查詢狀態,報錯服務trqauthd沒有啟動
  socket_connect_unix failed: 15137
  qnodes: cannot connect to server master, error=15137 (could not connect to trqauthd)
  [root@master torque-6.1.2]# for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i restart; done        //重新啟動服務
  Restarting pbs_server (via systemctl): [ OK ]
  Restarting pbs_sched (via systemctl): [ OK ]
  Restarting pbs_mom (via systemctl): [ OK ]
  Restarting trqauthd (via systemctl): [ OK ]


[root@master torque-6.1.2]# qnodes      //查詢狀態,成功
master
state = free
power_state = Running
np = 8
ntype = cluster
status = opsys=linux,uname=Linux master 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64,sessions=3154 3489 10903 41105 41699,nsessions=5,nusers=4,idletime=5287,totmem=94868512kb,
availmem=92236268kb,physmem=32367652kb,ncpus=56,loadave=0.01,gres=,netload=8920006882,state=free,varattr= ,cpuclock=Fixed,macaddr=68:cc:6e:c3:cf:87,version=6.1.2,rectime=1519982783,jobs=
mom_service_port = 15002 mom_manager_port = 15003

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM