基本描述
Kerberos使用Needha-Schroeder協議作為它的基礎。它使用了一個由兩個獨立的邏輯部分:認證服務器和票據授權服務器組成的"可信賴的第三方",術語稱為密鑰分發中心(KDC)。Kerberos工作在用於證明用戶身份的"票據"的基礎上。
KDC持有一個密鑰數據庫;每個網絡實體——無論是客戶還是服務器——共享了一套只有他自己和KDC知道的密鑰。密鑰的內容用於證明實體的身份。對於兩個實體間的通信,KDC產生一個會話密鑰,用來加密他們之間的交互信息。
協議內容
協議的安全主要依賴於參加者對時間的松散同步和短周期的叫做Kerberos票據的認證聲明。 下面是對這個協議的一個簡化描述,將使用以下縮寫:
- AS(Authentication Server)= 認證服務器
- TGT(Ticket Granting Ticket)= 票據授權票據,票據的票據
- TGS(Ticket Granting Server)= 票據授權服務器
- SS(Service Server)= 服務器
其在網絡通訊協定中屬於顯示層。
簡單地說,用戶先用共享密鑰從某認證服務器得到一個身份證明。隨后,用戶使用這個身份證明與SS通信,而不使用共享密鑰。
具體流程
-
首先,用戶使用客戶機(用戶自己的機器)上的程序登錄:
-
用戶輸入用戶ID和密碼到客戶機。
-
客戶機程序運行一個單向函數(大多數為雜湊)把密碼轉換成密鑰,這個就是客戶機(用戶)的"用戶密鑰"(K_client)。受信任的AS通過某些安全的途徑也獲取了與此密鑰相同的密鑰。
-
隨后,客戶機認證(客戶機從AS獲取票據的票據(TGT)):
-
客戶機向AS發送1條消息(注意:用戶不向AS發送密鑰(K_client),也不發送密碼):
- 包含用戶ID的明文消息,例如"用戶Sunny想請求服務"(Sunny是用戶ID)
-
AS檢查用戶ID有效性,而后返回2條消息:
- 消息A:用戶密鑰(K_client)加密后的"客戶機-TGS會話密鑰"(K_TGS-session)(會話密鑰用在將來客戶機與TGS的通信(會話)上)
- 消息B:TGS密鑰(K_TGS)加密后的"票據授權票據"(TGT)(TGT包括:客戶機-TGS會話密鑰(K_TGS-session),用戶ID,用戶網址,TGT有效期)
-
客戶機用自己的密鑰(K_client)解密A,得到客戶機-TGS會話密鑰(K_TGS-session)。(注意:客戶機不能解密消息B,因為B是用TGS密鑰(K_TGS)加密的)。
-
然后,服務授權(客戶機從TGS獲取票據(T)):
- 客戶機向TGS發送以下2條消息:
- 消息c:即消息B(K_TGS加密后的TGT),和想獲取的服務的服務ID(注意:不是用戶ID)
- 消息d:客戶機-TGS會話密鑰(K_TGS-session)加密后的"認證符"(認證符包括:用戶ID,時間戳)
- TGS用自己的密鑰(K_TGS)解密c中的B得到TGT,從而得到AS提供的客戶機-TGS會話密鑰(K_TGS-session)。再用這個會話密鑰解密d得到用戶ID(認證),而后返回2條消息:
- 消息E:服務器密鑰(K_SS)加密后的"客戶機-服務器票據"(T)(T包括:客戶機-SS會話密鑰(K_SS-session),用戶ID,用戶網址,T有效期)
- 消息F:客戶機-TGS會話密鑰(K_TGS-session)加密后的"客戶機-SS會話密鑰"(K_SS_session)
- 客戶機用客戶機-TGS會話密鑰(K_TGS-session)解密F,得到客戶機-SS會話密鑰(K_SS_session)。(注意:客戶機不能解密消息E,因為E是用SS密鑰(K_SS)加密的)。
- 客戶機向TGS發送以下2條消息:
-
最后,服務請求(客戶機從SS獲取服務):
- 客戶機向SS發出2條消息:
- 消息e:即消息E
- 消息g:客戶機-服務器會話密鑰(K_SS_session)加密后的"新認證符"(新認證符包括:用戶ID,時間戳)
- SS用自己的密鑰(K_SS)解密e/E得到T,從而得到TGS提供的客戶機-服務器會話密鑰(K_SS_session)。再用這個會話密鑰解密g得到用戶ID(認證),而后返回1條消息(確認函:確證身份真實,樂於提供服務):
- 消息H:客戶機-服務器會話密鑰(K_SS_session)加密后的"新時間戳"(新時間戳是:客戶機發送的時間戳加1)
- 客戶機用客戶機-服務器會話密鑰(K_SS_session)解密H,得到新時間戳。
- 客戶機檢查時間戳被正確地更新,則客戶機可以信賴服務器,並向服務器(SS)發送服務請求。
- 服務器(SS)提供服務。
缺陷
- 失敗於單點:它需要中心服務器的持續響應。當Kerberos服務結束前,沒有人可以連接到服務器。這個缺陷可以通過使用復合Kerberos服務器和缺陷認證機制彌補。
- Kerberos要求參與通信的主機的時鍾同步。票據具有一定有效期,因此,如果主機的時鍾與Kerberos服務器的時鍾不同步,認證會失敗。默認設置要求時鍾的時間相差不超過10分鍾。在實踐中,通常用網絡時間協議后台程序來保持主機時鍾同步。
- 管理協議並沒有標准化,在服務器實現工具中有一些差別。RFC 3244描述了密碼更改。
- 因為所有用戶使用的密鑰都存儲於中心服務器中,危及服務器的安全的行為將危及所有用戶的密鑰。
- 一個危險客戶機將危及用戶密碼。
常見問題
相關鏈接:https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Security-Guide/cdh4sg_topic_17.html
Problem 1: Running any Hadoop command fails after enabling security.
Description:
A user must have a valid Kerberos ticket in order to interact with a secure Hadoop cluster. Running any Hadoop command (such as hadoop fs -ls) will fail if you do not have a valid Kerberos ticket in your credentials cache. If you do not have a valid ticket, you will receive an error such as:
11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Solution:
You can examine the Kerberos tickets currently in your credentials cache by running the klist command. You can obtain a ticket by running the kinit command and either specifying a keytab file containing credentials, or entering the password for your principal.
Problem 2: Java is unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher.
Description:
If you are running MIT Kerberos 1.8.1 or higher, the following error will occur when you attempt to interact with the Hadoop cluster, even after successfully obtaining a Kerberos ticket using kinit:
11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Because of a change 1 in the format in which MIT Kerberos writes its credentials cache, there is a bug 2 in the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher. Kerberos 1.8.1 is the default in Ubuntu Lucid and later releases and Debian Squeeze and later releases. (On RHEL and CentOS, an older version of MIT Kerberos which does not have this issue, is the default.)
Footnotes: 1 MIT Kerberos change: http://krbdev.mit.edu/rt/Ticket/Display.html?id=6206 2 Report of bug in Oracle JDK 6 Update 26 and earlier: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6979329
Solution:
If you encounter this problem, you can work around it by running kinit -R after running kinit initially to obtain credentials. Doing so will cause the ticket to be renewed, and the credentials cache rewritten in a format which Java can read. To illustrate this:
$ klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1000)
$ hadoop fs -ls
11/01/04 13:15:51 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ kinit
Password for atm@YOUR-REALM.COM:
$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: atm@YOUR-REALM.COM
Valid starting Expires Service principal
01/04/11 13:19:31 01/04/11 23:19:31 krbtgt/YOUR-REALM.COM@YOUR-REALM.COM
renew until 01/05/11 13:19:30
$ hadoop fs -ls
11/01/04 13:15:59 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ kinit -R
$ hadoop fs -ls
Found 6 items
drwx------ - atm atm 0 2011-01-02 16:16 /user/atm/.staging
Note:
This workaround for Problem 2 requires the initial ticket to be renewable. Note that whether or not you can obtain renewable tickets is dependent upon a KDC-wide setting, as well as a per-principal setting for both the principal in question and the Ticket Granting Ticket (TGT) service principal for the realm. A non-renewable ticket will have the same values for its "valid starting" and "renew until" times. If the initial ticket is not renewable, the following error message is displayed when attempting to renew the ticket:
kinit: Ticket expired while renewing credentials
Problem 3: java.io.IOException: Incorrect permission
Description:
An error such as the following example is displayed if the user running one of the Hadoop daemons has a umask of 0002, instead of 0022:
java.io.IOException: Incorrect permission for
/var/folders/B3/B3d2vCm4F+mmWzVPB89W6E+++TI/-Tmp-/tmpYTil84/dfs/data/data1,
expected: rwxr-xr-x, while actual: rwxrwxr-x
at org.apache.hadoop.util.DiskChecker.checkPermission(DiskChecker.java:107)
at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:144)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:160)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1484)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1432)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1408)
at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:418)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:279)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:203)
at org.apache.hadoop.test.MiniHadoopClusterManager.start(MiniHadoopClusterManager.java:152)
at org.apache.hadoop.test.MiniHadoopClusterManager.run(MiniHadoopClusterManager.java:129)
at org.apache.hadoop.test.MiniHadoopClusterManager.main(MiniHadoopClusterManager.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:83)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Solution:
Make sure that the umask for hdfs and mapred is 0022.
Problem 4: A cluster fails to run jobs after security is enabled.
Description:
A cluster that was previously configured to not use security may fail to run jobs for certain users on certain TaskTrackers (MRv1) or NodeManagers (YARN) after security is enabled:
- A cluster is at some point in time configured without security enabled. 2. A user X runs some jobs on the cluster, which creates a local user directory on each TaskTracker or NodeManager. 3. Security is enabled on the cluster. 4. User X tries to run jobs on the cluster, and the local user directory on (potentially a subset of) the TaskTrackers or NodeManagers is owned by the wrong user or has overly-permissive permissions.
The bug is that after step 2, the local user directory on the TaskTracker or NodeManager should be cleaned up, but isn't.
If you're encountering this problem, you may see errors in the TaskTracker or NodeManager logs. The following example is for a TaskTracker on MRv1:
10/11/03 01:29:55 INFO mapred.JobClient: Task Id : attempt_201011021321_0004_m_000011_0, Status : FAILED
Error initializing attempt_201011021321_0004_m_000011_0:
java.io.IOException: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:212)
at org.apache.hadoop.mapred.LinuxTaskController.initializeUser(LinuxTaskController.java:442)
at org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeUserDirs(Localizer.java:272)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:963)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2174)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:250)
at org.apache.hadoop.util.Shell.run(Shell.java:177)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:370)
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:203)
... 5 more
Solution:
Delete the mapred.local.dir or yarn.nodemanager.local-dirs directories for that user across the cluster.
Problem 5: The NameNode does not start and KrbException Messages (906) and (31) are displayed.
Description:
When you attempt to start the NameNode, a login failure occurs. This failure prevents the NameNode from starting and the following KrbException messages are displayed:
Caused by: KrbException: Integrity check on decrypted field failed (31) - PREAUTH_FAILED}}
and
Caused by: KrbException: Identifier doesn't match expected value (906)
Note:
These KrbException error messages are displayed only if you enable debugging output. See Appendix D - Enabling Debugging Output for the Sun Kerberos Classes.
Solution:
Although there are several possible problems that can cause these two KrbException error messages to display, here are some actions you can take to solve the most likely problems:
- If you are using CentOS/Red Hat Enterprise Linux 5.6 or later, or Ubuntu, which use AES-256 encryption by default for tickets, you must install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File on all cluster and Hadoop user machines. For information about how to verify the type of encryption used in your cluster, see Step 3: If you are Using AES-256 Encryption, install the JCE Policy File. Alternatively, you can change your kdc.conf or krb5.conf to not use AES-256 by removing aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file. Note that after changing the kdc.conf file, you'll need to restart both the KDC and the kadmin server for those changes to take affect. You may also need to recreate or change the password of the relevant principals, including potentially the TGT principal (krbtgt/REALM@REALM).
- Recreate the hdfs keytab file and mapred keytab file using the -norandkey option in the xst command (for details, see Step 4: Create and Deploy the Kerberos Principals and Keytab Files).
kadmin.local: xst -norandkey -k hdfs.keytab hdfs/fully.qualified.domain.name HTTP/fully.qualified.domain.name
kadmin.local: xst -norandkey -k mapred.keytab mapred/fully.qualified.domain.name HTTP/fully.qualified.domain.name
Problem 6: The NameNode starts but clients cannot connect to it and error message contains enctype code 18.
Description:
The NameNode keytab file does not have an AES256 entry, but client tickets do contain an AES256 entry. The NameNode starts but clients cannot connect to it. The error message doesn't refer to "AES256", but does contain an enctype code "18".
Solution:
Make sure the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File" is installed or remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file. For more information, see the first suggested solution above for Problem 5.
For more information about the Kerberos encryption types, see http://www.iana.org/assignments/kerberos-parameters/kerberos-parameters.xml.
Problem 9: After you enable cross-realm trust, you can run Hadoop commands in the local realm but not in the remote realm.
Description:
After you enable cross-realm trust, authenticating as a principal in the local realm will allow you to successfully run Hadoop commands, but authenticating as a principal in the remote realm will not allow you to run Hadoop commands. The most common cause of this problem is that the principals in the two realms either don't have the same encryption type, or the cross-realm principals in the two realms don't have the same password. This issue manifests itself because you are able to get Ticket Granting Tickets (TGTs) from both the local and remote realms, but you are unable to get a service ticket to allow the principals in the local and remote realms to communicate with each other.
Solution:
On the local MIT KDC server host, type the following command in the kadmin.local or kadmin shell to add the cross-realm krbtgt principal:
kadmin: addprinc -e "<enc_type_list>" krbtgt/YOUR-LOCAL-REALM.COMPANY.COM@AD-REALM.COMPANY.COM
where the <enc_type_list> parameter specifies the types of encryption this cross-realm krbtgt principal will support: AES, DES, or RC4 encryption. You can specify multiple encryption types using the parameter in the command above, what's important is that at least one of the encryption types parameters corresponds to the encryption type found in the tickets granted by the KDC in the remote realm. For example:
kadmin: addprinc -e "aes256-cts:normal rc4-hmac:normal des3-hmac-sha1:normal" krbtgt/YOUR-LOCAL-REALM.COMPANY.COM@AD-REALM.COMPANY.COM
Problem 11: Users are unable to obtain credentials when running Hadoop jobs or commands.
Description:
This error occurs because the ticket message is too large for the default UDP protocol. An error message similar to the following may be displayed:
13/01/15 17:44:48 DEBUG ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential.
(63) - No service creds)]
Solution:
Force Kerberos to use TCP instead of UDP by adding the following parameter to libdefaults in the krb5.conf file on the client(s) where the problem is occurring.
[libdefaults]
udp_preference_limit = 1
More Info About the udp_preference_limit Property
When sending a message to the KDC, the library will try using TCP before UDP if the size of the ticket message is larger than the setting specified for the udp_preference_limit property. If the ticket message is smaller than udp_preference_limit setting, then UDP will be tried before TCP. Regardless of the size, both protocols will be tried if the first attempt fails.
Problem 12: Request is a replay exceptions in the logs.
Description:
Symptom: The following exception shows up in the logs for one or more of the Hadoop daemons:
2013-02-28 22:49:03,152 INFO ipc.Server (Server.java:doRead(571)) - IPC Server listener on 8020: readAndProcess threw exception javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism l
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))]
at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:159)
at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1040)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1213)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:566)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:363)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))
at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:741)
at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:323)
at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:267)
at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:137)
... 4 more
Caused by: KrbException: Request is a replay (34)
at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:300)
at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:134)
at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:79)
at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:724)
... 7 more
In addition, this problem can manifest itself as performance issues for all clients in the cluster, including dropped connections, timeouts attempting to make RPC calls, and so on.
Likely causes:
- Multiple services in the cluster are using the same kerberos principal. All secure clients that run on multiple machines should use unique kerberos principals for each machine. For example, rather than connecting as a service principal myservice@EXAMPLE.COM, services should have per-host principals such as myservice/host123.example.com@EXAMPLE.COM.
- Clocks not in synch: All hosts should run NTP so that clocks are kept in synch between clients and servers.