I have 10 HDFS datanodes
hdfs-w[01-04],[06-08] hdfs-hadoop[01-03]
and 3 namenodes master[01-03]
Kerberos is being utilised.
From my windows laptop I use the kerberos windows client to get a ticket which has a 10-hour TTL. Without this I cannot get to the namenode UI or datanode UI.
Namenode UI runs on port 9870; Datanodes UI on port 1006
IMPORTANT: datanodes hdfs-w01-08 are ALL THE SAME. Same hardware, same datacenter, same switch. Same amount of RAM, same OS (kernel updated to same versions) and updates, same Java version, same HADOOP version, same partitions, same users. All this was delivered using Ansible so consistency exists.
The hardware for the other 3 nodes is newer but the OS/software/services configuration is the same as the older kit.
Communication with KDC (linux) is not hampered by firewall or network rules. Authorisation Server is ACTIVE DIRECTORY. The kerberos setup has all of the nodes for each of the services installed.
EG.
hdfs/[email protected]
yarn/[email protected]
HTTP/[email protected]
The kerberos configuration (krb5.conf) on all nodes is the same.
Here is my problem: For nodes 01-06 I cannot get the datanode UI (hxxp://hdfs-w01:1006/datanode.html). Error 401 unauthorized access is always returned. For the other 2 nodes (w07/w08) the datanode UI always displays without error.
I then tried capturing the traffic with wireshark on the laptop as I figured the traffic is always going to pass through it.
For a working node I noticed:
Cookie: hadoop.auth="u=my.name&[email protected]&t=kerberos&e=1773943196854&s=ENspZ6w86VOJH09irEEJSNCXoca2O4OWe2qbvqJIUpQ="\r\n
For a NON-working node:
Set-Cookie: hadoop.auth=;Path=/;HttpOnly\r\n
So, does it seem that this may be the problem? What could this signify to enlightened techies out there?
The obvious is that the TGT is not delivering the correct (or any) ticket, or that the service host is refusing the ticket it is offered.
The local keytab and KDC principals for these hosts must be OK as the underlying HADOOP (HDFS/Yarn/etc) is working. I have even tried new a new keytab but the result is the same.
A tcpdump on a working and non-working node (port 1006) delivered nothing.
I am mystified as to why some would fail when everything is setup in the same way with the same parameters. I did try looking about for the web server hadoop may be using but I cannot see any processes or running settings to indicate what that may be, so I have no logs for anything there. I do not know enough about building/using this to determine if it is something like apache/jetty/node.js.