I previously wrote about the problems and issues with version 5.20.x and that we were looking forward to seeing what 6.1 was going to provide. Well, we downloaded version 6.0 and it is a significantly different product; even changing the name to IBM Systems Director. For starters, it no longer uses a thick client for administration, but uses your Internet browser. The whole look-and-feel has been drastically altered and I am not sure this is a good thing. Just going to the web page is difficult enough, as you have to remember a URL that looks like (http://director:8421/ibm/console). If you use Active Directory (AD) groups for login, it will work, but if you enter a legal username/password that is part of AD but not a valid user of IBM Director, then it will crash the server. No word from IBM on why this happens, but one of the java executables runs out of memory because of this. A server restart is required to correct this.
The initial “Welcome” page, if it comes up, shows you a selection of plug-ins and their “readiness”. There is also a link for 5.20.x users in the upper right corner. That link gives some generic answers and information, but not the level of technical assistance that an experienced administrator can really use. Most of it is obvious as you play with and try out the new user interface. What is not obvious are things like the “Hardware Status” that used to show you what sensors were available from a level-2 agent (now called a Common Agent) talking through the IPMI interface in the operating system to the BMC controller in the IBM hardware.
As I said, there are inexplicable times when the Welcome page will not show up. We have been unable to determine when or why this will occur, but it is far too frequent to ignore. Instead of the welcome, the users will get the infamous ATKCOR037E error page. This implies that you do not have permission to see the welcome page, which generally goes away if you reboot the server.
Installation
We tried to install the Linux version of the product and it worked well enough. However, we quickly learned that the Linux version can only talk to the Apache Derby, DB2 or Oracle database engines and not Microsoft SQL Server. Evidently, this is due to missing administrative interfaces and command-line tools for that engine. So, we decided to install this on Windows 2008 Server. To be honest, I don’t know why they offer a Basic installation option as that is pretty useless for most users. You will need to select custom if you are using any database other than Derby and it seems a difficult path to change later, although they say it can be done. You are required to set some passwords for the Agent Manager Configuration. Unfortunately, they don’t tell you what or why there are these usernames and passwords required for the Agent Manager, but you better write them down as they are critical when creating common agent servers.
Turns out that the common agent service (CAS) is how discovery is going to happen with the new common agents, but they don’t tell you that up front. When you deploy a new agent, you will have to modify the deployment code to take into account the passwords that were set during the server installation. This is a big change from 5.20.x in how servers register with Director. In previous versions, there was a DSA certificate generated by the server and you could push this file into the agent installations for authentication and registration with Director. While 6.1 kept that compatibility with older agents, it is not used in the new common agents. These agents need to register with the Tivoli service on Director. I assume this is part of the integration direction between Director and Tivoli, but I am not sure. Anyway, when you install the agent, there is a “response file” that is embedded in the deployment. You must override this file with an edited version where you set the name of your agent manager (probably the director server) and the password you set for the agent manager. For example:
AMHostname=director.mydomain.com
AMPassword=my_agent_manager_password
Then, you install the agent with a command line override to tell it to use your response file instead of the default one. However, we have been unable to get the Windows version of this command line to run properly and we generally run the whole GUI version and type in the options.
Linux = dir6.1_commonagent_linux.sh -r diragent.rsp
Windows = IBMSystemsDirectorAgentSetup.exe installationtype rsp=”diragent.rsp”
The other nuisance is that they do not have the default ports pre-entered for the databases. For example, you have to know that SQL Server is using port 1433 during the installation which I find inexcusable in any software install. They know the defaults and should fill them in. It is probably very rare for any enterprise to change these port values.
Agents and Inventory
If you remember in 5.20.x, the UI would show all of the agents and if a platform was part of a server that included a level-2 agent, then the platform would show under the agent as a subitem. That was changed in 6.1 and adds to the confusion that this product is rife with. All agents show as individual items and there is no correlation between agents and platforms. You have to drill around a bit to get the information about these relationships. It can be done, but it is not obvious or present in an easy way for administrators. To get this info, you will have to click on the Discovery Manager and get a chart of your servers as shown below.
From here, you can click on a number of options, but the interesting one for us is the “Full access systems” at the bottom of second section called “Access and Authentication”. This will present you with a list of the systems which have been discovered and have registered with the CAS.
Clicking on the first item, it will bring you to another tab (more on tabs later) which shows you some information about the server. From here, clicking on the “System” folder, opens it to show you “Operating System” and “Server”. Finally, clicking on “Server” shows you that information in the details to the right. Here we can see the platform under the common agent service. However, by this time, you have collected a number of tabs in the UI and probably forgotten why you have drilled all the way down here and what to do next.
Discovery
This part of the application is even worse than 5.20.x ever was. Even after installing the agents with the customized command lines, we have found that it only works “most” of the time. We have seen agents running on identical hardware and operating systems react differently. Sometimes it installs as expected, which is “Full Access Systems” as I talked about above. Other times, you will see them show up in the “Discovery Manager” as either “No agent systems” or “No access systems”. It is impossible to get any real details from the installation log files, even if you set the “DebugInstall=1” in the response file. If you attempt a reinstallation or two, most of the time it will finally register correctly. We have found on some Linux machines that the install will fail when trying to install ISDCommonAgent-6.1-1 for unknown reasons since it can be extracted and installed by hand.
To uninstall the agent, there will be a /opt/ibm/director/bin/diruninstall (Linux) available, but note that it will not even come close to actually uninstalling all of Director. It will leave /opt/ibm/*; /etc/ibm/*; /var/log/dirinst.log and it will leave the TIVguid-1.3.0-0 rpm installed. You should remove all of this to complete a full uninstall. Then attempt a new installation to see if it can register and be discovered correctly by the Director Serve through the CAS.
IBM Technical Support
In Linux, we have seen a number of issues that have generated a number of PMRs with IBM technical support. Yes, we purchased software support from IBM in order to help us get through the mess of problems and issues that plague the product. All of these issues are still open as we try to get this product to work in our environment. To open a PMR with IBM is not trivial. First, you have to call their 800 number and wait for at least 1-2 hours before talking with someone to enter the PMR. What if you have more than one issue to report? Well, that is gonna suck, because they will only take one issue per phone call. Yeah — you heard me right, they will not accept multiple issues per call no matter how much you scream and argue. So, to enter four tickets into their system took a full business day of waiting on hold to get them entered.
What do you get after that? Well, the usual junk about installing correctly, checking the BIOS, BMC, firmware, network cards, etc. When we tell them we have identical machines with the same levels of firmware, operating system and Director software, they just keep on going as if we don’t know what we are talking about. It is frustrating and difficult, but we need to get this level of administration on our servers that Director is supposed to provide.
Issues We Have Encountered
Most of our Linux machines are running RedHat 4 Update 6 and as such have the OpenIPMI interface already integrated. IBM Director common agent uses this to communicate with the BMC and report back to the Director server. However, we have seen the kipmi0 daemon consume 100% of the CPU and take control of our systems. The only way that we have been able to correct the problem is to use the MPCLI command-line interface on the Director server to execute a “restartmp” command. That restarts the BMC and releases whatever was holding up the daemon. Unfortunately, this can occur at any time and our servers are running batch jobs 7×24 and cannot tolerate this behavior. We have built cron jobs to do this automatically every hour or two just to make sure these systems run as cleanly as possible. IBM has suggested that this is a Broadcom network driver issue and we are investigating.
The next hurdle we encountered is the cimserver daemon goes 100% CPU utilization. Again, this happens randomly and so far IBM has been unable to tell us why this occurs. The obvious fix is to kill and restart the cimserver daemon and we have built jobs to do this action as well. However, this is not an acceptable way to proceed long term.
We also run a lot of VMware ESX 3.5 servers in our environment and IBM says that their software will run on ESX. While it installs here as well as any Linux (and that is not saying much), the first thing we noticed is that there are a large number of entries in the /var/log/messages file and it grows rapidly. It is not clear if these items indicate errors, warning or are just informational regarding a service called cimprovagt.
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 140
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 183
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 184
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 100
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 100
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 99
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 99
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 98
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 98
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 141
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 142
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 142
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 181
Feb 19 12:34:46 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 168
At first we thought that this was related to ESX 3.5 since it uses the OpenIPMI to show the “Health Status” in the Virtual Infrastructure Client under the Configuration tab. We tried to open a ticket with VMware, but as soon as they heard we were using LS-21 blades on an IBM BladeCenter they immediately rejected the ticket and told us to talk with IBM. That lead us into a finger pointing match as IBM immediately said that cimprovagt was part of ESX and that we had to lodge the ticket with VMware. Eventually, we saw an issue from VMwareregarding this type of issue that was not directly linked to IBM, but looked very similar. So, we decided to upgrade all of our ESX servers from 3.5 Update 2 to 3.5 Update 4. This in fact changed the output from within ESX, which we are now pursuing with IBM. Turns out the possible culprit is a possible identification problem with a RAID controller. I say this because we do not have any RAID controllers in our BladeCenter as we use iSCSI SAN disks for our guests and the base OS is loaded on a single local SAS hard disk. The new output on all of our blades shows the following:
Apr 19 01:02:00 blade00-4 HostRaidInd[2343]: Cannot parse software release date for controller 0
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 179
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 178
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 177
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 176
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 186
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 154
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 211
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 210
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 155
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 181
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 140
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 183
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 184
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 100
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 100
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 99
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 99
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 98
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 98
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 141
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 142
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 142
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 181
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 168
Next up is power off and shutdown of Linux boxes when the IBM System Director Agent is installed. After the installation, I attempted to shutdown one of the Linux servers. I went to the console and entered the “shutdown” command but nothing happened. A bit perplexed, I then issued the “poweroff” command, but still nothing happened. There was no command I could issue that would reboot or shutdown the system. We have asked IBM about this and logged entries on their develop forums. So far we have not received any confirmation or resolution. The only answer we could come up with was to use IBM Director itself. Using the power options from the common agent, we were able to restart and power down the server.
SNMP Discovery
It appears that SNMP discovery has significantly changed in 6.1 of IBM Systems Director. In previous versions, such as 5.20.x, the SNMP community could be configured and a seed device such as a router could be specified. Depending on the LAN/WAN configuration, that was enough to discover most, if not all of the network assuming that Director did not crash during the process (it almost always did).
In the older versions, The SNMP discovery would query the ipNetToMediaNetAddress table of each node it encountered. A great way for network discovery by supplying only a few seed systems. It was not necessary to scan entire IP address ranges since the ipNetToMediaNetAddress table knew the exact IP addresses to try.
However, in 6.1 it looks like this does not occur. Creating an entry in the “Advanced System Discovery” for SNMP, the community strings only applies to whatever was entered in the IP range or single IP address. The community string set here only apply to those devices that were discovered in the given profile. We use sparsely populated VLANs and subnets which makes this a difficult and very time consuming approach. Having 15 subnets with a /24 netmask means that I will be probing almost 4000 IP addresses, with less than 30% of those in use. This is a tremendous waste of time and resources that is not required in the old mechanism of discovery.
What does seem to be the same, is that any device that sends Director an SNMP trap is automatically added to the list of discovered entities. However, it attempts communication with the default “public”/”private” SNMP communities which in most cases will fail because no administrator will use these for security reasons. A default should be available to be set, but I cannot find one anywhere nor can I locate any documentation about it.
Performance Problems with BMC?
From our research, it seems that the BMC is not responding quickly enough to requests from the common agent and/or other users of the interface. Quite frequently we see messages in /var/log/messages that say:
Feb 25 13:05:47 gd-ds12-01 kernel: ioctl32(java:12813): Unknown cmd fd(201) cmd(00008938){00} arg(dd45bfe8) on socket:[62763481]
Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(205) cmd(00008938){00} arg(de948fe8) on socket:[62763640]
Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(206) cmd(00008938){00} arg(de948fe8) on socket:[62763689]
Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(206) cmd(00008938){00} arg(de948fe8) on socket:[62763740]
Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(207) cmd(00008938){00} arg(de948fe8) on socket:[62763806]
which seems to indicate that it is probing for a floppy disk or other device which is not present. I would assume that one failure would be enough and it is unnecessary to continually output this information. We also see the following items on the system console:
IPMI message handler: BMC returned incorrect response, expected netfn b cmd 40, got netfn 0 cmd 0
IPMI message handler: BMC returned incorrect response, expected netfn b cmd 40, got netfn 0 cmd 0
IPMI message handler: BMC returned incorrect response, expected netfn b cmd 40, got netfn 5 cmd 2d
which looks like there is a miscommunication between the OpenIPMI interface and the BMC. I can only assume that this communication error is a lack of responsiveness on the part of the BMC, but I am not positive.
Overall
The purpose and need for something like IBM Director is very important and grows with virtualization and consolidation within the DataCenter. We all have fewer resources to maintain and manage these systems that are growing in complexity. We have to worry more about environmental issues such as heat dissipation and power consumption while maintaining availability of redundancy of service. Products like IBM Systems Director is a great fit, but it has to work and work well.
Our experiences have been told by many others as is evident in IBM Director forums on their sites as well as other blogs and articles. It seems that IBM does not put forth the effort to really test this product to make sure of its quality and I am puzzled about that. Yes, I know it is a free tool to everyone who purchases an IBM server, but that is also the value-add that it is supposed to bring. Even if you purchase a service contract as we did and talk with IBM about integration with Tivoli, there is no mention about making sure the IBM Systems Director foundation is solid. There is no way a rational DataCenter manager can justify spending the large amounts of time currently required to try and bring IBM Systems Director on line and keep it operational. As far as we can tell, this is simply not even possible. If there were legitimate options to this product then we would be pursuing them instead of this one. But even with a low-level replacement, what would the overall systems management direction be? If it is Tivoli, then not having IBM Systems Director below it seems counter-productive as we don’t gain from the integration. I don’t know what other tools out there are using a common agent approach that integrates with IPMI and handles various platforms, so we will continue our pursuit of this system and update this article as we progress.