Over the past year, we have been struggling with the deployment of IBM Director and continue to discuss whtether or not it makes any sense to continue trying to get it working or abandon it completely. Our company standardized on the IBM server platforms over four years ago, using the X-series systems and BladeCenters, all of which comes with a free copy of IBM Director. We started with 5.20.1 and then upgraded to 5.20.2 (with update 1, then update 2 and update 3) and now we are on 5.20.3. Overall, our results have been inconsistent, unreliable and down-right frustrating. I have seen blogs and posts that say Director 6.0 is coming out and even some of them point at the Director download site to get a copy. However, I am unable to confirm that it is avaiable yet from IBM.
For us, the reasons behind wanting to use a management system such as Director were
- Hardware status from servers. We have had outages in our server room due to heat, power and server failuers. The more information we can obtain and work with, the better. For example, we wanted to get information on fans, memory, disks, etc.
- Remote power on/off. With the Intelligent Platform Management Interface (IPMI), access to low-level features are available regardless of operating system status. IBM uses the Baseboard Management Controller (BMC) to provide this feature. With this interface, which runs off of the primary NIC, administrators have access to the physical machine including cold and hot power cycle as long as power is being provided to the server.
- Event action plans. Given a circumstance or set of circumstances, the product can take a set of actions to prevent the need for an IT staff member to get a 2:00am phone call. If it cannot resolve the issue, at least the obvious things have already been tried.
Our IBM Platforms
The major reason that Director came into our thoughts as a management tool was that it came for free with every server we bought. With over 100 IBM serers on-site, it was a compelling argument to try and see if it could be used. We had the following mix of servers:
- IBM e325 servers
- IBM e326 servers
- IBM x345 servers
- IBM x346 servers
- IBM x336 servers
- IBM x3650 servers
- IBM x3755 servers
- IBM x3455 servers
- IBM BladeCenter H with LS-21 blades
In all cases, we had decided to just use the built-in BMC that share an interface NIC with the first or primary NIC on the servers. While advanced cards are available from IBM (such as RSA and RSA II) for management, we decided to stay with the built-in features until we were certain about our strategy moving forward.
Our first task was to get every server on the latest and greatest BIOS, Diagnostics, BMC firmware and network controller firmware. This was a real challenge as some of these devices are a bit old and their configuration is not done through the BIOS interface. For example, the e325 servers have a DOS utility called LANCFG.EXE which comes with the DOS disk image for the BMC firmware. Here again is an inconsistency, since LANCFG allows for SNMP trap destination, which the BIOS configured systems do not.
Using the BMC required that we reconfigure our network a bit since the BMC interface requires an IP address that is in the same network as the server interface sharing the NIC. Our strategy has been to use adjacent IP addresses for ease of identification and management. For example, if the IP address for the servers is 10.1.1.11/24, then the BMC was configured to be at 10.1.1.12/24.
Some of these devices do not have floppy drives, so we had to either use CD-ROM images or attach USB floppy drives to get these items loaded on to the boxes. Speaking of floppies, IBM ships their images as .IMG files. We spent some time trying to find a program to build a floppy based on that format and finally stumbled across EMT4WIN which is an excellent tool.
Next, we built a dedicated IBM Director server. This is an e325 server with two 2.4GHz Opteron processors and 4GB of RAM running Microsoft Windows 2003 R2 Standard Server with SP2. While it may be overkill, we wanted to give the server plenty of resources to keep it from becoming a bottleneck regardless of the situation. Installation was quick and easy and was up within a very short period of time.
Discovery [VLANs, SNMP, Level-2, Physical Platforms]
We have quite a few VLANs in our environment and needed Director to find all of the servers in our enterprise. Clearly, broadcast or unicast was not going to be the right way to go, so we looked into Multicast and relay. After about a week of researching and playing with multicast, we decided that this was not the correct approach for us. It was sending out too much junk and seemed ripe for exploitation within the environment. Actually, our network administrator was supportive of the idea and when we got to talking about opening multicast across routers and such, it became a hairball noone wanted to tackle. We are currently discussing blocking multicast protocols to avoid networking issues and problems (but that is another article). Anyway, suffice it to say that relay was our answer. We selected a node in each VLAN to act as our relay host to enable discovery.
HP Openview is used for network administration and since Director also wants SNMP traps, all of our serverswere reconfigured to add the Director server as a trap destination.
When the first discovery occurred, the first thing we started to see were some SNMP hosts. Mostly routers and switches, but soon, servers started to show up. Reading more about SNMP discovery showed what was happening. The SNMP discovery was walking the SNMP ipNetToMediaNetAddress table of each node that it encountered. This is actually a great way to discover a network and it was working flawlessly until it abruptly stopped. At some point, it just stopped processing nodes and would not discover another. Using a network sniffer, we observed the discovery process and once it stopped, the product would no longer issue any SNMP commands, no matter how often we told the system to run the discovery. Only after a reboot of the Director server, would SNMP discovery occur again. Even then, it would still stop after a certain amount of discovery and stop. Our conclusion was some sort of memory or buffer overflow was occurring since some of our nodes were large Cisco 6509 switches that had a very large number of entries in the ipNetToMediaNetAddress table. We contacted IBM technical support and they were unable to offer any ideas or suggestions and this continues not to work correctly — in any version.
One thing that will surprise most administrators, is that a server which supports SNMP and has a Level-2 Director agent installed, will show up in the IBM Director Console twice. While the Level-2 agent supports SNMP, it does not reconcile or eliminate the SNMP discovered item when looking at the “All Managed Objects” list of entities. We asked IBM about this and it is the expected result since administrators can look at servers in a number of ways. While it makes sense, I would like a different way of showing the systems in that view as it gets confusing and cluttered.
If a server has a management interface (such as BMC), then it will show up in two areas as well, but they will be associated with each other in the “All Managed Objects” view. Next to the server (level2 agent) icon, there is a small box (see below) that indicates that this is a container object as well.
When you click on the box, it will expand to show you that there is a physical platform contained within the agent.
Note that you can get some very useful information from these items, such as the server model and serial number as well as the IP addresses, FQDN’s and MAC addresses (latter two not shown).
Operating System Configurations
While we mentioned IPMI, the level-2 agent needs a mechanism within the operating system to talk to the BMC. Depending on the platform and the operating system, it can be difficult to come up with the combinations that really work. After reading many different blogs and writings, it is clear to us that a lot of administrators have simply overlooked or ignored all of these features and functions and use level-2 agents to manage operating system features and functions such as services, disk space and such. But in our environment, it is all-or-nothing because it either provides all required aspects or another product will replace it.
For Windows 2003 Server on the IBM platforms, we have to install the device drivers for the BMC and then a mid-tier layer to provide an IPMI interface that the level-2 agent can communicate with.
For Redhat Linux, the problem is a bit different, depending on the version. We started with Redhat AS 3.0 Update 2. For this system, there are drivers and mid-layer tiers that are customized for the platform. Not all of the software is available from IBM, so we had to gather everything we needed from different sites. Having tried every combination of things we could think of to get these built, it was clear that it was taking way too much time. We knew that RHEL 3.0 Update 9 and greater used the OpenIPMI and that IBM was a contributing member to that initiative. Rather than continue to used RHEL 3.0, we decided up move up to RHEL 4.0 Update 6 once we confirmed that all of our applications were compatible with that operating system.
For communications with the Director server, you need to unlock communications. While this can be done through the console, it is easier to copy the server’s public key on to each server.
So Far, So Good — Until Now
Having reconfigured every server and bounced them all, we just told Director to discover the entire enterprise and waited for the results. When they came in, the first reaction was enthusiastic, with the exception of SNMP which I already talked about. All of the systems showed on the console as expected. Or were they? Closer observations showed that some machines did not have an associated physical platform as expected. This was not limited to a certain platform, operating system, VLAN or other boundary. We rechecked everything we had done and confirmed that the exact same software and configurations had been done, but the results did not match.
We contacted our reseller to get first-level assistance in debugging what we had done. They were of very little help and the next step was to use our IBM Direct Sales contacts to see if they would assist. We went down this road since IBM Director came with the servers and we considered this part of support for them — IBM did not agree. They got us in contact with IBM partner comapnies that were willing to assist and then sell us consulting services and such. We played along for awhile, telling them that to get our business they would have to prove that they were capable of providing the level of technical service we required. In every case, these partners simply failed. They did not understand how SNMP discovery worked; could not explain the inconsistencies in the level-2 discovery and overall had no technical skills worth purchasing. They even questioned our use of level-2 agents and wanted to cut back our implementation to make it simpler. For whom? Of course for them! Sorry, but we were not interested in making it easy for them at our expense. Finally, I agreed to bite the bullet and purchase IBM software support for Director.
The IBM technical support people that I taked to actually knew their stuff. They were professional, direct and had concrete ideas and suggestions for us. I ended up feeling sorry for one or two of them as they were stuck supporting a product that clearly does not work and is not documented very well from a infrastructure/installation/troubleshooting perspective. That is shocking in that this product is aimed squarely at solving those types of items for sophisticated technical users.
Problem #1 – Solving issues on Microsoft Windows 2000 and 2003 Server
Anyway, the first thing I learned is that on Microsoft Windows, the order in which the components are installed is critical. Screw that up and just start over. We were told that the drivers must be installed first, then the mid-layer software and then the level-2 agent. After each step the server must be rebooted even if you are not prompted to do so. If you violate this procedure in any way, you may or may not get it working properly. If your installation is not consistent, then uninstall everything, delete all files and directories associated with Director and clean the registry and start again. Once we did all of that, we had a consistent install on Windows.
Problem #2 – Solving Issues on Redhat Linux 4.0
Fixing Linux was nowhere near as simple as Windows. Since Redhat already comes with OpenIPMI, we thought we had this one licked. All you have to do is run the agent install, copy the server certificate and reboot. Simple. Nope. It just sounds simple, but the truth is that this was the most trying thing we had to do. Not even IBM technical support could get us 100% consistent without a lot of trial and error and real frustrations.
We started with 5.20.2 and had problems getting physical platforms on some machines and on others, the list of items on the hardware status page were incomplete or not showing their common names.
As you can see on the left, the physical memory items 3 and 4 are missing from one of the servers along with all of the environmental entities. In the right image, one of the environmental sensors is being reported as “Sensor 49” instead of its common name “Fan Sensor 6”. Using the MPCLI on the Director server, we checked the results being returned from the BMC. In each case, we see the correct items and values being returned. Why then, does the level-2 agent not have the correct results and return them to the console? This is the question that we posed to IBM and they could not give us any straight or knowledeable answers.
Their first suggestion was to upgrade to 5.20.2 Update 1 which had then been recently released. Perhaps it had some bug fixes we needed. Easy enough, we thought. Just install the update over the existing version and see what happens. The results were horrible. Even more inconsistences came up — even on machines that were reporting as expected. The answer from IBM — unstinall and reinstall everything again. Okay. We did that and things got somewhat better, but still not 100% accurate. We tried bouncing OpenIPMI and the Director agents while at the same time removing the discovered items from the Director console. Eventually, we were able to get most machines to report correctly and started building a knowledge base on how to get things working. Eventually, Update 2 arrived and we tried again. Still no luck on a consistent method for getting all of the hardware status values. When 5.20.3 finally arrived, we tried again and now we are waiting for 6.0 to hopefully get this under control, but I have my doubts.
What about VMware ESX 3.02 and/or 3.5?
One would assume that VMware would react the same as RedHat 4.0 had done. That would be true for 3.02, but not for the later updates to 3.5. As administrators know, the latest updates to 3.5 have added a feature called Health Status which uses the OpenIPMI software to communicate this information.
Our question was whether or not this does or could interfere with the IBM Director level-2 agent and its use of the OpenIPMI software. We could not find an answer to that question, so we continued with our trial-and-error approach. What we came up with is less than optimal, but it has consistently worked on every ESX 3.5 server we have configured and installed.
First, just install the 5.20.3 agent as instructed by the IBM documentation. Once that is installed, copy over the public key for the server to the /opt/ibm/director/data directory. Remove any physical platforms or other objects for the server that might already exist in the Director console and reboot the server. Once the server is back up and running, run a discovery. Do not attempt to interrogate that server until the level-2 agent and physical platforms have been fully discovered. This can be determined when the green question mark disappears from the object icon on the console. Right-click the level-2 agent and bring up the Hardware Status window. Now the fun begins!!
If the Hardware status is empty or incomplete, take the following steps:
- stop and then reinstall the agent. Do not attempt an uninstall.
- remove the agent and physical platform from the Director console
- reboot the ESX server
- once discovery is complete, recheck the Hardware Status results.
- repeat until the Hardware Status results are correct (may take 1-4 tries, but always comes out in the end)
Conclusion
At this point, we appear to have a consistent, working environment. But given our results, we just don’t know if it will stay that way and for how long. Will a reboot lose information? When we have an issue arise, will it perform correctly? Will the next version of the software work and if so, will it work with my current platforms or will everything need to be upgraded? Do I have to purchase RSA cards to get what I want?
As you can see, we have a lot of questions and I have not even begun to talk about the real features, functions and capabilities within IBM Director. This is the simple baseline stuff upon which to build a real management environment. But if the foundation is built in quicksand, then it will crumble and fall and at a time when it is most needed. IBM has been asking me to discuss building a house of Tivoli on top of this stuff and I simply cannot take that step given the current situation. I have repeatedly told them that I cannot and will not move forward until I have a base built on bedrock. However, I am not sure this will ever get me there. Perhaps we need to investigate OpenView or Unicenter. Time will tell, but for now IBM Director is off to a really bad start in our environment.
Pingback: IBM Director 6.1 — still no joy | Just A Word (or two) From Steve()
Pingback: Anonymous()