IBM Director 5.20.x – worth using or not?

January 11, 2009 · 16 comments

1 Star2 Stars3 Stars4 Stars5 Stars (5 votes, average: 4.40 out of 5)
Loading...

Over the past year, we have been struggling with the deployment of IBM Director and continue to discuss whtether or not it makes any sense to continue trying to get it working or abandon it completely.  Our company standardized on the IBM server platforms over four years ago, using the X-series systems and BladeCenters, all of which comes with a free copy of IBM Director. We started with 5.20.1 and then upgraded to 5.20.2 (with update 1, then update 2 and update 3) and now we are on 5.20.3.  Overall, our results have been inconsistent, unreliable and down-right frustrating.  I have seen blogs and posts that say Director 6.0 is coming out and even some of them point at the Director download site to get a copy.  However, I am unable to confirm that it is avaiable yet from IBM.

For us, the reasons behind wanting to use a management system such as Director were

  • Hardware status from servers.  We have had outages in our server room due to heat, power and server failuers.  The more information we can obtain and work with, the better.  For example, we wanted to get information on fans, memory, disks, etc.
  • Remote power on/off.  With the Intelligent Platform Management Interface (IPMI), access to low-level features are available regardless of operating system status.  IBM uses the Baseboard Management Controller (BMC) to provide this feature.  With this interface, which runs off of the primary NIC, administrators have access to the physical machine including cold and hot power cycle as long as power is being provided to the server.
  • Event action plans.  Given a circumstance or set of circumstances, the product can take a set of actions to prevent the need for an IT staff member to get a 2:00am phone call.  If it cannot resolve the issue, at least the obvious things have already been tried.


Our IBM Platforms

 The major reason that Director came into our thoughts as a management tool was that it came for free with every server we bought.  With over 100 IBM serers on-site, it was a compelling argument to try and see if it could be used.  We had the following mix of servers:

  1. IBM e325 servers
  2. IBM e326 servers
  3. IBM x345 servers
  4. IBM x346 servers
  5. IBM x336 servers
  6. IBM x3650 servers
  7. IBM x3755 servers
  8. IBM x3455 servers
  9. IBM BladeCenter H with LS-21 blades

In all cases, we had decided to just use the built-in BMC that share an interface NIC with the first or primary NIC on the servers.  While advanced cards are available from IBM (such as RSA and RSA II) for management, we decided to stay with the built-in features until we were certain about our strategy moving forward.

Our first task was to get every server on the latest and greatest BIOS, Diagnostics, BMC firmware and network controller firmware.  This was a real challenge as some of these devices are a bit old and their configuration is not done through the BIOS interface.  For example, the e325 servers have a DOS utility called LANCFG.EXE which comes with the DOS disk image for the BMC firmware.  Here again is an inconsistency, since LANCFG allows for SNMP trap destination, which the BIOS configured systems do not.

Using the BMC required that we reconfigure our network a bit since the BMC interface requires an IP address that is in the same network as the server interface sharing the NIC.  Our strategy has been to use adjacent IP addresses for ease of identification and management.  For example, if the IP address for the servers is 10.1.1.11/24, then the BMC was configured to be at 10.1.1.12/24. 

Some of these devices do not have floppy drives, so we had to either use CD-ROM images or attach USB floppy drives to get these items loaded on to the boxes.  Speaking of floppies, IBM ships their images as .IMG files.  We spent some time trying to find a program to build a floppy based on that format and finally stumbled across EMT4WIN which is an excellent tool.

Next, we built a dedicated IBM Director server.  This is an e325 server with two 2.4GHz Opteron processors and 4GB of RAM running Microsoft Windows 2003 R2 Standard Server with SP2.  While it may be overkill, we wanted to give the server plenty of resources to keep it from becoming a bottleneck regardless of the situation.  Installation was quick and easy and was up within a very short period of time.

 

Discovery [VLANs, SNMP, Level-2, Physical Platforms]

We have quite a few VLANs in our environment and needed Director to find all of the servers in our enterprise.  Clearly, broadcast or unicast was not going to be the right way to go, so we looked into Multicast and relay.  After about a week of researching and playing with multicast, we decided that this was not the correct approach for us.  It was sending out too much junk and seemed ripe for exploitation within the environment.  Actually, our network administrator was supportive of the idea and when we got to talking about opening multicast across routers and such, it became a hairball noone wanted to tackle.  We are currently discussing blocking multicast protocols to avoid networking issues and problems (but that is another article).  Anyway, suffice it to say that relay was our answer.  We selected a node in each VLAN to act as our relay host to enable discovery.

HP Openview is used for network administration and since Director also wants SNMP traps, all of our serverswere reconfigured to add the Director server as a trap destination.

When the first discovery occurred, the first thing we started to see were some SNMP hosts.  Mostly routers and switches, but soon, servers started to show up.  Reading more about SNMP discovery showed what was happening.  The SNMP discovery was walking the SNMP ipNetToMediaNetAddress table of each node that it encountered.  This is actually a great way to discover a network and it was working flawlessly until it abruptly stopped.  At some point, it just stopped processing nodes and would not discover another.  Using a network sniffer, we observed the discovery process and once it stopped, the product would no longer issue any SNMP commands, no matter how often we told the system to run the discovery.  Only after a reboot of the Director server, would SNMP discovery occur again.  Even then, it would still stop after a certain amount of discovery and stop.  Our conclusion was some sort of memory or buffer overflow was occurring since some of our nodes were large Cisco 6509 switches that had a very large number of entries in the ipNetToMediaNetAddress table.  We contacted IBM technical support and they were unable to offer any ideas or suggestions and this continues not to work correctly — in any version.

One thing that will surprise most administrators, is that a server which supports SNMP and has a Level-2 Director agent installed, will show up in the IBM Director Console twice.  While the Level-2 agent supports SNMP, it does not reconcile or eliminate the SNMP discovered item when looking at the “All Managed Objects” list of entities.  We asked IBM about this and it is the expected result since administrators can look at servers in a number of ways.  While it makes sense, I would like a different way of showing the systems in that view as it gets confusing and cluttered.

If a server has a management interface (such as BMC), then it will show up in two areas as well, but they will be associated wServer container in Directorith each other in the “All Managed Objects” view.  Next to the server (level2 agent) icon, there is a small box (see below) that indicates that this is a container object as well.

Expanded Director Level-2 agent

When you click on the box, it will expand to show you that there is a physical platform contained within the agent. 

Note that you can get some very useful information from these items, such as the server model and serial number as well as the IP addresses, FQDN’s and MAC addresses  (latter two not shown).

 

Operating System Configurations

While we mentioned IPMI, the level-2 agent needs a mechanism within the operating system to talk to the BMC.  Depending on the platform and the operating system, it can be difficult to come up with the combinations that really work.  After reading many different blogs and writings, it is clear to us that a lot of administrators have simply overlooked or ignored all of these features and functions and use level-2 agents to manage operating system features and functions such as services, disk space and such.  But in our environment, it is all-or-nothing because it either provides all required aspects or another product will replace it.

For Windows 2003 Server on the IBM platforms, we have to install the device drivers for the BMC and then a mid-tier layer to provide an IPMI interface that the level-2 agent can communicate with.

For Redhat Linux, the problem is a bit different, depending on the version.  We started with Redhat AS 3.0 Update 2.  For this system, there are drivers and mid-layer tiers that are customized for the platform.  Not all of the software is available from IBM, so we had to gather everything we needed from different sites.  Having tried every combination of things we could think of to get these built, it was clear that it was taking way too much time.  We knew that RHEL 3.0 Update 9 and greater used the OpenIPMI and that IBM was a contributing member to that initiative.  Rather than continue to used RHEL 3.0, we decided up move up to RHEL 4.0 Update 6 once we confirmed that all of our applications were compatible with that operating system.

For communications with the Director server, you need to unlock communications.  While this can be done through the console, it is easier to copy the server’s public key on to each server.

So Far, So Good — Until Now

Having reconfigured every server and bounced them all, we just told Director to discover the entire enterprise and waited for the results.  When they came in, the first reaction was enthusiastic, with the exception of SNMP which I already talked about.  All of the systems showed on the console as expected.  Or were they?  Closer observations showed that some machines did not have an associated physical platform as expected.  This was not limited to a certain platform, operating system, VLAN or other boundary.  We rechecked everything we had done and confirmed that the exact same software and configurations had been done, but the results did not match.

We contacted our reseller to get first-level assistance in debugging what we had done.  They were of very little help and the next step was to use our IBM Direct Sales contacts to see if they would assist.  We went down this road since IBM Director came with the servers and we considered this part of support for them — IBM did not agree. They got us in contact with IBM partner comapnies that were willing to assist and then sell us consulting services and such.  We played along for awhile, telling them that to get our business they would have to prove that they were capable of providing the level of technical service we required.  In every case, these partners simply failed.  They did not understand how SNMP discovery worked; could not explain the inconsistencies in the level-2 discovery and overall had no technical skills worth purchasing.  They even questioned our use of level-2 agents and wanted to cut back our implementation to make it simpler.  For whom?  Of course for them!  Sorry, but we were not interested in making it easy for them at our expense.  Finally, I agreed to bite the bullet and purchase IBM software support for Director.

The IBM technical support people that I taked to actually knew their stuff.  They were professional, direct and had concrete ideas and suggestions for us.  I ended up feeling sorry for one or two of them as they were stuck supporting a product that clearly does not work and is not documented very well from a infrastructure/installation/troubleshooting perspective.  That is shocking in that this product is aimed squarely at solving those types of items for sophisticated technical users.

 

Problem #1 – Solving issues on Microsoft Windows 2000 and 2003 Server

Anyway, the first thing I learned is that on Microsoft Windows, the order in which the components are installed is critical.  Screw that up and just start over.  We were told that the drivers must be installed first, then the mid-layer software and then the level-2 agent.  After each step the server must be rebooted even if you are not prompted to do so.  If you violate this procedure in any way, you may or may not get it working properly.  If your installation is not consistent, then uninstall everything, delete all files and directories associated with Director and clean the registry and start again.  Once we did all of that, we had a consistent install on Windows.

Problem #2 – Solving Issues on Redhat Linux 4.0

Fixing Linux was nowhere near as simple as Windows.  Since Redhat already comes with OpenIPMI, we thought we had this one licked.  All you have to do is run the agent install, copy the server certificate and reboot.  Simple.  Nope.  It just sounds simple, but the truth is that this was the most trying thing we had to do.  Not even IBM technical support could get us 100% consistent without a lot of trial and error and real frustrations.

We started with 5.20.2 and had problems getting physical platforms on some machines and on others, the list of items on the hardware status page were incomplete or not showing their common names.

Health Status Missing ItemsAs you can see on the left, the physical memory items 3 and 4 are missing from one of the servers along with all of the Health Status Missing Common Nameenvironmental entities.  In the right image, one of the environmental sensors is being reported as “Sensor 49” instead of its common name “Fan Sensor 6”.  Using the MPCLI on the Director server, we checked the results being returned from the BMC.  In each case, we see the correct items and values being returned.  Why then, does the level-2 agent not have the correct results and return them to the console?  This is the question that we posed to IBM and they could not give us any straight or knowledeable answers.

Their first suggestion was to upgrade to 5.20.2 Update 1 which had then been recently released.  Perhaps it had some bug fixes we needed.  Easy enough, we thought.  Just install the update over the existing version and see what happens.  The results were horrible.  Even more inconsistences came up — even on machines that were reporting as expected.  The answer from IBM — unstinall and reinstall everything again.  Okay.  We did that and things got somewhat better, but still not 100% accurate.  We tried bouncing OpenIPMI and the Director agents while at the same time removing the discovered items from the Director console.  Eventually, we were able to get most machines to report correctly and started building a knowledge base on how to get things working.  Eventually, Update 2 arrived and we tried again.  Still no luck on a consistent method for getting all of the hardware status values.  When 5.20.3 finally arrived, we tried again and now we are waiting for 6.0 to hopefully get this under control, but I have my doubts.

 

What about VMware ESX 3.02 and/or 3.5?

One would assume that VMware would react the same as RedHat 4.0 had done.  That would be true for 3.02, but not for the later updates to 3.5.  As administrators know, the latest updates to 3.5 have added a feature called Health Status which uses the OpenIPMI software to communicate this information.

ESX Health StatusOur question was whether or not this does or could interfere with the IBM Director level-2 agent and its use of the OpenIPMI software.  We could not find an answer to that question, so we continued with our trial-and-error approach.  What we came up with is less than optimal, but it has consistently worked on every ESX 3.5 server we have configured and installed.

First, just install the 5.20.3 agent as instructed by the IBM documentation.  Once that is installed, copy over the public key for the server to the /opt/ibm/director/data directory.  Remove any physical platforms or other objects for the server that might already exist in the Director console and reboot the server.  Once the server is back up and running, run a discovery.  Do not attempt to interrogate that server until the level-2 agent and physical platforms have been fully discovered.  This can be determined when the green question mark disappears from the object icon on the console.  Right-click the level-2 agent and bring up the Hardware Status window.  Now the fun begins!!

If the Hardware status is empty or incomplete, take the following steps:

  1. stop and then reinstall the agent.  Do not attempt an uninstall.
  2. remove the agent and physical platform from the Director console
  3. reboot the ESX server
  4. once discovery is complete, recheck the Hardware Status results.
  5. repeat until the Hardware Status results are correct (may take 1-4 tries, but always comes out in the end)

 

Conclusion

At this point, we appear to have a consistent, working environment.  But given our results, we just don’t know if it will stay that way and for how long.  Will a reboot lose information?  When we have an issue arise, will it perform correctly?  Will the next version of the software work and if so, will it work with my current platforms or will everything need to be upgraded?  Do I have to purchase RSA cards to get what I want?

As you can see, we have a lot of questions and I have not even begun to talk about the real features, functions and capabilities within IBM Director.  This is the simple baseline stuff upon which to build a real management environment.  But if the foundation is built in quicksand, then it will crumble and fall and at a time when it is most needed.  IBM has been asking me to discuss building a house of Tivoli on top of this stuff and I simply cannot take that step given the current situation.  I have repeatedly told them that I cannot and will not move forward until I have a base built on bedrock.  However, I am not sure this will ever get me there.  Perhaps we need to investigate OpenView or Unicenter.  Time will tell, but for now IBM Director is off to a really bad start in our environment.

Article by Steve Van Domelen

Steve has written 47 awesome articles.

2 Pingbacks/Trackbacks

  • Rob

    I just came across your article. You have accurately articulated a lot of our frustrations with IBM Director. We started with version 4.3 a long time ago. Migrations and upgrades have been a disaster but in the end, we normally get a stable, environment that does provide effective monitoring.

    I was involved extensively with the 6.1 beta. All I can tell you is that you will still have the same upgrade/migration issues that go along with every release. I’m sure some new problems will come up. We got so frustrated with 6.1 that we decided to wait for a while until it stabilizes.

    I had a question for you though…we had been running our Director server on Windows 2003. We seem to have a reasonable rate of discovery of the physical platforms although there were issues, as you’ve outlined.

    We moved our Director server to CentOS 5.2 (redhat 5.2) and now there are no physical platforms for any server, and all hardware status features are broken. Do you have any idea based on your redhat experience what the issue may be?

    Rob

  • steve

    Rob – seems we share similar experiences with the exception that you normally got a stable environment which I seem unable to do.

    Your work on 6.1 is very interesting as we are in the process of upgrading to that version on Windows 2008 server. We tried using Redhat Linux 5, but it will not connect to a Microsoft SQL Server – only Oracle or Apache Derby. Care to share more on your frustrations with 6.1? I have a TWiki site and perhaps we could collaborate together and with others? Let me know your interest.

    I am surprised by the lack of physical platforms you are having with CentOS 5.2. I assume you modified /etc/redhat-release to get this installed since IBM will not support CentOS. However, the physical platform is discovered by communication with the level-2 agent. That agent talks to the IPMI interface on the operating system (drivers in Windows boxes and OpenIPMI on Linux). On Linux, you will see information in the /var/log/messages file when the agent is started.

  • Rob

    Stable environment means that IBM Director runs at least a week without the service dying. Not exactly my “ideal” but it does seem to work. We get good monitoring from it so I’m not going to complain since it comes for free.

    On 6.1 I posted to the forums about a few of my experiences (see http://www.ibm.com/developerworks/forums/thread.jspa?threadID=248043). I found out there is no migration tool available yet. Plus, I’ve tested it recently on CentOS and I can’t even get inventory collected on the Director Server. I sent a bunch of logs off the the development team…nothing. I’m definitely waiting a while on this one. I’d be happy to comment on a site although I really think 6.1 is not ready for prime time for those seriously using Director (there are a lot who play with it!).

    The physical platform issue has me completely baffled. I verified that OpenIPMI was installed and checked the installation logs…no errors. I also see ipmi starting up without issue too. The really odd thing is that there are no physical platforms listed at all. It is just like a problem they had under SUSE which was fixed by running a cimsubscribe routine. There is no such application in 5.20.3. I’ve been doing much of this work on my own. I think it is time to get the support contract in place for our Director Server and get someone on the support team to take a look at it. Honestly, with CentOS we’ve found much more stability of the Director Server and performance is MUCH better, so we want to stick with it.

    Rob

  • Ed in NYC

    So who own’s this product within IBM? Someone must be the product manager that has a care that this product is frustrating the crap out of a lot of system engineers/managers who’s job it is is to keep the various IBM servers up and running on a daily basis.

    Just annoying – my 6.1 just broke last night and now I’m thinking about re-installing the whole mess once again…

  • Ken Mukai

    I have found Nagios extremely effective and easy to setup and maintain. I am using a IBM x345 with a single CPU and 4 GB of memory to monitor over 2500 objects and 450+ hosts. Also running Cacti on this same system. We are using IBM Director to run a few specific EAP’s.

  • А в IE6 сайт немного расплывается. Проверьте верствку.

  • Pingback: IBM Director 6.1 — still no joy | Just A Word (or two) From Steve()

  • jivetolkein

    Nagios is an excellent service monitor, but for hardware managemnt you’d be reduced to rolling your own queries to SNMP/WMI by proxy etc. I’ve run both, but it’s fish and fowl IMHO.

    All I’d like Director to do is reliably and consistently check the hardware status and give PFAs when available. In all it’s incarnations it’s been really bad at this – my experience mirrors virtually all the above. If I build a server, using ServerGuide and my own bits and pieces we get OK results – it works, but their are still foibles that frustrate and annoy (hardware status – show me the status even if there isn’t a problem!!). But I’m trying to roll in servers of indeterminate build all across EMEA, the only thing in common is an IBM badge, and I’ve little to no faith it’s accurately checking them :-/

  • Ben

    What drivers did IBM exactly state needed to be installed on a client machine, in order for it to correctly report hardware errors?

    • steve

      Ben,
      There are a lot of drivers to obtain from IBM depending on the OS and the box. For Redhat Linux 3 prior to Update 9, you will have to get the IPMI drivers and midlayer code from the IBM website for your server model. If you are running RHEL 4 or 5, then it will use OpenIPMI.

      If you are using Windows 2003, then there are drivers for IPMI and midlayer as well. You must install the drivers first according to IBM technical support.

      In all cases, you must also be running the Broadcom drivers from IBM support website ONLY. They have made modifications for support of the BMC which shares the NIC. It makes a big difference.

      • Ben

        Thank you for the reply, Steve,
        Now when you say “midlayer”, which driver are you referring to? The IBM website does not seem to use the term “midlayer” for any of their drivers. And for me, this pertains to Windows 2003.

        Thanks,

        • steve

          Sorry for the confusion. I rechecked the IBM website and they are called the mapping layer and not the midllayer. As an example, suppose you have an IBM x336 server. You would go to the IBM website and in the download area select “OSA IPMI” in the “Refine results” drop-down box and click “Go”. Here you will see IBM Mapping Layer for OSA for W2K3 as well as the OSA IPMI device drivers. Install the device drivers first and the mapping layer software second.

          Hope that helps.

  • Bob

    Jivetolkein you don’t speak jive you speak the truth.

    Our company has been dealing with director for over 7 years. Not fully understanding the complexity of the install was our first mistake. How the subsequent upgrades were conducted without making sure all those running processes were not killed was our second, and then having the RSA cards installed on newer servers without configuring them was our third.

    For us the agent services that run never fail, they just don’t work. I can’t say they don’t work most of the time because I don’t know, and neither does the agent. The number 1 hardware alert that we would like to see are raid failures. If you use ADAPTEC SCSI drivers you need to install the Director Extension software, if you have LSI you don’t we found this out late in the game to. Oh by the way we have about 1200 IBM servers.

    I also should mention the memory leaks from running IBM processes. IBM always indicates that the problem is fixed in the next release. Kind of a shell game isn’t.

    To me the bottom-line is that I wouldn’t mind so much dealing with the install/deployment issues if at the end of the day each deployment on a regular basis would send a message out that each monitor hardware component is working. Since I can no longer rely on faith we will probably be going in a different direction.

  • Pingback: Anonymous()

  • Director is complete crap. The agent comes with its own outdated JRE. Then when you look at the various shell scripts, you almost die of laughter; it’s so bad, it looks like a DOS .BAT programmer from 1981 was just defrozen. I think it’s the whole division at IBM responsible for management software that’s completely incompetent. Sometimes they almost make the megafailures at EMC look good (not that much though).nnThey consistently deliver broken software. Their automagic update system — “UpdateXpress” — is a rotten Rube Goldberg contraption. It cannot run without an executable /tmp. It cannot run as a non-privileged user even if you’re just trying to download files. It wants you to install Firefox and all dependencies on your server. It contains its own outdated JRE, evidently. It writes temporary files in /var/log/IBM_something. Frankly, the only humane thing to do at this point is to take all the people responsible for these software products, line them against a wall and fire.

    • Steve

      Boy, are you having a bad day 🙂 While I feel your pain and misery, I must admit that your comment made me laugh out loud. You are echoing the exact feelings my senior architect and I have been saying for the past few years.

Previous post:

Next post: