IBM Systems Director 6.1 — still no joy

April 19, 2009 · 26 comments

1 Star2 Stars3 Stars4 Stars5 Stars (16 votes, average: 4.75 out of 5)
Loading...

I previously wrote about the problems and issues with version 5.20.x and that we were looking forward to seeing what 6.1 was going to provide.  Well, we downloaded version 6.0 and it is a significantly different product; even changing the name to IBM Systems Director.  For starters, it no longer uses a thick client for administration, but uses your Internet browser.  The whole look-and-feel has been drastically altered and I am not sure this is a good thing.  Just going to the web page is difficult enough, as you have to remember a URL that looks like (http://director:8421/ibm/console).  If you use Active Directory (AD) groups for login, it will work, but if you enter a legal username/password that is part of AD but not a valid user of IBM Director, then it will crash the server.  No word from IBM on why this happens, but one of the java executables runs out of memory because of this.  A server restart is required to correct this.

The initial “Welcome” page, if it comes up, shows you a selection of plug-ins and their “readiness”.  There is also a link for 5.20.x users in the upper right corner.  That link gives some generic answers and information, but not the level of technical assistance that an experienced administrator can really use.  Most of it is obvious as you play with and try out the new user interface.  What is not obvious are things like the “Hardware Status” that used to show you what sensors were available from a level-2 agent (now called a Common Agent) talking through the IPMI interface in the operating system to the BMC controller in the IBM hardware.

welcome

As I said, there are inexplicable times when the Welcome page will not show up.  We have been unable to determine when or why this will occur, but it is far too frequent to ignore.  Instead of the welcome, the users will get the infamous ATKCOR037E error page.  This implies that you do not have permission to see the welcome page, which generally goes away if you reboot the server.

atkcor037e1

Installation

We tried to install the Linux version of the product and it worked well enough.  However, we quickly learned that the Linux version can only talk to the Apache Derby, DB2 or Oracle database engines and not Microsoft SQL Server.  Evidently, this is due to missing administrative interfaces and command-line tools for that engine.  So, we decided to install this on Windows 2008 Server.  To be honest, I don’t know why they offer a Basic installation option as that is pretty useless for most users.  You will need to select custom if you are using any database other than Derby and it seems a difficult path to change later, although they say it can be done.  You are required to set some passwords for the Agent Manager Configuration.  Unfortunately, they don’t tell you what or why there are these usernames and passwords required for the Agent Manager, but you better write them down as they are critical when creating common agent servers.

Turns out that the common agent service (CAS) is how discovery is going to happen with the new common agents, but they don’t tell you that up front.  When you deploy a new agent, you will have to modify the deployment code to take into account the passwords that were set during the server installation.  This is a big change from 5.20.x in how servers register with Director.  In previous versions, there was a DSA certificate generated by the server and you could push this file into the agent installations for authentication and registration with Director.  While 6.1 kept that compatibility with older agents, it is not used in the new common agents.  These agents need to register with the Tivoli service on Director.  I assume this is part of the integration direction between Director and Tivoli, but I am not sure.  Anyway, when you install the agent, there is a “response file” that is embedded in the deployment.  You must override this file with an edited version where you set the name of your agent manager (probably the director server) and the password you set for the agent manager.  For example:

AMHostname=director.mydomain.com
AMPassword=my_agent_manager_password

Then, you install the agent with a command line override to tell it to use your response file instead of the default one.  However, we have been unable to get the Windows version of this command line to run properly and we generally run the whole GUI version and type in the options.

Linux = dir6.1_commonagent_linux.sh -r diragent.rsp
Windows = IBMSystemsDirectorAgentSetup.exe installationtype rsp=”diragent.rsp”

The other nuisance is that they do not have the default ports pre-entered for the databases.  For example, you have to know that SQL Server is using port 1433 during the installation which I find inexcusable in any software install.  They know the defaults and should fill them in.  It is probably very rare for any enterprise to change these port values.

Agents and Inventory

If you remember in 5.20.x, the UI would show all of the agents and if a platform was part of a server that included a level-2 agent, then the platform would show under the agent as a subitem.  That was changed in 6.1 and adds to the confusion that this product is rife with.  All agents show as individual items and there is no correlation between agents and platforms.  You have to drill around a bit to get the information about these relationships.  It can be done, but it is not obvious or present in an easy way for administrators.  To get this info, you will have to click on the Discovery Manager and get a chart of your servers as shown below.discover-manager

From here, you can click on a number of options, but the interesting one for us is the “Full access systems” at the bottom of second section called “Access and Authentication”.  This will present you with a list of the systems which have been discovered and have registered with the CAS.

full-access-systems

Clicking on the first item, it will bring you to another tab (more on tabs later) which shows you some information about the server.  From here, clicking on the “System” folder, opens it to show you “Operating System” and “Server”.  Finally, clicking on “Server” shows you that information in the details to the right.  Here we can see the platform under the common agent service.  However, by this time, you have collected a number of tabs in the UI and probably forgotten why you have drilled all the way down here and what to do next.

full-access-properties

Discovery

This part of the application is even worse than 5.20.x ever was.  Even after installing the agents with the customized command lines, we have found that it only works “most” of the time.  We have seen agents running on identical hardware and operating systems react differently.  Sometimes it installs as expected, which is “Full Access Systems” as I talked about above.  Other times, you will see them show up in the “Discovery Manager” as either “No agent systems” or “No access systems”.  It is impossible to get any real details from the installation log files, even if you set the “DebugInstall=1” in the response file.  If you attempt a reinstallation or two, most of the time it will finally register correctly.  We have found on some Linux machines that the install will fail when trying to install ISDCommonAgent-6.1-1 for unknown reasons since it can be extracted and installed by hand.

To uninstall the agent, there will be a /opt/ibm/director/bin/diruninstall (Linux) available, but note that it will not even come close to actually uninstalling all of Director.  It will leave /opt/ibm/*; /etc/ibm/*; /var/log/dirinst.log and it will leave the TIVguid-1.3.0-0 rpm installed.  You should remove all of this to complete a full uninstall.  Then attempt a new installation to see if it can register and be discovered correctly by the Director Serve through the CAS.

IBM Technical Support

In Linux, we have seen a number of issues that have generated a number of PMRs with IBM technical support.  Yes, we purchased software support from IBM in order to help us get through the mess of problems and issues that plague the product.  All of these issues are still open as we try to get this product to  work in our environment.  To open a PMR with IBM is not trivial.  First, you have to call their 800 number and wait for at least 1-2 hours before talking with someone to enter the PMR.  What if you have more than one issue to report?  Well, that is gonna suck, because they will only take one issue per phone call.  Yeah — you heard me right, they will not accept multiple issues per call no matter how much you scream and argue.  So, to enter four tickets into their system took a full business day of waiting on hold to get them entered.

What do you get after that?  Well, the usual junk about installing correctly, checking the BIOS, BMC, firmware, network cards, etc.  When we tell them we have identical machines with the same levels of firmware, operating system and Director software, they just keep on going as if we don’t know what we are talking about.  It is frustrating and difficult, but we need to get this level of administration on our servers that Director is supposed to provide.

Issues We Have Encountered

Most of our Linux machines are running RedHat 4 Update 6 and as such have the OpenIPMI interface already integrated.  IBM Director common agent uses this to communicate with the BMC and report back to the Director server.  However, we have seen the kipmi0 daemon consume 100% of the CPU and take control of our systems.  The only way that we have been able to correct the problem is to use the MPCLI command-line interface on the Director server to execute a “restartmp” command.  That restarts the BMC and releases whatever was holding up the daemon.  Unfortunately, this can occur at any time and our servers are running batch jobs 7×24 and cannot tolerate this behavior.  We have built cron jobs to do this automatically every hour or two just to make sure these systems run as cleanly as possible.  IBM has suggested that this is a Broadcom network driver issue and we are investigating.

The next hurdle we encountered is the cimserver daemon goes 100% CPU utilization.  Again, this happens randomly and so far IBM has been unable to tell us why this occurs.  The obvious fix is to kill and restart the cimserver daemon and we have built jobs to do this action as well.  However, this is not an acceptable way to proceed long term.

We also run a lot of VMware ESX 3.5 servers in our environment and IBM says that their software will run on ESX.  While it installs here as well as any Linux (and that is not saying much), the first thing we noticed is that there are a large number of entries in the /var/log/messages file and it grows rapidly.  It is not clear if these items indicate errors, warning or are just informational regarding a service called cimprovagt.

Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 140
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 183
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 184
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 100
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 100
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 99
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 99
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 98
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 98
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 141
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 142
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 142
Feb 19 12:34:45 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 181
Feb 19 12:34:46 blade00-10 cimprovagt: Attempting to read the state of a discrete sensor with no reading: 168

At first we thought that this was related to ESX 3.5 since it uses the OpenIPMI to show the “Health Status” in the Virtual Infrastructure Client under the Configuration tab.  We tried to open a ticket with VMware, but as soon as they heard we were using LS-21 blades on an IBM BladeCenter they immediately rejected the ticket and told us to talk with IBM.  That lead us into a finger pointing match as IBM immediately said that cimprovagt was part of ESX and that we had to lodge the ticket with VMware.  Eventually, we saw an issue from VMwareregarding this type of issue that was not directly linked to IBM, but looked very similar.  So, we decided to upgrade all of our ESX servers from 3.5 Update 2 to 3.5 Update 4.  This in fact changed the output from within ESX, which we are now pursuing with IBM.  Turns out the possible culprit is a possible identification problem with a RAID controller.  I say this because we do not have any RAID controllers in our BladeCenter as we use iSCSI SAN disks for our guests and the base OS is loaded on a single local SAS hard disk.  The new output on all of our blades shows the following:

Apr 19 01:02:00 blade00-4 HostRaidInd[2343]: Cannot parse software release date for controller 0
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 179
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 178
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 177
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 176
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 186
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 154
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 211
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 210
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a threshold sensor with no reading: 155
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 181
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 140
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 183
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 184
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 100
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 100
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 99
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 99
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 98
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 98
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 141
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 142
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 142
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 181
Apr 19 01:03:04 blade00-4 HostRaidInd[2343]: Attempting to read the state of a discrete sensor with no reading: 168

 

Next up is power off and shutdown of Linux boxes when the IBM System Director Agent is installed.  After the installation, I attempted to shutdown one of the Linux servers.  I went to the console and entered the “shutdown” command but nothing happened.  A bit perplexed, I then issued the “poweroff” command, but still nothing happened.  There was no command I could issue that would reboot or shutdown the system.  We have asked IBM about this and logged entries on their develop forums.  So far we have not received any confirmation or resolution.  The only answer we could come up with was to use IBM Director itself.  Using the power options from the common agent, we were able to restart and power down the server.

SNMP Discovery

It appears that SNMP discovery has significantly changed in 6.1 of IBM Systems Director. In previous versions, such as 5.20.x, the SNMP community could be configured and a seed device such as a router could be specified. Depending on the LAN/WAN configuration, that was enough to discover most, if not all of the network assuming that Director did not crash during the process (it almost always did).

In the older versions, The SNMP discovery would query the ipNetToMediaNetAddress table of each node it encountered. A great way for network discovery by supplying only a few seed systems. It was not necessary to scan entire IP address ranges since the ipNetToMediaNetAddress table knew the exact IP addresses to try.

However, in 6.1 it looks like this does not occur. Creating an entry in the “Advanced System Discovery” for SNMP, the community strings only applies to whatever was entered in the IP range or single IP address. The community string set here only apply to those devices that were discovered in the given profile.  We use sparsely populated VLANs and subnets which makes this a difficult and very time consuming approach.  Having 15 subnets with a /24 netmask means that I will be probing almost 4000 IP addresses, with less than 30% of those in use.  This is a tremendous waste of time and resources that is not required in the old mechanism of discovery.

What does seem to be the same, is that any device that sends Director an SNMP trap is automatically added to the list of discovered entities. However, it attempts communication with the default “public”/”private” SNMP communities which in most cases will fail because no administrator will use these for security reasons.   A default should be available to be set, but I cannot find one anywhere nor can I locate any documentation about it.

Performance Problems with BMC?

From our research, it seems that the BMC is not responding quickly enough to requests from the common agent and/or other users of the interface.  Quite frequently we see messages in /var/log/messages that say:

        Feb 25 13:05:47 gd-ds12-01 kernel: ioctl32(java:12813): Unknown cmd fd(201) cmd(00008938){00} arg(dd45bfe8) on socket:[62763481]
        Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(205) cmd(00008938){00} arg(de948fe8) on socket:[62763640]
        Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(206) cmd(00008938){00} arg(de948fe8) on socket:[62763689]
        Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(206) cmd(00008938){00} arg(de948fe8) on socket:[62763740]
        Feb 25 13:05:48 gd-ds12-01 kernel: ioctl32(java:12791): Unknown cmd fd(207) cmd(00008938){00} arg(de948fe8) on socket:[62763806]

which seems to indicate that it is probing for a floppy disk or other device which is not present.  I would assume that one failure would be enough and it is unnecessary to continually output this information.  We also see the following items on the system console:

        IPMI message handler: BMC returned incorrect response, expected netfn b cmd 40, got netfn 0 cmd 0
        IPMI message handler: BMC returned incorrect response, expected netfn b cmd 40, got netfn 0 cmd 0
        IPMI message handler: BMC returned incorrect response, expected netfn b cmd 40, got netfn 5 cmd 2d

which looks like there is a miscommunication between the OpenIPMI interface and the BMC.  I can only assume that this communication error is a lack of responsiveness on the part of the BMC, but I am not positive.

Overall

The purpose and need for something like IBM Director is very important and grows with virtualization and consolidation within the DataCenter.  We all have fewer resources to maintain and manage these systems that are growing in complexity.  We have to worry more about environmental issues such as heat dissipation and power consumption while maintaining availability of redundancy of service.  Products like IBM Systems Director is a great fit, but it has to work and work well.

Our experiences have been told by many others as is evident in IBM Director forums on their sites as well as other blogs and articles.  It seems that IBM does not put forth the effort to really test this product to make sure of its quality and I am puzzled about that.  Yes, I know it is a free tool to everyone who purchases an IBM server, but that is also the value-add that it is supposed to bring.  Even if you purchase a service contract as we did and talk with IBM about integration with Tivoli, there is no mention about making sure the IBM Systems Director foundation is solid.  There is no way a rational DataCenter manager can justify spending the large amounts of time currently required to try and bring IBM Systems Director on line and keep it operational.  As far as we can tell, this is simply not even possible.  If there were legitimate options to this product then we would be pursuing them instead of this one.  But even with a low-level replacement, what would the overall systems management direction be?  If it is Tivoli, then not having IBM Systems Director below it seems counter-productive as we don’t gain from the integration.  I don’t know what other tools out there are using a common agent approach that integrates with IPMI and handles various platforms, so we will continue our pursuit of this system and update this article as we progress.

Article by Steve Van Domelen

Steve has written 47 awesome articles.

  • jivetolKEIN

    I’m afraid all I can offer/add is – I feel your pain.

    IBM are our global partner, and we get world class pricing from them. But I’d still buy HP if I had the choice simply down to Director.

    Insight Manager – PSP the server, reboot, discover. Job Done.
    Director – err, well, if you find me a single instruction set tha works in all circumstances, I’d be mighty glad to hear it.

  • Nik Conwell

    Steve – Excellent post. I’m happy(?) to see that we are not the only one having similar issues with 6.1. (We’re using CentOS 4.7).

    IBM isn’t too helpful since it’s not really RHEL. The forums are OK but they don’t get enough activity to be really useful.

    We’re seeing the looping cimserverd as well. Starts looping at about 2am on random days.

    Our main problem is we’re not able to see anything related to the BIOS, which was the compelling reason to go with Director in the first place. We want to be able to see/manage BIOS and other hardware updates across a whole bunch of boxes without having to run around with CDs.

    • steve

      Nick,
      We have submitted a large number of PMR’s against this product and one of my guys is spending a lot of time with IBM on this.

      One thing I can tell you is that they have a hotfix for the runaway cimserverd problem. I will see if I can post it here.

      • Nik Conwell

        Thanks, I’m definitely interested. We’ve engineered an automatic restart work-around, but it would be much better if cimserverd would not just get into a cpu sucking loop.

        • steve

          Nik,
          I have contacted IBM and they have requested that I do not make the patch available. I know I have a lot of readers from IBM and I am reluctant to damage the relationship even though it could benefit their customers and my readers.

          On the bright side, we have determined that the patch in fact does not work. While it looked promising, it failed after about a week or so. Contacting IBM once again, they have issued some new patches that we are going to try. If they work, I will again request permission to make them available on this site.

  • Ed

    Same here – thanks for the detailed article on 6.1 – having come from HP Insight Mgr, and stumbled along with earlier versions of IBM Director – I feel everyone’s pain.

    Anyone have a creative solution for dumping the event logs within the RSA and then clearing them via IBM Director task?

    Thanks

  • Jason

    Have you considered other management packages for IBM? Particularly for a linux environment I’ve been fond of IBM’s xCAT (http://xcat.sourceforge.net/). It’s what they use on most of their Top500 systems. It has a bit of a learning curve, but once used to it, it generally works more like I would expect a *nix-oriented tool to work (relatively low footprint, strong CLI, access to tinker with source, etc). For some reason, IBM doesn’t talk about it much in their marketing, but it is a pretty good tool that has satisfied me after exasperation with Director.

    Their site talks some about virtualization, so that may address things further. You might want to contact their mailing list and try it out as an alternative and write an article about your experience.

    You mention virtualization

  • jeremy

    I have been attempting to use Director since it’s 3.x days and it has never worked well.

    I have a neat problem though. When I do a level 0 scan on a blade server that has been discovered over DCOM, the blade goes into Comm Err status in the chassis. The OS stays up, but in order to get access to the blade through the chassis I have to re-seat the blade.

    I think I use it now just for the unbelievable IT comedy it provides.

    • steve

      Jeremy,
      You are right about the IT comedy with this product. We have so many PMR’s submitted to IBM, I don’t know if it will ever be a viable tool. At this point, it seems I want it to work more than IBM does. From the discussions with support, I will not be surprised if they pull support for all of the BMC NICs in the older IBM products.

      They have also informed us that they don’t support VMware 3.5 Update 4 which we have installed. I am sure this has to do with its integration with IPMI and the two products stepping on each other.

      The laughter continues….

  • jivetolKEIN

    I’m seriously considering rolling all the IBMs into Insight, and compiling MIBs there instead.. there are a few docs littered around the net which make it look viable if you just want hardware status and monitoring (whichI do).

    If indeed there are IBM employees reading this thread, please please please just make it work.. dell can do it, HP can do it…

  • RaNma

    I’m a 5.20 user with 140 AIX systems that were hard to authenticate or discover in the old version. I installed a 6.1 server on a brand new AIX system (a dedicated p520 with 2Power6 CPUs and 4Gb ram): SD61 ate all my memory. Having less than 10 LPARs in a lab is quite fun but managing a whole prod infrastructure with +240 objects in the ApacheDerby DB is quite a pain and one day the DB corrupted as the server also killed SD61 with out of memory during an inventory of the systems. I decided to reinstall it (6.1.2.1) on a SLES10.3/ESX3.5 system with 2Cpus and 6Gb of RAM. Again, after discovery the whole SD61 began to be slow, unresponsive and messes with authentication. It needs ssh with PermitRootLogin set to yes which is unconcevable in a production system. When succesfully authenticating a host with this setting and then revert root login to “no” it returns “unknown” for status. Sometimes it overloads my HMC and I have to reboot it (a CR-4 with 7.3.4SP2 crying out: “too many open files”).
    Need to say that discovering an HACMP cluster is just a joke?

    Well… I feel like being a beta-tester for IBM for more than a year. I asked for price on AEM, VMControl and TPM that may be usefull in my environment but I thing that I’m losing my time with this tool while trying to gain it on management tasks instead! Maybe I’ll pay a junior sysadmin to maintain our old dsh/scp scripts and its PKI db that were used until now… less sexy than a webapp but without that crappy WebpshereLight and all that Java processing.
    (PS: Sun xVMopsCenter is the same kind of shit… we bailed it out for a while).

    • Fred

      Any Updates on this product? I have seen a 6.1.2 release recently does it fix any of the above issues? I have a fairly small staff so we are looking for any insight before we start our testing or go with another product… Thanks in advance

      • Fred,
        I must admit that we have exhausted our ability to continue to work with this product. Try as we might, it just does not work reliably and I have lost all confidence.

        On a bright side, we have been playing with a couple or Open Source products. One is GWOS (http://www.groundworkopensource.com) and the other is Zenoss (http://www.zenoss.com). The first one is a GUI wrapper around Nagios (http://www.nagios.org) and the latter appears to be a combination of tools like Cacti, Zope, Twisted and a few others.

        So far, I am leaning heavily toward Zenoss. I like the UI and it is easier to configure and use. On the other hand, Nagios/GWOS can do a lot of things, but it is very difficult to learn and work with.

        I am only using the Open Source version of Zenoss, but they do have an enterprise version which can get expensive, but might be worth it. I will write up my findings in a future entry.

        • uguessedit

          This thread sounds very familiar; we too have had major issues moving from 5.2.x to 6.1.x Most of which have been covered well by Steve! Had the hardest time trying to install ver 6 on 2 separate hosts, 1 Windows 2008 server for the Director daemons, the other for the SQL database on a non-standard port (it's in Information Protection thing). The lack of the cert made it painful to gain access to all my hosts. The instablilty of the product makes it hard to justify using in our infrastructure. However, we have no money for monitoring so this is one direction. We are also looking into Zenoss & Nagios and I agree, the enterprise (supported) version of Zenoss gets quite expensive! We have a handful critical processes being “watched” over by some home-grown PowerShell scripts setup as either Windows Services or simply running periodically from Windows Scheduler! (real 'enterprise', I know)

        • kevinncarpenter

          Thanks for the links to the other products. I just spent the better part of a month playing with 6.1.2.2. It had unpredictable horrible performance problems when running in a VM, but stabilized when we put it on a physical machine (quad core, 12gb, W2k8). Bumping the Java memory up helped a lot too.

          Still LOTS of problems. Although I can push agents to ESX 3.5 systems once I open up root, they don't apparently install. The only trace is a /etc/ibm/director directory – nothing at all in in /opt.

          What really burns us is the static nature of the database. Rename a server and the only way to get Director to update itself is to manually remove the server record and rediscover it. AutoDiscovery worked reasonably well, but we would have to drop the database and recreate every month to have something useful – and doing that losses all your configuration data.

          • Paul Cruiser

            Systems Director 6.2 is out now, but it seems like all IBM did was add new hardware support. My org is experiencing all of the same problems as everyone else here, even with 6.2. It's gotten to the point where my boss is ordering HP servers for everything that isn't a blade server. For blade servers, we purchased a new Cisco UCS system. I'm not sold on the “v1.0” Cisco UCS, but HP's Systems Insight Manager is a different story. SIM is proving just how bad Systems Director really is.

  • Sone IBMer

    I can tell you that even as an internal IBM user, Systems Director 6.1 is incomprensible and often just plain unusable. We will soon be approaching our 1-year mark on this project. We have a Director 5.20 system that has been our regular management system, but as IBM has seen fit to abandon a somewhat functional version (5.20) in favor of an unusable Eclipse-based version (6.1), we're going to have to switch over sooner or later.

    It's the same case with Xcat. Version 1.3 was usable and functional. But as with everything else in the IBM stable, everything *HAS* to move to Eclipse. So now we have to try and make the Eclipse-based Xcat 2.x do the same things that 1.3 did with ease. On top of that there are major chunks of functionality from 1.3 that they STILL haven't implemented in 2.3, well beyond their abandonment of support for the 1.3 series.

    I don't even dare leave my name/email here.

    • dkvello

      If only they moved it to Eclipse 🙂
      I guess You mean they're moving everything to WebSphere (WAS) runtime.

      I can't believe how they managed to gor from a fairly comprehensive and manageble (patch, upgrade, daily USE!!!) version 5.2 to the “Everything-has-to-look-like-our-standard-WAS-apps-no-matter-what” 6.1 version.

      Going “WebUI” only certainly hasn't made the product better.

  • Jon

    Not sure why people still use the 1800 number to open PMRs. SSR tool, you can open and track everything in one spot, and yes, multiple PMRs in a very short time can be opened. Not to mention, no VRU, no working with India, no getting routed to the wrong area.

    https://www-946.ibm.com/support/servicerequest/

  • Systems Admin

    I’m beginning to think that IBM has no place writing software of any kind– ever. Whatever their internal processes are, those processes must be completely misaligned with even the most basic of best practices for software engineering.

    We’ve been using 5.x for a few years and it has always proven to be non-intuitive, temperamental, and a bit kludgy. Additionally, the documentation (Redbooks) is bloated and unclear.

    We had hoped that 6.x would be a better product with better documentation and better support. What on Earth were we thinking? We’ve installed 6.1, 6.1.1, 6.1.2, 6.1.2.1, 6.1.2.2, and now finally 6.2. All of these new versions are badly designed, buggy, and the documentation is still bloated and vague. To top it all off, the tech support is terrible (it’s obvious that the phone technicians are level-one at best, and sometimes haven’t even used the most recent versions of the product).

    Perhaps the biggest success of software houses like IBM is that they’ve convinced the world IT community that it’s normal for software to have this many shortcomings, and it’s OK to expect we IT professionals to help them develop, troubleshoot, and debug their software in our production environments. I can’t believe how much money, time, and effort my company and colleagues have wasted doing just that with Director 6.x, and that we still don’t have a stable implementation.

  • Systems Admin

    I’m beginning to think that IBM has no place writing software of any kind– ever. Whatever their internal processes are, those processes must be completely misaligned with even the most basic of best practices for software engineering.

    We’ve been using 5.x for a few years and it has always proven to be non-intuitive, temperamental, and a bit kludgy. Additionally, the documentation (Redbooks) is bloated and unclear.

    We had hoped that 6.x would be a better product with better documentation and better support. What on Earth were we thinking? We’ve installed 6.1, 6.1.1, 6.1.2, 6.1.2.1, 6.1.2.2, and now finally 6.2. All of these new versions are badly designed, buggy, and the documentation is still bloated and vague. To top it all off, the tech support is terrible (it’s obvious that the phone technicians are level-one at best, and sometimes haven’t even used the most recent versions of the product).

    Perhaps the biggest success of software houses like IBM is that they’ve convinced the world IT community that it’s normal for software to have this many shortcomings, and it’s OK to expect we IT professionals to help them develop, troubleshoot, and debug their software in our production environments. I can’t believe how much money, time, and effort my company and colleagues have wasted doing just that with Director 6.x, and that we still don’t have a stable implementation.

    • > Iu2019m beginning to think that IBM has no place writing software of any kind– ever.nnThis is also unfortunately true of EMC and Dell. The abominations those companies put forth is staggering.

      • Steve

        Could not agree more with that comment. I have abandoned all hope on IBM Software and Director. RIP.

  • David

    I’ve got a good solid 100+ hours with ISD and I’ve got some fundamental gripes with it. We’ve got thousands of RHEL4/5 xseries scattered across the globe and with budget cuts I was soooo hoping to deploy ISD, hire a couple jr. admins and turn it over to them to handle the hardware firmware/driver and OS patching duties. I’ve fought and fought with ISD, I’ve worked closely with support…but from what I can tell, ISD just isn’t there yet, which totally bums me out.  We’re on 6.2.2 I believe.

    My big ticket gripes are almost entirely focused on ISD’s pitiful support for the most rudimentary IPMI features for the BMC and IMM (SOL via RCMP+) to the more advanced/integrated IPMI capabilities like any type of ASU or MegaCLI support via the BMC (Fully BIOS access and remote mounts via RDCLI).

    But the show stopper starts with the with the inherent and debilitating disconnect of and folly between manual IBM UXSP firmware/driver updates and RHEL RHN updates. What a mess….a mess that i was fully counting on ISD to absolve. But not only did it not, it ADDS an entirely new layer of complexity to the mess that not even Big Fix, oops, i mean IBM Endpoint Manager can tame.

    I really enjoyed (sadly) hearing everyone’s experiences and was very impressed with the article…it feels better to know that we’re not alone.

    The best thing I think IBM could do at this point would be to redirect all the money being spent on documentation into more ISD developers…as many have said, ISD’s documentation is insanely obese, it’s nauseating and it’s the very last place I’d look…the proposition of using it as a resource simply terrifies me.  I doubt the combination of the space shuttle, the stealth bomber/fighter and the F16 instruction/repair manuals has as much documentation as ISD. 

    SOOOO much uncapitalized potential.

    • DAvid

      BTW, when I say IPMI support I mean entirely through the BMC/IMM.  IMHO IBM should have baked the ISD cake to fully utilize the BMC/IMM interfaces.  And once the cake was baked and perfected only then begin exploring OS agents as icing.  For me that would give IBM the true/clearcut value added reality that MGT could sink their teeth into for the LONG TERM.  Based on my ISD findings, a staggering number of PR’s have been halted for rewrite, PR’s that IBM nearly had in the bag. 

    • Joe

      As an IBMer who worked on ISD 6.1 and 6.2, I can provide some insight. One of the reasons that the ISD documentation is so bloated and vague is that the IBM ISD developers do not write the ISD documentation.  The ISD documentation is designed and written by IBM technical writers who do not have a technical background. They are generic technical writers who comb through the ISD developers design documents and then “fill-in-the blanks” with the bloated nonsense that we have come to loathe. After all, there is a dept of about 15-20 ISD technical writers who need to justify their full-time status of working 40 hours per week. If the developers have time at the end of the ISD release cycle, they do a high-level review of the documentation. I have seen first-hand that most of the ISD documentation gets published without sufficient technical review or usability review. However, you can’t really blame the ISD developers, who are overworked and are worried about their job being off-shored. Similarly, can you blame the 15-20 ISD technical writers who have to justify their 40 hours per week worth of work, or else they too get their job off-shored. The real problem is systemic in the IBM culture. I agree with the previous comments about IBM getting out of the system software business.

Previous post:

Next post: