Monitoring System Health and Availability, and Logging: Part 2

Continuing yesterday’s discussion of monitoring and logging, I wanted to work through a few specific cases and try to illustrate how things can be sensed in an organized way, and especially how errors of various types can be detected. To that end, let’s start with a slightly simplified situation as shown here, where we have a single main process running on each machine, and the monitoring, communication, and logging are all built in to each process.

As always, functionality on the local machine is straightforward. In this simplified case, things are either working or they aren’t, and that should be reflected in whatever log are or are not written. Sensing the state and history of the remote machine is what’s interesting.

First, let’s imagine a remote system that supports only one type of communication. That channel must be able to support messaging for many functions, including whatever operations it’s carrying out, remote system administration (if appropriate), reporting on current status, and reporting on historical data. The remote system must be able to interpret the incoming messages so it can reply with an appropriate response. Most importantly it has to be able to sense which of the four kinds of messages it’s receiving. Let’s look at each function in turn.

  • Normal Operations: The incoming message must include enough information to support the desired operation. Messages will include commands, operating parameters, switches, data records, and so on.
  • Remote System Administration: Not much of this will typically happen. If the remote machine is complex and has a full, independent operating system, then it is likely to be administered manually at the machine or through a standard remote interface, different than the operational interface we’re thinking about now. Admin commands using this channel are likely to be few and simple, and include commands like stop, start, reboot, reset communications, and simple things like that. I include this mostly for completeness.
  • Report Current State: This is mostly a way to query and report on the current state of the system. The incoming command is likely to be quite simple. The response will be only as complex as needed to describe the running status of the system and its components. In the case of a single running process as shown here, there might not be much to report. It could be as simple as “Ping!” “Yup!” That said, the standard query may also include possible alarm conditions, process counts, current operating parameters for dashboards, and more.
  • Report Historical Data: This could involve reporting a summary of events that have been logged since the last scan, over a defined time period, or meeting a specified criteria. The reply might be lengthy and involve multiple send operations, or may involve sending one or more files back in their entirety.

Some setups may involve a separate communication channel using a different protocol and supporting different functions. Some of this was covered above and yesterday.

Now let’s look at what can be sensed on the local and remote systems in some kind of logical order:

Condition Current State Sensed Historical State Sensed
No problems, everything working Normal status returned, no errors reported Normal logs written, no errors reported
Local process not running Current status not accessible or not changing Local log file has no entries for time period or file missing
Local process detectable program error Error condition reported Error condition logged
Error writing to disk Error condition detected and reported Local log file has no entries for time period or file corrupted or missing
Error packing outgoing message Can be detected and reported normally Can be detected and logged normally
Error connecting to remote system (not found / wrong address / can’t be reached, incorrect authentication, etc.) Error from remote system received and reported (if it is running, else timeout on no connection made) Error from remote system received and logged (if it is running, else timeout on no connection made)
Error sending to remote system (connection physically or logically down) Error from remote system received and reported (if it is running, else timeout on no connection made) Error from remote system received and logged (if it is running, else timeout on no connection made)
Remote system fails to receive message Request timeout reported Request timeout logged
Remote system error unpacking message Error from remote system received and reported Error from remote system received and logged
Remote system error validating message values Error from remote system received and reported Error from remote system received and logged
Remote system error packing reply values Error from remote system received and reported (if sends relevant error) Error from remote system received and logged (if sends relevant error)
Remote system error connecting Assume this is obviated once message channel open Assume this is obviated once message channel open
Remote system error sending reply Request timeout reported Request timeout logged
Remote system detectable program error Error from remote system received and reported Error from remote system received and logged
Remote system error writing to disk Error from remote system received and reported (if sends relevant error) Entries missing in remote log or remote file corrupted or missing
Remote system not running (OS or host running) Error from remote system received and reported if sent by host/OS, otherwise timeout Error from remote system received and logged if sent by host/OS, otherwise timeout
Remote system not running (entire system down) Report unable to connect or timeout on reply Log unable to connect or timeout on reply

Returning to yesterday’s more complex case, if the remote system supports several independent processes and a separate monitoring process, then there are a couple of other cases to consider.

Condition Current State Sensed Historical State Sensed
Remote monitor working, individual remote processes generating errors or not running Normal status returned, relevant process errors reported Normal status returned, relevant process errors logged
Remote monitor not running, separate from communications Normal status returned, report monitor counter or heartbeat not updating Normal status returned, log monitor counter or heartbeat not updating
Remote monitor not running, with embedded communications Error from remote system received and reported if sent by host/OS, otherwise timeout reported Error from remote system received and logged if sent by host/OS, otherwise timeout logged

A further complication arises if the remote system is actually a cluster of individual machines supporting a single, scalable process. It is assumed in this case that the cluster management mechanism and its interface allow interactions to proceed the same way as if the system was running on a single machine. Alternatively, the cluster manager will be capable of reporting specialized error messages conveying appropriate information.

This entry was posted in Software and tagged , , , , , . Bookmark the permalink.

Leave a Reply