Monitoring System Health and Availability, and Logging: Part 2

Continuing yesterday’s discussion of monitoring and logging, I wanted to work through a few specific cases and try to illustrate how things can be sensed in an organized way, and especially how errors of various types can be detected. To that end, let’s start with a slightly simplified situation as shown here, where we have a single main process running on each machine, and the monitoring, communication, and logging are all built in to each process.

right click and view in higher resolution

As always, functionality on the local machine is straightforward. In this simplified case, things are either working or they aren’t, and that should be reflected in whatever log are or are not written. Sensing the state and history of the remote machine is what’s interesting.

First, let’s imagine a remote system that supports only one type of communication. That channel must be able to support messaging for many functions, including whatever operations it’s carrying out, remote system administration (if appropriate), reporting on current status, and reporting on historical data. The remote system must be able to interpret the incoming messages so it can reply with an appropriate response. Most importantly it has to be able to sense which of the four kinds of messages it’s receiving. Let’s look at each function in turn.

Normal Operations: The incoming message must include enough information to support the desired operation. Messages will include commands, operating parameters, switches, data records, and so on.
Remote System Administration: Not much of this will typically happen. If the remote machine is complex and has a full, independent operating system, then it is likely to be administered manually at the machine or through a standard remote interface, different than the operational interface we’re thinking about now. Admin commands using this channel are likely to be few and simple, and include commands like stop, start, reboot, reset communications, and simple things like that. I include this mostly for completeness.
Report Current State: This is mostly a way to query and report on the current state of the system. The incoming command is likely to be quite simple. The response will be only as complex as needed to describe the running status of the system and its components. In the case of a single running process as shown here, there might not be much to report. It could be as simple as “Ping!” “Yup!” That said, the standard query may also include possible alarm conditions, process counts, current operating parameters for dashboards, and more.
Report Historical Data: This could involve reporting a summary of events that have been logged since the last scan, over a defined time period, or meeting a specified criteria. The reply might be lengthy and involve multiple send operations, or may involve sending one or more files back in their entirety.

Some setups may involve a separate communication channel using a different protocol and supporting different functions. Some of this was covered above and yesterday.

right click and view in higher resolution

Now let’s look at what can be sensed on the local and remote systems in some kind of logical order:

Condition	Current State Sensed	Historical State Sensed
No problems, everything working	Normal status returned, no errors reported	Normal logs written, no errors reported
Local process not running	Current status not accessible or not changing	Local log file has no entries for time period or file missing
Local process detectable program error	Error condition reported	Error condition logged
Error writing to disk	Error condition detected and reported	Local log file has no entries for time period or file corrupted or missing
Error packing outgoing message	Can be detected and reported normally	Can be detected and logged normally
Error connecting to remote system (not found / wrong address / can’t be reached, incorrect authentication, etc.)	Error from remote system received and reported (if it is running, else timeout on no connection made)	Error from remote system received and logged (if it is running, else timeout on no connection made)
Error sending to remote system (connection physically or logically down)	Error from remote system received and reported (if it is running, else timeout on no connection made)	Error from remote system received and logged (if it is running, else timeout on no connection made)
Remote system fails to receive message	Request timeout reported	Request timeout logged
Remote system error unpacking message	Error from remote system received and reported	Error from remote system received and logged
Remote system error validating message values	Error from remote system received and reported	Error from remote system received and logged
Remote system error packing reply values	Error from remote system received and reported (if sends relevant error)	Error from remote system received and logged (if sends relevant error)
Remote system error connecting	Assume this is obviated once message channel open	Assume this is obviated once message channel open
Remote system error sending reply	Request timeout reported	Request timeout logged
Remote system detectable program error	Error from remote system received and reported	Error from remote system received and logged
Remote system error writing to disk	Error from remote system received and reported (if sends relevant error)	Entries missing in remote log or remote file corrupted or missing
Remote system not running (OS or host running)	Error from remote system received and reported if sent by host/OS, otherwise timeout	Error from remote system received and logged if sent by host/OS, otherwise timeout
Remote system not running (entire system down)	Report unable to connect or timeout on reply	Log unable to connect or timeout on reply

Returning to yesterday’s more complex case, if the remote system supports several independent processes and a separate monitoring process, then there are a couple of other cases to consider.

Condition	Current State Sensed	Historical State Sensed
Remote monitor working, individual remote processes generating errors or not running	Normal status returned, relevant process errors reported	Normal status returned, relevant process errors logged
Remote monitor not running, separate from communications	Normal status returned, report monitor counter or heartbeat not updating	Normal status returned, log monitor counter or heartbeat not updating
Remote monitor not running, with embedded communications	Error from remote system received and reported if sent by host/OS, otherwise timeout reported	Error from remote system received and logged if sent by host/OS, otherwise timeout logged

A further complication arises if the remote system is actually a cluster of individual machines supporting a single, scalable process. It is assumed in this case that the cluster management mechanism and its interface allow interactions to proceed the same way as if the system was running on a single machine. Alternatively, the cluster manager will be capable of reporting specialized error messages conveying appropriate information.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Monitoring System Health and Availability, and Logging: Part 2

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Categories

Meta