{"id":2125,"date":"2018-08-09T22:51:19","date_gmt":"2018-08-10T03:51:19","guid":{"rendered":"https:\/\/rpchurchill.com\/wordpress\/?p=2125"},"modified":"2019-04-30T18:41:28","modified_gmt":"2019-04-30T23:41:28","slug":"monitoring-system-health-and-availability-and-logging-part-2","status":"publish","type":"post","link":"https:\/\/rpchurchill.com\/wordpress\/posts\/2018\/08\/09\/monitoring-system-health-and-availability-and-logging-part-2\/","title":{"rendered":"Monitoring System Health and Availability, and Logging: Part 2"},"content":{"rendered":"<p>Continuing <a>yesterday&#8217;s<\/a> discussion of monitoring and logging, I wanted to work through a few specific cases and try to illustrate how things can be sensed in an organized way, and especially how errors of various types can be detected. To that end, let&#8217;s start with a slightly simplified situation as shown here, where we have a single main process running on each machine, and the monitoring, communication, and logging are all built in to each process.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.rpchurchill.com\/images\/articles\/20180809_Monitor-Log-Diagram_var_01.png\" width=\"400px\" title=\"right click and view in higher resolution\" \/><\/p>\n<p>As always, functionality on the local machine is straightforward. In this simplified case, things are either working or they aren&#8217;t, and that should be reflected in whatever log are or are not written. Sensing the state and history of the remote machine is what&#8217;s interesting.<\/p>\n<p>First, let&#8217;s imagine a remote system that supports only one type of communication. That channel must be able to support messaging for many functions, including whatever operations it&#8217;s carrying out, remote system administration (if appropriate), reporting on current status, and reporting on historical data. The remote system must be able to interpret the incoming messages so it can reply with an appropriate response. Most importantly it has to be able to sense which of the four kinds of messages it&#8217;s receiving. Let&#8217;s look at each function in turn.<\/p>\n<ul>\n<li><strong>Normal Operations:<\/strong> The incoming message must include enough information to support the desired operation. Messages will include commands, operating parameters, switches, data records, and so on.<\/li>\n<li><strong>Remote System Administration:<\/strong> Not much of this will typically happen. If the remote machine is complex and has a full, independent operating system, then it is likely to be administered manually at the machine or through a standard remote interface, different than the operational interface we&#8217;re thinking about now. Admin commands using this channel are likely to be few and simple, and include commands like stop, start, reboot, reset communications, and simple things like that. I include this mostly for completeness.<\/li>\n<li><strong>Report Current State:<\/strong> This is mostly a way to query and report on the current state of the system. The incoming command is likely to be quite simple. The response will be only as complex as needed to describe the running status of the system and its components. In the case of a single running process as shown here, there might not be much to report. It could be as simple as &#8220;Ping!&#8221; &#8220;Yup!&#8221; That said, the standard query may also include possible alarm conditions, process counts, current operating parameters for dashboards, and more.<\/li>\n<li><strong>Report Historical Data:<\/strong> This could involve reporting a summary of events that have been logged since the last scan, over a defined time period, or meeting a specified criteria. The reply might be lengthy and involve multiple send operations, or may involve sending one or more files back in their entirety.<\/li>\n<\/ul>\n<p>Some setups may involve a separate communication channel using a different protocol and supporting different functions. Some of this was covered above and yesterday.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.rpchurchill.com\/images\/articles\/20180809_Monitor-Log-Diagram_var_02.png\" width=\"400px\" title=\"right click and view in higher resolution\" \/><\/p>\n<p>Now let&#8217;s look at what can be sensed on the local and remote systems in some kind of logical order:<\/p>\n<table style=\"border: 1px solid white; border-collapse: collapse;\">\n<tbody>\n<tr>\n<th style=\"border: 1px solid white;\">Condition<\/th>\n<th style=\"border: 1px solid white;\">Current State Sensed<\/th>\n<th style=\"border: 1px solid white;\">Historical State Sensed<\/th>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">No problems, everything working<\/td>\n<td style=\"border: 1px solid white;\">Normal status returned, no errors reported<\/td>\n<td style=\"border: 1px solid white;\">Normal logs written, no errors reported<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Local process not running<\/td>\n<td style=\"border: 1px solid white;\">Current status not accessible or not changing<\/td>\n<td style=\"border: 1px solid white;\">Local log file has no entries for time period or file missing<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Local process detectable program error<\/td>\n<td style=\"border: 1px solid white;\">Error condition reported<\/td>\n<td style=\"border: 1px solid white;\">Error condition logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Error writing to disk<\/td>\n<td style=\"border: 1px solid white;\">Error condition detected and reported<\/td>\n<td style=\"border: 1px solid white;\">Local log file has no entries for time period or file corrupted or missing<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Error packing outgoing message<\/td>\n<td style=\"border: 1px solid white;\">Can be detected and reported normally<\/td>\n<td style=\"border: 1px solid white;\">Can be detected and logged normally<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Error connecting to remote system (not found \/ wrong address \/ can&#8217;t be reached, incorrect authentication, etc.)<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported (if it is running, else timeout on no connection made)<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged (if it is running, else timeout on no connection made)<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Error sending to remote system (connection physically or logically down)<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported (if it is running, else timeout on no connection made)<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged (if it is running, else timeout on no connection made)<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system fails to receive message<\/td>\n<td style=\"border: 1px solid white;\">Request timeout reported<\/td>\n<td style=\"border: 1px solid white;\">Request timeout logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system error unpacking message<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system error validating message values<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system error packing reply values<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported (if sends relevant error)<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged (if sends relevant error)<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system error connecting<\/td>\n<td style=\"border: 1px solid white;\">Assume this is obviated once message channel open<\/td>\n<td style=\"border: 1px solid white;\">Assume this is obviated once message channel open<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system error sending reply<\/td>\n<td style=\"border: 1px solid white;\">Request timeout reported<\/td>\n<td style=\"border: 1px solid white;\">Request timeout logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system detectable program error<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system error writing to disk<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported (if sends relevant error)<\/td>\n<td style=\"border: 1px solid white;\">Entries missing in remote log or remote file corrupted or missing<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system not running (OS or host running)<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported if sent by host\/OS, otherwise timeout<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged if sent by host\/OS, otherwise timeout<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote system not running (entire system down)<\/td>\n<td style=\"border: 1px solid white;\">Report unable to connect or timeout on reply<\/td>\n<td style=\"border: 1px solid white;\">Log unable to connect or timeout on reply<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Returning to yesterday&#8217;s more complex case, if the remote system supports several independent processes and a separate monitoring process, then there are a couple of other cases to consider.<\/p>\n<table style=\"border: 1px solid white; border-collapse: collapse;\">\n<tbody>\n<tr>\n<th style=\"border: 1px solid white;\">Condition<\/th>\n<th style=\"border: 1px solid white;\">Current State Sensed<\/th>\n<th style=\"border: 1px solid white;\">Historical State Sensed<\/th>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote monitor working, individual remote processes generating errors or not running<\/td>\n<td style=\"border: 1px solid white;\">Normal status returned, relevant process errors reported<\/td>\n<td style=\"border: 1px solid white;\">Normal status returned, relevant process errors logged<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote monitor not running, separate from communications<\/td>\n<td style=\"border: 1px solid white;\">Normal status returned, report monitor counter or heartbeat not updating<\/td>\n<td style=\"border: 1px solid white;\">Normal status returned, log monitor counter or heartbeat not updating<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid white;\">Remote monitor not running, with embedded communications<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and reported if sent by host\/OS, otherwise timeout reported<\/td>\n<td style=\"border: 1px solid white;\">Error from remote system received and logged if sent by host\/OS, otherwise timeout logged<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A further complication arises if the remote system is actually a cluster of individual machines supporting a single, scalable process.  It is assumed in this case that the cluster management mechanism and its interface allow interactions to proceed the same way as if the system was running on a single machine.  Alternatively, the cluster manager will be capable of reporting specialized error messages conveying appropriate information.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Continuing yesterday&#8217;s discussion of monitoring and logging, I wanted to work through a few specific cases and try to illustrate how things can be sensed in an organized way, and especially how errors of various types can be detected. To &hellip; <a href=\"https:\/\/rpchurchill.com\/wordpress\/posts\/2018\/08\/09\/monitoring-system-health-and-availability-and-logging-part-2\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[216,88,215,228,211,91],"_links":{"self":[{"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/posts\/2125"}],"collection":[{"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/comments?post=2125"}],"version-history":[{"count":5,"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/posts\/2125\/revisions"}],"predecessor-version":[{"id":2132,"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/posts\/2125\/revisions\/2132"}],"wp:attachment":[{"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/media?parent=2125"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/categories?post=2125"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rpchurchill.com\/wordpress\/wp-json\/wp\/v2\/tags?post=2125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}