Monitoring System Health and Availability, and Logging: Part 4

One more piece of context must be added to the discussion I’ve written up here, here, and here, and that is the place of these operations in the 7-layer OSI communications model.

The image above is copied from and linked to the appropriate Wikipedia page. The clarification I’m making is that, in general, all of the operations I’m describing with respect to monitoring and logging take place strictly at level 7, the application layer. This is the level of the communication process that application programmers deal with in most cases, particularly when working with higher-level protocols like TCP/IP and HTTP, even if that code writes specific information into message headers along with the messages.

Some applications will work with communications at lower levels. For example, I’ve worked with serial communications in real-time C++ code triggered by hardware interrupts where the operations at layers 5 and 6 were handled in the application, but the operations at layers 1 through 4 were handled by default in hardware and firmware and routing isn’t a consideration because serial is just a point-to-point operation. Even in that case, the monitoring and logging actions are performed (philosophically) at the application layer (layer 7).

Finally, it’s also possible to monitor configuration of certain items at lower levels. Examples are ports, urls, IP addresses, security certificates, authorization credentials, machine names, software version numbers (including languages, databases, operating systems), and other items that may affect the success or failure of communications. Misconfiguration of these items are likely to result in complete inability to communicate (e.g., incorrect network settings) or strange side-effects (e.g., incorrect language versions, especially for virtual machines supporting Java, JavaScript/Node, PHP, and the like).

Posted in Software | Tagged , , , | Leave a comment

Monitoring System Health and Availability, and Logging: Part 3

Now that we’ve described how to sense error states and something about how to record logging information on systems running multiple processes, we’ll go into some deeper aspects of these activities. We’ll first discuss storing information so errors can be recovered from and reconstructed. We’ll then discuss errors from nested calls in multi-process systems.

Recovering From Errors Without Losing Information and Events

We’ve described how communications work between processes on a single machine and across multiple machines. If an original message is not successfully sent or received for any reason, the operation of the receiving or downstream process will be compromised. If no other related events occur in the downstream process, then the action downstream action will not be carried out. If the message is passed in support of some other downstream action that does occur, however, then the downstream action will be carried out with missing information (that might, for example, require the use of manually entered or default values in the place of what wasn’t received). An interesting instance of the latter case is manufacturing systems where a physical workpiece may move from one station to another while the informational messages are not forwarded along with them. This may mean that the workpiece in the downstream station will have to be processed without identifying information, processing instructions, and so on.

There are a few ways to overcome this situation:

  • Multiple retries: This involves re-sending the message from an upstream process to a downstream process until a successful receipt (and completion?) is received by the upstream process. This operation fails when the upstream process itself fails. It may also be limited if new messages must be sent from an upstream process to a downstream process before the previous message is successfully sent.
  • Queueing requests: This involves storing the messages sent downstream so repeated attempts can be made to get them all handled. Storing in a volatile queue (e.g., in memory) may fail if the upstream process fails. Storing in a non-volatile queue (e.g., on disk) is more robust. The use of queues may also be limited if the order in which messages are sent is important, though including timestamp information may overcome those limits.
  • Pushing vs. Pulling: The upstream process can queue and/or retry sending the messages downstream until they all get handled. The downstream system can also fetch current and missed messages from the upstream system. It’s up the the pushing or pulling system to keep track of which actions or messages have been successfully handled and which must still be dealt with.

There may be special considerations depending on the nature of the system being designed. Some systems are send-only be nature. This may be a function of the communication protocol itself or just a facet of the system’s functional design.

In time-sensitive systems some information or actions may “age out.” This means they might not be able to be used in any meaningful context as events are happening, but keeping the information around may be useful for reconstructing events after the fact. This may be done by hand or in some automated way by functions that continually sweep for unprocessed items that may be correlated with known events.

For example, an upstream process may forward a message to a downstream process in conjunction with a physical workpiece. The message is normally received by the downstream system ahead of the physical workpiece so that it may be associated with the workpiece when it is received by the downstream system. If the message isn’t received before the physical piece the downstream process may assign a temporary ID and default processing values to the piece. If the downstream process receives the associated message while the physical piece is still being worked on in the downstream process then it can be properly associated and the correct instructions have a better chance of being carried out. The operating logs of the downstream process can also be edited or appended as needed. It the downstream process receives the associated message after the physical piece has left, then all the downstream system can do is log the information, and possibly pass it down to the next downstream process, in hopes that it will eventually catch up with the physical piece.

Another interesting case arises when the communication (or message- or control-passing) process fails on the return trip, after all of the desired actions were successfully completed downstream. Those downstream actions might include permanent side-effects like the writing of database entries and construction of complex, volatile data structures. The queuing/retry mechanisms have to be smart enough to detect whether the desired operations aren’t repeated if they have actually been completed.

A system will ideally be robust enough to ensure that no data or events ever get lost, and that they are all handled exactly the right number of times without duplication. Database systems that adhere to the ACID model have these qualities.

Properly Dealing With Nested Errors

System that pass messages or control through nested levels of functionality and then receive responses in return need a message mechanism that clearly indicates what went right or wrong. More to the point, since the happy path functionality is most likely to work reasonably well, particular care must be taken to communicate a complete contextual description of any errors encountered.

Consider the following function:

A properly constructed function would return errors from every identifiable point of failure in detail and unidentifiable failure in general. (This drawing should also be expanded to include reporting on calculation and other internal errors.) This generally isn’t difficult in the inline code over which the function has control, but what happens if control is passed to a nested function of some kind? And what if that function is every bit as complex as the calling function? In that case the error that should be returned should include information identifying the point in the calling function, with a unique error code and/or text description (how verbose you want to be depends on the system), and within that should be embedded the same information returned from the called function. Doing this gives a form of stack trace for errors (this level of detail generally isn’t needed for successfully traversed happy paths) and a very, very clear understanding of what went wrong, where, and why. If the relevant processes can each perform their own logging they should also do so, particularly if the different bits of functionality reside on different machines, as would be the case in a microservices architecture, but scanning error logs across different systems can be problematic. Being able to record errors at a higher level makes understanding process flows a little more tractable and could save a whole lot of forensic reconstruction of the crime.

Another form of nesting is internal operations that deal with multiple items, whether in arrays or some in other kind of grouped structure. This is especially important if separate, complex, brittle, nested operations are to be performed on each, and where each can either complete or fail to complete with completely different return conditions (and associated error codes and messages). In this case the calling function should return information describing the outcome of processing for each element (especially those that returned errors), so only those items can be queued and/or retried as needed. This can get very complicated if that group of items is processed at several different steps in the calling function, and different items can return different errors not only within a single operation, but across multiple operations. That said, once an item generates an error at an early step, it probably shouldn’t continue to be part of the group being processed at a later step. It should instead be queued and retried at some level of nested functionality.

Further Considerations

One more way to clear things up is to break larger functions down into smaller ones where possible. There are arguments for keeping a series of operations in a single function if they make sense logically and in the context of the framework being used, but there are arguments for clarity, simplicity, separation of concerns, modularity, and understandability as well. Whatever choice you make, know that you’re making it and do it on purpose.

If it feels like we’re imposing a lot of overhead to do error checking, monitoring, reporting, and so on, consider the kind of system we might be building. In a tightly controlled simulation system used for analysis, where calculation speed is the most important consideration, the level of monitoring and so on can be greatly reduced if it is known that the system is properly configured. In a production business or manufacturing system, however, the main considerations are going to be robustness, security, and customer service. Execution speed is far less likely to be the overriding consideration. In that case the efforts taken to avoid loss of data and events is the main goal of the system’s operation.

Posted in Software | Tagged , , , , , , | Leave a comment

Monitoring System Health and Availability, and Logging: Part 2

Continuing yesterday’s discussion of monitoring and logging, I wanted to work through a few specific cases and try to illustrate how things can be sensed in an organized way, and especially how errors of various types can be detected. To that end, let’s start with a slightly simplified situation as shown here, where we have a single main process running on each machine, and the monitoring, communication, and logging are all built in to each process.

As always, functionality on the local machine is straightforward. In this simplified case, things are either working or they aren’t, and that should be reflected in whatever log are or are not written. Sensing the state and history of the remote machine is what’s interesting.

First, let’s imagine a remote system that supports only one type of communication. That channel must be able to support messaging for many functions, including whatever operations it’s carrying out, remote system administration (if appropriate), reporting on current status, and reporting on historical data. The remote system must be able to interpret the incoming messages so it can reply with an appropriate response. Most importantly it has to be able to sense which of the four kinds of messages it’s receiving. Let’s look at each function in turn.

  • Normal Operations: The incoming message must include enough information to support the desired operation. Messages will include commands, operating parameters, switches, data records, and so on.
  • Remote System Administration: Not much of this will typically happen. If the remote machine is complex and has a full, independent operating system, then it is likely to be administered manually at the machine or through a standard remote interface, different than the operational interface we’re thinking about now. Admin commands using this channel are likely to be few and simple, and include commands like stop, start, reboot, reset communications, and simple things like that. I include this mostly for completeness.
  • Report Current State: This is mostly a way to query and report on the current state of the system. The incoming command is likely to be quite simple. The response will be only as complex as needed to describe the running status of the system and its components. In the case of a single running process as shown here, there might not be much to report. It could be as simple as “Ping!” “Yup!” That said, the standard query may also include possible alarm conditions, process counts, current operating parameters for dashboards, and more.
  • Report Historical Data: This could involve reporting a summary of events that have been logged since the last scan, over a defined time period, or meeting a specified criteria. The reply might be lengthy and involve multiple send operations, or may involve sending one or more files back in their entirety.

Some setups may involve a separate communication channel using a different protocol and supporting different functions. Some of this was covered above and yesterday.

Now let’s look at what can be sensed on the local and remote systems in some kind of logical order:

Condition Current State Sensed Historical State Sensed
No problems, everything working Normal status returned, no errors reported Normal logs written, no errors reported
Local process not running Current status not accessible or not changing Local log file has no entries for time period or file missing
Local process detectable program error Error condition reported Error condition logged
Error writing to disk Error condition detected and reported Local log file has no entries for time period or file corrupted or missing
Error packing outgoing message Can be detected and reported normally Can be detected and logged normally
Error connecting to remote system (not found / wrong address / can’t be reached, incorrect authentication, etc.) Error from remote system received and reported (if it is running, else timeout on no connection made) Error from remote system received and logged (if it is running, else timeout on no connection made)
Error sending to remote system (connection physically or logically down) Error from remote system received and reported (if it is running, else timeout on no connection made) Error from remote system received and logged (if it is running, else timeout on no connection made)
Remote system fails to receive message Request timeout reported Request timeout logged
Remote system error unpacking message Error from remote system received and reported Error from remote system received and logged
Remote system error validating message values Error from remote system received and reported Error from remote system received and logged
Remote system error packing reply values Error from remote system received and reported (if sends relevant error) Error from remote system received and logged (if sends relevant error)
Remote system error connecting Assume this is obviated once message channel open Assume this is obviated once message channel open
Remote system error sending reply Request timeout reported Request timeout logged
Remote system detectable program error Error from remote system received and reported Error from remote system received and logged
Remote system error writing to disk Error from remote system received and reported (if sends relevant error) Entries missing in remote log or remote file corrupted or missing
Remote system not running (OS or host running) Error from remote system received and reported if sent by host/OS, otherwise timeout Error from remote system received and logged if sent by host/OS, otherwise timeout
Remote system not running (entire system down) Report unable to connect or timeout on reply Log unable to connect or timeout on reply

Returning to yesterday’s more complex case, if the remote system supports several independent processes and a separate monitoring process, then there are a couple of other cases to consider.

Condition Current State Sensed Historical State Sensed
Remote monitor working, individual remote processes generating errors or not running Normal status returned, relevant process errors reported Normal status returned, relevant process errors logged
Remote monitor not running, separate from communications Normal status returned, report monitor counter or heartbeat not updating Normal status returned, log monitor counter or heartbeat not updating
Remote monitor not running, with embedded communications Error from remote system received and reported if sent by host/OS, otherwise timeout reported Error from remote system received and logged if sent by host/OS, otherwise timeout logged

A further complication arises if the remote system is actually a cluster of individual machines supporting a single, scalable process. It is assumed in this case that the cluster management mechanism and its interface allow interactions to proceed the same way as if the system was running on a single machine. Alternatively, the cluster manager will be capable of reporting specialized error messages conveying appropriate information.

Posted in Software | Tagged , , , | Leave a comment

Monitoring System Health and Availability, and Logging: Part 1

Ongoing Monitoring, or, What’s Happening Now?

Any system with multiple processes or machines linked by communication channels must address real-time and communication issues. One of the important issues in managing such a system is monitoring it to ensure all of the processes are running and talking to each other. This is a critical (and often overlooked) aspect of system design and operation. This article describes a number of approaches and considerations. Chief among these is maintaining good coverage while keeping the overhead as low as possible. That is, do no less than you need to do, but no more.

First let me start with a system diagram I’ve probably been overfond of sharing. It describes a Level 2, model-predictive, supervisory control system of a type I implemented many times for gas-fired reheat furnaces in steel mills all over North America and Asia.

The blue boxes represent the different running processes that made up the Level 2 system, most of which communicated via a shared memory area reserved by the Load Map program. The system needed a way to monitor the health of all the relevant processes, so each of the processes continually updated a counter in shared memory, and the Program Monitor process scanned them all at appropriate intervals. We referred to the counters and the sensing of them the system’s heartbeat, and sometimes even represented the status with a colored heart somewhere on the UI. If any counter failed to update the Program Monitor flagged an alarm and attempted to restart the offending process. Alarm states were displayed on the Level 2 UI (part of the Level 2 Model process) and also in the Program Monitor process, and were logged to disk at regular intervals. There were some additional subtleties but that was the gist of it. The communication method was straightforward, the logic was dead simple, and the mechanism did the job.

This was effective, but limited.

This system was great for monitoring processes running on a single machine and regularly logging the status updates on a local hard disk. Such a system should log the time all processes start and when they stop, if the stoppage can be sensed, otherwise the stoppage time should be able to be inferred from the periodically updated logs.

This kind of system can also log errors encountered for other operational conditions, e.g., temperatures, flows, or pressures too high or low, programs taking too long to run (model slipping real-time), and so on, as well as failures trying to connect with, read from, and write to external systems over network links. External systems, particularly the main HMI (Level 1 system) provided by our company as the main control and historical interface to the industrial furnaces, needed to be able to monitor and log the health and status of the Level 2 system as well.

If all the communications between Level 1 and Level 2 were working, the Level 1 system could display and log all reported status items from the Level 2 system. If the communications are working but one or more processes aren’t working on the Level 2 side, the Level 1 system might be able to report and log that something is out of whack with Level 2, but it might not be able to trust the details, since they may not be getting calculated or reported properly. Please refer to yesterday’s discussion of documenting and implementing inter-process communications to get a feel for what might be involved.

The point is that one needs to make sure the correct information is captured, logged, and displayed, for the right people, in the right context.

If a system is small and all of the interfaces are close together (the Level 1 HMI computer and Level 2 computer often sat on the same desk or console, and if that wasn’t the case the Level 2 computer was almost always in an adjacent or very nearby room) then it’s easy to to review anything you might need. This gets a lot more difficult if a system is larger, more dispersed, and more complicated. In that case you want to arrange for centralized logging and monitoring of as many statuses and operations as possible.

Let’s look at a general case of what might happen in one local computer and one connected, remote computer, with respect to monitoring and logging. Consider the following diagram:

Note that the details here are notional. They are intended to represent a general collection of common functions, though their specific arrangement and configuration may vary widely. I drew this model in a way that was inspired by a specific system I built (many times), but other combinations are possible. For example, one machine might support only a single process with an embedded communication channel. One machine might support numerous processes that each include their own embedded communication channel. The monitoring functions may operate independently or as part of a functional process. Imagine a single server providing an HTTP interface (a kind of web server), supporting a single function or service, and where the monitoring function is most embedded in the embedded communication channel. One may also imagine a virtualized service running across multiple machines with a single, logical interface.

Starting with the local system, as long as the system is running, and as long as the monitor process is running (this refers to a general concept and not a specific implementation), the system should be able to generate reliable status information in real-time. If the local UI is available a person will be able to read the information directly. If the disk or persistent media are available the system can log information locally, and the log can then be read and reviewed locally.

The more interesting aspect of monitoring system health and availability involves communicating with remote systems. Let’s start by looking at the communication process. A message must be sent from the local system to a remote system to initiate an action or receive a response. Sending a message involves the following steps:

  1. Pack: Message headers and bodies must be populated with values, depending on the nature of the physical connection and information protocol. Some types of communications, like serial RS-232 or RS-485, might not have headers while routed IP communications definitely will. Some or all of the population of the header is performed automatically but the payload or body of the message must always be populated explicitly by the application software. Message bodies may be structured to adhere to user-defined protocols within industry-standard protocols, with the PUP and PHP standards defined by American Auto-Matrix for serial communication with its HVAC control products serving as an example. HTTP, XML, and JSON are other examples of standard protocols-within-protocols.
  2. Connect: This involves establishing a communications channel with the remote system. This isn’t germane to hard-wired communications like serial, but it definitely is for routed network communications. Opening a channel may involve a series of steps involving identifying the correct machine name, IP address, and communications port, and then providing explicit authentication credentials (i.e., login and password). Failure to open a channel may be sensed by receipt of an error message. Failure of the remote system to respond at all is generally sensed by timing out.
  3. Send: This is the process of forwarding the actual message to the communications channel if it is not part of the Connect step just described. Individual messages are sometimes combined with embedded access request information because they are meant to function as standalone events with one sent request and one received reply. In other cases the Connect step sets up a channel over which an ongoing two-way conversation is conducted. The communications subsystems may report errors, or communications may cease altogether, which again is sensed by timing out.
  4. Receive: This is the process of physically receiving the information from the communications link. The protocol handling system generally takes care of this, so the user-written software only has to process the fully received message. The drivers and subsystems generally handle the accretion of data off the wire.
  5. Unpack: The receiving system has to parse the received message to break it down into its relevant parts. The process for doing so depends on the protocol, message structure, and implementation language.
  6. Validate: The receiving system can also check the received message components to ensure that the values are in range or otherwise appropriate. This can be done at the level of business operation or at the level of verifying the correctness of the communication itself. An example of verifying correct transmission of serial information is CRC checks, where the sending process calculates a value for the packet to be sent and embeds it in the packet. The receiving system then performs the same calculation and if it generates the same value then proceeds on the assumption that the received packet is probably correct.

I’ve drawn the reply process from the remote system to the local system as a mirror of the sending process, but in truth it’s probably simpler, because the connect process is mostly obviated. All of the connections and permissions should have been worked through as part of the local-to-remote connection process.

If the communications channels are working properly we can then discuss monitoring of remote systems. Real-time or current status values can be obtained by request from a remote system, based on what processes that machine or system or service is intended to support. As discussed above, this can be done via a query of a specialized monitoring subsystem or via the standard service interface that supports many kinds of queries.

In one example of a system I write the Level 2 system communicated with the Level 1 system by writing a file to a local RAM disk, that the Level 1 system would read, and reading a file from that RAM disk, that the Level 1 system would write. The read-write processes were mutex-locked using separate status files. The file written by the Level 2 system was written by the Model process and included information compiled by the Program Monitor process. The Level 1 system knew the Level 2 system was operating if the counter in the file was being continually updated. It knew the Level 2 system had no alarm states if the Program Monitor process was working and seeing all the process counters update. It knew the Level 2 system was available to assume control if it was running, there were no alarms, and the model ready flag was set. The Level 1 system could read that information directly in a way that would be appropriate for that method of communication. In other applications we used FTP, DECMessageQ, direct IP, shared memory, and database query communications with external systems. The same considerations apply for each.

An HTTP or any other interface might support a function that causes a status response to be sent, instead of whatever other response is normally requested. Status information might be obtained from remote systems using entirely different forms of communication. The ways to monitor status of remote systems are practically endless, but a few guidelines should be followed.

The overhead of monitoring the status of remote systems should be kept as light as possible. In general, but especially if there are multiple remote systems, a minimum number of queries should be made to request current statues. If those results are to be made visible to a large number of people (or processes), they should be captured in an intermediate source that can be pinged much more often, and even automatically updated. For example, rather than creating a web page each instance of which continuously pings all available systems, there should be a single process that continuously pings all the systems and then updates a web dashboard. All external users would then link to the dashboard, which can be configured to automatically refresh at intervals. That keeps the network load on the working systems down as much as possible.

Logging, or, What Happened Yesterday?

So far we’ve mostly discussed keeping an eye on the current operating state of local and remote systems. We’ve also briefly touched on logging to persistent media on a local machine.

There are many ways to log data on a local machine. They mostly involve appending a log file with increments of data in some form that can be reviewed. In FORTRAN and C++ I tended to write records of binary data, with multiple variant records for header data to keep the record size as small as possible. That’s less effective in languages like JavaScript (and Node and their variants), so the files tend to be larger as they are written out in text in YAML or XML or like format. It is also possible to log information to a database.

A few considerations for writing out log files are:

  • A means must be available to read and review them in appropriate formats.
  • Utilities can (and should be) written to sort, characterize, and compile statistical results from them, if appropriate.
  • The logs should have consistent naming conventions (for files, directories, and database tables). I’m very fond of using date and time formats that sort alphanumerically (e.g., LogType_YYYYMMDDD_HHMMSS).
  • Care should be taken to ensure logs don’t fill up or otherwise size out, which can result in the loss of valuable information. More on this below.

Logging events to a local system is straightforward, but accessing logs on remote systems is another challenge. Doing so may require that the normal communication channel for that system give access to the log or that a different method of access must be implemented. This may be complicated based on the volume of information involved.

It gets even more interesting if you want to log information that captures activities across multiple systems and machines. Doing this requires that all processes write to a centralized data repository of some kind. This can impose a significant overhead, so care must be taken to ensure that too much data isn’t written into the logging stream.

Here are a few ways to minimize or otherwise control the amount of data that gets logged:

  • Write no more information to the logs than is needed.
  • If very large entries are to be made, but the same information is likely to be repeated often (e.g., a full stack trace on a major failure condition), find a way to log a message saying something along the lines of, “same condition as last time,” and only do the full dump at intervals. That is, try not to repeatedly store verbose information.
  • Implement flexible levels of logging. These can include verbose, intermediate, terse, or none. Make the settings apply to any individual system or machine or to the system in its entirety.
  • Sometimes you want to be able to track the progress of items as they move through a system. In this case, the relevant data packages can include a flag that controls whether or not a logging operation is carried out for that item. That way you can set the flag for a test item or a specific class of items of interest, but not log events for every item or transaction. Flags can be temporarily to data structures or buried in existing data structures using tricks like masking a high, unused bit on one structure member.
  • If logs are in danger of overflowing, make sure they can roll over to a secondary or other succeeding log (e.g., YYYYMMDD_A, YYMMDD_B, YYMMDD_C, and so on), rather than just failing, stopping, or overwriting existing logs.

Well, that’s a fairly lengthy discussion. What would you add or change?

Posted in Software | Tagged , , , , , , , | Leave a comment

Designing and Documenting Inter-Process Communication Links

I’ve touched on the subject of inter-process communication previously here and here, but now that I’m charged with doing a lot of this formally I wanted to discuss the process in more detail. I’m sharing this in outline form to keep the verbiage down and level of hierarchical organization up.

The following discussion attempts to outline everything that reasonably could be documented about an inter-process communication link. As a practical matter a shorthand will be worked out over time, especially for elements that are repeated often (e.g., a standard way of performing yaml-formatted JSON transfers). The methods associated with the standard practices should be documented somewhere, and then referenced when they are used for an individually documented connection.

  • General: Overall context description
    • Message Name: Name / Title of message
    • Description: Text description of message
    • Architecture Diagram: graphical representation that shows context and connections
    • Reason / Conditions: Context of why message is sent and conditions under which it happens; note whether operation is to send or receive
  • Source: Sending Entity
    • Sending System: Name, identifiers, addresses
    • Sending Process / Module / Procedure / Function: Specific process initiating communication
    • System Owner: Developer/Owner name, department/organization, email, phone, etc.
  • Message Creation Procedure: Describe how message gets created (e.g., populated from scratch, modified from another message, etc.)
    • Load Procedure: How elements are populated, ranged, formatted, etc.
    • Verification Procedure: How package is reviewed for correctness (might be analogous to CRC check, if applicable)
  • Payload: Message Content
    • Message Class: General type of message with respect to protocol and hardware (e.g., JSON, XML, IP packet, File Transfer, etc.)
    • Grammar: Structure of message, meaning of YAML/XML tags, header layout, and so on (e.g., YAML, XML, binary structure, text file, etc.)
    • Data Items: List of data items, variable names, types, sizes, and meanings (note language/platform dependency: C++ has hard data types and rules for data packing in structures and objects, JavaScript is potentially more flexible by platform)
      • Acceptable range of values: Important for certain data types in terms of size, content, and values
      • Header: Message header information (if accessible / controlled / modified by sender)
      • Body: Active payload
    • Message Size: Appropriate to messages that have a fixed structure and size (i.e., not applicable to flexibly formatted XML messages
  • Transfer Process: Receiving/Accessed Entity
    • Access Method: Means of logging in to or connecting with destination system (if applicable)
    • Permissions: passwords, usernames, access codes, lock/mutex methods
    • Reason: why destination system is accessed
  • Destination: Receiving Entity
    • Receiving System: Name, identifiers, addresses
    • Receiving Process / Module / Procedure / Function: Specific process initiating communication
    • System Owner: Developer/Owner name, department/organization, email, phone, etc.
  • Methodology: How it’s done
    • Procedure Description: List of steps followed
      • Default Steps (happy path): Basic steps (e.g., connect, open, lock, transfer/send, verify, unlock, close, disconnect)
      • Fail / Retry Procedures: What happens when connect/disconnect, open/close, lock/unlock, read/write operations fail in terms of detection (error codes), retry, error logging, and complete failure
      • Error Checking / Verification: How procedure determines whether different operations were performed correctly
        • Pack / Load / Unpack / Unload: check values/ranges, resulting structure format, etc.
        • Send / Receive: check connection / communication status
    • Side Effects / Logs: Describe logging activity and other side effects

This information is diagrammed below. (I need to add some arrows to show the steps taken but they can be inferred for now.)

The small diagram below shows the full context of one interface, which as a practical matter is a two-way communication where each direction could be fully documented as described above.

Posted in Software | Tagged , , , | Leave a comment

A Simple Discrete-Event Simulation: Part 91

I started thinking about the design of processes that require resources and realized there are a lot of ways I could go with this, so I thought I’d back up and think it through in writing.

Let’s begin with a process that requires resources in a scenario where there is a single pool from which any number of identical resources can always be drawn and are available to be used by the process immediately. In this case an entity arrives in the process, the required resources are requested and arrive immediately, and the process can therefore begin immediately. This involves calling the advance function with the process duration as a parameter, which will place a new event in the future events queue. During this time the process component will be in an in process state. Assuming the process only handles one entity at a time, it will concurrently be in a closed state, meaning no other entities will be able to enter. When the forked process event is processed (i.e., it reaches the front of the future events queue), the resources immediately return to their designated pool, and the process enters a state where it begins trying to forward the entity to a subsequent component.

Now let’s start adding complications. I don’t think it’s possible to address them in a formal logical order because the considerations that arise are intertwined, so I’ll just work through them in the best way I can.

One complication is that the resources may take a finite amount of time to arrive once they are requested. There are two ways to handle this. The transfer delay time can be added to the process time (assuming it can be calculated directly — if the movement of resources is modeled in detail then it is governed by that model) and the transfer and process can be handled as a single event. Alternatively, the transfer delay and process can be handled as sequential, chained events. The the bookkeeping functions can record the data in any way that makes sense. When the process is complete, assuming the time for the resources to return to their pool is similarly finite and non-zero, the entity can move directly to its next state or operation and the resources can be processed separately.

The next thing to work out is whether the resources physically return to a central pool to be dispatched in answer to a subsequent request, or whether the resource pool is logical in nature and the resources can go directly to service the next request(s), if any are outstanding.

The pool of resources may have been drawn down so it does not have the number of resources requested (there should be no possibility that a process will request more resources than the maximum quantity the pool is defined to hold). If the entity in the process component must be processed in arrival order before anything else can happen (imagine a pure flow-through model of the type demonstrated to this point), then it will just place the request and stay in place until the resources are obtained. At this point it will have to enter a separate wait state, which means the function for processing the arrival of the entity in the process would have to terminate. The action of starting the clock on the process would have to be kicked off separately once the requested resources are received.

If the entity to be processed does not hold up any other activities (i.e., an entity in a flow-through model that can go to a holding area or secondary process, or an entity like an aircraft that sits in place and requires numerous services, which themselves can be modeled as queuing up to be worked off) then the requests can go into a queue independent of the entity. Any kind of logic can be applied, as long as it is carefully documented and followed.

The next idea to address is the mechanism for determining when the required number of resources become available if they were not available at the time of the request. The brute force method is to place a test event in the current events queue that checks the count of items in the pool after every discrete event item is processed. This has to be done when actions and side-effects and variable values might change unpredictably, but since the conditions under which it makes sense to perform the checks are known, things can be done more efficiently. In this case it means that a list of queued requests can be scanned and serviced whenever resources return to the pool.

Resources can represent many things. Parts needed to complete repair or assembly actions may be continually provided by some kind of supply logic. Replacement parts involved in repair actions may be drawn from local shelf stock, drawn from remote stock (shipped in after a long delay), returned from a refurbishment process, or cannibalized from another assembly (e.g., an aircraft waiting on multiple parts and services). Removed parts from repair actions may be discarded or sent to local or remote refurbishment processes. Workers may be available on a schedule that varies by time, vacation, illness, and breaks for meals, rests, vacations, training, and so on.

A major distinction between different kinds of resources is whether or not they are consumed during the model run. In a manufacturing assembly process new parts continually enter the system to be affixed to manufactured goods. In a repair process the number of mechanics might be fixed. A parts model in which most damaged parts are successfully refurbished but some are occasionally discarded or new parts are occasionally acquired might have elements of persistence and of flow-through and consumption.

Another obvious thing to consider is the presence of many different resource pools. Still another is the potential to continuously adjust the order in which resource requests are processed to take advantage of opportunities where they can be more fully utilized.

Yet another is the idea that one process may issue several requests for resources in succession, or even simultaneously. For example, mechanics performing a periodic service on an aircraft may need to complete several different tasks as part of the one inspection event, and the tasks themselves may require different numbers of mechanics. Should some of the mechanics go from the pool and work off inspection tasks one after another, never returning to the pool, or should they return to the pool after every task, so requests to service other processes may be interleaved with the inspection tasks? Similarly, should all of the requests be issued at once or should they be issued sequentially in time as each previous task is completed? If the transfer time between the pool and the task is zero then it matters less, but if the transfer times are finite and non-zero then these considerations get more complicated.

The final consideration is how the resources themselves are modeled. The behavior of the moving entities in our examples so far has been driven entirely by the logic built into the components, so that’s how all resources will be modeled as well. The processing may depend on the characteristics of the resources, like it is with other moving entities.

If you can think of any other considerations I’d love to hear about them.

Posted in Software | Tagged | Leave a comment

A Simple Discrete-Event Simulation: Part 90

This evening I attended the DC IIBA Meetup on the subject of process mapping, which is obviously an area in which I have some experience. Since I’ve reached a natural break in writing about pure business analysis subjects I’m going to revisit the discrete-event simulation framework, with my first goal being to add a few demos that illustrate some variations that are possible. To that end I’ve been reviewing the existing code and the to-do list and see that I need to add the following sub-item to the list of abstract processes to support:

  • Include ability to represent entities that remain in the system and only change state during the simulation run, rather than entities that merely pass through the system from an entry to an exit. These should be referred to as resident entities. The items might change color or include one or more text labels reflecting their current state (since multiple characteristics may drive several state characteristics in parallel, with multiple state variables making up synthetic state variables of other kinds), or move among locations (components) to represent state (assuming resident items).
  • Include the ability to easily replicate groups of components representing the state of items that remain within the simulation, if that method is adopted. Alternatively, make the resident entities move through a common set of state-representation components.

These capabilities support modeling of items like aircraft in operation and maintenance simulations. They remain within the system at all times while parts may enter and leave the system.

The item defining pools of resources was originally meant to represent staff that performed certain functions, though they could also represent specialized equipment. An example of how this works is to imagine pools of mechanics that are available to service aircraft. This can get quite complex, but lets work from a simple example to something more complex. Aircraft require a continuing series of scheduled and unscheduled maintenance activities over time, and each one requires a certain number of mechanics for a certain duration. If a group of aircraft simultaneously require service actions that require more mechanics than are available, then the service requests will have to be queued and answered in some kind of order. That is, mechanics will complete a service on one aircraft, return to the pool, and when enough mechanics are back in the pool to fulfill the next service request, they are dispatched to do so.

The next complication is to consider the queue of service events and identify opportunities to service the requests out of order to take advantage of opportunities to complete services requiring fewer resources (e.g., mechanics and time) while waiting for resources to become available to service a larger request. Imagine a group of four aircraft and six mechanics. One aircraft requires a service involving four mechanics for three hours. Another aircraft then requires a service that requires a service involving eight mechanics for two hours. The second service cannot begin until the first one is completed and the mechanics return to the pool. Now imagine a third service request involving two mechanics for thirty minutes. We might check to see if this one can be completed while the first service is still underway, so the mechanics can be used closer to their full capacity. (I would call this an opportunity service after the idea of an opportunity move the Level 1 engineers worked out for transfer of slabs in two-line tunnel furnaces at Bricmont.) The calculations involved can get quite complicated but you can see where this is going.

Another complication beyond that is that there are several groups of mechanics to service different parts of the aircraft, each subject to the same limitations. By itself this isn’t a problem but there are cases where more resources are required to complete an action than there are resources of the needed type available. That is, there might be a service that requires seven airframe mechanics in a situation where only five are assigned. In these cases the system pulls mechanics from other specialty pools to provide the necessary help. Sometimes the main work is performed and supervised by the relevant specialists while other workers just stand around and hold things and move things and run tools under direction. In that case there has to be logic that determines which supplementary pools to pull from.

Another concept to add is the idea of consuming supplies or parts. This requires the ability of logically merging entities. In an assembly process we might have a main assembly moving along an assembly line and having parts added to them from pools of parts from other processes and locations. Chaining pools of parts together can model a supply chain, and various sizes of those pools can be investigated to determine their optimum size and configuration. Pools of part can be treated like queues that might be reloaded in bulk rather than one entity at a time. Additionally, we might consider adding parts to larger assemblies based on custom configurations. This is true of automobiles being made with different options and military aircraft being fitted with different equipment in order to perform different missions. In other cases we might want to split elements off from a moving entity, as we do in the bag object, which can be thought of as a parking lot where vehicles park and their occupants get out and go through their own process. These ideas suggest two more items to add to the to do list.

  • Include ability to merge entities together. (This is somewhat similar to an AND gateway, a diamond with a plus symbol, in BPMN.)
  • Include a flow-through pool component that acts like a queue component but which can potentially be loaded in bulk. The object may also send a request for replenishment when the supply declines to a specified level. There should be signals and warnings based on definable quantity thresholds.

I’ve added the new items to the project page.

Posted in Software | Tagged | Leave a comment

Domain Knowledge Acquisition

I’ve touched on this subject briefly before, but I wanted describe it more directly. The simplest definition of domain knowledge, for me, is: what you need to know to create and apply effective solutions in a specific situation. Acquisition, of course, is the process of gaining that knowledge.

If that seems vague it’s because we have to understand what “situation” we’re actually working in when crafting a solution. When I implemented FileNet-based document imaging solutions for business process re-engineering efforts I might have been working across several competencies at the same time, but what did I actually have to know to do the job?

  • Process Mapping: This comes in two parts. The first involves understanding a process through discovery and data collection. You have to be thorough. You have to be pleasant and engaging to get people to tell you everything you need to know. You have to recognize the important information when you hear it. The second is creating the requisite documentation using some method, which these days more or less always involves knowing how to use one or more complicated software programs. You may or may not need to know how to do that yourself, but someone on your team does.
  • Business Case Analysis: This could also be called Cost-Benefit Analysis. It involves calculating the difference between the benefits of a proposed change and the cost of implementing it. If you can do these calculations in terms of money it makes things easier, but other factors may be involved. If the costs and benefits are readily identifiable it takes less expertise to run the numbers, but more complex assessments may take additional experience and even entrepreneurial judgment.
  • Document Analysis: For a process that involves many thousands of pieces of paper a day it’s important to be able to classify them and identify the relevant information on them. Our customer had developed an ingenious manual filing system that collated the arriving documents into custom-made folders that clasped the received documents in three easy-to-thumb-through collections so they were able to give us guidance on the sorting system they used. Without this guidance we would have had to do a lot more work to understand the documents and create our own sorting criteria.
  • Document Imaging Systems: The FileNet system included hardware components for scanning documents and then storing the images (using a robotic system of removable optical disk cartridges that automatically optimized the storage locations of the cartridges to minimize the travel time of the robotic mechanism based on the frequency of access of the items stored on them), and a WorkForce software component that ran on a network of PCs connected to a central server that contained the database information on each perspective customer and each employee, as well as links to the related documents.
  • Computer Programming: General: Implementers had to come to this work already knowing how to program computers. The FileNet WorkFlow language was obscure enough that it was never the first language anyone learned.
  • Computer Programming: FileNet WorkFlow language: When I used it the WorkFlow language was a loose combination of Pascal and BASIC with special functions to interface with the dedicated scanning and storage hardware, document and item queues, and mainframe access terminals. FileNet was in the process of creating a version called Visual WorkFlow that allowed the programmers to graphically define parts of the flow of documents, information, and work items within the system, but I never saw that in action.
  • User Interface Design: This is a whole field on its own, with a lot of specialized knowledge. By the time I worked on this I had been programming user interfaces graphically and in text for a number of years. I had even written my own text-based, mouse-driven windowing framework several years before. The basics aren’t that difficult to learn and can be picked up in a couple of weeks of initial training.
  • Database Schema Design: This is also a field that can get as complex as you’d like, but schemes for most applications are pretty simple; it’s what you do with the data that makes it work. The FileNet system used an Oracle back end, but the WorkFlow programmers didn’t need to know anything about Oracle per se to make use of its capabilities.
  • Mainframe Data Interfacing: There are many ways to get computers to talk to each other but this one was unique–it used a process called screen scraping.
  • Automated Microsoft Office Document Creation: Some Microsoft (and other) products included the ability to be manipulated by other programs using an interface protocol called Dynamic Data Exchange (DDE). Using this capability required an understanding of the controlled product, content management principles, and programming in general.
  • Medical Records Processing: This is a potentially huge field by itself. It includes considerations like security, procedure coding, and the use of specific medical language. If something isn’t understood then intelligent questions have to be asked. Our customer had a process for this, where questions were written up for certain files and routed to specialists who would call the physicians for clarification. (Since the people making the calls were not trained doctors themselves the people they called would sometimes get highly annoyed.)
  • Insurance: Disability: The insurance field covers many specific areas, and disability is just one type of insurable risk.
  • Insurance: Underwriting: The decision to grant insurance coverage to a potential customer, and the rates to charge, are the result of a complex process of evaluation and calculation. The cost of underwriting and servicing a new pool includes the cost of evaluating whether to grant coverage to every pool whether they are accepted or not.
  • Insurance: Actuarial Calculations: A major component of providing insurance is the ability to evaluate the occurrence rate of specified outcomes across large populations. The training required to do this work is extensive, and requires up to ten individual certification exams and a heavy mathematical background.
  • Insurance: Investment Portfolio Management: Insurance companies must manage a lot of capital as it’s being held against the need to pay benefits over time. My father was a fixed income institutional investor for decades. That specialty has a lot of overlap with what I do and requires a similar personality type.
  • Insurance: Claims Processing: This involves several components including communication, submission and reply processes, bank transfers, tax considerations, procedure and condition coding, and so on.

It should be clear that knowledge of some of these subject matter areas was required to automate the paper-handling aspects of the re-engineering process, especially as that project was handled in two phases. In the first phase we primarily used the skills of process mapping (with light data collection) and business case analysis and to a lesser extent our experience with document analysis and document imaging systems. Once we won the right to do the implementation the detailed computer skills came into play. It might have touched on supporting user processes (like the scoring of individual files and pools thereof) during the implementation phase, where screens had to be designed and calculations supported that allowed the underwriting department to perform its evaluations. In those cases the investigators and implementers can learn what they need to know from the subject matter experts they’re working with. The important thing to recognize is that for the most part the process mapping we did involved processes that anyone could analyze and understand. It required exactly zero special knowledge or training to understand how pieces of paper were received, collated, moved, and reviewed, so the real domain we were operating in wasn’t insurance, per se, it was document processing — which was our area of expertise..

This is relevant to me when recruiters and companies ask for specific domain knowledge. I think they often view the problem too narrowly in terms of industries. Some skills, like discovery, data collection, process mapping, and process improvement (in terms of the arrangement of steps) translate to any situation, as I believe I have more than demonstrated. Other skills, like actuary science, thermodynamics (and calculus), and specific elements of computing, are more specialized and are far more difficult to learn from subject matter experts on the fly. Many situations fall in the middle of this spectrum. For example, I knew nothing about border operations when I started analyzing and simulating them, but I had all the general skills I needed to understand and analyze them. Did I get better as I gained more experience and had seen dozens and dozens of ports? Sure, but that didn’t mean I couldn’t hit the ground running.

Domain knowledge is acquired in two major ways. One is through specific skills training and industry experience. The other is through being able to learn from subject matter experts on the ground. If the business or process analysts are sharp, communicate well, and can establish good rapport with such SMEs, they can implement solutions under the guidance of the SMEs with the proper feedback and direction. I discuss this feedback here. The effort is always cooperative, and learning how to learn from and engage with your customers is critical.

Posted in Tools and methods | Tagged , , | Leave a comment

Advanced Manufacturing Process Integration Ideas

I’ve seen a lot of different things over the years and they cannot help but suggest possibilities for ways things can be done in a more streamlined, integrated way. Of particular interest to me has been the many novel ways of combining ideas and integrating ideas to enhance the modern manufacturing process. It’s weird because many of these ideas are in use in different places, but not in all places, and there isn’t any place that uses all of them.

Let me walk through the ideas as they built up in chronological order.

My first engineering job was as a process engineer in the paper industry, where I used my knowledge of the many characteristics of wood pulp and its processing to produce drawings and control systems for new plants, perform heat and material balances to help size equipment to achieve the desired production and quality targets, and perform operational audits to see if our installed systems were meeting their contractual targets and suggest ways to improve quality of existing systems (I won a prize at an industry workshop for best suggestions for improving a simulated process). I started writing utilities to automate the many calculations I had to do and used my first simulation package, a product called GEMS.

I traveled to a number of sites in far-flung parts of the U.S. and Canada (from Vancouver Island to Newfoundland to Mississippi) and met some of the control system engineers. I didn’t quite know what they did but they seemed to spend a lot more time on site than I did. I also met traveling support engineers from many other vendors. The whole experience gave me a feel for large-scale manufacturing processes, automated ways of doing things, and how to integrate and commission new systems.

My next job involved writing thermo-hydraulic models for nuclear power plant simulators. I used a Westinghouse tool called the Interactive Model Builder (IMB), which automated the process of constructing fluid models that always remained in the liquid state. Any parts of a system that didn’t stay liquid had to be modeled using hand-written code. I was assigned increasingly complex models and, since time on the simulator was scarce and the software tools were unwieldy, I ended up building my own PC-based simulation test bed. I followed that up by writing my own system for cataloging all the components for each model, calculating the necessary constants, and automatically writing out the requisite variables and governing equations.

I envisioned expanding the above systems to automatically generate code, and actually started working on it, but that contract ended. Ideally it would have included a CAD-like interface for making the drawings. Right-clicking on the drawing elements would allow the designer to set the properties of the elements, the initial conditions, special equations, and so on. The same interface would display the drawing with live state information while a simulation was running, and the elements could still be clicked to inspect the real-time values in more detail. From a magazine ad for a Spice simulation package I got the idea that a real-time graph could also be attached to trace the value of chosen variables over time, right on the diagram.

My next job was supposed to be for a competitor of Westinghouse in the simulator space. I went to train in Maryland for a week and then drove down to a plant in Huntsville, AL, where I and a few other folks were informed that it was all a big mistake and we should go back home. I spent the rest of that summer updating the heating recipe system for a steel furnace control system.

My next job was with a contracting company that some of my friends at Westinghouse had worked for. Their group did business process reengineering and document imaging using FileNet (which used to be an independent company but is now owned by IBM). While doing a project in Manhattan I attended an industry trade show at the Javits Center. During that day all of the exhibits gave me the idea that a CAD-based GUI could easily integrate document functions like the FileNet system did. Instead of clicking on system elements to define and control a simulation, why not also click on elements to retrieve and review equipment and system documentation and event logs. Types of documentation could include operator and technical manuals, sales and support and vendor information, historical reports, and so on. Maintenance, repair, and consumption logs could be integrated as well. I’d learn a lot more about that at my next job.

Bricmont‘s main business was building turnkey steel reheat furnace systems. There were specialists who took care of the steel, refractory, foundations, hydraulics, and so on, and when it came to the control systems the Level 1 guys did the main stuff PLCs and PC-based HMI systems running Wonderware and the like. As a Level 2 guy I wrote model-predictive thermodynamic and material handling systems for real-time supervisory control. Here I finally learned why the control system guys spent so much time in plants (and kept right on working at the hotel, sometimes for weeks on end). I once completed a 50,000-line system in three months, with the last two months consisting of continuous 80-hour weeks. I also learned many different communication protocols, data logging and retrieval, and the flow of material and data through an entire plant.

Model-predictive control involves simulating a system into the future to see if the current control settings will yield the desired results. This method is used when the control variable(s) cannot be measured directly (like the temperature inside a large piece of steel), and when the system has to react intelligently to changes in the pace of production and other events.

The architecture of my Level 2 systems, which I wrote myself for PC-based systems and with a comms guy on DEC-based systems, looked roughly like this, with some variations.

This fit into the wider plant system that looked roughly like this:

The Level 1 layer was closest to the hardware and provided most of the control and operator interaction. The Level 2 layer incorporated supervisory controls that involved heavy calculation. Control room operators typically interacted with Level 2 systems through the Level 1 interface, but other engineers could monitor the current and historical data directly through the Level 2 interface. Operational data about every batch and individual workpiece was passed from machine to machine, and higher-level operating and historical data was passed to the Level 3 system, which optimized floor operations and integrated with what they called the Level 4 layer, which incorporated the plant’s ordering, shipping, scheduling, payroll, and other functions. This gave me a deep understanding of what can be accomplished across a thoughtfully designed, plant-wide, integrated system.

I also maintained a standardized control system for induction melting furnaces that were installed all over the world. That work taught me how to internationalize user interfaces in a modular way, and to build systems that were ridiculously flexible and customizable. The systems were configured with a simple text file in a specified format. While I was maintaining this system I helped a separate team design and build a replacement software system using better programming tools and a modern user interface.

My next job involved lower level building controls and involved mostly coding and fixing bugs. It was a different take on things I’d already seen. I suppose the main thing I learned there was the deeper nature of communications protocols and networking.

My next job, at Regal Decision Systems, gave me a ton of ideas. I built a tool on my own to build models of the operation of medical offices. I learned discrete-event simulation while doing this, where up to that point I had been doing continuous simulations only. I also included the ability to perform economic calculations so the monetary effects of proposed changes could be measured. The company developed other modeling tools for airports, building evacuations, and land border crossings, some of which I help design, some of which I documented, and many of which I used for multiple analyses. I visited and modeled dozens of border facilities all over the U.S., Canada, and Mexico, and the company built custom modeling tools for each country’s ports of entry. They all incorporated a CAD-based interface and I learned a ton about performing discovery and data collection. Some of the models used path-based movement and some of them used grid-based movement. The mapping techniques are a little different but the underlying simulation techniques are not.

The path-based model included a lot of queueing behaviors, scheduling, buffer effects, and more. All of the discrete-event models included Monte Carlo effects and were run over multiple iterations to generate a range of possible outcomes. This taught me how to evaluate systems in terms of probability. For example, a system might be designed to ensure no more than a ten-minute wait time on eighty-five percent of the days of operation.

I learned a lot about agent-based behaviors when I specified the parameters for building evacuation simulations. I designed the user interfaces to control all the settings for them and I implemented communications links to initialize and control them. Some of the systems supported the movement of human-in-the-loop agents that moved in the virtual environment (providing a threat or mitigating it and providing security and crowd control) among the automated evacuees.

At my next job I reverse-engineered a huge agency staffing system that calculated the staffing needs based on activity counts at over 500 locations across the U.S. and a few remote airports. I also participated in a large, independent VV&A of a complex, three-part, deterministic simulation designed to manage entire fleets of aircraft through their full life cycle.

The main thing I did, though, was support a simulation that performed tradespace analyses of the operation and logistic support of groups of aircraft. The model considered flight schedules and operating requirements, the state and mission capability of each aircraft, reliability engineering and the accompanying statistics and life cycles, the availability of specialized support staff and other resources, scheduled and unscheduled maintenance, repair or replace decisions, stock quantities and supply lines, and cannibalization. The vendor of the programming language in which the simulation was implemented stated that this was the largest and most complex model ever executed using that language (GPSS/H, the control wrapper and UI was written in C#). I learned a lot about handling huge data sets, conditioning noisy historical input data, Poisson functions and probabilistic arrival events, readiness analysis, and more.

More recently I’ve been exploring more ideas about how all these concepts could be combined in advanced manufacturing systems as follows.

Integration Ideas

  • A common user interface could be used to define system operation and layout. It can be made available through a web-based interface to appropriate personnel throughout an organization. It can support access to the following operations and information:
    • Real-time, interactive operator control
    • Real-time and historical operating data
    • Real-time and historical video feeds
    • Simulation for operations research and operator training
    • Operation and repair manuals and instructions (document, video, etc.)
    • Maintenance events and parts and consumption history
    • Vendor and service information
    • Operational notes and discoveries from analyses and process-improvement efforts
    • Notification of scheduled maintenance
    • Real-time data about condition-based maintenance
    • Management and worker dashboards on many subjects
  • Information about the items processed during production can also be captured, as opposed to information about the production equipment. In the steel industry my systems recorded the heating history and profile of every piece of steel as well as the operation of the furnace. In the paper industry plants were later to adopt this kind of tracking (which is more difficult to do because measurements involve a huge amount of manual lab work) but they treat the knowledge like gold and hold it in high secrecy. Depending on the volume of items processed the amounts of data could get very large.
    • Most six sigma data is based on characteristics of the entities processed. This can be collated with equipment operating data to determine root causes of undesirable variations. The collection of this data can be automated where possible, and integrated intelligently if the data must be collected manually.
  • Model-predictive simulation could be used to optimize production schedules.
  • Historical outage and repair data can be used to optimize buffer sizes and plan production restarts. It can also be used to plan maintenance activities and provision resources.
  • Look-ahead planning can be used to work around disruptions to just-in-time deliveries and manage shelf quantities.
  • All systems can be linked using appropriate, industry-standard communications protocols and as much data can be stored as makes sense.
  • Field failures of numbered and tracked manufactured items (individual items or batches) can be analyzed in terms of their detailed production history. This can help with risk analysis, legal defense, and warranty planning.
  • Condition-based maintenance can reduce downtime by extending the time between scheduled inspections and also preventing outages due to non-graceful failures of equipment.
  • Detailed simulations can improve the layout and operation of new production facilities in the design stage. Good 3D renderings can allow analysis and support situational awareness that isn’t possible through other means.
  • Economic calculations can support cost-benefit analyses of proposed options and changes.
  • Full life cycle analysis can aid in long-term planning, vendor relationships, and so on.
  • Augmented reality can improve the speed and quality of maintenance procedures.

The automotive and aircraft industries already do many of these things, but they make and maintain higher-value products in large volumes. The trick is to bring these techniques to smaller production facilities producing lower-value products and in smaller volumes. The earlier these techniques are considered for inclusion the more readily and cheaply they can be leveraged.

Posted in Tools and methods | Tagged , , | Leave a comment

Follow-up: A Way To Use Jira for Everything?

Today I continue yesterday’s discussion of ways to implement an RTM and my framework using Jira, except this time I want to talk about using Jira for every phase of an effort and not just for implementation and testing.

Given the figure below, which depicts my approach for reviewing the findings and outputs of each project phase with the customer before moving on, and also going back to previous phases to add concepts that might have been missed to support new items that were identified in the current phase, so complete two-way linking of all items is always maintained, I might ask why all items in every phase might not be handled in Jira as action items. There are a couple of possibilities.

Spawn New Issues from Post-Functions

An example of this was given in the course, where completed implementation items automatically spawned test items in a separate project. This is accomplished by specifying a post-function that creates a new issue during a transition of the existing issue, usually to a Done state, though it doesn’t have to be to a Done state. Indeed, it might be better to have it not be a transition to a done state. Instead, it might be a good idea to let the spawned issue or issues run to their own completion and then have them activate a transition to Done for the initial issue in some way.

One problem with this approach is that Jira doesn’t seem to include a create new issue post-function natively, although this capability appears to be available as an add-on from multiple vendors. An additional problem would be that there might be a problem with spawning multiple follow-on issues in the next phase from a single existing issue. A final problem is where to create the newly spawned issues. Do we put them in a new project? Do we create them in a separate swim lane? Do we create them in the same Jira project at the same level but with a different phase label?

Another approach is to add an ending transition for each item where it is sent to a specific team member (or subgroup) that is charged with creating the required follow-on issues manually. That requires a high degree of auditing by management personnel (ScrumMaster, Project Manager, or Transition Specialist).

It’s also possible to create sub-tasks, though I don’t think those flow across different phases very well.

Where Do We Put Newly Created Issues?

As asked above, where do we put issues from each phase of an effort? My framework defines seven possible phases, though that can be modified somewhat (for example, the Conceptual Model Phase could be split into a Discovery phase and a Data Collection Phase, while the Acceptance phase might be omitted).

If we put all of the items into a single project on a single board we’d have to identify items in each phase using some kind of label and then use a filter to view only the items we desire in each phase. This is a lot of work and does not provide much awareness across a whole project. That said, trying to show everything in a whole project might get pretty crazy.

In short, I think it’s probably simpler to manage items up through the design stage in separate-but-linked documents, with items possibly referenced in an overarching table of some kind, and handle the implementation and test operations as individual items within Jira workflows. This is how we operated on a large IV&V project I worked on for quite a while. The team managed an overarching document that included several intended use items while referring to separate documents for the requirements and conceptual model, along with a boatload of other materials. A separate development team tracked their own activities through implementation and test using one or more external work item management tools. Artifacts from those were referenced in the overarching tabular document.

When To Keep Things Simple

The foregoing discussion applies to efforts of a certain size and formality. Smaller efforts can be run on a more ad hoc basis using whatever techniques you’d like, including Scrum. In this case the discovery and conceptual phases can be folded into the requirements and implementation phases, and requirements and To Do items can be dumped right into Jira and processed from there. Managing a full RTM imposes a nontrivial cost, so you should be sure to incur that formal overhead only as it’s needed. It’s still a good idea to write out formal requirements, discoveries, and As-Is/To-Be diagrams and documents if possible.

Posted in Management, Tools and methods | Tagged , , , , | Leave a comment