BA Webinar Series 02: Engagement Phases In Detail

Today I gave this webinar for the Tampa IIBA Lunch and Learn Series. The slides are here.

Posted in Tools and methods | Tagged , , , | Leave a comment

BA Webinar Series 01: Requirements Traceability Matrices

Today I gave this webinar for the Tampa IIBA Lunch and Learn Series. The slides are here.

Posted in Tools and methods | Tagged , , , , , | Leave a comment

Richard Frederick Lunch & Learn Webinar Series for Tampa IIBA

I got to know Richard Frederick when he gave a series of twenty lunchtime webinars for the Tampa chapter of the IIBA. In the interest of making his excellent material available as widely as possible, I am sharing, with his permission, links to all of the relevant Youtube recordings and PDF files. They are inspiring me to create my own series, which will begin shortly.

Mr. Frederick provides highly effective training to corporate clients in the areas of Agile processes, Business Analysis, Data Analysis, and more. I recommend that you engage his services at your earliest convenience!

Week 1 August 5, 2020
Dashboards and Robots: Data Mining and Machine Learning
video PDF

Week 2 August 14, 2020
Structured Language Requirements: Structured English and Structured Query Language (SQL)
video PDF

Week 3 August 21, 2020
Requirements and Testing: Understanding the 4 Quadrants of Testing
video PDF

Week 4 August 28, 2020
Natural Language Processing: Semantics, Morphology, Syntax, Linguistics
video PDF

Week 5 September 4, 2021
Semantics of Business Vocabulary and Rules: Dictionaries and Rule Books
video PDF

Week 6 September 11, 2020
INCOSE: Rules for Writing Requirements
video PDF

Week 7 September 18, 2020
IEEE EARS: The Easy Approach to Requirements Syntax
video PDF

Week 8 September 25, 2020
Waterfall and Agile Assembly Methods
video PDF

Week 9 October 2, 2020
Do More with Less: The 2 Steps to Efficiency
video PDF

Week 10 October 9, 2020
The Ten “V’s” of Big Data: Understanding Data Risk
video PDF

Week 11 October 16, 2020
Five Step to Automation: Getting Organized
video PDF

Week 12 October 23, 2020
The Numbers Game: Non-Financial & Financial Accounting
video PDF

Week 13 October 30, 2020
Data Transformation: Understanding Extract Transform Load (ETL)
video PDF

Week 14 November 6, 2020
Project Estimating: Twenty-Six Sprints
video PDF

Week 15 November 13, 2020
Agile Portfolio Management
video PDF

Week 16 November 20, 2020
Online Transaction and Analytical Processing
video PDF

Week 17 December 4, 2020
Earned Value EZ
video PDF

Week 18 December 11, 2020
Demand Management and the Efficient Frontier
video PDF

Week 19 December 18, 2020
Machine Learning EZ: Unsupervised & Supervised Learning
video PDF

Week 20 January 8, 2021
Data Storytelling EZ: Graphs, Charts, and Diagrams
video PDF

Posted in Tools and methods | Tagged , , , | Leave a comment

Added IIBA-CBDA Certification Today

Today I sat for and passed the Certification in Business Data Analytics (CBDA) exam issued by the International Institute of Business Analysis (IIBA). It is the newest of the five active certifications I maintain. I sought this certification, as I have previously, simply to communicate the level of skill I’ve developed over the course of my career.

I signed up for the exam yesterday morning, read the 192-page Guide to Business Data Analytics the IIBA provides as a PDF (behind its paywall), and sat for it this evening. I mostly counted on my experience being sufficient to achieve good scores in the six knowledge areas, and that proved to be sufficient. The knowledge areas are:

Identify the Research Questions – 20% (Domain 1)
Source Data – 15% (Domain 2)
Analyze Data – 16% (Domain 3)
Interpret and Report Results – 20% (Domain 4)
Use Results to Influence Business Decision Making – 20% (Domain 5)
Guide Organization-level Strategy for Business Analytics – 9% (Domain 6)

Since I’ve been working with large volumes of data in many contexts over a long period of time, most of the general questions were tractable. Where I felt weakest was on questions about a myriad of specific diagram types that might have been in the guide mentioned above, but not discussed in detail. For future reference I’ve listed the specific tools and diagrams I found mentioned in the guide. Many of these are common and straightforward, but it would have been a better idea to review some in more detail, particularly those that were actually discussed near the end of the PDF, if only briefly (e.g., with an example accompanied by one or two short paragraphs).

Activity Diagram
Architectural Diagram
Area Chart
Autocorrelation Plots (Box-Jenkins method)
Bar Chart
Box Plot
Box and Whisker Plot
Box-Cox Plots / Normal Transformation
Bullet Chart
Business Model Canvas
Butterfly Chart
Chord
Contour Plot
Correlation Map
Cumulative Error Plot
Logical Data Flow Diagram
Physical Data Flow Diagram
Decision Tree
Density Plot
Entity Relationship Diagram
Error Residue Graph
Flow Diagram
Funnel Chart
Fusion Chart
Heat Map
Histogram
Lag Plot
Line Chart
Log Plot
Lognormal Plot
Map Plot
Onion Diagram
Organizational Chart
Pair Plot
Pie Chart
Principal Component Analysis (PCA) Plot
Probability Distribution Graph
Receiver Operating Characteristics (ROC) Curve
Residual Plot
Run-Sequence Plot
Scatter Plot
Sequence Diagram
Silhouette Plot
Spider Chart
Stakeholder Matrix
Sunburst Chart
Venn Diagram
Waterfall
Weibull Plot

No one person uses all of them or could reasonably be expected to know them, so maybe that’s the rationale for the super low passing threshold, which may only have been 40%.

The exam involved 75 questions in 120 minutes. There was little specific mention of Artificial Intelligence or Machine Learning techniques, except as generalities.

The interesting thing about this one is that it has to be renewed every year (instead of every three), but you only need 20 CDUs (instead of the 60 that are typical for some my other certs), so the idea is clearly that you stay engaged with the material to keep up with ongoing developments.

I don’t know how useful this will be. As of the end of November only 117 people had earned it. (Edit: 159 as of 3/17/21).

Posted in Tools and methods | Tagged , , , | Leave a comment

Post University CIS Advisory Board Meeting 2020

Today I participated in an annual advisory board for the CIS department at Post University in Waterbury, CT, as I have been privileged to do since 2016. I’ve written up my suggestions here and here when they’ve been extensive and unique enough, and I felt that my observations were sufficiently extensive to merit another write-up this year.

I’ve been impressed with the efforts the Post U CIS department has taken to solicit and, more importantly, act on the feedback they receive. They have absolutely added courses and redirected things per inputs from the board meetings. One of the most interesting aspects of Post’s educational offerings is that the majority of students take the majority of their education online. This is particularly interesting given the ongoing economics of higher education, which sees the value proposition rapidly declining for many schools as prices rise and the value received often falls. It is also interesting as the industry landscape seems to change more quickly every year (to the point where it was suggested that to board meet twice a year instead of just once), there are more online educational offerings than ever, and online information exchange and interaction is absolutely vital given the current troubles raised by COVID-19 and other current events.

One of the first subjects that came up involved certification. Most of the conversation centered on helping students achieve certifications in specific technologies (e.g., AWS, especially since Amazon seems to be looking for partnerships with technical schools). My certifications, of course, are all in more abstract, management and process-related areas (business analysis, which the board has emphasized in the last couple of years, project management, which was discussed a lot this year, Agile/Scrum, and process improvement / Lean Six Sigma). It’s an interesting dichotomy, because specific technologies come and go (though some obviously last longer than others), while the basic concepts of analysis, integration, collaboration, and management tend to be more evergreen (though new ideas come and go, they are often repackagings of old truths). The difference over time is the greater scope, scale and complexity of the projects taken on and the systems deployed over time.

I noted that it’s easier for students to complete certifications in specific technologies than it probably is to earn meaningful ones in the abstract subjects, many of which require significant working experience to appreciate on the soft side and which can have specific requirements for hours in a role on the hard side. You have to document 5000 hours in a project management role before you can even sit for the PMP, for example. The IIBA’s certifications for business analysis, by contrast, offer tiers requiring 0, 2000, and 5000 hours. At the least I’d like to to make students aware of the classes of abstract certifications that exist, and also repeat my recommendation that students complete team projects that demonstrate and give experience in analysis, collaboration, planning, and management concepts.

The university participants shared a diagram that attempted to place a lot of different ideas in context, like basic computing skills, data, and so on. Security seemed important and intertwined with all areas. Different aspects of communication and organization were shown in an outer, bounding frame, and the end result of the process was the creation of business value, which was shown at the top of the bounding frame.

One can always think of different ways to draw diagrams, but it seemed to me that all of the standalone concepts shown in the middle of the diagram could as easily have been represented as a classic Venn diagram with a high degree of overlap of all the circles. Each area incorporates its own unique skills (the non-overlapping parts), but they have to be integrated to provide value, and that integration (the overlaps) has to happen in an iterative, empathetic, and inclusive way.

Another thought I had was that education is becoming more granular and modular and less monolithic. Many companies in the tech space are relaxing their requirements for degrees in favor of candidates who can demonstrate specific skills and the ability to apply them. I started with a BS in Mechanical Engineering (which included probably as many or more computing courses as anyone who graduated with an engineering degree at that time) but have completed numerous online courses in the past few years. This is in addition to classes and study required to earn my certifications. The difference between practitioners, and between educational institutions, is in the amount of underlying theory and integration they understand and apply. No one expects Post University to operate at the depth of an MIT or Stanford (and I’m guessing Post is more affordable and provides a much better value proposition for most students), but some emphasis should be placed on where everything comes from and how it all fits together. Emphasizing a certain amount of theory and integration will give Post’s programs and students a distinct competitive advantage.

As I mentioned above regarding management, the problems addressed in computing are also evolving through long cycles. One example is that computing used to be very centralized with monolithic machines running one process at a time. That model evolved into highly distributed time-sharing systems. Then came desktop computing and a return to standalone operation, and then came networking and the internet. This is also happening in standalone devices that do physical things like take pictures or operate vehicle systems, but vehicles increasingly integrate disparate sensors and actuators and the oncoming Internet of Things will only take that farther. Students and new graduates entering the workforce should have some appreciation that they’re walking into a conversation that’s been going on a long time.

I may have suggested this before, but it might be a good idea to include instruction on different kinds of computing architectures. I know you do some of this, especially regarding cloud computing as an example, but I’d like to suggest specific instruction on microservices, devops, release chains, and so on, if you aren’t already. A large number of organizations seem to be working on similar models. Giving instruction on specific testing and deployment tools like Postman and Jenkins (among many, many possibilities) may also be a good idea.

A lot of hiring managers want to hire the hot new things who know the latest tricks but, as the group discussed, the best ones understand the larger picture and can continually grow and adapt. A problem is that specific, objective skills are easy to assess and understand, and the soft, integrative, contextual skills are much more difficult to assess, understand, and sell. (I’ve had a huge problem with this. I also note that, despite claims that there are zillions of unfilled jobs, the hiring process is often massively broken. Check almost any online conversation on LinkedIn and the like, but that’s a separate conversation.)

In order to keep students engaged and to offer them ongoing support and interaction that allows them to share what they’ve learned, you might want to create some sort of discussion and news forum for students and alumni. You might also solicit articles and links to subjects that might be of interest. This might be reinventing the wheel, it might not be practical for such a large number of participants, and it always bears the risk of people getting into (seemingly inevitable) fights about politics. In any case, the best students and workers will always continue to study and adapt.

Another subject addressed during the discussions was that of attracting non-traditional students to the program and to the profession (and keeping them). There are many reasons why the field doesn’t appeal to everyone, but emphasizing the integrated and collaborative nature of solving complex problems, rather than just the hard-tech, gee-wiz-ardry of it all, might help some. There’s a need and a niche for the lone genius in a remote garret someplace, but most people aren’t set up to work like that and they generally aren’t as effective as they could be. I can tell you from experience that the group dynamic is what will make or break any project or organization. (That said, groups can also unfairly — and incorrectly — ignore and steamroll individuals, so there has to be a balance.) There is also a long, not sufficiently well understood history of different kinds of people working in computing, which a few inspiring — and maddening — recent movies have highlighted. You might be able to leverage those stories. I’m guessing you already do some of this, but that’s what I have to add.

Thanks very much to Post University and the other board members for inviting me to participate once again.

Posted in Software | Tagged , , , , , | Leave a comment

Iteration and Feedback: The Key to Making Projects and Teams Work

The Product Management Body of Knowledge (PMBOK) teaches that most project failures are caused by poor team dynamics. That may be true, but that’s just a specific case of a larger problem. The more foundational idea is that trouble arises when people don’t communicate sufficiently well or often, and that is true whether the communications are collegial or contentious. Everyone can be the best of friends, but if they’re sitting in their own rooms doing their own things they aren’t going to accomplish much as a team.

The whole concept of Agile and its specific techniques of Scrum, Kanban, SAFe, and their variants and hybrids, as well as related frameworks for managing complexity, is that there has to be an organized way for people to talk to each other so they can reach common and correct understanding of what is, what should be, and how to get from one to the other. My framework for doing business analysis works through six major phases in ways that are adaptable to a given effort, organization, and management environment. All the elements from each phase are tracked and kept in sync by using a Requirements Traceability Matrix.

The diagram above was created to emphasize the need to get people together to talk to each other, figure out what they need, what’s possible, and how they can work together. They don’t do this just once in most situations, they do it iteratively, incorporating feedback and making corrections until everyone is in agreement (subject to real limitations of time, resources, and availability) at the end of each phase. The circles in the diagram represent the iterative cycles of planning, performing, review, feedback, and correction for each phase. The links forward represent moving to the next phase when the iterations for the current phase have succeeded. The links backward represent recognition that something was missed that needs to be added in a previous phase. Adding the missing elements involves iterations in the previous phase and then a return to the ongoing iterations in the current phase. (Ha ha, if done well, this all flows more naturally than it sounds like in writing here.)

Since I’ve served in so many different roles in my career, often in the capacity of a vendor or consultant serving larger organizations by providing specialized, highly technical products and services, I’ve had the chance to meet, work with, learn from, share with, and help a lot of different kinds of people in a lot of different environments. They all have a part to play and their needs, contributions, thoughts, and opinions are all valuable.

Get them all talking to each other. My way isn’t the only way (though it’s a good one), but make sure people are talking in some way.

Posted in Uncategorized | Leave a comment

Cross-Browser Compatibility: My Website Animation, Part 3

Following up on the issue I discussed previously here and here, I finally bit the bullet and straightened out the problems caused in the landing page animation by a certain behavior of Android web browsers. After periodically digging around for a solution that addressed the problem of font boosting directly, and finding a lot of expressions of frustrations about it where it’s discussed, I finally decided to just brute force it.

That means test for the browser type and then, if it’s Android, adjust the rendering code a bit to get the desired result. It turned out to be pretty straightforward.

Testing for the browser type is really simple, especially if you only have to test for one.

After that I just needed to write different versions of the initiation calls for the moving text elements of the animation sequence. The graphic elements didn’t need to be changed in this way. I had been a bit inconsistent when I defined the font sizes in the original animation, so I had to do a bit of trial and error to get the Android-specific items the right size, but all 35 of them got done. I think it only took a couple of hours one evening while I watched horror movies hosted by Joe Bob Briggs. Who says programming can’t be fun?

As browser compatibility issues go, this one was simple. If there isn’t a clean solution to whatever issue is causing you problems, then do what you need to do. It may feel like a hack but we don’t always get perfection. A little elbow grease on your part makes life a lot better and more consistent for your users, and usually the cost in terms of complexity and bandwidth is minimal. Your users come first, especially as the number of users grows.

I’m sure there are far weirder incompatibilities out there. Which ones have you seen?

Posted in Software | Tagged , , , | Leave a comment

Documentation

I’ve written a lot of different kinds of documentation in my career, and I list and describe them here. Some types of documentation are formal, as in manuals written for various phases of a project, but others will be more ad hoc, like punch lists used in place of formal tracking tools.

To be thorough I’m going to include some document types that may not usually be thought of as documentation per se, though they are still relevant types of documents, and they can include a lot of technical material.

Individual documents can (and should!) be generated for most of the phases of my business analysis framework, depending on the complexity and scope of the effort.

Sometimes the documentation is in the form of a word processing file and sometimes it is spread over many locations in a content management system like Confluence, WordPress, or some kind of Wiki page. In Agile projects a lot of the documentation, at least during the build phase of a project, will be hosted piecemeal in a distributed software system like Jira or Rally. There may be a lot of overlap, but every situation is unique, so let’s just jump on in.

  • User Manual: This one is kind of obvious. However, a system can have many different types of users, so individual manuals (or at least sections) can be written for users in every end user role as well as installers, maintainers, user managers, and so on. I often tried to include a story walkthough in my user manuals. Rather than just describing what every field and control on every screen of every program did and meant, and how every input and output artifact was to be processed, conditioned, and interpreted, I provided a narrative describing how a user would exercise most of the features in a logical order.
  • Sales Proposal / Project Bid: These might not seem like normal kinds of documentation but these packages can be large, complex, and highly technical. I’ve prepared process and instrumentation drawings, heat and material balances, test descriptions, process guides, personnel information, budgets, company teaming agreements, schedules, and other materials for inclusion in sales and bid packages.
  • Technical Manual: These documents can include details on almost anything, but they are usually meant to contain information of a highly technical nature. Descriptions of collected data, assumptions, equations, references, techniques, code, and other specifications are just some examples of what can be included in documents with this name. You could buy the Technical Reference Manual for the original IBM PC, which contained some detailed specs but was mostly famous for providing the complete BIOS listing in assembly language. I’ve written technical manuals that described how all the data was collected or otherwise derived, how it was conditioned and processed, and then how it was used in complex systems. I sometimes managed and serially updated these documents over a period of years. In one case I built a system to automate the calculation of initial values, internal coefficients, listing of source materials, and output of governing equations (with equation numbers), variable listings, and value tables.
  • Procedural Manual: This document describes how activities that support an effort are to be carried out. For example, I wrote a Data Collection Manual to describe how my company’s data collectors were supposed to collect, process, and incorporate data, from discovery and data collection trips to dozens of customer sites, as well as what and how much data needed to be collected. This kind of manual could then be delivered to the customer so they could use it to maintain and update the delivered system on their own. This was for a program whose purpose was to build a tool to generate models for every land border crossing for Canada or Mexico, and then collect data for the individual models.
  • Policy Guide / Business Rules: These lists of directives tend to describe organizational process requirements in terms of how certain decisions are made, the time periods allowed to complete certain activities, duties certain people must perform, and things that cannot be done. Policies may be imposed by outside agencies in the form of customer requirements or government laws and regulations.
  • Data Dictionary: These documents can describe all relevant data items, their provenance, their methods of collection, their meaning, their acceptable (ranges of) values, and so on. I have always included this information in other documents.
  • Glossary of Terms: This document describes the specialized terms of art for a project, system, practice, organization, or profession. I think I may have contributed to a standalone document of this type for one project, but I have often inclluded this kind of information in other documents.
  • Intended Use Statement: I originally saw this described in a Navy specification for performing full-scope VV&A on simulations. Its purpose was to describe how a simulation was to be used to support a particular type of planning and scheduling analysis. I’ve appropriated this language to describe the purpose of a project involving a system that is to be automated, constructed, simulated, reengineered, upgraded, or otherwise improved.
  • Conceptual Model: This document describes what’s happening in a system or process that is going to be automated, constructed, simulated, reengineered, upgraded, or otherwise improved. This type of document is often referred to as describing the As-Is state of a system.
  • Requirements Document: This document can describe several things. One is what the system needs to be able to do at the end of the project. These items are referred to as functions requirements. Another thing that can be described is the qualities a system needs to have, usually in the form of -ilities (e.g., reliability, flexibility, maintainability, understandability, usability, configurability, modularity, etc.), and those are referred to as non-functional requirements. Procedural requirements, which are requirements for how the project, effort, or engagement, should be carried out, are probably best considered a subclass of non-functional requirements. I’ve written and managed requirements in multiple contexts and dealt with requirements written by others in multiple contexts as well. This type of document is often referred to as describing the To-Be state of a system. In particular I think of requirements as an abstract description of the To-Be state.
  • Design Document: This document may describe the components, layout, configuration, operations, tools, interfaces, and methods that will comprise the solution. It can define what the solution will be and how it is to be implemented. There are as many ways to create design documents as there are projects to be completed. The details don’t always matter if everyone understands what is being proposed. There should be a balance between detail and flexibility depending on the type, scale, and complexity of the effort. A manufacturing line including specific equipment meant to provide a specific, contractually defined level of production should be specified up front and should probably be implemented using a Waterfall methodology. A software system whose details must be partially defined on the fly should likely be specified piecemeal using an Agile strategy. I’ve produced many items that were intended to serve as design documents and they were presented at varying levels of abstraction. A lot of communication in this area is taken for granted. Be sure to review and seek approval from the customer before proceeding. I think of this type of document as representing a more concrete description of the To-Be state of a system.
  • Implementation Document: This item could be a plan for how the implementation is to be accomplished (this was also discussed as possibly being part of the design document). This could include instructions for ongoing governance (e.g., change control procedures, communication and permissions, and so on). It could also be a record how how the implementation has been carried out. It can be in the form of an updatable text document or in the form of a data system of some kind, along the lines of a Jira, Rally, or VersionOne.
  • Test Document: This item can describe test procedures and also progress attained, in which case it can also function as a checklist. I’ve seen some pretty comprehensive test plans in my career. A good formal test plan should exercise and verify every possible function and capability of a system. All appropriate artifacts and operations should be validated as well, but that often involves expert judgment and that can be difficult or impossible to automate.
  • Accreditation Documents: These documents could include plans for how accreditation is to be carried out (plan) and how it was carried out and the resulting recommendation (report). These artifacts can bookend similar documents describing how V&V is to be and was completed. One possible description of how this could work is here.
  • Requirements Traceability Matrix: This artifact serves as a unified checklist that crosslinks every item in every project phase (intended use, conceptual model, requirements, design, implementation, and test and acceptance in my framework) both forward to the next phase and back to the previous phase. Every element should be linked in both directions (except for the initial and final phases). If all items in all phases are accepted as complete by the customer then work may be regarded as finished for the entire project’s identified and agreed-upon intended uses and requirements. I’ve only used this formally on one large, very formal, two-year project where my company was part of a team acting as an independent agent for the V&V of a deterministic simulation tool the Navy was adopting to manage large fleets of its aircraft. Since I learned about this and earned by CBAP certification I’ve become a huge fan and advocate its use in every effort of sufficient scale to warrant it, and will use it informally on smaller efforts.
  • Statement of Findings: This document may be generated to relay the findings of any kind of investigation one may conduct. This is intended to apply more to troubleshooting or forensic investigations than normal discovery efforts. I’ve prepared a few such reports for different situations over the years.
  • Status Report: These can be issued at various intervals and include varying amounts of information. Their contents may or may not be formally defined. They are usually targeted to defined distribution lists (which sometimes consist of just one manager or customer representative). I’ve written a ton of periodic and ad hoc status reports over the years. Automated tracking systems of various types can generated custom formatted periodic and ad hoc reports of activity as well.
  • Field Report: These documents describe discoveries and accomplishments at remote locations. In my experience these have almost always been customer sites. There can be a large overlap between these and other report types in this list.
  • Recommendations: These items come in many forms and can be produced as the result of a formal investigation or audit (I wrote a couple of these for customers in the paper industry as a young process engineer) or as the result of an informal observation (I’ve proposed numerous internal process improvements to companies I’ve worked for, with this being one recent example).
  • Punch List: These are usually informal to-do lists of tasks to be accomplished. They are usually generated and worked off near the end of projects where formal management techniques and tracking tools aren’t being used. They can be for strictly internal use or reviewed at intervals with the customer. I saw a lot of these when I visited turnkey pulping lines my company was installing for customers in the paper industry and I reviewed them in morning meetings with some customers in the steel industry, for whom I was building and installing supervisory furnace control systems.
  • Specifications: These are often standard descriptions of individual pieces of physical equipment, but they can also describe systems of physical components or software.
  • Request for Proposal / Request for Quote: Organizations in some fields (especially government and heavy industry) will advertise that they need to have specific work performed or items provided and potential vendors will respond with proposals describing how they will provide the requested goods and services. The issuing organizations may provide guidelines for how the response package must for formatted, and these specifications can be extremely long and complex. I’ve contributed artifacts to many bid packages in private industry and have worked on some insanely complex and formal packages for the federal government and the City of New York.
  • User Stories: User stories are one type of artifact that can be created to define requirements in Agile and Scrum projects. They can be written from the point of view of a human user, a piece of hardware, an automated process, or an external system. They often (but not always) take the form, “As a (type of user), I want to (perform some action), so I can (achieve some outcome).” On some projects user stories are the main or even only way that requirements are specified and tracked.
  • Contracts: Contracts can describe what is to be delivered and also the process by which the work is to be completed. It is generally appropriate to use a Waterfall methodology in cases where final deliverables are described up front in great detail and Agile methodologies where final deliverables are specified in a more open-ended way. Waterfall and Agile techniques need not be viewed as totally different and opposed. I feel they should be seen as usable on a hybrid, continuum basis, as I describe here. I’ve had to understand, manage to, and fulfill the terms of customer contracts of varying complexity for most of my career. I’ve even helped to draft sections of contracts related to my role and areas of expertise.

So that’s what I’ve done over the years. What kinds of documentation have you written that I didn’t include?

Posted in Tools and methods | Tagged , , , , | Leave a comment

Understanding and Monitoring Microservices Across Five Levels

Did I know anything about microservices (or DevOps, or…) when I landed at Universal recently? No, I did not. Did that stop me from figuring it out in short order? Nope. Did that stop me from being able to reverse-engineer their code stacks in multiple languages from day one? Of course not. Did that stop me from seeing things nobody else in the entire organization saw and designing a solution that would have greatly streamlined their operations and saved them thousands of hours of effort? Not a chance. Could I see what mistakes they had made in the original management and construction of the system and would I have done it differently? You betcha.

While there are definitely specific elements of tradecraft that have to be learned to make optimal use of microservices, and there would always be more for me to learn, the basics aren’t any different than what I’ve already been doing for decades. It didn’t take long before the combination learning how their system was laid out, seeing the effort it took to round up the information needed to understand problems as they came up in the working system, and seeing a clever monitoring utility someone had created before it became obvious they needed a capability that would help them monitor and understand the status of their entire system. Like I said, it didn’t require a lot of work, it was something I could just “see.” I suggested an expanded version of the monitoring tool they’d created, that let anyone see how every part of the system worked at every level.

Now don’t get me wrong, there were a lot of smart and capable and dedicated people there, and they understood a lot of the architectural considerations. So what I “saw” wasn’t totally new in detail, but is was certainly new in terms of its scope, in that it would unify and leverage a lot of already existing tools and techniques. As I dug into what made the system tick I saw that concepts and capabilities broke down into five different layers, but first some background.

Each column in the figure above represents a single microservices environment. The rightmost column shows a production environment, the live system of record that actually supports business operations. It represents the internal microservices and the external systems that are part of the same architectural ecosystem. The external items might be provided by third-party as a standard or customized capability. They can be monitored and interacted with, but the degree to which they can be controlled and modified might be limited. The external systems may not be present in every environment. Environments may share access to external systems for testing, or may not have access at all, in which case interactions have to be stubbed out or otherwise simulated or handled.

The other columns represent other environments that are likely to be present in a web-based system. (Note that microservices can use any communication protocol, not just the HTTP used by web-facing systems. The other environments are part of a DevOps pipeline used to develop and test new and modified capabilities. Such pipelines may have many more steps, than what I’ve shown here, but new code is ideally entered into the development environment and then advanced rightward from environment as it passes different kinds of tests and verifications. There may be an environment dedicated to entering and testing small hotfixes that can be advanced directly to the production environment with careful governance.

The basic structure of a microservice is shown above. I’ve written about monitoring processes here, and this especially make sense in a complicated microservices environment. I’ve also written about determining what information needs to be represented and what calculations need to be performed before working out detailed technical instantiations here. A microservice can be thought of as a set of software capabilities that perform defined and logically related functions, usually in a distributed system architecture, in a way that ideally doesn’t require chaining across to other microservices (like many design goals or rules of thumb, this regulation need not be absolute, just know what you’re trying to do and why; or, you’ve got to know the rules before you can break them!).

I’ve listed them here in the raw form in which I discovered them. I never got to work with anyone to review and implement this stuff in detail; something came up every time we scheduled a meeting, so I’ve found inconsistencies as I worked on this article. I cleaned some things up, eliminated some duplications, and made them a bit more organized and rational down below. A more detailed write-up follows.

Layer 1 – Most Concrete
Hardware / Local Environment

  • number of machines (if cluster)
  • machine name(s)
  • cores / memory / disk used/avail
  • OS / version
  • application version (Node, SQL, document store)
  • running?
  • POC / responsible party
  • functions/endpoints on each machine/cluster

Layer 2

Gateway Management (Network / Exposure / Permissioning)

  • IP address
  • port (different ports per service or endpoint on same machine?)
  • URL (per function/endpoint)
  • certificate
  • inside or outside of firewall
  • auth method
  • credentials
  • available?
  • POC / responsible party
  • logging level set
  • permissions by user role

Level 3

QA / Testing

  • tests being run in a given environment
  • dependencies for each path under investigation (microservices, external systems, mocks/stubs)
  • schedule of current / future activities
  • QA POC / responsible party
  • Test Incident POC / responsible party
  • release train status
  • CI – linting, automatically on checkin
  • CI – unit test
  • CI – code coverage, 20% to start (or even 1%), increase w/each build
  • CI – SonarQube Analysis
  • CI – SonarQube Quality Gate
  • CI – VeraCode scan
  • CI – compile/build
  • CI – deploy to Hockey App for Mobile Apps / Deploy Windows
  • CD – build->from CI to CD, publish to repository server (CodeStation)
  • CD – pull from CodeStation -> change variables in Udeploy file for that environment
  • CD – deploy to target server for app and environment
  • CD – deploy configured data connection, automatically pulled from github
  • CD – automatic smoke test
  • CD – automatic regression tests
  • CD – send deploy data to Kibana (deployment events only)
  • CD – post status to slack in DevOps channel
  • CD – roll back if not successful
  • performance test CD Ubuild, Udeploy (in own perf env?)

Level 4

Code / Logic / Functionality

  • code complete
  • compiled
  • passes tests (unit, others?)
  • logging logic / depth
  • Udeploy configured? (how much does this overlap with other items in other areas?)
  • test data available (also in QA area or network/environment area?)
  • messaging / interface contracts maintained
  • code decomposed so all events queue-able, storable, retryable so all eventually report individual statuses without loss
  • invocation code correct
  • version / branch in git
  • POC / responsible party
  • function / microservice architect / POC
  • endpoints / message formats / Swaggers
  • timing and timeout information
  • UI rules / 508 compliance
  • calls / is-called-by –> dependencies

Level 5 – Most Abstract

Documentation / Support / Meta-

  • management permissions / approvals
  • documentation requirements
  • code standards reviews / requirements
  • defect logged / updated / assigned
  • user story created / assigned
  • POC / responsible party
  • routing of to-do items (business process automation)
  • documentation, institutional memory
  • links to related defects
  • links to discussion / interactive / Slack pages
  • introduction and training information (Help)
  • date & description history of all mods (even config stuff)
  • business case / requirement for change

That’s my stream-of-consciousness take.

Before we go into detail let’s talk about what a monitoring utility might look like. The default screen of the standalone web app that served as my inspiration for this idea showed a basic running/not-running status for every service in every environment and looked something like this. It was straight HTML 5 with some JavaScript and Angular and was responsive to changes in the screen size. There were a couple of different views you could choose to see some additional status information for every service in an environment or for a single service across all environments. It gathered most of its status information by repeatedly sending a status request to the gateway of each service in each environment. Tow problems with this approach were that the original utility pinged the services repeatedly without much or any delay, and multiple people could (and did) run the app simultaneously, which hammered the network and the services harder than was desirable. These issues could be addressed by slowing down the scan rate and hosting the monitoring utility on a single website that users could load to see the results of the scans. That would still require a certain volume of refreshes across the network (and there are ways to even minimize those) but queries of the actual service endpoints would absolutely be minimized.

It was real simple. Green text meant the service was running and red text meant it wasn’t. Additional formatting or symbology could be added for users who are colorblind.

Now let’s break this down. Starting from the most concrete layer of information…

Layer 1 – Most Concrete
Hardware / Local Environment

This information describes the actual hardware a given piece of functionality is hosted on. Of course, this only matters if you are in control of the hardware. If you’re hosting the functionality on someone else’s cloud service then there my be other things to monitor, but it won’t be details you can see. Another consideration is whether a given collection of functionality requires more resources than a single host machine has, in which case the hosting has to be shared across across multiple machines, with all the synchronization and overhead that implies.

When I was looking at this information (and finding that the information in lists posted in various places didn’t match) it jumped out at me that there were different versions of operating systems on different machines, so that’s something that should be displayable. If the web app had a bit more intelligence, it could produce reports on machines (and services and environments) that were running each OS version. There might be good reasons to support varied operating systems and versions, and you could include logic that identified differences against policy baselines for different machines for different reasons. The point is that you could set up the displays to include any information and statuses you wanted. This would provide at-a-glance insights into exactly what’s going on at all times.

Moreover, since this facility could be used by a wide variety of workers in the organization, this central repository could serve as the ground truth documentation for the organization. Rather than trying to organize and link to web pages and Confluence pages and SharePoint documents and manually update text documents and spreadsheets on individual machines, the information could all be in one place that everyone could find. And even if separate documents needed to be maintained by hand, such a unified access method could provide a single, trusted pointer to the correct version of the correct document(s). This particular layer of information might not be particularly useful for everyone, but we’ll see when we talk about the other layers that having multiple eyes on things allows anomalies to be spotted and rectified much more quickly. If all related information is readily available and organized in this way, and if people understood how to access it, then the amount of effort spent finding the relevant people and information when things went wrong would be reduced to an absolute minimum. In a large organization the amount of time and effort that could be saved would be spectacular. Different levels of permissions for viewing, maintenance, and reporting operations could also be included as part of a larger management and security policy.

I talked about pinging the different microservices, above. That’s a form of ensuring that things are running at the application level, or the top level of the seven-layer OSI model. The status of machines can be independently monitored using other utilities and protocols. In this case they might operate at a much lower level. If the unified interface I’m discussing isn’t made to perform such monitoring directly, it could at least provide a trusted link to or description of where to access and how to use the appropriate utility. I was exposed to a whole bunch of different tools at Universal, and I know I didn’t learn them all or even learn what they all were, but such a consistent, unified interface would tell people where things were and greatly streamline awareness, training, onboarding, and overall organizational effectiveness.

  • number of machines (if cluster)
  • machine name(s)
  • cores / memory / disk used/avail
  • OS / version
  • application version (Node, SQL, document store)
  • running (machine level)
  • POC / responsible party
  • functions/endpoints on each machine/cluster

Last but not least, the information tracked and displayed could include meta-information about who was responsible for doing certain things and who to contact at various times of day and days of week. Some information would be available and displayed on a machine-by-machine basis, some based on the environment, some based on services individually or in groups, and some on an organizational basis. The diagram below shows some information for every service for every environment, but other information only for each environment. Even more general information, such as that applying to the entire system, could be displayed in the top margin.

Layer 2

Gateway Management (Network / Exposure / Permissioning)

The next layer is a little less concrete and slightly more abstract and configurable, and this involves the network configuration germane to each machine and each service. Where the information in the hardware layer wasn’t likely to change much (and might be almost completely obviated when working with third-party, virtualized hosting), this information can change more readily, and it is certainly a critical part making this kind of system work.

The information itself is fairly understandable, but what gives it power in context is how it links downward to the hardware layer and links upward to the deployment and configuration layer. That is, every service in every environment at any given time is described by this kind of five-layer stack of hierarchical information. If multiple endpoints are run on a shared machine then the cross-links and displays should make that clear.

  • IP address
  • port (different ports per service or endpoint on same machine?)
  • URL (per function/endpoint)
  • SSL certificate (type, provider, expiration date)
  • inside or outside of firewall
  • auth method
  • credentials
  • available?
  • POC / responsible party
  • logging level set
  • permissions by user role

There are a ton of tools that can be used to access, configure, manage, report on, and document this information. Those functions could all be folded into a single tool, though that might be expensive, time-consuming, and brittle. As described above, however, links can be provided to the proper tools and various sorts of live information.

Level 3

QA / Testing

Getting still more abstract, we could also display the status of all service items as they are created, modified, tested, and moved through the pipeline toward live deployment. The figure below shows how the status of different build packages could be shown in the formatted display we’ve been discussing. Depending on what is shown and how it’s highlighted, it would be easy to see the progress of the builds through the system, when each code package started running, the status of various test, and so on. You could highlight different builds in different colors and include extra symbols to show how they fit in sequence.

  • tests being run in a given environment
  • dependencies for each path under investigation (microservices, external systems, mocks/stubs)
  • schedule of current / future activities
  • QA POC / responsible party
  • Test Incident POC / responsible party
  • release train status
  • CI – linting, automatically on checkin
  • CI – unit test
  • CI – code coverage, 20% to start (or even 1%), increase w/each build
  • CI – SonarQube Analysis
  • CI – SonarQube Quality Gate
  • CI – VeraCode scan
  • CI – compile/build
  • CI – deploy to Hockey App for Mobile Apps / Deploy Windows
  • CD – build->from CI to CD, publish to repository server (CodeStation)
  • CD – pull from CodeStation -> change variables in Udeploy file for that environment
  • CD – deploy to target server for app and environment
  • CD – deploy configured data connection, automatically pulled from github
  • CD – automatic smoke test
  • CD – automatic regression tests
  • CD – send deploy data to Kibana (deployment events only)
  • CD – post status to slack in DevOps channel
  • CD – roll back if not successful
  • performance test CD Ubuild, Udeploy (in own perf env?)

A ton of other information could be displayed in different formats, as shown below. The DevOps pipeline view shows the status of tests passed for each build in an environment (and could include every environment, I didn’t make complete example displays for brevity, but it could also be possible to customize each display as shown). These statuses might be obtained via webhooks to the information generated by the various automated testing tools. As much or as little information could be displayed as desired, but it’s easy to see how individual status items can be readily apparent at a glance. Naturally, the system could be set up to only show adverse conditions (tests that failed, items not running, permissions not granted, packages that haven’t moved after a specified time period, and so on).

A Dependency Status display could show what services are dependent on other services (if there are any). This gives the release manager insight into what permissions can be given to advance or rebuild individual packages if its clear there won’t be any interactions. It also shows what functional tests can’t be supported if required services aren’t running. If unexpected problems are encountered, in functional testing it might be an indication that the messaging contracts between services needs to be examined, it something along those lines was missed. In the figure below, Service 2 (in red) requires that Services 3, 5, and 9 must be running. Similarly, Service 7 (in blue) requires that Services 1, 5, and 8 are running. If any of the required services aren’t running then meaningful integration tests cannot be run. Inverting that, the test manager could also see at a glance which services could be stopped for any reason. If Services 2 and 7 both needed to be running, for example, then services 4 and 6 could be stopped without interfering with anything else. There are doubtless more interesting, comprehensive, and efficient ways to do this, but the point is merely to illustrate the idea. This example does not show dependencies for external services, but those could be shown as well.

The Gateway Information display shows how endpoints and access methods are linked to hardware items. The Complete Requirements Stack display could show everything from the hardware on up to the management communications, documentation, and permissioning.

The specific tools used by any organization are likely to be different, and in fact are very likely to vary within an organization at a given time (Linux server function written in Node or Java, iOS or Android app, website or Angular app, or almost any other possibility), and will certainly vary over time as different tools and languages and methodologies come and go. The ones shown in these examples are notional, and different operations may be performed in different orders.

Level 4

Code / Logic / Functionality

The code, logic, and style is always my main concern and that most matching my experience. I interacted with people doing all these things, but I naturally paid closest attention to what was going on in this area.

  • code complete: flag showing whether code is considered complete by developer and submitted for processing.
  • compiled: flag showing whether code package is compiled. This is might only apply to the dev environment if only compiled packages are handled in higher environments.
  • passes tests (unit, others?): flag(s) showing status of local (pre-submittal) tests passed
  • logging logic / depth: settings controlling level of logging of code for this environment. These settings might vary by environment. Ideally, all activity at all levels would be logged in the production environment, so complete auditing is possible to ensure no transactions ever fail to complete or leave an explanation of why they did not. Conversely, more information might be logged for test scenarios than for production. It all depends on local needs.
  • Udeploy configured: Flag for whether proper deployment chain is configured for this code package and any dependencies. (How much does this overlap with other items in other areas?)
  • test data available : Every code package needs to have a set test data to run against. Sometimes this will be native to automated local tests and at other times will involved data that must be provided by or available in connected capabilities through a local (database) or remote (microservice, external system, web page, or mobile app) communications interface. The difficulty can vary based on the nature of the data. Testing credit card transactions so all possible outcomes are exercised is a bit of a hassle, for example. It’s possible that certain capabilities won’t be tested in every environment (which might be redundant in any case), so what gets done and what’s required in each environment in the pipeline needs to be made clear.
  • messaging / interface contracts maintained: There should be ways to automatically test the content and structure of messages passed between different functional capabilities. This can be done structurally (as in the use of JavaScript’s ability to query an object to determine what’s in it to see if it matches what’s expected, something other languages can’t always do) or by contents (as in the case of binary messages (see story 1) that could be tested based on having unique data items in every field). Either way, if the structure of any message is changed, someone has to verify that all code and systems that use that message type are brought up to date. Different versions of messages may have to be supported over time as well, which makes things even more complicated.
  • code decomposed so all events queue-able, storable, retryable so all eventually report individual statuses without loss and so that complete consistency of side-effects is maintained: People too often tend to write real-time systems with the assumption that everything always “just works.” Care must be taken to queue any operations that fail so they ccan be retried until they pass or, if they are allowed to fail legitimately, care must be taken to ensure that all events are backed out, logged, and otherwise rationalized. The nature of the operation needs to be considered. Customer facing functions need to happen quickly or be abandoned or otherwise routed around while in-house supply and reporting items could potentially proceed with longer delays.
  • invocation code correct: I can’t remember what I was thinking with this one, but it could have to do with the way the calling or initiating mechanism works, which is similar to the messaging / interface item, above.
  • version / branch in git: A link to the relevant code repository (it doesn’t have to be git) should be available and maintained. If multiple versions of the same code package are referenced by different environments, users, and so on, then the repository needs to be able to handle all of them and maintain separate links to them. The pipeline promotion process should also be able to trigger automatic migrations of code packages in the relevant code repositories. The point is that it should be easy to access thhe code in the quickest possible way for review, revision, or whatever. A user shouldn’t have to go digging for it.
  • POC / responsible party: How to get in touch with the right person depending on the situation. This could include information about who did the design, who did the actual ccoding (original or modification), who the relevant supervisor or liaison is, who the product owner is, or a host of other possibilities.
  • function / microservice architect / POC: This is just a continuation of the item above, pointing to the party responsible for an entire code and functionality package.
  • endpoints / message formats / Swaggers: This is related to the items about messaging and interface contracts but identifies the communication points directly. There are a lot of ways to exercise communication endpoints, so all could be referenced or linked to in some way. Swaggers are a way to automatically generate web-based interfaces that allow you to manually (or automatically, if you do it right) populate and send messages to and receive and display messages from API endpoints. A tool called Postman does something similar. There are a decent number of such tools around, along with ways to spoof endpoint behaviors internally, and I’ve hand-written test harnesses of my own. The bottom line is that more automation is better, but the testing tools used should be described, integrated, or at least pointed to. Links to documentation, written and video tutorials, and source websites are all useful.
  • timing and timeout information: The timing of retries and abandonments should be parameterized and readily visible and controllable. Policies can be established governing rules for similar operations across the entire stack, or at least in the microservices. Documentation describing the process and policies should also be pointed to.
  • UI rules / 508 compliance: This won’t apply to back-end microservices but would apply to front-end and mobile codes and interfaces. It could also apply to user interfaces for internal code.
  • calls / is-called-by –> dependencies: My idea is that this is defined at the level of the microservice, in order to generate the dependency display described above. This might not be so important if microservices are completely isolated from each other, as some designers opine that microservices should never call each other, but connections with third-party external systems would still be important.
  • running (microservice/application level): indication of whether the service or code is running on its host machine. This is determined using an API call.

Level 5 – Most Abstract

Documentation / Support / Meta-

    While the lower levels are more about ongoing and real-time status, this level is about governance and design. Some of the functions here, particularly the stories, defects, tracking, and messaging, are often handled by enterprise-type systems like JIRA and Rally, but they could be handled by a custom system. What’s most important here is to manage links to documentation in whatever style the organization deems appropriate. Definition of done calculations could be performed here in such a way that builds, deployments, or advancements to higher environments can only be performed if all the required elements are updated and verified. Required items could include documentation (written or updated, and approved), management permission given, links to Requirements Traceability Matrix updated (especially to the relevant requirement), announcements made by environment or release manager or other parties, and so on.

    Links to discussion pages are aslo important, since they can point directly to the relevant forums (on Slack, for example). This makes a lot more of the conversational and institutional memory available and accessable over time. Remember, if you can’t find it, it ain’t worth much.

  • management permissions / approvals
  • documentation requirements
  • code standards reviews / requirements
  • defect logged / updated / assigned
  • user story created / assigned
  • POC / responsible party
  • routing of to-do items (business process automation)
  • documentation, institutional memory
  • links to related defects
  • links to discussion / interactive / Slack pages
  • introduction and training information (Help)
  • date & description history of all mods (even config stuff)
  • business case / requirement for change: linked forward and backward as part of the Requirements Traceability Matrix.

Visually you could think of what I’ve described looking like the figure below. Some of the information is maintained in some kind of live data repository, from which the web-based, user-level-access-controlled, responsive interface is generated, while other information is accessed by following links to external systems and capabilities.

And that’s just the information for a single service in a single environment. The entire infrastructure (minus the external systems) contains a like stack for all services in all environments.

There’s a ton more I could write about one-to-many and many-to-one mapping in certain situations (e.g., when multiple services run on a single machine in a given environment or when a single service is hosted on multiple machines in a given — usually production — environment to handle high traffic volumes), and how links and displays could be filtered as packages moved through the pipeline towards live deployment (by date or build in some way), but this post is more than long enough as it is. Don’t you agree?

Posted in Tools and methods | Tagged , , , | Leave a comment

A Few Interesting Talks on Architecture and Microservices

These talks were recommended by Chase O’Neill, a very talented colleague from my time at Universal, and proved to be highly interesting. I’ve thought about things like this through my career, and they’re obviously more germane given where things are currently heading, so I’m sharing them here.

https://www.youtube.com/watch?v=KPtLbSEFe6c

https://www.youtube.com/watch?v=STKCRSUsyP0

https://www.youtube.com/watch?v=MrV0DqTqpFU

Let me know if you found them as interesting as I did.

Posted in Uncategorized | Leave a comment