 
          | Industries 
 | 
 | 
 
 | 
 
        My friend Richard Frederick gave an excellent series of twenty talks on data from August 2020 through January 2021, and I highly encourage you to review them at your convenience.
I've done this for a long time and I learned a lot from him.
The videos and PDFs from his talks may be found here.
 Intended Use
  Intended Use Conceptual Model (As-Is State)
  Conceptual Model (As-Is State) Data Sources, Collection, and Conditioning
  Data Sources, Collection, and Conditioning Requirements (To-Be State: Abstract)
  Requirements (To-Be State: Abstract) Design (To-Be State: Detailed)
  Design (To-Be State: Detailed) Implementation
  Implementation Test
  Test 
        Agile is Dead (in a rigorously formal sense)
 
         
         
        Link to detailed discussion.
 
         
         
        Link to detailed discussion.
The engagement is what we do to effect a change that serves customers.
 
          The system is what we analyze and either build or change to serve customers.
 
          The solution is the change we make to serve customers.
 
        
       
        Links to detailed discussions.
Some basic descriptive information can be collected during discovery, but the more detailed information is gleaned during data collection.
Examples of data that describe systems and entities include:
Think adjectives and adverbs as opposed to nouns and verbs.
This includes current and historical information about how the process is running.
Current information can be used for ongoing control, feedback, resource management (phone lights, grocery checkouts), identification of potential problems, safety, training, standards of performance, and so on.
Historical information can be used to show contractual terms are being met, for research, and to identify changes over time, in both the process and the entities processed (seasonal chip energy, port arrivals).
 
   
         
        Decisions may be driven on the basis of changeable parameters. Numeric thresholds are often used to this purpose, but modularly defined procedural rules can also be a form of data.
Many such rules will be identified in the conceptual model and requirements phases, but implementation SMEs might identify more in the design and implementation phases.
Link to detailed discussion.
Consideration of ultimate outputs really happens through every phase.
Not all systems output data as such. Data may only be used to run the process. Be mindful of the scope and scale of the current engagement.
Some forms of output are data but aren't really data in this sense. For example, a movie service may output a huge volume of data as audio/video streams, but for this discussion we're only concerned with the descriptive and operating metadata (e.g., viewing rates, bandwidth limitations, network problems, permission foul-ups, video quality, etc.)
Organizations may provide different forms of output for different interests and at different scales (production data, pricing data, financial results, regulatory reporting, marketing, customer service follow-up, etc.).
Whether a data items is considered operating data or output data may only be a matter of the scale at which you're thinking and the purposes the items serve, but it's good to have this distinction in mind to help you see things from more angles.
Every process and component can involve any and all classes of data.
 
      |   | Identify... 
 | 
Discovery is a qualitative process. It identifies nouns (things) and verbs (actions, transformations, decisions, calculations).
| It's how you go from this... | ...to this. | |
|   |   |   | 
Discovery comes first, so you know what data you need to collect.
Elicitation is discovering the customer's needs. Discovery is about mapping the customer's existing process — or — it's what you design by working back from the desired outcomes.
Link to detailed discussion.
Data Collection is a quantitative process. It identifies adjectives (colors, dimensions, rates, times, volumes, capacities, materials, properties).
| It's how you go from this... | ...to this. | |
|   |   |   | 
Think about what you'd need to know about a car in the context of traffic, parking, service (at a gas station or border crossing), repair, insurance, design, safety, manufacturing, marketing, finance, or anything else.
|   | 
 | 
It doesn't happen by accident!
|   | 
 | 
Link to detailed discussion.
Link to detailed discussion.
There are many forms of this (rather chicken-and-egg) equation, but this one is fairly common:
n >= (z ⋅ σ / MOE)2
where:
n = minimum sample size (should typically be at least 30)
z = z-score (e.g., 1.96 for 95% confidence interval)
σ = sample standard deviation
MOE = measure of effectiveness (e.g., difference between sample and population means in units of whatever you're measuring)
Document all procedures and assumptions!
Link to detailed discussion.
Complete data may not be available, and for good reason. Keeping records is sometimes rightly judged to be less important than accomplishing other tasks.
Here are some options* for dealing with missing data (from Udemy course R Programming: Advanced Analytics In R For Data Science by Kirill Eremenko):
* These mostly apply to individual values in larger data records.
Document all procedures and assumptions!
| Wreckage of Mars Polar Lander: Why To Make Sure Your Units Match | Data from different sources may need to be regularized so they all have the same units, formats, and so on. (This is a big part of ETL efforts.) Sanity checks should be performed for internal consistency (e.g., a month's worth of hourly totals should match the total reported for the month). Conversely, analysts should be aware that seasonality and similar effects mean subsets of larger collections of data may vary over time. Data items should be reviewed to see if reporting methods or formats have changed over time. Data sources should be documented for points of contact, frequency of issue, permissions and sign-offs, procedures for obtaining alternate data, and so on. I have been known to embed reference information in various systems, especially in systems that automated the production of coefficients, documentation, code, and simulations. | 
...as data is becoming more voluminous, and as what can be done with it is more powerful and valuable.
 
      This presentation and other information can be found at my website:
E-mail: bob@rpchurchill.com
LinkedIn: linkedin.com/in/robertpchurchill