Explicitly defining assumptions and requirements is the gate to add anything
the first version of Layer 1 selected signals based on what I thought failure looked like. Missing values, constants, duplicates. Not wrong, just incomplete in a way I couldn't see yet.
Context
- Layer 1 (deployed and working) is a part of the diagnostic system that detects structural data issues.
- issues that break downstream tasks regardless of model choice.
- it extracts information from the data in the form of signals and uses these to make decisions and detect problems.
What Happened?
- so to build Layer 1, I started off with the obvious stuff — missing data, duplicates, mixed types, etc. It was working fine, but I had a hunch that this just might not be enough.
- soon enough this was confirmed with AI — the current version of Layer 1 (L1) is not at all enough. It only detects data integrity issues (is the data trustworthy at all?).
- but it completely missed the other areas. If the target is missing or garbage, or the data is just utter noise, it still passes. So in the next version I added these failure modes. To be specific, I added these ones:
- target sanity (is the target usable?)
- sample adequacy (are there effective independent samples to learn from?)
- feature–target relationship (do they behave similarly for similar features?)
NOTE — failure mode = a specific way in which the data is broken. It's the root cause.
- after adding these, I think all the failure modes (areas from which the system can fail) are covered. I like to call these failure modes "Dimensions" (just a fancy term).
- these are just failure modes or Dimensions, but they represent the class of problems. To actually detect these problems, I need multiple signals that detect different parts of their respective failure mode.
- eg. sample dependency, feature variance, etc. for sample adequacy and target degeneracy, target entropy, etc. for target sanity.
What I Found
- when adding anything to a system or code, it should be grounded by the core principles of what the system should do and how it should work.
- your system should be architecturally flexible enough that when you encounter an entire new class of failure mode, you don't have to redesign everything — just append it.
Constraints
- it assumes the target is present and works for supervised ML tasks (for now), so tabular data.
- the info from the signal is then processed and the final decision is based on crude, rule-based, universally accepted heuristics. These are rigorously tested but are context-agnostic.