Sample-size rules
mureo’s anomaly detector does not alert on every metric shift. It refuses to fire below a set of sample-size thresholds, because below those thresholds a single atypical day of traffic can move a rate metric enough to look like a genuine change.
This page documents the thresholds, the reasoning, and the operator override surface.
The thresholds
Section titled “The thresholds”| Metric | Minimum sample | Alert direction |
|---|---|---|
| CPA (cost per acquisition) | 30 conversions per day | Spike (up) |
| CTR (click-through rate) | 1000 impressions per day | Drop (down) |
| Zero spend | None | Flag on a previously-spending campaign |
Below the minimum, the detector does not return an anomaly. The
metric is surfaced under a monitor flag in the /daily-check
report, without a recommended action.
Above the minimum, the detector applies a severity tier (see below).
Why these numbers
Section titled “Why these numbers”CPA: 30 conversions
Section titled “CPA: 30 conversions”CPA is a rate of dollars to conversions. At low conversion counts, a single atypical transaction moves the ratio enough to produce a false signal:
- At 5 conversions, one outlier shifts CPA by roughly 20%.
- At 10 conversions, one outlier shifts CPA by roughly 10%.
- At 30 conversions, one outlier shifts CPA by roughly 3%.
Thirty is the point at which a single-day CPA reading no longer needs a story. Below that, “yesterday was weird” has to be the operator’s default assumption, not “something is wrong.”
Thirty is the floor used by the mureo-learning skill because it is
conservative for most consumer and B2B accounts. A high-frequency ad
account (large DTC) may set its local gate higher. A low-frequency
account (enterprise B2B) may set it lower, accepting more noise in
exchange for faster detection of real shifts.
CTR: 1000 impressions
Section titled “CTR: 1000 impressions”CTR is a rate of clicks to impressions. Impressions are plentiful compared to conversions — an active campaign usually has thousands of impressions per day. The 1000 floor addresses a different failure mode: low-delivery days where the impression mix is dominated by a single audience segment, ad slot, or device tier, making the CTR number reflect the mix rather than creative fit.
Below 1000 daily impressions, CTR reads are suppressed because they are more likely a delivery artifact than a creative-quality signal. At 1000+, the mix evens out enough for CTR to mean what it usually means.
Zero spend: no minimum
Section titled “Zero spend: no minimum”Zero spend is an absolute, not a rate. If yesterday the campaign spent non-zero, and today it spent zero, that is a signal regardless of sample size — something structural (budget cap, paused ad group, billing issue) has changed. The detector emits this as CRITICAL without any sample gate.
Severity tiers (for metrics that clear the gate)
Section titled “Severity tiers (for metrics that clear the gate)”Once the sample gate is cleared, the detector assigns one of two tiers:
| Metric | HIGH | CRITICAL |
|---|---|---|
| CPA spike | ≥ 1.5× baseline | ≥ 2.0× baseline |
| CTR drop | ≤ 0.5× baseline | ≤ 0.3× baseline |
Two tiers, not five. A finer gradation would imply a precision the baseline math does not have.
- HIGH — investigate before the next daily check; likely structural (bid change, new competitor, landing page break).
- CRITICAL — pause-worthy without explanation; budget is actively burning against something that stopped working.
Baseline construction
Section titled “Baseline construction”The comparison baseline is the median of the same metric over a recent window of snapshots for the same campaign. Median, not mean, because one bad day should not move the reference.
Baseline window and inclusion rules:
- Window: last 14 daily snapshots for the same campaign.
- Excluded: snapshots tagged with
known-promotion,manual-intervention, orpost-rollback(operator-flagged as not representative). - Minimum window size: 7 snapshots. Below this, no baseline is built;
the metric is surfaced as
building-baselinerather than evaluated.
Operator overrides
Section titled “Operator overrides”Two supported overrides:
- Per-invocation override — the
analysis.anomalies.checkMCP tool accepts optionalmin_conversionsandmin_impressionsparameters. Passing smaller values lowers the gate for that run only. Useful for niche accounts where 30 conversions is an unreasonable floor. - Strategy-level override — setting
anomaly.sample_size_overrideinSTRATEGY.mdchanges the defaults for every workflow run against that account. The detector surfaces the override in every alert so downstream reviewers can see it was in effect.
Overrides adjust the sample-size floor only. They do not change the severity tier thresholds (1.5× / 2.0× / 0.5× / 0.3×). Those are fixed because adjusting them would change the semantics of HIGH and CRITICAL across operators.
When the rules are wrong
Section titled “When the rules are wrong”The rules are tuned for the median account. They are wrong, and should be overridden, in at least these cases:
- Known promotional pulse. A 48-hour flash sale that doubles CPA
on hour two is the promotion working (high CPC auction, high
volume), not a spike. Tag the snapshot
known-promotionor teach the framework with/learn. - Attribution lag. View-through, app-install, and offline conversion imports arrive 1–7 days late. Same-day CPA reads inflate because the numerator is real but the denominator is partial. Apply the lookback-window suppression at the tool-invocation layer.
- Sample-gate boundary. A genuinely-low-frequency account
(enterprise B2B, high LTV, 5–10 conversions per day) needs a lower
floor. Supply
min_conversions=10or similar and accept more noise in exchange for any signal at all.
Source
Section titled “Source”mureo/analysis/anomaly_detector.py— implementationmureo-learningskill — the statistical-thinking rule set the thresholds derive from- Blog: How AI agents misdiagnose CPA spikes — narrative introduction for the same material
Versioning
Section titled “Versioning”Thresholds are pinned to the OSS release. mureo 0.9.21 (current at
time of writing) uses the values on this page. The diagnostic
knowledge base may retune the defaults as evidence accumulates; when
that happens, the release CHANGELOG will reference this page and
the change will be announced on the blog.