OEE Accuracy
How Rela-ai filters structural biases (dead sensors, shared PLCs, fabricated downtime) so OEE reflects real operation — not a cosmetic number.
OEE Accuracy
The OEE a dashboard shows is only useful if it reflects reality. An audit surfaced several places where the traditional computation diverges from real operation: downtime fabricated by heuristics, dead sensors producing a fake 0%, shared PLCs inflating metrics via double-counting, broken calibration hidden behind a 100% cap. This page documents each one and how it was closed.
What is it for?
- Know whether the OEE number you're looking at is trustworthy or contaminated by wrong config.
- Audit the calculation point by point against the Nakajima standard.
- Spot dead sensors or shared PLCs silently inflating or deflating the number.
How it works
Each bias source is closed by an explicit filter and a response-JSON flag. If the calculation is affected, the response says so: status=stale_source, downtime_estimation=true, performance_capped=true. The operator sees the number AND knows whether to trust it.
Executive summary
Verifiable OEE: every deviation from the Nakajima standard is covered by an explicit filter and a response-side flag. The operator sees reality — including when reality is contaminated by bad config.
| Before | Now |
|---|---|
| Hardcoded 5 min per downtime event (a 2h stop reported as 5 min) | Real duration from metadata.duration_seconds + downtime_estimation flag |
| Dead sensor then fake 0% OEE | Staleness filter + status=stale_source |
| PLC shared across lines then double-counted | Asset-ID scope in the $match |
| Performance > 100% silently capped | performance_pct_raw + performance_capped flag |
| Buffer/store-and-forward misaligned periods | metadata.timestamp over created_at |
| Orphan configs (asset/source deleted) | 404 validation on configure |
| Threshold changes rewrote history silently | Fire-and-forget audit trail |
| Shifts and PM tanked availability | shift_pattern + subtract_scheduled_maintenance |
| Trend truncated on a transient error | Robust loop + calendar-day windows |
| No Pareto or version stamp | downtime_breakdown + config_version |
Closed findings
Critical (H)
OEE-H1 — Real downtime duration
The legacy computation used 5 min per downtime event regardless of real length. A 2-hour stop reported as a single event counted as 5 min; a 10-second stop with 12 rebounds counted as 60 min. The resulting Availability was random vs reality.
The resolver now follows a priority order:
- Sum of
metadata.duration_seconds(first-class, measured). - Sum of
metadata.duration_minutes(convenience shortcut). - Legacy 5 min × event_count heuristic (backward compat, flagged).
The response carries downtime_estimation with measured, heuristic or none so the dashboard can warn when the number is approximate.
{
"downtime_minutes": 90.0,
"downtime_estimation": "measured"
}OEE-H2 — Staleness guard on count source
A count source flagged stale by the sensor_watchdog no longer feeds the calculation. Instead the response short-circuits with status="stale_source" and a message pointing to the watchdog. Prevents a fake 0% OEE when the sensor died but the line is still producing.
{
"status": "stale_source",
"message": "Count source is stale; OEE not computed."
}OEE-H3 — Asset-id scoping
When two assets share the same count_source_id (typical: two lines behind the same PLC), every event was counted once per asset — Performance and Quality symmetrically inflated, downtime doubled. The $match now includes an $or accepting events with asset_id at root, inside metadata, or absent (legacy single-asset).
OEE-H4 — Raw performance + performance_capped
performance > 100% is not a glitch — it's a strong signal of broken calibration (cycle_time subconfigured or double-counting). Legacy code silently capped at 100%. The response now exposes both:
{
"performance_pct": 100.0,
"performance_pct_raw": 173.6,
"performance_capped": true
}The OEE KPI still uses the capped value (Nakajima defines 0-100), but the dashboard can render a "calibration check" badge when raw > 100.
Medium (M)
OEE-M1 — Real timestamp over created_at
$match prefers metadata.timestamp (ISO) when the event carries one; created_at is the fallback. Corrects the buffer/store-and-forward bias: a reading that fired at 14:00 but was re-ingested at 14:45 now lands in the 14:00 bucket, not 14:45.
OEE-M2 — shift_pattern per day
New config field:
{
"shift_pattern": {
"hours_by_weekday": {"1": 8, "2": 8, "3": 8, "4": 8, "5": 8}
}
}ISO weekday: 1=Monday, 7=Sunday. A plant running Mon-Fri 8h and dark on weekends sees planned=0 min on Saturday, not 0% OEE against a phantom 480-min plan.
OEE-M3 — Scheduled PM subtracts from planned
New flag subtract_scheduled_maintenance: true (default). _maintenance_plans with next_due_at inside the period reduce planned_minutes instead of counting as downtime. A 4h PM inside an 8h shift no longer reports Availability=50% (as if the PM were a failure); planned becomes 240 min and Availability stays 100% for the remaining 4h of real production. Opt-out available for tenants whose internal convention keeps PM in the downtime bucket.
OEE-M4 — Validate asset_id + count_source_id
POST /configure returns 404 when asset_id doesn't exist (neither as _assets._id ObjectId nor asset_code) or when count_source_id isn't present in _machine_event_sources. Prevents orphan configs producing a silent 0% OEE when the asset or source has been deleted.
OEE-M5 — Config mutation audit trail
Any change to the 6 regulated fields (planned_production_hours, ideal_cycle_time_seconds, count_source_id, count_metric_field, reject_metric_field, downtime_event_type) writes an entry to _audit_trail with actor, timestamp, previous and new snapshot. An auditor can answer "who moved ideal_cycle_time_seconds from 2.5 to 3.0 on March 3rd".
OEE-M6 — Robust trend against transient errors
A transient DB error (e.g. a timeout on day 3 of a 7-day trend) no longer truncates the result. Each day computes in its own try/except; failures return {"date": "...", "status": "error"} as a placeholder and the loop continues. Only a 404 (not-configured) is terminal, because that's a config error, not transient.
OEE-M7 — Calendar-day windows
Trend windows are anchored at UTC midnight (00:00 to 24:00) instead of rolling 24h anchored to call time. The label "2026-04-17" now matches exactly the window it represents.
Low (L)
OEE-L2 — Downtime Pareto breakdown
New downtime_breakdown list on the response, grouped by event_type and sorted descending by minutes:
{
"downtime_breakdown": [
{"event_type": "STOP_COMPRESSOR", "event_count": 2, "minutes": 40.0},
{"event_type": "STOP_CHANGEOVER", "event_count": 3, "minutes": 20.0}
]
}Answers "which stop type cost us the most time?" without a second query.
OEE-L3 — config_version in the response
Each configure_oee does $inc.config_version in MongoDB. Each calculate_oee stamps the active config_version in the response. A historical OEE value is anchored to the threshold set that produced it — the audit question "which config produced that 87%?" is trivially resolvable.
Full response shape
{
"asset_id": "line-01",
"oee_pct": 72.3,
"availability_pct": 93.8,
"performance_pct": 85.0,
"performance_pct_raw": 85.0,
"performance_capped": false,
"quality_pct": 98.2,
"total_count": 4800,
"good_count": 4714,
"reject_count": 86,
"planned_minutes": 480.0,
"operating_minutes": 450.0,
"downtime_minutes": 30.0,
"downtime_estimation": "measured",
"downtime_breakdown": [
{"event_type": "STOP_COMPRESSOR", "event_count": 2, "minutes": 22.0},
{"event_type": "STOP_CHANGEOVER", "event_count": 1, "minutes": 8.0}
],
"scheduled_maintenance_minutes": 0.0,
"config_version": 7,
"period_start": "2026-04-17T00:00:00+00:00",
"period_end": "2026-04-18T00:00:00+00:00"
}MongoDB collections touched
| Collection | Use |
|---|---|
_oee_configs | Per-asset config with monotonic config_version. |
_machine_events | Single source of events (production + downtime) filtered by asset_id + real timestamp. |
_machine_event_sources | connected/stale status read by the OEE-H2 guard. |
_assets | asset_id validation on configure (dual lookup: ObjectId or asset_code). |
_maintenance_plans | Scheduled windows reducing planned_minutes (OEE-M3). |
_audit_trail | Fire-and-forget entries with action=oee_config_updated. |
Key benefits
- Decision-making on real data, not heuristics.
- Zero fake 0% OEE from dead sensors.
- Zero double-counting across lines sharing a PLC.
- Visible calibration: performance_pct_raw > 100 triggers review.
- Stable trend: labels and windows aligned, transient errors don't truncate.
- Audit-ready: every threshold mutation leaves a trail, every historical KPI carries its
config_version.