Energy and Utilities Predictive Maintenance: Building an AI-Ready Data Pipeline

Unplanned transformer failures cost utilities between $1M and $10M per incident when you factor in emergency repairs, regulatory penalties, and lost revenue. Predictive maintenance AI can catch degradation patterns weeks before failure — but only if the data pipeline feeding those models is built correctly.

The challenge is not the AI model itself. It is the upstream data preparation: cleaning decades of inconsistent SCADA readings, normalizing maintenance logs written by different crews in different formats, and correlating equipment sensor data with weather patterns that affect failure rates.

This playbook covers the end-to-end data pipeline for energy and utilities predictive maintenance AI, from raw data sources through AI-ready output.

Data Sources in the Energy Sector

Predictive maintenance in energy and utilities draws from five primary data categories, each with distinct formats and quality challenges.

Data Source	Typical Format	Volume	Quality Challenge
SCADA telemetry	Time-series CSV, OPC-UA exports	10-50 GB/month per substation	Missing readings, sensor drift, timestamp misalignment
Maintenance logs	Free-text, PDF work orders, spreadsheets	500 MB - 5 GB/year per facility	Inconsistent terminology, handwritten entries, duplicate records
Equipment registries	Relational DB exports, Excel	50-200 MB per utility	Outdated records, inconsistent asset IDs across systems
Weather data	CSV, API responses (NOAA, ECMWF)	1-2 GB/year per service territory	Spatial resolution gaps, missing stations
Inspection reports	PDF, Word documents, images	2-10 GB/year per facility	Unstructured narrative, embedded images, inconsistent grading

The first step in any pipeline is mapping these sources to a unified ingest strategy.

Pipeline Architecture: Six Stages

The data pipeline follows six stages, each producing observable intermediate outputs that energy engineers can validate before data moves downstream.

Stage 1: Ingest

Raw data arrives in mixed formats. SCADA exports come as CSV time-series, maintenance logs as PDFs and Word documents, equipment registries as database exports, and inspection reports as scanned PDFs with embedded images.

In Ertas, the ingest stage uses format-specific parser nodes: PDF Parser for inspection reports and work orders, Excel/CSV Parser for SCADA exports and equipment registries, Word Parser for narrative maintenance logs, and Image Parser for scanned documents. Each parser extracts structured content while preserving metadata about the source file, timestamp, and originating system.

Key consideration: SCADA data often arrives in OPC-UA historian exports. Convert these to flat CSV before ingestion, preserving the original timestamp precision (typically millisecond or sub-millisecond).

Stage 2: Clean

Energy sector data has specific cleaning requirements that generic tools miss.

Deduplication across systems. Maintenance events often appear in both the CMMS (computerized maintenance management system) and the SCADA alarm log. A transformer oil temperature alert and the resulting work order describe the same event but in completely different formats. The Deduplicator node identifies these cross-system duplicates using configurable matching rules — timestamp proximity plus asset ID overlap.

Sensor drift correction. SCADA readings drift over time as sensors age. The Anomaly Detector node flags readings that deviate from expected ranges based on historical baselines, allowing engineers to mark them for exclusion or manual correction before they contaminate training data.

Terminology normalization. Maintenance crews use inconsistent language: "xfmr," "transformer," "TX," and "power transformer" all refer to the same equipment class. The Format Normalizer node applies domain-specific mappings to standardize terminology across all text fields.

Stage 3: Transform

This stage converts cleaned data into structures suitable for predictive maintenance models.

Time-series alignment. SCADA data, weather data, and maintenance events operate on different time scales. Sensor readings arrive every 5 seconds, weather data every hour, and maintenance events are irregular. The pipeline must align these to a common time window — typically hourly or daily aggregations — with appropriate statistical summaries (mean, max, min, standard deviation for continuous readings; count and recency for event data).

Feature engineering for failure prediction. The most effective predictive maintenance features combine multiple data streams:

Feature	Data Sources	Calculation
Temperature rate of change	SCADA thermal sensors	Rolling 24h slope of oil/winding temperature
Load-adjusted thermal index	SCADA load + temperature	Temperature deviation from expected value given current load
Maintenance recency score	Work orders, CMMS	Days since last preventive maintenance, weighted by maintenance type
Weather stress factor	Weather API, SCADA load	Composite of ambient temperature, humidity, and concurrent load level
Dissolved gas trend	Lab reports (PDF)	Rate of change in key dissolved gas concentrations over trailing 6 months

In Ertas, the RAG Chunker and Train/Val/Test Splitter nodes handle the transformation from aligned time-series into training-ready datasets, with configurable split ratios that respect temporal ordering (no future data leaking into training sets).

Stage 4: Quality Scoring

Before data reaches a model, every record passes through quality validation.

The Quality Scorer node assigns a confidence score to each training example based on completeness (are all expected features present), consistency (do correlated features align logically), and freshness (how recent is the underlying data). Records below a configurable threshold are flagged for human review rather than silently dropped — critical in safety-relevant applications where discarding data without review could mask real failure patterns.

Stage 5: Export

The pipeline produces AI-ready outputs in formats consumed by downstream ML frameworks.

Output Format	Use Case	Ertas Node
JSONL	Fine-tuning predictive models	JSONL Exporter
CSV	Statistical analysis, legacy ML tools	CSV Exporter
Vector embeddings	Similarity search across maintenance records	RAG Exporter

For predictive maintenance, the primary output is typically JSONL containing feature vectors with labeled outcomes (failure/no-failure within a prediction window). The secondary output is a RAG-ready knowledge base of maintenance records that field engineers can query in natural language.

Stage 6: Serve (RAG for Field Engineers)

Beyond training data preparation, Ertas enables a complete RAG pipeline for maintenance knowledge retrieval.

The indexing pipeline processes historical maintenance records: File Import, PDF Parser, PII Redactor (removing personnel names from work orders), RAG Chunker, Embedding, and Vector Store Writer. The retrieval pipeline — API Endpoint, Query Embedder, Vector Search, Context Assembler, API Response — deploys as a tool-callable endpoint that field AI assistants can query with questions like "What was the resolution for transformer T-4420 oil leak in 2024?"

This keeps institutional maintenance knowledge accessible and searchable without exposing raw work orders to cloud services.

Weather Correlation: The Multiplier

Weather is the single largest external factor in equipment failure rates. Heat waves stress transformers, ice storms damage lines, and humidity accelerates corrosion. But correlating weather data with equipment data requires careful spatial and temporal alignment.

Spatial matching. Weather stations rarely co-locate with substations. The pipeline must map each asset to its nearest weather stations (typically 2-3) and interpolate readings based on distance weighting. This mapping is defined once in the equipment registry and applied automatically during transformation.

Temporal alignment. Weather effects on equipment are not instantaneous. A heat wave that starts Monday may not cause measurable transformer stress until Wednesday. The pipeline should generate lagged features (1-day, 3-day, 7-day trailing weather statistics) alongside point-in-time readings.

On-Premise Architecture for Critical Infrastructure

Energy utilities classify their operational technology (OT) networks as critical infrastructure. Data from SCADA systems and grid operations cannot traverse the public internet. This makes on-premise data preparation a hard requirement, not a preference.

Ertas runs as a native desktop application — no Docker containers, no cloud dependencies, no network exposure. It deploys directly onto utility engineering workstations within the OT network perimeter. Pipeline execution stays entirely local, and every processing step generates an observable log entry that utility compliance teams can audit.

For utilities operating under NERC CIP (Critical Infrastructure Protection) standards, this architecture satisfies:

CIP-004: Access management through OS-level authentication on the workstation
CIP-007: System security management with no listening ports or network services
CIP-011: Information protection through local-only processing with no data egress

Implementation Checklist

Before starting your first predictive maintenance data pipeline:

Inventory all data sources — SCADA historians, CMMS exports, weather feeds, inspection report archives
Map asset identifiers across systems (many utilities have 3-5 different ID schemes for the same equipment)
Define your prediction target (failure within 30 days, 90 days, or degradation classification)
Establish temporal boundaries — how far back does reliable data go, and what is the minimum history needed per asset
Identify subject matter experts who can validate pipeline outputs against known failure events
Select a pilot scope — one substation or one equipment class — before scaling to the full fleet

Getting Started

The gap between raw utility data and AI-ready training sets is where most predictive maintenance projects stall. Not because the AI is hard, but because the data preparation is manual, fragile, and invisible.

Ertas Data Suite replaces that fragmented process with a visual pipeline where every transformation is observable, every step is logged, and the entire workflow runs on-premise within your OT network. Build the pipeline once for your pilot substation, then replicate it across your fleet with confidence that the same cleaning, normalization, and quality rules apply consistently.

Your transformers are already generating the data. The question is whether you can prepare it fast enough to act before the next failure.