
Energy and Utilities Predictive Maintenance: Building an AI-Ready Data Pipeline
A practical playbook for preparing SCADA data, equipment logs, and maintenance records for predictive maintenance AI in energy and utilities. Covers data pipeline stages, weather correlation, and on-premise architecture for critical infrastructure.
Unplanned transformer failures cost utilities between $1M and $10M per incident when you factor in emergency repairs, regulatory penalties, and lost revenue. Predictive maintenance AI can catch degradation patterns weeks before failure — but only if the data pipeline feeding those models is built correctly.
The challenge is not the AI model itself. It is the upstream data preparation: cleaning decades of inconsistent SCADA readings, normalizing maintenance logs written by different crews in different formats, and correlating equipment sensor data with weather patterns that affect failure rates.
This playbook covers the end-to-end data pipeline for energy and utilities predictive maintenance AI, from raw data sources through AI-ready output.
Data Sources in the Energy Sector
Predictive maintenance in energy and utilities draws from five primary data categories, each with distinct formats and quality challenges.
| Data Source | Typical Format | Volume | Quality Challenge |
|---|---|---|---|
| SCADA telemetry | Time-series CSV, OPC-UA exports | 10-50 GB/month per substation | Missing readings, sensor drift, timestamp misalignment |
| Maintenance logs | Free-text, PDF work orders, spreadsheets | 500 MB - 5 GB/year per facility | Inconsistent terminology, handwritten entries, duplicate records |
| Equipment registries | Relational DB exports, Excel | 50-200 MB per utility | Outdated records, inconsistent asset IDs across systems |
| Weather data | CSV, API responses (NOAA, ECMWF) | 1-2 GB/year per service territory | Spatial resolution gaps, missing stations |
| Inspection reports | PDF, Word documents, images | 2-10 GB/year per facility | Unstructured narrative, embedded images, inconsistent grading |
The first step in any pipeline is mapping these sources to a unified ingest strategy.
Pipeline Architecture: Six Stages
The data pipeline follows six stages, each producing observable intermediate outputs that energy engineers can validate before data moves downstream.
Stage 1: Ingest
Raw data arrives in mixed formats. SCADA exports come as CSV time-series, maintenance logs as PDFs and Word documents, equipment registries as database exports, and inspection reports as scanned PDFs with embedded images.
In Ertas, the ingest stage uses format-specific parser nodes: PDF Parser for inspection reports and work orders, Excel/CSV Parser for SCADA exports and equipment registries, Word Parser for narrative maintenance logs, and Image Parser for scanned documents. Each parser extracts structured content while preserving metadata about the source file, timestamp, and originating system.
Key consideration: SCADA data often arrives in OPC-UA historian exports. Convert these to flat CSV before ingestion, preserving the original timestamp precision (typically millisecond or sub-millisecond).
Stage 2: Clean
Energy sector data has specific cleaning requirements that generic tools miss.
Deduplication across systems. Maintenance events often appear in both the CMMS (computerized maintenance management system) and the SCADA alarm log. A transformer oil temperature alert and the resulting work order describe the same event but in completely different formats. The Deduplicator node identifies these cross-system duplicates using configurable matching rules — timestamp proximity plus asset ID overlap.
Sensor drift correction. SCADA readings drift over time as sensors age. The Anomaly Detector node flags readings that deviate from expected ranges based on historical baselines, allowing engineers to mark them for exclusion or manual correction before they contaminate training data.
Terminology normalization. Maintenance crews use inconsistent language: "xfmr," "transformer," "TX," and "power transformer" all refer to the same equipment class. The Format Normalizer node applies domain-specific mappings to standardize terminology across all text fields.
Stage 3: Transform
This stage converts cleaned data into structures suitable for predictive maintenance models.
Time-series alignment. SCADA data, weather data, and maintenance events operate on different time scales. Sensor readings arrive every 5 seconds, weather data every hour, and maintenance events are irregular. The pipeline must align these to a common time window — typically hourly or daily aggregations — with appropriate statistical summaries (mean, max, min, standard deviation for continuous readings; count and recency for event data).
Feature engineering for failure prediction. The most effective predictive maintenance features combine multiple data streams:
| Feature | Data Sources | Calculation |
|---|---|---|
| Temperature rate of change | SCADA thermal sensors | Rolling 24h slope of oil/winding temperature |
| Load-adjusted thermal index | SCADA load + temperature | Temperature deviation from expected value given current load |
| Maintenance recency score | Work orders, CMMS | Days since last preventive maintenance, weighted by maintenance type |
| Weather stress factor | Weather API, SCADA load | Composite of ambient temperature, humidity, and concurrent load level |
| Dissolved gas trend | Lab reports (PDF) | Rate of change in key dissolved gas concentrations over trailing 6 months |
In Ertas, the RAG Chunker and Train/Val/Test Splitter nodes handle the transformation from aligned time-series into training-ready datasets, with configurable split ratios that respect temporal ordering (no future data leaking into training sets).
Stage 4: Quality Scoring
Before data reaches a model, every record passes through quality validation.
The Quality Scorer node assigns a confidence score to each training example based on completeness (are all expected features present), consistency (do correlated features align logically), and freshness (how recent is the underlying data). Records below a configurable threshold are flagged for human review rather than silently dropped — critical in safety-relevant applications where discarding data without review could mask real failure patterns.
Stage 5: Export
The pipeline produces AI-ready outputs in formats consumed by downstream ML frameworks.
| Output Format | Use Case | Ertas Node |
|---|---|---|
| JSONL | Fine-tuning predictive models | JSONL Exporter |
| CSV | Statistical analysis, legacy ML tools | CSV Exporter |
| Vector embeddings | Similarity search across maintenance records | RAG Exporter |
For predictive maintenance, the primary output is typically JSONL containing feature vectors with labeled outcomes (failure/no-failure within a prediction window). The secondary output is a RAG-ready knowledge base of maintenance records that field engineers can query in natural language.
Stage 6: Serve (RAG for Field Engineers)
Beyond training data preparation, Ertas enables a complete RAG pipeline for maintenance knowledge retrieval.
The indexing pipeline processes historical maintenance records: File Import, PDF Parser, PII Redactor (removing personnel names from work orders), RAG Chunker, Embedding, and Vector Store Writer. The retrieval pipeline — API Endpoint, Query Embedder, Vector Search, Context Assembler, API Response — deploys as a tool-callable endpoint that field AI assistants can query with questions like "What was the resolution for transformer T-4420 oil leak in 2024?"
This keeps institutional maintenance knowledge accessible and searchable without exposing raw work orders to cloud services.
Weather Correlation: The Multiplier
Weather is the single largest external factor in equipment failure rates. Heat waves stress transformers, ice storms damage lines, and humidity accelerates corrosion. But correlating weather data with equipment data requires careful spatial and temporal alignment.
Spatial matching. Weather stations rarely co-locate with substations. The pipeline must map each asset to its nearest weather stations (typically 2-3) and interpolate readings based on distance weighting. This mapping is defined once in the equipment registry and applied automatically during transformation.
Temporal alignment. Weather effects on equipment are not instantaneous. A heat wave that starts Monday may not cause measurable transformer stress until Wednesday. The pipeline should generate lagged features (1-day, 3-day, 7-day trailing weather statistics) alongside point-in-time readings.
On-Premise Architecture for Critical Infrastructure
Energy utilities classify their operational technology (OT) networks as critical infrastructure. Data from SCADA systems and grid operations cannot traverse the public internet. This makes on-premise data preparation a hard requirement, not a preference.
Ertas runs as a native desktop application — no Docker containers, no cloud dependencies, no network exposure. It deploys directly onto utility engineering workstations within the OT network perimeter. Pipeline execution stays entirely local, and every processing step generates an observable log entry that utility compliance teams can audit.
For utilities operating under NERC CIP (Critical Infrastructure Protection) standards, this architecture satisfies:
- CIP-004: Access management through OS-level authentication on the workstation
- CIP-007: System security management with no listening ports or network services
- CIP-011: Information protection through local-only processing with no data egress
Implementation Checklist
Before starting your first predictive maintenance data pipeline:
- Inventory all data sources — SCADA historians, CMMS exports, weather feeds, inspection report archives
- Map asset identifiers across systems (many utilities have 3-5 different ID schemes for the same equipment)
- Define your prediction target (failure within 30 days, 90 days, or degradation classification)
- Establish temporal boundaries — how far back does reliable data go, and what is the minimum history needed per asset
- Identify subject matter experts who can validate pipeline outputs against known failure events
- Select a pilot scope — one substation or one equipment class — before scaling to the full fleet
Getting Started
The gap between raw utility data and AI-ready training sets is where most predictive maintenance projects stall. Not because the AI is hard, but because the data preparation is manual, fragile, and invisible.
Ertas Data Suite replaces that fragmented process with a visual pipeline where every transformation is observable, every step is logged, and the entire workflow runs on-premise within your OT network. Build the pipeline once for your pilot substation, then replicate it across your fleet with confidence that the same cleaning, normalization, and quality rules apply consistently.
Your transformers are already generating the data. The question is whether you can prepare it fast enough to act before the next failure.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Preparing Sensor and IoT Time-Series Data for AI Training Pipelines
A practical guide to building AI training pipelines for sensor and IoT time-series data — covering windowing strategies, normalization methods, anomaly labeling, and train/test splitting for vibration, temperature, pressure, and acoustic sensor types.

Telecommunications AI Data Pipeline: Preparing Network Data for Machine Learning
A practical guide to building AI data pipelines for telecom operators. Covers network log preparation, call detail record processing, CPNI compliance, capacity planning data, and on-premise architecture for carrier-grade data privacy.

ITAR-Compliant AI Training Data Pipelines for Defense Contractors
A compliance-focused guide to building AI training data pipelines that satisfy ITAR export control requirements. Covers the ITAR compliance matrix, pipeline architecture for controlled technical data, audit requirements, and on-premise deployment for defense contractors.