Predictive Maintenance AI: Preparing Sensor + Document Data On-Premise

Predictive maintenance promises to replace scheduled maintenance with condition-based maintenance — intervening only when equipment actually shows signs of degradation, not on arbitrary calendar intervals. The AI models that enable this need training data that combines sensor readings (vibration, temperature, pressure, current) with maintenance records (what failed, why, and what was done about it).

Preparing this training data is harder than it sounds. Sensor data and maintenance logs live in different systems, use different formats, and are owned by different teams. Bringing them together into a unified training dataset — on-premise, in an air-gapped manufacturing environment — requires a deliberate data preparation pipeline.

The Two Data Streams

Sensor Time-Series Data

Manufacturing equipment generates continuous sensor readings:

Vibration sensors: Acceleration, velocity, displacement — primary indicators of bearing and rotating machinery health
Temperature sensors: Bearing temperatures, motor winding temperatures, process temperatures
Pressure sensors: Hydraulic pressure, pneumatic pressure, coolant pressure
Electrical sensors: Motor current, voltage, power factor — indicators of electrical and mechanical load
Flow sensors: Coolant flow, lubricant flow, process material flow

This data is typically stored in a historian (OSIsoft PI, Aveva, InfluxDB) at sampling rates from once per second to hundreds of times per second. The volume is substantial — a single machine with 20 sensors sampling at 1 Hz generates 1.7 million data points per day.

Maintenance Records

Maintenance logs capture what happened to the equipment:

Work orders: Structured records of planned and unplanned maintenance activities
Technician notes: Free-text descriptions of symptoms, observations, and actions taken
Failure reports: Root cause analyses linking failures to contributing factors
Parts replacement records: What was replaced, when, and why
Equipment manuals: Manufacturer maintenance procedures and failure mode descriptions

This data is typically stored in a CMMS (Computerized Maintenance Management System) like SAP PM, Maximo, or Fiix — and the free-text fields are where the real intelligence lives.

The Data Preparation Challenge

Aligning Time-Series with Events

The core challenge: connecting sensor patterns to maintenance outcomes. A vibration spike on March 15 needs to be linked to the bearing replacement on March 17 to create a labeled training example.

This alignment requires:

Timestamp synchronization between historian and CMMS (which often use different time zones or clock sources)
Event windowing: Defining the time window before a failure that constitutes the "pre-failure" pattern (hours? days? weeks?)
Normal vs. degrading: Labeling which sensor windows represent normal operation vs. progressive degradation
Multiple failure modes: The same equipment can fail in different ways, each with different sensor signatures

Extracting Intelligence from Maintenance Logs

Technician notes contain critical information in unstructured form:

"Checked motor on Line 3 press. Unusual vibration noted during operation. Replaced upper bearing assembly. Found significant wear on inner race. Possible contamination from coolant leak last month."

From this note, a trained maintenance professional extracts:

Failure mode: Bearing wear
Root cause: Coolant contamination
Equipment: Line 3 press motor
Component: Upper bearing assembly
Severity: Significant (required replacement)

An ML engineer without maintenance experience would miss the causal chain between the coolant leak and the bearing failure. This is why domain expert labeling is essential.

Handling Class Imbalance

Equipment failures are (hopefully) rare events. In a healthy manufacturing operation, 95-99% of sensor readings represent normal operation. The failure patterns that predictive maintenance needs to detect are in the remaining 1-5%.

Training data preparation must address this:

Oversampling failure windows
Synthetic data generation for rare failure modes
Careful windowing to maximize the use of degradation data (the gradual decline before failure is more useful than the failure moment itself)

The Pipeline

Step 1: Sensor Data Export and Cleaning

Export from historian for relevant time ranges (typically 6-24 months per equipment unit)
Resample to consistent intervals if sensors have different sampling rates
Handle missing data (sensor dropouts, historian gaps)
Remove outliers caused by sensor malfunctions (not equipment issues)
Normalize across different sensor types and scales

Step 2: Maintenance Record Processing

Export work orders and technician notes from CMMS
Parse free-text fields for failure mode, root cause, component, and severity
Standardize terminology (same failure described differently by different technicians)
Map maintenance events to equipment identifiers that match sensor data
Build a timeline of maintenance events per equipment unit

Step 3: Data Fusion

Align sensor time-series with maintenance event timelines
Create labeled windows: "normal" (no maintenance event following), "pre-failure" (maintenance event within N days), "post-maintenance" (recently serviced)
Attach maintenance context to sensor windows (failure mode, root cause)
Create feature vectors combining sensor statistics (mean, std, peak, RMS, frequency features) with equipment metadata

Step 4: Labeling and Validation

Maintenance engineers validate the alignment between sensor patterns and failure events
Domain experts review edge cases: Was this really a failure, or scheduled maintenance? Was the sensor reading genuine, or a measurement artifact?
Label remaining of health (RUL) where equipment records support it

Step 5: Export

Structured datasets for time-series classification models
Feature matrices for traditional ML models (Random Forest, XGBoost)
Sequence data for LSTM/Transformer-based models
Documentation of sensor-to-failure mappings for model interpretability

Why This Must Happen On-Premise

Predictive maintenance data preparation has three hard requirements for on-premise processing:

OT network isolation: Sensor data lives on the operational technology network, which is typically air-gapped from the IT network and the internet
Trade secret protection: Equipment configurations, process parameters, and failure patterns are competitive intelligence
Data volume: Months of high-frequency sensor data from hundreds of machines is too large for practical cloud transfer

The data preparation tool must work within these constraints — fully offline, on local infrastructure, with no cloud dependencies.

Ertas Data Suite is designed for exactly this environment: a native desktop application that runs air-gapped, processes both structured sensor data and unstructured maintenance logs, and exports in formats suitable for predictive maintenance models. The interface is accessible to maintenance engineers and reliability professionals who understand the equipment — not just data scientists who understand the algorithms.