Back to blog
    Preparing Sensor and IoT Time-Series Data for AI Training Pipelines
    sensor-datatime-seriesiotmanufacturingpredictive-maintenancedata-pipelineon-premise

    Preparing Sensor and IoT Time-Series Data for AI Training Pipelines

    A practical guide to building AI training pipelines for sensor and IoT time-series data — covering windowing strategies, normalization methods, anomaly labeling, and train/test splitting for vibration, temperature, pressure, and acoustic sensor types.

    EErtas Team·

    Industrial IoT deployments now generate terabytes of sensor data daily. Vibration monitors on rotating equipment, temperature probes in process lines, pressure transducers in hydraulic systems, and acoustic emission sensors on structural components all produce continuous time-series streams. The AI models that consume this data — for predictive maintenance, anomaly detection, and process optimization — can only perform as well as the data preparation pipeline that feeds them.

    The gap between raw sensor data and model-ready training sets is substantial. Raw sensor streams contain gaps from communication failures, drift from calibration decay, noise from electromagnetic interference, and timestamps from unsynchronized clocks. Turning this into clean, windowed, labeled, properly split training data requires a systematic pipeline that handles each sensor type's specific characteristics.

    Pipeline Architecture by Sensor Type

    Different sensor types produce fundamentally different data characteristics. A one-size-fits-all preprocessing pipeline will either over-process simple signals or under-process complex ones. The following table maps each common sensor type to its pipeline requirements:

    Sensor TypeSampling RateSignal CharacteristicsKey Preprocessing StepsCommon AI Tasks
    Vibration (accelerometer)1-50 kHzHigh-frequency, periodic with harmonics, amplitude modulated by loadBand-pass filtering, FFT feature extraction, envelope analysis, windowing at rotation period multiplesBearing fault detection, imbalance classification, gear mesh analysis
    Temperature (thermocouple/RTD)0.1-10 HzLow-frequency, slow drift, step changes during process transitionsOutlier removal, interpolation for missing readings, rate-of-change calculation, thermal lag compensationOverheating prediction, process deviation detection, thermal runaway early warning
    Pressure (transducer)10-1000 HzMedium-frequency, cyclic in hydraulic systems, step functions in batch processesSpike removal, moving average smoothing, cycle segmentation, pressure-flow correlationLeak detection, pump degradation, valve failure prediction
    Acoustic (microphone/AE sensor)10-200 kHzVery high-frequency, broadband with event-driven burstsHigh-pass filtering, spectrogram generation, event detection and segmentation, background noise subtractionCrack propagation, tool wear, bearing fault (early stage)

    Stage 1: Ingestion and Timestamp Alignment

    Sensor data arrives in formats ranging from industrial protocols (OPC UA, MQTT, Modbus) to flat CSV exports from historians to proprietary binary formats from data acquisition systems. The ingestion stage must normalize all sources into a consistent time-indexed format.

    Timestamp alignment is the most underestimated preprocessing step. In multi-sensor systems, each sensor may have its own clock. A vibration sensor sampling at 10 kHz and a temperature sensor sampling at 1 Hz need to be aligned to a common time base before any cross-sensor features can be computed.

    Alignment ChallengeCauseSolution
    Clock driftSensor clocks diverge over time (typical: 1-10 ppm)Resample to common time base using NTP-synced reference timestamps
    Missing timestampsCommunication dropout, buffer overflowInterpolation for short gaps (under 5x sample period); gap marking for longer gaps
    Irregular samplingEvent-triggered sensors, network jitterResample to uniform interval using linear or cubic interpolation
    Timezone inconsistenciesSensors configured in different timezones or UTC offsetsNormalize all timestamps to UTC before any processing

    Ertas Data Suite handles CSV and Excel-based sensor data exports through its parser nodes, with the Format Normalizer node standardizing timestamp formats and the Anomaly Detector flagging gaps and irregularities before downstream processing.

    Stage 2: Cleaning and Noise Reduction

    Raw sensor data contains noise from multiple sources, and the appropriate cleaning strategy depends on the signal-to-noise characteristics of each sensor type.

    Common noise sources and remediation:

    Noise SourceAffected SensorsIdentification MethodRemediation
    Electromagnetic interference (EMI)Vibration, acousticFixed-frequency spikes in FFT (50/60 Hz and harmonics)Notch filter at power line frequency
    Sensor saturationAll typesFlat-line at sensor maximum or minimumFlag and exclude saturated windows from training data
    Calibration driftTemperature, pressureGradual baseline shift over weeks/monthsBaseline correction using known reference points
    Communication artifactsAll digital sensorsRepeated identical values, sudden jumps to zeroMedian filter for isolated spikes; gap-fill for repeated values
    Environmental transientsAcoustic, vibrationHigh-amplitude, short-duration bursts unrelated to equipmentEvent detection with duration threshold filtering

    The cleaning stage must preserve real anomalies while removing noise. This is the central tension in sensor data preparation: aggressive filtering removes noise but may also remove the early-stage fault signatures that predictive maintenance models need to detect. The general principle is to apply minimal filtering during cleaning, then let the model architecture handle remaining noise through its own learned representations.

    Stage 3: Windowing Strategies

    Time-series models do not consume raw streams directly. Data must be segmented into windows (fixed-length subsequences) that become individual training examples. Window design directly affects what the model can learn.

    Windowing ParameterDecision FactorsTypical Values
    Window lengthMust capture at least 2-3 complete cycles of the lowest-frequency pattern of interestVibration: 1-10 seconds; Temperature: 5-60 minutes; Pressure: 1-30 seconds; Acoustic: 0.1-1 seconds
    OverlapHigher overlap produces more training examples but increases redundancy and data leakage risk50% overlap is standard; 75% for small datasets; 0% for test sets
    StrideInverse of overlap; controls how far the window advances each stepHalf the window length for 50% overlap

    Critical rule for train/test splitting with overlapping windows: Overlapping windows must never span the train/test boundary. If window N is in the training set and window N+1 (which overlaps with N) is in the test set, the model has seen test data during training. Always split by time first, then window within each split.

    Window-Level Feature Engineering

    For many sensor applications, raw windowed time-series data is augmented or replaced by engineered features computed per window:

    Feature CategoryExamplesUse Case
    StatisticalMean, variance, skewness, kurtosis, RMS, crest factorGeneral health monitoring, anomaly detection
    Frequency domainDominant frequency, spectral centroid, band energy ratiosVibration analysis, rotating equipment diagnostics
    Time-frequencyWavelet coefficients, STFT spectrogram binsNon-stationary signals, transient event detection
    Cross-sensorCorrelation between sensors, phase difference, coherenceMulti-sensor fusion, system-level anomaly detection

    The choice between feeding raw windows versus engineered features depends on the model architecture. Deep learning models (CNNs, LSTMs, Transformers) can learn features from raw data given sufficient training examples (typically 10,000+ windows per class). Classical ML models (Random Forest, XGBoost) require engineered features but work well with smaller datasets (500-2,000 windows per class).

    Stage 4: Anomaly Labeling

    Labeling sensor data for supervised anomaly detection is fundamentally different from labeling images or text. Anomalies are rare, often ambiguous, and the boundary between "normal degradation" and "anomalous behavior" is domain-specific.

    Labeling approaches by data availability:

    ApproachData RequirementLabel QualityBest For
    Run-to-failureComplete degradation histories with known failure timesHigh — failure time anchors labelsEquipment with planned replacements or documented failures
    Expert annotationDomain expert reviews time-series windows and assigns labelsMedium to high — depends on expert consistencyOne-off anomalies, process deviations, novel failure modes
    Maintenance log correlationMatch sensor windows to maintenance work orders by timestampMedium — logs may have imprecise timingRetrospective labeling of historical data
    Semi-supervisedLarge unlabeled normal dataset + small set of confirmed anomaliesVariable — depends on normal data qualityWhen labeled anomalies are very scarce (fewer than 50 examples)

    For predictive maintenance specifically, the labeling window matters enormously. A bearing that fails at time T shows degradation signatures starting days or weeks before failure. Labels should not be binary (normal/fault) but should indicate the remaining useful life (RUL) or degradation stage:

    • Normal — no detectable degradation
    • Early degradation — subtle signature changes visible in frequency domain
    • Advanced degradation — clear deviation from baseline in time domain
    • Imminent failure — pronounced anomaly across multiple features

    Stage 5: Normalization and Scaling

    Sensor data spans wildly different scales. Vibration acceleration values might range from -50 to +50 g, while temperature readings range from 20 to 200 degrees Celsius. Without normalization, models will weight high-magnitude features disproportionately.

    Normalization MethodFormulaWhen to Use
    Z-score (standardization)(x - mean) / stdDefault for most sensor types; preserves distribution shape
    Min-max scaling(x - min) / (max - min)When bounded range is known; output in 0 to 1 range
    Robust scaling(x - median) / IQRWhen outliers are present and should not dominate statistics
    Per-sensor normalizationCompute statistics per individual sensorWhen sensors of the same type have different baselines due to mounting or calibration

    Normalization must be computed on the training set only and then applied to validation and test sets using the training set statistics. Computing normalization statistics on the full dataset before splitting introduces data leakage.

    Stage 6: Train/Test Splitting for Time Series

    Standard random splitting is invalid for time-series data. Future data must never leak into the training set. Time-series splitting requires temporal ordering:

    Split StrategyHow It WorksWhen to Use
    Chronological splitFirst 70% of time for train, next 15% for validation, last 15% for testSingle continuous deployment, sufficient data volume
    Walk-forward splitTrain on months 1-6, test on month 7; train on months 1-7, test on month 8; average resultsWhen evaluating model stability over time
    Group-based splitSplit by equipment unit — train on units 1-8, test on units 9-10When evaluating generalization to unseen equipment

    Never use random splitting for time-series sensor data. The autocorrelation in sensor signals means random splits create train/test overlap that inflates accuracy metrics by 10-30%.

    On-Premise Pipeline Requirements

    Industrial sensor data carries operational intelligence that manufacturers treat as trade secrets. Vibration signatures reveal equipment condition, process parameters, and production capacity. Temperature profiles expose proprietary process recipes. Acoustic signatures can indicate production volumes and equipment configurations.

    Sending this data to cloud-based ML platforms is a non-starter for most manufacturers. Beyond IP concerns, factory networks are often air-gapped from the internet by design, and bandwidth limitations make uploading terabytes of high-frequency sensor data impractical.

    Ertas Data Suite addresses this directly as a native desktop application that processes sensor data entirely on-premise. The visual pipeline canvas makes each preprocessing step observable — quality engineers can see exactly how raw sensor data is cleaned, windowed, normalized, and split before it reaches the model. The Anomaly Detector node flags data quality issues early in the pipeline, and the Quality Scorer node quantifies dataset fitness before export.

    Key Takeaways

    Sensor data preparation for AI is not a single problem — it is a sequence of domain-specific decisions about filtering, windowing, labeling, normalization, and splitting. Each sensor type requires different preprocessing parameters, and getting any stage wrong propagates errors into model performance.

    The teams that build reliable predictive maintenance and anomaly detection models invest heavily in observable, reproducible data pipelines. The teams that struggle in production are typically the ones that scripted ad-hoc preprocessing with no logging, no quality checks, and no reproducibility. The pipeline is the foundation.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading