Back to blog
    Data Preparation for Supply Chain Demand Forecasting AI
    supply-chaindemand-forecastingdata-pipelineenterpriseon-premiseai-trainingtime-series

    Data Preparation for Supply Chain Demand Forecasting AI

    A practical guide to building data pipelines for supply chain demand forecasting AI — covering data source mapping, quality requirements by forecasting horizon, feature engineering, and on-premise deployment for enterprise supply chains.

    EErtas Team·

    McKinsey estimates that AI-driven demand forecasting can reduce supply chain errors by 30-50% and cut lost sales from stockouts by up to 65%. Yet most enterprise demand forecasting initiatives stall during data preparation, not during model development. The reason is structural: demand forecasting requires integrating data from systems that were never designed to talk to each other — ERP, WMS, POS, CRM, and external data providers — into a single clean, temporally aligned dataset.

    The forecasting model itself is often the simplest part of the project. Getting historical sales, inventory positions, supplier lead times, promotional calendars, and macroeconomic signals into a consistent, quality-validated training set is where 60-80% of project effort concentrates. This guide covers how to build that data pipeline systematically.

    Data Source Mapping for Demand Forecasting

    Demand forecasting models consume data from multiple enterprise systems. Each source has distinct formats, update frequencies, and quality characteristics.

    Data SourceSystem of OriginTypical FormatUpdate FrequencyKey Fields
    Historical sales/ordersERP (SAP, Oracle, NetSuite)CSV/Excel export, API extractDaily to weeklySKU, quantity, date, channel, customer segment, price
    Inventory positionsWMS (Manhattan, Blue Yonder)CSV/Excel exportDailySKU, location, on-hand quantity, in-transit, reserved
    Supplier lead timesProcurement/SRM systemExcel, manual tracking sheetsMonthly to quarterlySupplier, SKU/category, quoted lead time, actual lead time history
    Promotional calendarMarketing/trade promotion systemExcel, shared calendarsMonthlyPromotion type, start/end dates, affected SKUs, discount depth
    Point-of-sale dataPOS/retail systemCSV, EDI 852Daily to weeklyStore, SKU, units sold, price, returns
    Pricing historyERP or pricing engineCSV/ExcelPer price change eventSKU, effective date, list price, net price, currency
    Weather dataThird-party API (NOAA, Weather Company)JSON/CSVDailyRegion, temperature, precipitation, severe weather alerts
    Macroeconomic indicatorsGovernment statistics, data vendorsCSVMonthly to quarterlyCPI, GDP growth, consumer confidence, unemployment rate
    Competitor pricingWeb scraping or third-party feedsJSON/CSVDaily to weeklyCompetitor, product category, price point, availability

    Not every forecasting model needs every source. The data source selection depends on the forecasting horizon and the business context. A short-term replenishment forecast for a grocery retailer needs daily POS data and promotional calendars. A long-term capacity planning forecast for a manufacturer needs macroeconomic indicators and supplier lead time trends.

    Quality Requirements by Forecasting Horizon

    Different forecasting horizons have different data quality tolerances. A strategic forecast for annual capacity planning can tolerate some data imprecision. An operational forecast driving daily replenishment orders cannot.

    Forecasting HorizonTime RangePrimary Data SourcesQuality RequirementsAcceptable Granularity
    Operational1-14 daysPOS data, current inventory, weather, promotional calendarVery high: under 1% missing values, daily granularity, sub-24hr data latencyDaily by SKU by location
    Tactical2-12 weeksHistorical sales, inventory, promotions, supplier lead timesHigh: under 3% missing values, weekly granularity, weekly refresh acceptableWeekly by SKU by region
    Strategic3-18 monthsHistorical sales trends, macroeconomic indicators, competitive landscapeModerate: under 5% missing values, monthly granularity, monthly refreshMonthly by product category by market
    Long-range planning1-5 yearsAnnual trends, market research, demographic shifts, technology adoption curvesTolerant: under 10% missing values, quarterly/annual granularityQuarterly by product family by segment

    Data Freshness Requirements

    The value of demand forecasting data decays at different rates depending on the source:

    Data SourceFreshness RequirementImpact of Stale Data
    POS / sales ordersSame-day to next-dayReplenishment orders based on yesterday's demand miss trend shifts
    Inventory positionsSame-dayOverstocking or stockouts from incorrect available-to-promise
    Promotional calendar2-4 weeks lead timeForecast misses demand spikes from unaccounted promotions
    Supplier lead timesMonthly updateSafety stock calculations based on outdated lead times
    Weather1-3 day forecast windowFails to capture weather-driven demand (seasonal goods, HVAC, beverages)
    MacroeconomicMonthly to quarterlyStrategic forecasts miss economic turning points

    Pipeline Stages for Demand Forecasting Data

    Stage 1: Ingestion and Schema Validation

    The ingestion stage pulls data from source systems and validates that each extract conforms to expected schemas. Schema drift — when a source system changes column names, date formats, or adds/removes fields during an upgrade — is a common failure mode in production forecasting pipelines.

    Validation CheckWhat It CatchesAction on Failure
    Column presenceMissing expected fields after system upgradeReject file, alert data engineering
    Data type validationStrings in numeric fields, dates in wrong formatReject affected rows, log for correction
    Row count thresholdUnexpectedly empty or truncated extractsReject if count falls below 80% of historical average
    Date range validationGaps or future-dated records in historical dataFlag gaps for interpolation, reject future dates
    Currency/unit consistencyMixed currencies or unit-of-measure changes mid-fileNormalize to base currency/unit using conversion tables

    Ertas Data Suite handles multi-format ingestion through dedicated parser nodes for CSV, Excel, PDF, and other common enterprise export formats. The Format Normalizer node standardizes date formats, currency representations, and unit-of-measure encoding across sources. Each validation step is visible on the pipeline canvas, making it immediately clear when a source file deviates from expectations.

    Stage 2: Deduplication and Entity Resolution

    Enterprise systems frequently produce duplicate records through integration errors, batch re-runs, and multi-system order capture. A single customer order might appear in both the ERP and the POS system with slightly different fields.

    Entity resolution across systems is equally important. The same product might be identified as SKU "A1234" in the ERP, "1234-A" in the WMS, and "Product A Regular 12oz" in the POS system. Without a unified product master, the forecasting model treats these as three different products with separate demand histories.

    Key deduplication and resolution tasks:

    • Order deduplication: Match on order ID, date, and amount to remove duplicates from multi-system extracts
    • Product harmonization: Map all product identifiers to a single canonical SKU using a cross-reference table
    • Location harmonization: Map warehouse codes, store numbers, and region identifiers to a consistent hierarchy
    • Customer deduplication: Match customer records across CRM and order systems (relevant for B2B forecasting)

    Stage 3: Missing Value Treatment

    Missing data in demand forecasting requires domain-aware imputation, not generic statistical filling. A zero-sales day might mean no demand (product is seasonal) or might mean the product was out of stock (demand existed but was not captured). The imputation strategy must distinguish between these scenarios.

    Missing Data ScenarioDetection MethodImputation Strategy
    True zero demandProduct in stock, no sales recordedRecord as zero — do not impute
    Stockout (censored demand)Inventory position at zero on the date in questionImpute using demand from comparable in-stock periods
    System downtimeAll products show zero sales for a location/dateImpute using prior-period same-day-of-week averages
    New product (no history)Product launch date is after the training window startUse analogous product demand, adjust by launch curve assumptions
    Discontinued productProduct end-of-life date is within the training windowTruncate history at EOL; do not impute post-EOL zeros

    Stockout-induced censored demand is the most critical scenario to handle correctly. If the model trains on observed sales that include stockout zeros, it learns that demand drops to zero periodically — and will forecast low demand precisely when the product is most likely to stock out again, creating a self-reinforcing cycle.

    Stage 4: Feature Engineering

    Raw historical data needs transformation into features that capture demand patterns. Feature engineering for demand forecasting falls into several categories:

    Feature CategoryExamplesEngineering Method
    Lag featuresSales 1 day ago, 7 days ago, 28 days ago, 364 days agoTime-shifted values from historical sales
    Rolling statistics7-day moving average, 28-day rolling std dev, 13-week trendWindow-based aggregations
    Calendar featuresDay of week, month, quarter, holiday flag, pre/post-holidayDate decomposition + holiday calendar lookup
    Promotional featuresActive promotion flag, discount depth, days since last promo, days until next promoJoin with promotional calendar
    Price featuresCurrent price, price change flag, price relative to 90-day averageJoin with pricing history, compute derived metrics
    External signalsTemperature forecast, CPI change, competitor price indexJoin by date and region from external sources
    Inventory featuresDays of supply, stockout flag, weeks of coverCompute from inventory positions and demand rate

    Stage 5: Temporal Alignment and Aggregation

    Different data sources operate on different time grains. POS data is daily. Macroeconomic data is monthly. Supplier lead times update quarterly. The pipeline must aggregate or disaggregate all sources to the target forecasting granularity.

    Aggregation rules by data type:

    Data TypeAggregation MethodDisaggregation Method
    Sales/demand volumeSum to higher grainDistribute proportionally using historical daily patterns
    PricesAverage (or end-of-period) to higher grainForward-fill last known price to daily grain
    InventoryEnd-of-period snapshot to higher grainLinear interpolation between snapshots
    Binary flags (promo, holiday)Any-true to higher grainApply flag to all days within the original event period
    WeatherAverage for temperature; sum for precipitationDaily values are already the native grain

    Stage 6: Validation and Export

    Before the training dataset reaches the model, comprehensive validation catches issues that would otherwise surface as unexplained forecast errors.

    Validation RuleWhat It CatchesThreshold
    Demand stationarity testStructural breaks or level shifts requiring separate model treatmentADF test p-value; flag if non-stationary without differencing
    Feature correlationMulticollinear features that destabilize model coefficientsDrop one of any pair with correlation above 0.95
    Target leakageFeatures that contain future informationVerify all features use only data available at forecast origin
    Class balance (for classification)Extreme imbalance in demand categories (if classifying demand levels)Flag if minority class falls below 5%
    Temporal completenessGaps in the time series after all processingZero tolerance for gaps in operational forecasting data

    Export in the target model format — typically CSV or Parquet for tabular models, with proper train/validation/test splits based on time (never random) to prevent temporal leakage.

    Why On-Premise Matters for Supply Chain Data

    Supply chain data is competitively sensitive. Historical demand by SKU reveals product performance. Supplier lead times expose procurement relationships. Pricing history shows margin structure. Inventory positions indicate operational efficiency. This data, in aggregate, provides a comprehensive view of business operations that companies protect rigorously.

    Beyond confidentiality, many enterprises have data governance policies that prohibit sending transactional data to external cloud services without extensive security review. For global supply chains, data residency requirements may restrict where data can be processed geographically.

    Ertas Data Suite operates entirely on-premise as a native desktop application. Supply chain data never leaves the enterprise network. Every transformation node logs its operations, producing an audit trail that satisfies internal data governance reviews and external compliance requirements. The visual pipeline canvas gives supply chain analysts — who understand the business logic but may not write Python — direct visibility into how their data is being prepared for AI models.

    Key Takeaways

    Demand forecasting AI is a data integration problem before it is a modeling problem. The pipeline must handle multi-source ingestion, entity resolution, domain-aware imputation (especially for censored demand during stockouts), temporal alignment, and rigorous validation against leakage and stationarity assumptions.

    Teams that build observable, reproducible data pipelines ship forecasting models that improve over time as data quality improves. Teams that duct-tape data preparation together with scripts spend their time debugging forecast errors that trace back to data issues they cannot see.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading