Data Preparation for Supply Chain Demand Forecasting AI

McKinsey estimates that AI-driven demand forecasting can reduce supply chain errors by 30-50% and cut lost sales from stockouts by up to 65%. Yet most enterprise demand forecasting initiatives stall during data preparation, not during model development. The reason is structural: demand forecasting requires integrating data from systems that were never designed to talk to each other — ERP, WMS, POS, CRM, and external data providers — into a single clean, temporally aligned dataset.

The forecasting model itself is often the simplest part of the project. Getting historical sales, inventory positions, supplier lead times, promotional calendars, and macroeconomic signals into a consistent, quality-validated training set is where 60-80% of project effort concentrates. This guide covers how to build that data pipeline systematically.

Data Source Mapping for Demand Forecasting

Demand forecasting models consume data from multiple enterprise systems. Each source has distinct formats, update frequencies, and quality characteristics.

Data Source	System of Origin	Typical Format	Update Frequency	Key Fields
Historical sales/orders	ERP (SAP, Oracle, NetSuite)	CSV/Excel export, API extract	Daily to weekly	SKU, quantity, date, channel, customer segment, price
Inventory positions	WMS (Manhattan, Blue Yonder)	CSV/Excel export	Daily	SKU, location, on-hand quantity, in-transit, reserved
Supplier lead times	Procurement/SRM system	Excel, manual tracking sheets	Monthly to quarterly	Supplier, SKU/category, quoted lead time, actual lead time history
Promotional calendar	Marketing/trade promotion system	Excel, shared calendars	Monthly	Promotion type, start/end dates, affected SKUs, discount depth
Point-of-sale data	POS/retail system	CSV, EDI 852	Daily to weekly	Store, SKU, units sold, price, returns
Pricing history	ERP or pricing engine	CSV/Excel	Per price change event	SKU, effective date, list price, net price, currency
Weather data	Third-party API (NOAA, Weather Company)	JSON/CSV	Daily	Region, temperature, precipitation, severe weather alerts
Macroeconomic indicators	Government statistics, data vendors	CSV	Monthly to quarterly	CPI, GDP growth, consumer confidence, unemployment rate
Competitor pricing	Web scraping or third-party feeds	JSON/CSV	Daily to weekly	Competitor, product category, price point, availability

Not every forecasting model needs every source. The data source selection depends on the forecasting horizon and the business context. A short-term replenishment forecast for a grocery retailer needs daily POS data and promotional calendars. A long-term capacity planning forecast for a manufacturer needs macroeconomic indicators and supplier lead time trends.

Quality Requirements by Forecasting Horizon

Different forecasting horizons have different data quality tolerances. A strategic forecast for annual capacity planning can tolerate some data imprecision. An operational forecast driving daily replenishment orders cannot.

Forecasting Horizon	Time Range	Primary Data Sources	Quality Requirements	Acceptable Granularity
Operational	1-14 days	POS data, current inventory, weather, promotional calendar	Very high: under 1% missing values, daily granularity, sub-24hr data latency	Daily by SKU by location
Tactical	2-12 weeks	Historical sales, inventory, promotions, supplier lead times	High: under 3% missing values, weekly granularity, weekly refresh acceptable	Weekly by SKU by region
Strategic	3-18 months	Historical sales trends, macroeconomic indicators, competitive landscape	Moderate: under 5% missing values, monthly granularity, monthly refresh	Monthly by product category by market
Long-range planning	1-5 years	Annual trends, market research, demographic shifts, technology adoption curves	Tolerant: under 10% missing values, quarterly/annual granularity	Quarterly by product family by segment

Data Freshness Requirements

The value of demand forecasting data decays at different rates depending on the source:

Data Source	Freshness Requirement	Impact of Stale Data
POS / sales orders	Same-day to next-day	Replenishment orders based on yesterday's demand miss trend shifts
Inventory positions	Same-day	Overstocking or stockouts from incorrect available-to-promise
Promotional calendar	2-4 weeks lead time	Forecast misses demand spikes from unaccounted promotions
Supplier lead times	Monthly update	Safety stock calculations based on outdated lead times
Weather	1-3 day forecast window	Fails to capture weather-driven demand (seasonal goods, HVAC, beverages)
Macroeconomic	Monthly to quarterly	Strategic forecasts miss economic turning points

Pipeline Stages for Demand Forecasting Data

Stage 1: Ingestion and Schema Validation

The ingestion stage pulls data from source systems and validates that each extract conforms to expected schemas. Schema drift — when a source system changes column names, date formats, or adds/removes fields during an upgrade — is a common failure mode in production forecasting pipelines.

Validation Check	What It Catches	Action on Failure
Column presence	Missing expected fields after system upgrade	Reject file, alert data engineering
Data type validation	Strings in numeric fields, dates in wrong format	Reject affected rows, log for correction
Row count threshold	Unexpectedly empty or truncated extracts	Reject if count falls below 80% of historical average
Date range validation	Gaps or future-dated records in historical data	Flag gaps for interpolation, reject future dates
Currency/unit consistency	Mixed currencies or unit-of-measure changes mid-file	Normalize to base currency/unit using conversion tables

Ertas Data Suite handles multi-format ingestion through dedicated parser nodes for CSV, Excel, PDF, and other common enterprise export formats. The Format Normalizer node standardizes date formats, currency representations, and unit-of-measure encoding across sources. Each validation step is visible on the pipeline canvas, making it immediately clear when a source file deviates from expectations.

Stage 2: Deduplication and Entity Resolution

Enterprise systems frequently produce duplicate records through integration errors, batch re-runs, and multi-system order capture. A single customer order might appear in both the ERP and the POS system with slightly different fields.

Entity resolution across systems is equally important. The same product might be identified as SKU "A1234" in the ERP, "1234-A" in the WMS, and "Product A Regular 12oz" in the POS system. Without a unified product master, the forecasting model treats these as three different products with separate demand histories.

Key deduplication and resolution tasks:

Order deduplication: Match on order ID, date, and amount to remove duplicates from multi-system extracts
Product harmonization: Map all product identifiers to a single canonical SKU using a cross-reference table
Location harmonization: Map warehouse codes, store numbers, and region identifiers to a consistent hierarchy
Customer deduplication: Match customer records across CRM and order systems (relevant for B2B forecasting)

Stage 3: Missing Value Treatment

Missing data in demand forecasting requires domain-aware imputation, not generic statistical filling. A zero-sales day might mean no demand (product is seasonal) or might mean the product was out of stock (demand existed but was not captured). The imputation strategy must distinguish between these scenarios.

Missing Data Scenario	Detection Method	Imputation Strategy
True zero demand	Product in stock, no sales recorded	Record as zero — do not impute
Stockout (censored demand)	Inventory position at zero on the date in question	Impute using demand from comparable in-stock periods
System downtime	All products show zero sales for a location/date	Impute using prior-period same-day-of-week averages
New product (no history)	Product launch date is after the training window start	Use analogous product demand, adjust by launch curve assumptions
Discontinued product	Product end-of-life date is within the training window	Truncate history at EOL; do not impute post-EOL zeros

Stockout-induced censored demand is the most critical scenario to handle correctly. If the model trains on observed sales that include stockout zeros, it learns that demand drops to zero periodically — and will forecast low demand precisely when the product is most likely to stock out again, creating a self-reinforcing cycle.

Stage 4: Feature Engineering

Raw historical data needs transformation into features that capture demand patterns. Feature engineering for demand forecasting falls into several categories:

Feature Category	Examples	Engineering Method
Lag features	Sales 1 day ago, 7 days ago, 28 days ago, 364 days ago	Time-shifted values from historical sales
Rolling statistics	7-day moving average, 28-day rolling std dev, 13-week trend	Window-based aggregations
Calendar features	Day of week, month, quarter, holiday flag, pre/post-holiday	Date decomposition + holiday calendar lookup
Promotional features	Active promotion flag, discount depth, days since last promo, days until next promo	Join with promotional calendar
Price features	Current price, price change flag, price relative to 90-day average	Join with pricing history, compute derived metrics
External signals	Temperature forecast, CPI change, competitor price index	Join by date and region from external sources
Inventory features	Days of supply, stockout flag, weeks of cover	Compute from inventory positions and demand rate

Stage 5: Temporal Alignment and Aggregation

Different data sources operate on different time grains. POS data is daily. Macroeconomic data is monthly. Supplier lead times update quarterly. The pipeline must aggregate or disaggregate all sources to the target forecasting granularity.

Aggregation rules by data type:

Data Type	Aggregation Method	Disaggregation Method
Sales/demand volume	Sum to higher grain	Distribute proportionally using historical daily patterns
Prices	Average (or end-of-period) to higher grain	Forward-fill last known price to daily grain
Inventory	End-of-period snapshot to higher grain	Linear interpolation between snapshots
Binary flags (promo, holiday)	Any-true to higher grain	Apply flag to all days within the original event period
Weather	Average for temperature; sum for precipitation	Daily values are already the native grain

Stage 6: Validation and Export

Before the training dataset reaches the model, comprehensive validation catches issues that would otherwise surface as unexplained forecast errors.

Validation Rule	What It Catches	Threshold
Demand stationarity test	Structural breaks or level shifts requiring separate model treatment	ADF test p-value; flag if non-stationary without differencing
Feature correlation	Multicollinear features that destabilize model coefficients	Drop one of any pair with correlation above 0.95
Target leakage	Features that contain future information	Verify all features use only data available at forecast origin
Class balance (for classification)	Extreme imbalance in demand categories (if classifying demand levels)	Flag if minority class falls below 5%
Temporal completeness	Gaps in the time series after all processing	Zero tolerance for gaps in operational forecasting data

Export in the target model format — typically CSV or Parquet for tabular models, with proper train/validation/test splits based on time (never random) to prevent temporal leakage.

Why On-Premise Matters for Supply Chain Data

Supply chain data is competitively sensitive. Historical demand by SKU reveals product performance. Supplier lead times expose procurement relationships. Pricing history shows margin structure. Inventory positions indicate operational efficiency. This data, in aggregate, provides a comprehensive view of business operations that companies protect rigorously.

Beyond confidentiality, many enterprises have data governance policies that prohibit sending transactional data to external cloud services without extensive security review. For global supply chains, data residency requirements may restrict where data can be processed geographically.

Ertas Data Suite operates entirely on-premise as a native desktop application. Supply chain data never leaves the enterprise network. Every transformation node logs its operations, producing an audit trail that satisfies internal data governance reviews and external compliance requirements. The visual pipeline canvas gives supply chain analysts — who understand the business logic but may not write Python — direct visibility into how their data is being prepared for AI models.

Key Takeaways

Demand forecasting AI is a data integration problem before it is a modeling problem. The pipeline must handle multi-source ingestion, entity resolution, domain-aware imputation (especially for censored demand during stockouts), temporal alignment, and rigorous validation against leakage and stationarity assumptions.

Teams that build observable, reproducible data pipelines ship forecasting models that improve over time as data quality improves. Teams that duct-tape data preparation together with scripts spend their time debugging forecast errors that trace back to data issues they cannot see.