
Data Preparation for Supply Chain Demand Forecasting AI
A practical guide to building data pipelines for supply chain demand forecasting AI — covering data source mapping, quality requirements by forecasting horizon, feature engineering, and on-premise deployment for enterprise supply chains.
McKinsey estimates that AI-driven demand forecasting can reduce supply chain errors by 30-50% and cut lost sales from stockouts by up to 65%. Yet most enterprise demand forecasting initiatives stall during data preparation, not during model development. The reason is structural: demand forecasting requires integrating data from systems that were never designed to talk to each other — ERP, WMS, POS, CRM, and external data providers — into a single clean, temporally aligned dataset.
The forecasting model itself is often the simplest part of the project. Getting historical sales, inventory positions, supplier lead times, promotional calendars, and macroeconomic signals into a consistent, quality-validated training set is where 60-80% of project effort concentrates. This guide covers how to build that data pipeline systematically.
Data Source Mapping for Demand Forecasting
Demand forecasting models consume data from multiple enterprise systems. Each source has distinct formats, update frequencies, and quality characteristics.
| Data Source | System of Origin | Typical Format | Update Frequency | Key Fields |
|---|---|---|---|---|
| Historical sales/orders | ERP (SAP, Oracle, NetSuite) | CSV/Excel export, API extract | Daily to weekly | SKU, quantity, date, channel, customer segment, price |
| Inventory positions | WMS (Manhattan, Blue Yonder) | CSV/Excel export | Daily | SKU, location, on-hand quantity, in-transit, reserved |
| Supplier lead times | Procurement/SRM system | Excel, manual tracking sheets | Monthly to quarterly | Supplier, SKU/category, quoted lead time, actual lead time history |
| Promotional calendar | Marketing/trade promotion system | Excel, shared calendars | Monthly | Promotion type, start/end dates, affected SKUs, discount depth |
| Point-of-sale data | POS/retail system | CSV, EDI 852 | Daily to weekly | Store, SKU, units sold, price, returns |
| Pricing history | ERP or pricing engine | CSV/Excel | Per price change event | SKU, effective date, list price, net price, currency |
| Weather data | Third-party API (NOAA, Weather Company) | JSON/CSV | Daily | Region, temperature, precipitation, severe weather alerts |
| Macroeconomic indicators | Government statistics, data vendors | CSV | Monthly to quarterly | CPI, GDP growth, consumer confidence, unemployment rate |
| Competitor pricing | Web scraping or third-party feeds | JSON/CSV | Daily to weekly | Competitor, product category, price point, availability |
Not every forecasting model needs every source. The data source selection depends on the forecasting horizon and the business context. A short-term replenishment forecast for a grocery retailer needs daily POS data and promotional calendars. A long-term capacity planning forecast for a manufacturer needs macroeconomic indicators and supplier lead time trends.
Quality Requirements by Forecasting Horizon
Different forecasting horizons have different data quality tolerances. A strategic forecast for annual capacity planning can tolerate some data imprecision. An operational forecast driving daily replenishment orders cannot.
| Forecasting Horizon | Time Range | Primary Data Sources | Quality Requirements | Acceptable Granularity |
|---|---|---|---|---|
| Operational | 1-14 days | POS data, current inventory, weather, promotional calendar | Very high: under 1% missing values, daily granularity, sub-24hr data latency | Daily by SKU by location |
| Tactical | 2-12 weeks | Historical sales, inventory, promotions, supplier lead times | High: under 3% missing values, weekly granularity, weekly refresh acceptable | Weekly by SKU by region |
| Strategic | 3-18 months | Historical sales trends, macroeconomic indicators, competitive landscape | Moderate: under 5% missing values, monthly granularity, monthly refresh | Monthly by product category by market |
| Long-range planning | 1-5 years | Annual trends, market research, demographic shifts, technology adoption curves | Tolerant: under 10% missing values, quarterly/annual granularity | Quarterly by product family by segment |
Data Freshness Requirements
The value of demand forecasting data decays at different rates depending on the source:
| Data Source | Freshness Requirement | Impact of Stale Data |
|---|---|---|
| POS / sales orders | Same-day to next-day | Replenishment orders based on yesterday's demand miss trend shifts |
| Inventory positions | Same-day | Overstocking or stockouts from incorrect available-to-promise |
| Promotional calendar | 2-4 weeks lead time | Forecast misses demand spikes from unaccounted promotions |
| Supplier lead times | Monthly update | Safety stock calculations based on outdated lead times |
| Weather | 1-3 day forecast window | Fails to capture weather-driven demand (seasonal goods, HVAC, beverages) |
| Macroeconomic | Monthly to quarterly | Strategic forecasts miss economic turning points |
Pipeline Stages for Demand Forecasting Data
Stage 1: Ingestion and Schema Validation
The ingestion stage pulls data from source systems and validates that each extract conforms to expected schemas. Schema drift — when a source system changes column names, date formats, or adds/removes fields during an upgrade — is a common failure mode in production forecasting pipelines.
| Validation Check | What It Catches | Action on Failure |
|---|---|---|
| Column presence | Missing expected fields after system upgrade | Reject file, alert data engineering |
| Data type validation | Strings in numeric fields, dates in wrong format | Reject affected rows, log for correction |
| Row count threshold | Unexpectedly empty or truncated extracts | Reject if count falls below 80% of historical average |
| Date range validation | Gaps or future-dated records in historical data | Flag gaps for interpolation, reject future dates |
| Currency/unit consistency | Mixed currencies or unit-of-measure changes mid-file | Normalize to base currency/unit using conversion tables |
Ertas Data Suite handles multi-format ingestion through dedicated parser nodes for CSV, Excel, PDF, and other common enterprise export formats. The Format Normalizer node standardizes date formats, currency representations, and unit-of-measure encoding across sources. Each validation step is visible on the pipeline canvas, making it immediately clear when a source file deviates from expectations.
Stage 2: Deduplication and Entity Resolution
Enterprise systems frequently produce duplicate records through integration errors, batch re-runs, and multi-system order capture. A single customer order might appear in both the ERP and the POS system with slightly different fields.
Entity resolution across systems is equally important. The same product might be identified as SKU "A1234" in the ERP, "1234-A" in the WMS, and "Product A Regular 12oz" in the POS system. Without a unified product master, the forecasting model treats these as three different products with separate demand histories.
Key deduplication and resolution tasks:
- Order deduplication: Match on order ID, date, and amount to remove duplicates from multi-system extracts
- Product harmonization: Map all product identifiers to a single canonical SKU using a cross-reference table
- Location harmonization: Map warehouse codes, store numbers, and region identifiers to a consistent hierarchy
- Customer deduplication: Match customer records across CRM and order systems (relevant for B2B forecasting)
Stage 3: Missing Value Treatment
Missing data in demand forecasting requires domain-aware imputation, not generic statistical filling. A zero-sales day might mean no demand (product is seasonal) or might mean the product was out of stock (demand existed but was not captured). The imputation strategy must distinguish between these scenarios.
| Missing Data Scenario | Detection Method | Imputation Strategy |
|---|---|---|
| True zero demand | Product in stock, no sales recorded | Record as zero — do not impute |
| Stockout (censored demand) | Inventory position at zero on the date in question | Impute using demand from comparable in-stock periods |
| System downtime | All products show zero sales for a location/date | Impute using prior-period same-day-of-week averages |
| New product (no history) | Product launch date is after the training window start | Use analogous product demand, adjust by launch curve assumptions |
| Discontinued product | Product end-of-life date is within the training window | Truncate history at EOL; do not impute post-EOL zeros |
Stockout-induced censored demand is the most critical scenario to handle correctly. If the model trains on observed sales that include stockout zeros, it learns that demand drops to zero periodically — and will forecast low demand precisely when the product is most likely to stock out again, creating a self-reinforcing cycle.
Stage 4: Feature Engineering
Raw historical data needs transformation into features that capture demand patterns. Feature engineering for demand forecasting falls into several categories:
| Feature Category | Examples | Engineering Method |
|---|---|---|
| Lag features | Sales 1 day ago, 7 days ago, 28 days ago, 364 days ago | Time-shifted values from historical sales |
| Rolling statistics | 7-day moving average, 28-day rolling std dev, 13-week trend | Window-based aggregations |
| Calendar features | Day of week, month, quarter, holiday flag, pre/post-holiday | Date decomposition + holiday calendar lookup |
| Promotional features | Active promotion flag, discount depth, days since last promo, days until next promo | Join with promotional calendar |
| Price features | Current price, price change flag, price relative to 90-day average | Join with pricing history, compute derived metrics |
| External signals | Temperature forecast, CPI change, competitor price index | Join by date and region from external sources |
| Inventory features | Days of supply, stockout flag, weeks of cover | Compute from inventory positions and demand rate |
Stage 5: Temporal Alignment and Aggregation
Different data sources operate on different time grains. POS data is daily. Macroeconomic data is monthly. Supplier lead times update quarterly. The pipeline must aggregate or disaggregate all sources to the target forecasting granularity.
Aggregation rules by data type:
| Data Type | Aggregation Method | Disaggregation Method |
|---|---|---|
| Sales/demand volume | Sum to higher grain | Distribute proportionally using historical daily patterns |
| Prices | Average (or end-of-period) to higher grain | Forward-fill last known price to daily grain |
| Inventory | End-of-period snapshot to higher grain | Linear interpolation between snapshots |
| Binary flags (promo, holiday) | Any-true to higher grain | Apply flag to all days within the original event period |
| Weather | Average for temperature; sum for precipitation | Daily values are already the native grain |
Stage 6: Validation and Export
Before the training dataset reaches the model, comprehensive validation catches issues that would otherwise surface as unexplained forecast errors.
| Validation Rule | What It Catches | Threshold |
|---|---|---|
| Demand stationarity test | Structural breaks or level shifts requiring separate model treatment | ADF test p-value; flag if non-stationary without differencing |
| Feature correlation | Multicollinear features that destabilize model coefficients | Drop one of any pair with correlation above 0.95 |
| Target leakage | Features that contain future information | Verify all features use only data available at forecast origin |
| Class balance (for classification) | Extreme imbalance in demand categories (if classifying demand levels) | Flag if minority class falls below 5% |
| Temporal completeness | Gaps in the time series after all processing | Zero tolerance for gaps in operational forecasting data |
Export in the target model format — typically CSV or Parquet for tabular models, with proper train/validation/test splits based on time (never random) to prevent temporal leakage.
Why On-Premise Matters for Supply Chain Data
Supply chain data is competitively sensitive. Historical demand by SKU reveals product performance. Supplier lead times expose procurement relationships. Pricing history shows margin structure. Inventory positions indicate operational efficiency. This data, in aggregate, provides a comprehensive view of business operations that companies protect rigorously.
Beyond confidentiality, many enterprises have data governance policies that prohibit sending transactional data to external cloud services without extensive security review. For global supply chains, data residency requirements may restrict where data can be processed geographically.
Ertas Data Suite operates entirely on-premise as a native desktop application. Supply chain data never leaves the enterprise network. Every transformation node logs its operations, producing an audit trail that satisfies internal data governance reviews and external compliance requirements. The visual pipeline canvas gives supply chain analysts — who understand the business logic but may not write Python — direct visibility into how their data is being prepared for AI models.
Key Takeaways
Demand forecasting AI is a data integration problem before it is a modeling problem. The pipeline must handle multi-source ingestion, entity resolution, domain-aware imputation (especially for censored demand during stockouts), temporal alignment, and rigorous validation against leakage and stationarity assumptions.
Teams that build observable, reproducible data pipelines ship forecasting models that improve over time as data quality improves. Teams that duct-tape data preparation together with scripts spend their time debugging forecast errors that trace back to data issues they cannot see.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

On-Premise vs Cloud Data Pipeline Throughput: Enterprise Document Processing Benchmarks
Throughput comparison of on-premise GPU infrastructure vs cloud API services for enterprise document processing at scale — from 100 to 100K documents — with cost analysis and deployment recommendations.

How to Prepare Training Data for Insurance Fraud Detection AI Models
A practical playbook for preparing claims text, adjuster notes, and policy documents as training data for insurance fraud detection AI — covering pipeline stages, data quality requirements, and on-premise deployment for regulated insurers.

Preparing Sensor and IoT Time-Series Data for AI Training Pipelines
A practical guide to building AI training pipelines for sensor and IoT time-series data — covering windowing strategies, normalization methods, anomaly labeling, and train/test splitting for vibration, temperature, pressure, and acoustic sensor types.