From Shadow AI to Sanctioned AI: The Enterprise Migration Playbook

Your employees are using ChatGPT with company data. You know it. They know you know it. And yet the problem persists because knowing about shadow AI and fixing it are two different things.

This is the migration playbook. It covers the complete journey from uncontrolled external AI usage to sanctioned, auditable, on-premise AI tools — broken into five phases with specific timelines, resource requirements, and decision points. It's designed for organizations that have moved past the "should we do something?" stage and into the "how do we actually do this?" stage.

The total timeline is 24–36 weeks. The total cost ranges from $50,000 to $200,000 depending on scale. The alternative — doing nothing — costs an average of $19.5 million per year in shadow AI-related insider risk losses.

The Migration Timeline at a Glance

Phase	Timeline	Focus	Key Deliverable
Phase 1: Discovery	Weeks 1–4	Audit, quantify, prioritize	Shadow AI assessment report
Phase 2: Quick Wins	Weeks 5–12	Deploy basic internal alternative	Internal AI chatbot live
Phase 3: Data Foundation	Weeks 9–24	Build data preparation pipeline	Enterprise data ready for AI
Phase 4: Custom Models	Weeks 17–32	Fine-tune domain-specific models	Production custom models
Phase 5: Governance	Week 12+ (ongoing)	Monitoring, policy, audits	Mature AI governance program

Note: Phases 3 and 4 overlap with earlier phases by design. You don't wait for governance to start data preparation, and you don't wait for perfect data to start fine-tuning.

Phase 1: Discovery (Weeks 1–4)

Objective: Understand what shadow AI exists in your organization, quantify the risk, and identify the highest-value use cases that employees are solving with external tools.

Resources: 1 security analyst, 1 IT operations lead, executive sponsor. Budget: $5,000–$15,000 (monitoring tools and analysis time).

Step 1.1: Audit Current Usage

Deploy network monitoring to identify traffic to known AI tool domains. The major targets:

LLM providers: openai.com, anthropic.com, gemini.google.com, chat.mistral.ai
Code assistants: copilot.github.com, cursor.sh, codeium.com
Embedded AI features: notion.so/ai, docs.google.com (Gemini features), bing.com/chat
AI aggregators: poe.com, huggingface.co, together.ai

Most enterprise firewalls and proxy servers can generate domain-level traffic reports without new tooling. You're looking for:

Number of unique employees accessing AI tools
Volume of data transmitted (outbound request sizes)
Frequency of usage (daily, weekly, one-time)
Departments with highest usage

Step 1.2: Quantify Risk

Translate usage data into risk metrics:

Risk Factor	How to Measure	Benchmark
Exposure breadth	% of employees using external AI	Industry average: 77%
Data sensitivity	Sample outbound prompts for PII/PHI/IP	Industry average: 1.6% contain policy violations
Account visibility	% using corporate vs. personal accounts	Industry average: 82% personal accounts
Tool diversity	Number of distinct AI tools in use	Typical enterprise: 15–40 distinct tools
Volume	Average prompts per user per day	Typical: 8–12 per active user

Calculate your estimated annual violation count: (active users) × (prompts/day) × (1.6%) × (220 working days). For a 1,000-person company with 60% active AI users: 600 × 10 × 0.016 × 220 = 21,120 estimated annual violations.

Step 1.3: Identify High-Value Use Cases

This is the most important discovery step and the one most organizations skip. Survey employees (anonymously) to understand what they're using AI for:

What tasks do you use AI tools for?
How much time does it save you per week?
What data do you typically provide to the AI tool?
If we provided an internal AI tool, what capabilities would it need?

Common findings:

Use Case	Typical Departments	Time Saved	Data Sensitivity
Writing and editing	All	3–5 hrs/week	Low to Medium
Code generation and debugging	Engineering	5–10 hrs/week	High
Document summarization	Legal, Finance, Ops	2–4 hrs/week	High
Data analysis	Finance, Marketing, Ops	3–6 hrs/week	Medium to High
Research	All	2–3 hrs/week	Low
Email drafting	All	1–2 hrs/week	Low to Medium

The discovery phase output is a Shadow AI Assessment Report that includes: usage metrics, risk quantification, top use cases by department, and a prioritized list of capabilities the internal platform must support.

Phase 2: Quick Wins (Weeks 5–12)

Objective: Deploy a basic internal AI chatbot that gives employees an immediate alternative to external tools. This reduces data leakage while you build the full solution.

Resources: 1 ML/DevOps engineer, 1 system administrator. Budget: $10,000–$30,000 (hardware + setup).

Step 2.1: Deploy Ollama + Open WebUI

The fastest path to a functional internal AI chatbot:

Hardware requirements (minimum):

1 server with an NVIDIA GPU (RTX 4090 for small teams, A100 for 100+ users)
32GB+ RAM, 500GB+ SSD storage
Internal network connectivity (no internet access required for inference)

Software stack:

Ollama for model serving
Open WebUI for the user interface
NGINX for load balancing (if multiple GPUs)
LDAP/SSO integration for authentication

Model selection for Phase 2:

Model	Size	Good For	Limitations
Llama 3.3 70B	40GB VRAM	General tasks, writing, analysis	Slower on consumer GPUs
Qwen 2.5 32B	20GB VRAM	Code, multilingual, analysis	Less conversational polish
Mistral Small 24B	14GB VRAM	Fast general usage	Less capable on complex reasoning
DeepSeek-R1 Distill 32B	20GB VRAM	Reasoning, math, analysis	Slower (chain-of-thought)

Start with one general-purpose model (Llama 3.3 or Qwen 2.5) and expand based on employee feedback.

Step 2.2: Announce and Migrate

The announcement matters as much as the technology. Frame it as:

"We're providing a better tool" — not "we're blocking the tools you like"
Demonstrate that the internal tool handles the top use cases identified in Phase 1
Provide migration guides: "If you were using ChatGPT for X, here's how to do X with the internal tool"
Offer drop-in training sessions by department

Step 2.3: Measure Adoption

Track weekly:

Internal platform: unique users, prompts per day, satisfaction scores
External AI: traffic volume (should be declining)
Support tickets: what's missing, what's not working

Expected result by end of Phase 2: 40–60% reduction in external AI tool usage. The remaining usage will be for capabilities the basic platform doesn't yet provide (code assistance, document upload, specialized tasks). That's what Phases 3 and 4 address.

Phase 3: Data Foundation (Weeks 9–24)

Objective: Build the data preparation pipeline that transforms your enterprise knowledge into training data for custom AI models. This is the foundation for Phase 4.

Resources: 1–2 data engineers, 1 domain expert (part-time per department). Budget: $15,000–$50,000 (tooling + engineering time).

Why This Phase Exists

Generic open-source models (deployed in Phase 2) are good at general tasks. They know nothing about your specific products, processes, terminology, customers, or domain. For an internal AI platform to outperform ChatGPT for your employees, it needs to know your business.

That knowledge comes from your enterprise data. But enterprise data is messy: scattered across file shares, wikis, Slack channels, email archives, databases, and document management systems. Before you can fine-tune a model, you need to extract, clean, structure, and validate that data.

Step 3.1: Identify Data Sources

Map the knowledge repositories across your organization:

Source Type	Examples	Typical Volume	Extraction Complexity
Documents	PDFs, Word files, presentations	10K–1M files	Medium (OCR, layout parsing)
Knowledge bases	Confluence, Notion, SharePoint	1K–100K pages	Low (API extraction)
Communications	Slack, Teams, email archives	High volume	High (noise filtering, privacy)
Databases	CRM, ERP, ticketing systems	Structured data	Low (SQL queries)
Code repositories	Git repos, documentation	Varies	Low (file system access)
Specialized systems	EMR (healthcare), case management (legal)	Varies	High (proprietary formats)

Step 3.2: Extract and Process

Build an extraction pipeline that handles your specific data sources. The typical pipeline:

Extract: Pull raw content from source systems. For documents, this means OCR and layout parsing (tools like Docling, Unstructured.io, or Apache Tika). For structured data, this means SQL queries and API calls.
Clean: Remove duplicates, boilerplate, headers/footers, navigation elements, and other noise. For communications data, filter out small talk, social messages, and non-work content.
Chunk: Break documents into semantically meaningful chunks (paragraphs, sections, Q&A pairs). Chunk size depends on the intended use: RAG retrieval works best with 200–500 token chunks; fine-tuning works best with complete conversation or document examples.
Structure: Convert cleaned content into training format. For fine-tuning: instruction/response pairs. For RAG: indexed document chunks with metadata.
Validate: Human review of a sample (5–10%) to verify quality, accuracy, and absence of sensitive data that shouldn't be in the training set.

Step 3.3: Build the Data Quality Pipeline

Data quality is the single biggest determinant of fine-tuned model performance. Bad data in, bad model out. Establish quality checks:

Accuracy: Is the information in the training data correct and current?
Relevance: Does the data represent the knowledge employees actually need?
Completeness: Are there gaps in topic coverage?
Consistency: Does the data contain contradictions?
Privacy: Has all PII/PHI been removed or appropriately handled?

Budget 40–60% of Phase 3 time on data quality. Teams that rush through data preparation and move to fine-tuning quickly consistently get worse model performance than teams that spend more time on data quality and less on model tuning.

Step 3.4: Establish an Ongoing Data Pipeline

Enterprise knowledge changes constantly. New products launch, procedures update, regulations change. The data pipeline must be continuous, not one-time:

Scheduled extraction from source systems (weekly or monthly)
Automated quality checks
Human review queue for flagged content
Version control for training datasets
Documentation of data lineage (where each piece of training data came from)

Phase 4: Custom Models (Weeks 17–32)

Objective: Fine-tune domain-specific models on your enterprise data that outperform generic models for your specific use cases.

Resources: 1 ML engineer, domain experts (part-time). Budget: $15,000–$75,000 (compute + tooling).

Why Fine-Tune?

A common question: "Why not just use RAG (retrieval-augmented generation) with the base model and skip fine-tuning?"

RAG and fine-tuning solve different problems:

Capability	RAG	Fine-Tuning	Both
Access to current information	Yes	No (static at training time)	Yes
Domain-specific terminology and style	Partially	Yes	Yes
Following organizational processes	No	Yes	Yes
Reducing hallucination on domain topics	Partially	Yes	Yes
Handling novel questions	Yes	Limited to training distribution	Yes

The best enterprise AI systems use both: fine-tuned models for domain expertise and style, plus RAG for current information and source citations. Phase 4 covers fine-tuning; RAG can be layered on during or after this phase.

Step 4.1: Select Base Models for Fine-Tuning

Choose base models based on your primary use cases:

Use Case	Recommended Base Model	Fine-Tuning Method	Typical Training Time
General enterprise assistant	Llama 3.3 70B	QLoRA	4–8 hours on 1× A100
Code assistance	Qwen 2.5 Coder 32B	QLoRA	2–4 hours on 1× A100
Document analysis	Llama 3.3 8B	Full fine-tune or LoRA	1–2 hours on 1× A100
Specialized domain (legal, medical)	Llama 3.3 70B	QLoRA	4–8 hours on 1× A100

QLoRA (Quantized Low-Rank Adaptation) is the standard method for enterprise fine-tuning: it requires less GPU memory than full fine-tuning while achieving comparable results for most use cases.

Step 4.2: Fine-Tune and Evaluate

The fine-tuning cycle:

Prepare training data: Convert Phase 3 outputs into the model's expected format (typically instruction/response pairs in JSONL)
Configure training: Set hyperparameters (learning rate, epochs, LoRA rank). Start with established defaults and adjust based on evaluation results.
Train: Run the fine-tuning job. Monitor loss curves for convergence.
Evaluate: Test the fine-tuned model against a held-out evaluation set. Measure:
- Accuracy: Does the model give correct, domain-appropriate answers?
- Style: Does it match your organization's tone and terminology?
- Safety: Does it refuse to provide information it shouldn't?
- Comparison: Side-by-side evaluation against the base model and against ChatGPT/Claude for the same prompts
Iterate: If evaluation results are below target, diagnose the issue (usually data quality) and retrain.

Step 4.3: Deploy to Production

Replace or augment the Phase 2 base models with fine-tuned models:

Deploy fine-tuned models alongside base models (let employees choose)
A/B test: route 50% of requests to the fine-tuned model and measure satisfaction
Collect feedback: thumbs up/down on responses, with optional written feedback
Plan for regular retraining (quarterly or when significant new data is available)

Expected result by end of Phase 4: Internal AI platform handles 80–90% of the use cases employees were previously solving with external tools, with equal or better quality for domain-specific tasks. External AI tool usage drops to under 10% — mostly edge cases and personal preference for general research.

Phase 5: Governance (Week 12+, Ongoing)

Objective: Establish the monitoring, policy, and audit infrastructure for long-term AI governance.

Resources: 1 security analyst (part-time), AI Governance Committee (quarterly meetings). Budget: $5,000–$30,000/year (tooling + committee time).

Phase 5 starts during Phase 2 (not after Phase 4) because governance can't wait for the full platform to be ready.

Step 5.1: Policy Framework

Deploy an AI acceptable use policy that covers:

Approved tools and the process for requesting new ones
Data classification for AI usage (what data can go where)
Acceptable use guidelines per department
Monitoring and enforcement provisions
Incident response procedures
Training requirements

See our Shadow AI Policy Template for Regulated Industries for a complete, adaptable policy document.

Step 5.2: Monitoring Infrastructure

Deploy monitoring that provides:

Usage dashboards: Who's using the internal platform, for what, and how often
External AI detection: Continued monitoring of traffic to external AI tools
Data leakage detection: Automated scanning for PII, PHI, and classified data in prompts
Anomaly detection: Unusual usage patterns (volume spikes, off-hours access, bulk data submission)
Audit logs: Complete record of all AI interactions for compliance and incident investigation

Step 5.3: Regular Audits

Audit	Frequency	Focus
Usage compliance	Monthly	Are employees using approved tools? Is external usage declining?
Data classification adherence	Quarterly	Are prompts consistent with data tier policies?
Model performance	Quarterly	Are fine-tuned models meeting accuracy and quality targets?
Policy effectiveness	Semi-annually	Is the policy being followed? What needs updating?
Regulatory alignment	Semi-annually	Have regulations changed? Does the policy need updating?

Step 5.4: Continuous Improvement

The internal AI platform must keep pace with external tools. If ChatGPT releases a capability your employees need and your internal platform doesn't have it, external usage will increase. Budget for:

Monthly model updates (new base model releases, retraining on fresh data)
Quarterly capability additions (new features, new use cases, new departments)
Annual infrastructure scaling (more GPUs, more storage, better performance)

The ROI Case

The numbers that justify this investment:

Cost of Doing Nothing

Risk Category	Estimated Annual Cost
Shadow AI insider risk losses (industry average)	$19.5M
Regulatory fines (per incident, GDPR)	$100K–$20M
Data breach investigation and response	$500K–$5M per incident
IP theft and competitive exposure	Unquantifiable but significant
Conservative estimate (one moderate incident/year)	$2M–$10M

Even if you discount the $19.5M industry average as inflated for your organization and assume just one moderate data leakage incident per year, the exposure is $2–10 million annually.

Cost of the Migration

Phase	Budget Range	Typical
Phase 1: Discovery	$5K–$15K	$10K
Phase 2: Quick Wins	$10K–$30K	$20K
Phase 3: Data Foundation	$15K–$50K	$30K
Phase 4: Custom Models	$15K–$75K	$40K
Phase 5: Governance (annual)	$5K–$30K/yr	$15K/yr
Total (Year 1)	$50K–$200K	$115K
Ongoing (Year 2+)	$20K–$80K/yr	$45K/yr

The Math

At the conservative end: $2M in annual risk exposure versus $115K in migration cost. That's a 17:1 ratio.

At the industry average: $19.5M in risk exposure versus $115K. That's a 170:1 ratio.

Even accounting for the fact that migration doesn't eliminate 100% of risk — it reduces it by an estimated 80–95% — the ROI is overwhelming at any reasonable assumption.

And this doesn't count the productivity gains. If your internal AI platform saves each knowledge worker 3 hours per week (a conservative estimate based on Phase 1 discovery data), and your average fully-loaded cost per knowledge worker is $80/hour, that's:

500 knowledge workers × 3 hours × $80 × 48 weeks = $5.76 million in annual productivity gains

The internal AI platform doesn't just reduce risk. It generates measurable productivity value that external shadow AI was already partially delivering — just without the audit trail, the data protection, or the organizational control.

Common Objections and Responses

"We can't afford it." You can't afford not to. One data leakage incident involving customer PII costs more than the entire migration. And the productivity gains alone typically offset the investment within 6 months.

"Our employees won't use an internal tool." They will if it's good enough. The key is Phase 2 — deploy quickly, get feedback, iterate. Employees prefer sanctioned tools when those tools meet their needs.

"Open-source models aren't as good as GPT-4." For general knowledge, that's partially true. For your specific domain, a fine-tuned 70B model outperforms GPT-4 because it knows your business. This is the whole point of Phase 4.

"We don't have ML expertise in-house." Phases 1 and 2 require IT operations skills, not ML expertise. Phases 3 and 4 require some ML knowledge, which can be provided by a platform vendor, a consultant, or one new hire. The skills needed are increasingly common.

"This timeline is too long." Phase 2 delivers value in 8 weeks. You don't need to wait for Phase 4 to see results. The phased approach means you're reducing risk from week 5 onward.

"What about just getting ChatGPT Enterprise?" ChatGPT Enterprise addresses some concerns (data not used for training, SOC 2 compliance, SSO). It doesn't address data residency, custom model training, offline availability, or full audit control. For lightly regulated industries, it may be sufficient. For healthcare, financial services, legal, defense, and other regulated environments, on-premise deployment remains necessary.

Getting Started

The first step is Phase 1: Discovery. You need to know the size and shape of the problem before you can solve it. Most organizations are surprised by what they find — both the scale of shadow AI usage and the genuine productivity value employees are getting from it.

Don't approach this as a crackdown. Approach it as a migration. Your employees have already demonstrated that AI tools make them more productive. Your job is to give them better tools — tools that are faster, more capable for your domain, and don't leak company data to third parties.

The playbook works because it aligns incentives. Employees get better AI tools. Security gets visibility and control. Leadership gets reduced risk and measurable productivity gains. Nobody has to lose for this to work.

Start with discovery. Deploy a quick win. Build the foundation. Train the models. Govern the system. Twenty-four weeks from now, the shadow AI problem isn't a problem anymore — it's a competitive advantage.