Staff-level interview prep: customer discovery, rapid prototyping, production deployment, and field engineering.
It is 6:47 AM and your phone buzzes. A Slack message from the VP of Engineering at FinanceCorps, your largest enterprise customer: "Our fraud detection pipeline missed $2.3M in chargebacks last night. Your model was supposed to catch these. I need someone on-site by noon." You are the someone. Not a support engineer reading from a runbook. Not a sales rep promising a fix. You are the Forward Deployed Engineer — the person who understands the product deeply enough to diagnose the issue, the customer's infrastructure well enough to reproduce it, and the business context well enough to explain the impact in dollars.
By 10 AM you are in their war room. By 11 AM you have identified the root cause: a schema migration last Tuesday added a new transaction type that your model's feature extractor doesn't recognize, so it defaults to "low risk." By 1 PM you have a hotfix deployed to their staging environment. By 3 PM it is in production. By 4 PM you are on a call with your product team explaining why the feature extractor needs to handle unknown transaction types gracefully, not just the 47 types it was trained on.
This is the Forward Deployed Engineer (FDE). The term was popularized by Palantir, but the role exists at every company that sells complex technical products to enterprise customers: Databricks, Snowflake, Stripe, Figma's enterprise team, Anduril, Scale AI, OpenAI, Anthropic, Cohere, and every AI startup selling APIs. The FDE is the bridge between the product and the customer. Not a generalist. A specialist in the boundary between what your product does and what the customer needs it to do.
Enterprise software fails in deployment, not in demos. The product works perfectly on your test data, in your cloud, with your assumptions. Then a Fortune 500 customer tries to run it on their 15-year-old Oracle database behind a firewall that blocks outbound HTTPS, with data that has 40% null values in fields your model assumes are always populated, governed by compliance rules that require all data to stay in the EU.
No amount of documentation bridges this gap. You need a human who can sit in the customer's environment and make the product work there, not in the idealized world of your test suite.
| Dimension | Software Engineer | Solutions Engineer | Forward Deployed Engineer |
|---|---|---|---|
| Builds for | The product (all customers) | Pre-sales demos | One customer's production system |
| Code ships to | Main product repo | Demo environments | Customer's infrastructure |
| Success metric | Feature adoption | Deal closed | Customer achieves production value |
| Failure mode | Shipped a bug | Lost the deal | Customer churns after 6 months |
| Time horizon | Quarters | Weeks | Days to months (per engagement) |
Each node is a customer engagement. Click New Engagement to add a customer. Click Trigger Incident to see how incidents cascade and demand triage. The urgency color shows priority.
Staff-level FDE interviews test you across five dimensions. Each chapter maps to one or more:
| Dimension | What they ask | Chapters |
|---|---|---|
| CONCEPT | "Explain the FDE lifecycle on a whiteboard" | 0, 1, 10 |
| DESIGN | "Design a deployment pipeline for an air-gapped customer" | 3, 4, 5 |
| CODE | "Write a health-check script for a customer's deployment" | 2, 4, 6 |
| DEBUG | "Your model's accuracy dropped 15% at a customer site. Walk me through your investigation." | 6, 9 |
| FRONTIER | "How will AI-native deployment change the FDE role?" | 7, 8, 11 |
You walk into a meeting room at a logistics company. On one side of the table: a VP of Operations, a data scientist, an IT director, and a procurement manager. They all want different things. The VP wants "AI that predicts delays." The data scientist wants "a real-time feature store." The IT director wants "something that runs on our existing Kubernetes cluster without new cloud spend." The procurement manager wants "a fixed-price contract." Your job is to leave this room with a technical spec that satisfies all four — or, more often, to identify which stakeholder's need is actually the business-critical one and build for that.
This is Customer Problem Discovery — the art of extracting the real problem from the stated problem. Customers almost never tell you what they actually need. They tell you what they think the solution is. "We need a recommendation engine" actually means "our customers can't find products and we're losing 12% of cart value." The FDE's first skill is translating business pain into technical requirements.
Every discovery conversation follows the same structure. Memorize this — interviewers will ask you to walk through a customer discovery scenario.
Different stakeholders reveal different information. A staff FDE knows how to extract signal from each:
| Stakeholder | What they tell you | What they hide (unintentionally) | The question that unlocks truth |
|---|---|---|---|
| VP/Executive | Strategic vision, budget | Actual data quality, technical debt | "Can you show me the current dashboard/report you use to make this decision?" |
| Data Scientist | Model requirements, feature wishlist | Infrastructure limitations, data pipeline reliability | "Walk me through what happens when your model retrains — from trigger to production." |
| IT/Infra | Security requirements, network topology | Actual deployment velocity, change management friction | "How long does it take to get a new service into production from first PR?" |
| End User | Daily pain points, workarounds | What they've already tried and abandoned | "Show me how you do this task right now, step by step." |
Here is a worked example. A healthcare company says: "We need AI to predict patient readmissions." After discovery, you produce this translation:
python # Discovery output: Requirements Document (simplified) requirements = { # Business requirement → Technical requirement "predict_readmission": { "input": "EHR data (HL7 FHIR R4), 48 features per patient", "output": "risk_score (0-1), top_3_risk_factors, confidence_interval", "latency": "<200ms per prediction (nurse workflow requirement)", "accuracy": "AUC-ROC > 0.82 (current manual process is ~0.65)", "volume": "~3000 predictions/day across 12 hospitals", }, "constraints": { "compliance": "HIPAA — no PHI leaves customer VPC", "infra": "AWS GovCloud, EKS 1.27, no GPU instances approved yet", "data_quality": "32% of records have missing diagnosis codes", "timeline": "POC in 4 weeks, production in 12 weeks", "budget": "$180K first year including compute", }, "success_criteria": { "primary": "Reduce 30-day readmission rate by 15% (saves ~$4.2M/year)", "secondary": "Nurse adoption rate > 60% within 3 months", } } # The key insight: "predict readmissions" became a specific # technical spec with latency, volume, accuracy, and compliance # requirements. Without discovery, you'd build the wrong thing.
Failure mode: Requirement Drift. You agree on a spec in Week 1. By Week 4, the customer has added 12 new requirements that weren't in the original scope. Each one is "small." Together, they've doubled the project. The fix: write a one-page scope document after every discovery meeting and get explicit sign-off. When new requirements appear, point to the doc and say: "Happy to add this. It extends the timeline by 2 weeks. Should we reprioritize?"
Failure mode: Building for the wrong stakeholder. The VP wants dashboards. The data scientist wants model accuracy. You build beautiful dashboards. Six months later, the data scientist has replaced your product with a Jupyter notebook because the model accuracy wasn't good enough. The fix: identify who has veto power and who measures success. Build for them first.
The next generation of FDE tooling uses LLMs to accelerate discovery. Record the stakeholder interview (with consent), transcribe it, and use an LLM to extract structured requirements, flag contradictions between stakeholders, and generate a draft scope document. The FDE still validates everything — but the turnaround from meeting to spec drops from 2 days to 2 hours. Companies like Dovetail and Grain are building the infrastructure; FDEs at Palantir and Databricks are early adopters.
Drag the sliders to simulate stakeholder input quality. Watch how data quality, constraint clarity, and stakeholder alignment affect the final requirement score. Click Run Discovery to animate the extraction.
You have 72 hours. The customer's executive review is Monday morning. If you can show a working prototype that processes their actual data and produces real results, the deal closes. If you show slides, they'll "circle back next quarter" — which means never. This is the FDE's superpower: building something real, fast, with the customer's own data.
Rapid prototyping is not hacking. It is the disciplined art of building the minimum artifact that proves the core value proposition works with the customer's data, in the customer's environment, under the customer's constraints. Everything else is deferred. Not ignored — explicitly deferred with a documented plan for how it gets built later.
Every prototype lives on a fidelity ladder. Choosing the wrong rung wastes time. Choosing too high burns days on polish nobody asked for. Choosing too low fails to convince.
| Level | Artifact | Time | When to use | Example |
|---|---|---|---|---|
| 0 | Napkin sketch | 15 min | Clarifying requirements in a meeting | Data flow diagram on a whiteboard |
| 1 | Script + CLI output | 2-4 hours | Proving data can be ingested and transformed | Python script that parses their CSV and outputs feature vectors |
| 2 | Notebook with charts | 1-2 days | Showing model viability on their data | Jupyter notebook with ROC curves on their historical data |
| 3 | Deployed API + minimal UI | 3-5 days | Executive demo, POC sign-off | FastAPI endpoint + Streamlit dashboard processing live data |
| 4 | Production-hardened service | 4-12 weeks | Customer deployment | Containerized service with monitoring, auth, and SLAs |
A Level 3 prototype is the sweet spot for most FDE engagements. It is real enough to process actual customer data and impressive enough for an executive demo, but lean enough to build in under a week.
Here is the actual code structure for a Level 3 prototype. Every FDE should be able to produce this in their sleep:
python # prototype/main.py — The 72-hour MVP structure from fastapi import FastAPI, UploadFile from pydantic import BaseModel import pandas as pd app = FastAPI(title="CustomerCo Fraud Detection POC") class PredictionResult(BaseModel): transaction_id: str risk_score: float # 0-1 risk_factors: list[str] # top 3 reasons latency_ms: float # show them you care about perf @app.post("/predict") async def predict(txn: dict) -> PredictionResult: start = time.monotonic() features = extract_features(txn) # Their schema, not yours score = model.predict_proba(features) # Pre-trained on their historical data factors = explain_prediction(features) # SHAP or simple feature importance elapsed = (time.monotonic() - start) * 1000 return PredictionResult( transaction_id=txn["id"], risk_score=float(score), risk_factors=factors[:3], latency_ms=round(elapsed, 1), ) @app.post("/batch") async def batch_predict(file: UploadFile): # Process their CSV/Parquet in one shot for the demo df = pd.read_csv(file.file) results = [predict_row(row) for _, row in df.iterrows()] return {"predictions": results, "total": len(results)}
The hardest judgment call an FDE makes: when does quick-and-dirty code become a liability? Here is the decision framework:
| Signal | Hack it | Architect it |
|---|---|---|
| Will this code run in production? | No — it's a demo | Yes — customer will deploy this |
| Will another engineer maintain it? | No — you own it | Yes — customer or teammate inherits |
| Does it handle customer data? | Sample data only | Real PII/financial data |
| Can you rewrite it in <1 day? | Yes — small blast radius | No — too entangled to rewrite |
Failure mode: Data format mismatch. You built the prototype with the CSV sample they sent last week. The live data is JSON with nested arrays, different column names, and timestamps in three different formats (ISO, Unix epoch, and "MM/DD/YYYY" with 2-digit years). The fix: always write a data adapter layer as the first thing. Never let the model see raw customer data directly.
python # The adapter pattern — your most reused FDE code class DataAdapter: def __init__(self, schema_map: dict): self.schema_map = schema_map # {"their_col": "our_col"} def adapt(self, raw: dict) -> dict: result = {} for their_key, our_key in self.schema_map.items(): val = raw.get(their_key) if val is None: result[our_key] = self.defaults[our_key] # Handle nulls else: result[our_key] = self.transforms[our_key](val) return result # Usage: one config change per customer, not code changes adapter = DataAdapter({ "txn_amt": "amount", "txn_ts": "timestamp", # Their format → our ISO "cust_id": "user_id", })
At AI companies, the 72-hour prototype looks different. Instead of training a model on customer data, you're wiring up an LLM API with the customer's context:
python # AI company FDE prototype — RAG pipeline in 4 hours from anthropic import Anthropic from fastapi import FastAPI client = Anthropic() app = FastAPI(title="CustomerCo Support Agent POC") @app.post("/query") async def query(question: str, customer_docs: list[str]): # Stuff their docs into context — quick and dirty RAG context = "\n---\n".join(customer_docs[:20]) response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=f"Answer using ONLY these docs:\n{context}", messages=[{"role": "user", "content": question}], ) return { "answer": response.content[0].text, "tokens_used": response.usage.input_tokens + response.usage.output_tokens, "cost_usd": estimate_cost(response.usage), # Show them unit economics }
The killer move: show cost-per-query alongside accuracy. When the VP sees "$0.003 per support ticket resolved" next to their current "$8.50 per human agent ticket," the deal closes itself. AI company FDEs must always quantify cost per unit of value, not just accuracy.
FDEs at Scale AI, Palantir, and AI API companies are using code-generation LLMs to accelerate prototyping. You describe the customer's data schema and desired output, and the LLM generates the adapter layer, feature engineering pipeline, and API scaffolding. The FDE validates, adjusts, and adds business logic. What used to take 3 days now takes 3 hours. The FDE's value shifts from typing code to understanding the customer's problem deeply enough to prompt correctly.
Adjust the Time Budget and Data Quality sliders to see what prototype fidelity you can achieve. The chart shows the optimal fidelity level and what you're trading off at each level.
The prototype worked. The customer saw their data flowing through your system and said "Yes, this is what we need." Now comes the hard part: turning that prototype into a production system that runs in their environment, not yours. The architecture you choose determines whether this deployment takes 4 weeks or 4 months, whether it scales to 10x their current volume, and whether it survives the first on-call incident.
An FDE's architecture is fundamentally different from a product engineer's architecture. A product engineer designs for all customers. You design for this customer. That's not a limitation — it's a superpower. You know their exact data volume, latency requirements, infrastructure, compliance rules, and team skill level. You can make specific tradeoffs that a generic product never could.
Before writing any architecture document, fill in this canvas. It forces you to address every dimension that matters in a customer deployment:
| Dimension | Question | Example (FinanceCorps) |
|---|---|---|
| Data Ingress | How does customer data enter the system? | Kafka topic, 50K events/min, Avro schema |
| Processing | Batch or streaming? Latency SLA? | Streaming, <500ms end-to-end |
| Storage | Where do results live? Retention? | Customer's PostgreSQL, 90-day retention |
| Compute | CPU/GPU? Customer's cluster or dedicated? | CPU-only (no GPU approved), 3 EKS nodes |
| Auth | How does the system authenticate? | OIDC via customer's Okta, service accounts for internal |
| Compliance | Data residency, encryption, audit logging? | SOC2, all data encrypted at rest (AES-256), audit log to Splunk |
| Observability | How do you monitor from outside? | Prometheus metrics exported, Datadog agent, PagerDuty alerts |
| Rollback | How do you undo a bad deployment? | Blue/green with automatic rollback on error rate > 5% |
Most FDE deployments follow one of three patterns. Know all three — your architecture interview will ask you to choose and justify:
python # Pattern 1: Sidecar — your code runs alongside their app # Best for: adding capabilities to existing services architecture_sidecar = { "deployment": "K8s sidecar container in customer's pod", "data_flow": "localhost:8080 → your sidecar → their DB", "pros": ["Minimal network changes", "Shares pod lifecycle"], "cons": ["Resource contention", "Coupled deployment"], } # Pattern 2: Standalone Service — your code runs independently # Best for: new capabilities that don't fit existing services architecture_standalone = { "deployment": "Dedicated K8s namespace or VM", "data_flow": "customer app → REST/gRPC → your service → their DB", "pros": ["Independent scaling", "Clean failure domain"], "cons": ["Network hop latency", "More infra to manage"], } # Pattern 3: Embedded Library — your code is a package they import # Best for: edge/offline, air-gapped, or latency-critical architecture_embedded = { "deployment": "pip install your-sdk, imported in their code", "data_flow": "in-process function call, no network", "pros": ["Zero latency", "Works offline/air-gapped"], "cons": ["Version coupling", "No independent updates"], }
Constraints from discovery: SOC2, no GPU, 50K events/min, <500ms latency, 3 EKS nodes, team of 2 DevOps.
Architecture decision: Standalone Service (Pattern 2). Why: (1) fraud detection needs independent scaling — transaction volume spikes at month-end, (2) clean failure domain means a bug in fraud detection doesn't crash their payment service, (3) their DevOps team can manage a K8s Deployment with 3 replicas — it's a pattern they already know.
Rejected alternatives: Sidecar was considered but rejected because their payment pods are at 85% memory utilization already. Embedded library was rejected because they need to update the model monthly without redeploying their payment service.
Failure mode: The "works on my cluster" problem. Your architecture works perfectly on your test cluster with 16-core nodes and 64GB RAM. The customer's cluster has 4-core nodes with 8GB RAM and a pod memory limit of 2GB. Your model alone needs 1.8GB. The fix: always ask for the customer's resource quotas during discovery, and design to fit within 60% of their limits (leaving headroom for spikes).
Leading FDE teams at Palantir (Apollo) and Databricks maintain deployment templates — parameterized Terraform/Pulumi modules that encode the three architecture patterns. The FDE fills in customer-specific values (VPC IDs, node sizes, compliance flags) and generates a complete deployment config in minutes. The frontier is using LLMs to generate these configs from natural-language descriptions of the customer's environment.
Toggle customer constraints to see which architecture pattern is recommended. The scoring formula weighs each constraint against pattern capabilities.
Your architecture is beautiful on the whiteboard. Now you need to connect it to the customer's world. Their world is a sprawling ecosystem of legacy systems, custom APIs, proprietary data formats, authentication flows that were designed in 2011, and documentation that was last updated in 2019. Integration is where FDE work gets real — and where most projects stall.
Integration is not "calling an API." It is understanding the customer's entire data lifecycle: where data is created, how it flows through their systems, what transformations happen along the way, who has access, and what happens when something in that chain breaks. An FDE who can map a customer's data flow end-to-end in 2 hours is worth more than one who can build a perfect ML model in 2 weeks.
Every customer has a different authentication story. Here are the four you'll encounter, ordered by frequency:
python # Auth Pattern 1: OAuth2 / OIDC (most common in cloud-native) # Your service gets a client_id + client_secret, exchanges for access token import httpx async def get_token(client_id: str, client_secret: str, token_url: str): resp = await httpx.AsyncClient().post(token_url, data={ "grant_type": "client_credentials", "client_id": client_id, "client_secret": client_secret, "scope": "read:transactions", }) return resp.json()["access_token"] # Expires in 3600s typically # Auth Pattern 2: mTLS (common in financial services) # Both sides present certificates. No tokens. The cert IS the credential. client = httpx.AsyncClient( cert=("/certs/client.pem", "/certs/client.key"), verify="/certs/customer-ca.pem", # Customer's CA bundle ) # Auth Pattern 3: API Key + IP Allowlist (legacy but common) # Static key in header, requests must come from allowed IPs headers = {"X-Api-Key": "sk_live_..."} # Auth Pattern 4: SAML + Service Account (enterprise SSO) # The customer's IdP issues SAML assertions for your service account # Typically used when your service needs to act as a "user" in their system
Customer data pipelines are rarely clean. Here is what a real integration looks like at a mid-size retailer:
The worked numbers: Oracle POS generates ~2M transactions/day. Informatica ETL processes them in 47 minutes. Snowflake query to extract 48 features for one customer takes ~200ms. Your inference takes ~50ms on CPU. End-to-end latency for a prediction: 6-18 hours (dominated by ETL staleness) + 250ms (query + inference). The customer wanted "real-time." You now have to explain why "real-time" with their current infrastructure means "within 18 hours" and what it would cost to make it truly real-time (CDC + Kafka + streaming inference = 3 months and $200K in infrastructure changes).
python # API versioning for FDE deployments # Golden rule: never break a customer's integration # Strategy: URL path versioning with graceful degradation @app.post("/v1/predict") async def predict_v1(txn: dict): # Original schema: flat dict with 12 fields return {"risk_score": score} @app.post("/v2/predict") async def predict_v2(txn: dict): # New schema: nested dict with explanation return {"risk_score": score, "explanation": factors, "model_version": "2.1.0"} # CRITICAL: v1 still works. Customer A is on v1, Customer B is on v2. # You maintain both until Customer A migrates (which takes 3 months # because their integration was built by a contractor who left).
At AI API companies, integration means helping the customer wire up your model API into their existing product. The patterns are different from traditional software integration:
| Pattern | When | FDE's Job | Common Pitfall |
|---|---|---|---|
| Direct API call | Simple Q&A, classification | Prompt template, error handling, retry logic | No streaming → 10s blank screen |
| RAG pipeline | Customer has proprietary docs | Chunking strategy, embedding model, vector store setup | Wrong chunk size → irrelevant retrieval |
| Agent with tools | Multi-step workflows | Tool definitions, guardrails, state management | Infinite loops, hallucinated tool calls |
| Fine-tuned model | Domain-specific language/format | Training data curation, eval pipeline, A/B rollout | Overfitting to training distribution |
| Batch processing | Document processing at scale | Batching strategy, cost projection, error handling | Rate limits, no progress tracking |
python # AI FDE: streaming integration with fallback import anthropic async def customer_query(question: str, docs: list[str]): try: with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=1024, system=build_rag_prompt(docs), messages=[{"role": "user", "content": question}], ) as stream: async for text in stream.text_stream: yield text # Stream to customer's UI except anthropic.RateLimitError: # Fallback: queue and retry, don't drop the request yield "Processing your request..." result = await retry_with_backoff(question, docs) yield result
Failure mode: The Silent Schema Change. The customer's upstream team adds a column to their transaction table. Your feature extractor doesn't know about it — no error, it just ignores it. But the new column contains a critical signal (e.g., "is_international_transaction") that now makes 15% of your feature vectors incomplete. Your model's accuracy silently degrades from 0.87 AUC to 0.71 AUC over 2 weeks. Nobody notices until the customer's fraud losses spike.
The fix: schema validation on ingestion. Every time your service reads customer data, validate the schema against a registered contract. If new columns appear, log a warning and notify the FDE. If expected columns disappear, halt and alert.
Companies like Airbyte, Fivetran, and dbt are building universal connector layers that abstract away the pain of integrating with every customer's unique data stack. The frontier for FDEs is composing these connectors into customer-specific pipelines using declarative configs rather than custom code. Palantir's Foundry does this with "transforms" that chain connectors. The FDE's role shifts from writing integration code to configuring and debugging connector pipelines.
Build an integration pipeline by clicking components. Watch data flow through each stage. Click Break Connection to see how errors propagate and where you need fallbacks.
The prototype worked. The architecture is approved. The integration is tested. Now you deploy to production — not your production, their production. Their environment has constraints you've never seen in a textbook: firewall rules that block Docker Hub, container registries that only accept signed images, deployment windows limited to Sundays between 2-6 AM, and a change management board that requires 2 weeks' notice for any production change.
FDE deployment is fundamentally different from product deployment because you don't control the infrastructure. You are a guest in someone else's house, and they have rules.
| Environment | Characteristics | FDE Impact | Real Example |
|---|---|---|---|
| Cloud VPC | Customer's AWS/GCP/Azure, internet access, managed services available | Closest to "normal" — pull images from ECR, use managed DBs | Most SaaS companies, fintech startups |
| On-Prem | Customer's data center, may have internet, custom hardware | Must pre-package all dependencies, no pulling from internet during deploy | Banks, hospitals, government |
| Air-Gapped | No internet connectivity at all, physical media transfer | Everything shipped on USB/DVD. No telemetry, no remote debugging, no updates without physical access. | Defense, intelligence, critical infrastructure |
| Hybrid | Some components in cloud, sensitive data on-prem | Split architecture: inference on-prem, training/analytics in cloud with anonymized data | Healthcare systems, financial institutions |
Air-gapped deployment is the ultimate FDE test. Here is the actual process for deploying to a classified environment:
bash # Step 1: Build the deployment bundle (on YOUR machine, with internet) # Every single dependency must be included. No pip install at deploy time. docker save your-service:v2.1.0 | gzip > service.tar.gz docker save postgres:15 | gzip > postgres.tar.gz # Bundle all Python wheels for offline install pip download -r requirements.txt -d ./wheels/ # Bundle Helm charts, configs, scripts tar czf deploy-bundle-v2.1.0.tar.gz \ service.tar.gz postgres.tar.gz wheels/ \ helm/ configs/ scripts/ checksums.sha256 # Step 2: Generate checksums for integrity verification sha256sum deploy-bundle-v2.1.0.tar.gz > manifest.sha256 # Step 3: Transfer via approved media (USB, burned DVD, etc.) # Step 4: On-site, verify checksum, load images, deploy docker load < service.tar.gz helm upgrade --install your-service ./helm/ -f configs/customer.yaml
python # deploy.py — FDE deployment script (every FDE has a version of this) import subprocess, sys, hashlib, json def preflight_checks(config: dict) -> list[str]: """Run before any deployment. Returns list of failures.""" failures = [] # Check: can we reach the customer's container registry? if not ping(config["registry_url"]): failures.append("Cannot reach container registry") # Check: do we have enough disk space? (burned us at MedTech) free_gb = get_disk_free_gb(config["deploy_path"]) if free_gb < config["min_disk_gb"]: failures.append(f"Need {config['min_disk_gb']}GB, have {free_gb}GB") # Check: are required secrets present? for secret in config["required_secrets"]: if not secret_exists(secret): failures.append(f"Missing secret: {secret}") # Check: is the target namespace healthy? pods = get_pods(config["namespace"]) unhealthy = [p for p in pods if p.status != "Running"] if unhealthy: failures.append(f"{len(unhealthy)} unhealthy pods in namespace") return failures # Usage: NEVER deploy if preflight fails failures = preflight_checks(customer_config) if failures: print("PREFLIGHT FAILED:") for f in failures: print(f" ✗ {f}") sys.exit(1)
Compliance is not optional and it shapes every deployment decision:
| Compliance | Key requirement | Deployment impact |
|---|---|---|
| SOC2 | Audit trail for all access and changes | Every deployment generates an audit log entry. All SSH sessions recorded. |
| HIPAA | PHI stays within approved boundaries | No data leaves the customer's VPC. Logs must be scrubbed of PHI before export. |
| GDPR | Data residency, right to deletion | Deploy in EU region. Implement data deletion endpoint. Log what data was processed. |
| FedRAMP | Government-approved cloud configurations | Only deploy to FedRAMP-authorized cloud regions. FIPS 140-2 encryption. |
| PCI-DSS | Cardholder data protection | Network segmentation, encryption in transit, no storing card numbers in logs. |
Failure mode: The Dependency Surprise. Your service starts fine in staging. In production, it crashes on startup with "cannot connect to database." Why? Staging uses a local PostgreSQL. Production uses a PostgreSQL behind a connection pooler (PgBouncer) that doesn't support prepared statements. Your ORM uses prepared statements by default. The fix: always test against a production-equivalent database setup, not a simplified staging version.
Argo CD and Flux are enabling GitOps workflows where the customer's deployment is defined in a git repo. The FDE opens a PR to change the deployment config, the customer reviews and merges, and Argo CD automatically deploys. This creates an audit trail, enables rollback via git revert, and gives the customer visibility into every change. Palantir's Apollo system is the gold standard here — it manages deployments across thousands of customer environments from a single control plane.
Select a deployment environment type and watch the pipeline adapt. Observe how stages change for air-gapped vs. cloud deployments. Click Deploy to animate.
It is 2 AM. Your phone buzzes. The on-call alert: "FinanceCorps fraud detection latency p99 is 12 seconds (SLA: 500ms)." You open your laptop. You cannot SSH into their servers. You cannot access their Grafana. You cannot tail their logs. You are debugging a production system you've never directly touched, through a 3-inch window of exported metrics and whatever the customer's on-call engineer can paste into Slack.
This is the FDE debugging experience. You are a surgeon operating through a mail slot. Your diagnostic tools are limited, your access is restricted, and the pressure is immense because every minute the system is slow, the customer is losing money and trust.
When you can't access the customer's environment directly, you need a structured approach:
Every FDE carries a mental toolkit of diagnostic commands that work in restricted environments:
bash # Network diagnostics (when you suspect connectivity issues) curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" -o /dev/null -s http://endpoint # Memory/CPU diagnostics (when you suspect resource exhaustion) kubectl top pods -n fraud-detection --sort-by=memory kubectl describe pod fraud-svc-abc123 | grep -A5 "Limits\|Requests\|Last State" # Log analysis (when you need to find the needle) kubectl logs deployment/fraud-svc --since=1h | grep -c ERROR # error rate kubectl logs deployment/fraud-svc --since=1h | grep ERROR | sort | uniq -c | sort -rn | head # top errors # Connection pool diagnostics (very common FDE issue) kubectl exec fraud-svc-abc123 -- ss -s # socket statistics kubectl exec fraud-svc-abc123 -- cat /proc/net/sockstat # socket counts
Let's trace through the FinanceCorps incident step by step:
Symptom: p99 latency jumped from 450ms to 12s at 01:47 UTC.
Hypothesis 1: Database connection pool exhaustion. The most common cause of sudden latency spikes. Ask: "Run `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`" Response: "247 active connections." Pool size is 50. Something is leaking connections.
Root cause investigation: At 01:45 UTC, the customer deployed a new version of their transaction ingestion service. The new version opens a database connection per request but doesn't close it in the error path. When a malformed transaction arrives (which happens ~100/minute), the connection leaks. After 2 minutes, the pool is exhausted. Your service can't get a connection, so it waits (12 seconds timeout).
python # The bug (in THEIR code, not yours): def process_transaction(txn): conn = pool.getconn() try: # process... if txn.amount < 0: raise ValueError("negative amount") # LEAK: conn never returned conn.execute(...) finally: # BUG: this finally block was missing in their new version pool.putconn(conn) # ← this line was deleted in their refactor # The FDE's mitigation (while they fix their code): # 1. Add connection timeout to YOUR service's DB config # 2. Add a circuit breaker: if 5 consecutive DB timeouts, return cached result # 3. Add connection pool monitoring to your health check endpoint
Key FDE skill: The bug was in the customer's code, not yours. But you still have to diagnose it, explain it, and propose a mitigation on your side. You can't say "fix your code" and go back to sleep. You need to make your system resilient to their failures.
When the customer says "the AI is giving bad answers," the debugging tree is different from traditional software:
Failure mode: The Blame Game. You identify the bug in the customer's code. You tell them "your code has a connection leak." They get defensive. Their engineering lead says "our code hasn't changed" (it has — you can see the deployment timestamp). Now you're in a political situation, not a technical one. The fix: never say "your code is broken." Say: "I've identified that the connection pool is exhausted. Here's the timeline that correlates with the deployment at 01:45. Let's look at the connection handling in the new version together." Together is the key word.
The next generation of FDE tooling embeds observability into the deployed service itself. Instead of asking customers to run kubectl commands, your service exports a diagnostic bundle on request — a JSON blob with the last 1000 log lines, current resource usage, connection pool state, and recent latency histograms. The FDE runs `curl https://customer-endpoint/debug/bundle` and gets everything they need. Companies building this: Honeycomb, Chronosphere, and Palantir's internal tooling.
A customer system is experiencing issues. Use the diagnostic buttons to gather information and identify the root cause. Each diagnostic reveals a clue.
You are in a boardroom. On one side: your company's sales lead, who needs this deal to hit their quarterly number. On the other side: the customer's CTO, VP of Engineering, CISO, and procurement lead. The CTO wants to know if your technology actually works. The CISO wants to know if it's secure. Procurement wants to know the total cost of ownership. The sales lead wants you to say "yes" to everything. Your job is to be technically honest while helping close the deal.
This is the tightrope of technical sales support. The FDE is the only person in the room who is both deeply technical and customer-facing. You bridge the gap between what the product can do today, what it will be able to do in 6 months, and what the customer needs. Overpromise and you'll spend the next year building features that should have been in the product. Underpromise and you lose the deal to a competitor who lies better.
A POC is the most powerful sales tool an FDE has. It converts "we believe this could work" into "we've proven it works with your data." Here is the structure:
python # POC Plan Document (what you send the customer after the first meeting) poc_plan = { "objective": "Demonstrate 20% improvement in fraud detection vs. current rules engine", "duration": "3 weeks", "data_required": [ "6 months historical transactions (anonymized OK)", "Labeled fraud/non-fraud outcomes", "Current rules engine's predictions for comparison", ], "success_criteria": { "primary": "AUC-ROC > 0.85 (their current: 0.72)", "secondary": "Latency < 200ms at p99", "tertiary": "False positive rate < 5% (their current: 18%)", }, "deliverables": [ "Working API endpoint processing their data", "A/B comparison report: our model vs. their rules engine", "ROI calculation: $ saved per year", ], "what_we_need": [ "VPN access to staging environment", "Weekly 30-min sync with their data team", "Named technical contact for data questions", ], }
VPs don't care about AUC-ROC. They care about dollars. An FDE must translate technical metrics into business impact. Here is the formula that closes enterprise deals:
Worked example for FinanceCorps fraud detection:
| Item | Value | Source |
|---|---|---|
| Annual fraud losses | $14.2M | Customer's finance team |
| Current detection rate | 62% | Their rules engine metrics |
| Our detection rate (POC) | 84% | POC results on their data |
| Improvement | 22 percentage points | 84% - 62% |
| Additional fraud caught | $3.1M/year | $14.2M × 0.22 |
| Our annual cost | $480K | License + compute + support |
| Net ROI | $2.6M/year (5.4x return) | $3.1M - $480K |
AI company ROI math is different. You're replacing human labor, not catching fraud:
| Item | Value | Source |
|---|---|---|
| Support tickets / month | 45,000 | Customer's Zendesk |
| Current cost per ticket (human) | $8.50 | Their ops team |
| AI resolution rate (POC) | 68% | Your eval on their 500 sample tickets |
| Tickets resolved by AI | 30,600/mo | 45K × 0.68 |
| Monthly savings (labor) | $260,100 | 30,600 × $8.50 |
| Monthly AI cost (API + infra) | $4,200 | 30,600 × $0.003/query × 1.5 (retrieval, retries, eval) |
| Net monthly savings | $255,900 (62x return) | $260,100 − $4,200 |
Every technical sales meeting has objections. Here are the five most common and how a staff FDE handles them:
| Objection | Bad response | Staff FDE response |
|---|---|---|
| "Our current system is good enough" | "Ours is better" | "Let's measure. Can I get 30 days of your predictions vs. actual outcomes? I'll run a head-to-head comparison with specific dollar amounts." |
| "We tried ML/AI before and it didn't work" | "Our AI is different" | "What specifically failed? Was it accuracy, hallucination, latency, or integration? Each has a different fix. Let me show you the eval results on your data." |
| "It's too expensive" | "We can discount" | "At $0.003/query and your volume, that's $4,200/month. You're spending $382K/month on the human team handling those same tickets. That's a 90x return." |
| "What about hallucinations?" | "Our model doesn't hallucinate" | "Every model can hallucinate. Here's our three-layer defense: (1) RAG grounds it in your docs, (2) citations let users verify, (3) confidence scoring routes low-confidence answers to humans. Here's the eval showing 2.1% hallucination rate on your data, down from 12% before RAG tuning." |
| "What about data privacy?" | "We're SOC2 compliant" | "Your data never leaves your VPC. We offer on-prem deployment, or API calls with zero data retention. Here's our DPA. Here's the architecture showing data flow. Happy to do a security review with your CISO." |
Failure mode: The Overpromise. The sales lead asks: "Can we do real-time predictions in 3 weeks?" You know it's 8 weeks minimum because the customer's data pipeline is batch-only. But saying "no" in front of the customer feels like killing the deal. So you say "we'll try." Three weeks later, you're 5 weeks from done, the customer is frustrated, and the sales lead blames you. The fix: never say "we'll try" to a timeline question. Say: "Real-time predictions require streaming infrastructure. The customer's current pipeline is batch. Here's my proposal: we deliver batch predictions in 3 weeks, then a streaming upgrade in an additional 5 weeks. The batch version alone delivers 70% of the value."
Some companies (Snowflake, Databricks) are investing in self-serve experiences that reduce the need for FDEs in the sales cycle. Free trials, interactive playgrounds, pre-built integrations. But for enterprise deals above $500K, the FDE is still essential because the complexity of integration exceeds what self-serve can handle. The frontier is using FDE insights to improve the self-serve experience: every integration pain point an FDE encounters gets filed as a product improvement request.
Adjust the customer's metrics to calculate ROI. Watch the deal viability change in real-time. Green = strong deal, red = walk away.
A VP of Product leans forward in her chair. Your demo is running on the projector. Real customer data is flowing through your system. Risk scores are appearing in real-time. Then a transaction comes through that your model flags as 99.7% fraud probability. The VP turns to her team: "That's the Acme Corp chargeback from last month. We lost $47K on that one." Silence. Then: "When can we start?"
That moment is the product of demo engineering — the disciplined practice of building demos that don't just show features, but tell a story. A story where the customer sees their own pain reflected in your solution, and the resolution feels inevitable.
A great demo has three acts, like any good story:
| Act | Purpose | Duration | What you show |
|---|---|---|---|
| 1: The Problem | Make them feel the pain | 3 min | Their current workflow, the manual steps, the errors, the cost |
| 2: The Solution | Show the magic moment | 5 min | Your system processing their actual data with real results |
| 3: The Future | Plant the vision | 2 min | What becomes possible once the system is in production (new capabilities, savings, insights) |
python # demo/setup.py — Pre-demo checklist (every FDE runs this) import requests, time def pre_demo_check(demo_url: str) -> dict: checks = {} # 1. Is the service healthy? r = requests.get(f"{demo_url}/health", timeout=5) checks["service_healthy"] = r.status_code == 200 # 2. Is the data loaded? r = requests.get(f"{demo_url}/stats") checks["data_loaded"] = r.json()["record_count"] > 0 # 3. Is latency acceptable? start = time.monotonic() r = requests.post(f"{demo_url}/predict", json={"test": True}) latency = (time.monotonic() - start) * 1000 checks["latency_ok"] = latency < 500 # 4. Is the demo data set with "wow" examples? # Pre-load 3 transactions that your model catches but their system misses checks["wow_examples_ready"] = verify_wow_examples(demo_url) return checks # Run this 30 minutes before every demo. EVERY. TIME. # The one time you skip it is the time the service is down.
Demos fail. The WiFi drops. The service crashes. The data doesn't load. A staff FDE has a plan for each scenario:
| Failure | Recovery | What to say |
|---|---|---|
| Service is down | Switch to pre-recorded video backup | "Let me show you the recording from our rehearsal. Same data, same results. I'll do a live walkthrough when we're back up." |
| Latency is high | Narrate while waiting | "This is running against your full dataset. In production, we cache the feature store so this would be 50ms, not 3 seconds." |
| Wrong result | Explain the why | "Interesting — this transaction has unusual features. Let me show you why the model scored it this way. This is actually a great example of explainability." |
| Total crash | Whiteboard fallback | "Let me draw the architecture and walk you through the data flow. I'll send a working demo link within 2 hours." |
AI demos have a unique advantage and a unique risk. The advantage: the output is visible — the customer can read the answer and judge it immediately. The risk: one hallucination in front of the CTO and trust evaporates.
The best demos aren't just functional — they tell a story. Here's the technique: identify 3-5 transactions from the customer's data where your system provides dramatically different results than their current approach. These are your "wow moments." Structure the demo so each wow moment builds on the previous one:
Wow 1: A straightforward fraud case your model catches at 99% confidence. The customer's system also caught this one. "We agree with your current system here. Good baseline."
Wow 2: A subtle fraud case your model catches at 87% confidence. The customer's system missed it. "This is the $47K Acme chargeback from last month. Our model flags it here, here, and here."
Wow 3: A legitimate transaction your model correctly passes at 3% risk. The customer's system flagged it as fraud (false positive). "Your team spent 20 minutes investigating this. Our model knew it was legitimate because of the spending pattern analysis."
Failure mode: The Feature Request Demo. Mid-demo, the CTO asks: "Can it also detect account takeover?" Your model doesn't do this. The sales lead looks at you expectantly. You say "yes" because the room's energy is high. Now you've committed to a feature that's 3 months of work. The fix: have a pre-agreed list of "what we show" and "what we don't show" with your sales lead before the meeting. When asked about an unplanned feature, say: "Great question. That's on our roadmap. Today's demo focuses on transaction fraud. I'd love to discuss account takeover in a follow-up meeting where I can show you our approach specifically."
Repli, Codespaces, and similar platforms are enabling "try before you buy" experiences where the customer can run your product against their data in an isolated cloud environment without any installation. The FDE's role evolves from "run the demo for them" to "configure the demo environment so they can explore on their own." Companies like Navattic and Walnut are building the infrastructure specifically for interactive product demos.
Run a live demo. Click Next Act to progress through the three-act structure. Click Break Something to simulate a failure mid-demo and practice recovery.
The customer's Slack channel lights up: "ALL FRAUD DETECTION DOWN. ZERO PREDICTIONS RETURNING. EVERY TRANSACTION PASSING THROUGH UNSCORED." This is a severity 1 incident. Every second your system is down, fraudulent transactions are flowing through unchecked. The customer's fraud losses are accruing at approximately $4,700 per hour (based on their historical rate). You are the FDE. You own this until it's resolved.
On-site incident response is where the FDE role is most different from a standard software engineer. You are not debugging in the comfort of your IDE with full access. You are in a war room with the customer's engineers, their management is watching, and every minute someone asks "when will this be fixed?" Your ability to stay calm, systematic, and communicative under this pressure is the single most important FDE skill.
The hardest part of incident response is not the debugging — it's the communication. Here's the format for incident updates:
markdown # Incident Update Template (send to customer every 15 min) **Status:** Investigating / Mitigating / Resolved **Impact:** Fraud scoring unavailable for all transactions **Duration:** 23 minutes **Current action:** Rolling back to previous version (v2.0.3) **Next update:** 15 minutes or when status changes # What NOT to write: # "We think it might be a database issue but we're not sure" # "Bob is looking into it" # "Should be fixed soon"
A good postmortem builds trust. A bad postmortem destroys it. Here is the structure:
python # Blameless postmortem structure postmortem = { "incident_id": "INC-2024-0342", "severity": "SEV1", "duration": "47 minutes (14:23 - 15:10 UTC)", "impact": "Zero fraud predictions served. ~$3,700 in estimated unscored fraud.", "timeline": [ "14:23 — Monitoring alert fires: prediction count drops to 0", "14:26 — FDE acknowledges, begins investigation", "14:31 — Identified: OOM kill on inference pods after model update", "14:35 — Mitigation: rollback to previous model version", "14:42 — Service restored. Predictions resuming.", "15:10 — All backlogged transactions scored. Incident closed.", ], "root_cause": "New model version (v3.1) requires 2.4GB RAM. Pod limit is 2GB. " "OOM killer terminated all inference pods simultaneously.", "fix": "Increase pod memory limit to 3GB. Add pre-deployment memory profiling.", "prevention": [ "Add memory consumption test to model CI/CD pipeline", "Implement canary deployment: new model serves 5% traffic first", "Add OOM prediction alert (warning at 80% memory utilization)", ], }
Failure mode: The Cascading Escalation. The incident starts with your service. But during investigation, you discover the root cause is in the customer's infrastructure (their load balancer is misconfigured). Now you need to tell the customer that their own system caused the outage of your service. Do this wrong and you've destroyed the relationship. Do this right: "We've identified that the traffic pattern changed at 14:20 — a 10x spike that exceeded the load balancer's connection limit. Let's review the LB configuration together to ensure it can handle peak traffic. I'll also add rate limiting on our side so we degrade gracefully if this happens again."
PagerDuty's AIOps, Datadog's Watchdog, and custom LLM-based tools are beginning to automate the first 10 minutes of incident response: correlating alerts, suggesting root causes from historical incidents, and drafting initial communications. The FDE's role evolves from "diagnose from scratch" to "validate the AI's hypothesis and manage the human side." But the customer-facing communication will remain human for a long time — trust isn't delegated to chatbots.
A SEV1 incident is in progress. Click actions in the correct order. The timer shows elapsed time and customer trust level decreases the longer you take.
You are CC'd on an email thread with 14 people. The customer's VP of Engineering wants the deployment done in 2 weeks. Your company's PM says the feature isn't on the roadmap until Q3. The sales lead says the deal depends on it. Legal says the customer's MSA needs an amendment for on-prem deployment. The customer's data team says they can't provide the training data until after their quarterly freeze. And you're the one who has to make all of these people happy — or at least aligned.
Stakeholder management is the unglamorous backbone of FDE work. Technical skill gets you in the door. Stakeholder management determines whether the project succeeds. Most FDE projects fail not because of technical issues, but because of misaligned expectations between stakeholders who each see a different part of the elephant.
| Stakeholder | What they want | What they fear | How you align them |
|---|---|---|---|
| Customer CTO | Technical excellence, innovation | Vendor lock-in, security breaches | Architecture reviews, security docs, roadmap transparency |
| Customer VP Eng | On-time delivery, team enablement | Disruption to existing systems, scope creep | Weekly status updates, clear scope docs, migration plans |
| Your PM | Product-market fit, feature adoption | Custom work that doesn't generalize | Frame custom work as feature requests with N-customer potential |
| Your Sales Lead | Deal closure, expansion revenue | Delays that kill the deal, technical "no"s | Honest timelines, phased delivery, technical alternatives to "no" |
| Customer Legal | Contract compliance, risk minimization | Data breaches, liability | Security architecture docs, compliance certifications, SLA definitions |
| Your Engineering Lead | Clean architecture, no tech debt | FDE hacks becoming permanent features | Clearly marked FDE code with migration plan to product code |
Knowing when and how to escalate is a staff-level skill. Here's the framework:
python # Escalation decision tree def should_escalate(issue: dict) -> dict: # Level 0: Handle yourself if issue["type"] == "technical" and issue["resolution_hours"] < 4: return {"action": "resolve", "notify": ["customer_lead"]} # Level 1: Escalate to your tech lead if issue["type"] == "technical" and issue["resolution_hours"] >= 4: return {"action": "escalate", "to": "tech_lead", "notify": ["customer_lead", "your_pm"]} # Level 2: Escalate to leadership if issue["type"] == "scope_change" or issue["type"] == "timeline_risk": return {"action": "escalate", "to": "engineering_director", "with": "written proposal with 3 options and recommendation"} # Level 3: Executive escalation if issue["type"] == "relationship_risk" or issue["revenue_impact"] > 100000: return {"action": "exec_escalation", "to": "VP_Engineering", "with": "1-page brief: impact, options, recommendation, timeline"} # The golden rule: never escalate without a recommendation. # "We have a problem" is not an escalation. # "We have a problem. Here are 3 options. I recommend option B because..." IS.
The worked example: the customer signed a contract for fraud detection. During implementation, they ask for: (1) account takeover detection, (2) a real-time dashboard, (3) integration with their Salesforce instance, (4) weekly model retraining. Each request is "small." Together, they've tripled the project scope.
The FDE response: "I love that you're thinking about these extensions. Let me organize them by priority and effort. Items 1 and 4 are significant engineering work that would extend the timeline by 6-8 weeks. Items 2 and 3 are moderate and could fit in a Phase 2 after the initial deployment. My recommendation: we deliver fraud detection on schedule, then scope Phase 2 based on the results. If fraud detection saves you $2.6M, the business case for Phase 2 writes itself."
Failure mode: The Invisible Stakeholder. You've aligned the CTO, VP Eng, and data team. The project is going well. Then in Week 6, the CISO — who wasn't in any of your meetings — discovers your service is running in their network and blocks it pending a security review. The project halts for 3 weeks. The fix: during discovery, always ask: "Who else needs to approve this deployment? Who hasn't been in these meetings but will have an opinion?" Map the full approval chain, including security, compliance, and change management.
Tools like Linear, Asana, and Notion AI are automating project tracking and status communication. The FDE's role evolves from manually writing status updates to reviewing AI-generated summaries, correcting nuances, and focusing on relationship management. But the human judgment of "this stakeholder is worried and needs a phone call, not an email" remains irreducibly human.
Each node represents a stakeholder. Green = aligned, yellow = concerned, red = blocking. Click stakeholders to address their concerns and watch alignment propagate.
This is the showcase. Everything you've learned in chapters 0-10 comes together in a single interactive simulation of the complete FDE lifecycle. You will take a customer from initial call through discovery, prototyping, deployment, and support — making the tradeoff decisions that define the role.
The simulation tracks five metrics across the engagement: Customer Trust (how much the customer believes in you), Technical Debt (shortcuts accumulating), Scope Completion (features delivered vs. promised), Timeline Adherence (on schedule?), and Product Feedback (insights sent back to your product team). A successful FDE engagement maximizes trust and completion while minimizing debt.
Navigate the full FDE lifecycle. Adjust tradeoffs, trigger events, and watch metrics evolve across all phases.
High scope + high timeline pressure + low debt tolerance is the impossible triangle. Something gives. In real FDE work, what usually gives is either the timeline (you miss the deadline) or the debt tolerance (you ship hacks that haunt you for months). The simulation makes this tradeoff visible.
Events model the chaos of real FDE work. A scope change mid-project. A SEV1 incident that eats 3 days. A key stakeholder leaving the customer's company. A compliance audit that freezes deployments. Each event tests a different skill from this lesson.
Customer Trust is the most important metric. It goes up when you communicate proactively, deliver on time, and handle incidents well. It goes down when you miss deadlines, overpromise, or go silent during problems. A project can have bugs and delays but still succeed if trust is high. A technically perfect project fails if the customer doesn't trust you.
You've learned the full FDE skillset: discovery, prototyping, architecture, integration, deployment, debugging, sales support, demos, incident response, and stakeholder management. Now let's arm you for the interview itself. This chapter is your cheat sheet — the questions they'll ask, the frameworks to use, and the resources to study.
| Type | Example | What they test | Framework to use |
|---|---|---|---|
| System Design | "Design a deployment pipeline for 50 enterprise customers" | Architecture, scalability, tradeoffs | Requirements → Architecture Canvas → Tradeoff matrix → Monitoring |
| Customer Scenario | "The customer says your model is wrong. Walk me through your response." | Communication, debugging, empathy | Listen → Gather data → Hypothesis → Test → Explain → Prevent |
| Debugging | "Latency spiked at a customer site. You have no direct access. Go." | Systematic debugging, remote diagnosis | Symptoms → Hypotheses → Targeted diagnostics → Root cause → Fix → Postmortem |
| Coding | "Write a health check script for a customer deployment" | Practical engineering, customer awareness | Write clean code that handles edge cases and logs actionable output |
| Behavioral | "Tell me about a time you disagreed with a customer's technical decision." | Communication, judgment, relationship management | STAR format: Situation → Task → Action → Result |
python # Question: "Design a system to manage 50 customer deployments" # This is the most common FDE system design question. # Step 1: Clarify requirements requirements = { "customers": 50, "environments": "mix of cloud VPC, on-prem, 3 air-gapped", "update_frequency": "weekly model updates, monthly service updates", "monitoring": "centralized dashboard, per-customer health", "rollback": "must rollback any customer in < 5 minutes", } # Step 2: Architecture # Control plane (your cloud) → manages deployments # Data plane (customer sites) → runs inference # Key insight: decouple control from data for security architecture = { "control_plane": { "deployment_manager": "GitOps (ArgoCD) with per-customer overlays", "config_store": "Encrypted customer configs in Vault", "monitoring": "Prometheus federation from customer agents", "artifact_registry": "Signed Docker images with SBOM", }, "data_plane": { "agent": "Lightweight agent that pulls updates from control plane", "runtime": "Customer's K8s or Docker, model + service containers", "telemetry": "Metrics only (no customer data) sent to control plane", }, }
Here are the five debugging scenarios that appear in every FDE interview. Practice talking through each one out loud:
| Scenario | Likely root cause | First diagnostic |
|---|---|---|
| Latency spike (p99 up, p50 normal) | Connection pool exhaustion or GC pauses | Check DB connection count and pod memory/GC logs |
| Accuracy degradation (slow over weeks) | Data drift — input distribution changed | Compare feature distributions: training data vs. last 7 days |
| Complete outage (zero responses) | OOM kill, certificate expiry, or DNS failure | Check pod status (kubectl get pods) and recent events |
| Intermittent errors (5% failure rate) | One replica is unhealthy, or one data source is flaky | Check per-pod error rates to isolate the bad replica |
| Results are wrong but service is healthy | Wrong model version deployed, or feature pipeline change | Check model version hash and feature pipeline output samples |
| Resource | Why |
|---|---|
| Designing Data-Intensive Applications (Kleppmann) | The bible of distributed systems. Every FDE must read chapters 1-9. |
| The Phoenix Project (Kim et al.) | Understand how IT operations and delivery work in enterprise. Helps you empathize with customers. |
| Staff Engineer (Larson) | What staff-level means: technical direction, cross-team alignment, organizational leverage. |
| Google SRE Book (free online) | Incident response, monitoring, SLOs. Essential for the on-call dimension of FDE work. |
| Forward Deployed (Palantir blog) | First-hand accounts from FDEs at the company that invented the role. |
Track your readiness across the five interview dimensions. Click each dimension to test yourself with a random question. Your confidence score updates based on practice.
The FDE role is evolving. As AI tools automate more of the integration and deployment work, the human dimension — customer empathy, stakeholder management, creative problem-solving — becomes more valuable, not less. The best FDEs in 2027 will be the ones who use AI to handle the routine while they focus on the relationship and the strategy.
Go build things that matter for people who need them. That's the job.
"What I cannot create, I do not understand." — Richard Feynman