Guide
The complete guide to writing AI step prompts that actually work
The AI step is the most powerful component in Flowpath — and the most underused. A well-crafted prompt is the difference between an agent that works reliably and one you have to babysit. Here's everything we've learned from thousands of AI step configurations.
10 min
read
·

There's a common pattern we see with new Flowpath users. They add an AI step, type something like "analyze this lead and tell me if it's a good fit," watch it work perfectly on the first three test runs, and then deploy it to production. Two weeks later they notice the outputs have been inconsistent — sometimes a number, sometimes a sentence, sometimes a JSON object — and downstream steps have been failing silently because the data format they expected wasn't what the AI returned.
This guide is about preventing that. It's built on patterns we've observed across thousands of AI step configurations on Flowpath, and it covers everything from basic output formatting to advanced prompt engineering techniques for complex classification tasks.
Principle 1 — Specify output format with surgical precision
This is the single most important thing you can do to make an AI step reliable. GPT-4o is a language model — its natural output is prose. If you want structured data, you have to ask for it explicitly and describe the format in detail.
Bad: "Score this lead from 0 to 100." Good: "Return only a single integer between 0 and 100. No explanation, no additional text, no punctuation. Only the integer."
Bad: "Extract the key information from this invoice." Good: "Extract the following fields from the invoice and return them as a JSON object with exactly these keys: invoice_number (string), invoice_date (ISO 8601 date string), total_amount (number, no currency symbol), vendor_name (string), line_items (array of objects with keys: description, quantity, unit_price, total). If any field cannot be found, return null for that key."
The second example leaves no room for interpretation. The model knows exactly what to return and in what format. Your downstream steps will receive consistent, parseable data on every single run.
Principle 2 — Be context-rich
The quality of an AI step's output is directly proportional to the quality of its input. Use every piece of enriched data available to you — don't just pass a company name and hope the model infers the rest.
A lead scoring prompt that passes {{company_name}} only will produce generic, unreliable scores. A prompt that passes {{company_name}}, {{industry}}, {{employee_count}}, {{funding_stage}}, {{job_title}}, {{tech_stack}} and {{annual_revenue_estimate}} will produce scores that are meaningfully differentiated and consistent across runs.
The same principle applies to classification tasks. If you're classifying a support ticket, pass the full ticket body, the customer's account tier, their tenure in months, and their historical ticket count. The model will use all of it — and the output will be significantly better for it.
Principle 3 — Always handle missing data explicitly
Clearbit enrichment fails on roughly 8% of contacts. API calls time out. Form fields get left blank. Your AI step will eventually receive incomplete data, and if your prompt doesn't tell the model how to handle it, the behavior will be unpredictable.
Always add a fallback instruction to any prompt that depends on enrichment data: "If any of the following fields are null or empty, treat them as unknown and adjust your output accordingly. Do not fail or return an error — make your best assessment with the available information and indicate which fields were missing by appending a 'missing_fields' array to your JSON output."
This won't completely compensate for missing data, but it will produce a usable output rather than a failed step — and the missing_fields array lets you identify data quality problems over time.
Principle 4 — Use chain-of-thought for complex decisions
For simple classification (qualify/disqualify, positive/negative, category A/B/C), direct prompting works well. For complex multi-factor decisions — lead scoring with six variables, content moderation with nuanced edge cases, anomaly detection in financial data — chain-of-thought prompting produces significantly better results.
Chain-of-thought means asking the model to reason through the problem before producing the final output. Structure it like this: "First, evaluate each of the following criteria separately and assign a sub-score for each: [criteria list]. Then sum the sub-scores to produce a total. Return a JSON object with the individual criteria scores and the total."
This approach has two benefits: better accuracy (the model is less likely to make lazy generalizations when forced to reason step by step) and better debuggability (you can see exactly how the model scored each criterion when reviewing run logs).
Principle 5 — Set temperature intentionally
Temperature controls how deterministic vs creative the model's outputs are. For scoring, classification and data extraction tasks — anything where consistency matters more than creativity — set temperature to 0. The same input will produce the same output every time.
For tasks where some variation is acceptable or desirable — drafting outreach emails, generating content summaries, producing personalized messages — a temperature of 0.3 to 0.7 gives the model creative latitude while keeping outputs grounded and relevant.
In Flowpath, temperature is set in the AI step configuration panel. It defaults to 0.2 — a sensible middle ground, but worth adjusting deliberately for your specific use case.




