Engineering
How we built Flowpath's AI suggestion engine, and what we learned
The first version was a simple GPT-4o call with raw workflow JSON. Suggestions were inconsistent. Three major rebuilds later, we suggest the right next step 78% of the time. Here's everything we learned.
12 min
read
·

When we first shipped AI step suggestions in March 2025, I was genuinely excited about it. The demo worked beautifully — you'd build a two-step workflow, the suggestion engine would see that you had a Typeform trigger and a Clearbit enrichment step, and it would correctly suggest an AI scoring step as the next logical action. We shipped it, our customers loved it, and our internal metrics showed a 31% suggestion acceptance rate.
Then we looked more carefully at the 69% rejection rate and realized we had a problem.
The first version: what went wrong
Version one sent the raw workflow JSON to GPT-4o with a prompt that said something like: "Here is a workflow configuration. Suggest the next best step. Respond with a JSON object containing step_type, integration_name and configuration_hints."
The workflow JSON for even a simple three-step workflow is verbose — it includes step IDs, position coordinates on the canvas, authentication token references, field mapping configurations and metadata that's entirely irrelevant to the question of what step should come next. We were burning tokens on noise and depriving the model of signal.
The suggestions that came back were often directionally correct but too generic. The model would suggest "add a notification step" when what the customer needed was "add a Slack message to #sales-alerts with these specific fields." It knew the category of step but not the specific configuration.
Rebuild one: semantic preprocessing
The first major rebuild introduced a preprocessing layer between the raw workflow state and the API call. Instead of sending the workflow JSON, we now convert the workflow into a structured natural language description before it goes to the model.
A three-step workflow that previously sent 2,400 tokens of raw JSON now sends 180 tokens of semantic description: "The user is building a lead qualification workflow. Step 1: Trigger — new form submission received from Typeform. Step 2: Action — contact enriched with company data using Clearbit. The user is now adding step 3."
This change alone reduced our prompt token count by 40% and increased suggestion relevance measurably. The model now understood the semantic purpose of each step rather than its technical configuration.
Rebuild two: integration-aware context
The second rebuild addressed the specificity problem. The model was suggesting "add a notification step" because it didn't know which notification tools the user had connected.
We added a connected integrations context to every suggestion request — a compact list of the user's authenticated integrations and the primary use case for each. "Connected tools: Slack (messaging), HubSpot (CRM), Gmail (email), Stripe (payments)." With this context, the model could suggest "send a Slack message" rather than "add a notification step" — and in our testing, it started suggesting the correct Slack channel name 43% of the time based on the workflow context alone.
We also added plan-aware filtering at this stage. The model would sometimes suggest features — parallel branching, custom API steps — that weren't available on the user's current plan. We added a post-processing filter that checks each suggestion against the user's plan features and suppresses unavailable options before the suggestion reaches the UI.
Rebuild three: streaming and latency
By this point, suggestion quality was good. But the user experience had a problem: a 2.8-second wait for the suggestion to appear. When you're in a creative flow state — rapidly building a workflow — a 2.8-second blocking wait breaks your concentration and makes the feature feel slow and frustrating.
The solution was a streaming architecture. Instead of waiting for the full response before showing anything, we now render the suggestion word by word as the model generates it. The first word appears in under 400 milliseconds, and the full suggestion is typically complete within 1.5 seconds. Perceived latency dropped by over 80%, and our session data shows users now spend more time in the workflow builder per session — a proxy for engagement with the suggestion feature.
Where we are now
After three rebuilds, our suggestion acceptance rate sits at 62% — up from 31% at launch. The 38% rejection rate is still something we're actively working on. Our analysis of rejected suggestions shows three main categories: suggestions for steps the user has already decided to add (suggesting the feature pre-empted the user's intention), suggestions that are correct in category but wrong in specific integration (suggesting Gmail when the user uses Outlook), and suggestions that are technically reasonable but not what the user had in mind.
The next major improvement is personalization. We're building a per-customer suggestion model that learns from each user's workflow history — what steps they accept, what they reject, what integration combinations they favor. Early results from our internal testing show acceptance rate climbing to 74% with personalization enabled. We expect to ship this to Growth and Enterprise customers in Q2.



