Case Study: Voice AI Sales Agent for Appointment Setting
Executive Summary
For high-volume consumer service providers, top-of-funnel lead leakage represents a significant revenue drain. In this engagement, Seven Labs designed, built, and deployed a human-like, 24/7 conversational voice AI agent for a multi-location Home Services Group. The core objective was to eliminate lead drop-off caused by missed after-hours calls, weekend delays, and slow manual follow-ups on web-based forms.
Over a tight five-week delivery roadmap, we engineered a voice routing and intelligence pipeline that handles both inbound customer calls and automated outbound web-lead callbacks. By integrating real-time Speech-to-Text (STT), advanced Large Language Model (LLM) reasoning, and low-latency Text-to-Speech (TTS) synthesis, the voice assistant qualifies leads dynamically and books appointments directly into the client’s scheduling system. The deployment achieved 24/7/365 call coverage, drove a 61% increase in booked appointments within the first 60 days, and reduced inbound sales staffing overhead by 70% while maintaining instant lead response times.
Business Problem
The client, a prominent regional Home Services Group, was losing substantial revenue to unworked or slowly processed inbound opportunities. The company operates in a highly competitive market where customer purchase intent decays rapidly; if a prospect's call is not answered immediately, they typically hang up and call a competitor.
The client faced three primary operational and financial bottlenecks:
- After-Hours Lead Leakage: The internal sales team operated on a strict 9:00 AM to 6:00 PM shift, five days a week. Approximately 35% of all inbound call volume occurred during weekday evenings, early mornings, and weekends. These calls were routed to voicemail systems. Historical CRM audit data revealed that 70% of prospects who reached voicemail hung up without leaving a message, resulting in an estimated $340,000 in annual lost revenue.
- Slow Response to Web Leads: Prospects filling out booking requests on the website waited an average of 4.5 hours for a callback from a human agent. By the time contact was established, the customer had frequently booked with another provider, driving up the company's customer acquisition cost (CAC).
- High Operational Staffing Costs: Scaling a human call center to cover nights, weekends, and holidays was financially unfeasible. The marginal cost of staffing a 24/7 internal team outweighed the margins on the recovered after-hours leads, creating a structural bottleneck that prevented business expansion.
To break this bottleneck, the client required a solution capable of qualifying leads on four dimensions-service type, property details, urgency, and scheduling availability-and booking confirmed appointments directly into HubSpot CRM.
Technical Challenges
Creating a conversational voice agent that behaves like a human sales representative requires solving several non-trivial engineering and systems challenges:
1. The 1.5-Second Round-Trip Latency Boundary
Human conversations rely on subtle, sub-second timing. The threshold at which a conversation feels unnatural or laggy is approximately 1.5 seconds of round-trip latency (the delay between the user finishing a sentence and the voice agent starting its reply). Standard voice AI architectures-which pipe audio through separate API calls for STT, LLM inference, and TTS-often incur 3.5 to 5.0 seconds of latency due to HTTP overhead, network hops, and blocking generation cycles. Breaking the 1.5-second barrier required streaming audio over persistent WebSockets, parallelizing audio processing, and utilizing highly optimized, regionalized edge infrastructure.
2. Graceful Interruption (Barge-In) Handling
In real-world conversations, humans interrupt. If a user interrupts the voice agent mid-sentence, the system must immediately cease audio playback, discard the remaining queued audio packets, capture the new speech input, update the conversational state, and generate an appropriate response. If barge-in is misconfigured or delayed, the agent will continue speaking over the user, destroying the illusion of human presence and causing severe conversational collisions.
3. Dynamic Multi-Tenant Calendar State Syncing
The voice agent must perform live database queries to identify open appointment windows across multiple locations and technician schedules. Fetching availability, presenting options to the user, holding a slot temporarily during the call, and writing a confirmed appointment back to HubSpot CRM and Google Calendar must happen in real time without introducing locks or race conditions that could lead to double-bookings.
4. Noise Debouncing and Speech Detection
Home service customers frequently call from noisy environments-construction sites, moving vehicles, or rooms with loud HVAC units. The agent's Voice Activity Detection (VAD) system must accurately distinguish between actual customer speech and transient background noise to prevent false barge-ins.
Solution Architecture
The solution architecture replaces standard HTTP API chaining with a low-latency, WebSocket-based media routing pipeline.
+------------------+ SIP Trunk (PSTN) +--------------------+
| Customer Phone | <==============================> | Twilio Phone Gateway|
+------------------+ +--------------------+
||
Audio Streaming
(SIP/WebRTC)
||
\/
+--------------------+
| Vapi AI Engine |
+--------------------+
/ \
Websockets Websockets
/ \
\/ \/
+------------------+ +--------------------+
| Deepgram STT | | ElevenLabs TTS |
| (Nova-2 Engine) | | (Turbo v2 Streaming|
+------------------+ +--------------------+
\ /
Text Stream Audio Stream
\ /
\/ \/
+----------------------------+
| Vapi Orchestrator LLM |
| (GPT-4o Agent) |
+----------------------------+
||
JSON Tool Calls
(HTTPS / JWT)
||
\/
+----------------------------+
| Seven Labs Middleware |
| (Node.js Webhook Server) |
+----------------------------+
/ | \
HTTPS HTTPS HTTPS
/ | \
\/ \/ \/
+-----------------+ +---------------+ +---------------+
| HubSpot CRM | | Google Cal API| | Slack Alerts |
| (Contacts/Deals)| | (Availability)| | (Hot Leads) |
+-----------------+ +---------------+ +---------------+
Component Breakdown
- Twilio Voice Gateway: Manages inbound call routing, SIP trunking, and outbound calling triggers. It routes live telephone audio directly to the Vapi orchestrator using low-latency WebRTC streams.
- Vapi AI Engine (Orchestrator): Functions as the central bridge, coordinating the ingestion of raw audio, routing text streams to the LLM, and piping synthesis chunks back to the telephony gateway.
- Deepgram Nova-2 (STT): Translates spoken audio to text in real time. We configured it to use streaming WebSockets with transient word-level timestamping, delivering transcripts with sub-100ms latency.
- ElevenLabs Turbo v2 (TTS): Generates natural, context-aware audio responses. It streams raw PCM audio packets back to Vapi in 20ms chunks, avoiding the latency penalty of waiting for an entire sentence to compile.
- OpenAI GPT-4o Agent: Serves as the cognitive engine. It parses the transcript, maintains the conversational state machine, decides on the next speech action, and triggers external tools when the customer requests calendar checks or bookings.
- Seven Labs Middleware (Node.js/TypeScript): Hosted on AWS ECS behind an Application Load Balancer. It validates webhook payloads via cryptographic signatures, translates agent tool requests into optimized database queries, handles API rate-limiting, and writes data to HubSpot CRM and the Google Calendar API.
Technology Stack
The technology stack was selected to maximize reliability, maintain sub-second processing overhead, and integrate cleanly with the client's commercial software suite:
- Orchestration Layer: Vapi AI was deployed to handle raw WebRTC media streaming and session lifecycle management. This bypassed the need to build a custom FreeSWITCH or Asterisk media server, saving months of infrastructure development.
- Large Language Model:
gpt-4o (deployed via OpenAI's low-latency API endpoints). It provides the optimal balance of reasoning speed, strict system-prompt adherence, and multi-parameter tool calling.
- Speech-to-Text: Deepgram Nova-2. Selected for its superior accuracy with regional accents, low latency, and robust handling of background noise.
- Text-to-Speech: ElevenLabs Turbo v2, utilizing a custom-cloned voice trained on the client's highest-converting sales representative to ensure brand consistency and local accent alignment.
- Backend API & Middleware: Node.js (v20) running TypeScript, Express, and
pnpm. This lightweight, asynchronous runtime handles concurrent webhook requests with negligible event loop lag.
- CRM Platform: HubSpot CRM, accessed via the official HubSpot Node SDK with OAuth2 credential rotation.
- Scheduling Layer: Google Calendar API, integrated via Google APIs Client Library, utilizing service account authentication.
- Monitoring & Logs: Winston logger with transport to AWS CloudWatch, combined with Datadog for API performance tracking and call failure alerting.
Implementation Process
We executed this project over a structured 5-week sprint cycle, progressing systematically from voice design to full production deployment:
Week 1: Prompt Engineering and Voice Cloning
We initiated the project by ingesting 40 hours of high-performing sales recordings provided by the Home Services Group. Using ElevenLabs' voice cloning tool, we created a custom, high-fidelity synthetic voice that matched the warmth and cadence of their top representative.
Simultaneously, we drafted the system prompt for GPT-4o. The prompt was structured using a strict hierarchical state machine, ensuring the agent remains focused on the primary goal (booking an appointment) while gracefully collecting the required qualification data.
# Conversational Flow States:
1. GREET: Welcome caller, state name and company, ask how we can help.
2. IDENTIFY_SERVICE: Determine which service (HVAC, Plumbing, Electrical) is required.
3. SCRUB_PROPERTY: Confirm location, zip code, and ownership status.
4. ASSESS_URGENCY: Classify call as emergency (same-day) or standard schedule.
5. BOOK_CALENDAR: Check calendar slots, present options, and book appointment.
6. WRAP_UP: Summarize appointment time, confirm contact details, say goodbye.
Week 2: Middleware Development and Tool Integration
We developed the TypeScript middleware server to bridge the voice agent with external APIs. To prevent the LLM from hallucinating date/time logic, we implemented three specialized tool definitions: get_available_slots, reserve_slot, and create_hubspot_booking.
Here is an example of the schema definition for get_available_slots registered within Vapi:
{
"name": "get_available_slots",
"description": "Queries the scheduling system for open appointment times based on service type and location.",
"parameters": {
"type": "OBJECT",
"properties": {
"serviceType": {
"type": "STRING",
"enum": ["hvac", "plumbing", "electrical"],
"description": "The trade branch requested."
},
"zipCode": {
"type": "STRING",
"description": "The 5-digit USPS zip code of the property."
},
"urgency": {
"type": "STRING",
"enum": ["emergency", "standard"],
"description": "Whether the caller needs immediate assistance."
}
},
"required": ["serviceType", "zipCode", "urgency"]
}
}
Week 3: Latency Tuning and Telephony Routing
We set up SIP trunks in Twilio and routed them to the Vapi dashboard. During this phase, we discovered a 400ms latency penalty due to regional mismatches. The Twilio gateway was located in us-east-1 (Virginia), while our Vapi workspace default was routed through us-west-2 (Oregon). We reconfigured the network topology, moving the middleware and orchestrator configurations to the same AWS data center region as the Twilio and Vapi gateways, cutting round-trip delay to under 1.2 seconds.
Week 4: Barge-In Calibration and Stress Testing
We conducted rigorous testing to fine-tune conversational parameters. We simulated over 1,500 test calls with varying background noises (crying babies, street traffic, and wind). We adjusted the VAD (Voice Activity Detection) threshold to prevent false cut-offs. The barge-in activation delay was calibrated to 300ms of continuous speech detection; if a noise lasted less than 300ms, the agent ignored it and continued speaking. If it exceeded 300ms, the audio playback engine instantly stopped streaming, preventing speech overlap.
Week 5: Pilot Run and Full Production Release
We launched a pilot phase, routing 10% of inbound after-hours traffic to the Voice AI agent. We monitored live transcripts and checked for integration failures in HubSpot. After 4 days of flawless operation and positive customer feedback, we scaled the system to handle 100% of after-hours and weekend inbound calls, alongside automated web-lead callbacks.
Security Considerations
Because the voice agent handles Protected Health Information (PHI) and Personally Identifiable Information (PII) such as phone numbers, physical addresses, and scheduling notes, we built the security framework around three pillars:
- Credential Isolation and KMS Vaulting: All database passwords, API tokens for HubSpot, Google Calendar, and Twilio credentials are encrypted at rest using AWS Key Management Service (KMS). The middleware loads these keys into environment memory at runtime; no secrets are ever stored in plaintext in version control or system logs.
- Webhook Verification and JWT Authentication: Every webhook payload sent from Vapi to our Node.js middleware must contain a valid JSON Web Token (JWT) in the
Authorization header. The middleware verifies this token using a shared secret key, ensuring that malicious actors cannot spoof call events and inject fake bookings or scrape data.
- Data Sanitization and Transient Storage: The middleware server operates statelessly. No customer call audio or transcript text is stored on local server disks. Transcripts and recordings are streamed directly to HubSpot and then purged from the middleware's volatile RAM. We configured Vapi to delete local recordings after 7 days, aligning with the company's data retention policies. Furthermore, we designed an automated sanitization script that replaces credit card numbers or Social Security Numbers with
[REDACTED] tokens should a customer voluntarily state them during the call.
For organizations requiring more restrictive configurations, Seven Labs provides zero-trust network design, detailed in our blog on zero-trust network saas.
Performance Optimizations
To ensure the voice AI remains responsive and stable under high call volumes, we implemented several performance-tuning steps:
1. Pre-Fetching and Local Caching
Querying Google Calendar and HubSpot CRM for technician availability dynamically during a call can take up to 1.8 seconds, which stalls the conversation. To optimize this, the middleware maintains a background worker thread that queries availability every 15 seconds for the next 48 hours and stores the result in a local Redis cache. When the agent requests get_available_slots, the middleware returns the cached slots in under 50ms. The actual slot is only locked against the live Google Calendar API once the user chooses a specific time, preventing double-booking while maintaining maximum conversational speed.
2. Stream-Based LLM Parsing
Instead of waiting for the LLM to complete its entire response before sending it to the TTS synthesizer, we configured the orchestrator to stream tokens. The ElevenLabs engine begins generating audio the moment the first phrase (e.g., "Sure, I can check...") is returned by the LLM. This technique hides the inference latency of the model behind the audio playback, shaving another 600ms off the user's perceived waiting time.
3. Serverless Auto-Scaling
The Node.js middleware server is deployed on AWS ECS using Fargate. We configured autoscaling triggers based on concurrent active WebSockets. If inbound call volume spikes (e.g., during a storm or local power outage when HVAC/plumbing demand surges), ECS automatically scales the container count, ensuring that up to 200 concurrent calls can be processed without CPU throttling or memory starvation.
This scaling strategy aligns with our engineering philosophies highlighted in ai-infrastructure-engineering-beyond-chatbots.
Results & Outcomes
Within two months of launching the Voice AI system, the client achieved substantial improvements in operational metrics and customer response times:
- 24/7/365 Call Coverage: Standardized call intake captures all after-hours leads, eliminating night and weekend leaks.
- +61% Booked Appointments: Appointment booking grew from 340 to 547 appointments monthly.
- 70% Staffing Cost Reduction: Lowered inbound call-center labor overhead from $28,000 to $8,400 per month.
- Instant Lead Response: Colapsed web-to-call response intervals down to under 15 seconds.
By integrating the Voice AI agent with their digital advertising channels, the client realized a complete return on their implementation investment within the first 30 days of production operation.
Lessons Learned
Deploying a voice agent in a high-stakes, consumer-facing environment provided three valuable engineering insights:
- The Criticality of Debounce and VAD Calibration: Standard VAD models are designed for quiet environments. In home services, callers are often driving or working on site. We learned that using a static speech-detection threshold is insufficient. We had to implement a dynamic noise-floor estimator that adjusts the VAD sensitivity in real time based on the first 500ms of call connection audio.
- Graceful Failure States: If an API call to the calendar system fails mid-conversation, the agent must not hang up or freeze. We designed a fallback state called
SCHEDULING_FALLBACK. If the middleware returns an error, the agent seamlessly transitions to: "I'm having a brief issue saving that slot right now, but I have your details. I'll pass this directly to our dispatch supervisor to lock in your morning window and text you confirmation. Does that work?" This maintains customer satisfaction and retains the lead.
- Structured Contextual Prompts Beat Free-Form Chat: Early iterations allowed the LLM too much conversational freedom, leading to overly verbose explanations. We restructured the prompt to enforce strict constraints: Limit responses to 25 words or fewer unless presenting calendar options. This constraint kept the conversation moving briskly and prevented token buildup that added latency.
For further reading on designing robust conversational agents and avoiding deployment failures, check out Seven Labs' insights in why-automation-roi-is-flawed and our guide on multi-agent-orchestration.
Frequently Asked Questions (FAQs)
1. How does the voice agent handle customer barge-ins without generating audio collisions?
When the user speaks while the agent is playing audio, the Voice Activity Detection (VAD) engine on the telephony gateway detects the incoming audio stream. If the incoming speech energy exceeds the noise threshold for more than 300ms (the debounce window), a WebRTC signal is sent to the playback buffer. The buffer immediately drops all queued PCM audio packages and stops the speaker output.
Simultaneously, the STT engine opens a new transcription window, and the orchestrator sends a cancel request to the active LLM generation process. The system then processes the user's interruption in isolation, updating the state machine based on the new context.
2. How does the system handle concurrent availability queries to avoid double-bookings?
To prevent race conditions where two concurrent callers try to book the exact same slot, the system employs a two-phase commit protocol. When a customer expresses interest in a slot (e.g., "Tomorrow at 10:00 AM"), the middleware calls the reserve_slot tool. This tool writes a temporary, 5-minute block on the technician's calendar in HubSpot CRM, marked as "Pending AI Booking."
If the customer completes the call and confirms, the system updates the event status to "Confirmed" and creates the associated HubSpot deal. If the customer hangs up or rejects the slot, the 5-minute block expires and is automatically pruned by a cron job, restoring the slot to the public pool.
3. What voice synthesis parameters were tuned to minimize latency with ElevenLabs?
To achieve sub-200ms audio synthesis, we bypassed the standard ElevenLabs HTTP POST endpoints and utilized their raw WebSocket API. We specified the eleven_turbo_v2 model, which is optimized for streaming speed over deep prose coloring. We locked the output format to pcm_24000 (24kHz, 16-bit linear PCM) to avoid the overhead of MP3 compression and decompression.
We also disabled the style_exaggeration slider and set the stability parameter to 0.45. This ensured the voice generated rapidly without risking pronunciation stability during long numeric strings (such as phone numbers or dates).
4. How does the voice agent handle noisy calls or customers who speak with regional slang?
Our Speech-to-Text pipeline uses Deepgram Nova-2, which is pre-trained on high-variance telephony audio and custom-tuned with vocabulary maps specific to the home services industry (e.g., "HVAC," "furnace," "compressor," "boiler"). When a word's confidence score drops below 60% due to line noise or background interference, the middleware passes the raw transcript along with a warning flag.
The LLM is prompted to ask a clarifying question in a natural manner rather than guessing, using conversational logic such as: "I caught that you're having trouble with your heating, but the connection cut out slightly. Did you say it was your furnace or your hot water boiler?"
5. How does the system handle webhook failures or transient database locks during a call wrap-up?
If the middleware encounters a transient database lock or API timeout while finalizing a booking, it retries the request using an exponential backoff policy with a maximum of three attempts over 5 seconds. If all retries fail, the middleware writes the raw call payload and transcript directly to a RabbitMQ dead-letter queue.
An automated alerting system triggers an emergency Slack notification and SMS message to the on-call dispatcher. The dispatcher is presented with the transcript, customer phone number, and requested slot, allowing them to manually confirm the booking with a single click, ensuring zero lead loss.
Schema & SEO Metadata
Recommended JSON-LD Schema
{
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "Voice AI Sales Agent for Appointment Setting Case Study",
"description": "How Seven Labs engineered a low-latency, 24/7 Voice AI agent for qualification and appointment booking, driving a 61% lift in bookings and cutting staffing costs by 70%.",
"keywords": "Voice AI, AI Agent, Twilio, ElevenLabs, HubSpot, Lead Qualification, SaaS, Conversational AI",
"inLanguage": "en-US",
"author": {
"@type": "Organization",
"name": "Seven Labs",
"url": "https://www.sevenlabs.site"
},
"publisher": {
"@type": "Organization",
"name": "Seven Labs",
"logo": {
"@type": "ImageObject",
"url": "https://res.cloudinary.com/dywx7ldqr/image/upload/v1779223334/media/img_01.png"
}
},
"about": [
{
"@type": "Service",
"name": "AI Automation & Workflow Integration",
"url": "https://www.sevenlabs.site/services/automation"
},
{
"@type": "Service",
"name": "AI Platforms & RAG Engineering",
"url": "https://www.sevenlabs.site/services/ai-platforms"
}
]
}
Internal Linking Anchors