Function calling is the feature that turns a voice agent from a chatbot with audio into an actual worker. Without it, the agent can talk about looking up your account; with it, the agent can actually do it. The basic idea is simple but the implementation has quirks worth understanding before you ship.

TL;DR

Function calling lets the LLM emit structured requests to call your code (lookup CRM, book appointment, transfer call).
Three things matter most: clear function names, clear descriptions, and tight parameter schemas.
For voice, latency matters — long-running functions need a "let me check" bridge.
Cap timeouts. Always cap timeouts. A function call that hangs for 5 seconds breaks the conversation.

How it works

Modern LLMs accept a list of available functions alongside the prompt. The model can choose to either reply with text or to emit a structured function call:

{
  "name": "lookup_caller",
  "arguments": { "phone_number": "+14155550199" }
}

Your orchestration layer intercepts that call, runs the actual code (a database query, an API call), and returns the result to the model. The model continues with that new context.

For voice, the typical flow:

Caller asks something that requires data lookup.
LLM emits a function call.
Your code executes the function.
Result flows back to the LLM.
LLM generates a reply with the result baked in.
TTS speaks the reply.

Steps 2–5 happen between the caller's turn and the agent's reply. Latency budget is tight.

Designing function names and descriptions

The names and descriptions are what the LLM uses to decide when to call each function. Take them seriously.

Bad:

{ name: "lookup", description: "look something up" }

Good:

{
  name: "lookup_caller_by_phone",
  description: "Look up a customer record using their phone number. Returns name, account status, and recent order history. Call this whenever the agent needs to identify the caller or fetch their account data."
}

The good version tells the model when to call the function, what it returns, and why you'd want to.

Rules of thumb:

Function names: verb_noun_modifier. lookup_account_by_email, not getAccount.
Descriptions: 2–4 sentences. Include when to call AND when not to call.
If you have many similar functions, explicitly differentiate them in the descriptions.

Designing parameter schemas

Use proper JSON schema with types and descriptions:

{
  "type": "object",
  "properties": {
    "phone_number": {
      "type": "string",
      "description": "E.164 format phone number, e.g. +14155550199"
    },
    "include_history": {
      "type": "boolean",
      "description": "Whether to include the caller's last 10 orders"
    }
  },
  "required": ["phone_number"]
}

Make required fields explicit. Use enums where applicable ("status: 'pending' | 'completed' | 'cancelled'"). Be explicit about formats (E.164, ISO date, etc.).

The latency problem

Function calls take time. A typical breakdown:

Network round-trip to your API: 50–200ms
Database query or third-party API: 100–800ms
Network round-trip back: 50–200ms

Total: 200ms–1.2 seconds. That's added to the LLM's response latency, which is added to the caller's perceived wait.

Two mitigations:

1. Cap timeouts. Every function should have a hard timeout (typically 1.5–3 seconds). If it doesn't return, the agent says "I'm having trouble looking that up — let me try again" or escalates.

2. Bridge with chitchat. When the LLM calls a function it knows might be slow, your prompt should tell it to say something first: "Let me check on that."

The bridge pattern is implemented in the prompt:

When you call a function that may take more than 1.5 seconds
(like get_appointment_history or sync_external_system),
first say something to the caller like "let me look that up"
or "one moment" before making the call.

When to make a function call vs answer from memory

Common bug: the LLM "remembers" something and answers from that instead of looking it up. For static info (your hours, your return policy), this is fine. For dynamic info (current order status, today's availability), this is dangerous.

The fix is in the prompt. Be explicit:

Always call get_order_status before answering questions about
the caller's order. Do not rely on prior conversation context
for order status — orders change in real time.

This is the most underused move in production prompts.

Function-call reliability

In practice, three things go wrong:

1. The model picks the wrong function. Mitigation: clearer descriptions; fewer overlapping functions; explicit examples in the prompt.

2. The model fills the wrong arguments. Mitigation: tighter schemas; explicit format examples; require fields the model can't easily fudge.

3. The model calls a function it shouldn't. Mitigation: explicit "do not call X if Y" rules; guardrails that intercept and reject inappropriate calls.

Reliability for major hosted LLMs is 95%+ on well-designed function schemas. Below that, your schemas need work.

Real-world function patterns

A few shapes that recur across most production agents:

Lookup function. get_X_by_Y(...) returns structured data. Usually fast.

Mutation function. book_appointment(slot, caller_id) → confirmation. Slower; needs idempotency.

External API call. send_sms_followup(phone, message) → status. May fail; needs retry logic.

Transfer function. transfer_to_human(reason, context) → handoff. Should always succeed; if it doesn't, that's an emergency.

Search function. search_knowledge(query) → list of matching docs. Often slow; cache.

Testing function calling

The eval workflow:

Pick 50 representative call transcripts.
For each, list the functions the agent should have called.
Replay through your current prompt; record what the agent actually called.
Score: did it call the right function? did it fill the right arguments?

Run this before every prompt change. It catches regressions in function reliability that human grading often misses.

FAQ

What's the difference between function calling and tool use? They're synonyms. "Tool use" is the older term; "function calling" is what most APIs use today.

Should I use one big function or many small ones? Many small ones. The model picks more reliably when each function has a clear single purpose.

Can the LLM call multiple functions in parallel? Some models support this; most production agents don't take advantage of it because serial is easier to reason about.

How do I handle function errors? Return a structured error to the LLM ("status: 'error', message: 'caller not found'"). The model can then decide how to phrase it to the user.

What about function calling cost? Function-calling overhead is small (~10% extra tokens per call). The bigger cost driver is whatever your function actually does.

Function Calling for Voice Agents: A Practical Guide

TL;DR

How it works

Designing function names and descriptions

Designing parameter schemas

The latency problem

When to make a function call vs answer from memory

Function-call reliability

Real-world function patterns

Testing function calling

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Tool Use vs Function Calling: What's the Difference?

Designing Voice Agents That Ask Better Questions

Open-Source vs Closed-Source LLMs for Voice Agents

Voice AI, twice a month.