I Just Talk to My MCP AI Agents — And They Get Stuff DONE!

AI LABS • 2025-05-06 • 9:31 minutes • YouTube

📚 Chapter Summaries (11)

• Intro - 0:00
• Installation - 0:37
• API Keys - 1:48
• Running the Agent - 2:27
• The Code - 3:44
• Pricing - 4:50
• Customization - 5:21
• MCP Use Library - 6:13
• How It Works - 6:51
• Testing - 7:41
• Outro - 9:17

🤖 AI-Generated Summary:

How to Set Up Advanced Voice Agents with MCP Integration: A Step-by-Step Guide

Imagine having a voice assistant that not only listens to you in real time and responds naturally but also interrupts when you speak, pulls in live data, and performs tasks on your behalf. Sounds futuristic? Thanks to recent developments in open-source voice agents and MCP (Multi-Channel Processing) servers, this is now achievable with relative ease. In this blog post, we’ll walk you through setting up such a voice agent, integrating it with MCP servers to access external knowledge, and customizing it to suit your needs.

What You’ll Learn

How to install and run a cutting-edge voice agent
How to set up necessary API keys for OpenAI, Cartisia, and LiveKit
How the voice agent works under the hood
How to customize the assistant’s personality and responses
How to integrate MCP servers for enhanced functionality, such as WhatsApp messaging
Insights into pricing and cost-effective options for voice synthesis APIs

Getting Started: Installation Made Simple

The voice agent project is open source and hosted on GitHub (link in the description). To get started:

Clone the repository: Open your terminal and run
git clone <repository-url>
Replace <repository-url> with the actual URL of the project.
Set up your Python environment:
Navigate into the cloned folder, open the project in your preferred code editor (the video uses Cursor), and create a Python environment by running the command palette shortcut (command+shift+p) and selecting "Create Python Environment." Choose any available interpreter.
Install dependencies:
This project uses uv instead of pip for package management. Run the specific command provided in the project readme or video to install all dependencies.

Configuring API Keys: What You Need and How to Get Them

The voice agent requires three main API keys:

OpenAI API Key: Easily obtained by signing up on OpenAI’s platform.
Cartisia API Key: Register at Cartisia’s website and find your API key in your account dashboard.
LiveKit API Credentials:
Log into the LiveKit dashboard.
Navigate to the Settings > Keys section.
Copy the Server URL, API Key, and reveal the API Secret.
Paste all three into your environment file.

Create a .env file in your project directory containing all these keys in the specified format.

Running the Voice Agent

With dependencies installed and your .env configured:

Run the command to start the voice agent server.
The terminal will output a URL. Open this link in your browser.
Ensure you have created a LiveKit project matching your API credentials.
Connect to the LiveKit project via the browser link.
You should hear a greeting from the voice agent, confirming it’s ready to interact.

Try speaking to it – the agent listens, responds promptly, and even allows you to interrupt naturally.

Under the Hood: How the Voice Agent Works

The core of the voice agent is a Python script that includes:

Voice Activity Detection (VA): This component detects when you speak and automatically pauses the assistant, making interactions smooth and natural.
Language Model: Powered by GPT-4o Mini, though you can swap it out for more powerful models if you need longer or more nuanced responses.
System Prompt: Defines how the assistant behaves — currently set to be witty, concise, and avoids complex words or emojis to keep text-to-speech smooth.
Text-to-Speech: Uses Cartisia’s Sonic Preview model, which is affordable and flexible, supporting customizable voice IDs.

Pricing Considerations

Compared to alternatives like 11 Labs, Cartisia offers a generous free tier with 20,000 credits. For example, transcribing 25 minutes of audio costs about 15 credits per second, which fits comfortably within the free tier for experimentation. Paid upgrades are available at competitive prices, making Cartisia a cost-effective choice for developers.

Customization: Make the Agent Your Own

You can easily tailor the assistant’s personality and behavior by:

Editing the greeting message directly in the code.
Modifying the system prompt to change tone, style, or even impersonate characters by adding descriptive prompts.
Selecting different voices through Cartisia’s API by changing the voice ID.

This level of customization lets you build anything from a professional assistant to a fun, character-driven chatbot.

MCP Integration: Extending the Voice Agent’s Capabilities

One of the most powerful features is integrating MCP servers. For example, the video demonstrates connecting a WhatsApp MCP server locally:

Voice inputs are transcribed and routed through the WhatsApp MCP server.
The server processes messages, sends replies, and interacts with your WhatsApp messages.
The voice agent interacts with any MCP-registered tool, allowing you to add multiple services seamlessly.

This setup supports building specialized voice agents, like an Airbnb booking assistant or an internet search assistant via Brave Search MCP, enabling voice-first workflows.

Real-Time Usage and Testing

The voice agent works with minimal latency (2-3 seconds from speaking to response), offering a near real-time conversation experience. Although there are minor transcription errors due to the underlying speech-to-text model (OpenAI’s Whisper), the overall performance is impressive.

Final Thoughts and Next Steps

This open-source voice agent project is an excellent foundation for anyone interested in voice-first applications with powerful integrations. With easy setup, flexible customization, and expandable MCP integration, you can build assistants tailored to your needs.

What’s Next?

Keep an eye on updates addressing interruption handling improvements.
Explore different MCP integrations to unlock new capabilities.
Clone the repository and experiment with the code.

Support and Stay Connected

If you find this guide helpful and want to dive deeper into voice agents, consider subscribing to related channels or supporting the creators. They regularly share tutorials that can help you build sophisticated voice solutions.

Links Mentioned

GitHub Repository (voice agent project)
Cartisia API Sign-up
LiveKit Dashboard
WhatsApp MCP Setup Video
MCP Use Library Documentation

Ready to build your own voice assistant that listens, talks, and acts on your behalf? Get started today and transform how you interact with technology!

📝 Transcript Chapters (11 chapters):

• Intro - 0:00
• Installation - 0:37
• API Keys - 1:48
• Running the Agent - 2:27
• The Code - 3:44
• Pricing - 4:50
• Customization - 5:21
• MCP Use Library - 6:13
• How It Works - 6:51
• Testing - 7:41
• Outro - 9:17

📝 Transcript (259 entries):

## Intro [00:00]
Imagine talking to an AI that listens in real time, interrupts when you speak, pulls in live data, and even performs tasks on your behalf. Hey everyone, in this video, I'm going to show you how to set up voice agents in what I think is the easiest way possible, and not just basic voice agents, but also how you can integrate MCP servers with them to give those agents access to external knowledge. I came across someone on X who built and open-sourced a voice agent and full credit goes to him because I used that base to implement MCPs on top of it and take it even further. I really hope you enjoy the video. Let's get into

## Installation [00:37]
it. First, let us look at how to install this and set up the voice agent in the simplest way possible. The GitHub repository is linked in the description below. The first step is to open your terminal and type git clone followed by the repository URL.

Once that is done, go into the folder you just cloned. Next, open the project in cursor. Inside cursor, press commandshiftp to bring up the command menu. Then type create python environment and select the option that appears. Choose any available interpreter and this will create your Python environment. This project does not use pip as the package manager.

Instead, it uses uv so you will need to run a specific command. Simply copy and paste it into your terminal and this will install the dependencies. After that, you need to set up a few environment variables and I will show you exactly where to get each one. These are the API keys you need to set up. The OpenAI key is simple and easy to get.

And for the Cartisia API key, all you have to do is sign up on the Cartisia site and you will find it in your account dashboard. For LiveKit, I will guide you through the process since it can be a little confusing at first.

## API Keys [01:48]
Inside the LiveKit dashboard, go to the settings section and click on keys. Once you're there, open your API key and you will see the URL that needs to be copied. This is the server URL you will paste into your environment file. Just below it, you will find the API key and if you click on reveal secret, you will get your API key secret as well. You will need to copy all three values, the server URL, the API key and the secret and paste them into your environment configuration. Now go ahead and create av file in the project directory. Paste the entire block of variables into the file and insert your API keys in the

## Running the Agent [02:27]
correct places. Once that is complete, open your terminal and paste the command that installs the dependencies. After that, use the next command to run the agent. Just copy it, paste it into your terminal and start the process. When the agent is running, it will give you a link. Copy that link and open it in your browser. Before using LiveKit, you also need to make sure you have created a project. If you have not done that already, go back to your LiveKit dashboard, create a project, and then select the same project where you entered your API details. Click on it and connect. As soon as everything is connected, you should hear the voice agent greet you.

Hey there. How can I help you today? Right now, I have the microphone turned off, but it was working earlier and immediately greeted us. Let's test it again now. Hi there. You are being featured in a video right now. I am demoing you and showing how easy it is to set up the voice agent. Awesome. Just make sure to tell them I'm not a robot with a secret plan to take over the world. Just here to help. As you can see, it is working exactly as expected and the setup process has been really smooth thanks to the tools this project

## The Code [03:44]
is built with. Let me give you some insight into how the code works. This is the main file that powers the voice agent. At the top, you will see a component called VA, which stands for voice activity detection. This allows the assistant to pause automatically when you start speaking, so you can interrupt it naturally. It makes the interaction feel smooth and responsive just like the realtime assistance you see from Open AI. The language model powering this assistant is GPT40 Mini, but you can switch it out for a more powerful model if you prefer longer or more detailed responses. That said, the current setup works well for most use cases. One of the most important parts of the agent is the system prompt. This defines how the assistant behaves during a conversation. In this case, we are telling it to act like a witty assistant that responds with short, clear answers.

We also avoid using hard to pronounce words or emojis since those can cause problems for the texttospech engine. Speaking of voice, this agent uses Cartisia's sonic preview model to generate speech. There is a voice ID that you can customize and the API is both affordable and flexible. For

## Pricing [04:50]
example, 11 Labs restricts how much you can use their API without a paid plan, but Cartisia gives you 20,000 free credits. With what I had left, I was able to transcribe about 25 minutes of audio since each second costs 15 credits. That is more than enough to try out integrations like the WhatsApp MCP agent or any other setup you are experimenting with. If you need more credits, Cartisia also offers upgrade plans that are still cheaper than what 11 Labs provides. Overall, it is a solid and cost-effective alternative. Back in

## Customization [05:21]
the code, there is a greeting message that the agent uses when it first starts. You can change this by simply editing the text directly in the file.

The system prompt is where you can shape the personality and behavior of the assistant. If you want it to speak like a certain character, you can add that into the prompt. And if the model does not recognize the character name, just include a short description to guide it.

The level of customization here is impressive and gives you a lot of control over how the agent interacts. If you're enjoying the video, I'd really appreciate it if you could subscribe to the channel. We're aiming to reach 25,000 subscribers by the end of this month, and your support genuinely helps.

We share videos like this three times a week, so there is always something new and useful for you to explore. If you have been watching the channel regularly, you already know how much I like working with the MCPUs library. It

## MCP Use Library [06:13]
is the same tool I used to build the WhatsApp voice agent. We are running a WhatsApp MCP server locally on the system. This gives the language model access to your WhatsApp messages and even allows it to send replies. It is a very solid and reliable tool. Now let us break down what actually happens. The MCP use library takes the MCP server, sends in your transcribed response, and then uses the agent defined in the library to automatically select the correct tool and return the result. It does not matter how many tools are registered inside the MCP. Any one of them can be used as a callable function inside your code. What I did was

## How It Works [06:51]
integrate the WhatsApp MCP directly into the voice agent. Here is how the flow works. When I speak, the voice is transcribed first, but instead of sending that straight to the language model, it goes to the WhatsApp MCP server. The server processes the input and sends back a result. Because of how this open-source agent is built, the whole thing feels fast. I speak, it processes for about 2 to 3 seconds, and then I hear the voice response almost immediately. That is what is happening in the background and honestly, it is pretty cool. I also highly recommend that you watch the WhatsApp MCP setup video along with the one where I show how MCP is actually used. Both of them are quick and will give you all the context you need to understand and set things up on your own. I will link them in the description below so you can check them out easily. I just muted the

## Testing [07:41]
microphone and checked for new messages, but there were none. So, I'm going to send a quick message to my mother letting her know that I built this WhatsApp voice agent. Hi, could you please send a message to my mom telling her that I built this WhatsApp MCP voice agent? Your message has been sent to your mom letting her know that you created the WhatsApp MCB voice agent.

Now, let me show you the message. As you can see, it was sent successfully. I won't show the entire chat for privacy reasons, but there is a small transcription error in the message.

That's a minor glitch with OpenAI's whisper model. It still performs really well overall and there is not much more that can be done in this case. The agent is still running in the background, but I'll go ahead and close it. Now, inside the MCP configuration, you can plug in any other agent that fetches data or perform specific tasks. For example, if you want to build an Airbnb voice agent, you can easily do that by hooking it into the same setup and letting it fetch listings or information for you. The same applies to the Brave Search MCP. If you want to search something on the internet, just ask and it will read the results back to you. This is exactly the kind of voice first workflow everything is moving toward. There are still a few issues with interruption handling that I'm actively working on. I'll upload the updated version to a GitHub repository.

If I get time, I will continue improving it. But even if I don't, it will remain open source and you are welcome to clone it and build on top of it. You can drag any MCP into this setup. All you need to do is update the MCP configuration and it will work with whatever agent you want to use. That brings us to the end

## Outro [09:17]
of this video. If you'd like to support the channel and help us keep making tutorials like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next

YouTube Deep Summary