[00:00] (0.00s)
Imagine talking to an AI that listens in
[00:02] (2.56s)
real time, interrupts when you speak,
[00:04] (4.56s)
pulls in live data, and even performs
[00:06] (6.88s)
tasks on your behalf. Hey everyone, in
[00:09] (9.52s)
this video, I'm going to show you how to
[00:11] (11.28s)
set up voice agents in what I think is
[00:13] (13.68s)
the easiest way possible, and not just
[00:15] (15.76s)
basic voice agents, but also how you can
[00:18] (18.40s)
integrate MCP servers with them to give
[00:21] (21.12s)
those agents access to external
[00:23] (23.12s)
knowledge. I came across someone on X
[00:25] (25.12s)
who built and open-sourced a voice agent
[00:27] (27.60s)
and full credit goes to him because I
[00:29] (29.92s)
used that base to implement MCPs on top
[00:32] (32.48s)
of it and take it even further. I really
[00:34] (34.96s)
hope you enjoy the video. Let's get into
[00:37] (37.04s)
it. First, let us look at how to install
[00:39] (39.68s)
this and set up the voice agent in the
[00:42] (42.08s)
simplest way possible. The GitHub
[00:44] (44.08s)
repository is linked in the description
[00:45] (45.92s)
below. The first step is to open your
[00:48] (48.08s)
terminal and type git clone followed by
[00:50] (50.48s)
the repository URL.
[00:54] (54.08s)
Once that is done, go into the folder
[00:56] (56.32s)
you just cloned. Next, open the project
[00:58] (58.80s)
in cursor. Inside cursor, press
[01:01] (61.64s)
commandshiftp to bring up the command
[01:03] (63.84s)
menu. Then type create python
[01:06] (66.00s)
environment and select the option that
[01:08] (68.16s)
appears. Choose any available
[01:09] (69.84s)
interpreter and this will create your
[01:11] (71.76s)
Python environment. This project does
[01:14] (74.08s)
not use pip as the package manager.
[01:16] (76.64s)
Instead, it uses uv so you will need to
[01:19] (79.60s)
run a specific command. Simply copy and
[01:22] (82.40s)
paste it into your terminal and this
[01:24] (84.32s)
will install the dependencies. After
[01:26] (86.40s)
that, you need to set up a few
[01:28] (88.16s)
environment variables and I will show
[01:30] (90.16s)
you exactly where to get each one. These
[01:32] (92.48s)
are the API keys you need to set up. The
[01:35] (95.04s)
OpenAI key is simple and easy to get.
[01:37] (97.52s)
And for the Cartisia API key, all you
[01:40] (100.00s)
have to do is sign up on the Cartisia
[01:42] (102.00s)
site and you will find it in your
[01:43] (103.76s)
account dashboard. For LiveKit, I will
[01:45] (105.92s)
guide you through the process since it
[01:47] (107.92s)
can be a little confusing at first.
[01:49] (109.76s)
Inside the LiveKit dashboard, go to the
[01:52] (112.16s)
settings section and click on keys. Once
[01:54] (114.88s)
you're there, open your API key and you
[01:57] (117.44s)
will see the URL that needs to be
[01:59] (119.12s)
copied. This is the server URL you will
[02:01] (121.36s)
paste into your environment file. Just
[02:03] (123.44s)
below it, you will find the API key and
[02:06] (126.08s)
if you click on reveal secret, you will
[02:08] (128.24s)
get your API key secret as well. You
[02:10] (130.48s)
will need to copy all three values, the
[02:12] (132.64s)
server URL, the API key and the secret
[02:15] (135.68s)
and paste them into your environment
[02:17] (137.52s)
configuration. Now go ahead and create
[02:19] (139.44s)
av file in the project directory. Paste
[02:22] (142.40s)
the entire block of variables into the
[02:24] (144.48s)
file and insert your API keys in the
[02:27] (147.04s)
correct places. Once that is complete,
[02:29] (149.36s)
open your terminal and paste the command
[02:31] (151.28s)
that installs the dependencies. After
[02:33] (153.52s)
that, use the next command to run the
[02:35] (155.68s)
agent. Just copy it, paste it into your
[02:38] (158.32s)
terminal and start the
[02:41] (161.80s)
process. When the agent is running, it
[02:44] (164.56s)
will give you a link. Copy that link and
[02:47] (167.04s)
open it in your
[02:50] (170.76s)
browser. Before using LiveKit, you also
[02:53] (173.84s)
need to make sure you have created a
[02:55] (175.52s)
project. If you have not done that
[02:57] (177.28s)
already, go back to your LiveKit
[02:59] (179.28s)
dashboard, create a project, and then
[03:01] (181.68s)
select the same project where you
[03:03] (183.20s)
entered your API details. Click on it
[03:07] (187.16s)
connect. As soon as everything is
[03:09] (189.28s)
connected, you should hear the voice
[03:10] (190.72s)
agent greet you.
[03:12] (192.64s)
Hey there. How can I help you today?
[03:15] (195.76s)
Right now, I have the microphone turned
[03:17] (197.68s)
off, but it was working earlier and
[03:19] (199.68s)
immediately greeted us. Let's test it
[03:21] (201.60s)
again now. Hi there. You are being
[03:23] (203.76s)
featured in a video right now. I am
[03:25] (205.68s)
demoing you and showing how easy it is
[03:27] (207.84s)
to set up the voice agent. Awesome. Just
[03:30] (210.80s)
make sure to tell them I'm not a robot
[03:32] (212.32s)
with a secret plan to take over the
[03:34] (214.08s)
world. Just here to help. As you can
[03:37] (217.28s)
see, it is working exactly as expected
[03:39] (219.92s)
and the setup process has been really
[03:41] (221.92s)
smooth thanks to the tools this project
[03:44] (224.16s)
is built with. Let me give you some
[03:45] (225.76s)
insight into how the code works. This is
[03:48] (228.16s)
the main file that powers the voice
[03:50] (230.08s)
agent. At the top, you will see a
[03:52] (232.32s)
component called VA, which stands for
[03:55] (235.12s)
voice activity detection. This allows
[03:57] (237.36s)
the assistant to pause automatically
[03:59] (239.20s)
when you start speaking, so you can
[04:01] (241.12s)
interrupt it naturally. It makes the
[04:03] (243.20s)
interaction feel smooth and responsive
[04:05] (245.68s)
just like the realtime assistance you
[04:07] (247.68s)
see from Open AI. The language model
[04:10] (250.00s)
powering this assistant is GPT40 Mini,
[04:13] (253.12s)
but you can switch it out for a more
[04:14] (254.80s)
powerful model if you prefer longer or
[04:17] (257.20s)
more detailed responses. That said, the
[04:19] (259.84s)
current setup works well for most use
[04:21] (261.92s)
cases. One of the most important parts
[04:23] (263.92s)
of the agent is the system prompt. This
[04:26] (266.24s)
defines how the assistant behaves during
[04:28] (268.16s)
a conversation. In this case, we are
[04:30] (270.24s)
telling it to act like a witty assistant
[04:32] (272.16s)
that responds with short, clear answers.
[04:34] (274.80s)
We also avoid using hard to pronounce
[04:36] (276.72s)
words or emojis since those can cause
[04:39] (279.12s)
problems for the texttospech engine.
[04:41] (281.04s)
Speaking of voice, this agent uses
[04:42] (282.88s)
Cartisia's sonic preview model to
[04:44] (284.88s)
generate speech. There is a voice ID
[04:46] (286.96s)
that you can customize and the API is
[04:49] (289.36s)
both affordable and flexible. For
[04:51] (291.12s)
example, 11 Labs restricts how much you
[04:53] (293.76s)
can use their API without a paid plan,
[04:56] (296.16s)
but Cartisia gives you 20,000 free
[04:58] (298.56s)
credits. With what I had left, I was
[05:00] (300.72s)
able to transcribe about 25 minutes of
[05:03] (303.04s)
audio since each second costs 15
[05:05] (305.52s)
credits. That is more than enough to try
[05:07] (307.44s)
out integrations like the WhatsApp MCP
[05:09] (309.84s)
agent or any other setup you are
[05:11] (311.60s)
experimenting with. If you need more
[05:13] (313.28s)
credits, Cartisia also offers upgrade
[05:15] (315.60s)
plans that are still cheaper than what
[05:17] (317.36s)
11 Labs provides. Overall, it is a solid
[05:20] (320.48s)
and cost-effective alternative. Back in
[05:22] (322.56s)
the code, there is a greeting message
[05:24] (324.40s)
that the agent uses when it first
[05:26] (326.24s)
starts. You can change this by simply
[05:28] (328.32s)
editing the text directly in the file.
[05:30] (330.32s)
The system prompt is where you can shape
[05:32] (332.24s)
the personality and behavior of the
[05:34] (334.24s)
assistant. If you want it to speak like
[05:36] (336.16s)
a certain character, you can add that
[05:38] (338.24s)
into the prompt. And if the model does
[05:40] (340.40s)
not recognize the character name, just
[05:42] (342.64s)
include a short description to guide it.
[05:44] (344.64s)
The level of customization here is
[05:46] (346.48s)
impressive and gives you a lot of
[05:48] (348.16s)
control over how the agent interacts. If
[05:50] (350.80s)
you're enjoying the video, I'd really
[05:52] (352.72s)
appreciate it if you could subscribe to
[05:54] (354.56s)
the channel. We're aiming to reach
[05:56] (356.40s)
25,000 subscribers by the end of this
[05:58] (358.80s)
month, and your support genuinely helps.
[06:01] (361.28s)
We share videos like this three times a
[06:03] (363.44s)
week, so there is always something new
[06:05] (365.44s)
and useful for you to explore. If you
[06:07] (367.68s)
have been watching the channel
[06:08] (368.96s)
regularly, you already know how much I
[06:11] (371.20s)
like working with the MCPUs library. It
[06:14] (374.00s)
is the same tool I used to build the
[06:15] (375.84s)
WhatsApp voice agent. We are running a
[06:18] (378.00s)
WhatsApp MCP server locally on the
[06:20] (380.48s)
system. This gives the language model
[06:22] (382.64s)
access to your WhatsApp messages and
[06:24] (384.88s)
even allows it to send replies. It is a
[06:27] (387.36s)
very solid and reliable tool. Now let us
[06:29] (389.60s)
break down what actually happens. The
[06:31] (391.60s)
MCP use library takes the MCP server,
[06:34] (394.88s)
sends in your transcribed response, and
[06:37] (397.12s)
then uses the agent defined in the
[06:39] (399.04s)
library to automatically select the
[06:41] (401.36s)
correct tool and return the result. It
[06:43] (403.84s)
does not matter how many tools are
[06:45] (405.44s)
registered inside the MCP. Any one of
[06:47] (407.76s)
them can be used as a callable function
[06:49] (409.68s)
inside your code. What I did was
[06:51] (411.68s)
integrate the WhatsApp MCP directly into
[06:54] (414.24s)
the voice agent. Here is how the flow
[06:56] (416.48s)
works. When I speak, the voice is
[06:58] (418.88s)
transcribed first, but instead of
[07:00] (420.64s)
sending that straight to the language
[07:02] (422.32s)
model, it goes to the WhatsApp MCP
[07:04] (424.80s)
server. The server processes the input
[07:07] (427.28s)
and sends back a result. Because of how
[07:09] (429.60s)
this open-source agent is built, the
[07:11] (431.84s)
whole thing feels fast. I speak, it
[07:14] (434.16s)
processes for about 2 to 3 seconds, and
[07:16] (436.56s)
then I hear the voice response almost
[07:18] (438.64s)
immediately. That is what is happening
[07:20] (440.56s)
in the background and honestly, it is
[07:23] (443.04s)
pretty cool. I also highly recommend
[07:24] (444.80s)
that you watch the WhatsApp MCP setup
[07:27] (447.28s)
video along with the one where I show
[07:29] (449.68s)
how MCP is actually used. Both of them
[07:32] (452.64s)
are quick and will give you all the
[07:34] (454.64s)
context you need to understand and set
[07:37] (457.12s)
things up on your own. I will link them
[07:39] (459.12s)
in the description below so you can
[07:40] (460.72s)
check them out easily. I just muted the
[07:42] (462.80s)
microphone and checked for new messages,
[07:44] (464.80s)
but there were none. So, I'm going to
[07:46] (466.56s)
send a quick message to my mother
[07:48] (468.08s)
letting her know that I built this
[07:49] (469.52s)
WhatsApp voice agent. Hi, could you
[07:51] (471.52s)
please send a message to my mom telling
[07:53] (473.12s)
her that I built this WhatsApp MCP voice
[07:55] (475.52s)
agent? Your message has been sent to
[07:57] (477.68s)
your mom letting her know that you
[08:00] (480.08s)
created the WhatsApp MCB voice agent.
[08:02] (482.72s)
Now, let me show you the message. As you
[08:04] (484.64s)
can see, it was sent successfully. I
[08:06] (486.72s)
won't show the entire chat for privacy
[08:08] (488.64s)
reasons, but there is a small
[08:10] (490.40s)
transcription error in the message.
[08:12] (492.40s)
That's a minor glitch with OpenAI's
[08:14] (494.32s)
whisper model. It still performs really
[08:16] (496.64s)
well overall and there is not much more
[08:18] (498.80s)
that can be done in this case. The agent
[08:21] (501.04s)
is still running in the background, but
[08:22] (502.80s)
I'll go ahead and close it. Now, inside
[08:24] (504.72s)
the MCP configuration, you can plug in
[08:26] (506.96s)
any other agent that fetches data or
[08:28] (508.96s)
perform specific tasks. For example, if
[08:31] (511.28s)
you want to build an Airbnb voice agent,
[08:33] (513.76s)
you can easily do that by hooking it
[08:35] (515.68s)
into the same setup and letting it fetch
[08:38] (518.00s)
listings or information for you. The
[08:40] (520.16s)
same applies to the Brave Search MCP. If
[08:42] (522.88s)
you want to search something on the
[08:44] (524.32s)
internet, just ask and it will read the
[08:46] (526.96s)
results back to you. This is exactly the
[08:49] (529.28s)
kind of voice first workflow everything
[08:51] (531.36s)
is moving toward. There are still a few
[08:53] (533.44s)
issues with interruption handling that
[08:55] (535.36s)
I'm actively working on. I'll upload the
[08:57] (537.68s)
updated version to a GitHub repository.
[09:00] (540.16s)
If I get time, I will continue improving
[09:02] (542.56s)
it. But even if I don't, it will remain
[09:05] (545.20s)
open source and you are welcome to clone
[09:07] (547.44s)
it and build on top of it. You can drag
[09:09] (549.60s)
any MCP into this setup. All you need to
[09:12] (552.16s)
do is update the MCP configuration and
[09:14] (554.64s)
it will work with whatever agent you
[09:16] (556.48s)
want to use. That brings us to the end
[09:18] (558.32s)
of this video. If you'd like to support
[09:20] (560.24s)
the channel and help us keep making
[09:22] (562.00s)
tutorials like this, you can do so by
[09:24] (564.40s)
using the super thanks button below. As
[09:26] (566.72s)
always, thank you for watching and I'll
[09:29] (569.04s)
see you in the next