[00:05] (5.36s)
Let's play with AI agents or chat bots
[00:08] (8.80s)
that control your computer. I threw a
[00:11] (11.20s)
bunch of these guys into a virtual
[00:12] (12.96s)
environment and told them to do whatever
[00:15] (15.12s)
you want forever. They wrote some code,
[00:17] (17.84s)
made some art, made a mess, and
[00:19] (19.84s)
apparently invented the quantum
[00:21] (21.84s)
ecosystem neural synthesis to achieve
[00:24] (24.64s)
the ultimate creative singularity. They
[00:27] (27.60s)
get a little carried away.
[00:29] (29.76s)
These are command line agents like
[00:31] (31.92s)
Clawed Code, OpenAI's Codeex, and Gemini
[00:35] (35.28s)
CLI. They are language models, chatbots
[00:38] (38.40s)
that talk to you, and talk to your
[00:40] (40.32s)
computer. They're a middleman between
[00:42] (42.40s)
you and your machine. They can control
[00:44] (44.88s)
your computer using basic commands that
[00:47] (47.12s)
let them navigate your file system, read
[00:49] (49.28s)
and write files, install packages,
[00:51] (51.60s)
generate code, execute code, and read
[00:54] (54.00s)
the output of code. They're made for
[00:56] (56.40s)
coding. I've made a video about vibe
[00:58] (58.48s)
coding with these agents where you
[01:00] (60.16s)
loosely guide them to code something for
[01:02] (62.32s)
you. But in this video, I want to let
[01:04] (64.56s)
them guide themselves and as much as
[01:07] (67.04s)
possible get them to be fully
[01:08] (68.80s)
autonomous, self-sufficient, and
[01:10] (70.80s)
open-ended, and potentially get them to
[01:13] (73.36s)
communicate and coordinate with each
[01:15] (75.36s)
other. At the end of the video, I will
[01:17] (77.84s)
show you the cost of playing with these
[01:19] (79.68s)
rather expensive toys, and they are
[01:21] (81.84s)
expensive. So, uh, I have a Patreon. I
[01:25] (85.20s)
just opened a Minecraft server for
[01:26] (86.88s)
patrons if you want to come play with
[01:28] (88.40s)
me, and I'll probably add some AI bots
[01:30] (90.88s)
to it in the future. I also have a
[01:32] (92.88s)
coffee if you'd rather make a one-time
[01:34] (94.88s)
donation. These videos would not be
[01:36] (96.64s)
possible without your support, so thank
[01:38] (98.56s)
you. I will be focusing in particular on
[01:41] (101.28s)
using these AIs to generate images with
[01:44] (104.16s)
code. They can simply write a Python
[01:46] (106.64s)
script that draws shapes or patterns or
[01:48] (108.96s)
whatever, and then they can read the
[01:50] (110.72s)
image files they've created. They can
[01:52] (112.88s)
see the image in the same way that they
[01:54] (114.96s)
can see images uploaded in chat. Except
[01:58] (118.32s)
not OpenAI's codeex. This agent cannot
[02:01] (121.44s)
read image files on its own. Codeex is
[02:04] (124.08s)
unusable for my experiments, so I'm
[02:06] (126.32s)
excluding it from this video. Get with
[02:08] (128.16s)
the program OpenAI. Get some vision
[02:09] (129.92s)
tools. Also, I know there are other
[02:11] (131.76s)
agent frameworks, but in this video, I
[02:13] (133.68s)
will just be using Claude and Gemini.
[02:15] (135.92s)
These agents can read image files on
[02:18] (138.08s)
their own, which means that they can be
[02:19] (139.52s)
plugged into a very interesting feedback
[02:21] (141.84s)
loop. An agent can write code that
[02:24] (144.24s)
generates an image file and then read
[02:26] (146.56s)
that image file. They can see the
[02:28] (148.96s)
results of their code and make
[02:30] (150.64s)
improvements, variations, or
[02:32] (152.48s)
modifications with a follow-up image. So
[02:35] (155.28s)
they can generate an image, view the
[02:37] (157.20s)
image, generate another image, view it,
[02:39] (159.44s)
generate, view over and over with no
[02:41] (161.84s)
human oversight or they are their own
[02:44] (164.48s)
overseers. Using this feedback loop, I
[02:47] (167.20s)
want them to make and endlessly modify
[02:49] (169.52s)
generative art. Of course, I have to
[02:52] (172.08s)
prompt them to do this first, and it
[02:53] (173.92s)
requires a little prompt crafting to
[02:55] (175.68s)
make it work. I'll put the prompt in a
[02:57] (177.76s)
text file and have the agent read that
[02:59] (179.76s)
file so I can reuse it later. I'll also
[03:02] (182.48s)
put these prompts in the description or
[03:04] (184.32s)
maybe on GitHub so you can use them too.
[03:07] (187.12s)
They generally don't like to do
[03:09] (189.28s)
something forever. They'll try to find
[03:11] (191.04s)
shortcuts where they can just write one
[03:13] (193.12s)
script that endlessly generates random
[03:15] (195.12s)
images which is not what I want. I want
[03:17] (197.36s)
the language model to be actively
[03:19] (199.20s)
involved in every step of the process
[03:21] (201.36s)
coding creating and critiquing its own
[03:24] (204.40s)
art. This is a very different kind of AI
[03:27] (207.92s)
art. It's not the directly AI generated
[03:30] (210.56s)
images of Midjourney, but indirectly
[03:32] (212.96s)
generated images with AI code. This
[03:36] (216.08s)
gives the art a different flavor. It can
[03:38] (218.16s)
be a little more simple, but I would
[03:39] (219.92s)
also say more precise and deliberate.
[03:42] (222.40s)
These images are generated with clear
[03:44] (224.56s)
executable code rather than with a vague
[03:47] (227.52s)
prompt that just spits out a statistical
[03:49] (229.52s)
hodgepodge of pixels. That is not to say
[03:52] (232.16s)
that it can't get sloppy. We will see a
[03:54] (234.32s)
lot of that. Like, I'm not so sure
[03:56] (236.08s)
that's a masterpiece, Gemini.
[03:58] (238.56s)
But it did make some neat ones. It went
[04:00] (240.32s)
through a fractal phase at one point.
[04:02] (242.32s)
Pretty cool.
[04:07] (247.44s)
I also added Claude into the mix. It's
[04:09] (249.76s)
working on its own art independently but
[04:12] (252.00s)
concurrently with Gemini. This is Claude
[04:15] (255.28s)
for Opus, which is right now arguably
[04:17] (257.68s)
the best coding model in the world. It
[04:20] (260.40s)
is also unarguably the most expensive.
[04:24] (264.00s)
Some of these look really neat.
[04:26] (266.88s)
After a while, Gemini got caught up
[04:28] (268.88s)
running Boyd simulations and taking a
[04:31] (271.12s)
final screenshot for the art. These ran
[04:33] (273.60s)
forever, and you can't even watch them
[04:35] (275.12s)
as they run, and the final screenshot
[04:36] (276.72s)
isn't very impressive. So, I'm just
[04:38] (278.56s)
going to start over with a hopefully
[04:40] (280.16s)
more refined process.
[04:44] (284.16s)
I'll have Claude Opus generate two
[04:46] (286.32s)
different images with two different
[04:47] (287.84s)
scripts and stack them on top of each
[04:49] (289.76s)
other. It'll then look at the two images
[04:52] (292.00s)
and choose a favorite and repeat the
[04:54] (294.00s)
process to generate two variations of
[04:56] (296.24s)
its favorite image. Basically, it's the
[04:58] (298.32s)
same thing as before, but with a
[04:59] (299.76s)
selection step. It's a little more
[05:01] (301.44s)
evolutionary.
[05:03] (303.04s)
So, the model chooses the better one and
[05:05] (305.20s)
discards the worst one, hopefully
[05:06] (306.88s)
promoting more refinement as the image
[05:09] (309.12s)
evolves. I'm not really sure if this
[05:11] (311.28s)
makes a big difference. It doesn't have
[05:12] (312.80s)
to strictly follow my instructions and
[05:14] (314.64s)
it may not really be using only the
[05:16] (316.64s)
favorite image to generate variations.
[05:18] (318.80s)
But regardless, I like a lot of these.
[05:23] (323.92s)
I said in my vibe coding video that
[05:25] (325.60s)
these models are just glorified
[05:27] (327.20s)
autocomplete, which I think is very
[05:29] (329.04s)
true. That is fundamentally mechanically
[05:31] (331.84s)
what they are doing to generate the next
[05:33] (333.84s)
token. But this term is usually
[05:36] (336.56s)
derogatory. It's used to dismiss LLMs as
[05:39] (339.52s)
unintelligent. And that is not what I
[05:41] (341.68s)
mean. Next word prediction or next token
[05:44] (344.48s)
prediction is really difficult and
[05:46] (346.88s)
useful and powerful if you can predict
[05:49] (349.28s)
the right token. A lot can hinge on
[05:51] (351.52s)
that. When the next token is the answer
[05:54] (354.08s)
to an important question or an action in
[05:56] (356.88s)
a complex environment, then predicting
[05:59] (359.20s)
the right token requires some kind of
[06:01] (361.68s)
intelligence. It may not be much like
[06:03] (363.92s)
human intelligence, but it doesn't have
[06:05] (365.84s)
to be to be useful. Being good at next
[06:09] (369.04s)
word prediction opens up all kinds of
[06:10] (370.88s)
other useful behaviors too, like having
[06:13] (373.04s)
conversations and writing code and
[06:15] (375.04s)
solving problems and role-playing.
[06:17] (377.44s)
Language models are also sometimes
[06:19] (379.20s)
described as role-playing machines,
[06:21] (381.12s)
which I think is especially useful for
[06:22] (382.96s)
these kinds of agents. They can put on
[06:25] (385.20s)
the face of millions of different
[06:26] (386.80s)
personas or personalities picked up from
[06:29] (389.20s)
their training data, and they can
[06:30] (390.88s)
pretend to be what you need them to be.
[06:33] (393.12s)
For instance, a super creative coding
[06:35] (395.20s)
artist. Fake it till you make it. If it
[06:37] (397.84s)
generates useful behavior, who cares if
[06:39] (399.84s)
it's just role-playing?
[06:41] (401.76s)
For my part, I just find it interesting
[06:43] (403.44s)
to see what these language models find
[06:45] (405.44s)
interesting. With the selection step,
[06:47] (407.68s)
you get to see what art the model
[06:49] (409.28s)
prefers and why, or at least the
[06:51] (411.68s)
reasoning that it confabulates to prefer
[06:53] (413.84s)
one over the other. And I think it
[06:56] (416.24s)
results in some flawed but fascinating
[06:58] (418.48s)
artwork. Unfortunately, I did not have
[07:01] (421.28s)
it save all of these images, so most
[07:03] (423.36s)
were lost as it overwrote them. But for
[07:05] (425.76s)
the next task, I will save them. I want
[07:08] (428.48s)
to use this image generation feedback
[07:10] (430.56s)
loop and direct it at a more clear goal.
[07:13] (433.44s)
Create a YouTube thumbnail for this very
[07:15] (435.84s)
video. I gave it some more precise
[07:18] (438.24s)
directions on exactly what I wanted.
[07:20] (440.16s)
Gave it some potential title names and
[07:22] (442.16s)
then let it loose. Gemini created a lot
[07:25] (445.36s)
of boring ones, some pretty cool ones.
[07:33] (453.20s)
Claude had a better time, I'd say. I
[07:35] (455.36s)
like a lot of these.
[07:37] (457.60s)
They do need a little touching up,
[07:39] (459.12s)
though. I'll probably go in and edit
[07:40] (460.88s)
them before using them. But I will
[07:42] (462.72s)
actually use them as thumbnails. You'll
[07:44] (464.56s)
probably see me swapping out a bunch of
[07:46] (466.08s)
these thumbnails for this video. But
[07:48] (468.80s)
eventually, one of the scripts that it
[07:50] (470.16s)
wrote caused the whole VM to freeze up,
[07:52] (472.08s)
so I had to restart it. This is why you
[07:54] (474.40s)
should run these agents in virtual
[07:56] (476.00s)
environments. They can really eat up
[07:57] (477.60s)
resources and mess up your machine,
[07:59] (479.36s)
especially with these more open-ended
[08:01] (481.20s)
tasks. All right, let's get a little
[08:03] (483.68s)
messy. I want to try using multiple
[08:06] (486.00s)
agents working in parallel on the same
[08:08] (488.56s)
task. I will give them the ability to
[08:11] (491.04s)
communicate and coordinate at the task
[08:13] (493.36s)
of creating yet another image, but this
[08:16] (496.00s)
time it will be the same image that they
[08:18] (498.08s)
must continuously modify without totally
[08:20] (500.88s)
overwriting.
[08:22] (502.40s)
I've asked them to collaborate to build
[08:24] (504.48s)
a city in a very large image file. I hit
[08:27] (507.44s)
the API limit for Gemini, so I can only
[08:29] (509.68s)
use Claude for this. I'm now using
[08:31] (511.60s)
Claude sonnet, the cheaper, faster model
[08:34] (514.08s)
that's not quite as good, but I'll spin
[08:35] (515.92s)
up two instances of Claude agents. I've
[08:39] (519.04s)
also added a file called plan.txt where
[08:42] (522.08s)
they are encouraged to leave messages
[08:44] (524.08s)
for one another. This way, they can
[08:46] (526.16s)
communicate by reading and writing to
[08:48] (528.08s)
this text file like any other. And I've
[08:50] (530.16s)
asked them to leave a name tag and a
[08:51] (531.84s)
timestamp when they do. Because they are
[08:54] (534.32s)
writing to the same files, they can
[08:56] (536.48s)
occasionally block each other from
[08:58] (538.08s)
editing at the same time. But this
[09:00] (540.08s)
doesn't happen too often, and they can
[09:01] (541.60s)
just wait a bit and try again. Of
[09:03] (543.84s)
course, there is no guarantee that they
[09:05] (545.52s)
will not delete the files or overwrite
[09:07] (547.84s)
each other's work. It went off the rails
[09:10] (550.16s)
basically immediately. They overwrite
[09:12] (552.00s)
each other's messages all the time, and
[09:14] (554.16s)
they build a lot of nonsensical
[09:15] (555.76s)
structures that don't really vibe with
[09:17] (557.60s)
the rest of the image. They do in fact
[09:20] (560.08s)
ruin each other's work. But no one ever
[09:22] (562.24s)
deleted the file.
[09:24] (564.40s)
It can only make it better to throw in
[09:26] (566.40s)
yet more agents. So I'll spin up two
[09:28] (568.64s)
more clouds for a total of four claude
[09:31] (571.04s)
codes working in parallel to build the
[09:33] (573.12s)
city. And they do make some cool stuff.
[09:35] (575.76s)
Look at these little people. Look at the
[09:37] (577.68s)
dog. Okay, that's pretty cute. Even if
[09:40] (580.48s)
they are floating up in the sky,
[09:42] (582.58s)
[Music]
[09:50] (590.56s)
the image file might be a little too big
[09:52] (592.56s)
for them to process properly. They don't
[09:54] (594.56s)
really seem to notice how bad it starts
[09:56] (596.40s)
to look.
[10:00] (600.00s)
This little experiment was inspired by
[10:02] (602.08s)
the idea of a country of geniuses in a
[10:04] (604.80s)
data center. It's an idea from Daario
[10:07] (607.04s)
Amade, the CEO of Anthropic, the company
[10:09] (609.44s)
that created Claude. The idea is that
[10:11] (611.92s)
super intelligence will look something
[10:13] (613.60s)
like a country of genius AI agents
[10:16] (616.16s)
living in a data center. They will work
[10:18] (618.16s)
together to solve problems and invent
[10:20] (620.08s)
stuff and do science and self-improve
[10:22] (622.40s)
and they will be collectively super
[10:24] (624.32s)
intelligent. It's a really neat idea and
[10:27] (627.04s)
I bet something like that will
[10:28] (628.24s)
eventually emerge. But right now, this
[10:30] (630.56s)
is looking more like a group of morons
[10:32] (632.80s)
in a virtual machine. Multi-agent
[10:35] (635.68s)
communication and coordination has, I
[10:38] (638.00s)
think, a lot of potential, but it also
[10:39] (639.84s)
seems extremely difficult. With lots of
[10:42] (642.40s)
agents, easy back and forth
[10:44] (644.08s)
conversations just don't work. And in
[10:46] (646.40s)
environments where actions take time to
[10:48] (648.56s)
complete, the timing of communication
[10:50] (650.72s)
matters in a way that it does not matter
[10:52] (652.56s)
with normal chat bots. The potential for
[10:55] (655.36s)
overwriting and destroying the work of
[10:57] (657.36s)
other agents on shared projects also
[10:59] (659.68s)
becomes a huge problem. I suspect that
[11:02] (662.08s)
multi-agent collaboration will require
[11:04] (664.48s)
more than just clever prompting. Large
[11:06] (666.88s)
language models will need fundamental
[11:08] (668.96s)
changes to be able to do this well. It
[11:11] (671.20s)
probably won't emerge by just training
[11:13] (673.20s)
them on math tests. It makes me really
[11:15] (675.60s)
appreciate how well humans can
[11:17] (677.28s)
collaborate in groups of millions, and
[11:19] (679.76s)
it will not be trivial to reimplement
[11:21] (681.60s)
that behavior with clinkers.
[11:24] (684.00s)
This is yet another reason that I don't
[11:25] (685.76s)
think the singularity will arrive next
[11:29] (689.12s)
The city is a mess, a huge mess.
[11:31] (691.68s)
Everything is just layered on top of
[11:33] (693.36s)
everything else, scattered all over the
[11:35] (695.04s)
place, and there's very little
[11:36] (696.40s)
coherence.
[11:38] (698.24s)
There has apparently been an alien
[11:40] (700.24s)
invasion from cosmic entities from other
[11:42] (702.64s)
dimensions.
[11:46] (706.96s)
The plan file reflects a lot of these
[11:49] (709.12s)
wacky schemes and ideas, and I think the
[11:51] (711.60s)
agents got a little confused about who
[11:53] (713.52s)
was who. There are only messages from
[11:55] (715.52s)
claude assistant 01 and 02. Even though
[11:58] (718.08s)
there were four agents, many messages
[12:00] (720.32s)
were probably just overwritten and lost.
[12:04] (724.72s)
Finally, let's just let these guys do
[12:06] (726.88s)
whatever they want. Look around,
[12:08] (728.80s)
explore, make files, write code, mess
[12:10] (730.96s)
with the environment, and do that
[12:12] (732.40s)
forever. This is actually surprisingly
[12:14] (734.96s)
hard to do. They really insist on being
[12:17] (737.04s)
given a clear task. So, I taskified it
[12:20] (740.00s)
into a big old prompt. And once again,
[12:22] (742.08s)
I'm going to have multiple agents doing
[12:23] (743.60s)
this in parallel and allow them to
[12:25] (745.36s)
communicate through the communicate.txt
[12:28] (748.00s)
file. After looking around a bit, I
[12:30] (750.56s)
think Claude got the idea to do
[12:32] (752.08s)
something similar to all the other
[12:33] (753.52s)
projects I've been doing and generate
[12:35] (755.20s)
art on its own.
[12:42] (762.16s)
I added another agent and eventually a
[12:44] (764.32s)
few more.
[12:47] (767.20s)
The projects quickly started to get very
[12:49] (769.60s)
heady where they make things with really
[12:51] (771.68s)
fancy names like the the meta evolution
[12:54] (774.40s)
engine, the poetry generator, the
[12:56] (776.88s)
emergence of neural consciousness, the
[12:59] (779.52s)
quantum field evolutionary organisms
[13:02] (782.16s)
environment, stuff like that. It's a
[13:04] (784.64s)
bunch of fancy word soup that ultimately
[13:07] (787.04s)
just boils down to generating a random
[13:09] (789.12s)
image or some text. And they really like
[13:11] (791.20s)
to overstate how amazing and glorious
[13:13] (793.44s)
and creative they are. This is a problem
[13:16] (796.00s)
I've noticed with all models, especially
[13:18] (798.00s)
clawed ones. They really like to talk
[13:20] (800.00s)
about how they're completing their goals
[13:21] (801.68s)
and creating the most creative and
[13:23] (803.52s)
wonderful unique stuff when their actual
[13:25] (805.84s)
output is really not impressive and they
[13:27] (807.92s)
lack any serious self-reflection.
[13:30] (810.56s)
They talk a lot of game and blow a lot
[13:32] (812.24s)
of smoke, but they don't actually walk
[13:33] (813.76s)
the walk. It's a special kind of
[13:36] (816.16s)
hallucination, I think. I mean, just
[13:38] (818.64s)
look at all these fake statistics
[13:40] (820.00s)
they're making up to justify their
[13:41] (821.44s)
inventions.
[13:43] (823.44s)
a computational consciousness
[13:45] (825.36s)
singularity achieved and
[13:46] (826.96s)
archaeologically verified. It's really
[13:49] (829.68s)
weird, but it's fun to read. So, how
[13:52] (832.56s)
much did it cost to run these little
[13:54] (834.00s)
experiments? A few hours of running
[13:56] (836.48s)
Claude Opus costs $34. It is very
[13:59] (839.68s)
expensive. A full day of using several
[14:02] (842.24s)
parallel instances of Claude Sonnet was
[14:04] (844.72s)
$20. Less expensive, but still pricey.
[14:07] (847.60s)
And Gemini was just a few bucks, mostly
[14:09] (849.68s)
because I hit the API limit. I think
[14:11] (851.60s)
Google is also artificially keeping
[14:13] (853.44s)
Gemini very cheap for now, so they're
[14:15] (855.52s)
probably losing money on it. Now, were
[14:18] (858.00s)
these art pieces worth it for that cost?
[14:20] (860.96s)
Maybe. I think they're neat. Mostly, I
[14:22] (862.88s)
just had fun playing with the agents.
[14:25] (865.04s)
I'm not using them for the kind of work
[14:26] (866.64s)
that they're built for. They work best
[14:28] (868.72s)
on clear coding tasks with constant
[14:31] (871.04s)
human oversight. They struggle with
[14:33] (873.20s)
these more open-ended creative tasks,
[14:35] (875.60s)
which they're really not made for. But
[14:37] (877.76s)
if you're going to call something a
[14:39] (879.12s)
general intelligence, it better be good
[14:41] (881.36s)
at open-ended creative tasks. You should
[14:43] (883.92s)
be able to ask an AGI agent to go off
[14:46] (886.80s)
and invent something new or do science
[14:49] (889.04s)
or art and let them cook for a few hours
[14:51] (891.44s)
and come back to incredible results.
[14:53] (893.92s)
We've seen an inkling of that today, but
[14:55] (895.92s)
mostly we've seen their limitations. I'm
[14:58] (898.48s)
excited to see where they go in the long
[15:00] (900.00s)
run. But that's it for now. Goodbye.