[00:00] (0.08s)
there's something huge in AI that I'm
[00:02] (2.12s)
shocked more people aren't talking about
[00:04] (4.36s)
and I'll cut straight to it combining
[00:06] (6.56s)
web scraping with AI I think is a
[00:10] (10.72s)
massive potential way to create apps
[00:13] (13.48s)
that weren't possible before and compete
[00:16] (16.52s)
with players with much bigger databases
[00:19] (19.44s)
more established and also build value
[00:22] (22.52s)
from scratch from the web and data
[00:25] (25.12s)
Transformations let's talk first about
[00:27] (27.28s)
web scraping a little bit then I'll get
[00:29] (29.40s)
into how to do it correctly not getting
[00:31] (31.96s)
blocked doing it at scale thousands tens
[00:34] (34.76s)
of thousands of requests What
[00:37] (37.04s)
specifically to use and then I will show
[00:39] (39.72s)
you a few example apps I built in just
[00:42] (42.28s)
about one hour each that I think already
[00:44] (44.56s)
have potential to turn into like a B2B
[00:47] (47.72s)
SAS maybe just as a feature and you
[00:49] (49.68s)
could probably just call them scripts at
[00:51] (51.08s)
this point they're not formalized with a
[00:53] (53.40s)
database and so on so web scripting is
[00:55] (55.48s)
just a way to get data from the internet
[00:58] (58.24s)
but traditionally there have been two
[00:59] (59.60s)
big problems with scraping number one is
[01:02] (62.20s)
scrapers are very brittle they break
[01:05] (65.24s)
often when websites change which as we
[01:07] (67.64s)
know they do all the time and then the
[01:09] (69.32s)
other issue is if you have like multiple
[01:11] (71.16s)
websites you want to scrape let's say
[01:12] (72.72s)
you have a database of a 100 companies
[01:15] (75.88s)
and you want to get the same data from
[01:18] (78.12s)
each website like what is their pricing
[01:20] (80.64s)
what is their headline on their site
[01:22] (82.72s)
their logo every page or every site's
[01:25] (85.80s)
HTML is different so how do you actually
[01:28] (88.60s)
deal with that it's a a little bit
[01:30] (90.56s)
difficult when it's not standardized so
[01:34] (94.00s)
AI really solves these two issues
[01:36] (96.36s)
because you can feed an llm unstructured
[01:39] (99.68s)
data which is just any text input and
[01:42] (102.80s)
most people aren't doing this but it can
[01:44] (104.16s)
give you a structured output like Json
[01:46] (106.72s)
that could be a row in your database for
[01:48] (108.80s)
example and you can use this to build
[01:51] (111.72s)
entire apps like directories you can
[01:54] (114.00s)
enrich current data like if you have a
[01:56] (116.08s)
database of email leads you can go to
[01:58] (118.96s)
LinkedIn and find more info about people
[02:00] (120.80s)
then save that info or you can build
[02:03] (123.24s)
this as a service just build a huge
[02:05] (125.40s)
database from scraping data and sell
[02:07] (127.04s)
access to the API to companies that's
[02:09] (129.64s)
kind of the high level overview why it's
[02:11] (131.96s)
valuable because data is super valuable
[02:14] (134.48s)
in general let's talk about scraping
[02:17] (137.88s)
actually how do you do it because it's
[02:20] (140.52s)
not super difficult if you know a little
[02:22] (142.56s)
bit of coding and we'll get into that
[02:24] (144.40s)
now so here's an easy way to understand
[02:26] (146.20s)
scraping I call it the levels of
[02:27] (147.60s)
scraping there's three let's start with
[02:29] (149.60s)
level one which is just making a request
[02:32] (152.40s)
in your code to the URL this is just
[02:35] (155.32s)
going to return the markup or HTML of
[02:37] (157.08s)
the page which is not a great option
[02:39] (159.16s)
because first a lot of sites need
[02:40] (160.48s)
JavaScript to even render the content
[02:42] (162.68s)
and second you're not going to have any
[02:43] (163.84s)
page interactions so you're not going to
[02:45] (165.56s)
be able to Traverse scroll click on
[02:47] (167.72s)
anything option number two it is a lot
[02:50] (170.36s)
better it's headless browsing and in
[02:52] (172.44s)
fact this is the bread and butter of
[02:54] (174.20s)
your scraping you run a library like
[02:56] (176.08s)
Puppeteer in JavaScript selenium and
[02:58] (178.36s)
Python and basically your code is
[03:00] (180.60s)
loading a browser environment and it's
[03:02] (182.88s)
able to do everything you can do
[03:04] (184.36s)
normally in a browser which is
[03:06] (186.00s)
incredible you can click take screenshot
[03:08] (188.12s)
scroll and all the JavaScript will run
[03:10] (190.72s)
which is great but there's one problem
[03:13] (193.20s)
and it is that servers are smart so if
[03:15] (195.64s)
they see a lot of traffic coming from
[03:17] (197.64s)
your IP address your server IP address
[03:19] (199.88s)
or even if they detect this is an IP
[03:22] (202.24s)
address not of a person but of a data
[03:24] (204.72s)
center your request can easily get
[03:27] (207.00s)
blocked and that is where proxies come
[03:30] (210.36s)
proxies give you a different IP for
[03:32] (212.20s)
every request and you can actually get
[03:34] (214.72s)
residential real IP addresses so there's
[03:38] (218.12s)
no way to tell that your scraper is a
[03:40] (220.56s)
bot it looks exactly like a real user
[03:42] (222.96s)
you can think of a proxy a bit like a
[03:45] (225.28s)
VPN it is going in between your headless
[03:48] (228.76s)
browser and the requests it makes and in
[03:52] (232.72s)
this way you can do parallel requests a
[03:55] (235.40s)
lot of requests back to back and you
[03:57] (237.52s)
don't have to worry about having
[03:59] (239.36s)
problems
[04:00] (240.28s)
with for example Instagram if you're
[04:02] (242.28s)
scraping different pages but the big
[04:04] (244.76s)
question is how do you get a proxy and
[04:06] (246.56s)
most importantly how do you get one with
[04:08] (248.12s)
residential IPS well shout out to data
[04:10] (250.88s)
impulse for sponsoring this video it's a
[04:13] (253.60s)
great product that with only three lines
[04:15] (255.60s)
of code allows you to run a proxy and it
[04:19] (259.20s)
super easily integrates with Puppeteer
[04:21] (261.44s)
selenium Etc it's super affordable 10
[04:24] (264.72s)
times cheaper than using a scraping
[04:26] (266.56s)
service and you can set the locations of
[04:29] (269.48s)
your IP addresses and similar I just
[04:31] (271.60s)
want to show you the difference between
[04:33] (273.36s)
scraping with a service like apify
[04:35] (275.88s)
compared to writing your own scraper and
[04:37] (277.84s)
using a proxy and the cost differences
[04:40] (280.20s)
are actually quite substantial so in
[04:42] (282.96s)
this specific row here that you can see
[04:46] (286.80s)
right here you can see that I scraped a
[04:48] (288.76s)
single profile with 15 reels and this
[04:51] (291.92s)
cost me basically 3 and2 cents to do so
[04:55] (295.88s)
that doesn't seem like a lot but if you
[04:57] (297.36s)
see all the requests I'm doing and I'm
[04:59] (299.84s)
not even at a huge scale with my app you
[05:02] (302.28s)
can imagine this would get quite quite
[05:04] (304.36s)
expensive let's compare this to a data
[05:06] (306.96s)
impulse where in this request let's see
[05:10] (310.00s)
I have a bunch here that cost me nothing
[05:11] (311.92s)
but this one where I scraped a full
[05:13] (313.84s)
profile real thumbnails and similar it
[05:17] (317.20s)
was 4 megabytes and that cost me4 cents
[05:21] (321.48s)
so actually 10 times less to do my own
[05:25] (325.00s)
scraper okay so before I show you the
[05:27] (327.48s)
first app that I've built let's just
[05:29] (329.96s)
look and if we go to my plan we can see
[05:32] (332.56s)
all my credentials are right here and I
[05:34] (334.92s)
can easily get some starter code with a
[05:37] (337.88s)
documentation or tutorials and if I'm
[05:40] (340.04s)
using like Puppeteer it'll give me a
[05:41] (341.80s)
full Puppeteer example I can start with
[05:44] (344.12s)
which is what I did for these uh mini
[05:46] (346.24s)
apps you can set your specific countries
[05:49] (349.40s)
sites can be different depending where
[05:50] (350.68s)
you're visiting them from or be blocked
[05:53] (353.20s)
so that can be quite important and then
[05:55] (355.44s)
scrolling down you can configure it
[05:58] (358.16s)
further and also get more more proxies
[06:00] (360.80s)
if you need them so with all that said
[06:02] (362.84s)
let's jump over to the code this is the
[06:06] (366.48s)
app that's scraping Instagram profiles
[06:09] (369.20s)
and it's getting the stats from all the
[06:11] (371.12s)
reals every day so we can have kind of a
[06:13] (373.32s)
Time series view of how a given profile
[06:16] (376.44s)
is changing over time how many views are
[06:18] (378.28s)
they getting or if you want to look at a
[06:20] (380.00s)
specific post that you collaborated on
[06:22] (382.28s)
for example you can see how that post is
[06:25] (385.40s)
doing so let's just run through things
[06:27] (387.88s)
like pretty quick and I'll do it in
[06:29] (389.32s)
Block box here we're just setting up our
[06:32] (392.20s)
proxy chain which is our basically Loop
[06:35] (395.48s)
of proxies we're going to go through
[06:37] (397.04s)
with our data impulse credentials and
[06:38] (398.76s)
these are basically copied from the
[06:40] (400.40s)
documentation going down I can do
[06:42] (402.84s)
multiple usernames and then basically
[06:45] (405.56s)
here we've got Tech with Tim's Instagram
[06:47] (407.72s)
and then we are mapping those into an
[06:49] (409.44s)
array of URLs going down we're here
[06:52] (412.96s)
launching our scraper puppeter in
[06:55] (415.52s)
headless mode false for development
[06:57] (417.80s)
that'll just show what the scraper is
[06:59] (419.76s)
doing so we can see it debug it but in
[07:02] (422.68s)
production you turn this to headless
[07:04] (424.72s)
true and then just scraping code here
[07:06] (426.92s)
long story short we're opening the page
[07:08] (428.92s)
waiting for it to load waiting for a
[07:10] (430.88s)
specific elements on the page because it
[07:12] (432.68s)
can load in pieces and then we are
[07:15] (435.68s)
selecting the whole header let me uh
[07:18] (438.04s)
show you what that looks like on
[07:19] (439.88s)
Instagram and how I like kind of
[07:22] (442.00s)
determine which element to select so if
[07:25] (445.52s)
I go to console here I can just do the
[07:28] (448.28s)
selector on this element and we can see
[07:30] (450.76s)
it is the header type element so of
[07:34] (454.76s)
course you can feed in the whole page
[07:36] (456.92s)
markup but I think header is pretty
[07:39] (459.80s)
reliably still going to be there of
[07:41] (461.52s)
course it can still break but this is
[07:43] (463.88s)
like a container element rather than a
[07:45] (465.76s)
class so I'm feeling more confident
[07:47] (467.88s)
about that so back over here we're
[07:50] (470.56s)
selecting that entire header and then
[07:52] (472.84s)
saving header content in this variable
[07:56] (476.04s)
all the HTML and then we are looking at
[07:58] (478.64s)
the reels one at time because I'm
[08:00] (480.84s)
actually going to the reals page I I
[08:02] (482.92s)
don't have reals personally but there's
[08:04] (484.68s)
a tab here so if we go to like Tech with
[08:07] (487.20s)
Tim and then this is a standardized sort
[08:10] (490.08s)
of URL structure username slre it's
[08:14] (494.12s)
going to this page which you'll see when
[08:16] (496.12s)
we run the scraper then is pulling all
[08:17] (497.92s)
these stats so we can see right here
[08:19] (499.40s)
we're displaying likes comments and
[08:21] (501.72s)
Views so that'll be saved we're actually
[08:24] (504.60s)
explicitly selecting each real container
[08:27] (507.84s)
with the the URL that it links to so
[08:30] (510.24s)
there's unique URLs on each one of these
[08:31] (511.88s)
cards but I mainly wanted to show you
[08:33] (513.60s)
this down here analyze headers so we're
[08:36] (516.44s)
running this code I have an a different
[08:37] (517.80s)
file analyze header which is our API
[08:40] (520.44s)
call to open Ai and here's my prompt I'm
[08:43] (523.56s)
just saying here's from HTML please give
[08:46] (526.16s)
it to me back in this structure
[08:47] (527.68s)
followers following link and bio and
[08:50] (530.80s)
then with the response we are doing a
[08:54] (534.12s)
little bit of code on it because I
[08:55] (535.76s)
noticed often with open AI they return
[08:58] (538.08s)
like the markdown format Json so with
[09:00] (540.32s)
the three back ticks and then word Json
[09:02] (542.80s)
so we're just replacing that with empty
[09:04] (544.36s)
string and then even if it fails we're
[09:07] (547.12s)
just running the prompt again one time
[09:08] (548.88s)
and then it can still fail after that in
[09:11] (551.28s)
which case you'd want to have like a
[09:12] (552.28s)
fallback strategy for this so that is
[09:15] (555.76s)
the long and short of the code let's
[09:17] (557.44s)
actually run this and hopefully it works
[09:20] (560.48s)
so we're just running node on our main
[09:22] (562.20s)
JS and we're running in headless false
[09:24] (564.84s)
so we can see the page come up and there
[09:26] (566.72s)
it is and we can see that everything
[09:29] (569.40s)
printed so here we got followers 23k
[09:33] (573.88s)
following 212 bio and then the link and
[09:37] (577.96s)
then of course we have stats on each
[09:39] (579.44s)
reel and the URL so of course there's a
[09:41] (581.72s)
lot more we can do with this we can
[09:43] (583.00s)
download videos we can monitor for
[09:45] (585.68s)
changes like when did they post a new
[09:47] (587.24s)
reel and then of course we can just do
[09:48] (588.96s)
this time series statistics tracking as
[09:51] (591.76s)
like a business Insight all right next
[09:53] (593.48s)
one I'll run through this one really
[09:54] (594.84s)
fast I promise because the setup is
[09:56] (596.96s)
pretty similar we are comparing
[09:58] (598.68s)
screenshots
[09:59] (599.84s)
every day on a given website that we
[10:01] (601.92s)
feed in we can have 100 a thousand of
[10:04] (604.20s)
these websites and just visit them once
[10:06] (606.12s)
every day take a screenshot compare it
[10:07] (607.88s)
to the screenshot from yesterday and
[10:09] (609.72s)
then AI will tell us did the website
[10:11] (611.84s)
change if so what changed then you can
[10:13] (613.88s)
make it prompt more specific tell me if
[10:15] (615.92s)
the price changed tell me if the
[10:18] (618.04s)
headline changed Etc okay so running
[10:20] (620.92s)
through the file we have this code that
[10:23] (623.68s)
is saving it's generating a file name
[10:26] (626.72s)
for each URL then we have our standard
[10:29] (629.44s)
proxy setup starting proxy and then we
[10:32] (632.36s)
are reading that local file to see if it
[10:34] (634.44s)
exists if not we just take the
[10:36] (636.08s)
screenshot we do the first comparison
[10:37] (637.60s)
tomorrow we're launching Puppeteer again
[10:40] (640.36s)
headless false for example purposes and
[10:44] (644.36s)
then waiting for the page to load taking
[10:46] (646.32s)
screenshots super easy with this method
[10:48] (648.56s)
in Puppeteer and importantly we're
[10:50] (650.28s)
saving it into memory so we can feed it
[10:52] (652.12s)
into open AI through the API now closing
[10:55] (655.44s)
the browser closing the proxy and then
[10:57] (657.88s)
we're running this compare images code
[10:59] (659.88s)
which I have written here just tell me
[11:02] (662.20s)
what's the difference between these two
[11:03] (663.24s)
images it's a starter prompt that can be
[11:05] (665.44s)
further modified feeding in the two
[11:07] (667.40s)
images in base 64 that is just string
[11:10] (670.88s)
format we'll get back to response and
[11:13] (673.00s)
then we have some checks here for
[11:14] (674.88s)
varying response types we can modify
[11:17] (677.04s)
that further for more reliability and
[11:20] (680.24s)
then yeah just returning the result and
[11:21] (681.96s)
printing it are we printing it yeah
[11:23] (683.84s)
we're printing okay let's run this one
[11:26] (686.40s)
and see if it works so here's the fremo
[11:28] (688.28s)
website didn't find an image so we saved
[11:31] (691.12s)
it now let's run it again to see the
[11:33] (693.56s)
comparison it's going to be the same but
[11:35] (695.72s)
let's just see what happens changes
[11:37] (697.36s)
false save the new file so we're all set
[11:40] (700.48s)
and once again can run this daily
[11:43] (703.84s)
thousands of companies monitor for
[11:45] (705.64s)
changes could be a cool app this kernel
[11:48] (708.32s)
could be modified to something even more
[11:50] (710.52s)
interesting powerful Etc okay guys that
[11:53] (713.12s)
is what I wanted to share in this video
[11:54] (714.92s)
If you haven't had your AI home run app
[11:57] (717.52s)
yet hopefully this can give you some
[11:58] (718.64s)
inspiration
[11:59] (719.84s)
maybe to add to what you're working on
[12:01] (721.12s)
maybe just to do a side project but me
[12:03] (723.88s)
personally I think this is really cool
[12:05] (725.40s)
got to use the proxy if you're doing
[12:06] (726.80s)
things seriously and the data impulse I
[12:09] (729.64s)
actually do use them as well as you know
[12:12] (732.00s)
working with them on this video so I
[12:13] (733.92s)
hope you saw how easy uh it was and yeah
[12:17] (737.00s)
just a couple lines of code to get set
[12:19] (739.24s)
up less than a cent per full page load
[12:22] (742.48s)
so pretty good last thing I want to say
[12:24] (744.20s)
and if you're sharp maybe you caught on
[12:26] (746.48s)
to this so when you're feeding a lot of
[12:28] (748.84s)
requests into AI you're probably
[12:31] (751.60s)
thinking wow that's got to get pretty
[12:33] (753.56s)
expensive if you're actually doing
[12:35] (755.56s)
Enterprise scraping scale to create like
[12:38] (758.20s)
a million row database and that's very
[12:40] (760.48s)
true but I have a very interesting video
[12:43] (763.24s)
in the works for you running models on
[12:46] (766.36s)
your laptop locally so even if you have
[12:48] (768.84s)
a production app as long as your
[12:50] (770.88s)
computer is on you know add a specified
[12:53] (773.80s)
time then you can run things locally on
[12:56] (776.72s)
your machine and save hundreds thousands
[12:58] (778.44s)
of dollars on open AI billing and I
[13:00] (780.64s)
think this is an amazing use case for a
[13:02] (782.40s)
local llm so I'm going to show you how
[13:04] (784.08s)
to do that in the next video I hope
[13:06] (786.64s)
you'll stick around and maybe subscribe
[13:08] (788.28s)
if you made it to the end and catch you
[13:10] (790.28s)
guys in the next one