How to Build Powerful Web Scrapers with AI - 3 Steps

[00:00] (0.08s)

there's something huge in AI that I'm

[00:02] (2.12s)

shocked more people aren't talking about

[00:04] (4.36s)

and I'll cut straight to it combining

[00:06] (6.56s)

web scraping with AI I think is a

[00:10] (10.72s)

massive potential way to create apps

[00:13] (13.48s)

that weren't possible before and compete

[00:16] (16.52s)

with players with much bigger databases

[00:19] (19.44s)

more established and also build value

[00:22] (22.52s)

from scratch from the web and data

[00:25] (25.12s)

Transformations let's talk first about

[00:27] (27.28s)

web scraping a little bit then I'll get

[00:29] (29.40s)

into how to do it correctly not getting

[00:31] (31.96s)

blocked doing it at scale thousands tens

[00:34] (34.76s)

of thousands of requests What

[00:37] (37.04s)

specifically to use and then I will show

[00:39] (39.72s)

you a few example apps I built in just

[00:42] (42.28s)

about one hour each that I think already

[00:44] (44.56s)

have potential to turn into like a B2B

[00:47] (47.72s)

SAS maybe just as a feature and you

[00:49] (49.68s)

could probably just call them scripts at

[00:51] (51.08s)

this point they're not formalized with a

[00:53] (53.40s)

database and so on so web scripting is

[00:55] (55.48s)

just a way to get data from the internet

[00:58] (58.24s)

but traditionally there have been two

[00:59] (59.60s)

big problems with scraping number one is

[01:02] (62.20s)

scrapers are very brittle they break

[01:05] (65.24s)

often when websites change which as we

[01:07] (67.64s)

know they do all the time and then the

[01:09] (69.32s)

other issue is if you have like multiple

[01:11] (71.16s)

websites you want to scrape let's say

[01:12] (72.72s)

you have a database of a 100 companies

[01:15] (75.88s)

and you want to get the same data from

[01:18] (78.12s)

each website like what is their pricing

[01:20] (80.64s)

what is their headline on their site

[01:22] (82.72s)

their logo every page or every site's

[01:25] (85.80s)

HTML is different so how do you actually

[01:28] (88.60s)

deal with that it's a a little bit

[01:30] (90.56s)

difficult when it's not standardized so

[01:34] (94.00s)

AI really solves these two issues

[01:36] (96.36s)

because you can feed an llm unstructured

[01:39] (99.68s)

data which is just any text input and

[01:42] (102.80s)

most people aren't doing this but it can

[01:44] (104.16s)

give you a structured output like Json

[01:46] (106.72s)

that could be a row in your database for

[01:48] (108.80s)

example and you can use this to build

[01:51] (111.72s)

entire apps like directories you can

[01:54] (114.00s)

enrich current data like if you have a

[01:56] (116.08s)

database of email leads you can go to

[01:58] (118.96s)

LinkedIn and find more info about people

[02:00] (120.80s)

then save that info or you can build

[02:03] (123.24s)

this as a service just build a huge

[02:05] (125.40s)

database from scraping data and sell

[02:07] (127.04s)

access to the API to companies that's

[02:09] (129.64s)

kind of the high level overview why it's

[02:11] (131.96s)

valuable because data is super valuable

[02:14] (134.48s)

in general let's talk about scraping

[02:17] (137.88s)

actually how do you do it because it's

[02:20] (140.52s)

not super difficult if you know a little

[02:22] (142.56s)

bit of coding and we'll get into that

[02:24] (144.40s)

now so here's an easy way to understand

[02:26] (146.20s)

scraping I call it the levels of

[02:27] (147.60s)

scraping there's three let's start with

[02:29] (149.60s)

level one which is just making a request

[02:32] (152.40s)

in your code to the URL this is just

[02:35] (155.32s)

going to return the markup or HTML of

[02:37] (157.08s)

the page which is not a great option

[02:39] (159.16s)

because first a lot of sites need

[02:40] (160.48s)

JavaScript to even render the content

[02:42] (162.68s)

and second you're not going to have any

[02:43] (163.84s)

page interactions so you're not going to

[02:45] (165.56s)

be able to Traverse scroll click on

[02:47] (167.72s)

anything option number two it is a lot

[02:50] (170.36s)

better it's headless browsing and in

[02:52] (172.44s)

fact this is the bread and butter of

[02:54] (174.20s)

your scraping you run a library like

[02:56] (176.08s)

Puppeteer in JavaScript selenium and

[02:58] (178.36s)

Python and basically your code is

[03:00] (180.60s)

loading a browser environment and it's

[03:02] (182.88s)

able to do everything you can do

[03:04] (184.36s)

normally in a browser which is

[03:06] (186.00s)

incredible you can click take screenshot

[03:08] (188.12s)

scroll and all the JavaScript will run

[03:10] (190.72s)

which is great but there's one problem

[03:13] (193.20s)

and it is that servers are smart so if

[03:15] (195.64s)

they see a lot of traffic coming from

[03:17] (197.64s)

your IP address your server IP address

[03:19] (199.88s)

or even if they detect this is an IP

[03:22] (202.24s)

address not of a person but of a data

[03:24] (204.72s)

center your request can easily get

[03:27] (207.00s)

blocked and that is where proxies come

[03:29] (209.56s)

in

[03:30] (210.36s)

proxies give you a different IP for

[03:32] (212.20s)

every request and you can actually get

[03:34] (214.72s)

residential real IP addresses so there's

[03:38] (218.12s)

no way to tell that your scraper is a

[03:40] (220.56s)

bot it looks exactly like a real user

[03:42] (222.96s)

you can think of a proxy a bit like a

[03:45] (225.28s)

VPN it is going in between your headless

[03:48] (228.76s)

browser and the requests it makes and in

[03:52] (232.72s)

this way you can do parallel requests a

[03:55] (235.40s)

lot of requests back to back and you

[03:57] (237.52s)

don't have to worry about having

[03:59] (239.36s)

problems

[04:00] (240.28s)

with for example Instagram if you're

[04:02] (242.28s)

scraping different pages but the big

[04:04] (244.76s)

question is how do you get a proxy and

[04:06] (246.56s)

most importantly how do you get one with

[04:08] (248.12s)

residential IPS well shout out to data

[04:10] (250.88s)

impulse for sponsoring this video it's a

[04:13] (253.60s)

great product that with only three lines

[04:15] (255.60s)

of code allows you to run a proxy and it

[04:19] (259.20s)

super easily integrates with Puppeteer

[04:21] (261.44s)

selenium Etc it's super affordable 10

[04:24] (264.72s)

times cheaper than using a scraping

[04:26] (266.56s)

service and you can set the locations of

[04:29] (269.48s)

your IP addresses and similar I just

[04:31] (271.60s)

want to show you the difference between

[04:33] (273.36s)

scraping with a service like apify

[04:35] (275.88s)

compared to writing your own scraper and

[04:37] (277.84s)

using a proxy and the cost differences

[04:40] (280.20s)

are actually quite substantial so in

[04:42] (282.96s)

this specific row here that you can see

[04:46] (286.80s)

right here you can see that I scraped a

[04:48] (288.76s)

single profile with 15 reels and this

[04:51] (291.92s)

cost me basically 3 and2 cents to do so

[04:55] (295.88s)

that doesn't seem like a lot but if you

[04:57] (297.36s)

see all the requests I'm doing and I'm

[04:59] (299.84s)

not even at a huge scale with my app you

[05:02] (302.28s)

can imagine this would get quite quite

[05:04] (304.36s)

expensive let's compare this to a data

[05:06] (306.96s)

impulse where in this request let's see

[05:10] (310.00s)

I have a bunch here that cost me nothing

[05:11] (311.92s)

but this one where I scraped a full

[05:13] (313.84s)

profile real thumbnails and similar it

[05:17] (317.20s)

was 4 megabytes and that cost me4 cents

[05:21] (321.48s)

so actually 10 times less to do my own

[05:25] (325.00s)

scraper okay so before I show you the

[05:27] (327.48s)

first app that I've built let's just

[05:29] (329.96s)

look and if we go to my plan we can see

[05:32] (332.56s)

all my credentials are right here and I

[05:34] (334.92s)

can easily get some starter code with a

[05:37] (337.88s)

documentation or tutorials and if I'm

[05:40] (340.04s)

using like Puppeteer it'll give me a

[05:41] (341.80s)

full Puppeteer example I can start with

[05:44] (344.12s)

which is what I did for these uh mini

[05:46] (346.24s)

apps you can set your specific countries

[05:49] (349.40s)

sites can be different depending where

[05:50] (350.68s)

you're visiting them from or be blocked

[05:53] (353.20s)

so that can be quite important and then

[05:55] (355.44s)

scrolling down you can configure it

[05:58] (358.16s)

further and also get more more proxies

[06:00] (360.80s)

if you need them so with all that said

[06:02] (362.84s)

let's jump over to the code this is the

[06:06] (366.48s)

app that's scraping Instagram profiles

[06:09] (369.20s)

and it's getting the stats from all the

[06:11] (371.12s)

reals every day so we can have kind of a

[06:13] (373.32s)

Time series view of how a given profile

[06:16] (376.44s)

is changing over time how many views are

[06:18] (378.28s)

they getting or if you want to look at a

[06:20] (380.00s)

specific post that you collaborated on

[06:22] (382.28s)

for example you can see how that post is

[06:25] (385.40s)

doing so let's just run through things

[06:27] (387.88s)

like pretty quick and I'll do it in

[06:29] (389.32s)

Block box here we're just setting up our

[06:32] (392.20s)

proxy chain which is our basically Loop

[06:35] (395.48s)

of proxies we're going to go through

[06:37] (397.04s)

with our data impulse credentials and

[06:38] (398.76s)

these are basically copied from the

[06:40] (400.40s)

documentation going down I can do

[06:42] (402.84s)

multiple usernames and then basically

[06:45] (405.56s)

here we've got Tech with Tim's Instagram

[06:47] (407.72s)

and then we are mapping those into an

[06:49] (409.44s)

array of URLs going down we're here

[06:52] (412.96s)

launching our scraper puppeter in

[06:55] (415.52s)

headless mode false for development

[06:57] (417.80s)

that'll just show what the scraper is

[06:59] (419.76s)

doing so we can see it debug it but in

[07:02] (422.68s)

production you turn this to headless

[07:04] (424.72s)

true and then just scraping code here

[07:06] (426.92s)

long story short we're opening the page

[07:08] (428.92s)

waiting for it to load waiting for a

[07:10] (430.88s)

specific elements on the page because it

[07:12] (432.68s)

can load in pieces and then we are

[07:15] (435.68s)

selecting the whole header let me uh

[07:18] (438.04s)

show you what that looks like on

[07:19] (439.88s)

Instagram and how I like kind of

[07:22] (442.00s)

determine which element to select so if

[07:25] (445.52s)

I go to console here I can just do the

[07:28] (448.28s)

selector on this element and we can see

[07:30] (450.76s)

it is the header type element so of

[07:34] (454.76s)

course you can feed in the whole page

[07:36] (456.92s)

markup but I think header is pretty

[07:39] (459.80s)

reliably still going to be there of

[07:41] (461.52s)

course it can still break but this is

[07:43] (463.88s)

like a container element rather than a

[07:45] (465.76s)

class so I'm feeling more confident

[07:47] (467.88s)

about that so back over here we're

[07:50] (470.56s)

selecting that entire header and then

[07:52] (472.84s)

saving header content in this variable

[07:56] (476.04s)

all the HTML and then we are looking at

[07:58] (478.64s)

the reels one at time because I'm

[08:00] (480.84s)

actually going to the reals page I I

[08:02] (482.92s)

don't have reals personally but there's

[08:04] (484.68s)

a tab here so if we go to like Tech with

[08:07] (487.20s)

Tim and then this is a standardized sort

[08:10] (490.08s)

of URL structure username slre it's

[08:14] (494.12s)

going to this page which you'll see when

[08:16] (496.12s)

we run the scraper then is pulling all

[08:17] (497.92s)

these stats so we can see right here

[08:19] (499.40s)

we're displaying likes comments and

[08:21] (501.72s)

Views so that'll be saved we're actually

[08:24] (504.60s)

explicitly selecting each real container

[08:27] (507.84s)

with the the URL that it links to so

[08:30] (510.24s)

there's unique URLs on each one of these

[08:31] (511.88s)

cards but I mainly wanted to show you

[08:33] (513.60s)

this down here analyze headers so we're

[08:36] (516.44s)

running this code I have an a different

[08:37] (517.80s)

file analyze header which is our API

[08:40] (520.44s)

call to open Ai and here's my prompt I'm

[08:43] (523.56s)

just saying here's from HTML please give

[08:46] (526.16s)

it to me back in this structure

[08:47] (527.68s)

followers following link and bio and

[08:50] (530.80s)

then with the response we are doing a

[08:54] (534.12s)

little bit of code on it because I

[08:55] (535.76s)

noticed often with open AI they return

[08:58] (538.08s)

like the markdown format Json so with

[09:00] (540.32s)

the three back ticks and then word Json

[09:02] (542.80s)

so we're just replacing that with empty

[09:04] (544.36s)

string and then even if it fails we're

[09:07] (547.12s)

just running the prompt again one time

[09:08] (548.88s)

and then it can still fail after that in

[09:11] (551.28s)

which case you'd want to have like a

[09:12] (552.28s)

fallback strategy for this so that is

[09:15] (555.76s)

the long and short of the code let's

[09:17] (557.44s)

actually run this and hopefully it works

[09:20] (560.48s)

so we're just running node on our main

[09:22] (562.20s)

JS and we're running in headless false

[09:24] (564.84s)

so we can see the page come up and there

[09:26] (566.72s)

it is and we can see that everything

[09:29] (569.40s)

printed so here we got followers 23k

[09:33] (573.88s)

following 212 bio and then the link and

[09:37] (577.96s)

then of course we have stats on each

[09:39] (579.44s)

reel and the URL so of course there's a

[09:41] (581.72s)

lot more we can do with this we can

[09:43] (583.00s)

download videos we can monitor for

[09:45] (585.68s)

changes like when did they post a new

[09:47] (587.24s)

reel and then of course we can just do

[09:48] (588.96s)

this time series statistics tracking as

[09:51] (591.76s)

like a business Insight all right next

[09:53] (593.48s)

one I'll run through this one really

[09:54] (594.84s)

fast I promise because the setup is

[09:56] (596.96s)

pretty similar we are comparing

[09:58] (598.68s)

screenshots

[09:59] (599.84s)

every day on a given website that we

[10:01] (601.92s)

feed in we can have 100 a thousand of

[10:04] (604.20s)

these websites and just visit them once

[10:06] (606.12s)

every day take a screenshot compare it

[10:07] (607.88s)

to the screenshot from yesterday and

[10:09] (609.72s)

then AI will tell us did the website

[10:11] (611.84s)

change if so what changed then you can

[10:13] (613.88s)

make it prompt more specific tell me if

[10:15] (615.92s)

the price changed tell me if the

[10:18] (618.04s)

headline changed Etc okay so running

[10:20] (620.92s)

through the file we have this code that

[10:23] (623.68s)

is saving it's generating a file name

[10:26] (626.72s)

for each URL then we have our standard

[10:29] (629.44s)

proxy setup starting proxy and then we

[10:32] (632.36s)

are reading that local file to see if it

[10:34] (634.44s)

exists if not we just take the

[10:36] (636.08s)

screenshot we do the first comparison

[10:37] (637.60s)

tomorrow we're launching Puppeteer again

[10:40] (640.36s)

headless false for example purposes and

[10:44] (644.36s)

then waiting for the page to load taking

[10:46] (646.32s)

screenshots super easy with this method

[10:48] (648.56s)

in Puppeteer and importantly we're

[10:50] (650.28s)

saving it into memory so we can feed it

[10:52] (652.12s)

into open AI through the API now closing

[10:55] (655.44s)

the browser closing the proxy and then

[10:57] (657.88s)

we're running this compare images code

[10:59] (659.88s)

which I have written here just tell me

[11:02] (662.20s)

what's the difference between these two

[11:03] (663.24s)

images it's a starter prompt that can be

[11:05] (665.44s)

further modified feeding in the two

[11:07] (667.40s)

images in base 64 that is just string

[11:10] (670.88s)

format we'll get back to response and

[11:13] (673.00s)

then we have some checks here for

[11:14] (674.88s)

varying response types we can modify

[11:17] (677.04s)

that further for more reliability and

[11:20] (680.24s)

then yeah just returning the result and

[11:21] (681.96s)

printing it are we printing it yeah

[11:23] (683.84s)

we're printing okay let's run this one

[11:26] (686.40s)

and see if it works so here's the fremo

[11:28] (688.28s)

website didn't find an image so we saved

[11:31] (691.12s)

it now let's run it again to see the

[11:33] (693.56s)

comparison it's going to be the same but

[11:35] (695.72s)

let's just see what happens changes

[11:37] (697.36s)

false save the new file so we're all set

[11:40] (700.48s)

and once again can run this daily

[11:43] (703.84s)

thousands of companies monitor for

[11:45] (705.64s)

changes could be a cool app this kernel

[11:48] (708.32s)

could be modified to something even more

[11:50] (710.52s)

interesting powerful Etc okay guys that

[11:53] (713.12s)

is what I wanted to share in this video

[11:54] (714.92s)

If you haven't had your AI home run app

[11:57] (717.52s)

yet hopefully this can give you some

[11:58] (718.64s)

inspiration

[11:59] (719.84s)

maybe to add to what you're working on

[12:01] (721.12s)

maybe just to do a side project but me

[12:03] (723.88s)

personally I think this is really cool

[12:05] (725.40s)

got to use the proxy if you're doing

[12:06] (726.80s)

things seriously and the data impulse I

[12:09] (729.64s)

actually do use them as well as you know

[12:12] (732.00s)

working with them on this video so I

[12:13] (733.92s)

hope you saw how easy uh it was and yeah

[12:17] (737.00s)

just a couple lines of code to get set

[12:19] (739.24s)

up less than a cent per full page load

[12:22] (742.48s)

so pretty good last thing I want to say

[12:24] (744.20s)

and if you're sharp maybe you caught on

[12:26] (746.48s)

to this so when you're feeding a lot of

[12:28] (748.84s)

requests into AI you're probably

[12:31] (751.60s)

thinking wow that's got to get pretty

[12:33] (753.56s)

expensive if you're actually doing

[12:35] (755.56s)

Enterprise scraping scale to create like

[12:38] (758.20s)

a million row database and that's very

[12:40] (760.48s)

true but I have a very interesting video

[12:43] (763.24s)

in the works for you running models on

[12:46] (766.36s)

your laptop locally so even if you have

[12:48] (768.84s)

a production app as long as your

[12:50] (770.88s)

computer is on you know add a specified

[12:53] (773.80s)

time then you can run things locally on

[12:56] (776.72s)

your machine and save hundreds thousands

[12:58] (778.44s)

of dollars on open AI billing and I

[13:00] (780.64s)

think this is an amazing use case for a

[13:02] (782.40s)

local llm so I'm going to show you how

[13:04] (784.08s)

to do that in the next video I hope

[13:06] (786.64s)

you'll stick around and maybe subscribe

[13:08] (788.28s)

if you made it to the end and catch you

[13:10] (790.28s)

guys in the next one

YouTube Deep Summary

How to Build Powerful Web Scrapers with AI - 3 Steps

🤖 AI-Generated Summary:

Summary History

Unlocking the Power of AI and Web Scraping: Build Smarter Apps with Scalable Data Extraction

Why Combine Web Scraping with AI?

Scraping at Scale: The Technical Approach

Levels of Scraping

Why Residential Proxies?

Recommended Proxy Service: Data Impulse

Real-World Mini Apps Built in Under an Hour

1. Instagram Profile & Reels Analytics

2. Website Change Monitoring via Screenshots

Cost Efficiency: Building Your Own Scraper vs. Using Services

Key Takeaways for Building Your AI + Scraping App

What’s Next? Running AI Models Locally for Cost Savings

Final Thoughts

📝 Transcript (338 entries):