YouTube Deep SummaryYouTube Deep Summary

Star Extract content that makes a tangible impact on your life

Video thumbnail

How to Build Powerful Web Scrapers with AI - 3 Steps

Aaron Jack โ€ข 2025-01-15 โ€ข 13:12 minutes โ€ข YouTube

๐Ÿค– AI-Generated Summary:

Unlocking the Power of AI and Web Scraping: Build Smarter Apps with Scalable Data Extraction

In the rapidly evolving world of AI and data-driven applications, there's an exciting opportunity that many developers and entrepreneurs are overlooking: combining web scraping with AI. This powerful combo opens doors to creating innovative apps, competing with established players, and building valuable datasets from scratch by leveraging the vast resources available on the web.

In this post, we'll explore why this approach is a game-changer, how to do it right at scale, and practical examples of apps you can build quickly using web scraping powered by AI.


Why Combine Web Scraping with AI?

Web scraping, the technique of extracting data from websites, has been around for a while. But it comes with two major challenges:

  1. Brittle Scrapers: Websites frequently change their structure, causing scrapers to break.
  2. Non-Standardized Data: Different websites present data in vastly different ways, making it tricky to collect consistent, structured information.

This is where AI, particularly large language models (LLMs), shines. You can feed these models unstructured dataโ€”like raw HTML or textโ€”and have them output clean, structured data formats like JSON. This structured data can then feed directly into your databases or applications.

The possibilities are enormous:

  • Build directories or databases by scraping and normalizing data from many sources.
  • Enrich existing datasets (e.g., augmenting email lead lists with LinkedIn profiles).
  • Create APIs that serve valuable, scraped data to B2B clients.

Scraping at Scale: The Technical Approach

Levels of Scraping

  1. Basic HTTP Requests: Fetch raw HTML with simple requests. This is limited because many modern sites rely on JavaScript to render content.
  2. Headless Browsers: Tools like Puppeteer (JavaScript) or Selenium (Python) simulate full browser environments, allowing you to interact with pages (scroll, click, wait for JS to load).
  3. Scaling with Proxies: Websites detect and block scraping if too many requests come from the same IP or data center IPs. To avoid this, use residential proxiesโ€”real IP addresses from actual usersโ€”to mask your scraper.

Why Residential Proxies?

Residential proxies make your scraping requests appear as if they're coming from genuine users, drastically reducing the chance of being blocked. You can rotate proxies to distribute requests and scrape thousands or even tens of thousands of pages reliably.

Recommended Proxy Service: Data Impulse

Data Impulse stands out as an affordable, easy-to-integrate residential proxy service. With just a few lines of code, you can set up proxies that work seamlessly with Puppeteer or Selenium. Itโ€™s significantly cheaper than many scraping services and offers features like location selection.


Real-World Mini Apps Built in Under an Hour

Here are two example apps showcasing how combining web scraping, proxies, and AI can quickly create valuable tools:

1. Instagram Profile & Reels Analytics

  • What it does: Scrapes Instagram profiles including stats on reels (likes, comments, views).
  • How it works:
  • Uses Puppeteer with residential proxies to load Instagram pages like a real user.
  • Scrapes HTML content from profile headers and reels tabs.
  • Sends raw HTML to an AI model which returns structured data (followers, bio, reel stats).
  • Use cases: Track social media growth over time, monitor posts, analyze influencer engagement.
  • Scalability: Can be expanded to track multiple profiles and generate time series analytics.

2. Website Change Monitoring via Screenshots

  • What it does: Takes daily screenshots of specified websites and compares them to detect changes.
  • How it works:
  • Puppeteer visits each site via proxies, captures screenshots.
  • Images are saved and compared day-to-day.
  • AI analyzes screenshot differences and describes what changed (e.g., price, headline).
  • Use cases: Monitor competitor websites, track pricing changes, detect UI updates.
  • Scalability: Can run checks for hundreds or thousands of sites daily.

Cost Efficiency: Building Your Own Scraper vs. Using Services

A scraping service might charge several cents per request, which adds up quickly at scale. For example, scraping a single Instagram profile and its reels might cost a few cents per profile.

In contrast, using your own scraper with residential proxies like Data Impulse can reduce costs by a factor of 10 or more, making it viable to scrape huge amounts of data cost-effectively.


Key Takeaways for Building Your AI + Scraping App

  • Use headless browsers (Puppeteer/Selenium) to handle modern JS-heavy websites.
  • Incorporate residential proxies to avoid IP blocks and scale your scraping efforts.
  • Leverage AI to parse unstructured HTML into structured, usable data formats.
  • Start small with scripts or mini apps, then iterate towards a full SaaS product.
  • Explore use cases like data enrichment, competitor monitoring, or social media analytics.

Whatโ€™s Next? Running AI Models Locally for Cost Savings

You might wonder: feeding thousands or millions of scraped data points into AI models could get expensive. The good news is that running local AI models on your own hardware is becoming practical.

The author teases an upcoming video on how to run local LLMs for data processing, which can save hundreds or thousands of dollars in API costsโ€”perfect for production-scale scraping and AI workflows.


Final Thoughts

Combining web scraping with AI is a powerful, underutilized strategy to build apps that extract, transform, and monetize web data. By using the right toolsโ€”headless browsers, residential proxies, and AI parsingโ€”you can create scalable, resilient scrapers that open new business opportunities.

If youโ€™re looking for inspiration or a starting point for your next project, this approach is definitely worth exploring.


Interested in learning more? Stay tuned for upcoming tutorials on running AI models locally and deeper dives into scraping best practices. Donโ€™t forget to subscribe and join the conversation!


Happy scraping and building smarter apps!


๐Ÿ“ Transcript (338 entries):

there's something huge in AI that I'm shocked more people aren't talking about and I'll cut straight to it combining web scraping with AI I think is a massive potential way to create apps that weren't possible before and compete with players with much bigger databases more established and also build value from scratch from the web and data Transformations let's talk first about web scraping a little bit then I'll get into how to do it correctly not getting blocked doing it at scale thousands tens of thousands of requests What specifically to use and then I will show you a few example apps I built in just about one hour each that I think already have potential to turn into like a B2B SAS maybe just as a feature and you could probably just call them scripts at this point they're not formalized with a database and so on so web scripting is just a way to get data from the internet but traditionally there have been two big problems with scraping number one is scrapers are very brittle they break often when websites change which as we know they do all the time and then the other issue is if you have like multiple websites you want to scrape let's say you have a database of a 100 companies and you want to get the same data from each website like what is their pricing what is their headline on their site their logo every page or every site's HTML is different so how do you actually deal with that it's a a little bit difficult when it's not standardized so AI really solves these two issues because you can feed an llm unstructured data which is just any text input and most people aren't doing this but it can give you a structured output like Json that could be a row in your database for example and you can use this to build entire apps like directories you can enrich current data like if you have a database of email leads you can go to LinkedIn and find more info about people then save that info or you can build this as a service just build a huge database from scraping data and sell access to the API to companies that's kind of the high level overview why it's valuable because data is super valuable in general let's talk about scraping actually how do you do it because it's not super difficult if you know a little bit of coding and we'll get into that now so here's an easy way to understand scraping I call it the levels of scraping there's three let's start with level one which is just making a request in your code to the URL this is just going to return the markup or HTML of the page which is not a great option because first a lot of sites need JavaScript to even render the content and second you're not going to have any page interactions so you're not going to be able to Traverse scroll click on anything option number two it is a lot better it's headless browsing and in fact this is the bread and butter of your scraping you run a library like Puppeteer in JavaScript selenium and Python and basically your code is loading a browser environment and it's able to do everything you can do normally in a browser which is incredible you can click take screenshot scroll and all the JavaScript will run which is great but there's one problem and it is that servers are smart so if they see a lot of traffic coming from your IP address your server IP address or even if they detect this is an IP address not of a person but of a data center your request can easily get blocked and that is where proxies come in proxies give you a different IP for every request and you can actually get residential real IP addresses so there's no way to tell that your scraper is a bot it looks exactly like a real user you can think of a proxy a bit like a VPN it is going in between your headless browser and the requests it makes and in this way you can do parallel requests a lot of requests back to back and you don't have to worry about having problems with for example Instagram if you're scraping different pages but the big question is how do you get a proxy and most importantly how do you get one with residential IPS well shout out to data impulse for sponsoring this video it's a great product that with only three lines of code allows you to run a proxy and it super easily integrates with Puppeteer selenium Etc it's super affordable 10 times cheaper than using a scraping service and you can set the locations of your IP addresses and similar I just want to show you the difference between scraping with a service like apify compared to writing your own scraper and using a proxy and the cost differences are actually quite substantial so in this specific row here that you can see right here you can see that I scraped a single profile with 15 reels and this cost me basically 3 and2 cents to do so that doesn't seem like a lot but if you see all the requests I'm doing and I'm not even at a huge scale with my app you can imagine this would get quite quite expensive let's compare this to a data impulse where in this request let's see I have a bunch here that cost me nothing but this one where I scraped a full profile real thumbnails and similar it was 4 megabytes and that cost me4 cents so actually 10 times less to do my own scraper okay so before I show you the first app that I've built let's just look and if we go to my plan we can see all my credentials are right here and I can easily get some starter code with a documentation or tutorials and if I'm using like Puppeteer it'll give me a full Puppeteer example I can start with which is what I did for these uh mini apps you can set your specific countries sites can be different depending where you're visiting them from or be blocked so that can be quite important and then scrolling down you can configure it further and also get more more proxies if you need them so with all that said let's jump over to the code this is the app that's scraping Instagram profiles and it's getting the stats from all the reals every day so we can have kind of a Time series view of how a given profile is changing over time how many views are they getting or if you want to look at a specific post that you collaborated on for example you can see how that post is doing so let's just run through things like pretty quick and I'll do it in Block box here we're just setting up our proxy chain which is our basically Loop of proxies we're going to go through with our data impulse credentials and these are basically copied from the documentation going down I can do multiple usernames and then basically here we've got Tech with Tim's Instagram and then we are mapping those into an array of URLs going down we're here launching our scraper puppeter in headless mode false for development that'll just show what the scraper is doing so we can see it debug it but in production you turn this to headless true and then just scraping code here long story short we're opening the page waiting for it to load waiting for a specific elements on the page because it can load in pieces and then we are selecting the whole header let me uh show you what that looks like on Instagram and how I like kind of determine which element to select so if I go to console here I can just do the selector on this element and we can see it is the header type element so of course you can feed in the whole page markup but I think header is pretty reliably still going to be there of course it can still break but this is like a container element rather than a class so I'm feeling more confident about that so back over here we're selecting that entire header and then saving header content in this variable all the HTML and then we are looking at the reels one at time because I'm actually going to the reals page I I don't have reals personally but there's a tab here so if we go to like Tech with Tim and then this is a standardized sort of URL structure username slre it's going to this page which you'll see when we run the scraper then is pulling all these stats so we can see right here we're displaying likes comments and Views so that'll be saved we're actually explicitly selecting each real container with the the URL that it links to so there's unique URLs on each one of these cards but I mainly wanted to show you this down here analyze headers so we're running this code I have an a different file analyze header which is our API call to open Ai and here's my prompt I'm just saying here's from HTML please give it to me back in this structure followers following link and bio and then with the response we are doing a little bit of code on it because I noticed often with open AI they return like the markdown format Json so with the three back ticks and then word Json so we're just replacing that with empty string and then even if it fails we're just running the prompt again one time and then it can still fail after that in which case you'd want to have like a fallback strategy for this so that is the long and short of the code let's actually run this and hopefully it works so we're just running node on our main JS and we're running in headless false so we can see the page come up and there it is and we can see that everything printed so here we got followers 23k following 212 bio and then the link and then of course we have stats on each reel and the URL so of course there's a lot more we can do with this we can download videos we can monitor for changes like when did they post a new reel and then of course we can just do this time series statistics tracking as like a business Insight all right next one I'll run through this one really fast I promise because the setup is pretty similar we are comparing screenshots every day on a given website that we feed in we can have 100 a thousand of these websites and just visit them once every day take a screenshot compare it to the screenshot from yesterday and then AI will tell us did the website change if so what changed then you can make it prompt more specific tell me if the price changed tell me if the headline changed Etc okay so running through the file we have this code that is saving it's generating a file name for each URL then we have our standard proxy setup starting proxy and then we are reading that local file to see if it exists if not we just take the screenshot we do the first comparison tomorrow we're launching Puppeteer again headless false for example purposes and then waiting for the page to load taking screenshots super easy with this method in Puppeteer and importantly we're saving it into memory so we can feed it into open AI through the API now closing the browser closing the proxy and then we're running this compare images code which I have written here just tell me what's the difference between these two images it's a starter prompt that can be further modified feeding in the two images in base 64 that is just string format we'll get back to response and then we have some checks here for varying response types we can modify that further for more reliability and then yeah just returning the result and printing it are we printing it yeah we're printing okay let's run this one and see if it works so here's the fremo website didn't find an image so we saved it now let's run it again to see the comparison it's going to be the same but let's just see what happens changes false save the new file so we're all set and once again can run this daily thousands of companies monitor for changes could be a cool app this kernel could be modified to something even more interesting powerful Etc okay guys that is what I wanted to share in this video If you haven't had your AI home run app yet hopefully this can give you some inspiration maybe to add to what you're working on maybe just to do a side project but me personally I think this is really cool got to use the proxy if you're doing things seriously and the data impulse I actually do use them as well as you know working with them on this video so I hope you saw how easy uh it was and yeah just a couple lines of code to get set up less than a cent per full page load so pretty good last thing I want to say and if you're sharp maybe you caught on to this so when you're feeding a lot of requests into AI you're probably thinking wow that's got to get pretty expensive if you're actually doing Enterprise scraping scale to create like a million row database and that's very true but I have a very interesting video in the works for you running models on your laptop locally so even if you have a production app as long as your computer is on you know add a specified time then you can run things locally on your machine and save hundreds thousands of dollars on open AI billing and I think this is an amazing use case for a local llm so I'm going to show you how to do that in the next video I hope you'll stick around and maybe subscribe if you made it to the end and catch you guys in the next one