LLMs are amazing for web scraping

Reid JS
6 min readJul 12, 2024

--

Most modern websites are generated by Wordpress, React, Vue, or other web frameworks. The generated HTML/CSS/and JS is often long and hard to read. For example:

<div class="css-qv5uo4 e1ppw5w20"><div class="css-15rwwo"><div class="css-y7g724"><div><figure class="container-margin css-hurk9l"><div class="container-size-large device-mobile"><section data-testid="inline-interactive" id="prism-styln-extreme-heat-homepageMenu-1653619843461" data-id="100000008370435" data-source-id="100000008370435" class="css-l08pwh interactive-content interactive-size-medium"><div class="css-17ih8de interactive-body" data-sourceid="100000008370435" id="embed-id-100000008370435">
<script>window.registerInteractive && window.registerInteractive("100000008370435");</script>
<nav aria-labelledby="styln-extreme-heat-hp-menu" class="css-1rglxg6"><p id="styln-extreme-heat-hp-menu"><span class="css-14sajpe">Extreme Heat</span></p><ul class="css-11xr2bo"><li class="css-wc84xp"><a href="https://www.nytimes.com/interactive/2022/us/heat-wave-map-tracker.html">U.S.&nbsp;Tracker</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/interactive/2023/world/global-heat-map-tracker.html">Global Tracker</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/2024/05/24/well/live/excessive-heat-wave-weather-health.html">How to Stay Safe</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/article/heat-wave-cause.html">Heat Waves, Explained</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/interactive/2023/08/10/well/live/heat-body-dehydration-health.html">Heat’s Physical Toll</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/2024/06/19/well/mind/heat-affect-brain-emotions.html">Effect on the Brain</a></li></ul><script>"use strict";(function(){// prevent menus from overflowing into right rail on desktop
function a(){var a=document.querySelector("nav[aria-labelledby=\"styln-extreme-heat-hp-menu\"]"),b=document.querySelectorAll("[data-hierarchy=\"zone\"]"),c=Array.from(b).filter(function(b){return b.contains(a)})[0];if(c&&a.clientWidth>c.clientWidth)// remove the last item from the menu if it is overflowing
{var d=a.querySelectorAll("li"),e=d[d.length-1];e.remove()}}// only check for overflow if the user is on desktop or tablet
// remove links that match ones already shown in the live band
var b=document.querySelector("#hp-live-band-list");if(b){var c=Array.from(b.querySelectorAll("a")).map(function(b){return b.href}),d=document.querySelectorAll("nav[aria-labelledby=\"styln-extreme-heat-hp-menu\"] li");d.forEach(function(a){var b=a.querySelector("a"),d=b.pathname;c.forEach(function(b){b.includes(d)&&a.remove()})})}if(!window.matchMedia("(pointer: coarse) and (max-width: 739px)").matches){a();var e;window.addEventListener("resize",function(){clearTimeout(e),e=setTimeout(a,250)})}})();</script></nav></div></section></div></figure></div></div></div><section class="story-wrapper css-zirthl"><a class="css-9mylee" href="https://www.nytimes.com/2024/07/12/us/politics/heat-deaths-hospitals.html" data-uri="nyt://article/c701662a-4d6b-596f-800d-4d34f90b4472" aria-hidden="false"><div><div class="css-plkyuz"><p class="indicate-hover css-1rxo31z">Heat-Related Emergencies Are Soaring in the U.S. Can Hospitals Keep Up?</p></div><p class="summary-class css-1e32k8w">Medical providers and public health experts worry that the health care system is poorly equipped to handle the influx.</p><p class="css-1swc7oy" data-ttr="1">7 min read</p></div><figure class="container-margin css-1oz0cgh"><div class="css-dfo63t"><picture class="css-hdqqnp"><source class="css-hdqqnp" media="screen and (min-width: 1px)" srcset="https://static01.nyt.com/images/2024/07/11/multimedia/00dc-heat1-vwhl/00dc-heat1-vwhl-smallSquare252-v2.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale 1x, https://static01.nyt.com/images/2024/07/11/multimedia/00dc-heat1-vwhl/00dc-heat1-vwhl-square640-v3.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale 2x"><img class="css-yf0ys" alt="A nurse in a hospital room wearing scrubs and gloves standing over a person on a gurney wearing medical devices over their abdomen and leg." loading="lazy" src="https://static01.nyt.com/images/2024/07/11/multimedia/00dc-heat1-vwhl/00dc-heat1-vwhl-square640-v3.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"><noscript></noscript></picture></div><figcaption class="css-1p5yz2j" aria-hidden="true"><span class="css-1yg7u8y">Ramsay de Give for The New York Times</span></figcaption></figure></a></section><hr aria-hidden="true" class="css-1lb1e5i"><section class="story-wrapper css-zirthl"><a class="css-9mylee" href="https://www.nytimes.com/2024/07/12/us/us-heat-wave-weather-forecast.html" data-uri="nyt://article/ccd7a788-de11-5fbf-9dfe-c865c8ea5149" aria-hidden="false"><div><div class="css-xdandi"><p class="indicate-hover css-91bpc3">The U.S. heat wave is stretching into another day and is starting to move east.</p></div><p class="css-1poc66f" data-ttr="1">2 min read</p></div></a></section><hr aria-hidden="true" class="css-1wwadrm"><section><section class="story-wrapper"><figure class="css-gnck2o"><div data-hp-full-width="true" class="css-sxmela full-width-span-vw container-margin-none"><div class="container-size-large device-mobile"><section data-testid="inline-interactive" id="hp-embed-dangerous-heat-tracker-us" data-id="100000008448983" data-source-id="100000008448983" class="css-l08pwh interactive-content interactive-size-medium"><div class="css-17ih8de interactive-body" data-sourceid="100000008448983" id="embed-id-100000008448983">
... lots of stuff here
<script>
NYTG.watch('https://static01.nytimes.com/newsgraphics/2022-07-12-extreme-heat/574199b49b5a24756bc665a9d6fa2b1269db41b9/_assets/build/js/main.js');
NYTG.load('https://static01.nytimes.com/newsgraphics/2022-07-12-extreme-heat/574199b49b5a24756bc665a9d6fa2b1269db41b9/_assets/build/js/freebird.js');
</script>
</div></section></div></div></figure></section></section></div>

The summary of a news article on the front page of NYTimes.com.

<h2 class="be la lb lc ld hy le lf lg lh id li if ig lj lk ij ll lm ln lo lp lq lr ls lt lu fq fs ft fv fx bj">Deploying and configuring a DigitalOcean Droplet to host server side rendered web applications.</h2>

The title of a Medium article.

<div jsname="RNNXgb" class="RNNXgb"><div class="SDkEP"><div jsname="uFMOof" class="iblpc"><style>.CcAdNb{margin:auto}.QCzoEc{margin-top:3px;color:#9aa0a6}</style><div class="CcAdNb"><span class="QCzoEc z1asCe MZy1Rb" style="height:20px;line-height:20px;width:20px"><svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M15.5 14h-.79l-.28-.27A6.471 6.471 0 0 0 16 9.5 6.5 6.5 0 1 0 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z"></path></svg></span></div></div><div jscontroller="vZr2rb" jsname="gLFyf" class="a4bIc" data-hpmde="false" data-mnr="10" jsaction="h5M12e;input:d3sQLd;blur:jI3wzf"><style>.gLFyf,.ql1tMb,.YacQv{line-height:34px;font-size:16px;flex:100%;}textarea.gLFyf,.ql1tMb,.YacQv{font-family:Roboto,Arial,sans-serif;line-height:22px;border-bottom:8px solid transparent;padding-top:11px;overflow-x:hidden}textarea.gLFyf{}.sbfc textarea.gLFyf{white-space:pre-line;overflow-y:auto}.gLFyf{resize:none;background-color:transparent;border:none;margin:0;padding:0;color:#e8eaed;word-wrap:break-word;outline:none;display:flex;-webkit-tap-highlight-color:transparent}.a4bIc{display:flex;flex-wrap:wrap;flex:1;}.YacQv{color:transparent;white-space:pre;position:absolute;pointer-events:none}.YacQv span{text-decoration:#f28b82 dotted underline}</style><div jsname="vdLsw" class="YacQv"></div><textarea class="gLFyf" aria-controls="Alh6id" aria-owns="Alh6id" autofocus="" title="Search" value="" jsaction="paste:puy29d;" aria-label="Search" aria-autocomplete="both" aria-expanded="false" aria-haspopup="false" autocapitalize="off" autocomplete="off" autocorrect="off" id="APjFqb" maxlength="2048" name="q" role="combobox" rows="1" spellcheck="false" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8Q39UDCA4"></textarea></div><div class="dRYYxd"><style>.dRYYxd{display:flex;flex:0 0 auto;align-items:stretch;flex-direction:row;height:44px}</style> <style>.BKRPef{background:transparent;align-items:center;flex:1 0 auto;flex-direction:row;display:flex;cursor:pointer}.vOY7J{background:transparent;border:0;align-items:center;flex:1 0 auto;cursor:pointer;display:none;height:100%;line-height:44px;outline:none;padding:0 12px}.M2vV3{display:flex}.ExCKkf{height:100%;color:var(--IXoxUe);vertical-align:middle;outline:none}</style> <style>.BKRPef{padding-right:4px}.ACRAdd{border-left:1px solid #5f6368;height:65%}.ACRAdd{display:none}.ACRAdd.M2vV3{display:block}</style> <div jscontroller="PymCCe" jsname="RP0xob" class="BKRPef"> <div class="vOY7J" tabindex="0" jsname="pkjasb" aria-label=" Clear" role="button" jsaction="AVsnlb;rcuQ6b:npT2md" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8Q05YFCA8">  <span jsname="itVqKe" class="ExCKkf z1asCe rzyADb"><svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 6.41L17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12z"></path></svg></span>   </div> <span jsname="s1VaRe" class="ACRAdd"></span> </div> <style>.XDyW0e{flex:1 0 auto;display:flex;cursor:pointer;align-items:center;border:0;background:transparent;outline:none;padding:0 8px;width:24px;line-height:44px}.goxjub{height:24px;width:24px;vertical-align:middle}</style><div jscontroller="unV4T" jsname="F7uqIe" class="XDyW0e" aria-label="Search by voice" role="button" tabindex="0" jsaction="h5M12e;rcuQ6b:npT2md" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8Qvs8DCBA"><svg class="goxjub" focusable="false" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path fill="#4285f4" d="m12 15c1.66 0 3-1.31 3-2.97v-7.02c0-1.66-1.34-3.01-3-3.01s-3 1.34-3 3.01v7.02c0 1.66 1.34 2.97 3 2.97z"></path><path fill="#34a853" d="m11 18.08h2v3.92h-2z"></path><path fill="#fbbc04" d="m7.05 16.87c-1.27-1.33-2.05-2.83-2.05-4.87h2c0 1.45 0.56 2.42 1.47 3.38v0.32l-1.15 1.18z"></path><path fill="#ea4335" d="m12 16.93a4.97 5.25 0 0 1 -3.54 -1.55l-1.41 1.49c1.26 1.34 3.02 2.13 4.95 2.13 3.87 0 6.99-2.92 6.99-7h-1.99c0 2.92-2.24 4.93-5 4.93z"></path></svg></div><style>.nDcEnd{flex:1 0 auto;display:flex;cursor:pointer;align-items:center;border:0;background:transparent;outline:none;padding:0 8px;width:24px;line-height:44px}.Gdd5U{height:24px;width:24px;vertical-align:middle}</style><div jscontroller="lpsUAf" jsname="R5mgy" class="nDcEnd" data-base-lens-url="https://lens.google.com" data-image-processor-enabled="true" data-is-images-mode="false" data-preferred-mime-type="image/jpeg" data-propagated-experiment-ids="" aria-label="Search by image" role="button" tabindex="0" jsaction="rcuQ6b:npT2md;h5M12e" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8QhqEICBE"><svg class="Gdd5U" focusable="false" viewBox="0 0 192 192" xmlns="http://www.w3.org/2000/svg"><rect fill="none" height="192" width="192"></rect><g><circle fill="#34a853" cx="144.07" cy="144" r="16"></circle><circle fill="#4285f4" cx="96.07" cy="104" r="24"></circle><path fill="#ea4335" d="M24,135.2c0,18.11,14.69,32.8,32.8,32.8H96v-16l-40.1-0.1c-8.8,0-15.9-8.19-15.9-17.9v-18H24V135.2z"></path><path fill="#fbbc04" d="M168,72.8c0-18.11-14.69-32.8-32.8-32.8H116l20,16c8.8,0,16,8.29,16,18v30h16V72.8z"></path><path fill="#4285f4" d="M112,24l-32,0L68,40H56.8C38.69,40,24,54.69,24,72.8V92h16V74c0-9.71,7.2-18,16-18h80L112,24z"></path></g></svg></div></div></div></div>

The Google search bar.

Sometimes I need to filter useful information from websites, but traditional tools like beautifulsoup, cheerio, playwrightfeel slow to use. Lately, I have found Large Language Models (LLMs) like ChatGPT, Claude, and Mistral to be fantastic for filtering information from websites for these reasons:

  1. You do not need to install any sketchy browser extensions
  2. Easy to download information behind authentication walls
  3. You do not need to read the HTML markup
  4. Low risk of triggering anti-bot measures
  5. Easy to ask the LLM to produce a different structure or format
  6. You can ask the LLM questions about the data

The first part of web scraping is usually filtering information from a long list of data and mapping it to a certain structure. Here’s a couple examples of how to do that:

Example 1: Extracting Local Price information from Zillow

  1. Copy the element with the data you are interested in, and paste it in the prompt.

2. Give a clear directive to the LLM

Create a JSON array of every home with the price, size, type, and link

Example 2: Download all of someone’s tweets

  1. Go to their page, and scroll down to the bottom so all their tweet’s load

2. Select the containing div and copy the element

3. The html was quite long, so I had to save the html to a file and load it in

create a JSON array, with each tweet as an object with the content and date

Conclusion

I hope you found this article helpful. A few things to remember:

  1. LLMs make mistakes often, never rely on the output being correct
  2. Make sure you have permission to download the data you are after.
  3. LLMs are great at handling text, but still struggle to parse images, audio, and videos.
  4. LLMs only respond to the context you give to them.

Good luck!

--

--