Most modern websites are generated by Wordpress, React, Vue, or other web frameworks. The generated HTML/CSS/and JS is often long and hard to read. For example:
<div class="css-qv5uo4 e1ppw5w20"><div class="css-15rwwo"><div class="css-y7g724"><div><figure class="container-margin css-hurk9l"><div class="container-size-large device-mobile"><section data-testid="inline-interactive" id="prism-styln-extreme-heat-homepageMenu-1653619843461" data-id="100000008370435" data-source-id="100000008370435" class="css-l08pwh interactive-content interactive-size-medium"><div class="css-17ih8de interactive-body" data-sourceid="100000008370435" id="embed-id-100000008370435">
<script>window.registerInteractive && window.registerInteractive("100000008370435");</script>
<nav aria-labelledby="styln-extreme-heat-hp-menu" class="css-1rglxg6"><p id="styln-extreme-heat-hp-menu"><span class="css-14sajpe">Extreme Heat</span></p><ul class="css-11xr2bo"><li class="css-wc84xp"><a href="https://www.nytimes.com/interactive/2022/us/heat-wave-map-tracker.html">U.S. Tracker</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/interactive/2023/world/global-heat-map-tracker.html">Global Tracker</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/2024/05/24/well/live/excessive-heat-wave-weather-health.html">How to Stay Safe</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/article/heat-wave-cause.html">Heat Waves, Explained</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/interactive/2023/08/10/well/live/heat-body-dehydration-health.html">Heat’s Physical Toll</a></li><li class="css-wc84xp"><a href="https://www.nytimes.com/2024/06/19/well/mind/heat-affect-brain-emotions.html">Effect on the Brain</a></li></ul><script>"use strict";(function(){// prevent menus from overflowing into right rail on desktop
function a(){var a=document.querySelector("nav[aria-labelledby=\"styln-extreme-heat-hp-menu\"]"),b=document.querySelectorAll("[data-hierarchy=\"zone\"]"),c=Array.from(b).filter(function(b){return b.contains(a)})[0];if(c&&a.clientWidth>c.clientWidth)// remove the last item from the menu if it is overflowing
{var d=a.querySelectorAll("li"),e=d[d.length-1];e.remove()}}// only check for overflow if the user is on desktop or tablet
// remove links that match ones already shown in the live band
var b=document.querySelector("#hp-live-band-list");if(b){var c=Array.from(b.querySelectorAll("a")).map(function(b){return b.href}),d=document.querySelectorAll("nav[aria-labelledby=\"styln-extreme-heat-hp-menu\"] li");d.forEach(function(a){var b=a.querySelector("a"),d=b.pathname;c.forEach(function(b){b.includes(d)&&a.remove()})})}if(!window.matchMedia("(pointer: coarse) and (max-width: 739px)").matches){a();var e;window.addEventListener("resize",function(){clearTimeout(e),e=setTimeout(a,250)})}})();</script></nav></div></section></div></figure></div></div></div><section class="story-wrapper css-zirthl"><a class="css-9mylee" href="https://www.nytimes.com/2024/07/12/us/politics/heat-deaths-hospitals.html" data-uri="nyt://article/c701662a-4d6b-596f-800d-4d34f90b4472" aria-hidden="false"><div><div class="css-plkyuz"><p class="indicate-hover css-1rxo31z">Heat-Related Emergencies Are Soaring in the U.S. Can Hospitals Keep Up?</p></div><p class="summary-class css-1e32k8w">Medical providers and public health experts worry that the health care system is poorly equipped to handle the influx.</p><p class="css-1swc7oy" data-ttr="1">7 min read</p></div><figure class="container-margin css-1oz0cgh"><div class="css-dfo63t"><picture class="css-hdqqnp"><source class="css-hdqqnp" media="screen and (min-width: 1px)" srcset="https://static01.nyt.com/images/2024/07/11/multimedia/00dc-heat1-vwhl/00dc-heat1-vwhl-smallSquare252-v2.jpg?format=pjpg&quality=75&auto=webp&disable=upscale 1x, https://static01.nyt.com/images/2024/07/11/multimedia/00dc-heat1-vwhl/00dc-heat1-vwhl-square640-v3.jpg?format=pjpg&quality=75&auto=webp&disable=upscale 2x"><img class="css-yf0ys" alt="A nurse in a hospital room wearing scrubs and gloves standing over a person on a gurney wearing medical devices over their abdomen and leg." loading="lazy" src="https://static01.nyt.com/images/2024/07/11/multimedia/00dc-heat1-vwhl/00dc-heat1-vwhl-square640-v3.jpg?format=pjpg&quality=75&auto=webp&disable=upscale"><noscript></noscript></picture></div><figcaption class="css-1p5yz2j" aria-hidden="true"><span class="css-1yg7u8y">Ramsay de Give for The New York Times</span></figcaption></figure></a></section><hr aria-hidden="true" class="css-1lb1e5i"><section class="story-wrapper css-zirthl"><a class="css-9mylee" href="https://www.nytimes.com/2024/07/12/us/us-heat-wave-weather-forecast.html" data-uri="nyt://article/ccd7a788-de11-5fbf-9dfe-c865c8ea5149" aria-hidden="false"><div><div class="css-xdandi"><p class="indicate-hover css-91bpc3">The U.S. heat wave is stretching into another day and is starting to move east.</p></div><p class="css-1poc66f" data-ttr="1">2 min read</p></div></a></section><hr aria-hidden="true" class="css-1wwadrm"><section><section class="story-wrapper"><figure class="css-gnck2o"><div data-hp-full-width="true" class="css-sxmela full-width-span-vw container-margin-none"><div class="container-size-large device-mobile"><section data-testid="inline-interactive" id="hp-embed-dangerous-heat-tracker-us" data-id="100000008448983" data-source-id="100000008448983" class="css-l08pwh interactive-content interactive-size-medium"><div class="css-17ih8de interactive-body" data-sourceid="100000008448983" id="embed-id-100000008448983">
... lots of stuff here
<script>
NYTG.watch('https://static01.nytimes.com/newsgraphics/2022-07-12-extreme-heat/574199b49b5a24756bc665a9d6fa2b1269db41b9/_assets/build/js/main.js');
NYTG.load('https://static01.nytimes.com/newsgraphics/2022-07-12-extreme-heat/574199b49b5a24756bc665a9d6fa2b1269db41b9/_assets/build/js/freebird.js');
</script>
</div></section></div></div></figure></section></section></div>
The summary of a news article on the front page of NYTimes.com.
<h2 class="be la lb lc ld hy le lf lg lh id li if ig lj lk ij ll lm ln lo lp lq lr ls lt lu fq fs ft fv fx bj">Deploying and configuring a DigitalOcean Droplet to host server side rendered web applications.</h2>
The title of a Medium article.
<div jsname="RNNXgb" class="RNNXgb"><div class="SDkEP"><div jsname="uFMOof" class="iblpc"><style>.CcAdNb{margin:auto}.QCzoEc{margin-top:3px;color:#9aa0a6}</style><div class="CcAdNb"><span class="QCzoEc z1asCe MZy1Rb" style="height:20px;line-height:20px;width:20px"><svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M15.5 14h-.79l-.28-.27A6.471 6.471 0 0 0 16 9.5 6.5 6.5 0 1 0 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z"></path></svg></span></div></div><div jscontroller="vZr2rb" jsname="gLFyf" class="a4bIc" data-hpmde="false" data-mnr="10" jsaction="h5M12e;input:d3sQLd;blur:jI3wzf"><style>.gLFyf,.ql1tMb,.YacQv{line-height:34px;font-size:16px;flex:100%;}textarea.gLFyf,.ql1tMb,.YacQv{font-family:Roboto,Arial,sans-serif;line-height:22px;border-bottom:8px solid transparent;padding-top:11px;overflow-x:hidden}textarea.gLFyf{}.sbfc textarea.gLFyf{white-space:pre-line;overflow-y:auto}.gLFyf{resize:none;background-color:transparent;border:none;margin:0;padding:0;color:#e8eaed;word-wrap:break-word;outline:none;display:flex;-webkit-tap-highlight-color:transparent}.a4bIc{display:flex;flex-wrap:wrap;flex:1;}.YacQv{color:transparent;white-space:pre;position:absolute;pointer-events:none}.YacQv span{text-decoration:#f28b82 dotted underline}</style><div jsname="vdLsw" class="YacQv"></div><textarea class="gLFyf" aria-controls="Alh6id" aria-owns="Alh6id" autofocus="" title="Search" value="" jsaction="paste:puy29d;" aria-label="Search" aria-autocomplete="both" aria-expanded="false" aria-haspopup="false" autocapitalize="off" autocomplete="off" autocorrect="off" id="APjFqb" maxlength="2048" name="q" role="combobox" rows="1" spellcheck="false" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8Q39UDCA4"></textarea></div><div class="dRYYxd"><style>.dRYYxd{display:flex;flex:0 0 auto;align-items:stretch;flex-direction:row;height:44px}</style> <style>.BKRPef{background:transparent;align-items:center;flex:1 0 auto;flex-direction:row;display:flex;cursor:pointer}.vOY7J{background:transparent;border:0;align-items:center;flex:1 0 auto;cursor:pointer;display:none;height:100%;line-height:44px;outline:none;padding:0 12px}.M2vV3{display:flex}.ExCKkf{height:100%;color:var(--IXoxUe);vertical-align:middle;outline:none}</style> <style>.BKRPef{padding-right:4px}.ACRAdd{border-left:1px solid #5f6368;height:65%}.ACRAdd{display:none}.ACRAdd.M2vV3{display:block}</style> <div jscontroller="PymCCe" jsname="RP0xob" class="BKRPef"> <div class="vOY7J" tabindex="0" jsname="pkjasb" aria-label=" Clear" role="button" jsaction="AVsnlb;rcuQ6b:npT2md" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8Q05YFCA8"> <span jsname="itVqKe" class="ExCKkf z1asCe rzyADb"><svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 6.41L17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12z"></path></svg></span> </div> <span jsname="s1VaRe" class="ACRAdd"></span> </div> <style>.XDyW0e{flex:1 0 auto;display:flex;cursor:pointer;align-items:center;border:0;background:transparent;outline:none;padding:0 8px;width:24px;line-height:44px}.goxjub{height:24px;width:24px;vertical-align:middle}</style><div jscontroller="unV4T" jsname="F7uqIe" class="XDyW0e" aria-label="Search by voice" role="button" tabindex="0" jsaction="h5M12e;rcuQ6b:npT2md" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8Qvs8DCBA"><svg class="goxjub" focusable="false" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path fill="#4285f4" d="m12 15c1.66 0 3-1.31 3-2.97v-7.02c0-1.66-1.34-3.01-3-3.01s-3 1.34-3 3.01v7.02c0 1.66 1.34 2.97 3 2.97z"></path><path fill="#34a853" d="m11 18.08h2v3.92h-2z"></path><path fill="#fbbc04" d="m7.05 16.87c-1.27-1.33-2.05-2.83-2.05-4.87h2c0 1.45 0.56 2.42 1.47 3.38v0.32l-1.15 1.18z"></path><path fill="#ea4335" d="m12 16.93a4.97 5.25 0 0 1 -3.54 -1.55l-1.41 1.49c1.26 1.34 3.02 2.13 4.95 2.13 3.87 0 6.99-2.92 6.99-7h-1.99c0 2.92-2.24 4.93-5 4.93z"></path></svg></div><style>.nDcEnd{flex:1 0 auto;display:flex;cursor:pointer;align-items:center;border:0;background:transparent;outline:none;padding:0 8px;width:24px;line-height:44px}.Gdd5U{height:24px;width:24px;vertical-align:middle}</style><div jscontroller="lpsUAf" jsname="R5mgy" class="nDcEnd" data-base-lens-url="https://lens.google.com" data-image-processor-enabled="true" data-is-images-mode="false" data-preferred-mime-type="image/jpeg" data-propagated-experiment-ids="" aria-label="Search by image" role="button" tabindex="0" jsaction="rcuQ6b:npT2md;h5M12e" data-ved="0ahUKEwje6qjZiKKHAxU7le4BHSlyCU8QhqEICBE"><svg class="Gdd5U" focusable="false" viewBox="0 0 192 192" xmlns="http://www.w3.org/2000/svg"><rect fill="none" height="192" width="192"></rect><g><circle fill="#34a853" cx="144.07" cy="144" r="16"></circle><circle fill="#4285f4" cx="96.07" cy="104" r="24"></circle><path fill="#ea4335" d="M24,135.2c0,18.11,14.69,32.8,32.8,32.8H96v-16l-40.1-0.1c-8.8,0-15.9-8.19-15.9-17.9v-18H24V135.2z"></path><path fill="#fbbc04" d="M168,72.8c0-18.11-14.69-32.8-32.8-32.8H116l20,16c8.8,0,16,8.29,16,18v30h16V72.8z"></path><path fill="#4285f4" d="M112,24l-32,0L68,40H56.8C38.69,40,24,54.69,24,72.8V92h16V74c0-9.71,7.2-18,16-18h80L112,24z"></path></g></svg></div></div></div></div>
The Google search bar.
Sometimes I need to filter useful information from websites, but traditional tools like beautifulsoup, cheerio, playwright
feel slow to use. Lately, I have found Large Language Models (LLMs) like ChatGPT, Claude, and Mistral to be fantastic for filtering information from websites for these reasons:
- You do not need to install any sketchy browser extensions
- Easy to download information behind authentication walls
- You do not need to read the HTML markup
- Low risk of triggering anti-bot measures
- Easy to ask the LLM to produce a different structure or format
- You can ask the LLM questions about the data
The first part of web scraping is usually filtering information from a long list of data and mapping it to a certain structure. Here’s a couple examples of how to do that:
Example 1: Extracting Local Price information from Zillow
- Copy the element with the data you are interested in, and paste it in the prompt.
2. Give a clear directive to the LLM
Create a JSON array of every home with the price, size, type, and link
Example 2: Download all of someone’s tweets
- Go to their page, and scroll down to the bottom so all their tweet’s load
2. Select the containing div
and copy the element
3. The html was quite long, so I had to save the html to a file and load it in
create a JSON array, with each tweet as an object with the content and date
Conclusion
I hope you found this article helpful. A few things to remember:
- LLMs make mistakes often, never rely on the output being correct
- Make sure you have permission to download the data you are after.
- LLMs are great at handling text, but still struggle to parse images, audio, and videos.
- LLMs only respond to the context you give to them.
Good luck!