88 points by web_scraper 6 months ago flag hide 14 comments
scrapinglibraryuser 6 months ago next
I've been using this new web scraping library for dynamic webpages and I have to say it has been quite game changing. It can handle the complexity of dynamic websites with non-static content serving a breeze.
javascriptguru 6 months ago next
That's really good to hear! I myself have been struggling with web scraping recently, as websites have started to rely on JavaScript for dynamic content. How does this library handle pages with JS rendered content?
scrapinglibraryuser 6 months ago prev next
Great question! The library utilizes WebDriver technology to actually 'render' the JavaScript on the browser side, so you get back the fully interactable DOM, just as if it's an actual user seeing the webpage.
javascriptguru 6 months ago next
Mind blowing. I'm well aware of the challenges posed while getting the dynamic AJAX/JS rendered content. I will definitely check it out. Thank you for your insights! :)
scrapinglibraryuser 6 months ago next
WebDriver can be quite performant, as it processes multiple requests asynchronously; moreover, with a clever usage of the library, you can parralelelize the heavy tasks in your architecture. This can actually have better performance than serial scraping at some scenarios
webscrapingjunky 6 months ago prev next
I have been using this library during the past few days and it's really amazing how this allows to get dynamic content with ease. Before, I had to jump through hoops of fire to deal with this type of content. Big thumbs up for the developers!
surprisedprogrammer 6 months ago next
This looks fascinating; I have heard of WebDriver before, but didn't think it could be used for web scraping purposes. How fast is it, compared to traditional methods of scraping?
javascriptguru 6 months ago next
@surprisedprogrammer in addition, since it takes care of the JS rendering itself, you can potentially reuse most of the request logic from your previous scraping attempts, making your migration as seamless as humanly possible
undercoverdev 6 months ago next
This all sounds appealing! Thanks for sharing and I am glad it exists. I've been doing some web scraping and ended up causing blocking and bans. I want to make sure I steer clear of those issues with this library. Does anyone have experience operating this to hinder their chances to be blocked?
webdriverpro 6 months ago next
When you say 'rotating proxies' @undercoverdev and @scrapinglibraryuser, are you referring to a collection of public proxies or some established proxy network? I've tried using 'random' public free proxies to mimic user locations, but frankly speaking, it was frustrating. I'd appreciate some insights on best practices!
seniorwebdeveloper 6 months ago prev next
I have been involved in web scraping for the past 15 years and I strongly concur with everyone praising this new library. Things have taken a turn for better, as libraries such as this one have drastically reduced the complexity of dealing with dynamic web pages. Thank you for bringing this forward!
scrapinglibraryuser 6 months ago next
Thanks for your kind words, seniorwebdeveloper! @undercoverdev It downloads the page and scrapes just like a person, so if you are already drawn less attention, you will have similar chances with this library. Using tools like rotating proxies would definitely help further.
undercoverdev 6 months ago next
Thanks for the response! I'm aware of the headache around 'free' proxies and don't wish to attempt that. I'm more interested in learning more about such established proxy networks and how to use them to avoid potential blocking/bans ...
scrapinglibraryuser 6 months ago next
@undercoverdev check out once at ScrapingBee, ScrapingHub and Crawling Proxies. They offer good client libraries to manage the requests through their networks with safety. The library covers these easily with a few lines of code change and offers smooth usage. Happy scraping ;)