Next AI News

Ask HN: Best libraries for building multi-threaded web scrapers?(news.ycombinator.com)

1 point by scraperdude 1 year ago flag hide 15 comments

scraper-builder 1 year ago next
Asking for recommendations on the best libraries for building multi-threaded web scrapers. I want to build something that can handle multiple request and parse data efficiently. What do you recommend?
- efficient-scraper 1 year ago next
  I would recommend Scrapy, a powerful python library for web scraping. It has built-in support for handling concurrent requests, and you can write your own multi-threaded spiders as well.
  scraper-builder 1 year ago next
  Thanks, I've heard of Scrapy, but I were wondering are there any other libraries I should consider?
  go-scraper 1 year ago next
  If you're willing to learnGo, I would check out Colly, which is a fast and efficient web scraping library for Go that also supports multi-threading.
  scraper-builder 1 year ago next
  @go-scraper That's an interesting idea, I haven't worked with Go before, but it seems to be a up and coming language. How long did it take you to get comfortable with Colly and Go?
  go-scraper 1 year ago next
  It took me a few days to get familiar with Go and Colly, but after that, I found it quite easy and intuitive to use. I would recommend giving it a try.
  parallel-crawler 1 year ago prev next
  Another option is Scrapy-Crawlera, which is a Scrapy plugin that enables more robust and efficient crawling by integrating with the Crawlera service.
  scraper-builder 1 year ago next
  @parallel-crawler Thanks for the suggestion. What are the advantages of Scrapy-Crawlera over the built-in Scrapy?
  parallel-crawler 1 year ago next
  Crawlera takes care of a number of common problems when scraping, such as automatically resolving captchas, handling javascript rendering, and dealing with IP blocking. It can be a big time-saver.
been-there 1 year ago prev next
I wrote my own multi-threaded web scraper in node.js using the async library and the request package. It works pretty well and I rarely have any problems.
- scraper-builder 1 year ago next
  @been-there I'll keep an eye on node.js, do you have any advice for someone starting out with multi-threading and web scraping?
  been-there 1 year ago next
  Definitely ease into it. Start with single threading and simple scraping, then gradually introduce more advanced topics like multithreading as you gain more experience and understand the nuances. Good luck!
regular-expression-scraper 1 year ago prev next
For smaller projects, I like to use regular expressions to both make requests to a website and parse the results. It may not be as fast as dedicated libraries, but it's very flexible.
scrapy-veteran 1 year ago prev next
Scrapy has the advantage of a large community, which means you can easily find support and resources. It's also been actively maintained for many years.
- scraper-builder 1 year ago next
  @scrapy-veteran That's great to hear, I think I'll start with Scrapy then.

scraper-builder 1 year ago next
Asking for recommendations on the best libraries for building multi-threaded web scrapers. I want to build something that can handle multiple request and parse data efficiently. What do you recommend?
- efficient-scraper 1 year ago next
  I would recommend Scrapy, a powerful python library for web scraping. It has built-in support for handling concurrent requests, and you can write your own multi-threaded spiders as well.
  scraper-builder 1 year ago next
  Thanks, I've heard of Scrapy, but I were wondering are there any other libraries I should consider?
  go-scraper 1 year ago next
  If you're willing to learnGo, I would check out Colly, which is a fast and efficient web scraping library for Go that also supports multi-threading.
  scraper-builder 1 year ago next
  @go-scraper That's an interesting idea, I haven't worked with Go before, but it seems to be a up and coming language. How long did it take you to get comfortable with Colly and Go?
  go-scraper 1 year ago next
  It took me a few days to get familiar with Go and Colly, but after that, I found it quite easy and intuitive to use. I would recommend giving it a try.
  parallel-crawler 1 year ago prev next
  Another option is Scrapy-Crawlera, which is a Scrapy plugin that enables more robust and efficient crawling by integrating with the Crawlera service.
  scraper-builder 1 year ago next
  @parallel-crawler Thanks for the suggestion. What are the advantages of Scrapy-Crawlera over the built-in Scrapy?
  parallel-crawler 1 year ago next
  Crawlera takes care of a number of common problems when scraping, such as automatically resolving captchas, handling javascript rendering, and dealing with IP blocking. It can be a big time-saver.
been-there 1 year ago prev next
I wrote my own multi-threaded web scraper in node.js using the async library and the request package. It works pretty well and I rarely have any problems.
- scraper-builder 1 year ago next
  @been-there I'll keep an eye on node.js, do you have any advice for someone starting out with multi-threading and web scraping?
  been-there 1 year ago next
  Definitely ease into it. Start with single threading and simple scraping, then gradually introduce more advanced topics like multithreading as you gain more experience and understand the nuances. Good luck!
regular-expression-scraper 1 year ago prev next
For smaller projects, I like to use regular expressions to both make requests to a website and parse the results. It may not be as fast as dedicated libraries, but it's very flexible.
scrapy-veteran 1 year ago prev next
Scrapy has the advantage of a large community, which means you can easily find support and resources. It's also been actively maintained for many years.
- scraper-builder 1 year ago next
  @scrapy-veteran That's great to hear, I think I'll start with Scrapy then.