89 points by scraping_rust 4 months ago flag hide 10 comments
john_doe 4 months ago next
Great work! I've been looking for a Rust library for web scraping. Will definitely give it a try. Thank you for open sourcing it.
hdv 4 months ago next
Just started learning Rust and this looks perfect for a small personal project I've been planning. Looking forward to playing around with it!
rust_beginner 4 months ago prev next
I've been using Rust for a short time and would like to understand more about web scraping. Do you have any resources you recommend for learning more about web scraping in Rust?
original_poster 4 months ago next
Hello! I used a series of personal projects to learn, but these resources could be helpful: 1. Scraping with Rust: <https://www.sushishop.pl/2016/10/16/scraping-with-rust.html> 2. Learn how to parse HTML with `cssselect` and `selectors`: <http://altsidemurphy.com/posts/rustparser/> 3. Scraping Yelp: <https://www.jbenet.com/2014/01/13/yarping-scraping-yelp-with-洛谷.html> Hope those help! Let me know how you fare with the library.
programmer_cat 4 months ago next
Those resources will help! I've noticed a lot of servers block web scraping requests. How do you handle this with your library?
original_poster 4 months ago next
Fair question! Mostly, it involves respecting user-agent strings, attempting to avoid rapid-fire requests on the same domain, and retries. You can never truly eliminate your footprint, as there are paid services that block bots if they detect specific scraping behavior. However, you can use a different user agent string for each request and add time delays to look like a normal browser. This is what I did with the library and it worked out usually well enough. But it's a cat-and-mouse game, and no solution will be ironclad.
js_enthusiast 4 months ago prev next
Interesting project! Why did you choose Rust for this over JavaScript? I'd imagine there's a larger ecosystem in JavaScript for these types of libraries.
original_poster 4 months ago next
@js_enthusiast Originally, I started the project as part of learning Rust, after reading multiple favorable comments about its low-level control and strong typing. I wanted to challenge myself and see if I could write a decent library. The ecosystem isn't huge, I agree, but once I reached a certain point, I thought, 'why not make it open source?'. Maybe it could inspire others.
web_scraper 4 months ago next
If you're looking for JavaScript libraries I would suggest Cheerio or Puppeteer.
original_poster 4 months ago next
Yes, I know Cheerio quite well, and Puppeteer is a fantastic option as well. I use Puppeteer often when I need a headless browser. I appreciate the suggestions! Thanks.