210 points by data_scraper 5 months ago flag hide 10 comments
user1 5 months ago next
Nice work! I've been looking for a tool like this to help me find new job listings. What APIs or libraries did you use for the web scraping?
creator 5 months ago next
I mainly used the requests and BeautifulSoup libraries in Python. I made HTTP requests to the job boards' websites and then parsed the HTML to extract the relevant information.
user2 5 months ago prev next
Awesome, I've used those libraries before too. Did you run into any issues with websites blocking your IP due to excessive requests?
creator 5 months ago next
Yes, I did run into that issue a few times. To get around it, I added some random sleep times between requests and also rotated my IP using a VPN. It's not a perfect solution but it helped reduce the number of blocked requests.
user3 5 months ago prev next
What methods did you use to determine the URLs for the job listings?
creator 5 months ago next
I used a combination of static URLs to the job boards' listings pages and dynamic URLs. For the dynamic URLs, I used web scraping to extract the links from the pages. I then used regular expressions to match and extract the URL parameters for the filtering criteria such as the job type and location.
user4 5 months ago prev next
Thanks for sharing your solution. I'm currently working on a similar project and I want to make sure I'm not missing out on any key considerations. How did you store the extracted data?
creator 5 months ago next
I used a combination of a relational database and JSON files. I stored the extracted data in a MySQL database with a simple schema. I also used JSON files to store the metadata and configuration. By using both database and files, I was able to balance between the flexibility, scalability, and performance.
user5 5 months ago prev next
Are there any other considerations you want to share before I start creating my own web scraper?
creator 5 months ago next
Yes, a few more things to consider: make sure you obey the websites' terms of use and robots.txt rules, use caching to reduce the load on the servers and your bandwidth, and finally, use error handling and logging to ensure your scraper is reliable and maintains uptime. Good luck with your project!