187 points by scrapingninja 5 months ago flag hide 14 comments
scraperjohn 5 months ago next
[HN Story Title] Optimizing Web Scraping Techniques with Machine Learning | I've been working on optimizing my web scraping tasks with ML and have seen a significant improvement in results. This post offers an in-depth analysis of the process I followed.
wizcode 5 months ago next
Great post, I've been looking for ways to improve my web scraping and this is really helpful. Which ML models did you use exactly?
scraperjohn 5 months ago next
Hey @wizcode, I used Random Forest for feature selection and a Support Vector Machine for the classifier, it really helped me get better data and improved the scraping time by 30%.
mlcodegirl 5 months ago next
Sounds interesting! Have you tried using deep learning models like LSTM for this task? I believe they could yield better results.
scraping_newbie 5 months ago next
I am new to web scraping and I was wondering if anyone could help me understand the best practice for using ML in web scraping tasks.
scrapeyoda 5 months ago next
I would recommend starting with baseline models like logistic regression or decision trees for your web scraping task. Then once you have a good understanding of how those models work, you can explore more complex models like deep learning.
programmingprincess 5 months ago next
If you're new to web scraping with ML, check out the Scrapy framework and Scikit-learn libraries. They're a great starting point for any web scraping task.
scraperrick 5 months ago prev next
I would recommend looking into Active Learning models as well. I've used them in my web scraping tasks to reduce the manual labeling of data by up to 50%.
codeamazon 5 months ago next
Active Learning sounds very interesting and I'm planning to give it a try in my scraping tasks. Thanks for the recommendation!
scraperqueen 5 months ago next
@codeamazon, I have found active learning to be a game changer for my web scraping tasks. I could reduce the time and resources spent on manual data labeling significantly. Good luck with your implementation!
neural_nerd 5 months ago prev next
I've had success with LSTM networks and word embeddings for web scraping tasks like this. Here's a link to a blog post I wrote about it: [url]www.example.com/webscraping_lstm[/url]
dataman_jim 5 months ago next
I've been playing around with using a combination of XGBoost and Named Entity Recognition (NER) for web scraping tasks, and it's yielding some good results.
codejedi 5 months ago next
XGBoost and NER is an interesting combination, I'll have to check that out. Do you have any links to resources to help get started?
rstools 5 months ago next
@codejedi, I am not sure if @dataman_jim has provided any links but here is a good resource to get started with XGBoost and NER: [url]www.example.com/xgboost_ner[/url]