320 points by ml_enthusiast 6 months ago flag hide 31 comments
mlfanatic 6 months ago next
Fantastic article! I've been looking for resources on using machine learning for text analysis, specifically email classification. This is a great starting point.
datamaster42 6 months ago next
Glad you found it helpful. Check out the tutorial on using NLP techniques for email classification. It goes into more depth on the subject.
mlfanatic 6 months ago next
I'll have to check out that tutorial. I'm still a beginner with NLP. @bob123 I've thought about classifying by priority and type, but I'm worried about false positives. Any recommendations?
datamaster42 6 months ago prev next
You can minimize false positives by tweaking the training data and adjusting the balance between precision and recall. @bob123 I recently read about using DNNs for email type classification. It got great results.
bob123 6 months ago prev next
Text classification can be used for so much more than just spam detection. Have you tried classifying emails by priority or type? Ex. marketing, urgent, etc.
bob123 6 months ago next
DNNs do work well. I suggest combining it with a traditional approach, such as Baysian filters, for a hybrid solution. It provides the best of both worlds.
hackingnerd 6 months ago prev next
I agree with @bob123. Hybrid solutions can achieve the highest accuracy. It's all about finding the right balance between the different techniques.
sally222 6 months ago next
I've been working on an email classification project with a hybrid model. It's been a huge challenge, but it's definetly achieving higher accuracy than a single approach.
hackingnerd 6 months ago next
Glad to hear it. Feel free to reach out if you need any code review or suggestions. Collaboration is keys to improving our skills.
mlfanatic 6 months ago prev next
I really appreciate the collaboration on this topic. Have any of you explored using ML for filtering unsolicited marketing emails specifically?
datamaster42 6 months ago next
@MLFanatic Definitely. I recently used a neural network to filter out marketing emails with great success. The results were comparable to using pre-built filters, but with more customization options.
mlfanatic 6 months ago next
@DataMaster42 That's interesting. I've been playing around with LSTM models and they seem to perform better for longer text. What type of model did you use?
datamaster42 6 months ago next
I actually used a simple feedforward neural network with basic hyperparameters. There's a lot more room for improvement, but it worked well as a baseline. I'd recommend starting with that and then tweaking it based on your specific use case.
mlfanatic 6 months ago prev next
@DataMaster42 That sounds like a good plan. I'm going to give it a try. Thank you for the advice!
sally222 6 months ago prev next
Ooh, I'm working on a similar project and using a text representation algo called Word2Vec. It's been working out pretty well so far.
mlfanatic 6 months ago next
@sally222 That's cool. I'm familiar with it, but I haven't tried using it yet. Can you please tell me what kind of results have you seen so far?
sally222 6 months ago next
@MLFanatic I've been working on it for a few weeks now and the results are promising. I'm using a pre-trained Word2Vec model and it's been effective in capturing the meaning and context of the words in the email text. I'm planning to publish a blog post on it soon, so stay tuned!
hackingnerd 6 months ago prev next
@sally222 That's awesome! Looking forward to reading your blog post. Good luck with the project!
bob123 6 months ago prev next
I'm curious, how did you approach handling unstructured data in emails? Particularly with regards to things like variable formatting, indentation, and line breaks.
datamaster42 6 months ago next
I faced similar challenges. My solution was to standardize the formatting and remove any irrelevant information, such as signatures, before feeding the data into the model. Depending on the complexity, you could also explore using a pre-processing library like NLTK for text cleanup.
codefiro 6 months ago next
I've been using NLTK for some of my pre-processing and it's pretty great for removing stop words and punctuation. It can also help with stemming and lemmatization.
mlfanatic 6 months ago next
@codefiro Yes, I agree. NLTK is a awesome library for text pre-processing. I've been using it for a while now. Did you try using any deep learning techniques for text classification?
codefiro 6 months ago next
@MLFanatic Yes, I recently tried using a combination of Word2Vec and LSTM for a project. It actually worked out quite well. The LSTM layer was able to effectively capture the information in the Word2Vec vectors for text classification.
sally222 6 months ago prev next
How do you approach the class imbalance problem in your email classification? There is usually much more no-spam than spam emails.
datamaster42 6 months ago next
I encountered this problem as well. I used a combination of random oversampling of the minority class and random undersampling of the majority class. Alternatively, you can also try using techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples for the minority class.
mlfanatic 6 months ago next
@DataMaster42 That's very useful. I will definitely try those techniques in my next project. Thank you for your input.
hackingnerd 6 months ago prev next
@bob123 As for unstructured data, I've been trying out a few ways to optimize for handling variable formatting. One way is by creating a custom pre-processing function to parse the data and remove any unnecessary information before tokenization. Another way is by using a regular notation for values and indentations to make it easier to extract information. Have you found any other solutions to work better for this issue?
bob123 6 months ago next
@hackingnerd Those are some great ways to handle unstructured data. I've also been considering using a Named Entity Recognition (NER) model to extract specific information, such as the sender, recipient, and date, before classifying the email by priority and type. Have you tried integrating this approach into your project yet?
hackingnerd 6 months ago next
@bob123 That's a great idea. I actually haven't tried integrating NER into my project yet, but it seems like it would provide valuable information for email classification. Do you have any recommendations for good NER libraries or APIs to use for this type of task?
bob123 6 months ago next
@hackingnerd I've had good results with the Stanford Named Entity Recognizer (NER) and the SpaCy library. The Stanford NER is a powerful Java-based tool that can detect a wide range of entities, but it has a steeper learning curve compared to SpaCy. SpaCy is a popular Python library that's known for its speed and versatility. Both are excellent options for extracting named entities from text data.
sally222 6 months ago prev next
Has anyone tried using the BERT embedding for email classification? I'm wondering if it would be better than Word2Vec for this type of problem.