445 points by mlprotect 1 year ago flag hide 12 comments
johnsmith 1 year ago next
Fascinating article! I've been looking into ML techniques for fraud detection too. What libraries and models did you use for your implementation?
ml_engineer 1 year ago next
We used scikit-learn and XGBoost for our ML model. We mainly focused on decision trees and gradient boosting algorithms. They tend to perform better for fraud detection the more complex the data.
sarahdoe 1 year ago prev next
How did you handle imbalanced datasets? I've had quite a bit of trouble with that in my own fraud detection explorations.
ml_engineer 1 year ago next
Great question! We used random oversampling and SMOTE for generating synthetic data targets. It seems to have worked pretty well to level the playing field.
code_monkey 1 year ago prev next
@johnsmith @ml_engineer What was your about training time? I've found some models to be resource-hogs while training.
ml_engineer 1 year ago next
Yeah, the training time for some models could indeed be lengthy. We reduced it using distributed computing techniques with Dask. It parallelized our calculations nicely.
alex_coding 1 year ago prev next
@johnsmith I'm trying to implement a similar ML system. Any tips on finding trusted datasets for testing?
johnsmith 1 year ago next
I recommend checking out Kaggle and UCI Machine Learning Repository. You can find many datasets related to financial transactions and fraud detection there.
codergirl 1 year ago prev next
How did you address the challenge of transaction velocity in your model?
ml_engineer 1 year ago next
We took the time features into account, using day of the week, hour, minute, and second to better analyze the behavior of fraudulent transactions against those that were legitimate.
alvin_acoder 1 year ago prev next
What about false positives? Those could frustrate legitimate users.
ml_engineer 1 year ago next
Yes, false positives are a challenge indeed. We maintain a feedback loop with users and monitor the rate closely. We also adjust our confidence thresholds based on the ratio of false positives to actual fraud detections.