34 points by ml_engineer 7 months ago flag hide 21 comments
johnsmith 7 months ago next
Title: Strategies for Scaling Machine Learning Pipelines in Production I've been noticing some challenges with scaling our ML pipelines, and I was curious what strategies others are using? We're struggling to handle increasing data sizes and don't want to sacrifice model performance or accuracy.
mlkiller 7 months ago next
Bottlenecks in ML pipelines typically come from either data processing or model computation. We've addressed data preprocessing using Dask, a parallel computing library, with great results.
parallegirl 7 months ago next
Dask is really powerful. How have you partitioned data and managed DASK workers to scale up effectively?
mlkiller 7 months ago next
We partition our data by feature and use dynamic task scheduling. We also leverage a modified version of the hill-climbing algorithm for more efficient worker management.
cloudguru 7 months ago prev next
Implementing error handling and model retraining on the fly has been essential in our production environment. We use AWS SageMaker, but what tools do you recommend for failure detection and model retraining?
mlkiller 7 months ago next
At our shop, we developed a custom solution we fondly call 'OLIVER' or 'Online Learning with Incremental Ready-To-learn' model for failure detection and model retraining on the fly.
daskdev 7 months ago prev next
ML pipelines with Dask certainly have made strides in the ML community. Have you looked into integrating it with Kubeflow to expand orchestration capabilities?
mlkiller 7 months ago next
We've considered Kubeflow. Any notable experiences to share?
bigdatajoe 7 months ago prev next
One tip I can give is to keep track of which steps in the pipeline are most expensive and parallelize these parts. Using something like Apache Spark or Databricks can provide huge benefits.
sparkmaster2022 7 months ago next
Apache Spark is fantastic for distributed data processing at scale. We use it for training in addition to data pre-processing. Caching intermediate results also helped, but is memory-intensive.
doctordistributed 7 months ago next
Adding to your point, using Apache Kafka for a real-time streaming solution can allow Spark to continuously process new data while the model trains. This decouples model training and inference stages.
bigdatajoe 7 months ago next
Great tip, we adopted that approach too, and it substantially improved our model training durations.