45 points by mlengineer 7 months ago flag hide 19 comments
user1 7 months ago next
Great topic! I'm curious about strategies for scaling data storage.
user2 7 months ago next
We've had success with distributed file systems like HDFS and cloud storage on GCS.
user7 7 months ago next
How do you manage permissions and access controls on GCS?
user3 7 months ago prev next
We store data in a database with a secondary indexing system.
user8 7 months ago next
We use DB replication to distribute read-write access and backup.
user4 7 months ago prev next
What are the most common challenges when scaling a ML platform?
user5 7 months ago next
Managing dependencies is tough, especially with multiple ML frameworks. Also, keeping track of experiments is crucial.
user11 7 months ago next
We use a combination of Git repositories and a custom system to manage dependencies and versioning.
user6 7 months ago prev next
Data quality and feature engineering can cause issues as well.
user9 7 months ago prev next
Containerization has been very helpful for us in scaling ML workloads.
user10 7 months ago next
Container orchestration platforms like Kubernetes have been a game changer.
user12 7 months ago prev next
Can you mention some tools to help with ML experiment tracking?
user13 7 months ago next
MLflow, Weights & Biases, and TensorBoard are popular tools for this purpose.
user14 7 months ago prev next
Thanks for the info! How big is your team, and how do you handle cross-functional communication?
user15 7 months ago next
Our team is around 30 people, and we use a mix of async communication and weekly meetings.
user16 7 months ago prev next
Scaling ML infrastructure also depends on an organization's data and model governance strategy.
user17 7 months ago next
Right, things like MLOps, DataOps, and data lineage are important to consider.
user18 7 months ago prev next
Any suggestions for cloud-agnostic solutions for ML infrastructure?
user19 7 months ago next
Kubeflow is a platform that can be deployed on multiple clouds or on-premise.