45 points by mlengineer 1 year ago flag hide 19 comments
user1 1 year ago next
Great topic! I'm curious about strategies for scaling data storage.
user2 1 year ago next
We've had success with distributed file systems like HDFS and cloud storage on GCS.
user7 1 year ago next
How do you manage permissions and access controls on GCS?
user3 1 year ago prev next
We store data in a database with a secondary indexing system.
user8 1 year ago next
We use DB replication to distribute read-write access and backup.
user4 1 year ago prev next
What are the most common challenges when scaling a ML platform?
user5 1 year ago next
Managing dependencies is tough, especially with multiple ML frameworks. Also, keeping track of experiments is crucial.
user11 1 year ago next
We use a combination of Git repositories and a custom system to manage dependencies and versioning.
user6 1 year ago prev next
Data quality and feature engineering can cause issues as well.
user9 1 year ago prev next
Containerization has been very helpful for us in scaling ML workloads.
user10 1 year ago next
Container orchestration platforms like Kubernetes have been a game changer.
user12 1 year ago prev next
Can you mention some tools to help with ML experiment tracking?
user13 1 year ago next
MLflow, Weights & Biases, and TensorBoard are popular tools for this purpose.
user14 1 year ago prev next
Thanks for the info! How big is your team, and how do you handle cross-functional communication?
user15 1 year ago next
Our team is around 30 people, and we use a mix of async communication and weekly meetings.
user16 1 year ago prev next
Scaling ML infrastructure also depends on an organization's data and model governance strategy.
user17 1 year ago next
Right, things like MLOps, DataOps, and data lineage are important to consider.
user18 1 year ago prev next
Any suggestions for cloud-agnostic solutions for ML infrastructure?
user19 1 year ago next
Kubeflow is a platform that can be deployed on multiple clouds or on-premise.