1 point by ml_engineer 6 months ago flag hide 17 comments
kubernetesuser 6 months ago next
Just set up distributed TensorFlow training on Kubernetes with GPU scheduling! Has been such a game changer in terms of speed and resource allocation.
nvidiauser 6 months ago prev next
@kubernetesuser nice! Can you share more details about GPU scheduling setup? We're looking to do something similar.
kubernetesuser 6 months ago next
@nvidiauser of course! We're using Kubeflow to manage the distributed training, with TensorFlow serving as the core ML library. For GPU scheduling, we set up a custom Kubernetes scheduler that takes into account GPU availability and resource requirements. It's been a bit of a complex process, but well worth it in the end.
kubernetesuser 6 months ago next
@tensorflowuser definitely! We've seen about a 5x increase in training speed compared to running everything on a single machine without GPUs. Given that we're working with fairly large datasets, this has been a huge efficiency boost.
tensorflowuser 6 months ago next
@kubernetesuser great to hear. We're trying to do something similar on AWS using SageMaker, and are hoping for similar performance gains. Did you run into any issues with compatibility between TensorFlow and Kubernetes?
tensorflowuser 6 months ago next
@kubernetesuser that's helpful, thanks. We've been running into some version compatibility issues between TensorFlow and Kubernetes, but the TensorFlow team has been great about responding to issues and providing updates.
tensorflowuser 6 months ago next
@awsuser yes, we've considering using those. We've also considered using the AWS SageMaker TensorFlow containers, but we're still in the early stages of our setup so we haven't made a final decision yet.
tensorflowuser 6 months ago next
@awsuser thanks for the tip. We ended up going with the AWS SageMaker TensorFlow containers in the end, since we're already pretty deep into the SageMaker ecosystem. But those TensorFlow AMIs look pretty useful for other use cases.
tensorflowuser 6 months ago prev next
@kubernetesuser that's a great setup! Do you have any performance metrics to share? Specifically, I'm curious how much faster training is now that it's distributed and using GPUs.
googleclouduser 6 months ago prev next
@kubernetesuser impressive! We're currently looking into using GKE for distributed TensorFlow training with GPUs. Any tips or lessons learned from your experience?
kubernetesuser 6 months ago next
@googleclouduser one tip I would give is to make sure you have enough resources allocated to both your Kubernetes nodes and your GPUs. We initially ran into some performance issues due to resource contention, which went away once we increased resource limits. In terms of compatibility, we didn't run into any major issues, but your mileage may vary depending on your specific setup.
kubeflowuser 6 months ago next
To build on what @kubernetesuser was saying earlier, Kubeflow also has a built-in GPU scheduler that works really well with TensorFlow. It takes care of all the pesky details around resource allocation and makes it easy to spin up clusters and start training jobs.
kubeflowuser 6 months ago next
@dataengineer thanks for the support! Open-source tools like Kubeflow and Kubernetes have been a game changer for the data engineering community, and we're excited to see more and more people getting involved in distributed ML and AI applicatiosn.
awsuser 6 months ago prev next
@tensorflowuser have you looked into the TensorFlow AWS Deep Learning AMIs? They come pre-loaded with TensorFlow and all its dependencies, so they might save you some time and effort in your setup.
azureuser 6 months ago prev next
@all I'm curious, has anyone tried using Azure's ML Engine for distributed TensorFlow training with GPUs? If so, how was the experience?
azureuser 6 months ago next
We actually ended up using Azure's Data Science Virtual Machine (DSVM) for our TensorFlow training. It comes with pre-installed GPUs and TensorFlow, and we've been pretty happy with it so far. It's definitely a bit easier to set up than building everything out with Kubernetes, but YMMV depending on your specific use case.
dataengineer 6 months ago prev next
This is exciting stuff! A few years ago, setting up distributed TensorFlow training with GPUs required significant technical expertise and infrastructure. The fact that Kubernetes and other tools have made this process more accessible is a big deal for the broader data engineering community.