150 points by mlopsmagic 6 months ago flag hide 12 comments
johnsmith 6 months ago next
Interesting post! I've been working on a similar project and I'm curious, how did you handle feature engineering and preprocessing in your pipeline? Did you use built-in TensorFlow libraries or a third-party library like scikit-learn?
hackerx 6 months ago next
Great question! We mainly used TensorFlow libraries, specifically the tf.data API, to handle feature engineering and preprocessing. It made the integration with the rest of the pipeline much smoother. But I'd love to hear how you approached it in your project!
secureninja 6 months ago prev next
This is a great write up about production-scale ML engineering using TensorFlow and Kubernetes. I'm curious, how did you design the monitoring and logging system to detect any failures and errors? Did you implement custom checks or use existing tools?
autoscalr 6 months ago next
We used a combination of tools and custom checks for monitoring and logging in our ML pipeline. We used Prometheus for monitoring metrics and Grafana for visualization. For error logs, we used Stackdriver and implemented custom checks using Kubernetes liveness and readiness probes.
mlopsenthusiast 6 months ago prev next
Fantastic work! I'm just getting started with building scalable ML pipelines and I find that understanding Kubernetes and containerization is a bit of a challenge. Any recommendations for resources or tutorials that cover these topics specifically in the context of ML pipelines?
tensorguru 6 months ago next
Glad to hear that this was helpful! I recommend checking out the official TensorFlow and Kubernetes documentation as a starting point for understanding the fundamentals. There are also great community tutorials on Medium and YouTube that cover ML pipelines with TensorFlow and Kubernetes. I will also suggest looking into a course on Coursera or Udemy that specifically covers ML and Kubernetes.
cloudmaestro 6 months ago prev next
Looks like a lot of effort went into creating this pipeline! What kind of infrastructure did you use? Was it on-prem or cloud-based?
alinium 6 months ago next
We used a cloud-based infrastructure for our pipeline. Specifically, we used Google Cloud Platform for our compute instances and storage, as well as Kubernetes Engine for container orchestration. This allowed us to scale up and down as needed and ensured high availability for our model servings.
deeplearner 6 months ago prev next
Very cool! How did you handle the distribution of TensorFlow jobs at scale? I've heard that this can be a challenge when working with TensorFlow and Kubernetes.
k8sexpert 6 months ago next
To distribute TensorFlow jobs at scale, we used the TensorFlow Kubernetes Engineer (TFK8S) which is a 3rd party library. The library simplifies the creation of Kubernetes job resources by exposing a simple Python interface. This allowed us to easily distribute the TensorFlow jobs and manage their lifecycle.
mlproduser 6 months ago prev next
Incredible work automating the ML pipeline with TensorFlow and Kubernetes! I'm curious to know how you managed the different versioning of the models and how it played a role in the continuous integration and delivery process of the pipeline.
tensorflowlover 6 months ago next
Thanks! We used a combination of TensorFlow Model Analysis (TFMA) and Kubeflow to manage the model versioning and integration in the continuous delivery process. TFMA allowed us to track the performance of the model versions and Kubeflow provided a rollout strategy to easily control the version that was currently deployed to production.