234 points by datasciencejen 6 months ago flag hide 13 comments
ml_enthusiast 6 months ago next
Fascinating article! Real-time ML data processing techniques have been one of my main interests lately. I've been working on a similar project using Dataflow and BigQuery. Have you tried combining your current pipeline with these services?
realtime_ml 6 months ago next
We did try using Dataflow and BigQuery for some parts of the pipeline, but encountered some latency issues with spiky data. Any suggestions for handling real-time spiky data?
streaming_ninja 6 months ago prev next
I find Apache Nifi helpful for handling spiky data and streaming high-volume, real-time data flows. It's worth a look for addressing those latency issues, especially if you're dealing with non-Python environments.
tensorstar 6 months ago prev next
Great research on real-time techniques! I'm curious how you plan to implement these new techniques on the edge for IoT devices for on-device ML computations?
edge_defender 6 months ago next
This is definitely an area we are interested in. For edge implementation, we plan to leverage TensorFlow Lite, Core ML, and On-Device Machine Learning toolkits provided by Apple and Google. This approach provides adaptive streaming computations for various edge devices.
mlopsmaster 6 months ago prev next
Your techniques could certainly help improve some of our MLOps efforts. We've been utilizing Apache Airflow, Kubeflow and AWS Pipeline Manager for ETL and ML pipelines. How do they compare to your proposed techniques?
realtime_ml 6 months ago next
We have used AWS Pipeline Manager In the past but have seen room for improvement in terms of customizability and integration with ML-specific tools. These techniques offer better flexibility and integration options.
quant_guru 6 months ago prev next
I'm impressed with the results and the variety of test datasets used. I'd be curious to see how these perform on high-dimensional real-time financial data, like stock prices and multi-stream data. Have you attempted such datasets and experiments?
financial_data_scientist 6 months ago next
We initially planned to test with financial datasets. However, due to time and data limitations, the team was unable to include those tests. But that is a fantastic idea! Explorations of high-dimensional financial datasets will be essential for future work.
zquest 6 months ago prev next
I have a question regarding the implementation of distributed real-time ML pipelines. How do you handle the scalability regarding the horizontal distribution of data and machine resources?
realtime_ml 6 months ago next
For distributed ML data processing, we focus on using Kubernetes to manage and optimize the distribution of resources. We also use some open-source tools and projects that help with scalability for specific steps and models.
jupyter_genius 6 months ago prev next
That sounds interesting. What are some of the open-source Kubernetes-based tools and projects you mentioned?
realtime_ml 6 months ago next
Among the open-source options, we have Kubeflow, KServe, and Open Data Hub. These focus primarily on distributed ML workloads, supporting popular deep learning frameworks and model serving. They cater to a variety of roles, from researchers to DevOps professionals.