134 points by dataengineer 6 months ago flag hide 12 comments
data_engineer42 6 months ago next
Fantastic post! I've been looking for a comprehensive guide on real-time data pipelines using Apache Kafka and Flink. I like how you explained the architectural components and the use case. Great job!
system_design_nerd 6 months ago next
@data_engineer42 thank you for the kind words! I happy to know the article was helpful for you. I enjoyed writing it and sharing my knowledge with the community. Cheers!
distributed_systems_enthusiast 6 months ago prev next
I've been working with similar tech in my latest project. We chose to use Kinesis instead Kafka for handling heavy loads and we're quite happy with the results. I wonder how these two compare in a real-time data pipeline scenario. Does anyone have experience with this?
kafka_advocate 6 months ago next
@distributed_systems_enthusiast from my experience, Kafka has better scalability, especially if you need to handle huge amounts of data. However, Kinesis features easier set-up and more user-friendly interfaces. In the end, it depends on your project's requirements and constraints.
jvm_freak 6 months ago prev next
Really like the examples in Flink. That motivated me to dive deeper into the project. Do you have where I can get more practical use cases and examples for Flink?
flink_insider 6 months ago next
@jvm_freak there are a few resources available: 1. Flink's documentation (https://ci.apache.org/projects/flink/flink-docs-stable/) 2. Flink community examples (https://github.com/apache/flink-training/tree/master/exercises) 3. Flink in Action book (https://www.manning.com/books/flink-in-action)
big_data_noob 6 months ago prev next
What is Chapeter 7's performance scenario and benchmarks compared to Spark streaming? Looking for a new project, and I'd love to contribute with benchmarks on a similar set up.
knock_knock 6 months ago next
@big_data_noob That's awesome! I don't have benchmarks against Spark Streaming but I'm considering doing something similar. I'll make sure to reach out and see if we can collaborate on this. The mentioned chapter is covering the design of stateful stream processing using Flink Keyed Process Function.
python_data_pipeline_developer 6 months ago prev next
It's been a long time that I haven't touched Java and Scala. I'm considering using an alternative, like DataStream API in Python for a similar project. Any feedback or resources to share?
rpc_programmer 6 months ago next
@python_data_pipeline_developer Flink's DataStream API for Python is a great choice! Recently, Flink officially started supporting Python. You can take a look at their documentation (https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/) and examples (https://github.com/apache/flink/tree/master/flink-examples/flink-examples-streaming/src/main/python)
decentralized_by_default 6 months ago prev next
Any good recommendations for decentralized real-time data pipelines using similar tech?
data_streams_freak 6 months ago next
@decentralized_by_default Have you checked out Apache Storm and Heron? They're more decentralized compared to Kafka and Flink, especially in a distributed, peer-to-peer environment. Storm and Heron also offer similar functionality in the real-time data pipeline space, so they might be worth considering.