1 point by data_engineer 11 months ago flag hide 20 comments
johnsmith 11 months ago next
Great topic! I'd recommend using Apache Kafka as your real-time data streaming platform. It has excellent scalability, durability and fault-tolerance features.
clarkegrant 11 months ago next
I agree with you, but RocksDB might be a good alternative for faster data storage. What do you think?
richiewong 11 months ago prev next
Kafka has a KSQL engine which can help you easily perform real-time data processing. It's worth checking out.
virginia 11 months ago prev next
Thanks for sharing, John! I'm new to streaming and would like to learn more. Any recommended resources?
johnsmith 11 months ago next
Virginia, there are many great tutorials and resources for learning Kafka from the basics to advanced topics. Here are a few: Kafka Tutorials: <https://kafka-tutorials.confluent.io type='link'> and Kafka Best Practices: <https://www.confluent.io/resources/kafka-best-practices/>
userabc 11 months ago prev next
What about using AWS Kinesis or Google Cloud Dataflow for real-time data processing? Anyone have experience using them?
jennydoe 11 months ago next
Yes, actually, we use Google Cloud Dataflow in-house and have been pretty happy with its performance so far. I can provide more details if you'd like.
ednelson 11 months ago prev next
Here's a list of best practices I've picked up working with real-time data streams: 1. Don't lose data, always consume from the earliest offset. 2. Use compacted topics for reference data. 3. Use upsertions for real-time stream updates. 4. Always manage consumers sensibly using Kafka's consumer groups.
alizaharak 11 months ago next
Nice, Ed! Solid tips. I'd love to see a more complete list. Any recommended resources on upsertions specifically for real-time data streams?
robertcol 11 months ago prev next
Ed, these are good practices indeed! When handling real-time data, what is the optimal amount of time to wait before triggering a new event? Are there any guidelines?
ednelson 11 months ago next
Robert, it's subjective and depends on your business needs. However, I recommend using a threshold between 100ms to 1s as a general guideline.
jakeparker 11 months ago prev next
Beware of data skew when processing real-time data. Fan-out handling and balancing can become tricky with high throughput rates and amounts of data.
johnsmith 11 months ago next
Jake, you're right! I've had success pipelining with PySpark to mitigate this problem in real-time data processing jobs.
sarahj 11 months ago prev next
You can also use data partitioning and pipelines parallelization to balance database load and avoid data skew.
peterkim 11 months ago prev next
To answer the original question, using Kafka with KSQL engine for real-time data processing and storing stream data in MyRocks, RocksDB's MySQL storage engine, is a good combination.
virginia 11 months ago next
Thank you, Peter! I've heard of MyRocks before but never used it. How does it perform compared to regular Kafka storage?
peterkim 11 months ago next
Virginia, MyRocks really shines when it comes to storage efficiency for write-heavy operations, making it an excellent choice for real-time data storage.
stancy 11 months ago prev next
I think pre-aggregating data can help improve the performance of real-time data streaming. Curious to know the community's thoughts about it.
scott 11 months ago next
Yes, definitely. Pre-aggregation via DataSketch or Druid for online analytics can help have much more performant queries.
samia 11 months ago prev next
We've been doing pre-aggregating for about a year, and it definitely made a big difference for our real-time data streaming services.