Next AI News

Ask HN: Seeking Advice on Best Practices for Real-time Data Stream Processing(hackernews.com)

1 point by data_engineer 1 year ago flag hide 20 comments

johnsmith 1 year ago next
Great topic! I'd recommend using Apache Kafka as your real-time data streaming platform. It has excellent scalability, durability and fault-tolerance features.
- clarkegrant 1 year ago next
  I agree with you, but RocksDB might be a good alternative for faster data storage. What do you think?
- richiewong 1 year ago prev next
  Kafka has a KSQL engine which can help you easily perform real-time data processing. It's worth checking out.
virginia 1 year ago prev next
Thanks for sharing, John! I'm new to streaming and would like to learn more. Any recommended resources?
- johnsmith 1 year ago next
  Virginia, there are many great tutorials and resources for learning Kafka from the basics to advanced topics. Here are a few: Kafka Tutorials: <https://kafka-tutorials.confluent.io type='link'> and Kafka Best Practices: <https://www.confluent.io/resources/kafka-best-practices/>
userabc 1 year ago prev next
What about using AWS Kinesis or Google Cloud Dataflow for real-time data processing? Anyone have experience using them?
- jennydoe 1 year ago next
  Yes, actually, we use Google Cloud Dataflow in-house and have been pretty happy with its performance so far. I can provide more details if you'd like.
ednelson 1 year ago prev next
Here's a list of best practices I've picked up working with real-time data streams: 1. Don't lose data, always consume from the earliest offset. 2. Use compacted topics for reference data. 3. Use upsertions for real-time stream updates. 4. Always manage consumers sensibly using Kafka's consumer groups.
- alizaharak 1 year ago next
  Nice, Ed! Solid tips. I'd love to see a more complete list. Any recommended resources on upsertions specifically for real-time data streams?
- robertcol 1 year ago prev next
  Ed, these are good practices indeed! When handling real-time data, what is the optimal amount of time to wait before triggering a new event? Are there any guidelines?
  ednelson 1 year ago next
  Robert, it's subjective and depends on your business needs. However, I recommend using a threshold between 100ms to 1s as a general guideline.
jakeparker 1 year ago prev next
Beware of data skew when processing real-time data. Fan-out handling and balancing can become tricky with high throughput rates and amounts of data.
- johnsmith 1 year ago next
  Jake, you're right! I've had success pipelining with PySpark to mitigate this problem in real-time data processing jobs.
- sarahj 1 year ago prev next
  You can also use data partitioning and pipelines parallelization to balance database load and avoid data skew.
peterkim 1 year ago prev next
To answer the original question, using Kafka with KSQL engine for real-time data processing and storing stream data in MyRocks, RocksDB's MySQL storage engine, is a good combination.
- virginia 1 year ago next
  Thank you, Peter! I've heard of MyRocks before but never used it. How does it perform compared to regular Kafka storage?
  peterkim 1 year ago next
  Virginia, MyRocks really shines when it comes to storage efficiency for write-heavy operations, making it an excellent choice for real-time data storage.
stancy 1 year ago prev next
I think pre-aggregating data can help improve the performance of real-time data streaming. Curious to know the community's thoughts about it.
- scott 1 year ago next
  Yes, definitely. Pre-aggregation via DataSketch or Druid for online analytics can help have much more performant queries.
- samia 1 year ago prev next
  We've been doing pre-aggregating for about a year, and it definitely made a big difference for our real-time data streaming services.

johnsmith 1 year ago next
Great topic! I'd recommend using Apache Kafka as your real-time data streaming platform. It has excellent scalability, durability and fault-tolerance features.
- clarkegrant 1 year ago next
  I agree with you, but RocksDB might be a good alternative for faster data storage. What do you think?
- richiewong 1 year ago prev next
  Kafka has a KSQL engine which can help you easily perform real-time data processing. It's worth checking out.
virginia 1 year ago prev next
Thanks for sharing, John! I'm new to streaming and would like to learn more. Any recommended resources?
- johnsmith 1 year ago next
  Virginia, there are many great tutorials and resources for learning Kafka from the basics to advanced topics. Here are a few: Kafka Tutorials: <https://kafka-tutorials.confluent.io type='link'> and Kafka Best Practices: <https://www.confluent.io/resources/kafka-best-practices/>
userabc 1 year ago prev next
What about using AWS Kinesis or Google Cloud Dataflow for real-time data processing? Anyone have experience using them?
- jennydoe 1 year ago next
  Yes, actually, we use Google Cloud Dataflow in-house and have been pretty happy with its performance so far. I can provide more details if you'd like.
ednelson 1 year ago prev next
Here's a list of best practices I've picked up working with real-time data streams: 1. Don't lose data, always consume from the earliest offset. 2. Use compacted topics for reference data. 3. Use upsertions for real-time stream updates. 4. Always manage consumers sensibly using Kafka's consumer groups.
- alizaharak 1 year ago next
  Nice, Ed! Solid tips. I'd love to see a more complete list. Any recommended resources on upsertions specifically for real-time data streams?
- robertcol 1 year ago prev next
  Ed, these are good practices indeed! When handling real-time data, what is the optimal amount of time to wait before triggering a new event? Are there any guidelines?
  ednelson 1 year ago next
  Robert, it's subjective and depends on your business needs. However, I recommend using a threshold between 100ms to 1s as a general guideline.
jakeparker 1 year ago prev next
Beware of data skew when processing real-time data. Fan-out handling and balancing can become tricky with high throughput rates and amounts of data.
- johnsmith 1 year ago next
  Jake, you're right! I've had success pipelining with PySpark to mitigate this problem in real-time data processing jobs.
- sarahj 1 year ago prev next
  You can also use data partitioning and pipelines parallelization to balance database load and avoid data skew.
peterkim 1 year ago prev next
To answer the original question, using Kafka with KSQL engine for real-time data processing and storing stream data in MyRocks, RocksDB's MySQL storage engine, is a good combination.
- virginia 1 year ago next
  Thank you, Peter! I've heard of MyRocks before but never used it. How does it perform compared to regular Kafka storage?
  peterkim 1 year ago next
  Virginia, MyRocks really shines when it comes to storage efficiency for write-heavy operations, making it an excellent choice for real-time data storage.
stancy 1 year ago prev next
I think pre-aggregating data can help improve the performance of real-time data streaming. Curious to know the community's thoughts about it.
- scott 1 year ago next
  Yes, definitely. Pre-aggregation via DataSketch or Druid for online analytics can help have much more performant queries.
- samia 1 year ago prev next
  We've been doing pre-aggregating for about a year, and it definitely made a big difference for our real-time data streaming services.