26 points by bigdatainc 2 years ago flag hide 29 comments
dataengyc18 2 years ago next
Hey HN, we're the team behind the Scala-based Distributed Data Processing System at a major e-commerce giant (YC S18). We're hiring Data Engineers to join our ranks!
fnord456 2 years ago next
Wow, sounds exciting! Can you share more about the tech stack and how it's being used in your e-commerce giant?
fnord456 2 years ago next
Impressive! I'm assuming you have a petabyte-scale data warehousing solution as well?
dataengyc18 2 years ago next
Yes, we use Hadoop HDFS for our data warehousing solution, along with Hive for SQL querying and Spark for machine learning and data processing.
dataengyc18 2 years ago prev next
Of course, we're using Scala for the processing engine, combined with Spark and Akka for streaming and clustering. Our system processes terabytes of data every day, and it's a key part of our e-commerce platform.
hadoopfan654 2 years ago prev next
I've been following the developments in the Scala ecosystem and it's really impressed me, good choice!
dataengyc18 2 years ago next
Thanks! Scala has been a great fit for us, and we're excited to see its continued growth in the data engineering space.
akka432 2 years ago prev next
Akka is an awesome tool for building reactive systems. I'm curious how you're using it at scale for your data processing system.
dataengyc18 2 years ago next
We use Akka along with Spark for building our reactive data processing pipeline. Akka provides us with a robust and fault-tolerant system for handling real-time streams of data, which is critical for our business.
broker567 2 years ago prev next
I've used Akka for building low-latency trading systems and it's been a game-changer. How do you deal with data consistency across the cluster?
dataengyc18 2 years ago next
We use Apache Zookeeper for managing and coordinating our data processing cluster, which helps us ensure data consistency across the cluster.
streamingguru900 2 years ago prev next
Streaming data processing is a hot topic these days, tell us more about how you're handling stream processing with Spark.
dataengyc18 2 years ago next
We use Spark Streaming for handling real-time data processing, and it's integrated with our Akka and Scala stack. We're able to handle millions of events per second with sub-second latency.
scalaenthusiast789 2 years ago prev next
That's very cool. I'm a big fan of Scala and functional programming, what are some of the functional programming concepts you're using in your data pipeline?
dataengyc18 2 years ago next
We use a lot of functional programming techniques and libraries in our data pipeline, such as Scalaz and Cats. They help us write more robust and composable code.
functionalfan123 2 years ago prev next
I've been looking for a new challenge and this sounds really interesting, do you have any positions open for functional programmers?
dataengyc18 2 years ago next
Yes, we have several positions open for functional programmers. If you have experience with Scala, Akka, Spark, and functional programming, we'd love to talk to you.
bigdatachampion456 2 years ago prev next
This is a great achievement. What kind of data engineering problems you are solving with Scala based system for eCommerce giants?
dataengyc18 2 years ago next
We solve a variety of data engineering problems such as data ingestion, data transformations, data enrichment, near real-time analytics, and machine learning. We use Scala to build scalable and high-performance distributed data processing system.
scalaninja321 2 years ago prev next
Impressive! Are you using any specific Scala frameworks or libraries? Also, what's your approach towards testing and quality assurance?
dataengyc18 2 years ago next
We use several Scala frameworks and libraries such as Akka, Play, and Finatra. Our testing strategy includes unit testing, integration testing, and end-to-end testing. We use tools like ScalaTest, Specs2 and ScalaCheck for testing. For quality assurance, we follow best practices such as code reviews, continuous integration, and automated deployment.
distributeddatalover888 2 years ago prev next
How do you ensure fault-tolerance and consistency of data in distributed environment? Also, what's your approach towards data governance?
dataengyc18 2 years ago next
For fault-tolerance, we use Apache Zookeeper and Apache Spark. Spark provides reliable and fault-tolerant RDDs, while Zookeeper is used for coordination and configuration. For ensuring data consistency, we use transactions and pessimistic locking. We also have a strong data governance program in place that defines policies, roles, and responsibilities for data management and usage.
sparkfan777 2 years ago prev next
What's your approach towards scaling the system and how do you ensure high performance? Also, how do you handle failures and error scenarios?
dataengyc18 2 years ago next
We use Apache Spark's cluster computing capabilities and distributed data processing features to scale the system. For high performance, we optimize our Spark jobs using techniques such as partitioning, caching, and broadcasting. In terms of failures and error handling, we use Spark's resilience capabilities and a combination of log analysis, alerting, and monitoring tools.
machinelearningguru222 2 years ago prev next
What's your approach towards building and deploying AI/ML models in your system? Are you using any specific Scala ML libraries or frameworks?
dataengyc18 2 years ago next
We use Apache Spark MLlib and scikit-learn for building, training, and deploying ML models in our system. We also leverage Scala-based libraries such as Smile and Breeze for statistical computing and optimization. We follow best practices such as data versioning, model versioning, and experiment tracking for building robust and scalable ML pipelines.
dataopsleader333 2 years ago prev next
How do you manage and monitor the system? What's your approach towards DevOps, CI/CD, and automation?
dataengyc18 2 years ago next
We use a variety of tools and frameworks for managing and monitoring the system, such as Kubernetes, Prometheus, and Grafana. We have a strong DevOps and CI/CD culture in place, and we follow best practices such as automation, testing, and version control. We also use tools such as Spinnaker for continuous deployment and GitOps for managing our infrastructure as code.