Research Lab @ Maxaur
Hadoop vs. Spark

We have done some research and here are our findings:

  • Hadoop does everything on disk via HDFS, whereas Spark does all it can in RAM and only spills to disk when needed. This means Spark can be much faster, by up to 100 times, though such speed takes massive memory.
  • Hadoop's MapReduce API is pretty burdensome and easy to get wrong, whereas Spark has a natural functional API that's comparatively easy to understand. It's also written in Scala and supports Scala natively, which is a far better language than Java for implementing the kinds of transformations it supports.
  • Spark supports any Hadoop Input/Output Format, so you can leverage existing Hadoop connectors to get to your data.
  • Hadoop is much more mature, and there are more tools written on top of it. The Spark ecosystem is evolving rapidly (check out DataBricks), but there's definitely more built on Hadoop than Spark.
  • All the major Hadoop distributions are now on the Spark train.
  • Spark is easier to configure and run than vanilla Hadoop, although the various distributions make it simpler.

Conclusion: If you require fast, in-memory computation using the latest technology that is ultimately likely to replace Hadoop, then choose Spark. If stability and a mature, proven ecosystem are more important to you, go with Hadoop. You can also use both, as they can co-exist well.

1820 Michael Faraday Drive, Suite 19, Reston VA 20190 • • 703-582-7215 • Copyright @1999-2023