Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. Because each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. However, a challenge to MapReduce is the sequential multi-step process it takes to run a job. Developers can write massively parallelized operators, without having to worry about work distribution, and fault tolerance. Hadoop MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |