April 2024
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Categories

April 2024
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Hadoop vs Spark

Hadoop is to solve big data (up to a computer cannot be stored, a computer cannot be processed within the required time) of reliable storage and processing.

HDFS, the cluster composed by the ordinary PC to provide highly reliable file storage, block by saving multiple copies of the solution to the problem server or hard disk broken.

MapReduce, a programming model provides a simple abstraction Mapper and Reducer, available on unreliable cluster of dozens of PC platform consisting of a hundred units concurrently distributed to handle large data sets, and the concurrency, distribution type (such as inter-machine communication), and fault recovery calculation details hidden.
The Mapper and Reducer abstract, but also a wide variety of complex data processing can be broken down into basic elements. Thus, complex data processing can be broken down multiple grounds Job (contain a Mapper and a Reducer) consisting of directed acyclic graph (DAG), then each Mapper and Reducer into execution on Hadoop cluster, you can obtain results.

With examples of frequency MapReduce statistical word appears in a text file in the WordCount See:
WordCount – Hadoop Wiki, if not hate familiar MapReduce, through the example of MapReduce some understanding of understanding below helpful.
In MapReduce in, Shuffle is a very important process, Shuffle process precisely because of the invisible, you can make on the MapReduce data processing developers to write completely imperceptible distributed and concurrent existence.
Shuffle broadly refers to the figure in the range of between Map and Reduce process.

Hadoop limitations and shortcomings, however, MapRecue following limitations, is more difficult to use.

Low levels of abstraction, need to manually write code to complete, it is difficult to get started use.

Only two operations, Map and Reduce, expressive lacking.

Only a Job Map and Reduce two phases (Phase), complex calculations require a lot of Job completed Job dependencies between the management by the developers themselves.

Processing logic hidden in the details of the code, there is no overall logic intermediate results are also placed HDFS file system ReduceTask need to wait before you can start all MapTask completed

High latency, applicable only Batch data processing, for interactive data processing, real-time data processing is not enough

For iterative data processing performance is relatively poor

For example, to achieve Join two tables with MapReduce is a very tricky process,

hdfs 001

As shown below 🙁 Source: Real World Hadoop) Therefore, after the introduction of Hadoop, there have been many of them related to the limitations of technology improvements, such as Pig, Cascading, JAQL, OOzie, Tez, Spark and the like.

Apache SparkApache Spark is an emerging big data processing engine, the main feature is to provide a distributed memory abstract cluster to support applications that require working set.

This abstract is a RDD (Resilient Distributed Dataset), RDD is recorded with an immutable set of partitions, RDD is Spark in the programming model. Spark provides two operations on RDD, transitions and actions. Conversion is used to define a new RDD, including the map, flatMap, filter, union, sample, join, groupByKey, cogroup, ReduceByKey, cros, sortByKey, mapValues ??like, action is to return a result, including collect, reduce, count, save , lookupKey.

Spark’s API is very simple to use, the Spark Word Count example is shown below:

 

 

val spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile(“hdfs://…”)

val counts = file.flatMap(line => line.split(” “))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile(“hdfs://…”)

 

In Spark, all RDD conversions are inert evaluated. RDD switching operation will generate new RDD, the new data RDD RDD dependent on the original data, each RDD also contains multiple partitions. So a program to actually construct a plurality of interdependent RDD component directed acyclic graph (DAG). And by performing an action on RDD this directed acyclic graph as a submission to Spark Job execution.

 

For example, the above WordCount program will generate the following DAGscala> counts.toDebugStringres0: String = MapPartitionsRDD [7] at reduceByKey at: 14 (1 partitions) ShuffledRDD [6] at reduceByKey at: 14 (1 partitions) MapPartitionsRDD [5] at reduceByKey at: 14 (1 partitions) MappedRDD [4] at map at: 14 (1 partitions) FlatMappedRDD [3] at flatMap at: 14 (1 partitions) MappedRDD [1] at textFile at: 12 (1 partitions) HadoopRDD [0 ] at textFile at: 12 (1 partitions)

 

Spark for directed acyclic graph Job scheduling, determine the stage (Stage), the partition (Partition), pipeline (Pipeline), task (Task) and Cache (Cache), optimization, and run Job on Spark cluster. Dependence between RDD into wide-dependent (dependent on multiple partitions) and narrow-dependent (only dependent on one partition), in determining the stage, the need to rely on a wide stage divided. According to district level tasks.

 

hdfs 002

 

Spark failover support in different ways, provides two ways, Linage, through kinship data, and then execute it again in front of the process, Checkpoint, will store the data set to a persistent store.

 

Spark provide better support for iterative data processing. Each iteration of data can be stored in memory instead of writing to a file.

 

Spark’s performance has improved so much compared to Hadoop, in October 2014, Spark completed a Daytona Gray category Sort Benchmark test, sort entirely on the disk, and Hadoop comparison of previous test results are shown in the table:

 

 

hdfs 003

As can be seen from the table to sort 100TB of data (one trillion data), Spark only computing resources used Hadoop 1/10, consuming only 1/3 of Hadoop.

 

Spark advantage is not only reflected in the performance improvement, Spark framework for batch processing (Spark Core), Interactive (Spark SQL), streaming (Spark Streaming), Machine Learning (MLlib), diagram calculation (GraphX) provides a unified data processing platform, which is opposed to using Hadoop has a great advantage.

hdfs 004

According to the statement Databricks Citylink is One Stack To Rule Them All

 

Especially in some cases, you need to do some ETL work, then train a machine learning model, and finally make some inquiries, if you are using Spark, you can complete this three-part form logic in a program has a large directed acyclic graph (DAG), and Spark have large directed acyclic graph overall optimization.

 

For example the following procedure:

 

val points = sqlContext.sql (“SELECT latitude, longitude FROM historic_tweets”)

 

val model = KMeans.train (points, 10)

 

sc.twitterStream (…) .map (t => (model.closestCenter (t.location), 1)) .reduceByWindow (“5s”, _ + _)

 

(Example Source: http: //www.slideshare.net/Hadoop_Summit/building-a-unified-data-pipeline-in-apache-spark)

 

The first line of this program is to use Spark SQL Check out at some point, the second line is MLlib the K-means algorithm uses a model train these points, the third line is Spark Streaming processing stream messages, using the trained models.

 

Lambda Architecture

 

Lambda Architecture Reference Model is a large data processing platform, as shown below:

 

hdfs 005

Which contains three layers, Batch Layer, Speed ??Layer and Serving Layer, due to the Batch Layer and Speed ??Layer data processing logic is the same, if Hadoop as Batch Layer, and with Storm as Speed ??Layer, you need to maintain two different technologies code.

 

The Spark Lambda Architecture as an integrated solution, as follows:

 

Batch Layer, HDFS + Spark Core, append incremental real-time data to HDFS using Spark Core batch process the whole amount of data, the total amount of data generated view. ,

 

Speed ??Layer, Spark Streaming to handle real-time incremental data to generate a low latency real-time view of the data.

 

Serving Layer, HDFS + Spark SQL (perhaps BlinkDB), storage Batch Layer and Speed ??Layer output of view, to provide low-latency ad hoc query capabilities, bulk data view to view real-time data consolidation.

 

to sum up

 

If we say, MapReduce is recognized as a low-level abstraction of distributed data processing, similar to the logic gate circuit and the gate, or gate and NAND gate, then the Spark of RDD is distributed big data processing high level of abstraction, similar to the logic circuit The encoder or decoder, etc.

 

RDD is a distributed data collection (Collection), any action on this collection can be as straightforward as functional programming operation memory collection, simple, but the implementation of set operations are broken down into a series of Task indeed sent to the background Completing the cluster stage dozens of hundreds of servers. Recently launched a large data processing framework Apache Flink also use data sets (Data Set) and its operation as a programming model.

 

By the RDD component directed acyclic graph (DAG) execution is scheduler which generates a physical plan and optimize Spark then executed on the cluster. Spark also offers a similar MapReduce execution engine, which use more memory instead of disk, to get better execution performance.

 

So Hadoop Spark which problems it?

 

Low levels of abstraction, need to manually write code to complete, it is difficult to get started use.

 

=> RDD-based abstract, the real data processing logic code is very brief. .

 

Only two operations, Map and Reduce, expressive lacking.

 

=> Provide a lot of transitions and actions, many of the basic operations, such as Join, GroupBy has been achieved in RDD conversion and action.

 

Only a Job Map and Reduce two phases (Phase), complex calculations require a lot of Job completed Job dependencies between the management by the developers themselves.

 

=> A plurality of switching operation of a Job can contain RDD, when scheduling can generate multiple stages (Stage), and if RDD multiple map operations zoning change, it can be placed on the same Task.

 

Processing logic hidden in the details of the code, there is no overall logic

 

=> In Scala, via anonymous functions and higher-order functions, RDD conversion support streaming API, it can provide a holistic view of the processing logic. Code does not contain specific implementation details of the operation, the logic clearer.

 

Intermediate results are also placed HDFS file system

 

=> Intermediate results in memory, the memory will be written to fit the local disk, instead of HDFS.

 

You can start ReduceTask need to wait for all MapTask are completed before

 

=> Partition lines in the same conversion constitute a Task runs partition different conversion needs Shuffle, is divided into different Stage, the need to wait in front of the Stage can begin after completion.

 

High latency, applicable only Batch data processing, for interactive data processing, real-time data processing is not enough

 

=> Provide Discretized Stream processing stream data stream is split into small batch.

 

For iterative data processing performance is relatively poor

 

=> By caching data in memory to improve performance iterative calculation.

 

Therefore, Hadoop MapReduce will be a new generation of large data processing platform alternative is the trend of technological development, and in the next generation of large data processing platform, Spark now has been the most widely recognized and supported.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>