Apache Spark

Apache Spark 1.5.2 release, this version is a maintenance release that includes fixes Spark stability in some areas, mainly: DataFrame API, Spark Streaming, PySpark, R, Spark SQL and MLlib

Apache Spark is one of the hadoop open source cluster computing environments similar, but there are some differences between the two, these useful differences make Spark in some workloads behaved more superior, in other words, Spark Enable memory distributed data sets, in addition to providing interactive query, it also can optimize iterative workloads.

Spark is implemented in the Scala language, which will Scala as its application framework. And Hadoop different, Spark and Scala can be tightly integrated, which can operate as a local collection Scala objects as easily as operating a distributed data sets.

Although creating Spark iterative job to support distributed data sets, but in fact it is complementary to Hadoop, it can run in parallel Hadoo file system. Through third-party clustering framework called Mesos can support this behavior. Spark by the University of California, Berkeley AMP Lab (Algorithms, Machines, and People Lab) development, can be used to build large, low-latency data analysis applications.

Spark (http://spark-project.org) is developed in the UC Berkeley AMPLab, to make data analytics fast. It is open source. Spark is for in-memory cluster computing whereas Hadoop-MapReduce is disk-based. Our job can load data into memory and query it repeatedly much quicker than Hadoop-MapReduce. For programmers Spark provides APIs in both Scala and Java. Spark is developed focusing two applications where keeping data in memory helps

Iterative Algorithms, which are common in machine learning.
Interactive data mining.

Abstractions Provided by Spark

The main abstraction Spark provides is a Resilient Distributed Dataset (RDD).
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

A second abstraction in Spark is shared variables that can be used in parallel operations. Spark supports two types of shared variables

broadcast variables
accumulators

Driver Program

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.

Operations on RDDs

Spark exposes RDDs through a language-integrated APIs. RDDs support two types of operations.

Transformations, which create a new dataset from an existing one.
Actions, which return a value to the driver program after running a computation on the dataset.

For example, map is a transformation that passes each dataset element through a function and returns a new distributed dataset representing the results. On the other hand, reduce is an action that aggregates all the elements of the dataset using some function and returns the final result to the driver program.

More examples of Transformation operations are filter(func), flatMap(func), distinct([numTasks])), reduceByKey(func, [numTasks])

More examples of Action operations are collect(), count(), first(),
saveAsTextFile(path), saveAsSequenceFile(path)

Centos7 install and configure Spark 1.5.2 Standalone mode
[root@clusterserver1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.20 clusterserver1.rmohan.com clusterserver1
192.168.1.21 clusterserver2.rmohan.com clusterserver2

wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.tar.gz”
tar -zxvf jdk-8u65-linux-x64.tar.gz
mkdir /usr/java
mv jdk1.8.0_65 /usr/java/

cd /usr/java/jdk1.8.0_40/
[root@cluster1 java]# ln -s /usr/java/jdk1.8.0_40/bin/java /usr/bin/java
[root@cluster1 java]# alternatives –install /usr/java/jdk1.8.0_40/bin/java java /usr/java/jdk1.8.0_40/bin/java 2

alternatives –install /usr/java/jdk1.8.0_40/bin/java java /usr/java/jdk1.8.0_40/bin/java 2
alternatives –config java

[root@cluster1 java]# alternatives –config java

There is 1 program that provides ‘java’.

Selection Command
———————————————–
*+ 1 /usr/java/jdk1.8.0_40/bin/java

Enter to keep the current selection[+], or type selection number: 1
[root@cluster1 java]#

alternatives –install /usr/bin/jar jar /opt/jdk1.8.0_40/bin/jar 2
alternatives –install /usr/bin/javac javac /opt/jdk1.8.0_40/bin/javac 2
alternatives –set jar /opt/jdk1.8.0_40/bin/jar
alternatives –set javac /opt/jdk1.8.0_40/bin/javac
vi /etc/profile.d/java.sh

export JAVA_HOME=/usr/java/jdk1.8.0_40
PATH=$JAVA_HOME/bin:$PATH
export PATH=$PATH:$JAVA_HOME
export JRE_HOME=/usr/java/jdk1.8.0_40/jre
export PATH=$PATH:/usr/java/jdk1.8.0_40/bin:/usr/java/jdk1.8.0_40/jre/bin

wget http://www.apache.org/dyn/closer.lua/spark/spark-1.5.2/spark-1.5.2.tgz

gunzip -c spark-1.5.2.tgz | tar xvf –

wget http://mirror.nus.edu.sg/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop1-scala2.11.tgz
gunzip -c spark-1.5.2-bin-hadoop1-scala2.11.tgz | tar xvf –

Download Scala
http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz?_ga=1.97307478.816346610.1449891008

mkdir /usr/hadoop
mv spark-1.5.2 /usr/hadoop/
mv scala-2.11.7 /usr/hadoop/
mv spark-1.5.2-bin-hadoop1-scala2.11 /usr/hadoop/

vi /etc/profile.d/scala
#SCALA VARIABLES START
export SCALA_HOME=/usr/hadoop/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin
#SCALA VARIABLES END

#SPARK VARIABLES START
export SPARK_HOME=/usr/hadoop/spark-1.5.2-bin-hadoop1-scala2.11
export PATH=$PATH:$SPARK_HOME/bin
#SPARK VARIABLES END

export SPARK_MASTER_IP=localhost
export SPARK_WORKER_MEMORY=1024m
export master=spark://localhost:7070

[root@clusterserver1 spark-1.5.2-bin-hadoop1-scala2.11]# scala -version
Scala code runner version 2.11.7 — Copyright 2002-2013, LAMP/EPFL
You have new mail in /var/spool/mail/root
[root@clusterserver1 spark-1.5.2-bin-hadoop1-scala2.11]#

[root@clusterserver1 sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/hadoop/spark-1 .5.2-bin-hadoop1-scala2.11/sbin/../logs/spark-root-org.apache.spark.deploy.mas ter.Master-1-clusterserver1.rmohan.com.out
localhost: Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
root@localhost’s password:
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/hadoop/spark-1.5.2-bin-hadoop1-scala2.11/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-clusterserver1.rmohan.com.out
[root@clusterserver1 sbin]#

root@clusterserver1 bin]# spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.security.                                                                                                                             Groups).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more                                                                                                                              info.
Using Spark’s repl log4j profile: org/apache/spark/log4j-defaults-repl.propert                                                                                                                             ies
To adjust logging level use sc.setLogLevel(“INFO”)
15/12/13 08:20:19 WARN MetricsSystem: Using default name DAGScheduler for sour                                                                                                                             ce because spark.app.id is not set.
Spark context available as sc.
15/12/13 08:20:22 WARN Connection: BoneCP specified but not present in CLASSPA                                                                                                                             TH (or one of dependencies)
15/12/13 08:20:23 WARN Connection: BoneCP specified but not present in CLASSPA                                                                                                                             TH (or one of dependencies)
15/12/13 08:20:29 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/13 08:20:30 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/13 08:20:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/12/13 08:20:31 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/13 08:20:32 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/13 08:20:38 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/13 08:20:38 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/13 08:20:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
SQL context available as sqlContext.
Welcome to
____              __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
/_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 1+2
res0: Int = 3

scala>

/root/word.txt
hello world
hello hadoop
pls say hello

val readFile = sc.textFile(“file:///root/word.txt”)

scala> val readFile = sc.textFile(“file:///root/word.txt”)
readFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at textFile at <console>:24

scala> readFile.count()
15/12/13 08:36:25 WARN LoadSnappy: Snappy native library not loaded
res2: Long = 3