Apache Spark 1.5.2 release, this version is a maintenance release that includes fixes Spark stability in some areas, mainly: DataFrame API, Spark Streaming, PySpark, R, Spark SQL and MLlib
Apache Spark is one of the hadoop open source cluster computing environments similar, but there are some differences between the two, these useful differences make Spark in some workloads behaved more superior, in other words, Spark Enable memory distributed data sets, in addition to providing interactive query, it also can optimize iterative workloads.
Spark is implemented in the Scala language, which will Scala as its application framework. And Hadoop different, Spark and Scala can be tightly integrated, which can operate as a local collection Scala objects as easily as operating a distributed data sets.
Although creating Spark iterative job to support distributed data sets, but in fact it is complementary to Hadoop, it can run in parallel Hadoo file system. Through third-party clustering framework called Mesos can support this behavior. Spark by the University of California, Berkeley AMP Lab (Algorithms, Machines, and People Lab) development, can be used to build large, low-latency data analysis applications.
- Iterative Algorithms, which are common in machine learning.
- Interactive data mining.
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
- broadcast variables
- accumulators
- Transformations, which create a new dataset from an existing one.
- Actions, which return a value to the driver program after running a computation on the dataset.
saveAsTextFile(path), saveAsSequenceFile(path)
Centos7 install and configure Spark 1.5.2 Standalone mode
[root@clusterserver1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.20 clusterserver1.rmohan.com clusterserver1
192.168.1.21 clusterserver2.rmohan.com clusterserver2
wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.tar.gz”
tar -zxvf jdk-8u65-linux-x64.tar.gz
mkdir /usr/java
mv jdk1.8.0_65 /usr/java/
cd /usr/java/jdk1.8.0_40/
[root@cluster1 java]# ln -s /usr/java/jdk1.8.0_40/bin/java /usr/bin/java
[root@cluster1 java]# alternatives –install /usr/java/jdk1.8.0_40/bin/java java /usr/java/jdk1.8.0_40/bin/java 2
alternatives –install /usr/java/jdk1.8.0_40/bin/java java /usr/java/jdk1.8.0_40/bin/java 2
alternatives –config java
[root@cluster1 java]# alternatives –config java
There is 1 program that provides ‘java’.
Selection Command
———————————————–
*+ 1 /usr/java/jdk1.8.0_40/bin/java
Enter to keep the current selection[+], or type selection number: 1
[root@cluster1 java]#
alternatives –install /usr/bin/jar jar /opt/jdk1.8.0_40/bin/jar 2
alternatives –install /usr/bin/javac javac /opt/jdk1.8.0_40/bin/javac 2
alternatives –set jar /opt/jdk1.8.0_40/bin/jar
alternatives –set javac /opt/jdk1.8.0_40/bin/javac
vi /etc/profile.d/java.sh
export JAVA_HOME=/usr/java/jdk1.8.0_40
PATH=$JAVA_HOME/bin:$PATH
export PATH=$PATH:$JAVA_HOME
export JRE_HOME=/usr/java/jdk1.8.0_40/jre
export PATH=$PATH:/usr/java/jdk1.8.0_40/bin:/usr/java/jdk1.8.0_40/jre/bin
wget http://www.apache.org/dyn/closer.lua/spark/spark-1.5.2/spark-1.5.2.tgz
gunzip -c spark-1.5.2.tgz | tar xvf –
wget http://mirror.nus.edu.sg/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop1-scala2.11.tgz
gunzip -c spark-1.5.2-bin-hadoop1-scala2.11.tgz | tar xvf –
Download Scala
http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz?_ga=1.97307478.816346610.1449891008
mkdir /usr/hadoop
mv spark-1.5.2 /usr/hadoop/
mv scala-2.11.7 /usr/hadoop/
mv spark-1.5.2-bin-hadoop1-scala2.11 /usr/hadoop/
vi /etc/profile.d/scala
#SCALA VARIABLES START
export SCALA_HOME=/usr/hadoop/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin
#SCALA VARIABLES END
#SPARK VARIABLES START
export SPARK_HOME=/usr/hadoop/spark-1.5.2-bin-hadoop1-scala2.11
export PATH=$PATH:$SPARK_HOME/bin
#SPARK VARIABLES END
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_MEMORY=1024m
export master=spark://localhost:7070
[root@clusterserver1 spark-1.5.2-bin-hadoop1-scala2.11]# scala -version
Scala code runner version 2.11.7 — Copyright 2002-2013, LAMP/EPFL
You have new mail in /var/spool/mail/root
[root@clusterserver1 spark-1.5.2-bin-hadoop1-scala2.11]#
[root@clusterserver1 sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/hadoop/spark-1 .5.2-bin-hadoop1-scala2.11/sbin/../logs/spark-root-org.apache.spark.deploy.mas ter.Master-1-clusterserver1.rmohan.com.out
localhost: Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
root@localhost’s password:
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/hadoop/spark-1.5.2-bin-hadoop1-scala2.11/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-clusterserver1.rmohan.com.out
[root@clusterserver1 sbin]#
root@clusterserver1 bin]# spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.security. Groups).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark’s repl log4j profile: org/apache/spark/log4j-defaults-repl.propert ies
To adjust logging level use sc.setLogLevel(“INFO”)
15/12/13 08:20:19 WARN MetricsSystem: Using default name DAGScheduler for sour ce because spark.app.id is not set.
Spark context available as sc.
15/12/13 08:20:22 WARN Connection: BoneCP specified but not present in CLASSPA TH (or one of dependencies)
15/12/13 08:20:23 WARN Connection: BoneCP specified but not present in CLASSPA TH (or one of dependencies)
15/12/13 08:20:29 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/13 08:20:30 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/13 08:20:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/12/13 08:20:31 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/13 08:20:32 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/13 08:20:38 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/13 08:20:38 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/13 08:20:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
SQL context available as sqlContext.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
scala> 1+2
res0: Int = 3
scala>
/root/word.txt
hello world
hello hadoop
pls say hello
val readFile = sc.textFile(“file:///root/word.txt”)
scala> val readFile = sc.textFile(“file:///root/word.txt”)
readFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at textFile at <console>:24
scala> readFile.count()
15/12/13 08:36:25 WARN LoadSnappy: Snappy native library not loaded
res2: Long = 3
Recent Comments