Lorem ipsum dolor sit amet, conse ctetur adip elit, pellentesque turpis.

  • No products in the cart.

Installing Spark 2.x on Cloudera Distribution 5.xx

Apache Spark is one of the most admired Open source projects in Apache Software Foundation. Due to its features, it is now considered as one of the Key technologies in Big Data Analytics Projects. Spark has evolved a lot since its inception. It has added a lot of interesting features like Support for R Language, lot of Machine Learning Algorithms, Real-time processing providing sub-second latency etc.

Cloudera QuickStart VM (5.xx) is a single node cluster having Spark 1.x as the default version. Since Spark 2.x has additional features, therefore, it is required to override default Spark 1.x with Spark 2.x. One way to achieve this is via Cloudera Manager but running CM on Cloudera VM is time-consuming and it requires a lot of resources.

We are proposing to install Spark 2.x as a separate package on Cloudera Quick Start VM to ensure that VM performance doesn’t go down after installation.

Below are the steps to be followed to achieve this objective –

Step 1: Remove the JDK 1.7 version because you need JDK 1.8 version to run Spark 2.2.0
$ sudo rm -rf /usr/java
Step 2: Download and Install jdk 1.8 rpm

https://www.dropbox.com/s/as5sjhy09kwsznv/jdk-8u131-linux-x64.rpm?dl=0

$ sudo rpm -ivh <jdk>
Step 3: Install maven 3.3.9
$ cd /usr/local
$ sudo wget https://apache.mirrors.tds.net/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
$ sudo tar -xvf apache-maven-3.3.9-bin.tar.gz
$ sudo ln -s apache-maven-3.3.9  maven

Step 4: Open sudo and comment and add –

$ sudo gedit /etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_131
export M2_HOME=/usr/local/maven
Step 5: Reboot the system
$ sudo reboot

Check the version of maven and java

$ java -version
$ mvn -version
Step 6: Download Spark 2.2.0

https://drive.google.com/file/d/16gf_0xyHq8YYuWYYMeHIaf-XrORfbXfF/view

$ sudo tar -xvf <spark-2.2.0.tgz> -C /opt/
$ sudo chmod -R 777 /opt/spark2

$ cd /opt/spark2/spark-2.2.0

$ ./bin/spark-shell
Step 7: Run Spark shell

Install Jupyter Notebook

Download Python 2.7 Version

https://www.anaconda.com/distribution/

Note: Before Installation, you need JDK so before Installation check Java -version

Extract Spark 2.2
$ cd ~/Download/

Extract spark-2.2.0.tar.gz

$ sudo tar -xvf spark-2.2.0.tar.gz -C /opt
Verify Installtion

Run:

$ /opt/spark-2.2.0/bin/pyspark
Link spark with jupyter
~$ cd anaconda2/bin/
~$ source activate
~$ sudo gedit ~/.bashrc
export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip 127.0.0.1 --port 3333 --no-mathjax'
$ source ~/.bashrc
Run jupyter notebook
$ /opt/spark-2.2.0/bin/pyspark
output:

[I 22:56:14.197 NotebookApp] JupyterLab beta preview extension loaded from /home/datacouch/anaconda3/lib/python3.6/site-packages/jupyterlab [I 22:56:14.197 NotebookApp] JupyterLab application directory is /home/datacouch/anaconda3/share/jupyter/lab [I 22:56:14.202 NotebookApp] Serving notebooks from local directory: /home/datacouch/notebook [I 22:56:14.202 NotebookApp] 0 active kernels [I 22:56:14.202 NotebookApp] The Jupyter Notebook is running at: [I 22:56:14.202 NotebookApp] https://127.0.0.1:3333/?token=35fd37e10f310e84fca36bcf0f0400c167e24c77e62862b2 [I 22:56:14.202 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 22:56:14.203 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,

to login with a token:

   https://127.0.0.1:3333/?token=35fd37e10f310e84fca36bcf0f0400c167e24c77e62862b2

Post a Comment