
Installing Spark 2.x on Cloudera Distribution 5.xx
Apache Spark is one of the most admired Open source projects in Apache Software Foundation. Due to its features, it is now considered as one of the Key technologies in Big Data Analytics Projects. Spark has evolved a lot since its inception. It has added a lot of interesting features like Support for R Language, lot of Machine Learning Algorithms, Real-time processing providing sub-second latency etc.

Cloudera QuickStart VM (5.xx) is a single node cluster having Spark 1.x as the default version. Since Spark 2.x has additional features, therefore, it is required to override default Spark 1.x with Spark 2.x. One way to achieve this is via Cloudera Manager but running CM on Cloudera VM is time-consuming and it requires a lot of resources.
We are proposing to install Spark 2.x as a separate package on Cloudera Quick Start VM to ensure that VM performance doesn’t go down after installation.
Below are the steps to be followed to achieve this objective –
Step 1: Remove the JDK 1.7 version because you need JDK 1.8 version to run Spark 2.2.0
$ sudo rm -rf /usr/java
Step 2: Download and Install jdk 1.8 rpm
https://www.dropbox.com/s/as5sjhy09kwsznv/jdk-8u131-linux-x64.rpm?dl=0
$ sudo rpm -ivh <jdk>
Step 3: Install maven 3.3.9
$ cd /usr/local $ sudo wget https://apache.mirrors.tds.net/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz $ sudo tar -xvf apache-maven-3.3.9-bin.tar.gz $ sudo ln -s apache-maven-3.3.9 maven
Step 4: Open sudo and comment and add –
$ sudo gedit /etc/profile

export JAVA_HOME=/usr/java/jdk1.8.0_131 export M2_HOME=/usr/local/maven
Step 5: Reboot the system
$ sudo reboot
Check the version of maven and java
$ java -version $ mvn -version
Step 6: Download Spark 2.2.0
https://drive.google.com/file/d/16gf_0xyHq8YYuWYYMeHIaf-XrORfbXfF/view
$ sudo tar -xvf <spark-2.2.0.tgz> -C /opt/ $ sudo chmod -R 777 /opt/spark2 $ cd /opt/spark2/spark-2.2.0 $ ./bin/spark-shell
Step 7: Run Spark shell
Install Jupyter Notebook
Download Python 2.7 Version

Note: Before Installation, you need JDK so before Installation check Java -version
Extract Spark 2.2
$ cd ~/Download/
Extract spark-2.2.0.tar.gz
$ sudo tar -xvf spark-2.2.0.tar.gz -C /opt
Verify Installtion
Run:
$ /opt/spark-2.2.0/bin/pyspark
Link spark with jupyter
~$ cd anaconda2/bin/ ~$ source activate ~$ sudo gedit ~/.bashrc
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip 127.0.0.1 --port 3333 --no-mathjax'
$ source ~/.bashrc
Run jupyter notebook
$ /opt/spark-2.2.0/bin/pyspark
output:
[I 22:56:14.197 NotebookApp] JupyterLab beta preview extension loaded from /home/datacouch/anaconda3/lib/python3.6/site-packages/jupyterlab [I 22:56:14.197 NotebookApp] JupyterLab application directory is /home/datacouch/anaconda3/share/jupyter/lab [I 22:56:14.202 NotebookApp] Serving notebooks from local directory: /home/datacouch/notebook [I 22:56:14.202 NotebookApp] 0 active kernels [I 22:56:14.202 NotebookApp] The Jupyter Notebook is running at: [I 22:56:14.202 NotebookApp] https://127.0.0.1:3333/?token=35fd37e10f310e84fca36bcf0f0400c167e24c77e62862b2 [I 22:56:14.202 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 22:56:14.203 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
https://127.0.0.1:3333/?token=35fd37e10f310e84fca36bcf0f0400c167e24c77e62862b2