DataCouch

Big Data Processing using Google Dataproc

Introduction

At present, about 2.5 quintillion bytes (2500 PetaBytes) of data is produced by humans every day (Source: Social Media Today). Processing this much quantile data is a headache. This is where BIg Data Processing comes into the play and acts as a painkiller for this headache. Although there are various tools or technologies already available in the market such as Hadoop, Spark, Hive, and many more. But installing, configuring, and managing all these technologies on premise is quite a strenuous task for Data Engineers or Architects. This whole process became facile because of Google Dataproc, which lets you work with Hadoop and Spark Frameworks on Cloud.

What will you Learn?

What is Google Dataproc?
Why to use Dataproc?
Getting started with Dataproc
Running Spark, Hive Commands, JupyterLab, and HBase

What is Google Dataproc?

It is a Hadoop and Spark-managed service which lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning. It helps you to create and manage the clusters. With less time, it lets you focus on the jobs/query without worrying about money spent on administration.

Why to use Dataproc?

Dataproc has number of advantages such as:

Low Cost: Cloud Dataproc charges 1 cent per CPU in Cluster per hour. There are preemptible machines available which are even much cheaper. It charges you only for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.
Super Fast: On IaaS provider or on-premise, it would take a long time for creating Hadoop and Spark Cluster. But Dataproc would take only 90 seconds or less on average for creating clusters.
Simple and Familiar: There is no need to explore new tools/technologies or APIs for using Cloud Dataproc, which makes it easier to move existing projects into Cloud Dataproc without redevelopment. Spark, Hadoop, Pig, and Hive are frequently updated, so you can be productive faster.
Integrated: Google Cloud Dataproc can be easily integrated with various other Google Cloud services such as BigQuery, Google Cloud Storage, Google Cloud Bigtable, Google Cloud Logging, and Google Cloud Monitoring, so you have more than just a Spark or Hadoop cluster—you have a complete data platform.

Getting started with Dataproc

Google offers free trial credits to get started with Google. Create a trial account for the Google account platform. Login to Google Cloud Platform, using your id in incognito mode. Then go to the Navigation Bar and then select Dataproc > Cluster.

Then click on the Create Cluster button.

After that Give any Name, for this project we provide the name as a sample.

Under the Location tab, you can select the desired location. Basically select the location on the basis of what is your Data Storage Region i.e. is it Regional or Multi-Regional. By default us-central1 is selected. For this tutorial we would be choosing asia-southeast1

Under the Cluster Tab, select Standard (1 master, N workers) option for making a multi node cluster. Later we would also set the configurations for master and worker nodes.

For this tutorial we are going to use Hadoop 3.1 and Spark 3, which is the latest version of both technologies in the market.

Now scroll a bit and Select the Enable Component Gateway option. Also select Jupyter Notebook, Zookeeper and HBase.

Then go to the Configure Node tab and for Master Node change the machine type and select n1-standard-2 (2 vCPU, 7.5 GB memory), also select the Primary Disk size as 50 GB.

Now under Worker Node, set the Machine type as n1-standard-2 (2 vCPU, 7.5 GB memory), and set Primary disk size as 50 GB.

Note: Using the Number of Nodes property you can configure no. of worker nodes we want for cluster.

You can set the security settings if you desire. Then, finally click on the Create button.

After sometime your cluster would be ready.

Running Spark, Hive Commands, JupyterLab and HBase

Spark Jobs:

For running the Spark Jobs, first of all go to the VM Instance, inside the Dataproc Cluster. Click on SSH for opening the terminal.

Now type spark-shell for starting the spark shell using scala language.

Now let’s run the word count exercise which is already available in the system. Input for this would be pyspark. Lets run it by executing the following command:

				
					scala> sc.textFile("gs://dataproc-examples" + "/pyspark/hello-world/hello-world.py").count

You can ignore the warnings.

Your output would be 7.

Quit the spark shell using :q

Similarly you can also create your own spark code and run it over cloud.

Hive Job:

Now let’s run the Hive Job. In Dataproc, by default Hive uses Tez at its execution engine level. You can change it to MapReduce if you want to.

Start the Hive shell by typing hive in the cloud shell.

Now create a table named as employees with id (integer), name (string) and salary (integer) as its schema.

				
					hive> create table employees (id integer, name string, salary integer);

Now let’s populate it with some data:

				
					hive> insert into employees values (1,’sam’,3200), (2,’haylie’,3000), (3,’samir’,4000), (4,’john’,5000) ;

Let’s check our table:

				
					hive> select * from employees;

JupyterLab UI:

For opening the Jupyter Notebook UI, go to the Web Interfaces tab inside the Cluster. Then click on JupyterLab hyperlink.

Now you would be redirected to the Jupyter UI, if not then it would provide a link to you for redirecting you to the UI.

You have successfully opened the JupyterLab UI.

From here you can directly work on pyspark, python 3, R or Spylon-kernel.

Click on GCS option, present on the left side of the page. Using this you would work on Google Cloud Storage option. Without doing so you would face a Invalid Response: 403 error.

Now select the pyspark option under the Notebook tab. Here we would read the table from Hive and apply some transformations.

Let’s check the spark version by typing spark and running the cell using shift + enter.

Now, list the table present in the Hive by running spark.catalog.listTables()

Read the “Employees” table and show the table.

				
					filtered_df = raw_df.filter(raw_df.name == “sam”) filtered_df.show()

HBase:

Now let’s work on HBase, you can start the HBase shell by typing hbase shell command.

Let’s create a table named as facebook with profile and pics as column family.

				
					hbase> create ‘facebook’ , ‘profile’ , ‘pics’

Populating the data in the “facebook” table. We would be creating a column named as Name under the Profile column family.

				
					hbase> put ‘facebook’ , ‘1’ , ‘profile:name’ , ‘Bhavuk’

Conclusion

Google Dataproc is a very powerful Hadoop and Spark application enabled cluster option. One can easily migrate their Big Data Processing from on-premise to Google Cloud. It provides a big data framework which includes Hadoop, Spark, Hive, Pig, and many more itself. With features such as easy to use, fast, powerful, one can also integrate Dataproc with various other components of Google Cloud such as Airflow, Google Storage, Big Query, Big Table, etc. You can further integrate it with these tools and develop a highly available Big Data Pipeline for managing Stream data effectively and efficiently.