Our Latest Blog
Introduction to Apache Sqoop
11:25 AM
No comments
Sqoop is
a tool which can transfer bulk data from a relational database to Hadoop and
vise-versa. For better performance and optimal system utilization it does
parallel data transfer and load balancing among the nodes. It can read/write data
from/to Oracle, Teradata, Netezza, MySQL, Postgres, and HSQLDB. While importing
the data to hdfs it can save the data in different format e.g. ORC, Avro,
Parquet etc.
Hadoop Vs Traditional Data Processing Solutions
9:23 PM
No comments
Post has
been moved to https://www.technologyintrend.com/2019/07/hadoop-vs-rdbms.html
Sorry for inconvenience
Why Big Data Analytics had become a Buzzword Today
1:32 PM
No comments
Big Data Analytics had become a buzzword today. Be it Insurance, Banking, Ecommerce or anything everyone is inclined towards learning or implementing Big Data.
Analytics basically is the quantification of the insight developed for the small phenomenon happening in the real life through proper massage of huge amount of data based on complex and cumbersome algorithm.
But Why Data Analytics is a hot topic in market?: -
1. Data Spin off is wealthier than Data itself
Data is a magic wand or the tool which can conjure perfect strategy to venture on by identifying the pattern and relationship present inside it.
2. Quantification of Insights for Business Growth
Business is all about taking right decision and more importantly at the right time. The Quantification of the Insights derived from the data by identifying the pattern and relationship can prove to be reliable in solving problems and taking decisions.
3. Selling the insights
Companies now-a-days are selling the insights to the vendors who actually need the information to drive their marketing strategy or to reach out to more and more targeted audience.
If you are naïve for this new technical jargon then several question may arise in your mind -
- Why Big Data when we already have the well tested and reliable traditional RDBMS solution?
- Why didn’t we have to deal with Big Data 5-10 years back, from where the data came all of a sudden?
- Didn’t we have eCommerce website selling products online and generating huge amount data 10 years back for what good they are migrating to Big Data?
I will try to answer the above question one by one: -
Why didn’t we have to deal with Big Data 5-10 years back?
Answer to this consists of the following two important points:-
- Lack of easy, cost effective and faster processing engine to analysis terabytes of data of different variety. We had the solution like Netezza, Teradata etc but they are very costly and are difficult to scale up.
- Since the advent of globalization the amount of data being generated has increased tremendously which will keep on growing with a tremendous speed.
The world has really producing data with an accelerating rate every year. The reason behind this surge is the growth in globalization. Data is doubling every two year and Stat from EMC on the expected growth of Data from 4.4 zettabytes in 2013 to 44 zettabytes in 2020.
The insights and the predictions made through the Analytics are never 100% accurate. The precision in prediction is directly proportional to the
- Type and Amount of data.
- Rationality of the algorithm applied.
So the more data you process the more precise your insights will be and Hadoop came as a cost effective fault tolerant processing solution in which industry can invest to achieve a through put where the investment is far less than the economic value of insights that they can get by processing and analyzing the data.
--By Sanjeev Krishna
Basic Programming guide to begin with Apache Spark
1:10 PM
No comments
Post has
been moved to -
Sorry for inconvenience
How and where to practise Big Data and Apache Spark Programs?
6:02 AM
2 comments
Content has been moved to :-
https://www.technologyintrend.com/2019/03/platform-to-practice-Big-Data-Apache-Spark.html
Sorry for inconvenience
https://www.technologyintrend.com/2019/03/platform-to-practice-Big-Data-Apache-Spark.html
Sorry for inconvenience
Deployment modes and job submission in Apache Spark
11:07 AM
No comments
There are various ways of submitting an application in spark. In Addition to client and cluster modes of execution there is also a local mode of submitting a spark job. We must understand these modes of execution before we start running our job. Before we jump into it we need to recall few important things which learnt in the previous lesson click Introduction to Apache Spark to know more.
Spark is a Scheduling Monitoring and Distribution engine i.e. Spark is not only a processing engine it can also acts as a resource manager for the job submitted to it. Spark can run by itself(Standalone) using its own cluster manager and can run also on top of other cluster/resource managers.
There are various ways of submitting an application in spark. In Addition to client and cluster modes of execution there is also a local mode of submitting a spark job. We must understand these modes of execution before we start running our job. Before we jump into it we need to recall few important things which learnt in the previous lesson click Introduction to Apache Spark to know more.
Spark is a Scheduling Monitoring and Distribution engine i.e. Spark is not only a processing engine it can also acts as a resource manager for the job submitted to it. Spark can run by itself(Standalone) using its own cluster manager and can run also on top of other cluster/resource managers.
How Spark supports different Cluster Managers and Why?
This is made possible with the help of SparkContext object which is in the main driver program of spark. SparkContext object can connect to several types of cluster managers enabling Spark to run on top of other cluster manager frameworks like Yarn or Mesos. It is this object which coordinates between the independently executing parallel threads of the cluster.
src: https://spark.apache.org/docs/latest/cluster-overview.html |
Spark Installation can be launched in three different ways: -
1. Local(pseudo-cluster mode)
2. Standalone (Cluster with Spark default Cluster manager)
3. On top of other Cluster Manager (Cluster with Yarn, Mesos or Kubernetes as Cluster Manager)
Local:-
Local mode is pseudo-cluster mode generally used for testing and demonstration. In local mode it runs all the execution component in a single node.
Standalone: -
In Standalone mode the default Cluster manager provided in the official distribution of Apache spark is used for resource and cluster management of Spark Jobs. It has standalone Master for resource Management and Standalone worker for the task.
Please do not get confused here,
Standalone mode doesn't mean a single node Spark deployment. It is also a cluster deployment of Spark, the only thing to understand here is the cluster will be managed by Spark itself in Standalone mode.
On top of other Cluster Manager: -
Apache Spark can also run on other Cluster managers like Yarn, Mesos or Kubernates. However, the most used cluster manager for Spark in Industry is Yarn because of good compatibility with HDFS and other benefits it brings like data locality.
The command used to submit a spark job in Standalone and other cluster mode is same.
Scala Spark
|
PySpark
|
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other spark properties options
<application-jar> \
[application-arguments]
|
spark-submit \ \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other Spark properties options
--py-files <python-modules-jars>
my_application.py
[application-arguments]
|
Table 1: Spark-submit command in Scala and Python
For Python applications, in place of a JAR we need to simply pass our .py file as <application-jar>, and add Python dependencies like modules, .zip, .egg or .py files in --py-files.
Click to see #other spark properties options
How to submit a Spark job on Standalone Cluster vs Cluster managed by other cluster managers?
Answer to the above question is very simple. You need to use the "--master" option show in the above spark submit command and pass the master url of the cluster e.g.
Mode
|
Value of “--master”
|
For Standalone deployment mode
|
--master spark://HOST:PORT
|
For Mesos
|
--master mesos://HOST:PORT
|
For Yarn
|
--master yarn
|
Local
|
--master local[*] :: * = number of threads
|
Table 2: Spark-submit "--master" for different Spark deployment modes
When you submit a job in spark the application jars (the code which you have written for the job) is distributed to all worker nodes along with the jar files(if mentioned)
We talked enough about the Cluster deployment mode, now we need to understand the application "--deploy-mode" . The above deployment modes which we discussed so far is Cluster Deployment mode and is different from the "--deploy-mode" mentioned in spark-submit command(table 1) . --deploy-mode is the application(or driver) deploy mode which tells how to run the job in cluster(as already mentioned cluster can be a standalone, a yarn or Mesos). For an application(spark job) to run on cluster there are two --deploy-modes, one is client and other is cluster mode.
Spark Application deploy modes: -
Cluster: - When the driver runs inside the cluster then it is cluster deploy mode. In this case Resource Manager or Master decides which node the driver will run.
Client: - In Client mode the driver runs in the machine where the job is submitted.
Now the question arises -
"How to submit a job in Cluster or Client mode and which is better?"
How to submit:-
In the above spark submit command just pass "--deploy-mode client" for client mode and "--deploy-mode cluster" for cluster mode.
Which one is better, Client or Cluster mode:
Unlike Cluster mode in client mode if the client machine is disconnected then the job will fail. Client mode is good if you want to work on spark interactively, also if you don’t want to eat up any resource from your cluster for the driver daemon then you should go for client mode. When dealing with huge data set and calling action on RDDs or Dfs you need to make sure you have sufficient resources available on Client. So it’s not like the cluster or client mode is better than the other. You can choose any deploy mode for your application, it depends on what suits your requirement.
Client
|
Cluster
|
Driver runs in the machine where the job is submitted.
|
Driver runs inside the cluster. Resource Manager or Master decides which node the driver will run
|
Job fails if the driver is disconnected
|
After submitting the job client can disconnect.
|
Can be used to work with spark in an interactive manner. Performing action on RDD or DataFrame(like count) and capturing them in logs becomes easy.
|
Cannot be used to work with spark in an interactive manner.
|
Jars can be accessed from Client machine.
|
Since the driver runs on a different machine than the client, so the jars present in local machine won’t work. Those jars should be made available to all nodes either by placing them on each node or mention them in --jars or as –py-files during spark-submit.
|
YARN:-
| |
Spark driver does not run on the YARN cluster only executor runs inside YARN cluster.
|
Spark driver and executor both runs on the YARN cluster.
|
The local dir used by driver is spark.local.dir and for executor it is YARN config
yarn.nodemanager.local-dirs. |
The local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config
yarn.nodemanager.local-dirs ) |
Table 3: Spark Client Vs Cluster Mode
Here are some examples on submitting a spark job in different modes: -
Mode
|
Scala
|
PySpark
|
Local
|
./bin/spark-submit \
--class main_class \
--master local[8] \
/path/to/examples.jar
|
./bin/spark-submit \
--master local[8] \
my_job.py
|
Spark Standalone: -
|
./bin/spark-submit \
--class main_class \
--master spark://<ip-address>:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 10G \
--total-executor-cores 100 \
/path/to/examples.jar
|
./bin/spark-submit \
--master spark://<ip-add>:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 10G \
--total-executor-cores 100 \
--py-files
my_job.py
|
Yarn Cluster mode
|
./bin/spark-submit \
--class main_class \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 10G \
--num-executors 50 \
/path/to/examples.jar
|
./bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 10G \
--num-executors 50 \
--py-files
my_job.py
|
Table 4: Spark submit examples for different mode
To learn
more on Spark click here. Let
us know your views or feedback on Facebook or
Twitter @BigDataDiscuss.
Subscribe to:
Posts (Atom)
Popular Posts
-
Encoding is used to translate the numeric values into a readable character it provides the information that your computer needs to display...
-
Loading a csv file and capturing all the bad records is a very common task in ETL projects. The bad records are analyzed to take correctiv...
-
Big Data Analytics had become a buzzword today. Be it Insurance, Banking, Ecommerce or anything everyone is inclined towards learning or i...
-
There are various ways of submitting an application in spark. In Addition to client and cluster modes of execution there is also a ...
-
RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark. This post will cover What is RDD, Why Spark needs i...
-
Content has been moved to :- https://www.technologyintrend.com/2019/03/scala-or-python-for-apache-spark.html Sorry for inconveni...
-
Content has been moved to :- https://www.technologyintrend.com/2019/03/platform-to-practice-Big-Data-Apache-Spark.html Sorry for inconv...
-
Post has been moved to https://www.technologyintrend.com/2019/07/hadoop-vs-rdbms.html Sorry for inconvenience
-
Sqoop is a tool which can transfer bulk data from a relational database to Hadoop and vise-versa. For better performance and optimal sy...
-
Post has been moved to - https://www.technologyintrend.com/2019/07/basic-programming-guide-to-begin-with-apache-spark.html Sorry...