Liferay Guide (By Piyush): Apache Spark Interview Question

Top most 50 Spark interview question
---------------------------------------------------------

Q:-How do you connect Hadoop in your project?
ans:-Through Edge Node we can connect to hadoop cluster.

Q:-What is spark?
ans:-spark is open source cluster computing framework
it does real time data processing as well as batch processing also
but hadoop can do batch processing only.

-------------------------------------------------------------------------------------
Q:- Why spark? why not hadoop?
Ans:-There are several reason
1-main reason hadoop can do batch processing only but spark can both real time data processing and
batch processing.

2-No of line of code which you write in spark is less compare to hadoop.

3-Hadoop is written in java but spark is written in scala.
scala is language which is written in scala.

--------------------------------------------------------------------------------------

Q:-What are the feature of spark?
ans:-
1- real- time stream processing:- spark offer real time processing data but
Hadoop MapReduce was able to handle and process data which is already present but
it does not support real-time data processing thats why spark comes in picture.

2- dynamic in nature:- easily possible to develop a parallel application.

3-in memeory computation in spark:-no need to fatch data every time from disk
4- Reusability:-
5-Fault Tolerance :- apche spark provide fault tolerance through spark abstraction-RDD ,
Spark RDD are capable to digonosis the failure of any worker node in the cluster.

6- Lazy Evaluation-lazy evaluation means that execuation will not start util an action is triggered
Transformation are lazy in nature means when we call some operation in RDD, it does not execute immediatly.

------------------------------------------------------------------------------------------------------
Q:-Advantage of spark?
ans:- 1-increase Managability
2- saves computation and increass speed
3-reduce complexities
4-optimization.
----------------------------------------------------------------------------------------------------
Q:-What are the component of spark Ecosystem?
ans:-Spark core
spark streaming
spark sql
Graphx
MLIB

----------------------------------------------------------------------------------------------------
Q:- Lifecycle of spark?
ans:-
Lifecycle:-
1-load data on cluster
2-create RDD(once u creat RDD yu have to transformation)
3-DO transformation
4-Perfrom action
5-create data fram
6- perform query on dataFram.

another word-

1- initial part loading the data
you can load streaming data.

2-once data is loaded we need transform the data
using the transformation we use map, flat map, filter

3-once transformation is complete we have to perform action of the transform data
----------------------------------------------------------------------------------------------

Lifecycle of Spark program
-------------------------------------------------------------------------------------------------
Q:-The following steps explain the lifecycle of a Spark application with standalone resource manager,
and Figure 3.8 shows the scheduling process of a spark program:
Ans:-
1-The user submits a spark application using the spark-submit command.

2-Spark-submit launches the driver program on the same node in (client mode) or on the cluster (cluster mode)
and invokes the main method specified by the user.

3-The driver program contacts the cluster manager to ask for resources to launch executor JVMs
based on the configuration parameters supplied.

4-The cluster manager launches executor JVMs on worker nodes.

5-The driver process scans through the user application.
Based on the RDD actions and transformations in the program, Spark creates an operator graph.

6-When an action (such as collect) is called, the graph is submitted to a DAG scheduler.
The DAG scheduler divides the operator graph into stages.

7-A stage comprises tasks based on partitions of the input data. The DAG scheduler pipelines operators...

in other word:-
1- spark has something called driver
driver is like master that is the one which is going to command to every one

2-inside the driver we have something called spark context it just like spring context.

3-so this spark context is going to controll worker nodes
inside the worker nodes we have executors node these executor basically will execute some task.

4-so there is driver which act as a master, Driver instruct the worker to execute some task
that is done by spark context so spark context instruct to worker to execute some task on each node

5- if particular task is failed than it get rebuild and spark context rebuild and sent to worker again.

--------------------------------------------------------------------------------------------------

Q:- Are there any benifit of Apache Spark over HAdoop MApreduce?
Ans:- Spark has ability to perform data processing 100 times faster than Mapreduce
Also SPark has inbuilt memory processing and libraries to perform multiple task together
like batch processing, streaming. intractive processing etc..

Q:-Define Apache spark core and how it is useful for scala developer?
Spark core is used for memory management, job monitoring, tolerate faults
scheduling jobs and interactive storage feature. Rdd is advance feature in spark core suitable for tolerating fault
RDD is a collection of distributed objects available across multiple nodes
that are generally manipulated in parallel.

Q:-HOw many cluster mode are supported in Apache Spark?
Ans:-
There are three cluster mode are supported
standalone
mesos
YARN cluster managers

Q:- What are the way to launch spark over YARN?
Ans:- when we running on YARN spark executors runs as YARNS Container
spark support two mode for running on YARN

1- cluster mode:- it means it will be use full in production envirement by
using spark-submit or spark-shell--master yarn-cluster command

2-client mode:- it is useful for development purpose
bu using this command :- spark-submit or spark-shell -- master yarn-client.

Q:- WHile submit the spark job what are the properties you are going to submit?
ans:

Q:-What is SparkContext in Apache Spark?
ans:SparkContext is the entry point of Spark functionality.
The most important step of any Spark driver application is to generate SparkContext.
It allows your Spark Application to access Spark Cluster with the help of Resource Manager.
The resource manager can be one of these three- Spark Standalone, YARN, Apache Mesos.

----------------------------------------------------------------------------------------------------

Q:- Who will create sparkCOntext?
ans:-SparkContext is the entry gate of Apache Spark functionality.
The most important step of any Spark driver application is to generate SparkContext.
It allows your Spark Application to access Spark Cluster with the help of Resource Manager.

-----------------------------------------------------------------------------------------------------

Q:How to Create SparkContext Class?
ans:-If you want to create SparkContext, first SparkConf should be made.
The SparkConf has a configuration parameter that our Spark driver application will pass to SparkContext.

or

In short, it guides how to access the Spark cluster. After the creation of a SparkContext object,
we can invoke functions such as textFile, sequenceFile, parallelize etc.
The different contexts in which it can run are local, yarn-client, Mesos URL and Spark URL.
Once the SparkContext is created, it can be used to create RDDs, broadcast variable,
and accumulator, ingress Spark service and run jobs.
All these things can be carried out until SparkContext is stopped.

------------------------------------------------------------------------------------------------------
Q:-What are Stages in Spark?
Ans:-A stage is nothing but a step in a physical execution plan. Moreover,
It is a physical unit of the execution plan. In other words, Stage is a set of parallel tasks
i.e. one task per partition. Basically,
each job which gets divided into smaller sets of tasks is a stage.

3. Types of Spark Stages
Basically, stages in Apache spark are two categories

a. ShuffleMapStage in Spark

b. ResultStage in Spark

-----------------------------------------------------------------------------------------------------
Q:What is Spark Executor?
ans:-Basically, we can say Executors in Spark are worker nodes.
Those help to process in charge of running individual tasks in a given Spark job.
Executors also provide in-memory storage for Spark RDDs

--------------------------------------------------------------------------------------------------------
Q:- Can we run Apache Spark without Hadoop?
ans:-Yes, Apache Spark can run without Hadoop,

standalone, or in the cloud. Spark doesn’t need a Hadoop cluster to work.
Spark can read and then process data from other file systems as well.HDFS is just one of the file systems that Spark supports.

Spark is a meant for distributed computing. In this case, the data is distributed across the computers and
Hadoop’s distributed file system HDFS is used to store data that does not fit in memory.

-------------------------------------------------------------------------------------------------------
Q:-Different Running Modes of Apache Spark?
Ans:-Apache Spark can be run in following three mode :

(1) Local mode
(2) Standalone mode
(3) Cluster mode

----------------------------------------------------------------------------------------------------------
Q:-What are the roles and responsibilities of worker nodes in the apache spark cluster?
Is Worker Node in Spark is same as Slave Node?

Ans:-Worker node refers to node which runs the application code in the cluster.
Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks.
Worker node processes the data stored on the node,
they report the resources to the master. Based on the resource availability Master schedule tasks.

-----------------------------------------------------------------------------------------------------------

Q:-what are the features of dataframe in Spark?
List out the characteristics of DataFrame in Apache Spark.

Ans:-DataFrames are the distributed collection of data. In DataFrame, data is organized into named columns.
It is conceptually similar to a table in a relational database.

Out of the box, DataFrame supports reading data from the most popular formats, including JSON files,
Parquet files, Hive tables. Also, can read from distributed file systems (HDFS), local file systems,
cloud storage (S3), and external relational database systems through JDBC.
----------------------------------------------------------------------------------------------------

Q:-What are the different methods to run Spark over Apache Hadoop?
Ans:-
1. Local standalone mode — everything (spark , driver , worker , etc) are on the same machine locally.
Generally used for testing and develop the logic for spark application

2. YARN client — the client which runs driver program and submits the job is same ,
and the workers (data nodes) are separate.

3. Yarn Cluster — the driver program runs on one of the dedicated data nodes and
the workers are separate. Most advisable for production platform.

3. Mesos:

Mesos is used in large scala production deploymen

-------------------------------------------------------------------------------------------------
Q:-What is Parquet file format ? Where Parquet format should be used ? how to convert data to Parquet format ?
Ans:-
Parquet is the columnar information illustration that is that the best choice
for storing long run massive information for analytics functions.
It will perform each scan and write operations with Parquet file.
Parquet could be a columnar information storage format.

Parquet is created to urge the benefits of compressed,
economical columnar information illustration accessible to any project,
despite the selection of knowledge process framework, data model, or programming language.

Parquet could be a format which will be processed by variety of various systems: Spark-SQL,
Impala, Hive, Pig, niggard etc. It doesn’t lock into a particular programming language
since the format is outlined exploitation, Thrift that supports numbers of programming languages.
as an example, Aepyceros melampus is written in
C++ whereas Hive is written in Java however they will simply interoperate on an equivalent Parquet information.

----------------------------------------------------------------------------------------------------

Q:- What is RDD?
ans:-
RDD stands for resilient distributed dataSet
RDD is immutable fault tolrance distributed object run as parallel
or

RDDs are immutable(can’t be modified once created) and fault tolerant, Distributed because
it is distributed across cluster and Dataset because it holds data.

or

RDD is a collection of distributed objects available across multiple nodes
that are generally manipulated in parallel.

So why RDD? Apache Spark lets you treat your input files almost like any other variable,
which you cannot do in Hadoop MapReduce.
RDDs are automatically distributed across the network by means of Partitions.

Partitions

RDDs are divided into smaller chunks called Partitions, and when you execute some action,
a task is launched per partition. So it means, the more the number of partitions, the more the parallelism.
Spark automatically decides the number of partitions that an RDD has to be divided into
but you can also specify the number of partitions when creating an RDD.
These partitions of an RDD is distributed across all the nodes in the network.

Creating an RDD
Creating an RDD is easy, it can be created either from an external file or
by parallelizing collections in your driver. For example,

val rdd = sc.textFile("/some_file",3)
val lines = sc.parallelize(List("this is","an example"))
The first line creates an RDD from an external file,
and the second line creates an RDD from a list of Strings.

Note that the argument ‘3’ in the method call sc.textFile() specifies the number of partitions
that has to be created. If you don’t want to specify the number of partitions,
then you can simply call sc.textFile(“some_file”).

Actions/Transformations
There are two types of operations that you can perform on an RDD- Transformations and Actions.
Transformation applies some function on a RDD and creates a new RDD,
it does not modify the RDD that you apply the function on.(Remember that RDDs are resilient/immutable).
Also, the new RDD keeps a pointer to it’s parent RDD.

When you call a transformation, Spark does not execute it immediately, instead it creates a lineage.
A lineage keeps track of what all transformations has to be applied on that RDD,
including from where it has to read the data. For example, consider the below example

val rdd = sc.textFile("spam.txt")
val filtered = rdd.filter(line => line.contains("money"))
filtered.count()
sc.textFile() and rdd.filter() do not get executed immediately,
it will only get executed once you call an Action on the RDD - here filtered.count().
An Action is used to either save result to some location or to display it.
You can also print the RDD lineage information
by using the command filtered.toDebugString(filtered is the RDD here).

RDDs can also be thought of as a set of instructions that has to be executed,
first instruction being the load instruction.
Caching
You can cache an RDD in memory by calling rdd.cache(). When you cache an RDD,
it’s Partitions are loaded into memory of the nodes that hold it.

Caching can improve the performance of your application to a great extent.
In the previous section you saw that when an action is performed on a RDD,
it executes it’s entire lineage.
Now imagine you are going to perform an action multiple times on the same RDD
which has a long lineage, this will cause an increase in execution time.
Caching stores the computed result of the RDD in the memory thereby e
liminating the need to recompute it every time. You can think of caching as
if it is breaking the lineage, but
it does remember the lineage so that it can be recomputed in case of a node failure.

or

Resilient Distributed Datasets (RDDs)
RDDs are the main logical data unit in Spark. They are a distributed collection of objects,
which are stored in memory or on disks of different machines of a cluster.
A single RDD can be divided into multiple logical partitions so that
these partitions can be stored and processed on different machines of a cluster.

RDDs are immutable (read-only) in nature. You cannot change an original RDD,
but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.

-------------------------------------------------------------------------------------------------------
Q:-Why do we need RDD in Spark?
The key motivations behind the concept of RDD are-

Iterative algorithms.
Interactive data mining tools.
DSM (Distributed Shared Memory)

The main challenge in designing RDD is defining a program interface that provides fault tolerance efficiently.
To achieve fault tolerance efficiently,RDDs provide a restricted form of shared memory,

------------------------------------------------------------------------------------------------------
Q:- Features of RDD

Ans:-Partitioning
In-memory Computation
Immutability
Lazy Evaluation

Distributed
Resilient

-----------------------------------------------------------------------------------------------------

There are two basic operations which can be done on RDDs. They are:
Ans:-
Transformations
Actions

Transformations: These are functions which accept existing RDDs as the input and outputs one or more RDDs.
The data in the existing RDDs does not change as it is immutable.
Some of the transformation operations are shown in the table given below:

Functions Description
map() Returns a new RDD by applying the function on each data element
filter() Returns a new RDD formed by selecting those elements of the source on which the function returns true
reduceByKey() Used to aggregate values of a key using a function
groupByKey() Used to convert a (key, value) pair to (key, <iterable value>) pair
union() Returns a new RDD that contains all elements and arguments from the source RDD
intersection() Returns a new RDD that contains an intersection of elements in the datasets
These transformations are executed when they are invoked or called.
Every time transformations are applied, a new RDD is created.

Actions: Actions in Spark are functions which return the end result of RDD computations.
It uses a lineage graph to load the data onto the RDD in a particular order.
After all transformations are done, actions return the final result to the Spark Driver.
Actions are operations which provide non-RDD values. Some of the common actions used in Spark are:

Functions Description
count() Gets the number of data elements in an RDD
collect() Gets all data elements in the RDD as an array
reduce() Aggregates data elements into the RDD by taking two arguments and returning one
take(n) Used to fetch the first n elements of the RDD
foreach(operation) Used to execute operation for each data element in the RDD
first() Retrieves the first data element of the RDD

---------------------------------------------------------------------------------------------------
Q:-How many way to creat RDD?
Ans:-Creating an RDD
An RDD can be created in three ways:

1-By loading an external dataset
You can load an external file into an RDD.
The types of files you can load are csv, txt, JSON, etc.
Here is an example of loading a text file into an RDD.

2-By parallelizing the collection of objects
When Spark’s parallelize method is applied on a group of elements, a new distributed dataset is created.
This is called an RDD.

3-By performing transformations on existing RDDs
One or more RDDs can be created by performing transformations on the existing RDDs.
The below figure shows how a map() function can be used.

----------------------------------------------------------------------------------------------------
Q:-What is Map transformation operation in Apache Spark?
What is the need for the Map transformation?
What processing can be done in the Map in Spark explain with example?

Ans:-Map is a transformation applied to each element in a RDD and it provides a new RDD as a result.
In Map transformation, user-defined business logic will be applied to all the elements in the RDD.
It is similar to FlatMap, but unlike FlatMap Which can produce 0, 1 or many outputs,
Map can only produce one to one output.
Map operation will transforms an RDD of length N into another RDD of length N.

A——->a
B——->b
C——->c
Map Operation

Map transformation will not shuffle data from one partition to many. It will keep the operation narrow.

Q:-Explain the flatMap() transformation in Apache Spark.?
ans:-

---------------------------------------------------------------------------------------------

Q:- Diffrence between dataFrame and dataSet?
ans:-

-----------------------------------------------------------------------------------------------
Q:-What is diffrence map and FlatMap?
ans:-

------------------------------------------------------------------------------------------------------
Q:- What is diffrence between groupBy key and reduce By key?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-How RDD fault Tolerence?
Ans:-

-----------------------------------------------------------------------------------------------------------
Q:- Where we need to map Partition?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-How to reduce no of partitions?
Ans:-

-------------------------------------------------------------------------------------------------------
Q:-How we will read file formate?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-What are the type of join in spark?
Ans:-

-------------------------------------------------------------------------------------------------------
Q:-If hive job is going to taking much time than what are the step you are going to take?
Ans:-

-----------------------------------------------------------------------------------------------------
Q:-How to cptimize the query?
Ans:-

------------------------------------------------------------------------------------------------------
Q:-How we will do unit testing ?

Ans:-

-----------------------------------------------------------------------------------------------------
Q:-If any partitioned is corrupt than how to handle ?
Ans:-

-----------------------------------------------------------------------------------------------------
Q:- Diffrence between DAG and Lenage?
Ans:-

-----------------------------------------------------------------------------------------------------
Q:-How to increased partitioned?
Ans:-

---------------------------------------------------------------------------------------------------
Q:-If i am applying aggregation after grouping what rdd we have to use?
Ans:-

----------------------------------------------------------------------------------------------------
Q:-Diffrence between order by and sort By?
Ans:-

------------------------------------------------------------------------------------------------------
======================================================================================================

Hive Interview qustion
=======================================================================================================

Q:-What is Hive?

-----------------------------------------------------------------------------------------------------
Q:-Where to use hive?
Ans:-

-------------------------------------------------------------------------------------------------------
Q:- what are the feature of hive?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-What are tabular function in Hive?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-How to delete some data in hive?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-What are the table variable in HIve?
Ans:-

---------------------------------------------------------------------------------------------------------
Q:-Suppose if we drop internal table what will happen?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-Suppose if we drop external table than what will happen?

---------------------------------------------------------------------------------------------------------
Q:-What are the partitioned ? Where you implemented in your project?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-How many type of partitioned in Hive?
Ans:-

--------------------------------------------------------------------------------------------------------
Q:-What type of partitioned we should go?
Ans:-

-------------------------------------------------------------------------------------------------------
Q:-Why we use bucketing?
Ans:-

-------------------------------------------------------------------------------------------------------
Q:-What are file formate in Hive?
Ans:-

======================================================================================================

Liferay Guide (By Piyush)

Saturday, 6 April 2019

Apache Spark Interview Question

5 comments:

Blog Archive