Abstract – A massive volume of data (Data in the unit of Exabyte
or Zettabyte) is called Big Data.To quantify such a large volume of data and
store electronically is not easy. To process these huge datasets, Hadoop system
is used. To gather these big data according to the request Map Reduce program
is used. For achieving greater performance, big data requires proper
scheduling. To minimize starvation and maximize the utilization of resource, scheduling
technique are used to assign the jobs to available resources. The Performance
can be increasedby implementing deadline constraints on jobs. The goal of the
research is to study and evaluate various scheduling algorithm for better performance.
Index Terms – Big Data, Map
Reduce, Hadoop, Job Scheduling Algorithms.
the term big data 1 has become very trendy in Information Technology segment.
Big data refers to broad range of datasets which are hard to be managed by
previous conventional applications. Big data can be applied in finance and
business, banking, online and onsite purchasing, healthcare, astronomy,
oceanography, engineering, and many other fields. These datasets are very
difficult and are rising exponentially day by day in very large volume. As data is increasing in volume, in variety and
with high velocity, it leads to complexities in processing it. To correlate,
link, match and transform such big data is a complex process. Big data being a
developing field has a lot of research problems and challenges to address. The
major research problems in big data are following: 1) Handling data volume, 2)
Analysis of big data, 3) Privacy of data, 4) Storage of huge volume of data, 5)
Data visualization, 6) Job scheduling in big data, 7) Fault tolerance. 1)
Handling data volume 1
2: The large volume of data coming from different fields of science such as
biology, astronomy, meteorology, etc makes its processing very difficult to the scientists. 2) Analysis of big data:
it is difficult to analyze big data due to heterogeneity and incompleteness of
data can be in different formats, variety and structure 3. 3) Privacy of data
in the context of big data 3: There is public fear regarding the
inappropriate use of personal data, particularly through linking of data from
multiple sources. Managing privacy is both a technical and a Sociological
problem. 4) Storage of huge volume of data 1 3: it represents the problem
of how to recognize and store important information, extracted from
unstructured data, efficiently. 5) Data visualization 1: Data processing
techniques should be efficient enough to enable real time visualization. 6) Job
scheduling in big data 4: This problem focuses on efficient scheduling of
jobs in a distributed environment. 7) Fault tolerance 5: is another issue in
Hadoop framework in big data. In Hadoop, NameNode is a single point of failure.
Replication of block is one of the fault tolerance technique used by Hadoop.
Fault tolerance techniques must be efficient enough to handle failure in
distributed environment. MapReduce 6 provides an ideal framework for
processing of such large datasets by using parallel and distributed programming
operations depend on two function such as Map and Reduce function. Boththe functions
are written for the user need. The Map functiontakes an input pair andgenerates
a set of intermediate or middle key or the value pairs. The MapReduce library
that collects all the middle values that are associated with the same middle
key andtransfer them into the Reduce function for further operations. The
Reduce function obtains an intermediate or middle key with integrated set of
values. And it associates thosevalues to make it as a smaller set of values.
The Figure 1 shows all process of MapReduce.
Overall MapReduce Word Count Process.
Scheduling decisions which are
taken by the master node are called as Job Tracker and by the worker nodes are
called as Task Tracker which executes the tasks.
cluster includes a single master node and multiple slave nodes. Figure 2 shows
Hadoop Architecture. The single master node consists of a Job tracker, Task
tracker, Name node and Data node.
function of the job tracker is managing the task trackers and tracking resource
availability. The Job tracker is a node which controls the job execution
process. Job tracker performs mapreduce tasks to a specific node in the
cluster. Client submits jobs to the Job tracker. When the work is completed,
the Job tracker updates its status. Client applications can ask the Job tracker
It follows the
orders of the job tracker and updating the job tracker with its status
periodically. Task tracker run tasks and send the reports to Job tracker, which
keeps a complete record of each job. Every Task tracker is configured with a
set of slots; it indicates the number of tasks that it can accept.
The name node maps
to block locations and which blocks are stored on which data node. Whenever a
data node undergoes a disk corruption of a particular block, the first table
gets updated and whenever a data node is detected to be dead due to network
failure or a node, both the tables get updated. The updating of the tables is
based on only failure of the nodes. It does not depend on any neighbor blocks
or any block locations to identify its destination. Each block is separated
with its job nodes and respective allocated process.
The node which
stores the data in hadoop system is known to be as data node. All data nodes
send a heartbeat message to the name node for every three seconds to say that
they are alive. If the name node does not receive a heartbeat from a particular
data node for ten minutes, then it considers that data node to be dead or out
of service. It initiates some other data node for the process. The data nodes
update the name node with the block information periodically.
JOB SCHEDULING IN BIGDATA
default Scheduling algorithm is supported on FIFO where jobs were executed in
the magnitude of their humility. Later on the cognition to set the priority of
a Job was added. Facebook and Character contributed meaningful apply in
processing schedulers i.e. Legible Scheduler 8 and Capacity Scheduler 9
respectively which after free to Hadoop Dominion. This section describes
various Job Scheduling algorithms in big data.
A. Default FIFO Scheduling
default Hadoop scheduler operates using a FIFO queue. After a job is divided
into independent tasks, they are ended into the queue and allotted to free
slots as they get acquirable on Task Tracker nodes. Although there is keep for
decision of priorities to jobs, this is not revolved on by default. Typically
apiece job would use the complete assemble, so jobs had to inactivity for their
release. Regularize though a distributed constellate offers zealous latent for
offering larger resources to numerous users, the job of intercourse resources
evenhandedly between users requires a turn scheduler. Production jobs bet in a
B. Fair Scheduling
Fair Scheduler 8 was developed at Facebook to manage access to their Hadoop
cluster and subsequently released to the Hadoop community. The Fair Scheduler
plans to provide each user a fair share of the cluster capacity in excess of
time. Users may allocate jobs to pools, with every pool owed a guaranteed
smallest number of Map and Reduce slots. Free slots in unsuccessful pools may
be owed to new pools; piece immoderateness ability within a pool is joint among
jobs. The Fair Scheduler maintains preemption, so if a pool has not received
its fair contract for a destined period of measure, then the scheduler module
veto tasks in pools flowing over capacity in dictate to afford the slots to the
pool functional under capacity. In addition, administrators may enforce
priority settings on doomed pools. Tasks are therefore scheduled in an
interleaved fashion, supported on their priority within their pool, and the
constellate capacity and activity of their pool. As jobs contain their tasks
assigned to Task Tracker slots for calculation, the scheduler follows the
shortfall between the become of calculate really old and the saint fair
percentage for that job. Eventually, this has the result of ensuring that jobs
obtain roughly equal volumes of resources. Shorter jobs are assigned enough
resources to terminate fast. Simultaneously, longer jobs are assured to not be
ravenous of resources.
C. Capacity Scheduling
Scheduler 10 initially developed at Yahoo addresses a usage circumstances
where the number of users is huge, and there is a require to make sure a fair
assign of calculation resources between users.
The Capacity Scheduler
allocates jobs supported on the submitting user to queues with configurable
drawing of Map and Minify slots. Queues that hold jobs are bestowed their
organized capacity; patch a trip capacity in a queue is shared among opposite
queues. Within a queue, planning operates on a modified priority queue
groundwork with specialized person limits, with priorities orientated supported
on the quantify a job was submitted, and the priority scene allocated to that
human and accumulation of job. When a Task Tracker receptacle becomes unfixed,
the queue with the lowest laden is elite, from which the oldest remaining job
is chosen. A task is then scheduled from that job. This has the validity of
enforcing meet capacity distribution among users, rather than among jobs, as
was the case in the Fair Scheduler.
D. Dynamic Proportional Scheduling
claimed by Sandholm and Lai 12, Dynamic Proportional scheduling gives a lot
of job sharing and prioritization that end in increasing share of cluster
resources and a lot of differentiation in service levels of various jobs. This
algorithm improves response time for multi-user Hadoop environments.
Adaptive Scheduling (RAS)
increaseutilization of resource among machines even as monitoring the completion
time of process, RAS proposed by Polo et al. 13 for the Map Reduce with
Zhao et al. 14 provides
task scheduling algorithm based on the resource attribute selection (RAS) to
work out its resource assigned by sending a group of test tasks to an execution
node before a task is scheduled and so choose optimal node to execute a task
consistent with resource needs and appropriateness between the resource node
and therefore the task, which uses history task information if prevail.
F. MapReduce task scheduling with deadline
constraints (MTSD) algorithm
to Tang et al. 15, scheduling algorithmic rule sets two deadlines:
map-deadline and reduce-deadline. Reduce-deadline is simply the users’ job
deadline. Pop et al. 16 presents a classical approach for a periodic task
scheduling by considering a scheduling system with totally different queues for
periodic and aperiodic function and deadline, because the main constraint
develops a method to guess the quantity of resources required to schedule a
group of an interrupted tasks or function, by considering along implementation
and data transfers costs. Based on a numerical model, and by using dissimilar
simulation situations, MTSD proved the
following statements: (1) varied sources of independent an episodic tasks will
be measured approximating to a single one; (2) when the quantity of evaluated
resources transcend a data center capability, the tasks migration between
totally different regional centers is that the appropriate resolution with
relevance the global deadline; and (3) during a heterogeneous data center, we
want higher variety of resources for an equivalent request with relevance the
deadline constraints. In MapReduce, Wang and Li 17 detailed the task
scheduling, for disseminated data centers on heterogeneous networks through
adaptative heartbeats, job deadlines and data locality. Job deadlines are
dividing alongside the foremost data quantity of tasks. With the thought of
limitation, the task scheduling is twisted as an assignment downside in each
heartbeat, during which adaptive heartbeats are supposed by the process times
of tasks and jobs are sequencing in terms of the separated deadlines and tasks
are planned by the Hungarian algorithmic program. On the idea of data transfer
and process times, the most appropriate data center for all mapped jobs are
determined within the reduce part.
G. Delay Scheduling
objective is to deal with the dispute between locality and fairness. once a
node requests for a task or function, if the head-of-line job cannot project a
local task, scheduler omit that task and appears at later jobs. If a job has
been omited for long, we tend to permit it to project non-native tasks, to
scheduling provisionally relaxes fairness to induce higher locality through
allowing jobs to attend for scheduling on a node among native data. Song et al.
18 offer a game assumption based technique to solve scheduling problems by
separating a Hadoop scheduling issue into 2 levels—job level and task level.
the job level scheduling, use a bid model to produce guarantee to the fairness
and reduce the common waiting time. For tasks level, change scheduling drawback
into assignment problem and use Hungarian methodology to optimize the problem.
Wan et al. 19 provides multi-job scheduling algorithm in MapReduce supported
game assumption that deals with the competition for resources between many
H. Multi Objective Scheduling
et al. 20 explain about scheduling algorithm named MOMTH by considering
objective functions associated to resources and users within the similar time
with constraints similar to deadline and budget.
enact model takes into account as all MapReduce jobs are independent. As
there’s no nodes failure before/during scheduling computation, scheduling
decision is taken solely based on the present data. Bian et al. presents
scheduling strategy supported fault tolerance. Consistent with this scheduling
strategy, the cluster finds the speed of the present nodes and creates some
backups of the intermediate MapReduce data which results to a high performance cache server. The data
created by that node could get it wrong shortly. Hence the cluster could resume
the execution to the previous level rapidly if there are many nodes going
wrong, the cut back nodes scan the Map output from the cache server or from
both the cache and also the node, and keeps its high performance.
Multistage Heuristic Scheduling (HMHS)
et al. 21 elaborates heuristic scheduling algorithm named HMHS that makes an
attempt to clarify the scheduling trouble by rending it into 2 sub problems:
sequencing and dispatching. For sequencing, they use heuristic supported Pri
(the modified Johnson’s algorithm). For dispatching, they recommend two
heuristics Min-Min and Dynamic Min-Min.
TABLE I: COMPARISON OF
VARIOUS JOB SCHEDULING ALGORITHMS IN BIGDATA
Default FIFO Scheduling 22
Schedule jobs based on their priorities
in first-in first-out
1. Cost of entire cluster scheduling
process is less.
2. Simple to implement and efficient.
1. Designed only for single type of job.
2. Low performance when run multiple
types of jobs.
3. Poor response times for short jobs
compared to large jobs.
Fair Scheduling 8
Do an equal distribution of compute
resources among the users/jobs in the system.
1. Less complex
2. Works well when both small and large
3. It can provide fast response times
for small jobs mixed with larger jobs.
1. Does not consider the job weight of
Maximization the resource utilization
and throughput in multi-tenant cluster environment.
1. Ensure guaranteed access with the
potential to reuse unused capacity and prioritize jobs within queues over
1. The most complex among three
Dynamic Proportional Scheduling12
Planned for data intensive workloads and
tries to maintain data locality during job execution
1. It is a fast and flexible scheduler.
2. It improves response time for
multi-user Hadoop environments.
If the system eventually crashes then
all unfinished low priority processes gets lost.
Resource-Aware Adaptive Scheduling (RAS)
Dynamic Free Slot Advertisement. Free
It improves the Job performance.
Only takes action on appropriate slow
MapReduce task scheduling with deadline
Achieve nearly full overlap via the
novel idea of including reduce in the overlap.
1. It Reduce computation time.
2. Improve performance for the important
class of shuffle-heavy Map Reductions.
Better work with small clusters only.
To address the conflict between locality
1. Simplicity of scheduling
Multi Objective Scheduling20
The executiontype consider as allthe MapReduce
jobs are independent, there is no nodes failure before or during the
scheduling computation and the scheduling decision is taken only based on
It keeps performance is high.
Execution Time is too large.
Hybrid Multistage Heuristic Scheduling
Johnson’s algorithm & Min-Min and
Dynamic-MinMin algorithm used
Achieves not only high data locality
rate but also high cluster utilization.
It does not ensure reliability.
paper provides the classification of Hadoop schedulers based on different
parameters such as time, priority, resources etc. It discuss about how various
task scheduling algorithms helps in achieving better result in Hadoop cluster. Furthermore
this paper also discusses about advantages and disadvantages of various task
scheduling algorithms. This comparison results shows, each scheduling algorithm
has some advantages and disadvantages. So, all algorithms are important in job
paper gives an overall idea about different job scheduling algorithm in the big
data. And it compares most of the properties of various task scheduling
algorithms. Individual scheduling techniques which areused to upgrade the data
locality, efficiency,make span,fairness and performance are elaborated and
discussed. However, the scheduling technique is an open area for researchers to