Resource Management for Data Intensive Tasks on Grids

Distributed systems such as Grids, aim to enable the sharing, selection, and aggregation of a wide variety of resources that are geographically distributed and often owned by different organizations. These resources collaborate for performing complex tasks. Without efficient resource management, the benefits of a Grid system cannot be realized, especially for large-scale computational and data intensive tasks. The efficient management of distributed resources to perform a complex task is important. In a Grid, a resource management system is responsible for managing the available resources for a given task to be performed. This thesis proposes an effective resource management system called BiLeG, which can be used for performing resource intensive tasks in a Grid computing environment. This thesis focuses on the problem of allocating resources for a group of a particular type of resource intensive tasks termed Processable Bulk Data Transfer (PBDT) tasks. A PBDT task involves the transfer of a very large volume of data that has to be processed in some way before it can be used at a remote set of sink nodes. In BiLeG, the resource management system is bifurcated into separate upper and lower decision making levels and separate responsibilities are assigned to each decision making level. The upper decision making level of BiLeG, called the Task Resource Pool Selector (TRPS), is concerned with selection of a resource-pool for the given task. The lower decision making level, called Resource Allocator (RA) is responsible for allocating resources out of the resource-pool chosen by TRPS. At TRPS, a policy determines the way the resource-pool is chosen for each of the tasks. At RA, an algorithm which determines the allocation of the resources from its resource-pool for the task selected by TRPS is deployed. Note that the resource allocation problem considered in this research focuses on achieving a good balance between multiple performance metrics. Although existing research has addressed various issues in resource allocation, none of the existing works has dealt with resource allocation for PBDT tasks addressing multiple performance metrics that this research focuses on.


Introduction
The ubiquitous Internet as well as the availability of powerful computers and high-speed network technologies as low-cost commodity components are changing the way computing is carried out. It becomes more feasible to use widely distributed computers for solving large-scale problems, which cannot often be effectively dealt without using a single existing powerful supercomputer. In terms of computations and data requirements, these problems are often resource intensive due to their size and complexity. They may also involve the use of a variety of heterogeneous resources that are not usually available in a single location. This led to the emergence of what is known as Grid computing. Grid computing enables sharing of heterogeneous distributed resources across different administrative and geographical boundaries [3]. By sharing these distributed resources, many complex distributed tasks can be performed in a cost effective way. The way the resources are allocated to tasks holds a pivotal importance for achieving satisfactory system performance [4]. To perform efficiently, the resource allocation algorithm has to take into account many factors, such as, the system and workload conditions, type of the task to be performed and the requirements of the end user.
To devise more efficient allocation algorithms, it may be useful to classify the given tasks into predefined types based on similarities in their predicted resource needs or workflows. This classification of tasks into various types provides the possibility to customize the allocation algorithm according to a particular group of similar tasks. This chapter presents an effective resource management middleware developed for a type of resource-intensive tasks classified as Processable Bulk Data Transfer (PBDT) tasks. The common trait among PBDT tasks is the transfer of a very large amount of data which has to be processed in some way before it can be delivered from a source node to a set of designated sink nodes (Ahmad, I & Majumdar, S. , 2008). Typically, these tasks can be broken down into parallel sub-tasks, called jobs. Various multimedia and High Energy Physics (HEP) applications can be classified as PBDT tasks. The processing operation involved in these tasks may be as simple as applying a compression algorithm to a raw video file in a multimedia application; or, as complex as isolating information about particles pertaining to certain wavelengths in High Energy Physics (HEP) experimentations [22] [25]. Performing PBDT tasks requires both computing power and large bandwidths for data transmission. To perform such resourceintensive tasks, in recent years, research has been conducted in devising effective resource The problem of optimally scheduling these sub-tasks is a well-known NP complete problem [12]. To tackle it, various heuristics-based algorithms that can generate near-optimal solutions to optimization problems in polynomial times are devised. In this chapter a Bilevel Grid Resource Management System abbreviated as BiLeG is presented, in which the decision-making module is divided into two separate sub-modules. The upper level decision-making module is called the Task & Resource Pool Selector (TRPS). It selects a task from the given bag-of-tasks for which resources are to be assigned and chooses a partition of resources available for this chosen task (called the resource-pool of this task) which is typically a subset of all the resources available. The lower level decision-making module is called the Resource Allocator (RA), which uses an assignment algorithm to decide how the resources(from the chosen resource-pool) are allocated to the jobs, in a given task. Various algorithms can be used at RA whereas various policies can be deployed at TRPS. A particular combination of a TRPS policy and a RA scheduling algorithm deployed at a time is called an allocation-plan which determines the resource allocation for each task in the given bag-of-tasks. The following notation is used in this paper to write an allocation-plan: TRPS Policy, RA-Algorithm>. Investigating the choice of the most appropriate allocationplan under a specific set of workload and system conditions is the focus of this chapter.
The main contributions of this paper are summarized.
1. It proposes the ATSRA algorithm and two extensions based on constraints relaxation technique. Based on simulation, it analyses the performance of the proposed algorithms for different number of available Grid nodes. 2. The experimental results capture the trade-off between accuracy in resource allocation and scheduling overhead both of which affect the overall system performance. The chapter discusses under which circumstances the proposed original algorithm or its extensions should be used.
The rest of the paper is organized as follows. In Section 2, different approaches to resource allocation of tasks on Grids are presented. In Section 3, PBDT tasks are described. In Section 4, the problem being solved is defined and an overview of the proposed system is presented. In Section 5 policies are described. In Section 6, the concept of Architectural Templates is described. In Section 7, a Linear Programming (LP) based algorithm and its extensions are described that can be used to perform PBDT tasks. In Section 8, experimental results are presented. Finally, in Section 9, the chapter is concluded.

Approaches to resource allocation of tasks on grids
Different researchers have taken various approaches to resource allocation of Tasks on Grids. The approaches to allocate resources in Grids can be divided into three broad categories.

Traditional Schedulers and Resource Brokers 2. Policy based Resource Allocation 3. Workflow based Resource Allocation
Each of these approaches is discussed in a following subsection.

Traditional schedulers and resource brokers
One of the traditional approaches is to use a Grid resource broker which selects suitable resources by interacting with various middleware services. Venugopal describes such a Grid resource broker that discovers computational and data resources running diverse middleware through distributed discovery services [12]. However, any mechanism for breaking a given task into parallel jobs for processing, is not present.
YarKhan and Dongarra [22] have also performed scheduling experiments in a Grid environment using simulated annealing. To evaluate the schedules generated by the simulated annealing algorithm they use a Performance Model, a function specifically created to predict the execution time of the program. Generating such a Performance Model requires detailed analysis of the program to be scheduled.
Another effort worth mentioning is Grid Application Development Software (GrADS) Project [2]. At the heart of the GrADS architecture is an enhanced execution environment which continually adapts the application to changes in the Grid resources, with the goal of maintaining overall performance at the highest possible level. A number of resource allocation algorithms can be used at GrADS to schedule a given bag-of-tasks in Grid environments. Due to the NP-complete nature of the resource allocation problem the majority of proposed solutions are heuristic algorithms [14] [18] [20].

Policy based resource allocation
For resource allocation in Grids, some researchers have also proposed policy based resource allocation techniques. Sander et al. [12] propose a policy based architecture for QoS configuration for systems that comprise different administrative domains in a Grid. They focus on making decisions when users attempt to make reservations for network bandwidth across several administrative network domains that are controlled by a bandwidth broker. The bandwidth broker acts as an allocator and establishes an end-to-end signalling process that chooses the most efficient path based on the available bandwidth. The work presented in [13] is concerned with data transmission costs only; whereas the research presented in this research needs to consider both computation and communication costs associated with the PBDT tasks. Verma. et al. [19] has also proposed a technique in which resource allocation is performed based on a predefined policy. But in this research, the resource allocation is not based on any performance measure.

Workflow based resource allocation
Many recent efforts have focused on scheduling of workflows in Grids. [16] presents a QoSbased workflow management system and a scheduling algorithm that match workflow applications with resources using event condition action rules. Pandey and Buyya have worked on scheduling scientific workflows using various approaches in the context of their www.intechopen.com GridBus workflow management effort [11] [23]. [23] has developed an architecture to specify and to schedule workflows under resource allocation constraints. Also, many of the data Grid projects that support distributed processing of remote data have proposed workflow scheduling [11] [21].

Processable Bulk Data Transfer (PBDT) tasks
PBDT tasks require bulk transfer of processed data. Such data transfers are typical in multimedia systems and HEP experiments. For example in [1], 650MB of data was transferred on an average from a source to a set of sink nodes. High communication and computing times in PBDT tasks effectively amortizes the overhead of the LP-based algorithm used for optimization of the system performance. A PBDT task is characterized by the following three characteristics.
1. The task involves large data transfer that has to be processed in some way before it can be used at the sink nodes. The large amount of data involved in the PBDT differentiates it from the compute intensive tasks where data usually consists of is only the parameters of the remote functions invoked. This implies that the data communication costs cannot be ignored while scheduling a PBDT task. 2. Cost of data processing is proportional to the length of the raw data file. 3. The unprocessed raw file is such that it can be either processed as a whole or be divided into multiple partitions. If divided into partitions, each partition can be processed independently. The resultant processed partitions can later be combined to generate the required processed file. Consider a source file F, of size L. F can be partitioned into k disjoint partitions, with data sizes of {L 1 , L 2 …. L k }, such that L=∑ L (1) Then for a PBDT task, the length of the required processed file is given by where ε i is a processing factor which is the ratio of the size of the processed partition and that of the original partition.
PBDT tasks are increasingly becoming important. They are used in various multimedia, high-energy physics and medical applications. The following section explains two of the practical examples of PBDT tasks.

Particle physics data grids
Particle Physics Data Grids (PPDG) is a colloboratory project concerned with providing next-generation infrastructure for high-energy and nuclear physics experiments. One of the important requirements of PPDG is to deal with the enormous amount of data that is created during high-energy physics experiments that must be analyzed by large groups of specialists. Data storage, replication, job scheduling, resource management and security components of the Grid must be integrated for use by the physics collaborators. Processing these tasks require huge computing capabilities and fast communication capabilities. Grid computing is used for processing PPDG tasks that can be classified as a PBDT task.

Multimedia encoding
Multimedia encoding is required for applying a specific codec to a video [27]. Conventional methods use a single system for the conversion. The compression of the raw captured video data into an MPEG-1 or MPEG-2 data stream can take an enormous amount of time, which increases with higher quality conversions. Depending on the quality level of the video capture, the data required for a typical one hour tape can create over 10 GB of video data, which needs to be compressed to approximately 650 MB to fit on a VideoCD. The compression stage is CPU intensive, since it matches all parts of adjacent video frames looking for similar sub-pictures, and then creates an MPEG data stream encoding the frames. At higher quality levels, more data is initially captured and enhanced algorithms, which consume more time, are used. The compression process can take a day or more, depending on the quality level and the speed of the system being used. For commercial DVD quality, conversions are typically done by a service company that has developed higher quality conversion algorithms which may take considerable amount of time to execute. Grid technology is ideal for improving the process of video conversion.

Overall system architecture
In this research we have focused on the problem of allocating resources for a given bag of PBDT tasks. The bag-of-tasks consists of a set of independent PBDT tasks all of which must be executed successfully. The Grid system consists of n nodes. Collectively, these n nodes are represented by a set Δ. Each individual PBDT task in the given bag-of-tasks may be divided into a number of sub-tasks called jobs which can be executed in parallel, independent of each other. As discussed, PBDT tasks are resource-intensive tasks that use a large amount of computing resources and communication bandwidth. Usually, if a node starts processing a PBDT task, pre-emption of this task is counter-productive as it wastes the effort of transferring the raw-data file to the concerned node. Also, due to excessive demand of computing power, a node is assumed to handle the processing of only one PBDT task at a time. In this research we have made the following two assumptions regarding the running of constituent jobs of a task on Grid nodes.
1. Once a job starts executing on a Grid node, it cannot be pre-empted. 2. Only one job can be executed on a Grid node at a time.
For the cost analysis of the purposed architecture, we have measured cost by the time (in seconds) spent in performing a particular communication or computation job. We have chosen one megabyte as a unit of data. When a particular node i accesses data in node j, the communication cost of transporting a data unit from node i to node j is designated by d(i,j). It is assumed that the communication costs are metrics, meaning that they are non-negative, represented by Cp m which is the cost of processing a unit of data. Set of all the nodes in the system is represented by Δ. To represent the computing costs, a vector of |Δ| dimensions denoted by [C p ] is used which holds the values of the computing costs of all the nodes in the system. A matrix [C c ] of dimensions |Δ| x |Δ| denotes the values of the communication costs between all the nodes in the system. The objective of this research is to assign resources to tasks in such a manner that the total cost in executing the given bag-of-tasks is minimized; where the total cost is defined as the total time spent by a task at all the resources it has used during its execution. Total cost indicates the total resource usage for executing a task and hence the minimization of the total cost is a system objective.

Grid Computing -Technology and Applications, Widespread Coverage and New Horizons 54
The BiLeG resource management system consists of two decision-making modules; a lower level decision-making module called Resource Allocator (RA) and a higher level decisionmaking module called Task Resource Pool Selector (TRPS). TRPS selects a task T i from the given bag of PBDT tasks and allocates it a resource-pool which is a subset of all resources available. A resource-pool of a particular task T i is represented by , where ⊆ . RA allocates resources for a particular task T i chosen from its associated resource-pool .
Each PBDT task consists of an unprocessed raw-data file, information about the processing operation that is required to be performed on the raw-data file and a set of sink nodes where the processed file is to be delivered. The source node groups the submitted tasks into a bagof-tsks ( Fig. 1, Step-1) and initiates the processing by sending "initiate" signal to the TRPS Fig. 1 , Step-2). TRPS determines how many Grid resources are reserved ( Fig. 1 , Step-3) by interacting with the Grid Computing Environment. This set of reserved nodes is represented by . TRPS determines a resource-pool Ґ i for each of the tasks T i . Not all the Grid nodes reserved are available or visible to an individual task T i in the bag-of-tasks, T. Typically, each task has a different resource-pool selected by TRPS according to the TRPS policy used.
For an individual task, using all the resources of the resource-pool may not be the best option for its most efficient execution. A TRPS resource selection policy is deployed at TRPS and determines the way in which TRPS chooses a resource-pool for each individual task. The policy uses the existing system state and resource availability in making its decision.

Fig. 1. BiLeG Architecture
From the resource-pool Ґ i allocated by TRPS to T i , the lower level decision-making module (RA) chooses a set of resources that are used to perform T i . This set of resources is denoted by ω i . For different systems, different resource allocation algorithm may be best suited at RA. The remaining set of resources (Ґ i -ω i ) are returned to TRPS. Based on the resources chosen by the algorithm, RA divides a particular task into various jobs. RA specifies the details of the jobs in a file which is called the workflow of a Task, T i . BiLeG architecture includes a software component called workflow engine which is designed to execute all the constituent jobs of a Task. The workflow engine is implemented as service and is responsible for running all the constituent jobs associated with a particular Task.
A combination of a TRPS policy and an RA algorithm is called an Allocation Plan(AP) and is represented by AP{<Policy>,<Algorithm>}. This paper explores the factors that determine the choice of the most efficient allocation plan for a given bag-of-tasks.
Note that the visibility of RA for a particular task is limited to its resource-pool. RA is myopic in nature and is not concerned with the overall system performance optimization. The objective of RA is to optimize the performance for a particular task only. TRPS is concerned with global system performance and has the responsibility to choose an appropriate resource-pool for each of the tasks and pass it on to RA. RA assigns a set of resources from the resource-pool passed to it by TRPS.
It can be observed that In the BiLeG architecture, by dividing the overall system into two independent decision-making modules and by assigning both decision-making modules separate responsibilities; we divide the problem of scheduling the tasks in the given bag-oftasks into three different sub-problems: 1. Determination of the task execution order at TRPS 2. Selection of resource-pool 3. Resource allocation for each constituent job in the given bag-of-tasks at RA.
These three sub-problems may be solved by three independent algorithms. The division into three independent sub-problems makes the architecture customizable. It also provides finergrade control over the resource allocation for the given bag-of-tasks and helps improving the stated optimization objective.

TRPS resoruce selection policy
A TRPS resource selection policy is used at the upper decision making module to select the resource-pool for each task. It can be either static or dynamic in nature. A TRPS policy is said to be static if mapping between tasks and their corresponding resource-pools is established before the system starts executing tasks and it is dynamic if these mappings are established during runtime according to the current availability of the resources. Two static TRPS policies considered in this paper are presented in this section. Dynamic TRPS polices are discussed available in [6].

Static Resource-Pool--Single Partition (SRP SP ) policy
In SRP SP , the TRPS algorithm has two phases, a mapping phase and an execution phase.
The mapping phase (Fig. 2) is performed before the execution of the first task. Each task in T has a given resource-pool Δ. Thus, for each task and Ґ i = Δ. In Mapping phase, a mapping between each task and the most appropriate set of resources it needs, is determined. To create this mapping, TRPS iteratively calls the algorithm at RA for each task in T.
In the Execution Phase, the first task in set T is executed first. TRPS iterates through all the tasks in T and chooses the next task for which the complete set of resources needed is available. All the tasks, for which each resource allocated by RA is available start executing. All the tasks, for which all the resources allocated by Resource Allocator are not yet available, wait in a queue. Once a task is complete, it is removed from T. The resources released by task are now added to the free resource set and the queue of waiting tasks is checked again to see whether the resource demand of any of these tasks can be satisfied. If all the resources of a particular are now available it starts execution and the next task in the waiting queues is checked and so on. The resources released by the task are now added to the resource set. The queue of waiting tasked is checked again in a First-In-First-Out (FIFO) order to see whether the resource demand of any of the tasks can be satisfied. When T={}, it means that all tasks have been assigned resources.

Static Resource-Pool-Single Partition with Backfilling (SRP SP +BF) policy
SRP SP +BF is an improvement of SRP SP . A drawback of SRP SP is that the performance of the system may deteriorate due to two factors. First, there is the contention for resources, as each task has to wait until the complete set of resources it has been assigned to during the www.intechopen.com Resource Management for Data Intensive Tasks on Grids 57 mapping phase becomes available. Second, there is the presence of unused resources that are not utilized at all; as it is possible that some resources may not become a part of mapping of any task. Thus, at a particular instance there may be resources that are available but are not being utilized while tasks are waiting in a queue (as the complete resource-pools associated with the tasks waiting in the queue are not available.) The mapping phase of SRP SP +BF is similar to SRP SP . In the execution phase, SRP SP +BF starts just like SRP SP . Once all the tasks for which the resources are available have started to execute, the SRP SP +BF tries to take advantage of the unused resource set by combining them into a single resource-pool. This resource-pool is given to the first task that is waiting in the queue and is it is passed to the Resource Allocator for the resource assignment. This process is called called backfilling. Backfilling is repeated till there is no unused resource in the system.

Architectural templates for RA
This section presents the concept of Architectural Templates that are used by the resource allocation algorithm deployed at the RA. In addition to deciding on how to decompose a task into its constituent jobs, an Architectural Template divides the available resources into different entities and assigns each of them a specialized role. We have identified the following five roles that can be assigned to the entities: 1. Source: A single Grid node where the raw data file of T i is located. 2. Sink: A set of Grid nodes where the processed file is to be delivered. 3. Compute-Farm: A set of Grid nodes dedicated to process the raw data file in parallel. 4. Data-Farm: A set of Grid nodes used for replicating the data. 5. Egress Node: A node where files are combined after being processed at the computefarm.
Note that a particular node may be part of a different entity at different times. For example a resource may be best utilized in a compute-farm for processing a particular job at one time, thus being a part of the compute-farm entity. But the same node may be used more effectively in a data-farm for processing a job in another task at another time; thus being a part of a data-farm entity. For each type of a task a set of appropriate templates is given as an input to the Resource Allocator. In this paper we have assumed the same set of templates described later can be used for every task in the bag-of-tasks. Thus, an Architectural Template specifies the structure of the suggested functional domains in which available resources are to be divided. This section briefly discusses a set of templates suitable for PBDT tasks.

2-Tier Architectural Templates
In 2-Tier Templates only the source and the sink nodes are used for both processing and data transfer. There are two different types of 2 tier Templates: 2-Tier-a and 2-Tier-b. In a 2-Tier-a Template, the source node is used for data processing. Fig. 3 (a) shows the process, if 2-Tier-a architecture is used in a system. TRPS co-ordinates with the Task RA (1) and gives it a PBDT task and a resource pool (which is the set of all available nodes for this task). Task RA sends an acknowledgment signal back to TRPS (2). The Task Workflow Engine, deployed at Lower Level, L1, signals the source node to start the processing of data at the source node (3 1 ). The raw data file is processed at source node (3 2 ), and is delivered to each of the sink nodes (3 31 to 3 3k ). After the transfer of processed data is completed, each of the k sink nodes sends an acknowledgment to the Task Workflow Engine to indicate that the processed file have reached the sink nodes(shown by (4 1 ) to (4 k )). Once all the k sink nodes have sent completion signals to RA, RA sends the signal to TRPS to indicate that the task has been completed (5).

2-Tier-b Architectural
Template is similar to 2-Tier-a (shown in Fig. 3 (b)). The important difference is that instead of using the source node, the data processing job is done at each of the sink nodes (3 31 to 3 3k in Fig. 3 (b)).

4-tier Architectural templates
In a 4-tier Architectural Template, the resource pool of the given task (representing the set of available resources for the given task selected by TRPS) is grouped in two domains a compute-farm and a data-farm. Both a compute-farm and a data-farm have a specific role (see Fig. 4). The role the compute-farm is to process the data. Once all the data is processed, it is combined at the Egress node. The role of the data-farm is to replicate this processed data at chosen nodes to optimize its transfer to the sink nodes. Initially, TRPS co-ordinates with the RA and gives it a PBDT task and a corresponding resource pool (1). After running the resource allocation algorithm, RA generates the workflow of the given task T i which indicates that which of the nodes from the provided resource pool will be used. RA returns ώ i back to TRPS indicating which of the resources are planned to be used for the execution of task T i (2). The Task Workflow Engine initiates the process (3). Once processing of the data is completed at the compute-farm nodes, these partitions are transferred to a special node called Egress Node where they are combined to produce the required processed file. The Egress Node sends a signal to the Task Workflow Engine (4) to indicate the completion of this stage.
The responsibility of the Egress node is to make sure that all the partitions of the raw data file associated with T i have been successfully processed. Even if a small portion of data gets missing or corrupted due to any unforeseen error, the resultant processed file formed by the combination of the constituent processed files may be become invalid. In practical environments catching such error at earlier stage is often desirable as the Task Workflow Engine can re-initiate the processing of faulty data partition only. If Egress node is not present, the system is not able to catch such errors at early stages and in case of an error in the processing of one of the partitions, the resultant processed file becomes invalid. In this case the only way to recover is to restart the whole process again from the scratch which would be considerable wastage of both time and resources.
From Egress, this processed data is transferred to the data nodes chosen by the algorithm in the workflow. From there it is delivered to each of the k sink nodes. Once the processed data is delivered to all sink nodes, Task Workflow Engine is notified (5 1 to 5 k ) which, in turn, notifies the TRPS (6) to indicate the completion of the task. Note that in compute-farm partitions of raw data files are transferred. But in data-farm complete processed files (not partitions) are transferred and replicated. 3-tier Architectural Template (having a computefarm, but no data-farm or Egress node) is not discussed in this paper. If data-farm is not required, 3-tier Architectural Template can be used instead of a 4-tier Template.

RA Algorithms
In this section ATSRA algorithms are described which enable RA to assign resources to T i within the resource pool, Ґ i , allocated to it by TRPS. ATSRA algorithms are based on Linear Programming which is a popular technique for solving optimization problems [12] [13]. It models an optimization problem as a set of linear expressions composed of input parameters and output parameters. The LP solver starts by creating a problem instance of the model by assigning values to the input parameters[16] [17]. The problem instance is then subjected to an objective function, which is also required to be a linear expression. The values of the output variables, which collectively represent the optimal solution, are determined for the best value of the objective function. Based on this approach three algorithms are presented in this section which can be deployed at RA. Each of the ATSRA algorithms has the following two stages. First, cost associated with each of the Architectural Templates is calculated and the template having the minimum amount of total cost is chosen. The Architectural Template Selection phase starts with calculating the costs associated with the simplest of the templates. For PBDT tasks described in this paper, it starts with the 2-tier architectures.
If L is the size of the raw data file in MB, is the source node, n is the i sink node, ξ is the processing factor associated with the given task T i , Cp src is the CPU processing cost per data unit at the source node, then total cost of performing the given PBDT task using 2-Tiera architecture is If Cp is the cost of processing per data unit at the k th sink node then the total cost of performing a PBDT task using a 2-Tier-b architecture is given by For 4-tier cost calculations, the cost function is formulated as a mixed integer/linear programming (MILP) problem which is an LP problem with the additional restriction that certain variables must take integer values. MILP problems are generally h0arder to solve than LP problems [11]. If n src is the source node, n is the j th sink node, n egress is the egress node and p is the number of partitions of the raw data file (as mentioned in its metadata); then for 4-tier Architectural Templates, the cost can be formulated as: where x i is a binary variable which is 1 if a particular node n i is assigned to compute-farm and is 0 otherwise. Similarly y i is a node assignment binary variable for the data-farm. It is 1, if a node is used for replication in the data-farm and is 0 otherwise. Variable w ij is the fraction of the processed file that a sink gets from a particular node in data-farm. Note that we are considering PBDT fixed-par task with equal partitions, and each of these partitions at compute-farm has a length of L/p.
The feasibility of a particular assignment is determined by the following constraints.
The first constraint specifies that the sum of all parts of the files being transferred from the data-farm to a particular sink node should add up to form the full length of the file. The second constraint specifies that the number of nodes used in the compute-farm should be equal to the number of partitions of the raw data file of the given task. The third and the fourth constraints ensure that both x i and y i are binary variables. The fifth constraint makes sure that the solution proposed by the algorithm has a non-zero value of w ij , if and only if y i >0. For example, consider a certain node n 3 and for a particular sink node s 7 . w 37 (that represents the portion of the total processed file that s 7 gets from n 3 ) should only have a nonzero value if y 3 =1 (that is y 3 is being used as a node in data-farm). The last constraint prevents negative values for .
Then ~r epresents the total cost of sending a unit data to a node in compute-farm, processing it and sending it to the egress node. From Equation (5) and Equation (6) cost 4-tier =min L ∑ ~ + ∑ , + ∑∑ , The input of the ATSRA algorithms are the computing cost vector [C p ] and the communication cost matrix [C c . The output is the solution matrix which represents the values of solution variables, i.e.
x i , y i ∀ = to n w ij ∀ = to n, j=1 to k It is important to note that there is nothing that prevents a node to be part of both computefarm and data-farm. For example if the solution matrix has x 3 =1 and y 3 =1 and then n 3 is used both in data-farms and compute-farms. Once the costs calculated with all the Architectural Templates are calculated, the minimum of them is chosen. If cost min = cost 4-tier , then resources are allocated according the values of the variables in solution matrix.

ATSRA SSR algorithm
For small number of nodes, ATSRA org algorithm performs well. But as the number of nodes increases the time taken by the algorithm to run becomes considerable. ATSRA SSR is proposed to improve performance for large number of nodes. It is based on finding a lower bound for the cost minimization problem formulated in Equation (7). The basic idea is to replace a "difficult" minimization problem by a simpler minimization problem whose optimal value is at least as small as cost 4-tier .
For the "relaxed" problem to have this property, there are two possibilities.
1. Enlarge the set of feasible solutions so that one optimizes. If the set of feasible solutions is represented by P, then it means to find such that ⊆ . OR 2. Replace the minimum objective function of Equation (7) by a function that has the same or a smaller value everywhere.
For the ATSRA SSR , we have chosen the first approach. We formulate a new optimization problem called the relaxation of the original problem, using the same objective function but a larger feasible region that includes P as a subset. Because contains P, any solution which belongs to P, also belongs to as well. This relaxed cost is denoted by cost 4-tier-relaxed.. Tenlarge the set of feasible solutions, constraint relaxation technique is used. In ATSRA SSR , the constraint relaxation technique is used at the Architectural Template selection stage only. Once an Architectural Template has been chosen, exact LP formulation is used for resource allocation. It is thus named as ATSRA Single Stage Relaxation (SSR) or ATSRA SSR , as the constraint relaxation is applied only to first stage of the algorithm. The ATSRA SSR starts by calculating the costs associated with 2-tier Architectural Templates 2a and 2b, using Equations (3) and (4). The minimum of these two is called as cost min . For 4tier Architectural Templates, instead of calculating exact cost 4-tier , cost 4-tier-relaxed is calculated. For constraint relaxation, the fifth constraint (i.e. is dropped.

ATSRA BSR algorithm
In ATSRA BSR algorithm we apply relaxation of the constraints at both Architectural Template Selection and resource allocation stages.
A summary of ATSRA BSR is as follows: Summary of ATSRA BSR algorithm 1 initialize 2 calculate cost 2-Tier-a and cost 2-Tier-b

cost min = min(cost 2-Tier-a , cost 2-Tier-b ) 4 calcuate cost 4-Tier-relaxed 5 If cost 4-Tier-relaxed < cost min 6 cost min =cost 4-tier-relaxed 7 choose Architectural Template associated with cost min 8 Allocate resources for Architectural Template associated with cost min
The important thing to note is that in ATSRA BSR , the constraint relaxation technique is used at both the Architectural Template selection stage and the relaxed solution matrix is used for actual resource allocation. For relaxation, constraints 3 and 4 are replaced by Note that the constraint relaxed in ATSRA SSR produces an invalid solution matrix. By dropping fifth constraint (i.e. ), the variable w ij can be assigned a non-zero value even if the corresponding data-farm node y i is not assigned. Thus the resultant solution matrix cannot be used in resource allocation. But in ATSRA SSR, we are using this relaxation only for the selection of Architectural Template and if 4-tier is chosen then the exact LP formulation is used for actual resource allocation. For ATSRA BSR , we have chosen such constraints for relaxation that do not produce an invalid solution matrix. Thus the same resultant solution matrix is used for resources allocation as well.
Note that as we move from ATSRA org to ATSRA SSR and then to ATSRA BSR , following are some of the considerations.

Time complexity of algorithm is reduced. 2. Imprecision in Resource Allocation increases.
The decrease in algorithm running time is the benefit of using this relaxation

Results experimental
This paper uses the following performance metrics.
Makespan total (t ms-total ): The time in seconds required for completing all the tasks in the given bag-of-tasks, T. To analyze the performance of the proposed RA algorithms, a simulation based investigation is performed. Various performance metrics described earlier were captured at the end of each simulation. Each experiment was repeated enough number of times to produce a confidence interval of ±5% for each performance metric at a confidence level of 95%. The workload chosen for these experiments is a bag-of-tasks consisting of 32 PBDT fixedpar tasks. Each of these tasks models the encoding of a raw multimedia file which is to be processed and delivered to a set of sink nodes. The choice of the raw data file is based on a typical animation movie described in [2]. The size of the raw data files of each of the tasks in the given bag-of-tasks is an important workload parameter. A detailed study of the characteristics of similar real-world tasks was carried out. The true representative probability distribution of the sizes of the raw or unprocessed data files used in similar tasks has been a subject of discussion over the years in the research community. Researchers seem to be split over characterizing it either with a Pareto or with Log-normal distribution. After careful analysis the Pareto distribution seems to be a better representative of PBDT multimedia workloads and is thus used. Another important parameter for the workload is the value of p, which is the number of partitions in which raw data files can be divided. This value is included in the metadata of each of the raw data file of the given tasks. The value of p depends on the structure of the raw data file and the type of processing required for it. For example, if a raw data file of multimedia animation contains 20 sub-sequences, each of which has to be processed as a single partition, then this task has a p of 20. The number of partitions (referred to as sub-sequences in the description of the movie rendering project presented in [1]) for each raw file varies from 1 to 30. We have used a uniform distribution [1-30] for modeling the number of partitions in each raw multimedia file. The mean of the raw data files is fixed at 650MB.
For performance analysis of the proposed algorithms total number of nodes of the Grid system is increased. Number of Grid nodes is directly related to the time-complexity of the deployed algorithm. All other parameters related to workload and system conditions are kept constant. Fig. 5 shows the performance of three RA algorithms (ATSRA org , ATSRA SSR and ATSRA BSR ) with SRP sp deployed at the TRSP. Fig. 5(a) shows the time taken by each of the algorithm to run. It can be that for small number of nodes, there is not much difference in the scheduling time taken be these three algorithms. The time taken by the ATSRA org algorithm rises sharply as number of nodes is increased more than 32. It can be observed that for ATSRA BSR , t sch does not rise sharply. Fig. 6(b) shows the value of Makespan non-scheduling (t ms-nonSch ) for each of the proposed algorithms as the number of nodes is increased. It is clear that ATSRA org has the lowest value of t ms-nonSch for all values of n. This is expected as by using all constraints associated with the LP formulation, we are allocating the resource with the highest precision and it this allocation is expected to be efficient. As constraint relaxation is applied at stage one of the algorithms in ATSRA SSR , t ms-nonSch increases. It further increases for ATSRA BSR in which constraint relaxation is applied at both stage of the algorithm. Fig.  6(d) shows the total cost (t cost ) for each of the three algorithms. ATSRA org has the lowest t cost as we are allocating the resources with highest precision. This is followed by ATSRA SSR and ATSRA BSR. The overall makespan time, t ms-total shown in Fig. 5(c) includes both the scheduling time and the execution time for the bag-of-tasks. It captures the tradeoff between t ms-noSch and t sch presented in Fig. 5(b) and 5(a) respectively. For a very small number of nodes the scheduling overhead for the ATSRA org is small and t ms-nonsch is the lowest and as a result the best t ms-total is achieved. For a large number of nodes, the scheduling overhead for ATSRA org is very high and the benefit of using a better resource allocation is offset by the overhead and its performance deteriorates. ATSRA BSR that exhibits the smallest scheduling overhead for a large number of nodes (see Fig. 5(a)) demonstrates the best t ms-total (see Fig.  5(c)). It is interesting to see that ATSRA SSR produces the best t ms-total for a range of intermediate values of the number of Grid nodes. The accuracy of resource allocation for ATSRA SSR lies between that achieved with ATSRA org and ATSRA BSR . For a small number of nodes, t sch of ATSRA SSR is comparable to that of ATSRA org ; whereas the t ms-nonSch achieved by ATSRA SSR is inferior to that achieved by ATSRA org . Thus if the number of nodes is small, ATSRA SSR is inferior to that of ATSRA org .
For a large number of nodes, although ATSRA SSR gives rise to a lower scheduling overhead than ATSRA BSR , the advantage is offset by the much lower execution time produced by ATSRA BSR . The net effect is that t ms-total achieved by ATSRA SSR is inferior to that of ATSRA BSR for a large number of nodes. Fig. 6 shows the performance of ATSRA algorithms when SRP sp +BF is deployed at TRPS. As in the case of Fig. 5(c) the best t ms-total is achieved by ATSRA BSR for larger numbers of nodes; whereas ATSRA org demonstrates the best performance for a lower number of nodes. ATSRA SSR demsonstrates a slightly higher t ms-total than ATSRA org when the number of Grid nodes is small. Although the total makespan achieved by it is better than ATSRA org at higher number of nodes, it is higher than that achieved by ATSRA BSR . The relative performances of the three algorithms captured in Fig. 6(a) , Fig. 6(b) and Fig. 6(d) are the same as those displayed in Fig. 5(a), Fig. 5(b) and Fig. 5(d) respectively. ATSRA org demonstrates the best in t ms-nonSch and t cost followed by ATSRA SSR and ATSRA BSR ; whereas the smallest scheduling overhead is achieved with ATSRA BSR and ATSRA org demonstrates the highest scheduling overhead. The rationale for such a behavior has been provided in the discussion presented earlier for Fig. 5(a) Fig. 5(b) and Fig. 5(d). Note that although the shapes of the graphs in Fig.  5(a) and Fig. 6(a) are similar, the value of t sh for a given number of nodes in Fig 6(a) is higher than the value of t sh for the same number of nodes in Fig. 5(a). This is because in SRP sp +BF backfilling is used which increases scheduling overheads. While the relative performance of ATSRA org , ATSRA SSR and ATSRA BSR remains almost the same, this additional scheduling overhead has shifted the graphs upwards in Fig. 6(a) as compared to Fig. 5(a).
For ATSRA org and ATSRA SSR algorithms and any given number of nodes, the t ms-nonSch achieved with SRPsp +BF is observed to be smaller than that achieved SRP sp (see Fig. 5 (b) and Fig. 6(b). This demonstrates the effectiveness of using backfilling that can increase the concurrency of task execution. Except for the case in which the number of Grid nodes is 128, a similar behavior is observed with ATSRA BSR .
Comparing t ms-total achieved with SRP sp (Fig. 5(c)) and SRP sp +BF (Fig. 6(c)), we observe that for any given ATSRA algorithm, the total makespan achieved by SRPsp +BF is superior to that achieved by SRPsp when the number of nodes is small. For higher number of nodes, SRP sp +BF demonstrates an inferior performance. This becauseat smaller number of nodes concurrent execution of tasks may be severely limited with SRP sp because many tasks may not be able to get all their resources at the same time. With the use of backfilling this problem is alleviated as RA is run for each waiting task with the set of unused resources as the resource pool. However, this problem with task concurrency is not that severe at higher number of nodes. Thus, SRP sp +BF that re-runs RA multiple times and incurs a higher scheduling overhead demonstrates an inferior performance as the potential performance benefit due to backfilling is offset by the overhead.

Summary and conclusion
In this chapter, by using BiLeG an allocation-plan is devised which reflects the overall resource allocation strategy comprising two parts; a policy used at the higher decision making module, TRPS, which has the responsibility to select a resource-pool for each of the tasks; and a resource allocation algorithm used at the lower decision making module, RA, which actually assigns resources from the resource-pool selected by TRPS for a particular PBDT task. Three RA algorithms and six TRPS policies have been proposed in this chapter forming different allocation-plans. The suitability of various allocation-plans under different sets of system and workload parameters has been explored.
Detailed study of the various trade-offs, implicit in the use of different allocation-plans, is the focal points of this chapter. The most suitable allocation-plan not only depends on various workload and system parameters, it also depends on the user requirements and the hardware available. It can be seen that from the performance perspective various trade-offs exist among different allocation-plans and understanding these trade-offs in depth is the focus of the experiments conducted in this chapter.
For the choice of an appropriate allocation-plan, two of the important considerations that came out of these experimental results are the size of the Grid and the performance metric chosen for optimization. Generally, from the results obtained from the experiments conducted in chapter, it can be concluded that if an allocation-plan tries to minimize one of the performance metrics, it tends to yield higher values of the other performance metrics. For example, <SRP sp ,ATSRA org > always gives the lowest value of t cost but it also yields one of the highest values for t ms-WOH , especially for a large number of nodes. At RA, the tradeoffs associated with reducing the accuracy of the ATSRA algorithm by relaxing some of the constraints in the LP formulation have been studied. The combination of the proposed RA algorithms and TRPS policies gives rise to various allocation-plans. These allocation-plans can be used under a wide variety of system and workload parameters to maximize the use of available resources according to a pre-determined optimization objective.
Although the research summarized in this chapter has focused primarily on the Grid systems, the proposed BiLeG architecture can also be used in a Cloud Computing environment. Cloud Computing environments are often classified as public and private Cloud environments [3]. The private Cloud environment is better suited for the BiLeG architecture; as a private Cloud environment uses a dedicated computing infrastructure that provides hosted services to a limited number of users behind a firewall and can, thus, more easily incorporate mechanisms to accurately predict the computing and communication costs.
The algorithms presented in this chapter are based on a dedicated resource environment. To adapt the BiLeG architecture to shared environments, more research is required. For example, in order to use it in a shared resource environment, mechanisms to accurately predict the unit communication and processing times are needed to be incorporated in the BiLeG architecture. Also, in a shared environment, investigating the impact of various types of communication models, such as many-to-one and one-to-many forms, an important direction for the future research.