Nowadays Grid technology [Foster & Kesselman, 1999] provides us with simultaneous and effective use of distributed computational and informational resources. Three main types of this technological phenomenon are known as resource discovery grids, computational grids, and data grids. In data grids, distributed heterogeneous database systems play important role to provide users access information easily without knowing the resource position. Due to heterogeneity property, communication among subsystems has to be handled well considering different network structure, operating system, and DBMS. In such systems we are also interested in efficient search and retrieval mechanism to speedup traditional relational database queries.
Distributed systems can be taught of as a partnership among independent cooperating centralized systems. Upon this idea number of large scale applications has been investigated during past decades among which distributed information retrieval (DIR) systems gained more popularity due to its high demand. The goal of DIR is to provide a single search interface that provides access to the available databases involving resource descriptions building for each database, choosing which databases to search for particular information, and merging retrieved results into a single result list [Si & Callan, 2003].
A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. This resource distribution improves performance, reliability, availability and modularity that are inherent in distributed systems. As with traditional centralized databases, distributed database systems (DDBS) must provide an efficient user interface that hides all of the underlying data distribution details of the DDB from the users. The use of a relational query allows the user to specify a description of the data that is required without having to know where the data is physically located [Li & Victor, 1981].
Data retrieval from different sites in a DDB is known as distributed query processing (DQP). For example, the following query accesses data from the local database as well as the remote sales database. The first table (EMP) found in site1 and the second table (DEPT) found in site2:
So a distributed query is one that selects data from databases located at multiple sites in a network and distributed processing performs computations on multiple CPUs to achieve a single result. Query processing is much more difficult in distributed environment than in centralized environment because a large number of parameters affect the performance of distributed queries, relations may be fragmented and/or replicated, and considering many sites to access, query response time may become very high [Li & Victor, 1981]It is quite evident that the performance of a DDBS is critically dependant upon the ability of the query optimization algorithm to derive efficient query processing strategies. DDBMS query optimization algorithms attempts to reduce the quantity of data transferred. Minimizing the quantity of data transferred is a desirable optimization criterion. The distributed query optimization has several problems relate to the cost model, larger set of queries, optimization cost, and optimization interval.
The goal of DQP is to execute such queries as efficiently as possible in order to minimize the response time that users must wait for answers or the time application programs are delayed. And to minimize the total communication costs associated with a query, to improved throughput via parallel processing, sharing of data and equipment, and modular expansion of data management capacity. In addition, when redundant data is maintained, one also achieves increased data reliability and improved response time.
In this work we propose a multi-agent architecture for distributed query processing. The structure of the paper is as follows. Section II describes an overview of query optimization process. An investigation on related works is declared in section III, our proposed approach in section IV, and simulation results in V. We finally concluded the work in the section VI.
2. Query Optimization Process
In a relational database all information can be found in a series of tables. A query therefore consists of operations on tables. The most common queries are Select-Project-Join queries. In this paper, we will focus on the join-ordering problem since permutations of the join order have the most important effect on performance of relational queries [Özsu & Valduriez, 1999]. The query optimization process shown in Figure 1, consists of getting a query on n relations and generating the best Query Execution Plan (QEP).
For a given query, the search space can be defined as the set of equivalent operator trees that can be produced using transformation rules. The example bellow illustrates 3 equivalent join trees, which are obtained by exploiting the associative property of binary operators. Join tree (c) which starts with a Cartesian product may have a much higher cost than other join trees.
Regarding different search spaces, there would be different shape of the join tree. In a linear tree, at least one operand of each operand node is a base relation. However, a bushy tree might have operators whose both operands are intermediate operators. In a distributed environment, bushy trees are useful in exhibiting parallelism [Özsu & Valduriez, 1999].
Considering new large scale database applications such as deductive database systems and bioinformatics, it is necessary to be able to deal with larger size queries. The search complexity constantly increases and makes higher demand for better algorithms than our traditional relational database queries.
3. Related Works
Three most common types of algorithms for join-ordering optimization are deterministic, Genetic and randomized algorithms [Zalenay, 2005].
Deterministic algorithm, also known as exhaustive search dynamic programming algorithm, produces optimal left-deep processing trees with the big disadvantage of having an exponential running time. This means that for queries with more than 10-15 joins, the running time and space complexity explodes [Zalenay, 2005]. Due to the very large time and space complexity of this algorithm for plan enumeration, iterative dynamic programming approach was proposed which produces reasonable plans with reasonable running time for most network topologies. However, its complexity is not much more than classical DP algorithm.
Genetic and randomized algorithms [Ioannidis & Kang, 2003; Steinbrunn et al., 1997] on the other hand do not generally produce an optimal access plan. But in exchange they are superior to dynamic programming in terms of running time. Experiments have shown that it is possible to reach very similar results with both genetic and randomized algorithms depending on the chosen parameters. Still, the genetic algorithm has in some cases proved to be slightly superior to randomized algorithms.
Layers of distributed query optimization have been depicted in Figure 4.
There are number of Query Execution Plan for DDB such as: row blocking, multi-cast optimization, multi-threaded execution, joins with horizontal partitioning, Semi Joins, and Top n queries [Kossman, 2000; Selinger et al., 1998; Stocker et al., 2001]. In this paper we propose a novel agent based QEP generator for heterogeneous distributed data base systems.
3.1 Distributed Cost Model
An optimizer cost model includes cost functions to predict the cost of operators, and formulas to evaluate the sizes of results. Cost functions can be expressed with respect to either the total time, or the response time [Ceri & Pelagatti, 1984; Özsu & Valduriez, 1999]. The total time is the sum of all times and the response time is the elapsed time from the initiation to the completion of the query. The total time (TT) is computed as bellow, where TCPU is the time of a CPU instruction, TI/O the time of a disk I/O, TMSG the fixed time of initiating and receiving a message, and TTR the time it takes to transmit a data unit from one site to another.
When the response time of the query is the objective function of the optimizer, parallel local processing and parallel communications must also be considered. This response time (RT) is calculated as bellow:
In parallel transferring, response time is minimized by increasing the degree of parallel execution. This does not imply that the total time is also minimized. On contrary, it can increase the total time, for example by having more parallel local processing (often includes synchronization overhead) and transmissions. Minimizing the total time implies that the utilization of the resources improves, thus increasing the system throughput. In practice, a compromise between the total and response times is desired.
3.2 Database Statistics
The main factor affecting the performance is the size of the intermediate relations that are produced during the execution. When a subsequent operation is located at a different site, the intermediate relation must be transmitted over the network. It is of prime interest to estimate the size of the intermediate results in order to minimize the size of data transfers. The estimation is based on statistical information about the base relations and formulas to predict the cardinalities of the results of the relational operations.
Cartesian product: The cardinality of the Cartesian product of R and S is as (1):
In other words, the Cartesian product R * S contains nr * ns tuples; each tuple occupies sr + ss bytes. If R S = , then R ⋈ S is the same as R * S. If R S is a key for R, then a tuple of s will join with at most one tuple from R, therefore, the number of tuples in R ⋈ S is no greater than the number of tuples in S. If R S in S is a foreign key in S referencing R, then the number of tuples in R ⋈ S is exactly the same as the number of tuples in s. The case for R S being a foreign key referencing S is symmetric.
As discussed earlier, ordering joins is an important aspect of centralized query optimization. This matter in a distributed context is even more important since joins between fragments may increase the communication time. Two main approaches exist to order joins in fragment queries:
1) Direct optimization of the ordering of joins (e.g. in the Distributed INGRES algorithm).
2) Replacement of joins by combination of semi-joins in order to minimize communication costs.
Let R and S are relations stored at different sites and query R ⋈ S be the join operator. The obvious choice is to send the smaller relation to the site of the larger one.
Since the join operations may reduce or increase the size of intermediate results, estimating the size of joint results is mandatory, but difficult. Consider the following query expressed in relate.
algorithm: PROJ ⋈PNO EMP ⋈ENO ASG whose join graph is below:
1. EMP ASG ; EMP’ = EMP ⋈ ASG PROJ ; EMP’ ⋈ PROJ
2. ASG EMP ; EMP’ = EMP ⋈ ASG PROJ ; EMP’ ⋈ PROJ
3. ASG PROJ ; ASG’ = ASG ⋈ PROJ EMP ; ASG’ ⋈ EMP
4. PROJ ASG ; PROJ’ = PROJ ⋈ ASG EMP ; PROJ’ ⋈ EMP
To select one of these programs, the following sizes must be known or predicted: size(EMP), size(ASG), size(PROJ), size(EMP ⋈ ASG), size(ASG ⋈ PROJ). Furthermore, if it is the response time that is being considered, the optimization must take into account the fact that transfers can be done in parallel with strategy 5. An alternative to enumerating all the solutions is to use heuristics that consider only the size of the operand relations by assuming, for example, that the cardinality of the resulting join is the product of cardinalities. In this case, relations are ordered by increasing sizes and the order of execution is given by this ordering and the join graph. For instance, the order (EMP, ASG, PROJ) could use strategy 1, while the order (PROJ, ASG, EMP) could use strategy 4.
3.3 Multi Agent Systems
Multi-agent systems (MASs) as an emerging sub-field of artificial intelligence concern with interaction of agents to solve a common problem [Wooldridge, 2002]. This paradigm has become more and more important in many aspects of computer science by introducing the issues of distributed intelligence and interaction. They represent a new way of analyzing, designing, and implementing complex software systems.
In multi-agent systems, communication is the basis for interactions and social organizations which enables the agents to cooperate and coordinate their actions. A number of communication languages have been developed for inter-agent communication, in which the most widely used ones are KIF (Knowledge Interchange Format) [Genesereth & Fikes, 1992], KQML (Knowledge Query and Manipulation Language) [Finin et al., 1994], and ACL (Agent Communication Language) [Labrou et al., 1999]. KQML uses KIF to express the content of a message based on the first-order logic. KIF is a language intended primarily to express the content part of KQML messages. ACL is another communication standard emerging in competition with KQML since 1995. Nowadays, XML (Extensible Markup Language) started to show its performance as a language to encode the messages exchanged between the agents, in particular in agent-based e-commerce to support the next generation of Internet commerce [Korzyk, 2000].
4. Proposed System Architecture
Although the problem of distributed query processing in heterogeneous systems has been investigated before [Huang et al., 1981], a good solution to practical query optimization has not been studies well. To meet so we proposed new multi-agent system architecture based on Java Agent DEvelopment (JADE) framework [Bellifemine et al., 2006]. JADE is a software development framework aimed at developing multi-agent systems and applications in which agents communicate using FIPA  - Agent Communication Language (ACL) messages and live in containers which may be distributed to several different machines. The Agent Management System (AMS) is the agent who exerts supervisory control over access to and use of the Agent Platform. Only one AMS will exist in a single platform.
Each agent must register with an AMS in order to get a valid AID. The Directory Facilitator (DF) is the agent who provides the default yellow page service in the platform. The Message Transport System, also called Agent Communication Channel (ACC), is the software component controlling all the exchange of messages within the platform, including messages to/from remote platforms.
JADE is capable of linking Web services and agents together to enable semantic web applications. A Web service can be published as a JADE agent service and an agent service can be symmetrically published as a Web service endpoint. Invoking a Web service is just like invoking a normal agent service. Web services’ clients can also search for and invoke agent services hosted within JADE containers.
The Web Services Integration Gateway (WSIG) [JADE board, 2005] uses a Gateway agent to control the gateway from within a JADE container. Interaction among agents on different platforms is achieved through the Agent Communication Channel. Whenever a JADE agent sends a message and the receiver lives on a different agent platform, a Message Transport Protocol (MTP) is used to implement lower level message delivery procedures [Cortese et al., 2002]. Currently there are two main MTPs to support this inter-platform agent communication - CORBA IIOP-based and HTTP-based MTP.
Considering large-scale applications over separated networks, agent communications has to be handled behind firewalls and Network Address Translators (NATs), however, the current JADE MTP do not allow agent communication through firewalls and NATs. Fortunately, the firewall/NAT issue can be solved by using the current JXTA implementation for agent communication [Liu et al., 2006].
JXTA is a set of open protocols for P2P networking. These protocols enable developers to build and deploy P2P applications through a unified medium [Gradecki, 2002]. Obviously, JXTA is a suitable architecture for implementing MTP-s for JADE and consequently JADE agent communication within different networks can be facilitated by incorporating JXTA technology into JADE [Liu et al., 2006].
In this work, a query is submitted by user to the Query Distributor Agent and then it will be distributed using the submitted terms in the available ontology. Our proposed architecture uses different types of agents, each having its own characteristics. The proposed system architecture is shown in Figure 5.
Query Distributor Agent (QDA): After receiving the user query, the QDA sends sub-queries to responsible local optimizer agents. The QDA can also create search agents if needed.
Local Optimizer Agents (LOAs): These agents apply a Genetic algorithm based sub-query optimization which will be discussed later, and return a result table size to the Global Optimizer Agent.
Global Optimizer Agent (GOA): This agent has the responsibility to find best join order via network. To do so, GOA receives result table size information from LOAs and again using an evolutionary method finds a semi-optimal join order, however, this time the genetic algorithm fitness function is based on minimizing communication rate among different sites. The final result would be then send to LOAs to perform operation and deliver result to the GOA to show them on the screen.
4.1 Genetic Algorithm based Optimization
The basic concept of the genetic algorithms were originally developed by Holland [Holland, 1975] and later revised by Goldberg [Goldberg, 1989]. Goldberg showed that genetic algorithms are independent of any assumption about the search space and are based on the mechanism of natural genetics. The first step to model this problem as a Genetic Algorithm problem, is determining the chromosome, Genetic A operators, and fitness function.
In order to encode different access plans, we considered ordered list scheme to represent each processing tree as an ordered list. For instance (((((R1 ⋈ R5) ⋈ R3) ⋈ R2) ⋈ R4) is encoded as “15324”. This converts the join order to the well-known traveling salesman problem (TSP). Possible query plans are encoded as integer strings. Each string represents the join order from one relation of the query to the next. This mechanism is also used within the PostgreSQL optimizer [PostgreSQL]. There may be other ways to encode processing trees especially bushy trees; however, we have used the above explained ordered list encoding method for implementation simplicity and faster run.
Genetic Algorithm operations
For the crossover, one point in the selected chromosome would be selected along with a corresponding point in another chromosome and then the tails would be exchanged. Mutation processes causes some bits to invert and produces some new information. The only problem of mutation is that it may cause some useful information to be corrupted. Therefore we used elitism which means the best individual will go forward to the next generation without undergoing any change to keep the best information.
Defining fitness function is one of the most important steps in designing a Genetic Algorithm based method, which can guide the search toward the best solution. In our simulation we used a simple random cost estimator as defined bellow in (3), where random(x) returns an integer between 0 and x.
After calculating the fitness function value for each parent chromosome the algorithm will generate N children. The lower a parent chromosome's fitness function value is the higher probability it has to contribute one or more offspring in the next generation. After performing operations, some chromosomes might not satisfy the fitness and as a result the algorithm discards this process and gets M (M N) children chromosomes. The algorithm then selects N chromosomes with the lower fitness value from the M + N chromosomes (M children and N parents) to be parent of the next generations. This process would repeat until a certain number of generations are processed, after which the best chromosome is chosen. Figure 6 shows our genetic algorithm based approach.
Experiments were done on a PC Pentium 4 CPU 2 GHz and 1GB RAM. The evolutionary process accomplished in less than a second and seen in Figure 7 a sample query for 20 joins is converged to a near optimal fitness almost after 100 generations. Table 1 shows the parameters value we set for our implementation.
|Number of generations||100|
There is no doubt that dynamic programming method always gives us optimal solution, however, since the time and space complexity of the genetic algorithm base optimization is much less, it is not a practical approach for high amount of nested joins. Figure 8 shows TGA / TDP which is rate of average run time for 10 queries, where TGA is genetic algorithm based optimization average run time and TDP is dynamic programming based optimization average run time for the same query. Results as expected shows this rate is always less than 1 which means the genetic algorithm approach has a less overhead in contrast with the DP method.
A computational complexity analysis in table 2 shows the superiority of the evolutionary method in comparison with the dynamic programming approach. The time and space complexity in the evolutionary method is liner with respect to G as the number of generations, and N as the number of chromosomes in the population, while it is exponential in the DP method which is not efficient in handling more than 10 joins.
An evolutionary query optimization mechanism in distributed heterogeneous systems has boon proposed using multi-agent architecture and genetic algorithm approach. Although deterministic dynamic programming algorithm produces optimal left-deep processing trees, it has the big disadvantage of having an exponential running time. Genetic and randomized algorithms on the other hand do not generally produce an optimal access plan. But in exchange they are superior to dynamic programming in terms of running time. Our practical framework uses hybrid JADE-JXTA framework which allows agent communication through firewalls and NATs in heterogeneous networks.