Selection of indexes and materialized vs. knapsack.
For the super-excellence applications used to control the water level in rivers, temperature handles a very large volume of information and does not stop constantly changing. These spatio-temporal data collected by a network of sensors form a set of thematic, integrated, non-volatile and historical data organized to help decision-making. Usually this process is performed with temporal, spatial and spatiotemporal queries. This in turn increases the execution time of the query load. In the literatures, several techniques have been identified such as materialized views (MV), indexes, fragmentation, scheduling, and buffer management. These techniques do not consider the update of the request load and the modification at the database level. In this chapter, we propose an optimal dynamic selection solution based on indexes and VMs. the solution is optimal when it meets the entire workload with a reasonable response time. The proposed approach supports modification at the database level and at the workload level to ensure the validity of the optimal solution for this the knapsack algorithm was used.
- wireless sensor network
- optimized structure
- NP-complete problem
- materialized view
- multiple selection problem
A sensor network used to record physical conditions of the environment such as temperature, rainfall, pollution, humidity, wind, etc. These data are sent to a database server which will be processed later.
All this data collected by the sensors will be recorded in a database which is in turn queried by client applications, such as the supervisor, the security agent, or a third-party application.
This database will be queried by complex queries which require resources.
To decrease the response time, it is necessary to use optimization techniques such as materialized views (MV), indexes, fragmentation, and the caching system. All these techniques are proven in the case of relational databases.
A view is a virtual table representing the result of a query on the basis. As the name suggests and unlike a standard view, in a materialized view the data is duplicated.
The index placed on a table will provide quick access to records, depending on the value of one or more fields. In addition, allows to simplify and accelerate the operations of search, sorting, joining or aggregation.
In this work, an approach proposed for the multiple selection of indexes and materialized views with the knapsack algorithm. The work presents an improvement of another approach based on the greedy algorithm. The rest of this work is organized like this: The first section deals with optimization techniques; the problem of multiple selection of indexes and materialized views will be presented in the second section. The contribution to the dynamic workload will be mentioned in Section 4. In Section 5, a discussion of our approach for the case where the database is dynamic will be described. Finally, a discussion on the experiment and the contribution will be discussed.
2. Optimization techniques
The use of optimization techniques is based on two approaches. The first is the sequential use of techniques such as indexes and fragmentations which have depleted physical structure, but the second is the simultaneous use of techniques which have similar physical structure such as materialized views and indexes.
In , authors proposed three approaches which are MVFirst, INDFirst and Joint enumeration. But the major drawback of this approach is the sequential and isolating use of these techniques, which does not make it possible to benefit from the advantage of the interactivity between these optimization structures.
The authors say that this last alternative is the best . Bellatrech et al.  improve part of the multiple selection problem with storage space management.
The authors use two spies to manage the space shared between two structures, index and VM. If the optimal configuration needs more indexes, then the spy associated with the index will take up space from the VM spy and vice versa.
In  use the drop algorithm for the selection simulation of indexes and MV.
These works only deal with the case where the load of requests is static and does not change over time. In , authors proposed an approach to dynamically select materialized views. The approach is based on the PRQ predictor to predict the next request and materialize its corresponding views using the conditional probability. This approach uses a cost model based on cloud costs. This work shows a tremendous improvement in terms of cost, execution time, and processing, but the authors only used one optimization framework which is view materialized. Another dynamic approach called Dynamat refreshes the configuration of materialized views if their size exceeds the space allocated for it . Several criteria are used, for example, delete rarely used views. A hybrid approach jointly exploits a static set of persistent views used by multiple request and maintenance sequences, and another dynamic set of aggregated and smaller sizes accessible and replaceable on the fly . However, these approaches focus only on the refresh performance of materialized views and not on query workload. In addition, do not use other optimization structures. Karkad et al.  applied the buffer management and scheduling technique to three optimization structures (index, materialized views, and fragmentation). This approach requires caching, planning, and is not dynamic.
In our approach, the simultaneity between materialized and indexed views used to benefit from the interaction between these structures. In addition, a mathematical modulization of the problem has been proposed based on the backpack algorithm which proves their performance compared to the greedy algorithm used by N. Maiz et al. .
3. Materialized view and index selection problem
Simultaneous selection of indexes and materialized views is an NP-Hard type problem which gives several optimal solutions .
The Knapsack problem, also noted KP, is an optimization problem. Presents a situation which cannot support more than a certain weight, with all or part of a given set of objects each having a weight and a value . Items put in the backpack must maximize the total value and not exceed the maximum weight The problem is formalized as follows:
On the other hand, the problem of selecting indexes and materialized views (PSIMV) consists in finding a set of indexes and materialized views constituting the final configuration to optimize the workload requests. This optimization can be in run time and storage space. The Workload requests, index and MV are presented as follows:
presented the Workload queries. This set composed by m queries. is the set of n indexes and presented the k materialized views.
is the size allowed by the administrator to store indexes and MV. Then it is necessary to find a configuration without violating the following constraints:
Minimize the cost of Workload, i.e.
The size of the configuration does not exceed
The problem of selecting indexes and materialized views is adopted by the genetic algorithm. The starting population is the set of candidate indexes and MVs. The objective function to optimize is the cost of the workload. The next section shows the analogy between the problem of selecting indexes and materialized views and the knapsack algorithm.
3.1 Selection problem with index and MV vs. knapsack algorithm
In this work, we present the correspondence between the problem of the knapsack and that of the multiple selection of indexes and materialized views (Table 1).
|Knapsack problem||Problem with selecting indexes and materialized views|
|Objects||The total set of Indexes and materialized views|
|Weight||The point shows the size of each object and the required execution time.|
|Profit||This is the profile to be won if these objects are used. Shows the gain in execution time and storage|
|Size||The number of bits needed to store the objects that form the optimal solution|
|Object set||The final configuration of indexes and materialized views|
3.2 Cost model
Typically, the number of indexes and candidate VMs is greater since the input load is significant. The creation of all these indexes and MVs is not possible due to the constraint on the allocated storage space. To solve the problem, we use a cost model which allows us to keep only the most advantageous indexes and MVs. This model estimates the space in bytes occupied by indexes and VMs, the data access costs and the maintenance cost in terms of number of inputs and outputs.
Indexes and MVs are the objects in this optimization system. Cost of an object is the sum of storage size, access data cost both these indexes and MVs and maintenance cost.
The benefit provided by an object is the difference the cost of the Workload before adding the object and after the addition of this object . the following equation calculates the profit:
To add this object to the configuration list, we followed this equation
If , There is a benefit, so add the object to the configuration, else not add to the configuration.
This query system can be static or dynamic. When the Workload and the data stored in the database are invariable in this case, it is a static system which will be discussed in Section 4. On other hand, if the Workload and the database are modifiable, then it is a dynamic system which will be discussed in Section 5.
4. Statistic workload
A request load is the set of requests that have arrived and are waiting for their turn to be executed. This section will discuss the case where the database does not change, and the requests have arrived successively in random order. Bellatrach et al.  proposed a static approach that does not support the changing of Workload. Authors apply the greedy algorithm which does not necessarily provide an optimal solution. In , authors show that dynamic programming and more optimal than the Greedy algorithm.
The Knapsack algorithm is an example of dynamic algorithms used for optimization problems. In the proposed approach, Artificial Intelligence uses this algorithm only in the learning phase and afterwards a model will be created to predict the final solution and avoid the execution of this algorithm on each new request.
An optimal configuration is the set of materialized views and indexes which extend to the workload in a reasonable time with the minimum of resources.
The following algorithm takes as input the list of indexes and Materialized views to create an optimal configuration on condition that this configuration extends to the entire load of requests and does not exceed the authorized storage size.
S is the size of the disk space allocated to store MVs and indexes, it is fixed by administrator.
Objective function Profit () calculates for each index or MV. It is the difference in cost between the workload run time with or without this object (Index or MV). If this object improves the system, it will be added to the entire configuration. At the end of this algorithm, the final configuration is made up of a set of indexes and MVs which represent the optimal solution. This technique considers the similarity between the two optimization structures index and MV.
These iterations will be repeated until there is no improvement in the Profit () function, or until all indexes and VMs have been selected, or until the limit storage space is exceeded.
Changes at the database level or in the workload require a new configuration to revert to the new Workload. Then you must rebuild new indexes and VMs. This operation is very time consuming.
In Dynamat  the authors have removed the least used VMs to free space for new creations. In this approach the authors limit themselves to use only the MV optimization structure.
To solve this inconsistency problem, the authors find three strategies. The first one is that all views are updated regularly at each time interval . The second one is that all views are updated at the end of each transaction  and the last strategy is that the changes are propagated in a delayed manner. I.e. a VM is updated only when it is used by a request.
Our approach combines the two structures (Index and Materialized views) to benefit from the structural affinity between these two optimization techniques.
In a real-time survival system, query processing is important. To ensure optimal validation of the solution after the change in workload and database, two artificial intelligence techniques are used.
The arrival of requests is random and varied depending on the context and in this case. On the other hand, the database can be modified during the execution of the queries. In this part we will study the two cases.
We used artificial intelligence to create materialized views for the dynamic processing of the workload and to make requests as visible as possible. With automatic learning, we proposed an algorithm that allows to search for the logical link between the query load and the optimal configuration, then and after the learning phase will predict the final solution (Minimum configuration).
We started with a remodeling phase. Each request is presented by a factor which presents the list of attributes used. on the other side a matrix which presents all the possible solutions which are prepared in advance.
The Workload is formed by queries, i.e., . A query is composed by attributes, where , and each query has the following form: . The activation function used in this work is presented as follows:
The workload can be presented in form of matrix as follow:
A final solution is a set of structures such as Index and MV that guarantees the response to the entire query load with minimal cost. Based on, attributes, we can find views and the final solution has a view between 1 and .
In order to verify if this materialized view is included in the solution or not, the function having the following form should be used
The maximum number of final solutions is , where is the number of attributes in database tables.
The final solutions are presented as follows
The references of the final solutions are stored in a vector with the following form
Figure 1 shows the three layers of our modeling and the steps to create candidate solutions. First step is the extraction of the attributes used in all the tables of the database, then create a vector containing all the possible materialized views, i.e. the possible combinations with the attributes . A materialized view contains at least one attribute and at most all attributes. The number of VMs is
Then the candidate solutions, which presents all the possible combinations of the VMs. The maximum number of solutions is .
To apply the automatic learning, To apply machine learning, you have to start with the learning phase, this phase the algorithm will build a logical link between the attributes and the final solutions. The duration of this phase is set by the administrator (Figure 2).
The algorithm is composed of two phases: The first phase is used for training. However, the second is used to predict materialized views.
Figure 3 shows the architecture of our approach. The system administrator sets the period for learning the model. If this phase is in progress, each time a new request arrives the system will use the knapsack algorithm to find the right configuration and at the same time prepare the neural network model. At the end of the learning phase the system will use this module provided in the first phase to predict a new optimal configuration for each arrival query.
The final FS solution is the optimal configuration that extends to the workload with a reasonable execution time. With this approach, a logic established between the requests and the final solutions to avoid recalculating each time.
In this experiment uses a workload containing 5 queries numbered from 1 to 5 and a database of 4 attribute differences that make 15 materialized views and 32,768 final solutions (Table 2).
Between 09:21 am and 9:47 am the requests arrive randomly. At the start the Workload contains only the query Q5 and for this workload the final solution is 4523 on the other hand the predicted final solution is 25,531 which is our predicted solution is different from the real solution.
To test the approach, an implementation of the algorithm was carried out with Python 3 on a laptop computer equipped with a Windows 10 operating system, 64 bits and 8 GB of RAM. The experimental results are discussed in the following figure.
First, each query is executed with the greedy algorithm to see the final solution as shown by the blue dots in Figure 4. In the second step our algorithm will be compared with the first to see if possible, to predict the final solution (orange curve) without wasting the time to recalculate the configuration each time a request arrives.
This figure clearly shows that after a learning phase, the algorithm manages to predict the final solution and consequently a great gain in the execution time and the resources used.
5. Dynamic database
This section discusses the case where the database is dynamic, during the execution of the queries, an update on the data is in progress. Updating all optimization structures is very expensive, so it is a good idea to update only the affected optimization structures.
For this, two binary tables are proposed and stored in the database (Table 3). The Matrix IT[i, t] stores the link between the indexes and the tables of the database. If index number 3 is used by table number 5, then IT[3, 5] = 1 otherwise equal to 0. Likewise, for the Matrix VT[v, t] which presents the materialized views linked to the tables. For example, if the materialized view number 5 (MV5) is linked with Table 4 then VT[5,4] = 1 otherwise equal to 0.
To understand, here is the following example: either Table T1 used by the indexes I1, I2, I4 and MVs V2, V4. Table T3 used by indexes I2, I4 and MVs V1, V4, so each time the database is updated, it is wise to modify only the structure concerned (index or MV). Each time the database tables are updated, a trigger searches for the index or Materialize view affected by this change. More details below (Figure 5).
The trigger is an integrated solution in all DBMS. It is a program that launches a series of tasks with each change in the database. It identifies the objects to be modified in the configuration. At each update operation (insertion, update, or deletion) the trigger does the same operation on the object concerned (Index or MV). For example, if a new row is inserted in the Table Ti, the trigger inserts the same row in the index and the VM linked by the table Ti. After each iteration, if the size of the configuration exceeds S or if the solution has become non-optimal, Algorithm 1 must be restarted.
This architecture guarantees that all the indexes and MVs form the optimal configuration even after updating the Workload.
Algorithm 2 is still running, the
In this work, a similarity between the problem of selecting indexes and materialized views with the Knapsack algorithm was proposed. The contributions are: The first level, the use of the backpack algorithm to present this problem as well as a mathematical modeling, then the use of machine learning to reduce the execution time of the workload. For this, two tables were used to ensure that the optimal configuration remains reliable even after updating the database. To validate this approach, an algorithm developed in python.
this work is the result of two works published as part of a Tunisian-South African research project (Grant Numbers: 113340, 120106) funded by the Ministry of Higher Education and Scientific Research and the National Research Foundation of South Africa presented by Prof Qing-Guo WANG.
ORDONEZ-ANTE, Leandro, VAN SEGHBROECK, Gregory, WAUTERS, Tim, et al. a workload-driven approach for view selection in large dimensional datasets. Journal of Network and Systems Management, 2020, p. 1–26.
AOUICHE, K. Automatic selection of indexes in data warehouses. Research Report, Laboratory ERIC Lumière Lyon2University, 2005.
BAHACHE, Anwar Nour Eddine. A Metaheuristic Based Approach for Solving the Index Selection Problem in Data Warehouses. Doctoral thesis. In: FACULTY Mathematics and Computer Science DEPARTMENT of Computer Science. 2018
LETRACHE, Khadija, EL BEGGAR, Omar, et RAMDANI, Mohammed. OLAP cube partitioning based on association rules method. Applied Intelligence. 2019; 49(2):420-434
TOUMI, Lyazid et UGUR, Ahmet. Static and incremental dynamic approaches for multi-objective bitmap join indexes selection in data warehouses. The Journal of Supercomputing. 2020:1-26
AL ETAWI, Namer Ali et ABUROMMAN. Fatima Thaher. 0/1 KNAPSACK PROBLEM: GREEDY VS. DYNAMIC-PROGRAMMING. International Journal of Advanced Engineering and Management Research Vol. 2020; 02:5
GARAIN, Nilkantha, CHATTOPADHYAY, Samiran, MAHAPATRA, Gautam, et al. Design and Implementation of an Improved Data Warehouse on Clinical Data. In: International Conference on Computational Intelligence, Communications, and Business Analytics. Springer, Singapore, 2018. p. 278–290.
KERKAD, Amira, BELLATRECHE, Ladjel, RICHARD, Pascal, et al. A query beehive algorithm for data warehouse buffer management and query scheduling. International Journal of Data Warehousing and Mining (IJDWM), 2014, vol. 10, no 3, p. 34-58.
Harinarayan V, Rajaraman A, Ullman a J. Implementing data cubes efficiently. Proceedings of the ACM SIGMOD International Conference on Management of Data. June 1996:205
COMER D. The difficulty of optimum index selection. ACM Transactions on Database Systems (TODS). 1978:440-445
J. A. Blakeley, P. L. (June 1986). Efficiently updating materialized views. Proceedings of the ACM SIGMOD International Conference on Management of Data.