Data Warehouse Technology and Application in Data Centre Design for E-government

E-government allows governments to service citizens in a more timely, effective, and costefficient method. E-Government services focus on four main customers: citizens, the business community, government employees, and government agencies. E-government can provide four types of services, which are Government-to-Citizen (G2C), Government-toBusiness (G2B), Government-to-Employee (G2E), and Government-to-Government (G2G). E-government system has already been very popular all over the world, because Egovernment helps to disseminate information. Further, it aids in the collection of information that helps decision makers serve citizens more effectively. E-Government allows government agencies to centralize decision making. China have started first E-Government program in the late 1980s, in which the governments both at central and local levels built up office automation (OA) systems and established an intranet, subsequently the Central Government of China had formally launched five Golden Projects (Golden Bridge Project, Golden Customs Project, Golden Card Project, Golden Tax Project and Government online Project) aimed at building E-Government in China ever since 1990s. After realizing five Golden Projects, Chinese government has set ambitious visions in the implementation of E-Government: quicken the pace of change in government functions to suit the requirement of reform, opening up and modernization policies, improve the performance of government operation, introduce new government measures in a scientific manner and more effective mechanisms to monitor the economic activities, place a greater emphasis on central co-ordination and transparency of government work, carry out administrative functions in accordance with law and provide better service for the public. To meet above ambitious visions, the Central Government of China have programmed to establish four governance information resource databases in next five years, governance information resource databases consist of four databases: population basic information database, judicial entity basic information database, natural resource and geography basic information database, and macroscopically economy database. A data warehouse has been defined as a collection of data in support of management decisions which is: subject oriented, integrated, nonvolatile, time variant. Data warehouse, as a collection of database or data management technologies, emerged in the early 1990s. The data warehouse has now been more generally seen as a strategy to bring heterogeneous 19

data together under a common conceptual and technical umbrella and to make the data available for new operation or decision support application. Three intrinsic features of data warehouse are data integration, data completeness and decision-making support. Management of government is from top to down, but collecting data of government is from down to top. To support construction of four databases, building data center is becoming an important project for E-government. Nanhai city of Guangdong province is a leading the way of E-government in China, this chapter put forward the design of data center based on data warehouse technology for Nanhai city, this solution not only consider to provide data for high level four databases, but also solve integration, share and exchange of data in various departments of Nanhai city, especially devised application database based on data warehouse technology for better utilizing accumulated data. The remainder of this chapter is organized as follows. The section 2 provides an introduction to the basic technology of data warehouse for reader to understand explicitly data centre design. In the following sections, data centre design of Nanhai city is depicted. In section 3, we summarize requirement analysis of data centre. In Section 4, we give the architecture of data center based on data warehouse technology. Section 5 gives some pivotal techniques of realizing data center. Finally, Section 6 gives the conclusion.

Data warehouse technology 2.1 Definition of Data Warehouse
A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of management decisions. The goal of using a data warehouse is to allow businesses and organizations to make strategic decisions. According to the definition of a data warehouse, a data warehouse generally has following four characteristics. (a) Subject-Oriented Subject-Oriented means that the main objective of data warehouse is to facilitate decision process of a company, and within any company data naturally concentrates around subject areas, so information gathering in data warehouse is aiming for a specific subject rather than for the functions of a company. (b) Integrated Being integrated means that the data is collected within the data warehouse, that can come from different tables, databases or even servers, but can be combined into one unit that is relevant and logical for convenience of making strategic decision. (c) Non-volatile Being the snapshot of operational data on a given specific time, the data in the data warehouses should be stable. The data in the data warehouse usually be added, but it should rarely be deleted. (d) Time-variant Time-variant means that all the data within the data warehouse can be found with a given period of time.

Architecture of Data Warehouse
Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Many architectures of data warehouse exists in the literature, According to study of Thilini Ariyachandra and Hugh J. Watson, Data Warehouse has five architectures: (1) independent data marts, (2) data mart bus architecture with linked dimensional data marts, (3) hub-and-spoke, (4) centralized data warehouse (no dependent data marts), and (5) federated. Other architectures tend to be variations on the five.
(1) Independent data marts This architecture is common for organizational units to develop their own data marts. These marts are independent of other marts. These marts typically have inconsistent data definitions and use different dimensions and measures, so the architecture is difficult to analyze data across all the marts. Figure 1 shows the architecture for independent data marts.

Source systems
Staging area Independent data mart (atomic/summarized data) End user Access and applications (2) Data mart bus architecture with linked dimensional data marts A business requirements analysis for a specific business process is the foundation for this architecture. The first mart is built for a single business process using dimensions and measures that will be used with other marts. Additional marts are developed using these dimensions and measures, which results in logically integrated marts and an enterprise view of the data. Atomic and summarized data are maintained in the marts and are organized in a star schema to provide a dimensional view of the data. This architecture is illustrated in Figure 2.

Source systems
Staging area Dimensionalized data marts linked by conformed dimensions (atomic/summarized data) End user access and applications Fig. 2. The data mart bus architecture with linked dimensional data marts (3) Hub-and-spoke An extensive enterprise-level analysis of data requirements provides the basis for this architecture. Attention is also focused on building a scalable and maintainable infrastructure. Using the enterprise view of the data, the architecture is developed in an iterative manner, subject area by subject area. Atomic level data is maintained in the warehouse in third normal form. Dependent data marts are created that source data from the warehouse. The dependent data marts may be developed for departmental, functional area, or special purposes and may have normalized, de-normalized, or summarized/atomic dimensional data structures based on user needs. Most users query the dependent data marts. Figure 3 shows this architecture. (4) Centralized data warehouse (no dependent data marts) This architecture is similar to the hub and spoke architecture except that there are no dependent data marts. The warehouse contains atomic level data, some summarized data, and logical dimensional views of the data. Queries and applications access data from both the relational data and the dimensional views. This architecture is typically a logical rather than a physical implementation of the hub and spoke architecture. This architecture is shown in Figure 4.

Source systems
Staging area Normalized relational warehouse (atomic/some summarized data) End user Access and applications Fig. 4. The centralized data warehouse architecture (5) Federated This architecture is advocated as a practical solution for firms that have a pre-existing, complex decision support environment and do not want to rebuild. Based on business requirements, data is accessed from existing data warehouse, data marts, and legacy systems. The data is either logically or physically integrated using shared keys, global metadata, distributed queries or other methods. This architecture is shown in Figure 5.
Existing data warehouse, Data marts, and Legacy systems Logical/physical integration of common data elements End user access and applications

Building of data warehouse
Building a data warehouse is only extracting the operational data and entering it into the data warehouse (Inmon, 2007), which sounds simple. However, creating a data warehouse is more than that. Extracting data, analysis data, and presentation result are a complex problem. The major steps of building data warehouse are described.
(a) Data collection The very first step before you start to build data warehouse, the data source will be identified. You need to figure out what are the data that are required to be put into your data warehouse. The data in the warehouse usually come from a number of source systems. Most of the data have been stored originally in transactional databases. External data may be stored in spreadsheets or personal databases. In some cases, source data may be collected automatically. If new data are required, a suitable system may need to be built to collect them. Otherwise, only minor changes should be needed to existing systems. In all cases, the owners of source data are responsible for maintaining quality, and this may require substantial effort. (b) Transformation & cleansing This can be the most time consuming part where you need to grab the data from various data source and store it into the staging database. In this process, data are usually restructured to optimize subsequent use for querying, reporting and analysis. This is often done in stages, in a data staging area. These data feeds need to be run on a regular basis to keep the data warehouse up-to-date. To minimize disruption to other systems, and warehouse users, this often has to be completed within a tight overnight time window. Task of this stage is very hard and time-consuming, and usually can be done with the help of ETL tools.
(c) Aggregation & analysis Selected data are taken from the central warehouse using query tools and processed to produce useful results. Often, the most frequently accessed data are first summarized and stored in data marts to improve response times. Additional performance measures are typically derived at the same time. Analytic applications may also be developed to help users get useful information. (

d) Presentation
Presentation is displaying results for end users, usually in the form of reports. Several different report types are normally needed to suit different types of user. The results might appear as text, tables or charts and could be viewed on-line, printed, published on a web server or distributed by email.

Tools of data warehouse
Developing data warehouse is a very complex and time-consuming task. In the process of building data warehouse, many tools can help developer to build data warehouse. According to purpose, data warehouse tools can be divided into ETL tools, OLAP tools, report tools, data mining tools, and database management systems (DBMS). These tools are introduced as following. Many materials get from official website of products.
(a) ETL tools IBM WebSphere DataStage delivers three key capabilities necessary for success in enterprise data integration: the most comprehensive connectivity to easily and quickly access any source or target system; advanced development and maintenance tools, which speed implementation and simplify administration; and a scalable platform that can easily handle today's massive volumes of corporate data. WebSphere DataStage is an industryleading data integration and transformation product that provides advanced development and maintenance capabilities for virtually unsurpassed levels of productivity. Teradata Parallel Transporter was designed for increased functionality and customer ease of use for faster, easier and deeper integration. The capabilities include: Simplified data transfer between one Teradata Database and another; Ability to load dozens of files using a single script makes development and maintenance of the data warehouse easier; Distribution of workloads across CPUs on the load server eliminates bottlenecks in the data load process.; The open database connectivity (ODBC) operator reads from the ODBC driver, which could pull data from any database; for example, DB2 or Oracle; Accessibility to myriad data sources via open standards is possible. SAS ETL Studio is a visual design tool for building, implementing and managing ETL and data integration processes from source to destination, regardless of data sources or platforms. In-depth data transformations are provided to efficiently meet enterprise data integration requirements and support business and analytic intelligence. It also improves quality management by providing impact analysis of potential changes made throughout the data management life cycle. SQL Server Integration Services (SSIS) is a component of Microsoft SQL Server. SSIS is only available in the "Standard" and "Enterprise" editions. SSIS provides a platform to build data integration and workflow applications. The primary use for SSIS is data warehousing, as the product features a fast and flexible tool for data extraction, transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server databases, update multidimensional cube data, and perform other functions. Informatica PowerCenter is a single, unified enterprise data integration platform for accessing, discovering, and integrating data from virtually any business system, in any format, and delivering that data throughout the enterprise at any speed.
(b) OLAP tools Oracle Discoverer is an ad hoc query tool that operates in the Caltech data warehouse environment. Discoverer's desktop-reporting features provide users the ability to generate financial reports and analyze financial information easily and quickly. Discoverer has three main components: User Edition (it provides the graphical interface utilized by users on a day-to-day basis to generate reports and analyze data), Administration Edition ( it is the tool used to design and present the hierarchy of financial data) and End User Layer (it shields users from the complexity of the relational database from which Discoverer retrieves data). SQL Server Analysis Services (SSAS) is a part of the Microsoft SQL Server platform. It adds OLAP and data mining capabilities for SQL Server databases. The OLAP engine supports MOLAP, ROLAP and HOLAP storage modes for data. Analysis Services supports the XML for Analysis standard as the underlying communication protocol. The cube data can be accessed using MDX queries. Data mining specific functionality is exposed via the DMX query language. Analysis Services includes various algorithms -Decision trees, clustering algorithm, Naive Bayes algorithm, time series analysis, sequence clustering algorithm, linear and logistic regression analysis, and neural networks -for use in data mining. DB2 OLAP Server is a strategic member of IBM's Business Intelligence family. It integrates the powerful OLAP engine and the application programming interfaces of Hyperion Essbase with IBM's DB2® family of relational databases. It provides a fast path to turn user's warehouse data into business insight. It is built for e-business with tools to help you quickly deploy Web-based analytical applications. SAS OLAP Server is a multidimensional data store designed from the outset to provide quick access to pre-summarized data generated from vast amounts of detailed data. Decision makers need quick access to summarized data so they can make timely decisions based on knowledge instead of gut feelings. BusinessObjects OLAP Intelligence is an On Line Analytical Processing (OLAP) application for analysis business data. It was previously known as Crystal Analysis Professional, it is part of BusinessObjects Crystal Decisions. (c) report tools Cognos ReportNet is the first all-in-one reporting software that lets user create, modify and distribute any report -invoices, statements, weekly sales and inventory reports to list a few. User can standardize all enterprise reports, not simply with one vendor -with one product and architecture. Crystal Reports Server is software product of SAS. Crystal Reports Server provides a complete report management solution that enables IT professionals to securely share, schedule, and deliver interactive reports over the Web, in e-mail, and in Microsoft Office documents. It empowers business users to view, print and share more compelling reports. Oracle Reports is Oracle's award-winning, high-fidelity enterprise reporting tool. It enables businesses to give immediate access to information to all levels within and outside of the Oracle Data Miner is the graphical user interface for Oracle Data Mining that helps data analysts mine their Oracle data to find valuable hidden information, patterns, and new insights. Data analysts can mine data with Oracle Data Miner's easy-to-use wizards that guide them through the data preparation, data mining, model evaluation, and model scoring process. Oracle Data Miner can automatically generate code needed to transform the data mining steps into an integrated data mining/BI application. SAS Enterprise Miner streamlines the data mining process to create highly accurate predictive and descriptive models based on analysis of vast amounts of data from across the enterprise. Enterprise Miner Support the entire data mining process with a broad set of tools, for example, association, clustering, decision-tree, neural network, classical regression technology. Teradata Warehouse Miner provides an array of data profiling and mining functions ranging from data exploration and transformation to analytic model development and deployment that are performed directly in Teradata Database. Teradata warehouse miner allows you to analyze detailed data without data movement, streamlining the data mining process.
(e) Database management systems Database management systems (DBMS) can be divided into two categories --desktop databases and server databases. Generally speaking, Data warehouse applications adopt server database, such as Microsoft SQL Server, Oracle and IBM DB2, which offer organizations the ability to manage large amounts of data efficiently and in a manner that enables many users to access and update the data simultaneously. Microsoft SQL Server is a relational model database server produced by Microsoft. Its primary query languages are T-SQL and ANSI SQL. Oracle Database (commonly referred to as Oracle) consists of a relational database management system (RDBMS) produced and marketed by Oracle Corporation. Its primary query languages are Plus/SQL and ANSI SQL. DB2 is a "Relational Database Management System" (RDBMS), DB2 is designed to make the storage and analysis of data easier. All access to and manipulation of data in DB2 is accomplished via SQL. www.intechopen.com

E-learning, experiences and future 378
Sybase IQ is a highly optimized analytics server designed specifically to deliver faster results for mission-critical business intelligence, data warehouse and reporting solutions on any standard hardware and operating systems. It works with diverse data -including unstructured data -and diverse data sources to deliver unsurpassed query performance at the lowest price/performance available. Teradata Database is a relational model database server produced by Teradata. Its data models are Relational, ANSI SQL Compatible, Full Query Parallelism and Balanced Performance.

Requirement analysis of data centre
Nanhai city of Guangdong province is the state-level pilot city for information-based. Consequently Nanhai city is listed as the pilot cities for Model Project of China's Egovernment Application. Nanhai city has successfully programmed and developed a series of E-government projects, including village management, finance decision-making, education, irrigation management, soil management, police management, etc, which vigorously promoted information-based construction in such fields as governments, rural areas, education, culture. With many system applications in every department, data exchange among departments, data share among departments and data integration application are becoming a big demand for improving greatly management efficiency and service standards. The relations of data provider and user for population data and for judicial entity data are depicted in Figure 6  To solve data share and exchange among departments, establishing data centre is a prime method. Main requirements of data centre are summarized as following: Realizing data share and exchange in the various departments, supplying data collecting, processing and loading from data sources to data centre, achieving centralized storage of government information resource, in favour of the higher level construction of four databases, offering integration application service for government, enterprises and citizens, Offering integration management for population information.

Architecture of data centre based on data warehouse technology
The architecture of data centre based on data warehouse is presented in Figure 8. Data centre is composed of six main components: data share and exchange platform, kernel database, support application platform, application database, data centre management platform, and data centre security platform. The very essence of the data warehouse is the flexible and unpredictable access of data. Thus, required is the ability to access data quickly and easily. If data is not efficiently indexed and users cannot access data rapidly, the data warehouse will not succeed. In addition, the data in the data warehouse needs to be able to be monitored at will. The cost of monitoring data cannot be so high and the complexity of monitoring data cannot be so hard that a monitoring program cannot be run whenever necessary. The data warehouse also needs to be able both to receive data from and pass data to the various departments of Nanhai city.

Data Share and Exchange Platform
Data share and exchange platform is made up of ETL (extraction, transformation and load) and data share agent. ETL is data provider for kernel data, which collects data from data source and load to kernel data. Data share agent deal with dispensing share data stored in kernel database into various departments that need data.
www.intechopen.com E-learning, experiences and future 380 Fig. 8. Architecture of Data Centre

Kernel Database
Kernel database of data centre is composed of four parts: population basic information database, judicial entity basic information database, natural resource and geography basic information database, and macroscopical economy database, which is organized according to user requirements and is maintained by administrators of data centre.

Support Application Platform
Support application platform is a secondary development tool possessed by data centre, main function of which is establish special database by extracting data directly from kernel database according requirements of application.

Application Database
Application database may be treated as data mart, which is composed of three parts: public service database, decision support database and special application database. Data Special application database is provided by kernel database and source database according to application requirement. Application database is organized according to requirements of decision-making, which is maintained by both administrators of data centre and department.

Data Centre Management Platform
Its main function is to manage and control data centre, including share management, exchange management, run management, log management, authorization management, backup management and recovery management.

Data Centre Security Platform
To deal with alarming and unpredictable security threats, data centre must consider security. Data centre Security Platform is a base of other platform that monitors and protects data centre.

ETL Technique
The process of extracting data from data sources and bringing it into the kernel database is commonly called ETL, which stands for extraction, transformation, and loading. During Extraction, the desired data has to be identified and extracted from data sources. Very often, it is not possible to identify the specific subset of interest; therefore more data than necessary has to be extracted, since the identification of the relevant data will be done at a later point in time. The size of the extracted data varies from hundreds of kilobytes to hundreds of gigabytes, depending on the source system and the organization situation. Just as the size of the data extraction may vary widely, the frequency at which the data is extracted may also vary widely: the time span may vary between hours and minutes to near real-time. After extracting (and transporting) the data, the most challenging and time consuming parts of ETL follow: Transformation and Loading into the target system. This may include applying complex filters; validating the incoming data against information which already existing in target database tables; comparing new data to existing data in the data warehouse, to determine whether the new data needs to be inserted or updated; computing aggregations and other derived data based on the new data. Generally there are three kind approaches of ETL: transformation-then-load, load-then-transformation and transformation-while-load.

Data Storage Technique
Data storage techniques is very important and complicated for realizing goal of data center, to support future decision support system, star schema for the warehouse is adopted to build application databases. Every application database is composed of several fact tables and a set of dimensional tables, the fact table contains a list of all measures and points to the key value of the lowest level of each dimension. Each of these measurements is taken at the intersection of all dimensions. Dimensions are qualifiers that give meaning to measures.

E-learning, experiences and future 382
They organize the data based on the what, when, and where components of a organization question. Dimensions are stored in dimension tables made up of dimensional elements and attributes. Each dimension is composed of related items or elements. Dimensions are hierarchies of related elements. Each element represents a different level of summarization. Choosing the appropriate fact measures for the grain in the fact table depends on the organization and analysis purposes. For example, the star schema for the contract data is constructed as shown in Figure 9.

Data Share and Exchange Technique
Data exchange is the problem of finding an instance of a target schema, given an instance of a source schema and a specification of the relationship between the source and the target. Such a target instance should correctly represent information from the source instance under the constraints imposed by the target schema, and should allow one to evaluate queries on the target instance in a way that is semantically consistent with the source data. XML stands for extensible markup language. XML was released in the late 90's and received a great amount of application. The XML standard was created by World Wide Web Consortium to provide an easy to use and standardized way to store self-describing data. The main benefit of XML is that you can take data from a platform, convert it into XML, and then share that XML with other platforms. Each of these receiving platforms can then convert the XML into a structure the platform uses normally and you have just communicated between two potentially very different platforms! So XML is adopted to realize data share and exchange. Data stored in the data source is converted into XML file, and then send it to data share and exchange platform, when receive XML file, data share and exchange platform again convert XML to structure data of kernel database, data stored in the kernel database is sent to application department in the same way, therefore, function of data share and exchange is realized easily.

Conclusion
In this chapter, definition, architecture, process of building and tools of data warehouse are introduced. Then Nanhai city E-government project is acted as example to illustrate the data warehouse based data centre deign. Relation of provider and user for data in various departments of government is given in detail. On the base of data analysis, the architecture of data center based on data warehouse technology has been presented. In our proposed method, XML technology is used for data share and exchange. The proposed method has already been successfully used in development of data center system for Nanhai city.