Satellite data centers, data rates, and volumes.
The incredible increase in the volume of data emerging along with recent technological developments has made the analysis processes which use traditional approaches more difficult for many organizations. Especially applications involving subjects that require timely processing and big data such as satellite imagery, sensor data, bank operations, web servers, and social networks require efficient mechanisms for collecting, storing, processing, and analyzing these data. At this point, big data analytics, which contains data mining, machine learning, statistics, and similar techniques, comes to the help of organizations for end-to-end managing of the data. In this chapter, we introduce a novel high-performance computing system on the geo-distributed private cloud for remote sensing applications, which takes advantages of network topology, exploits utilization and workloads of CPU, storage, and memory resources in a distributed fashion, and optimizes resource allocation for realizing big data analytics efficiently.
- big data analytics
- high-performance computing
- real-time analytics
- remote sensing
- distributed computing
- geo-distributed cloud
- resource allocation
The extreme increase in the amount of data produced daily by many organizations reveals big challenges in data storage and extracting information from timely data [1, 2, 3]. Many sensors designed in today’s technology are used in observation centers and on the Earth to create a continuous stream of data . Real-time, near-real-time geospatial data must be analyzed in a short time in order to be able to provide time-critical decision support in time-critical applications . The development of efficient computing techniques for obtaining information from remote sensing (RS) big data is critical for Earth science [6, 7]. In particular, the recent developments in remote sensing technologies have had a tremendous increase in remote sensor data . The amount of remote sensing (RS) data collected from a single satellite data center has dramatically increased and has reached several terabyte values per day . This is because sensors have high resolution and a large amount of band due to new camera technologies. Thus, RS data reaching high dimensions is defined as “Big data.”
Large image files which consist of both voluminous pixel and multiple spectral bands (multispectral/hyperspectral) cause great difficulties to read and store in memory. Besides this, data mining algorithms which extract information from satellite and remote sensing data involve high computational complexity. With these features, remote sensor applications are both data intensive [10, 11] and compute intensive . However, the computational complexity of many data mining algorithms is super linear to the number of samples and the size of the sample. Hence, optimization of algorithms is not enough to obtain better performance when those two variables continue to increase.
When talking about big data analysis, the main difficulties in computer architecture are CPU intensiveness and slow input/output (I/O) operations. According to Moore’s Law, CPU and driver performance doubles every 18 months. On the other hand, when the trends in I/O interfaces are examined, the improvement is near-to-network speed improvement but still behind of it . Although I/O interfaces operate at the same speed as the processor bus, I/O lags behind because peripheral hardware cards operate at low speeds. PCI cards are being used to increase I/O performance. Thus I/O performance increases by about 33% annually. In spite of these improvements in traditional approaches, when collected data exponentially increases, usage of relatively slow data processing techniques makes real-time analysis difficult particularly . For this reason, the most important factor determining the performance of analytical processes is the limited structure of hardware resources inherently . Therefore, modern technologies and high-performance computing (HPC) techniques, which have parallel processing capabilities such as multi-core programming, GPU programming, field programmable gate arrays (FPGA), cluster computer, and cloud computing are needed to perform analysis on large volumes of data, which is complex and time-consuming to extract information [14, 15, 16].
The HPC system usage attracts more attention in remote sensing applications because of a great deal of data recently [6, 7]. HPC integrates some of the computing environments and programming techniques to solve large-scale problems in the remote sensing era. Many applications of remote sensing such as environmental studies, military applications, tracking and monitoring of hazards, and so on require real-time or near-real-time processing capabilities for urgent intervention that is timely in the necessary situation. HPC systems such as multi- or many-integrated core, GPU/GPGPU, FPGA, cluster, and cloud have become inevitable to meet this requirement. Multi-core, GPU, FPGA, and so on technologies meet parallel computing needs for onboard image processing particularly. In such technologies, developed algorithms run on a single node only but with multiple cores. So they could just scale vertically with hardware upgrades—for example, more CPUs, better graphics cards, more memories . When distributed computing is considered, the most conceivable technology is commercial off-the-shelf (COTS) computer equipment, which is called cluster [6, 7]. In this approach, a cluster is created from a number of computers to work together as a team . These parallel systems, installed with a large number of CPUs, provide good results for both real-time and near-real-time applications that use both remote sensing and data streams, but these systems are both expensive, and scalability cannot exceed a certain capacity. Although much of parallel systems are homogeneous inherently, the recent trend in HPC systems is the use of heterogeneous computing resources, where the heterogeneity is generally the result of technological advancement in progress of time. With increasing heterogeneity, cloud computing has emerged as a technology which aimed at facilitating heterogeneous and distributed computing platforms. Cloud computing is an important choice for efficient distribution and management of big datasets that cannot be stored in a commodity computer’s memory solitarily.
Not only the increasing data volume but also the difficulty of indexing, searching, and transferring the data exponentially increases depending on data explosion [19, 20]. Effective data storage, management, distribution, visualization, and especially multi-modal processing in real/near-real-time applications are challenged as open issues for RS .
2. Big data
Big data is characterized as 3 V by many studies: volume, velocity, and variety [22, 23]. Volume is the most important big data quality that expresses the size of the dataset. The velocity indicates the rate of production of big data, but the increase in the rate of production also reveals the need for faster processing of the data. Variety refers to the diversity of different sources of data.
Given the variety of the data, the majority of the data obtained is unstructured or semi-structured . Considering the velocity of data, velocity requirements vary according to application areas. In general, velocity is addressed under the heading of processing at a specific time interval, such as batch processing, near-real-time requirement, continuous input–output requirement real time, and stream processing requirements. Application of critical and live analytics-based batches to improve data and analysis processes requires continuous and real-time analysis, and critical applications require immediate intervention, depending on the analysis of incoming data streams.
2.1. Remote sensing data
Remote sensing (RS) is defined as the ability to measure the quality of a surface or object from a distance . RS data are obtained from various data acquisition technologies (lidar, hyperspectral camera, etc.) in airplanes and unmanned aerial vehicles. With the recent developments in RS technologies, the amount, production rate, and diversity of remote sensor data have increased exponentially (Table 1). Thus, the data received from the remote sensors are being treated as RS “Big Data.”
Diversity and multidimensionality are the greatest factors in the complexity of RS big data. RS data is used in a variety of geosciences with environmental monitoring, underground, atmospheric, hydrological, and oceanographic content. Due to such different and wide range of application areas, the diversity of RS data has increased greatly. There are approximately 7000 different types of RS datasets in NASA archives, as far as is known . Numerous satellites and sensors with different resolutions have emerged due to higher spatial resolution, temporal resolution, and even spectral resolution. As remote sensing data continues to increase and complex, a new architecture has become a necessity for existing algorithms and data management .
2.2. Remote sensing algorithms
The processing of RS data has an important role in Earth observation systems. The longest RS workflow starts with data acquisition and ends with a thematic application (Figure 1). There can be processes which are processed sequentially or concurrently in each step within the workflow. An RS application typically consists of the following stages, respectively :
Data Acquisition/Collection: Images obtained from satellite or hyperspectral cameras are captured on the downlink channel. With high spatial and temporal resolution, data volume is increasing.
Data Transmission: The data stream obtained is sent simultaneously to the satellite center. The high bandwidth of the data transfer is critical for applications requiring real-time processing. In reality, it is impossible to transmit real-time data because the data increase due to satellite and camera technologies is faster than the data transmission rate.
Preprocessing: Preprocessing related to image quality and geographical accuracy such as decomposition, radiometric verification, and geometric verification is performed.
Value-added Operations: Value-added operations such as fine-tuning, orthorectification, fusion, and mosaic are performed. It is possible to say that the mosaic process has a longer processing time than the others.
Information Extraction/Thematic Implementation: In this step, classification, attribute extraction, and quantitative deductions are obtained with images subjected to preprocessing and value-added operations, for example, extracting information such as leaf area index, surface temperature, and snowy area on RS images.
2.3. Data access pattern in remote sensing algorithm
The data access patterns in the remote sensing algorithms vary according to the characteristics of the algorithms. There are four different access patterns for RS data. Depending on the size of the RS data, the unavoidable I/O load and irregular data access patterns make inapplicable the traditional cluster-based parallel I/O systems .
“Sequential row access pattern” is used by algorithms such as pixel-based processing-based radiometric verification, support vector machine (SVM) classifier. When algorithms are implemented in parallel, each processor needs logically multiple consecutive image rows.
“Rectangular block access pattern” is used by algorithms such as convolution filter and resample which require neighbor-based processing. Non-contiguous I/O patterns are visible in these algorithms and are not efficiently supported by normal parallel file systems.
“Cross-file access pattern” is used by algorithms like fusion and normalized difference vegetation index (NDVI) that require inter-band calculations. This data consists of small and non-contiguous fragments in hundreds of image files. Thus, in this type of access, a large number of read/write operations take place, which is a time-consuming process.
“Irregular access pattern” is used by algorithms such as fast fourier transform (FFT), image distortion, and information extraction, which require scattered access or the entire image. These algorithms use diagonal and polygonal access patterns as irregular access. In these patterns, different sized parts can be in different nodes, even though they are small and non-contiguous pieces of data. In addition to I/O difficulties, the problem of identifying irregular data areas also arises.
3. Real-time big data architecture
Big data architects often need a distributed system structure for data analysis, which requires data storage. S. Tehranian et al. proposed an architectural model that provides performance, reliability, and scalability, consisting of candidate hardware and software for distributed real-time processing of satellite data at ground stations . The resulting prototype system was implemented using the open source adaptive communication environment (ACE) framework and C ++ and tested on the cluster; real-time systems have achieved significant performance without sacrificing reliability and high availability. Structures and mechanisms for parallel programming are needed so that RS data can be analyzed in a distributed manner. In this context, Y. Ma et al. proposed a generic parallel programming framework for RS applications on high-performance clusters . The proposed mechanism has programming templates that provide both distributed and generic parallel programming frameworks for RS algorithms. The proposed magHD for storage, analysis, and visualization of multidimensional data combines Hadoop and MapReduce technologies with various indexing techniques for use on clusters . Some systems, such as , contain all of the different steps of addition, filtering, load balancing, processing, combining, and interpreting. In the related work, a real-time approach to continuous feature extraction and detection aimed at finding rivers, land, and rail from satellite images was proposed using the Hadoop ecosystem and MapReduce.
At a minimum, the components that the big data architecture should have in order to be able to perform real-time analysis are as follows: user interface, distributed file system, distributed database, high-performance computing (Figure 2).
A distributed file system is a virtual file system that allows distributed data to be stored on multiple computer clusters. In clusters that may consist of heterogeneous nodes, the application provides a common interface for accessing the data in isolation from the operating and file systems.
A distributed database consists of separate databases on each node/server in a multi-computer cluster connected by a computer network. The distributed database management system allows distribution and management of data on the server.
High-performance computing systems are technologies that provide an infrastructure that provides a sufficiently fast computing environment for parallel analysis of big data. It is crucial for the system to respond to the user within the reasonable time.
The user interface is a component that basically allows the user to query the server-based application through a visual interface, query, load, delete, update, request analysis, define workflow. The user interface has not been investigated as a major concern in this chapter.
3.1.1. Distributed file system
SSD and PCM devices that are being used instead of HDD as data storage are far from the I/O performance required for big data. Enterprise storage architectures, such as DAS, NAS, and SAN, have drawbacks and limitations when used as distributed systems .
Distributed File System (DFS) is a virtual file system that spans multiple nodes on the cloud . It provides the abstraction of heterogeneity of data nodes in different centers. Thus, the distributed file system provides a common interface for applications to access data on heterogeneous nodes that use different operating systems and different file systems on different nodes individually. There are some capabilities that a distributed file system should generally provide. Location transparency is where the application can access data such as being held locally without actually having to hold it. Access transparency is a common interface for access to data independent of the operating system and the file system. Fault tolerance is the ability to keep a replica of a replica on more than one node so that in the event of an error the replica is preserved in the nodes holding the replica. Scalability means that the number of nodes the file system is running on can be increased to the required amount (without the error system hanging down) if needed.
The Hadoop distributed file system (HDFS) is an open-source distributed file system distributed with an Apache license (Figure 3). HDFS is designed especially for big datasets and high availability, apart from the common abilities. It is also platform independent as it is implemented in Java. Applications are accessed via the HDFS API, which is maintained by any filing system. Thus, file access is isolated from local file systems. Compared to other distributed file systems (IRODS, Luster), it is stated that the performance is different in design and HDFS is the only DFS with automatic load balancing . At the same time, because it is platform independent and the availability of MapReduce support makes it easy to use on many systems, it is the preferred choice.
The Google file system (GFS) is a proprietary distributed file system developed by Google for its own use . The reason for the development is the need for a scalable distributed file system that emerges in big data-intensive applications. It is designed to enable reliable, efficient, and fault-tolerant use of data in a multitude of thousands of drives and machines, each with thousands of simultaneous users.
Storage systems such as amazon simple storage (S3), nirvanix cloud storage, openstack swift, and windows azure blob that are used in cloud systems do not fully meet the scalability and replication needs of cloud applications and the concurrency and performance requirements of analysis applications.
General parallel file system (GPFS) is a high-performance clustered file system developed by IBM. GPFS can be built on shared drives or shared-nothing distributed parallel nodes. Since it fully supports the POSIX-based file system, it removes the need to learn the new API set introduced by other storage systems. On the other hand, HDFS and GFS are not completely POSIX compliant and require new API definitions to provide analysis solutions in the cloud. In the study conducted by Schmuck et al., it is stated that GPFS is in terms of file-reading performance of HDFS with a meta-block concept . A meta block is a set of consecutive data blocks located on the same disk. In the proposed approach, a trade-off was attempted between different block sizes.
Parallel virtual file system (PVFS) compared with HDFS  shows that PVFS has not shown a significant improvement in terms of completion time and throughput.
Rasdaman database is a database that supports large multidimensional arrays that conventional databases cannot handle and can store large remote sensing data by nature . The architecture of Rasdaman is based on the sequence shredding process called “tiling.” The Rasdaman parallel server architecture feature provides a scalable and distributed environment for efficient processing of large numbers of concurrent user requests. Thus it is possible to present distributed datasets over the web. In order to retrieve and process the dataset from Rasdaman, queries of data retrieval in the query language defined by open geospatial consortium’s (OGC) web coverage processing service (WCPS) standards should be run. The PetaScope component, developed as a Java Servlet used at this point, provides queries for multidimensional data retrieval, retrieval filtering, and processing by implementing OGC standard interfaces. It also adds support for geographic and temporal coordinate systems.
Depending on the size of the RS data in the remote sensing applications, the unavoidable I/O load and irregular data access patterns are not applicable to traditional cluster-based parallel I/O systems . In the study conducted by L. Wang et al., an RS file-based parallel file system for remote sensing applications was proposed and implemented using the OrangeFS file system. By providing an application-specific data placement policy, efficiency is achieved for different data access patterns. The improvement in the performance of the proposed system is seen as an average of 20%.
3.1.2. Distributed database
The classical approaches used in managing structured data have a schema for data storage and a relational database for retrieving data. Existing database management tools have been inadequate for processing large volumes that grow rapidly and become complex. Data warehouse and data-market approaches have gained popularity in systems with more than one structured data . One of these approaches is the data warehouse, which is used to store, analyze, and report results to the user. The data market (March) approach is an approach that improves data access and analysis based on the data warehouse. The enterprise data warehouse (EDW), which is favored by large organizations, allows the data processing and analysis capability to be used on a very large and unified enterprise database . Some cloud providers can offer a petabyte data and more scaling solution with EDW. For example, Amazon Redshift uses a massively parallel processing (MPP) architecture consisting of a large number of processors for high-performance interrogation, with columnar storage and data compression. In addition, the amount of I/O required by queries is reduced using local attached storage and zone maps.
For storing and managing unstructured or non-relational data, the NoSQL approach is divided into two independent parts: data storage and management . With the key-value storage model in storage, NoSQL’s focal point is scalability and high performance of data storage. In the management section, data management tasks can be performed at the application layer through the lower-level access mechanism. The most important features of the NoSQL database are the ability to quickly change the data structure by providing schema freedom and the need to rewrite the data so that the structured data can be stored heterogeneously, providing flexibility. The most popular NoSQL database is the Cassandra database, which was first used by Facebook and published as open source in 2008. There are also NoSQL implementations such as SimpleDB, Google BigTable, MongoDB, and Voldemort. Social networking applications such as Twitter, LinkedIn, and Netflix also benefited from NoSQL capabilities.
According to the method proposed by L. Wang et al. for the management problem of conventional remote sensing data, the image data is divided into blocks based on the GeoSOT global discrete grid system and the data blocks are stored in HBase . In this method, the data is first recorded in the MetaDataInfo table. The satellite-sensor acquisition time is used as the row ID. In the DataGridBlock table, the row ID is kept with the MetaDataInfo row ID as well as the geographic coordinate. HBase tables ensure that blocks that are geographically close to the ascending order of row numbers will be held in adjacent rows in the table. When a spatial query arrives, the GeoSOT codes are first calculated and the DataGridBlock table is filtered by these codes. In addition, a distributed processing method that uses MapReduce model to deal with image data blocks is also designed. When MapReduce starts the job, it splits the table into bounds of regions, each region containing a set of image data blocks. The map function then processes each data block from the region and sends the resulting results to the reduce function.
The analysis of ultra-big databases has attracted many researchers’ interest, as traditional databases are inefficient for storing and analyzing large digital data. Apache HBase, the NoSQL distributed database developed on HDFS, is one of the results of these researches. A study by M.N. Vora evaluated a hybrid approach in which HDFS retains data such as non-textual images and HBase retains these data . This hybrid architecture makes it possible to search and retrieve data faster.
3.2. HPC systems
When data analysis is considered, the most important difficulty is scalability, depending on the volume of data. In recent years, researchers have focused more on accelerating their analysis algorithms. However, the amount of data is much faster than CPU speed. This has led processors to come to a position to support parallel computing as multi-core. Timeliness for real-time applications comes first. Thus, many difficulties arise not only in hardware development but also in the direction of development of software architects. The most important trend at this point is to make distributed computing improvements using cloud computing technology.
The technologies used in remote sensing applications have difficulties in delivering, processing, and responding in time . Web technologies, grid computing, data mining, and parallel computation on remote sensing data generated by R. Patrick and J. Karpjoo have been scanned. The size of the data volume, the data formats, and the download time are general difficulties. With the combination of betting technologies, the processing time in some applications can be reduced to as short a time as can be decided by the helpers in a timely manner. Although it is reasonable to process the remote sensing data as soon as possible, it does not seem possible to perform real-time processing automatically.
3.2.1. Onboard architecture
188.8.131.52. Multi-core processor
The multi-core processor is an integrated circuit which has two and more processors to process multiple tasks efficiently. After the frequency of processors reached limits due to the heating problem, processing capabilities of new CPUs continue to increase by multiplying the number of cores . Recently new CPUs could handle 8–12 simultaneous threads. To benefit from the multi-core CPUs, the problem should be divided into partitions which can be processed simultaneously. Multi-core programming is needed to achieve this benefit and rewriting of application is needed affordably. Multi-core programming is the implementation of algorithms using a multi-core processor on a single computer to improve performance. Some APIs and standards such as OpenMP and MPI are needed to implement algorithms which could be run simultaneously on multi-core processors.
184.108.40.206. Graphic processing units
GPU is a specialized circuit dedicated to graphical processing preliminarily. After it has had a brilliant rise in manipulating computer graphics and image processing in recent years, this technology gets used for developing parallel algorithms on RS image data widely [38, 39, 40]. RS complex algorithms should be rewritten to benefit from GPU parallelism by using thousands of simultaneous threads.
220.127.116.11. Field programmable gate arrays
Some onboard remote sensing data processing scenarios require components that can operate with low weight and low power, especially in systems where air vehicles such as unmanned air vehicle and satellite are used. While these components reduce the amount of payload, they can produce real/near-real-time analysis results at the same time as data is being obtained from the sensor. For this purpose, programmable hardware devices such as FPGAs can be used [41, 42]. FPGAs are the digital integrated circuits which consist of an array of programmable logic blocks and reconfigurable interconnects that allow the blocks to be connected simply. But the need for FPGA programming and learning a new set of APIs is emerging.
3.2.2. Distributed architecture
One of the most used approaches when considering hardware-based improvements is commercial off-the-shelf (COTS)-based computer-based solutions. In this approach, a cluster is created from a number of computers to work together as a team . These parallel systems, installed with a large number of CPUs, provide good results for both real-time and near-real-time applications using both remote sensors and data streams, but these systems are both expensive, and scalability does not exceed a certain capacity.
Cavallaro et al. have addressed the classification of land cover types over an image-based dataset as a concrete big data problem in their work . In the scope of the study, PiSVM, an implementation based on LibSVM, was used for classification. The PiSVM code is stale and stable despite the I/O limits. While PiSVM is used in parallel, MPI is used for communication on multiple nodes. For the parallel analysis, the JUDGE cluster in Jülich Supercomputing Center in Germany was used. The training period has been reduced significantly in the PiSVM, which runs parallel to the running of the series MATLAB. In parallel operation, the accuracy of SVM remains the same as in serial operation (97%).
Cloud computing is one of the most powerful big data techniques . The ability to provide flexible processing, memory, and drivers by virtualizing computing resources on a physical computer made the supercomputing concept more affordable and easily accessible . The use of the cloud concept, which provides a multi-computer infrastructure for data management and analysis, provides great ease in terms of high scalability and usability, fault tolerance, and performance. Especially considering the critical applications that need to extract information from the data that the next-generation remote sensors can produce near real time, it is very important to use cloud computing technologies for high-performance computing . In addition, the cloud computing infrastructure has the ability to create an efficient platform for the storage of big data as well as for the performance of the analysis process. Thus, together with the use of this technology, expensive computing hardware such as cluster systems, allocated space and software requirements can be eliminated .
The Hadoop ecosystem has emerged as one of the most successful infrastructures for cloud and big data analysis [23, 45]. The platform brings together several tools for various purposes, with two major services: HDFS, a distributed file system, and MapReduce, a high-performance parallel-data processing engine. The MapReduce model is an open-source implementation of the Apache Hadoop framework. This model allows big datasets to be distributed concurrently on multiple computers. Remote sensing applications using the MapReduce model have become a research topic in order to improve the performance of the analysis process as a result of exponential growth within the latest developments in sensor technologies [15, 16]. The processing services on the cloud are accessed via the distributed file system. In order to reduce data access, it is reasonable to process the data in the central computer where the data is stored.
4. Performance-aware HPC
Processing of big geospatial data is vital for time-critical applications such as natural disasters, climate changes, and military/national security systems. Its challenges are related to not only massive data volume but also intrinsic complexity and high dimensions of the geospatial datasets . Hadoop and similar technologies have attracted increasingly in geosciences communities for handling big geospatial data. Many investigations were carried out for adopting those technologies to processing big geospatial data, but there are very few studies for optimizing the computing resources to handle the dynamic geo-processing workload efficiently.
In existing software systems, computing is seen as the most expensive part. After daily collected data amount is grown exponentially with recent technologies, data movement has a deep impact on performance with a bigger cost than computing which is cheap and massively parallel . At that point, new high-performance systems need to update themselves to adapt to the data-centric paradigm. New systems must use data locality to succeed that adaptation. Current systems ignore the incurred cost of communication and rely on the hardware cache coherency to virtualize data movement. Increasing amount of data reveals more communication between the processing elements and that situation requires supporting data locality and affinity. With the upcoming new model, data locality should be succeeded with recent technologies which contain tiling, data layout, array views, task and thread affinity, and topology-aware communication libraries. Combination of the best of these technologies can help us develop a comprehensive model for managing data locality on a high-performance computing system .
4.1. Data partition strategy
Domain decomposition is an important subject in high-performance modeling and numerical simulations in the area such as intelligence and military, meteorology, agriculture, urbanism, and search and rescue . It makes possible parallel computing by dividing a large computational task into smaller parts and distributing them to different computing resources. To increase the performance, high-performance computing is extensively used by dividing the entire problem into multiple subdomains and distributing each one to different computing nodes in a parallel fashion. Inconvenient allocation of resources induces imbalanced task loads and redundant communications among computing nodes. Hereby, resource allocation is a vital part which has a deep impact on the efficiency of the parallel process. Resource allocation algorithm should minimize total execution time by taking into consideration the communication and computing cost for each computing node and reduce total communication cost for the entire processing. In this chapter, a new data partitioning strategy is proposed to benefit from the current situation of resources on the cloud. At this new strategy, RS data should be partitioned based on performance metrics which is formulated by a combination of available resources of network, memory, CPU, and storage. After appropriate cloud site and resource nodes are found by stage 1 and 2 as described in Sections 6.2 and 6.3, the receiving data is divided into the selected resource nodes according to the number of available resources on that cloud. For this dividing operation, the system should determine which portion of data will be allocated to found nodes based on some performance metrics such as the below heuristic function:
is the portion of node i and are the transfer time needed for data with network throughput of i and processing time needed for data with CPU frequency of i. is the total sum of transfer and processing time for selected nodes. In the formula, transfer time can be computed with and processing time can be computed with .
4.2. Geo-distributed cloud
Geo-distributed cloud is the application of cloud computing technologies which consist of multiple cloud sites distributed in different geographic locations to interconnect data and applications . Cloud providers prefer the distributed cloud systems to enable lower latency and provide better performance for cloud consumers. Recently, most of the large online services have been geo-distributed toward the exponential increase in data . The most important reason for realizing the services as geo-distributed is latency. Geo-distributed clouds provide a point of presence nearby clients with reference to reduce latency. In this chapter, we introduce a novel resource allocation technique for managing RS big data in the geo-distributed private cloud. This new approach will select the most appropriate cloud site which has minimum latency. It also finds efficient data layout for data which gives a higher performance for selected data nodes in the related cloud site. Within this context, resource management should match instantaneous application requirements with optimal CPU, storage, memory, and network resources . Putting the data into a more appropriate node with enough resources and providing an efficient layout of the dependent data partitions on the nodes with minimum latency on the network should decrease processing time of algorithms and also minimize transferring time of dependent data which is needed by algorithm processing on different nodes.
In the proposed approach, acquired RS data should be stored on geo-distributed cloud within two stages (Figure 4). In the first stage, each cloud site determines a score based on latency, bandwidth capacity, CPU, memory, and storage workloads. At that point, each cloud site has a multi-criteria decision-making (MCDM) process. MCDM is a subdiscipline of operation research that evaluates multiple criteria in decision-making problems . As a criteria value for the resource, workloads of resources could be computed by dividing the used amount of the resource to the total capacity of it. For CPU workload, the equation is given as follows:
Let be the total number of physical CPU threads in the cloud site i and be the number of virtual CPUs that are allocated for virtual machines (VMs) in the cloud site. For storage workload, the equation is given as follows:
Let be the total size of storage in the cloud site i and be the size of storage that is used for coming data in the cloud site. For memory workload, the equation is given as follows:
Let be the total size of memory in the cloud site i and be the size of storage that is used for coming data in the cloud site.
Time of transferring data between consumer and cloud site is described as latency, bandwidth between them is.
After determining criteria for decision-making, a processing method is needed to compute numerical values using the relative importance of the criteria to determine a ranking of each alternative . Some of the well-known MCDM processing models are weighted sum model (WSP), weighted product model (WPM), and analytic hierarchy process (AHP). If we define the solution with AHP which is based on decomposing a complex problem into a system of hierarchies, the best alternative could be defined as the below relationship:
where , N is criteria amount, and i cloud site. If we write the model for earlier five criteria:
Although AHP is similar to WSM, it uses related values instead of actual values. This makes it possible to use the AHP in multidimensional decision-making problems by removing the problem of combining different dimensions in various units (similar to adding oranges and apples).
After the minimum valued alternative cloud site is found, the second stage takes place for evaluating which resources should be used optimally in the related cloud site and finding an optimal layout in the network for RS big data.
4.3. Resource optimization in cloud network with performance metrics
Large-scale networks and its applications lack centralized access to information . When we interpret RS big data on the cloud as a large-scale network, optimization of resource allocation depends on local observation and information of each node. At this point, control and optimization algorithms should be deployed in a distributed manner for finding optimum resource allocation in such a network. Optimization algorithm should be robust against link or node failures and scalable horizontally.
To succeed distributed optimization of the system, each node should run the optimization algorithm locally which can be called as an agent. At this system which consists of multi-agents connected over a network, each agent has a local objective function and local constraint set which are known by just this agent. The agents try to decide on a global decision vector cooperatively based on their objective function and constraints (Figure 5).
The agents cooperatively optimize a global objective function denoted by f(x), which is a combination of the local objective functions, that is:
where and decision vector x is bounded by a constraint set, where x ∈ C, which consists of local constraints and global constraints that may be imposed by the network structure, that is:
where represent global constraints. This model leads to following optimization problem:
where and the set C is constraint set. The decision vector in Eq. (9) can be considered as resource vector whose component corresponds to resources allocated to each node or global decision vector which is estimated by the nodes on the network using local information.
After defining some basic notation and terminology for optimization of a function, continue with where we left off. The cloud site should decide which resources will be used for coming RS data when it is determined to receive it. Each node in the cloud site evaluates network latency and bandwidth between other nodes for optimum network communications. In addition to that, current amounts of CPU, memory, and storage are also taken into account together for finding the best-fitted solution to store and process RS data. Hence each node solves the defined formula:
where indicates that the jth node would be in the data-receiving group together with node i or not, is latency between i and j, is bandwidth between i and j, is hop count between i and j, and , , and are CPU, storage, and memory resources in node j. Each node should solve to minimize decision vector with AHP model in a decentralized manner. The individual optimization problem is a mix integer program for each node. Lagrangian relaxation is a heuristic method that stands for solving mix integer problems with decomposing constraints. The idea of the method is to decompose constraints which complicated the problem by adding them to the objective function with the associated vector μ called the Lagrange multiplier. After applying it to our problem, we derive the dual problem of Eq. (9) for an efficient solution by adding complicated constraints 1 and 2.
where utilization function for node:
and and are the dual variables.
After obtaining above optimization problem, it could be separable in the variables and it also decomposes into sub-problems for each node i. Thus each node needs to solve the one-dimensional optimization problem Eq. (11). This optimization problem consists of its own utility function and Lagrangian multipliers which are available for node i. Generally, the subgradient method is used to solve the obtained dual problem because of simplicity of computations per iteration. First-order methods such as subgradient have slower convergence rate for high accuracy but they are very effective in large-scale multi-agent optimization problems where the aim is to find near-optimal approximate solutions .
When one optimal approximate solution is found, RS data should be divided into partitions and proportionally distributed to found nodes in the solution according to performance heuristic (Eq. (1)) which is given in Section 6.1.
As a result of technological developments, the amount of data produced by many organizations on a daily basis has increased to terabyte levels. Remotely sensed data, which is spatially and spectrally amplified and heterogeneous by means of different sensing techniques, causes great difficulties in storing, transferring, and analyzing with conventional methods. It has become a necessity to implement distributed approaches instead of conventional methods that are inadequate in critical applications when real/near-real-time analysis of relevant big data is needed. Existing distributed file systems, databases, and high-performance computing systems are experiencing difficulties in optimizing workflows for analysis, in the storage, and retrieval of spatial big data of remote sensing and data streams.
According to investigated researches in this chapter, it is observed that existing techniques and systems cannot find a solution that covers the existing problems when analyzing the real and near-real-time big data analysis in remote sensing. Hadoop and similar technologies have attracted increasingly in geosciences communities for handling big geospatial data. Many investigations were carried out for adopting those technologies to processing big geospatial data, but there are very few studies for optimizing the computing resources to handle the dynamic geo-processing workload efficiently.
In this chapter, a two-stage innovative approach has been proposed to store RS big data on a suitable cloud site and to process them with optimizing resource allocation on a geo-distributed cloud. In the first stage, each cloud site determines a score based on latency, bandwidth capacity, CPU, memory, and storage workloads with an MCDM process. After minimum valued alternative cloud site is found, the second stage takes place for evaluating which resources should be used optimally in related cloud site and finding an optimal layout in the network for RS big data with respect to latency, bandwidth capacity, CPU, memory, and storage amount. Lastly, data should be divided into partitions based on a performance metric which could be computed with available network and processing resources of selected nodes in the cloud site.
As future work, optimal replication methods will be searched for preventing failure situations when transferring and processing RS data in a distributed manner. For succeeding that, a performance-based approach is considered to maintain high-performance computing.