The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5 and XML Multimodal and Hyperspectral Data Sets

Scientists require the ability to effortlessly share and process data collected and stored on a variety of computer platforms in specialized data storage formats. Experiments often generate large amounts of raw and corrected data and metadata, which describes and characterizes the raw data. Scientific teams and groups develop many formats and tools for internal use for specialized users with particular references and backgrounds. Researchers need a solution for querying, accessing, and analyzing large data sets of heterogeneous data, and demand high interoperability between data and various applications (Shasharina et al., 2007; Shishedjiev et al., 2010). Debate continues regarding which data format provides the greatest transparency and produces the most reliable data exchange. Currently, Extensible Markup Language (XML) and Hierarchical Data Format 5 (HDF5) formats are two solutions for sharing data. XML is a simple, platform-independent, flexible markup meta-language that provides a format for storing structured data, and is a primary format for data exchange across the Internet (McGrath, 2003). XML data files use Document Type Definitions (DTDs) and XML Schemas to define the data structures and definitions, including data formatting, attributes, and descriptive information about the data. A number of applications exist that use XML-based storage implementations for applications, including radiation and spectral measurements, simulation data of magnetic fields in human tissues, and describing and accessing fusion and plasma physics simulations (Shasharina et al., 2007; Shishedjiev et al., 2010). HDF5 is a data model, library, and file format for storing and managing data. HDF5 is portable and extensible, allowing applications to evolve in their use of HDF5 (HDF Group). HDF5 files provide the capability for self-documenting storage of scientific data in that the HDF5 data model provides structures that allow the file format to contain data about the file structure and descriptive information about the data contained in the file (Barkstrom, 2001). Similar to XML, numerous applications using the HDF5 storage format exist, such as fusion


Introduction
Scientists require the ability to effortlessly share and process data collected and stored on a variety of computer platforms in specialized data storage formats.Experiments often generate large amounts of raw and corrected data and metadata, which describes and characterizes the raw data.Scientific teams and groups develop many formats and tools for internal use for specialized users with particular references and backgrounds.Researchers need a solution for querying, accessing, and analyzing large data sets of heterogeneous data, and demand high interoperability between data and various applications (Shasharina et al., 2007;Shishedjiev et al., 2010).Debate continues regarding which data format provides the greatest transparency and produces the most reliable data exchange.Currently, Extensible Markup Language (XML) and Hierarchical Data Format 5 (HDF5) formats are two solutions for sharing data.XML is a simple, platform-independent, flexible markup meta-language that provides a format for storing structured data, and is a primary format for data exchange across the Internet (McGrath, 2003).XML data files use Document Type Definitions (DTDs) and XML Schemas to define the data structures and definitions, including data formatting, attributes, and descriptive information about the data.A number of applications exist that use XML-based storage implementations for applications, including radiation and spectral measurements, simulation data of magnetic fields in human tissues, and describing and accessing fusion and plasma physics simulations (Shasharina et al., 2007;Shishedjiev et al., 2010).HDF5 is a data model, library, and file format for storing and managing data.HDF5 is portable and extensible, allowing applications to evolve in their use of HDF5 (HDF Group).HDF5 files provide the capability for self-documenting storage of scientific data in that the HDF5 data model provides structures that allow the file format to contain data about the file structure and descriptive information about the data contained in the file (Barkstrom, 2001).Similar to XML, numerous applications using the HDF5 storage format exist, such as fusion

Analysis of HDF5 and XML Formats
The goals of this analysis are to: 1. Determine strengths and weaknesses of using HDF5 and XML formats for typical processing techniques associated with large hyperspectral images; 2. Compare and analyze processing times on Windows and Linux 64-bit workstations for HDF5 and XML hyperspectral images; and 3. Identify areas that require additional research to help improve efficiencies associated with processing large HDF5 and XML files, such as hyperspectral images.

Methodology for Analysis of HDF5 and XML Formats
To address the analysis goals a set of 100 files containing multimodal hyperspectral images, ranging in size from 57 MB to 191 MB, stored in HDF5 format provided the input for the creation of HDF5 and XML dataset files as part of a preprocessing step for further analysis.
The created HDF5 and XML dataset files provided the input to a series of analysis techniques typically associated with image and signal processing.Each original HDF5 file went through a number of preprocessing steps to remove the metadata in preparation for analysis.For analysis purposes, we needed to remove the metadata from the original HDF5 files and create new HDF5 and XML formatted files consisting of only raw sensor data prior to performing image processing.These steps included loading the original HDF5 file structures, searching through the HDF5 groups to find the raw image data, saving the new HDF5 file, creating and populating an XML document node, and saving the XML file. Figure 1 shows the overall steps in processing the original HDF5 file, along with some critical MATLAB code associated with each of those steps.After creating the HDF5 and XML files for the raw sensor data, each file was loaded into MATLAB, converted to an array as needed, and run through a number of image processing steps.XML stores the array data as a large ASCII character string, which requires converting the character array into a numeric array before beginning any processing.Unfortunately, the arrays were too large to use MATLAB's str2num() function, so a novel custom method was developed to read each character and convert it into numbers before writing the numbers into a MATLAB array.Once stored as numeric arrays, the processing for the XML and the HDF5 files were the same and these processing steps include image adjustment, histogram calculation, and descriptive statistics, filtering to remove noise, edge detection and 2-D FFT threshold feature detection.Each of these image-processing techniques includes possible techniques users may invoke when processing hyperspectral images.Table 2 provides a brief description of each of these techniques and an example call within MATLAB.In Table 2, "im" represents the original image and 'im2' represents a processed image of 'im'.Each row in Table 2 shows various processing operations performed on 'im' or 'im2'.Figure 2 shows the flow of the image processing techniques.Some of the metrics used for assessing the performance of each file format are calculation of load times, process times, and memory usage statistics for each file format and machine.These metrics reveal the computational performance of processing large archived data files in MATLAB using typical image processing algorithms.MATLAB's tic and toc methods were convenient to measure elapsed times associated with each processing step.Hyperspectral images consist of multiple segments representing different spectral bands or narrow frequency bands, with data collected for each image segment and averaged for reporting purposes.For example, an image with 62 segments would generate data for each of the 62 segments and the mean of those values, with the results described in the Results section of this paper.A typical sequence of elapsed time measurement would occur as shown in Figure 3.In the Figure 3 example, all files ("i") and segments ("j") perform the timing process for the image adjustment algorithm.

Technique
% Adjust the image for better display tic; im2= imadjust(im); adjustIM(i,j)=toc; After loading each created dataset file, both in HDF5 and XML, measuring the memory will determine the average memory usage.For the Windows environment, MATLAB's memory functions perform the process of determining the physical memory available at that point in time.For the Linux environment, system calls to the Linux memory functions determine the physical memory available after loading the file.MATLAB does not provide a Linux memory function at this time.Figure 4 shows a typical Windows memory call.
MATLAB has a convenient built-in benchmark named "bench" that executes six different MATLAB tasks and compares the execution speed with the speed of several other computers.Table 3 shows the six different tasks.
The LU test performs a matrix factorization, which expresses a matrix as the product of two triangular matrices.One of the matrices is a permutation of a lower triangular matrix and the other an upper triangular matrix.The fast Fourier transform (FFT) test performs the discrete Fourier transform computed using an FFT algorithm.The ordinary differential equation (ODE) test solves equations using the ODE45 solver.The Sparse test converts a The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5… 151 matrix to sparse form by removing any zero elements.Finally, the 2-D and 3-D measure 2-D and 3-D graphics performance, including software or hardware support for OpenGL (Open Graphics Library).
The benchmark results in a speed comparison between the current machine and industryavailable machines.

Data analysis
Data analysis included calculating descriptive statistics on each test to include mean, standard deviation, variance, minimum and maximum values, and t-test analysis; to determine relationships and differences in performance measurements comparing XML and HDF5 formats for both computer systems.The t-test is one of the most commonly used statistics to determine whether two datasets are significantly different from one another (Gay & Airasian, 2003).The t-test determines if the observed variation between the two datasets is sufficiently larger than a difference expected purely by chance.For this research, the significance level (α) was set at 0.05.This value is commonly accepted and is the default value for many statistical packages that include the t-test (Gay & Airasian, 2003;SAS Institute, 2003;MathWorks, 2011).
For each processing, memory, or loading algorithm, the descriptive statistics for each hyperspectral image create relevant data for a final analysis.The information obtained from averaging across each segment of the multiple segmented images creates the analytical data products used in the results.
In addition to the descriptive statistics for each process, graphical plots illustrate the load times, process times, and memory usage as a function of file size for each data type and test environment.These plots provide an ability to identify differences between the XML and HDF5 data types and possible processing bottlenecks and limitations.

Results and Implications
Scientists and researchers need a reliable format for exchanging large datasets for use in computational environments (such as MATLAB).MATLAB has many advantages over conventional languages (such as FORTRAN, and C++) for scientific data analysis, such as ease of use, platform independence, device-independent plotting, graphical user interface, and the MATLAB compiler (Chapman, 2008).Previous results have shown HDF5 format provided faster load and process times than XML formats, and loads large amounts of data without running into memory issues (Bennett & Robertson, 2010).This research supports these findings.This section provides results and discussion of this current research.After the baseline benchmarks provide results for each machine, the analysis will show example images and descriptive statistics for each image-processing algorithm, along with tables, plots, and discussion comparing HDF5 and XML formats for each task.
Table 4 shows the average of 10 MATLAB bench time results for each of the machines for the LU, FFT, ODE, Sparse, 2D, and 3D tests.For most tests, the Windows 64-bit machine performed better (as indicated by smaller execution times) than the Linux 64-bit machine.One exception to this was the 2D graphics, where the Linux 64-bit machine was slightly faster than the Windows machine.Based on these results, the Windows 64-bit machine should perform slightly faster for the subsequent image processing tasks.Figure 5 shows a typical image used in this analysis.This image represents one specific frequency range (spectral band) for a 460 x 256 image after adjusting of the intensity for display.The average times associated with each of these steps are shown in Table 6 for the Windows 64-bit and Table 7 for Linux 64-bit machine.The column labeled "Total (s)" represents the sum of each of the processing steps for the respective machines.For the current configuration of the Windows 64-bit machine, the mean preparation time per file was just over 9 s, with preparation times ranging between almost 7 and approximately 16.5 s.For the current configuration of the Linux 64-bit machine, the mean preparation time per file was almost 11 s, with times ranging between almost 9 and approximately 19.5 s.Table 8 shows the average free physical memory for each system during the preprocessing steps.Free physical memory can vary throughout a run based on system processing during the run and the amount of memory allocated to MATLAB for processing the run.For all runs during this research, the MATLAB Java heap memory was set to its maximum possible value to avoid any potential out-of-memory issues.In MATLAB version 2010b, selecting File, then Preferences, then General, and then Java Heap Memory, and then using the scroll bar to set its maximum setting changes the memory.The maximum setting for the Windows 64-bit machine was 1533 MB, while the maximum setting for the Linux 64-bit machine was 4011 MB.One trade-off with the Java heap memory being larger in Linux is that less physical memory is available for the run.However, increasing the Java heap memory does allow for larger possible Java objects, which is useful when dealing with large image arrays.After the preparation steps are complete, saving the raw image data to HDF5 and XML files is the next step.The new raw image files in HDF5 and XML contain only the image dimension information and the raw image pixel values.Table 9 provides the file statistics of the raw image data in both HDF5 and XML format.In all cases, the XML files are larger compared to the HDF5 files.In most cases, the resulting XML file is between 2.5 and 3 three times as large as the similar HDF5 file.This finding is consistent with other published results (Bennett & Robertson, 2010).On both the Windows and Linux machines, the total execution times for the HDF5 files were significantly less than the total execution times for the XML files.Comparing the results for the mean execution time for the Windows machine, HDF5 demonstrates excellent performance (~1.1 s) compared to XML (~1.8 s).The execution times for the windows machine ranged between ~0.1 and ~6.2 s for the HDF5 files, compared to ~0.7 -~6.9 s for the XML files.Similarly, comparing the results for the mean execution time for the Linux machine, HDF5 demonstrates excellent performance (~1.5 s) compared to XML (~3.1 s).The execution times for the Linux machine ranged between ~0.15 and ~9.2 s for the HDF5 files, compared to ~1.3 -~12.3 s for the XML files.The total execution time difference for both the Windows and Linux machines is primarily due to the "load" process.Loading XML files requires far more execution time due to the larger file sizes of the created XML data files (~3 times larger file size when storing the raw data in XML format).Additional loading difficulties with XML files include: 1. Slowness of the serialization process of converting Unicode XML into binary memory storage (McGrath, 2003).2. MATLAB loading algorithm ('xmlread' method) uses the Document Object Model (DOM) to load XML files.DOM is memory and resource intensive, and can consume as much as 10 times the computer memory as the size of the actual XML data file (Wang et al., 2007).3.In general, and of particular concern for users performing 32-bit processing, processing speeds associated with XML loading can be greatly diminished as virtual memory becomes insufficient compared with the size of the XML file as the computer starts to run out of memory.

www.intechopen.com
The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5… 157 The load times were larger for the XML files compared to the HDF5 files.This difference is most likely due to the larger XML file size.Figure 7 visually displays the load times for the XML and HDF5 files for the Linux 64-bit machine.Figure 8 shows a similar result for the Windows 64-bit machine.Corresponding increases in XML file size contribute to the large jumps observed in the XML load times around file ID 75 and 90 (Figures 7 and 8).Similar arguments made earlier in the chapter (slowness of serialization of converting Unicode to binary storage and resource intensive DOM loading methods) offer explanation of the larger loading process times compared to the more efficient loading of HDF5 binary files.HDF5 load times do not significantly vary depending on file size.Efficient methods of loading HDF5 binary data files, combined with excellent memory usage and use of computing resources, into the MATLAB workspace, demonstrate the superior benefit of archiving data in HDF5 versus XML.HDF5 provides seamless integration of data into MATLAB without performance degradation (especially for large data sets) and is the 'de facto' standard for MATLAB data files containing workspaces over 2 GB in size (Mather & Rogers, 2007).
The load times (Figures 7 and 8) for both HDF5 and XML show similar behavior on both the Windows and Linux machines.The cross platform behavior demonstrates the file size dependency for XML loading performance, and the lack of file size dependency for HDF5 loading performance.As expected from the benchmark testing results, the XML loading performance on the Windows machine is slightly faster than the Linux.An additional processing step is required to prepare the large raw data for processing.In XML files, the raw image data is stored as ASCII characters with whitespace separators.As the image gets larger, converting from the ASCII character data to a MATLAB array can take considerable time.MATLAB has a num2str() function that works very nice for small arrays, but this function would not work for these large character arrays.A novel process allows the reading of each character, one at a time, parse on spaces, and then load into the array, resulting in a tremendous savings (as much as two orders of magnitude) in processing time.
C or other software development languages may provide other more efficient methods to reduce this processing restriction.However, preparing the XML data for processing is a very important process step.Additional new research and software tools may simplify and expedite the process.T-test analysis on the total image processing times confirmed that there was a significant difference between the HDF5 and XML file processing times not attributable to random chance.Specifically, HDF5 files took less processing time than XML files on the Windows 64-bit machine (t (198) = 2.27, ρ = .0014)and the Linux 64-bit machine (t (198) = 3.25, ρ = .0244).The t (198), or t-value, represents the difference of the mean values for total processing times for HDF5 and XML, respectively, divided by the standard error of the two means.The 198 represents the degrees of freedom, or sample size minus 2 for an unpaired ttest, which is appropriate for the independent groups in this analysis.The important value (ρ) represents the probability of the difference (t-value) being due to chance is .0014for the Windows 64-bit machine, and .0244for the Linux 64-bit machine.Setting the significance level to .05 indicates that in both cases, the difference in processing times between HDF5 and XML is not by chance.These results suggest a significant difference between the total process times for HDF5 and XML files for both machines.Further t-test analysis on the individual components contributing to the total process time indicated significant differences in execution times for load, adjust, and mean calculations for the Linux 64-bit machine and load, adjust, and median noise filter for the Windows 64-bit machine.It seems reasonable the load times would be different between the XML and HDF5 formats.To provide insight into the differences between the XML and HDF5 formats for the image Figures 9 and 10 graphically depict these findings by displaying the total processing time for the HDF5 and XML files for the Linux 64-bit and Windows 64-bit test systems.In both cases, the XML process times were significantly greater than the HDF5 process times.For each file format and test machine, the amount of calculated free physical memory usage during the image processing stage shows definite differences between the file formats.Table 13 shows the descriptive statistics of these data.Similar to the preprocessing step, setting the maximum Java heap memory to maximum for each run results in no out-of-memory errors.
For both machines, the XML files required more physical memory than the HDF5 files, as indicated by less free physical memory in Table 13.This result is consistent with XML loading requiring relatively large amounts of memory compared to the XML file size (Wang et al., 2007).

Ethics of data sharing
There is a large, complex set of concerns when openly sharing data -especially electronic data over the Internet.From a scientific viewpoint of discovery, open sharing of scientific data allows many researchers and scientists the ability to form a common understanding of data, which is essential for furthering science.However, there are many ethical concerns in the process of sharing data, particularly over the Web.For example, a given medical study group collects sensitive, personal medical information as part of a medical case study using public government funds.All of the data is stored (archived) on a government computer system.Many years later, another researcher wants to use the data for another study, which could help save the lives of many people.Should the second researcher be able to use the archived data for a purpose other than the intent of the original study?Many arguments come into discussion in this situation.The right to use data paid for with publically collected funds seems reasonable; however, what about the right of human participants to privacy?What happens if a data release into the public domain harms any of the participants?Such harm may take the form of job loss or denial of life insurance, etc.The ethics of sharing data is complex and the ethical dilemma of sharing data is an area of study requiring much thought and discussion.

Fig. 1 .
Fig. 1.Original HDF5 File Preprocessing Overview for the creation of HDF5 and XML Dataset Files.

Fig
Fig. 4. Windows Memory Measurement Example Code.

Fig. 5 .
Fig. 5. Example 460 x 256 Image.A quad chart (Figure6) displays processed images showing some of the techniques.The first image in Figure6is an image in the upper-left corner representing the image adjusted for intensity.The image in the upper-right corner represents the image after the Weiner noise filter is applied.Next, the image in the lower-left corner represents the image after the Canny edge detection is applied.Lastly, the image in the lower-right corner represents the FFT threshold results.Recall from Figure1, preparing the images for processing requires several steps.The first step was to load the HDF5 structures, followed by finding and loading the HDF5 raw image data, saving the HDF5 raw image data, populating the XML docNode, and saving the XML raw image data.

Table 1 .
Two different workstations running 64-bit Windows and Linux operating systems are used.The workstations are equipped with MATLAB (scientific programming language).Table1displays the descriptions of each of the workstations.The hyperspectral images were originally stored in HDF5 format and included several different types of metadata in the form of HDF5 Groups and Datasets.Metadata in a typical HDF5 file includes ground truth, frequency bandwidths, raw image data, TIFF (Tagged The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5… 147 Image File Format)-formatted images, collection information, and other ancillary information, allowing researchers to understand the images and their collection parameters.Research Equipment Descriptions.
Table5displays the original HDF5 file size statistics for this research.The original HDF5 files contained ground truth, collection information, processed data, and spectral content, in addition to the raw image data.The computed image processing statistics use only the raw image data extracted from the HDF5 files and saved to HDF5 and XML formats.

Table 8 .
Free Physical Memory during HDF5 Preparation Steps.

Table 9 .
HDF5 and XML Raw Image File Size Statistics.After saving the raw image data to HDF5 and XML files, each file was loaded and processed according to the steps shown previously in Figure2.These steps include loading the file, adjusting the image, calculating image statistics, removing noise, detecting edges, and detecting features.Algorithms include two different noise removal algorithms (Median and Weiner Filtering) and two different edge detection algorithms (Sobel and Canny).All of these algorithms, unmodified for this research effort, are available within the MATLAB Image Processing toolbox.Table10shows the statistical results of the execution times for each of these imageprocessing algorithms for HDF5 and XML formats for the Windows 64-bit.Table11shows the results for the Linux 64-bit machine.

Table 11 .
Linux 64-bit HDF5 and XML Image Processing Execution Times.

Table 13 .
Free Physical Memory during Image Processing Steps.