Document Image Processing for Hospital Information Systems

In this chapter, we introduce document image processing methods and their applications in the field of Medical (and Clinical) Science. Though the use of electronic health record systems are gradually spreading especially among big hospitals in Japan, and e-health environment will be thoroughly available in near future [1–3], a large amount of paper based medical records are still stocked in medical record libraries in hospitals. They are the long histories of medical examinations of each patient and, indeed, good sources for clinical research and better patient care. Because of the importance of these paper documents, some hospitals have started to computerize them as image files or as PDF files in which the patient ID is the only reliable key to retrieve them, however most hospitals have kept them as they are. This is due to the large cost of computerization and also the relatively low benefit of documents that can only be retrieved by patient ID. Indeed, the true objective of computerization of paper records is to give them functionality so that they can be used in clinical research such as to extract similar cases among them. If we cannot find out any practical solutions to this problem, large amounts of these paper based medical records will soon be buried in book vaults and might be discarded in near future. Thus we are confronted with a challenge to devise a good system which is easy to run and can smoothly incorporate the paper based large histories of medical records into the e-health environment.


Introduction
In this chapter, we introduce document image processing methods and their applications in the field of Medical (and Clinical) Science. Though the use of electronic health record systems are gradually spreading especially among big hospitals in Japan, and e-health environment will be thoroughly available in near future [1][2][3], a large amount of paper based medical records are still stocked in medical record libraries in hospitals. They are the long histories of medical examinations of each patient and, indeed, good sources for clinical research and better patient care. Because of the importance of these paper documents, some hospitals have started to computerize them as image files or as PDF files in which the patient ID is the only reliable key to retrieve them, however most hospitals have kept them as they are. This is due to the large cost of computerization and also the relatively low benefit of documents that can only be retrieved by patient ID. Indeed, the true objective of computerization of paper records is to give them functionality so that they can be used in clinical research such as to extract similar cases among them. If we cannot find out any practical solutions to this problem, large amounts of these paper based medical records will soon be buried in book vaults and might be discarded in near future. Thus we are confronted with a challenge to devise a good system which is easy to run and can smoothly incorporate the paper based large histories of medical records into the e-health environment.
In an e-health environment, health records are usually treated using an XML format with appropriate tags representing the document type. Here the document type means the scope or rough meaning of contents. Therefore, a good system might have such functions as to create XML files from paper documents that also have appropriate tags and keys representing the rough meaning of contents. Fortunately, most paper based medical records have been written on fixed forms depending on the clinic or discipline, such as diagnoses placed in a fixed frame of a sheet, and progress notes in another frame, etc., and these frames usually correspond to the document types. It would seem rather easy to assign an appropriate XML tag to each frame if we could determine the form or the style of the paper. And if such a frame can be determined and the scope of the contents in it is fixed, then translation into text from the document in that frame might be accurately performed by using dictionaries properly assigned to the scope. Also, as collaborative medicine spreads, many recent medical records have been typed so that they can easily be read among the team members; which 3.2 Outline of our system Figure 1 illustrates the outline of the proposed method. The images obtained from paper-based medical documents have some factors, such as noise, tilts and so on, that deteriorate the accuracy of the following processes. These factors are reduced (or removed) by pre-processing described in 3.3, and some features to determine the sheet type are extracted from them. After this, each cell in the documents is extracted using cell positions and master information. The extracted cell often has images, e.g. schema images, sketches, as well as character strings. Thus such images are also extracted from the cell images. The extracted character strings are converted into text data by OCR engine, and the obtained text data are stored into a database. The extracted schema images are also recognized by a schema recognition engine, and the recognition results, i.e. schema name, annotation and its position etc., are also registered into the database. After this, an XML file is generated using the master information.
Extraction of Nodes, Recognition of Table Structure Master Information Node Position Types of Nodes

Pre-processing
In this study, binarization, tilt correction and noise reduction techniques are applied to the input images as pre-processing. In the binarization process, Otsu's method was used [8,9]. In his method, the threshold for binarization is determined by discriminant analysis using a density histogram of the input image. Therefore, no fixed threshold for each image is required. In the tilt correction process, the LPP method is used to correct the tilt of the images [10]. Figure 2 illustrates the rough image of LPP. In the LPP, the target image, i.e. the input image, is divided into n s sub-regions and marginal distributions of each region are obtained. In this case, horizontal projection histograms are used as the marginal distributions. Next, correlations between each region (α k ) are calculated by Here, P k (j) means the j-th value of the horizontal projection histogram of the k-th sub-region, and β does the range of calculation. These value indicate misaligns of the phases in each region, which are equivalent to the ratio of the tilt. As the result, the tilt angle of the paper θ is given by Here, α m is the average of α k ,a n dS w is the width of each sub region. range is from -10 to 10 degrees. We use the LPP method only, because images tilted by more than 10 degrees do not occur in practical cases. As a final step of pre-processing, a median filter is applied to the images to reduce speckle noise and salt and pepper noise.

Sheet type recognition using node information
Generally speaking, a tabular form document has at least one table, and its form and location heavily depend on sheet type. In other words, features of the table in the document would be the key information for sheet type recognition. Thus we extract crossover points of ruled lines, which we call "Nodes", from the document, then positions and types of these nodes are used for the sheet type recognition. Figure 3 shows the outline of feature extraction for sheet type recognition. As a first step, ruled lines in the input images are extracted using black pixels forming a straight line. When there is a horizontal connected component that consists of n h black pixels, it is regarded as a horizontal solid line. The same process is also applied to extract vertical ruled lines. In this study, the value of n h is decided experimentally as 50. The length of 50 pixels is equivalent to about 4.2mm when the resolution of the input images is 300 dpi. Of course, the value of n h affects the extraction accuracy of ruled lines. And in some cases, partial lines of characters or underlines in the image are also obtained as shown the circular parts in the figure 3(b). These parts may influence the processes follows. But in the proposed method, the detection of crossover points can remove these surplus lines, and the determination of value of n h is not so significant. As a matter of fact, these surplus lines are removed by adjusting the value of n h . Now the next step is to decide the types and positions of nodes. Since ruled lines usually have some width, the node where these ruled lines crossover usually form a rectangle. We set the node position as the center of gravity of such a rectangle. Then, from the node position, ruled lines are traced toward outside until they reach other lines. All ruled lines which failed to meet other lines are discarded. By doing so, the pattern of the node is decided. Figure 3(c) shows the outline of the classification method. Generally speaking, a table consists of nine types of crossover points, which are called "Node" in this paper, and non-crossover points [11][12][13]. We express the table in the document using these features. In our method, ruled lines around the target nodes are searched first. In the case shown in Figure 3(c), when a ruled line exists above the target node, then, node No.1, 2 and 3 are excluded as candidates.

Feature extraction
In the next step, ruled lines are also searched for on the left, right and bottom of the target node. As a result, the target node is identified as node type 4. The same process is applied to all nodes in the image. The extracted nodes' numbers and their positions are stored into the database for sheet type recognition and cell image extraction. These features can express the structure of the table, and elements in the table can be extracted by using the nodes' types and their positions. Figure 4 illustrates the outline of our sheet type recognition technique. We first set a ROI of size n roi × n roi pixel to each node thus obtained in the above section. Then, we search whether the same type of node exists in the same ROI of a sheet in the master database, and count up the successful cases and calculate the degree of coincidence to the sheet in the master database as the ratio of the number of successes to the total number of nodes of that sheet registered in the master database. Lastly we determine the sheet type of the image as that which has the highest degree of coincidence among the master database. As the master database contains all types of sheets used at Mie University Hospital, and the occurrence of an irregular type of sheet will be very rare if at all, the proposed method can determine the sheet type with quite good accuracy.

Cutout of cell images using node matrix
The elements of the table which we call "cells", are extracted using node information. In this study, we use a matrix using the node's number called "Node Matrix". Figure 5 illustrates the generation process of the node matrix. The node matrix expresses the structure of table, thus we can extract cells from the table by using the matrix and the positions of these nodes. Figure 6 shows the outline of the cell extraction method. The node located on the top-left in the document is set as the starting point of the extraction. Then the matrix is scanned from the start point left to right until the nodes with a downward element, i.e. node 1 -6 in Figure  5, appear. In this case, node 2 appears first as the node with a downward element. The node is the top-right point of the cell and the matrix is scanned from this point to the bottom again.   Table   When the node with a left element such as nodes 2, 3, 5, 6, 8 and 9 appears, the node is regarded as the bottom-right of the cell. The same process is repeated until the start point appears again. In this paper, the same process is applied to all nodes in the matrix to extract all cells in the table. Of course, the position of each node is stored into the database, thus we can finally cutout each cell from the table by using the information.

Detection of strings and character recognition
String regions in all the cells have to be extracted to recognize characters and generate an XML document. The proposed method extracts the regions using the master information. In this chapter, the cell image extracted from a blank table is called the "Master Cell Image", and the one from a table inscribed by users is called "Inscribed Cell Image", respectively. Since the 71 Document Image Processing for Hospital Information Systems www.intechopen.com master cell image sometimes has images coming from the title printed in the blank sheet, the string regions inscribed by users in each cell are extracted by a subtraction between the master cell image and the inscribed cell image. However, when the position of the master cell image does not match that of the inscribed cell image, these regions cannot be extracted correctly. Therefore, our method calculates the ratio of difference between these images first, and then the position for the subtraction process is determined to solve the above problem. In this process, the ratio of difference is obtained by the sum of the number of pixels with different values in each pixel, and the string regions in the cell image are extracted by the subtraction process. Figure 7 shows the outcome of the string extraction. The figure indicates that the inappropriate regions not inscribed by users are also extracted as well as the string regions inscribed. These results are caused by slight differences of tilt or input conditions between the master cell image and the inscribed cell images. But, it is very difficult to eliminate these differences completely. To solve this problem, the proposed method was changed to improve extraction accuracy. Specifically, the labeling process shown in Figure 8(a) was added. As a first step of the procedure, the labeling process is applied to the master cell image, and next the black pixels belonging to the large connected components are changed to white. After this, the same subtraction process is done again. Figure 8(b) shows a result of the improved method. It is obvious that characters in the master cell image are erased completely and strings inscribed by users are appropriately extracted compared with the result in Figure 7. Actually the extraction accuracy of the improved proposed method depends on that of the labeling process. In the case of the printed documents, variations of character size and distance between characters are not significant, thus the accuracy of the improved proposed method is high enough for practical use. In preliminary experiments, false extraction of string regions such in Figure 7 was not detected.

Features for schema detection
Generally speaking, extracted cell images consist of some elements such as character strings, dotted (or broken) lines and schema images. In our method, as a first step, four features are extracted from the cell images to discriminate these elements. In this section, we focus on the shape of dotted lines and schemas. It is supposed that dotted lines and schemas have the following characteristics: 1. The circumscription rectangle size of schemas is larger than that of a single component of dotted lines or character and the shape of schemas is vertically (or horizontally) longer than that of single component of dotted lines or character. 2. Each component of dotted lines is smaller than that of schemas and they are lined up on straight lines.
To express 1, we employ a variance of horizontal and vertical direction S x and S y and circumscribed rectangle area A of each connected component. For 2, the number of connected components lined up on straight lines is employed. We call this feature the horizontal (or vertical) connected level L. Figure 9 illustrates the rough image of a horizontal connected level. The center coordinates of each circumscription rectangle are obtained by labeling processing, and the center coordinate of the target rectangle is connected to that of other rectangles with straight lines. In the case that tilts of the lines are within ±t degrees, it is regarded that these circumscription rectangles distribute on the straight line. In this study, the value of t was set to 0.5 experimentally, because the theoretical detection accuracy is 0.06 degree in the LPP method. The processing for discriminant of vertical dotted lines is not done because tabular form documents used in this study do not have such structures. As a matter of course, the features about discriminate vertical dotted lines can be calculated easily by extending the previous processing. Figure 10 shows the ideal distribution of the features. In this figure, the connected components of schema images will have large values of S x , S y and A as shown in Figure 10(a) and the components of dotted lines will come on the region with a large value of L.B u tt h ec h a r a c t e r components will appear in the region with small values of S x , S y and L ( Figure 10(b)). It is expected that dotted lines, schemas and characters can be discriminated by using appropriate thresholds to these features.

Extraction of schemas from cell image
To extract schemas from a cell image, we must decide the threshold values for S x , S y and A. Since the objective of this section is to extract schemas from the cell, only the threshold values for S x , S y and A are used. (The threshold value for L is necessary to discriminate characters and dotted lines.) These threshold values were decided by considering the shape of histograms of each feature. As expected in the above section, the histograms will show bimodal patterns and the threshold values will easily be determined at the bottom of valley Here n c is the number of classes and n d means the number of data, respectively. And the threshold value is determined at the bottom of valley between two peaks. With this method, all data having schema characteristics is extracted. In other words, all data having characteristics of dotted lines or characters are not extracted even when they are located in the schema area. These should be recovered.

Extraction of schemas from schema area and recovery
In some cases several schemas are placed closely in a document. In such cases the schema area obtained in the above section might have several schemas which should be divided and extracted from the cell image appropriately. For this we prepare a dividing process in the system using the shape of histogram. Figure 11(a) and (b) illustrate the outline of the dividing process. As a first step of this process, we obtain a projection histogram of vertical direction for the schema. In the obtained histogram, the part that consists of d 0 -continuous elements with zero value is regarded as the boundary of each schema, and the image is divided on the middle point of the part. The same processing is applied to the image for division on horizontal direction. By this processing, schema regions are divided into several mutually independent ones. In this paper, the value of d 0 is given experimentally. Finally, the connected components in the schema, which were classified as characters, are added to the original schema image (Figure 11(c)).

Schema recognition using weighted direction index histogram method
Weighted direction index histogram method (WDIHM) is one feature extraction method. It is often used in handwritten character recognition systems [14][15][16]. Figure 12 illustrates the rough image of this method. As you can see, the method traces the contour of the character image first, and direction index histograms in each sub-region are generated using chain codes. After this, the spacial weighted filter based on Gaussian distribution is applied to the obtained histograms to generate a feature vector. WDIHM has enough robustness to local shape variations of input character images. As the accuracy of this method is extremely high compared with other character recognition algorithms, this method is employed in many  [17][18][19][20]. Figure 13 shows the outline of schema image recognition method using WDIHM. For schema image recognition, we first have to make a dictionary for recognition. In this method, many images are required to make the dictionary. Since this method divides the input images into some sub-regions and calculates covariance matrix among them for feature vectors, the dimension of feature vector is very large. We used not only basic schema images employed in the hospitals shown in Figure 14(a) but also some additional images, e.g. rotated and shifted ones etc., to make the dictionary (Figure 14(b)). Actually there are more than 120 kinds of schema images used in HIS, but in this study we picked up only five kinds of typical schema images as shown in Figure 14(a) to examine the effectiveness of the proposed method. For the recognition of input schema images, we employ the following discriminant function called Modified Bays Discriminant Function (MBDF) [14,15]. In the above formula, x is the n-dimensional feature vector of the input schema image, and l μ is the average vector of schema image l in the dictionary. l λ i ,a n d l ϕ i are the i-th eigen value and eigen vector of schema image l, respectively. And k 1 is determined by the number of learning sample m(1 ≤ k 1 ≤ m, n). These higher-order eigen values are in many cases not used due to the increase of calculation time while contributing little to the improvement of recognition accuracy. But in our case, the higher-order eigen values and vectors will be necessary components to improve recognition accuracy, since the construction of characters (or schema images) are very complex. As the absolute value of higher-order eigen values are very small and the true values of them are difficult to obtain, λ k 1 +1 are used as the approximation of λ i (i = k 1 + 1, ··· , n). In this study, the number of sub-regions and the value of k 1 were determined based on the literature [14,15]. After this process,

Accuracy of sheet type recognition
To make our system robust in the case of the misalignment of medical records to the scanning machine, we introduce ROI of size n roi × n roi pixels in 3.4.2 But, if the misalignment error exceeds this range due some reason, say, distortions caused by anthropogenic factors or by a mechanical error of the copying machine, a further improvement will be necessary. We used the following three techniques in the recognition method, and examined their accuracy and the processing time by using 325 sheets.  Figure 15 shows an example of distribution of the features extracted from an input image. The obtained distribution of the features was similar to the ideal one as shown in Figure 10.

Features for schema image extraction
In this experiment, we also applied the extraction method to 6 kinds of printed discharge summary documents in print [21], which have dotted (or broken) lines and schema images. The obtained distributions for these 6 cases were almost same as those of the ideal one. These results indicate that these elements can be divided by using linear discriminant functions with these features in good accuracy. figures, but the images can easily be acquired by the subtraction of (b), (c) and (d) from the input image (a). To know the effectiveness of the proposed method for cases of handwritten summary documents, we applied the method to such cases. Figure 18 shows an example of the results. Figure 18(a) is a summary for gynecology with some schema images. In this case medical records were written on the sheet with ruled lines. The result shows that each schema can be extracted even for such case of a handwritten summary. But, characters were regarded as ruled lines because they were located on the original ruled lines (Figure 18(b)). In addition, some characters were also extracted with the schema (Figure 18(d)) as the obtained circumscription rectangle has these characters. A method to eliminate them has to be added to the current extraction method. Table 2 shows the obtained results of schema image recognition. In this table, each row means the schema type of the input image and each column is that of the recognition result. This table shows that the recognition accuracy of the proposed method was more than 90%.

Recognition Result
Input Image   Figure 19 shows results of success cases of correctly recognized images. These figures were recognized appropriately by using the proposed method even if there are marks, comments, lead lines for explanations in them. These results indicate that the dictionary with various schema images may not be necessary for recognition if input images do not have many annotations. On the other hand, the schema images with large marks or many annotations were not recognized correctly ( Figure 20). Table 3  given by the discriminant function. In these cases, the large marks (or lead lines) made the contour shape of the input image change drastically, as a result the distance between the input image and the original schema image was larger than that between the input image and the recognition result. In addition, the proposed method outputs the schema type with the smallest distance as a recognition result. Thus it is difficult to detect schema images not

Generated XML documents
Characters in extracted strings have to be recognized and converted to text data by an Optical Character Reader (OCR) engine. The very strength of our method is that we can define the document type of each frame before the start of recognition of cell images by using the master database and can use any type of OCR engine pertinent to that type. It was found, however, some work is necessary to create interfaces between various OCR engines and our system. At present, we use a commercially available OCR library, developed by "Panasonic Solution Technology, Inc." [22]. The table structure and characters acquired by the proposed method are used to generate an XML file. In this study, an XSL, i.e. defining the table structure of the document, is generated from the acquired node matrix first. The table structure is defined by table tags in XSL. In the next step of the process, an XML document is generated using XSL and converted text data corresponding to the contents of each cell. Figure 21 and 22 show examples of generated XML files. In the experiments, the table structures of all input images were recognized correctly. In the case of the document with a schema image, the recognition results, i.e. schema type and annotation part, were inserted to the generated XML file (schema tag in Figure 22). In the present experiments, some parts of the characters were misrecognized. These errors may come from the OCR engine itself. To reduce such errors, it would necessary to use an OCR engine pertinent to the scope of the documents analyzed.

Developed system for resemble case search
As stated in the introduction, the objective of developing our system is to create a system actively used at healthcare sectors, so that a large volume of paper-based medical records can be included in the e-health environment. For this objective it is necessary to show quickly the usability and/or capability of the method for clinical requirements. Though the research is ongoing, we have developed a prototype system to demonstrate what we can do using this system. We developed a system to search similar cases using Microsoft Visual C# .NET. Figure  23(a) and (b) show the photograph and screenshot of the developed system, respectively. In the system, we used a wizard form with icons to improve the usability of the system. When the system is started, then the wizard window appears at the top left of root window and navigates users who are not experts of information systems. The wizard window of the system consists of some components such as "Image Input", "System Configuration" and "Scanning", "Generated XML Viewer" and so on. The image input component supports various input methods. For example, we can input document images from TWAIN devices as well as image files such as Bitmap, JPEG, or PDF files and so on. The system configuration component is so designed as to guide users to set up system parameters easily. When the scanning processes are finished, the structure (and contents) of the input document image are recognized, and a XML file is generated. It takes several tens of seconds before the XML document is generated. The generated XML file is shown in the viewer window ( Figure 23(c)). After this, we can search similar cases from the stored documents by using keywords like Figure 23(d). Since the generated XML documents have high compatibility with relational databases, the documents can easily be imported to hospital information systems. If data mining software (or systems) such as data ware house OLAP tools, and so on can be used, these XML documents would be used more effectively for clinical and medical study.

Related works
Studies and research for document image analysis systems have been reported [23]- [31]. As related works to ruled line extraction, the detection methods using the Hough transform technique are reported by literature [23]- [27]. Particularly in literature [23] and [24], complex line shapes can be extracted using a pattern-matching method and Hough transform method.  [27], authors propose detection methods for character patterns, general curving lines, quadratic curving lines, circular patterns using the concept of [23] and [24], and discuss their effectiveness. These methodologies may have higher extraction accuracy when compared with the proposed method, but they require a large amount of calculation time, because these algorithms are so complex. In practical situation, processing time is the most important factor to evaluate systems. Therefore, it is not realistic to employ them in cases where large number of documents are being processed.
As for related methods for document layout and structure recognition, literature [28] reports the table structure recognition method based on the block segmentation method and literature [29] tries to extract the contents from printed document images using model checking. The method of literature [28], however, depends on the output of commercial OCR systems. On the other hand, our proposed method identifies table types, i.e. document types, using a node matrix and positions of nodes. The node matrix can be acquired easily by using the extracted ruled lines and the lines themselves are obtained by very simple image processing techniques. The proposed method does not depend on an external library in image processing. In the case of [29], only the logical structures in the documents are detected using image analysis but the system is not developed to reuse the information. In a different field, methods to analyze cultural heritage documents are reported by Ogier et al. [30]. In this literature, document analysis techniques are employed to preserve and archiving cultural heritage documents. Literature [31] reports a prototypical document image analysis system for journals. Most of these studies mainly describe the methodology and processing for typical business letters. According to the authors' survey, only a few articles propose document image recognition method for medical documents, such as patient discharge summaries to search similar cases.
In medical fields, many novel information systems have been studied. As one of such examples, we introduce here a new concept and systems to assure lifelong readability for Medical Records in HIS. Figure 24 illustrates the outline of the concept, called Document  [32]. Since the lifespan of computer systemss is usually very short compared with the need for medical records of a patient, great care is necessary to shift paper-based toward computer based society. DACS is such a system which covers this problem. Because of the very nature of rapid progress of medical science, all the electronic health record system used now will never mature, and, indeed, the system architecture itself is changing. It is sometimes very difficult to retrieve data created by a system previously used. Though electronic health record systems offer us utilities to retrieve any type of data in the database, they loose functionality to grasp many features at a glance which the paper systems had. Prof. Matsumura et al. deliberately introduces the combination of these two concepts. In the DACS, all medical records are not treated as data but as an aggregation of documents. The medical documents generated by the electronic health system are converted to PDF (or JPEG, TIFF, Docuworks) and XML files. By converting the data to such files, the readability of the data are guaranteed, and the meta-data of the documents, e.g. timestamp, patient ID, document type etc., are used as key information of search. After this, these files are delivered to Document Archive Server of DACS, and then system users can view and search the stored documents easily. As a matter of course, the document deliverer of DACS can also deliver the generated files (and XML data) to other systems such as Data Ware House (DWH), and we can use the data for clinical analyses and studies. DACS also supports not only the data stored in HIS but also other data types, e.g. paper-based documents, other applications' data, PDF files generated by other systems and so on. In the case of a paper-based document, the target document is scanned by the optical scanning device and transferred into a PDF file. The meta-data of the documents are also obtained by a scanning sheet with the QR code. This sheet is generated using stored clinical data in the HIS (or input data to the DACS by hand) before scanning. The generated PDF file and its meta-data are delivered by the document deliverer and stored into the database. As you can see, DACS can keep readability of medical records and supports various data types.
One of the problems that DACS has now will be the problem of creating meta-data manually. Our method can cover this problem as much of these meta-data are automatically extracted from the images, which would contribute to improve DACS.

Toward the future
In this chapter we introduced document image recognition, keyword extraction and automatic XML generation techniques to search similar cases from paper-based medical documents. These techniques were developed for practical use at healthcare sectors, so as to help the incorporation of vast volumes of paper-based medical records into the e-health environment. Good usability and speed, robustness, low running cost and automated execution will be the key requisite for such a system to practically be used, and our system will satisfy many of these requirements. These characteristics of our system mainly come from the use of master information which covers almost all type of medical documents. However, there remain many problems unsolved. One of the largest drawbacks of our system might be the anxiety whether we can get similar accuracy and effectiveness of such documents without tables. As is stated in 3.1 there are many paper based medical documents without tables. But, even in such cases, they are not written randomly in free format. Since medical records are the most important documents for physicians to keep continuity of healthcare, the format itself has been deliberately designed and used. Therefore it is quite plausible that any medical documents without tables will match one of the master information if we can insert frame lines in it. If so, it may not so difficult to improve the algorithm of determining the best suited sheet to include mass or area information.