Node features in the augmented dependency tree.
Relation extraction refers to the method of efficient detection and identification of predefined semantic relationships within a set of entities in text documents (Zelenco, Aone, & Richardella, 2003; Zhang, Zhou, and Aiti, 2008). The importance of this method was recognized first at the Message Understanding Conference (MUC, 2001) that had been held from 1987 to 1997 under the supervision of DARPA Defense Advanced Research Projects Agency of the U.S. National Institute of Standards and Technology of the U.S.
Defense Advanced Research Projects Agency of the U.S.
National Institute of Standards and Technology of the U.S.
According to ACE, an entity in the text is a representation for naming a real object. Exemplary entities include the names of persons, locations, facilities and organizations. A sentence including these entities can express the semantic relationships in between them. For example, in the sentence “
Many relation extraction techniques have been developed in the framework of various technological workshops mentioned above. Most relation extraction methods developed so far are based on supervised learning that requires learning collections. These methods are classified into feature-based methods, semi-supervised learning methods, bootstrapping methods, and kernel-based methods (Bach & Badaskar, 2007; Choi, Jeong, Choi, and Myaeng, 2009). Feature-based methods rely on classification models for automatically specifying the category where a relevant feature vector belongs. At that, surrounding contextual features are used to identify semantic relations between the two entities in a specific sentence and represent them as a feature vector. The major drawback of the supervised learning-based methods, however, is that they require learning collections. Semi-supervised learning and bootstrapping methods, on the other hand, use a large corpora or web documents, based on reduced learning collections that are progressively expanded to overcome the above disadvantage. Kernel-based methods (Collins & Duffy, 2001), in turn, devise kernel functions that are most appropriate for relation extraction and apply them for learning in the form of a kernel set optimized for syntactic analysis and part-of-speech tagging. The kernel function itself is used for measuring the similarity between two instances, which are the main objects of machine learning. General kernel-based models will be discussed in detail in Section 3.
As one representative approach of the feature-based methods, (Kambhatla, 2004) combines various types of lexical, syntactic, and semantic features required for relation extraction by using maximum entropy model. Although it is based on the same type of composite features as that proposed by Kambhatla (2004), Zhou, Su, Zhang, and Zhang (2005) make the use of support vector machines for relation extraction that allows flexible kernel combination. Zhao and Grishman (2005) have classified all features available by that point in time in order to create individual linear kernels, and attempted relation extraction by using composite kernels made of individual linear kernels. Most feature-based methods aim at applying feature engineering algorithms for selecting optimal features for relation extraction, and application of syntactic structures was very limited.
Exemplary semi-supervised learning and bootstrapping methods are Snowball (Agichtein & Gravano, 2000) and DIPRE (Brin, 1999). They rely on a few learning collections for making the use of bootstrapping methods similar to the Yarowsky algorithm (Yarowsky, 1995) for gathering various syntactic patterns that denote relations between the two entities in a large web-based text corpus. Recent developments include KnowItAll (Etzioni, et al., 2005) and TextRunner (Yates, et al., 2007) methods for automatically collecting lexical patterns of target relations and entity pairs based on ample web resources. Although this approach does not require large learning collections, its disadvantage is that many incorrect patterns are detected through expanding pattern collections, and that only one relation can be handled at a time.
Kernel-based relation extraction methods were first attempted by Zelenco, et al. (2003). Zelenco, et al., devised contiguous subtree kernels and sparse subtree kernels for recursively measuring similarity of two parse trees in order to apply them to binary relation extraction that demonstrated relatively high performance. After that, a variety of kernel functions for relation extraction have been suggested, e.g., dependency parse trees (Culotta and Sorensen, 2004), convolution parse tree kernels (Zhang, Zhang and Su, 2006), and composite kernels (Choi et al., 2009; Zhang, Zhang, Su and Zhou, 2006), which show even better performance.
In this chapter, case analysis was carried out for kernel-based relation extraction methods, which are considered to be the most successful approach so far. Of course, some previous survey papers based on the importance and effect of the methodology have been published (Bach and Badaskar, 2007; Moncecchi, Minel and Wonsever, 2010). However, they fail to fully analyze particular functional principles or characteristics of the kernel-based relation extraction models announced so far, and just cite the contents of individual articles or describe limited analysis. Although the performance of most kernel-based relation extraction methods has been demonstrated on the basis of ACE evaluation collections, comparison and analysis of the overall performance has not been made so far.
This chapter, unlike existing case studies, makes a close analysis of operation principles and individual characteristics of five kernel-based relation extraction methods starting from Zelenco, et al. (2003) which is the source of kernel-based relation extraction studies, to the composite kernel, which is considered the most advanced kernel-based relation method (Choi, et al., 2009; Zhang, Zhang, Su, et al., 2006). The focus will be laid on the ACE collection to compare the overall performance of each method. We hope this study will contribute to further research of kernel-based relation extraction of even higher performance and to high-level general kernel studies for linguistic processing and text mining.
Section 2 outlines supervised learning-based relation extraction methods and in section 3 we discuss kernel-based machine learning. Section 4 closely analyzes five exemplary kernel-based relation extraction methods. As mentioned above, Section 5 also compares the performance of these methods to analyze advantages and disadvantages of each method. Section 6 draws a conclusion.
2. Supervised learning-based relation extraction
As discussed above, relation extraction methods are classified into three categories. The difference between feature-based and kernel-based methods is shown in the following Figure 1. With respect to machine learning procedure, these two are different from semi-supervised learning methods.
On the left of Figure 1, individual sentences that make up a learning collection have at least two entities (black square) of which the relation is manually extracted and predefined. Since most relation extraction methods studied so far work with binary relations, learning examples are modified for convenient relation extraction from the pair of entities by preprocessing the original learning collection. These modified learning examples are referred to as
The aforementioned modification is closely related to feature information used in relation extraction. Since most supervised learning-based methods use both the entity itself and the contextual information about the entity, it is important to collect contextual information efficiently for improving performance. Linguistic processing (part-of-speech tagging, base phrase recognition, syntactic analysis, etc.) for individual learning sentences in the pre-processing step contributes to making a base for effective feature selection and extraction. For example, when one sentence shown in the above Figure 1 goes through syntactic analysis, one relation instance is composed of a parse tree and the locations of entities indicated in the parse tree (Fundel, Küffner, & Zimmer, 2007; Zhang, et al., 2008; Zhou, Zhang, Ji, and Zhu, 2007). A single sentence can be represented as a feature vector or a syntactic graph (Jiang and Zhai, 2007; W. Li, Zhang, Wei, Hou, and Lu, 2008; Zhang, Zhang, and Su, 2006). The type of such relation instances depends on the relation extraction methods, and can involve various preprocessing tasks as well (D. P. T. Nguyen, Matsuo, and Ishizuka, 2007; Zhang, Zhang, and Su, 2006).
In general, feature-based relation extraction methods follow the procedures shown in the upper part of Figure 1. That is, feature collections that “
As shown in the Figure 1, kernel functions (linear, polynomial, and sigmoid) can be used in feature-based methods as well. They are, however, functions applied only to instances expressed with vectors. On the other hand, kernel-based methods are not limited in terms of the type of the instance, and thus can contain various kernel functions.
3. Overview of kernel-based machine learning methods
Most machine learning methods are carried out on a feature basis. That is, each instance to which an answer (label) is attached, is modified into a feature sequence or
In section 2, we discussed that the essence of feature-based methods is to create a feature vector that can best express individual learning examples. However, in many cases, it is not possible to express the feature vector reasonably. For example, a feature space is required for expressing syntactic information Dependency grammar relation, parse tree, etc. between words in a sentence.
Dependency grammar relation, parse tree, etc. between words in a sentence.
Kernel-based learning methods are used also for natural language processing. Linear, polynomial and Gaussian kernels are typical in simple feature vector-based machine learning. Convolutional kernel (Collins & Duffy, 2001) is used for efficient learning of structural data such as trees or graphs. The convolution kernel is a type of kernel function featuring the idea of sequence kernels (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002), tree kernels (Culotta & Sorensen, 2004; Reichartz, Korte, & Paass, 2009; Zelenco, et al., 2003; Zhang, Zhang, & Su, 2006; Zhang, et al., 2008), and graph kernels (Gartner, Flach, & Wrobel, 2003). The convolutional kernel can measure the overall similarity by defining “sub-kernels” for measuring similarity between the components of an individual entity and calculating similarity convolution among the components. For example, the sequence kernel divides the relevant sequence into subsequences for measuring similarity of two sequences to calculate overall similarity by means of similarity measurement between the subsequences. Likewise, the tree kernel divides a tree into its sub-trees to calculate similarity between them and then it calculates the convolution of these similarities.
As described above, another advantage of the kernel methods is that learning is possible as a single kernel function for input instance collections of different type. For example, (Choi et al., 2009; Zhang, Zhang, Su, et al., 2006) have demonstrated a composite kernel for which the convolutional parse tree kernel is combined with the entity kernel for high-performance relation extraction.
4. Kernel-based relation extraction
The most prominent characteristic of the relation extraction models derived so far is that linguistic analysis is used to carefully identify relation expressions and syntactic structures directly and indirectly expressed in specific sentences. In this section, five important research results are discussed and analyzed. Of course, there are many other important studies that have drawn much attention due to their high performance. Most of approaches, however, just modify or supplement the five basic methods discussed below. Therefore, this study can be an important reference for supplementing existing research results in the future or studying new mechanisms for relation extraction, by intuitively explaining the details of major studies. Firstly, tree kernel methods originally proposed by Zelenco, et al. (2003) are covered in detail. Then, the method proposed by Culotta & Sorensen (2004) is covered where the dependency tree kernel was used for the first time. Also, kernel-based relation extraction (Bunescu & Mooney, 2005) using dependency path between two entities in a specific sentence on the basis of similar dependency trees is discussed. Additionally, the subsequence kernel-based relation extraction method proposed by (Bunescu & Mooney (2006) is explained. Finally, the relation extraction models (Zhang, Zhang, Su, et al., 2006) based on the composite kernel for which various kernels are combined on the basis of the convolution parse tree kernel proposed by Collins & Duffy (2001) are covered in detail.
4.1. Tree kernel-based method (Zelenco, et al., 2003)
This study is known as the first application of kernel method to relation extraction. The parse trees derived from shallow parsing are used for measuring similarity between the sentences containing entities. In their study, the REES (Relation and Event Extraction System) is used to analyze part-of-speeches and types of individual words in a sentence, as well as the syntactic structure of the sentence in question. Here, the REES is a relation and event extraction system developed by Aone & Ramos-Santacruz (2000). The Figure 2 below shows example result from the REES.
As shown in Figure 2, when the syntactic analysis of the input sentence is completed, particular information for words of the sentence is analyzed and extracted. Four types of attribute information are attached to all words or entities other than articles and stop words. Type represents the part-of-speech or entity type of the current word. While “
As one can see, it is possible to identify and extract the relation between the two entities in a specific sentence at a given level if the REES is used. Since the system, however, was developed on the basis of rules, it has some limitations in terms of scalability and generality (Zelenco et al., 2003). Zelenco et al. (2003) have constructed tree kernels on the basis of the REES analysis result with a view of better relation extraction so as to overcome the limitations of machine learning and low performance. The kernel function defined in this study for measuring the similarity of a pair of shallow parsing trees consists of the following chain of equations. The comparison function for each individual configuration node in the trees is as follows.
In this equation,
Equation 2 represents the function for deciding whether two nodes comprise the same words or entities. (Zelenco et al., 2003) named this
In Equation 3,
The original sentence of the parse tree on the left side is “
Equation 4 is used to calculate the tree kernel between the two trees
Figure 4 shows the process of calculating the kernel function between the second-level child nodes of the two trees. Since all nodes at this level have unexpectedly the matching node type as shown in Figure 3, kernel similarity between the subsequences of each matching node as shown in the equations on the right side of Figure 3. Since matching only between subsequences of the same length is implemented as in Equation 3, non-matching subsequences are excluded from kernel calculation through conformity check among subsequences of which the length is 1, 2 and 3 respectively. The result of kernel calculation is as follows.
As Equation 6 shows, only the Equation expressed on the basis of is left, other than
Figure 5 shows the process and method of calculating the kernel function. Basically, for the tree kernel,
Zelenco, et al., (2003) divides tree kernel calculation into two types. One is the
Figure 6 shows the parsing result for “
Figure 7 shows the process of calculating
For measuring the performance of the proposed two tree kernels, (Zelenco et al., 2003) used 60% of data manually constituted as a learning collection, and carried out 10-fold cross validation. Only two relations, that is, “
Their study is generally recognized as an important contribution for devising kernels for efficient measuring of similarity between very complex tree-type structure entities to be used later for relation extraction. Since various information other than the syntactic information is still required, the method highly depends on the performance of the REES system for creating parse trees. Because the quantity of data was not enough for the test and the binary classification test for only two types of relation was carried out, scalability or generality of the proposed kernel was not analyzed in detail.
4.2. Dependency tree kernel-based method (Culotta & Sorensen, 2004)
As the ACE collection was constituted and distributed from 2004, relation extraction has been fully studied. Culotta and Sorensen (2004) proposed a kernel-based relation extraction method, which uses the dependency parse tree structure on the basis of the tree kernel proposed by Zelenco, et al., (2003) described in section 4.1. This special parse tree, called an
Figure 8 shows a sample augmented dependency tree used in this study. The root of the tree is the verb “
|Detailed POS (24)||NN, NNP|
|General POS (5)||Noun, Verb, Adjective|
|Chunking Information||NP, VP, ADJP|
|Entity Type||person, geo-political-entity|
|Entity Level||name, nominal, pronoun|
|WordNet hypernyms||social group, city|
|Relation Argument||ARG_A, ARG_B|
In Table 1, the first four features (words, part-of-speech information, and phrase information) are the information obtained from parsing, and the rest are named entity features from the ACE collection. Among them, the WordNet hypernym is the result of extracting the highest node for corresponding word from the WordNet database.
As discussed above, the tree kernel defined by (Zelenco et al., 2003) is used in this method. Since the features of each node are added, the matching function (Equaiton 1) and the similarity function (Equation 2) defined by Zelenco, et al., (2003) are accordingly modified into and applied. In detail, the features to be applied to the matching function and the features to be applied to the similarity function from among 8 features were dynamically divided to devise the following models to be applied.
For the evaluation, the initial ACE collection version (2002) released in 2003 was used. This collection defines 5 entity types and 24 types of relations. Culotta & Sorensen (2004) tested relation extraction only for the higher 5 types of relation collections (“ Binary classification for identifying possible relation between two named entities. Relation extraction for all instances with relations in the result of relation identification.
Binary classification for identifying possible relation between two named entities.
Relation extraction for all instances with relations in the result of relation identification.
4.3. Shortest path dependency tree kernel method (Bunescu & Mooney, 2005)
In section 4.2, we have discussed relation extraction using dependency parse trees for the tree kernel proposed by Zelenco, et al., (2003). Bunescu & Mooney (2005) have studied the dependency path between the two named entities in the dependency parse tree with a view to proposing the shortest possible path dependency kernel for relation extraction. There is always a dependency path between two named entities in a sentence and Bunescu & Mooney (2005) argued that the performance of relation extraction is improved by using the syntactic paths. The Figure 9 below shows the dependency graph for a sample sentence.
The red node in Figure 9 represents the named entity specified in the ACE collection. Separation of the entire dependency graph results in 10 dependency syntax pairs. It is possible to select pairs, which include named entities in the syntax pairs to construct the dependency path as shown in Figure 10.
As one can see from Figure 10, it is possible to construct the dependency path between the named entities, “
For learning, Bunescu & Mooney (2005) have extracted the shortest dependency paths, including two entities from individual training instances as shown in the Table below.
As shown in Table 2, each relation instance is expressed as a dependency path whose both ends are named entities. In terms of learning, however, it is not easy to extract sufficient features from such instances. Therefore, as discussed in section 4.2, various supplementary information is created, such as part-of-speech, entity type and WordNet synset. As a result, individual nodes which make up the dependency path comprise a plurality of information elements, and a variety of new paths are finally created as shown in Figure 11.
As shown in Figure 11, with more information available for individual nodes, new 48 dependency paths can be created through Cartesian product of the node values. Here, relation extraction is carried out by applying the dependency path kernel for calculating redundancy of the information included in each node rather than comparing all newly created paths, as shown below.
In the above Equation 8,
As shown in Figure 12, the process of comparing two dependency paths is very simple. If the length of two paths is different, the kernel function simply returns zero (0). Otherwise, the level of information redundancy is then calculated for each node with respect to two paths. Since all the corresponding values are identical in the first node (“his”, “PRP” and “Person”), the output is set to 3. As one matches in the second node, 1 is returned. By exponentiating all the calculated values, the kernel value is found to be 18.
On the basis of the same test environment as the collection used by Culotta & Sorensen (2004), two parsing systems, the CCG parser (Hockenmaier & Steedman, 2002) and the CFG parser (Collins, 1997) have been used to construct the shortest dependency path. The test included
With regard to relation extraction, the shortest dependency path information is considered to be very useful, and is highly likely to be used in various fields. However, the kernel structure is too simple. Yet another limitation is that only the paths of the same length are included in calculating similarity of two dependency paths.
4.4. Subsequence kernel-based method (Bunescu & Mooney, 2006)
The tree kernel presented by Zelenco, et al., (2003) is to compare two sibling nodes basically at the same level and uses the subsequence kernel. Bunescu & Mooney (2006) introduced the subsequence kernel and attempted relation extraction only with the base phrase analysis (chunking), without applying the syntactic structure. Since kernel input is not of a complex syntactic structure, but base phrase sequences, the assumption was that the feature space can be divided into 3 types to comprise maximum 4 words for each type of features as follows, by using the advantage of easy selection of contextual information essential for relation extraction.
In Figure 13, [FB] represents the words positioned before and between the two entities; [B] means only the word between them; and [BA], accordingly, means word collections between and after. The 3 types of feature collections can accept individual relation expressions, respectively. Furthermore, various types of supplementary word information (part-of-speech, entity type, WordNet synset, etc.) are used to expand them as in the methods described above.
Zelenco et al., (2003) described how to calculate the subsequence kernel, which will be described in detail again later. The kernel calculation function
Where, i and j represent subsequences contained in
As shown in Figure 14, each of the nodes that consist of analysis result has 8 types of lexical information (word, part-of-speech, base phrase type, entity type, etc.) The kernel value,
There are three subsequence pairs which are decided that the node of all subsequence is at least 0 by means of the homogeneity decision function
As described above, it is possible to construct the subsequence kernel function based on the subsequences of all lengths by using contextual location information and Equation 8 as described above. Figure 16 shows surrounding contextual location where named entities occur in the sentence.
In Figure 16,
The subsequence kernel
In Equation 11,
The performance evaluation in the same test environment as used in Sections 4.2 and 4.3 shows increased performance, even without complicated pre-processing, such as parsing, and without any syntactic information. In conclusion, the evaluation shows that this method is very fast in terms of learning speed and is an approach with a variety of potentials for improving performance.
4.5. Composite kernel-based method (Zhang, Zhang, Su, et al., 2006)
The release of ACE 2003 and 2004 versions contributed to full-scale study on relation extraction. In particular, the collection is characterized by even richer information for tagged entities. For example, ACE 2003 provides various entity features, e.g., entity headwords, entity type, and entity subtype for a specific named entity, and the features have been used as an important clue for determining the relation between two entities in a specific sentence. In this context, Zhang, Zhang, Su, et al., (2006) have built a composite kernel for which the convolution parse tree kernel proposed by Collins & Duffy, (2001) is combined with the entity feature kernel. Equation 11 gives the entity kernel definition.
In Equation 12,
Second, the convolution parse tree kernel expresses one parse tree as an occurrence frequency vector of a subtree as follows so as to measure the similarity between the two parse trees.
In Equation 13, #subtreei(T) represents the occurrence frequency of the i-th subtree. All parse trees are expressed with the vector as described above, and the kernel function is calculated as the inner product of two vectors as follows.
Figure 17 shows all subtrees of a specific parse tree. There are nine subtrees in the figure altogether, and each subtree is an axis of the vector, which expresses the left side parse tree. If the number of all unified parse trees that can be extracted for
As shown in Figure 17, there are two constraints for a subtree of a specific parse tree. First, the number of nodes of the subtree must be at least 2, and the subtree should comply with production rules used by syntactic parser to generate parse trees of sentences (Collins & Duffy, 2001). For example, [VP
It is necessary to investigate all subtrees in the tree T, and calculate their frequency so as to build the vector for each of parse trees. This process is quite inefficient, however. Since we need only to compute the similarity of two parse trees in the kernel-based method, we can come up with indirect kernel functions without building the subtree vector for each parse tree as with Equation 13. The following Equation 15, proposed by Collins & Duffy (2001), is used to calculate efficiently the similarity of two parse trees
In Equation 15, Ti represents a specific parse tree, and Ni represents all node collections of the parse tree. Ist(n) is the function for checking whether the node n is the root node of the specific subtree st. The most time-consuming calculation in Equation 15 falls on calculating To enhance this, Collins & Duffy (2001) came up with the following algorithm.
The function defined in (3) of Figure 18 compares the child nodes of the input node to calculate the frequency of subtrees contained in both parse trees and the product thereof, until the end conditions defined in (1) and (2) are satisfied. In this case, the decay factor , which is a variable for limiting large subtrees so as to address the issue that larger subtrees among the subtrees of the parse tree comprise another subtrees therein, can be applied repeatedly for calculating the inner product of the subtree vector.
Two kernels built as described above, that is, the entity kernel and the convolution parse tree kernel, are combined in the following two manners.
In the above equations 16-1 and 16-2, KL represents the entity kernel and K stands for the convolution parse tree kernel. Equation 16-1 shows the composite kernel being a linear combination of the two kernels, and Equation 16-2 defines the composite kernel constructed using quadratic polynomial combination.
Furthermore, Zhang, Zhang, Su, et al., (2006) proposed the method for pruning relation instance by leaving a part of the parse tree and removing the rest, so as to improve similarity measurement performance of the kernel function, and to exclude unnecessary contextual information in learning.
|Minimum Complete Tree (MCT)||Minimum complete sub-tree encompassing two entities|
|Path-enclosed Tree (PT)||Sub-tree belong to the shortest path in between two entities|
|Chunking Tree(CT)||Sub-tree generated by discarding all the internal nodes excpet nodes for base phrases and POS from PT|
|Context-sensitive PT(CPT)||Sub-tree generated by adding two additional terminal nodes outside PT|
|Context-sensitive CT(CCT)||Sub-tree generated by adding two additional terminal nodes outside CT|
|Flattened PT(FPT)||Sub-tree generated by discarding all the nodes having only one paraent and child node from PT|
|Flattened CPT(FCPT)||Sub-tree generated by discarding all the nodes having only one paraent and child node from CT|
For the evaluation, Zhang, Zhang, Su, et al., (2006) used both ACE 2003 and ACE 2004. They parsed all available relation instances with Charniak’s Parser (Charniak, 2001), and on the basis of the parsing result carried out instance conversion using the method described in Table 3. To this end, Moschitti (2004) has developed a kernel tool, while SVMLight (Joachims, 1998) was used for learning and classification.
The test shows that the composite kernel features better performance than a single syntactic kernel. The combination of quadratic polynomial type shows performance between the two kernels. This means that flat feature (entity type feature) and structural feature (syntactic feature) can be organically combined as a single kernel function. In consideration that the Path-enclosed Tree method shows the best performance among all relation instance pruning methods, it is possible to achieve the effect only with core related syntactic information, so as to estimate the relation of two entities in a specific sentence.
4.6. Other recent studies
Choi, et al., (2009) have constructed and tested a composite kernel where various lexical and contextual features are added by expanding the existing composite kernel. In addition to the syntactic feature, called flat feature, they extended the combination range of lexical feature from the entity feature to the contextual feature in order to achieve high performance. Mintz, Bills, Snow, and Jurafsky (2009) proposed a new method of using Freebase, which is a semantic database for thousands relation collections, to gather exemplary sentences for a specific relation and making relation extraction on the basis of the obtained exemplary sentences. In the test, the collection of 10,000 instances corresponding to 102 types of relations has shown the accuracy of 67.6%. In addition, T.-V. T. Nguyen, Moschitti, and Riccardi (2009) have designed a new kernel, which extends the existing convolution parse tree kernel, and Reichartz, et al., (2009) proposed a method, which extends the dependency tree kernel. As described above, most studies published so far are based on the kernels described in Sections 4.1to 4.5.
5. Comparison and analysis
In the previous section, five types of kernel-based relation extraction have been analyzed in detail. Here, we discuss the results of comparison and analysis of these methods. Section 5.1 will briefly describe the criteria for comparison and analysis of the methods. Section 5.2 compares characteristics of the methods. Section 5.3 covers performance results in detail. Section 5.4 sums up the advantages and disadvantages of each of the method.
5.1. Criteria for comparison and analysis
Generally, a large variety of criteria can be used for comparing kernel-based relation extraction methods. The following 6 criteria, however, have been selected and used in this study. First, (1) linguistic analysis and pre-processing method means the pre-processing analysis methods and types for individual instances which are composed of learning collections and evaluation collections, e.g., the type of parsing method or the parsing system used. (2) The level of linguistic analysis, which is the criterion related to the method (1), is a reference to what level the linguistic analysis will be carried out in pre-processing and analyzing instances. Exemplary levels include part-of-speech tagging, base phrase analysis, dependency parsing or full parsing. In addition, (3) the method of selecting a feature space is a reference for deciding if the substantial input of the kernel function is an entire sentence or a part thereof. Also, (4) the applied lexical and supplementary feature information means various supplementary feature information used for addressing the issue of sparse data. (5) The relation extraction method is a practical relation extraction method based on learning models already constituted. Exemplary relation extraction methods include multi-class classification at a time and a single mode method of separating instances with relations from those without relations by means of processing multiple classifications at a time or binary classification, then to carry out relation classification only for the instances with relations. (6) The manual work requirement is a reference to decide if the entire process is fully automatically carried out or manual work is required only at some step. The aforementioned 6 criteria were used to analyze the kernel-based relation extraction methods and the result of the analysis is shown in the following Table 6.
In addition, for the purpose of describing the characteristics of the kernel function, the description in section 4 will be summarized, and each structure of factors to be input of the kernel function will be described. Modification of the kernel function for optimized speed will be included in the analysis criteria and described. For performance comparison of the individual methods, types and scale of the tested collections and tested relations will be analyzed and described in detail. The following Table 4 describes the ACE collection generally used among the tested collections for relation extraction developed so far.
|# training documents||422||674||451|
|# training relation instances||6,156||9,683||5,702|
|# test documents||97||97||N/A|
|# test relation instances||1,490||1,386||N/A|
|# entity types||5||5||7|
|# major relation types||5||5||7|
|# relation sub-types||24||24||23|
As shown in Table 4, the ACE collection is generally used and can be divided into 3 types. ACE-2002, however, is not widely used because of consistency and quality problems. There are 5 to 7 types of entities, e.g., Person, Organization, Facility, Location, Geo-Political Entity, etc. For relations, all collections are structured to be at two levels, consisting of 23 to 24 types of particular relations corresponding to 5 types of Role, Part, Located, Near, and Social. As the method of constructing those collections is advanced and the quality thereof is improved, the tendency is that the scale of training instances is reduced. Although subsequent collections have already been constituted, they are not publicized according to the principle of non-disclosure, which is the policy of ACE Workshop. This should be improved for active studies.
5.2. Comparison of characteristics
Table 5 summarizes each of kernel functions with respect to the concept, before analyzing the kernel-based relation extraction method, in conformity to 6 types of comparison and analysis criteria described in 5.1.
|Compares each node which consists of two trees to be compared.|
Based on BFS, applies subsequence kernel to the child nodes located at the same level.
The decay factor is adjusted and applied to the similarity depending on the length of subsequences at the same level or on the level itself.
|Compares each element of the two paths to cumulatively calculate the common values for the elements in order to compute the similarity by multiplying all values.|
Similarity is 0 if the length of two paths is different.
|For the example of measuring similarity of two words, extracts only the subsequences which exist in both words, and expresses the two words with a vector (Φ(x)) by using the subsequences, among all of the subsequences which belong to two words and of which the length is n.|
Afterwards, obtains the inner product of the two vectors to calculate the similarity.
Generalizes and uses SSK to compare planar information of the tree kernel (sibling node).
|Finds all subtrees in the typical CFG-type syntactic tree, and establishes them as a coordinate axis to represent the parse tree as a vector (Φ(x)).|
In this case, the following constraints hold true: (1) the number of nodes must be at least 2, and (2) the subtree should comply with the CFG creation rule.
Since there can be multiple subtrees, each coordinate value can be at least 1, and similarity is calculated by obtaining the inner product of the two vectors created as such.
Table 6 shows comparison and analysis of the kernel-based relation extraction methods. As it is closer to the right side, the method is more recent one. The characteristic found in all of the above methods is that various feature information in addition to the syntactic information is used as well. Such heterogeneous information was first combined and used in a single kernel function, but is separated from the composite kernel and applied.
With respect to selecting the feature space, most of sentences or a part of the parse tree are applied other than the tree kernel. Manual work was initially required for extracting relation instances and building the parse tree. The recently developed methods, however, offer full automation. In the relation extraction methods, multi-class classification is used, in which the case with no relation is included as one relation.
features from parse trees manually
sub-tree including two entities
from the entire dependency
|Selects the shortest dependency path which starts with one entity and ends with the other one||Before Entities|
LDC mention Type
(Relation Detection and Classification)
5.3. Comparison of performance
Table 7 shows the parameter type of each kernel function and computation complexity in calculating the similarity of two inputs. Most of them show complexity of O(N2), but SPDK exceptionally demonstrates the complexity of the order of O(N) and can be considered as the most efficient kernel.
|Shallow parse trees||CSTK : O(|Ni,1|*|Ni,2|)|
SSTK : O(|Ni,1|*|Ni,2|3)
|N1| : #(1st input’s nodes in level i)
|N2| : #(2nd input’s nodes in level i)
|N1| : #(1st input’s nodes)
n : subsequence length
|N1| : #(1st input’s nodes)
|N2| : #(2nd input’s nodes)
|Full parse trees||O(|N1| *|N2|)|
|N1| : #(1st input’s nodes)
|N2| : #(2nd input’s nodes)
It should be noted that the complexity shown in Table 7 is just kernel calculation complexity. The overall complexity of relation extraction can be much higher when processing time for parsing and learning is also considered.
|(Zelenco et al., 2003)||2003||TK||200 News Articles (2-relations)|
|(Culotta & Sorensen, 2004)||2004||DTK||ACE-2002 (5-major relations)|
|(Kambhatla, 2004)||2004||ME||ACE-2003 (24-relation sub-types)|
|(Bunescu & Mooney, 2005)||2005||SPDK||ACE-2002 (5-major relations)|
|(Zhou et al., 2005)||2005||SVM||ACE-2003 (5-major relations)|
ACE-2003 (24-relation sub-types)
|(Zhao & Grishman, 2005)||2005||CK||ACE-2004 (7-major relations)|
|(Bunescu & Mooney, 2006)||2006||SK||ACE-2002 (5-major relations)|
|(Zhang, Zhang, Su, et al., 2006)||2006||CK||ACE-2003 (5-major relations)|
ACE-2003 (24-relation sub-types)
ACE-2004 (7-major relations)
ACE-2004 (23-relation sub-types)
|(Zhou et al., 2007)||2007||CK||ACE-2003 (5-major relations)|
ACE-2003 (24-relation sub-types)
ACE-2004 (7-major relations)
ACE-2004 (23-relation sub-types)
|(Jiang & Zhai, 2007)||2007||ME/SVM||ACE-2004 (7-major relations)|
|(Zelenco et al., 2003)||TK||91.6||79.5||85.0|
|(Culotta & Sorensen, 2004)||DTK||67.1||35.0||45.8|
|(Bunescu & Mooney, 2005)||SPDK||65.5||43.8||52.5|
|(Zhou et al., 2005)||SVM||77.2||60.7||68.0||63.1||49.5||55.5|
|(Zhao & Grishman, 2005)||CK||69.2||70.5||70.4|
|(Bunescu & Mooney, 2006)||SK||73.9||35.2||47.7|
|(Zhang, Zhang, Su, et al., 2006)||CK||77.3||65.6||70.9||64.9||51.2||57.2||76.1||68.4||72.1||68.6||59.3||63.6|
|(Zhou et al., 2007)||CK||80.8||68.4||74.1||65.2||54.9||59.6||82.2||70.2||75.8||70.3||62.2||66.0|
|(Jiang & Zhai, 2007)||ME/|
ACE-2002, which is the first version of ACE collection, had the issue with data consistency. In the subsequent versions this problem has been continuously addressed, and finally resolved in version ACE-2003. Starting from 52.8% achieved by Kambhatla (2004) on the basis of the performance of ACE-2003 with respect to 24 relation collections, the performance was improved up to 59.6% recently announced by Zhou, et al., (2007). Similarly, the maximum relation extraction performance for 23 particular relations on the ACE-2004 collection is currently 66%.
Although each model has different performance in differently sized relation collections, the composite kernel generally shows better results. In particular, Zhou, et al., (2007) have demonstrated high performance for all collections or relation collections based on extended models initially proposed by Zhang, Zhang, Su, et al., (2006). As described above, it is considered that various features for relation extraction, that is, the syntactic structure and the vocabulary, can be efficiently combined in a composite kernel for better performance.
Although the described research results do not represent all studies on relation extraction, there are many parts not evaluated yet although a lot of study results have been derived so far as seen in Table 9. It is necessary to carry out evaluation on the basis of various collections for comprehensive performance evaluation of a specific relation extraction model, but this is a challenge that more studies should be done. In particular, the key issue is to check whether the performance of relation extraction is achieved as high as described in the above without the characteristics of ACE collections, in that they provide supplementary information (entity type information) of a considerable scale for relation extraction.
5.4. Comparison and analysis of advantages and disadvantages of each method
In Section 5.4, advantages and disadvantages of five kernel-based relation extraction methods are discussed and outlined in Table 10.
|Applies the typical automatic sentence classification without modification. Performance can be further improved by applying various feature information.|
Relatively high speed
|A lot of effort is required for feature extraction and selection.|
Performance can be improved only through feature combination.
|Calculates particular similarity between shallow parse trees.|
Uses both structural (parenthood) and planar information (brotherhood).
Optimization for speed improvement.
|Very limited use of structural information (syntactic relations)|
Slow similarity calculation speed in spite of optimized speed
|Addressed the issue of insufficient use of structural information, which is a disadvantage of TK.|
Uses key words which are the core of relation expression in the sentence, as feature information, on the basis of the structure that the predicate node is raised to a higher status, which is a structural characteristic of a dependency tree.
|Predicates and key words in a dependency tree are emphasized only by means of decay factors (low emphasis capability)|
Slow similarity calculation speed
|Shortest Path DTK (SPTK)||Creates a path between two named entities by means of dependency relation to reduce noise not related to relation expression.|
Shows very fast computation speed because the kernel input is not trees, but paths, different from previous inputs.
Adds various types of supplementary feature information to improve the performance of similarity measurement, thanks to the simple structure of paths.
|Too simple structure of the kernel function|
Too strong constraints because the similarity is 0 if the length of two input paths is different.
|Subsequence Kernel (SK)||Very efficient because syntactic analysis information is not used.|
Adds various supplementary feature information to improve the performance.
|Can include many unnecessary features|
|Makes all of constituent subtrees of a parse tree have a feature, to perfectly|
use structure information in calculating similarity.
Optimized for improved speed
|Comparison is carried out only on the basis of sentence component information of each node (phrase info.) (kernel calculation is required on the basis of composite feature information with reference to word class, semantic info, etc.)|
As one can see in Table 10, each method has some advantages and disadvantages. A lot of efforts are required for the process of feature selection in general feature-based relation extraction. The kernel-based method does not have this disadvantage, but has various limitations instead. For example, although the shortest dependency path kernel includes a variety of potentials, it showed low performance due to the overly simple structure of the kernel function. Since the composite kernel constitutes and compares subtree features only on the basis of part-of-speech information and vocabulary information of each node, generality of similarity measurement is not high. A scheme to get over this is to use word classes or semantic information.
A scheme can be suggested for designing a new kernel in order to overcome the above shortcomings. For example, a scheme may be used for interworking various supplementary feature information (WordNet Synset, thesaurus, ontology, part-of-speech tag, thematic role information, etc.), so as to ensure general comparison between subtrees in the composite kernel. The performance can be improved by replacing the current simple linear kernel with the subsequence or another composite kernel and applying all sorts of supplementary feature information in order to address the shortcomings of the shortest path dependency kernel.
In this chapter, we analyzed kernel-based relation extraction method, which is considered the most efficient approach so far. Previous case studies did not fully covered specific operation principles of the kernel-based relation extraction models, just cited contents of individual studies or made an analysis in a limited range. This chapter, however, closely examines operation principles and individual characteristics of five kernel-based relation extraction methods, starting from the original kernel-based relation extraction studies (Zelenco et al., 2003), to composite kernel (Choi et al., 2009; Zhang, Zhang, Su, et al., 2006), which is considered the most advanced kernel-based method. The overall performance of each method was compared using ACE collections, and particular advantages and disadvantages of each method were summarized. This study will contribute to researchers’ kernel study for relation extraction of higher performance and to general kernel studies of high level for linguistic processing and text mining.
Agichtein E. Gravano L. 2000 Snowball: extracting relations from large plain-text collections
Bach N. Badaskar S. 2007 A Survey on Relation Extraction
Brin S. 1999 Extracting Patterns and Relations from the World Wide Web
Bunescu R. Mooney R. J. 2005 A Shortest Path Dependency Kernel for Relation Extraction
Bunescu R. Mooney R. J. 2006 Subsequence Kernels for Relation Extraction
Charniak E. 2001 Immediate-head Parsing for Language Models
Choi S.-P. Jeong C.-H. Choi Y.-S. Myaeng S.-H. 2009 Relation Extraction based on Extended Composite Kernel using Flat Lexical Features
Collins M. 1997 Three Generative, Lexicalised Models for Statistical Parsing
Collins M. Duffy N. 2001 Convolution Kernels for Natural Language
Cortes C. Vapnik V. 1995 Support-Vector Networks
Cristianini N. Shawe-Taylor J. 2000
Culotta A. Sorensen J. 2004 Dependency Tree Kernels for Relation Extraction
Etzioni O. Cafarella M. Downey D. Popescu A. M. Shaked T. Soderland S. Weld D. S. et al. 2005 Unsupervised named-entity extraction from the Web: An experimental study
Fellbaum C. 1998
Freund Y. Schapire R. E. 1999 Large margin classification using the perceptron algorithm
Fundel K. Küffner R. Zimmer R. 2007 RelEx-Relation extraction using dependency parse trees
Gartner T. Flach P. Wrobel S. 2003 On graph kernels: Hardness results and efficient alternatives
Hockenmaier J. Steedman M. 2002 Generative Models for Statistical Parsing with Combinatory Categorial Grammar
Jiang J. Zhai C. 2007 A Systematic Exploration of the Feature Space for Relation Extraction
Joachims T. 1998 Text Categorization with Support Vecor Machine: learning with many relevant features
Kambhatla N. 2004 Combining lexical, syntactic and semantic features with Maximum Entropy models for extracting relations
Li J. Zhang Z. Li X. Chen H. 2008 Kernel-based learning for biomedical relation extraction
Li W. Zhang P. Wei F. Hou Y. Lu Q. 2008 A novel feature-based approach to Chinese entity relation extraction
Lodhi H. Saunders C. Shawe-Taylor J. Cristianini N. Watkins C. 2002 Text classification using string kernels
Mintz M. Bills S. Snow R. Jurafsky D. 2009 Distant supervision for relation extraction without labeled data
Moncecchi G. Minel J. L. Wonsever D. 2010 A survey of kernel methods for relation extraction
Moschitti A. 2004 A Study on Convolution Kernels for Shallow Semantic Parsing
Nguyen D. P. T. Matsuo Y. Ishizuka M. 2007 Exploiting syntactic and semantic information for relation extraction from wikipedia
Nguyen T.-V. T. Moschitti A. Riccardi G. 2009 Convolution kernels on constituent, dependency and sequential structures for relation extraction
Ratnaparkhi A. 1996 A Maximum Entropy Part-Of-Speech Tagger
Reichartz F. Korte H. Paass G. 2009 Dependency tree kernels for relation extraction from natural language text
Rosenblatt F. 1958 The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain
Yarowsky D. 1995 Unsupervised word sense disambiguation rivaling supervised methods
Yates A. Cafarella M. Banko M. Etzioni O. Broadhead M. Soderland S. 2007 TextRunner: open information extraction on the web
Zelenco D. Aone C. Richardella A. 2003 Kernel Methods for Relation Extraction
Zhang M. Zhang J. Su J. 2006 Exploring syntactic features for relation extraction using a convolution tree kernel
Zhang M. Zhang J. Su J. Zhou G. 2006 A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features
Zhang M. Zhou G. Aiti A. 2008 Exploring syntactic structured features over parse trees for relation extraction using kernel methods
Zhao S. Grishman R. 2005 Extracting Relations with Integrated Information Using Kernel Methods
Zhou G. Su J. Zhang J. Zhang M. 2005 Exploring Various Knowledge in Relation Extraction
Zhou G. Zhang M. Ji D. Zhu Q. 2007 Tree Kernel-based Relation Extraction with Context-Sensitive Structured Parse Tree Information
- Defense Advanced Research Projects Agency of the U.S.
- National Institute of Standards and Technology of the U.S.
- Dependency grammar relation, parse tree, etc. between words in a sentence.
- Binary classification for identifying possible relation between two named entities.
- Relation extraction for all instances with relations in the result of relation identification.