Open access

Text Clumping for Technical Intelligence

Written By

Alan Porter and Yi Zhang

Submitted: March 14th, 2012 Published: November 21st, 2012

DOI: 10.5772/50973

Chapter metrics overview

4,297 Chapter Downloads

View Full Metrics

1. Introduction: Concepts, Purposes, and Approaches

This development responds to a challenge. Text mining software can conveniently generate very large sets of terms or phrases. Our examples draw from use of VantagePoint (or equivalently, Thomson Data Analyzer – TDA) software [1] to analyze abstract record sets. A typical search on an ST&I topic of interest might yield, say, 5,000 records. One approach is to apply VantagePoint’s Natural Language Processing (NLP) to the titles, and also to the abstracts and/or claims. We also take advantage of available topic-rich fields such as keywords and index terms. Merging these fields could well offer on the order of 100,000 terms and phrases in one field (list). That list, unfortunately, will surely contain much noise and redundancy. The text clumping aim is to clean and consolidate such a list to provide rich, usable content information.

As described, the text field of interest can contain terms (i.e., single words or unigrams) and/or phrases (i.e., multi-word noun + modifiers term sets). Herein, we focus on such NLP phrases, typically including many single words also. Some of the algorithms pertain especially to multi-word phrases, but, in general, many steps can usefully be applied to single-word term sets. Here we focus on analyzing NLP English noun-phrases – to be called simply „phrases.“

Our larger mission is to generate effective Competitive Technical Intelligence (CTI). We want to answer basic questions of „Who is doing What, Where and When?“ In turn, that information can be used to build „innovation indicators“ that address users‘ CTI needs [2]. Typically, those users might be:

  • Information professionals (compiling most relevant information resources)

  • Researchers (seeking to learn about the nearby „research landscape“)

  • R&D managers (wanting to invest in the most promising opportunities)

  • Science, Technology and Innovation (ST&I) policy-makers (striving to advance their country’s competitiveness)

We focus on ST&I information sets, typically in the form of field-structured abstract records retrieved from topical database searches [e.g., Web of Science (WoS), Derwent World Patent Index, Factiva]. These records usually contain a mix of free text portions (e.g., abstracts) and structured text fields (e.g., keywords, publication years). The software uses an import filter to recognize fields (i.e., to know where and how to find the authors and parse their names properly) for particular source sets, such as WoS. VantagePoint can merge multiple datasets from a given source database or from different sources (with guidance on field matching and care in interpreting).

Figure 1 presents our framework for „term clumping.“ We combine established and relatively novel bibliometric and text mining techniqueswithin this framework. Itincludesa number of steps to process alarge phrase list. The top portion of the figure indicates choices to be made concerning which data resources to mine and selection criteria for the records to be analyzed. The next tier notes additional choices regarding which content-laden fields to process. The following two blocks contain the major foci of this chapter. “Text Cleanup” includes stopword and common term handling, through several steps to consolidate related terms. “Consolidateion of terms into informative topical factors” follows. Here we treat basic “inductive methods.” The elements of the Figure flagged with an asterisk (*) are addressed in depth herein.

Figure 1 also points towardinterests for future work. These include“purposive methods,” wherein our attention focuses on particular terms based on external criteria – e.g., semantic TRIZ (Theory of Inventive Problem Solving) suggests vital functions and actions indicative of technological innovative potential [3, 4]. The idea is to search the target text fields for occurrences of theory-guided terms and adjacent content.

We are also keenly interested in pursuing single word analyses via Topic Modeling (TM) methods to get at themes of the record set under study. These hold appeal in providing tools that will work well in multiple languages and character sets (e.g., Chinese). The main language dependency that we confront is the use of NLP to extract noun phrases (e.g., VantagePoint’s NLP is developed for English text).

The bottom portion of Figure 1 indicates interest in how best to engage experts in such topic identification processes. We distinguish three roles:

  • Analyst: Professionals in data retrieval and analysis, who have analytical skills in handling text, but usually don’t have domain knowledge

  • Expert: Professional researchers in the specific domain, knowledgeable over the domain, and able to describe the current status of the domain at both macro and micro levels;

  • Information & Computer Scientist: Covering a range of skills from in-depth programming, through preparation of macros, to operating software to accomplish particular text manipulations.

So defined, engagement of experts presents challenges in terms of motivation, time required, and communication of issues so that the domain experts can readily understand and respond to the analyst’s needs. Simple, intermediate stage outputs could have value in this regard.

Figure 1.

Term Clumping for Technical Intelligence

In summary, this chapter addresses how best to clean and consolidate ST&I phrase lists from abstract record sets. The target is to semi-automate this „inductive“ process (i.e., letting the data speak without predetermined identification of target terms). We aim toward semi-automation because the process should be tailorable to study needs. We are exploring a series of text manipulations to consolidate phrase lists. We are undertaking a series of experiments that vary how technical the content is, which steps are performed, in what sequence, and what statistical approaches are then used to further cluster the phrases or terms. In particular, we also vary and assess the degree of human intervention in the term clumping. That ranges from almost none, to analyst tuning, to active domain expert participation [5-7].


2. Review of Related Literatures

Given the scope of Figure 1, several research areas contribute. This chapter does not address the purposive analyses, so we won’t treat literatures on importing index terms, or on TRIZ and Technology RoadMapping (TRM) -- of great interest in suggesting high value terms for CTI analyses.

Several of the steps to be elaborated are basic. Removal of „stopwords“ needs little theoretical framing. It does pose some interesting analytical possibilities, however. For instance, Cunningham found that the most common modifiers provided analytical value in classifying British science [8]. He conceives of an inverted U shape that emphasizes analyzing moderately high frequency terms -- excluding both the very high frequency (stopwords and commonly used scientific words, that provide high recall of records, but low precision) and low frequency words (suffering from low recall due to weak coverage, but high precision). Pursuing this notion of culling common scientific words, we remove „common words.“ In our analyses we apply several stopword lists of several hundred terms (including some stemming), and a common words in academic/scientific writing thesaurus of some 48,000 terms [9]. We are interested in whether removal of these enhances or, possibly, degrades further analytical steps‘ performance (e.g., Topic Modeling).

To state the obvious -- not all texts behave the same. Language and the venue for the discourse, with its norms, affect usage and text mining. In particular, we focus on ST&I literature and patent abstracts, with outreach to business and attendant popular press coverage of topics (e.g., the Factiva database). English ST&I writing differs somewhat from „normal“ English in structure and content. For instance, scientific discourse tends to include many technical phrases that should be retained, not parsed into separate terms or part-phrases by NLP. VantagePoint’s NLP routine [1] strives to do that. It also seeks to retain chemical formulas.

A research community has built around bibliometric analyses of ST&I records over the past 60 or so years, see for instance [10-12]. DeBellis nicely summarizes many facets of the data and their analyses [13]. Our group at Georgia Tech has pursued ST&I analyses aimed especially at generating Competitive Technical Intelligence (CTI) since the 1970’s, with software development to facilitate mining of abstract records since 1993 [1, 2, 14]. We have explored ways to expedite such text analyses, c.f. [15, 16], as have others [17]. We increasingly turn toward extending such „research profiling“ to aid in Forecasting Innovation Pathways (FIP), see for example [18].

Over the years many techniques have been used to model content retrieved from ST&I text databases. Latent Semantic Indexing (LSI) [19], Principal Components Analysis (PCA), Support Vector Machines (SVM), and Topic Modeling (TM) are among the key methods that have come forth [20].

PCA is closely related to LSI. Both use Singular Value Decomposition (SVD) to transform the basic terms by documents matrix to reduce ranks (i.e., to replace a large number of terms by a relatively small number of factors, capturing as much of the information value as possible). PCA eigen-decomposes a covariance matrix, whereas LSI does so on the term-document matrix. [See wikipedia for basic statistical manipulations.]

VantagePoint uses a special variant of PCA developed to facilitate ST&I text analyses (used in the analyses reported here). This PCA routine generates a more balanced factor set than LSI (which extracts a largest variance explaining factor first; then a second that best explains remaining variance, etc.). The VantagePoint factor map routine applies a small-increment Kaiser Varimax Rotation (yielding more attractive results, but running slower, than SPSS PCA in developmental tests). Our colleague, Bob Watts of the U.S. Army, has led development of a more automated version of PCA, with an optimization routine to determine a best solution (maximizing inclusion of records with fewest factors) based on selected parameter settings -- (Principal Components Decomposition – PCD)[21] He has also empirically compared PCD (inductive) results with a deductive approach based on use of class codes [22].

We apply PCA to term sets to generate co-occurrence based principal components. Because of the familiar use of “clusters,” we also use that terminology, although other clustering approaches can yield different forms (e.g., K-means, hierarchical clustering). This PCA approach allows terms to appear in multiple factors

We use the concept, „term clumping,“ as quite general – entailing various means of text consolidation (e.g., application of thesauri, fuzzy matching, stemming) with noise removal. Bookstein, Raita, and collegues offer a somewhat more specialized, but related, interpretation pointing toward the aim of condensing terminology to better identify content-bearing words [23-25]. Term clumping addresses text (not document) „clustering.“ Any type of text clustering is based on co-occurrence of words in records (documents). Clustering, in turn, includes many variations plus additional statistical analyses with considerable commonality -- in particular, factor analysis. PCA can be considered as a basic factoring approach; indeed, we call its output principal components „factors. “Similarity among these term grouping approaches arises in that they generally aim to maximize association within clusters and minimize association among clusters. Features to keep in mind include whether terms or phrases being clustered are allowed to be included in multiple clusters or not; whether algorithms yield the same results on rerun or may change (probabilistic methods); and whether useful visualization are generated. Many further variations are available – e.g., hierarchical or non-hierarchical; building up or partitioning down; neural network based approaches (e.g., Kohonen Self-Organizing Maps), and so forth [26]. Research is actively pursuing many refinements, for many objectives, for instance [27]. Our focus is on grouping terms, but we note much complementary activity on grouping documents (based on co-occurrence with particular terms) [26], with special interest in grouping web sites, for instance [28].

Latent Semantic Indexing (LSI) or Latent Semantic Analysis, is a classical indexing method based on a Vector Space Model that introduces Singular-Value Decomposition (SVD) to uncover the underlying semantic structure in the text set. The key feature of LSI is to map those terms that occur in similar contexts into a smaller “semantic space“ and to help determine the relationships among terms (synonymy and polysemy) [17, 29, 30]. When applied on co-occurrence information for large text sources, there is no need for LSI to import domain literatures or thesauri (what we call „purposive“ or aided text clumping). There are also various extended LSI methods [31]. Researchers are combining LSI with term clumping variations in order to relate synonymous terms from massive content.For example, Maletic and Marcus combine semantic and structural information [32] and Xu et al. seek to associate genes based on text mining of abstracts [30].

Topic modeling is a suite of algorithms that automatically conforms topical themes from a collection of documents [33, 34]. This stream of research begins with Latent Dirichlet Allocation (LDA), which remains the basic algorithm. Topic modelling is an extended LSI method, that treats association probabilistically. Various topic modeling algorithms extend the basic approach, for example [35-44]. Topic modeling is being applied in many contexts – e.g., NLP extension, sentiment analysis, and topic detection.

We are pursuing topic modeling in conjunction with our text clumping development in several ways. We are experimenting to assess whether and which term clumping steps can refine term or phrase sets as input into topic modeling to enhance generation of meaningful topics. We also compare topic modeling outputs to alternative processes, especially PCA performed on clumped phrases. We additionally want to assess whether some form of text clumping can be applied after topic modeling to enhance topic interpretability.

We have also tried, but are not actively pursuing, Key Graph, a kind of visualization technique that treats the documents as a building constructed by a series of ideas and then retrieves these ideas and posts as a summary of original points on the segmentation of a graph [45-47]. Usually, Key Graph has 3 major components: (1) Foundations, which are the sub-graphs of highly associated and frequent terms; (2) Roofs, which are terms highly related to the foundations; and (3) Columns, which are keywords representing the relationships between foundations and roofs.

We are especially interested in term grouping algorithms to refine large phrase sets through a sequence of steps. These typically begin with noise removal and basic cleaning, and end with some form of clustering of the resulting phrases (e.g., PCA). „In-between“ we are applying several intermediate stage term consolidation tools. Kongthon has pursued an object oriented association rule mining approach [48], with a „concept grouping“ routine [49] and a tree-structured network algorithm that associates text parent-child and sibling relationships [50].

Courseault-Trumbach devised a routine to consolidate related phrases, particularly of different term lengths based on term commonality [51]. Webb Myers developed another routine to combine authors. The notion was that, say, we have three papers authored by X. Perhaps two of those are co-authored with Y, and one with Z; and Y and Z never appear as authors on another paper without X. In that case, the operation surmises that Y and Z are likely junior authors, and eliminates them so that further author analyses can focus on the senior authors or author team. The macro [available at] adds major co-authors into the term name. We incorporate these two routines in the present exercises.

Lastly, we consider various quality assessment approaches. Given that one generates clustered text in various forms, which are best? We look toward three approaches. First, we want to ask the target users. While appealing, this also confronts issues – e.g., our PCA output „names“ the resulting factors, whereas topic modeling does not. How can we compare these even-handedly? Second are statistical approaches that measure some form of the degree of coherence within clusters vs. among clusters [52]. Third are record assignment tests – to what extent do alternative text clumping and clustering sequences correctly distinguish mixed dataset components? Here we seek both high recall and precision.


3. Empirical Investigation:Two Case Analyses

Figure 1 arrays a wide range of possible term clumping actions. As introduced in the previous sections, we are interested in many of those, but within the scope of this chapter we focus on many of the following steps and comparisons:

Term Clumping STEPS:

  1. Fuzzy matching routines

  2. Thesauri to reduce common terms

  3. Human-aided and topic tailored cleaning

  4. Phrase consolidation macro (different lengths)

  5. Pruning of extremely high and low frequency terms

  6. Combine term networks (parent-child) macro

  7. g.TFIDF (Term Frequency Inverse Document Frequency)

  8. Term normalization vs. parent database samples

  9. PCA variations to generate high, medium, and low frequency factors

  10. Topic Modeling

  11. Quality assessment of the resulting factors – comparing expert and statistical means

We are running multiple empirical comparisons. Here we compare results on two topical datasets:

  1. ­“MOT” (for Management of Technology) – 5169 records covering abstract records of the PICMET (Portland International Conference on Management of Engineering and Technology) from 1997 through 2012.

  2. ­“DSSCs” (for Dye-Sensitized Solar Cells) – 5784 abstract records compiled from searches for 2001-2010 in WoS and in EI Compendex, merged in VantagePoint

Elsewhere, we elaborate on these analyses in various ways. Substantive interpretations of the topical MOT thrusts based on the human-selected MOT terms are examined over time and regions [55]. Comparisons of three MOT analyses -- 1) 3-tier, semi-automatic PCA extraction, 2) PCA based on human-selected MOT terms, and 3) Topic Modeling of unigrams – found notably different factors extracted. Human quality assessment did not yield a clear favorite, but the Topic Modeling results edged ahead of the different PCA’s [7]. Additional explorations of the WoS DSSC data appear in [6], comparing Topic Modeling and term clumping-to-PCA – finding quite different emphases in the extracted factors. Zhang et al. [54] track through a similar sequence of term clumping steps on the combined WoS-Compendex DSSC dataset.

Here, we focus on stepping through most of the term clumping operations for these two cases. To avoid undue complexity, we set aside data variations (e.g., stepping through for the WoS DSSC set alone), Topic Modeling comparisons, and quality assessment. As noted, we have done one version of human assessment for the MOT data [7]. We are pursuing additional quality assessments via statistical measures [52] and by comparing how well the alternative analytics are able to separate out record sets from a combination of 7 searches. We also intend to pursue Step h – term normalization based on external (e.g., entire database) frequencies. So, here we treat Steps a-g and i, not Steps h, j, or k.

Table 1 provides the stepwise tally of phrases in the merged topical fields undergoing term clumping. It is difficult to balance precision with clarity, so we hope this succeeds. The first column indicates which text analysis action was taken, coresponding to the list of steps just above.The second column shows the results of those actions applied in sequence on the MOT data. Blank cells indicate that particular action was not performed on the MOT (or DSSC) dataset. The last row notes additional human-informed analyses done on the MOT data, but not treated here (to recognize that this is a selective presentation). The third column relates the results of application of the steps to the DSSC data, but here we indicate sequence within the column, also showing the resulting term reduction. [So, the Table shows the Term Clumping Steps in the order performed on MOT; this was arbitrary. It could as well have been ordered by the list (above) or in the order done for DSSC data.]

Term Clumping StepsMOT
5169 PICMET records
5784 records (WoS+Compendex), 2001-2010
Field selectionTitle&Abstract NLP phrasesTitle&Abstract NLP phrases + keywords
Phrases with which we begin8601490980
a-1) Apply general.fuz routine76398Applied 10th, reducing 82701 to 74263
b-1) Apply stopwords thesaurus76105Applied 1st, reducing 90980 to 89576) and applied 7th, reducing 85960 to 84511
b-2) Apply common academic/scientific terms thesaurus73232Applied 2d, reducing 89576 to 89403; and applied8th, reducing 84511 to 82739
b-3) multiple tailored cleaning routines -- trash term remover.the; topic variations consolidator.the; DSSC data fuzzy matcher results.the*Applied such actions as 3d-6th steps, reducing 89403 to 85960; applied 9th, reducing 82739 to 82701;
a-2) Apply general-85cutoff-95fuzzywordmatch-1exact.fuz69677Applied 11th, reducing 74263 to 65379
d) Apply phrase consolidation macro (different lengths)68062Applied 4th, reducing 89355 to 86410
e) Prune (remove phrases appearing in only 1 record)13089Applied 12th, reducing 65379 to 23311
c-1) Apply human-aided and general.fuz routineApplied 13th, reducing 23311 to 21645
c-2) Manual noise screens (e.g., copyrights, stand-alone numbers)Applied 14th, reducing 21645 to 20172
f) Apply combine term networks (parent-child) macro10513Applied 15th, reducing 20172 to 8181
g) Apply TFIDF1999Applied 16th, reducing 8181 to 2008
i) Auto-PCA: highest frequency; 2d highest; 3d highest201, 256, 299203;214;230
PCA factors9 factors (only top tier)12 (only top tier)
c-3) Tuned phrases to 7164; reviewed 15 factors from 204 top phrases; reran to get final PCA

Table 1.

Term Clumping Stepwise Results

*a compilation of phrase variations that VantagePoint’s “List Cleanup” routine suggested combining [e.g. – various singular and plural variations; hyphenation variations; and similar phrases such as “nanostructured TiO2 films” with “nanostructured TiO2 thin films”]

Some steps are broken out in more detail – e.g., Step a -- Fuzzy matching routines – is split into use of VantagePoint’s general matching routine (a-1) and application of a variant tunedfor this term clumping (a-2). Note also that some steps appear more than once, especially for the DSSC clumping.

For Step b – application of thesauri to remove common terms – we distinguish the use of a modest size stopwords thesaurus (fewer than 300 words) as Step b-1 and the application of the 48,000 term thesaurus of common academic/scientific terms as Step b-2.

Step c -- Human-aided and topic tailored cleaning (Steps c-1, c-2 & c-3) groups a variety of „obvious“ cleaning routines. Our dilemma is whether to eliminate these, to facilitate development of semi-automated routines, or to include them, for easy improvement of the term consolidation? In the MOT term clumping reported in Table 1, we essentially avoid such cleaning. In the DSSC step-through, we include limited iterations of human-aided cleaning to see whether this makes a qualitative difference by the time the progression of steps is completed. [It does not seem to do so.]

Step d -- Phrase consolidation macro – consolidates only a modest percentage of the phrases (as applied here, reducing the phrase count by 2.3% for MOT and by 3.3% for DSSCs), but the improvements appear worthwhile. For instance, combining “Dye-Sensitized Solar Cells” with “Sensitized Solar Cells” can provide important conceptual concentration.

Step e – Pruning – is simply discarding the phrases that appear in only one record. Those would not add to co-occurrence based analyses. The challenge is to sequence pruning after consolidation so that potentially useful topical information is not discarded. Pruning is the overwhelmingly potent step in reducing the term or phrase counts. For MOT, it effects a reduction of 81%; for DSSCs, 64%.

Step f -- Combine term networks (parent-child) – appears a powerful reducer. As discussed, Webb Myers devised this macro to consolidate author sets.We apply the macro to the phrases field, showing sizable reductions for MOT (19.7%) and DSSCs (59.4%). The macro will combine major co-occurring terms in the new phrase name with a “&” between them. It also results in terms that appear in a single record being combined into a single phrase [hence, we perform the Pruning step prior to applying this macro].

Step g – TFIDF – strives to distinguish terms that provide specificity within the sample set.For example, if some form of „DSSC“ appears in nearly every DSSC record, this would not be a high-value term in distinguishing patterns within the dataset. VantagePoint offers three TFIDF routines – A) un-normalized, B) log, and C) square root. We compared these and proceed with the square root term set for DSSCs, whose 2008 terms are all included in sets A or B. Of the 2008 phrases, 1915 are in both A and B (so differences in this regard are small), with 42 in set A and 51 in set B. For the MOT data, B and C yield the same 1999 terms, whereas A yields 2052. Inspection of the distinct terms find the 78 only in sets B & C to appear more substantive than the 131 terms only in set A, so we opt for the 1999 term result.

Step h is included as a place-holder.On the one hand, Step b aims to remove generally common terms.On another, Step g favors more specific terms within the document set being analyzed. With access to full databases or general samples from sources such as WoS, one could sort toward terms or phrases that are relatively unique to the search set.We have not done that here.

At this stage, we have very large, but clumped, phrase sets. In our two cases, these consist of about 2000 phrases. Consider the illustrative „post-TFIDF“ tabulations in Table 2. We believe these offer rich analytical possibilities. For instance, we could scan their introduction and frequency over time to identify „hot“ topics in the field. Or, we could compare organizational emphases across these phrases to advance CTI interests. We might ask technical and/or business experts in the field to scan those 2000 phrases to identify particularly important or novel ones for in-depth analyses.

Steps i and j represent a major „last step“ for these sorts of term analyses. Here we explore select PCA steps; elsewhere, as noted, we pursue Topic Modeling [6, 7]. This factoring (~clustering) step reduces thousands of phrases to tens of phrases. If done accurately, this can be game-changing in terms of opening conceptual insights into topical emphases in the field under study.

VantagePoint’s PCA routine is now applied as Step i. In these cases we have tried to minimize human-aiding, but we explore that elsewhere [6, 7]. We select three top tiers of terms to be subjected to separate Principal Components Analysis. Such selection can be handled by various coverage rules – e.g., terms appearing in at least 1% of the records. In the present exercises, we set thresholds to provide approximately 200 phrases as input to each of three PCA analyses. We run the default requested number of factors to extract – this is the square root of the number of terms submitted. We review the resulting three sets of factors in terms of recall (record coverage) and determine to focus on just the top tier PCA results here. For DSSCs, the top-tier PCA yields 12 factors that cover 98% of the records, whereas the 2d tier factors cover 47% and the 3d tier only 18%.For the MOT analyses, results are comparable – the 9 top-tier factors cover 90% of the records; 2d tier, 36%; 3d tier, 17%. [We have performed additional analyses of these data, exploring various PCA factor sets, including ones in which we perform post-PCA term cleaning based on inspection of initial results., then rerun PCAFor instance, a very high frequency term might be removed, or simple relations handled by refining a thesaurus (e.g., in one result „Data Envelopment Analysis“ and its acronym, DEA, constituted a factor).

Step j is of high interest, and we are exploring several alternative approaches, as mentioned. Here, we just present the high tier set of PCA factors for face validity checks.


4. Term Clumping Case Results

Having stepped through multiple term clumping steps, what do we get? One has many reasonable choices as to which term clumping steps to apply, in what sequence. To get a feel for the gains, let’s compare sample results at four Stages:

  1. Initial phrase set

  2. After the term clumping steps up to TFIDF

  3. After TFIDF

  4. After PCA

Referring to Figure 1, the Text Cleaning stage, in general, would be carried out in preparation for nearly all further analyses. We would not anticipate aborting that processing part-way, except in special cases (e.g., as mentioned in Cunningham’s analysis of British science titles). The next stage of consolidating the cleaned and, therefore, partly consolidated phrases, is where interesting choices arrive. Based on the analyses of the MOT and DSSC data, we note the significant effect of selecting the high TFIDF terms. We thus compare the phrase sets at Stage 1 (before cleaning and clumping), Stage 2 (before filtering to the top TFIDF terms), Stage 3 (after TFIDF), and Stage 4 (after applying one of the clustering family of techniques – PCA).

Stage 1 - InitialStage 2 - Clumped
Top 10# Records# InstancesTop 10# Records#Instances
results8941177case study472931
knowledge412750technology manager241526

Table 2.

Stages 1 & 2 – Top 10MOT Phrases

Considering the MOT data first, Table 2 compares the ten most frequent terms or phrases as of Stages 1 and 2.As per Table 1, the clumping and, especially single-record term pruning, has reduced from 86014 to 10513 phrases – an 88% reduction. Table 2 lists the highest frequency terms and phrases based on record coverage. For instance, study appears in 1177 of the 5169 records (23%). The Table also shows instances, and we see that study appears more than once in some records to give a total of 1874 instances. MOT is Management of Technology. That said, the terms and phrases after clumping are somewhat more substantive. As one scans down the Stage 2 set of 10513 phrases, this is even more the case. Our sense is that a topical expert reviewing these to tag a set of themes to be analyzed (e.g., to track trends, or institutional emphases) would definitely prefer the clumped to the raw phrases.

In Tables 2-5, we show in how many of the full sample of MOT and DSSC records the particular terms appear. We also show instances (i.e., some terms appear more than once in a record). These just convey the changes in coverage resulting from the various clumping operations applied.

Table 3 shows the „Top 10“ post-TFIDF terms and phrases, based on TFIDF scores. Recall that the 1999 terms and phrases at this Stage 3 are based on an arbitrary threshold – we sought about 2000. Note that term counts are unchanged for terms present in both Stages 2 & 3. TFIDF is not clumping, but rather screening based on occurrence patterns across the 5169 records.

Stage 3 - post-TFIDF
Top 10# Records# InstancesSqRt TFIDF value
innovation technology20052732.42
case study47293131.72
technology manager24152630.54
developed country17940629.43

Table 3.

Stage 3 – Top 10 MOT Phrases based on TFIDF

Table 4 presents another 10-term sample pair for Stages 1 and 2. Here, we alphabetically sort the phrase lists and arbitrarily take the ten phrases beginning with „knowledge“ or „knowledg“ --i.e., a stem version of the term. Notice that the big consolidation is for the stemmed version of „knowledg,“ for which the record count has gone up a tiny amount (2), whereas the instance count has increased by 272. In general, the term clumping increases term frequencies and consolidates related terms pretty well (but by no means completely).

Table 5 presents the top-tier PCA analysis results. The phrases appearing here tend to be more topically specific than those seen as most frequent at Stages 2 and 3. Only two terms -- „competition“ and „knowledge“ -- happen to be approximately in common. These nine factors pass a face validity check – they seem quite coherent and potentially meaningful to study of the MOT research arena. Naming of the factors is done by VantagePoint, using an algorithm that takes into account relative term loading on the factor and term commonalities among phrases.

Stage 1 Sample#R#IStage 2 Sample#R#I
knowledge absorption ability KAA11knowledge acquisition611
knowledge access11knowledge age410
knowledge accumulated11knowledge asset45
knowledge accumulation48knowledge base1417
knowledge accumulation model12knowledge based competencies23
knowledge acquisition611knowledge based economy2128
knowledge acquisition KA11knowledge based organizational strategy24
knowledge acquisition strategies14knowledge based perspective34
knowledge across different sectors11Knowledge Based Product Models22

Table 4.

Stages 1 & 2 – 10 Sample MOT Phrases

Note: #R = # of Records; #I = # of Instances

As mentioned, we have done additional analyses of these data. In another PCA, starting with the 10513 terms (pre-TFIDF), we extracted a high frequency term set (112 terms or phrases appearing in 50-475 records). In addition we extracted a second-tier PCA based on 185 terms appearing in 25-49 records, and a third-tier PCA from 763 terms in 10-24 records. Each set was run using VantagePoint default settings for number of factors, yielding, respectively, 7, 9, and 16 factors. Of the present 9 top-tier factors, 3 show clear correspondence to either top or second-tier factors in the 10513-term based PCA; one shows partial correspondence; 5 are quite distinct. Which factor sets are better? Impressionalistically, the 9 post-TFIDF factors seem reasonable and somewhat superior, but lacking some of the specificity of (7 + 9 + 16 = 32) factors. As noted, we don’t pursue the corresponding post-TFIDF PCA second and third tier factors because their record coverage is low.

Examination of DSSC phrase sets shows generally similar progressions as term clumping proceeds.In some respects, results are even more satisfactory with that more technical terminology.In the interest of space, we don’t present tables like Tables 2-4 here.But here’s a synopsis of one fruitful topical concentration within the DSSC phrase list:

Principle Component (Factor)High Loading Phrases
Managing Supply ChainManaging Supply Chain
Supply Chain
Competing TechnologiesCompetition
Technology Capability
Global Competition
Competing Technologies
Technology RoadmapRoadmap
Technology Roadmap
Innovation ProcessInnovation Process
Innovation Activity
Open Innovation
Knowledge Manager
Knowledge Creation
New Knowledge
Share Knowledge
Project SuccessProject Manager
Project Success
Make DecisionsMake Decisions
Decision Making Process
Communication TechnologyICT
Communication Technology

Table 5.

Stage 4 – Top Tier MOT PCA Factors and Constituent Phrases

  • In the initial 90980 term list, there are 807 terms on “electron/electrons/electronic”

  • In the 8181 term list, there are 119 terms on this

  • In the 2008 term list, there are 40 terms remaining, such as “electron acceptor,” “electron diffusion,” “electron injection,” “electron transfer,” “electronic structure,” etc.

Table 6 shows the twelve top-tier DSSC PCA factors and the phrases that load highly on those factors. These results pace a face validity test in that the grouping of terms seems generally sensible. These factors appear to be reasonable candidates for thematic analysis of this solar cell research & development activity.

Principle Component (Factor)High Loading Phrases
Sol Gel ProcessSol Gel
 Sol Gel Process
 Gel-Sol Method
Polymer ElectrolyteElectrolyte
 Ionic Liquid
 Polymer Electrolyte
 Gel Electrolyte
 Electrolyte Liquid
 Ionic Conduction
 Gel Polymer Electrolyte
 Poly electrolyte
 Temperature Molten-Salt
Conduction BandElectron Injection
 Conduction Band
 Mobile Electrons
 Density Functional Theory
Coumarin DyeOrganic Dye
 Coumarin Dye
Solar EquipmentPhoto Electrochemical cell
 Efficient Conversion
 Solar Energy
 Solar Equipment
Material NanostructureMaterial Nanostructure
 Redox Reaction
Electron TransportElectron Transpot
 Back Reaction
 Semiconducting zinc compounds
Scanning Electron MicroscopyScanning Electron Microscopy
 X-ray Diffraction
 Transmission Electron Microscopy
 Electron Microscopy
 X-ray Diffraction Analysis
 X-ray Photoelectron spectroscopy
Open Circuit VoltageOpen Circuit Voltage
 Fill Factor
Electrochemical Impedance SpectroscopyElectrochemical Impedance Spectroscopy
 Electrochemical Corrosion
 Ion Exchange
 Nanotube TiO2

Table 6.

Stage 4 – Top Tier DSSC PCA Factors and Constituent Phrases


5. Discussion

Recent attention to themes like “Big Data” and “MoneyBall” draw attention to the potential in deriving usable intelligence from information resources. We have noted the potential for transformative gains, and some potential unintended consequences, of exploiting information resources [53]. Term clumping, as presented here, offers an important tool set to help move toward real improvements in identifying, tracking, and forecasting emerging technologies and their potential applications.

Desirable features in such text analytics include:

  • Transparency of actions – not black box

  • Evaluation opportunities – we see value in comparing routines on datasets to ascertain what works better; we recognize that no one sequence of operations will be ideal for all text analytics

Phrase consolidation advantages stand out in one DSSC example. Starting with some 2000 terms relating to variations of titanium dioxide (e.g., TiO2, TiO2, TiO2 film), we reduce to 4 such terms, with the “combine term networks” (Step f) particularly helpful.

We are pointing toward generation of a macro that would present the analyst with options as to which cleaning and clumping steps to run, in what order; however, we also hope to come up with a default routine that works well to consolidate topical terms and phrases for further analyses

Some future research interests have been noted in conjunction with the list of steps, of which we are actively working on Steps h, j, and k. We are particularly interested in processing unigrams, because of the potential in such approaches to work with multiple languages. On the other hand, we appreciate the value of phrases to convey thematic structure. Possibilities include processing single words, through a sequence of steps to Topic Modeling, and then trying to associate related phrases to help capture the thrust of each topic. We see potential use of clumped terms and phrases in various text analyses.To mention two relating to competitive technical intelligence (CTI) and Future-oriented Technology Analyses (FTA):

  1. ­Combining empirical with expert analyses is highly desirable in CTI and FTA – clumped phrases can be further screened to provide digestible input for expert review to point out key topics and technologies for further scrutiny

  2. ­Clumped phrases and/or PCA factors can provide appropriate level content for Technology RoadMapping (TRM) – for instance, to be located on a temporal plot.

We recognize considerable interplay among text content types as well.This poses various cleaning issues in conjunction with co-occurrence of topical terms with time periods, authors, organizations, and class codes.We look forward to exploring ways to use clumped terms and phrases to generate valuable CTI.


Key Acronyms:

  1. CTI - Competitive Technical Intelligence

  2. DSSCs - Dye-Sensitized Solar Cells [one of two topical test sets]

  3. LSI - Latent Semantic Indexing

  4. MOT - Management of Technology [the second of two topical test sets]

  5. NLP - Natural Language Processing

  6. PCA - Principal Components Analysis

  7. ST&I - Science, Technology & Innovation

  8. TM - Topic Modeling

  9. WoS - Web of Science (including Science Citation Index)


6. Acknowledgements

We acknowledge support from the US National Science Foundation (Award #1064146 – “Revealing Innovation Pathways: Hybrid Science Maps for Technology Assessment and Foresight”). The findings and observations contained in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation.

We thank David J. Schoeneck for devising groundrules for a semi-automated, 3-tier PCA and Webb Myers for the macro to combine term networks.Nils Newman has contributed pivotal ideas as we build our term clumping capabilities and determine how to deploy them.


  1. 1. 20 May 2012)
  2. 2. PorterA.L.CunninghamS.W.Tech Mining: Exploiting New Technologies for Competitive AdvantageNew YorkWiley2005
  3. 3. KimY.TianY.JeongY.RyuJ.MyaengS.Automatic Discovery of Technology, Trends from Patent Text.InProceedings of the 2009 ACM symposium on Applied Computing, ACMSAC2009, 9-12 March 2009,Hawaii, USA2009
  4. 4. VerbitskyM.Semantic TRIZThe TRIZ Journal2004; Feb. 20 May 2012)
  5. 5. PorterA. L.ZhangY.NewmanN. C.Tech Mining to Identify Topical Emergence in Management of Technology.The International Conference on Innovative Methods for Innovation Management and Policy, IM2012, 23-26 May 2012. Beijing, China2012
  6. 6. NewmanN. C.PorterA. L.NewmanD.Courseault-TrumbachC.BolanS. D.Comparing Methods to Extract Technical Content for Technological Intelligence.Portland International Conference on Management of Engineering and Technology, PICMET2012, 29 July-2 August, Vancouver, Canada2012
  7. 7. PorterA. L.NewmanD.NewmanN. C.Text Mining to identify topical emergence: Case study on’Management of Technology.The 17th International Conference on Science and Technology Indicators, STI2012, 5-8 September, Montreal, Canada2012
  8. 8. CunninghamS.W.The Content Evaluation of British Scientific Research.D.Phil. Thesis, Science Policy Research UnitUniversity of SussexBrighton, United Kingdom1996
  9. 9. HaywoodS.Academic VocabularyNottingham University 26 May, 2012)
  10. 10. PriceD.S.Little science, big science and beyondNew YorkColumbia University Press1986
  11. 11. GarfieldE.MalinM.SmallH.Citation Data as Science Indicators.Y. Elkana,et al, (Eds.), The Metric of Science: The Advent of Science Indicators,New YorkWiley1978
  12. 12. Van RaanA. F. J.Advanced Bibliometric Methods to Assess Research Performance and Scientific Development: Basic Principles and Recent Practical Applications.Research Evaluation199233151166
  13. 13. De BellisN.Bibliometrics and Citation AnalysisLanham, MDThe Scarecrow Press2009
  14. 14. PorterA.L.DetampelM.J.Technology opportunity analysis.Technol. Forecast. Soc. Change199549237255
  15. 15. WattsR.J.PorterA.L.CunninghamS.W.ZhuD.TOAS intelligence mining, an analysis of NLP and computational linguistics.Lecture Notes in Computer Science19971263323334
  16. 16. ZhuD.PorterA.L.Automated extraction and visualization of information for technological intelligence and forecasting.Technol. Forecast. Soc. Change200269495506
  17. 17. LosiewiczP.OardD.W.KostoffR.N.Textual data mining to support science and technology management.Journal of Intelligent Information Systems200015299119
  18. 18. RobinsonD.K.R.HuangL.GuoY.PorterA.L.Forecasting Innovation Pathways for New and Emerging Science & Technologies.Technological Forecasting & Social Change
  19. 19. DeerwesterS.DumalsS.FurnasG.LandauerT.HarshmanR.Indexing by latent semantic analysis.Journal of the American Society for Information Science199041391407
  20. 20. FodorI.K.A survey of dimension reduction techniques.U.S. Department of Energy, Lawrence Livermore National Lab.9 May 2002. 22 May 2012)
  21. 21. WattsR. J.PorterA. L.Mining Foreign language Information Resources, Proceedings.Portland International Conference on Management of Engineering and Technology, PICMET1999, July 1999, Portland, OR, USA;1999
  22. 22. WattsR. J.PorterA. L.MinskB.Automated text mining comparison of Japanese and USA multi-robot research, data mining 2004.Fifth International Conference on Data Mining, Text Mining and their Business Applications,15-17Sep. 2004, Malaga, Spain;2004
  23. 23. BooksteinA.KleinT.RaitaT.Clumping properties of content-bearing words.Journal of the American Society for Information Science1998492102114
  24. 24. BooksteinA.T.RaitaDiscovering term occurrence structure in text.Journal of the American Society for Information Science and Technology 2000526476486
  25. 25. BooksteinA.VladimirK.T.RaitaN.JohnAdapting measures of clumping strength to assess term-term similarity.Journal of the American Society for Information Science and Technology 2003547611620
  26. 26. BerryM.W.CastellanosM.Survey of text mining II : clustering, classification, and retrieval.New York:Springer2008
  27. 27. BeilF.EsterM.XuX.Frequent term-based text clustering.Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining, KDD2002 21 May 2012)
  28. 28. ScimeA.Web mining: applications and techniques.Hershey, PAIdea Group Pub2005
  29. 29. HomayouniR.HeinrichK.WeiL.BerryM.W.Gene clustering by latentsemantic indexing of MEDLINE abstracts.Bioinformatics200521104115
  30. 30. XuL.FurlotteN.LinY.HeinrichK.BerryM.W.Functional Cohesion of Gene Sets Determined byLatent Semantic Indexing of PubMedAbstracts.PLoS ONE 201164e18851
  31. 31. LandauerT.K.McNamaraD.S.DenisS.KintschW.Handbook of Latent Semantic AnalysisMahwah, NJErlbaum Associates2007
  32. 32. MaleticJ. I.MarcusA.Supporting program comprehension using semantic and structural informationProceedings of the 23rd International Conference on Software Engineering, ICSE2001
  33. 33. BleiD.NgA.JordanM.Latent Dirichlet allocation.Journal of Machine Learning Research200339931022
  34. 34. GriffithsT.SteyversM.Finding Scientific Topics.Proceedings of the National Academy of Sciences2004101 (suppl.1)52285235
  35. 35. ThomasH.Probabilistic latent semantic indexing.Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR1999
  36. 36. AndoR.K.Latent semantic space: iterative scaling improves precision of inter-document similarity measurement.Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR2000
  37. 37. LiW.McCallumA.Pachinko allocation: DAG-structured mixture models of topic correlations.Proceedings of the 23rd international conference on Machine learning, ICML2006
  38. 38. DavidM.LiW.Mc CallumA.Mixtures of hierarchical topics with Pachinko allocation.Proceedings of the 24th international conference on Machine learning, ICML 2007
  39. 39. WangX.McCallumA.Topics over time: A non-Markov continuous-time model of topical trends.Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD2006
  40. 40. DavidM. B.JohnD. H.Dynamic topic models.Proceeding Proceedings of the 23rd international conference on Machine learning, ICML 2006
  41. 41. GruberA.Rosen-ZviM.WeissY.Hidden topic Markov models March 20, 2012
  42. 42. Rosen-ZviM.GriffithsT.SteyversM.SmythP.The author-topic model for authors and documents.Proceeding of the 20th conference on Uncertainty in artificial intelligence, UAI2004
  43. 43. Mc CallumA.Corrada-EmmanuelA.WangX.The Author-Recipient-Topic Model for Topic and Role DiscoverySocial Networks: Experiments with Enron and Academic Email. March 20, 2012
  44. 44. MeiQ.XuL.WondraM.SuH.ZhaiC.Topic sentiment mixture: modeling facets and opinions in weblogs.Proceedings of the 16th international conference on World Wide Web, WWW2007
  45. 45. OhsawaY.BensonN.E.YachidaM.Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor.Proceedings of the Advances in Digital Libraries Conference, ADL1998
  46. 46. TsudaK.ThawonmasR.KeyGraph for Visualization of Discussions in Comments of a Blog Entry with Comment Scores.World Scientific and Engineering Academy and Society (WSEAS) Trans. Computers200512417941801
  47. 47. SayyadiH.HurstM.MaykovA.Event Detection and Story Tracking in Social Streams.Proceeding of 3rd Int’l AAAI Conference on Weblogs and Social Media, ICWSM09, May 17-20, 2009, San Jose, California, USA; 2009
  48. 48. KongthonA. A.Text Mining Framework for Discovering Technological Intelligence to Support Science and Technology Management, Doctoral Dissertation,Georgia Institute of Technology2004 20 May 2012)
  49. 49. KongthonA.HaruechaiyasakC.ThaiprayoonS.Constructing Term Thesaurus using Text Association Rule Mining.Proceedings of the 2008 Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology International Conference, ECTI2008.
  50. 50. KongthonA.,AngkawattanawitN.Deriving Tree-Structured Network Relations in Bibliographic Databases.Proceedings of the 10th International Conference on Asian Digital Libraries, ICADL 2007, December 10-13, Hanoi, Vietnam; 2007
  51. 51. Courseault-TrumbachC.PayneD.Identifying Synonymous Concepts in Preparation for Technology MiningJournal of Information Science3362007
  52. 52. WattsR.J.PorterA.L.R&D cluster quality measures and technology maturity.Technological Forecasting and Social Change2003708735758
  53. 53. PorterA.L.ReadW.The Information Revolution: Current and Future Consequences.Westport, CT: JAI/Ablex; 1998
  54. 54. ZhangY.PorterA. L.HuZ.An Inductive Method for “Term Clumping”: A Case Study on Dye-Sensitized Solar Cells.The International Conference on Innovative Methods for Innovation Management and Policy, IM2012, 23-26May 2012, Beijing, China; 2012
  55. 55. PorterA.L.SchoeneckD.J.AndersonT.R.PICMET Empirically: Tracking 14 Management of Technology Topics.Portland International Conference on Management of Engineering and Technology, PICMET2012, 29 July-2 August, Vancouver, Canada2012

Written By

Alan Porter and Yi Zhang

Submitted: March 14th, 2012 Published: November 21st, 2012