Open access peer-reviewed chapter - ONLINE FIRST

“Set of Strings” Framework for Big Data Modeling

By Igor Sheremet

Submitted: July 24th 2018Reviewed: March 4th 2019Published: April 27th 2019

DOI: 10.5772/intechopen.85602

Downloaded: 59

Abstract

The most complicated task for big data modeling in comparison with relational approach is its variety, being a consequence of heterogeneity of sources of data, accumulated in the integrated storage space. “Set of Strings” Framework (SSF) provides unified solution of this task by representation of database as updated finite set of facts, being strings, in which structure is defined by current metadatabase, which is also an updated set of the context-free generating rules. This chapter is dedicated to SSF formal and substantial description.

Keywords

  • big data
  • “Set of Strings” framework
  • string databases
  • context-free grammars
  • Post systems

1. Introduction

From the three “V’s,” traditionally used for description of big data (volume, variety, velocity) [1, 2, 3, 4, 5, 6, 7], variety is the most difficult for theoretical modeling. The main reason of such difficulty is heterogeneity of sources of data, accumulated in the integrated storage space. By this, data items, passing to the aforementioned storage, have different structures and formats (more or less formalized texts, multimedia, hyperlinked trees of pages, etc.), which makes practically impossible application to such data of well-known relational-originated approaches to database (DB) description, manipulation, and knowledge extraction/application [8, 9, 10, 11, 12]. This obstacle makes hardly achieved the fourth “V” (veracity), which last time is often associated with big data [13, 14, 15, 16, 17], as well as with implementation of data mining over such data storages [18, 19, 20, 21].

Such background makes necessary the alternative approach to data and knowledge modeling. This chapter contains compact consideration of the so-called “Set of Strings” Framework (SSF), developed in order to integrate on the unified theoretical basis capabilities, already used in the relational-like data representations and associated with them knowledge models, with big data immanent property—its variety.

SSF is a result of an attempt to design the aforementioned basis upon the most general representation of elementary data item, which may be stored, transported, received, processed, and visualized. Such representation is string (no matter, symbol, or bit), and SSF combines the best features of classical string-generating formal grammars, developed by Chomsky [22], with string-operating logical systems, proposed by Post [23].

The second section of this chapter is dedicated to the description of string databases (SDB), while the third, - to their interconnections with relational and non-relational DB. In the fourth section, incomplete information modeling within SSF is considered. The main content of the fifth section are the so-called word equations on context-free languages (WECFL), being key element of the SSF algorithmics.

2. “Set of strings” basic equations

The background of the SSF is representation of a database as a finite set of strings:

Wt=w1wmtV,E1

where Wtmeans DB at the discrete time moment tand Vis a set of all strings in the initial (terminal) alphabet V. Such databases will be called lower, if it is necessary to distinguish them from the other, the set of strings databases. The structure of DB elements wiWt, named facts, is determined by metadatabase (MDB), in which the current state is denoted by Dt.

Couple

Θt=<Wt,Dt>,E2

is named data storage (DS). Data storage is in the correct state, if WtWDt, where WDtis the set of all correct databases, defined by the MDB.

Access message to DS is triple:

ωt=<o,c,x>,E3

where ois the operation, which execution is the purpose of the access (insert, delete, update, query), cis the DS component (DB, MDB) which is the objective of the access, and xis the content of the access, i.e., query body, or DB elements (facts), which are inserted or deleted. For simplicity it is supposed that the answer (reply) to the access is obtained by the user at the moment t+1, next to t, and it is denoted At+1,ifc=DB,andAt+1D,ifc=MDB(both sets are finite).

A set of all possible access messages (3) is called data storage manipulation language (DSML).

SSF background is a sequential definition of four interconnected representations of DSML semantics.

Set-theoretical (S)-semantics of DSML is defined by equations on sets, which connect together input data, DB before and after access, and answer (reply) to the access.

Mathematical (M)-semantics follows aforementioned equations but is defined by some well-known and understandable mathematical constructions, being background of DSML.

Operational (O)-semantics is adequate to M-semantics but is represented by algorithms, providing execution of operations on DB.

At last, implementational (I)-semantics is also represented by algorithms, which, in general case, are much more efficient than the previous, in which the main purpose is recognition of algorithmic decidability of answer search (derivation), i.e., possibility of answer generation by finite number of steps.

Let us begin from S-semantics of the DSML segment, addressing DB, called lower, as usually, data manipulation language (DML). The equations, defining DML S-semantics, operate the following sets:

  1. Wt(database at the moment of user’s access to DB).

  2. Wt+1(database after execution of operation, i.e., at the moment t+1,when answer is acceptedbythe user).

  3. It(set, expressing user’s awareness about some fragment of problem area at the moment of access to DB).

  4. At+1(answer to the access).

Basic equations, defining DML S-semantics, are as follows:

Wt+1=WtIt,E4
At+1=Wt+1Wt,E5

for insertion (speaking more precisely, inclusion),

Wt+1=WtIt,E6
At+1=WtWt+1,E7

for deletion (exclusion),

Wt+1=Wt,E8
At+1=WtIt,E9

and for query (everywhere “” is subtraction on sets). As seen, Eqs. (4)(9) fully correspond to the sense of basic operations on DB, inherent to any DML. In Eqs. (6) and (9), set Itmay be infinite.

Example 1. Let database, containing data items from various emergency devices, be as follows:

Wt={AREA GREEN VALLEY IS IN NORMAL STATEAT15.03,
AREA BLUE LAKE IS IN NORMAL STATEAT15.05,
AREA LOWER FOREST IS SMOKEDAT15.20},E10

(due to free use of natural language in facts, it is unnecessary to comment DB content). Equation

Wt+1=WtAREA GREEN VALLEY IS SMOKEDAT15.20E11

describes insertion of data item, in which the source is device, mounted at the Green Valley, which was detected as smoked since 15.20. When at this moment t+1user accesses DB with query, in which the purpose is to get information about all smoked areas, the infinite set Itmay be as follows:

It+1={AREAAIS SMOKEDAT00.00,,
AREAAIS SMOKEDAT23.59,,
AREAAAIS SMOKEDAT00.00,,
AREAAAIS SMOKEDAT23.59,,
AREAZIS SMOKEDAT00.00,,
AREAZIS SMOKEDAT23.59,}.E12

The answer to the query is

At+2=Wt+1It+1={AREA GREEN VALLEY IS SMOKEDAT15.20,
AREA LOWER FOREST IS SMOKEDAT15.20}.E13

In expression (12), names of all areas are strings in the alphabet V=AZ09., so

It+1=AREAVSMOKEDAT0023.0059.E14

Note that definitions (4)(9) are not unique. For example, in the inclusion definition, elements of Itset, having place in the DB at moment t,may be included to the answer

At+1=WtIt,E15

as well as the answer may be defined as

At+1=FACT"WtIt"ALREADY PRESENTS IN DATABASE
FACT"Wt+1Wt"IS INCLUDED TO DATABASE.E16

So, according to Eq. (16), the answer to the access may be as follows:

At+1={FACT AREA GREEN VALLEY SMOKED
AT 15.20ALREADY PRESENTS IN DATABASE,
FACT AREA LOWER FOREST IS SMOKEDAT15.20
IS INCLUDED TO DATABASE}.E17

As may be seen, Eqs. (4)(9) are based on the closed-world interpretation, which defines that the absence of the fact in the database is equivalent to its absence in the real world (problem area).

DML operations do not touch MDB; thus Dt+1=Dt.

Let us consider DML M- and O-semantics of DML.

The background of M-semantics of the simplest DML is the representation of the MDB Dtas a set of the context-free (CF) generating rules αβ, where αis a nonterminal symbol (“nonterminal” for short) andβis a string of both nonterminal and terminal symbols. Every nonterminal symbol, from the substantial point of view, is the name of some substring of fact, entering DB; thus βrepresents the structure of α. The only nonterminal symbol α0, which does not enter any string β, is the “axiom” in the terminology of formal grammars and “fact” in the terminology of SSF. So MDB Dtunambiguously defines CF grammar

Gt=<V,Nt,α0,Dt>,E18

where

Nt=ααβDt}E19

is the set of nonterminals (“nonterminal alphabet”) of Gt.

Database Wtis named correct to metadatabase Dt, if

WtLGt,E20

i.e., facts, having place in the DB, are words of the CF language LGt. In other notation,

wWtα0Gtw,E21

where Gtis used to define that string in alphabet VNtis generated (or derived) from another one.

Example 2. Let MDB Dtbe as follows (nonterminal symbols are framed by metalinguistic brackets):

<fact>AREA<name of area>IS<state>
AT<time>,
<name of area><text>,
<state>IN NORMAL STATE,
<state>SMOKED,
<time><hours>.<minutes>,
<hours><0to1><0to9>,
<hours>2<0to3>,
<0to1>0,
<0to1>1,
<0to9>0,
<0to9>9,
<0to3>0,
<0to3>3,
<minutes><0to5><0to9>,
<0to5>0,
<0to5>5,
<text><symbol>,
<text><symbol><text>,
<symbol>A,
<symbol>Z,
<symbol>0,
<symbol>9,
<symbol>˽.

Database

Wt={AREAAWIS SMOKEDAT15.10,
AREAEIS IN NORMAL STATEAT23.59}

is correct to this MDB, unlike database

Wt=AREAATNORMAL.

Proposed application of CF grammars differs from the classical, in which the main sense is the description of a set of correct sentences of some language (most frequently, programming language). This description is created by its developers or researchers, is based on syntactic categories referred as nonterminals, and is constant through all life cycle of the language (minor changes may be done by reason of language modification or deeper understanding). In the SSF case, CF generating rules are used for description of the DB element (facts) structure, so nonterminals are more semantic than syntactic objects. From the other side, MDB is updated by DS administration and is a dynamic set, in which changes provide immediate changes of DB in order to keep it in the correct state. Such changes may be defined by the following equations, similar to Eqs. (4)(9):

Dt+1=DtItD,E22
At+1D=Dt+1Dt,E23
Wt+1=WtE24

for insertion (inclusion) of new CF rules to MDB,

Dt+1=DtItD,E25
At+1D=DtDt+1,E26
Wt+1=WtLGt+1E27

for deletion (exclusion) of CF rules, having place in MDB,

Dt+1=Dt,E28
At+1D=DtItD,E29
Wt+1=WtE30

and for query to MDB. Here ItDis similar to Itin Eqs. (4)(9), being a set of CF rules representing knowledge of DS administration about MDB. As seen, Eqs. (22) and (23) provide extension of MDB; thus

LGtLGt+1,E31

and DB remains correct, because

WtLGtLGt+1.E32

In Eqs. (25) and (26), where some part (subset) of MDB may be deleted,

LGt+1LGt,E33

so some facts wLGtmay become not satisfying condition (20) of DB correctness to MDB Dt+1, because wLGt+1. In Eqs. (25)(30), it is presumed, that Dtis also SDB, in which MDB defines structure of CF rules, which may be as Example 2.

Let us note that the notion of SDB correctness to MDB is from the substantial point of view weaker than the notion of data storage correctness, because in general case

WDt2LGt,E34

i.e., set of databases in correct storage is the subset of Boolean of LGt, while SDB correct to MDB is such that

WDt=2LGt,E35

i.e., every SDB, containing facts, being words of CF language LGt, is correct, which is not true in the reality. DS correctness is the generalization of notion of DB integrity, deeply developed inside relational approach covering the total content of database, i.e., interconnections between its different elements. There are known various tools for integrity criteria declaration and check—first of all, functional dependencies and their multiple modifications [24, 25, 26, 27, 28, 29, 30, 31, 32]. Storage correctness, being SDB analog of integrity, is considered inside SSF on the basis of augmented Post systems (APS).

Let us consider now the application of the described segment of the SSF to the representation of the most frequently used data models. We shall call such application by the term “emulation.”

3. Emulation of the known data models

We shall demonstrate how relational and non-relational databases may be represented on the described higher background. We shall consider relational data model as a full-scope example of databases with symmetric access (DBSA) [8, 9, 10, 11, 12] and a family of asymmetric access (or key-addressed) databases (KADB), which contains, among others, hypertext, page, and WWW- and Twitter-like DB [32, 33, 34, 35, 36, 37, 38, 39, 40, 41].

Let us begin from the relational model of data.

Consider relational database (RDB), in which the scheme is R1A11Am11Rk(A1kAmkk), where R1,…,Rkare the names of relations and A11,,Aji,,Amkkare the names of attributes. Every relation Riat moment t is the set of tuples

RiDji××Dmji,E36

where Djiis the domain (set of possible values of attribute Aji).

We shall define SDB <Wt,Dt>, corresponding to this RDB, as follows. We shall include to the MDB Dtrules

<fact>R1:<A11>,,<Am11>,<fact>Rk:<A1k>,,<Amkk>,E37

where “<” and “>” are the dividers (aforementioned metalinguistic brackets in the Backus-Naur notation) and Riand Ajiare the strings, being names of relations and attributes, respectively (dividers provide syntactic unambiguity).

Along with Eq. (36), MDB will include rules such that

Dji=b<Aji>b&bV,E38

i.e., these rules provide generation of sets of words in terminal alphabet V, being domains of the respective attributes. For unambiguity we assume that comma “,” does not enter values of attributes VDji.

By this, every tuple b1ibmiiof the relation Riis represented by fact

Ri:b1i,,bmiiWt.E39

Note that representation of facts in the form (37)(39) is not unique. As seen from Examples 1 and 2, tuples, entering relations, may be represented as any natural language phrases, described by the corresponding rules.

Let us consider now key-addressed databases. Their common feature is that every fact, entering KADB, includes a unique key, which is necessary to select, delete, and update this fact. These DB are associated with NoSQL family of DML [42, 43, 44, 45], which in the last years is considered as a practical alternative to SQL-like DML [8, 9, 10, 11, 12, 46, 47, 48], developed since the introduction of the relational model of data.

We shall represent KADB as set

Wt=k1=d1km=dm,E40

where symbol “=” inside angle brackets is the divider, kiV=is the key, and diVis the data, corresponding to this key (or identified by it). At every moment t, KADB must satisfy the consistency condition: KADB is consistent, if

tkV=k=VWt1,E41

i.e., set Wtwould not include two or more elements with one and the same key.

Content of access to KADB must include key k, so Itk=V,

and for inclusion It=k=d. S-semantics of insertion to KADB is as follows:

Wt+1=Wtk=d,ifk=VWt=,Wtk=Vk=dotherwise,E42

because postulation of fact k=dat moment tis equivalent to the negation of fact k=d, where dd, which was postulated at some earlier moment t<t. So in the case of KADB update is implemented by insertion, and reply to this access may be defined as follows:

At+1=k=d,ifk=VWt=,Wtk=Vk=dotherwise,E43

thus in the case of update reply contains deleted as well as included fact.

Concerning M-semantics of KADB, we may see, that every known class of such databases is identified by its own structure of keys and techniques of their extraction from the current processed fact.

The simplest approach is implemented in Twitter network, where keys, necessary for access to the descendants of the current element of the hypertext, are bounded by two dividers—“#” from the left and blank from the right.

In the Internet HTTP/WWW service, similar keys are represented as strings of symbols, visualized by other colors in comparison with the rest of the text of the current hypertext page. This is equivalent to splitting terminal alphabet Vto two subsets, first including symbols of the ordinary colors and the second symbols of the “key-representing” colors. However, HTTP/WWW hypertexts are organized in a much more complicated manner. First of all, along with the displayed pages, in which the structure is described by hypertext markup language (HTML) or its various later versions (XML et al.), there is another KADB, in which elements contain keys, being the aforementioned strings of another colors, and data are, in fact, unified resource locators (URL), providing direct network access to the subordinated pages. This access is possible, because URL contains string, providing application of the domain name service (DNS) for resolving proper IP address. In fact, HTML is no more than language for the convenient representation of CF rules, which form current metadabase of the WWW KADB.

One of the simplest versions of KADB is the so-called page databases, in which elements are strings of equal length, the first string of the page being key [38, 39]. Thus

tLGt=VlVl,E44

where lis the length of the string (in this case divider “=” is redundant). Data may be also string p:d, where “:” is the divider and prefix p before the sequence of l-symbol strings defines the name of the program, called for this sequence interpretation (e.g., visualization). In general case dmay be the string of bits, not only string of symbols of alphabet V.

Until now we discussed only S-semantics and start point of M-semantics, being representation of metadabase as a set of rules of CF grammar. Second such point in the SSF is the representation of databases with incomplete information.

4. Representation of databases with incomplete information and sentential data manipulation languages

Let Dtbe the metadabase. Then database with incomplete information (for short, DBI or, if it is necessary to underline “set of strings” DBI, then SDBI), denoted Xt, is the finite set of the so-called incomplete facts (N-facts) being sentential forms (SF) of CF grammar Gt:

Xt=x1xmSFGt,E45

where

SFGt=xa0GtxE46

is the set of all sentential forms of grammar Gt. Obviously, LGtSFGt.

Example 3. Consider MDB from Example 2 and corresponding DBI

Xt={AREA LONELY TREES IS NORMALAT12.31,
AREA LONELY TREES<state>AT12.<minutes>,
AREA<name of area>IS SMOKEDAT15.30}.

The first N-fact of the three, entering this DBI, does not contain nonterminals, so it is fact in the sense of S-semantics of DML. The second N-fact contains nonterminals <state> and <minutes>, which correspond to the uncertainty of the state of the area Lonely Trees and time moment, when this state occurs; however, the aforementioned moment enters interval from 12.01 to 12.59. The last N-fact contains information about the same area, which was detected as smoked at 15.30.

Before consideration of equations, defining M-semantics of operations on DBI, we shall introduce interpretation of relation Gtof the mutual derivability of sentential forms of context-free grammar as relation of mutual informativity of N-facts.

Let Gtbe acyclic and unambiguous CF grammar [49]. If so, Gtis the relation of partial order on the set SFGt[38, 39].

There is maximal element of the set SFGt—it is axiom α0(“fact”) because for every x SFGt,α0Gtx. For every subset XSFGt,there exists set of its upper bounds, e.g., sentential forms (“N-facts”) ySFGt, such that yGtxfor all xX, and minimal (least) upper bound, supX,such that for every other upper bound y from the mentioned set, the relation yGtsupXis true. For some XSFGt,there may exist set of lower bounds, e.g., sentential forms (“N-facts”) ySFGt,such that xGtyfor all xX, and maximal lower bound inf Xsuch that for every other lower bound y, inf XGtyis true.

Algorithms of sup and inf generation are described in detail in [38, 39].

Example 4. For DBI from Example 3, inf Xtdoes not exist, but

supXt=AREA<name of area>IS<state>AT1<0to9>.<minutes>.

At the same time,

inf{AREA LONELY TREES IS<state>AT12.30,
AREA<name of area>IS SMOKEDAT12.<minutes>}
=AREA LONELY TREES IS SMOKEDAT12.30.

Since now we shall use interpretation of Gtas of the relation of the mutual informativity of incomplete facts. According to this interpretation, xGtxmeans that N-fact xis not less informative in comparison with N-fact x (if x+Gtx, then xis more informative and is called concretization of x). (This interpretation naturally fits to A. Kolmogorov’s algorithmic theory of information basic postulates, i.e., constructive objects mutual complexity [50]).

Graphical illustration of interconnections between supxyand infxyis in Figure 1. As seen, supxyx, supxyy, xinfxy, yinfxy, infxyw, and infxyw, where wLGtand wLGt.

Figure 1.

Graphical illustration of interconnection between sup x y and inf x y .

Let us consider DBI Xt=x1xmtSFGt. We shall call such DBI nonredundant (NR), if there are no two N-facts x and xentering Xt, one of which is more informative than the other. (It is obvious that if xGtx, then there is no necessity of storing x, because all information, having place in x, presents in x. So xis a redundant N-fact, and DBI, containing such N-facts, is also redundant).

Until the contrary is declared, we shall consider only NR DBI lower. By this, when defining M-semantics of update of NR DBI, understood as inclusion of N-fact xSFGt,it is reasonable to suppose, that it contains maximally informative N-facts, which only may be acquired by the system. In this case inclusion of N-fact xXtto DBI may be defined as follows:

Xt+1=XtxyyXt&xGtyyGtx.E47

According to this definition, N-fact xinclusion to DBI Xtcauses extraction from DBI of all N-facts, which are more or less informative than x. Such logics provides maintenance of nonredundancy of the DBI. As seen, all N-facts, having place in Xtand being “compatible” with N-fact x by informativity, are eliminated from this DBI.

It is reasonable to define reply to the inclusion of N-fact x as set of N-facts, eliminated from DBI:

At+1=XtXt+1.E48

Example 5. If N-fact x=AREA GREEN VALLEY IS SMOKEDAT15.30is included to DBI Xtfrom Example 3, then

Xt+1={AREA LONELY TREES IS NORMALAT12.31,
AREA LONELY TREES<state>AT12.<minutes>,
AREA GREEN VALLEY IS SMOKEDAT15.30},

because

AREA<name of area>SMOKEDAT15.30Gt

AREA GREEN VALLEY IS SMOKEDAT15.30.

As to M-semantics of queries, there may by two different versions, which run out of the new DBI features in comparison with DB.

The first version is obvious:

At+1y=xxXt&yx,E49

where At+1yis the answer to the query with content y, and all N-facts from DBI Xt, which are no less informative than x, are included to the answer.

As seen, by this we postulate query language to databases (or DB with complete information): it is the set of sentential forms of CF grammar Gt, and in the case of SF yis the content of query ωt,

It=wyGtw&wV,E50

i.e., it is the set of facts, more informative than N-fact y, having place in the query. So combining Eqs. (9) and (49), we obtain

At+1=WtIt=wyGtw&wWt,E51

i.e., the result of access is the subset of database Wt, containing all facts, more informative than y.

The background of the second version is the interpretation of the query as an action, which aim is to check if there are such possible factswLGt,which are more informative than N-fact x, entering Xt, and N-fact y (query content):

wLGtxGtw&yGtw,E52

so, while x is not concretization of y, it is sensible to include xto the answer, because there may be facts wLGt,which are both xand yconcretizations. The set of such facts is the intersection WxWy, where

Wx=wxGtw,E53
Wy=wyGtw.

Finite representation of the aforementioned intersection is, obviously, maximal lower bound of the set xy. For this reason, answer to the query with content y may be set

A¯t+1y=xxXt&infxy},E54

and, as an alternative,

A¯¯t+1y=infxyxXt&infxy}.E55

Example 6. Consider DBI Xtfrom Example 3 and the query with content y=AREA<name of area>IS SMOKEDAT<time>. The purpose of this query is to get information about all areas smoked. According to Eqs. (49), (54), and (55)

At+1y=AREA<name of area>IS SMOKEDAT14.30,
A¯t+1y={AREA LONELY TREES<state>AT13.<minutes>,
AREA<name of area>IS SMOKEDAT14.30},
A¯¯t+1y={AREA LONELY TREES IS SMOKEDAT13.<minutes>,

AREA<name of area>IS SMOKEDAT14.30}.

Returning to DB, which DML S-semantics was defined by Eqs. (4)(9), we may now write its M-semantics equations not only for query but also for insertion and deletion. Namely, if string wis the content of the insertion access, then

Wt+1=Wtw,ifwLGt,Wtotherwise.E56

Similarly, if string y, containing terminal and nonterminal symbols of CF grammar Gt, is the content of the delete access, then

Wt+1=WtwyGtw&wLGt.E57

Data manipulation languages, described in this section, will be called sentential (SDML), because content of any access to DB/DBI, specified with the help of such DML, is a sentential form of CF grammar, which set of rules is current MDB.

Concerning KADB, capabilities of the sentential DML are compatible with the aforementioned NoSQL languages, which provide selection of DB elements, in which keys are specified in the queries.

Of course, it is not difficult to extend SDML by features, providing construction of more complicated selection criteria (including, e.g., number intervals) [38, 39]. But in comparison with SQL and similar relational languages, providing symmetric access to DB, SDML are rather poor. To achieve capabilities of the relationally complete query languages, it is necessary to extend SDML by features, providing comparison of values, having place in different facts. Such features are critically needed also for knowledge representation, extraction, and processing.

To achieve the formulated purpose, we shall use another tool, differing from CF grammars, namely, Post systems (PS), which also operate strings but, due to variables in their basic constructions (productions), have basic capabilities for aforementioned functions. The result of integration of the described “set of strings” databases with PS are augmented Post systems. The intermediate layer between SDB and APS is formed by the so-called word equations on context-free languages, considered in the next section.

5. Word equations on context-free languages

Word equation is a well-known object of discrete mathematics, defined as follows [51, 52, 53, 54, 55, 56].

Word equation is written as

s=s,E58

where sand sare the so-called terms. Term is a non-empty sequence of symbols of alphabet, which we shall call terminal, presuming it is the same set V, as higher, and variables, which universum is denoted Г. So sVГ+,sVГ+. Domain (set of values) of every variable γΓ, having place in any term, is V. Term without any variables is, obviously, word in alphabet V. At least one variable must present in WECFL or just the same in term ss(or ss):

ssVΓ+V+.E59

Set

d=γ1u1γnun,E60

where γ1,,γnare the variables, u1,,unare the strings in alphabet V, and is the divider (which is not occasionally the same as higher in the generating rules αβ, entering metadabases), is called substitution.

Term sdis the result of application of substitution dto term sand is defined as follows. If

s=u1¯γi1u2¯um¯γimu¯m+1,E61

where ui¯Vand i=1,,m+1,, then

sδ=u1¯γi1¯u2¯um¯γim¯u¯m+1,E62

where

γij¯=uij,ifγijuijdγijotherwise.E63

Definitions (62) and (63) cover general case, when some of the variables, entering term s, do not enter the substitution (60).

Substitution dis called terminal substitution to term sVΓ+V+, if

sdV+,E64

i.e., result of its application to term is word in the alphabet V. In this case, obviously,

γi1γinγiγn.E65

Terminal substitution to terms sand sis called solution of word equation (58), if

sdsd,E66

i.e., result of application of dto terms sand sis one and the same word (here ""is identity sign).

Returning to SDB and M-semantics of their DML, we may see that set of terms may be the simplest query language to SDB. If term

s=u1¯γi1u2¯um¯γimu¯m+1,E67

is query to DB Wt, then

It=u1¯ui1u2¯um¯uimu¯m+1ui1V&&uimV,E68

and

At+1=WtIt=wwWt&ui1VuimVu1¯ui1u2¯um¯uimu¯m+1=w,E69

so Eq. (69) is the definition of M-semantics of the term’s query language to SDB; as seen, wAt+1, if wWt, and word equation s=whas at least one solution.

Example 7. Consider database Wt, containing three facts:

SENSOR 1 IS AT GREEN VALLEY,

SENSOR 2 IS AT BLUE LAKE,

AREA LOWER FOREST IS SMOKED.

If query s=SENSORa, which purpose, as seen, is to select all facts with information about sensor installation, then

At+1=SENSOR1ISATGREEN VALLEY,SENSOR2ISATBLUE LAKE,

and solution of word equations

SENSOR a = SENSOR 1 IS AT GREEN VALLEY

and

SENSOR a = SENSOR 2 IS AT BLUE LAKE

are, respectively,

a1ISATGREEN VALLEY

and

a2ISATBLUE LAKE.

However, the application of the term’s query language to databases with incomplete information, containing sentential forms of CF grammar with scheme Dt, being DS metadabase, is not so simple and needs more sophisticated mathematical background.

Let Gbe CF grammar, corresponding metadabase D (lower index t for simplicity is omitted). We shall call word equation on context-free language LGcouple

<s=s,δ>,E70

where the first component s=sis the word equation in the sense (58), called here kernel, while

δ=γ1β1γlβl,E71

is the so-called suffix, which defines domains (sets of values) of variables γ1,,γl, entering terms sand s, by means of strings β1,,βl, containing terminal and nonterminal symbols of grammar G. Kernel and suffix must satisfy the so-called sentential condition

sδsδSFG,E72

i.e., strings, being the result of application of substitution δto terms sand s, must be sentential forms of grammar G. As seen, δis the generalization of substitution (60), so it will be called lower SF-substitution.

WECFL (70) may be read “s=s, where δ.”

Domain of variable γiis the set of strings in terminal alphabet V, which are generated from string βiby application of rules of grammar G. This domain is denoted as

Vγiδ=uγiβiδ&βiu&uVE73

(from here we shall use in the sense G).

Suffix δdefines set of terminal substitutions to terms sand s, denoted

δ=u1Vγ1δulVγlδγ1u1γlul.E74

As it is easy to see, direct consequence of the sentential condition (72) and definition (74) is

sdsdLG,E75

for every dδ.

If terminal substitution dis such that

sdsd,E76

it is called solution of WECFL (70). Set of solutions of WECFL (70), which is infinite in general case, is denoted Ds=sδ.

Function V may be applied to every term, so

Vsδ=sddδLG,E77
Vsδ=sddδLG.E78

Example 8. Let metadatabase be the same as in Example 2, and WECFL is

<AREA GREEN VALLEY ISa=bAT15.00,{a<state>AT<time>,bAREA<name of area>IS<state>}>.

As seen, this equation satisfies sentential condition, because

sδ=AREAGREENVALLEY IS<state>AT<time>SFGt,sδ=AREA<name of area>IS<state>AT15.00SFGt.

According to Eq. (73),

Vaδ=IN NORMAL STATEATTSMOKEDATTVbδ=AREASIS IN NORMAL STATEATAREASIS SMOKED, where Tis the set of strings, explicating time (00.00,00.01,,23.58,23.59), while S is the set of names of the monitored areas.

Terminal substitution

s={aSMOKEDAT15.00,bAREAGREENVALLEY IS SMOKED}

is the solution of the presented WECFL.

As seen, in general case the set of solutions of WECFL may be infinite, and the problem is to find finite representation of this set.

Let us consider two sentential forms xand xof unambiguous and acyclic CF grammar G. Each of them defines generated (derived) from it set of strings, being words of language LG:

Wx=wxw&wV,E79
Wy=wyw&wV.E80

And therefore SF xand xare finite representations of sets Wxand Wy, both being subsets of language LG. This obstacle serves as background for the following statement, representing necessary solution.

Statement 1 [51]. If

W=WxWx,E81

then there exists SF y such that

W=wxy&xy&yw&wV.E82

Verbally, non-empty intersection of sets Wxand Wyis the subset of language LG, in which words are generated from SFy, which itself is generated from SFxand xsimultaneously.

Example 9. Consider SF sdand sdfrom Example 8. As seen, SF

y=AREAGREENVALLEY IS<state>AT15.00

is the finite representation of intersection WsdWsd.

Thus SF y from Eq. (82) is nothing else than required finite representation of the non-empty intersection (81).

This finding is a basis for constructing the set of solutions Ds=sδ. Let us begin from the case where all variables, having place in WECFL (or, just the same, in term ss), enter it once, i.e., there is no more than one occurrence of any variable in ss.

Obviously, if

W=VsδVsδ=,E83

then WECFL (70) does not have a solution, i.e.,

Ds=sδ=,E84

and if W, then, since sδand sδare sentential forms of CF grammar G, there exists finite representation of set W, being SF y generated (derived) from sδand sδsimultaneously.

From this place it is clear, that finite representation of the set Ds=sδis set

δ¯=γ1β¯1γlβ¯l,E85

such that

W={wsδ¯=sδ¯=y&yw&wV}.E86

It is easy to verify that β¯1,,β¯lare strings, containing terminal and nonterminal symbols, and being generated from strings β1,,βl, respectively, by sδyand syy.

Set δ¯will be named unifier of WECFL (70). In accordance with Eqs. (54) and (55), we shall consider lower so-called maximal unifiers, corresponding to

y=infsδsδ,E87

where yis the maximal lower bound of the considered two-element set.

Example 10. Let metadatabase be the same as in Example 2, and WECFL is

<AREAaIS SMOKEDATt=bAT15.00,{a<name of area>,t<time>,bAREA<name of area>IS<state>}>.

As seen,

sδ=AREA<name of area>IS SMOKEDAT<time>,sδ=AREA<name of area>IS<state>AT15.00,y=infsδsδ=AREA<name of area>IS SMOKEDAT15.00,

and thus

δ¯={a<name of area>,t15.00,

bAREA<name of area>IS SMOKED}.

Now we may return to DBI and introduce the so-called term data manipulation language (TDML), being the set of the so-called augmented terms <s,d>, where sis the term and dis the SF-substitution. M-semantics of this language is similar to Eqs. (49), (54), and (55) and is obtained by replacement of SFyby couple <s,d>:

At+1s,d=xxXt&sdx,E88
A¯t+1s,d=xxXt&infsdx,E89
A¯¯t+1s,d=infsdxxXt&infsdx.E90

Moreover, from now we may use augmented terms or even their sets as N-facts. Corresponding equations, which describe M-semantics of TDML, much more useful from the practical point of view, are as follows:

At+1s,d=<s¯d¯><s¯d¯>Xt&sds¯d¯,E91
A¯t+1s,d=<s¯d¯><s¯d¯>Xt&infsds¯d¯,E92
A¯¯t+1s,d=Ds=s¯dd¯<s¯d¯>Xt.E93

As may be seen, the last definition provides the most informative reply, containing maximal unifiers of WECFL, each corresponding N-fact, entering BDI.

The concerned reader may find the detailed consideration of WECFL, DBI algorithmics (including N-facts fusion), and key theoretical issues of SDB/DBI internal organization, providing associative access to the stored data as well as their compression, in [38, 39, 40].

All the said about TDML is sufficient for consideration of already mentioned knowledge representation, called augmented Post systems, being core of the deductive capabilities of “Set of Strings” Framework. APS are described in the separate chapter of this book.

Download

chapter PDF

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Igor Sheremet (April 27th 2019). “Set of Strings” Framework for Big Data Modeling [Online First], IntechOpen, DOI: 10.5772/intechopen.85602. Available from:

chapter statistics

59total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us