Open access peer-reviewed chapter

“Set of Strings” Framework for Big Data Modeling

Written By

Igor Sheremet

Submitted: 24 July 2018 Reviewed: 04 March 2019 Published: 27 April 2019

DOI: 10.5772/intechopen.85602

From the Edited Volume

Introduction to Data Science and Machine Learning

Edited by Keshav Sud, Pakize Erdogmus and Seifedine Kadry

Chapter metrics overview

884 Chapter Downloads

View Full Metrics

Abstract

The most complicated task for big data modeling in comparison with relational approach is its variety, being a consequence of heterogeneity of sources of data, accumulated in the integrated storage space. “Set of Strings” Framework (SSF) provides unified solution of this task by representation of database as updated finite set of facts, being strings, in which structure is defined by current metadatabase, which is also an updated set of the context-free generating rules. This chapter is dedicated to SSF formal and substantial description.

Keywords

  • big data
  • “Set of Strings” framework
  • string databases
  • context-free grammars
  • Post systems

1. Introduction

From the three “V’s,” traditionally used for description of big data (volume, variety, velocity) [1, 2, 3, 4, 5, 6, 7], variety is the most difficult for theoretical modeling. The main reason of such difficulty is heterogeneity of sources of data, accumulated in the integrated storage space. By this, data items, passing to the aforementioned storage, have different structures and formats (more or less formalized texts, multimedia, hyperlinked trees of pages, etc.), which makes practically impossible application to such data of well-known relational-originated approaches to database (DB) description, manipulation, and knowledge extraction/application [8, 9, 10, 11, 12]. This obstacle makes hardly achieved the fourth “V” (veracity), which last time is often associated with big data [13, 14, 15, 16, 17], as well as with implementation of data mining over such data storages [18, 19, 20, 21].

Such background makes necessary the alternative approach to data and knowledge modeling. This chapter contains compact consideration of the so-called “Set of Strings” Framework (SSF), developed in order to integrate on the unified theoretical basis capabilities, already used in the relational-like data representations and associated with them knowledge models, with big data immanent property—its variety.

SSF is a result of an attempt to design the aforementioned basis upon the most general representation of elementary data item, which may be stored, transported, received, processed, and visualized. Such representation is string (no matter, symbol, or bit), and SSF combines the best features of classical string-generating formal grammars, developed by Chomsky [22], with string-operating logical systems, proposed by Post [23].

The second section of this chapter is dedicated to the description of string databases (SDB), while the third, - to their interconnections with relational and non-relational DB. In the fourth section, incomplete information modeling within SSF is considered. The main content of the fifth section are the so-called word equations on context-free languages (WECFL), being key element of the SSF algorithmics.

Advertisement

2. “Set of strings” basic equations

The background of the SSF is representation of a database as a finite set of strings:

W t = w 1 w m t V , E1

where W t means DB at the discrete time moment t and V is a set of all strings in the initial (terminal) alphabet V . Such databases will be called lower, if it is necessary to distinguish them from the other, the set of strings databases. The structure of DB elements w i W t , named facts, is determined by metadatabase (MDB), in which the current state is denoted by D t .

Couple

Θ t = < W t , D t > , E2

is named data storage (DS). Data storage is in the correct state, if W t W D t , where W D t is the set of all correct databases, defined by the MDB.

Access message to DS is triple:

ω t = < o , c , x > , E3

where o is the operation, which execution is the purpose of the access (insert, delete, update, query), c is the DS component (DB, MDB) which is the objective of the access, and x is the content of the access, i.e., query body, or DB elements (facts), which are inserted or deleted. For simplicity it is supposed that the answer (reply) to the access is obtained by the user at the moment t + 1 , next to t , and it is denoted A t + 1 , if c = DB , and A t + 1 D , if c = MDB (both sets are finite).

A set of all possible access messages (3) is called data storage manipulation language (DSML).

SSF background is a sequential definition of four interconnected representations of DSML semantics.

Set-theoretical (S)-semantics of DSML is defined by equations on sets, which connect together input data, DB before and after access, and answer (reply) to the access.

Mathematical (M)-semantics follows aforementioned equations but is defined by some well-known and understandable mathematical constructions, being background of DSML.

Operational (O)-semantics is adequate to M-semantics but is represented by algorithms, providing execution of operations on DB.

At last, implementational (I)-semantics is also represented by algorithms, which, in general case, are much more efficient than the previous, in which the main purpose is recognition of algorithmic decidability of answer search (derivation), i.e., possibility of answer generation by finite number of steps.

Let us begin from S-semantics of the DSML segment, addressing DB, called lower, as usually, data manipulation language (DML). The equations, defining DML S-semantics, operate the following sets:

  1. W t (database at the moment of user’s access to DB).

  2. W t + 1 (database after execution of operation, i.e., at the moment t + 1 , when answer is accepted by the user ).

  3. I t (set, expressing user’s awareness about some fragment of problem area at the moment of access to DB).

  4. A t + 1 (answer to the access).

Basic equations, defining DML S-semantics, are as follows:

W t + 1 = W t I t , E4
A t + 1 = W t + 1 W t , E5

for insertion (speaking more precisely, inclusion),

W t + 1 = W t I t , E6
A t + 1 = W t W t + 1 , E7

for deletion (exclusion),

W t + 1 = W t , E8
A t + 1 = W t I t , E9

and for query (everywhere “ ” is subtraction on sets). As seen, Eqs. (4)(9) fully correspond to the sense of basic operations on DB, inherent to any DML. In Eqs. (6) and (9), set I t may be infinite.

Example 1. Let database, containing data items from various emergency devices, be as follows:

W t = { AREA GREEN VALLEY IS IN NORMAL STATE AT 15.03 ,
AREA BLUE LAKE IS IN NORMAL STATE AT 15.05 ,
AREA LOWER FOREST IS SMOKED AT 15.20 } , E10

(due to free use of natural language in facts, it is unnecessary to comment DB content). Equation

W t + 1 = W t AREA GREEN VALLEY IS SMOKED AT 15.20 E11

describes insertion of data item, in which the source is device, mounted at the Green Valley, which was detected as smoked since 15.20. When at this moment t + 1 user accesses DB with query, in which the purpose is to get information about all smoked areas, the infinite set I t may be as follows:

I t + 1 = { AREA A IS SMOKED AT 00.00 , ,
AREA A IS SMOKED AT 23.59 , ,
AREA AA IS SMOKED AT 00.00 , ,
AREA AA IS SMOKED AT 23.59 , ,
AREA Z IS SMOKED AT 00.00 , ,
AREA Z IS SMOKED AT 23.59 , } . E12

The answer to the query is

A t + 2 = W t + 1 I t + 1 = { AREA GREEN VALLEY IS SMOKED AT 15.20 ,
AREA LOWER FOREST IS SMOKED AT 15.20 } . E13

In expression (12), names of all areas are strings in the alphabet V = A Z 0 9 . , so

I t + 1 = AREA V SMOKED AT 00 23 . 00 59 . E14

Note that definitions (4)(9) are not unique. For example, in the inclusion definition, elements of I t set, having place in the DB at moment t , may be included to the answer

A t + 1 = W t I t , E15

as well as the answer may be defined as

A t + 1 = FACT " W t I t " ALREADY PRESENTS IN DATABASE
FACT " W t + 1 W t " IS INCLUDED TO DATABASE . E16

So, according to Eq. (16), the answer to the access may be as follows:

A t + 1 = { FACT   AREA GREEN VALLEY SMOKED
AT   15.20 ALREADY PRESENTS IN DATABASE ,
FACT   AREA LOWER FOREST IS SMOKED AT 15.20
IS INCLUDED TO DATABASE } . E17

As may be seen, Eqs. (4)(9) are based on the closed-world interpretation, which defines that the absence of the fact in the database is equivalent to its absence in the real world (problem area).

DML operations do not touch MDB; thus D t + 1 = D t .

Let us consider DML M- and O-semantics of DML.

The background of M-semantics of the simplest DML is the representation of the MDB D t as a set of the context-free (CF) generating rules α β , where α is a nonterminal symbol (“nonterminal” for short) and β is a string of both nonterminal and terminal symbols. Every nonterminal symbol, from the substantial point of view, is the name of some substring of fact, entering DB; thus β represents the structure of α . The only nonterminal symbol α 0 , which does not enter any string β , is the “axiom” in the terminology of formal grammars and “fact” in the terminology of SSF. So MDB D t unambiguously defines CF grammar

G t = < V , N t , α 0 , D t > , E18

where

N t = α α β D t } E19

is the set of nonterminals (“nonterminal alphabet”) of G t .

Database W t is named correct to metadatabase D t , if

W t L G t , E20

i.e., facts, having place in the DB, are words of the CF language L G t . In other notation,

w W t α 0 G t w , E21

where G t is used to define that string in alphabet V N t is generated (or derived) from another one.

Example 2. Let MDB D t be as follows (nonterminal symbols are framed by metalinguistic brackets):

< fact > AREA < name of area > IS < state >
AT < time > ,
< name of area > < text > ,
< state > IN NORMAL STATE ,
< state > SMOKED ,
< time > < hours > . < minutes > ,
< hours > < 0 to 1 > < 0 to 9 > ,
< hours > 2 < 0 to 3 > ,
< 0 to 1 > 0 ,
< 0 to 1 > 1 ,
< 0 to 9 > 0 ,
< 0 to 9 > 9 ,
< 0 to 3 > 0 ,
< 0 to 3 > 3 ,
< minutes > < 0 to 5 > < 0 to 9 > ,
< 0 to 5 > 0 ,
< 0 to 5 > 5 ,
< text > < symbol > ,
< text > < symbol > < text > ,
< symbol > A ,
< symbol > Z ,
< symbol > 0 ,
< symbol > 9 ,
< symbol > ˽ .

Database

W t = { AREA AW IS SMOKED AT 15.10 ,
AREA E IS IN NORMAL STATE AT 23.59 }

is correct to this MDB, unlike database

W t = AREA AT NORMAL .

Proposed application of CF grammars differs from the classical, in which the main sense is the description of a set of correct sentences of some language (most frequently, programming language). This description is created by its developers or researchers, is based on syntactic categories referred as nonterminals, and is constant through all life cycle of the language (minor changes may be done by reason of language modification or deeper understanding). In the SSF case, CF generating rules are used for description of the DB element (facts) structure, so nonterminals are more semantic than syntactic objects. From the other side, MDB is updated by DS administration and is a dynamic set, in which changes provide immediate changes of DB in order to keep it in the correct state. Such changes may be defined by the following equations, similar to Eqs. (4)(9):

D t + 1 = D t I t D , E22
A t + 1 D = D t + 1 D t , E23
W t + 1 = W t E24

for insertion (inclusion) of new CF rules to MDB,

D t + 1 = D t I t D , E25
A t + 1 D = D t D t + 1 , E26
W t + 1 = W t L G t + 1 E27

for deletion (exclusion) of CF rules, having place in MDB,

D t + 1 = D t , E28
A t + 1 D = D t I t D , E29
W t + 1 = W t E30

and for query to MDB. Here I t D is similar to I t in Eqs. (4)(9), being a set of CF rules representing knowledge of DS administration about MDB. As seen, Eqs. (22) and (23) provide extension of MDB; thus

L G t L G t + 1 , E31

and DB remains correct, because

W t L G t L G t + 1 . E32

In Eqs. (25) and (26), where some part (subset) of MDB may be deleted,

L G t + 1 L G t , E33

so some facts w L G t may become not satisfying condition (20) of DB correctness to MDB D t + 1 , because w L G t + 1 . In Eqs. (25)(30), it is presumed, that D t is also SDB, in which MDB defines structure of CF rules, which may be as Example 2.

Let us note that the notion of SDB correctness to MDB is from the substantial point of view weaker than the notion of data storage correctness, because in general case

W D t 2 L G t , E34

i.e., set of databases in correct storage is the subset of Boolean of L G t , while SDB correct to MDB is such that

W D t = 2 L G t , E35

i.e., every SDB, containing facts, being words of CF language L G t , is correct, which is not true in the reality. DS correctness is the generalization of notion of DB integrity, deeply developed inside relational approach covering the total content of database, i.e., interconnections between its different elements. There are known various tools for integrity criteria declaration and check—first of all, functional dependencies and their multiple modifications [24, 25, 26, 27, 28, 29, 30, 31, 32]. Storage correctness, being SDB analog of integrity, is considered inside SSF on the basis of augmented Post systems (APS).

Let us consider now the application of the described segment of the SSF to the representation of the most frequently used data models. We shall call such application by the term “emulation.”

Advertisement

3. Emulation of the known data models

We shall demonstrate how relational and non-relational databases may be represented on the described higher background. We shall consider relational data model as a full-scope example of databases with symmetric access (DBSA) [8, 9, 10, 11, 12] and a family of asymmetric access (or key-addressed) databases (KADB), which contains, among others, hypertext, page, and WWW- and Twitter-like DB [32, 33, 34, 35, 36, 37, 38, 39, 40, 41].

Let us begin from the relational model of data.

Consider relational database (RDB), in which the scheme is R 1 A 1 1 A m 1 1 R k ( A 1 k A mk k ) , where R 1 ,…, R k are the names of relations and A 1 1 , , A j i , , A mk k are the names of attributes. Every relation R i at moment t is the set of tuples

R i D j i × × D mj i , E36

where D j i is the domain (set of possible values of attribute A j i ).

We shall define SDB < W t , D t > , corresponding to this RDB, as follows. We shall include to the MDB D t rules

< fact > R 1 : < A 1 1 > , , < A m 1 1 > , < fact > R k : < A 1 k > , , < A mk k > , E37

where “<” and “>” are the dividers (aforementioned metalinguistic brackets in the Backus-Naur notation) and R i and A j i are the strings, being names of relations and attributes, respectively (dividers provide syntactic unambiguity).

Along with Eq. (36), MDB will include rules such that

D j i = b < A j i > b & b V , E38

i.e., these rules provide generation of sets of words in terminal alphabet V , being domains of the respective attributes. For unambiguity we assume that comma “,” does not enter values of attributes V D j i .

By this, every tuple b 1 i b mi i of the relation R i is represented by fact

R i : b 1 i , , b mi i W t . E39

Note that representation of facts in the form (37)(39) is not unique. As seen from Examples 1 and 2, tuples, entering relations, may be represented as any natural language phrases, described by the corresponding rules.

Let us consider now key-addressed databases. Their common feature is that every fact, entering KADB, includes a unique key, which is necessary to select, delete, and update this fact. These DB are associated with NoSQL family of DML [42, 43, 44, 45], which in the last years is considered as a practical alternative to SQL-like DML [8, 9, 10, 11, 12, 46, 47, 48], developed since the introduction of the relational model of data.

We shall represent KADB as set

W t = k 1 = d 1 k m = d m , E40

where symbol “=” inside angle brackets is the divider, k i V = is the key, and d i V is the data, corresponding to this key (or identified by it). At every moment t , KADB must satisfy the consistency condition: KADB is consistent, if

t k V = k = V W t 1 , E41

i.e., set W t would not include two or more elements with one and the same key.

Content of access to KADB must include key k , so I t k = V ,

and for inclusion I t = k = d . S-semantics of insertion to KADB is as follows:

W t + 1 = W t k = d , if k = V W t = , W t k = V k = d otherwise , E42

because postulation of fact k = d at moment t is equivalent to the negation of fact k = d , where d d , which was postulated at some earlier moment t < t . So in the case of KADB update is implemented by insertion, and reply to this access may be defined as follows:

A t + 1 = k = d , if k = V W t = , W t k = V k = d otherwise , E43

thus in the case of update reply contains deleted as well as included fact.

Concerning M-semantics of KADB, we may see, that every known class of such databases is identified by its own structure of keys and techniques of their extraction from the current processed fact.

The simplest approach is implemented in Twitter network, where keys, necessary for access to the descendants of the current element of the hypertext, are bounded by two dividers—“#” from the left and blank from the right.

In the Internet HTTP/WWW service, similar keys are represented as strings of symbols, visualized by other colors in comparison with the rest of the text of the current hypertext page. This is equivalent to splitting terminal alphabet V to two subsets, first including symbols of the ordinary colors and the second symbols of the “key-representing” colors. However, HTTP/WWW hypertexts are organized in a much more complicated manner. First of all, along with the displayed pages, in which the structure is described by hypertext markup language (HTML) or its various later versions (XML et al.), there is another KADB, in which elements contain keys, being the aforementioned strings of another colors, and data are, in fact, unified resource locators (URL), providing direct network access to the subordinated pages. This access is possible, because URL contains string, providing application of the domain name service (DNS) for resolving proper IP address. In fact, HTML is no more than language for the convenient representation of CF rules, which form current metadabase of the WWW KADB.

One of the simplest versions of KADB is the so-called page databases, in which elements are strings of equal length, the first string of the page being key [38, 39]. Thus

t L G t = V l V l , E44

where l is the length of the string (in this case divider “=” is redundant). Data may be also string p : d , where “:” is the divider and prefix p before the sequence of l-symbol strings defines the name of the program, called for this sequence interpretation (e.g., visualization). In general case d may be the string of bits, not only string of symbols of alphabet V .

Until now we discussed only S-semantics and start point of M-semantics, being representation of metadabase as a set of rules of CF grammar. Second such point in the SSF is the representation of databases with incomplete information.

Advertisement

4. Representation of databases with incomplete information and sentential data manipulation languages

Let D t be the metadabase. Then database with incomplete information (for short, DBI or, if it is necessary to underline “set of strings” DBI, then SDBI), denoted X t , is the finite set of the so-called incomplete facts (N-facts) being sentential forms (SF) of CF grammar G t :

X t = x 1 x m SF G t , E45

where

SF G t = x a 0 G t x E46

is the set of all sentential forms of grammar G t . Obviously, L G t SF G t .

Example 3. Consider MDB from Example 2 and corresponding DBI

X t = { AREA LONELY TREES IS NORMAL AT 12.31 ,
AREA LONELY TREES < state > AT 12 . < minutes > ,
AREA < name of area > IS SMOKED AT 15.30 } .

The first N-fact of the three, entering this DBI, does not contain nonterminals, so it is fact in the sense of S-semantics of DML. The second N-fact contains nonterminals <state> and <minutes>, which correspond to the uncertainty of the state of the area Lonely Trees and time moment, when this state occurs; however, the aforementioned moment enters interval from 12.01 to 12.59. The last N-fact contains information about the same area, which was detected as smoked at 15.30.

Before consideration of equations, defining M-semantics of operations on DBI, we shall introduce interpretation of relation G t of the mutual derivability of sentential forms of context-free grammar as relation of mutual informativity of N-facts.

Let G t be acyclic and unambiguous CF grammar [49]. If so, G t is the relation of partial order on the set SF G t [38, 39].

There is maximal element of the set SF G t —it is axiom α 0 (“fact”) because for every x SF G t , α 0 G t x . For every subset X SF G t , there exists set of its upper bounds, e.g., sentential forms (“N-facts”) y SF G t , such that y G t x for all x X , and minimal (least) upper bound, sup X , such that for every other upper bound y from the mentioned set, the relation y G t sup X is true. For some X SF G t , there may exist set of lower bounds, e.g., sentential forms (“N-facts”) y SF G t , such that x G t y for all x X , and maximal lower bound inf   X such that for every other lower bound y, inf   X G t y is true.

Algorithms of sup and inf generation are described in detail in [38, 39].

Example 4. For DBI from Example 3, inf   X t does not exist, but

sup X t = AREA < name of area > IS < state > AT 1 < 0 to 9 > . < minutes > .

At the same time,

inf { AREA LONELY TREES IS < state > AT 12.30 ,
AREA < name of area > IS SMOKED AT 12 . < minutes > }
= AREA LONELY TREES IS SMOKED AT 12.30 .

Since now we shall use interpretation of G t as of the relation of the mutual informativity of incomplete facts. According to this interpretation, x G t x means that N-fact x is not less informative in comparison with N-fact x (if x + G t x , then x is more informative and is called concretization of x ). (This interpretation naturally fits to A. Kolmogorov’s algorithmic theory of information basic postulates, i.e., constructive objects mutual complexity [50]).

Graphical illustration of interconnections between sup x y and inf x y is in Figure 1. As seen, sup x y x , sup x y y , x inf x y , y inf x y , inf x y w , and inf x y w , where w L G t and w L G t .

Figure 1.

Graphical illustration of interconnection between sup x y and inf x y .

Let us consider DBI X t = x 1 x m t SF G t . We shall call such DBI nonredundant (NR), if there are no two N-facts x and x entering X t , one of which is more informative than the other. (It is obvious that if x G t x , then there is no necessity of storing x , because all information, having place in x , presents in x. So x is a redundant N-fact, and DBI, containing such N-facts, is also redundant).

Until the contrary is declared, we shall consider only NR DBI lower. By this, when defining M-semantics of update of NR DBI, understood as inclusion of N-fact x SF G t , it is reasonable to suppose, that it contains maximally informative N-facts, which only may be acquired by the system. In this case inclusion of N-fact x X t to DBI may be defined as follows:

X t + 1 = X t x y y X t & x G t y y G t x . E47

According to this definition, N-fact x inclusion to DBI X t causes extraction from DBI of all N-facts, which are more or less informative than x. Such logics provides maintenance of nonredundancy of the DBI. As seen, all N-facts, having place in X t and being “compatible” with N-fact x by informativity, are eliminated from this DBI.

It is reasonable to define reply to the inclusion of N-fact x as set of N-facts, eliminated from DBI:

A t + 1 = X t X t + 1 . E48

Example 5. If N-fact x = AREA GREEN VALLEY IS SMOKED AT 15.30 is included to DBI X t from Example 3, then

X t + 1 = { AREA LONELY TREES IS NORMAL AT 12.31 ,
AREA LONELY TREES < state > AT 12 . < minutes > ,
AREA GREEN VALLEY IS SMOKED AT 15.30 } ,

because

AREA < name of area > SMOKED AT 15.30 G t

AREA GREEN VALLEY IS SMOKED AT 15.30 .

As to M-semantics of queries, there may by two different versions, which run out of the new DBI features in comparison with DB.

The first version is obvious:

A t + 1 y = x x X t & y x , E49

where A t + 1 y is the answer to the query with content y , and all N-facts from DBI X t , which are no less informative than x , are included to the answer.

As seen, by this we postulate query language to databases (or DB with complete information): it is the set of sentential forms of CF grammar G t , and in the case of SF y is the content of query ω t ,

I t = w y G t w & w V , E50

i.e., it is the set of facts, more informative than N-fact y , having place in the query. So combining Eqs. (9) and (49), we obtain

A t + 1 = W t I t = w y G t w & w W t , E51

i.e., the result of access is the subset of database W t , containing all facts, more informative than y .

The background of the second version is the interpretation of the query as an action, which aim is to check if there are such possible facts w L G t , which are more informative than N-fact x, entering X t , and N-fact y (query content):

w L G t x G t w & y G t w , E52

so, while x is not concretization of y , it is sensible to include x to the answer, because there may be facts w L G t , which are both x and y concretizations. The set of such facts is the intersection W x W y , where

W x = w x G t w , E53
W y = w y G t w .

Finite representation of the aforementioned intersection is, obviously, maximal lower bound of the set x y . For this reason, answer to the query with content y may be set

A ¯ t + 1 y = x x X t & inf x y } , E54

and, as an alternative,

A ¯ ¯ t + 1 y = inf x y x X t & inf x y } . E55

Example 6. Consider DBI X t from Example 3 and the query with content y = AREA < name of area > IS SMOKED AT < time > . The purpose of this query is to get information about all areas smoked. According to Eqs. (49), (54), and (55)

A t + 1 y = AREA < name of area > IS SMOKED AT 14.30 ,
A ¯ t + 1 y = { AREA LONELY TREES < state > AT 13 . < minutes > ,
AREA < name of area > IS SMOKED AT 14.30 } ,
A ¯ ¯ t + 1 y = { AREA LONELY TREES IS SMOKED AT 13 . < minutes > ,

AREA < name of area > IS SMOKED AT 14.30 } .

Returning to DB, which DML S-semantics was defined by Eqs. (4)(9), we may now write its M-semantics equations not only for query but also for insertion and deletion. Namely, if string w is the content of the insertion access, then

W t + 1 = W t w , if w L G t , W t otherwise . E56

Similarly, if string y , containing terminal and nonterminal symbols of CF grammar G t , is the content of the delete access, then

W t + 1 = W t w y G t w & w L G t . E57

Data manipulation languages, described in this section, will be called sentential (SDML), because content of any access to DB/DBI, specified with the help of such DML, is a sentential form of CF grammar, which set of rules is current MDB.

Concerning KADB, capabilities of the sentential DML are compatible with the aforementioned NoSQL languages, which provide selection of DB elements, in which keys are specified in the queries.

Of course, it is not difficult to extend SDML by features, providing construction of more complicated selection criteria (including, e.g., number intervals) [38, 39]. But in comparison with SQL and similar relational languages, providing symmetric access to DB, SDML are rather poor. To achieve capabilities of the relationally complete query languages, it is necessary to extend SDML by features, providing comparison of values, having place in different facts. Such features are critically needed also for knowledge representation, extraction, and processing.

To achieve the formulated purpose, we shall use another tool, differing from CF grammars, namely, Post systems (PS), which also operate strings but, due to variables in their basic constructions (productions), have basic capabilities for aforementioned functions. The result of integration of the described “set of strings” databases with PS are augmented Post systems. The intermediate layer between SDB and APS is formed by the so-called word equations on context-free languages, considered in the next section.

Advertisement

5. Word equations on context-free languages

Word equation is a well-known object of discrete mathematics, defined as follows [51, 52, 53, 54, 55, 56].

Word equation is written as

s = s , E58

where s and s are the so-called terms. Term is a non-empty sequence of symbols of alphabet, which we shall call terminal, presuming it is the same set V , as higher, and variables, which universum is denoted Г. So s V Г + , s V Г + . Domain (set of values) of every variable γ Γ , having place in any term, is V . Term without any variables is, obviously, word in alphabet V . At least one variable must present in WECFL or just the same in term s s (or s s ):

s s V Γ + V + . E59

Set

d = γ 1 u 1 γ n u n , E60

where γ 1 , , γ n are the variables, u 1 , , u n are the strings in alphabet V , and is the divider (which is not occasionally the same as higher in the generating rules α β , entering metadabases), is called substitution.

Term s d is the result of application of substitution d to term s and is defined as follows. If

s = u 1 ¯ γ i 1 u 2 ¯ u m ¯ γ im u ¯ m + 1 , E61

where u i ¯ V and i = 1 , , m + 1 , , then

s δ = u 1 ¯ γ i 1 ¯ u 2 ¯ u m ¯ γ im ¯ u ¯ m + 1 , E62

where

γ ij ¯ = u ij , if γ ij u ij d γ ij otherwise . E63

Definitions (62) and (63) cover general case, when some of the variables, entering term s , do not enter the substitution (60).

Substitution d is called terminal substitution to term s V Γ + V + , if

s d V + , E64

i.e., result of its application to term is word in the alphabet V . In this case, obviously,

γ i 1 γ in γ i γ n . E65

Terminal substitution to terms s and s is called solution of word equation (58), if

s d s d , E66

i.e., result of application of d to terms s and s is one and the same word (here " " is identity sign).

Returning to SDB and M-semantics of their DML, we may see that set of terms may be the simplest query language to SDB. If term

s = u 1 ¯ γ i 1 u 2 ¯ u m ¯ γ im u ¯ m + 1 , E67

is query to DB W t , then

I t = u 1 ¯ u i 1 u 2 ¯ u m ¯ u im u ¯ m + 1 u i 1 V & & u im V , E68

and

A t + 1 = W t I t = w w W t & u i 1 V u im V u 1 ¯ u i 1 u 2 ¯ u m ¯ u im u ¯ m + 1 = w , E69

so Eq. (69) is the definition of M-semantics of the term’s query language to SDB; as seen, w A t + 1 , if w W t , and word equation s = w has at least one solution.

Example 7. Consider database W t , containing three facts:

SENSOR 1 IS AT GREEN VALLEY,

SENSOR 2 IS AT BLUE LAKE,

AREA LOWER FOREST IS SMOKED.

If query s = SENSOR a , which purpose, as seen, is to select all facts with information about sensor installation, then

A t + 1 = SENSOR 1 IS AT GREEN VALLEY , SENSOR 2 IS AT BLUE LAKE ,

and solution of word equations

SENSOR a = SENSOR 1 IS AT GREEN VALLEY

and

SENSOR a = SENSOR 2 IS AT BLUE LAKE

are, respectively,

a 1 IS AT GREEN VALLEY

and

a 2 IS AT BLUE LAKE .

However, the application of the term’s query language to databases with incomplete information, containing sentential forms of CF grammar with scheme D t , being DS metadabase, is not so simple and needs more sophisticated mathematical background.

Let G be CF grammar, corresponding metadabase D (lower index t for simplicity is omitted). We shall call word equation on context-free language L G couple

< s = s , δ > , E70

where the first component s = s is the word equation in the sense (58), called here kernel, while

δ = γ 1 β 1 γ l β l , E71

is the so-called suffix, which defines domains (sets of values) of variables γ 1 , , γ l , entering terms s and s , by means of strings β 1 , , β l , containing terminal and nonterminal symbols of grammar G . Kernel and suffix must satisfy the so-called sentential condition

s δ s δ SF G , E72

i.e., strings, being the result of application of substitution δ to terms s and s , must be sentential forms of grammar G . As seen, δ is the generalization of substitution (60), so it will be called lower SF-substitution.

WECFL (70) may be read “ s = s , where δ .”

Domain of variable γ i is the set of strings in terminal alphabet V , which are generated from string β i by application of rules of grammar G . This domain is denoted as

V γ i δ = u γ i β i δ & β i u & u V E73

(from here we shall use in the sense G ).

Suffix δ defines set of terminal substitutions to terms s and s , denoted

δ = u 1 V γ 1 δ u l V γ l δ γ 1 u 1 γ l u l . E74

As it is easy to see, direct consequence of the sentential condition (72) and definition (74) is

s d s d L G , E75

for every d δ .

If terminal substitution d is such that

s d s d , E76

it is called solution of WECFL (70). Set of solutions of WECFL (70), which is infinite in general case, is denoted D s = s δ .

Function V may be applied to every term, so

V s δ = s d d δ L G , E77
V s δ = s d d δ L G . E78

Example 8. Let metadatabase be the same as in Example 2, and WECFL is

< AREA GREEN VALLEY IS a = b AT 15.00 , { a < state > AT < time > , b AREA < name of area > IS < state > } > .

As seen, this equation satisfies sentential condition, because

s δ = AREA GREEN VALLEY IS < state > AT < time > SF G t , s δ = AREA < name of area > IS < state > AT 15.00 SF G t .

According to Eq. (73),

V a δ = IN NORMAL STATE AT T SMOKED AT T V b δ = AREA S IS IN NORMAL STATE AT AREA S IS SMOKED , where T is the set of strings, explicating time ( 00.00 , 00.01 , , 23.58 , 23.59 ), while S is the set of names of the monitored areas.

Terminal substitution

s = { a SMOKED AT 15.00 , b AREA GREEN VALLEY IS SMOKED }

is the solution of the presented WECFL.

As seen, in general case the set of solutions of WECFL may be infinite, and the problem is to find finite representation of this set.

Let us consider two sentential forms x and x of unambiguous and acyclic CF grammar G . Each of them defines generated (derived) from it set of strings, being words of language L G :

W x = w x w & w V , E79
W y = w y w & w V . E80

And therefore SF x and x are finite representations of sets W x and W y , both being subsets of language L G . This obstacle serves as background for the following statement, representing necessary solution.

Statement 1 [51]. If

W = W x W x , E81

then there exists SF y such that

W = w x y & x y & y w & w V . E82

Verbally, non-empty intersection of sets W x and W y is the subset of language L G , in which words are generated from SF y , which itself is generated from SF x and x simultaneously.

Example 9. Consider SF s d and s d from Example 8. As seen, SF

y = AREA GREEN VALLEY IS < state > AT 15.00

is the finite representation of intersection W s d W s d .

Thus SF y from Eq. (82) is nothing else than required finite representation of the non-empty intersection (81).

This finding is a basis for constructing the set of solutions D s = s δ . Let us begin from the case where all variables, having place in WECFL (or, just the same, in term s s ), enter it once, i.e., there is no more than one occurrence of any variable in s s .

Obviously, if

W = V s δ V s δ = , E83

then WECFL (70) does not have a solution, i.e.,

D s = s δ = , E84

and if W , then, since s δ and s δ are sentential forms of CF grammar G , there exists finite representation of set W , being SF y generated (derived) from s δ and s δ simultaneously.

From this place it is clear, that finite representation of the set D s = s δ is set

δ ¯ = γ 1 β ¯ 1 γ l β ¯ l , E85

such that

W = { w s δ ¯ = s δ ¯ = y & y w & w V } . E86

It is easy to verify that β ¯ 1 , , β ¯ l are strings, containing terminal and nonterminal symbols, and being generated from strings β 1 , , β l , respectively, by s δ y and s y y .

Set δ ¯ will be named unifier of WECFL (70). In accordance with Eqs. (54) and (55), we shall consider lower so-called maximal unifiers, corresponding to

y = inf s δ s δ , E87

where y is the maximal lower bound of the considered two-element set.

Example 10. Let metadatabase be the same as in Example 2, and WECFL is

< AREA a IS SMOKED AT t = b AT 15.00 , { a < name of area > , t < time > , b AREA < name of area > IS < state > } > .

As seen,

s δ = AREA < name of area > IS SMOKED AT < time > , s δ = AREA < name of area > IS < state > AT 15.00 , y = inf s δ s δ = AREA < name of area > IS SMOKED AT 15.00 ,

and thus

δ ¯ = { a < name of area > , t 15.00 ,

b AREA < name of area > IS SMOKED } .

Now we may return to DBI and introduce the so-called term data manipulation language (TDML), being the set of the so-called augmented terms < s , d > , where s is the term and d is the SF-substitution. M-semantics of this language is similar to Eqs. (49), (54), and (55) and is obtained by replacement of SF y by couple < s , d > :

A t + 1 s , d = x x X t & s d x , E88
A ¯ t + 1 s , d = x x X t & inf s d x , E89
A ¯ ¯ t + 1 s , d = inf s d x x X t & inf s d x . E90

Moreover, from now we may use augmented terms or even their sets as N-facts. Corresponding equations, which describe M-semantics of TDML, much more useful from the practical point of view, are as follows:

A t + 1 s , d = < s ¯ d ¯ > < s ¯ d ¯ > X t & s d s ¯ d ¯ , E91
A ¯ t + 1 s , d = < s ¯ d ¯ > < s ¯ d ¯ > X t & inf s d s ¯ d ¯ , E92
A ¯ ¯ t + 1 s , d = D s = s ¯ d d ¯ < s ¯ d ¯ > X t . E93

As may be seen, the last definition provides the most informative reply, containing maximal unifiers of WECFL, each corresponding N-fact, entering BDI.

The concerned reader may find the detailed consideration of WECFL, DBI algorithmics (including N-facts fusion), and key theoretical issues of SDB/DBI internal organization, providing associative access to the stored data as well as their compression, in [38, 39, 40].

All the said about TDML is sufficient for consideration of already mentioned knowledge representation, called augmented Post systems, being core of the deductive capabilities of “Set of Strings” Framework. APS are described in the separate chapter of this book.

References

  1. 1. Smolan R, Erwitt J. The Human Face of Big Data. Abingdon, UK: Marston Book Services; 2012. p. 224
  2. 2. Roberts FS. What is Big Data and how has it changed. In: Invited Talk at International Conference on Data Intensive Systems Analysis for Geohazard Studies; July 18–21, 2016; Sochi, Russia; 2016
  3. 3. Chen M, Mao S, Zhang Y, Leung VCM. Big Data: Related Technologies, Challenges, and Future Prospects. NY: Springer; 2014. p. 100. DOI: 10.1007/978-3-319-06245-7
  4. 4. Mayer-Schonberger V, Cukier R. Big Data: A Revolution that Will Transform how we Live, Work, and Think. Canada: Eamon Dolan/Houghton Mifflin Harcourt; 2013. p. 242
  5. 5. Labrinidis A, Jagadish HV. Challenges and opportunities with big data. Proceedings of the VLDB Endowment. 2012;5(12):2032-2033
  6. 6. Maheshwari A. Data Analytics Made Accessible. Seattle, WA: Amazon Digital Services; 2015. p. 156
  7. 7. Marz N, Warren J. Big Data: Principles and Best Practices of Scalable Real-time Data Systems. NY: Manning Publications; 2015. p. 328
  8. 8. Codd EF. The Relational Model for Database Management: Version 2. Boston, MA: Addison Wesley; 1990. p. 567
  9. 9. Date CJ. An Introduction to Database Systems. Pearson; 2003. p. 1034
  10. 10. Sumathi S, Esakkirajan S. Fundamentals of Relational Database Management Systems. NY: Springer; 2007. p. 776
  11. 11. Mensah K. Oracle Database Programming Using Java and Web Services. Amsterdam: Elsevier; 2006. p. 1120
  12. 12. Vaswani V. MySQL Database Usage & Administration. NY: McGraw Hill; 2009. p. 368
  13. 13. Pendyala V. Veracity of Big Data: Machine Learning and Other Approaches to Verifying Truthfulness. NY: Apress; 2018. p. 177
  14. 14. McNeill C. Veracity: The Most Important “V” of Big Data. GutCheck. January 23, 2018. Available from: www.gutcheckit.com/blog/veracity-big-data
  15. 15. Normandeau K. Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity. Inside Big Data. September 12, 2003. Available from: www.insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity
  16. 16. Walker M. Data Veracity. Data Science Central. November 28, 2012. Available from: www.datasciencecentral.com/profiles/blogs/data-veracity
  17. 17. Siewert SB. Big Data in the Cloud. Data Velocity, Volume, Variety, Veracity. Available from: www.ibm.com/develoerworks/library/bd-bigdatacloud
  18. 18. Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining. London, UK: Pearson/Addison Wesley; 2005. p. 400
  19. 19. Witten IH, Frank E, Hall MA. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Burlington, MA: Morgan Kaufmann; 2011. p. 664
  20. 20. Data Mining. Project Report Online. December 27, 2017. Available from: www.projectreportonline.com/data-mining/
  21. 21. Adriaans P, Zantinge D. Data Mining. London, UK: Pearson; 2005. p. 458
  22. 22. Chomsky N. Syntactic Structures. 2nd ed. Berlin, New York: Mouton de Gruyter; 2002. p. 117
  23. 23. Post EL. Formal reductions of the general combinatorial decision problem. American Journal of Mathematics. 1943;65:197-215
  24. 24. Teeling M. What is Data Integrity? Learn How to Ensure Database Data Integrity via Checks, Tests, & Best Practices. May 14, 2012. Available from: www.veracode.com/blog/2012/05/what-is-data-integrity
  25. 25. Haloin T, Morgan T. Information Modelling and Relational Databases. Burlington, MA: Morgan Kaufmann; 2010. p. 976
  26. 26. Date C. Database Design and Relation Theory: Normal Forms and all that Jazz. Sebastopol, CA: O’Reilly Media, Inc.; 2012. p. 260
  27. 27. Silberschatz A, Korth H, Sudarshan S. Database Systems Concepts. NY: McGraw Hill; 2012
  28. 28. Garcia-Molina H, Ullman JD, Widom J. Database Systems: The Complete Book. 2nd ed. London, UK: Pearson Prentice Hall; 2009
  29. 29. Fagin R, Vardi MY. The theory of data dependencies—A survey. In: Anshel M, Gewirtz W, editors. Mathematics of Information Processing. Proceedings of Symposia in Applied Mathematics. Vol. 34. 1986. pp. 19-71
  30. 30. Demetrovics J, Libkin L, Muchnik IB. Functional dependencies in relational databases: A lattice point of view. Discrete Applied Mathematics. 1992;40(2):155-185. DOI: 10.1016/0166-218X(92)90028-9
  31. 31. Megid YA, El-Tazi N, Fahmy A. Using functional dependencies in conversion of relational databases to graph databases. In: DEXA 2018: Database and Expert Systems Applications. Lecture Notes in Computer Science. Vol. 11030; 2018. pp. 350-357
  32. 32. Wiil UK. Experience with hyperbase: A hypertext database supporting collaborative work. SIGMOD Record. 1993;22(4):19-25
  33. 33. De Bra P, Pechenitzkiy M. Dynamic and adaptive hypertext: Generic frameworks, approaches and techniques. In: Proceedings of the 20th ACM Conference on Hypertext and Hypermedia. 2009. pp. 387-388. DOI: 10.1145/1557914.1558003
  34. 34. Bagchi A, Lahoti G. Relating web pages to enable information-gathering tasks. In: Proceedings of the 20th ACM Conference on Hypertext and Hypermedia. 2009. pp. 109-118. DOI: 10.1145/1557914.1557935
  35. 35. Lescovec J. Human navigation in networks. In: Proceedings of the 23th ACM Conference on Hypertext and Hypermedia. 2012. pp. 143-144. DOI: 10.1145/2309996.2310020
  36. 36. Hall W. From hypertext to linked data: The ever evolving web. In: Proceedings of the 22th ACM Conference on Hypertext and Hypermedia. 2011. pp. 3-4. DOI: 10.1145/1995996.1995969
  37. 37. Hara Y, Botafogo R. Hypermedia databases: a specification and formal language. In: Proceedings of Database and Expert Systems Applications DEXA’94; Athens, Greece; September 7–9, 1994. pp. 521-530. DOI: 10.1007/3-540-58435-8-218
  38. 38. Sheremet IA. Intelligent Software Environments for Information Processing Systems. Moscow: Nauka; 1994. p. 544. (In Russian)
  39. 39. Sheremet IA. Grammatical Codings. Hannover: EANS; 2012. p. 54
  40. 40. Sheremet IA. Augmented Post Systems: The Mathematical Framework for Data and Knowledge Engineering in Network-Centric Environment. Berlin: EANS; 2013. p. 395
  41. 41. Sheremet I. Data and knowledge bases with incomplete information in a “Set of Strings” framework. International Journal of Engineering and Applied Sciences. 2016;3(3):90-103
  42. 42. McCreary D, Kelly A. Making Sense of NoSQL. A Guide for Managers and the Rest of us. NY: Manning Publications; 2013. p. 312
  43. 43. Tiwari S. Professional NoSQL. Birmingham, UK: Packt Publishing; 2011. p. 384
  44. 44. Moniruzzaman AB, Hossain SA. NoSQL Database: New Era of Databases for Big Data Analytics—Classifications, Characteristics and Comparison; 2013. arXiv: 1307.0191
  45. 45. Kessin Z. Building Web Applications with Erlang. Sebastopol, CA: O’Reilly Media, Inc.; 2012. p. 156
  46. 46. Oracle Berkeley DB. Available from: www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html
  47. 47. Redis. Available from: www.redis.io
  48. 48. Cunningham L. World’s Largest Database Runs on Postgres? Available from: https://it.toolbox.com/blogs/lewiscunningham/worlds-largest-database-runs-on-postgres-052808
  49. 49. Meduna A. Formal Languages and Computation: Models and their Applications. London, UK: CRC Press; 2014. p. 315
  50. 50. Kolmogorov AN. Three approaches to the definition of notion “quantity of information”. In: The Selectas. Vol. 3 “Information Theory and Theory of Algorithms”. Moscow: Nauka; 2005. (In Russian)
  51. 51. Sheremet IA. Word Equations on Context-Free Languages. Hannover: EANS; 2011. p. 44
  52. 52. Schultz KU, editor. Word equations and related topics. Lecture Notes in Computer Science. Vol. 572. NY: Springer Verlag; 1992. p. 256
  53. 53. Lothaire M. Algebraic Combinatorics on Words. Cambridge, UK: Cambridge University Press; 2002. p. 504
  54. 54. Makanin GS. The problem of solvability of equations in a free semigroup. Mathematics of the USSR-Sbornik. 1977;32(2):129-198
  55. 55. Makanin GS. Equations in a free group. Mathematics of the USSR-Izvestiya. 1983;21(3):483-546
  56. 56. Makanin GS. Solvability of universal and positive theories in a free group. Mathematics of the USSR-Izvestiya. 1985;25(1):75-83

Written By

Igor Sheremet

Submitted: 24 July 2018 Reviewed: 04 March 2019 Published: 27 April 2019