The most complicated task for big data modeling in comparison with relational approach is its variety, being a consequence of heterogeneity of sources of data, accumulated in the integrated storage space. “Set of Strings” Framework (SSF) provides unified solution of this task by representation of database as updated finite set of facts, being strings, in which structure is defined by current metadatabase, which is also an updated set of the context-free generating rules. This chapter is dedicated to SSF formal and substantial description.
- big data
- “Set of Strings” framework
- string databases
- context-free grammars
- Post systems
From the three “V’s,” traditionally used for description of big data (volume, variety, velocity) [1, 2, 3, 4, 5, 6, 7], variety is the most difficult for theoretical modeling. The main reason of such difficulty is heterogeneity of sources of data, accumulated in the integrated storage space. By this, data items, passing to the aforementioned storage, have different structures and formats (more or less formalized texts, multimedia, hyperlinked trees of pages, etc.), which makes practically impossible application to such data of well-known relational-originated approaches to database (DB) description, manipulation, and knowledge extraction/application [8, 9, 10, 11, 12]. This obstacle makes hardly achieved the fourth “V” (veracity), which last time is often associated with big data [13, 14, 15, 16, 17], as well as with implementation of data mining over such data storages [18, 19, 20, 21].
Such background makes necessary the alternative approach to data and knowledge modeling. This chapter contains compact consideration of the so-called “Set of Strings” Framework (SSF), developed in order to integrate on the unified theoretical basis capabilities, already used in the relational-like data representations and associated with them knowledge models, with big data immanent property—its variety.
SSF is a result of an attempt to design the aforementioned basis upon the most general representation of elementary data item, which may be stored, transported, received, processed, and visualized. Such representation is string (no matter, symbol, or bit), and SSF combines the best features of classical string-generating formal grammars, developed by Chomsky , with string-operating logical systems, proposed by Post .
The second section of this chapter is dedicated to the description of string databases (SDB), while the third, - to their interconnections with relational and non-relational DB. In the fourth section, incomplete information modeling within SSF is considered. The main content of the fifth section are the so-called word equations on context-free languages (WECFL), being key element of the SSF algorithmics.
2. “Set of strings” basic equations
The background of the SSF is representation of a database as a finite set of strings:
where means DB at the discrete time moment and is a set of all strings in the initial (terminal) alphabet . Such databases will be called lower, if it is necessary to distinguish them from the other, the
is named data storage (DS). Data storage is in the correct state, if , where is the set of all correct databases, defined by the MDB.
Access message to DS is triple:
where is the operation, which execution is the purpose of the access (insert, delete, update, query), is the DS component (DB, MDB) which is the objective of the access, and is the content of the access, i.e., query body, or DB elements (facts), which are inserted or deleted. For simplicity it is supposed that the answer (reply) to the access is obtained by the user at the moment , next to , and it is denoted (both sets are finite).
A set of all possible access messages (3) is called data storage manipulation language (DSML).
SSF background is a sequential definition of four interconnected representations of DSML semantics.
Set-theoretical (S)-semantics of DSML is defined by equations on sets, which connect together input data, DB before and after access, and answer (reply) to the access.
Mathematical (M)-semantics follows aforementioned equations but is defined by some well-known and understandable mathematical constructions, being background of DSML.
Operational (O)-semantics is adequate to M-semantics but is represented by algorithms, providing execution of operations on DB.
At last, implementational (I)-semantics is also represented by algorithms, which, in general case, are much more efficient than the previous, in which the main purpose is recognition of algorithmic decidability of answer search (derivation), i.e., possibility of answer generation by finite number of steps.
Let us begin from
(database at the moment of user’s access to DB).
(database after execution of operation, i.e., at the moment ).
(set, expressing user’s awareness about some fragment of problem area at the moment of access to DB).
(answer to the access).
Basic equations, defining DML S-semantics, are as follows:
for insertion (speaking more precisely, inclusion),
for deletion (exclusion),
(due to free use of natural language in facts, it is unnecessary to comment DB content). Equation
describes insertion of data item, in which the source is device, mounted at the Green Valley, which was detected as smoked since 15.20. When at this moment user accesses DB with query, in which the purpose is to get information about all smoked areas, the infinite set may be as follows:
The answer to the query is
In expression (12), names of all areas are strings in the alphabet , so
as well as the answer may be defined as
So, according to Eq. (16), the answer to the access may be as follows:
DML operations do not touch MDB; thus .
Let us consider DML
The background of M-semantics of the simplest DML is the representation of the MDB as a set of the context-free (CF) generating rules , where is a nonterminal symbol (“nonterminal” for short) is a string of both nonterminal and terminal symbols. Every nonterminal symbol, from the substantial point of view, is the name of some substring of fact, entering DB; thus represents the structure of . The only nonterminal symbol , which does not enter any string , is the “axiom” in the terminology of formal grammars and “fact” in the terminology of SSF. So MDB unambiguously defines CF grammar
is the set of nonterminals (“nonterminal alphabet”) of .
Database is named
i.e., facts, having place in the DB, are words of the CF language . In other notation,
where is used to define that string in alphabet is generated (or derived) from another one.
is correct to this MDB, unlike database
Proposed application of CF grammars differs from the classical, in which the main sense is the description of a set of correct sentences of some language (most frequently, programming language). This description is created by its developers or researchers, is based on syntactic categories referred as nonterminals, and is constant through all life cycle of the language (minor changes may be done by reason of language modification or deeper understanding). In the SSF case, CF generating rules are used for description of the DB element (facts) structure, so nonterminals are more semantic than syntactic objects. From the other side, MDB is updated by DS administration and is a dynamic set, in which changes provide immediate changes of DB in order to keep it in the correct state. Such changes may be defined by the following equations, similar to Eqs. (4)–(9):
for insertion (inclusion) of new CF rules to MDB,
for deletion (exclusion) of CF rules, having place in MDB,
and DB remains correct, because
so some facts may become not satisfying condition (20) of DB correctness to MDB , because . In Eqs. (25)–(30), it is presumed, that is also SDB, in which MDB defines structure of CF rules, which may be as Example 2.
Let us note that the notion of SDB correctness to MDB is from the substantial point of view weaker than the notion of
i.e., set of databases in correct storage is the subset of Boolean of , while SDB correct to MDB is such that
i.e., every SDB, containing facts, being words of CF language , is correct, which is not true in the reality. DS correctness is the generalization of notion of DB integrity, deeply developed inside relational approach covering the total content of database, i.e., interconnections between its different elements. There are known various tools for integrity criteria declaration and check—first of all, functional dependencies and their multiple modifications [24, 25, 26, 27, 28, 29, 30, 31, 32]. Storage correctness, being SDB analog of integrity, is considered inside SSF on the basis of augmented Post systems (APS).
Let us consider now the application of the described segment of the SSF to the representation of the most frequently used data models. We shall call such application by the term “emulation.”
3. Emulation of the known data models
We shall demonstrate how relational and non-relational databases may be represented on the described higher background. We shall consider relational data model as a full-scope example of databases with symmetric access (DBSA) [8, 9, 10, 11, 12] and a family of asymmetric access (or key-addressed) databases (KADB), which contains, among others, hypertext, page, and WWW- and Twitter-like DB [32, 33, 34, 35, 36, 37, 38, 39, 40, 41].
Let us begin from the
Consider relational database (RDB), in which the scheme is , where ,…,are the names of relations and are the names of attributes. Every relation at moment
where is the domain (set of possible values of attribute ).
We shall define SDB , corresponding to this RDB, as follows. We shall include to the MDB rules
where “<” and “>” are the dividers (aforementioned metalinguistic brackets in the Backus-Naur notation) and and are the strings, being names of relations and attributes, respectively (dividers provide syntactic unambiguity).
Along with Eq. (36), MDB will include rules such that
i.e., these rules provide generation of sets of words in terminal alphabet , being domains of the respective attributes. For unambiguity we assume that comma “,” does not enter values of attributes .
By this, every tuple of the relation is represented by fact
Note that representation of facts in the form (37)–(39) is not unique. As seen from Examples 1 and 2, tuples, entering relations, may be represented as any natural language phrases, described by the corresponding rules.
Let us consider now
We shall represent KADB as set
where symbol “=” inside angle brackets is the divider, is the key, and is the data, corresponding to this key (or identified by it). At every moment , KADB must satisfy the consistency condition: KADB is consistent, if
i.e., set would not include two or more elements with one and the same key.
Content of access to KADB must include key , so
and for inclusion . S-semantics of insertion to KADB is as follows:
because postulation of fact at moment is equivalent to the negation of fact , where , which was postulated at some earlier moment . So in the case of KADB update is implemented by insertion, and reply to this access may be defined as follows:
thus in the case of update reply contains deleted as well as included fact.
Concerning M-semantics of KADB, we may see, that every known class of such databases is identified by its own structure of keys and techniques of their extraction from the current processed fact.
The simplest approach is implemented in Twitter network, where keys, necessary for access to the descendants of the current element of the hypertext, are bounded by two dividers—“#” from the left and blank from the right.
In the Internet HTTP/WWW service, similar keys are represented as strings of symbols, visualized by other colors in comparison with the rest of the text of the current hypertext page. This is equivalent to splitting terminal alphabet to two subsets, first including symbols of the ordinary colors and the second symbols of the “key-representing” colors. However, HTTP/WWW hypertexts are organized in a much more complicated manner. First of all, along with the displayed pages, in which the structure is described by hypertext markup language (HTML) or its various later versions (XML et al.), there is another KADB, in which elements contain keys, being the aforementioned strings of another colors, and data are, in fact, unified resource locators (URL), providing direct network access to the subordinated pages. This access is possible, because URL contains string, providing application of the domain name service (DNS) for resolving proper IP address. In fact, HTML is no more than language for the convenient representation of CF rules, which form current metadabase of the WWW KADB.
where is the length of the string (in this case divider “=” is redundant). Data may be also string , where “:” is the divider and prefix
Until now we discussed only S-semantics and start point of M-semantics, being representation of metadabase as a set of rules of CF grammar. Second such point in the SSF is the representation of databases with incomplete information.
4. Representation of databases with incomplete information and sentential data manipulation languages
Let be the metadabase. Then database with incomplete information (for short, DBI or, if it is necessary to underline “set of strings” DBI, then SDBI), denoted , is the finite set of the so-called incomplete facts (N-facts) being sentential forms (SF) of CF grammar :
is the set of all sentential forms of grammar . Obviously,
The first N-fact of the three, entering this DBI, does not contain nonterminals, so it is fact in the sense of S-semantics of DML. The second N-fact contains nonterminals <
Before consideration of equations, defining M-semantics of operations on DBI, we shall introduce interpretation of relation of the mutual derivability of sentential forms of context-free grammar as relation of mutual informativity of N-facts.
There is maximal element of the set —it is axiom (“fact”) because for every
At the same time,
Since now we shall use interpretation of as of the relation of the
Graphical illustration of interconnections between and is in Figure 1. As seen, , , , , , and , where and .
Let us consider DBI . We shall call such DBI nonredundant (NR), if there are no two N-facts
Until the contrary is declared, we shall consider only NR DBI lower. By this, when defining M-semantics of update of NR DBI, understood as inclusion of N-fact it is reasonable to suppose, that it contains maximally informative N-facts, which only may be acquired by the system. In this case inclusion of N-fact to DBI may be defined as follows:
According to this definition, N-fact inclusion to DBI causes extraction from DBI of all N-facts, which are more or less informative than
It is reasonable to define reply to the inclusion of N-fact
As to M-semantics of queries, there may by two different versions, which run out of the new DBI features in comparison with DB.
where is the answer to the query with content , and all N-facts from DBI , which are no less informative than , are included to the answer.
As seen, by this we postulate query language to databases (or DB with complete information): it is the set of sentential forms of CF grammar , and in the case of SF is the content of query ,
i.e., the result of access is the subset of database , containing all facts, more informative than .
The background of the
Finite representation of the aforementioned intersection is, obviously, maximal lower bound of the set . For this reason, answer to the query with content
and, as an alternative,
Returning to DB, which DML S-semantics was defined by Eqs. (4)–(9), we may now write its M-semantics equations not only for query but also for insertion and deletion. Namely, if string is the content of the insertion access, then
Similarly, if string , containing terminal and nonterminal symbols of CF grammar , is the content of the delete access, then
Data manipulation languages, described in this section, will be called sentential (SDML), because content of any access to DB/DBI, specified with the help of such DML, is a sentential form of CF grammar, which set of rules is current MDB.
Concerning KADB, capabilities of the sentential DML are compatible with the aforementioned NoSQL languages, which provide selection of DB elements, in which keys are specified in the queries.
Of course, it is not difficult to extend SDML by features, providing construction of more complicated selection criteria (including, e.g., number intervals) [38, 39]. But in comparison with SQL and similar relational languages, providing symmetric access to DB, SDML are rather poor. To achieve capabilities of the relationally complete query languages, it is necessary to extend SDML by features, providing comparison of values, having place in different facts. Such features are critically needed also for knowledge representation, extraction, and processing.
To achieve the formulated purpose, we shall use another tool, differing from CF grammars, namely, Post systems (PS), which also operate strings but, due to variables in their basic constructions (productions), have basic capabilities for aforementioned functions. The result of integration of the described “set of strings” databases with PS are augmented Post systems. The intermediate layer between SDB and APS is formed by the so-called word equations on context-free languages, considered in the next section.
5. Word equations on context-free languages
Word equation is written as
where and are the so-called terms. Term is a non-empty sequence of symbols of alphabet, which we shall call terminal, presuming it is the same set , as higher, and variables, which universum is denoted Г. So .
where are the variables, are the strings in alphabet , and is the divider (which is not occasionally the same as higher in the generating rules , entering metadabases), is called
Term is the result of application of substitution to term and is defined as follows. If
where and , then
Substitution is called
i.e., result of its application to term is word in the alphabet . In this case, obviously,
Terminal substitution to terms and is called
i.e., result of application of to terms and is one and the same word (here is identity sign).
Returning to SDB and M-semantics of their DML, we may see that
is query to DB , then
so Eq. (69) is the definition of M-semantics of the term’s query language to SDB; as seen, , if , and word equation has at least one solution.
If query , which purpose, as seen, is to select all facts with information about sensor installation, then
and solution of word equations
However, the application of the term’s query language to databases with incomplete information, containing sentential forms of CF grammar with scheme , being DS metadabase, is not so simple and needs more sophisticated mathematical background.
Let be CF grammar, corresponding metadabase
where the first component is the word equation in the sense (58), called here
is the so-called suffix, which defines domains (sets of values) of variables , entering terms and , by means of strings , containing terminal and nonterminal symbols of grammar . Kernel and suffix must satisfy the so-called sentential condition
i.e., strings, being the result of application of substitution to terms and , must be sentential forms of grammar . As seen, is the generalization of substitution (60), so it will be called lower
WECFL (70) may be read “, where .”
(from here we shall use in the sense ).
for every .
If terminal substitution is such that
Function V may be applied to every term, so
As seen, this equation satisfies sentential condition, because
According to Eq. (73),
, where is the set of strings, explicating time (), while
is the solution of the presented WECFL.
As seen, in general case the set of solutions of WECFL may be infinite, and the problem is to find finite representation of this set.
Let us consider two sentential forms and of unambiguous and acyclic CF grammar . Each of them defines generated (derived) from it set of strings, being words of language :
And therefore SF and are finite representations of sets and , both being subsets of language . This obstacle serves as background for the following statement, representing necessary solution.
then there exists SF
Verbally, non-empty intersection of sets and is the subset of language , in which words are generated from , which itself is generated from and simultaneously.
is the finite representation of intersection .
This finding is a basis for constructing the set of solutions . Let us begin from the case where all variables, having place in WECFL (or, just the same, in term ), enter it once, i.e., there is no more than one occurrence of any variable in .
then WECFL (70) does not have a solution, i.e.,
and if , then, since and are sentential forms of CF grammar , there exists finite representation of set , being SF
From this place it is clear, that finite representation of the set is set
It is easy to verify that are strings, containing terminal and nonterminal symbols, and being generated from strings , respectively, by and .
where is the maximal lower bound of the considered two-element set.
Now we may return to DBI and introduce the so-called term data manipulation language (
Moreover, from now we may use augmented terms or even their sets as N-facts. Corresponding equations, which describe M-semantics of TDML, much more useful from the practical point of view, are as follows:
As may be seen, the last definition provides the most informative reply, containing maximal unifiers of WECFL, each corresponding N-fact, entering BDI.
The concerned reader may find the detailed consideration of WECFL, DBI algorithmics (including N-facts fusion), and key theoretical issues of SDB/DBI internal organization, providing associative access to the stored data as well as their compression, in [38, 39, 40].
All the said about TDML is sufficient for consideration of already mentioned knowledge representation, called augmented Post systems, being core of the deductive capabilities of “Set of Strings” Framework. APS are described in the separate chapter of this book.