“Set of Strings” Framework for Big Data Modeling

The most complicated task for big data modeling in comparison with relational approach is its variety, being a consequence of heterogeneity of sources of data, accumulated in the integrated storage space. “Set of Strings” Framework (SSF) provides unified solution of this task by representation of database as updated finite set of facts, being strings, in which structure is defined by current metadatabase, which is also an updated set of the context-free generating rules. This chapter is dedicated to SSF formal and substantial description.


Introduction
From the three "V's," traditionally used for description of big data (volume, variety, velocity) [1][2][3][4][5][6][7], variety is the most difficult for theoretical modeling. The main reason of such difficulty is heterogeneity of sources of data, accumulated in the integrated storage space. By this, data items, passing to the aforementioned storage, have different structures and formats (more or less formalized texts, multimedia, hyperlinked trees of pages, etc.), which makes practically impossible application to such data of well-known relational-originated approaches to database (DB) description, manipulation, and knowledge extraction/application [8][9][10][11][12]. This obstacle makes hardly achieved the fourth "V" (veracity), which last time is often associated with big data [13][14][15][16][17], as well as with implementation of data mining over such data storages [18][19][20][21].
Such background makes necessary the alternative approach to data and knowledge modeling. This chapter contains compact consideration of the so-called "Set of Strings" Framework (SSF), developed in order to integrate on the unified theoretical basis capabilities, already used in the relational-like data representations and associated with them knowledge models, with big data immanent property-its variety.
SSF is a result of an attempt to design the aforementioned basis upon the most general representation of elementary data item, which may be stored, transported, received, processed, and visualized. Such representation is string (no matter, symbol, or bit), and SSF combines the best features of classical string-generating formal grammars, developed by Chomsky [22], with string-operating logical systems, proposed by Post [23].
The second section of this chapter is dedicated to the description of string databases (SDB), while the third, -to their interconnections with relational and nonrelational DB. In the fourth section, incomplete information modeling within SSF is considered. The main content of the fifth section are the so-called word equations on context-free languages (WECFL), being key element of the SSF algorithmics.

"Set of strings" basic equations
The background of the SSF is representation of a database as a finite set of strings: where W t means DB at the discrete time moment t and V * is a set of all strings in the initial (terminal) alphabet V. Such databases will be called lower, if it is necessary to distinguish them from the other, the set of strings databases. The structure of DB elements w i ∈ W t , named facts, is determined by metadatabase (MDB), in which the current state is denoted by D t . Couple is named data storage (DS). Data storage is in the correct state, if W t ∈ W D t ð Þ, where W D t ð Þ is the set of all correct databases, defined by the MDB. Access message to DS is triple: where o is the operation, which execution is the purpose of the access (insert, delete, update, query), c is the DS component (DB, MDB) which is the objective of the access, and x is the content of the access, i.e., query body, or DB elements (facts), which are inserted or deleted. For simplicity it is supposed that the answer (reply) to the access is obtained by the user at the moment t þ 1, next to t, and it is denoted A tþ1 , if c ¼ DB, and A D tþ1 , if c ¼ MDB (both sets are finite). A set of all possible access messages (3) is called data storage manipulation language (DSML).
SSF background is a sequential definition of four interconnected representations of DSML semantics.
Set-theoretical (S)-semantics of DSML is defined by equations on sets, which connect together input data, DB before and after access, and answer (reply) to the access.
Mathematical (M)-semantics follows aforementioned equations but is defined by some well-known and understandable mathematical constructions, being background of DSML.
Operational (O)-semantics is adequate to M-semantics but is represented by algorithms, providing execution of operations on DB.
At last, implementational (I)-semantics is also represented by algorithms, which, in general case, are much more efficient than the previous, in which the main purpose is recognition of algorithmic decidability of answer search (derivation), i.e., possibility of answer generation by finite number of steps.
Let us begin from S-semantics of the DSML segment, addressing DB, called lower, as usually, data manipulation language (DML). The equations, defining DML S-semantics, operate the following sets: 1. W t (database at the moment of user's access to DB).
2. W tþ1 (database after execution of operation, i.e., at the moment t þ 1, when answer is accepted by the user).
3. I t (set, expressing user's awareness about some fragment of problem area at the moment of access to DB).
Basic equations, defining DML S-semantics, are as follows: for insertion (speaking more precisely, inclusion), for deletion (exclusion), and for query (everywhere "À" is subtraction on sets). As seen, Eqs. (4)-(9) fully correspond to the sense of basic operations on DB, inherent to any DML. In Eqs. (6) and (9), set I t may be infinite. Example 1. Let database, containing data items from various emergency devices, be as follows: (due to free use of natural language in facts, it is unnecessary to comment DB content). Equation describes insertion of data item, in which the source is device, mounted at the Green Valley, which was detected as smoked since 15.20. When at this moment t þ 1 user accesses DB with query, in which the purpose is to get information about all smoked areas, the infinite set I t may be as follows: The answer to the query is AREA LOWER FOREST IS SMOKED AT 15:20g: (13) In expression (12), names of all areas are strings in the alphabet V ¼ A; …; Z; 0; …; 9; :; f g , so Note that definitions (4)-(9) are not unique. For example, in the inclusion definition, elements of I t set, having place in the DB at moment t, may be included to the answer as well as the answer may be defined as So, according to Eq. (16), the answer to the access may be as follows: As may be seen, Eqs. (4)-(9) are based on the closed-world interpretation, which defines that the absence of the fact in the database is equivalent to its absence in the real world (problem area).
DML operations do not touch MDB; thus D tþ1 ¼ D t . Let us consider DML M-and O-semantics of DML. The background of M-semantics of the simplest DML is the representation of the MDB D t as a set of the context-free (CF) generating rules α ! β, where α is a nonterminal symbol ("nonterminal" for short) and β is a string of both nonterminal and terminal symbols. Every nonterminal symbol, from the substantial point of view, is the name of some substring of fact, entering DB; thus β represents the structure of α. The only nonterminal symbol α 0 , which does not enter any string β, is the "axiom" in the terminology of formal grammars and "fact" in the terminology of SSF. So MDB D t unambiguously defines CF grammar where is the set of nonterminals ("nonterminal alphabet") of G t . Database W t is named correct to metadatabase D t , if i.e., facts, having place in the DB, are words of the CF language L G t ð Þ. In other notation, where * ¼) G t is used to define that string in alphabet V ∪ N t is generated (or derived) from another one. Example 2. Let MDB D t be as follows (nonterminal symbols are framed by metalinguistic brackets): is correct to this MDB, unlike database W t ¼ AREA AT NORMAL f g .∎ Proposed application of CF grammars differs from the classical, in which the main sense is the description of a set of correct sentences of some language (most frequently, programming language). This description is created by its developers or researchers, is based on syntactic categories referred as nonterminals, and is constant through all life cycle of the language (minor changes may be done by reason of language modification or deeper understanding). In the SSF case, CF generating rules are used for description of the DB element (facts) structure, so nonterminals are more semantic than syntactic objects. From the other side, MDB is updated by DS administration and is a dynamic set, in which changes provide immediate changes of DB in order to keep it in the correct state. Such changes may be defined by the following equations, similar to Eqs. (4)-(9): for insertion (inclusion) of new CF rules to MDB, for deletion (exclusion) of CF rules, having place in MDB, and for query to MDB. Here I D t is similar to I t in Eqs. (4)-(9), being a set of CF rules representing knowledge of DS administration about MDB. As seen, Eqs. (22) and (23) provide extension of MDB; thus and DB remains correct, because In Eqs. (25) and (26), where some part (subset) of MDB may be deleted, so some facts w ∈ L G t ð Þ may become not satisfying condition (20) of DB correctness to MDB D tþ1 , because w ∉ L G tþ1 ð Þ. In Eqs. (25)- (30), it is presumed, that D t is also SDB, in which MDB defines structure of CF rules, which may be as Example 2.
Let us note that the notion of SDB correctness to MDB is from the substantial point of view weaker than the notion of data storage correctness, because in general case i.e., set of databases in correct storage is the subset of Boolean of L G t ð Þ, while SDB correct to MDB is such that i.e., every SDB, containing facts, being words of CF language L G t ð Þ, is correct, which is not true in the reality. DS correctness is the generalization of notion of DB integrity, deeply developed inside relational approach covering the total content of database, i.e., interconnections between its different elements. There are known various tools for integrity criteria declaration and check-first of all, functional dependencies and their multiple modifications [24][25][26][27][28][29][30][31][32]. Storage correctness, being SDB analog of integrity, is considered inside SSF on the basis of augmented Post systems (APS).
Let us consider now the application of the described segment of the SSF to the representation of the most frequently used data models. We shall call such application by the term "emulation."
Let us begin from the relational model of data. Consider relational database (RDB), in which the scheme is are the names of attributes. Every relation R i at moment t is the set of tuples where D i j is the domain (set of possible values of attribute A i j ). We shall define SDB < W t , D t >, corresponding to this RDB, as follows. We shall include to the MDB D t rules where "<" and ">" are the dividers (aforementioned metalinguistic brackets in the Backus-Naur notation) and R i and A i j are the strings, being names of relations and attributes, respectively (dividers provide syntactic unambiguity).
Along with Eq. (36), MDB will include rules such that i.e., these rules provide generation of sets of words in terminal alphabet V, being domains of the respective attributes. For unambiguity we assume that comma "," does not enter values of attributes V ∈ D i j . By this, every tuple b i Note that representation of facts in the form (37)-(39) is not unique. As seen from Examples 1 and 2, tuples, entering relations, may be represented as any natural language phrases, described by the corresponding rules.
Let us consider now key-addressed databases. Their common feature is that every fact, entering KADB, includes a unique key, which is necessary to select, delete, and update this fact. These DB are associated with NoSQL family of DML [42][43][44][45], which in the last years is considered as a practical alternative to SQL-like DML [8][9][10][11][12][46][47][48], developed since the introduction of the relational model of data.
We shall represent KADB as set where symbol "=" inside angle brackets is the divider, k i ∈ V À ¼ f g ð Þ * is the key, and d i ∈ V * is the data, corresponding to this key (or identified by it). At every moment t, KADB must satisfy the consistency condition: KADB is consistent, if i.e., set W t would not include two or more elements with one and the same key. Content of access to KADB must include key k, so I t ⊆ k ¼ f gÁ V * , and for inclusion I t ¼ k ¼ d f g. S-semantics of insertion to KADB is as follows: because postulation of fact k ¼ d at moment t is equivalent to the negation of fact k ¼ d 0 , where d 6 ¼ d 0 , which was postulated at some earlier moment t 0 < t. So in the case of KADB update is implemented by insertion, and reply to this access may be defined as follows: thus in the case of update reply contains deleted as well as included fact.
Concerning M-semantics of KADB, we may see, that every known class of such databases is identified by its own structure of keys and techniques of their extraction from the current processed fact.
The simplest approach is implemented in Twitter network, where keys, necessary for access to the descendants of the current element of the hypertext, are bounded by two dividers-"#" from the left and blank from the right.
In the Internet HTTP/WWW service, similar keys are represented as strings of symbols, visualized by other colors in comparison with the rest of the text of the current hypertext page. This is equivalent to splitting terminal alphabet V to two subsets, first including symbols of the ordinary colors and the second symbols of the "key-representing" colors. However, HTTP/WWW hypertexts are organized in a much more complicated manner. First of all, along with the displayed pages, in which the structure is described by hypertext markup language (HTML) or its various later versions (XML et al.), there is another KADB, in which elements contain keys, being the aforementioned strings of another colors, and data are, in fact, unified resource locators (URL), providing direct network access to the subordinated pages. This access is possible, because URL contains string, providing application of the domain name service (DNS) for resolving proper IP address. In fact, HTML is no more than language for the convenient representation of CF rules, which form current metadabase of the WWW KADB.
One of the simplest versions of KADB is the so-called page databases, in which elements are strings of equal length, the first string of the page being key [38,39]. Thus where l is the length of the string (in this case divider "=" is redundant). Data may be also string p : d, where ":" is the divider and prefix p before the sequence of l-symbol strings defines the name of the program, called for this sequence interpretation (e.g., visualization). In general case d may be the string of bits, not only string of symbols of alphabet V.
Until now we discussed only S-semantics and start point of M-semantics, being representation of metadabase as a set of rules of CF grammar. Second such point in the SSF is the representation of databases with incomplete information.

Representation of databases with incomplete information and sentential data manipulation languages
Let D t be the metadabase. Then database with incomplete information (for short, DBI or, if it is necessary to underline "set of strings" DBI, then SDBI), denoted X t , is the finite set of the so-called incomplete facts (N-facts) being sentential forms (SF) of CF grammar G t : where is the set of all sentential forms of grammar G t . Obviously, L G t ð Þ⊂ SF G t ð Þ: Example 3. Consider MDB from Example 2 and corresponding DBI The first N-fact of the three, entering this DBI, does not contain nonterminals, so it is fact in the sense of S-semantics of DML. The second N-fact contains nonterminals <state> and <minutes>, which correspond to the uncertainty of the state of the area Lonely Trees and time moment, when this state occurs; however, the aforementioned moment enters interval from 12.01 to 12.59. The last N-fact contains information about the same area, which was detected as smoked at 15.30. ∎ Before consideration of equations, defining M-semantics of operations on DBI, we shall introduce interpretation of relation * ¼) G t of the mutual derivability of sentential forms of context-free grammar as relation of mutual informativity of N-facts.
Let G t be acyclic and unambiguous CF grammar [49]. If so, * ¼) G t is the relation of partial order on the set SF G t ð Þ [38,39]. There is maximal element of the set SF G t ð Þ-it is axiom α 0 ("fact") because For every subset X ⊆ SF G t ð Þ, there exists set of its upper bounds, e.g., sentential forms ("N-facts") y ∈ SF G t ð Þ, such that y * ¼) G t x for all x ∈ X, and minimal (least) upper bound, sup X, such that for every other upper bound y from the mentioned set, the relation y * ¼) G t sup X is true. For some X ⊆ SF G t ð Þ, there may exist set of lower bounds, e.g., sentential forms ("N-facts") y for all x ∈ X, and maximal lower bound inf X such that for every other lower bound y, inf X * ¼) G t y is true.
Algorithms of sup and inf generation are described in detail in [38,39]. Example 4. For DBI from Example 3, inf X t does not exist, but sup X t ¼ AREA < name of area>IS < state>AT 1 < 0 to 9>: < minutes>: Graphical illustration of interconnections between sup x; y f g and inf x; y f g is in Figure 1. As seen, sup x; y We shall call such DBI nonredundant (NR), if there are no two N-facts x and x 0 entering X t , one of which is more informative than the other. (It is obvious that if x 0 * ¼) G t x, then there is no necessity of storing x 0 , because all information, having place in x 0 , presents in x. So x 0 is a redundant N-fact, and DBI, containing such N-facts, is also redundant). Until the contrary is declared, we shall consider only NR DBI lower. By this, when defining M-semantics of update of NR DBI, understood as inclusion of N-fact x ∈ SF G t ð Þ, it is reasonable to suppose, that it contains maximally informative N-facts, which only may be acquired by the system. In this case inclusion of N-fact x ∈ X t to DBI may be defined as follows: According to this definition, N-fact x inclusion to DBI X t causes extraction from DBI of all N-facts, which are more or less informative than x. Such logics provides maintenance of nonredundancy of the DBI. As seen, all N-facts, having place in X t and being "compatible" with N-fact x by informativity, are eliminated from this DBI.
It is reasonable to define reply to the inclusion of N-fact x as set of N-facts, eliminated from DBI: Example 5. If N-fact x ¼ AREA GREEN VALLEY IS SMOKED AT 15:30 is included to DBI X t from Example 3, then The first version is obvious: where A y tþ1 is the answer to the query with content y, and all N-facts from DBI X t , which are no less informative than x, are included to the answer.
As seen, by this we postulate query language to databases (or DB with complete information): it is the set of sentential forms of CF grammar G t , and in the case of SF y is the content of query ω t , i.e., it is the set of facts, more informative than N-fact y, having place in the query. So combining Eqs. (9) and (49), we obtain i.e., the result of access is the subset of database W t , containing all facts, more informative than y.
The background of the second version is the interpretation of the query as an action, which aim is to check if there are such possible facts w ∈ L G t ð Þ, which are more informative than N-fact x, entering X t , and N-fact y (query content): so, while x is not concretization of y, it is sensible to include x to the answer, because there may be facts w ∈ L G t ð Þ, which are both x and y concretizations. The set of such facts is the intersection W x ∩ W y , where Finite representation of the aforementioned intersection is, obviously, maximal lower bound of the set x; y f g. For this reason, answer to the query with content y may be set and, as an alternative, Example 6. Consider DBI X t from Example 3 and the query with content y ¼ AREA < name of area>IS SMOKED AT < time>. The purpose of this query is to get information about all areas smoked. According to Eqs. (49), (54), and (55) Returning to DB, which DML S-semantics was defined by Eqs. (4)-(9), we may now write its M-semantics equations not only for query but also for insertion and deletion. Namely, if string w is the content of the insertion access, then Similarly, if string y, containing terminal and nonterminal symbols of CF grammar G t , is the content of the delete access, then Data manipulation languages, described in this section, will be called sentential (SDML), because content of any access to DB/DBI, specified with the help of such DML, is a sentential form of CF grammar, which set of rules is current MDB.
Concerning KADB, capabilities of the sentential DML are compatible with the aforementioned NoSQL languages, which provide selection of DB elements, in which keys are specified in the queries.
Of course, it is not difficult to extend SDML by features, providing construction of more complicated selection criteria (including, e.g., number intervals) [38,39]. But in comparison with SQL and similar relational languages, providing symmetric access to DB, SDML are rather poor. To achieve capabilities of the relationally complete query languages, it is necessary to extend SDML by features, providing comparison of values, having place in different facts. Such features are critically needed also for knowledge representation, extraction, and processing.
To achieve the formulated purpose, we shall use another tool, differing from CF grammars, namely, Post systems (PS), which also operate strings but, due to variables in their basic constructions (productions), have basic capabilities for aforementioned functions. The result of integration of the described "set of strings" Terminal substitution to terms s and s 0 is called solution of word equation (58), if i.e., result of application of d to terms s and s 0 is one and the same word (here } } is identity sign).
Returning to SDB and M-semantics of their DML, we may see that set of terms may be the simplest query language to SDB. If term is query to DB W t , then and However, the application of the term's query language to databases with incomplete information, containing sentential forms of CF grammar with scheme D t , being DS metadabase, is not so simple and needs more sophisticated mathematical background.
Let G be CF grammar, corresponding metadabase D (lower index t for simplicity is omitted). We shall call word equation on context-free language L G ð Þ couple where the first component s ¼ s 0 is the word equation in the sense (58), called here kernel, while is the so-called suffix, which defines domains (sets of values) of variables γ 1 , …, γ l , entering terms s and s 0 , by means of strings β 1 , …, β l , containing terminal and nonterminal symbols of grammar G. Kernel and suffix must satisfy the so-called sentential condition i.e., strings, being the result of application of substitution δ to terms s and s 0 , must be sentential forms of grammar G. As seen, δ is the generalization of substitution (60), so it will be called lower SF-substitution.
WECFL (70) may be read "s ¼ s 0 , where δ." Domain of variable γ i is the set of strings in terminal alphabet V, which are generated from string β i by application of rules of grammar G. This domain is denoted as (from here we shall use ) * in the sense ) * G ).
Suffix δ defines set of terminal substitutions to terms s and s 0 , denoted As it is easy to see, direct consequence of the sentential condition (72) and definition (74) is for every d ∈ ∑ δ . If terminal substitution d is such that it is called solution of WECFL (70). Set of solutions of WECFL (70), which is infinite in general case, is denoted D s ¼ s 0 ; δ ½ . Function V may be applied to every term, so Example 8. Let metadatabase be the same as in Example 2, and WECFL is < AREA GREEN VALLEY IS a ¼ b AT 15:00, fa ! < state>AT < time>, b ! AREA < name of area>IS < state>g>: As seen, this equation satisfies sentential condition, because s δ ½ ¼ AREA GREEN VALLEY IS < state>AT < time> ∈ SF G t ð Þ, s 0 δ ½ ¼ AREA < name of area>IS < state>AT 15:00 ∈ SF G t ð Þ: where T is the set of strings, explicating time (00:00, 00:01, …, 23:58, 23:59), while S is the set of names of the monitored areas.
Terminal substitution is the solution of the presented WECFL. ∎ As seen, in general case the set of solutions of WECFL may be infinite, and the problem is to find finite representation of this set.
Let us consider two sentential forms x and x 0 of unambiguous and acyclic CF grammar G. Each of them defines generated (derived) from it set of strings, being words of language L G ð Þ: And therefore SF x and x 0 are finite representations of sets W x and W y , both being subsets of language L G ð Þ. This obstacle serves as background for the following statement, representing necessary solution.
Statement 1 [51]. If then there exists SF y such that Verbally, non-empty intersection of sets W x and W y is the subset of language L G ð Þ, in which words are generated from SF y, which itself is generated from SF x and x 0 simultaneously. This finding is a basis for constructing the set of solutions D s ¼ s 0 ; δ ½ . Let us begin from the case where all variables, having place in WECFL (or, just the same, in term ss 0 ), enter it once, i.e., there is no more than one occurrence of any variable in ss 0 .
Obviously, if Moreover, from now we may use augmented terms or even their sets as N-facts. Corresponding equations, which describe M-semantics of TDML, much more useful from the practical point of view, are as follows: As may be seen, the last definition provides the most informative reply, containing maximal unifiers of WECFL, each corresponding N-fact, entering BDI.
The concerned reader may find the detailed consideration of WECFL, DBI algorithmics (including N-facts fusion), and key theoretical issues of SDB/DBI internal organization, providing associative access to the stored data as well as their compression, in [38][39][40].
All the said about TDML is sufficient for consideration of already mentioned knowledge representation, called augmented Post systems, being core of the deductive capabilities of "Set of Strings" Framework. APS are described in the separate chapter of this book.