Open access peer-reviewed chapter

Data Mining and Fuzzy Data Mining Using MapReduce Algorithms

By Poli Venkata Subba Reddy

Submitted: December 4th 2019Reviewed: March 23rd 2020Published: January 20th 2021

DOI: 10.5772/intechopen.92232

Downloaded: 52

Abstract

Data mining is knowledge discovery process. It has to deal with exact information and inexact information. Statistical methods deal with inexact information but it is based on likelihood. Zadeh fuzzy logic deals with inexact information but it is based on belief and it is simple to use. Fuzzy logic is used to deal with inexact information. Data mining consist methods and classifications. These methods and classifications are discussed for both exact and inexact information. Retrieval of information is important in data mining. The time and space complexity is high in big data. These are to be reduced. The time complexity is reduced through the consecutive retrieval (C-R) property and space complexity is reduced with blackboard systems. Data mining for web data based is discussed. In web data mining, the original data have to be disclosed. Fuzzy web data mining is discussed for security of data. Fuzzy web programming is discussed. Data mining, fuzzy data mining, and web data mining are discussed through MapReduce algorithms.

Keywords

  • data mining
  • fuzzy logic
  • fuzzy data mining
  • web data mining
  • fuzzy MapReduce algorithms

1. Introduction

Data mining is an emerging area for knowledge discovery to extract hidden and useful information from large amounts of data. Data mining methods like association rules, clustering, and classification use advanced algorithms such as decision tree and k-means for different purposes and goals. The research fields of data mining include machine learning, deep learning, and sentiment analysis. Information has to be retrieved within a reasonable time period for big data analysis. This may be achieved through the consecutively retrieval (C-R) of datasets for queries. The C-R property was first introduced by Ghosh [1]. After that, the C-R property was extended to statistical databases. The C-R cluster property is a presorting to store the datasets for clusters. In this chapter, C-R property is extended to cluster analysis. MapReduce algorithms are studied for cluster analysis. The time and space complexity shall be reduced through the consecutive retrieval (C-R) cluster property. Security of the data is one of the major issues for data analytics and data science when the original data is not to be disclosed.

The web programming has to handle incomplete information. Web intelligence is an emerging area and performs data mining to handle incomplete information. The incomplete information is fuzzy rather than probability. In this chapter, fuzzy web programming is discussed to deal with data mining using fuzzy logic. The fuzzy algorithmic language, called FUZZYALGOL, is discussed to design queries in data mining. Some examples are discussed for web programming with fuzzy data mining.

2. Data mining

Data mining [2, 3, 4, 5] is basically performed for knowledge discovery process. Some of the well-known data mining methods are frequent itemset mining, association rule mining, and clustering. Data warehousing is the representation of a relational dataset in two or more dimensions. It is possible to reduce the space complexity of data mining with consecutive storage of data warehouses.

The relational dataset is a representation of data with attributes and tuples.

Definition: A relational dataset R or cluster dataset is defined as a collection of attributes A1, A2 ,…, Am and tuples t1, t2,…, tn and is represented as

R = A1 x A2 x …x Am

ti = ai1 x ai2 x … x aim are tuples, where i =1,2,.., n

or

R(A1. A2. … Am). R is a relation.

R(ti)= (ai1. ai2 …. aim) are tuples, where i =1,2,.., n

or instance, two sample datasets “price” and “sales” are given in Tables 1 and 2, respectively.

The lossless join of the datasets “price” and “sales” is given in Table 3.

In the following, some of the methods (frequency, association rule, and clustering) are discussed.

Consider the “purchase” relational dataset given in Table 4.

2.1 Frequency

Frequency is the repeatedly accrued data.

Consider the following query:

Find the frequently customers purchase more than one item.

SELECT P.CNo, P.INo, IName, COUNT(*)

FROM purchase P

WHERE COUNT(*)>1.

The output of this query is given in Table 5.

INoINamePrice
I005Shirt100
I007Dress50
I004Pants80
I008Jacket60
I009Skirt100

Table 1.

Sample dataset “price.”

INoINameSales
I005Shirt80
I007Dress60
I004Pants100
I008Jacket50
I009Skirt80

Table 2.

Sample dataset “sales.”

INoINameSalesPrice
I005Shirt80100
I007Dress6050
I004Pants10080
I008Jacket5060
I009Skirt80100

Table 3.

Lossless join of the price and sales datasets.

CNoINoINamePrice
C001I005shirt100
C001I007Dress50
C003I004pants80
C002I007dress80
C001I008Jacket60
C002I005shirt100

Table 4.

Sample dataset “purchase.”

CNoINoCOUNT
C001I0052
C002I0052

Table 5.

Frequency.

2.2 Association rule

Association rule is the relationship among the data.

Consider the following query:

Find the customers who purchase shirt and dress.

<shirt⇔ dress>

SELECT P.CNo, P.INo

FROM purchase P

WHERE IName=”shirt” and IName=”dress”.

The output of this query is given in Table 6.

CNoINo
C001I005
C002I005

Table 6.

Association.

2.3 Clustering

Clustering is grouping the particular data.

Consider the following query:

Group the customers who purchase dress and shirt.

The output of this query is given in Table 7.

CNoINoINamePrice
C001I007Dress50
I005shirt100
C002I007dress80
I005shirt100

Table 7.

Clustering.

3. Data mining using C-R cluster property

The C-R (consecutive retrieval) property [1, 3] is the retrieval of records of database consecutively. Suppose R = {r1, r2, …, rn} is the dataset of records and C = {C1, C2, …, Cm} is the set of clusters.

The best type of file organization on a linear storage is one in which records pertaining to clusters are stored in consecutive locations without redundancy storing any data of R.

If there exists on such organization of R for C said to have the Consecutive Retrieval Property or C-R cluster property with respect to dataset R. Then C-R cluster property is applicable to linear storage.

The C-R cluster property is a binary relation between a cluster set and dataset.

Suppose if a cluster in a cluster set C is relevant to the data in a dataset R, then the relevancy is denoted by 1 and the irrelevancy is denoted by 0. Thus, the relevancy between cluster set C and dataset R can be represented as (n x m) matrix, as shown in Table 8. The matrix is called dataset-cluster incidence matrix (CIM).

Consider the dataset for customer account given in Table 9.

The dataset given in Table 9 is reorganized in ascending order based on sorting, as shown in Table 10.

Consider the following clusters of queries:

C1 = Find the customers whose sales is greater than or equal to 100.

C2 = Find the customers whose sales is less than 100.

C3 = Find the customers whose sales is greater than or equal average sales.

C4 = Find the customers whose sales is less than average sales.

The CIM is given in Table 11.

The dataset given in Table 11 is reorganized with sort on C1 in descending order, as shown in Table 12. Thus, C1 has C-R cluster property.

The dataset given in Table 11 is reorganized with sort on C2 in descending order, as shown in Table 13. Thus, C2 has C-R cluster property.

The dataset given in Table 11 is reorganized with sort on C3 in descending order, as shown in Table 14. Thus, C3 has C-R cluster property.

The dataset given in Table 11 is reorganized with sort on C4 in descending order, as shown in Table 15. Thus, C4 has a C-R cluster property.

The dataset is given for C1C2 has C-R cluster property (Table 16).

The dataset is given for C3C4 has C-R cluster property (Table 17).

The dataset is given for C1C3 has C-R cluster property (Table 18).

The dataset is given for C2C4 has C-R cluster property (Table 19).

The dataset is given for C2 ⋈ C3 has C-R cluster property (Table 20).

The cluster sets {C1C2, C3C4, C1C3, C2 U⋈ C4, C2 U⋈ C3} has C-R cluster property. Thus, the cluster sets have C-R cluster properties with respect to dataset R.

3.1 Design of parallel C-R cluster property

The design of parallel cluster shall be studied through the C-R cluster property. It can be studied in two ways: the parallel cluster design through graph theoretical approach and the parallel cluster design through response vector approach.

The C-R cluster property between cluster set C and dataset R can be stated in terms of the properties of vectors. The data cluster incidences of cluster set C with C-R cluster property may be represented as response vector set V. For instance the cluster set {C1, C2, C3, C4} has response vector set {V1=(1,1,1,0,0,0,0), V2=(0,0,0,1,1,1,1), V3=(1,1,1,0,0,0), and V4=(0,0,0,0,1,1,1)} (Tables 2123).

RC1C2….Cm
r1101
r201;;;0
----
----
=---
rn111

Table 8.

Incidence matrix.

RCNoINameSales
r170001Shirt150
r270002Dress30
r370003Pants100
r460001Dress50
r560002Jacket75
r660003Shirt120
r760004Dress40

Table 9.

Storage of sales.

RCNoINameSales
r170001Shirt150
r660003Dress120
r370003Pants100
r560002Dress75
r460001Jacket50
r760004Shirt40
r270002Dress30

Table 10.

Reorganizing for C-R cluster.

RC1C2C3C4
r11010
r20101
r31010
r40101
r50110
r61010
r70101

Table 11.

Cluster incidence matrix.

RC1
r11
r31
r61
r20
r40
r50
R70

Table 12.

Sorting on C1.

RC2
r10
r30
r60
r21
r41
r51
r71

Table 13.

Sorting on C2.

RC3
r11
r31
r51
r61
r20
r40
r70

Table 14.

Sorting on C3.

RC4
r10
r30
r50
r60
r21
r41
r71

Table 15.

Sorting on C4.

RC1 ⋈ C2
r11
r31
r61
r21
r41
r51
r71

Table 16.

C1C2.

RC3 ⋈C4
r11
r31
r51
r61
r21
r41
r71

Table 17.

C3C4.

RC1 ⋈C3
r11
r31
r61
r21
r40
r50
r70

Table 18.

C1C3.

RC2 ⋈C4
r10
r30
r60
r21
r41
r51
r71

Table 19.

C2C4.

RC2 U C3
r11
r31
r61
r21
r41
r51
r71

Table 20.

C2C3.

RC1C2
r110
r310
r610
r201
r401
r501
r701

Table 21.

{C1, C2}.

RC3C4
r110
r310
r610
r210
r401
r501
r701

Table 22.

{C3, C4}.

RC2C3
r101
r301
r601
r211
r410
r510
r710

Table 23.

{C2, C3}.

For instance, the response vector of the cluster C1 is given by column vector (1,1,1,0,0,0,0).

Suppose Ci and Cj are two clusters. If the two vectors Vi and Vj of Ci and Cj and the intersection ViVj = Ф, then the cluster set {Ci, Cj} has a parallel cluster property. Consider the vectors V1 and V2 of C1 and C2. The intersection of V1V2 = Ф, so that the cluster set {C1, C2} has parallel cluster property. Similarly the cluster set {C3, C4} has parallel cluster property. The cluster set {C2, C3} does not have parallel cluster property because V1V2 # Ф and r2 depending on C1 and C2.

3.2 Visual design for parallel cluster

The C-R cluster property is studied with graphical approach. This graphical approach can be studied for designing parallel cluster processing (PCP).

Suppose Vi is the vertex of RICM of C. The G(C) is defined by vertices Vi, i=1,2,…, and n, and two vertices have an edge Eij associated with interval Ii={Vi, Vi+1} i=1,…,n-1.

If G(C) has C-R cluster property, the vertices of G(C) have consecutive 1’s or 0’s.

Consider the cluster set {C1, C2}. The G(C1) has the vertices (1,1,1,0,0,0,0), and the G(C2) has the vertices (0,0,0,1,1,1,1), G(C3) has the vertices (1,1,1,1, 0,0,0), and G(C4) has vertices (0,0,0,0,1,1,1).

The parallel cluster property exists if G(Ci) ∩G(Cj)=Ф.

For instance, consider the G(C1) and G(C2). G(C1) ∩G(C2)=Ф, so that the cluster set {C1, C2} has parallel cluster property. The graphical representation is shown in Figure 1.

Figure 1.

{C1, C2}.

Similarly the cluster set {C3, C4} has the parallel cluster property (PCP). The cluster set {C3, C4} has no PCP because it is G(C2) ∩ G(C3) # Ф

The graph G(C1) ∩ G(C2) = Ф have consecutive cluster property.

The graph G(C3) ∩ G(C4) = Ф have consecutive cluster property. The graphical representation is shown in Figure 2.

Figure 2.

{C3, C4}.

The graph G(C2) ∩ G(C3) # Ф does not have consecutive cluster property. The graphical representation is shown in Figure 3.

Figure 3.

{C2, C3}.

3.3 Parallel cluster design through genetic approach

Genetic algorithms (GAs) were introduced by Darwin [6]. GAs are used to learn and optimize the problem [7]. There are four evaluation processes:

  • Selection

  • Reproduction

  • Mutation

  • Competition

Consider the following crossover with two cuts:

Parent #1 00001111

Parent #2 11110000

The parent #1 and #2 match with crossover.

The C-R cluster property is studied through genetical study. This study will help for designing parallel cluster processing (PCP).

Definition: The gene G of cluster G(C) is defined as incidence sequence.

Suppose G(C1) is parent and G(C2) child genome of cluster incidence for C1 and C2.

Suppose the G(C1) has (1,1,1,0,0,0,0) and the G(C2) has the v(0,0,0,1,1,1,1).

The parallel cluster property may be designed using genetic approach with the C-R cluster property.

Suppose C is cluster set, R is dataset and G(C) is genetic set.

The parallel cluster property exists if G(Ci) and G(Cj) matches with crossover.

For instance,

G(C1) = 11110000

G(C2) = 00001111

G(C1) and G(C2)matches with the crossover.

The cluster set {C1, C2} has parallel cluster property.

Similarly the cluster set {C3, C4} has the parallel cluster property. The cluster set {C3, C4} has no PCP because G(C2) and G(C3) are not matched with crossover.

3.4 Parallel cluster design cluster analysis

Clustering is grouping the particular data according to their properties, and sample clusters C1 and C2 are given in Tables 24 and 25, respectively.

RC1
r11
r31
r61

Table 24.

Cluster C1.

RC2
r21
r41
r51
r71

Table 25.

Cluster C2.

Thus, the C1 and C2 have consecutive parallel cluster property (Tables 26 and 27).

RC3
r11
r31
r51
r61

Table 26.

Cluster C3.

RC4
r21
r41
r71

Table 27.

Cluster C4.

Thus, the C3 and C4 have consecutive parallel properly. C2 and C3 do not have consecutive parallel cluster property because r2 is common.

4. Design of retrieval of cluster using blackboard system

Retrieval of clusters from blackboard system [8] is the direct retrieval of data sources. When the query is being processed, the entire database has to bring to main memory but in blackboard architecture, the data item source is direct from the blackboard structure. For the retrieval of information for a query, data item is directly retrieved from the blackboard which contains data item sources. Hash function may be used to store the data item set in the blackboard.

The blackboard systems may be constructed with data structure for data item sources.

Consider the account (AC-No, AC-Name, AC-Balance)

Here AC-No is key of datasets.

Each data item is data sourced which is mapped by h(x).

These data items are stored in blackboard structure.

When the transaction is being processed, there is no need to take the entire database into the main memory. It is sufficient to retrieval of particular data item of particular transaction from the blackboard system (Figure 4).

Figure 4.

Blackboard system.

The advantage of blackboard architecture is highly secured for blockchain transaction. The blockchain technology has no third-party interference.

5. Fuzzy data mining

Sometimes, data mining is unable to deal with incomplete database and unable to combine the data and reasoning. Fuzzy data mining [6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] will combine the data and reasoning by defining with fuzziness. The fuzzy MapReducing algorithms have two functions: mapping reads fuzzy datasets and reducing writes the after operations.

Definition: Given some universe of discourse X, a fuzzy set is defined as a pair {t, μd(t)}, where t is tuples and d is domains and membership function μd(x) is taking values on the unit interval [0,1], i.e., μd(t)➔[0,1], where tiЄX is tuples (Table 28).

R1d1d2.dmμ
t1a11a12.a1mμd(t1)
t2a21a22A2mμd(t2)
......
tna1na1n.Anmμd(tn)

Table 28.

Fuzzy dataset.

The sale is defined intermittently with fuzziness (Tables 2932).

CNoINoINameDemand
C001I005shirt0.9
C001I007Dress0.65
C003I004pants0.85
C002I007dress0.6
C001I008Jacket0.65
C002I005shirt0.9

Table 29.

Fuzzy demand.

CNoINoINameNegation of price
C001I005shirt0.3
C001I007Dress0.5
C003I004pants0.4
C002I007dress0.5
C001I008Jacket0.4
C002I005shirt0.3

Table 30.

Negation of price.

CNoINoINameSales U price
C001I005Shirt0.8
C001I007Dress0.5
C003I004Pants0.6
C002I007Dress0.5
C001I008Jacket0.6
C002I005Shirt0.7

Table 31.

Sales U price.

INoINameSales
I005Shirt0.8
I007Dress0.5
I004Pants0.6
I007Dress0.5
I008Jacket0.6

Table 32.

Items-sales.

μ Demand(x)=0.9/90+0.85/80+0.8/75+0.65/70

or

Fuzziness may be defined with function

μ Demand(x)= (1+(Demand-100)/100) −1 Demand <=100

=1 Demand>100

  1. Negation

  1. Union

Union of 1105 = max{0.8,0.7}=0,8

Fuzzy semijoin is given by sales ⋈ items-sale as shown in Table 33.

CNoINoINameSales
C001I005shirt0.8
C001I007Dress0.5
C003I004pants0.6
C002I007dress0.5
C001I008Jacket0.7
C002I005shirt0.7

Table 33.

Fuzzy semijoin.

The fuzzy k-means clustering algorithm (FKCA) is optimization algorithm for fuzzy datasets (Table 34).

CNoINoINameSales
C001I005⇔I007Shirt⇔Dress0.4
C003I004pants0.6
C002I007⇔I005Dress⇔shirt0.5

Table 34.

Association.

Fuzzy k-means cluster algorithm (FKAC) is given by, using FAD

best=R

K=means=best

for i range(1,n)

for j range(1,n)

ti=fuzzy union(ri.RU ri.Rj), if ri.R=rj.R

C reduce best

k-means < best

return

The fuzzy multivalued association property of data mining may be defined with multivalued fuzzy functional dependency.

The fuzzy multivalued association (FMVD) is the multivalve dependency (MVD). The association multivalve dependency (FAMVD) may be defined by using Mamdani fuzzy conditional inference [3].

If EQ(t1(X),t2(X),t3(X)) then EQ(t1(Y) ,t2(Y)) or EQ(t2(Y) ,t3(Y)) or EQ(t1(Y) ,t3(Y))

= min{EQ(t1(Y) ,t2(Y)) EQ(t2(Y) ,t3(Y)) EQ(t1(Y) ,t3(Y))}

= min{min(t1(Y) ,t2(Y)) , min(t2(Y) ,t3(Y)) , min(t1(Y) ,t3(Y))}

= min(t1(Y) ,t2(Y). t3(Y))

The fuzzy k-means clustering algorithm (FKCA) is the optimization algorithm for fuzzy datasets (Table 35).

CNoINoINameSales
C001I005⇔I007 ⇔I008Shirt⇔Dress
⇔Jacket
0.8
0.4
0.5
C003I004Pants0.6
C002I007⇔I005Dress⇔shirt0.5
0.7

Table 35.

Association using AFMVD.

Fuzzy k-means cluster algorithm (FKAC) is given by, using FAMVD

best=R

K=means=best

for i range(1,n)

for j range(1,n)

for k range(1,n)

ti=fuzzy union(ri.R U rj.R U rk.R), if ri.R=rj.R=rk.R

C reduce best

k-means<best

return

The fuzzy k-means clustering algorithm (FKCA) is the optimization algorithm for fuzzy datasets.

K=means=n

for i range(1,n)

for j range(1,n)

ti=fuzzy union(ri.R U si.Sj), if ri.R=sj.S

C =best

k-means < best

return

For example, consider the sorted fuzzy sets of Table 5 is given in Table 36.

CNoINoINameSales ⋈ Price⋈ Demand
C001I005Shirt0.8
C001I007Dress0.5
C003I004Pants0.6
C002I007Dress0.5
C001I008Jacket0.6
C002I005Shirt0.7

Table 36.

Fuzzy join.

6. Fuzzy security for data mining

Security methods like encryption and decryption are used cryptographically. These security methods are not secured. Fuzzy security method is based on the mind and others do not descript. Zadeh [16] discussed about web intelligence, world knowledge, and fuzzy logic. The current programming is unable to deal question answering containing approximate information. For instance “which is the best car?” The fuzzy data mining with security is knowledge discovery process with data associated.

The fuzzy relational databases may be with fuzzy set theory. Fuzzy set theory is another approach to approximate information. The security may be provided by approximate information.

Definition: Given some universe of discourse X, a relational database R1 is defined as pair {t, d}, where t is tuple and d is domain (Table 37).

R1d1d2.dm
t1a11a12.a1m
t2a21a22A2m
.....
tna1na1n.Anm

Table 37.

Relational database.

Price = 0.4/50+0.5/60+07/80+0.8/100

The fuzzy security database of price is given in Table 38.

INoINamePrice
I005Benz0.8
I007Suzuki0.4
I004Toyota0.7
I008Skoda0.5
I009Benz0.8

Table 38.

Price fuzzy set.

Demand = 0.4/50+0.5/60+0.7/80+0.8/100

The fuzzy security database of demand is given in Table 39.

INoINameDemandμ
I005Benz800.7
I007Suzuki600.5
I004Toyota1000.8
I008Skoda500.4
I009Benz800.7

Table 39.

Demand fuzzy set.

The lossless natural join of demand and price is union and is given in Table 40.

Table 40.

Lossless join.

The actual data has to be disclosed for analysis on the web. There is no need to disclose the data if the data is inherently define with fuzziness.

“car with fuzziness >07” may defined as follows:

For instance,

XML data may be defined as

<CAR>

<COMPANY>

<NAME> Benz <NAME>

<FUZZ> 0.8 <FUZZ>

</COMPANY>

<COMPANY>

<NAME> Suzuki <NAME>

<FUZZ> 0.9<FUZZ>

</COMPANY>

<COMPANY>

<NAME> Toyoto<NAME>

<FUZZ> 0.6<FUZZ>

</COMPANY>

<COMPANY>

I<NAME> Skoda<NAME>

<FUZZ> 0.7<FUZZ>

</COMPANY>

Xquery may define using projection operator for demand car is given as

Name space default =http://www.automoble.com/company

Validate <CAR> {

For $name in COMPANY/CAR

where $company/ Max($demand>0.7)}

return <COMPANY> {$company/name, $company/fuzzy}</COMPANY>

</CAR>

The fuzzy reasoning may be applied for fuzzy data mining.

Consider the more demand fuzzy database by decomposition (Tables 41 and 42).

INoINameDemand
I005Benz0.8
I007Suzuki0.9
I004Toyota0.6
I008Skoda0.7
I009Benz0.9

Table 41.

Demand.

INoINamePrice
I005Benz0.7
I007Suzuki0.4
I004Toyota0.6
I008Skoda0.5
I009Benz0.7

Table 42.

Price.

The fuzzy reasoning [14] may be performed using Zadeh fuzzy conditional inference

The Zadeh [14] fuzzy conditional inference is given by

if x is P1 and x is P2 …. x is Pn then x is Q =

min 1, {1-min(μP1(x), μP2(x), …, μPn(x)) +μQ(x)}

The Mamdani [7] fuzzy conditional inference s given by

if x is P1 and x is P2 …. x is Pn then x is Q =

min {μP1(x), μP2(x), …, μPn(x) , μQ(x)}

The Reddy [12] fuzzy conditional inference s given by

= min(μP1(x), μP2(x), …, μPn(x))

If x is Demand then x is price

x is more demand

------------------------------------

x is more Demand o (Demand➔Price)

x is more Demand o min{1, 1-Demand+Price}Zadeh

x is more Demand o min{Demand, Price} Mamdani

x is more Demand o {Demand} Reddy

“If x is more demand, then x is more prices” is given in Tables 43 and 44.

INoINameMore demand
I005Benz0.89
I007Suzuki0.95
I004Toyota0.77
I008Skoda0.84
I009Benz0.95

Table 43.

More demand.

INoINameZadehMamdaniReddy
I005Benz0.90.70.7
I007Suzuki0.50.40.4
I004Toyota1,00.60.6
I008Skoda0.80.50.5
I009Benz0.80.70.7

Table 44.

Demand➔Price.

The inference for price is given in Table 45.

INoINameZadehMamdaniReddy
I005Benz0.890.70.7
I007Suzuki0.50.40.4
I004Toyota0.770.60.6
I008Skoda0.80.50.5
I009Benz0.80.70.7

Table 45.

Inference price.

So the business administrator (DA) can take decision to increase the price or not.

7. Web intelligence and fuzzy data mining

Let C and D be the fuzzy rough sets (Tables 4651).

d122.dmμ
t1a11a12.a1mμd(t1)
t2a21a22A2mμd(t2)
......
tna1na1n.Anmμd(tn)

Table 46.

Fuzzy database.

INoINamePriceμ
I005Shirt1000.8
I007Dress500.4
I004Pants800.7
I008Jacket600.5
I009Skirt1000.8

Table 47.

Price database.

Table 48.

Intersect of demand and price.

INoINameDemandμ
I005Shirt800.8
I007Dress600.5
I004Pants1000.8
I008Jacket500.5
I009Skirt800.8

Table 49.

Lossless decomposition of demand.

INoINamePriceμ
I005Shirt1000.8
I007Dress500.5
I004Pants800.8
I108Jacket600.5
I009Skirt1000.8

Table 50.

Lossless decomposition of price.

Companyμ
IBM0.8
Microsoft0.9
Google0.75

Table 51.

Best software company.

The operations on fuzzy rough set type 2 are given as

1-C= 1- μC(x) Negation

CVD=max{μC(x), μD(x)} Union

CΛD=min{μC(x) , μD(x)} Intersection

XML data may be defined as

<SOFTWARE>

<COMPANY>

<NAME> IBM <NAME>

<FUZZ> 0.8 <FUZZ>

</COMPANY>

<COMPANY>

<NAME> Microsoft <NAME>

<FUZZ> 0.9<FUZZ>

</COMPANY>

<COMPANY>

<NAME> Google<NAME>

<FUZZ> 0.75<FUZZ>

</COMPANY>

Xquery may define using projection operator for best software company is given as

Name space default =http://www.software.cm/company

Validate <SOFTWARE> {For $name in COMPANY/SOFTWARE where $company/ Max($fuzz)}

return <COMPANY> {$company/name, $company/fuzzy} </COMPANY>

</SOFTWARE>

Similarly, the following problem may be considered for web programming.

Let P is the fuzzy proposition in question-answering system.

P=Which is tallest buildings City?

The answer is “x is the tallest buildings city.”

For instance, the fuzzy set “most tallest buildings city” may defined as

most tallest buildings city = 0.6/Hoang-Kang + 0.6/Dubai + 0.7/New York +0.8/Taipei+ 0.5/Tokyo

For the above question, output is “tallest buildings city”= 0.8/Taipei by using projection.

The fuzzy algorithm using FUZZYALGOL is given as follows:

BEGIN

Variable most tallest buildings City = 0.6 / Hoang-Kang + 0.6 / Dubai + 0.7 / New York + 0.8 / Taipei + 0.5 / Tokyo

most tallest buildings City =0.8 / Taipei

Return URL, fuzziness=Taipei, 0.8

END

The problem is to find “most pdf of type-2 in fuzzy sets”

The Fuzzy algorithm is

Go to most visited fuzzy set cites

Go to most visited fuzzy sets type-2

Go to most visited fuzzy sets type -2 pdf

The web programming gets “the most visited fuzzy sets” and put in order

The web programming than gets “the most visited type-2 in fuzzy sets”

The web programming gets “the most visited pdf in type-2”

8. Conclusion

Data mining may deal with incomplete information. Bayesian theory needs exponential complexity to combine data. Defining datasets with fuzziness inherently reduce complexity. In this chapter, fuzzy MapReduce algorithms are studied based on functional dependencies. The fuzzy k-means MapReduce algorithm is studied using fuzzy functional dependencies. Data mining and fuzzy data mining are discussed. A brief overview on the work on business intelligence is given as an example.

Most of the current web programming studies are unable to deal with incomplete information. In this chapter, the web intelligence system is discussed for fuzzy data mining. In addition, the fuzzy algorithmic language is discussed for design fuzzy algorithms for data mining. Web intelligence system for data mining is discussed. Some examples are given for web intelligence and fuzzy data mining.

Acknowledgments

The author thanks the reviewer and editor for revision and review suggestions made in this work.

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Poli Venkata Subba Reddy (January 20th 2021). Data Mining and Fuzzy Data Mining Using MapReduce Algorithms, Data Mining - Methods, Applications and Systems, Derya Birant, IntechOpen, DOI: 10.5772/intechopen.92232. Available from:

chapter statistics

52total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision

By Selma Tekir and Yalin Bastanlar

Related Book

Data Mining and Knowledge Discovery in Real Life Applications

Edited by Julio Ponce

First chapter

A Data Mining & Knowledge Discovery Process Model

By Oscar Marban, Gonzalo Mariscal and Javier Segovia

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us