Open access peer-reviewed chapter

A Review of Several Privacy Violation Measures for Large Networks under Active Attacks

By Tanima Chatterjee, Nasim Mobasheri and Bhaskar DasGupta

Submitted: August 6th 2019Reviewed: December 19th 2019Published: January 24th 2020

DOI: 10.5772/intechopen.90909

Downloaded: 66

Abstract

It is by now a standard practice to use the concepts and terminologies of network science to analyze social networks of interconnections between people such as Facebook, Twitter and LinkedIn. The powers and implications of such social network analysis are indeed indisputable; for example, such analysis may uncover previously unknown knowledge on community-based involvements, media usages and individual engagements. However, all these benefits are not necessarily cost-free since a malicious individual could compromise privacy of users of these social networks for harmful purposes that may result in the disclosure of sensitive data that may be linked to its users. A natural way to avoid this consists of an “anonymization process” of the relevant social network. However, since such anonymization processes may not always succeed, an important research goal is to quantify and measure how much privacy a given social network can achieve. Toward this goal, some recent research works have aimed at evaluating the resistance of a social network against active privacy-violating attacks by introducing and studying a new and meaningful privacy measure for social networks. In this chapter, we review both theoretical and empirical aspects of such privacy violation measures of large networks under active attacks.

Keywords

  • social networks
  • privacy measure
  • active attacks
  • (k
  • ℓ)-anonymity
  • algorithmic complexity

1. Introduction

In recent years, social networks have become an indisputable part of people’s lives. The emergence of such networks has altered how we interact with the world. A given individual’s day-to-day activities like media consumption, job hunting and social interaction have changed, along with how businesses and other beneficial entities interact with them through marketing, advertising, and information diffusion. This has led to an unstoppable race of collecting information and interaction from social networks by researchers, governments, and business entities for various purposes. From a research point of view, social networks and their interaction mechanisms provide valuable insight in many fields of study, such as sociology, psychology, advertising, and recommendation systems. It is only natural that the information contained in these networks and the value they hold have been and will be targeted by bad actors for malicious activities. The importance of these networks and the value of information that can be retrieved from them have led social network researchers to take a closer look at methods to combat such bad actors as well as formulate network measures that can provide an insight to the privacy of these networks. In this survey, we will look at one such measure known as k-anonymity [1] and will discuss some theoretical and empirical results regarding this measure.

1.1 Overview of the paper

Given the irrefutable importance of social networks in our daily lives and the ever increasing risk of compromising valuable personal data through privacy attacks against these networks, it is preferable to know how secure a given social network is against privacy attacks. This necessitates a deeper look into the types of privacy attacks and how to cope with them. There is an extensive literature on privacy preserving computational models in variety of application areas such as multi-party communications or distributed computing settings [2, 3, 4, 5, 6]. In this chapter, we focus on a specific type of attack known as background-based active attack and one measure that reflects the resistance of any given network against such attacks. The organization of the rest of the paper is as follows:

  • In Section 2 we briefly discuss the notion of privacy in social networks and review some literature on privacy violating attacks on social networks. We also introduce the k-anonymity privacy measure and some corresponding network measurement which are the basis for this measure.

  • In Section 3 we review some basic terminologies and notations that will be used in formulation of the three problems introduced in Section 4.

  • Section 4 contains three problems that arise from theoretical investigation of the k-anonymity.

  • Section 5 contains the results of an empirical study on the resistance of real-world social networks.

  • Finally, we end this chapter with some concluding remarks in Section 6.

2. Privacy measures in social networks

We begin by discussing the mathematical structure that fit the most to represent social networks. A social network is often portrayed as a graph [7, 8] G=VEwhere V is a set of nodes representing the social members, and E is the set of edges portraying the relationship among these members. Both nodes and edges may have extra attributes, such as weights, that provide extra information about the nature of these social bonds (e.g., trust or popularity); however, throughout this survey we will consider the simplest form of graphs, namely undirected and unweighted graphs, to model our social networks.

As we discussed in the previous section, the information that the social networks provide are invaluable. Due to the very nature of many social network applications, the identity of the members or the nature of relationship between members is quite sensitive and valuable. Thus, when releasing a social network we want to remove any attributes that may help identify these kinds of sensitive data. Assuming all members and their relationships are of high sensitivity, preventing identity disclosure or link disclosure becomes an important task. One popular method to prevent such disclosures is anonymization. In an anonymization process, we publish the network without identifying the corresponding nodes or potentially identifiable attributes. Even after anonymizing the network, we will still be releasing many informative attributes encoded by the network structure; for example, attributes such as node degree, connectivity, or other similar graph properties can still help the adversaries in compromising the user privacies of a published network.

Adversaries usually rely on background knowledge to compromise the privacy of published anonymized social networks. For understanding the failure of current privacy preservation methods such as anonymization, we need to have a proper model for the adversary background knowledge. Although it’s challenging to have a comprehensive model of all possible types of adversary background knowledge, it is very useful to model the background knowledge via structural properties of networks such as node degrees, embedded subgraphs, node neighbors, etc. [9]. Backstrom et al. [10] were the first to introduce a category of attacks on anonymized social graphs. The models introduced in [10] are background-based attacks and are widely used in privacy analysis of social networks. The two main types of attacks are as follows.

  1. Passive attacks in which the adversary will not modify the network by injecting new nodes, but instead will use the structural knowledge to detect the location of a known node. In this type of attacks, the adversary can benefit from the fact that most nodes in real social networks often belong to a small uniquely identifiable subgraph [10]. An adversary can then build a coalition with members of such subgraphs and attempt to re-identify the subgraphs in the anonymized published network, thus compromising the privacy of neighboring nodes.

  2. Active attacks in which the adversary will choose an arbitrary set of target users, create new nodes and insert them into a social network in a way that they are connected to the target set and they form a distinguishable subgraph. After the anonymized version of the social network is published, the adversary can then use the subgraph as a fingerprint to re-identify the targeted users and compromise their privacy.

The authors in [10] also showed that it is possible to compromise the privacy of any social network of n nodes with high probability using only Olognattacker nodes. In a passive attack, adversary’s structural knowledge will give her/him a global view of the network depending on the global structure of the network. It could pose a high privacy risk if an adversary were to combine this global view with the local structural knowledge obtained using an active attack. As an example, consider the network in Figure 1. If we only have global structural knowledge, it is not possible to differentiate the nodes v3and v4(e.g. , same node degrees, etc.). However, controlling just one extra node in the graph, such as the node v1, provides local structural knowledge such as distances between nodes, and using the knowledge of the distance of v1from v3and v4(dv1,v3=1and dv1,v4=2) one can easily differentiate node v3from node v4.

Figure 1.

A simple graph G used in Section 2 to illustrate the high risk posed by combining knowledge gained by active and passive attacks.

There are several well-studied strategies for coping with active attacks on a social network [9, 11, 12] via addressing the anonymization process of the social network. However, in this chapter we will focus on a measure that evaluates how resistant a social network is against this type of privacy attack. Introduced by Trujillo-Rasua et al. [1], k-anonymity is a novel and, to the best of our knowledge, the only privacy measure examining the structural resistance of a given graph against active attacks. The k-anonymity is a measure based on metric representation of nodes, where k is a privacy threshold and l is the maximum number of attacker nodes that may be inserted in the network. It was shown in [1] that graphs satisfying k-anonymity can successfully deter adversaries controlling at most l nodes in the graph from re-identifying nodes with probability higher than 1k.

2.1 k-anonymity

The k-anonymity measure is based on a concept known as k-metric anti-dimension of graphs. To facilitate further discussions about the measure, we first introduce some notations and terminologies. For a simple connected graph G=VE, where V is set of nodes and E is set of edges, let distvi,vjdenote distance (i.e. , number of edges in a shortest path) between the nodes viand vj. Given and ordered set of nodes S=v1vtand a node uwe define the metric representation of u with respect to S as a vector du,S=distu,v1distu,vt. Metric representations of nodes are closely related to the concept of a resolving set of a graph. Inspired by the problem of identifying an intruder in a network and introduced separately by Slater [13] and by Harary and Melter [14], a resolving set of graph provides recognition of every pair of nodes in graph.

Definition 1 (resolving set). Given a graph G=VE, a subset SVis called a resolving set for G if, for each pair of nodes uvG, there exist a node xSsuch that distx,udistx,v. A smallest-cardinality resolving set is called the metric basis, and its cardinality is referred to as the metric dimension of G.

The concepts of metric representation and resolving set inspired the introduction of another network measure known as k-antiresolving set that will be used as the founding base for k-anonymity.

Definition 2 (k-antiresolving set). Given a graph G=VE, SVis called a k-antiresolving set of G if k is the largest integer such that, for every node vV\S, there exist at least k1nodes u1,u2,,uk1V\Swith the same metric representation with respect to S as v.

A k-antiresolving set of minimum cardinality is called a k-antiresolving basis, and its cardinality denotes the k-metric antidimension adimkGof G. Note that the k-antiresolving set may not exist for every k in a graph.

The (k,l)-anonymity measure is built upon the k-antiresolving set concept. Assume the adversary has gained control of a subset S of nodes in the graph G, where S is a k-antiresolving set for G. Then the adversary cannot uniquely re-identify any node based on the background knowledge (namely, the knowledge of metric representation of a node vwith respect to S) with probability higher than 1k. k-anonymity is formally defined as [1].

Definition 3 (k-anonymity). A graph G under active attack satisfies k-anonymity if k is the smallest positive integer so that the k-metric antidimension of G is less than or equal to l.

In the above definition, k is a parameter depicting the privacy threshold and l represents the maximum number of attacker nodes. It is safe to assume that number of attacker nodes l is significantly smaller than number of nodes present in the network as injecting attacker nodes or gaining control of existing nodes is difficult without being detected [15].

3. Basic terminologies and notations

For the exposition in the remainder of this chapter, we will need some notations and terminologies which we introduce here. Consider the (undirected unweighted) graph G in Figure 2. We will use this graph to illustrate the terminologies and notations that are introduced.

  • The metric representation of node viis denoted by dvi=distv1,vidistv2,vidistvn,vi.

    • For example, in Figure 2, dv1=0,1,2,3,3,2

  • The diameter of Gis the length of the longest shortest path and is denoted by diamG=maxvi,vjVdistvi,vj.

  • The open neighborhood of node viis a subset of all nodes directly connected to viand denoted byNbrvi=vjvivjE.

  • The metric representation of a node viwith respect to a subset such as SVis denoted by dvi,S.

    • For example, in Figure 2, dv1,v3v4=23.

  • We can expand the previous notation to reflect the metric representation of a subset of nodes VVSwith respect to Sas DV,S=dvl,SvlV.

    • For example, in Figure 2, Dv1v2,v3v4=2312. Note that the first pair (2,3) corresponds to v1and the second pair (1,2) corresponds to v2.

  • We define a partition =V1V2Vtof VVas one with the following properties:

    • i=1tVi=V, and

    • for all ij, ViVj=Ø.

  • We define a refinement '=V1V2Vof a partition , denoted by 'r, as one that can be obtained from using the following rules:

    • For every node vji=1tVi\i=1Vi, remove vjfrom the set in that contains it.

    • Optionally, for every set Vin , replace Vby a partition of V.

    • If there exists an empty set, remove it.

    • For example, in Figure 2, v1v2v3v5rv1v2v3v4v5.

  • We define an equivalence relation (and related notations) over set of same-length vectors DV\V,Vfor some ØVVas follows:

    • The set of equivalence classes, which forms a partition of DV\V,V, is denoted by V\V,V=

    • For example, in Figure 2, v1v2v6,v3v5==231223.

      • We declare two nodes vi,vjV\Vto be in the same equivalence class if dvi,Vand dvj,Vbelong to the same equivalence class in V\V,V=; thus V\V,V=also defines a partition into equivalence classes of V\V.

      • The measure of the equivalence relation is defined as μDV\V,V=defminyV\V,V=y.

      • If a set Sis a k-antiresolving set then DV\S,Sdefines a partition into equivalence classes of measure k.

    • For example, in Figure 2, μDv1v2v6,v3v5=1and v3v5is a 1-antiresolving set.

Figure 2.

An example used in Section 3 for illustrating various notations.

4. Theoretical results

To understand graph resistance against privacy attacks, one needs to study the k-anonymity in greater details. Thus, we look into some computational problems related to this measure that were formalized and investigated in [16]. This section contains three problems from [16] and the respective algorithms to solve each problem efficiently. It is important to note that k-anonymity in its basic definition sets no limitation for the adversary, which means that an adversary can take control of as many nodes as she/he can. However, in real world there are many mechanisms designed solely to prevent such attacks and thus the chances of being caught are significantly high. This notion is the motivation behind several problems with respect to measuring the k-anonymity in a graph [17].

We now state the three problems for analyzing k-anonymity. Problem 1 simply checks to find a k-antiresolving set for the largest possible value of k. Problem 2 sets a restriction for number of nodes the adversary can control and attempts to find the largest possible value of k while minimizing the number of nodes that are compromised. Problem 3 introduces a version of the problem that attempts to address the trade-off between privacy threshold and number of compromised nodes.

Problem 1 (metric antidimension (ADIM)). Find a k-antiresolving subset of nodes S that maximizes k.

Problem 1 assumes there are no limitations on the number of attacker nodes, thus finding an absolute bound for privacy violation. Note that solution to Problem 1, denoted by kopt , shows that, given no bound on number of the nodes an adversary can control, it is feasible to uniquely re-identify kopt nodes with probability 1kopt. The assumptions in Problem 1 are rarely plausible in practice; due to mechanisms present to counter such attacks, the more nodes the adversary controls, the higher the risk of being exposed. Thus, a limit on number of attacker nodes is necessary, which leads us to Problem 2.

Problem 2 (k-metric antidimension (ADIMk)). Given k, find a k-antiresolving set S such that (i) k>=kand, (ii) S is of minimum cardinality.

Problem 2 is an extension to Problem 1 that attempts to find the largest value of k while minimizing the number of attacker nodes. A solution to this problem asserts few interesting statements. For example, an adversary controlling l attacker nodes where <Loptkcannot uniquely re-identify any node in the network with a probability better than 1k. However, using enough number of nodes (Loptk) one can re-establish such possibilities.

The third problem focuses on a trade-off between number of attacker nodes and the privacy violation probability. Given two measures k-anonymity and k-anonymity where k>kand <, it is easy to observe that k-anonymity measure provides a smaller privacy violation probability but also has lower tolerance for attacker nodes. The trade-off leads us to the third problem.

Problem 3 (k=-metric antidimension (ADIM=k)) Given a positive integer k, find a k antiresolving subset of nodes S with minimum cardinality if such a subset exists.

Chatterjee et al. [16] investigated Problems 1–3 from a computational complexity perspective. The following theorems summarizes their finding on Problems 1–3. The non-trivial mathematical proofs for these theorems are unfortunately outside of the scope of this chapter; we strongly recommend readers who are interested in the proofs to read the original paper [16].

Theorem 1. [16]

  1. Both ADIM and ADIMkcan be solved in On4time.

  2. Both ADIM and ADIMkcan also be solved in On4lognktime with high probability.

Theorem 2. [16]

  1. ADIM=kis NP-Complete for any kin the range 1knεwhere 0ε<12is any arbitrary constant, even if the diameter of the input graph is 2.

  2. Assuming NP DTIMEnloglogn, there exists a universal constant δ>0such that ADIM=kdoes not admit 1δlnnapproximation for any integer kin the range 1knεfor any constant 0ε<12, even if the diameter of the input graph is 2.

  3. If k=ncfor some constant c then ADIM=kcan be solved in polynomial time.

Theorem 3. [16]

  1. ADIM=1admits 1+lnn1approximation in On3time.

  2. If Ghas at least one node of degree 1 then ADIM=1can be solved in On3time.

  3. If Gdoes not contain a cycle of 4 edges then ADIM=1can be solved in On3time.

4.1 Algorithms

The following algorithms were devised in [16] to address Problems 1–3. It is important to note that ADIMcan be solved in On5time by repeatedly solving ADIMkfor k=n1,n2,,1to find the largest obtainable value for k such that Loptk<. However, few modifications to Algorithm 1 directly result in On4solution, which is shown in Algorithm 2.

5. Empirical results

In [18], DasGupta et al. investigated the resistance of 8 real-world network against active attacks with respect to the k-anonymity. All the networks under investigation were unweighted graphs and the direction of edges (if the network was directed) was ignored during the analysis. Table 1 contains the general information regarding these networks. Results for both ADIMand ADIMkwere obtained by running Algorithm 1 on the networks, the return statements from Algorithm 1 being an exact solution to Problem 2. On the other hand, the exact solution for Problem 1 can be achieved by combining Algorithm 1 and binary search on k to find the largest value of k such that VoptkØ[18].

NameNumber of nodesNumber of edgesDescription
Zachary Karate Club [20]3478Network of friendship between 34 members of a karate club
San Juan Community [21]75144Network for visiting relations between families living in farms in San Juan Sur, Costa Rica, 1948
Jazz Musician Network [22]1982842A social network of jazz musicians
University Rovira i Virgili emails [23]113310903The network of email interchanges between members of university
Enron Email Data Set [24]10881767Enron email network
Email Eu Core [25]98624989Emails from a large European research institute
UC Irvine College Message platform [26]189659835Messages on a Facebook-like platform at UC-Irvine
Hamsterster friendships [27]178812476Friendships between users of the website

Table 1.

List of 8 social networks studied in [18].

The results for both Problem 1 and Problem 2 for the networks in Table 1 are depicted in Table 2. Results in Table 2 provide the following interesting insights with respect to resistance against privacy attacks in real-world social networks [19].

  • All networks, with the exception of ”Enron Email Data” network, will have a significant percentage of their users compromised if an adversary gains control of only one node (varying between 2.6% of users compromised in ”University Rovira i Virgili emails” network to 26.5% of users compromised in ”Zachary Karate Club” network).

  • For all networks with the exception of ”Enron Email Data” network, the minimum privacy violation probability is notably higher than 0 (varying between 0.019 for the ”UC Irvine College Message platform” network to 0.25 for the ”Hamsterster friendships” network). The value for minimum privacy violation probability in ”Hamsterster friendships” network is notably higher compare to all other networks.

  • In comparison to other networks, the ”Zachary Karate Club” and the ”San Juan Community” have higher percentage of their users compromised if subjected to a privacy attack.

Namenkoptpopt=1koptLoptkopt=Lopt=koptkoptn
Zachary Karate Club [20]3490.111126.5%
San Juan Community [21]7570.14319.3%
Jazz Musician Network [22]198120.08416.0%
University Rovira i Virgili emails [23]1133290.03512.6%
Enron Email Data Set [24]10881530.00793514.1%
Email Eu Core [25]986390.02613.4%
UC Irvine College Message platform [26]1896550.01912.9%
Hamsterster friendships [27]178840.2510.22%

Table 2.

Results for ADIM using Algorithm 1 [18].

n denotes the number of nodes in the social graph.


kopt is the largest value of k such that Voptkø.


ndepict the number of nodes, koptis the largest value of ksuch that Voptkø, and Loptkoptis minimum number of attacker nodes for corresponding k.

The exception network is the “Enron Email Data” network which due to a high value of Loptkis very resilient against an attack. An adversary needs to control at least 86% of the network to achieve a value of popt=0.007, which is not feasible in practice. This interesting observation in the “Enron Email Data” network motivated further inspections in different values of k. As shown in Table 3, Loptkin the “Enron Email Data” network does not decrease significantly until k is set to a much smaller value compare to kopt, which further emphasizes that violating the privacy of the “Enron Email Data” network is not guaranteed in practice. The authors in [18] also investigated the k-anonymity measure in synthetic networks constructed based on both Erdös-Rényi random graphs and Barabási-Albert scale-free networks. We refer the reader to the original paper for more information.

k4510204060100120153
Enron Email Data Setpk=1k0.250.20.10.050.0250.0170.010.0090.007
Loptk1334463567683842935935935

Table 3.

Loptkvalues recorded for k>1for the “Enron Email Data” network [18]. The values shown are subject to LoptkLoptk1.

6. Conclusions

Since their emergence about a decade ago, social networks have rapidly grown and infiltrated every aspect of our daily lives. With rapidly expanding reliance on their platforms, social networks like Facebook and Twitter are becoming a goldmine of personal information and user behavior data which makes the study of these networks of prime importance. The valuable information stored within these platforms makes them the target of malicious entities which try to compromise the privacy of the users which may further lead to unwanted disclosure of the sensitive attributes of the network.

In this chapter, we have reviewed a novel privacy measure that quantifies the resistance of a large social network against a privacy violating attack. We reviewed some efficient algorithms to compute this measure in social graph and revisited the privacy violation properties in 8 real-world networks. The current theoretical and empirical results for k-anonymity pave the way for further investigation of this measure, as well as addressing its shortcomings and limitations.

Acknowledgments

We thank Ismael G. Yero for useful discussions. This research was partially supported by NSF grants IIS-1160995 and IIS-1814931.

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Tanima Chatterjee, Nasim Mobasheri and Bhaskar DasGupta (January 24th 2020). A Review of Several Privacy Violation Measures for Large Networks under Active Attacks, Security and Privacy From a Legal, Ethical, and Technical Perspective, Christos Kalloniatis and Carlos Travieso-Gonzalez, IntechOpen, DOI: 10.5772/intechopen.90909. Available from:

chapter statistics

66total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural Networks

By Ofer Mendelevitch and Michael D. Lesh

Related Book

First chapter

Information Systems: From the Requirements to the Integrated Solution

By José Francisco Zelasco and Judith Donayo

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us