A Review of Several Privacy Violation Measures for Large Networks under Active Attacks

It is by now a standard practice to use the concepts and terminologies of network science to analyze social networks of interconnections between people such as Facebook, Twitter and LinkedIn. The powers and implications of such social network analysis are indeed indisputable; for example, such analysis may uncover previously unknown knowledge on community-based involvements, media usages and individual engagements. However, all these benefits are not necessarily cost-free since a malicious individual could compromise privacy of users of these social networks for harmful purposes that may result in the disclosure of sensitive data that may be linked to its users. A natural way to avoid this consists of an “ anonymization process ” of the relevant social network. However, since such anonymization processes may not always succeed, an important research goal is to quantify and measure how much privacy a given social network can achieve. Toward this goal, some recent research works have aimed at evaluating the resistance of a social network against active privacy-violating attacks by introducing and studying a new and meaningful privacy measure for social networks. In this chapter, we review both theoretical and empirical aspects of such privacy violation measures of large networks under active attacks.


Introduction
In recent years, social networks have become an indisputable part of people's lives. The emergence of such networks has altered how we interact with the world. A given individual's day-to-day activities like media consumption, job hunting and social interaction have changed, along with how businesses and other beneficial entities interact with them through marketing, advertising, and information diffusion. This has led to an unstoppable race of collecting information and interaction from social networks by researchers, governments, and business entities for various purposes. From a research point of view, social networks and their interaction mechanisms provide valuable insight in many fields of study, such as sociology, psychology, advertising, and recommendation systems. It is only natural that the information contained in these networks and the value they hold have been and will be targeted by bad actors for malicious activities. The importance of these networks and the value of information that can be retrieved from them have led social network researchers to take a closer look at methods to combat such bad actors as well as formulate network measures that can provide an insight to the privacy of these networks. In this survey, we will look at one such measure known as k, ℓ ð Þanonymity [1] and will discuss some theoretical and empirical results regarding this measure.

Overview of the paper
Given the irrefutable importance of social networks in our daily lives and the ever increasing risk of compromising valuable personal data through privacy attacks against these networks, it is preferable to know how secure a given social network is against privacy attacks. This necessitates a deeper look into the types of privacy attacks and how to cope with them. There is an extensive literature on privacy preserving computational models in variety of application areas such as multi-party communications or distributed computing settings [2][3][4][5][6]. In this chapter, we focus on a specific type of attack known as background-based active attack and one measure that reflects the resistance of any given network against such attacks. The organization of the rest of the paper is as follows: • In Section 2 we briefly discuss the notion of privacy in social networks and review some literature on privacy violating attacks on social networks. We also introduce the k, ℓ ð Þ-anonymity privacy measure and some corresponding network measurement which are the basis for this measure.
• In Section 3 we review some basic terminologies and notations that will be used in formulation of the three problems introduced in Section 4.
• Section 4 contains three problems that arise from theoretical investigation of the k, ℓ ð Þ-anonymity.
• Section 5 contains the results of an empirical study on the resistance of realworld social networks.
• Finally, we end this chapter with some concluding remarks in Section 6.

Privacy measures in social networks
We begin by discussing the mathematical structure that fit the most to represent social networks. A social network is often portrayed as a graph [7,8] is a set of nodes representing the social members, and E is the set of edges portraying the relationship among these members. Both nodes and edges may have extra attributes, such as weights, that provide extra information about the nature of these social bonds (e.g., trust or popularity); however, throughout this survey we will consider the simplest form of graphs, namely undirected and unweighted graphs, to model our social networks.
As we discussed in the previous section, the information that the social networks provide are invaluable. Due to the very nature of many social network applications, the identity of the members or the nature of relationship between members is quite sensitive and valuable. Thus, when releasing a social network we want to remove any attributes that may help identify these kinds of sensitive data. Assuming all members and their relationships are of high sensitivity, preventing identity disclosure or link disclosure becomes an important task. One popular method to prevent such disclosures is anonymization. In an anonymization process, we publish the network without identifying the corresponding nodes or potentially identifiable attributes. Even after anonymizing the network, we will still be releasing many informative attributes encoded by the network structure; for example, attributes such as node degree, connectivity, or other similar graph properties can still help the adversaries in compromising the user privacies of a published network.
Adversaries usually rely on background knowledge to compromise the privacy of published anonymized social networks. For understanding the failure of current privacy preservation methods such as anonymization, we need to have a proper model for the adversary background knowledge. Although it's challenging to have a comprehensive model of all possible types of adversary background knowledge, it is very useful to model the background knowledge via structural properties of networks such as node degrees, embedded subgraphs, node neighbors, etc. [9]. Backstrom et al. [10] were the first to introduce a category of attacks on anonymized social graphs. The models introduced in [10] are background-based attacks and are widely used in privacy analysis of social networks. The two main types of attacks are as follows.
1. Passive attacks in which the adversary will not modify the network by injecting new nodes, but instead will use the structural knowledge to detect the location of a known node. In this type of attacks, the adversary can benefit from the fact that most nodes in real social networks often belong to a small uniquely identifiable subgraph [10]. An adversary can then build a coalition with members of such subgraphs and attempt to re-identify the subgraphs in the anonymized published network, thus compromising the privacy of neighboring nodes.
2. Active attacks in which the adversary will choose an arbitrary set of target users, create new nodes and insert them into a social network in a way that they are connected to the target set and they form a distinguishable subgraph. After the anonymized version of the social network is published, the adversary can then use the subgraph as a fingerprint to re-identify the targeted users and compromise their privacy.
The authors in [10] also showed that it is possible to compromise the privacy of any social network of n nodes with high probability using only O ffiffiffiffiffiffiffiffiffiffi ffi log n p À Á attacker nodes. In a passive attack, adversary's structural knowledge will give her/him a global view of the network depending on the global structure of the network. It could pose a high privacy risk if an adversary were to combine this global view with the local structural knowledge obtained using an active attack. As an example, consider the network in Figure 1. If we only have global structural knowledge, it is not possible to differentiate the nodes v 3 and v 4 (e.g. , same node degrees, etc.).
However, controlling just one extra node in the graph, such as the node v 1 , provides local structural knowledge such as distances between nodes, and using the knowledge of the distance of v 1 from v 3 and v 4 There are several well-studied strategies for coping with active attacks on a social network [9,11,12] via addressing the anonymization process of the social network. However, in this chapter we will focus on a measure that evaluates how resistant a social network is against this type of privacy attack. Introduced by Trujillo-Rasua et al. [1], k, ℓ ð Þ-anonymity is a novel and, to the best of our knowledge, the only privacy measure examining the structural resistance of a given graph against active attacks. The k, ℓ ð Þ-anonymity is a measure based on metric representation of nodes, where k is a privacy threshold and l is the maximum number of attacker nodes that may be inserted in the network. It was shown in [1] that graphs satisfying k, ℓ ð Þ-anonymity can successfully deter adversaries controlling at most l nodes in the graph from re-identifying nodes with probability higher than 1 k .

k, ℓ ð Þ-anonymity
The k, ℓ ð Þ-anonymity measure is based on a concept known as k-metric antidimension of graphs. To facilitate further discussions about the measure, we first introduce some notations and terminologies. For a simple connected graph where V is set of nodes and E is set of edges, let dist v i ,v j denote distance (i.e. , number of edges in a shortest path) between the nodes v i and v j . Given and ordered set of nodes S ¼ v 1 , … , v t f gand a node u we define the metric representation of u with respect to S as a vector d u, Metric representations of nodes are closely related to the concept of a resolving set of a graph. Inspired by the problem of identifying an intruder in a network and introduced separately by Slater [13] and by Harary and Melter [14], a resolving set of graph provides recognition of every pair of nodes in graph.
Definition 1 (resolving set). Given a graph A smallest-cardinality resolving set is called the metric basis, and its cardinality is referred to as the metric dimension of G.
The concepts of metric representation and resolving set inspired the introduction of another network measure known as k-antiresolving set that will be used as the founding base for k, ℓ ð Þ-anonymity. Definition 2 (k-antiresolving set). Given a graph G ¼ V, E ð Þ, S ⊂ V is called a k-antiresolving set of G if k is the largest integer such that, for every node v ∈ VnS , there exist at least k À 1 nodes u 1 , u 2 , … , u kÀ1 ∈ VnS with the same metric representation with respect to S as v.
A k-antiresolving set of minimum cardinality is called a k-antiresolving basis, and its cardinality denotes the k-metric antidimension adim k G ð Þ of G. Note that the k-antiresolving set may not exist for every k in a graph.
The (k,l)-anonymity measure is built upon the k-antiresolving set concept. Assume the adversary has gained control of a subset S of nodes in the graph G, where S is a k-antiresolving set for G. Then the adversary cannot uniquely reidentify any node based on the background knowledge (namely, the knowledge of metric representation of a node v with respect to S) with probability higher than 1 k . k, ℓ ð Þ-anonymity is formally defined as [1]. Figure 1.
A simple graph G used in Section 2 to illustrate the high risk posed by combining knowledge gained by active and passive attacks.
Definition 3 ( k, ℓ ð Þ-anonymity). A graph G under active attack satisfies k, ℓ ð Þanonymity if k is the smallest positive integer so that the k-metric antidimension of G is less than or equal to l.
In the above definition, k is a parameter depicting the privacy threshold and l represents the maximum number of attacker nodes. It is safe to assume that number of attacker nodes l is significantly smaller than number of nodes present in the network as injecting attacker nodes or gaining control of existing nodes is difficult without being detected [15].

Basic terminologies and notations
For the exposition in the remainder of this chapter, we will need some notations and terminologies which we introduce here. Consider the (undirected unweighted) graph G in Figure 2. We will use this graph to illustrate the terminologies and notations that are introduced.
• The metric representation of node v i is denoted by • For example, in Figure 2 • The diameter of G is the length of the longest shortest path and is denoted by • For example, in Figure 2 • The open neighborhood of node v i is a subset of all nodes directly connected to v i and denoted by • For example, in Figure 2, • The metric representation of a node v i with respect to a subset such as S ⊂ V is denoted by d v i ,ÀS .
• For example, in Figure 2 • We can expand the previous notation to reflect the metric representation of a subset of nodes V 0 ⊂ V S with respect to S as • For example, in Figure 2, g . Note that the first pair (2,3) corresponds to v 1 and the second pair (1,2) corresponds to v 2 .
• We define a partition ⊆ V as one with the following properties: , as one that can be obtained from Q using the following rules: • Optionally, for every set V ℓ in Q , replace V ℓ by a partition of V ℓ.
• If there exists an empty set, remove it.
i. For example, in Figure 2, • We define an equivalence relation (and related notations) over set of samelength vectors D VnV 0 ,ÀV 0 for some Ø ⊂ V 0 ⊂ V as follows: • The set of equivalence classes, which forms a partition of D VnV 0 ,ÀV 0 , is denoted by Q ¼ VnV 0 ,ÀV 0 i. For example, in Figure 2, • We declare two nodes v i , v j ∈ VnV 0 to be in the same equivalence class if d v i ,ÀV 0 and d v j ,ÀV 0 belong to the same equivalence class in Q ¼ VnV 0 ,ÀV 0 ; thus Q ¼ VnV 0 ,ÀV 0 also defines a partition into equivalence classes of VnV 0 .
• The measure of the equivalence relation is defined as • If a set S is a k-antiresolving set then D VnS,ÀS defines a partition into equivalence classes of measure k.

Theoretical results
To understand graph resistance against privacy attacks, one needs to study the k, ℓ ð Þ-anonymity in greater details. Thus, we look into some computational problems related to this measure that were formalized and investigated in [16]. This section contains three problems from [16] and the respective algorithms to solve each problem efficiently. It is important to note that k, ℓ ð Þ-anonymity in its basic definition sets no limitation for the adversary, which means that an adversary can take control of as many nodes as she/he can. However, in real world there are many mechanisms designed solely to prevent such attacks and thus the chances of being caught are significantly high. This notion is the motivation behind several problems with respect to measuring the k, ℓ ð Þ-anonymity in a graph. We now state the three problems for analyzing k, ℓ ð Þ-anonymity. Problem 1 simply checks to find a k-antiresolving set for the largest possible value of k. Problem 2 sets a restriction for number of nodes the adversary can control and attempts to find the largest possible value of k while minimizing the number of nodes that are compromised. Problem 3 introduces a version of the problem that attempts to address the trade-off between privacy threshold and number of compromised nodes.
Problem 1 assumes there are no limitations on the number of attacker nodes, thus finding an absolute bound for privacy violation. Note that solution to Problem 1, denoted by k opt , shows that, given no bound on number of the nodes an adversary can control, it is feasible to uniquely re-identify k opt nodes with probability 1 k opt . The assumptions in Problem 1 are rarely plausible in practice; due to mechanisms present to counter such attacks, the more nodes the adversary controls, the higher the risk of being exposed. Thus, a limit on number of attacker nodes is necessary, which leads us to Problem 2.
Problem 2 is an extension to Problem 1 that attempts to find the largest value of k while minimizing the number of attacker nodes. A solution to this problem asserts few interesting statements. For example, an adversary controlling l attacker nodes where ℓ < |L ≥ k opt | cannot uniquely re-identify any node in the network with a probability better than 1 k . However, using enough number of nodes ( ≥ |L ≥ k opt |) one can reestablish such possibilities.
The third problem focuses on a trade-off between number of attacker nodes and the privacy violation probability. Given two measures k, ℓ ð Þ-anonymity and k 0 , ℓ 0 À Á anonymity where k 0 > k and ℓ 0 < ℓ, it is easy to observe that k 0 , ℓ 0 À Á -anonymity measure provides a smaller privacy violation probability but also has lower tolerance for attacker nodes. The trade-off leads us to the third problem.
Problem 3 (k=-metric antidimension (ADIM ¼k )) Given a positive integer k, find a k antiresolving subset of nodes S with minimum cardinality if such a subset exists.
Chatterjee et al. [16] investigated Problems 1-3 from a computational complexity perspective. The following theorems summarizes their finding on Problems 1-3. The non-trivial mathematical proofs for these theorems are unfortunately outside of the scope of this chapter; we strongly recommend readers who are interested in the proofs to read the original paper [16]. Theorem 1. [16] 1. Both ADIM and ADIM ≥ k can be solved in O n 4 ð Þ time.
2. Both ADIM and ADIM ≥ k can also be solved in O n 4 log n k time with high probability.
Theorem 2. [16] 1. ADIM ¼k is NP-Complete for any k in the range 1 ≤ k ≤ n ε where 0 ≤ ε < 1 2 is any arbitrary constant, even if the diameter of the input graph is 2.
2. Assuming NP⊈DTIME n log log n À Á , there exists a universal constant δ > 0 such that ADIM ¼k does not admit 1 δ ln n À Á approximation for any integer k in the range 1 ≤ k ≤ n ε for any constant 0 ≤ ε < 1 2 , even if the diameter of the input graph is 2.
3. If k ¼ n À c for some constant c then ADIM ¼k can be solved in polynomial time.
2. If G has at least one node of degree 1 then ADIM ¼1 can be solved in O n 3 ð Þ time.
3. If G does not contain a cycle of 4 edges then ADIM ¼1 can be solved in O n 3 ð Þ time.

Algorithms
The following algorithms were devised in [16] to address Problems 1-3. It is important to note that ADIM can be solved in O n 5 ð Þ time by repeatedly solving ADIM ≥ k for k ¼ n À 1, n À 2, … , 1 to find the largest obtainable value for k such that L ≥ k opt < ∞. However, few modifications to Algorithm 1 directly result in O n 4 ð Þ solution, which is shown in Algorithm 2.

Empirical results
In [18], DasGupta et al. investigated the resistance of 8 real-world network against active attacks with respect to the k, ℓ ð Þ-anonymity. All the networks under investigation were unweighted graphs and the direction of edges (if the network was directed) was ignored during the analysis. Table 1 contains the general information regarding these networks. Results for both ADIM and ADIM ≥ k were obtained by running Algorithm 1 on the networks, the return statements from Algorithm 1 being an exact solution to Problem 2. On the other hand, the exact solution for Problem 1 can be achieved by combining Algorithm 1 and binary search on k to find the largest value of k such that V ≥ k opt 6 ¼ Ø [18].
The results for both Problem 1 and Problem 2 for the networks in Table 1 are depicted in Table 2. Results in Table 2 provide the following interesting insights with respect to resistance against privacy attacks in real-world social networks.
• All networks, with the exception of "Enron Email Data" network, will have a significant percentage of their users compromised if an adversary gains control of only one node (varying between 2.6% of users compromised in "University Rovira i Virgili emails" network to 26.5% of users compromised in "Zachary Karate Club" network). • For all networks with the exception of "Enron Email Data" network, the minimum privacy violation probability is notably higher than 0 (varying between 0.019 for the "UC Irvine College Message platform" network to 0.25 for the "Hamsterster friendships" network). The value for minimum privacy violation probability in "Hamsterster friendships" network is notably higher compare to all other networks.
• In comparison to other networks, the "Zachary Karate Club" and the "San Juan Community" have higher percentage of their users compromised if subjected to a privacy attack.
The exception network is the "Enron Email Data" network which due to a high value of L ≥ k opt is very resilient against an attack. An adversary needs to control at least 86% of the network to achieve a value of p opt ¼ 0:007, which is not feasible in practice. This interesting observation in the "Enron Email Data" network motivated further inspections in different values of k. As shown in Table 3, L ≥ k opt in the "Enron Email Data" network does not decrease significantly until k is set to a much smaller value compare to k opt , which further emphasizes that violating the privacy of the "Enron Email Data" network is not guaranteed in practice. The authors in [18] also investigated the k, ℓ ð Þ-anonymity measure in synthetic networks constructed based on both Erdös-Rényi random graphs and Barabási-Albert scale-free networks. We refer the reader to the original paper for more information. n depict the number of nodes, k opt is the largest value of k such that V ≥ k opt 6 ¼ ø, and L ≥ kopt opt is minimum number of attacker nodes for corresponding k. a n denotes the number of nodes in the social graph. b k opt is the largest value of k such that V ≥ k opt 6 ¼ ø. Table 2.
Results for ADIM using Algorithm 1 [18].  Table 3. L ≥ k opt values recorded for k > 1 for the "Enron Email Data" network [18]. The values shown are subject to L ≥ k opt 6 ¼ L ≥ kÀ1 opt .

Conclusions
Since their emergence about a decade ago, social networks have rapidly grown and infiltrated every aspect of our daily lives. With rapidly expanding reliance on their platforms, social networks like Facebook and Twitter are becoming a goldmine of personal information and user behavior data which makes the study of these networks of prime importance. The valuable information stored within these platforms makes them the target of malicious entities which try to compromise the privacy of the users which may further lead to unwanted disclosure of the sensitive attributes of the network.
In this chapter, we have reviewed a novel privacy measure that quantifies the resistance of a large social network against a privacy violating attack. We reviewed some efficient algorithms to compute this measure in social graph and revisited the privacy violation properties in 8 real-world networks. The current theoretical and empirical results for k, ℓ ð Þ-anonymity pave the way for further investigation of this measure, as well as addressing its shortcomings and limitations.