List of 8 social networks studied in .
It is by now a standard practice to use the concepts and terminologies of network science to analyze social networks of interconnections between people such as Facebook, Twitter and LinkedIn. The powers and implications of such social network analysis are indeed indisputable; for example, such analysis may uncover previously unknown knowledge on community-based involvements, media usages and individual engagements. However, all these benefits are not necessarily cost-free since a malicious individual could compromise privacy of users of these social networks for harmful purposes that may result in the disclosure of sensitive data that may be linked to its users. A natural way to avoid this consists of an “anonymization process” of the relevant social network. However, since such anonymization processes may not always succeed, an important research goal is to quantify and measure how much privacy a given social network can achieve. Toward this goal, some recent research works have aimed at evaluating the resistance of a social network against active privacy-violating attacks by introducing and studying a new and meaningful privacy measure for social networks. In this chapter, we review both theoretical and empirical aspects of such privacy violation measures of large networks under active attacks.
- social networks
- privacy measure
- active attacks
- algorithmic complexity
In recent years, social networks have become an indisputable part of people’s lives. The emergence of such networks has altered how we interact with the world. A given individual’s day-to-day activities like media consumption, job hunting and social interaction have changed, along with how businesses and other beneficial entities interact with them through marketing, advertising, and information diffusion. This has led to an unstoppable race of collecting information and interaction from social networks by researchers, governments, and business entities for various purposes. From a research point of view, social networks and their interaction mechanisms provide valuable insight in many fields of study, such as sociology, psychology, advertising, and recommendation systems. It is only natural that the information contained in these networks and the value they hold have been and will be targeted by bad actors for malicious activities. The importance of these networks and the value of information that can be retrieved from them have led social network researchers to take a closer look at methods to combat such bad actors as well as formulate network measures that can provide an insight to the privacy of these networks. In this survey, we will look at one such measure known as -anonymity  and will discuss some theoretical and empirical results regarding this measure.
1.1 Overview of the paper
Given the irrefutable importance of social networks in our daily lives and the ever increasing risk of compromising valuable personal data through privacy attacks against these networks, it is preferable to know how secure a given social network is against privacy attacks. This necessitates a deeper look into the types of privacy attacks and how to cope with them. There is an extensive literature on privacy preserving computational models in variety of application areas such as multi-party communications or distributed computing settings [2, 3, 4, 5, 6]. In this chapter, we focus on a specific type of attack known as background-based active attack and one measure that reflects the resistance of any given network against such attacks. The organization of the rest of the paper is as follows:
In Section 2 we briefly discuss the notion of privacy in social networks and review some literature on privacy violating attacks on social networks. We also introduce the -anonymity privacy measure and some corresponding network measurement which are the basis for this measure.
In Section 3 we review some basic terminologies and notations that will be used in formulation of the three problems introduced in Section 4.
Section 4 contains three problems that arise from theoretical investigation of the -anonymity.
Section 5 contains the results of an empirical study on the resistance of real-world social networks.
Finally, we end this chapter with some concluding remarks in Section 6.
2. Privacy measures in social networks
We begin by discussing the mathematical structure that fit the most to represent social networks. A social network is often portrayed as a graph [7, 8] where V is a set of nodes representing the social members, and E is the set of edges portraying the relationship among these members. Both nodes and edges may have extra attributes, such as weights, that provide extra information about the nature of these social bonds (e.g., trust or popularity); however, throughout this survey we will consider the simplest form of graphs, namely undirected and unweighted graphs, to model our social networks.
As we discussed in the previous section, the information that the social networks provide are invaluable. Due to the very nature of many social network applications, the identity of the members or the nature of relationship between members is quite sensitive and valuable. Thus, when releasing a social network we want to remove any attributes that may help identify these kinds of sensitive data. Assuming all members and their relationships are of high sensitivity, preventing identity disclosure or link disclosure becomes an important task. One popular method to prevent such disclosures is anonymization. In an anonymization process, we publish the network without identifying the corresponding nodes or potentially identifiable attributes. Even after anonymizing the network, we will still be releasing many informative attributes encoded by the network structure; for example, attributes such as node degree, connectivity, or other similar graph properties can still help the adversaries in compromising the user privacies of a published network.
Adversaries usually rely on background knowledge to compromise the privacy of published anonymized social networks. For understanding the failure of current privacy preservation methods such as anonymization, we need to have a proper model for the adversary background knowledge. Although it’s challenging to have a comprehensive model of all possible types of adversary background knowledge, it is very useful to model the background knowledge via structural properties of networks such as node degrees, embedded subgraphs, node neighbors, etc. . Backstrom et al.  were the first to introduce a category of attacks on anonymized social graphs. The models introduced in  are background-based attacks and are widely used in privacy analysis of social networks. The two main types of attacks are as follows.
Passive attacks in which the adversary will not modify the network by injecting new nodes, but instead will use the structural knowledge to detect the location of a known node. In this type of attacks, the adversary can benefit from the fact that most nodes in real social networks often belong to a small uniquely identifiable subgraph . An adversary can then build a coalition with members of such subgraphs and attempt to re-identify the subgraphs in the anonymized published network, thus compromising the privacy of neighboring nodes.
Active attacks in which the adversary will choose an arbitrary set of target users, create new nodes and insert them into a social network in a way that they are connected to the target set and they form a distinguishable subgraph. After the anonymized version of the social network is published, the adversary can then use the subgraph as a fingerprint to re-identify the targeted users and compromise their privacy.
The authors in  also showed that it is possible to compromise the privacy of any social network of n nodes with high probability using onlyattacker nodes. In a passive attack, adversary’s structural knowledge will give her/him a global view of the network depending on the global structure of the network. It could pose a high privacy risk if an adversary were to combine this global view with the local structural knowledge obtained using an active attack. As an example, consider the network in Figure 1. If we only have global structural knowledge, it is not possible to differentiate the nodes and (e.g., same node degrees, etc.). However, controlling just one extra node in the graph, such as the node , provides local structural knowledge such as distances between nodes, and using the knowledge of the distance of from and (and ) one can easily differentiate node from node .
There are several well-studied strategies for coping with active attacks on a social network [9, 11, 12] via addressing the anonymization process of the social network. However, in this chapter we will focus on a measure that evaluates how resistant a social network is against this type of privacy attack. Introduced by Trujillo-Rasua et al. , -anonymity is a novel and, to the best of our knowledge, the only privacy measure examining the structural resistance of a given graph against active attacks. The -anonymity is a measure based on metric representation of nodes, where k is a privacy threshold and l is the maximum number of attacker nodes that may be inserted in the network. It was shown in  that graphs satisfying -anonymity can successfully deter adversaries controlling at most l nodes in the graph from re-identifying nodes with probability higher than .
The -anonymity measure is based on a concept known as k-metric anti-dimension of graphs. To facilitate further discussions about the measure, we first introduce some notations and terminologies. For a simple connected graph , where V is set of nodes and E is set of edges, let denote distance (i.e., number of edges in a shortest path) between the nodes and . Given and ordered set of nodes and a node we define the metric representation of u with respect to S as a vector . Metric representations of nodes are closely related to the concept of a resolving set of a graph. Inspired by the problem of identifying an intruder in a network and introduced separately by Slater  and by Harary and Melter , a resolving set of graph provides recognition of every pair of nodes in graph.
Definition 1 (resolving set). Given a graph , a subset is called a resolving set for G if, for each pair of nodes , there exist a node such that . A smallest-cardinality resolving set is called the metric basis, and its cardinality is referred to as the metric dimension of G.
The concepts of metric representation and resolving set inspired the introduction of another network measure known as k-antiresolving set that will be used as the founding base for -anonymity.
Definition 2 (k-antiresolving set). Given a graph , is called a k-antiresolving set of G if k is the largest integer such that, for every node , there exist at least nodes with the same metric representation with respect to S as .
A k-antiresolving set of minimum cardinality is called a k-antiresolving basis, and its cardinality denotes the k-metric antidimensionof G. Note that the k-antiresolving set may not exist for every k in a graph.
The (k,l)-anonymity measure is built upon the k-antiresolving set concept. Assume the adversary has gained control of a subset S of nodes in the graph G, where S is a k-antiresolving set for G. Then the adversary cannot uniquely re-identify any node based on the background knowledge (namely, the knowledge of metric representation of a node with respect to S) with probability higher than . -anonymity is formally defined as .
Definition 3 (-anonymity). A graph G under active attack satisfies -anonymity if k is the smallest positive integer so that the k-metric antidimension of G is less than or equal to l.
In the above definition, k is a parameter depicting the privacy threshold and l represents the maximum number of attacker nodes. It is safe to assume that number of attacker nodes l is significantly smaller than number of nodes present in the network as injecting attacker nodes or gaining control of existing nodes is difficult without being detected .
3. Basic terminologies and notations
For the exposition in the remainder of this chapter, we will need some notations and terminologies which we introduce here. Consider the (undirected unweighted) graph G in Figure 2. We will use this graph to illustrate the terminologies and notations that are introduced.
The metric representation of node is denoted by .
For example, in Figure 2,
The diameter of is the length of the longest shortest path and is denoted by .
For example, in Figure 2, .
The open neighborhood of node is a subset of all nodes directly connected to and denoted by
For example, in Figure 2,
The metric representation of a node with respect to a subset such as is denoted by .
For example, in Figure 2, .
We can expand the previous notation to reflect the metric representation of a subset of nodes with respect to as .
For example, in Figure 2, . Note that the first pair (2,3) corresponds to and the second pair (1,2) corresponds to .
We define a partition of as one with the following properties:
for all , .
We define a refinement of a partition , denoted by , as one that can be obtained from using the following rules:
For every node , remove from the set in that contains it.
Optionally, for every set in , replace by a partition of .
If there exists an empty set, remove it.
For example, in Figure 2, .
We define an equivalence relation (and related notations) over set of same-length vectors for some as follows:
The set of equivalence classes, which forms a partition of , is denoted by
For example, in Figure 2, .
We declare two nodes to be in the same equivalence class if and belong to the same equivalence class in ; thus also defines a partition into equivalence classes of .
The measure of the equivalence relation is defined as .
If a set is a -antiresolving set then defines a partition into equivalence classes of measure .
For example, in Figure 2, and is a 1-antiresolving set.
4. Theoretical results
To understand graph resistance against privacy attacks, one needs to study the -anonymity in greater details. Thus, we look into some computational problems related to this measure that were formalized and investigated in . This section contains three problems from  and the respective algorithms to solve each problem efficiently. It is important to note that -anonymity in its basic definition sets no limitation for the adversary, which means that an adversary can take control of as many nodes as she/he can. However, in real world there are many mechanisms designed solely to prevent such attacks and thus the chances of being caught are significantly high. This notion is the motivation behind several problems with respect to measuring the -anonymity in a graph.
We now state the three problems for analyzing -anonymity. Problem 1 simply checks to find a k-antiresolving set for the largest possible value of k. Problem 2 sets a restriction for number of nodes the adversary can control and attempts to find the largest possible value of k while minimizing the number of nodes that are compromised. Problem 3 introduces a version of the problem that attempts to address the trade-off between privacy threshold and number of compromised nodes.
Problem 1 (metric antidimension (ADIM)). Find a k-antiresolving subset of nodes S that maximizes k.
Problem 1 assumes there are no limitations on the number of attacker nodes, thus finding an absolute bound for privacy violation. Note that solution to Problem 1, denoted by kopt, shows that, given no bound on number of the nodes an adversary can control, it is feasible to uniquely re-identify kopt nodes with probability . The assumptions in Problem 1 are rarely plausible in practice; due to mechanisms present to counter such attacks, the more nodes the adversary controls, the higher the risk of being exposed. Thus, a limit on number of attacker nodes is necessary, which leads us to Problem 2.
Problem 2 (-metric antidimension ()). Given k, find a -antiresolving set S such that (i) and, (ii) S is of minimum cardinality.
Problem 2 is an extension to Problem 1 that attempts to find the largest value of k while minimizing the number of attacker nodes. A solution to this problem asserts few interesting statements. For example, an adversary controlling l attacker nodes where cannot uniquely re-identify any node in the network with a probability better than . However, using enough number of nodes () one can re-establish such possibilities.
The third problem focuses on a trade-off between number of attacker nodes and the privacy violation probability. Given two measures -anonymity and -anonymity where and , it is easy to observe that -anonymity measure provides a smaller privacy violation probability but also has lower tolerance for attacker nodes. The trade-off leads us to the third problem.
Problem 3 (k=-metric antidimension ()) Given a positive integer k, find a k antiresolving subset of nodes S with minimum cardinality if such a subset exists.
Chatterjee et al.  investigated Problems 1–3 from a computational complexity perspective. The following theorems summarizes their finding on Problems 1–3. The non-trivial mathematical proofs for these theorems are unfortunately outside of the scope of this chapter; we strongly recommend readers who are interested in the proofs to read the original paper .
Theorem 1. 
Both ADIM and can be solved in Otime.
Both ADIM and can also be solved in Otime with high probability.
Theorem 2. 
is NP-Complete for any in the range where is any arbitrary constant, even if the diameter of the input graph is 2.
Assuming NPDTIME, there exists a universal constant such that does not admit approximation for any integer in the range for any constant , even if the diameter of the input graph is 2.
If for some constant c then can be solved in polynomial time.
Theorem 3. 
admits approximation in time.
If has at least one node of degree 1 then can be solved in time.
If does not contain a cycle of 4 edges then can be solved in time.
The following algorithms were devised in  to address Problems 1–3. It is important to note that can be solved in time by repeatedly solving for to find the largest obtainable value for k such that . However, few modifications to Algorithm 1 directly result in solution, which is shown in Algorithm 2.
5. Empirical results
In , DasGupta et al. investigated the resistance of 8 real-world network against active attacks with respect to the -anonymity. All the networks under investigation were unweighted graphs and the direction of edges (if the network was directed) was ignored during the analysis. Table 1 contains the general information regarding these networks. Results for both and were obtained by running Algorithm 1 on the networks, the return statements from Algorithm 1 being an exact solution to Problem 2. On the other hand, the exact solution for Problem 1 can be achieved by combining Algorithm 1 and binary search on k to find the largest value of k such that .
|Name||Number of nodes||Number of edges||Description|
|Zachary Karate Club ||34||78||Network of friendship between 34 members of a karate club|
|San Juan Community ||75||144||Network for visiting relations between families living in farms in San Juan Sur, Costa Rica, 1948|
|Jazz Musician Network ||198||2842||A social network of jazz musicians|
|University Rovira i Virgili emails ||1133||10903||The network of email interchanges between members of university|
|Enron Email Data Set ||1088||1767||Enron email network|
|Email Eu Core ||986||24989||Emails from a large European research institute|
|UC Irvine College Message platform ||1896||59835||Messages on a Facebook-like platform at UC-Irvine|
|Hamsterster friendships ||1788||12476||Friendships between users of the website|
The results for both Problem 1 and Problem 2 for the networks in Table 1 are depicted in Table 2. Results in Table 2 provide the following interesting insights with respect to resistance against privacy attacks in real-world social networks.
All networks, with the exception of ”Enron Email Data” network, will have a significant percentage of their users compromised if an adversary gains control of only one node (varying between 2.6% of users compromised in ”University Rovira i Virgili emails” network to 26.5% of users compromised in ”Zachary Karate Club” network).
For all networks with the exception of ”Enron Email Data” network, the minimum privacy violation probability is notably higher than 0 (varying between 0.019 for the ”UC Irvine College Message platform” network to 0.25 for the ”Hamsterster friendships” network). The value for minimum privacy violation probability in ”Hamsterster friendships” network is notably higher compare to all other networks.
In comparison to other networks, the ”Zachary Karate Club” and the ”San Juan Community” have higher percentage of their users compromised if subjected to a privacy attack.
|Zachary Karate Club ||34||9||0.111||1||26.5%|
|San Juan Community ||75||7||0.143||1||9.3%|
|Jazz Musician Network ||198||12||0.084||1||6.0%|
|University Rovira i Virgili emails ||1133||29||0.035||1||2.6%|
|Enron Email Data Set ||1088||153||0.007||935||14.1%|
|Email Eu Core ||986||39||0.026||1||3.4%|
|UC Irvine College Message platform ||1896||55||0.019||1||2.9%|
|Hamsterster friendships ||1788||4||0.25||1||0.22%|
The exception network is the “Enron Email Data” network which due to a high value of is very resilient against an attack. An adversary needs to control at least 86% of the network to achieve a value of , which is not feasible in practice. This interesting observation in the “Enron Email Data” network motivated further inspections in different values of . As shown in Table 3, in the “Enron Email Data” network does not decrease significantly until k is set to a much smaller value compare to , which further emphasizes that violating the privacy of the “Enron Email Data” network is not guaranteed in practice. The authors in  also investigated the -anonymity measure in synthetic networks constructed based on both Erdös-Rényi random graphs and Barabási-Albert scale-free networks. We refer the reader to the original paper for more information.
|Enron Email Data Set||0.25||0.2||0.1||0.05||0.025||0.017||0.01||0.009||0.007|
Since their emergence about a decade ago, social networks have rapidly grown and infiltrated every aspect of our daily lives. With rapidly expanding reliance on their platforms, social networks like Facebook and Twitter are becoming a goldmine of personal information and user behavior data which makes the study of these networks of prime importance. The valuable information stored within these platforms makes them the target of malicious entities which try to compromise the privacy of the users which may further lead to unwanted disclosure of the sensitive attributes of the network.
In this chapter, we have reviewed a novel privacy measure that quantifies the resistance of a large social network against a privacy violating attack. We reviewed some efficient algorithms to compute this measure in social graph and revisited the privacy violation properties in 8 real-world networks. The current theoretical and empirical results for -anonymity pave the way for further investigation of this measure, as well as addressing its shortcomings and limitations.
We thank Ismael G. Yero for useful discussions. This research was partially supported by NSF grants IIS-1160995 and IIS-1814931.
- These authors contributed equally.