4. Privacy
The Dutch Telecommunications Act requires the data leaving the telecom company to be anonymous. In this context, the Dutch Data Protection Authority (DPA) referred to a scientific article [2] describing a method for disclosure of trajectories of mobile phones from aggregated telephone data. This chapter discusses the method described in this article in more detail and answers the question to what degree this affects the anonymity of the data.
The starting point is the dataset that is delivered to CBS by the telecom provider. These are aggregated data containing estimated counts of the devices located in a given municipality from a given municipality of residence at a given hour (the so-called flow cube). This chapter is organised as follows:
- In section 4.1, the inverse problem is explained and how it is introduced in the process. Here it is examined to what extent it is technically mathematically possible, on the basis of the latest knowledge, despite all the precautions, to trace back directly to an individual. In this section the anonymity is mathematically demonstrated. In this context, an opinion proposed by the DPA referred to a scientific article [3] [4] from China which describes a method to trace the trajectories of mobile phones from aggregated telephone data. The method cited in the article to ''crack'' outcomes to traceable outcomes is fairly novel but also runs into this fundamental inverse problem. The article describes a complex method. It requires extensive mathematical analysis to be able to interpret it correctly. In section 4.1.3, as a supplement to the mathematical approach, these Chinese articles are also discussed in more detail and why they cannot work with the method CBS used in the pilot.
- Section 4.2 discusses the hypothetical case in which it is assumed the attacker (this can be within CBS itself, or an external hacker with access to data from, for example, a "big tech" company) has a very large amount of information (the most extreme scenario). Is it possible to determine the location of someone's device (by means of a flow cube)? This approach is described in section 4.2. It concerns an advanced attack that goes beyond the scope of the scientific article from China in section 4.1.
- Finally, the phenomenon of group disclosure is described in section 4.3. It concerns the possibility of identifying groups from the output data and what safeguards are taken to limit the risk of this as much as possible. Strictly speaking, this has nothing to do with the anonymity of the individual. On the contrary, monitoring groups is an important application for combating COVID-19. Nevertheless, it is important to pay attention to this as well.
4.1 Basic mathematical principles for anonymity according to the CBS method
4.1.1 The classic inverse problem
The basic mathematical principle applied by CBS for data security comes from the field of algebra.
"When a consistent system of independent linear equations over the real numbers has a smaller number of equations than unknowns, the solution set is infinitely large.
For example, if one adds together five numbers and publishes only the sum, it is impossible to retrieve the original numbers. This principle is the main reason why the data provided by the telecom providers are anonymous. Thus, mathematically speaking, no reverse calculations can be made, whichever method or technique is used.
The relationship between individual objects, such as mobile phones or similar electronic devices, and a measurement such as a count of those objects is a mathematical equation. The most important and irreversible procedure in every step of a processing process is therefore to deliberately not carry out certain counts. This irreversible destruction of information ensures that none of the partial steps can be reversed to lead to an individual device. This applies a fortiori to the entire process: from signalling data to the aggregated anonymised dataset provided by the telecom company to CBS.
4.1.2 The inverse problem that arises in the process
For a precise and complete description of the process steps refer to chapter 3 and the technical description [1]. In this paragraph a number of specific steps which are important in the process to arrive at anonymous telephone data will be discussed.
Step 1 to 6: Processing by the telecom company (takes place within the infrastructure of the telecom company)
Every 30 days, all pseudonymised data are re-pseudonymised at the operator without a key being generated (which can therefore not be stored). The double-pseudonymised data contains links between a mast and a device at a time of exchange of a contact signal or data. The links between mast and device are still being edited. There are two reasons for this:
- Content-related: whether a link is being established at a certain time between an antenna and a device not only depends on physical proximity but also on signal strength that is not uniform around an antenna, and on the availability of signal processing capacity by the antenna. Within a time interval of one hour, devices often exchange signals with several masts in random order. It is therefore statistically purer to weigh each of the individual mast-device links over a number of masts.
- Security-related: because of anonymity. Because links are divided over a number of antennas, and then numbers of fractional links are added per mast, the 1-to-1 relationship between an individual antenna and a device is irretrievably disrupted.
The addition per mast should always be weighted across municipalities, because the sensitivity area of each mast covers several municipalities. Therefore, there is no unique, reversible relationship between the counts per area or mast and the number of devices involved per area or mast: there are more unknowns than there are measurement data, so the mathematical basic principle applies that the unknowns can no longer be determined. Information is deliberately destroyed to prevent identification. In chapter 3, the exact operation of the location census algorithm is given as follows:
- ".... The location is therefore not very precise: statistical noise is added in this step. An additional effect of this method is that numbers of devices are "dispersed" in space and thus redistributed. Also in the time dimension there is a dispersion. The model works in batches of one hour. If there is only one measurement in a particular one hour time span, this data is extrapolated over other minutes within that hour. There is no spread between the hours as a unit of time. The dispersion in space and time causes an extra noise that changes composition every hour. This means that the relationship between the counts per area and the counts per mast is a system of linear equations, with fewer equations than unknowns and therefore not uniquely invertible.
- We illustrate this with an example. Let’s calculate the probabilities of 0.5, 0.3 and 0.2 for a device with number 1 in the hour 8:00-9:00 AM across municipalities A, B and C. So we consider the probability of device 1 being in municipality A to be greater, but leave this probability distribution as it is - we do not directly assign it to A. Suppose that for another device 2 in the same hour we estimate a probability distribution for municipalities A, B, and C to be 0, 1, and 0. In other words, we think that the device was probably completely in municipality B in that hour. Then we count the probability distribution for both devices on: 0.5, 1.3 and 0.2. So we estimate that there were 1.3 devices in municipality B."
In addition, it should be noted that the range of antennas varies greatly, from 5 metres to 60 kilometres across municipal boundaries. The angle can also vary from 2 degrees to 360 degrees. As prescribed, the telecom company applies the rule that if the counts in a certain area fall below the lower limit of 15 within a certain period of time, that data is suppressed and thus destroyed. The total number of devices per time interval in the final dataset that comes to CBS fluctuates greatly. After all, if a device has not connected to the network for a telephone call or text message in a span of one hour, there is no registration for that device for that hour, and that device is not counted in the dataset provided to CBS for that hour. The population is therefore not constant in size and certainly not in composition. The fluctuations are amplified by the suppression of small counts as described above.
It is important to mention this here because it is an essential requirement for the paper[3], for example, that the population of devices is constant in composition and size.
When these datasets arrive at CBS, several irreversible operations have been carried out that have destroyed information, so that even with the data and resources available to CBS itself, identification is no longer possible. The data received by CBS is therefore anonymous.
Steps 7-9 Statistical output
The number of different counts made (per municipality) has a complicated and non-reversible relationship with the number of devices counted. The number of unknowns is in turn greater than the number of comparisons: the counts at municipality level. The "standard" CBS statistical security for CBS publications is applied here once again. As an extra precaution, the numbers to be published have been adjusted to relative values. In this step, information is again and deliberately destroyed in order to render it impossible for an external party to even reconstruct counts per mast, let alone trace the movements of individuals.
4.1.3 Discussion of academic publications
Paragraphs 4.1.1 and 4.1.2 explained how identification is rendered impossible as it is a violation of the aforementioned basic mathematical principle. In addition to the paper [3] (quoted in the introduction), other papers [4],[5] have been published which explain techniques aimed at disclosing information about the carrier or owner of a device based on the path of that device. There is significant academic interest in this topic, so there are many more papers in this field that have been or will be published. It is not feasible to discuss all existing literature in this area individually.
However, all such methods must, in order to work, be applicable to data in a process in which the aforementioned basic mathematical principle explicitly or implicitly does not apply. In other words, direct back calculations to reveal individuals are not possible. These papers are mostly aimed at approaching the individual with a certain probability.
Most methods turn out not to be applicable to outcomes such as those produced according to the method described above. The design of the method therefore makes these cracking methods inapplicable. In part of the published papers it is assumed that there are 'tracks' available in which the successive antennas with devices are stored and processed as a whole.
The fundamental assumption for these methods is that the 1-to-1 relationship between device and mast or area is preserved. This is in violation of the method that is used by the telecom provider in step 1 (and step 3). Therefore, the conclusions of any paper referring to such tracks do not apply to the method used by CBS, of which the MobLoc module forms an essential part. In addition, it remains necessary for these types of methods to leverage additional external identifying information so that a certain track of a device can be linked to a person.
The data generated by the CBS method is of a completely different type than the data in the paper [3]. The most important assumption in the paper [1] is that in consecutive time intervals there are identical 1-on-1 devices in the dataset, where moreover each device is connected 1-on-1 to an antenna. The 1-on-1 connection of antenna and device is prevented in step 1 of the process discussed above.
It is unclear whether the procedure described in the paper can be adapted to this method. This requires additional research. In this article a relaxation for the algorithm is applied based on a number of assumptions to make the calculation possible. It is likely that a relaxation for the situation with millions of devices that should lead to an acceptable calculation time will lead to a reduction in accuracy, resulting in more "incorrect" routes being found. Even if the computing power is present, the desired accuracy leading to identification cannot be achieved.
The former paper[3] also uses tracks, but does not take them as a starting point. Instead, counts per antenna are linked in consecutive time intervals so that tracks can be constructed. Here, model assumptions are used for typical device movements (patterns). After all, many people are known to have the same patterns of movement behaviour. To solve this problem, the so-called 'Hungarian algorithm' is used. There are several assumptions inherent to the method as described in that paper, as a result of which the conclusions cannot apply to the process steps followed by Statistics Netherlands. The number of devices in the paper[3] is many times smaller than the data based on the analyses carried out in this pilot (100 thousand versus 21 million), while the number of locations for which aggregates are determined in that paper is much larger than the data (at municipal level only) that will be provided to CBS (8,600 versus 400). The difference between the number of equations to solve and the number of unknowns is therefore many times greater for the data provided to CBS than is the case in the paper. This would amount to an order of magnitude of 1·1018. This makes it impossible to process this procedure with CBS's currently available computer power. As far as we know, the procedure described in the article can only be performed on such a dataset by four supercomputers in the world (exa-scale computing).
Let's take a look at the following thought experiment belonging to the cracking method from Tu's article [3]. For the sake of convenience, we assume that a device can only be located in one municipality for one hour1). Consider the trajectories of two devices with the same home municipality, and change the two municipalities in their trajectory for a certain hour2).
In that case, the aggregated flow cubes3) before and after the exchange are identical. This simple example already shows that multiple path data sets can lead to the same flow cube after aggregation. However, it is not clear from this argument whether, and to what extent, the flow cube can contribute to the identification of individuals. It could be that some path combinations are more likely to be plausible than others. This plausibility information could add an attacker to the problem in order to make a better estimation of the original paths. So suppose an attacker has extra data (such as general mobility patterns, or data from a specific individual) next to the flow cube. To what extent is it then possible to solve parts of the equations after all?
The assumption that all devices are present in the data at all times is an idealised situation that does not occur in real datasets, such as the dataset to be provided to Statistics Netherlands. However, this is crucial for the Hungarian algorithm (an optimization method) as described in [3] to work. As soon as even one device disappears or appears in a time interval, that algorithm cannot be operationalized. In the paper[3], therefore, devices are completely omitted from the population or are interpolated. The academic method described in the paper[3] is therefore not applicable to the dataset that CBS processed in the pilot.
The paper[3] also indicates that extra information is needed to link a reconstructed trajectory to a recognizable individual. In the paper a reconstructed trajectory is "unique" if it can be linked to a unique trajectory within the "ground truth" on the basis of n top locations. It is then assumed that a unique trajectory within the "ground truth" can be linked to a unique person in the population. For this last step extra information is needed, which also needs to be uniquely linked to a recognizable person.
It is also indicated in the paper[3] that in general it can be said that every algorithm or method that can identify a unique device should assume that every step in the processing of the raw telecom data is uniquely invertible. On the contrary, the mathematical basic principle ensures that each processing step is not invertible, so disclosure is not only unlikely or very computationally intensive, but mathematically impossible. The next question is whether an attacker with additional information can still not approach the truth. In section 4.2 this is discussed in more detail by means of another extreme attack scenario.
For the sake of completeness, this paper incites several other critical points as follows[3]:
- An N>15 eliminates all unique routes that one could quickly compose based on low numbers of data.
The uniqueness of a route is not mentioned in the article, it only describes the percentage of unique routes found based on the TopK (most visited) locations “…uniqueness measures the possibility that he can uniquely distinguish victims' recovered trajectories. Therefore, we define the uniqueness as the percentage of recovered trajectories that can be uniquely distinguished by their most frequent k locations (Top-K)”. Subsequently, external information is needed to link that trajectory to an individual; Therefore, the results indicate that the recovered trajectories are very unique and vulnerable to be reidentified with little external information.”
What external information is needed is not mentioned in the article, but it is plausible that the authors mean that one knows which TopK locations belong to a person. Whether it really is the right person (is that person the only one with those TopK locations) is not investigated, because it is assumed that a person / device can be identified by a unique TopK . So a person with a TopK number of locations fits the route, but that should not be reversed to the idea that it is possible to link them to a unique person. Precisely because in the chapter 3 described methods, the areas are very large (and not cells). The question is therefore whether the creation of this link is the right one. - In the study, cells ("area covered by base station") are used as geographical boundaries. We use municipalities instead of area cells. This makes it much more difficult to link a route to a device. The article talks about 8,000 base stations. This means that there are 511,808,016,000 unique variants of a top 3 of most visited base stations possible. In our dataset with 355 municipalities there are 44,361,510 unique variants of a top 3 most visited municipalities of each person. This is a factor of approximately 11,500 lower to link a top 3 to a device. In our case, the smaller number of unique variants of a top 3 is also distributed over 2 to 4 million devices instead of 100 thousand. This makes the "uniqueness" of a top 3 combination ("being different from others" mentioned in the article) 200,000 to 400,000 times smaller.
- The "recovery accuracy" will be many times lower because not 100,000 but 4 million devices are in the dataset (see Figure 11 a for decrease accuracy with an increase in the number of devices). This makes it very likely that if a route with the correct TopK locations can be linked to a person, the route will not be correct. This does not enrich the knowledge about that person.
4.2 Attack scenario
In section 4.1 it has been shown that solving equations does not work to reveal individuals. What if someone, an attacker or hacker has additional sources of information at his disposal? What are the consequences then? Suppose the attacker knows all the locations except one during period t (of one hour). In the periods before and after, the attacker also knows for this one device where it is. The purpose of the attacker is to find out where the device was during period t. The attacker knows the location of the device in t-1 and in t+1, and given the maximum travel speed of the device, a number of areas remain in which the device may have been in the meantime. The question now is, to what extent the flow cube can help the attacker with this problem. Let's assume that there is more than one possible area at time t, otherwise the attacker will not need the flow cube anyway. That data does not reveal anything the attacker already knew.
Since the attacker knows the locations of all other devices, the location should be easy to find: he takes the numbers from the flow cube; subtracts the known numbers from it and if all goes well, there is one area left where a difference of one remains. However, there are a number of characteristics of the signalling data and the methodology used to produce the flow cube that make this calculation much more difficult to make than one would initially think. This has the following causes:
- Not all devices show activity. So although the locations of all devices can be traced on the basis of the source dataset, it is not known which of them were actually active during period t. To be sure, it is assumed that the device whose location is sought was active and that this is known to the attacker. As can be seen in Figure 4.1, approximately 20-25 percent of the devices are inactive during the day per hour. An average of 78% is active at 7:00 in the morning.
- The attacker knows where the devices are but is not sure which area the devices are assigned to. Allocation of devices to areas is done via the 3. transmission tower4). Many cell towers provide coverage in multiple areas. Using a model, for a given transmission tower, one device is divided over all areas where antenna provides coverage (the fractions add up to one). The attacker is also not aware to which cell tower antenna a device has been connected (if connected; see previous point).
For many antennas, a connection to this antenna cannot be directly translated to a municipality. So even if the attacker has data about the actual location of each device, there is uncertainty about in which municipality this device will be because of the estimation method. When a transmitter mast serves more than one municipality, a device is counted in weighted fractions in all areas where the antenna provides coverage. For example, a value of 100 devices in the data can consist of 200 times a contribution of 0.5. - For areas where less than 15 devices are estimated, the value is not present in the flow cube. So there are a lot of areas for which the calculation is not possible anyway. The number of cells that is suppressed is very time dependent. At night more table cells are affected but much less devices and vice versa.
Each of the above points makes it impossible for the attacker to determine the location of the device with a reasonable degree of certainty. Assume that the estimation is 0 (zero) devices in a area. Approximately 80% of the devices are observed. The number of observed devices will therefore follow a binomial distribution (with p = 0.8). This means that with 90% probability there may also be one device in that area. If more devices are estimated, the probability interval will only increase. If 10 devices are observed, up to 16 devices can actually be in the area. Since the uncertainty in the number of actual devices in the area is already greater than or equal to one, determining the location of one device becomes very difficult. This example is further elaborated in Appendix 1 as the known numbers of devices do provide additional information.
The effect of point 2 varies greatly per area. In urban areas there more antennas, so antennas less often cover a large area, while in rural areas the masts often cover larger areas. Also when many antennas overlap in coverage area this problem becomes more complex. However, exact figures are not available at this time. As indicated in point 3, for many areas the figures are suppressed. Nothing is known about these areas other than that the estimated number of devices is less than 15.
4.3 Group disclosure and nuance of the N> 15 rule in step 6
The output provided by the telecom company contains a table of which each cell contains more than 15 estimated devices. It is important to understand that these are not real direct counts. It is a Bayesian estimate. So even an outcome from one device is in fact not unique and therefore anonymous. If there is an identification it is purely by chance and a disclosure not certain. However, improper (but as correctly interpreted) disclosure is also undesirable; that is why a lower limit has been set. That is the main reason for this limit.
The example is a bus of students that goes between 2:00 and 3:00 a.m. from the municipality of Groningen to another municipality. If more than 15 of these students have the same provider and are all active in the same municipality in the same hour (there are events of those devices) and happen to be assigned by the geolocation algorithm in the same municipality and there is no other movement of devices with the same origin municipality Groningen to that municipality, then such a group can be recognized in the data provided to Statistics Netherlands.
That looks like disclosure, but in fact they knew they had to look for this group. Agenda information (e.g. a programme publication on the internet) was needed. So no additional information was revealed from this data source in this case, but there was a one-time recognising (or validating) of the results. It is also not possible to recognise individuals within that group. However, certain characteristics can be added to this group, such as the fact that it has travelled via another municipality.
A higher limit N would reduce the chance of group disclosure even more, but only within the SN infrastructure. The limit of N>15, though, has been carefully investigated. Efforts have been made to minimize data in respect to applications: however, the data must remain usable. Figures 4 and 5 show that further increasing of the limit in step 6 leads to outcomes that are ultimately useless. Figure 4.2 shows that the dynamics/flows of the population (green line) disappear with the increase of the border/"threshold", while this information is relevant for making statistics.
Knowing that under very rare circumstances there may be a risk of group disclosure, various safeguards have been taken to prevent this information being used to identify groups of persons in the data provided to CBS. It is prohibited by law for the MNO to provide identifiable information about individuals, and for CBS it is prohibited by law to disclose identifiable information to the public. This prohibition is also supported by organisational and technical measures. For example, access to this data is blocked at CBS and only a few employees of CBS can access these data.
2) In the scientific article trajectories are compiled from the data. The method used in the pilot cannot be used to compile routes, but it can be assumed that from the hourly counts a route can perhaps be compiled.
3) This refers to an origin destination table containing counts in time. For example, at noon in Amsterdam there are 300,000 Amsterdam residents and 500,000 people from surrounding municipalities. By putting the hours next to each other, one can observe the "flow" in the counts. The third dimension is then time, hence the name flow cube, an origin-destination matrix in time.
4) So the attacker knows the locations of all other devices, but that subtracting the flow cube numbers will result in wrong (and possibly negative) numbers. So the attacker should also know which other devices were all active.