(please read also updates at the end and the description of the April 10 version)
Here I describe what I have understood – on the basis of documents published until yesterday - about how the proposed solution for fighting the spread of COVID_19 by privacy-preserving tracking proximity contacts
In my previous post I have discussed some of the other non-technical issues of technical solutions based on contact tracing. I add here only one consideration that I would like to stress (and was stressed also in many documents cited in my previous post – most notably the Algorithm Watch's one). Knowing whom someone has been in contact with, during the days leading up to the day they were found infected, is only useful if the diagnosis arrives early enough to notify their contacts BEFORE they contact other people. If an average person comes in contact with 10 others during a day (note that being in contact does not mean being close and chatting, but having been in the same office, shop, premises, means of transport, etc.), an infected person generates at the second day 100 potential infected ones, 1,000 at the third one and 10,000 at the fourth. If only today one realizes that 5 days ago a person was infected, then one has to test 10,000 people TODAY (or ask 10,000 people to self-isolate) to avoid having tomorrow 100,000 potentially infected person. Not impossible, if a country is prepared, but it is not something that can be arranged in some days or a few weeks. It is no coincidence that in China, where this approach has been successfully used, it is since the SARS epidemic of 2003 that they are working on this.
The basic idea of
Let's discuss now the most relevant technical aspects, as I’ve understood them and assuming my understanding is correct. I’ll gladly correct any mistake I might have done. I don't discuss all the details for space reasons. The reader wishing to know more can consult their technical documentation.
Upon installation, the app generates a key, which is a 32-bit random SK0 number, and registers with a central server with an ID different from SK0 and used by the server only to contact all registered apps for periodic updates related to infected people. Each app renews its key every day, calculating a new 32-bit random number SKt + 1 using a predefined hash function H1 having SKt as its argument. The use of H1 ensures that, given the random number used in a day, it is not possible to calculate the one used the day before. The app stores the last 14 SKt used, because 14 days are considered necessary for the infection to develop. Such value can be changed if health authorities deem it appropriate.
Each day the app uses SKt to generate, through a different hash function H2, a certain number of "ephemeral identifiers" EphIDs that will be used only that day to record the contacts of that day only. When two phones, through Bluetooth, understand that they are close, then the two apps exchange their EphIDs. They are stored locally together with a coarse time indication (the documentation is not explicit, let's say a 4 hour time window), together with the strength of the Bluetooth signal received (which is an indicator of the distance between the devices) and the time spent in proximity. Again, the use of H2 does not allow to derive from received EphIDs the SKt used to generate them, not even knowing all the EphIDs used during the same day by the same device, nor does it allow to understand if two EphIDs received on the same day are generated by the same SKt.
Every day the app erases data older than 14 days, both the SKt and the EphIDs of the phones it has been close to, together with the data associated with the received EphIDs.
When a person is found out infected by a medical test, he receives an authorization code from the health authority, which he can use – if he chooses to do so – to alert the central server. In such a case, the app uses this authorization code to securely send the SKt used in the presumed initial day of the infection, let's say that of 14 days ago or the oldest on record. To prevent the mere happening of this communication discloses to external observers that the person is infected, it is done in the context of communications the app send to the server several times during the day at regular intervals. They are indistinguishable from an external viewpoint, but only one of them contains the SKt. When the app communicates to the central server the device owner is infected, it simultaneously generates a new initial key, that is, a new 32-bit random SK0 number, which will be used from that moment on in the contact tracing mechanism.
When the server receives an infection report, the SKt received is broadcasted, as part of the periodic updates, to all smartphones that have the app installed. Each app then locally calculates for each received SKt, corresponding to the device of an infected person, both the EphIDs of that day (using the hash H2 function) and the SKt of the subsequent days up to the current day (using the hash function H1) together with their corresponding EphIDs. All the EphIDs locally calculated by the app in this way are used to search in the stored data if the device have been in contact with that of an infected person. In this case, on the basis of the data (distance and duration) stored with EphIDs an algorithm present in the app (using parameters that can be adjusted by the health authority) can calculate a risk index for the smartphone owner who is therefore, on the basis of this index, advised on what to do.
This element of distributing SKt of devices owned by an infected person is in my opinion a weak point with respect to privacy. In fact, when an app receives an SKt of an infected person, by calculating through H1 the SKt of the following days and through H2 the corresponding EphIDs, could discover the identity of the person, because it can relate different EphIDs, which are otherwise unrelated, to the same original SKt. This is not difficult to do, especially in the case of people who are regularly met or for whom, perhaps for work reasons, meeting have been recorded in one’s own agenda. The server should instead broadcast directly the EphIDs, given the increase in transmitted data appears not to be able to significantly worsen overall performances.
In addition, the central server stores the received SKt. This also seems to me a weak point because, even if all those smartphones have changed keys, their stored SKt could be used (in case of data breach) to relate EphIDs that are otherwise unrelated to the same device.
Another element of perplexity, not regarding the mechanism per se but technologies it is based on, is the fact that the Bluetooth protocol is known to be not so robust against attackers (thanks to Francesco Palmieri for pointing it out). Adopting the described mechanism on millions of devices would imply forcing them into a vulnerable situation.
Addendum (April 6th, 17:10 CET). Michael Vaele, one of the researchers of the DP-3T team (wrongly called DPT in the first edition of this post) pointed me on Twitter that «the DP-3T protocol is not PEPP-PT. PEPP-PT is an empty shell, to be filled. We try to fill with decentralised.» I have therefore updated the initial paragraph and the title, making now reference just to DP-3T. My analysis stays the same, since it is the documentation from DP-3T (link on the first line of this post) what I analyzed. Also, I would like to point out that the description of PEPP-PT on their site coincides with the high-level-description of DP-3T protocol and IS decentralized (see picture below).
Addendum (April 9th, 19:00 CET). DP-3T has released a new version of their white paper (which I'm currentl analyzing), a clarification of the relations between DP-3T and PEPP-PT, and a set of answers to issues raised (by me and other researchers) in this FAQ. My issue about broadcasting SKt of devices owned by an infected person is (I think) covered in FAQ P6. I've written "I think" given the actual title of P6 is «Why do infected people upload a seed (which enables recreating EphIDs) instead of their individual EphIDs?»
In itself uploading the seed does not give away anonymity of the infected person, given this seed was randomly generated when the app was installed. The problem I signalled lies in the fact that the backend server brodcasts this seed instead of the derived EphIDs. The answer given in the FAQ discuss anyhow the issue using performances argument, which I find reasonable, in principle. But given privacy is involved here, I remain convinced the protocol SHOULD NOT broadcast the seed but only the the EphIDs.
So the title should be «Why do the backend broadcasts the seeds of infected persons (which enables recreating EphIDs) instead of their individual EphIDs?». I've asked them for clarification and I'll update you on the issue.