The high level theory is quite clear. First users can control who they give the data to. This we want to solve by managing commissioners and users trust in regards to them. Secondly we want to aggregate the data in a way that protects the individuals privacy. This we can achieve by only generating insights to the data no copies and thus the data would never be leaving the users device.
There are multiple ways to solve this technically with different benefits and drawbacks. This thread is for discussing the alternative approaches.
A project I find pretty interesting here is flower ai, which are focusing on federated ML, but could prove useful in “normal” data aggregation as well. They have many examples (e.g. FedAvg) from the research community. At this point in time I think they offer a lot what we don’t need yet might be interesting in the future (more complex analytics with pandas or even training ML models with tensorflow, pytorch etc.).
Another aspect is Secure Aggregation. In their nightly release, which will soon come out as said in the flower summit keynote they implemented the SecureAggregationPlus Algorithm as a mod to use in all modes.
The protocol is explained in this paper: “Secure Aggregation for Federated Learning in Flower” from 2020. It seems like it could be interesting for us
They have a C++ SDK package for the client which can be checked out in this example. I tried it myself and it works. I don’t know whether the SecAggPlus protocol is enabled to work with the C++ SDK yet. Trying to find that out.
“Mobile devices [also] generally
cannot establish direct communications channels with other mobile
devices (relying on a server or service provider to mediate such
communication) nor can they natively authenticate other mobile
devices.”
I think there are two dimensions to this problem, we should probably only worry about one at a time I’d describe them like this:
1. Data transport
How is data (or more accurately partial aggregations) transported between clients? We’ve considered two options so far:
Peer to peer communication: Client A send data directly to client B. Data is only ever sent to the server if it’s aggregated enough, i.e. the entropy is high enough to achieve reasonable privacy. This has the incredible benefit that we don’t have to worry about data leaks on the server, and MITM attacks are probably also harder. Even if the data is encrypted, encryption tends to get broken in finite time (though for practical purposes not very soon after it’s been encrypted).
Asynchronous end to end encryption: Client A gets the public key for client B and decrypts the data for them (so that they can encrypt the data, but the server can’t, not without breaking the encryption at least). Then client A sends the encrypted data to the server, from which client B pulls it, decrypting it with their private key.
I think (2) is the way to go for now, it has less unknowns and is easier to implement. We can look into (1) once we got (2) working, IMHO. We might anyhow need (2) as a fallback where (1) is not feasible.
(2) is quite trivial to implement, there’s solid libraries for this. Doesn’t need any framework in my mind.
2. Data aggregation
How is the data actually aggregated? How are aggregation instructions for the clients defined? Do we just send a Lua script to the client or something? Is it something declarative?
I personally have no great idea how to approach this yet. This is where I’d find some existing mechanism (like SecAgg?) we can employ super helpful. Down the road, we could always come up with our own mechanism, but for now, it could save us a lot of time if we didn’t.