blog.oceanprotocol.com
Open in
urlscan Pro
52.6.3.192
Public Scan
Submitted URL: https://blog.oceanprotocol.com/how-ocean-compute-to-data-relates-to-other-privacy-preserving-technology-b4e1c330483
Effective URL: https://blog.oceanprotocol.com/how-ocean-compute-to-data-relates-to-other-privacy-preserving-technology-b4e1c330483?gi=8e1480de...
Submission: On December 25 via api from LV — Scanned from DE
Effective URL: https://blog.oceanprotocol.com/how-ocean-compute-to-data-relates-to-other-privacy-preserving-technology-b4e1c330483?gi=8e1480de...
Submission: On December 25 via api from LV — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign In Write Sign up Sign In Published in Ocean Protocol Trent McConaghy Follow May 28, 2020 · 10 min read · Listen Save HOW DOES OCEAN COMPUTE-TO-DATA RELATE TO OTHER PRIVACY-PRESERVING APPROACHES? A SURVEY SPANNING FEDERATED LEARNING, HOMOMORPHIC ENCRYPTION, MULTI-PARTY COMPUTE, AND MORE Creatures of the ocean love their privacy! Here’s an octopus hiding in sand. Can you see it? [Image: CC-BY-SA 4.0] INTRODUCTION At Ocean Protocol, we recently released Ocean Compute-to-Data. It helps AI practitioners access valuable, private data for more accurate AI models. Data owners get to retain privacy and control over their data. Compute-to-Data works as follows. First, Data owners approve AI algorithms to run on their data. Then, Compute to Data orchestrates remote computation and execution on data to train AI models. The compute is sufficiently “aggregating” or “anonymizing” that the privacy risk is minimized. Yet it results in a model that’s useful for research or business. This article asks: how does Ocean Compute-to-Data relate to other privacy-preserving approaches? Here’s the quick answer: it’s complementary. Each technology has its own usage, and its own constraints. We’ll now give a more detailed answer, in a fashion that’s approachable to less deeply-technical audience. We survey some notable privacy-preserving technologies. For each, we discuss its challenges, how those challenges are being addressed, and how the technique relates to Ocean. We do the same for Ocean Compute-to-Data. We conclude with a broader discussion of Ocean in the privacy-preserving ecosystem. SURVEY ENCRYPTION AND DECRYPTION Encryption transforms data into a form that can be safely sent across an insecure channel. When received, the receiver uses a key to transform the data back into its original plaintext form. Symmetric encryption is when the same key is used to encrypt and decrypt; schemes like Diffie-Hellman are used to send the key itself safely across an insecure channel. Asymmetric encryption has “public keys and private keys, coming in pairs. What one does, the other undoes” [Ref]. Alice encrypts a message with Bob’s public key, then sends the message across an insecure channel. Only Bob can decrypt it, with his private key. Encryption and decryption are widely used, for applications like secure web-based payments (the “https” you see in your browser) and secure messaging (end-to-end encryption such as Signal). Ocean Protocol uses encryption/decryption as part of its access control infrastructure. HOMOMORPHIC ENCRYPTION (HE) In HE, compute is performed on encrypted data. Therefore, non-trusted parties can perform compute without ever learning the contents of the data. Challenge: HE is still too computationally intensive to be used in most applications. Towards solving: Speed will continue to improve with time due to better algorithms, faster chips, and dedicated chips. HE is a remarkable idea, almost like it’s out of science fiction. We look forward to when it scales enough to work in more applications, as it will be useful to have as part of the Ocean technology stack. It will combine well with Ocean’s other features like data asset management and marketplaces. SECURE ENCLAVES / TRUSTED EXECUTION ENVIRONMENTS (TEE) In TEE, computation is performed in special chips that can see the private data but are severely restricted with what information they can share with their host machine. Intel SGX is the most prominent hardware example. Challenge: any security flaw found in the chips renders the chip useless, and there is a history of this happening. Towards solving: TEE chips have been hardening over time; today we’re approximately at the threshold of production usage. TEEs play well with Ocean: Ocean can manage data assets which then have computation performed in TEEs; and results come back to Ocean. Related, Oasis Labs leverages blockchain to manage secure enclave-based compute. There is opportunity for integration of Ocean and Oasis. Aquariums are a bit like trusted execution environments… for dangerous sea creatures. [Image: CC0] MULTI-PARTY COMPUTE (MPC) In MPC, the compute task is broken into small sub-tasks; a different party performs each sub-task; and the results are merged. Challenge: bandwidth can be a bottleneck because it requires a lot of communication between the parties. Towards solving: researchers are working to reduce bandwidth needs. MPC plays well with Ocean: Ocean for data asset management, MPC for compute. For example, here’s a prototype integration doing image classification for a healthcare use case. The Enigma blockchain project focuses on TEEs and MPC. Therefore there are future opportunities for integration with Ocean and Enigma. ZERO-KNOWLEDGE PROOFS (ZKPS) In ZKPs, Alice asks Bob if Bob knows x, and Bob can provably reply without leaking information. Constraints: ZKPs require interactive sessions, scale poorly, and only answer binary questions. Towards solving: First, some use cases are perfectly ok with the constraints of ZKPs. Perhaps the most famous example in blockchain is ZCash, which offers Bitcoin-like functionality (e.g. prevent double spending), but without leaking Personally Identifiable Information (PII). Second, there is steady progress to loosen the constraints given above, especially the scaling part. In requiring interactive sessions and binary outputs, ZKPs are less directly applicable to Ocean on the AI side. However, we are excited about the future of ZKPs elsewhere for Ocean. Like in Zcash, they could be helpful to reduce PII leakage about blockchain transactions themselves. For example, Zokrates provides private transactions in Ethereum. Furthermore, with ZK Rollups (or its more lightweight Optimistic cousin) there is great promise for blockchain scalability in addition to privacy. Here’s a sea moth looking to minimize its information leakage. [Image: Matt Kieffer CC-BY-SA 2.0] SYNTHETIC DATA In synthetic data generation, a probability density function (PDF) is computed or “learned” from the original dataset, next to the data itself. Then, millions of datapoints can be drawn from the PDF and shared. Those datapoints are naturally “anonymized”, which reduces risk of personally-identifiable information (PII) leaking. Challenge 1: not flexible. PDF construction is essentially doing AI-style modeling, where the choice of the algorithm is made by the provider of the synthetic data generation technology. Challenge 2: less accurate. There’s now modeling in two layers — the PDF and the final AI model built by the AI practitioner. Modeling error compounds. Furthermore, if the PDF is overfit, PII will leak. Towards solving: Problem 1 is addressed by letting the AI practitioner build the PDF themselves. Problem 2 is addressed if the AI practitioner simply builds a single model themselves next to the data. And then, you have Ocean Compute-to-Data (!). So Synthetic Data is a poor approach to AI modeling. However, Synthetic Data is still useful for visualization to gain intuition on the (synthetic) data, such as 2D or 3D scatterplot visualizations on synthetic data. This is what makes Synthetic Data complementary to Ocean. FEDERATED LEARNING (FL) In FL, a neural network is randomly initialized. Then, weight updates are computed next to the data itself in data silo #1, and sent to the neural network. This is repeated in data silo #2, #3, and so on. In the end, a neural network has been trained across many data silos, without data leaving the premises of each respective silo. TensorFlow Federated (TFF) and OpenMined are the most prominent FL projects. TFF does orchestration in a centralized fashion and OpenMined decentralized. Google Federated Analytics takes a cue from FL and computes simpler aggregate values such as averages. Challenge: in TFF-style FL, a centralized entity (e.g. Google) must perform the orchestration of compute jobs across silos. So, PII can leak to this entity. Towards solving: OpenMined addresses this via decentralized orchestration. But its software infrastructure could use improvement to manage computation at each silo in a more secure fashion; this is where Compute-to-Data can help. DECOUPLED HASHING (DH) DH is less well-known than other techniques surveyed but it’s worth understanding. We first review traditional Feature Hashing (FH). FH trains an AI model as follows: (1) On training data, create a hash for each {input variable, input value} combination. (2) Apply a learning algorithm to learn a weight for each hash. It runs on new / testing inputs as follows: (1) On test data, create a hash for each {input variable, input value} combination. (2) Run the hashes through the trained model. Traditionally, all the steps are done on the same machine. But they don’t need to be! This is the idea of DH. DH does training step (1) and testing step (1) next to the data. The result is naturally anonymized. Training step (2) and testing step (2) can be done anywhere by anyone, without seeing any private information. DH is pragmatic: it has minimal information leakage, scales well, and doesn’t require new leaps in technology or science. The remaining challenge is how to set up the infrastructure to separate steps (1) and (2), and to coordinate the actors on each side. Ocean Compute-to-Data can help with infrastructure and coordination to lower barriers to using DH. DIFFERENTIAL PRIVACY (DP) DP “is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.” The main tactic is to add random noise to each input datapoint so that any actor reviewing statistics derived from all the datapoints can’t extract PII. DP can enhance the privacy of of other techniques. It’s crucial for synthetic data: DP is the main accepted way of generating it in a provably private way. DP has been shown to help Federated Learning, for example here. DP holds potential for Compute-to-Data contexts too. Stonefish (trying to hide) in coral. [Image: Matt Kieffer CC-BY-SA 2.0] COMPUTE-TO-DATA The main idea of Compute-to-Data is to bring computation to the data, where the data stays on-premise. The compute results returned are sufficiently aggregated or anonymized that the privacy risk is minimized. Ocean Compute-to-Data draws on a lineage of related ideas and technologies. Database researchers have explored the idea of compute next to the data since the 1970s; the modern incarnation is near-memory computing and near-data computing. As discussed, FL brings compute next to data for training AI models across many data silos, albeit with centralized orchestration). FL started to gain traction in 2015. The Fitchain project also brought compute next to data, including collaboration with Ocean in 2018. It has a commercial spinoff. Finally, an academic paper from Algorand recently proposed a technology that brings compute to data. Ocean brings the idea of compute-to-data into its ecosystem of blockchain-based access control (platform level) and data marketplaces to buy and sell private data while preserving privacy. It’s a long lineage of ideas and tech, all around a shared movement of regaining control of our data. We’re proud to be part of that movement. In Ocean Compute-to-Data, data owners approve AI algorithm scripts to run on their data, then Compute to Data orchestrates remote computation and execution on data to train AI models. Challenge: there’s a risk that the script supplied leaks PII. This has two variants: (a) malicious, and (b) overfitting. In (a), the script has special code that sends the data to the script supplier. The supplier would obfuscate this code via an easy-to-miss special import like “import sk_learn” (versus the correct version “import sklearn”). The special library wraps sklearn, but injects copying. In (b), the model learns too much detail, so that PII can be extracted from it. An extreme example is: in CART tree training, learning each branch only stops when the leaf node has a single datapoint. Or, neural network could get overfit it has a large number of parameters compared to its datapoints, and it doesn’t do regularization in training. To solve: The Data Provider chooses what algorithms to trust. Therefore it’s the same entity that risks private data getting exposed and chooses what algorithm to trust. It is their choice to make, based on their risk-reward preference. For (a): they simply do inspection. For (b): some algorithms are easy to trust, like averaging or learning a logistic regression model with linear basis functions. But for more advanced modeling, it’s a bit more of a burden. To ease that, we envision a rise of community-curated scripts with skin-in-the-game (staking) to help “harden” the most useful or promising scripts over time. Bringing the action to where it’s secure: here’s an octopus hiding in a clam shell. [Image: arhnue CC0] OCEAN AND THE PRIVACY-PRESERVING ECOSYSTEM Ocean Compute-to-Data’s properties make it useful for now. It’s less burdened by some of the issues that have slowed adoption of some privacy-preserving techniques. This is not by accident: when we first started exploring how to preserve privacy in Ocean, we reviewed the approaches surveyed above, and realized that bringing compute to data was the most pragmatic choice for the near term. But other approaches are maturing nicely. Ocean Protocol is not constrained to just compute-to-data as a privacy preserving technique. As time goes on and other techniques mature, we envision other techniques being used in conjunction with Ocean. Of particular interest is FL, which is closest in spirit to Ocean Compute-to-Data, since FL also brings compute to data. In fact, FL is complementary to Ocean: FL does higher-level management across many data silos, and Ocean securely manages computation at a given silo. We’re especially excited about integrations with OpenMined FL technology. OpenMined is interesting to Ocean more generally. It’s evolved from being a pure FL technology to become a broader toolbox of open “connective tissue” software for privacy-preserving AI technologies, alongside a large and growing community. We look forward to further interactions with the OpenMined community. CONCLUSION This article asked: how does Ocean Compute-to-Data relate to other privacy-preserving approaches? We see that Ocean is complementary. Each technology has its own usage, its its own constraints, and its own complementary relation to Ocean. Encryption/decryption, HE, TEE, MPC, and ZKPs sit side-by-side with Ocean. DP can enhance Compute-to-Data further. Synthetic data, and FL flows are directly improved by Compute-to-Data. ACKNOWLEDGEMENTS Special thanks to Andrew Trask, David Holtzman, Bruce Pon, Adam Drake, and Julien Thevenard for providing feedback on this article. FURTHER READING OpenMined has an excellent series on privacy-preserving data science, starting with this article. MAIN ARTICLE UPDATES * May 31, 2020: Added section on Decoupled Hashing. Follow Ocean Protocol via our Newsletter and Twitter; chat with us on Telegram or Discord; and build on Ocean starting at our docs. Homepage Privacy Data Artificial Intelligence Deeptech Thanks to Julien Thevenard 539 539 1 539 1 MORE FROM OCEAN PROTOCOL Follow A New Data Economy Ocean Protocol Team ·May 27, 2020 OCEAN PROTOCOL LAUNCHES COMPUTE-TO-DATA With the latest Ocean release, enterprises can sell data while preserving privacy and AI practitioners can access private data to advance research — [PRESS RELEASE] Singapore — May 26th, 2020 — Ocean Protocol, a decentralized data exchange protocol to unlock data for AI, announces the release of Compute-to-Data, which enables sharing, buying and selling data while preserving privacy. Private data can help research in life-altering innovations in science and technology. The more data… Product 3 min read Product 3 min read -------------------------------------------------------------------------------- Share your ideas with millions of readers. Write on Medium -------------------------------------------------------------------------------- Diksha Dutta ·May 23, 2020 DEMYSTIFYING DATA TRUSTS AND COLLECTIVE CONSENT IN THE WORLD OF DATA PRIVACY Anouk Ruhaak, Mozilla and AlgorithmWatch Fellow, on the future of data — For the third episode of Ocean’s podcast Voices of the Data Economy, we spoke to Anouk Ruhaak, who is currently researching and developing data governance models as a Mozilla Fellow embedded with AlgorithmWatch. … Homepage 4 min read Homepage 4 min read -------------------------------------------------------------------------------- Ocean Protocol Team ·May 22, 2020 V2 OCEAN COMPUTE-TO-DATA RELEASE Unlocking Private Data while Preserving Privacy — The Ocean Protocol team aims to fuel an open data economy, by enabling data owners and consumers to securely exchange and monetize data in a safe and secure manner. We’ve spent countless hours coding, building a global community of thousands and establishing partnerships with organizations that believe in our vision… Homepage 3 min read Homepage 3 min read -------------------------------------------------------------------------------- Manan Patel ·May 19, 2020 TECHNICAL GUIDE TO OCEAN COMPUTE-TO-DATA An overview of our v2 release, Ocean Compute-to-Data — [Note from Nov 2021: some of the content in this post is obsolete, as V3 and later Ocean releases interact with Compute-to-Data in slightly different ways. Please refer to oceanprotocol.com/technology/compute-to-data for up-to-date info.] With the v2 Compute-to-Data release, Ocean Protocol provides a means to exchange data while preserving privacy. This… Homepage 8 min read Homepage 8 min read -------------------------------------------------------------------------------- Sheridan Johns ·Apr 29, 2020 DECENTRALIZATION VIA COLLABORATION — THE OCEAN PROTOCOL PARTNER PROGRAM Fueling the Web3 Movement with Strategic Co-Creation — Breaking down data silos is hard, but rewarding work. When we set out to build a world where proprietary data can be shared without compromising data security and data privacy, it was clear that Ocean would need the support and expertise of the broader community for this ambitious mission. Over… Community 3 min read Community 3 min read -------------------------------------------------------------------------------- Read more from Ocean Protocol RECOMMENDED FROM MEDIUM Dmytro Naumets LAMINAR MARKETS | ZELLIC SECURITY ASSESSMENT REPORT Xcoder(Joy ahmed) [BAC/IDOR] HOW MY FATHER CREDIT CARD HELP ME TO FIND THIS ACCESS CONTROL ISSUE Charmion Byers {UPDATE} TURBO DIRT BIKE SPRINT HACK FREE RESOURCES GENERATOR Cloud Journey AZURE FIREWALL POLICY AND HUB VNET Mohammad Ali | @0xMohd THEY LIED ABOUT TOTOK ATNET Airdrops & Trading Tools in Cryptolounge ENDING SOON AIRDROPS — 19 NOV Rom in Rom’s Ramblings IS STAYSAFE.PH SAFE? CyberVein CYBERVEIN WEEKLY REPORT 01/25/2021–01/29/2021 AboutHelpTermsPrivacy -------------------------------------------------------------------------------- GET THE MEDIUM APP TRENT MCCONAGHY 6.8K Followers Trent McConaghy. @OceanProtocol , AI, data, Web3, #TokenEngineering, MCV. www.trent.st Follow MORE FROM MEDIUM Jeffrey Scholz in RareSkills BLOCKCHAIN JOB TIER LIST Mark Vassilevskiy 5 UNIQUE PASSIVE INCOME IDEAS — HOW I MAKE $4,580/MONTH Ren & Heinrich in DataDrivenInvestor I ANALYZED 200 DEFI PROJECTS. HERE IS WHAT I FOUND OUT. Ann in Crypto 24/7 THESE NEW DEFI PROTOCOLS ARE FREAKING IMPRESSIVE Help Status Writers Blog Careers Privacy Terms About Text to speech To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.