0:00
/
0:00
Transcript

"Props for Machine-Learning Security"

The podcast on this paper is generated with Google's Illuminate.

Props (Protected Pipelines) unlock deep web's massive data for ML while preserving privacy and proving data authenticity

https://arxiv.org/abs/2410.20522

🎯 Original Problem:

The World Wide Web's limited data accessibility creates a major bottleneck for Machine Learning advancement. Current ML practitioners are hitting limits with available training data, forcing them to rely on synthetic data which risks self-poisoning models. While deep web contains 100x more data than surface web, most of it remains inaccessible due to security and privacy concerns.

-----

🛠️ Solution in this Paper:

Props (Protected Pipelines) enable secure access to deep-web data while maintaining privacy and data integrity. They work through:

→ Privacy Control: Users maintain control over their data disclosure throughout the pipeline

→ Data Integrity: Props prove to consumers that the data is authentic and comes from trustworthy deep-web sources

→ Secure Data Sourcing: Using privacy-preserving oracles with Trusted Execution Environments (TEEs) or zkTLS

→ Pinned Models: Specifications that prove an output Y is the result of applying a specific model M to input X

-----

💡 Key Insights:

→ Deep web is estimated 100x larger than surface web, containing valuable training data

→ Props enable both training and inference while maintaining privacy

→ Users can control and monetize their data contributions

→ Props limit adversarial attacks by authenticating input sources

-----

📊 Results:

→ Successfully demonstrated privacy-preserving access to deep web data without infrastructure changes

→ Enabled secure model training and inference on sensitive data

→ Proved effective in constraining adversarial inputs through authenticated data sourcing

Discussion about this video

User's avatar