0:00
/
0:00
Transcript

"I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution"

Generated below podcast on this paper with Google's Illuminate.

LLMs can attribute code authorship without extensive training data.

This paper explores using LLMs for code authorship attribution, addressing challenges of traditional methods that need lots of labeled data and struggle with different coding styles.

-----

https://arxiv.org/abs/2501.08165

Original Problem 🤔:

→ Traditional code authorship attribution relies on supervised machine learning.

→ Supervised machine learning needs extensive labeled datasets.

-----

Solution in this Paper 💡:

→ This paper investigates using LLMs for code authorship attribution.

→ It tests zero-shot prompting to verify if two code samples are by the same author.

→ It uses few-shot, in-context learning to attribute authorship based on reference samples.

→ For large-scale attribution (many authors), the paper proposes a tournament-style approach.

→ The tournament approach splits the problem into smaller sets. It runs multiple rounds of LLM-based authorship comparisons.

-----

Key Insights from this Paper 🔎:

→ Careful prompt engineering is crucial for LLM performance in this task.

→ LLMs are more robust against adversarial attacks compared to traditional ML models.

→ LLMs can generalize across programming languages.

-----

Results ✨:

→ Achieves up to 0.78 MCC score in zero-shot verification.

→ Achieves up to 88.5% accuracy in few-shot attribution.

→ Tournament approach reaches 65% Top-1 accuracy on a dataset with 500 C++ authors and 68.7% for Java with 686 authors, using only one reference sample per author.

Discussion about this video