Paraphrasing避开了人工智能生成文本的检测器,但检索是一种有效的防御手段 Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

作者:Kalpesh Krishna Yixiao Song Marzena Karpinska John Wieting Mohit Iyyer


To detect the deployment of large language models for malicious use cases(e.g., fake content creation or academic plagiarism), several approaches haverecently been proposed for identifying AI-generated text via watermarks orstatistical irregularities. How robust are these detection algorithms toparaphrases of AI-generated text? To stress test these detectors, we firsttrain an 11B parameter paraphrase generation model (DIPPER) that can paraphraseparagraphs, optionally leveraging surrounding text (e.g., user-written prompts)as context. DIPPER also uses scalar knobs to control the amount of lexicaldiversity and reordering in the paraphrases. Paraphrasing text generated bythree large language models (including GPT3.5-davinci-003) with DIPPERsuccessfully evades several detectors, including watermarking, GPTZero,DetectGPT, and OpenAI’s text classifier. For example, DIPPER drops thedetection accuracy of DetectGPT from 70.3% to 4.6% (at a constant falsepositive rate of 1%), without appreciably modifying the input semantics. Toincrease the robustness of AI-generated text detection to paraphrase attacks,we introduce a simple defense that relies on retrieving semantically-similargenerations and must be maintained by a language model API provider. Given acandidate text, our algorithm searches a database of sequences previouslygenerated by the API, looking for sequences that match the candidate textwithin a certain threshold. We empirically verify our defense using a databaseof 15M generations from a fine-tuned T5-XXL model and find that it can detect80% to 97% of paraphrased generations across different settings, while onlyclassifying 1% of human-written sequences as AI-generated. We will open sourceour code, model and data for future research.



Related posts