Paraphrasing避开了人工智能生成文本的检测器,但检索是一种有效的防御手段 Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

作者:Kalpesh Krishna Yixiao Song Marzena Karpinska John Wieting Mohit Iyyer

为了检测恶意用例(例如,虚假内容创建或学术剽窃)的大型语言模型的部署,最近提出了几种通过水印或统计违规来识别人工智能生成的文本的方法。这些检测算法对人工智能生成的文本的反短语的鲁棒性如何?为了对这些检测器进行压力测试,我们首先训练了一个11B参数转述生成模型(DIPPER),该模型可以转述段落,可选地利用周围的文本(例如,用户编写的提示)作为上下文。DIPPER还使用标量旋钮来控制转述中词汇多样性和重新排序的数量。使用DIPPER对三个大型语言模型(包括GPT3.5-davinci-003)生成的文本进行解释成功地避开了几个检测器,包括水印、GPT0ero、DetectGPT和OpenAI的文本分类器。例如,DIPPER将DetectGPT的检测准确率从70.3%降至4.6%(恒定假阳性率为1%),但没有明显下降

To detect the deployment of large language models for malicious use cases(e.g., fake content creation or academic plagiarism), several approaches haverecently been proposed for identifying AI-generated text via watermarks orstatistical irregularities. How robust are these detection algorithms toparaphrases of AI-generated text? To stress test these detectors, we firsttrain an 11B parameter paraphrase generation model (DIPPER) that can paraphraseparagraphs, optionally leveraging surrounding text (e.g., user-written prompts)as context. DIPPER also uses scalar knobs to control the amount of lexicaldiversity and reordering in the paraphrases. Paraphrasing text generated bythree large language models (including GPT3.5-davinci-003) with DIPPERsuccessfully evades several detectors, including watermarking, GPTZero,DetectGPT, and OpenAI’s text classifier. For example, DIPPER drops thedetection accuracy of DetectGPT from 70.3% to 4.6% (at a constant falsepositive rate of 1%), without appreciably modifying the input semantics. Toincrease the robustness of AI-generated text detection to paraphrase attacks,we introduce a simple defense that relies on retrieving semantically-similargenerations and must be maintained by a language model API provider. Given acandidate text, our algorithm searches a database of sequences previouslygenerated by the API, looking for sequences that match the candidate textwithin a certain threshold. We empirically verify our defense using a databaseof 15M generations from a fine-tuned T5-XXL model and find that it can detect80% to 97% of paraphrased generations across different settings, while onlyclassifying 1% of human-written sequences as AI-generated. We will open sourceour code, model and data for future research.

论文链接:http://arxiv.org/pdf/2303.13408v1

更多计算机论文:http://cspaper.cn/

Related posts