0 Abstract

우리는 GPT-3을 fine-tune하여 long-form 질문들에 대해 텍스트기반의 웹브라우징을 사용하여서 답변을 하고, 이는 모델이 웹을 검색하게끔 모델링한다.
인간이 수행할 수 있도록 테스크를 세팅업함으로써, 우리는 imitation learning을 사용하여 테스크에대해 모델이 학습할 수 있도록 하고나서, human feedback과 함께 answer quality을 옵티파이즈한다.
factual accuracy의 human evaluation을 더 쉽게하기 위해서, 모델들은 그들의 답변들을 서포트하기 위해 검색하는 references을 모은다.
우리는 우리의 모델들을 ELI5에 대해 학습과 평가를 하고, 이는 Reddit 유저들에 의해 요청받은 questions의 데이터세트이다.
우리의 베스트 모델은 GPT-3을 behavior cloning을 사용해서 fine-tuning하고나서 human reference을 예측하도록 학습된 reward 모델에 관하여 rejection sampling을 수행하여 얻어진 모델이다.
이 모델의 답변은 56%의 경우 인간이 human demonstrators의 답변보다 선호하며 69%는 Reddit에서 가장 높은 투표를 받은 답변을 선호합니다.

1 Introduction

A rising challenge in NLP is long-form question-answering (LFQA), in which a paragraph-length answer is generated in response to an open-ended question. LFQA systems have the potential to become one of the main ways people learn about the world, but currently lag behind human performance [Krishna et al., 2021]. Existing work tends to focus on two core components of the task, information retrieval and synthesis. In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API,2 and utilize unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3 [Brown et al., 2020]. Instead of trying to improve these ingredients, we focus on combining them using more faithful training objectives. Following Stiennon et al. [2020], we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans.
We make two key contributions:

We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in an end-to-end fashion using general methods such as imitation learning and reinforcement learning.
We generate answers with references: passages extracted by the model from web pages while browsing. This is crucial for allowing labelers to judge the factual accuracy of answers, without engaging in a difficult and subjective process of independent research.

Reference