跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) LLM(84) angular(83) 大语言模型(67) 人工智能(56) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(34) Go基础(29) Python(24) Vue(23) Web开发(20) 深度学习(20) Web技术(19) 精选资源(19) Java(19) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) LLMOps(11) 聊天机器人(11) 安卓(11) ChatGPT(10) typescript(10) 资料精选(10) mlops(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) RAG(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) 安全(9) 智能体(8) 全栈开发(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 数据科学(8) 软件架构(7) Go语言高级开发(7) AWS(7) C++(7) whisper(6) Prisma(6) 隐私保护(6) 提示工程(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 生成式AI(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) kafka(5) 推荐系统(5) WebAssembly(5) GameDev(5) 数据分析(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) nextjs(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) RAG架构(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 模型评估(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

我们如何与LLM进行有效沟通?

除非你已经完全脱离了社交媒体和新闻中的喧嚣,否则你不太可能错过大型语言模型(LLM)带来的兴奋。

LLM的演变。图片取自论文[1](来源)。即使我添加了这张图片,当前LLM的发展速度也使这张图片过时了。

LLM已经变得无处不在,几乎每天都有新的模型发布。得益于一个蓬勃发展的开源社区,即使在计算资源有限的情况下,该社区在减少内存需求和开发LLM的高效微调方法方面发挥了至关重要的作用,它们也更容易被公众访问。

LLM最令人兴奋的用例之一是它们出色地完成未明确训练过的任务的非凡能力,只使用了任务描述,还可以选择使用一些示例。你现在可以找到一个有能力的LLM,以你最喜欢的作者的风格生成一个故事,将长电子邮件总结成简洁的电子邮件,并通过向模型描述你的任务来开发创新的营销活动,而无需对其进行微调。但是,你如何最好地向LLM传达你的要求呢?这就是提示的作用所在。

目录:

  • 什么是提示?
  • 为什么提示很重要?
  • 探索不同的提示策略
  • 我们如何实现这些技术?
    • 4.1. 用零样本提示Llama 2 7B-chat
    • 4.2. 用几个样本提示Llama 2 7B-chat
    • 4.3. 如果我们不遵守聊天模板,会发生什么?
    • 4.4. 提示Llama 2 7B聊天与CoT提示
    • 4.5. Llama 2的CoT故障模式
    • 4.6. 使用零样本提示GPT-3.5
    • 4.7. 使用少量快照提示GPT-3.5
    • 4.8. 使用CoT提示GPT-3.5
  • 结论和要点
  • 再现性
  • 工具书类

什么是提示?

提示,或称提示工程,是一种用于设计输入或提示的技术,用于指导人工智能模型——特别是自然语言处理和图像生成中的人工智能模型,以产生特定的、期望的输出。提示包括将您的需求结构化为一种输入格式,有效地将所需的结果传达给模型,从而获得预期的输出。

大型语言模型(LLM)展示了在上下文中学习的能力[2][3]。这意味着这些模型可以仅基于通过提示提供给模型的任务描述和示例来理解和执行各种任务,而不需要对每个新任务进行专门的微调。提示在这种情况下非常重要,因为它是用户和模型之间利用这种能力的主要接口。定义良好的提示有助于定义任务的性质和对LLM的期望,以及如何以可利用的方式向用户提供输出。

你可能倾向于认为,促使LLM不应该那么难;毕竟,它只是用自然语言描述您对模型的需求,对吧?在实践中,它并没有那么简单。你会发现不同的LLM有不同的优势。有些可能更符合您所需的输出格式,而另一些可能需要更详细的说明。您希望LLM执行的任务可能很复杂,需要详细而精确的说明。因此,设计一个合适的提示通常需要大量的实验和基准测试。

为什么提示很重要?

在实践中,LLM对输入的结构和提供方式很敏感。我们可以沿着不同的轴对此进行分析,以更好地了解情况:

坚持提示格式:

LLM通常使用不同的提示格式来接受用户输入。这通常是在针对聊天用例[4][5]对模型进行指令调整或优化时完成的。在高层,大多数提示格式包括指令和输入。指令描述了模型要执行的任务,而输入包含了需要在其上执行任务的文本。以Alpaca指令格式为例(取自https://github.com/tatsu-lab/stanford_alpaca):

以下是描述任务的说明,并与提供进一步上下文的输入配对。编写一个适当完成请求的响应。

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

假设模型是使用这样的模板进行指令调优的,那么当用户使用相同的格式进行提示时,预计模型会以最佳方式执行。

2.描述可解析性的输出格式:

在向模型提供提示后,您希望从模型的输出中提取所需内容。理想情况下,这些输出应该是一种可以通过编程方法轻松解析的格式。根据任务的不同,例如文本分类,这可能涉及到利用正则表达式(regex)来筛选LLM的输出。相比之下,对于需要更细粒度数据的任务(如命名实体识别(NER)),您可能更喜欢JSON等格式作为输出。

然而,您使用LLM的次数越多,就越快了解到获取可解析输出可能具有挑战性。LLM常常难以精确地按照用户要求的格式提供输出。虽然像少镜头提示这样的策略可以显著缓解这个问题,但从LLM中获得一致的、可编程解析的输出需要仔细的实验和调整。

3.提示最佳性能:

LLM对任务的描述非常敏感。制作不当或留下太多解释空间的提示可能会导致性能不佳。想象一下,向某人解释一项任务——你的解释越清晰、越详细,对方的理解就越好。然而,没有神奇的公式可以达到理想的提示。这需要对不同的提示进行仔细的实验和评估,以选择性能最佳的提示。

探索不同的提示策略

希望你确信,在这一点上你需要认真对待提示。如果提示是一个工具包,那么我们可以利用哪些工具?

零样本提示:

零样本提示[2][3]涉及指示LLM执行仅在提示中描述的任务,而不提供示例。术语“零样本”表示模型必须完全依赖于提示中的任务描述,因为它没有收到与任务相关的具体演示。

零样本提示概述

在许多情况下,零样本提示足以指示LLM执行所需任务。然而,如果您的任务过于模糊、开放式或模糊,零样本提示可能会受到限制。假设你想让LLM对一个答案进行1到5的排名。尽管模型可以在零样本提示下执行此任务,但这里可能会出现两个问题:

  1. LLM可能对评分表上的每个数字的含义没有客观的理解。如果任务描述缺乏细微差别,它可能很难决定何时给答案打分3或4。
  2. LLM可能有自己的得分从1到5的概念,这可能与你的个人得分准则相矛盾。在打分时,你可能会优先考虑答案的真实性,但模型可以根据答案写得有多好来评估答案。

为了使模型符合您的评分期望,您可以提供一些答案示例以及如何评分。现在,该模型在如何对文档进行评分方面有了更多的上下文和参考,从而缩小了任务中的歧义。这使我们很少有镜头提示。

少样本提示:

少镜头提示通过少量的示例输入及其相应的输出丰富了任务描述[3]。该技术通过包括几个示例对来说明任务,从而增强了模型的理解。

An overview of Few-Shot Prompting. (Image by the author)

例如,为了指导LLM对电影评论进行情感分类,你会提供一些评论及其情感评级。与零样本提示相比,少拍摄的主要好处是能够演示如何执行任务的示例,而不是期望LLM仅通过描述来执行任务。

思想链:

思想链(CoT)提示[6]是一种使LLM能够通过将复杂问题分解为更简单的中间步骤来解决这些问题的技术。这种方法鼓励模型“大声思考”,使其推理过程透明,并使LLM能够更有效地解决推理问题。正如这项工作的作者[6]所提到的,CoT模拟了人类如何通过将问题分解为更简单的步骤并一次解决一个问题来解决推理问题,而不是直接跳到答案上来。

 

 

An overview of Chain-of-Thought Prompting. 

CoT提示通常被实现为少量提示,其中模型接收任务描述和输入输出对的示例。这些例子包括系统地得出正确答案的推理步骤,演示如何处理信息。因此,为了有效地进行CoT提示,用户需要高质量的演示示例。然而,对于需要专业领域专业知识的任务来说,这可能具有挑战性。例如,根据患者的病史使用LLM进行医学诊断需要领域专家(如医生或内科医生)的帮助,以阐明正确的推理步骤。此外,CoT在具有足够大的参数尺度的模型中特别有效。根据论文[6],CoT对137B参数LaMBDA[7]、175B参数GPT-3[3]和540B参数PaLM[8]模型最有效。这种限制可能会限制其对较小规模模型的适用性。

 

Figure taken from [6] (Source) shows that the performance improvement provided by CoT prompting improves substantially with the scale of the model.

CoT提示与标准提示不同的另一个方面是,在得出最终答案之前,模型需要生成更多的令牌。虽然不一定是缺点,但如果您在推理时受计算约束,这是一个需要考虑的因素。

如果你想要更深入的概述,我推荐OpenAI的提示资源,可在https://platform.openai.com/docs/guides/prompt-engineering/strategy-write-clear-instructions.

我们如何实现这些技术?

与本文相关的所有代码和资源都可以在这个Github存储库中的introduction_to_prompting文件夹下获得。请随意提取存储库并直接运行笔记本来运行这些实验。如果你有任何反馈或意见,或者你注意到任何错误,请告诉我!

我们可以在样本数据集上探索这些技术,以使理解更容易。为此,我们将使用MedQA数据集[9],其中包含测试医学和临床知识的问题。我们将专门利用该数据集中的USMLE问题。这项任务非常适合分析各种提示技巧,因为回答问题需要知识和推理。我们将在此数据集上测试Llama-2 7B[10]和GPT-3.5[11]的功能。

让我们先下载数据集。MedQA数据集可从该链接下载。下载数据集后,我们可以解析并开始处理问题。这套测试共包含1273道题。我们从测试集中随机抽取300个问题来评估模型,并从训练集中随机选择3个例子作为我们对模型的少量演示。

import json
import random
random.seed(42)

def read_jsonl_file(file_path):
    """
    Parses a JSONL (JSON Lines) file and returns a list of dictionaries.

    Args:
        file_path (str): The path to the JSONL file to be read.

    Returns:
        list of dict: A list where each element is a dictionary representing
            a JSON object from the file.
    """
    jsonl_lines = []
    with open(file_path, 'r', encoding="utf-8") as file:
        for line in file:
            json_object = json.loads(line)
            jsonl_lines.append(json_object)
            
    return jsonl_lines

def write_jsonl_file(dict_list, file_path):
    """
    Write a list of dictionaries to a JSON Lines file.

    Args:
    - dict_list (list): A list of dictionaries to write to the file.
    - file_path (str): The path to the file where the data will be written.
    """
    with open(file_path, 'w') as file:
        for dictionary in dict_list:
            # Convert the dictionary to a JSON string and write it to the file.
            json_line = json.dumps(dictionary)
            file.write(json_line + '\n')


# read the contents of the train and test set
train_set = read_jsonl_file("data_clean/questions/US/4_options/phrases_no_exclude_train.jsonl")
test_set = read_jsonl_file("data_clean/questions/US/4_options/phrases_no_exclude_test.jsonl")

# subsample test set samples and few-shot samples
test_set_subsampled = random.sample(test_set, 300)
few_shot_examples = random.sample(test_set, 3)

# dump the sampled questions and few-shot samples as jsonl files
write_jsonl_file(test_set_subsampled, "USMLE_test_samples_300.jsonl")
write_jsonl_file(few_shot_examples, "USMLE_few_shot_samples.jsonl")

用零样本提示Llama 2 7B-chat

Llama系列模型由Meta发布。它们是从7B到70B的参数计数的LLM的仅解码器家族。Llama-2系列型号有两种变体:基础版本和聊天/指令调整版本。在本练习中,我们将使用Llama 2-7B型号的聊天版本。

让我们看看我们能在多大程度上促使Llama模型回答这些医学问题。我们将模型加载到内存中:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

questions = read_jsonl_file("USMLE_test_samples_300.jsonl")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.bfloat16).cuda()
model.eval()

如果您使用的是Nvidia Ampere GPU,则可以使用torch.bfloat16加载该型号。它为推理提供了加速,并比普通FP16/FP32使用更少的GPU内存。

首先,让我们为我们的任务制作一个基本提示:

PROMPT = """You will be provided with a medical or clinical question, along with multiple possible answer choices. Pick the right answer from the choices.
Your response should be in the format "The answer is <correct_choice>". Do not add any other unnecessary content in your response"""

我们的提示很简单。它包括有关任务性质的信息,并提供有关输出格式的说明。我们将了解此提示在实践中的有效性。

Llama-2聊天模型有一个特定的聊天模板来提示它们。

<s>[INST] <<SYS>>
You will be provided with a medical or clinical question, along with multiple possible answer choices. Pick the right answer from the choices.
Your response should be in the format "The answer is <correct_choice>". Do not add any other unnecessary content in your response
<</SYS>>

A 21-year-old male presents to his primary care provider for fatigue. He reports that he graduated from college last month and returned 3 days ago from a 2 week vacation to Vietnam and Cambodia. For the past 2 days, he has developed a worsening headache, malaise, and pain in his hands and wrists. The patient has a past medical history of asthma managed with albuterol as needed. He is sexually active with both men and women, and he uses condoms “most of the time.” On physical exam, the patient’s temperature is 102.5°F (39.2°C), blood pressure is 112/66 mmHg, pulse is 105/min, respirations are 12/min, and oxygen saturation is 98% on room air. He has tenderness to palpation over his bilateral metacarpophalangeal joints and a maculopapular rash on his trunk and upper thighs. Tourniquet test is negative. Laboratory results are as follows:

Hemoglobin: 14 g/dL
Hematocrit: 44%
Leukocyte count: 3,200/mm^3
Platelet count: 112,000/mm^3

Serum:
Na+: 142 mEq/L
Cl-: 104 mEq/L
K+: 4.6 mEq/L
HCO3-: 24 mEq/L
BUN: 18 mg/dL
Glucose: 87 mg/dL
Creatinine: 0.9 mg/dL
AST: 106 U/L
ALT: 112 U/L
Bilirubin (total): 0.8 mg/dL
Bilirubin (conjugated): 0.3 mg/dL

Which of the following is the most likely diagnosis in this patient?
Options:
A. Chikungunya
B. Dengue fever
C. Epstein-Barr virus
D. Hepatitis A [/INST]

任务描述应在<<SYS>>标记之间提供,然后是模型需要回答的实际问题。提示以[/INST]标记结束,以指示输入文本的结束。

角色可以是“用户”、“系统”或“助理”之一。“系统”角色为模型提供任务描述,“用户”角色包含模型需要响应的输入。这与我们稍后在与GPT-3.5交互时使用的约定相同。这相当于创建了一个虚构的多回合对话历史,提供给Llama-2,其中每个回合对应于一个示例演示和模型的理想输出。

听起来很复杂?值得庆幸的是,Huggingface Transformers库支持将提示转换为聊天模板。我们将利用此功能使我们的生活更轻松。让我们从处理数据集和创建提示的助手功能开始。

def create_query(item):
    """
    Creates the input for the model using the question and the multiple choice options.

    Args:
        item (dict): A dictionary containing the question and options.
            Expected keys are "question" and "options", where "options" is another
            dictionary with keys "A", "B", "C", and "D".

    Returns:
        str: A formatted query combining the question and options, ready for use.
    """
    query = item["question"] + "\nOptions:\n" + \
            "A. " + item["options"]["A"] + "\n" + \
            "B. " + item["options"]["B"] + "\n" + \
            "C. " + item["options"]["C"] + "\n" + \
            "D. " + item["options"]["D"]
    return query

def build_zero_shot_prompt(system_prompt, question):
    """
    Builds the zero-shot prompt.

    Args:
        system_prompt (str): Task Instruction
        content (dict): The content for which to create a query, formatted as
            required by `create_query`.

    Returns:
        list of dict: A list of messages, including a system message defining
            the task and a user message with the input question.
    """
    messages = [{"role": "system", "content": system_prompt},
                {"role": "user", "content": create_query(question)}]
    return messages

此函数构造要提供给LLM的查询。MedQA数据集将每个问题存储为JSON元素,并将问题和选项作为关键字提供。我们解析JSON并构建问题和选择。

让我们开始从模型中获取输出。当前任务包括通过从各种选项中选择正确答案来回答所提供的医疗问题。与内容写作或摘要等创造性任务不同,这些任务可能要求模型在输出中具有想象力和创造性,这是一项基于知识的任务,旨在测试模型基于其参数中编码的知识回答问题的能力。因此,我们将在生成答案的同时使用贪婪解码。让我们定义一个助手函数,用于解析模型响应和计算准确性。

 

pattern = re.compile(r"([A-Z])\.\s*(.*)")

def parse_answer(response):
    """
    Extracts the answer option from the predicted string.

    Args:
    - response (str): The string to search for the pattern.

    Returns:
    - str: The matched answer option if found or an empty string otherwise.
    """
    match = re.search(pattern, response)
    if match:
        letter = match.group(1)
    else:
        letter = ""
    
    return letter

def calculate_accuracy(ground_truth, predictions):
    """
    Calculates the accuracy of predictions compared to ground truth labels.

    Args:
    - ground_truth (list): A list of true labels.
    - predictions (list): A list of predicted labels.

    Returns:
    - float: The accuracy of predictions as a fraction of correct predictions over total predictions.
    """
    return sum([1 if x==y else 0 for x,y in zip(ground_truth, predictions)]) / len(ground_truth)pattern = re.compile(r"([A-Z])\.\s*(.*)")

def parse_answer(response):
    """
    Extracts the answer option from the predicted string.

    Args:
    - response (str): The string to search for the pattern.

    Returns:
    - str: The matched answer option if found or an empty string otherwise.
    """
    match = re.search(pattern, response)
    if match:
        letter = match.group(1)
    else:
        letter = ""
    
    return letter

def calculate_accuracy(ground_truth, predictions):
    """
    Calculates the accuracy of predictions compared to ground truth labels.

    Args:
    - ground_truth (list): A list of true labels.
    - predictions (list): A list of predicted labels.

    Returns:
    - float: The accuracy of predictions as a fraction of correct predictions over total predictions.
    """
    return sum([1 if x==y else 0 for x,y in zip(ground_truth, predictions)]) / len(ground_truth)

 

ground_truth = []

for item in questions:
    ans_options = item["options"]
    correct_ans_option = ""
    for key,value in ans_options.items():
        if value == item["answer"]:
            correct_ans_option = key
            break
            
    ground_truth.append(correct_ans_option)

 

zero_shot_llama_answers = []
for item in tqdm(questions):
    zero_shot_prompt_messages = build_zero_shot_prompt(PROMPT, item)
    prompt = tokenizer.apply_chat_template(zero_shot_prompt_messages, tokenize=False)
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
    outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=False)
 
    # https://github.com/huggingface/transformers/issues/17117#issuecomment-1124497554
    gen_text = tokenizer.batch_decode(outputs.detach().cpu().numpy()[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
    zero_shot_llama_answers.append(gen_text.strip())

zero_shot_llama_predictions = [parse_answer(x) for x in zero_shot_llama_answers]
print(calculate_accuracy(ground_truth, zero_shot_llama_predictions))

在零样本设置下,我们获得了36%的性能。这是一个不错的开始,但让我们看看能否进一步推动这一表现。

用几个样本提示Llama 2 7B-chat

现在让我们为模型提供任务演示。我们使用训练集中的三个随机抽样问题,并将它们附加到模型中作为任务演示。幸运的是,我们可以继续使用Transformers库和标记器提供的聊天模板支持,以最少的代码更改附加我们的几个快照示例。

def build_few_shot_prompt(system_prompt, content, few_shot_examples):
    """
    Builds the few-shot prompt using provided examples.

    Args:
        system_prompt (str): Task description for the LLM
        content (dict): The content for which to create a query, similar to the
            structure required by `create_query`.
        few_shot_examples (list of dict): Examples to simulate a hypothetical
            conversation. Each dict must have "options" and an "answer".

    Returns:
        list of dict: A list of messages, simulating a conversation with
            few-shot examples, followed by the current user query.
    """
    messages = [{"role": "system", "content": system_prompt}]
    for item in few_shot_examples:
        ans_options = item["options"]
        correct_ans_option = ""
        for key, value in ans_options.items():
            if value == item["answer"]:
                correct_ans_option = key
                break
        messages.append({"role": "user", "content": create_query(item)})
        messages.append({"role": "assistant", "content": "The answer is " + correct_ans_option + "."})
    messages.append({"role": "user", "content": create_query(content)})
    return messages

few_shot_prompts = read_jsonl_file("USMLE_few_shot_samples.jsonl")

让我们想象一下我们的少镜头提示是什么样子的。

<s>[INST] <<SYS>>
You will be provided with a medical or clinical question, along with multiple possible answer choices. Pick the right answer from the choices.
Your response should be in the format "The answer is <correct_choice>". Do not add any other unnecessary content in your response
<</SYS>>

A 30-year-old woman presents to the clinic because of fever, joint pain, and a rash on her lower extremities. She admits to intravenous drug use. Physical examination reveals palpable petechiae and purpura on her lower extremities. Laboratory results reveal a negative antinuclear antibody, positive rheumatoid factor, and positive serum cryoglobulins. Which of the following underlying conditions in this patient is responsible for these findings?
Options:
A. Hepatitis B infection
B. Hepatitis C infection
C. HIV infection
D. Systemic lupus erythematosus (SLE) [/INST] The answer is B. </s><s>[INST] A 10-year-old child presents to your office with a chronic cough. His mother states that he has had a cough for the past two weeks that is non-productive along with low fevers of 100.5 F as measured by an oral thermometer. The mother denies any other medical history and states that he has been around one other friend who also has had this cough for many weeks. The patient's vitals are within normal limits with the exception of his temperature of 100.7 F. His chest radiograph demonstrated diffuse interstitial infiltrates. Which organism is most likely causing his pneumonia?
Options:
A. Mycoplasma pneumoniae
B. Staphylococcus aureus
C. Streptococcus pneumoniae
D. Streptococcus agalactiae [/INST] The answer is A. </s><s>[INST] A 44-year-old with a past medical history significant for human immunodeficiency virus infection presents to the emergency department after he was found to be experiencing worsening confusion. The patient was noted to be disoriented by residents and staff at the homeless shelter where he resides. On presentation he reports headache and muscle aches but is unable to provide more information. His temperature is 102.2°F (39°C), blood pressure is 112/71 mmHg, pulse is 115/min, and respirations are 24/min. Knee extension with hips flexed produces significant resistance and pain. A lumbar puncture is performed with the following results:

Opening pressure: Normal
Fluid color: Clear
Cell count: Increased lymphocytes
Protein: Slightly elevated

Which of the following is the most likely cause of this patient's symptoms?
Options:
A. Cryptococcus
B. Group B streptococcus
C. Herpes simplex virus
D. Neisseria meningitidis [/INST] The answer is C. </s><s>[INST] A 21-year-old male presents to his primary care provider for fatigue. He reports that he graduated from college last month and returned 3 days ago from a 2 week vacation to Vietnam and Cambodia. For the past 2 days, he has developed a worsening headache, malaise, and pain in his hands and wrists. The patient has a past medical history of asthma managed with albuterol as needed. He is sexually active with both men and women, and he uses condoms “most of the time.” On physical exam, the patient’s temperature is 102.5°F (39.2°C), blood pressure is 112/66 mmHg, pulse is 105/min, respirations are 12/min, and oxygen saturation is 98% on room air. He has tenderness to palpation over his bilateral metacarpophalangeal joints and a maculopapular rash on his trunk and upper thighs. Tourniquet test is negative. Laboratory results are as follows:

Hemoglobin: 14 g/dL
Hematocrit: 44%
Leukocyte count: 3,200/mm^3
Platelet count: 112,000/mm^3

Serum:
Na+: 142 mEq/L
Cl-: 104 mEq/L
K+: 4.6 mEq/L
HCO3-: 24 mEq/L
BUN: 18 mg/dL
Glucose: 87 mg/dL
Creatinine: 0.9 mg/dL
AST: 106 U/L
ALT: 112 U/L
Bilirubin (total): 0.8 mg/dL
Bilirubin (conjugated): 0.3 mg/dL

Which of the following is the most likely diagnosis in this patient?
Options:
A. Chikungunya
B. Dengue fever
C. Epstein-Barr virus
D. Hepatitis A [/INST]

提示相当长,因为我们附加了三个演示。现在让我们在提示下运行Llama-2并获得结果:

few_shot_llama_answers = []
for item in tqdm(questions):
    few_shot_prompt_messages = build_few_shot_prompt(PROMPT, item, few_shot_prompts)
    prompt = tokenizer.apply_chat_template(few_shot_prompt_messages, tokenize=False)
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
    outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=False)
    gen_text = tokenizer.batch_decode(outputs.detach().cpu().numpy()[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
    few_shot_llama_answers.append(gen_text.strip())

few_shot_llama_predictions = [parse_answer(x) for x in few_shot_llama_answers]
print(calculate_accuracy(ground_truth, few_shot_llama_predictions))

我们现在的总体准确率为41.67%。不错,比同一型号的零样本提示提高了近6%!

如果我们不遵守聊天模板,会发生什么?

早些时候,我观察到,建议根据最初用于微调LLM的提示模板来构建我们的提示。让我们验证一下不遵守聊天模板是否会影响我们的性能。我们创建了一个函数,使用相同的示例构建几个镜头提示,而不遵循聊天格式。

def build_few_shot_prompt_wo_chat_template(system_prompt, content, few_shot_examples):
    """
    Builds the few-shot prompt using provided examples, bypassing the chat-template
    for Llama-2.

    Args:
        system_prompt (str): Task description for the LLM
        content (dict): The content for which to create a query, similar to the
            structure required by `create_query`.
        few_shot_examples (list of dict): Examples to simulate a hypothetical
            conversation. Each dict must have "options" and an "answer".

    Returns:
        str: few-shot prompt in non-chat format
    """
    few_shot_prompt = ""
    few_shot_prompt += "Task: " + system_prompt + "\n"
    for item in few_shot_examples:
        ans_options = item["options"]
        correct_ans_option = ""
        for key, value in ans_options.items():
            if value == item["answer"]:
                correct_ans_option = key
                break
        few_shot_prompt += create_query(item) + "\n" + "The answer is " + correct_ans_option + "." + "\n"
    
    few_shot_prompt += create_query(content) + "\n"
    return few_shot_prompt

现在我们的提示如下:

Task: You will be provided with a medical or clinical question, along with multiple possible answer choices. Pick the right answer from the choices.
Your response should be in the format "The answer is <correct_choice>". Do not add any other unnecessary content in your response
A 30-year-old woman presents to the clinic because of fever, joint pain, and a rash on her lower extremities. She admits to intravenous drug use. Physical examination reveals palpable petechiae and purpura on her lower extremities. Laboratory results reveal a negative antinuclear antibody, positive rheumatoid factor, and positive serum cryoglobulins. Which of the following underlying conditions in this patient is responsible for these findings?
Options:
A. Hepatitis B infection
B. Hepatitis C infection
C. HIV infection
D. Systemic lupus erythematosus (SLE)
The answer is B.
A 10-year-old child presents to your office with a chronic cough. His mother states that he has had a cough for the past two weeks that is non-productive along with low fevers of 100.5 F as measured by an oral thermometer. The mother denies any other medical history and states that he has been around one other friend who also has had this cough for many weeks. The patient's vitals are within normal limits with the exception of his temperature of 100.7 F. His chest radiograph demonstrated diffuse interstitial infiltrates. Which organism is most likely causing his pneumonia?
Options:
A. Mycoplasma pneumoniae
B. Staphylococcus aureus
C. Streptococcus pneumoniae
D. Streptococcus agalactiae
The answer is A.
A 44-year-old with a past medical history significant for human immunodeficiency virus infection presents to the emergency department after he was found to be experiencing worsening confusion. The patient was noted to be disoriented by residents and staff at the homeless shelter where he resides. On presentation he reports headache and muscle aches but is unable to provide more information. His temperature is 102.2°F (39°C), blood pressure is 112/71 mmHg, pulse is 115/min, and respirations are 24/min. Knee extension with hips flexed produces significant resistance and pain. A lumbar puncture is performed with the following results:

Opening pressure: Normal
Fluid color: Clear
Cell count: Increased lymphocytes
Protein: Slightly elevated

Which of the following is the most likely cause of this patient's symptoms?
Options:
A. Cryptococcus
B. Group B streptococcus
C. Herpes simplex virus
D. Neisseria meningitidis
The answer is C.
A 21-year-old male presents to his primary care provider for fatigue. He reports that he graduated from college last month and returned 3 days ago from a 2 week vacation to Vietnam and Cambodia. For the past 2 days, he has developed a worsening headache, malaise, and pain in his hands and wrists. The patient has a past medical history of asthma managed with albuterol as needed. He is sexually active with both men and women, and he uses condoms “most of the time.” On physical exam, the patient’s temperature is 102.5°F (39.2°C), blood pressure is 112/66 mmHg, pulse is105/min, respirations are 12/min, and oxygen saturation is 98% on room air. He has tenderness to palpation over his bilateral metacarpophalangeal joints and a maculopapular rash on his trunk and upper thighs. Tourniquet test is negative. Laboratory results are as follows:

Hemoglobin: 14 g/dL
Hematocrit: 44%
Leukocyte count: 3,200/mm^3
Platelet count: 112,000/mm^3

Serum:
Na+: 142 mEq/L
Cl-: 104 mEq/L
K+: 4.6 mEq/L
HCO3-: 24 mEq/L
BUN: 18 mg/dL
Glucose: 87 mg/dL
Creatinine: 0.9 mg/dL
AST: 106 U/L
ALT: 112 U/L
Bilirubin (total): 0.8 mg/dL
Bilirubin (conjugated): 0.3 mg/dL

Which of the following is the most likely diagnosis in this patient?
Options:
A. Chikungunya
B. Dengue fever
C. Epstein-Barr virus
D. Hepatitis A

现在,让我们使用以下提示来评估Llama 2,并观察其性能:

few_shot_llama_answers_wo_chat_template = []
for item in tqdm(questions):
    prompt = build_few_shot_prompt_wo_chat_template(PROMPT, item, few_shot_prompts)
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
    outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=False)
    gen_text = tokenizer.batch_decode(outputs.detach().cpu().numpy()[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
    few_shot_llama_answers_wo_chat_template.append(gen_text.strip())

few_shot_llama_predictions_wo_chat_template = [parse_answer(x) for x in few_shot_llama_answers_wo_chat_template]
print(calculate_accuracy(ground_truth, few_shot_llama_predictions_wo_chat_template))

我们实现了36%的准确率。这比我们之前的几次投篮得分低了近6%。这强化了我们之前的论点,即根据用于微调我们打算使用的LLM的模板来构建提示是至关重要的。提示模板很重要!

提示Llama 2 7B-chat与CoT提示

让我们通过评估CoT提示来结束。请记住,我们的数据集包括旨在通过USMLE考试测试医学知识的问题。这些问题通常需要事实回忆和概念推理才能回答。这使得它成为测试CoT工作效果的完美任务。

首先,我们必须向模型提供一个示例CoT提示,以演示如何推理问题。为此,我们将使用谷歌MedPALM论文[12]中的一个提示。

Five-shot CoT prompt used for evaluating the MedPALM model on the MedQA dataset. Prompt borrowed from Table A.18, Page 41 of [12] (Source).

我们使用这个五次提示来评估模型。由于这种提示样式与我们以前的提示略有不同,让我们再次创建一些辅助函数来处理它们并获得输出。在使用CoT提示的同时,我们生成具有更大输出令牌计数的输出,以使模型能够在回答问题之前“思考”和“推理”。

def create_query_cot(item):
    """
    Creates the input for the model using the question and the multiple choice options in the CoT format.

    Args:
        item (dict): A dictionary containing the question and options.
            Expected keys are "question" and "options", where "options" is another
            dictionary with keys "A", "B", "C", and "D".

    Returns:
        str: A formatted query combining the question and options, ready for use.
    """
    query = "Question: " + item["question"] + "\n" + \
            "(A) " + item["options"]["A"] + " " +  \
            "(B) " + item["options"]["B"] + " " +  \
            "(C) " + item["options"]["C"] + " " +  \
            "(D) " + item["options"]["D"]
    return query

def build_cot_prompt(instruction, input_question, cot_examples):
    """
    Builds the few-shot prompt for the GPT API using provided examples.

    Args:
        content (dict): The content for which to create a query, similar to the
            structure required by `create_query`.
        few_shot_examples (list of dict): Examples to simulate a hypothetical
            conversation. Each dict must have "question" and an "explanation".

    Returns:
        list of dict: A list of messages, simulating a conversation with
            few-shot examples, followed by the current user query.
    """
    
    messages = [{"role": "system", "content": instruction}]
    for item in cot_examples:
        messages.append({"role": "user", "content": item["question"]})
        messages.append({"role": "assistant", "content": item["explanation"]})

    
    messages.append({"role": "user", "content": create_query_cot(input_question)})
    
    return messages

def parse_answer_cot(text):
    """
    Extracts the choice from a string that follows the pattern "Answer: (Choice) Text".

    Args:
    - text (str): The input string from which to extract the choice.

    Returns:
    - str: The extracted choice or a message indicating no match was found.
    """
    # Regex pattern to match the answer part
    pattern = r"Answer: (.*)"

    # Search for the pattern in the text and extract the matching group
    match = re.search(pattern, text)
    
    if match:
        if len(match.group(1)) > 1:
            return match.group(1)[1]
        else:
            return ""
    else:
        return ""

cot_llama_answers = []
for item in tqdm(questions):
    cot_prompt = build_cot_prompt(COT_INSTRUCTION, item, COT_EXAMPLES)
    prompt = tokenizer.apply_chat_template(cot_prompt, tokenize=False)
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
    outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=False)
    gen_text = tokenizer.batch_decode(outputs.detach().cpu().numpy()[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
    cot_llama_answers.append(gen_text.strip())

cot_llama_predictions = [parse_answer_cot(x) for x in cot_llama_answers]
print(calculate_accuracy(ground_truth, cot_llama_predictions))

使用CoT提示Llama 2–7B,我们的性能下降到20%。这通常与CoT论文[6]的发现一致,其中作者提到CoT是LLM的一种新兴性质,随着模型的规模而提高。话虽如此,让我们来分析一下性能大幅下滑的原因。

Llama 2的CoT故障模式

我们对Llama 2对一些测试集问题提供的一些回答进行了抽样,以分析错误案例:

Sample Prediction 1 — The model arrives at an answer but does not adhere to the format, making parsing the result hard. (Image by the author)

Sample Prediction 2 — The model fails to adhere to the prompt format and arrive at a conclusive answer. (Image by the author)

虽然CoT提示允许模型在得出最终答案之前进行“思考”,但在大多数情况下,模型要么没有得出结论性答案,要么以与我们的示例演示不一致的格式提及答案。我在这里没有分析过一种失败模式,但可能值得探索,那就是检查测试集中模型“推理”不正确的情况,从而得出错误的答案。这超出了当前文章和我的医学知识的范围,但我稍后肯定会重新审视这一点。

使用零样本提示GPT-3.5

让我们开始定义一些帮助函数,帮助我们处理这些setx OPENAI_API_KEY "your-api-key-here"输入,以利用GPT API。您需要生成一个API密钥才能使用GPT-3.5 API。您可以使用以下方法在Windows中设置API密钥:

setx OPENAI_API_KEY "your-api-key-here"

或者在Linux中使用:

export OPENAI_API_KEY "your-api-key-here"

在您正在使用的当前会话中。

 

from openai import OpenAI
import re
from tqdm import tqdm

# assuming you have already set the secret key using env variable
# if not, you can also instantiate the OpenAI client by providing the 
# secret key directly like so:
# I highly recommend not doing this, as it is a best practice to not store
# the api key in your code directly or in any plain-text file for security 
# reasons.
# client = OpenAI(api_key = "")

client = OpenAI() 
def get_response(messages, model_name, temperature = 0.0, max_tokens = 10):
    """
    Obtains the responses/answers of the model through the chat-completions API.

    Args:
        messages (list of dict): The built messages provided to the API.
        model_name (str): Name of the model to access through the API
        temperature (float): A value between 0 and 1 that controls the randomness of the output.
        A temperature value of 0 ideally makes the model pick the most likely token, making the outputs (mostly) deterministic.
        max_tokens (int): Maximum number of tokens that the model should generate

    Returns:
        str: The response message content from the model.
    """
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content

 

此函数现在以GPT-3.5 API的格式构造提示。我们可以通过库提供的聊天实现API与GPT-3.5模型进行交互。API要求将消息结构化为字典列表,以便发送到API。每条消息都必须指定角色和内容。关于“系统”、“用户”和“助理”角色所遵循的约定与前面针对Llama-7B聊天模型所描述的约定相同。

现在让我们使用GPT-3.5 API来处理测试集并获取响应。在收到所有响应后,我们从模型的响应中提取选项并计算准确性。

zero_shot_gpt_answers = []
for item in tqdm(questions):
    zero_shot_prompt_messages = build_zero_shot_prompt(PROMPT, item)
    answer = get_response(zero_shot_prompt_messages, model_name = "gpt-3.5-turbo", temperature = 0.0, max_tokens = 10)
    zero_shot_gpt_answers.append(answer)

zero_shot_gpt_predictions = [parse_answer(x) for x in zero_shot_gpt_answers]
print(calculate_accuracy(ground_truth, zero_shot_gpt_predictions))

我们的业绩目前为63%。与Llama 2–7B相比,这是一个显著的改进。这并不奇怪,因为GPT-3.5可能比Llama 2–7B大得多,并且在更多的数据上进行了训练,以及OpenAI可能已经包含在模型中的其他专有优化。让我们看看现在很少有镜头提示的效果如何。

使用少量快照提示GPT-3.5

为了向LLM提供几个镜头示例,我们重用从训练集中采样的三个示例,并将它们附加到提示中。对于GPT-3.5,我们创建了一个带有示例的消息列表,类似于我们之前对Llama 2的处理。输入使用“用户”角色附加,相应的选项显示在“助理”角色中。我们重用前面的函数来构建少量的快照提示。

这同样等同于创建一个提供给GPT-3.5的虚构的多回合对话历史,其中每个回合对应一个示例演示。

现在让我们使用GPT-3.5获取输出。

few_shot_gpt_answers = []
for item in tqdm(questions):
    few_shot_prompt_messages = build_few_shot_prompt(PROMPT, item, few_shot_prompts)
    answer = get_response(few_shot_prompt_messages, model_name= "gpt-3.5-turbo", temperature = 0.0, max_tokens = 10)
    few_shot_gpt_answers.append(answer)

few_shot_gpt_predictions = [parse_answer(x) for x in few_shot_gpt_answers]
print(calculate_accuracy(ground_truth, few_shot_gpt_predictions))

我们已经成功地将性能从63%提高到67%,使用了很少的镜头提示!这是一个显著的改进,突出了为模型提供任务演示的价值。

使用CoT提示GPT-3.5

现在让我们使用CoT提示来评估GPT-3.5。我们重复使用相同的CoT提示并获得输出:

 

cot_gpt_answers = []
for item in tqdm(questions):
    cot_prompt = build_cot_prompt(COT_INSTRUCTION, item, COT_EXAMPLES)
    answer = get_response(cot_prompt, model_name= "gpt-3.5-turbo", temperature = 0.0, max_tokens = 100)
    cot_gpt_answers.append(answer)

cot_gpt_predictions = [parse_answer_cot(x) for x in cot_gpt_answers]
print(calculate_accuracy(ground_truth, cot_gpt_predictions))

在GPT-3.5中使用CoT提示的准确率为71%!这意味着与少量注射提示相比,进一步提高了4%。看来,让模型在回答问题之前大声“思考”对这项任务是有益的。这也与论文[6]的发现一致,即CoT解锁了更大参数模型的性能改进。

结论和要点:

提示是使用大型语言模型(LLM)的一项关键技能,并了解提示工具包中有各种工具可以帮助根据上下文从LLM中提取更好的任务性能。我希望这篇文章能成为这个主题的一个广泛的(希望!)可访问的介绍。然而,它的目的并不是全面概述所有的激励战略。提示仍然是一个非常活跃的研究领域,引入了许多方法,如ReAct[13]、思维树提示[14]等。我建议探索这些技术,以更好地理解它们,并增强您的提示工具包。

再现性

在这篇文章中,我的目标是使所有实验尽可能具有确定性和可重复性。我们使用贪婪解码来获得零样本、少搜索和使用Llama-2提示的CoT的输出。虽然这些分数在技术上应该是可重复的,但在极少数情况下,与Cuda/GPU相关的或库问题可能会导致略有不同的结果。

类似地,当从GPT-3.5 API获取响应时,我们使用0的温度来获取结果,并仅选择下一个最有可能的令牌,而不对所有提示设置进行采样。这使得结果“基本上是确定的”,因此再次向GPT-3.5发送相同的提示可能会导致略有不同的结果。

我提供了所有提示设置下模型的输出,以及子采样测试集、几个镜头提示示例和CoT提示(来自MedPALM论文),用于再现本文中报告的分数。

参考文献:

All papers referred to in this blog post are listed here. Please let me know if I might have missed out any references, and I will add them!

[1] Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., … & Hu, X. (2023). Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.

[2] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog1(8), 9.

[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901.

[4] Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., … & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

[5] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems35, 27730–27744.

[6] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems35, 24824–24837.

[7] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H. T., … & Le, Q. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.

[8] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113.

[9] Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences11(14), 6421.

[10] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[11] https://platform.openai.com/docs/models/gpt-3-5-turbo

[12] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., … & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature620(7972), 172–180.

[13] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., & Cao, Y. (2022, September). ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations.

[14] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2024). Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems36.

文章链接