MemGPT: Towards LLMs as Operating Systems-大模型长记忆解决方案

目录

这是一篇来自伯克利大学的论文,主要针对大模型受限制的上下文,提供了一定解决方案,这里进行了翻译,方便做应用开发/agent开发等场景研发人员进行原理参考

Abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems which provide the illusion of an extended virtual memory via paging between physical memory and disk. Using this technique, we introduce MemGPT (MemoryGPT), a system that intelligently manages different storage tiers in order to effectively provide extended context within the LLM’s limited context window. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM’s context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://research.memgpt.ai.

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis.
大型语言模型(LLMs)已经彻底改变了人工智能,但受限于有限的上下文窗口,这阻碍了它们在如扩展对话和文档分析等任务中的实用性。

To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems which provide the illusion of an extended virtual memory via paging between physical memory and disk.
为了能够使用超出有限上下文窗口的上下文,我们提出了虚拟上下文管理,这是一种从传统操作系统中的分层内存系统汲取灵感的技术,这些系统通过物理内存和磁盘之间的分页提供扩展虚拟内存的幻觉。

Using this technique, we introduce MemGPT (MemoryGPT), a system that intelligently manages different storage tiers in order to effectively provide extended context within the LLM’s limited context window.
利用这种技术,我们介绍了MemGPT(MemoryGPT),这是一个智能管理系统,能够智能地管理不同的存储层次,以便在LLM有限的上下文窗口内有效地提供扩展上下文。

We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM’s context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users.
我们在两个领域评估了我们受操作系统启发的设计,其中现代LLMs的有限上下文窗口严重限制了它们的表现:文档分析,其中MemGPT能够分析远远超过底层LLM上下文窗口的大型文档;多会话聊天,其中MemGPT可以创建能够记住、反思并通过与用户的长期交互动态发展的对话代理。

We release MemGPT code and data for our experiments at https://research.memgpt.ai.
我们在 https://research.memgpt.ai 上发布了MemGPT代码和我们实验的数据。

1. Introduction

In recent years, large language models (LLMs) and their underlying transformer architecture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications. Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their applicability to long conversations or reasoning about long documents. For example, the most widely used open-source LLMs can only support a few dozen back-and-forth messages or reason about a short document before exceeding their maximum input length (Touvron et al., 2023).

In recent years, large language models (LLMs) and their underlying transformer architecture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications.
近年来,大型语言模型(LLMs)及其底层的变换器架构(Vaswani等人,2017年;Devlin等人,2018年;Brown等人,2020年;Ouyang等人,2022年)已成为对话式人工智能的基石,并导致了大量消费者和企业应用程序的出现。

Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their applicability to long conversations or reasoning about long documents.
尽管取得了这些进步,LLMs使用的有限固定长度上下文窗口显著限制了它们在长对话或对长文档进行推理的应用能力。

For example, the most widely used open-source LLMs can only support a few dozen back-and-forth messages or reason about a short document before exceeding their maximum input length (Touvron et al., 2023).
例如,最广泛使用的开源LLMs只能支持几十个来回消息,或者在超过其最大输入长度之前对短文档进行推理。

Directly extending the context length of transformers incurs a quadratic increase in computational time and memory cost due to the transformer architecture’s self-attention mechanism, making the design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020). While developing longer models is an active area of research (Dong et al., 2023), even if we could overcome the computational challenges of context scaling, recent research shows that longcontext models struggle to utilize additional context effectively (Liu et al., 2023a). As consequence, given the considerable resources needed to train state-of-the-art LLMs and diminishing returns of context scaling, there is a critical need for alternative techniques to support long context.

Directly extending the context length of transformers incurs a quadratic increase in computational time and memory cost due to the transformer architecture’s self-attention mechanism, making the design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020).
直接扩展变换器的上下文长度会导致计算时间和内存成本呈二次方增加,这是由于变换器架构的自注意力机制造成的,使得设计新的长上下文架构成为一个紧迫的研究挑战。

While developing longer models is an active area of research (Dong et al., 2023), even if we could overcome the computational challenges of context scaling, recent research shows that longcontext models struggle to utilize additional context effectively (Liu et al., 2023a).
尽管开发更长的模型是一个活跃的研究领域,即使我们能够克服上下文扩展的计算挑战,最近的研究表明,长上下文模型在有效利用额外上下文方面存在困难。

As consequence, given the considerable resources needed to train state-of-the-art LLMs and diminishing returns of context scaling, there is a critical need for alternative techniques to support long context.
因此,考虑到训练最先进LLMs所需的大量资源和上下文扩展的递减收益,迫切需要替代技术来支持长上下文。

In this paper, we study how to provide the illusion of an infinite context while continuing to use fixed-context models. Our approach borrows from the idea of virtual memory paging that was developed to enable applications to work on datasets that far exceed the available memory by paging data between main memory and disk. We leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al., 2023b) to design MemGPT, an OS-inspired LLM system for virtual context management. Using function calls, LLM agents can read and write to external data sources, modify their own context, and choose when to return responses to the user.

In this paper, we study how to provide the illusion of an infinite context while continuing to use fixed-context models.
在本文中,我们研究了如何在继续使用固定上下文模型的同时提供无限上下文的错觉。

Our approach borrows from the idea of virtual memory paging that was developed to enable applications to work on datasets that far exceed the available memory by paging data between main memory and disk.
我们的方法借鉴了虚拟内存分页的概念,该概念被开发出来,以使应用程序能够在数据集远远超过可用内存的情况下工作,通过在主存和磁盘之间分页数据。

We leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al., 2023b) to design MemGPT, an OS-inspired LLM system for virtual context management.
我们利用LLM代理的功能调用能力的最新进展来设计MemGPT,这是一个受操作系统启发的LLM系统,用于虚拟上下文管理。

Using function calls, LLM agents can read and write to external data sources, modify their own context, and choose when to return responses to the user.
使用函数调用,LLM代理可以读写外部数据源,修改自己的上下文,并选择何时向用户返回响应。

These capabilities allow LLMs to effective “page” in and out information between context windows (analogous to “main memory” in operating systems) and external storage, similar to hierarchical memory in traditional OSes. In addition, function calls can be leveraged to manage control flow between context management, response generation, and user interactions. This allows for an agent to choose to iteratively modify what is in its context for a single task, thereby more effectively utilizing its limited context.

These capabilities allow LLMs to effectively "page" in and out information between context windows (analogous to "main memory" in operating systems) and external storage, similar to hierarchical memory in traditional OSes.
这些能力允许LLMs在上下文窗口(类似于操作系统中的“主存”)和外部存储之间有效地“分页”进出信息,类似于传统操作系统中的分层内存。

In addition, function calls can be leveraged to manage control flow between context management, response generation, and user interactions.
此外,函数调用可以用来管理上下文管理、响应生成和用户交互之间的控制流程。

This allows for an agent to choose to iteratively modify what is in its context for a single task, thereby more effectively utilizing its limited context.
这允许一个代理选择为其单一任务迭代修改其上下文中的内容,从而更有效地利用其有限的上下文。

In MemGPT, we treat context windows as a constrained memory resource, and design a memory hiearchy for LLMs analogous to memory tiers used in traditional OSes (Patterson et al., 1988). Applications in traditional OSes interact with virtual memory, which provides an illusion of there being more memory resources than are actually available in physical (i.e., main) memory by the OS paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed by applications. To provide a similar illusion of longer context length (analogous to virtual memory), we allow the LLM to manage what is placed in its own context (analogous to physical memory) via an ‘LLM OS’, which we call MemGPT. MemGPT enables the LLM to retrieve relevant historical data missing from what is placed in-context, and also evict less relevant data from context and into external storage systems. Figure 3 illustrates the components of MemGPT. The combined use of a memory-hierarchy, OS functions and event-based control flow allow MemGPT to handle unbounded context using LLMs that have finite context windows. To demonstrate the utility of our new OSinspired LLM system, we evaluate MemGPT on two domains where the performance of existing LLMs is severely limited by finite context: document analysis, where the length of standard text files can quickly exceed the input capacity of modern LLMs, and conversational agents, where LLMs bound by limited conversation windows lack context awareness, persona consistency, and long-term memory during extended conversations. In both settings, MemGPT is able to overcome the limitations of finite context to outperform existing LLM-based approaches.

In MemGPT, we treat context windows as a constrained memory resource, and design a memory hierarchy for LLMs analogous to memory tiers used in traditional OSes.
在MemGPT中,我们将上下文窗口视为受限的内存资源,并为LLMs设计了类似于传统操作系统中使用的内存层次结构。

Applications in traditional OSes interact with virtual memory, which provides an illusion of there being more memory resources than are actually available in physical (i.e., main) memory by the OS paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed by applications.
应用程序在传统操作系统中与虚拟内存交互,虚拟内存通过将溢出数据分页到磁盘,并在应用程序访问时通过页面错误将数据重新检索回内存,从而提供比物理(即主)内存中实际可用的更多的内存资源的错觉。

To provide a similar illusion of longer context length (analogous to virtual memory), we allow the LLM to manage what is placed in its own context (analogous to physical memory) via an ‘LLM OS’, which we call MemGPT.
为了提供类似的更长上下文长度的错觉(类似于虚拟内存),我们允许LLM通过我们称之为MemGPT的“LLM OS”来管理其自己的上下文中放置的内容(类似于物理内存)。

MemGPT enables the LLM to retrieve relevant historical data missing from what is placed in-context, and also evict less relevant data from context and into external storage systems.
MemGPT使LLM能够检索缺失于上下文中的相关历史数据,并且将不太相关的数据从上下文逐出到外部存储系统。

The combined use of a memory-hierarchy, OS functions and event-based control flow allow MemGPT to handle unbounded context using LLMs that have finite context windows.
内存层次结构、操作系统功能和基于事件的控制流程的结合使用,使MemGPT能够使用具有有限上下文窗口的LLMs处理无界上下文。

To demonstrate the utility of our new OS-inspired LLM system, we evaluate MemGPT on two domains where the performance of existing LLMs is severely limited by finite context.
为了展示我们新的受操作系统启发的LLM系统的功能,我们在两个领域评估了MemGPT,现有LLMs的性能受到有限上下文的严重限制。

In both settings, MemGPT is able to overcome the limitations of finite context to outperform existing LLM-based approaches.
在这两种情况下,MemGPT都能够克服有限上下文的限制,超越现有的基于LLMs的方法。

2. MemGPT (MemoryGPT)

MemGPT’s OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage). Main context consists of the LLM prompt tokens— anything in main context is considered in-context and can be accessed by the LLM processor during inference. External context refers to any information that is held outside of the LLMs fixed context window. This out-of-context data must always be explicitly moved into main context in order for it to be passed to the LLM processor during inference. MemGPT provides function calls that the LLM processor to manage its own memory without any user intervention

MemGPT’s OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage).
MemGPT受操作系统启发的多级内存架构区分了两种主要的内存类型:主上下文(类似于主存/物理内存/RAM)和外部上下文(类似于磁盘内存/磁盘存储)。

Main context consists of the LLM prompt tokens— anything in main context is considered in-context and can be accessed by the LLM processor during inference.
主上下文由LLM提示令牌组成——任何在主上下文中的内容都被视为上下文内,并可以在推理期间被LLM处理器访问。

External context refers to any information that is held outside of the LLMs fixed context window.
外部上下文指的是存储在LLMs固定上下文窗口之外的任何信息。

This out-of-context data must always be explicitly moved into main context in order for it to be passed to the LLM processor during inference.
这些上下文外的数据必须明确移动到主上下文,以便在推理期间传递给LLM处理器。

MemGPT provides function calls that the LLM processor to manage its own memory without any user intervention.
MemGPT提供了LLM处理器用来管理其自身内存的函数调用,无需任何用户干预。

2.1. Main context (prompt tokens)

The prompt tokens in MemGPT are split into three contiguous sections: the system instructions, working context, and FIFO Queue. The system instructions are readonly (static) and contain information on the MemGPT control flow, the intended usage of the different memory levels, and instructions on how to use the MemGPT functions (e.g. how to retrieve out-of-context data). Working context is a fixed-size read/write block of unstructured text, writeable only via MemGPT function calls. In conversational settings, working context is intended to be used to store key facts, preferences, and other important information about the user and the persona the agent is adopting, allowing the agent to converse fluently with the user. The FIFO queue stores a rolling history of messages, including messages between the agent and user, as well as system messages (e.g. memory warnings) and function call inputs and outputs. The first index in the FIFO queue stores a system message containing a recursive summary of messages that have been evicted from the queue

The prompt tokens in MemGPT are split into three contiguous sections: the system instructions, working context, and FIFO Queue.
MemGPT中的提示令牌被分成三个连续的部分:系统指令、工作上下文和FIFO队列。

The system instructions are readonly (static) and contain information on the MemGPT control flow, the intended usage of the different memory levels, and instructions on how to use the MemGPT functions (e.g. how to retrieve out-of-context data).
系统指令是只读的(静态的),并且包含有关MemGPT控制流程、不同内存层次的预期用途以及如何使用MemGPT功能的信息。

Working context is a fixed-size read/write block of unstructured text, writeable only via MemGPT function calls.
工作上下文是一个固定大小的读写块,只能通过MemGPT函数调用来写入。

In conversational settings, working context is intended to be used to store key facts, preferences, and other important information about the user and the persona the agent is adopting, allowing the agent to converse fluently with the user.
在对话设置中,工作上下文旨在用于存储有关用户和代理采纳的角色的关键事实、偏好和其他重要信息,允许代理与用户流利地对话。

The FIFO queue stores a rolling history of messages, including messages between the agent and user, as well as system messages (e.g. memory warnings) and function call inputs and outputs.
FIFO队列存储消息的滚动历史记录,包括代理和用户之间的消息,以及系统消息和函数调用的输入输出。

The first index in the FIFO queue stores a system message containing a recursive summary of messages that have been evicted from the queue.
FIFO队列的第一个索引存储包含已从队列中逐出的消息的递归摘要的系统消息。

2.2. Queue Manager

The queue manager manages messages in recall storage and the FIFO queue. When a new message is received by the system, the queue manager appends the incoming messages to the FIFO queue, concatenates the prompt tokens and triggers the LLM inference to generate LLM output (the completion tokens). The queue manager writes both the incoming message and the generated LLM output to recall storage (the MemGPT message database). When messages in recall storage are retrieved via a MemGPT function call, the queue manager appends them to the back of the queue to reinsert them into the LLM’s context window.

The queue manager manages messages in recall storage and the FIFO queue.
队列管理器负责管理回忆存储中的消息和FIFO队列。

When a new message is received by the system, the queue manager appends the incoming messages to the FIFO queue, concatenates the prompt tokens and triggers the LLM inference to generate LLM output (the completion tokens).
当系统收到新消息时,队列管理器会将传入的消息追加到FIFO队列,连接提示令牌,并触发LLM推理以生成LLM输出(完成令牌)。

The queue manager writes both the incoming message and the generated LLM output to recall storage (the MemGPT message database).
队列管理器将传入的消息和生成的LLM输出都写入回忆存储(MemGPT消息数据库)。

When messages in recall storage are retrieved via a MemGPT function call, the queue manager appends them to the back of the queue to reinsert them into the LLM’s context window.
当通过MemGPT函数调用语回忆存储中的消息时,队列管理器会将它们追加到队列的末尾,以重新将它们插入到LLM的上下文窗口中。

The queue manager is also responsible for controlling context overflow via a queue eviction policy. When the prompt tokens exceed the ‘warning token count‘ of the underlying LLM’s context window (e.g. 70% of the context window), the queue manager inserts a system message into the queue warning the LLM of an impending queue eviction (a ‘memory pressure‘ warning) to allow the LLM to use MemGPT functions to store important information contained in the FIFO queue to working context or archival storage (a read/write database storing arbitrary length text objects). When the prompt tokens exceed the ‘flush token count’ (e.g. 100% of the context window), the queue manager flushes the queue to free up space in the context window: the queue manager evicts a specific count of messages (e.g. 50% of the context window), generates a new recursive summary using the existing recursive summary and evicted messages. Once the queue is flushed, the evicted messages are no longer in-context and immediately viewable to the LLM, however they are stored indefinitely in recall storage and readable via MemGPT function calls.

The queue manager is also responsible for controlling context overflow via a queue eviction policy.
队列管理器还负责通过队列逐出策略控制上下文溢出。

When the prompt tokens exceed the ‘warning token count‘ of the underlying LLM’s context window (e.g. 70% of the context window), the queue manager inserts a system message into the queue warning the LLM of an impending queue eviction (a ‘memory pressure‘ warning).
当提示令牌超过底层LLM的上下文窗口的“警告令牌计数”(例如,上下文窗口的70%),队列管理器会向队列中插入一个系统消息,警告LLM即将进行队列逐出(一个“内存压力”警告)。

to allow the LLM to use MemGPT functions to store important information contained in the FIFO queue to working context or archival storage (a read/write database storing arbitrary length text objects).
以便LLM可以使用MemGPT函数将FIFO队列中包含的重要信息存储到工作上下文或归档存储中(这是一个可读写的数据库,用于存储任意长度的文本对象)。

When the prompt tokens exceed the ‘flush token count’ (e.g. 100% of the context window), the queue manager flushes the queue to free up space in the context window: the queue manager evicts a specific count of messages (e.g. 50% of the context window),
当提示令牌超过“刷新令牌计数”(例如,上下文窗口的100%),队列管理器会刷新队列以释放上下文窗口中的空间:队列管理器逐出特定数量的消息(例如,上下文窗口的50%),

generates a new recursive summary using the existing recursive summary and evicted messages.
并使用现有的递归摘要和逐出的消息生成一个新的递归摘要。

Once the queue is flushed, the evicted messages are no longer in-context and immediately viewable to the LLM, however they are stored indefinitely in recall storage and readable via MemGPT function calls.
一旦队列被刷新,被逐出的消息就不再处于上下文中,且立即对LLM不可见,但它们被无限期地存储在回忆存储中,并且可以通过MemGPT函数调用来读取。

2.3. Function executor (handling of completion tokens)

MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor. Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context. For instance, it can decide when to move items between contexts (e.g. when the conversation history is becoming too long, as show in Figure 1) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (as shown in Figure 3). We implement self-directed editing and retrieval by providing explicit instructions within the system instructions that guide the LLM on how to interact with the MemGPT memory systems. These instructions comprise two main components: (1) a detailed description of the memory hierarchy and their respective utilities, and (2) a function schema (complete with their natural language descriptions) that the system can call to access or modify its memory

MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor.
MemGPT通过由LLM处理器生成的函数调用来协调主上下文和外部上下文之间的数据移动。

Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context.
内存编辑和检索完全自主导向:MemGPT根据当前上下文自主更新和搜索自己的内存。

For instance, it can decide when to move items between contexts (e.g. when the conversation history is becoming too long, as show in Figure 1) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (as shown in Figure 3).
例如,它可以决定何时在上下文之间移动项目(如图1所示,当对话历史变得太长时),并修改其主上下文以更好地反映其对当前目标和责任的不断演变的理解(如图3所示)。

We implement self-directed editing and retrieval by providing explicit instructions within the system instructions that guide the LLM on how to interact with the MemGPT memory systems.
我们通过在系统指令中提供明确的指令来实现自主编辑和检索,这些指令指导LLM如何与MemGPT内存系统交互。

These instructions comprise two main components: (1) a detailed description of the memory hierarchy and their respective utilities, and (2) a function schema (complete with their natural language descriptions) that the system can call to access or modify its memory.
这些指令包括两个主要部分:(1)内存层次结构及其各自用途的详细描述,以及(2)系统可以调用的函数模式(包括它们的自然语言描述),以访问或修改其内存。

During each inference cycle, LLM processor takes main context (concatenated into a single string) as input, and generates an output string. This output string is parsed by MemGPT to ensure correctness, and if the parser validates the function arguments the function is executed. The results, including any runtime errors that occur (e.g. trying to add to main context when it is already at maximum capacity), are then fed back to the processor by MemGPT. This feedback loop enables the system to learn from its actions and adjust its behavior accordingly. Awareness of context limits is a key aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the processor with warnings regarding token limitations to guide its memory management decisions. Additionally, our memory retrieval mechanisms are designed to be cognizant of these token constraints and implement pagination to prevent retrieval calls from overflowing the context window.

During each inference cycle, LLM processor takes main context (concatenated into a single string) as input, and generates an output string.
在每个推理周期中,LLM处理器将主上下文(串联成一个单一字符串)作为输入,并生成一个输出字符串。

This output string is parsed by MemGPT to ensure correctness, and if the parser validates the function arguments the function is executed.
这个输出字符串由MemGPT解析以确保正确性,如果解析器验证了函数参数,则执行该函数。

The results, including any runtime errors that occur (e.g. trying to add to main context when it is already at maximum capacity), are then fed back to the processor by MemGPT.
然后,结果(包括发生的任何运行时错误,例如尝试在主上下文已达到最大容量时添加内容)由MemGPT反馈给处理器。

This feedback loop enables the system to learn from its actions and adjust its behavior accordingly.
这个反馈循环使系统能够从其行动中学习并相应调整其行为。

Awareness of context limits is a key aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the processor with warnings regarding token limitations to guide its memory management decisions.
对上下文限制的认识是使自编辑机制有效工作的关键方面,为此,MemGPT通过有关令牌限制的警告提示处理器,以指导其内存管理决策。

Additionally, our memory retrieval mechanisms are designed to be cognizant of these token constraints and implement pagination to prevent retrieval calls from overflowing the context window.
此外,我们的内存检索机制被设计为意识到这些令牌限制,并实现分页以防止检索调用溢出上下文窗口。

2.4. Control flow and function chaining

In MemGPT, eventstrigger LLM inference: events are generalized inputs to MemGPT and can consist of user messages (in chat applications), system messages (e.g. main context capacity warnings), user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading a document), and timed events that are run on a regular schedule (allowing MemGPT to run ‘unprompted’ without user intervention). MemGPT processes events with a parser to convert them into plain text messages that can be appended to main context and eventually be fed as input into the LLM processor.

In MemGPT, events trigger LLM inference: events are generalized inputs to MemGPT and can consist of user messages (in chat applications), system messages (e.g. main context capacity warnings), user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading a document), and timed events that are run on a regular schedule (allowing MemGPT to run ‘unprompted’ without user intervention).
在MemGPT中,事件触发LLM推理:事件是MemGPT的通用输入,可以包括用户消息(在聊天应用中)、系统消息(例如,主上下文容量警告)、用户交互(例如,提醒用户刚刚登录的通知,或者他们完成了文件上传的通知),以及定期运行的定时事件(允许MemGPT在没有用户干预的情况下“自发”运行)。

MemGPT processes events with a parser to convert them into plain text messages that can be appended to main context and eventually be fed as input into the LLM processor.
MemGPT使用解析器处理事件,将它们转换成可以添加到主上下文并最终作为输入输入到LLM处理器的纯文本消息。

Many practical tasks require calling multiple functions in sequence, for example, navigating through multiple pages of results from a single query or collating data from different documents in main context from separate queries. Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user. In MemGPT, functions can be called with a special flag that requests control be immediately returned to the processor after the requested function completes execution. If this flag is present, MemGPT will add the function output to main context and (as opposed to pausing processor execution). If this flag is not present (a yield), MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message or scheduled interrupt).

Many practical tasks require calling multiple functions in sequence, for example, navigating through multiple pages of results from a single query or collating data from different documents in main context from separate queries.
许多实际任务需要顺序调用多个函数,例如,通过单一查询浏览多个结果页面,或者从主上下文中的不同文档中整理来自不同查询的数据。

Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user.
函数链式调用允许MemGPT在将控制权交还给用户之前,顺序执行多个函数调用。

In MemGPT, functions can be called with a special flag that requests control be immediately returned to the processor after the requested function completes execution.
在MemGPT中,可以用一个特殊标志调用函数,该标志请求在请求的函数完成执行后立即将控制权交回给处理器。

If this flag is present, MemGPT will add the function output to main context and (as opposed to pausing processor execution).
如果存在这个标志,MemGPT将把函数输出添加到主上下文,并且(与暂停处理器执行相反)。

If this flag is not present (a yield), MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message or scheduled interrupt).
如果这个标志不存在(即yield),在下一个外部事件触发(例如用户消息或预定中断)之前,MemGPT将不会运行LLM处理器。

3. Experiments

We assess MemGPT in two long-context domains: conversational agents and document analysis. For conversational agents, we expand the existing Multi-Session Chat dataset (Xu et al., 2021) and introduce two new dialogue tasks that evaluate an agent’s ability to retain knowledge across long conversations. For document analysis, we benchmark MemGPT on existing tasks from (Liu et al., 2023a) for question answering and key-value retrieval over lengthy documents. We also propose a new nested keyvalue retrieval task requiring collating information across multiple data sources, which tests the ability of an agent to collate information from multiple data sources (multihop retrieval). We publicly release our augmented MSC dataset, nested KV retrieval dataset, and a dataset of embeddings for 20M Wikipedia articles to facilitate future research. Our code for the benchmarks is available at https://research.memgpt.ai.

We assess MemGPT in two long-context domains: conversational agents and document analysis. 我们对MemGPT在两个长上下文领域进行了评估:对话代理和文档分析。
For conversational agents, we expand the existing Multi-Session Chat dataset (Xu et al., 2021) and introduce two new dialogue tasks that evaluate an agent’s ability to retain knowledge across long conversations.
对于对话代理,我们扩展了现有的多会话聊天数据集,并引入了两个新的对话任务,以评估代理在长对话中保持知识的能力。

For document analysis, we benchmark MemGPT on existing tasks from (Liu et al., 2023a) for question answering and key-value retrieval over lengthy documents.
对于文档分析,我们在(Liu等人,2023a)的现有任务中对MemGPT进行了基准测试,这些任务涉及对长文档的问题回答和键值检索。

We also propose a new nested keyvalue retrieval task requiring collating information across multiple data sources, which tests the ability of an agent to collate information from multiple data sources (multihop retrieval).
我们还提出了一个新的嵌套键值检索任务,该任务需要从多个数据源整理信息,这测试了代理从多个数据源整理信息的能力(多跳检索)。

We publicly release our augmented MSC dataset, nested KV retrieval dataset, and a dataset of embeddings for 20M Wikipedia articles to facilitate future research.
我们公开发布了我们的增强型多会话聊天数据集、嵌套键值检索数据集以及2000万维基百科文章的嵌入数据集,以促进未来的研究。

Our code for the benchmarks is available at https://research.memgpt.ai.
我们的基准测试代码可在 https://research.memgpt.ai 上获取。

Implementation details. When discussing OpenAI models, unless otherwise specified ‘GPT-4 Turbo’ refers to the specific gpt-4-1106-preview model endpoint (context window of 128, 000), ‘GPT-4‘ refers to gpt-4-0613 (context window of 8, 192), and ‘GPT-3.5 Turbo‘ refers to gpt-3.5-turbo-1106 (context window of 16, 385). In experiments, we run MemGPT with all baseline models (GPT-4, GPT-4 Turbo, and GPT 3.5) to show how the underlying model performance affects MemGPT’s.

Implementation details.
实现细节。

When discussing OpenAI models, unless otherwise specified ‘GPT-4 Turbo’ refers to the specific gpt-4-1106-preview model endpoint (context window of 128, 000), ‘GPT-4‘ refers to gpt-4-0613 (context window of 8, 192), and ‘GPT-3.5 Turbo‘ refers to gpt-3.5-turbo-1106 (context window of 16, 385).
在讨论OpenAI模型时,除非另有说明,“GPT-4 Turbo”指的是特定的gpt-4-1106-preview模型端点(上下文窗口为128,000),“GPT-4”指的是gpt-4-0613(上下文窗口为8,192),“GPT-3.5 Turbo”指的是gpt-3.5-turbo-1106(上下文窗口为16,385)。

In experiments, we run MemGPT with all baseline models (GPT-4, GPT-4 Turbo, and GPT 3.5) to show how the underlying model performance affects MemGPT’s.
在实验中,我们使用所有基线模型(GPT-4、GPT-4 Turbo和GPT 3.5)运行MemGPT,以展示底层模型性能如何影响MemGPT的。

3.1. MemGPT for conversational agents

Conversational agents like virtual companions and personalized assistants aim to engage users in natural, long-term interactions, potentially spanning weeks, months, or even years. This creates challenges for models with fixed-length contexts, which can only reference a limited history of the conversation. An ‘infinite context’ agent should seamlessly handle continuous exchanges without boundary or reset. When conversing with a user, such an agent must satisfy two key criteria: (1) Consistency – The agent should maintain conversational coherence. New facts, preferences, and events mentioned should align with prior statements from both the user and agent. (2) Engagement – The agent should draw on long-term knowledge about the user to personalize responses. Referencing prior conversations makes dialogue more natural and engaging.

Conversational agents like virtual companions and personalized assistants aim to engage users in natural, long-term interactions, potentially spanning weeks, months, or even years.
对话代理,如虚拟伴侣和个性化助手,旨在与用户进行自然、长期的互动,可能跨越数周、数月甚至数年。

This creates challenges for models with fixed-length contexts, which can only reference a limited history of the conversation.
这对固定长度上下文的模型构成了挑战,因为它们只能引用有限的对话历史。

An ‘infinite context’ agent should seamlessly handle continuous exchanges without boundary or reset.
一个“无限上下文”代理应该能够无缝处理连续的交流,没有界限或重置。

When conversing with a user, such an agent must satisfy two key criteria: 
与用户对话时,这样的代理必须满足两个关键标准:

(1) Consistency - The agent should maintain conversational coherence. New facts, preferences, and events mentioned should align with prior statements from both the user and agent.
(1) 一致性 - 代理应保持对话的连贯性。新提到的事实、偏好和事件应与用户和代理之前的陈述一致。

(2) Engagement - The agent should draw on long-term knowledge about the user to personalize responses. Referencing prior conversations makes dialogue more natural and engaging.
(2) 参与度 - 代理应利用对用户的长期了解来个性化回应。引用之前的对话可以使对话更加自然和吸引人。

We therefore assess our proposed system, MemGPT, on these two criteria: (1) Does MemGPT leverage its memory to improve conversation consistency? Can it remember relevant facts, preferences, and events from past interactions to maintain coherence? (2) Does MemGPT produce more engaging dialogue by taking advantage of memory? Does it spontaneously incorporate long-range user information to personalize messages? By evaluating on consistency and engagement, we can determine how well MemGPT handles the challenges of long-term conversational interaction compared to fixed-context baselines. Its ability to satisfy these criteria will demonstrate whether unbounded context provides meaningful benefits for conversational agents.

We therefore assess our proposed system, MemGPT, on these two criteria: 
因此,我们根据这两个标准评估我们提出的系统MemGPT:

(1) Does MemGPT leverage its memory to improve conversation consistency? Can it remember relevant facts, preferences, and events from past interactions to maintain coherence?
(1) MemGPT是否利用其记忆来提高对话一致性?它能否记住过去互动中相关的事实、偏好和事件以保持连贯性?

(2) Does MemGPT produce more engaging dialogue by taking advantage of memory? Does it spontaneously incorporate long-range user information to personalize messages?
(2) MemGPT是否通过利用记忆产生更具吸引力的对话?它是否能够自发地整合长期用户信息来个性化消息?

By evaluating on consistency and engagement, we can determine how well MemGPT handles the challenges of long-term conversational interaction compared to fixed-context baselines.
通过评估一致性和参与度,我们可以确定MemGPT与固定上下文基线相比,处理长期对话互动的挑战能力如何。

Its ability to satisfy these criteria will demonstrate whether unbounded context provides meaningful benefits for conversational agents.
它满足这些标准的能力将展示无界上下文是否为对话代理提供了有意义的好处。

Dataset. We evaluate MemGPT and our fixed-context baselines on the Multi-Session Chat (MSC) dataset introduced by Xu et al. (2021), which contains multi-session chat logs generated by human labelers, each of whom was asked to play a consistent persona for the duration of all sessions. Each multi-session chat in MSC has five total sessions, and each session consists of a roughly a dozen messages. As part of our consistency experiments, we created a new session (session 6) that contains a single questionanswer response pair between the same two personas.

Dataset. We evaluate MemGPT and our fixed-context baselines on the Multi-Session Chat (MSC) dataset introduced by Xu et al. (2021), which contains multi-session chat logs generated by human labelers, each of whom was asked to play a consistent persona for the duration of all sessions.
数据集。我们在Xu等人(2021年)引入的多会话聊天(MSC)数据集上评估MemGPT和我们的固定上下文基线,该数据集包含由人类标注者生成的多会话聊天记录,每位标注者都被要求在所有会话中扮演一个一致的角色。

Each multi-session chat in MSC has five total sessions, and each session consists of a roughly a dozen messages.
MSC中的每一组多会话聊天都有五个会话,每个会话大约包含十几个消息。

As part of our consistency experiments, we created a new session (session 6) that contains a single question-answer response pair between the same two personas.
作为我们一致性实验的一部分,我们创建了一个新的会话(会话6),其中包含同两个角色之间的单个问题-回答对。

3.1.1. DEEP MEMORY RETRIEVAL TASK (CONSISTENCY).

深度记忆检索任务(一致性)
We introduce a new ‘deep memory retrieval’ (DMR) task based on the MSC dataset designed to test the consistency of a conversational agent. In DMR, the conversational agent is asked a question by the user that explicitly refers back to a prior conversation and has a very narrow expected answer range. We generated the DMR question-answer (QA) pairs using a separate LLM that was instructed to write a question from one user to another that could only be answered correctly using knowledge gained from the past sessions (see Appendix for further details).

We introduce a new ‘deep memory retrieval’ (DMR) task based on the MSC dataset designed to test the consistency of a conversational agent.
我们引入了一个新的“深度记忆检索”(DMR)任务,基于MSC数据集设计,旨在测试对话代理的一致性。

In DMR, the conversational agent is asked a question by the user that explicitly refers back to a prior conversation and has a very narrow expected answer range.
在DMR中,用户向对话代理提出一个问题,该问题明确回顾了之前的对话,并且期望的答案范围非常狭窄。

We generated the DMR question-answer (QA) pairs using a separate LLM that was instructed to write a question from one user to another that could only be answered correctly using knowledge gained from the past sessions.
我们使用一个单独的大型语言模型(LLM)生成了DMR问题-回答(QA)对,该模型被指示编写一个问题,从一位用户到另一位用户,只有利用从过去会话中获得的知识才能正确回答。

We evaluate the quality of the generated response against the ‘gold response’ using ROUGE-L scores (Lin, 2004) and an ‘LLM judge’, which is instructed to evaluate whether or not the generated response is consistent with the gold response (GPT-4 has been shown to have high agreement with human evaluators (Zheng et al., 2023)). In practice, we notice that the generated responses (from both MemGPT and the baselines) were generally more verbose than the gold responses. We use the ROUGE-L recall (R) metric to account for the verbosity of the generated agent replies compared to the relatively short gold answer labels.

We evaluate the quality of the generated response against the ‘gold response’ using ROUGE-L scores (Lin, 2004) and an ‘LLM judge’,
我们使用ROUGE-L分数和一位“LLM裁判”来评估生成回应的质量与“标准回应”的对比,

which is instructed to evaluate whether or not the generated response is consistent with the gold response (GPT-4 has been shown to have high agreement with human evaluators (Zheng et al., 2023)).
该裁判被指导评估生成的回应是否与标准回应一致(GPT-4已被证明与人类评估者具有高度一致性)。

In practice, we notice that the generated responses (from both MemGPT and the baselines) were generally more verbose than the gold responses.
在实践中,我们注意到生成的回应(来自MemGPT和基线模型)通常比标准回应更加冗长。

We use the ROUGE-L recall (R) metric to account for the verbosity of the generated agent replies compared to the relatively short gold answer labels.
我们使用ROUGE-L回忆(R)指标来考虑生成代理回应的冗长与相对简短的标准答案标签相比。

MemGPT utilizes memory to maintain coherence: Table 2 shows the performance of MemGPT vs the fixedmemory baselines. We compare MemGPT using different underlying LLMs, and compare against using the base LLM without MemGPT as a baseline. The baselines are able to see a lossy summarization of the past five conversations to mimic an extended recursive summarization procedure, while MemGPT instead has access to the full conversation history but must access it via paginated search queries to recall memory (in order to bring them into main context). In this task, we see that MemGPT clearly improves the performance of the underlying base LLM: there is a clear drop in both accuracy and ROUGE scores when going from MemGPT to the corresponding LLM baselines.

MemGPT utilizes memory to maintain coherence: Table 2 shows the performance of MemGPT vs the fixed-memory baselines.
MemGPT利用记忆来维持连贯性:表2显示了MemGPT与固定内存基线的性能对比。

We compare MemGPT using different underlying LLMs, and compare against using the base LLM without MemGPT as a baseline.
我们比较了使用不同底层LLMs的MemGPT,并将其与没有MemGPT的基础LLM作为基线进行比较。

The baselines are able to see a lossy summarization of the past five conversations to mimic an extended recursive summarization procedure, while MemGPT instead has access to the full conversation history but must access it via paginated search queries to recall memory (in order to bring them into main context).
基线能够看到一个过去五次对话的有损摘要,以模仿扩展的递归摘要过程,而MemGPT则可以访问完整的对话历史,但必须通过分页搜索查询来访问记忆(以便将它们带入主上下文)。

In this task, we see that MemGPT clearly improves the performance of the underlying base LLM: there is a clear drop in both accuracy and ROUGE scores when going from MemGPT to the corresponding LLM baselines.
在这项任务中,我们发现MemGPT明显提高了底层基础LLM的性能:当从MemGPT转向相应的LLM基线时,准确性和ROUGE分数都有明显的下降。

3.1.2. CONVERSATION OPENER TASK (ENGAGEMENT).

对话开场任务(参与度)
In the ‘conversation opener’ task we evaluate an agent’s ability to craft engaging messages to the user that draw from knowledge accumulated in prior conversations. To evaluate the ‘engagingness’ of a conversation opener using the MSC dataset, we compare the generated opener to the gold personas: an engaging conversation opener should draw from one (or several) of the data points contained in the persona, which in MSC effectively summarize the knowledge accumulated throughout all prior sessions. We also compare to the human-generated gold opener, i.e., the first response in the following session. We report the CSIM scores of MemGPT’s openers in Table 3. We test several variations of MemGPT using different base LLMs.

In the ‘conversation opener’ task we evaluate an agent’s ability to craft engaging messages to the user that draw from knowledge accumulated in prior conversations.
在“对话开场”任务中,我们评估代理根据之前对话中积累的知识,制作吸引用户的个性化消息的能力。

To evaluate the ‘engagingness’ of a conversation opener using the MSC dataset, we compare the generated opener to the gold personas: an engaging conversation opener should draw from one (or several) of the data points contained in the persona, which in MSC effectively summarize the knowledge accumulated throughout all prior sessions.
为了使用MSC数据集评估对话开场的“吸引力”,我们将生成的开场与标准角色进行比较:一个有吸引力的对话开场应该利用角色中包含的一个(或几个)数据点,在MSC中,这有效地总结了所有之前会话中积累的知识。

We also compare to the human-generated gold opener, i.e., the first response in the following session.
我们还将其与人类生成的标准开场进行比较,即下一场会话中的第一条回应。

We report the CSIM scores of MemGPT’s openers in Table 3.
我们在表3中报告了MemGPT开场的CSM分数。

We test several variations of MemGPT using different base LLMs.
我们测试了使用不同基础LLMs的几个MemGPT变体。

MemGPT utilizes memory to increase engagement: As seen in Table 3, MemGPT is able to craft engaging openers that perform similarly to and occasionally exceed the hand-written human openers. We observe that MemGPT tends to craft openers that are both more verbose and cover more aspects of the persona information than the human baseline. Additionally, we can see the storing information in working context is key to generating engaging openers.

MemGPT utilizes memory to increase engagement: As seen in Table 3, MemGPT is able to craft engaging openers that perform similarly to and occasionally exceed the hand-written human openers.
MemGPT利用记忆来提高参与度:如表3所示,MemGPT能够制作同样吸引人甚至有时超越人工编写的开场白。

We observe that MemGPT tends to craft openers that are both more verbose and cover more aspects of the persona information than the human baseline.
我们观察到MemGPT倾向于制作比人类基线更加冗长并且涵盖角色信息更多方面的开场白。

Additionally, we can see the storing information in working context is key to generating engaging openers.
此外,我们可以看到在工作上下文中存储信息对于生成吸引人的开场白至关重要。

3.2. MemGPT for document analysis

Document analysis also faces challenges due to the limited context windows of today’s transformer models. As shown in Table 1, both open and closed source models suffer from constrained context length (up to 128k tokens for OpenAI’s models). However many documents easily surpass these lengths; for example, legal or financial documents such as Annual Reports (SEC Form 10-K) can easily pass the million token mark. Moreover, many real document analysis tasks require drawing connections across multiple such lengthy documents. Anticipating these scenarios, it becomes difficult to envision blindly scaling up context as a solution to the fixed-context problem. Recent research (Liu et al., 2023a) also raises doubts about the utility of simply scaling contexts, since they find uneven attention distributions in large context models (the model is more capable of recalling information at the beginning or end of its context window, vs tokens in the middle). To enable reasoning across documents, more flexible memory architectures like MemGPT are needed.

Document analysis also faces challenges due to the limited context windows of today’s transformer models.
文档分析也面临着当今变换器模型有限上下文窗口的挑战。

As shown in Table 1, both open and closed source models suffer from constrained context length (up to 128k tokens for OpenAI’s models).
如表1所示,无论是开源还是闭源模型,都受到上下文长度限制(OpenAI模型最多可达128k个令牌)。

However many documents easily surpass these lengths; for example, legal or financial documents such as Annual Reports (SEC Form 10-K) can easily pass the million token mark.
然而,许多文档很容易超过这些长度;例如,法律或财务文件,如年度报告(SEC表格10-K),很容易超过百万个令牌的标记。

Moreover, many real document analysis tasks require drawing connections across multiple such lengthy documents.
此外,许多真实的文档分析任务需要在多个这样冗长的文档之间建立联系。

Anticipating these scenarios, it becomes difficult to envision blindly scaling up context as a solution to the fixed-context problem.
面对这些情况,盲目扩展上下文作为解决固定上下文问题的方法变得难以想象。

Recent research (Liu et al., 2023a) also raises doubts about the utility of simply scaling contexts, since they find uneven attention distributions in large context models (the model is more capable of recalling information at the beginning or end of its context window, vs tokens in the middle).
最近的研究(Liu等人,2023a)也对简单地扩展上下文的效用提出了疑问,因为他们发现大型上下文模型中的注意力分布不均(模型更能够回忆起在其上下文窗口开始或结束时的信息,而不是中间的令牌)。

To enable reasoning across documents, more flexible memory architectures like MemGPT are needed.
为了实现跨文档的推理,需要像MemGPT这样更灵活的内存架构。

3.2.1. MULTI-DOCUMENT QUESTION-ANSWERING.

To evaluate MemGPT’s ability to analyze documents, we benchmark MemGPT against fixed-context baselines on the retriever-reader document QA task from Liu et al. (2023a). In this task, a question is selected from the NaturalQuestions-Open dataset, and a retriever selects relevant Wikipedia documents for the question. A reader model (the LLM) is then fed these documents as input, and is asked to use the provided documents to answer the question. Similar to Liu et al. (2023a), we evaluate reader accuracy as the number of retrieved documents K increases.

To evaluate MemGPT’s ability to analyze documents, we benchmark MemGPT against fixed-context baselines on the retriever-reader document QA task from Liu et al. (2023a).
为了评估MemGPT分析文档的能力,我们将MemGPT与固定上下文基线在Liu等人(2023a)提出的检索器-阅读器文档问答任务上进行基准测试。

In this task, a question is selected from the NaturalQuestions-Open dataset, and a retriever selects relevant Wikipedia documents for the question.
在这个任务中,从NaturalQuestions-Open数据集中选取一个问题,然后检索器为这个问题选择相关的维基百科文档。

A reader model (the LLM) is then fed these documents as input, and is asked to use the provided documents to answer the question.
然后,阅读器模型(LLM)被输入这些文档,并被要求使用提供的文档回答问题。

Similar to Liu et al. (2023a), we evaluate reader accuracy as the number of retrieved documents K increases.
与Liu等人(2023a)类似,我们评估随着检索到的文档数量K的增加,阅读器的准确性。

In our evaluation setup, both the fixed-context baselines and MemGPT use the same retriever, which selects the top K documents according using similarity search (cosine distance) on OpenAI’s text-embedding-ada-002 embeddings. We use MemGPT’s default storage settings which uses PostgreSQL for archival memory storage with vector search enabled via the pgvector extention. We precompute embeddings and load them into the database, which uses an HNSW index to enable approximate, subsecond query times. In MemGPT, the entire embedding document set is loaded into archival storage, and the retriever naturally emerges via the archival storage search functionality (which performs vector search based on cosine similarity). In the fixed-context baselines, the top-K documents are fetched using the retriever independently from the LLM inference, similar to the original retrieverreader setup in Liu et al. (2023a).

In our evaluation setup, both the fixed-context baselines and MemGPT use the same retriever, which selects the top K documents according using similarity search (cosine distance) on OpenAI’s text-embedding-ada-002 embeddings.
在我们的评估设置中,固定上下文基线和MemGPT使用相同的检索器,该检索器根据OpenAI的text-embedding-ada-002嵌入的相似性搜索(余弦距离)选择前K个文档。

We use MemGPT’s default storage settings which uses PostgreSQL for archival memory storage with vector search enabled via the pgvector extension.
我们使用MemGPT的默认存储设置,该设置使用PostgreSQL进行归档存储,并通过pgvector扩展启用了向量搜索。

We precompute embeddings and load them into the database, which uses an HNSW index to enable approximate, subsecond query times.
我们预计算嵌入并将它们加载到数据库中,该数据库使用HNSW索引以实现近似的、亚秒级的查询时间。

In MemGPT, the entire embedding document set is loaded into archival storage, and the retriever naturally emerges via the archival storage search functionality (which performs vector search based on cosine similarity).
在MemGPT中,整个嵌入文档集被加载到归档存储中,检索器自然地通过归档存储搜索功能出现(该功能基于余弦相似性执行向量搜索)。

In the fixed-context baselines, the top-K documents are fetched using the retriever independently from the LLM inference, similar to the original retriever-reader setup in Liu et al. (2023a).
在固定上下文基线中,使用检索器独立于LLM推理获取前K个文档,类似于Liu等人(2023a)中的原始检索器-阅读器设置。

We use a dump of Wikipedia from late 2018, following past work on NaturalQuestions-Open (Izacard & Grave, 2020; Izacard et al., 2021), and sampled a subset of 50 questions for evaluation. Both the sampled questions and embedded Wikipedia passages are publicaly released. We evaluate the performance of both MemGPT and baselines with an LLM-judge, to ensure that the the answer is properly derived from the retrieved documents and to avoid non-exact string matches being considered incorrect

We use a dump of Wikipedia from late 2018, following past work on NaturalQuestions-Open (Izacard & Grave, 2020; Izacard et al., 2021), and sampled a subset of 50 questions for evaluation.
我们使用了2018年末的维基百科数据转储,遵循了之前在NaturalQuestions-Open上的工作,并抽取了50个问题用于评估。

Both the sampled questions and embedded Wikipedia passages are publicly released.
抽取的问题和嵌入的维基百科段落都已公开发布。

We evaluate the performance of both MemGPT and baselines with an LLM-judge, to ensure that the answer is properly derived from the retrieved documents and to avoid non-exact string matches being considered incorrect.
我们使用LLM裁判来评估MemGPT和基线的性能,以确保答案正确地从检索到的文档中得出,并避免将非完全字符串匹配视为错误。

We show the results for the document QA task in Figure 5. The fixed-context baselines performance is capped roughly at the performance of the retriever, as they use the information that is presented in their context window (e.g. if the embedding search retriever fails to surface the gold article using the provided question, the fixed-context baselines are guaranteed to never see the gold article). By contrast, MemGPT is effectively able to make multiple calls to the retriever by querying archival storage, allowing it to scale to larger effective context lengths. MemGPT actively retrieves documents from its archival storage (and can iteratively page through results), so the total number of documents available to MemGPT is no longer limited by the number of documents that fit within the LLM processor’s context window.

We show the results for the document QA task in Figure 5.
我们在图5中展示了文档问答任务的结果。

The fixed-context baselines performance is capped roughly at the performance of the retriever, as they use the information that is presented in their context window.
固定上下文基线的性能大致被限制在检索器的性能,因为它们使用在其上下文窗口中呈现的信息。

By contrast, MemGPT is effectively able to make multiple calls to the retriever by querying archival storage, allowing it to scale to larger effective context lengths.
相比之下,MemGPT能够有效地通过查询归档存储对检索器进行多次调用,允许它扩展到更大的有效上下文长度。

MemGPT actively retrieves documents from its archival storage (and can iteratively page through results), so the total number of documents available to MemGPT is no longer limited by the number of documents that fit within the LLM processor’s context window.
MemGPT主动从其归档存储中检索文档(并且可以迭代地浏览结果),因此可供MemGPT使用的文档总数不再受LLM处理器上下文窗口内可容纳的文档数量的限制。

The document QA task is challenging for all methods due to the limitations of embedding-based similarity search. We observe that the golden document for chosen question (as annotated by NaturalQuestions-Open) often appears outside of the first dozen retrieved results, if not even further. The retriever performance translates directly to the fixed-context baseline results: GPT-4’s accuracy is relatively low with few retrieved documents, and continues to improve as additional documents are added to the context window, as it correctly limits itself to answering questions based on information in retrieved documents. While MemGPT is theoretically not limited by sub-optimal retriever performance (even if the embedding-based ranking is noisy, as long as the full retriever ranking contains the gold document it can still be found with enough retriever calls via pagination), we observe that MemGPT will often stop paging through retriever results before exhausting the retriever database.

The document QA task is challenging for all methods due to the limitations of embedding-based similarity search.
文档问答任务对于所有方法来说都很具挑战性,因为基于嵌入的相似性搜索存在局限性。

We observe that the golden document for chosen question (as annotated by NaturalQuestions-Open) often appears outside of the first dozen retrieved results, if not even further.
我们观察到,所选问题的标准文档通常出现在检索结果的前十几个之外,有时甚至更远。

The retriever performance translates directly to the fixed-context baseline results: GPT-4’s accuracy is relatively low with few retrieved documents, and continues to improve as additional documents are added to the context window.
检索器的性能直接影响固定上下文基线的结果:GPT-4的准确性在检索到的文档很少时相对较低,并且随着更多的文档被添加到上下文窗口中,其准确性继续提高。

as it correctly limits itself to answering questions based on information in retrieved documents.
因为它正确地将自己限制在基于检索到的文档中的信息回答问题。

While MemGPT is theoretically not limited by sub-optimal retriever performance, we observe that MemGPT will often stop paging through retriever results before exhausting the retriever database.
虽然MemGPT理论上不受次优检索器性能的限制,我们观察到MemGPT经常在耗尽检索器数据库之前停止翻阅检索器结果。

To evaluate the fixed-context baselines against MemGPT past their default context lengths, we truncate the document segments returned by the retriever to fix the same number of documents into the available context. As expected, document truncation reduces accuracy as documents shrink as the chance of the relevant snippet (in the gold document) being omitted grows, as shown in Figure 5. MemGPT has significantly degraded performance using GPT-3.5, due to its limited function calling capabilities, and performs best using GPT-4.

To evaluate the fixed-context baselines against MemGPT past their default context lengths, we truncate the document segments returned by the retriever to fix the same number of documents into the available context.
为了评估固定上下文基线与MemGPT在超出其默认上下文长度时的表现,我们截断检索器返回的文档片段,以固定相同数量的文档进入可用的上下文。

As expected, document truncation reduces accuracy as documents shrink as the chance of the relevant snippet (in the gold document) being omitted grows, as shown in Figure 5.
正如预期,文档截断会降低准确性,因为随着文档的缩减,相关片段(在标准文档中)被省略的可能性增加,如图5所示。

MemGPT has significantly degraded performance using GPT-3.5, due to its limited function calling capabilities, and performs best using GPT-4.
使用GPT-3.5时,MemGPT的性能显著下降,这是由于其有限的函数调用能力,而使用GPT-4时表现最佳。

3.2.2. NESTED KEY-VALUE RETRIEVAL (KV).

We introduce a new task based on the synthetic Key-Value retrieval proposed in prior work (Liu et al., 2023a). The goal of this task is to demonstrate how MemGPT can collate information from multiple data sources. In the original KV task, the authors generated a synthetic dataset of keyvalue pairs, where each key and value is a 128-bit UUID (universally unique identifier). The agent is then given a key, and asked to return the associated value for the key. We create a version of the KV task, nested KV retrieval, where values themselves may be keys, thus requiring the agent to perform a multi-hop lookup. In our setup, we fix the total number of UUIDs pairs to 140, corresponding to roughly 8k tokens (the context length of our GPT-4 baseline). We vary the total number of nesting levels from 0 (the initial key-value pair’s value is not a key) to 4 (ie 4 total KV lookups are required to find the final value), and sample 30 different ordering configurations including both the initial key position and nesting key positions

We introduce a new task based on the synthetic Key-Value retrieval proposed in prior work (Liu et al., 2023a).
我们介绍了一项基于先前工作提出的合成键值检索的新任务。

The goal of this task is to demonstrate how MemGPT can collate information from multiple data sources.
这项任务的目标是展示MemGPT如何能够从多个数据源整合信息。

In the original KV task, the authors generated a synthetic dataset of key-value pairs, where each key and value is a 128-bit UUID (universally unique identifier).
在原始的KV任务中,作者们生成了一个键值对的合成数据集,其中每个键和值都是一个128位的UUID。

The agent is then given a key, and asked to return the associated value for the key.
代理随后被给予一个键,并被要求返回该键关联的值。

We create a version of the KV task, nested KV retrieval, where values themselves may be keys, thus requiring the agent to perform a multi-hop lookup.
我们创建了KV任务的一个版本,即嵌套KV检索,其中值本身可能是键,因此要求代理执行多跳查找。

In our setup, we fix the total number of UUIDs pairs to 140, corresponding to roughly 8k tokens (the context length of our GPT-4 baseline).
在我们的设置中,我们将UUID对的总数固定为140,对应大约8k个令牌(我们GPT-4基线的上下文长度)。

We vary the total number of nesting levels from 0 (the initial key-value pair’s value is not a key) to 4 (i.e., 4 total KV lookups are required to find the final value).
我们改变嵌套层次的总数,从0(初始键值对的值不是一个键)到4(即需要进行4次总的KV查找才能找到最终的值)。

And sample 30 different ordering configurations including both the initial key position and nesting key positions.
并抽取了30种不同的排序配置,包括初始键位置和嵌套键位置。

While GPT-3.5 and GPT-4 have good performance on the original KV tasks, both struggle in the nested KV task. GPT-3.5 is unable to complete the nested variant of the task and has an immediate dropoff in performance, hitting 0 percent accuracy at 1 nesting level (we observe that its primary failure mode is to simply returns the original value). GPT4 and GPT-4 Turbo are better than GPT-3.5, but also suffer from a similar dropoff, and hit 0 percent accuracy by 3 nesting levels. MemGPT with GPT-4 on the other hand is unaffected with the number of nesting levels and is able to perform the nested lookup by accessing the key-value pairs stored in main context repeatedly via function queries. MemGPT with GPT-4 Turbo and GPT-3.5 also have better performance than the corresponding baseline models, but still begin to drop off in performance at 2 nesting levels as a result of failing to perform enough lookups. MemGPT performance on the nested KV task demonstrates its ability to combine multiple queries to perform multi-hop lookups.

While GPT-3.5 and GPT-4 have good performance on the original KV tasks, both struggle in the nested KV task.
GPT-3.5和GPT-4在原始的KV任务上表现良好,但在嵌套KV任务中都遇到了困难。

GPT-3.5 is unable to complete the nested variant of the task and has an immediate dropoff in performance, hitting 0 percent accuracy at 1 nesting level.
GPT-3.5无法完成嵌套任务的变体,并且性能立即下降,在嵌套1层时准确率就降至0%。

(We observe that its primary failure mode is to simply return the original value).
(我们观察到其主要的失败模式是简单地返回原始值)。

GPT4 and GPT-4 Turbo are better than GPT-3.5, but also suffer from a similar dropoff, and hit 0 percent accuracy by 3 nesting levels.
GPT-4和GPT-4 Turbo比GPT-3.5表现更好,但也遭受类似的性能下降,并在嵌套3层时准确率降至0%。

MemGPT with GPT-4 on the other hand is unaffected with the number of nesting levels and is able to perform the nested lookup by accessing the key-value pairs stored in main context repeatedly via function queries.
另一方面,使用GPT-4的MemGPT不受嵌套层数的影响,能够通过反复通过函数查询访问主上下文中存储的键值对来执行嵌套查找。

MemGPT with GPT-4 Turbo and GPT-3.5 also have better performance than the corresponding baseline models, but still begin to drop off in performance at 2 nesting levels as a result of failing to perform enough lookups.
使用GPT-4 Turbo和GPT-3.5的MemGPT也比相应的基线模型表现更好,但由于未能执行足够的查找,在嵌套2层时开始性能下降。

MemGPT performance on the nested KV task demonstrates its ability to combine multiple queries to perform multi-hop lookups.
MemGPT在嵌套KV任务上的表现证明了其能够结合多个查询来执行多跳查找的能力。

4. Related Work

Long-context LLMs. Several lines of work have improved the context length of LLMs. For instance, more efficient transformer architectures via sparsifying the attention (Child et al., 2019; Beltagy et al., 2020), low-rank approximations (Wang et al., 2020), and neural memory (Lee et al., 2019). Another line of work aims to extend context windows beyond the length they were original trained for, their training size, such as Press et al. (2021); Chen et al. (2023). MemGPT builds upon these improvements in context length as they improve the size of the main memory in MemGPT. Our main contribution is a hierarchical tiered memory that uses a long-context LLM as the implementation of main memory.

Long-context LLMs. Several lines of work have improved the context length of LLMs.
长上下文的LLMs(大型语言模型)。有几项研究工作已经提升了LLMs的上下文长度。

For instance, more efficient transformer architectures via sparsifying the attention (Child et al., 2019; Beltagy et al., 2020), low-rank approximations (Wang et al., 2020), and neural memory (Lee et al., 2019).
例如,通过稀疏化注意力机制(Child等人,2019年;Beltagy等人,2020年)、低秩近似(Wang等人,2020年)和神经记忆(Lee等人,2019年)来实现更高效的变换器架构。

Another line of work aims to extend context windows beyond the length they were originally trained for, their training size, such as Press et al. (2021); Chen et al. (2023).
另一项研究工作旨在将上下文窗口扩展到它们原始训练长度之外,即超出它们的训练尺寸,如Press等人(2021年);Chen等人(2023年)。

MemGPT builds upon these improvements in context length as they improve the size of the main memory in MemGPT.
MemGPT在这些提升上下文长度的改进基础上进行构建,因为它们提高了MemGPT中主存储器的大小。

Our main contribution is a hierarchical tiered memory that uses a long-context LLM as the implementation of main memory.
我们的主要贡献是提出了一个分层存储记忆系统,它使用长上下文LLM作为主存储器的实现。

Retrieval-Augmented Models. The design of the external memory of MemGPT builds upon much prior work augmenting LLMs with relevant inputs from external retrievers (Ram et al., 2023; Borgeaud et al., 2022; Karpukhin et al., 2020; Lewis et al., 2020; Guu et al., 2020; Lin et al., 2023). In particular, Jiang et al. (2023) propose FLARE, a method that allows the LLM to actively decide when and what to retrieve during the course of generation. Trivedi et al. (2022) interleave retrieval with Chain-of-Thoughts reasoning to improve multi-step question answering.

Retrieval-Augmented Models. The design of the external memory of MemGPT builds upon much prior work augmenting LLMs with relevant inputs from external retrievers.
检索增强型模型。MemGPT的外部存储器设计基于许多先前的工作,这些工作通过外部检索器提供的相关输入来增强LLMs。

In particular, Jiang et al. (2023) propose FLARE, a method that allows the LLM to actively decide when and what to retrieve during the course of generation.
特别是,Jiang等人(2023年)提出了FLARE,一种方法,允许LLM在生成过程中主动决定何时以及检索什么。

Trivedi et al. (2022) interleave retrieval with Chain-of-Thoughts reasoning to improve multi-step question answering.
Trivedi等人(2022年)将检索与思维链推理交错,以改进多步骤问题回答。

LLMs as agents. Recent work has explored augmenting LLMs with additional capabilities to act as agents in interactive environments. Park et al. (2023) propose adding memory to LLMs and using the LLM as a planner, and observe emergent social behaviors in a multiagent sandbox environment (inspired by The Sims video game) where agents can perform basic activities such as doing chores/hobbies, going to work, and conversing with other agents. Nakano et al. (2021) train models to search the web before answering questions, and use similar pagination concepts to MemGPT to control the underlying context size in their web-browsing environment. Yao et al. (2022) showed that interleaving chain-of-thought reasoning (Wei et al., 2022) can further improve the planning ability of interactive LLM-based agents; similarly in MemGPT, LLM is able to ‘plan out loud’ when executing functions. Liu et al. (2023b) introduced a suite of LLM-asan-agent benchmarks to evaluate LLMs in interactive environments, including video games, thinking puzzles, and web shopping. In contrast, our work focuses on tackling the problem of equipping agents with long-term memory of user inputs.

LLMs as agents. Recent work has explored augmenting LLMs with additional capabilities to act as agents in interactive environments.
LLM作为代理。最近的研究探索了通过增加额外能力使LLMs在交互环境中充当代理。

Park et al. (2023) propose adding memory to LLMs and using the LLM as a planner, and observe emergent social behaviors in a multiagent sandbox environment (inspired by The Sims video game).
Park等人(2023年)提议给LLMs添加记忆,并使用LLM作为规划器,并在受《模拟人生》视频游戏启发的多代理沙盒环境中观察到出现的社会行为。

Nakano et al. (2021) train models to search the web before answering questions, and use similar pagination concepts to MemGPT to control the underlying context size in their web-browsing environment.
Nakano等人(2021年)训练模型在回答问题前搜索网络,并使用类似于MemGPT的分页概念来控制其网络浏览环境中的基本上下文大小。

Yao et al. (2022) showed that interleaving chain-of-thought reasoning (Wei et al., 2022) can further improve the planning ability of interactive LLM-based agents; similarly in MemGPT, LLM is able to ‘plan out loud’ when executing functions.
Yao等人(2022年)表明,交错思维链推理(Wei等人,2022年)可以进一步提高基于LLM的交互式代理的规划能力;同样在MemGPT中,LLM在执行函数时能够“大声规划”。

Liu et al. (2023b) introduced a suite of LLM-asan-agent benchmarks to evaluate LLMs in interactive environments, including video games, thinking puzzles, and web shopping.
Liu等人(2023b)引入了一系列LLM作为代理的基准测试,以评估LLM在交互环境中的表现,包括视频游戏、思维谜题和网上购物。

In contrast, our work focuses on tackling the problem of equipping agents with long-term memory of user inputs.
相比之下,我们的工作专注于解决为代理配备用户输入长期记忆的问题。

5. Conclusion

In this paper, we introduced MemGPT, a novel LLM system inspired by operating systems to manage the limited context windows of large language models. By designing a memory hierarchy and control flow analogous to traditional OSes, MemGPT provides the illusion of larger context resources for LLMs. This OS-inspired approach was evaluated in two domains where existing LLM performance is constrained by finite context lengths: document analysis and conversational agents. For document analysis, MemGPT could process lengthy texts well beyond the context limits of current LLMs by effectively paging relevant context in and out of memory. For conversational agents, MemGPT enabled maintaining long-term memory, consistency, and evolvability over extended dialogues. Overall, MemGPT demonstrates that operating system techniques like hierarchical memory management and interrupts can unlock the potential of LLMs even when constrained by fixed context lengths. This work opens numerous avenues for future exploration, including applying MemGPT to other domains with massive or unbounded contexts, integrating different memory tier technologies like databases or caches, and further improving control flow and memory management policies. By bridging concepts from OS architecture into AI systems, MemGPT represents a promising new direction for maximizing the capabilities of LLMs within their fundamental limits.

In this paper, we introduced MemGPT, a novel LLM system inspired by operating systems to manage the limited context windows of large language models.
本文中,我们介绍了MemGPT,一种受操作系统启发的新型大型语言模型(LLM)系统,用于管理大型语言模型的有限上下文窗口。

By designing a memory hierarchy and control flow analogous to traditional OSes, MemGPT provides the illusion of larger context resources for LLMs.
通过设计类似于传统操作系统的内存层次结构和控制流程,MemGPT为LLM提供了更大上下文资源的假象。

This OS-inspired approach was evaluated in two domains where existing LLM performance is constrained by finite context lengths: document analysis and conversational agents.
这种受操作系统启发的方法在两个领域进行了评估,这些领域中现有LLM的性能受到有限上下文长度的限制:文档分析和对话代理。

For document analysis, MemGPT could process lengthy texts well beyond the context limits of current LLMs by effectively paging relevant context in and out of memory.
对于文档分析,MemGPT能够通过有效地将相关上下文分页进出内存,处理超出当前LLM上下文限制的长篇文本。

For conversational agents, MemGPT enabled maintaining long-term memory, consistency, and evolvability over extended dialogues.
对于对话代理,MemGPT能够维持长期记忆、一致性和可发展性,以应对扩展对话。

Overall, MemGPT demonstrates that operating system techniques like hierarchical memory management and interrupts can unlock the potential of LLMs even when constrained by fixed context lengths.
总体而言,MemGPT展示了操作系统技术,如分层内存管理和中断,即使在固定上下文长度的限制下也能释放LLM的潜力。

This work opens numerous avenues for future exploration, including applying MemGPT to other domains with massive or unbounded contexts, integrating different memory tier technologies like databases or caches, and further improving control flow and memory management policies.
这项工作为未来的探索开辟了众多途径,包括将MemGPT应用于具有庞大或无界上下文的其他领域,集成不同的内存层技术如数据库或缓存,并进一步改进控制流程和内存管理策略。

By bridging concepts from OS architecture into AI systems, MemGPT represents a promising new direction for maximizing the capabilities of LLMs within their fundamental limits.
通过将操作系统架构的概念引入到AI系统中,MemGPT代表了在它们的基本限制内最大化LLM能力的一个有希望的新方向。

作者:Charles Packer , Sarah Wooders , Kevin Lin 1Vivian Fang , Shishir G. Patil ,Ion Stoica 1 Joseph E. Gonzalez

0条留言