简介

转载于：Auto-GPT源码解读（万字干货+原理速读）。

这几个月里，基于chatGPT的技术进展飞快，还没看懂GPT3.5的时候，GPT4就出来了，并且，各种基于开源模型（GLM、LLaMA）的应用也层出不穷。这两天，GitHub上出现了一个王炸级的项目：Auto-GPT。从4月到现在的（4.15）star数量从几百增长到了65k！

在写这篇文章前，笔者在自己电脑上尝试过GLM、LLaMA、stable diffusion等模型的微调、推理和本地部署，并精读过GPT系列的经典论文（已整理好放在文末，有需要自取~），和学界业界相关人士也有过讨论，结合这些实践和理解，总结了这篇文章。

背景介绍

语言模型产品化的三个阶段

我这里将语言模型的产品化分为了三个阶段：基础模型阶段、对话模型阶段和应用集成阶段。

以OpenAI系列的语言模型为例，简单回顾一下：2020年5月，OpenAI发布了1750亿参数的GPT3模型，在多个领域的few-shot learning数据集上都获得了最佳的表现，这是基础模型阶段；通过对GPT3进行RLHF（基于人类反馈学习）等训练方法，2022年11月，OpenAI发布了重磅产品chatGPT，第一次将大语言模型引入实际应用，这是对话模型阶段；2023年初，OpenAI发布了表现力更强的GPT4和插件（Plugin）功能，将chatGPT的可用性提升了一个台阶，这是应用集成阶段。当然可能还会有下一个阶段，限于想象力有限，这里先留空。

相应的，GLM、LLaMA等开源模型和百度阿里的大模型也有对应阶段的产品。按照个人理解，整理了上述表格

Auto-GPT解决的什么问题？

如果把使用模型的场景比喻成开车的话。第一阶段的基础模型相当于一本地图集，上面蕴含了大量信息，但是需要查时候还得手动翻目录去查；第二阶段的对话模型相当于导航软件，输入我们想要去的目的地之后，导航会告诉你怎么走，但车还得自己开；第三阶段的应用集成相当于一辆自动驾驶汽车，只要输入我们的目的地，就能直接到达了。AutoGPT就是这辆自动驾驶汽车。

整体框架实现 (main.py)

模块视角

数据视角

用户输入ai_name（AI名字）ai_role（AI扮演的角色）和ai_goal（目标）
让chatGPT根据输入的角色和目标，生成一段json格式的操作序列（prompt构造方式见3.0）
根据操作序列，调用不同的命令组件（如shell脚本，网页爬虫，语音播报，Google搜索等）
组件调用后的输出存储到缓存或长期记忆中，作为下一个组件的输入
直到最后一个组件执行完成，任务结束

重要数据处理模块

prompt构造（./data/prompt.txt)

prompt的设计非常精巧，让人感觉作者是妥妥的chatGPT鼓励师啊，看来AI也需要被鼓励，来感受下。

这个prompt的目的是让chatGPT根据输入的Goal，生成一段json格式的操作序列。这里对prompt的格式做了精简和翻译：

限制：
1.短期记忆限制在 4000 字左右。 您的短期记忆很短，因此请立即将重要信息保存到文件中。
2.如果您不确定自己以前是如何做某事的，或者想回忆起过去的事件，想想类似的事件会帮助您记住
3.没有用户帮助

命令：
1.谷歌搜索：“google”，参数：“input”：“<search>”
2.浏览网站：“browse_website”，args：“url”：“<url>”，“question”：“<what_you_want_to_find_on_website>”
3.执行 Python 文件："execute_python_file", args: "file": "<file>"
4.执行 Shell 命令，仅限非交互式命令：“execute_shell”，args：“command_line”：“<command_line>”

资源：
1.用于搜索和信息收集的 Internet 访问。
2.长期内存管理。
3.GPT-3.5 驱动的代理，用于委托简单的任务
4.文件输出。

表现评估:
1.不断审查和分析您的行为，以确保您发挥出最佳能力。
2.不断建设性地自我批评你的大局行为。
3.反思过去的决定和策略来改进你的方法。
4.每个命令都有成本，所以要聪明高效。 旨在以最少的步骤完成任务。

您应该仅以 JSON 格式响应，如下所述：
{
    "thoughts":
    {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args":{
            "arg name": "value"
        }
    }
}

命令执行（execute_code.py)

#1.python文件执行。创建docker环境执行.py文件（篇幅所限，仅摘取部分代码）
def execute_python_file(file):
    """Execute a Python file in a Docker container and return the output"""

    print (f"Executing file '{file}' in workspace '{WORKSPACE_FOLDER}'")

    if not file.endswith(".py"):
        return "Error: Invalid file type. Only .py files are allowed."

    file_path = os.path.join(WORKSPACE_FOLDER, file)

    if not os.path.isfile(file_path):
        return f"Error: File '{file}' does not exist."

    try:
        client = docker.from_env()

        image_name = 'python:3.10'
        try:
            client.images.get(image_name)
            print(f"Image '{image_name}' found locally")
        except docker.errors.ImageNotFound:
            print(f"Image '{image_name}' not found locally, pulling from Docker Hub")
            # Use the low-level API to stream the pull response
            low_level_client = docker.APIClient()
（剩下的代码部分省略）

#2. shell命令行执行
def execute_shell(command_line):

    current_dir = os.getcwd()

    if not WORKSPACE_FOLDER in current_dir: # Change dir into workspace if necessary
        work_dir = os.path.join(os.getcwd(), WORKSPACE_FOLDER)
        os.chdir(work_dir)

    print (f"Executing command '{command_line}' in working directory '{os.getcwd()}'")

    result = subprocess.run(command_line, capture_output=True, shell=True)
    output = f"STDOUT:\n{result.stdout}\nSTDERR:\n{result.stderr}"

    # Change back to whatever the prior working dir was

    os.chdir(current_dir)

    return output

网页查询（browse.py)

#1.前面的大致是用urllib库进行爬虫，并用BeautifulSoup进行解析
def summarize_text(text, question):
    """Summarize text using the LLM model"""
    if not text:
        return "Error: No text to summarize"

    text_length = len(text)
    print(f"Text length: {text_length} characters")

    summaries = []
#2.文本切分，源码中的split_text方法根据自然段进行切分
    chunks = list(split_text(text)) 
#3.段落总结，对每一个段落，都用chatgpt接口生成一个summary
    for i, chunk in enumerate(chunks):
        print(f"Summarizing chunk {i + 1} / {len(chunks)}")
        messages = [create_message(chunk, question)]

        summary = create_chat_completion(
            model=cfg.fast_llm_model,
            messages=messages,
            max_tokens=300,
        )
        summaries.append(summary)

    print(f"Summarized {len(chunks)} chunks.")

    combined_summary = "\n".join(summaries)
    messages = [create_message(combined_summary, question)]
#4.文章总结，根据3的段落总结做文章总结
    final_summary = create_chat_completion(
        model=cfg.fast_llm_model,
        messages=messages,
        max_tokens=300,
    )

    return final_summary

New Bing的原理和这个大同小异

向量数据库维护（memory文件夹）

memory文件夹下提供了3种数据存储方式，分别是本地（local）、Pinecone向量数据库和redis的方式。原理其实都是一样的，通过向量索引去模糊搜索知识库中的相关内容，再和问题一起加入到prompt中对大模型发起提问。

查询过程1-embedding：调用openai接口中的embedding方法将用户输入转换为向量

def get_ada_embedding(text):
    text = text.replace("\n", " ")
    if cfg.use_azure:
        return openai.Embedding.create(input=[text], engine=cfg.get_azure_deployment_id_for_model("text-embedding-ada-002"))["data"][0]["embedding"]
    else:
        return openai.Embedding.create(input=[text], model="text-embedding-ada-002")["data"][0]["embedding"]

查询过程2-向量数据库匹配top k语料

#1. Pinecone 调用向量数据库查询方法（./memory/pinecone.py)
def get_relevant(self, data, num_relevant=5):
        """
        Returns all the data in the memory that is relevant to the given data.
        :param data: The data to compare to.
        :param num_relevant: The number of relevant data to return. Defaults to 5
        """
        query_embedding = get_ada_embedding(data)
        results = self.index.query(query_embedding, top_k=num_relevant, include_metadata=True)
        sorted_results = sorted(results.matches, key=lambda x: x.score)
        return [str(item['metadata']["raw_text"]) for item in sorted_results]

#2. Local 使用numpy求向量内积并找出最小k个（./memory/local.py)
def get_relevant(self, text: str, k: int) -> List[Any]:
        """"
        matrix-vector mult to find score-for-each-row-of-matrix
         get indices for top-k winning scores
         return texts for those indices
        Args:
            text: str
            k: int

        Returns: List[str]
        """
        embedding = get_ada_embedding(text)
        scores = np.dot(self.data.embeddings, embedding)
        top_k_indices = np.argsort(scores)[-k:][::-1]
        return [self.data.texts[i] for i in top_k_indices]

#3. Redis 使用Redis自带的向量搜索VectorField方法查找（./memory/redis.py)
def get_relevant(
        self,
        data: str,
        num_relevant: int = 5
    ) -> Optional[List[Any]]:
        """
        Returns all the data in the memory that is relevant to the given data.
        Args:
            data: The data to compare to.
            num_relevant: The number of relevant data to return.

        Returns: A list of the most relevant data.
        """
        query_embedding = get_ada_embedding(data)
        base_query = f"*=>[KNN {num_relevant} @embedding $vector AS vector_score]"
        query = Query(base_query).return_fields(
            "data",
            "vector_score"
        ).sort_by("vector_score").dialect(2)
        query_vector = np.array(query_embedding).astype(np.float32).tobytes()

        try:
            results = self.redis.ft(f"{self.cfg.memory_index}").search(
                query, query_params={"vector": query_vector}
            )
        except Exception as e:
            print("Error calling Redis search: ", e)
            return None
        return [result.data for result in results.docs]

其中，第1种方法类似的，阿里云的AnalyticDB也有类似的功能。第二种和第三种不需要借助向量数据库，不失为一种高性价比的专业知识库解决方案。

语音模块：speak.py

这是一个支持对话方式交互的模块，对于mac系统，调用macos自带的发音功能，对非mac系统，通过elevenlabs api接口获取文字转语音的mp3文件并播。这一点很细节，开发者居然会利用mac系统自带的语音播报功能

#1. 对于macOS，调用系统自带语音播报功能
def macos_tts_speech(text, voice_index=0):
    if voice_index == 0:
        os.system(f'say "{text}"')
    else:
        if voice_index == 1:
            os.system(f'say -v "Ava (Premium)" "{text}"')
        else:
            os.system(f'say -v Samantha "{text}"')
#2. 对于非mac系统，调用*elevenlabs* api
def eleven_labs_speech(text, voice_index=0):
    """Speak text using elevenlabs.io's API"""
    tts_url = "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}".format(
        voice_id=voices[voice_index])
    formatted_message = {"text": text}
    response = requests.post(
        tts_url, headers=tts_headers, json=formatted_message)

    if response.status_code == 200:
        with mutex_lock:
            with open("speech.mpeg", "wb") as f:
                f.write(response.content)
            playsound("speech.mpeg", True)
            os.remove("speech.mpeg")
        return True
    else:
        print("Request failed with status code:", response.status_code)
        print("Response content:", response.content)
        return False
#3. 执行方法
def say_text(text, voice_index=0):

    def speak():
        if not cfg.elevenlabs_api_key:
            if cfg.use_mac_os_tts == 'True':
                macos_tts_speech(text, voice_index)
            else:
                gtts_speech(text)
        else:
            success = eleven_labs_speech(text, voice_index)
            if not success:
                gtts_speech(text)

总结

值得学习的地方：

数据组织处理方式。prompt构造方式，数据入库切分chunk的方式，本地和Redis数据库的向量索引增删改查功能实现
利用丰富的系统接口实现自动化操作，能做的事情不止于此。比如OA系统、业务系统、数据服务产品的升级改造。基于通义千问或GLM也能做类似的事情，autoGPT给我们提供了很好的启发

感想：虽然很多地方还有优化的空间（爬虫可以加上代理、本地的向量索引可以采用Annoy等高效的开源算法），但在一行行注释中看到了开发者的用心，感受到了代码的温度，夜深人静之时仿佛在和开发者进行跨越时空的交流。