diff --git a/README.md b/README.md index ceeea7c0109037d9cd756f4f67ca9724eaa94e50..3167cf9aa7d08c829ad6c853a89b2c9add374fe4 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,7 @@ To translate this project to arbitrary language with GPT, read and run [`multi_l 功能(⭐= 近期新增功能) | 描述 --- | --- -⭐[接入新模型](https://github.com/binary-husky/gpt_academic/wiki/%E5%A6%82%E4%BD%95%E5%88%87%E6%8D%A2%E6%A8%A1%E5%9E%8B)! | 百度[千帆](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/Nlks5zkzu)与文心一言, [通义千问](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary),上海AI-Lab[书生](https://github.com/InternLM/InternLM),讯飞[星火](https://xinghuo.xfyun.cn/),[LLaMa2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),智谱API,DALLE3 +⭐[接入新模型](https://github.com/binary-husky/gpt_academic/wiki/%E5%A6%82%E4%BD%95%E5%88%87%E6%8D%A2%E6%A8%A1%E5%9E%8B)! | 百度[千帆](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/Nlks5zkzu)与文心一言, 通义千问[Qwen](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary),上海AI-Lab[书生](https://github.com/InternLM/InternLM),讯飞[星火](https://xinghuo.xfyun.cn/),[LLaMa2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),[智谱API](https://open.bigmodel.cn/),DALLE3, [DeepseekCoder](https://coder.deepseek.com/) 润色、翻译、代码解释 | 一键润色、翻译、查找论文语法错误、解释代码 [自定义快捷键](https://www.bilibili.com/video/BV14s4y1E7jN) | 支持自定义快捷键 模块化设计 | 支持自定义强大的[插件](https://github.com/binary-husky/gpt_academic/tree/master/crazy_functions),插件支持[热更新](https://github.com/binary-husky/gpt_academic/wiki/%E5%87%BD%E6%95%B0%E6%8F%92%E4%BB%B6%E6%8C%87%E5%8D%97) @@ -104,36 +104,38 @@ Latex论文一键校对 | [插件] 仿Grammarly对Latex文章进行语法、拼 ### 安装方法I:直接运行 (Windows, Linux or MacOS) 1. 下载项目 -```sh -git clone --depth=1 https://github.com/binary-husky/gpt_academic.git -cd gpt_academic -``` + + ```sh + git clone --depth=1 https://github.com/binary-husky/gpt_academic.git + cd gpt_academic + ``` 2. 配置API_KEY -在`config.py`中,配置API KEY等设置,[点击查看特殊网络环境设置方法](https://github.com/binary-husky/gpt_academic/issues/1) 。[Wiki页面](https://github.com/binary-husky/gpt_academic/wiki/项目配置说明)。 + 在`config.py`中,配置API KEY等设置,[点击查看特殊网络环境设置方法](https://github.com/binary-husky/gpt_academic/issues/1) 。[Wiki页面](https://github.com/binary-husky/gpt_academic/wiki/项目配置说明)。 -「 程序会优先检查是否存在名为`config_private.py`的私密配置文件,并用其中的配置覆盖`config.py`的同名配置。如您能理解该读取逻辑,我们强烈建议您在`config.py`旁边创建一个名为`config_private.py`的新配置文件,并把`config.py`中的配置转移(复制)到`config_private.py`中(仅复制您修改过的配置条目即可)。 」 + 「 程序会优先检查是否存在名为`config_private.py`的私密配置文件,并用其中的配置覆盖`config.py`的同名配置。如您能理解该读取逻辑,我们强烈建议您在`config.py`旁边创建一个名为`config_private.py`的新配置文件,并把`config.py`中的配置转移(复制)到`config_private.py`中(仅复制您修改过的配置条目即可)。 」 -「 支持通过`环境变量`配置项目,环境变量的书写格式参考`docker-compose.yml`文件或者我们的[Wiki页面](https://github.com/binary-husky/gpt_academic/wiki/项目配置说明)。配置读取优先级: `环境变量` > `config_private.py` > `config.py`。 」 + 「 支持通过`环境变量`配置项目,环境变量的书写格式参考`docker-compose.yml`文件或者我们的[Wiki页面](https://github.com/binary-husky/gpt_academic/wiki/项目配置说明)。配置读取优先级: `环境变量` > `config_private.py` > `config.py`。 」 3. 安装依赖 -```sh -# (选择I: 如熟悉python, python推荐版本 3.9 ~ 3.11)备注:使用官方pip源或者阿里pip源, 临时换源方法:python -m pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ -python -m pip install -r requirements.txt + ```sh + # (选择I: 如熟悉python, python推荐版本 3.9 ~ 3.11)备注:使用官方pip源或者阿里pip源, 临时换源方法:python -m pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ + python -m pip install -r requirements.txt -# (选择II: 使用Anaconda)步骤也是类似的 (https://www.bilibili.com/video/BV1rc411W7Dr): -conda create -n gptac_venv python=3.11 # 创建anaconda环境 -conda activate gptac_venv # 激活anaconda环境 -python -m pip install -r requirements.txt # 这个步骤和pip安装一样的步骤 -``` + # (选择II: 使用Anaconda)步骤也是类似的 (https://www.bilibili.com/video/BV1rc411W7Dr): + conda create -n gptac_venv python=3.11 # 创建anaconda环境 + conda activate gptac_venv # 激活anaconda环境 + python -m pip install -r requirements.txt # 这个步骤和pip安装一样的步骤 + ```
如果需要支持清华ChatGLM2/复旦MOSS/RWKV作为后端,请点击展开此处

【可选步骤】如果需要支持清华ChatGLM2/复旦MOSS作为后端,需要额外安装更多依赖(前提条件:熟悉Python + 用过Pytorch + 电脑配置够强): + ```sh # 【可选步骤I】支持清华ChatGLM2。清华ChatGLM备注:如果遇到"Call ChatGLM fail 不能正常加载ChatGLM的参数" 错误,参考如下: 1:以上默认安装的为torch+cpu版,使用cuda需要卸载torch重新安装torch+cuda; 2:如因本机配置不够无法加载模型,可以修改request_llm/bridge_chatglm.py中的模型精度, 将 AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) 都修改为 AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True) python -m pip install -r request_llms/requirements_chatglm.txt @@ -155,39 +157,39 @@ AVAIL_LLM_MODELS = ["gpt-3.5-turbo", "api2d-gpt-3.5-turbo", "gpt-4", "api2d-gpt- 4. 运行 -```sh -python main.py -``` + ```sh + python main.py + ``` ### 安装方法II:使用Docker 0. 部署项目的全部能力(这个是包含cuda和latex的大型镜像。但如果您网速慢、硬盘小,则不推荐使用这个) [![fullcapacity](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-all-capacity.yml/badge.svg?branch=master)](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-all-capacity.yml) -``` sh -# 修改docker-compose.yml,保留方案0并删除其他方案。然后运行: -docker-compose up -``` + ``` sh + # 修改docker-compose.yml,保留方案0并删除其他方案。然后运行: + docker-compose up + ``` 1. 仅ChatGPT+文心一言+spark等在线模型(推荐大多数人选择) [![basic](https://github.com/binary-husky/gpt_academic/actions/workflows/build-without-local-llms.yml/badge.svg?branch=master)](https://github.com/binary-husky/gpt_academic/actions/workflows/build-without-local-llms.yml) [![basiclatex](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-latex.yml/badge.svg?branch=master)](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-latex.yml) [![basicaudio](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-audio-assistant.yml/badge.svg?branch=master)](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-audio-assistant.yml) -``` sh -# 修改docker-compose.yml,保留方案1并删除其他方案。然后运行: -docker-compose up -``` + ``` sh + # 修改docker-compose.yml,保留方案1并删除其他方案。然后运行: + docker-compose up + ``` P.S. 如果需要依赖Latex的插件功能,请见Wiki。另外,您也可以直接使用方案4或者方案0获取Latex功能。 2. ChatGPT + ChatGLM2 + MOSS + LLAMA2 + 通义千问(需要熟悉[Nvidia Docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-ubuntu-and-debian)运行时) [![chatglm](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-chatglm.yml/badge.svg?branch=master)](https://github.com/binary-husky/gpt_academic/actions/workflows/build-with-chatglm.yml) -``` sh -# 修改docker-compose.yml,保留方案2并删除其他方案。然后运行: -docker-compose up -``` + ``` sh + # 修改docker-compose.yml,保留方案2并删除其他方案。然后运行: + docker-compose up + ``` ### 安装方法III:其他部署姿势 @@ -208,9 +210,11 @@ docker-compose up # Advanced Usage ### I:自定义新的便捷按钮(学术快捷键) + 任意文本编辑器打开`core_functional.py`,添加条目如下,然后重启程序。(如按钮已存在,那么前缀、后缀都支持热修改,无需重启程序即可生效。) 例如 -``` + +```python "超级英译中": { # 前缀,会被加在你的输入之前。例如,用来描述你的要求,例如翻译、解释代码、润色等等 "Prefix": "请翻译把下面一段内容成中文,然后用一个markdown表格逐一解释文中出现的专有名词:\n\n", @@ -219,6 +223,7 @@ docker-compose up "Suffix": "", }, ``` +

@@ -295,6 +300,7 @@ Tip:不指定文件直接点击 `载入对话历史存档` 可以查看历史h ### II:版本: + - version 3.70(todo): 优化AutoGen插件主题并设计一系列衍生插件 - version 3.60: 引入AutoGen作为新一代插件的基石 - version 3.57: 支持GLM3,星火v3,文心一言v4,修复本地模型的并发BUG @@ -315,7 +321,7 @@ Tip:不指定文件直接点击 `载入对话历史存档` 可以查看历史h - version 3.0: 对chatglm和其他小型llm的支持 - version 2.6: 重构了插件结构,提高了交互性,加入更多插件 - version 2.5: 自更新,解决总结大工程源代码时文本过长、token溢出的问题 -- version 2.4: (1)新增PDF全文翻译功能; (2)新增输入区切换位置的功能; (3)新增垂直布局选项; (4)多线程函数插件优化。 +- version 2.4: 新增PDF全文翻译功能; 新增输入区切换位置的功能 - version 2.3: 增强多线程交互性 - version 2.2: 函数插件支持热重载 - version 2.1: 可折叠式布局 @@ -337,6 +343,7 @@ GPT Academic开发者QQ群:`610599535` 1. `master` 分支: 主分支,稳定版 2. `frontier` 分支: 开发分支,测试版 +3. 如何接入其他大模型:[接入其他大模型](request_llms/README.md) ### V:参考与学习 diff --git a/check_proxy.py b/check_proxy.py index 977ac276a9c3f1fe3c01a219aae6bc6d6ca8b2aa..e2ba3f1aaad3d69beb8d65e3681447f84bd1c2c4 100644 --- a/check_proxy.py +++ b/check_proxy.py @@ -5,7 +5,6 @@ def check_proxy(proxies): try: response = requests.get("https://ipapi.co/json/", proxies=proxies, timeout=4) data = response.json() - # print(f'查询代理的地理位置,返回的结果是{data}') if 'country_name' in data: country = data['country_name'] result = f"代理配置 {proxies_https}, 代理所在地:{country}" @@ -47,8 +46,8 @@ def backup_and_download(current_version, remote_version): os.makedirs(new_version_dir) shutil.copytree('./', backup_dir, ignore=lambda x, y: ['history']) proxies = get_conf('proxies') - r = requests.get( - 'https://github.com/binary-husky/chatgpt_academic/archive/refs/heads/master.zip', proxies=proxies, stream=True) + try: r = requests.get('https://github.com/binary-husky/chatgpt_academic/archive/refs/heads/master.zip', proxies=proxies, stream=True) + except: r = requests.get('https://public.gpt-academic.top/publish/master.zip', proxies=proxies, stream=True) zip_file_path = backup_dir+'/master.zip' with open(zip_file_path, 'wb+') as f: f.write(r.content) @@ -111,11 +110,10 @@ def auto_update(raise_error=False): try: from toolbox import get_conf import requests - import time import json proxies = get_conf('proxies') - response = requests.get( - "https://raw.githubusercontent.com/binary-husky/chatgpt_academic/master/version", proxies=proxies, timeout=5) + try: response = requests.get("https://raw.githubusercontent.com/binary-husky/chatgpt_academic/master/version", proxies=proxies, timeout=5) + except: response = requests.get("https://public.gpt-academic.top/publish/version", proxies=proxies, timeout=5) remote_json_data = json.loads(response.text) remote_version = remote_json_data['version'] if remote_json_data["show_feature"]: @@ -127,8 +125,7 @@ def auto_update(raise_error=False): current_version = json.loads(current_version)['version'] if (remote_version - current_version) >= 0.01-1e-5: from colorful import print亮黄 - print亮黄( - f'\n新版本可用。新版本:{remote_version},当前版本:{current_version}。{new_feature}') + print亮黄(f'\n新版本可用。新版本:{remote_version},当前版本:{current_version}。{new_feature}') print('(1)Github更新地址:\nhttps://github.com/binary-husky/chatgpt_academic\n') user_instruction = input('(2)是否一键更新代码(Y+回车=确认,输入其他/无输入+回车=不更新)?') if user_instruction in ['Y', 'y']: @@ -154,7 +151,7 @@ def auto_update(raise_error=False): print(msg) def warm_up_modules(): - print('正在执行一些模块的预热...') + print('正在执行一些模块的预热 ...') from toolbox import ProxyNetworkActivate from request_llms.bridge_all import model_info with ProxyNetworkActivate("Warmup_Modules"): diff --git a/config.py b/config.py index 59d7be15a00147449a262be55f7cd87a451f5792..46d1d90d8e056f7608750b248c70136300de4ebd 100644 --- a/config.py +++ b/config.py @@ -94,12 +94,12 @@ DEFAULT_FN_GROUPS = ['对话', '编程', '学术', '智能体'] # 模型选择是 (注意: LLM_MODEL是默认选中的模型, 它*必须*被包含在AVAIL_LLM_MODELS列表中 ) LLM_MODEL = "gpt-3.5-turbo" # 可选 ↓↓↓ -AVAIL_LLM_MODELS = ["gpt-3.5-turbo-1106","gpt-4-1106-preview", +AVAIL_LLM_MODELS = ["gpt-3.5-turbo-1106","gpt-4-1106-preview","gpt-4-vision-preview", "gpt-3.5-turbo-16k", "gpt-3.5-turbo", "azure-gpt-3.5", "api2d-gpt-3.5-turbo", 'api2d-gpt-3.5-turbo-16k', "gpt-4", "gpt-4-32k", "azure-gpt-4", "api2d-gpt-4", - "chatglm3", "moss", "newbing", "claude-2"] -# P.S. 其他可用的模型还包括 ["zhipuai", "qianfan", "llama2", "qwen", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-3.5-random" + "chatglm3", "moss", "claude-2"] +# P.S. 其他可用的模型还包括 ["zhipuai", "qianfan", "deepseekcoder", "llama2", "qwen", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-3.5-random" # "spark", "sparkv2", "sparkv3", "chatglm_onnx", "claude-1-100k", "claude-2", "internlm", "jittorllms_pangualpha", "jittorllms_llama"] @@ -278,11 +278,31 @@ NUM_CUSTOM_BASIC_BTN = 4 │ ├── BAIDU_CLOUD_API_KEY │ └── BAIDU_CLOUD_SECRET_KEY │ -├── "newbing" Newbing接口不再稳定,不推荐使用 +├── "zhipuai" 智谱AI大模型chatglm_turbo +│ ├── ZHIPUAI_API_KEY +│ └── ZHIPUAI_MODEL +│ +└── "newbing" Newbing接口不再稳定,不推荐使用 ├── NEWBING_STYLE └── NEWBING_COOKIES +本地大模型示意图 +│ +├── "chatglm3" +├── "chatglm" +├── "chatglm_onnx" +├── "chatglmft" +├── "internlm" +├── "moss" +├── "jittorllms_pangualpha" +├── "jittorllms_llama" +├── "deepseekcoder" +├── "qwen" +├── RWKV的支持见Wiki +└── "llama2" + + 用户图形界面布局依赖关系示意图 │ ├── CHATBOT_HEIGHT 对话窗的高度 @@ -293,7 +313,7 @@ NUM_CUSTOM_BASIC_BTN = 4 ├── THEME 色彩主题 ├── AUTO_CLEAR_TXT 是否在提交时自动清空输入框 ├── ADD_WAIFU 加一个live2d装饰 -├── ALLOW_RESET_CONFIG 是否允许通过自然语言描述修改本页的配置,该功能具有一定的危险性 +└── ALLOW_RESET_CONFIG 是否允许通过自然语言描述修改本页的配置,该功能具有一定的危险性 插件在线服务配置依赖关系示意图 @@ -305,7 +325,7 @@ NUM_CUSTOM_BASIC_BTN = 4 │ ├── ALIYUN_ACCESSKEY │ └── ALIYUN_SECRET │ -├── PDF文档精准解析 -│ └── GROBID_URLS +└── PDF文档精准解析 + └── GROBID_URLS """ diff --git "a/crazy_functions/Latex\350\276\223\345\207\272PDF\347\273\223\346\236\234.py" "b/crazy_functions/Latex\350\276\223\345\207\272PDF\347\273\223\346\236\234.py" index a2545ddde802245c765708ce16e48028c45b0385..f3919edc38c9e5f7e35f980a321b267df8b801f5 100644 --- "a/crazy_functions/Latex\350\276\223\345\207\272PDF\347\273\223\346\236\234.py" +++ "b/crazy_functions/Latex\350\276\223\345\207\272PDF\347\273\223\346\236\234.py" @@ -73,6 +73,7 @@ def move_project(project_folder, arxiv_id=None): # align subfolder if there is a folder wrapper items = glob.glob(pj(project_folder,'*')) + items = [item for item in items if os.path.basename(item)!='__MACOSX'] if len(glob.glob(pj(project_folder,'*.tex'))) == 0 and len(items) == 1: if os.path.isdir(items[0]): project_folder = items[0] @@ -214,7 +215,6 @@ def Latex英文纠错加PDF对比(txt, llm_kwargs, plugin_kwargs, chatbot, histo # <-------------- we are done -------------> return success - # ========================================= 插件主程序2 ===================================================== @CatchException diff --git a/crazy_functions/crazy_utils.py b/crazy_functions/crazy_utils.py index 5788194c0145ea57eda4df13f6f41e14f968f134..afe079f419a70cc95a07ab9689197f0c43144e57 100644 --- a/crazy_functions/crazy_utils.py +++ b/crazy_functions/crazy_utils.py @@ -1,4 +1,4 @@ -from toolbox import update_ui, get_conf, trimmed_format_exc, get_log_folder +from toolbox import update_ui, get_conf, trimmed_format_exc, get_max_token import threading import os import logging @@ -92,7 +92,7 @@ def request_gpt_model_in_new_thread_with_ui_alive( # 【选择处理】 尝试计算比例,尽可能多地保留文本 from toolbox import get_reduce_token_percent p_ratio, n_exceed = get_reduce_token_percent(str(token_exceeded_error)) - MAX_TOKEN = 4096 + MAX_TOKEN = get_max_token(llm_kwargs) EXCEED_ALLO = 512 + 512 * exceeded_cnt inputs, history = input_clipping(inputs, history, max_token_limit=MAX_TOKEN-EXCEED_ALLO) mutable[0] += f'[Local Message] 警告,文本过长将进行截断,Token溢出数:{n_exceed}。\n\n' @@ -224,7 +224,7 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency( # 【选择处理】 尝试计算比例,尽可能多地保留文本 from toolbox import get_reduce_token_percent p_ratio, n_exceed = get_reduce_token_percent(str(token_exceeded_error)) - MAX_TOKEN = 4096 + MAX_TOKEN = get_max_token(llm_kwargs) EXCEED_ALLO = 512 + 512 * exceeded_cnt inputs, history = input_clipping(inputs, history, max_token_limit=MAX_TOKEN-EXCEED_ALLO) gpt_say += f'[Local Message] 警告,文本过长将进行截断,Token溢出数:{n_exceed}。\n\n' diff --git a/crazy_functions/latex_fns/latex_toolbox.py b/crazy_functions/latex_fns/latex_toolbox.py index afa65bf94d1f433410d26faa62b475449c1a758f..4555ff185c1fe00382cfd3a3851760394f51305a 100644 --- a/crazy_functions/latex_fns/latex_toolbox.py +++ b/crazy_functions/latex_fns/latex_toolbox.py @@ -283,10 +283,10 @@ def find_tex_file_ignore_case(fp): dir_name = os.path.dirname(fp) base_name = os.path.basename(fp) # 如果输入的文件路径是正确的 - if os.path.exists(pj(dir_name, base_name)): return pj(dir_name, base_name) + if os.path.isfile(pj(dir_name, base_name)): return pj(dir_name, base_name) # 如果不正确,试着加上.tex后缀试试 if not base_name.endswith('.tex'): base_name+='.tex' - if os.path.exists(pj(dir_name, base_name)): return pj(dir_name, base_name) + if os.path.isfile(pj(dir_name, base_name)): return pj(dir_name, base_name) # 如果还找不到,解除大小写限制,再试一次 import glob for f in glob.glob(dir_name+'/*.tex'): diff --git a/docs/waifu_plugin/waifu-tips.js b/docs/waifu_plugin/waifu-tips.js index 8f9533a19e7d4914bde888ee2a107e4430242968..d1b214d7e26bd2ea429f35af2ff87f6f3d74f526 100644 --- a/docs/waifu_plugin/waifu-tips.js +++ b/docs/waifu_plugin/waifu-tips.js @@ -258,39 +258,7 @@ function loadTipsMessage(result) { }); window.showWelcomeMessage = function(result) { - var text; - if (window.location.href == live2d_settings.homePageUrl) { - var now = (new Date()).getHours(); - if (now > 23 || now <= 5) text = getRandText(result.waifu.hour_tips['t23-5']); - else if (now > 5 && now <= 7) text = getRandText(result.waifu.hour_tips['t5-7']); - else if (now > 7 && now <= 11) text = getRandText(result.waifu.hour_tips['t7-11']); - else if (now > 11 && now <= 14) text = getRandText(result.waifu.hour_tips['t11-14']); - else if (now > 14 && now <= 17) text = getRandText(result.waifu.hour_tips['t14-17']); - else if (now > 17 && now <= 19) text = getRandText(result.waifu.hour_tips['t17-19']); - else if (now > 19 && now <= 21) text = getRandText(result.waifu.hour_tips['t19-21']); - else if (now > 21 && now <= 23) text = getRandText(result.waifu.hour_tips['t21-23']); - else text = getRandText(result.waifu.hour_tips.default); - } else { - var referrer_message = result.waifu.referrer_message; - if (document.referrer !== '') { - var referrer = document.createElement('a'); - referrer.href = document.referrer; - var domain = referrer.hostname.split('.')[1]; - if (window.location.hostname == referrer.hostname) - text = referrer_message.localhost[0] + document.title.split(referrer_message.localhost[2])[0] + referrer_message.localhost[1]; - else if (domain == 'baidu') - text = referrer_message.baidu[0] + referrer.search.split('&wd=')[1].split('&')[0] + referrer_message.baidu[1]; - else if (domain == 'so') - text = referrer_message.so[0] + referrer.search.split('&q=')[1].split('&')[0] + referrer_message.so[1]; - else if (domain == 'google') - text = referrer_message.google[0] + document.title.split(referrer_message.google[2])[0] + referrer_message.google[1]; - else { - $.each(result.waifu.referrer_hostname, function(i,val) {if (i==referrer.hostname) referrer.hostname = getRandText(val)}); - text = referrer_message.default[0] + referrer.hostname + referrer_message.default[1]; - } - } else text = referrer_message.none[0] + document.title.split(referrer_message.none[2])[0] + referrer_message.none[1]; - } - showMessage(text, 6000); + showMessage('欢迎使用GPT-Academic', 6000); }; if (live2d_settings.showWelcomeMessage) showWelcomeMessage(result); var waifu_tips = result.waifu; diff --git a/docs/waifu_plugin/waifu-tips.json b/docs/waifu_plugin/waifu-tips.json index 229d5a14cae722e7f7dbaf23e23d11e6ce233199..30f263153f3c20e10a1a28b35490fd4307f9595d 100644 --- a/docs/waifu_plugin/waifu-tips.json +++ b/docs/waifu_plugin/waifu-tips.json @@ -83,8 +83,8 @@ "很多强大的函数插件隐藏在下拉菜单中呢。", "红色的插件,使用之前需要把文件上传进去哦。", "想添加功能按钮吗?读读readme很容易就学会啦。", - "敏感或机密的信息,不可以问chatGPT的哦!", - "chatGPT究竟是划时代的创新,还是扼杀创造力的毒药呢?" + "敏感或机密的信息,不可以问AI的哦!", + "LLM究竟是划时代的创新,还是扼杀创造力的毒药呢?" ] } ], "click": [ @@ -92,8 +92,6 @@ "selector": ".waifu #live2d", "text": [ "是…是不小心碰到了吧", - "萝莉控是什么呀", - "你看到我的小熊了吗", "再摸的话我可要报警了!⌇●﹏●⌇", "110吗,这里有个变态一直在摸我(ó﹏ò。)" ] diff --git a/flagged/modeling_moss.py b/flagged/modeling_moss.py new file mode 100644 index 0000000000000000000000000000000000000000..f191bf532c0e1b769e773f932a35f099f782de44 --- /dev/null +++ b/flagged/modeling_moss.py @@ -0,0 +1,2952 @@ +""" PyTorch Moss model.""" + +from typing import Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +from torch.nn import CrossEntropyLoss +import transformers +from transformers.activations import ACT2FN +from transformers.modeling_utils import PreTrainedModel +from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast +from transformers.utils import ( + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging +) + +from .configuration_moss import MossConfig + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "fnlp/moss-moon-003-base" +_CONFIG_FOR_DOC = "MossConfig" + + +MOSS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "fnlp/moss-moon-003-base", + "fnlp/moss-moon-003-sft", + "fnlp/moss-moon-003-sft-plugin", + "fnlp/moss-moon-003-sft-int4", + "fnlp/moss-moon-003-sft-plugin-int4", + "fnlp/moss-moon-003-sft-int8", + "fnlp/moss-moon-003-sft-plugin-int8", +] + + +# Copied from transformers.models.gptj.modeling_gptj.create_sinusoidal_positions +def create_sinusoidal_positions(num_pos: int, dim: int) -> torch.Tensor: + inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim)) + sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(num_pos, dtype=torch.float), inv_freq).float() + return torch.cat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1) + + +# Copied from transformers.models.gptj.modeling_gptj.rotate_every_two +def rotate_every_two(x: torch.Tensor) -> torch.Tensor: + x1 = x[:, :, :, ::2] + x2 = x[:, :, :, 1::2] + x = torch.stack((-x2, x1), dim=-1) + return x.flatten(-2) # in einsum notation: rearrange(x, '... d j -> ... (d j)') + + +# Copied from transformers.models.gptj.modeling_gptj.apply_rotary_pos_emb +def apply_rotary_pos_emb(tensor: torch.Tensor, sin: torch.Tensor, cos: torch.Tensor) -> torch.Tensor: + sin = torch.repeat_interleave(sin[:, :, None, :], 2, 3) + cos = torch.repeat_interleave(cos[:, :, None, :], 2, 3) + return (tensor * cos) + (rotate_every_two(tensor) * sin) + + +class MossAttention(nn.Module): + def __init__(self, config): + super().__init__() + + max_positions = config.max_position_embeddings + self.register_buffer( + "causal_mask", + torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view( + 1, 1, max_positions, max_positions + ), + ) + + self.attn_dropout = nn.Dropout(config.attn_pdrop) + self.resid_dropout = nn.Dropout(config.resid_pdrop) + + self.embed_dim = config.hidden_size + self.num_attention_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_attention_heads + if self.head_dim * self.num_attention_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and" + f" `num_attention_heads`: {self.num_attention_heads})." + ) + self.scale_attn = torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()) + self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False) + + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False) + self.rotary_dim = config.rotary_dim + pos_embd_dim = self.rotary_dim or self.embed_dim + self.embed_positions = create_sinusoidal_positions(max_positions, pos_embd_dim) + + def _split_heads(self, x, n_head, dim_head, mp_num): + reshaped = x.reshape(x.shape[:-1] + (n_head // mp_num, dim_head)) + reshaped = reshaped.reshape(x.shape[:-2] + (-1,) + reshaped.shape[-1:]) + return reshaped + + def _merge_heads(self, tensor, num_attention_heads, attn_head_size): + """ + Merges attn_head_size dim and num_attn_heads dim into n_ctx + """ + if len(tensor.shape) == 5: + tensor = tensor.permute(0, 1, 3, 2, 4).contiguous() + elif len(tensor.shape) == 4: + tensor = tensor.permute(0, 2, 1, 3).contiguous() + else: + raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}") + new_shape = tensor.size()[:-2] + (num_attention_heads * attn_head_size,) + return tensor.view(new_shape) + + def _attn( + self, + query, + key, + value, + attention_mask=None, + head_mask=None, + ): + # compute causal mask from causal mask buffer + query_length, key_length = query.size(-2), key.size(-2) + causal_mask = self.causal_mask[:, :, key_length - query_length : key_length, :key_length] + + # Keep the attention weights computation in fp32 to avoid overflow issues + query = query.to(torch.float32) + key = key.to(torch.float32) + + attn_weights = torch.matmul(query, key.transpose(-1, -2)) + + attn_weights = attn_weights / self.scale_attn + mask_value = torch.finfo(attn_weights.dtype).min + # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`. + # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device` + mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device) + attn_weights = torch.where(causal_mask, attn_weights, mask_value) + + if attention_mask is not None: + # Apply the attention mask + attn_weights = attn_weights + attention_mask + + attn_weights = nn.Softmax(dim=-1)(attn_weights) + attn_weights = attn_weights.to(value.dtype) + attn_weights = self.attn_dropout(attn_weights) + + # Mask heads if we want to + if head_mask is not None: + attn_weights = attn_weights * head_mask + + attn_output = torch.matmul(attn_weights, value) + + return attn_output, attn_weights + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[ + Tuple[torch.Tensor, Tuple[torch.Tensor]], + Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]], + ]: + qkv = self.qkv_proj(hidden_states) + # TODO(enijkamp): factor out number of logical TPU-v4 cores or make forward pass agnostic + mp_num = 4 + qkv_split = qkv.reshape(qkv.shape[:-1] + (mp_num, -1)) + + local_dim = self.head_dim * self.num_attention_heads // mp_num + query, value, key = torch.split(qkv_split, local_dim, dim=-1) + query = self._split_heads(query, self.num_attention_heads, self.head_dim, mp_num=mp_num) + key = self._split_heads(key, self.num_attention_heads, self.head_dim, mp_num=mp_num) + + value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num) + value = value.permute(0, 2, 1, 3) + + embed_positions = self.embed_positions + if embed_positions.device != position_ids.device: + embed_positions = embed_positions.to(position_ids.device) + self.embed_positions = embed_positions + + sincos = embed_positions[position_ids] + sin, cos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1) + + if self.rotary_dim is not None: + k_rot = key[:, :, :, : self.rotary_dim] + k_pass = key[:, :, :, self.rotary_dim :] + + q_rot = query[:, :, :, : self.rotary_dim] + q_pass = query[:, :, :, self.rotary_dim :] + + k_rot = apply_rotary_pos_emb(k_rot, sin, cos) + q_rot = apply_rotary_pos_emb(q_rot, sin, cos) + + key = torch.cat([k_rot, k_pass], dim=-1) + query = torch.cat([q_rot, q_pass], dim=-1) + else: + key = apply_rotary_pos_emb(key, sin, cos) + query = apply_rotary_pos_emb(query, sin, cos) + + key = key.permute(0, 2, 1, 3) + query = query.permute(0, 2, 1, 3) + + if layer_past is not None: + past_key = layer_past[0] + past_value = layer_past[1] + key = torch.cat((past_key, key), dim=-2) + value = torch.cat((past_value, value), dim=-2) + + if use_cache is True: + present = (key, value) + else: + present = None + + # compute self-attention: V x Softmax(QK^T) + attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) + + attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim) + attn_output = self.out_proj(attn_output) + attn_output = self.resid_dropout(attn_output) + + outputs = (attn_output, present) + if output_attentions: + outputs += (attn_weights,) + + return outputs # a, present, (attentions) + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJMLP with GPTJ->Moss +class MossMLP(nn.Module): + def __init__(self, intermediate_size, config): # in MLP: intermediate_size= 4 * embed_dim + super().__init__() + embed_dim = config.n_embd + + self.fc_in = nn.Linear(embed_dim, intermediate_size) + self.fc_out = nn.Linear(intermediate_size, embed_dim) + + self.act = ACT2FN[config.activation_function] + self.dropout = nn.Dropout(config.resid_pdrop) + + def forward(self, hidden_states: Optional[torch.FloatTensor]) -> torch.FloatTensor: + hidden_states = self.fc_in(hidden_states) + hidden_states = self.act(hidden_states) + hidden_states = self.fc_out(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJBlock with GPTJ->Moss +class MossBlock(nn.Module): + def __init__(self, config): + super().__init__() + inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd + self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon) + self.attn = MossAttention(config) + self.mlp = MossMLP(inner_dim, config) + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]: + residual = hidden_states + hidden_states = self.ln_1(hidden_states) + attn_outputs = self.attn( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask, + use_cache=use_cache, + output_attentions=output_attentions, + ) + attn_output = attn_outputs[0] # output_attn: a, present, (attentions) + outputs = attn_outputs[1:] + + feed_forward_hidden_states = self.mlp(hidden_states) + hidden_states = attn_output + feed_forward_hidden_states + residual + + if use_cache: + outputs = (hidden_states,) + outputs + else: + outputs = (hidden_states,) + outputs[1:] + + return outputs # hidden_states, present, (attentions) + + +class MossPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = MossConfig + base_model_prefix = "transformer" + supports_gradient_checkpointing = True + _no_split_modules = ["MossBlock"] + + def __init__(self, *inputs, **kwargs): + super().__init__(*inputs, **kwargs) + + def _init_weights(self, module): + """Initialize the weights.""" + if isinstance(module, (nn.Linear,)): + # Slightly different from Mesh Transformer JAX which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, MossModel): + module.gradient_checkpointing = value + + +MOSS_START_DOCSTRING = r""" + This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use + it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and + behavior. + + Parameters: + config ([`MossConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +MOSS_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `({0})`): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoProcenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + position_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.n_positions - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.FloatTensor` of shape `(num_attention_heads,)` or `(n_layer, num_attention_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_dim)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert *input_ids* indices into associated vectors than the + model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare Moss Model transformer outputting raw hidden-states without any specific head on top.", + MOSS_START_DOCSTRING, +) +class MossModel(MossPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + self.embed_dim = config.n_embd + self.vocab_size = config.vocab_size + self.wte = nn.Embedding(config.vocab_size, self.embed_dim) + self.drop = nn.Dropout(config.embd_pdrop) + self.h = nn.ModuleList([MossBlock(config) for _ in range(config.n_layer)]) + self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon) + self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads) + + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.wte + + def set_input_embeddings(self, new_embeddings): + self.wte = new_embeddings + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=BaseModelOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPast]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + batch_size = input_ids.shape[0] + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + batch_size = inputs_embeds.shape[0] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if token_type_ids is not None: + token_type_ids = token_type_ids.view(-1, input_shape[-1]) + + if position_ids is not None: + position_ids = position_ids.view(-1, input_shape[-1]).long() + + if past_key_values is None: + past_length = 0 + past_key_values = tuple([None] * len(self.h)) + else: + past_length = past_key_values[0][0].size(-2) + + if position_ids is None: + position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) + + # Attention mask. + if attention_mask is not None: + if batch_size <= 0: + raise ValueError("batch_size has to be defined and > 0") + attention_mask = attention_mask.view(batch_size, -1) + # We create a 3D attention mask from a 2D tensor mask. + # Sizes are [batch_size, 1, 1, to_seq_length] + # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length] + # this attention mask is more simple than the triangular masking of causal attention + # used in OpenAI GPT, we just need to prepare the broadcast dimension here. + attention_mask = attention_mask[:, None, None, :] + + # Since attention_mask is 1.0 for positions we want to attend and 0.0 for + # masked positions, this operation will create a tensor which is 0.0 for + # positions we want to attend and the dtype's smallest value for masked positions. + # Since we are adding it to the raw scores before the softmax, this is + # effectively the same as removing these entirely. + attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility + attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x num_attention_heads x N x N + # head_mask has shape n_layer x batch x num_attention_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.n_layer) + + if inputs_embeds is None: + inputs_embeds = self.wte(input_ids) + + hidden_states = inputs_embeds + + if token_type_ids is not None: + token_type_embeds = self.wte(token_type_ids) + hidden_states = hidden_states + token_type_embeds + + hidden_states = self.drop(hidden_states) + + output_shape = input_shape + (hidden_states.size(-1),) + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting " + "`use_cache=False`..." + ) + use_cache = False + + presents = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + # None for past_key_value + return module(*inputs, use_cache, output_attentions) + + return custom_forward + + outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(block), + hidden_states, + None, + attention_mask, + position_ids, + head_mask[i], + ) + else: + outputs = block( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + if use_cache is True: + presents = presents + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + + hidden_states = self.ln_f(hidden_states) + + hidden_states = hidden_states.view(output_shape) + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=presents, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + +@add_start_docstrings( + """ + The Moss Model transformer with a language modeling head on top. + """, + MOSS_START_DOCSTRING, +) +class MossForCausalLM(MossPreTrainedModel): + _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.causal_mask"] + + def __init__(self, config): + super().__init__(config) + if not hasattr(config, 'wbits'): + config.wbits = 32 + config.groupsize = 128 + + if config.wbits not in [4, 8, 32]: + logger.warning(f'Specify `wbits` with 4, 8 or 32 to load the model. ') + if config.wbits in [4, 8]: + def noop(*args, **kwargs): + pass + torch.nn.init.kaiming_uniform_ = noop + torch.nn.init.uniform_ = noop + torch.nn.init.normal_ = noop + + torch.set_default_dtype(torch.half) + transformers.modeling_utils._init_weights = False + torch.set_default_dtype(torch.half) + self.transformer = MossModel(config) + self.lm_head = nn.Linear(config.n_embd, config.vocab_size) + if config.wbits in [4, 8]: + torch.set_default_dtype(torch.float) + transformers.modeling_utils._init_weights = True + self.quantize(config.wbits, config.groupsize) + # Initialize weights and apply final processing + self.post_init() + + def get_output_embeddings(self): + return self.lm_head + + def set_output_embeddings(self, new_embeddings): + self.lm_head = new_embeddings + + def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): + token_type_ids = kwargs.get("token_type_ids", None) + # only last token for inputs_ids if past is defined in kwargs + if past_key_values: + input_ids = input_ids[:, -1].unsqueeze(-1) + if token_type_ids is not None: + token_type_ids = token_type_ids[:, -1].unsqueeze(-1) + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + if attention_mask is not None and position_ids is None: + # create position_ids on the fly for batch generation + position_ids = attention_mask.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -1].unsqueeze(-1) + + return { + "input_ids": input_ids, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "attention_mask": attention_mask, + "token_type_ids": token_type_ids, + } + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=CausalLMOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, CausalLMOutputWithPast]: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set + `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` + are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_outputs = self.transformer( + input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + hidden_states = transformer_outputs[0] + + # make sure sampling in fp16 works correctly and + # compute loss in fp32 to match with mesh-tf version + # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179 + lm_logits = self.lm_head(hidden_states).to(torch.float32) + + loss = None + if labels is not None: + # Shift so that tokens < n predict n + shift_logits = lm_logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) + + loss = loss.to(hidden_states.dtype) + + if not return_dict: + output = (lm_logits,) + transformer_outputs[1:] + return ((loss,) + output) if loss is not None else output + + return CausalLMOutputWithPast( + loss=loss, + logits=lm_logits, + past_key_values=transformer_outputs.past_key_values, + hidden_states=transformer_outputs.hidden_states, + attentions=transformer_outputs.attentions, + ) + + @staticmethod + def _reorder_cache( + past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PretrainedModel.beam_search`] or + [`~PretrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + return tuple( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past) + for layer_past in past_key_values + ) + + def quantize(self, wbits, groupsize): + from .quantization import quantize_with_gptq + return quantize_with_gptq(self, wbits, groupsize) + +""" PyTorch Moss model.""" + +from typing import Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +from torch.nn import CrossEntropyLoss +import transformers +from transformers.activations import ACT2FN +from transformers.modeling_utils import PreTrainedModel +from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast +from transformers.utils import ( + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging +) + +from .configuration_moss import MossConfig + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "fnlp/moss-moon-003-base" +_CONFIG_FOR_DOC = "MossConfig" + + +MOSS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "fnlp/moss-moon-003-base", + "fnlp/moss-moon-003-sft", + "fnlp/moss-moon-003-sft-plugin", + "fnlp/moss-moon-003-sft-int4", + "fnlp/moss-moon-003-sft-plugin-int4", + "fnlp/moss-moon-003-sft-int8", + "fnlp/moss-moon-003-sft-plugin-int8", +] + + +# Copied from transformers.models.gptj.modeling_gptj.create_sinusoidal_positions +def create_sinusoidal_positions(num_pos: int, dim: int) -> torch.Tensor: + inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim)) + sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(num_pos, dtype=torch.float), inv_freq).float() + return torch.cat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1) + + +# Copied from transformers.models.gptj.modeling_gptj.rotate_every_two +def rotate_every_two(x: torch.Tensor) -> torch.Tensor: + x1 = x[:, :, :, ::2] + x2 = x[:, :, :, 1::2] + x = torch.stack((-x2, x1), dim=-1) + return x.flatten(-2) # in einsum notation: rearrange(x, '... d j -> ... (d j)') + + +# Copied from transformers.models.gptj.modeling_gptj.apply_rotary_pos_emb +def apply_rotary_pos_emb(tensor: torch.Tensor, sin: torch.Tensor, cos: torch.Tensor) -> torch.Tensor: + sin = torch.repeat_interleave(sin[:, :, None, :], 2, 3) + cos = torch.repeat_interleave(cos[:, :, None, :], 2, 3) + return (tensor * cos) + (rotate_every_two(tensor) * sin) + + +class MossAttention(nn.Module): + def __init__(self, config): + super().__init__() + + max_positions = config.max_position_embeddings + self.register_buffer( + "causal_mask", + torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view( + 1, 1, max_positions, max_positions + ), + ) + + self.attn_dropout = nn.Dropout(config.attn_pdrop) + self.resid_dropout = nn.Dropout(config.resid_pdrop) + + self.embed_dim = config.hidden_size + self.num_attention_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_attention_heads + if self.head_dim * self.num_attention_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and" + f" `num_attention_heads`: {self.num_attention_heads})." + ) + self.scale_attn = torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()) + self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False) + + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False) + self.rotary_dim = config.rotary_dim + pos_embd_dim = self.rotary_dim or self.embed_dim + self.embed_positions = create_sinusoidal_positions(max_positions, pos_embd_dim) + + def _split_heads(self, x, n_head, dim_head, mp_num): + reshaped = x.reshape(x.shape[:-1] + (n_head // mp_num, dim_head)) + reshaped = reshaped.reshape(x.shape[:-2] + (-1,) + reshaped.shape[-1:]) + return reshaped + + def _merge_heads(self, tensor, num_attention_heads, attn_head_size): + """ + Merges attn_head_size dim and num_attn_heads dim into n_ctx + """ + if len(tensor.shape) == 5: + tensor = tensor.permute(0, 1, 3, 2, 4).contiguous() + elif len(tensor.shape) == 4: + tensor = tensor.permute(0, 2, 1, 3).contiguous() + else: + raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}") + new_shape = tensor.size()[:-2] + (num_attention_heads * attn_head_size,) + return tensor.view(new_shape) + + def _attn( + self, + query, + key, + value, + attention_mask=None, + head_mask=None, + ): + # compute causal mask from causal mask buffer + query_length, key_length = query.size(-2), key.size(-2) + causal_mask = self.causal_mask[:, :, key_length - query_length : key_length, :key_length] + + # Keep the attention weights computation in fp32 to avoid overflow issues + query = query.to(torch.float32) + key = key.to(torch.float32) + + attn_weights = torch.matmul(query, key.transpose(-1, -2)) + + attn_weights = attn_weights / self.scale_attn + mask_value = torch.finfo(attn_weights.dtype).min + # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`. + # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device` + mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device) + attn_weights = torch.where(causal_mask, attn_weights, mask_value) + + if attention_mask is not None: + # Apply the attention mask + attn_weights = attn_weights + attention_mask + + attn_weights = nn.Softmax(dim=-1)(attn_weights) + attn_weights = attn_weights.to(value.dtype) + attn_weights = self.attn_dropout(attn_weights) + + # Mask heads if we want to + if head_mask is not None: + attn_weights = attn_weights * head_mask + + attn_output = torch.matmul(attn_weights, value) + + return attn_output, attn_weights + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[ + Tuple[torch.Tensor, Tuple[torch.Tensor]], + Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]], + ]: + qkv = self.qkv_proj(hidden_states) + # TODO(enijkamp): factor out number of logical TPU-v4 cores or make forward pass agnostic + mp_num = 4 + qkv_split = qkv.reshape(qkv.shape[:-1] + (mp_num, -1)) + + local_dim = self.head_dim * self.num_attention_heads // mp_num + query, value, key = torch.split(qkv_split, local_dim, dim=-1) + query = self._split_heads(query, self.num_attention_heads, self.head_dim, mp_num=mp_num) + key = self._split_heads(key, self.num_attention_heads, self.head_dim, mp_num=mp_num) + + value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num) + value = value.permute(0, 2, 1, 3) + + embed_positions = self.embed_positions + if embed_positions.device != position_ids.device: + embed_positions = embed_positions.to(position_ids.device) + self.embed_positions = embed_positions + + sincos = embed_positions[position_ids] + sin, cos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1) + + if self.rotary_dim is not None: + k_rot = key[:, :, :, : self.rotary_dim] + k_pass = key[:, :, :, self.rotary_dim :] + + q_rot = query[:, :, :, : self.rotary_dim] + q_pass = query[:, :, :, self.rotary_dim :] + + k_rot = apply_rotary_pos_emb(k_rot, sin, cos) + q_rot = apply_rotary_pos_emb(q_rot, sin, cos) + + key = torch.cat([k_rot, k_pass], dim=-1) + query = torch.cat([q_rot, q_pass], dim=-1) + else: + key = apply_rotary_pos_emb(key, sin, cos) + query = apply_rotary_pos_emb(query, sin, cos) + + key = key.permute(0, 2, 1, 3) + query = query.permute(0, 2, 1, 3) + + if layer_past is not None: + past_key = layer_past[0] + past_value = layer_past[1] + key = torch.cat((past_key, key), dim=-2) + value = torch.cat((past_value, value), dim=-2) + + if use_cache is True: + present = (key, value) + else: + present = None + + # compute self-attention: V x Softmax(QK^T) + attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) + + attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim) + attn_output = self.out_proj(attn_output) + attn_output = self.resid_dropout(attn_output) + + outputs = (attn_output, present) + if output_attentions: + outputs += (attn_weights,) + + return outputs # a, present, (attentions) + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJMLP with GPTJ->Moss +class MossMLP(nn.Module): + def __init__(self, intermediate_size, config): # in MLP: intermediate_size= 4 * embed_dim + super().__init__() + embed_dim = config.n_embd + + self.fc_in = nn.Linear(embed_dim, intermediate_size) + self.fc_out = nn.Linear(intermediate_size, embed_dim) + + self.act = ACT2FN[config.activation_function] + self.dropout = nn.Dropout(config.resid_pdrop) + + def forward(self, hidden_states: Optional[torch.FloatTensor]) -> torch.FloatTensor: + hidden_states = self.fc_in(hidden_states) + hidden_states = self.act(hidden_states) + hidden_states = self.fc_out(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJBlock with GPTJ->Moss +class MossBlock(nn.Module): + def __init__(self, config): + super().__init__() + inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd + self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon) + self.attn = MossAttention(config) + self.mlp = MossMLP(inner_dim, config) + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]: + residual = hidden_states + hidden_states = self.ln_1(hidden_states) + attn_outputs = self.attn( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask, + use_cache=use_cache, + output_attentions=output_attentions, + ) + attn_output = attn_outputs[0] # output_attn: a, present, (attentions) + outputs = attn_outputs[1:] + + feed_forward_hidden_states = self.mlp(hidden_states) + hidden_states = attn_output + feed_forward_hidden_states + residual + + if use_cache: + outputs = (hidden_states,) + outputs + else: + outputs = (hidden_states,) + outputs[1:] + + return outputs # hidden_states, present, (attentions) + + +class MossPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = MossConfig + base_model_prefix = "transformer" + supports_gradient_checkpointing = True + _no_split_modules = ["MossBlock"] + + def __init__(self, *inputs, **kwargs): + super().__init__(*inputs, **kwargs) + + def _init_weights(self, module): + """Initialize the weights.""" + if isinstance(module, (nn.Linear,)): + # Slightly different from Mesh Transformer JAX which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, MossModel): + module.gradient_checkpointing = value + + +MOSS_START_DOCSTRING = r""" + This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use + it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and + behavior. + + Parameters: + config ([`MossConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +MOSS_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `({0})`): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoProcenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + position_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.n_positions - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.FloatTensor` of shape `(num_attention_heads,)` or `(n_layer, num_attention_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_dim)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert *input_ids* indices into associated vectors than the + model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare Moss Model transformer outputting raw hidden-states without any specific head on top.", + MOSS_START_DOCSTRING, +) +class MossModel(MossPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + self.embed_dim = config.n_embd + self.vocab_size = config.vocab_size + self.wte = nn.Embedding(config.vocab_size, self.embed_dim) + self.drop = nn.Dropout(config.embd_pdrop) + self.h = nn.ModuleList([MossBlock(config) for _ in range(config.n_layer)]) + self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon) + self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads) + + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.wte + + def set_input_embeddings(self, new_embeddings): + self.wte = new_embeddings + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=BaseModelOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPast]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + batch_size = input_ids.shape[0] + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + batch_size = inputs_embeds.shape[0] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if token_type_ids is not None: + token_type_ids = token_type_ids.view(-1, input_shape[-1]) + + if position_ids is not None: + position_ids = position_ids.view(-1, input_shape[-1]).long() + + if past_key_values is None: + past_length = 0 + past_key_values = tuple([None] * len(self.h)) + else: + past_length = past_key_values[0][0].size(-2) + + if position_ids is None: + position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) + + # Attention mask. + if attention_mask is not None: + if batch_size <= 0: + raise ValueError("batch_size has to be defined and > 0") + attention_mask = attention_mask.view(batch_size, -1) + # We create a 3D attention mask from a 2D tensor mask. + # Sizes are [batch_size, 1, 1, to_seq_length] + # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length] + # this attention mask is more simple than the triangular masking of causal attention + # used in OpenAI GPT, we just need to prepare the broadcast dimension here. + attention_mask = attention_mask[:, None, None, :] + + # Since attention_mask is 1.0 for positions we want to attend and 0.0 for + # masked positions, this operation will create a tensor which is 0.0 for + # positions we want to attend and the dtype's smallest value for masked positions. + # Since we are adding it to the raw scores before the softmax, this is + # effectively the same as removing these entirely. + attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility + attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x num_attention_heads x N x N + # head_mask has shape n_layer x batch x num_attention_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.n_layer) + + if inputs_embeds is None: + inputs_embeds = self.wte(input_ids) + + hidden_states = inputs_embeds + + if token_type_ids is not None: + token_type_embeds = self.wte(token_type_ids) + hidden_states = hidden_states + token_type_embeds + + hidden_states = self.drop(hidden_states) + + output_shape = input_shape + (hidden_states.size(-1),) + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting " + "`use_cache=False`..." + ) + use_cache = False + + presents = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + # None for past_key_value + return module(*inputs, use_cache, output_attentions) + + return custom_forward + + outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(block), + hidden_states, + None, + attention_mask, + position_ids, + head_mask[i], + ) + else: + outputs = block( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + if use_cache is True: + presents = presents + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + + hidden_states = self.ln_f(hidden_states) + + hidden_states = hidden_states.view(output_shape) + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=presents, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + +@add_start_docstrings( + """ + The Moss Model transformer with a language modeling head on top. + """, + MOSS_START_DOCSTRING, +) +class MossForCausalLM(MossPreTrainedModel): + _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.causal_mask"] + + def __init__(self, config): + super().__init__(config) + if not hasattr(config, 'wbits'): + config.wbits = 32 + config.groupsize = 128 + + if config.wbits not in [4, 8, 32]: + logger.warning(f'Specify `wbits` with 4, 8 or 32 to load the model. ') + if config.wbits in [4, 8]: + def noop(*args, **kwargs): + pass + torch.nn.init.kaiming_uniform_ = noop + torch.nn.init.uniform_ = noop + torch.nn.init.normal_ = noop + + torch.set_default_dtype(torch.half) + transformers.modeling_utils._init_weights = False + torch.set_default_dtype(torch.half) + self.transformer = MossModel(config) + self.lm_head = nn.Linear(config.n_embd, config.vocab_size) + if config.wbits in [4, 8]: + torch.set_default_dtype(torch.float) + transformers.modeling_utils._init_weights = True + self.quantize(config.wbits, config.groupsize) + # Initialize weights and apply final processing + self.post_init() + + def get_output_embeddings(self): + return self.lm_head + + def set_output_embeddings(self, new_embeddings): + self.lm_head = new_embeddings + + def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): + token_type_ids = kwargs.get("token_type_ids", None) + # only last token for inputs_ids if past is defined in kwargs + if past_key_values: + input_ids = input_ids[:, -1].unsqueeze(-1) + if token_type_ids is not None: + token_type_ids = token_type_ids[:, -1].unsqueeze(-1) + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + if attention_mask is not None and position_ids is None: + # create position_ids on the fly for batch generation + position_ids = attention_mask.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -1].unsqueeze(-1) + + return { + "input_ids": input_ids, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "attention_mask": attention_mask, + "token_type_ids": token_type_ids, + } + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=CausalLMOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, CausalLMOutputWithPast]: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set + `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` + are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_outputs = self.transformer( + input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + hidden_states = transformer_outputs[0] + + # make sure sampling in fp16 works correctly and + # compute loss in fp32 to match with mesh-tf version + # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179 + lm_logits = self.lm_head(hidden_states).to(torch.float32) + + loss = None + if labels is not None: + # Shift so that tokens < n predict n + shift_logits = lm_logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) + + loss = loss.to(hidden_states.dtype) + + if not return_dict: + output = (lm_logits,) + transformer_outputs[1:] + return ((loss,) + output) if loss is not None else output + + return CausalLMOutputWithPast( + loss=loss, + logits=lm_logits, + past_key_values=transformer_outputs.past_key_values, + hidden_states=transformer_outputs.hidden_states, + attentions=transformer_outputs.attentions, + ) + + @staticmethod + def _reorder_cache( + past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PretrainedModel.beam_search`] or + [`~PretrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + return tuple( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past) + for layer_past in past_key_values + ) + + def quantize(self, wbits, groupsize): + from .quantization import quantize_with_gptq + return quantize_with_gptq(self, wbits, groupsize) + +""" PyTorch Moss model.""" + +from typing import Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +from torch.nn import CrossEntropyLoss +import transformers +from transformers.activations import ACT2FN +from transformers.modeling_utils import PreTrainedModel +from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast +from transformers.utils import ( + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging +) + +from .configuration_moss import MossConfig + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "fnlp/moss-moon-003-base" +_CONFIG_FOR_DOC = "MossConfig" + + +MOSS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "fnlp/moss-moon-003-base", + "fnlp/moss-moon-003-sft", + "fnlp/moss-moon-003-sft-plugin", + "fnlp/moss-moon-003-sft-int4", + "fnlp/moss-moon-003-sft-plugin-int4", + "fnlp/moss-moon-003-sft-int8", + "fnlp/moss-moon-003-sft-plugin-int8", +] + + +# Copied from transformers.models.gptj.modeling_gptj.create_sinusoidal_positions +def create_sinusoidal_positions(num_pos: int, dim: int) -> torch.Tensor: + inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim)) + sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(num_pos, dtype=torch.float), inv_freq).float() + return torch.cat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1) + + +# Copied from transformers.models.gptj.modeling_gptj.rotate_every_two +def rotate_every_two(x: torch.Tensor) -> torch.Tensor: + x1 = x[:, :, :, ::2] + x2 = x[:, :, :, 1::2] + x = torch.stack((-x2, x1), dim=-1) + return x.flatten(-2) # in einsum notation: rearrange(x, '... d j -> ... (d j)') + + +# Copied from transformers.models.gptj.modeling_gptj.apply_rotary_pos_emb +def apply_rotary_pos_emb(tensor: torch.Tensor, sin: torch.Tensor, cos: torch.Tensor) -> torch.Tensor: + sin = torch.repeat_interleave(sin[:, :, None, :], 2, 3) + cos = torch.repeat_interleave(cos[:, :, None, :], 2, 3) + return (tensor * cos) + (rotate_every_two(tensor) * sin) + + +class MossAttention(nn.Module): + def __init__(self, config): + super().__init__() + + max_positions = config.max_position_embeddings + self.register_buffer( + "causal_mask", + torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view( + 1, 1, max_positions, max_positions + ), + ) + + self.attn_dropout = nn.Dropout(config.attn_pdrop) + self.resid_dropout = nn.Dropout(config.resid_pdrop) + + self.embed_dim = config.hidden_size + self.num_attention_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_attention_heads + if self.head_dim * self.num_attention_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and" + f" `num_attention_heads`: {self.num_attention_heads})." + ) + self.scale_attn = torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()) + self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False) + + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False) + self.rotary_dim = config.rotary_dim + pos_embd_dim = self.rotary_dim or self.embed_dim + self.embed_positions = create_sinusoidal_positions(max_positions, pos_embd_dim) + + def _split_heads(self, x, n_head, dim_head, mp_num): + reshaped = x.reshape(x.shape[:-1] + (n_head // mp_num, dim_head)) + reshaped = reshaped.reshape(x.shape[:-2] + (-1,) + reshaped.shape[-1:]) + return reshaped + + def _merge_heads(self, tensor, num_attention_heads, attn_head_size): + """ + Merges attn_head_size dim and num_attn_heads dim into n_ctx + """ + if len(tensor.shape) == 5: + tensor = tensor.permute(0, 1, 3, 2, 4).contiguous() + elif len(tensor.shape) == 4: + tensor = tensor.permute(0, 2, 1, 3).contiguous() + else: + raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}") + new_shape = tensor.size()[:-2] + (num_attention_heads * attn_head_size,) + return tensor.view(new_shape) + + def _attn( + self, + query, + key, + value, + attention_mask=None, + head_mask=None, + ): + # compute causal mask from causal mask buffer + query_length, key_length = query.size(-2), key.size(-2) + causal_mask = self.causal_mask[:, :, key_length - query_length : key_length, :key_length] + + # Keep the attention weights computation in fp32 to avoid overflow issues + query = query.to(torch.float32) + key = key.to(torch.float32) + + attn_weights = torch.matmul(query, key.transpose(-1, -2)) + + attn_weights = attn_weights / self.scale_attn + mask_value = torch.finfo(attn_weights.dtype).min + # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`. + # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device` + mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device) + attn_weights = torch.where(causal_mask, attn_weights, mask_value) + + if attention_mask is not None: + # Apply the attention mask + attn_weights = attn_weights + attention_mask + + attn_weights = nn.Softmax(dim=-1)(attn_weights) + attn_weights = attn_weights.to(value.dtype) + attn_weights = self.attn_dropout(attn_weights) + + # Mask heads if we want to + if head_mask is not None: + attn_weights = attn_weights * head_mask + + attn_output = torch.matmul(attn_weights, value) + + return attn_output, attn_weights + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[ + Tuple[torch.Tensor, Tuple[torch.Tensor]], + Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]], + ]: + qkv = self.qkv_proj(hidden_states) + # TODO(enijkamp): factor out number of logical TPU-v4 cores or make forward pass agnostic + mp_num = 4 + qkv_split = qkv.reshape(qkv.shape[:-1] + (mp_num, -1)) + + local_dim = self.head_dim * self.num_attention_heads // mp_num + query, value, key = torch.split(qkv_split, local_dim, dim=-1) + query = self._split_heads(query, self.num_attention_heads, self.head_dim, mp_num=mp_num) + key = self._split_heads(key, self.num_attention_heads, self.head_dim, mp_num=mp_num) + + value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num) + value = value.permute(0, 2, 1, 3) + + embed_positions = self.embed_positions + if embed_positions.device != position_ids.device: + embed_positions = embed_positions.to(position_ids.device) + self.embed_positions = embed_positions + + sincos = embed_positions[position_ids] + sin, cos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1) + + if self.rotary_dim is not None: + k_rot = key[:, :, :, : self.rotary_dim] + k_pass = key[:, :, :, self.rotary_dim :] + + q_rot = query[:, :, :, : self.rotary_dim] + q_pass = query[:, :, :, self.rotary_dim :] + + k_rot = apply_rotary_pos_emb(k_rot, sin, cos) + q_rot = apply_rotary_pos_emb(q_rot, sin, cos) + + key = torch.cat([k_rot, k_pass], dim=-1) + query = torch.cat([q_rot, q_pass], dim=-1) + else: + key = apply_rotary_pos_emb(key, sin, cos) + query = apply_rotary_pos_emb(query, sin, cos) + + key = key.permute(0, 2, 1, 3) + query = query.permute(0, 2, 1, 3) + + if layer_past is not None: + past_key = layer_past[0] + past_value = layer_past[1] + key = torch.cat((past_key, key), dim=-2) + value = torch.cat((past_value, value), dim=-2) + + if use_cache is True: + present = (key, value) + else: + present = None + + # compute self-attention: V x Softmax(QK^T) + attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) + + attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim) + attn_output = self.out_proj(attn_output) + attn_output = self.resid_dropout(attn_output) + + outputs = (attn_output, present) + if output_attentions: + outputs += (attn_weights,) + + return outputs # a, present, (attentions) + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJMLP with GPTJ->Moss +class MossMLP(nn.Module): + def __init__(self, intermediate_size, config): # in MLP: intermediate_size= 4 * embed_dim + super().__init__() + embed_dim = config.n_embd + + self.fc_in = nn.Linear(embed_dim, intermediate_size) + self.fc_out = nn.Linear(intermediate_size, embed_dim) + + self.act = ACT2FN[config.activation_function] + self.dropout = nn.Dropout(config.resid_pdrop) + + def forward(self, hidden_states: Optional[torch.FloatTensor]) -> torch.FloatTensor: + hidden_states = self.fc_in(hidden_states) + hidden_states = self.act(hidden_states) + hidden_states = self.fc_out(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJBlock with GPTJ->Moss +class MossBlock(nn.Module): + def __init__(self, config): + super().__init__() + inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd + self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon) + self.attn = MossAttention(config) + self.mlp = MossMLP(inner_dim, config) + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]: + residual = hidden_states + hidden_states = self.ln_1(hidden_states) + attn_outputs = self.attn( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask, + use_cache=use_cache, + output_attentions=output_attentions, + ) + attn_output = attn_outputs[0] # output_attn: a, present, (attentions) + outputs = attn_outputs[1:] + + feed_forward_hidden_states = self.mlp(hidden_states) + hidden_states = attn_output + feed_forward_hidden_states + residual + + if use_cache: + outputs = (hidden_states,) + outputs + else: + outputs = (hidden_states,) + outputs[1:] + + return outputs # hidden_states, present, (attentions) + + +class MossPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = MossConfig + base_model_prefix = "transformer" + supports_gradient_checkpointing = True + _no_split_modules = ["MossBlock"] + + def __init__(self, *inputs, **kwargs): + super().__init__(*inputs, **kwargs) + + def _init_weights(self, module): + """Initialize the weights.""" + if isinstance(module, (nn.Linear,)): + # Slightly different from Mesh Transformer JAX which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, MossModel): + module.gradient_checkpointing = value + + +MOSS_START_DOCSTRING = r""" + This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use + it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and + behavior. + + Parameters: + config ([`MossConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +MOSS_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `({0})`): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoProcenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + position_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.n_positions - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.FloatTensor` of shape `(num_attention_heads,)` or `(n_layer, num_attention_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_dim)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert *input_ids* indices into associated vectors than the + model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare Moss Model transformer outputting raw hidden-states without any specific head on top.", + MOSS_START_DOCSTRING, +) +class MossModel(MossPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + self.embed_dim = config.n_embd + self.vocab_size = config.vocab_size + self.wte = nn.Embedding(config.vocab_size, self.embed_dim) + self.drop = nn.Dropout(config.embd_pdrop) + self.h = nn.ModuleList([MossBlock(config) for _ in range(config.n_layer)]) + self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon) + self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads) + + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.wte + + def set_input_embeddings(self, new_embeddings): + self.wte = new_embeddings + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=BaseModelOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPast]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + batch_size = input_ids.shape[0] + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + batch_size = inputs_embeds.shape[0] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if token_type_ids is not None: + token_type_ids = token_type_ids.view(-1, input_shape[-1]) + + if position_ids is not None: + position_ids = position_ids.view(-1, input_shape[-1]).long() + + if past_key_values is None: + past_length = 0 + past_key_values = tuple([None] * len(self.h)) + else: + past_length = past_key_values[0][0].size(-2) + + if position_ids is None: + position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) + + # Attention mask. + if attention_mask is not None: + if batch_size <= 0: + raise ValueError("batch_size has to be defined and > 0") + attention_mask = attention_mask.view(batch_size, -1) + # We create a 3D attention mask from a 2D tensor mask. + # Sizes are [batch_size, 1, 1, to_seq_length] + # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length] + # this attention mask is more simple than the triangular masking of causal attention + # used in OpenAI GPT, we just need to prepare the broadcast dimension here. + attention_mask = attention_mask[:, None, None, :] + + # Since attention_mask is 1.0 for positions we want to attend and 0.0 for + # masked positions, this operation will create a tensor which is 0.0 for + # positions we want to attend and the dtype's smallest value for masked positions. + # Since we are adding it to the raw scores before the softmax, this is + # effectively the same as removing these entirely. + attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility + attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x num_attention_heads x N x N + # head_mask has shape n_layer x batch x num_attention_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.n_layer) + + if inputs_embeds is None: + inputs_embeds = self.wte(input_ids) + + hidden_states = inputs_embeds + + if token_type_ids is not None: + token_type_embeds = self.wte(token_type_ids) + hidden_states = hidden_states + token_type_embeds + + hidden_states = self.drop(hidden_states) + + output_shape = input_shape + (hidden_states.size(-1),) + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting " + "`use_cache=False`..." + ) + use_cache = False + + presents = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + # None for past_key_value + return module(*inputs, use_cache, output_attentions) + + return custom_forward + + outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(block), + hidden_states, + None, + attention_mask, + position_ids, + head_mask[i], + ) + else: + outputs = block( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + if use_cache is True: + presents = presents + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + + hidden_states = self.ln_f(hidden_states) + + hidden_states = hidden_states.view(output_shape) + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=presents, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + +@add_start_docstrings( + """ + The Moss Model transformer with a language modeling head on top. + """, + MOSS_START_DOCSTRING, +) +class MossForCausalLM(MossPreTrainedModel): + _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.causal_mask"] + + def __init__(self, config): + super().__init__(config) + if not hasattr(config, 'wbits'): + config.wbits = 32 + config.groupsize = 128 + + if config.wbits not in [4, 8, 32]: + logger.warning(f'Specify `wbits` with 4, 8 or 32 to load the model. ') + if config.wbits in [4, 8]: + def noop(*args, **kwargs): + pass + torch.nn.init.kaiming_uniform_ = noop + torch.nn.init.uniform_ = noop + torch.nn.init.normal_ = noop + + torch.set_default_dtype(torch.half) + transformers.modeling_utils._init_weights = False + torch.set_default_dtype(torch.half) + self.transformer = MossModel(config) + self.lm_head = nn.Linear(config.n_embd, config.vocab_size) + if config.wbits in [4, 8]: + torch.set_default_dtype(torch.float) + transformers.modeling_utils._init_weights = True + self.quantize(config.wbits, config.groupsize) + # Initialize weights and apply final processing + self.post_init() + + def get_output_embeddings(self): + return self.lm_head + + def set_output_embeddings(self, new_embeddings): + self.lm_head = new_embeddings + + def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): + token_type_ids = kwargs.get("token_type_ids", None) + # only last token for inputs_ids if past is defined in kwargs + if past_key_values: + input_ids = input_ids[:, -1].unsqueeze(-1) + if token_type_ids is not None: + token_type_ids = token_type_ids[:, -1].unsqueeze(-1) + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + if attention_mask is not None and position_ids is None: + # create position_ids on the fly for batch generation + position_ids = attention_mask.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -1].unsqueeze(-1) + + return { + "input_ids": input_ids, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "attention_mask": attention_mask, + "token_type_ids": token_type_ids, + } + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=CausalLMOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, CausalLMOutputWithPast]: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set + `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` + are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_outputs = self.transformer( + input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + hidden_states = transformer_outputs[0] + + # make sure sampling in fp16 works correctly and + # compute loss in fp32 to match with mesh-tf version + # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179 + lm_logits = self.lm_head(hidden_states).to(torch.float32) + + loss = None + if labels is not None: + # Shift so that tokens < n predict n + shift_logits = lm_logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) + + loss = loss.to(hidden_states.dtype) + + if not return_dict: + output = (lm_logits,) + transformer_outputs[1:] + return ((loss,) + output) if loss is not None else output + + return CausalLMOutputWithPast( + loss=loss, + logits=lm_logits, + past_key_values=transformer_outputs.past_key_values, + hidden_states=transformer_outputs.hidden_states, + attentions=transformer_outputs.attentions, + ) + + @staticmethod + def _reorder_cache( + past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PretrainedModel.beam_search`] or + [`~PretrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + return tuple( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past) + for layer_past in past_key_values + ) + + def quantize(self, wbits, groupsize): + from .quantization import quantize_with_gptq + return quantize_with_gptq(self, wbits, groupsize) + +""" PyTorch Moss model.""" + +from typing import Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +from torch.nn import CrossEntropyLoss +import transformers +from transformers.activations import ACT2FN +from transformers.modeling_utils import PreTrainedModel +from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast +from transformers.utils import ( + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging +) + +from .configuration_moss import MossConfig + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "fnlp/moss-moon-003-base" +_CONFIG_FOR_DOC = "MossConfig" + + +MOSS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "fnlp/moss-moon-003-base", + "fnlp/moss-moon-003-sft", + "fnlp/moss-moon-003-sft-plugin", + "fnlp/moss-moon-003-sft-int4", + "fnlp/moss-moon-003-sft-plugin-int4", + "fnlp/moss-moon-003-sft-int8", + "fnlp/moss-moon-003-sft-plugin-int8", +] + + +# Copied from transformers.models.gptj.modeling_gptj.create_sinusoidal_positions +def create_sinusoidal_positions(num_pos: int, dim: int) -> torch.Tensor: + inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim)) + sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(num_pos, dtype=torch.float), inv_freq).float() + return torch.cat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1) + + +# Copied from transformers.models.gptj.modeling_gptj.rotate_every_two +def rotate_every_two(x: torch.Tensor) -> torch.Tensor: + x1 = x[:, :, :, ::2] + x2 = x[:, :, :, 1::2] + x = torch.stack((-x2, x1), dim=-1) + return x.flatten(-2) # in einsum notation: rearrange(x, '... d j -> ... (d j)') + + +# Copied from transformers.models.gptj.modeling_gptj.apply_rotary_pos_emb +def apply_rotary_pos_emb(tensor: torch.Tensor, sin: torch.Tensor, cos: torch.Tensor) -> torch.Tensor: + sin = torch.repeat_interleave(sin[:, :, None, :], 2, 3) + cos = torch.repeat_interleave(cos[:, :, None, :], 2, 3) + return (tensor * cos) + (rotate_every_two(tensor) * sin) + + +class MossAttention(nn.Module): + def __init__(self, config): + super().__init__() + + max_positions = config.max_position_embeddings + self.register_buffer( + "causal_mask", + torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view( + 1, 1, max_positions, max_positions + ), + ) + + self.attn_dropout = nn.Dropout(config.attn_pdrop) + self.resid_dropout = nn.Dropout(config.resid_pdrop) + + self.embed_dim = config.hidden_size + self.num_attention_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_attention_heads + if self.head_dim * self.num_attention_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and" + f" `num_attention_heads`: {self.num_attention_heads})." + ) + self.scale_attn = torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()) + self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False) + + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False) + self.rotary_dim = config.rotary_dim + pos_embd_dim = self.rotary_dim or self.embed_dim + self.embed_positions = create_sinusoidal_positions(max_positions, pos_embd_dim) + + def _split_heads(self, x, n_head, dim_head, mp_num): + reshaped = x.reshape(x.shape[:-1] + (n_head // mp_num, dim_head)) + reshaped = reshaped.reshape(x.shape[:-2] + (-1,) + reshaped.shape[-1:]) + return reshaped + + def _merge_heads(self, tensor, num_attention_heads, attn_head_size): + """ + Merges attn_head_size dim and num_attn_heads dim into n_ctx + """ + if len(tensor.shape) == 5: + tensor = tensor.permute(0, 1, 3, 2, 4).contiguous() + elif len(tensor.shape) == 4: + tensor = tensor.permute(0, 2, 1, 3).contiguous() + else: + raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}") + new_shape = tensor.size()[:-2] + (num_attention_heads * attn_head_size,) + return tensor.view(new_shape) + + def _attn( + self, + query, + key, + value, + attention_mask=None, + head_mask=None, + ): + # compute causal mask from causal mask buffer + query_length, key_length = query.size(-2), key.size(-2) + causal_mask = self.causal_mask[:, :, key_length - query_length : key_length, :key_length] + + # Keep the attention weights computation in fp32 to avoid overflow issues + query = query.to(torch.float32) + key = key.to(torch.float32) + + attn_weights = torch.matmul(query, key.transpose(-1, -2)) + + attn_weights = attn_weights / self.scale_attn + mask_value = torch.finfo(attn_weights.dtype).min + # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`. + # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device` + mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device) + attn_weights = torch.where(causal_mask, attn_weights, mask_value) + + if attention_mask is not None: + # Apply the attention mask + attn_weights = attn_weights + attention_mask + + attn_weights = nn.Softmax(dim=-1)(attn_weights) + attn_weights = attn_weights.to(value.dtype) + attn_weights = self.attn_dropout(attn_weights) + + # Mask heads if we want to + if head_mask is not None: + attn_weights = attn_weights * head_mask + + attn_output = torch.matmul(attn_weights, value) + + return attn_output, attn_weights + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[ + Tuple[torch.Tensor, Tuple[torch.Tensor]], + Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]], + ]: + qkv = self.qkv_proj(hidden_states) + # TODO(enijkamp): factor out number of logical TPU-v4 cores or make forward pass agnostic + mp_num = 4 + qkv_split = qkv.reshape(qkv.shape[:-1] + (mp_num, -1)) + + local_dim = self.head_dim * self.num_attention_heads // mp_num + query, value, key = torch.split(qkv_split, local_dim, dim=-1) + query = self._split_heads(query, self.num_attention_heads, self.head_dim, mp_num=mp_num) + key = self._split_heads(key, self.num_attention_heads, self.head_dim, mp_num=mp_num) + + value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num) + value = value.permute(0, 2, 1, 3) + + embed_positions = self.embed_positions + if embed_positions.device != position_ids.device: + embed_positions = embed_positions.to(position_ids.device) + self.embed_positions = embed_positions + + sincos = embed_positions[position_ids] + sin, cos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1) + + if self.rotary_dim is not None: + k_rot = key[:, :, :, : self.rotary_dim] + k_pass = key[:, :, :, self.rotary_dim :] + + q_rot = query[:, :, :, : self.rotary_dim] + q_pass = query[:, :, :, self.rotary_dim :] + + k_rot = apply_rotary_pos_emb(k_rot, sin, cos) + q_rot = apply_rotary_pos_emb(q_rot, sin, cos) + + key = torch.cat([k_rot, k_pass], dim=-1) + query = torch.cat([q_rot, q_pass], dim=-1) + else: + key = apply_rotary_pos_emb(key, sin, cos) + query = apply_rotary_pos_emb(query, sin, cos) + + key = key.permute(0, 2, 1, 3) + query = query.permute(0, 2, 1, 3) + + if layer_past is not None: + past_key = layer_past[0] + past_value = layer_past[1] + key = torch.cat((past_key, key), dim=-2) + value = torch.cat((past_value, value), dim=-2) + + if use_cache is True: + present = (key, value) + else: + present = None + + # compute self-attention: V x Softmax(QK^T) + attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) + + attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim) + attn_output = self.out_proj(attn_output) + attn_output = self.resid_dropout(attn_output) + + outputs = (attn_output, present) + if output_attentions: + outputs += (attn_weights,) + + return outputs # a, present, (attentions) + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJMLP with GPTJ->Moss +class MossMLP(nn.Module): + def __init__(self, intermediate_size, config): # in MLP: intermediate_size= 4 * embed_dim + super().__init__() + embed_dim = config.n_embd + + self.fc_in = nn.Linear(embed_dim, intermediate_size) + self.fc_out = nn.Linear(intermediate_size, embed_dim) + + self.act = ACT2FN[config.activation_function] + self.dropout = nn.Dropout(config.resid_pdrop) + + def forward(self, hidden_states: Optional[torch.FloatTensor]) -> torch.FloatTensor: + hidden_states = self.fc_in(hidden_states) + hidden_states = self.act(hidden_states) + hidden_states = self.fc_out(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +# Copied from transformers.models.gptj.modeling_gptj.GPTJBlock with GPTJ->Moss +class MossBlock(nn.Module): + def __init__(self, config): + super().__init__() + inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd + self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon) + self.attn = MossAttention(config) + self.mlp = MossMLP(inner_dim, config) + + def forward( + self, + hidden_states: Optional[torch.FloatTensor], + layer_past: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]: + residual = hidden_states + hidden_states = self.ln_1(hidden_states) + attn_outputs = self.attn( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask, + use_cache=use_cache, + output_attentions=output_attentions, + ) + attn_output = attn_outputs[0] # output_attn: a, present, (attentions) + outputs = attn_outputs[1:] + + feed_forward_hidden_states = self.mlp(hidden_states) + hidden_states = attn_output + feed_forward_hidden_states + residual + + if use_cache: + outputs = (hidden_states,) + outputs + else: + outputs = (hidden_states,) + outputs[1:] + + return outputs # hidden_states, present, (attentions) + + +class MossPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = MossConfig + base_model_prefix = "transformer" + supports_gradient_checkpointing = True + _no_split_modules = ["MossBlock"] + + def __init__(self, *inputs, **kwargs): + super().__init__(*inputs, **kwargs) + + def _init_weights(self, module): + """Initialize the weights.""" + if isinstance(module, (nn.Linear,)): + # Slightly different from Mesh Transformer JAX which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, MossModel): + module.gradient_checkpointing = value + + +MOSS_START_DOCSTRING = r""" + This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use + it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and + behavior. + + Parameters: + config ([`MossConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +MOSS_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `({0})`): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoProcenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + position_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.n_positions - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.FloatTensor` of shape `(num_attention_heads,)` or `(n_layer, num_attention_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_dim)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert *input_ids* indices into associated vectors than the + model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare Moss Model transformer outputting raw hidden-states without any specific head on top.", + MOSS_START_DOCSTRING, +) +class MossModel(MossPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + self.embed_dim = config.n_embd + self.vocab_size = config.vocab_size + self.wte = nn.Embedding(config.vocab_size, self.embed_dim) + self.drop = nn.Dropout(config.embd_pdrop) + self.h = nn.ModuleList([MossBlock(config) for _ in range(config.n_layer)]) + self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon) + self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads) + + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.wte + + def set_input_embeddings(self, new_embeddings): + self.wte = new_embeddings + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=BaseModelOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPast]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + batch_size = input_ids.shape[0] + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + batch_size = inputs_embeds.shape[0] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if token_type_ids is not None: + token_type_ids = token_type_ids.view(-1, input_shape[-1]) + + if position_ids is not None: + position_ids = position_ids.view(-1, input_shape[-1]).long() + + if past_key_values is None: + past_length = 0 + past_key_values = tuple([None] * len(self.h)) + else: + past_length = past_key_values[0][0].size(-2) + + if position_ids is None: + position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) + + # Attention mask. + if attention_mask is not None: + if batch_size <= 0: + raise ValueError("batch_size has to be defined and > 0") + attention_mask = attention_mask.view(batch_size, -1) + # We create a 3D attention mask from a 2D tensor mask. + # Sizes are [batch_size, 1, 1, to_seq_length] + # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length] + # this attention mask is more simple than the triangular masking of causal attention + # used in OpenAI GPT, we just need to prepare the broadcast dimension here. + attention_mask = attention_mask[:, None, None, :] + + # Since attention_mask is 1.0 for positions we want to attend and 0.0 for + # masked positions, this operation will create a tensor which is 0.0 for + # positions we want to attend and the dtype's smallest value for masked positions. + # Since we are adding it to the raw scores before the softmax, this is + # effectively the same as removing these entirely. + attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility + attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x num_attention_heads x N x N + # head_mask has shape n_layer x batch x num_attention_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.n_layer) + + if inputs_embeds is None: + inputs_embeds = self.wte(input_ids) + + hidden_states = inputs_embeds + + if token_type_ids is not None: + token_type_embeds = self.wte(token_type_ids) + hidden_states = hidden_states + token_type_embeds + + hidden_states = self.drop(hidden_states) + + output_shape = input_shape + (hidden_states.size(-1),) + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting " + "`use_cache=False`..." + ) + use_cache = False + + presents = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + # None for past_key_value + return module(*inputs, use_cache, output_attentions) + + return custom_forward + + outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(block), + hidden_states, + None, + attention_mask, + position_ids, + head_mask[i], + ) + else: + outputs = block( + hidden_states=hidden_states, + layer_past=layer_past, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + if use_cache is True: + presents = presents + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + + hidden_states = self.ln_f(hidden_states) + + hidden_states = hidden_states.view(output_shape) + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=presents, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + +@add_start_docstrings( + """ + The Moss Model transformer with a language modeling head on top. + """, + MOSS_START_DOCSTRING, +) +class MossForCausalLM(MossPreTrainedModel): + _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.causal_mask"] + + def __init__(self, config): + super().__init__(config) + if not hasattr(config, 'wbits'): + config.wbits = 32 + config.groupsize = 128 + + if config.wbits not in [4, 8, 32]: + logger.warning(f'Specify `wbits` with 4, 8 or 32 to load the model. ') + if config.wbits in [4, 8]: + def noop(*args, **kwargs): + pass + torch.nn.init.kaiming_uniform_ = noop + torch.nn.init.uniform_ = noop + torch.nn.init.normal_ = noop + + torch.set_default_dtype(torch.half) + transformers.modeling_utils._init_weights = False + torch.set_default_dtype(torch.half) + self.transformer = MossModel(config) + self.lm_head = nn.Linear(config.n_embd, config.vocab_size) + if config.wbits in [4, 8]: + torch.set_default_dtype(torch.float) + transformers.modeling_utils._init_weights = True + self.quantize(config.wbits, config.groupsize) + # Initialize weights and apply final processing + self.post_init() + + def get_output_embeddings(self): + return self.lm_head + + def set_output_embeddings(self, new_embeddings): + self.lm_head = new_embeddings + + def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): + token_type_ids = kwargs.get("token_type_ids", None) + # only last token for inputs_ids if past is defined in kwargs + if past_key_values: + input_ids = input_ids[:, -1].unsqueeze(-1) + if token_type_ids is not None: + token_type_ids = token_type_ids[:, -1].unsqueeze(-1) + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + if attention_mask is not None and position_ids is None: + # create position_ids on the fly for batch generation + position_ids = attention_mask.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -1].unsqueeze(-1) + + return { + "input_ids": input_ids, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "attention_mask": attention_mask, + "token_type_ids": token_type_ids, + } + + @add_start_docstrings_to_model_forward(MOSS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=CausalLMOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, CausalLMOutputWithPast]: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set + `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` + are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_outputs = self.transformer( + input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + hidden_states = transformer_outputs[0] + + # make sure sampling in fp16 works correctly and + # compute loss in fp32 to match with mesh-tf version + # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179 + lm_logits = self.lm_head(hidden_states).to(torch.float32) + + loss = None + if labels is not None: + # Shift so that tokens < n predict n + shift_logits = lm_logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) + + loss = loss.to(hidden_states.dtype) + + if not return_dict: + output = (lm_logits,) + transformer_outputs[1:] + return ((loss,) + output) if loss is not None else output + + return CausalLMOutputWithPast( + loss=loss, + logits=lm_logits, + past_key_values=transformer_outputs.past_key_values, + hidden_states=transformer_outputs.hidden_states, + attentions=transformer_outputs.attentions, + ) + + @staticmethod + def _reorder_cache( + past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PretrainedModel.beam_search`] or + [`~PretrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + return tuple( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past) + for layer_past in past_key_values + ) + + def quantize(self, wbits, groupsize): + from .quantization import quantize_with_gptq + return quantize_with_gptq(self, wbits, groupsize) + diff --git a/request_llms/README.md b/request_llms/README.md index 92b856e30b12c0cd92daea97b7eebdae91c1637c..288bc1352e0e85ce56f569f3a5e6685d279e1565 100644 --- a/request_llms/README.md +++ b/request_llms/README.md @@ -1,79 +1,35 @@ -# 如何使用其他大语言模型 +P.S. 如果您按照以下步骤成功接入了新的大模型,欢迎发Pull Requests(如果您在自己接入新模型的过程中遇到困难,欢迎加README底部QQ群联系群主) -## ChatGLM -- 安装依赖 `pip install -r request_llms/requirements_chatglm.txt` -- 修改配置,在config.py中将LLM_MODEL的值改为"chatglm" +# 如何接入其他本地大语言模型 -``` sh -LLM_MODEL = "chatglm" -``` -- 运行! -``` sh -`python main.py` -``` +1. 复制`request_llms/bridge_llama2.py`,重命名为你喜欢的名字 -## Claude-Stack +2. 修改`load_model_and_tokenizer`方法,加载你的模型和分词器(去该模型官网找demo,复制粘贴即可) -- 请参考此教程获取 https://zhuanlan.zhihu.com/p/627485689 - - 1、SLACK_CLAUDE_BOT_ID - - 2、SLACK_CLAUDE_USER_TOKEN +3. 修改`llm_stream_generator`方法,定义推理模型(去该模型官网找demo,复制粘贴即可) -- 把token加入config.py +4. 命令行测试 + - 修改`tests/test_llms.py`(聪慧如您,只需要看一眼该文件就明白怎么修改了) + - 运行`python tests/test_llms.py` -## Newbing +5. 测试通过后,在`request_llms/bridge_all.py`中做最后的修改,把你的模型完全接入到框架中(聪慧如您,只需要看一眼该文件就明白怎么修改了) -- 使用cookie editor获取cookie(json) -- 把cookie(json)加入config.py (NEWBING_COOKIES) +6. 修改`LLM_MODEL`配置,然后运行`python main.py`,测试最后的效果 -## Moss -- 使用docker-compose -## RWKV -- 使用docker-compose +# 如何接入其他在线大语言模型 -## LLAMA -- 使用docker-compose +1. 复制`request_llms/bridge_zhipu.py`,重命名为你喜欢的名字 -## 盘古 -- 使用docker-compose +2. 修改`predict_no_ui_long_connection` +3. 修改`predict` ---- -## Text-Generation-UI (TGUI,调试中,暂不可用) +4. 命令行测试 + - 修改`tests/test_llms.py`(聪慧如您,只需要看一眼该文件就明白怎么修改了) + - 运行`python tests/test_llms.py` -### 1. 部署TGUI -``` sh -# 1 下载模型 -git clone https://github.com/oobabooga/text-generation-webui.git -# 2 这个仓库的最新代码有问题,回滚到几周之前 -git reset --hard fcda3f87767e642d1c0411776e549e1d3894843d -# 3 切换路径 -cd text-generation-webui -# 4 安装text-generation的额外依赖 -pip install accelerate bitsandbytes flexgen gradio llamacpp markdown numpy peft requests rwkv safetensors sentencepiece tqdm datasets git+https://github.com/huggingface/transformers -# 5 下载模型 -python download-model.py facebook/galactica-1.3b -# 其他可选如 facebook/opt-1.3b -# facebook/galactica-1.3b -# facebook/galactica-6.7b -# facebook/galactica-120b -# facebook/pygmalion-1.3b 等 -# 详情见 https://github.com/oobabooga/text-generation-webui +5. 测试通过后,在`request_llms/bridge_all.py`中做最后的修改,把你的模型完全接入到框架中(聪慧如您,只需要看一眼该文件就明白怎么修改了) -# 6 启动text-generation -python server.py --cpu --listen --listen-port 7865 --model facebook_galactica-1.3b -``` - -### 2. 修改config.py - -``` sh -# LLM_MODEL格式: tgui:[模型]@[ws地址]:[ws端口] , 端口要和上面给定的端口一致 -LLM_MODEL = "tgui:galactica-1.3b@localhost:7860" -``` - -### 3. 运行! -``` sh -cd chatgpt-academic -python main.py -``` +6. 修改`LLM_MODEL`配置,然后运行`python main.py`,测试最后的效果 \ No newline at end of file diff --git a/request_llms/bridge_all.py b/request_llms/bridge_all.py index 88848a984a8312fe929e3992805f3ba97e24a689..8dece5487e822e260adba707a772b862cd1238af 100644 --- a/request_llms/bridge_all.py +++ b/request_llms/bridge_all.py @@ -543,6 +543,22 @@ if "zhipuai" in AVAIL_LLM_MODELS: # zhipuai }) except: print(trimmed_format_exc()) +if "deepseekcoder" in AVAIL_LLM_MODELS: # deepseekcoder + try: + from .bridge_deepseekcoder import predict_no_ui_long_connection as deepseekcoder_noui + from .bridge_deepseekcoder import predict as deepseekcoder_ui + model_info.update({ + "deepseekcoder": { + "fn_with_ui": deepseekcoder_ui, + "fn_without_ui": deepseekcoder_noui, + "endpoint": None, + "max_token": 4096, + "tokenizer": tokenizer_gpt35, + "token_cnt": get_token_num_gpt35, + } + }) + except: + print(trimmed_format_exc()) # <-- 用于定义和切换多个azure模型 --> AZURE_CFG_ARRAY = get_conf("AZURE_CFG_ARRAY") diff --git a/request_llms/bridge_deepseekcoder.py b/request_llms/bridge_deepseekcoder.py new file mode 100644 index 0000000000000000000000000000000000000000..2242eec75cad9cdce0804dcae1305999bd71b218 --- /dev/null +++ b/request_llms/bridge_deepseekcoder.py @@ -0,0 +1,88 @@ +model_name = "deepseek-coder-6.7b-instruct" +cmd_to_install = "未知" # "`pip install -r request_llms/requirements_qwen.txt`" + +import os +from toolbox import ProxyNetworkActivate +from toolbox import get_conf +from .local_llm_class import LocalLLMHandle, get_local_llm_predict_fns +from threading import Thread + +def download_huggingface_model(model_name, max_retry, local_dir): + from huggingface_hub import snapshot_download + for i in range(1, max_retry): + try: + snapshot_download(repo_id=model_name, local_dir=local_dir, resume_download=True) + break + except Exception as e: + print(f'\n\n下载失败,重试第{i}次中...\n\n') + return local_dir +# ------------------------------------------------------------------------------------------------------------------------ +# 🔌💻 Local Model +# ------------------------------------------------------------------------------------------------------------------------ +class GetCoderLMHandle(LocalLLMHandle): + + def load_model_info(self): + # 🏃‍♂️🏃‍♂️🏃‍♂️ 子进程执行 + self.model_name = model_name + self.cmd_to_install = cmd_to_install + + def load_model_and_tokenizer(self): + # 🏃‍♂️🏃‍♂️🏃‍♂️ 子进程执行 + with ProxyNetworkActivate('Download_LLM'): + from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer + model_name = "deepseek-ai/deepseek-coder-6.7b-instruct" + # local_dir = f"~/.cache/{model_name}" + # if not os.path.exists(local_dir): + # tokenizer = download_huggingface_model(model_name, max_retry=128, local_dir=local_dir) + tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) + self._streamer = TextIteratorStreamer(tokenizer) + model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) + if get_conf('LOCAL_MODEL_DEVICE') != 'cpu': + model = model.cuda() + return model, tokenizer + + def llm_stream_generator(self, **kwargs): + # 🏃‍♂️🏃‍♂️🏃‍♂️ 子进程执行 + def adaptor(kwargs): + query = kwargs['query'] + max_length = kwargs['max_length'] + top_p = kwargs['top_p'] + temperature = kwargs['temperature'] + history = kwargs['history'] + return query, max_length, top_p, temperature, history + + query, max_length, top_p, temperature, history = adaptor(kwargs) + history.append({ 'role': 'user', 'content': query}) + messages = history + inputs = self._tokenizer.apply_chat_template(messages, return_tensors="pt").to(self._model.device) + generation_kwargs = dict( + inputs=inputs, + max_new_tokens=max_length, + do_sample=False, + top_p=top_p, + streamer = self._streamer, + top_k=50, + temperature=temperature, + num_return_sequences=1, + eos_token_id=32021, + ) + thread = Thread(target=self._model.generate, kwargs=generation_kwargs, daemon=True) + thread.start() + generated_text = "" + for new_text in self._streamer: + generated_text += new_text + # print(generated_text) + yield generated_text + + + def try_to_import_special_deps(self, **kwargs): pass + # import something that will raise error if the user does not install requirement_*.txt + # 🏃‍♂️🏃‍♂️🏃‍♂️ 主进程执行 + # import importlib + # importlib.import_module('modelscope') + + +# ------------------------------------------------------------------------------------------------------------------------ +# 🔌💻 GPT-Academic Interface +# ------------------------------------------------------------------------------------------------------------------------ +predict_no_ui_long_connection, predict = get_local_llm_predict_fns(GetCoderLMHandle, model_name, history_format='chatglm3') \ No newline at end of file diff --git a/request_llms/bridge_llama2.py b/request_llms/bridge_llama2.py index e6da4b755a1c9aaa648afd5d4f17043c9ed16e96..bfa3c14ae63cb1eafe85b6d81810b0f70303a900 100644 --- a/request_llms/bridge_llama2.py +++ b/request_llms/bridge_llama2.py @@ -12,7 +12,7 @@ from threading import Thread # ------------------------------------------------------------------------------------------------------------------------ # 🔌💻 Local Model # ------------------------------------------------------------------------------------------------------------------------ -class GetONNXGLMHandle(LocalLLMHandle): +class GetLlamaHandle(LocalLLMHandle): def load_model_info(self): # 🏃‍♂️🏃‍♂️🏃‍♂️ 子进程执行 @@ -87,4 +87,4 @@ class GetONNXGLMHandle(LocalLLMHandle): # ------------------------------------------------------------------------------------------------------------------------ # 🔌💻 GPT-Academic Interface # ------------------------------------------------------------------------------------------------------------------------ -predict_no_ui_long_connection, predict = get_local_llm_predict_fns(GetONNXGLMHandle, model_name) \ No newline at end of file +predict_no_ui_long_connection, predict = get_local_llm_predict_fns(GetLlamaHandle, model_name) \ No newline at end of file diff --git a/request_llms/bridge_qwen.py b/request_llms/bridge_qwen.py index afd886bf9df72b106d53a845d3bed9079bffe696..85a4d80cbbcd6824020cfd9cdab21168bfc79bb7 100644 --- a/request_llms/bridge_qwen.py +++ b/request_llms/bridge_qwen.py @@ -15,7 +15,7 @@ from .local_llm_class import LocalLLMHandle, get_local_llm_predict_fns # ------------------------------------------------------------------------------------------------------------------------ # 🔌💻 Local Model # ------------------------------------------------------------------------------------------------------------------------ -class GetONNXGLMHandle(LocalLLMHandle): +class GetQwenLMHandle(LocalLLMHandle): def load_model_info(self): # 🏃‍♂️🏃‍♂️🏃‍♂️ 子进程执行 @@ -64,4 +64,4 @@ class GetONNXGLMHandle(LocalLLMHandle): # ------------------------------------------------------------------------------------------------------------------------ # 🔌💻 GPT-Academic Interface # ------------------------------------------------------------------------------------------------------------------------ -predict_no_ui_long_connection, predict = get_local_llm_predict_fns(GetONNXGLMHandle, model_name) \ No newline at end of file +predict_no_ui_long_connection, predict = get_local_llm_predict_fns(GetQwenLMHandle, model_name) \ No newline at end of file diff --git a/request_llms/bridge_zhipu.py b/request_llms/bridge_zhipu.py index a1e0de5918a00498a127ab6fbc5eb987fed66d93..915a13efbaf6411ea7c2a66500f09f57237b87d0 100644 --- a/request_llms/bridge_zhipu.py +++ b/request_llms/bridge_zhipu.py @@ -1,6 +1,7 @@ import time from toolbox import update_ui, get_conf, update_ui_lastest_msg +from toolbox import check_packages, report_exception model_name = '智谱AI大模型' @@ -37,6 +38,14 @@ def predict(inputs, llm_kwargs, plugin_kwargs, chatbot, history=[], system_promp chatbot.append((inputs, "")) yield from update_ui(chatbot=chatbot, history=history) + # 尝试导入依赖,如果缺少依赖,则给出安装建议 + try: + check_packages(["zhipuai"]) + except: + yield from update_ui_lastest_msg(f"导入软件依赖失败。使用该模型需要额外依赖,安装方法```pip install --upgrade zhipuai```。", + chatbot=chatbot, history=history, delay=0) + return + if validate_key() is False: yield from update_ui_lastest_msg(lastmsg="[Local Message] 请配置ZHIPUAI_API_KEY", chatbot=chatbot, history=history, delay=0) return diff --git a/request_llms/local_llm_class.py b/request_llms/local_llm_class.py index 091707a768f6355cc98854b07307c65582a7cf3b..413df03f222f63f21a72b59cc345d0cc1b32ca4c 100644 --- a/request_llms/local_llm_class.py +++ b/request_llms/local_llm_class.py @@ -198,7 +198,7 @@ class LocalLLMHandle(Process): if res.startswith(self.std_tag): new_output = res[len(self.std_tag):] std_out = std_out[:std_out_clip_len] - # print(new_output, end='') + print(new_output, end='') std_out = new_output + std_out yield self.std_tag + '\n```\n' + std_out + '\n```\n' elif res == '[Finish]': diff --git a/tests/test_llms.py b/tests/test_llms.py index 6285f0309f2bb1a8ae69687154c29b4bd7e1309e..8b685972e8d10a8f55a1f311e74ed50474f94ab5 100644 --- a/tests/test_llms.py +++ b/tests/test_llms.py @@ -15,7 +15,8 @@ if __name__ == "__main__": # from request_llms.bridge_jittorllms_pangualpha import predict_no_ui_long_connection # from request_llms.bridge_jittorllms_llama import predict_no_ui_long_connection # from request_llms.bridge_claude import predict_no_ui_long_connection - from request_llms.bridge_internlm import predict_no_ui_long_connection + # from request_llms.bridge_internlm import predict_no_ui_long_connection + from request_llms.bridge_deepseekcoder import predict_no_ui_long_connection # from request_llms.bridge_qwen import predict_no_ui_long_connection # from request_llms.bridge_spark import predict_no_ui_long_connection # from request_llms.bridge_zhipu import predict_no_ui_long_connection diff --git a/themes/contrast.py b/themes/contrast.py index d407d92ab5e32835b30fa601995ec7116af19718..bf48808d931fc28ef7cf788b17dbca3b5e9fb67d 100644 --- a/themes/contrast.py +++ b/themes/contrast.py @@ -1,6 +1,8 @@ +import os import gradio as gr from toolbox import get_conf CODE_HIGHLIGHT, ADD_WAIFU, LAYOUT = get_conf('CODE_HIGHLIGHT', 'ADD_WAIFU', 'LAYOUT') +theme_dir = os.path.dirname(__file__) def adjust_theme(): @@ -57,7 +59,7 @@ def adjust_theme(): button_cancel_text_color_dark="white", ) - with open('themes/common.js', 'r', encoding='utf8') as f: + with open(os.path.join(theme_dir, 'common.js'), 'r', encoding='utf8') as f: js = f"" # 添加一个萌萌的看板娘 @@ -67,7 +69,9 @@ def adjust_theme(): """ - gradio_original_template_fn = gr.routes.templates.TemplateResponse + if not hasattr(gr, 'RawTemplateResponse'): + gr.RawTemplateResponse = gr.routes.templates.TemplateResponse + gradio_original_template_fn = gr.RawTemplateResponse def gradio_new_template_fn(*args, **kwargs): res = gradio_original_template_fn(*args, **kwargs) res.body = res.body.replace(b'', f'{js}'.encode("utf8")) @@ -79,7 +83,7 @@ def adjust_theme(): print('gradio版本较旧, 不能自定义字体和颜色') return set_theme -with open("themes/contrast.css", "r", encoding="utf-8") as f: +with open(os.path.join(theme_dir, 'contrast.css'), "r", encoding="utf-8") as f: advanced_css = f.read() -with open("themes/common.css", "r", encoding="utf-8") as f: +with open(os.path.join(theme_dir, 'common.css'), "r", encoding="utf-8") as f: advanced_css += f.read() diff --git a/themes/default.py b/themes/default.py index 6fa2ba5c53231abd149b1ac9c42633e06c7b486b..64b6720943522a4a41192cfa4fa506ce39bb14fa 100644 --- a/themes/default.py +++ b/themes/default.py @@ -60,7 +60,7 @@ def adjust_theme(): with open(os.path.join(theme_dir, 'common.js'), 'r', encoding='utf8') as f: js = f"" - + # 添加一个萌萌的看板娘 if ADD_WAIFU: js += """ @@ -68,7 +68,9 @@ def adjust_theme(): """ - gradio_original_template_fn = gr.routes.templates.TemplateResponse + if not hasattr(gr, 'RawTemplateResponse'): + gr.RawTemplateResponse = gr.routes.templates.TemplateResponse + gradio_original_template_fn = gr.RawTemplateResponse def gradio_new_template_fn(*args, **kwargs): res = gradio_original_template_fn(*args, **kwargs) res.body = res.body.replace(b'', f'{js}'.encode("utf8")) diff --git a/themes/gradios.py b/themes/gradios.py index 96a9c54e60754f3c33260d5d3b4f371e10409335..9db134159c3a596bf9057b83e27cacefdd053426 100644 --- a/themes/gradios.py +++ b/themes/gradios.py @@ -1,7 +1,9 @@ -import gradio as gr import logging +import os +import gradio as gr from toolbox import get_conf, ProxyNetworkActivate CODE_HIGHLIGHT, ADD_WAIFU, LAYOUT = get_conf('CODE_HIGHLIGHT', 'ADD_WAIFU', 'LAYOUT') +theme_dir = os.path.dirname(__file__) def dynamic_set_theme(THEME): set_theme = gr.themes.ThemeClass() @@ -13,7 +15,6 @@ def dynamic_set_theme(THEME): return set_theme def adjust_theme(): - try: set_theme = gr.themes.ThemeClass() with ProxyNetworkActivate('Download_Gradio_Theme'): @@ -23,7 +24,7 @@ def adjust_theme(): if THEME.startswith('huggingface-'): THEME = THEME.lstrip('huggingface-') set_theme = set_theme.from_hub(THEME.lower()) - with open('themes/common.js', 'r', encoding='utf8') as f: + with open(os.path.join(theme_dir, 'common.js'), 'r', encoding='utf8') as f: js = f"" # 添加一个萌萌的看板娘 @@ -33,7 +34,9 @@ def adjust_theme(): """ - gradio_original_template_fn = gr.routes.templates.TemplateResponse + if not hasattr(gr, 'RawTemplateResponse'): + gr.RawTemplateResponse = gr.routes.templates.TemplateResponse + gradio_original_template_fn = gr.RawTemplateResponse def gradio_new_template_fn(*args, **kwargs): res = gradio_original_template_fn(*args, **kwargs) res.body = res.body.replace(b'', f'{js}'.encode("utf8")) @@ -46,7 +49,5 @@ def adjust_theme(): logging.error('gradio版本较旧, 不能自定义字体和颜色:', trimmed_format_exc()) return set_theme -# with open("themes/default.css", "r", encoding="utf-8") as f: -# advanced_css = f.read() -with open("themes/common.css", "r", encoding="utf-8") as f: +with open(os.path.join(theme_dir, 'common.css'), "r", encoding="utf-8") as f: advanced_css = f.read() diff --git a/themes/green.py b/themes/green.py index a29a0fa1006c560643349e483e53694f19670659..326981e3d835fd52b07ac4b587ecf3725dd3c06f 100644 --- a/themes/green.py +++ b/themes/green.py @@ -1,6 +1,8 @@ +import os import gradio as gr from toolbox import get_conf CODE_HIGHLIGHT, ADD_WAIFU, LAYOUT = get_conf('CODE_HIGHLIGHT', 'ADD_WAIFU', 'LAYOUT') +theme_dir = os.path.dirname(__file__) def adjust_theme(): try: @@ -73,7 +75,7 @@ def adjust_theme(): chatbot_code_background_color_dark="*neutral_950", ) - with open('themes/common.js', 'r', encoding='utf8') as f: + with open(os.path.join(theme_dir, 'common.js'), 'r', encoding='utf8') as f: js = f"" # 添加一个萌萌的看板娘 @@ -83,11 +85,13 @@ def adjust_theme(): """ - - with open('themes/green.js', 'r', encoding='utf8') as f: + + with open(os.path.join(theme_dir, 'green.js'), 'r', encoding='utf8') as f: js += f"" - - gradio_original_template_fn = gr.routes.templates.TemplateResponse + + if not hasattr(gr, 'RawTemplateResponse'): + gr.RawTemplateResponse = gr.routes.templates.TemplateResponse + gradio_original_template_fn = gr.RawTemplateResponse def gradio_new_template_fn(*args, **kwargs): res = gradio_original_template_fn(*args, **kwargs) res.body = res.body.replace(b'', f'{js}'.encode("utf8")) @@ -99,8 +103,7 @@ def adjust_theme(): print('gradio版本较旧, 不能自定义字体和颜色') return set_theme - -with open("themes/green.css", "r", encoding="utf-8") as f: +with open(os.path.join(theme_dir, 'green.css'), "r", encoding="utf-8") as f: advanced_css = f.read() -with open("themes/common.css", "r", encoding="utf-8") as f: +with open(os.path.join(theme_dir, 'common.css'), "r", encoding="utf-8") as f: advanced_css += f.read() diff --git a/version b/version index cf14e82bf2c807aeb79d2a7d0eb1e2939c712485..5f6de09c8a3ad983116e5fe0cb657686d13d7292 100644 --- a/version +++ b/version @@ -1,5 +1,5 @@ { - "version": 3.60, + "version": 3.61, "show_feature": true, - "new_feature": "修复多个BUG <-> AutoGen多智能体插件测试版 <-> 修复本地模型在Windows下的加载BUG <-> 支持文心一言v4和星火v3 <-> 支持GLM3和智谱的API <-> 解决本地模型并发BUG <-> 支持动态追加基础功能按钮 <-> 新汇报PDF汇总页面 <-> 重新编译Gradio优化使用体验" + "new_feature": "修复潜在的多用户冲突问题 <-> 接入Deepseek Coder <-> AutoGen多智能体插件测试版 <-> 修复本地模型在Windows下的加载BUG <-> 支持文心一言v4和星火v3 <-> 支持GLM3和智谱的API <-> 解决本地模型并发BUG <-> 支持动态追加基础功能按钮" }