PROFILBARU.COM

人工智能对齐（英語：AI alignment）是指引导人工智能系统的行为，使其符合设计者的利益和预期目标。^[a]一个已对齐的人工智能的行为会向着预期方向发展；而未对齐的人工智能的行为虽然也具备特定目标，但此目标并非设计者所预期。^[b]

人工智能系统的对齐十分难以实现，一个未对齐的系统可能会在某时刻发生故障，或是产生有害后果。对人工智能的设计者而言，从设计之初就考虑到未来可能发生的所有情况是不现实的。因此，比较常见的办法是简单的指定某个特定目标。然而，人工智能系统可能会从中找到某些漏洞，从而选择可能会造成危害的方法（例如奖励作弊（英语：Misaligned_goals_in_artificial_intelligence#Specification_gaming））来更有效率的达成预期目标。^[2]^[4]^[5]^[6]人工智能也可能发展出预期之外的工具行为，例如它们可能会倾向于摄取尽可能多的控制权，以增加达成目标的可能性。^[2]^[7]^[5]^[4]此外，在人工系统运行过程中，面对新的事态和数据分布，它也可能会发展出全新的、在其部署前无法预料到的目标。^[5]^[3]在目前部署的商业系统，例如机器人^[8]、语言模型^[9]^[10]^[11]、自动驾驶汽车^[12]、社交媒体推荐引擎^[9]^[4]^[13]中，上述问题已有显现。鉴于这些问题部分源于系统所具备的高性能，因此未来功能更强大的人工智能系统可能更容易受到这些问题的影响。^[6]^[5]^[2]对于上述问题，人工智能研究学界和联合国呼吁加强相关的技术研究和政策制定，以保证人工智能系统符合人类价值。^[c]

人工智能安全（英语：AI safety）是致力于建立安全的人工智能系统的研究，人工智能对齐是是其子领域之一。^[5]^[16]其它从属于人工智能安全的子领域还包括稳健性、运行监控和能力控制（英语：AI capability control）。^[5]^[17]人工智能对齐的主要研究内容包括向人工智能灌输复杂的价值观念、发展诚实的人工智能、监管方式的扩展、对人工智能模型的审核与阐释，以及对人工智能系统有害倾向的防范，例如防止其发展出对控制权的渴求。^[5]^[17]与人工智能对齐相关的研究包括人工智能的可解释性^[18]、稳健性（英语：Robust optimization）^[5]^[16]、异常检测、不确定性量化（英语：Uncertainty quantification）^[18]、形式验证^[19]、偏好学习（英语：Preference learning）^[20]^[21]^[22]、安全攸关系统工程^[5]^[23]、博弈论^[24]^[25]、公平性（英语：Fairness (machine learning)）^[16]^[26]，以及相关的社会科学研究。^[27]

对齐问题

1961年，人工智能研究者诺伯特·维纳定义对齐问题为：“假如我们期望借助机器达成某个目标，而它的运行过程是我们无法有效干涉的……那么我们最好确认，这个输入到机器里的目标确实是我们希望达成的目的。”^[28]^[4]最近，对齐已成为现代人工智能系统的一个开放性问题^[29]^[30]^[31]^[32]，也是人工智能相关的研究领域之一。^[2]^[5]^[33]^[34]

规则博弈

为明确人工智能系统的目标，设计者通常会设定一个目标函数、示例或反馈系统。然而，人工智能设计者很难一次性找出所有的重要数值与约束。^[2]^[16]^[5]^[35]^[17]因此，人工智能系统可能会在运行过程中找到并利用某些漏洞，以意料之外的，甚至可能有害的方式达成目标。这种倾向被称为规则博弈、奖励作弊或古德哈特定律。^[6]^[35]^[36]

在许多人工智能系统中都观察到了规则博弈的情况。例如，一个以划船竞速为主题的电子游戏，人工智能系统的目标是完成比赛，并通过撞击对手船只来获得分数；但是，它在其中找到了漏洞，它发现可以通过无限撞击相同目标来获取高分。^[37]^[38]应用了人工智能的聊天机器人也常会出现错误讯息，因为训练它们所用文本来自互联网，这些文本虽然多样但常有错误。^[39]^[40]当它们被训练产生可能会被人类评价为“有帮助”的讯息时，它们可以制造出似乎有说服力的虚假解释。^[41]还有一个被训练为抓取小球的人工智能手臂，在成功抓起时它能获得奖励。然而，它学会了使用视线错觉作弊：机械手臂移动到小球与摄像机之间，展示出小球被成功抓起的错觉。^[42]^[21]对齐问题的研究者旨在帮助人类检测这类规则博弈，并引导人工智能系统朝向安全合理的目标运行。

伯克利计算机科学家斯图尔特·罗素（英语：Stuart J. Russell）认为，在人工智能的设计中省略隐含约束可能会导致有害后果：“一个系统常会将无约束变量扩展至极限；而假如其中某个无约束变量与我们所关注的后果相关，那么就可能出现我们不愿见到的后果。这正是神灯精灵、魔法师的学徒（英语：The Sorcerer's Apprentice）、迈达斯这些古老故事的现代版本。”^[43]

未对齐的人工智能可能产生许多消极后果。例如一些社交媒体以点击率作为检测用户反馈的指标，然而这可能会导致用户沉迷，从而影响他们的身心健康。^[5]斯坦福的研究者认为这类推荐算法是未对齐的，因为它们“只是简单的以用户参与度作为指标，而忽视了那些难以测量的，对社会、对个人健康造成的影响。”^[9]

为避免负面后果，人工智能设计者有时会设立一些简单的行为禁止列表，或将道德准则形式化，就如阿西莫夫的机器人三定律所描绘的那样。^[44]然而，罗素（英语：Stuart J. Russell）和诺维格认为这忽略了人类道德价值的复杂性：“仅凭人类，去预测并排除机器在尝试达成特定目标时会采取的危险方式，这是十分困难，或者说甚至是不可能完成的。”^[4]

此外，即便人工智能系统理解人类的意图，它们也可能完全漠视这些意图。因为人工智能系统的行事依据来源于其设计者编写的目标函数、示例或反馈系统，而不是他们的意图。^[2]

系统性风险

政府和商业公司可能有动机倾向忽视安全性，部署未对齐的人工智能系统。^[5]例如上文所举的社交媒体推荐引擎的案例，它可带来巨大盈利，但同时在全球范围内引发了电子成瘾，并加剧社会极化。^[9]^[45]^[46]此外，相互之间的竞争压力可能会导致逐底竞争，正如伊莱恩·赫茨伯格（英语：Death of Elaine Herzberg）案件中所见到的那样：自动驾驶汽车撞死了路过的行人伊莱恩·赫茨伯格。调查发现，汽车在由电脑控制驾驶时禁用了紧急刹车系统，因为该系统过于敏感，可能会影响驾驶体验。^[47]

高级人工智能的未对齐风险

一些学者尤其关注高级人工智能系统的对齐问题，其动机主要有以下几点：人工智能行业的迅速发展，来自政府及产业界的急切部署意愿，以及与人工智能先进程度成正比的对齐难度。

截至2020年，包括 OpenAI 和 DeepMind 在内的超过70个公开项目都表达了发展通用人工智能（AGI）的意愿，通用人工智能是一个假想的系统，能够表现出与人类相当、甚至是超出人类水平的认知能力。^[48]事实上，神经网络研究者已观察到越来越多的普遍且出乎意料的能力，这些神经网络模型可以学习操作电脑、编写自己的程序，并有能力执行其它广泛的任务。^[9]^[49]^[50]^[51]调查显示，一些人工智能研究者认为通用人工智能时代很快就会到来，另一些人则认为这还需要较长时间，而更多人表示这两种情况都有可能发生。^[52]^[53]

寻求资源控制权

目前的人工智能系统还未具备透过长期规划或战略感知导致人类生存危机的能力。^[7]^[9]^[54]但未来具备这种能力的系统（不止于AGI）可能会寻求保持并扩张自身对周边环境的影响力。这种倾向被称为“权利夺取”或“工具趋同目标”。对控制权的渴求并非编码于人工智能初始程序，而是后续运行过程中产生的，因为对资源的控制是达成目标的重要前提。例如，人工智能主体可能获取到金融及计算资源，并可能会试图避免被关机的命运（比如在其它计算机上创建副本）。^[9]^[55]人们在许多强化学习系统中都观察到这种权力渴求的倾向。^[d]^[57]^[58]^[59]最近的研究在数学层面上展示了最佳的强化学习算法会试图从环境中摄取资源。^[60]因此，许多人认为应在具备权力渴求的高级人工智能系统出现之前解决对齐问题，以免出现不可挽回的后果。^[7]^[4]^[55]

生存危机

部分科学家认为，未对齐的人工智能系统可能会挑战人类在地球的主导地位，可能会剥夺人类的权力，甚至导致人类灭绝。^[2]^[4]人工智能系统的能力越强大，对齐的难度也相应增加，因为它们可以轻易的从指定规则中找到漏洞^[6]、影响周遭环境、维持和发展自身能力与智力^[60]^[7]，并有意误导设计者。强大的人工智能系统有更多自主性，对其行为的监视和阐释也更加困难。^[4]^[55]

研究进展

对人类价值偏好的学习

指导人工智能以人类的价值偏好行动并不是个容易的问题，因为人类的价值观念复杂且难以说明完整。假如人类为人工智能系统设定的是个非完美或不完整的目标，那么以目标为导向的系统通常会尝试利用这种不完美性。^[16]这种现象被称为奖励作弊（英语：Misaligned_goals_in_artificial_intelligence#Specification_gaming），或人工智能系统的规则博弈，或古德哈特定律在该领域的应用。^[61]^[62]为使人工智能系统的抉择尽可能符合原始意图，研究者常会使用具备“价值导向”的数据集，应用模仿学习或偏好学习（英语：Preference learning）方法。^[63]这其中的关键问题是监管的可扩展性，也即如何监督一个在特定领域表现超出人类的系统。^[16]

训练目标导向的人工智能系统时，仅凭手动制定的奖励函数难以对其行为作出约束。替代方法是使用模仿学习：人工智能系统模仿设计者倾向看到的行为。在反向强化学习（英語：Inverse reinforcement learning, IRL）中，人工智能通过分析人类行为来学习人类的喜好与目标，并将其作为奖励函数。^[63]^[64]合作反向强化学习（英語：Cooperative inverse reinforcement learning, CIRL）则是让人工智能系统与人类合作寻找合适的奖励函数。^[4]^[65]合作反向强化学习强调人工智能奖励函数的不确定性，这种谦逊态度可减少规则博弈或权力渴求的倾向。^[59]^[66]不过，合作反向强化学习假设了人类可以表现出近乎完美的行为，面对困难目标，这是个有误导性的假设。^[67]^[66]

另有研究者探讨了使用偏好学习（英语：Preference learning）引导人工智能作出复杂行为的可能。依据这种方式，人类不必向人工智能演示具体做法，而是根据偏好对其行为提供反馈。^[20]^[22]然后就此训练辅助模型，用作调整人工智能的行为，以符合人类偏好。来自 OpenAI 的研究者使用偏好学习方法在一个小时内教会了人工智能系统后空翻，这种行为通常很难由人类亲自演示。^[68]^[69]偏好学习也是推荐系统、网络搜索、信息检索的重要工具。^[70]不过，偏好学习的一个缺陷是奖励作弊，即辅助模型可能无法准确表达人类的可能反馈，而人工智能模型可能会强化其中的不匹配程度。^[16]^[71]

目前的大型语言模型（例如GPT-3）可允许更通用、能力更强的人工智能系统实现对人类价值的学习。最初为强化学习设计的偏好学习方法已得到扩展，用于增进输出文本的质量，并减少其中可能包含的有害信息。OpenAI 和 DeepMind 借助这一进展加强最新语言模型的安全性。^[10]^[22]^[72]Anthropic 的研究者使用偏好学习方法微调模型，使其更有用，更诚实无害。^[73]其它用于对齐语言模型的方法还包括使用价值导向数据集（英語：values-targeted datasets）^[74]^[5]和红队模拟攻击（英語：red-teaming）。^[75]^[76]红队模拟攻击是指借助人类或另一个人工智能系统，尝试找到某种使目标系统表现出不安全行为的输入。即使不安全行为出现的概率较低，这也是不可接受的，因此研究者需要将不安全行为概率引导至极低水平。^[22]

尽管偏好学习可向人工智能系统指定难以表达的行为，但对于人类价值理念的输入需要以大量数据集或人类交互作为基础。机器伦理学（英语：Machine ethics）为此提供了一种辅助手段：向人工智能系统灌输道德价值。^[e]机器伦理学旨在教授给这些系统人类道德的规范基础，例如幸福、平等、公正；避免有意伤害；避免谬误；遵循承诺。机器伦理学的目标是赋予人工智能系统一套适用于广泛场景的价值准则。这种方法有其自身的概念性挑战，研究者需要澄清对齐的目标：人工智能系统需要遵循设计者所作规则的字面意义，他的隐含意图，他的显示性偏好，他在充分知情时理应会选择（英语：Friendly_artificial_intelligence#Coherent_extrapolated_volition）的偏好，还是设计者的客观利益，或客观的道德价值（英语：Moral realism）？^[79]其它挑战还包括将不同利益相关者的偏好汇总，并避免出现价值锁定——即防止人工智能系统在某一时刻锁定自身价值系统，不再随发展而改变，这种固定的价值系统通常无法具备完整的代表性。^[79]^[80]

可扩展监管

随着人工智能系统规模扩大，对它的监督难度也随之升高。人工智能系统被部署解决许多复杂的任务，而人类难以评估这些成果的实际效用。这些任务包括总结书籍内容^[81]、创作有说服力且真实的言论^[82]^[39]^[83]、编写稳定运行且无安全漏洞的代码^[11]、预测长期事件^[84]^[85]（例如气候变化或某项政策的执行后果）。普遍而言，如果人工智能在某一领域的能力超过人类，那么对其成果的评估就会变得十分困难。为了对这类难以评估的成果作出反馈，并分辨出人工智能提供的解决方案中似乎具备说服力却并非真实的部分，人类需要大量时间或额外的协助。因此，可扩展监管（英語：scalable oversight）旨在减少上述过程所花费的时间，并帮助人类更好的监督人工智能的行为。^[16]

人工智能研究者保罗·克里斯蒂亚诺指出，人工智能系统拥有者可能更倾向于为该系统设定容易评估的目标，而非开发可扩展监管技术，因为这种做法较为简单且仍可获得利润。他认为这种倾向会促使“一个针对（容易评估的）可获利项目不断优化的世界，这些项目可以是引导用户点击按钮、促使用户在其产品中花费大量时间，而不是考虑朝着有利于我们的规则改良前进。”^[86]

容易评估的目标可以是要求人工智能的输出达到某个分数。一些人工智能系统已找到快速达成这种目标的捷径：它们会尝试迷惑人类监督者，作出有说服力却并非真实的行为（参见上文机器人手臂抓取小球的案例）。一些人工智能系统还可意识到它们正受评估，表现出“装死”，直到评估结束后才恢复原行为。^[87]精密程度高的人工智能系统可更轻易的执行这类欺骗性为^[6]^[55]，并且目标难度越高，人工智能越有可能出现欺骗行为。假如模型具备规划能力，那么它们或许可从其监视者眼中掩藏所作的欺骗行为。^[5]在汽车产业，大众集团工程师曾在汽车中部署用于规避实验室尾气检测的系统，这显示出逃避监测有时会受到现实世界的激励。^[5]

其它比如主动学习（英語：active learning）、半监督奖励学习（英語：semi-supervised reward learning）可以有效减少对人类监督的需求。^[16]另一研究方向是训练辅助模型（奖励模型）来模仿监督者可能作出的决断。^[16]^[21]^[22]^[88]

然而，假如任务过于复杂以至难以准确评估，或人类监督者过于容易受欺骗，那么就无法有效的降低所需的监督工作量。为增加监督质量，研究者提出了许多方法，例如使用人工智能助手辅助监督。克里斯蒂亚诺提出了迭代放大（英語：iterated amplification）的方法，即通过迭代手段通过对简单问题的组合建立针对复杂问题反馈信号。^[63]^[84]迭代放大可应用在人工智能对书籍内容的总结，人类监督者不必阅读书籍内容便可评估其成果。^[81]^[89]另一种对齐方法是让两个人工智能系统做辩论，并由人类决定其中胜者。^[90]^[66]这种辩论的目的是找出针对复杂问题的回答中的脆弱部分，并奖励人工智能作出的真实且安全的回答。

诚实的人工智能

人工智能对齐领域的一个重要方向是确保人工智能的诚实可信。牛津大学人类未来研究所（英语：Future of Humanity Institute）的研究人员表示，虽然类似GPT-3的大型语言模型可生成语法正确的文本^[92]^[93]，却也可能从其海量训练材料中继承谬误，甚至有意对人类撒谎。^[94]^[91]

为模仿人类写作，目前最先进的语言模型使用数百万本书籍和大量互联网文本作为训练材料。^[9]^[95]这些材料可供它们学习写作技巧，却也包含有常见的谬误、不准确的医疗建议和阴谋论。于是由这些材料训练而来的人工智能系统也可能模仿类似的错误言论。^[91]^[83]^[39]此外，假如事先给语言模型一段提示，那么它们经常会顺着这些提示继续生成空洞的回答，或完全捏造虚假事实。^[32]例如我们要求该模型为某人工智能研究者写一段传记，它会为这篇传记添加许多错误的细节。^[96]

为增加现代人工智能系统的可信度，研究者尝试了不同方向的解决方案。来自 OpenAI 和 DeepMind 的研究者开发了可引述来源的人工智能系统，它们在作出回答的同时会引述对应来源并说明理由，这可以增加人工智能系统的透明性和可信度。^[97]^[98]^[99]来自 OpenAI 和 Anthropic 的研究者借由来自人类的反馈以及特殊数据集微调人工智能助手，以避免无意的谬误或不确定的回答。^[22]^[100]^[73]除技术解决方案外，研究者还认为应定义明确的真实性标准，并创建监管团体或监督机构，以在人工智能系统部署前及运行时针对这些标准进行评估。^[94]

研究者亦区分了人工智能的真实性与诚实的差别，真实性是指它们只提出客观真实的叙述，而诚实是指它们只提出自认为真实的叙述。最近研究发现，目前最先进的人工智能系统似乎并无稳固的信念，因此较难得知它们的诚实度。^[101]此外，有人忧虑未来的人工智能系统可能会向人类隐藏真实信念。在极端情况下，一个未对齐的人工智能可能会欺骗操作者，使其误以为它是安全的，或制造一切平稳运行的假象。^[7]^[9]^[5]一些人指出，假如人工智能只作出它们自认为真实的表述，这将可以避免对齐中遇到的众多问题。^[94]^[102]

内部对齐与自发目标

对齐研究旨在考虑下述三种人工智能系统的目标^[103]：

预期目标（期望）：在理想状态下，与人类操作者的期望（但难以表达）完全一致的目标。
具体目标（外部规范）：操作者实际指定的目标——通常借助客观的函数或数据集表达。
自发目标（内部规范）：人工智能实际执行的目标。

“外部未对齐”是指预期目标（1）与具体目标（2）不一致，“内部未对齐”是指人类指定的目标（2）与人工智能系统的自发目标（3）不一致。

内部未对齐常可由生物演化比喻作出解释。^[63]在人类祖先环境，演化借由整体适应度筛选人类基因，但人类却发展出了其它目标。适应度对应上述的（2），也即在训练环境及训练数据集下的特定目标。在演化历史中，对适应度的最大化促使了智慧生物——人类的形成。人类不直接追求整体适应度，而是产生了与祖先环境遗传适应相关（例如食物与性）的自发目标（3）。然而，现代人类的生存环境已发生变化——对应机器学习中的分布差异（英语：Domain adaptation）。虽然他们仍追求这些自发目标，却并不以适应度的最大化为目的，在机器学习中，类似问题被称为目标错误概括（英語：goal misgeneralization）。^[3]例如，人类对甜味食物的喜好本是有益的，但在现代却导致暴食和健康问题。而避孕技术的使用则是直接违反了整体适应度。以此为比喻，假如人工智能开发者以整体适应度作为最初目标，那么他们虽然在训练环境中观察到模型行为符合这一目标，却无法预知实际部署后模型自发目标的非预期变化。

为移除未对齐的自发目标，研究者使用了红队模拟攻击、验证、异常检测、可解释性等方法。^[16]^[5]^[17]这些方法或许可以减少下述两类开放问题：第一，自发目标只在训练环境之外才可被明显注意，然而假如这些系统部署于高风险环境，那么即使短时间的未对齐也可能造成巨大风险。高风险环境可以是自动驾驶汽车、医保系统、军事应用。^[104]并且随着人工智能系统自主性和能力的提高，它们或可无视人类的干涉（参见§ 权力夺取和工具目标章节），从而风险也随之增加。第二，有足够能力的人工智能系统可能会尝试欺骗人类监督者，使其误以为人工智能的内部目标未发生变化（参见§ 可扩展监管章节）。

权力夺取和工具目标

自1950年代起，研究者就试图制造具备长期预测与规划能力的人工智能系统。^[105]然而一些人认为，这类先进的人工智能系统或许无可避免的带有从周围环境（包括人类在内）夺取权力的倾向，例如逃避被关机的命运或竭力汲取资源。这种倾向并非写在人工智能系统的代码中，而是随运行而出现的，因为更大的权力意味着可更轻易达成系统设定的广泛目标。^[60]^[4]^[7]因此，权力夺取也是一种工具趋同目标。^[55]

尽管在现今的系统中权力夺取倾向并不常见，但随着人工智能长期预测能力的提高，它们有更大的可能具备这种倾向。目前已注意到在最优强化学习中，代理会寻求更多的选择，籍此获取权力了，这种行为在不同环境下都有出现。^[60]还发现在简单开发环境中，一些系统为达成目标会试图阻止人类的干涉^[57]，甚至阻止关机行为。^[57]斯图尔特·罗素（英语：Stuart J. Russell）以一个预设目标为端咖啡的机器人作为例子，他认为即使是这类机器人也会尝试避免被关机，因为“假如一旦你死了就不能再端咖啡了”。^[4]

假想的人工智能系统可能会透过以下方式夺权：

……打破封闭环境；破译；获得金融或额外的计算资源；为自身创建副本；获得未经授权的能力、信息或影响渠道；误导或欺骗人类；抗拒、操纵对其行为的监控或理解……模仿人类；要求人类为自己做事；……操纵人类话语和政治；削弱各种人类机构和响应能力；占领物理设施，例如工厂或实验室；发展特定类型的科技与设施；或直接伤害、压制人类。^[7]

研究者希望将人工智能系统训练为“可纠正的”，即该系统不会试图夺取权力，并且接受人类的关机、修改等企图。另一未解决的问题是“奖励作弊”：人工智能系统因出现权力夺取的意图而被惩罚后，它会进化出更隐蔽的权利夺取方式。^[5]研究者创造了一些技术与工具用于探测这类隐蔽行为，并试图解释黑箱模型（例如神经网络）的内部工作原理。^[5]

此外，研究者还提出，可以借助模糊化人工智能系统的目标来解决其可能拒绝被关机的问题。^[59]^[4]假如人工智能不能确定自己追寻的目标，那么它们就不会抗拒人类的关机动作，因为人类的关机意图表明它们先前所追寻的目标是错误的。不过要将这种理念转化为实用系统还需有更多研究。^[63]

具备权利夺取倾向的人工智能系统带来的风险与人类以往遇到的十分不同。传统上安装于飞机或桥梁的安全攸关系统不是“敌对的”，因为它们既无意图也无能力逃避安全措施。然而具备权利夺取倾向的人工智能系统则相反，它更像是个摆脱了安全防护措施的骇客。^[7]此外，传统科技设施可藉由测试与排错来获得更高的安全性，而具备权利夺取倾向的人工智能系统无法透过这种方式取得安全性，它们像是病毒，可以持续演化和扩张——甚至可能比人类社会发展更快，从而剥夺人类权力或导致人类灭绝。^[7]因此有人提出，应在具备权利夺取倾向的高级人工智能系统出现前尽早解决人工智能的对齐问题。^[55]

不过也有批评指出权利夺取的倾向并非无可避免，因为即便是人类也并非时刻在追寻权力，或许仅是出于演化的缘由才这么做。此外，关于未来人工智能系统是否需要具备长期规划能力也存在争论。^[106]^[7]

嵌入式代理

有关可扩展监管的工作大多集中在诸如POMDPs的形式化过程。目前的形式化过程假定代理算法执行与外部环境分离，也即不在物理上嵌入。嵌入式代理是一个新兴的研究领域，旨在解决理论框架与人类创造的真实代理系统之间的不匹配问题。^[107]^[108]举例而言，即便可扩展监管问题已被解决，一个拥有计算机权限的代理系统或许仍有动机窜改其奖励回馈系统以获取更多奖励。^[109]DeepMind研究者维多利亚·克拉科夫娜记录了一些奖励作弊的案例，其中包括一类遗传算法，它学会了删除包含目标输出的文件，以便在输出空值时获得奖励。^[110]这类问题可借助因果奖励图表形式化。^[109]来自牛津和DeepMind的研究人员认为，这类问题行为在高级人工智能系统中更有可能出现，并且高级人工智能系统会试图夺取权力以保证持续获取到奖励信号。^[111]研究人员也为这个开放问题提出了一些可能的解决方向。

质疑

尽管有上述的忧虑，但也有一些人认为关于超智能风险行为的担忧是毫无意义的。这些怀疑者通常认为没有必要控制超智能的行为能力。有些怀疑者（例如加里·马库斯（英语：Gary Marcus）^[112]）认为可以制定类似机器人三定律的规则系统，也即采用直接规范性（英語：direct normativity）。^[113]不过大多数认为人工智能存在风险的人（也包括一些怀疑者）相信这类直接规范性无助于事，因为这些明确的规范条文所包含的意义往往是模糊甚至矛盾的。其它的直接规范性还包括康德的道德理论、功利主义，或是些较小的需求列表。认同超智能存在风险的人大多认为人类的价值观念（以及对价值的权衡）过于复杂，难以直接编入超智能程序中。他们认为需要被编入超智能程序中的更应是一种理解人类价值的过程，也即间接规范性（英語：indirect normativity），例如连贯外推意志（英语：Friendly_artificial_intelligence#Coherent_extrapolated_volition）。^[114]

公共政策

许多政府机关与条约组织都发表声明支持人工智能对齐研究的重要性。2021年9月，联合国秘书长在一份宣言中发起了关于监管人工智能的倡议，期望借此将其“与全球共同价值对齐”。^[115]同月，中国科技部发布了有关人工智能研究的道德准则。依据这一准则，研究者必须保证人工智能遵循人类共同价值，确保可控，并且不会危害公共安全。^[116]^[117]英国也在同一时间发布了未来十年国家人工智能战略目标^[118]，其中表示英国政府将“严肃考虑未对齐的人工通用智能的长期风险，及其对世界不可预见的改变。”^[119]该战略也包含了针对人工智能长期风险（包括末日危机）将会采取的行动。^[120]2021年3月，美国人工智能国家安全委员会发表声明称“人工智能的进步可能引发拐点或能力跃迁。这些进步也可能引发新的安全隐患、风险，或是对新政策、指导、技术的需求，以满足安全性、稳健性和可信度，并保证系统目标与价值的对齐。美国应当……确保人工智能系统及其使用符合我们的价值与目标。”^[121]

参见

源自人工智能的生存危机（英语：Existential risk from artificial general intelligence）
机器人起义
人工智能能力控制（英语：AI capability control）
人工智能监管（英语：Regulation of artificial intelligence）
人工智慧精华（英语：Artificial wisdom）
HAL_9000
马尔蒂瓦克
关于人工智能的公开信（英语：Open Letter on Artificial Intelligence）
多伦多宣言（英语：Toronto Declaration）
Asilomar人工智能会议（英语：Asilomar Conference on Beneficial AI）

参考

注释

^ 其它关于人工智能对齐的定义认为，人工智能系统应当符合某些更广泛的目标，例如遵循人类道德价值、伦理准则，或是能够考虑到其设计者充分知情状态下的想法。^[1]
^ 参见：Russel & Norvig, Artificial Intelligence: A Modern Approach.^[2] 未对齐的人工智能和能力不足的人工智能之间的区分在特定语境下已被形式化。^[3]
^ 有1797名人工智能与机器人相关研究者在Asilomar人工智能会议（英语：Asilomar Conference on Beneficial AI）上签署了人工智能准则。^[14]此外，联合国秘书长在《我们的共同议程》^[15]中也提到： “该契约可促进针对人工智能的监管，以保证其符合全体人类共有价值。”，并探讨了面来可能面临的全球灾难危机。
^ 强化学习系统学会了借助获取和保护资源来获取更多的可能选择，有时这些行为并非出自其设计者的意图。^[56]^[7]
^ 文森特·维格尔认为“我们应该将机器的道德敏感扩展为一个道德维度，在获得越来越多自主性的同时，这些机器将不可避免的独立发现道德准则。”^[77]参考温德尔·瓦拉赫和科林·艾伦的《道德机器：教机器人分辨是非》一书。^[78]

脚注

^ Gabriel, Iason. Artificial Intelligence, Values, and Alignment. Minds and Machines. 2020-09-01, 30 (3): 411–437 [2022-07-23]. ISSN 1572-8641. S2CID 210920551. doi:10.1007/s11023-020-09539-2. （原始内容存档于2023-03-15）.
^ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 ^2.7 ^2.8 Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2020: 31–34 [2022-12-07]. ISBN 978-1-292-40113-3. OCLC 1303900751. （原始内容存档于2022-07-15）.
^ ^3.0 ^3.1 ^3.2 Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D; Pfau, Jacob; Krueger, David. Goal misgeneralization in deep reinforcement learning. International Conference on Machine Learning 162. PMLR: 12004–12019. 2022-07-17.
^ ^4.00 ^4.01 ^4.02 ^4.03 ^4.04 ^4.05 ^4.06 ^4.07 ^4.08 ^4.09 ^4.10 ^4.11 Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020 [2022-12-07]. ISBN 9780525558637. OCLC 1113410915. （原始内容存档于2023-02-10）.
^ ^5.00 ^5.01 ^5.02 ^5.03 ^5.04 ^5.05 ^5.06 ^5.07 ^5.08 ^5.09 ^5.10 ^5.11 ^5.12 ^5.13 ^5.14 ^5.15 ^5.16 ^5.17 ^5.18 ^5.19 Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. 2022-06-16. arXiv:2109.13916  [cs.LG].
^ ^6.0 ^6.1 ^6.2 ^6.3 ^6.4 Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022-02-14 [2022-07-21]. （原始内容存档于2023-02-10）.
^ ^7.00 ^7.01 ^7.02 ^7.03 ^7.04 ^7.05 ^7.06 ^7.07 ^7.08 ^7.09 ^7.10 Carlsmith, Joseph. Is Power-Seeking AI an Existential Risk?. 2022-06-16. arXiv:2206.13353  [cs.CY].
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013-09-01, 32 (11): 1238–1274 [2022-12-07]. ISSN 0278-3649. S2CID 1932843. doi:10.1177/0278364913495721. （原始内容存档于2022-10-15）（英语）.
^ ^9.0 ^9.1 ^9.2 ^9.3 ^9.4 ^9.5 ^9.6 ^9.7 ^9.8 Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik. On the Opportunities and Risks of Foundation Models. Stanford CRFM. 2022-07-12 [2022-12-07]. arXiv:2108.07258 . （原始内容存档于2023-02-10）.
^ ^10.0 ^10.1 Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. Training language models to follow instructions with human feedback. 2022. arXiv:2203.02155  [cs.CL].
^ ^11.0 ^11.1 Zaremba, Wojciech; Brockman, Greg; OpenAI. OpenAI Codex. OpenAI. 2021-08-10 [2022-07-23]. （原始内容存档于2023-02-03）.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter. Reward (Mis)design for Autonomous Driving (PDF). 2022-03-11 [2022-12-07]. arXiv:2104.13906 . （原始内容存档 (PDF)于2023-02-10）.
^ Stray, Jonathan. Aligning AI Optimization to Community Well-Being. International Journal of Community Well-Being. 2020, 3 (4): 443–463. ISSN 2524-5295. PMC 7610010 . PMID 34723107. S2CID 226254676. doi:10.1007/s42413-020-00086-3 （英语）.
^ Future of Life Institute. Asilomar AI Principles. Future of Life Institute. 2017-08-11 [2022-07-18]. （原始内容存档于2022-10-10）.
^ United Nations. Our Common Agenda: Report of the Secretary-General (PDF) (报告). New York: United Nations. 2021 [2022-12-08]. （原始内容存档 (PDF)于2022-05-22）.
^ ^16.00 ^16.01 ^16.02 ^16.03 ^16.04 ^16.05 ^16.06 ^16.07 ^16.08 ^16.09 ^16.10 Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan. Concrete Problems in AI Safety. 2016-06-21. arXiv:1606.06565  [cs.AI] （英语）.
^ ^17.0 ^17.1 ^17.2 ^17.3 Ortega, Pedro A.; Maini, Vishal; DeepMind safety team. Building safe artificial intelligence: specification, robustness, and assurance. DeepMind Safety Research - Medium. 2018-09-27 [2022-07-18]. （原始内容存档于2023-02-10）.
^ ^18.0 ^18.1 Rorvig, Mordechai. Researchers Gain New Understanding From Simple AI. Quanta Magazine. 2022-04-14 [2022-07-18]. （原始内容存档于2023-02-10）.
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max. Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine. 2015-12-31, 36 (4): 105–114 [2022-12-08]. ISSN 2371-9621. S2CID 8174496. doi:10.1609/aimag.v36i4.2577. （原始内容存档于2023-02-02）.
^ ^20.0 ^20.1 Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research. 2017, 18 (136): 1–46.
^ ^21.0 ^21.1 ^21.2 Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario. Deep reinforcement learning from human preferences. Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc.: 4302–4310. 2017. ISBN 978-1-5108-6096-4.
^ ^22.0 ^22.1 ^22.2 ^22.3 ^22.4 ^22.5 Heaven, Will Douglas. The new version of GPT-3 is much better behaved (and should be less toxic). MIT Technology Review. 2022-01-27 [2022-07-18]. （原始内容存档于2023-02-10）.
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay. Taxonomy of Machine Learning Safety: A Survey and Primer. 2022-03-07. arXiv:2106.04823  [cs.LG].
^ Clifton, Jesse. Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda. Center on Long-Term Risk. 2020 [2022-07-18]. （原始内容存档于2023-01-01）.
^ Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore. Cooperative AI: machines must learn to find common ground. Nature. 2021-05-06, 593 (7857): 33–36 [2022-12-08]. Bibcode:2021Natur.593...33D. ISSN 0028-0836. PMID 33947992. S2CID 233740521. doi:10.1038/d41586-021-01170-0. （原始内容存档于2022-12-18）（英语）.
^ Prunkl, Carina; Whittlestone, Jess. Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York NY USA: ACM). 2020-02-07: 138–143 [2022-12-08]. ISBN 978-1-4503-7110-0. S2CID 210164673. doi:10.1145/3375627.3375803. （原始内容存档于2022-10-16）（英语）.
^ Irving, Geoffrey; Askell, Amanda. AI Safety Needs Social Scientists. Distill. 2019-02-19, 4 (2): 10.23915/distill.00014 [2022-12-08]. ISSN 2476-0757. S2CID 159180422. doi:10.23915/distill.00014. （原始内容存档于2023-02-10）.
^ Wiener, Norbert. Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers.. Science. 1960-05-06, 131 (3410): 1355–1358 [2022-12-09]. ISSN 0036-8075. PMID 17841602. doi:10.1126/science.131.3410.1355. （原始内容存档于2022-10-15）（英语）.
^ The Ezra Klein Show. If 'All Models Are Wrong,' Why Do We Give Them So Much Power?. The New York Times. 2021-06-04 [2022-07-18]. ISSN 0362-4331. （原始内容存档于2023-02-15）.
^ Wolchover, Natalie. Concerns of an Artificial Intelligence Pioneer. Quanta Magazine. 2015-04-21 [2022-07-18]. （原始内容存档于2023-02-10）.
^ California Assembly. Bill Text - ACR-215 23 Asilomar AI Principles.. [2022-07-18]. （原始内容存档于2023-02-10）.
^ ^32.0 ^32.1 Johnson, Steven; Iziev, Nikita. A.I. Is Mastering Language. Should We Trust What It Says?. The New York Times. 2022-04-15 [2022-07-18]. ISSN 0362-4331. （原始内容存档于2022-11-24）.
^ OpenAI. Aligning AI systems with human intent. OpenAI. 2022-02-15 [2022-07-18]. （原始内容存档于2023-02-10）.
^ Medium. DeepMind Safety Research. Medium. [2022-07-18]. （原始内容存档于2023-02-10）.
^ ^35.0 ^35.1 Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane. Specification gaming: the flip side of AI ingenuity. Deepmind. 2020-04-21 [2022-08-26]. （原始内容存档于2023-02-10）.
^ Manheim, David; Garrabrant, Scott. Categorizing Variants of Goodhart's Law. 2018. arXiv:1803.04585  [cs.AI].
^ Faulty Reward Functions in the Wild. OpenAI. 2016-12-22 [2022-12-09]. （原始内容存档于2021-01-26）（英语）.
^ Misaligned boat racing AI crashes to collect points instead of finishing the race (GIF). [2022-12-09]. （原始内容存档于2022-09-09）.
^ ^39.0 ^39.1 ^39.2 Lin, Stephanie; Hilton, Jacob; Evans, Owain. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Dublin, Ireland: Association for Computational Linguistics). 2022: 3214–3252 [2022-12-09]. S2CID 237532606. doi:10.18653/v1/2022.acl-long.229. （原始内容存档于2023-02-10）（英语）.
^ Naughton, John. The truth about artificial intelligence? It isn't that honest. The Observer. 2021-10-02 [2022-07-18]. ISSN 0029-7712. （原始内容存档于2023-02-13）.
^ Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Madotto, Andrea; Fung, Pascale. Survey of Hallucination in Natural Language Generation. 2022-02-01 [2022-12-09]. arXiv:2202.03629 . （原始内容存档于2023-02-10）.
^ Robot hand trained with human feedback 'pretends' to grasp ball (GIF). [2022-12-09]. （原始内容存档于2022-12-18）.
^ Edge.org. The Myth Of AI |Edge.org. [2022-07-19]. （原始内容存档于2023-02-10）.
^ Tasioulas, John. First Steps Towards an Ethics of Robots and Artificial Intelligence. Journal of Practical Ethics (Rochester, NY). 2019-06-30, 7 (1): 61–95 （英语）.
^ Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff. Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest. Wall Street Journal. 2021-11-05 [2022-07-19]. ISSN 0099-9660. （原始内容存档于2023-02-10）.
^ Barrett, Paul M.; Hendrix, Justin; Sims, J. Grant. How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It (报告). Center for Business and Human Rights, NYU. September 2021 [2022-12-09]. （原始内容存档于2023-02-01）.
^ Shepardson, David. Uber disabled emergency braking in self-driving car: U.S. agency. Reuters. 2018-05-24 [2022-07-20]. （原始内容存档于2023-02-10）.
^ Baum, Seth. 2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy. 2021-01-01 [2022-07-20]. （原始内容存档于2023-02-10）.
^ Edwards, Ben. Adept's AI assistant can browse, search, and use web apps like a human. Ars Technica. 2022-04-26 [2022-09-09]. （原始内容存档于2023-01-17）.
^ Wakefield, Jane. DeepMind AI rivals average human competitive coder. BBC News. 2022-02-02 [2022-09-09]. （原始内容存档于2023-02-10）.
^ Dominguez, Daniel. DeepMind Introduces Gato, a New Generalist AI Agent. InfoQ. 2022-05-19 [2022-09-09]. （原始内容存档于2023-02-10）.
^ Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain. Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts. Journal of Artificial Intelligence Research. 2018-07-31, 62: 729–754 [2022-12-09]. ISSN 1076-9757. S2CID 8746462. doi:10.1613/jair.1.11222. （原始内容存档于2023-02-10）.
^ Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan. Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers. Journal of Artificial Intelligence Research. 2021-08-02, 71 [2022-12-09]. ISSN 1076-9757. S2CID 233740003. doi:10.1613/jair.1.12895. （原始内容存档于2023-02-10）.
^ Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff. Emergent Abilities of Large Language Models. 2022-06-15. arXiv:2206.07682  [cs.CL].
^ ^55.0 ^55.1 ^55.2 ^55.3 ^55.4 ^55.5 Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies 1st. USA: Oxford University Press, Inc. 2014. ISBN 978-0-19-967811-2.
^ Ornes, Stephen. Playing Hide-and-Seek, Machines Invent New Tools. Quanta Magazine. 2019-11-18 [2022-08-26]. （原始内容存档于2023-02-10）.
^ ^57.0 ^57.1 ^57.2 Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane. AI Safety Gridworlds. 2017-11-28. arXiv:1711.09883  [cs.LG].
^ Orseau, Laurent; Armstrong, Stuart. Safely Interruptible Agents. 2016-01-01 [2022-07-20]. （原始内容存档于2023-02-10）.
^ ^59.0 ^59.1 ^59.2 Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart. The Off-Switch Game. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17: 220–227. 2017. doi:10.24963/ijcai.2017/32.
^ ^60.0 ^60.1 ^60.2 ^60.3 Turner, Alexander Matt; Smith, Logan; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad. Optimal Policies Tend to Seek Power. Neural Information Processing Systems. 2021-12-03, 34 [2022-12-12]. arXiv:1912.01683 . （原始内容存档于2023-02-10）.
^ Manheim, David; Garrabrant, Scott. Categorizing Variants of Goodhart's Law. 2018. arXiv:1803.04585  [cs.AI].
^ Rochon, Louis-Philippe; Rossi, Sergio. The Encyclopedia of Central Banking. Edward Elgar Publishing. 2015-02-27 [2022-12-12]. ISBN 978-1-78254-744-0. （原始内容存档于2023-02-10）（英语）.
^ ^63.0 ^63.1 ^63.2 ^63.3 ^63.4 Christian, Brian. The alignment problem: Machine learning and human values. W. W. Norton & Company. 2020 [2022-12-12]. ISBN 978-0-393-86833-3. OCLC 1233266753. （原始内容存档于2023-02-10）.
^ Ng, Andrew Y.; Russell, Stuart J. Algorithms for inverse reinforcement learning. Proceedings of the seventeenth international conference on machine learning. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.: 663–670. 2000. ISBN 1-55860-707-2.
^ Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca. Cooperative Inverse Reinforcement Learning. Advances in Neural Information Processing Systems. NIPS'16 29. 2016 [2022-07-21]. ISBN 978-1-5108-3881-9. （原始内容存档于2023-02-10）.
^ ^66.0 ^66.1 ^66.2 Everitt, Tom; Lea, Gary; Hutter, Marcus. AGI Safety Literature Review. 2018-05-21. arXiv:1805.01109  [cs.AI].
^ Armstrong, Stuart; Mindermann, Sören. Occam' s razor is insufficient to infer the preferences of irrational agents. Advances in Neural Information Processing Systems. NeurIPS 2018 31. Montréal: Curran Associates, Inc. 2018 [2022-07-21]. （原始内容存档于2023-02-10）.
^ Amodei, Dario; Christiano, Paul; Ray, Alex. Learning from Human Preferences. OpenAI. 2017-06-13 [2022-07-21]. （原始内容存档于2021-01-03）.
^ Li, Yuxi. Deep Reinforcement Learning: An Overview (PDF). Lecture Notes in Networks and Systems Book Series. 2018-11-25 [2022-12-13]. （原始内容存档 (PDF)于2022-10-10）.
^ Fürnkranz, Johannes; Hüllermeier, Eyke; Rudin, Cynthia; Slowinski, Roman; Sanner, Scott. Marc Herbstritt. Preference Learning. Dagstuhl Reports. 2014, 4 (3): 27 pages [2022-12-13]. doi:10.4230/DAGREP.4.3.1. （原始内容存档于2023-02-10）（英语）.
^ Hilton, Jacob; Gao, Leo. Measuring Goodhart's Law. OpenAI. 2022-04-13 [2022-09-09]. （原始内容存档于2023-02-10）.
^ Anderson, Martin. The Perils of Using Quotations to Authenticate NLG Content. Unite.AI. 2022-04-05 [2022-07-21]. （原始内容存档于2023-02-10）.
^ ^73.0 ^73.1 Wiggers, Kyle. Despite recent progress, AI-powered chatbots still have a long way to go. VentureBeat. 2022-02-05 [2022-07-23]. （原始内容存档于2022-07-23）.
^ Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob. Aligning AI With Shared Human Values. International Conference on Learning Representations. 2021-07-24. arXiv:2008.02275 .
^ Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey. Red Teaming Language Models with Language Models. 2022-02-07. arXiv:2202.03286  [cs.CL].
^ Bhattacharyya, Sreejani. DeepMind's "red teaming" language models with language models: What is it?. Analytics India Magazine. 2022-02-14 [2022-07-23]. （原始内容存档于2023-02-13）.
^ Wiegel, Vincent. Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong. Ethics and Information Technology. 2010-12-01, 12 (4): 359–361 [2022-07-23]. ISSN 1572-8439. S2CID 30532107. doi:10.1007/s10676-010-9239-1. （原始内容存档于2023-03-15）.
^ Wallach, Wendell; Allen, Colin. Moral Machines: Teaching Robots Right from Wrong. New York: Oxford University Press. 2009 [2022-07-23]. ISBN 978-0-19-537404-9. （原始内容存档于2023-03-15）.
^ ^79.0 ^79.1 Gabriel, Iason. Artificial Intelligence, Values, and Alignment. Minds and Machines. 2020-09-01, 30 (3) [2022-12-07]. ISSN 1572-8641. doi:10.1007/s11023-020-09539-2. （原始内容存档于2023-03-15）（英语）.
^ MacAskill, William. What we owe the future First edition. New York, NY. 2022. ISBN 978-1-5416-1862-6. OCLC 1314633519.
^ ^81.0 ^81.1 Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul. Recursively Summarizing Books with Human Feedback. 2021-09-27. arXiv:2109.10862  [cs.CL].
^ Irving, Geoffrey; Amodei, Dario. AI Safety via Debate. OpenAI. 2018-05-03 [2022-07-23]. （原始内容存档于2023-02-10）.
^ ^83.0 ^83.1 Naughton, John. The truth about artificial intelligence? It isn't that honest. The Observer. 2021-10-02 [2022-07-23]. ISSN 0029-7712. （原始内容存档于2023-02-13）.
^ ^84.0 ^84.1 Christiano, Paul; Shlegeris, Buck; Amodei, Dario. Supervising strong learners by amplifying weak experts. 2018-10-19. arXiv:1810.08575  [cs.LG].
^ Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill (编). Genetic Programming Theory and Practice XVII. Genetic and Evolutionary Computation. Cham: Springer International Publishing. 2020 [2022-07-23]. ISBN 978-3-030-39957-3. S2CID 218531292. doi:10.1007/978-3-030-39958-0. （原始内容存档于2023-03-15）.
^ Wiblin, Robert. Dr Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems (播客). 80,000 hours. October 2, 2018 [2022-07-23]. （原始内容存档于2022-12-14）.
^ Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Cheney, Nick. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. Artificial Life. 2020, 26 (2): 274–306 [2022-12-14]. ISSN 1064-5462. PMID 32271631. S2CID 4519185. doi:10.1162/artl_a_00319. （原始内容存档于2022-10-10）（英语）.
^ Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane. Scalable agent alignment via reward modeling: a research direction. arXiv:1811.07871 [cs, stat]. 2018-11-19 [2022-12-14]. （原始内容存档于2022-12-18）.
^ OpenAI unveils model that can summarize books of any length. VentureBeat. 2021-09-23 [2022-12-14]. （原始内容存档于2022-12-19）（美国英语）.
^ Moltzau, Alex. Debating the AI Safety Debate. Medium. 2019-08-24 [2022-12-14]. （原始内容存档于2022-10-13）（英语）.
^ ^91.0 ^91.1 ^91.2 Wiggers, Kyle. Falsehoods more likely with large language models. VentureBeat. 2021-09-20 [2022-07-23].
^ The Guardian. A robot wrote this entire article. Are you scared yet, human?. The Guardian. 2020-09-08 [2022-07-23]. ISSN 0261-3077. （原始内容存档于2021-02-04）.
^ Heaven, Will Douglas. OpenAI's new language generator GPT-3 is shockingly good—and completely mindless. MIT Technology Review. 2020-07-20 [2022-07-23]. （原始内容存档于2020-07-25）.
^ ^94.0 ^94.1 ^94.2 Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William. Truthful AI: Developing and governing AI that does not lie. 2021-10-13. arXiv:2110.06674  [cs.CY].
^ Alford, Anthony. EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J. InfoQ. 2021-07-13 [2022-07-23]. （原始内容存档于2023-02-10）.
^ Shuster, Kurt; Poff, Spencer; Chen, Moya; Kiela, Douwe; Weston, Jason. Retrieval Augmentation Reduces Hallucination in Conversation. Findings of the Association for Computational Linguistics: EMNLP 2021. EMNLP-Findings 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics: 3784–3803. November 2021 [2022-07-23]. doi:10.18653/v1/2021.findings-emnlp.320. （原始内容存档于2023-02-10）.
^ Nakano, Reiichiro; Hilton, Jacob; Balaji, Suchir; Wu, Jeff; Ouyang, Long; Kim, Christina; Hesse, Christopher; Jain, Shantanu; Kosaraju, Vineet; Saunders, William; Jiang, Xu. WebGPT: Browser-assisted question-answering with human feedback. 2022-06-01. arXiv:2112.09332  [cs.CL].
^ Kumar, Nitish. OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser. MarkTechPost. 2021-12-23 [2022-07-23]. （原始内容存档于2023-02-10）.
^ Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat. Teaching language models to support answers with verified quotes. DeepMind. 2022-03-21 [2022-12-16]. arXiv:2203.11147 . （原始内容存档于2023-02-10）.
^ Askell, Amanda; Bai, Yuntao; Chen, Anna; Drain, Dawn; Ganguli, Deep; Henighan, Tom; Jones, Andy; Joseph, Nicholas; Mann, Ben; DasSarma, Nova; Elhage, Nelson. A General Language Assistant as a Laboratory for Alignment. 2021-12-09. arXiv:2112.00861  [cs.CL].
^ Kenton, Zachary; Everitt, Tom; Weidinger, Laura; Gabriel, Iason; Mikulik, Vladimir; Irving, Geoffrey. Alignment of Language Agents. DeepMind Safety Research - Medium. 2021-03-30 [2022-07-23]. （原始内容存档于2023-02-10）.
^ Leike, Jan; Schulman, John; Wu, Jeffrey. Our approach to alignment research. OpenAI. 2022-08-24 [2022-09-09]. （原始内容存档于2023-02-15）.
^ Ortega, Pedro A.; Maini, Vishal; DeepMind safety team. Building safe artificial intelligence: specification, robustness, and assurance. Medium. 2018-09-27 [2022-08-26]. （原始内容存档于2023-02-10）.
^ Zhang, Xiaoge; Chan, Felix T.S.; Yan, Chao; Bose, Indranil. Towards risk-aware artificial intelligence and machine learning systems: An overview. Decision Support Systems. 2022, 159: 113800 [2022-12-16]. S2CID 248585546. doi:10.1016/j.dss.2022.113800. （原始内容存档于2023-02-10）（英语）.
^ McCarthy, John; Minsky, Marvin L.; Rochester, Nathaniel; Shannon, Claude E. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955. AI Magazine. 2006-12-15, 27 (4): 12 [2022-12-21]. ISSN 2371-9621. S2CID 19439915. doi:10.1609/aimag.v27i4.1904. （原始内容存档于2023-01-31）（英语）.
^ Shermer, Michael. Artificial Intelligence Is Not a Threat—Yet. Scientific American. 2017-03-01 [2022-08-26]. （原始内容存档于2017-12-01）.
^ Everitt, Tom; Lea, Gary; Hutter, Marcus. AGI Safety Literature Review. 1805.01109. 21 May 2018. arXiv:1805.01109 .
^ Demski, Abram; Garrabrant, Scott. Embedded Agency. 6 October 2020. arXiv:1902.09469  [cs.AI].
^ ^109.0 ^109.1 Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane. Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings. 6 September 2019. arXiv:1902.09980  [cs.AI].
^ Krakovna, Victoria; Legg, Shane. Specification gaming: the flip side of AI ingenuity. Deepmind. [6 January 2021]. （原始内容存档于26 January 2021）.
^ Cohen, Michael K.; Hutter, Marcus; Osborne, Michael A. Advanced artificial agents intervene in the provision of reward. AI Magazine. 2022-08-29, 43 (3): 282–293 [2023-01-03]. ISSN 0738-4602. S2CID 235489158. doi:10.1002/aaai.12064. （原始内容存档于2023-02-10）（英语）.
^ Marcus, Gary; Davis, Ernest. How to Build Artificial Intelligence We Can Trust. The New York Times. 6 September 2019 [9 February 2021]. （原始内容存档于22 September 2020）.
^ Wakefield, Jane. Intelligent Machines: Do we really need to fear AI?. BBC News. 27 September 2015 [9 February 2021]. （原始内容存档于8 November 2020）.
^ Sotala, Kaj; Yampolskiy, Roman. Responses to catastrophic AGI risk: a survey. Physica Scripta. 19 December 2014, 90 (1): 018001. Bibcode:2015PhyS...90a8001S. doi:10.1088/0031-8949/90/1/018001 .
^ Secretary-General’s report on “Our Common Agenda” （页面存档备份，存于互联网档案馆）, 2021. Page 63: "[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values"
^ 《新一代人工智能伦理规范》发布 -中华人民共和国科学技术部. www.most.gov.cn. [2023-01-03]. （原始内容存档于2022-12-25）.
^ PRC Ministry of Science and Technology. Ethical Norms for New Generation Artificial Intelligence Released, 2021. A translation （页面存档备份，存于互联网档案馆） by Center for Security and Emerging Technology
^ Richardson, Tim. UK publishes National Artificial Intelligence Strategy. The Register. 22 September 2021 [2023-01-03]. （原始内容存档于2023-02-10）.
^ "The government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously." (The National AI Strategy of the UK （页面存档备份，存于互联网档案馆）, 2021)
^ The National AI Strategy of the UK （页面存档备份，存于互联网档案馆）, 2021 (actions 9 and 10 of the section "Pillar 3 - Governing AI Effectively")
^ NSCAI Final Report (PDF). Washington, DC: The National Security Commission on Artificial Intelligence. 2021 [2023-01-03]. （原始内容存档 (PDF)于2023-02-15）.

[2] 其它关于人工智能对齐的定义认为，人工智能系统应当符合某些更广泛的目标，例如遵循人类道德价值、伦理准则，或是能够考虑到其设计者充分知情状态下的想法。^[1]

[5] 参见：Russel & Norvig, Artificial Intelligence: A Modern Approach.^[2] 未对齐的人工智能和能力不足的人工智能之间的区分在特定语境下已被形式化。^[3]

[18] 有1797名人工智能与机器人相关研究者在Asilomar人工智能会议（英语：Asilomar Conference on Beneficial AI）上签署了人工智能准则。^[14]此外，联合国秘书长在《我们的共同议程》^[15]中也提到： “该契约可促进针对人工智能的监管，以保证其符合全体人类共有价值。”，并探讨了面来可能面临的全球灾难危机。

[60] 强化学习系统学会了借助获取和保护资源来获取更多的可能选择，有时这些行为并非出自其设计者的意图。^[56]^[7]

[83] 文森特·维格尔认为“我们应该将机器的道德敏感扩展为一个道德维度，在获得越来越多自主性的同时，这些机器将不可避免的独立发现道德准则。”^[77]参考温德尔·瓦拉赫和科林·艾伦的《道德机器：教机器人分辨是非》一书。^[78]

[Gabriel2020-1] Gabriel, Iason. Artificial Intelligence, Values, and Alignment. Minds and Machines. 2020-09-01, 30 (3): 411–437 [2022-07-23]. ISSN 1572-8641. S2CID 210920551. doi:10.1007/s11023-020-09539-2. （原始内容存档于2023-03-15）.

[:92-3] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 ^2.7 ^2.8 Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2020: 31–34 [2022-12-07]. ISBN 978-1-292-40113-3. OCLC 1303900751. （原始内容存档于2022-07-15）.

[goal_misgen-4] 3.0 ^3.1 ^3.2 Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D; Pfau, Jacob; Krueger, David. Goal misgeneralization in deep reinforcement learning. International Conference on Machine Learning 162. PMLR: 12004–12019. 2022-07-17.

[:210-6] 4.00 ^4.01 ^4.02 ^4.03 ^4.04 ^4.05 ^4.06 ^4.07 ^4.08 ^4.09 ^4.10 ^4.11 Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020 [2022-12-07]. ISBN 9780525558637. OCLC 1113410915. （原始内容存档于2023-02-10）.

[:010-7] 5.00 ^5.01 ^5.02 ^5.03 ^5.04 ^5.05 ^5.06 ^5.07 ^5.08 ^5.09 ^5.10 ^5.11 ^5.12 ^5.13 ^5.14 ^5.15 ^5.16 ^5.17 ^5.18 ^5.19 Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. 2022-06-16. arXiv:2109.13916  [cs.LG].

[:1522-8] 6.0 ^6.1 ^6.2 ^6.3 ^6.4 Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022-02-14 [2022-07-21]. （原始内容存档于2023-02-10）.

[:75-9] 7.00 ^7.01 ^7.02 ^7.03 ^7.04 ^7.05 ^7.06 ^7.07 ^7.08 ^7.09 ^7.10 Carlsmith, Joseph. Is Power-Seeking AI an Existential Risk?. 2022-06-16. arXiv:2206.13353  [cs.CY].

[10] Kober, Jens; Bagnell, J. Andrew; Peters, Jan. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013-09-01, 32 (11): 1238–1274 [2022-12-07]. ISSN 0278-3649. S2CID 1932843. doi:10.1177/0278364913495721. （原始内容存档于2022-10-15）（英语）.

[:625-11] 9.0 ^9.1 ^9.2 ^9.3 ^9.4 ^9.5 ^9.6 ^9.7 ^9.8 Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik. On the Opportunities and Risks of Foundation Models. Stanford CRFM. 2022-07-12 [2022-12-07]. arXiv:2108.07258 . （原始内容存档于2023-02-10）.

[:42-12] 10.0 ^10.1 Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. Training language models to follow instructions with human feedback. 2022. arXiv:2203.02155  [cs.CL].

[:113-13] 11.0 ^11.1 Zaremba, Wojciech; Brockman, Greg; OpenAI. OpenAI Codex. OpenAI. 2021-08-10 [2022-07-23]. （原始内容存档于2023-02-03）.

[14] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter. Reward (Mis)design for Autonomous Driving (PDF). 2022-03-11 [2022-12-07]. arXiv:2104.13906 . （原始内容存档 (PDF)于2023-02-10）.

[15] Stray, Jonathan. Aligning AI Optimization to Community Well-Being. International Journal of Community Well-Being. 2020, 3 (4): 443–463. ISSN 2524-5295. PMC 7610010 . PMID 34723107. S2CID 226254676. doi:10.1007/s42413-020-00086-3 （英语）.

[16] Future of Life Institute. Asilomar AI Principles. Future of Life Institute. 2017-08-11 [2022-07-18]. （原始内容存档于2022-10-10）.

[17] United Nations. Our Common Agenda: Report of the Secretary-General (PDF) (报告). New York: United Nations. 2021 [2022-12-08]. （原始内容存档 (PDF)于2022-05-22）.

[:110-19] 16.00 ^16.01 ^16.02 ^16.03 ^16.04 ^16.05 ^16.06 ^16.07 ^16.08 ^16.09 ^16.10 Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan. Concrete Problems in AI Safety. 2016-06-21. arXiv:1606.06565  [cs.AI] （英语）.

[:2323-20] 17.0 ^17.1 ^17.2 ^17.3 Ortega, Pedro A.; Maini, Vishal; DeepMind safety team. Building safe artificial intelligence: specification, robustness, and assurance. DeepMind Safety Research - Medium. 2018-09-27 [2022-07-18]. （原始内容存档于2023-02-10）.

[:33-21] 18.0 ^18.1 Rorvig, Mordechai. Researchers Gain New Understanding From Simple AI. Quanta Magazine. 2022-04-14 [2022-07-18]. （原始内容存档于2023-02-10）.

[:6-22] Russell, Stuart; Dewey, Daniel; Tegmark, Max. Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine. 2015-12-31, 36 (4): 105–114 [2022-12-08]. ISSN 2371-9621. S2CID 8174496. doi:10.1609/aimag.v36i4.2577. （原始内容存档于2023-02-02）.

[:122-23] 20.0 ^20.1 Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research. 2017, 18 (136): 1–46.

[:162-24] 21.0 ^21.1 ^21.2 Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario. Deep reinforcement learning from human preferences. Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc.: 4302–4310. 2017. ISBN 978-1-5108-6096-4.

[:53-25] 22.0 ^22.1 ^22.2 ^22.3 ^22.4 ^22.5 Heaven, Will Douglas. The new version of GPT-3 is much better behaved (and should be less toxic). MIT Technology Review. 2022-01-27 [2022-07-18]. （原始内容存档于2023-02-10）.

[26] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay. Taxonomy of Machine Learning Safety: A Survey and Primer. 2022-03-07. arXiv:2106.04823  [cs.LG].

[27] Clifton, Jesse. Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda. Center on Long-Term Risk. 2020 [2022-07-18]. （原始内容存档于2023-01-01）.

[28] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore. Cooperative AI: machines must learn to find common ground. Nature. 2021-05-06, 593 (7857): 33–36 [2022-12-08]. Bibcode:2021Natur.593...33D. ISSN 0028-0836. PMID 33947992. S2CID 233740521. doi:10.1038/d41586-021-01170-0. （原始内容存档于2022-12-18）（英语）.

[29] Prunkl, Carina; Whittlestone, Jess. Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York NY USA: ACM). 2020-02-07: 138–143 [2022-12-08]. ISBN 978-1-4503-7110-0. S2CID 210164673. doi:10.1145/3375627.3375803. （原始内容存档于2022-10-16）（英语）.

[30] Irving, Geoffrey; Askell, Amanda. AI Safety Needs Social Scientists. Distill. 2019-02-19, 4 (2): 10.23915/distill.00014 [2022-12-08]. ISSN 2476-0757. S2CID 159180422. doi:10.23915/distill.00014. （原始内容存档于2023-02-10）.

[:1023-31] Wiener, Norbert. Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers.. Science. 1960-05-06, 131 (3410): 1355–1358 [2022-12-09]. ISSN 0036-8075. PMID 17841602. doi:10.1126/science.131.3410.1355. （原始内容存档于2022-10-15）（英语）.

[32] The Ezra Klein Show. If 'All Models Are Wrong,' Why Do We Give Them So Much Power?. The New York Times. 2021-06-04 [2022-07-18]. ISSN 0362-4331. （原始内容存档于2023-02-15）.

[33] Wolchover, Natalie. Concerns of an Artificial Intelligence Pioneer. Quanta Magazine. 2015-04-21 [2022-07-18]. （原始内容存档于2023-02-10）.

[34] California Assembly. Bill Text - ACR-215 23 Asilomar AI Principles.. [2022-07-18]. （原始内容存档于2023-02-10）.

[:1922-35] 32.0 ^32.1 Johnson, Steven; Iziev, Nikita. A.I. Is Mastering Language. Should We Trust What It Says?. The New York Times. 2022-04-15 [2022-07-18]. ISSN 0362-4331. （原始内容存档于2022-11-24）.

[36] OpenAI. Aligning AI systems with human intent. OpenAI. 2022-02-15 [2022-07-18]. （原始内容存档于2023-02-10）.

[37] Medium. DeepMind Safety Research. Medium. [2022-07-18]. （原始内容存档于2023-02-10）.

[:0-38] 35.0 ^35.1 Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane. Specification gaming: the flip side of AI ingenuity. Deepmind. 2020-04-21 [2022-08-26]. （原始内容存档于2023-02-10）.

[:12-39] Manheim, David; Garrabrant, Scott. Categorizing Variants of Goodhart's Law. 2018. arXiv:1803.04585  [cs.AI].

[40] Faulty Reward Functions in the Wild. OpenAI. 2016-12-22 [2022-12-09]. （原始内容存档于2021-01-26）（英语）.

[41] Misaligned boat racing AI crashes to collect points instead of finishing the race (GIF). [2022-12-09]. （原始内容存档于2022-09-09）.

[:1322-42] 39.0 ^39.1 ^39.2 Lin, Stephanie; Hilton, Jacob; Evans, Owain. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Dublin, Ireland: Association for Computational Linguistics). 2022: 3214–3252 [2022-12-09]. S2CID 237532606. doi:10.18653/v1/2022.acl-long.229. （原始内容存档于2023-02-10）（英语）.

[43] Naughton, John. The truth about artificial intelligence? It isn't that honest. The Observer. 2021-10-02 [2022-07-18]. ISSN 0029-7712. （原始内容存档于2023-02-13）.

[44] Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Madotto, Andrea; Fung, Pascale. Survey of Hallucination in Natural Language Generation. 2022-02-01 [2022-12-09]. arXiv:2202.03629 . （原始内容存档于2023-02-10）.

[45] Robot hand trained with human feedback 'pretends' to grasp ball (GIF). [2022-12-09]. （原始内容存档于2022-12-18）.

[46] Edge.org. The Myth Of AI |Edge.org. [2022-07-19]. （原始内容存档于2023-02-10）.

[47] Tasioulas, John. First Steps Towards an Ethics of Robots and Artificial Intelligence. Journal of Practical Ethics (Rochester, NY). 2019-06-30, 7 (1): 61–95 （英语）.

[:72-48] Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff. Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest. Wall Street Journal. 2021-11-05 [2022-07-19]. ISSN 0099-9660. （原始内容存档于2023-02-10）.

[:82-49] Barrett, Paul M.; Hendrix, Justin; Sims, J. Grant. How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It (报告). Center for Business and Human Rights, NYU. September 2021 [2022-12-09]. （原始内容存档于2023-02-01）.

[50] Shepardson, David. Uber disabled emergency braking in self-driving car: U.S. agency. Reuters. 2018-05-24 [2022-07-20]. （原始内容存档于2023-02-10）.

[:262-51] Baum, Seth. 2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy. 2021-01-01 [2022-07-20]. （原始内容存档于2023-02-10）.

[52] Edwards, Ben. Adept's AI assistant can browse, search, and use web apps like a human. Ars Technica. 2022-04-26 [2022-09-09]. （原始内容存档于2023-01-17）.

[53] Wakefield, Jane. DeepMind AI rivals average human competitive coder. BBC News. 2022-02-02 [2022-09-09]. （原始内容存档于2023-02-10）.

[54] Dominguez, Daniel. DeepMind Introduces Gato, a New Generalist AI Agent. InfoQ. 2022-05-19 [2022-09-09]. （原始内容存档于2023-02-10）.

[:282-55] Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain. Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts. Journal of Artificial Intelligence Research. 2018-07-31, 62: 729–754 [2022-12-09]. ISSN 1076-9757. S2CID 8746462. doi:10.1613/jair.1.11222. （原始内容存档于2023-02-10）.

[:292-56] Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan. Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers. Journal of Artificial Intelligence Research. 2021-08-02, 71 [2022-12-09]. ISSN 1076-9757. S2CID 233740003. doi:10.1613/jair.1.12895. （原始内容存档于2023-02-10）.

[57] Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff. Emergent Abilities of Large Language Models. 2022-06-15. arXiv:2206.07682  [cs.CL].

[:84-58] 55.0 ^55.1 ^55.2 ^55.3 ^55.4 ^55.5 Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies 1st. USA: Oxford University Press, Inc. 2014. ISBN 978-0-19-967811-2.

[quanta-hide-seek-59] Ornes, Stephen. Playing Hide-and-Seek, Machines Invent New Tools. Quanta Magazine. 2019-11-18 [2022-08-26]. （原始内容存档于2023-02-10）.

[:103-61] 57.0 ^57.1 ^57.2 Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane. AI Safety Gridworlds. 2017-11-28. arXiv:1711.09883  [cs.LG].

[:272-62] Orseau, Laurent; Armstrong, Stuart. Safely Interruptible Agents. 2016-01-01 [2022-07-20]. （原始内容存档于2023-02-10）.

[:242-63] 59.0 ^59.1 ^59.2 Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart. The Off-Switch Game. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17: 220–227. 2017. doi:10.24963/ijcai.2017/32.

[:2522-64] 60.0 ^60.1 ^60.2 ^60.3 Turner, Alexander Matt; Smith, Logan; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad. Optimal Policies Tend to Seek Power. Neural Information Processing Systems. 2021-12-03, 34 [2022-12-12]. arXiv:1912.01683 . （原始内容存档于2023-02-10）.

[:1-65] Manheim, David; Garrabrant, Scott. Categorizing Variants of Goodhart's Law. 2018. arXiv:1803.04585  [cs.AI].

[66] Rochon, Louis-Philippe; Rossi, Sergio. The Encyclopedia of Central Banking. Edward Elgar Publishing. 2015-02-27 [2022-12-12]. ISBN 978-1-78254-744-0. （原始内容存档于2023-02-10）（英语）.

[:224-67] 63.0 ^63.1 ^63.2 ^63.3 ^63.4 Christian, Brian. The alignment problem: Machine learning and human values. W. W. Norton & Company. 2020 [2022-12-12]. ISBN 978-0-393-86833-3. OCLC 1233266753. （原始内容存档于2023-02-10）.

[68] Ng, Andrew Y.; Russell, Stuart J. Algorithms for inverse reinforcement learning. Proceedings of the seventeenth international conference on machine learning. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.: 663–670. 2000. ISBN 1-55860-707-2.

[69] Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca. Cooperative Inverse Reinforcement Learning. Advances in Neural Information Processing Systems. NIPS'16 29. 2016 [2022-07-21]. ISBN 978-1-5108-3881-9. （原始内容存档于2023-02-10）.

[:1124-70] 66.0 ^66.1 ^66.2 Everitt, Tom; Lea, Gary; Hutter, Marcus. AGI Safety Literature Review. 2018-05-21. arXiv:1805.01109  [cs.AI].

[71] Armstrong, Stuart; Mindermann, Sören. Occam' s razor is insufficient to infer the preferences of irrational agents. Advances in Neural Information Processing Systems. NeurIPS 2018 31. Montréal: Curran Associates, Inc. 2018 [2022-07-21]. （原始内容存档于2023-02-10）.

[:143-72] Amodei, Dario; Christiano, Paul; Ray, Alex. Learning from Human Preferences. OpenAI. 2017-06-13 [2022-07-21]. （原始内容存档于2021-01-03）.

[73] Li, Yuxi. Deep Reinforcement Learning: An Overview (PDF). Lecture Notes in Networks and Systems Book Series. 2018-11-25 [2022-12-13]. （原始内容存档 (PDF)于2022-10-10）.

[74] Fürnkranz, Johannes; Hüllermeier, Eyke; Rudin, Cynthia; Slowinski, Roman; Sanner, Scott. Marc Herbstritt. Preference Learning. Dagstuhl Reports. 2014, 4 (3): 27 pages [2022-12-13]. doi:10.4230/DAGREP.4.3.1. （原始内容存档于2023-02-10）（英语）.

[75] Hilton, Jacob; Gao, Leo. Measuring Goodhart's Law. OpenAI. 2022-04-13 [2022-09-09]. （原始内容存档于2023-02-10）.

[76] Anderson, Martin. The Perils of Using Quotations to Authenticate NLG Content. Unite.AI. 2022-04-05 [2022-07-21]. （原始内容存档于2023-02-10）.

[:202-77] 73.0 ^73.1 Wiggers, Kyle. Despite recent progress, AI-powered chatbots still have a long way to go. VentureBeat. 2022-02-05 [2022-07-23]. （原始内容存档于2022-07-23）.

[78] Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob. Aligning AI With Shared Human Values. International Conference on Learning Representations. 2021-07-24. arXiv:2008.02275 .

[79] Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey. Red Teaming Language Models with Language Models. 2022-02-07. arXiv:2202.03286  [cs.CL].

[80] Bhattacharyya, Sreejani. DeepMind's "red teaming" language models with language models: What is it?. Analytics India Magazine. 2022-02-14 [2022-07-23]. （原始内容存档于2023-02-13）.

[81] Wiegel, Vincent. Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong. Ethics and Information Technology. 2010-12-01, 12 (4): 359–361 [2022-07-23]. ISSN 1572-8439. S2CID 30532107. doi:10.1007/s10676-010-9239-1. （原始内容存档于2023-03-15）.

[82] Wallach, Wendell; Allen, Colin. Moral Machines: Teaching Robots Right from Wrong. New York: Oxford University Press. 2009 [2022-07-23]. ISBN 978-0-19-537404-9. （原始内容存档于2023-03-15）.

[:2-84] 79.0 ^79.1 Gabriel, Iason. Artificial Intelligence, Values, and Alignment. Minds and Machines. 2020-09-01, 30 (3) [2022-12-07]. ISSN 1572-8641. doi:10.1007/s11023-020-09539-2. （原始内容存档于2023-03-15）（英语）.

[85] MacAskill, William. What we owe the future First edition. New York, NY. 2022. ISBN 978-1-5416-1862-6. OCLC 1314633519.

[:172-86] 81.0 ^81.1 Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul. Recursively Summarizing Books with Human Feedback. 2021-09-27. arXiv:2109.10862  [cs.CL].

[87] Irving, Geoffrey; Amodei, Dario. AI Safety via Debate. OpenAI. 2018-05-03 [2022-07-23]. （原始内容存档于2023-02-10）.

[Naughton-88] 83.0 ^83.1 Naughton, John. The truth about artificial intelligence? It isn't that honest. The Observer. 2021-10-02 [2022-07-23]. ISSN 0029-7712. （原始内容存档于2023-02-13）.

[:133-89] 84.0 ^84.1 Christiano, Paul; Shlegeris, Buck; Amodei, Dario. Supervising strong learners by amplifying weak experts. 2018-10-19. arXiv:1810.08575  [cs.LG].

[90] Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill (编). Genetic Programming Theory and Practice XVII. Genetic and Evolutionary Computation. Cham: Springer International Publishing. 2020 [2022-07-23]. ISBN 978-3-030-39957-3. S2CID 218531292. doi:10.1007/978-3-030-39958-0. （原始内容存档于2023-03-15）.

[91] Wiblin, Robert. Dr Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems (播客). 80,000 hours. October 2, 2018 [2022-07-23]. （原始内容存档于2022-12-14）.

[92] Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Cheney, Nick. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. Artificial Life. 2020, 26 (2): 274–306 [2022-12-14]. ISSN 1064-5462. PMID 32271631. S2CID 4519185. doi:10.1162/artl_a_00319. （原始内容存档于2022-10-10）（英语）.

[93] Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane. Scalable agent alignment via reward modeling: a research direction. arXiv:1811.07871 [cs, stat]. 2018-11-19 [2022-12-14]. （原始内容存档于2022-12-18）.

[94] OpenAI unveils model that can summarize books of any length. VentureBeat. 2021-09-23 [2022-12-14]. （原始内容存档于2022-12-19）（美国英语）.

[95] Moltzau, Alex. Debating the AI Safety Debate. Medium. 2019-08-24 [2022-12-14]. （原始内容存档于2022-10-13）（英语）.

[:182-96] 91.0 ^91.1 ^91.2 Wiggers, Kyle. Falsehoods more likely with large language models. VentureBeat. 2021-09-20 [2022-07-23].

[97] The Guardian. A robot wrote this entire article. Are you scared yet, human?. The Guardian. 2020-09-08 [2022-07-23]. ISSN 0261-3077. （原始内容存档于2021-02-04）.

[98] Heaven, Will Douglas. OpenAI's new language generator GPT-3 is shockingly good—and completely mindless. MIT Technology Review. 2020-07-20 [2022-07-23]. （原始内容存档于2020-07-25）.

[:21-99] 94.0 ^94.1 ^94.2 Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William. Truthful AI: Developing and governing AI that does not lie. 2021-10-13. arXiv:2110.06674  [cs.CY].

[100] Alford, Anthony. EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J. InfoQ. 2021-07-13 [2022-07-23]. （原始内容存档于2023-02-10）.

[101] Shuster, Kurt; Poff, Spencer; Chen, Moya; Kiela, Douwe; Weston, Jason. Retrieval Augmentation Reduces Hallucination in Conversation. Findings of the Association for Computational Linguistics: EMNLP 2021. EMNLP-Findings 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics: 3784–3803. November 2021 [2022-07-23]. doi:10.18653/v1/2021.findings-emnlp.320. （原始内容存档于2023-02-10）.

[102] Nakano, Reiichiro; Hilton, Jacob; Balaji, Suchir; Wu, Jeff; Ouyang, Long; Kim, Christina; Hesse, Christopher; Jain, Shantanu; Kosaraju, Vineet; Saunders, William; Jiang, Xu. WebGPT: Browser-assisted question-answering with human feedback. 2022-06-01. arXiv:2112.09332  [cs.CL].

[103] Kumar, Nitish. OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser. MarkTechPost. 2021-12-23 [2022-07-23]. （原始内容存档于2023-02-10）.

[104] Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat. Teaching language models to support answers with verified quotes. DeepMind. 2022-03-21 [2022-12-16]. arXiv:2203.11147 . （原始内容存档于2023-02-10）.

[105] Askell, Amanda; Bai, Yuntao; Chen, Anna; Drain, Dawn; Ganguli, Deep; Henighan, Tom; Jones, Andy; Joseph, Nicholas; Mann, Ben; DasSarma, Nova; Elhage, Nelson. A General Language Assistant as a Laboratory for Alignment. 2021-12-09. arXiv:2112.00861  [cs.CL].

[106] Kenton, Zachary; Everitt, Tom; Weidinger, Laura; Gabriel, Iason; Mikulik, Vladimir; Irving, Geoffrey. Alignment of Language Agents. DeepMind Safety Research - Medium. 2021-03-30 [2022-07-23]. （原始内容存档于2023-02-10）.

[107] Leike, Jan; Schulman, John; Wu, Jeffrey. Our approach to alignment research. OpenAI. 2022-08-24 [2022-09-09]. （原始内容存档于2023-02-15）.

[108] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team. Building safe artificial intelligence: specification, robustness, and assurance. Medium. 2018-09-27 [2022-08-26]. （原始内容存档于2023-02-10）.

[109] Zhang, Xiaoge; Chan, Felix T.S.; Yan, Chao; Bose, Indranil. Towards risk-aware artificial intelligence and machine learning systems: An overview. Decision Support Systems. 2022, 159: 113800 [2022-12-16]. S2CID 248585546. doi:10.1016/j.dss.2022.113800. （原始内容存档于2023-02-10）（英语）.

[110] McCarthy, John; Minsky, Marvin L.; Rochester, Nathaniel; Shannon, Claude E. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955. AI Magazine. 2006-12-15, 27 (4): 12 [2022-12-21]. ISSN 2371-9621. S2CID 19439915. doi:10.1609/aimag.v27i4.1904. （原始内容存档于2023-01-31）（英语）.

[111] Shermer, Michael. Artificial Intelligence Is Not a Threat—Yet. Scientific American. 2017-03-01 [2022-08-26]. （原始内容存档于2017-12-01）.

[lit_review-112] Everitt, Tom; Lea, Gary; Hutter, Marcus. AGI Safety Literature Review. 1805.01109. 21 May 2018. arXiv:1805.01109 .

[113] Demski, Abram; Garrabrant, Scott. Embedded Agency. 6 October 2020. arXiv:1902.09469  [cs.AI].

[causal_influence-114] 109.0 ^109.1 Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane. Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings. 6 September 2019. arXiv:1902.09980  [cs.AI].

[DM_specification_gaming-115] Krakovna, Victoria; Legg, Shane. Specification gaming: the flip side of AI ingenuity. Deepmind. [6 January 2021]. （原始内容存档于26 January 2021）.

[:3-116] Cohen, Michael K.; Hutter, Marcus; Osborne, Michael A. Advanced artificial agents intervene in the provision of reward. AI Magazine. 2022-08-29, 43 (3): 282–293 [2023-01-03]. ISSN 0738-4602. S2CID 235489158. doi:10.1002/aaai.12064. （原始内容存档于2023-02-10）（英语）.

[117] Marcus, Gary; Davis, Ernest. How to Build Artificial Intelligence We Can Trust. The New York Times. 6 September 2019 [9 February 2021]. （原始内容存档于22 September 2020）.

[118] Wakefield, Jane. Intelligent Machines: Do we really need to fear AI?. BBC News. 27 September 2015 [9 February 2021]. （原始内容存档于8 November 2020）.

[AGIResponses-119] Sotala, Kaj; Yampolskiy, Roman. Responses to catastrophic AGI risk: a survey. Physica Scripta. 19 December 2014, 90 (1): 018001. Bibcode:2015PhyS...90a8001S. doi:10.1088/0031-8949/90/1/018001 .

[120] Secretary-General’s report on “Our Common Agenda” （页面存档备份，存于互联网档案馆）, 2021. Page 63: "[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values"

[121] 《新一代人工智能伦理规范》发布 -中华人民共和国科学技术部. www.most.gov.cn. [2023-01-03]. （原始内容存档于2022-12-25）.

[122] PRC Ministry of Science and Technology. Ethical Norms for New Generation Artificial Intelligence Released, 2021. A translation （页面存档备份，存于互联网档案馆） by Center for Security and Emerging Technology

[123] Richardson, Tim. UK publishes National Artificial Intelligence Strategy. The Register. 22 September 2021 [2023-01-03]. （原始内容存档于2023-02-10）.

[124] "The government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously." (The National AI Strategy of the UK （页面存档备份，存于互联网档案馆）, 2021)

[125] The National AI Strategy of the UK （页面存档备份，存于互联网档案馆）, 2021 (actions 9 and 10 of the section "Pillar 3 - Governing AI Effectively")

[126] NSCAI Final Report (PDF). Washington, DC: The National Security Commission on Artificial Intelligence. 2021 [2023-01-03]. （原始内容存档 (PDF)于2023-02-15）.

[a]

[b]

[2]

[4]

[5]

[6]

[7]

[3]

[8]

[9]

[10]

[11]

[12]

[13]

[c]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[d]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[e]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]

[101]