无人系统技术

2025, 05, v.8 45-57

基于MCTS-PPO的无人机自主空战决策方法

基金项目(Foundation):

邮箱(Email):

DOI: 10.19942/j.issn.2096-5915.2025.05.44

246	0	108
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

无人机在近距空战中需要同时完成高机动动作与导弹发射时机选择，其决策过程具有强实时性、高不确定性和强对抗性。传统的博弈论与最优化方法难以兼顾实时性与全局最优性，单一强化学习方法则容易陷入局部最优或收敛不稳定。因此，提出一种基于MCTS-PPO的无人机自主空战决策方法。首先，通过构建六自由度无人机飞行动力模型、近距空战对抗模型及导弹制导模型，定义高维状态空间与分层动作空间，并设计奖励函数以兼顾飞行安全、敌我态势、导弹规避及作战事件。然后，在算法上，利用蒙特卡洛树搜索（MCTS）实现动作空间的全局探索，通过自博弈生成高质量经验样本，并结合近端策略优化（PPO）的策略梯度更新完成策略与价值网络的联合优化。最后，仿真结果表明，相比单一PPO方法，该方法在累计奖励、机动稳定性及导弹发射时机控制方面具有显著优势，能够更好地适应复杂动态环境下的无人机自主空战需求，体现出较强的可行性与创新性。

关键词： 无人机; 空战机动决策; 导弹发射时机决策; 强化学习; 近端策略优化; 蒙特卡洛树搜索; 自博弈;

Abstract：

In close-range air combat, unmanned aerial vehicles(UAVs) need to simultaneously execute high-maneuver actions and select missile launch timings, and their decision-making process features strong real-time performance, high uncertainty, and intense antagonism. Traditional game theory and optimization methods struggle to balance real-time performance and global optimality, while the single reinforcement learning method tends to fall into local optimality or suffer from unstable convergence. An intelligent decisionmaking method based on the integration of Monte Carlo Tree Search(MCTS) and Proximal Policy Optimization(PPO) is proposed for autonomous unmanned aerial vehicle(UAV) operations in close-range one-on-one air combat scenarios. Firstly, a six-degree-of-freedom UAV dynamics model, a close-range air combat model, and a missile guidance model are established to construct a high-dimensional state space and hierarchical action space. Then, a reward function is designed to comprehensively consider flight safety, relative combat situations, missile evasion, and combat events. In the proposed approach, MCTS efficiently explores the action space, while self-play is employed to generate experience samples. These samples are then used to optimize policy and value networks through PPO-based policy gradient updates. Finally, simulation results demonstrated that the proposed method achieved higher cumulative rewards, improved maneuver stability, and enhanced situational control accuracy under identical training conditions. Compared with the standalone PPO algorithm, the MCTS-PPO approach exhibits superior adaptability for air combat maneuver decision-making and missile launch timing decision-making in complex and dynamic environments, indicating its feasibility and practical potential.

KeyWords： Unmanned Aerial Vehicle; Air Combat Maneuver Decision-making; Missile Launch Timing Decision-making; Reinforcement Learning; Proximal Policy Optimization; Monte Carlo Tree Search; Self-play;

参考文献

[1] LI B, YANG Z, CHEN D, et al. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning[J]. Defence Technology, 2021, 17(2):457-466.

[2] FU X W, SUN Y M. A combined intrusion strategy based on Apollonius circle for multiple mobile robots in attack-defense scenario[J]. IEEE Robotics and Automation Letters, 2025;10(1):676-683.

[3] FU X W, PAN J, WANG H. A formation maintenance and reconstruction method of UAV swarm based on distributed control[J]. Aerospace Science and Technology, 2020, 9(1):104-105.

[4] AUSTIN F, CARBONE G, EALCO M, et al. Game theory for automated maneuvering during air-to-air combat[J]. Journal of Guidance, Control, and Dynamics,1990, 13(6):143-149.

[5] MCGREW J S. Real-time maneuvering decisions for autonomous air combat[J]. Cambridge:Massachusets institute of Technology, 2008, 10(1):103-104.

[6] PARK H, LEE B Y, TAHK M J, et al. Differential game based air combat maneuver generation using scoring function matrix[J]. International Journal of Aeronautical and Space Sciences, 2016, 17(2):204-213.

[7] PARK H, LEE B Y, TAHK M J, et al. Differential game based air combat maneuver generation using scoring function matrix[J]. International Journal of Aeronautical and Space Sciences, 2016, 17(2):204-213.

[8]徐光达，吕超，王光辉，等.基于双矩阵对策的UCAV空战自主机动决策研究.舰船电子工程，2017, 37(11):24-28.XU G D, LV C, WANG G H, et al. Research on autonomous maneuver decision-making of UCAV air combat based on double-matrix game[J]. Ship Electronic Engineering, 2017, 37(11):24-28(in Chinese).

[9] HOWARD M. Influence diagrams[J]. Decision Analysis, 2005, 2(3):127-147.

[10]周思羽，吴文海，孔繁峨，等.基于随机决策准则的改进多级影响图机动决策方法[J].北京理工大学学报，2013, 33(3):296-301.ZHOU S Y, WU W H, KONG F E, et al. Improved multi-level influence diagram maneuver decision-making method based on stochastic decision criterion[J].Journal of Beijing Institute of Technology, 2013, 33(3):296-301(in Chinese).

[11] MCGREW J S. Real-time maneuvering decisions for autonomous air combat[D]. Cambridg:Massachusetts Institute of Technology, 2008.

[12]张涛，于雷，周中良，等.基于混合算法的空战机动决策[J].系统工程与电子技术，2013, 35(7):1445-1450.ZHANG T, YU L, ZHOU Z L, et al. Air combat maneuver decision-making based on hybrid algorithm[J].Systems Engineering and Electronics, 2013, 35(7):1445-1450(in Chinese).

[13]朱星宇，艾剑良.多对多无人机作战的智能决策研究[J].复旦学报（自然科学版），2021, 60(4):410-419.ZHU X Y, AI J L. Intelligent decision-making study for many-to-many UAV combat[J]. Fudan Journal(Natural Science Edition), 2021, 60(4):410-419(in Chinese).

[14]李波，白双霞，孟波波，等.基于SAC算法的无人机自主空战决策算法[J].指挥控制与仿真[J], 2022,44(5):25-30.LI B, BAI S X, MENG B B, et al. UAV autonomous air combat decision-making algorithm based on SAC algorithm[J]. Command, Control and Simulation, 2022,44(5):25-30(in Chinese).

[15] FENG Q. Low-high frequency network for spatial-temporal traffic flow forecasting[J]. Engineering Applications of Artificial Intelligence, 2025, 15(11):13-17.

[16] FU X W, ZHU J, WEI Z, et al. A UAV pursuit-evasion strategy based on DDPG and imitation learning[J].International Journal of Aerospace Engineering, 2022,17(3):125-130.

[17] FU X W, ZHANG Y, ZHU J, et al. Bioinspired cooperative control method of a pursuer group vs a faster evader in a limited area[J]. Applied Intelligence, 2023,53:6736-6752.

[18] FU X W, PAN J, WANG H X, et al. A formation maintenance and reconstruction method of UAV swarm based on distributed control[J]. Aerospace Science and Technology, 2020, 104:105981.

[19] WAI K F, GAO X G, LI B. Uing approximate dynamic programming for multi-ESM scheduling to track ground moving targets[J]. Journal of Systems Engineering and Electronics, 2018, 29(1):74-85.

[20] WANG Q L, GAO X G, WAN K F, et al. A novel restricted boltzmann machine training algorithm with fast gibbs sampling policy[J]. Mathematical Problems in Engineering, 2020, 1:1-19.

[21]孙楚，赵辉，王渊，等.基于强化学习的无人机自主机动决策方法[J].火力与指挥控制，2019, 44(4):142-149.SUN C, ZHAO H, WANG Y, et al. UCAV Autonomic maneuver decision making method based on reinforcement learning[J]. Fire Control&Command Control,2019, 44(4):142-149(in Chinese).

[22] SCHULMAN J, WOLSKI F, DHARIWAL P, et a1.Proximal policy optimization algorithms[EB/OL]. 2017-07-20[2025-08-31]. https：//doi. org/10.48550/arXiv.1707.06347.

[23] LI B, LIANG S Y, TIAN L Y. An adaptive task scheduling method for networked UAV combat cloud system based on virtual machine and task migration[J]. Mathematical Problems in Engineering, 2020, 1:1-12.

[24] WAI K F, LI B, GAO X G. A learning-based flexible autonomous motion control method for UAV in dynamic unknown environments[J]. Journal of Systems Engineering and Electronics, 2021, 32(6):1490-1058.

[25]惠耀洛，南英，陈哨东，等.空空导弹动态攻击区的高精度快速算法研究[J].弹道学报，2015, 27(2):39-45.HUI Y L, NAN Y, CHEN S D, et al. High-precision and fast algorithm for dynamic attack zone of air-to-air missiles, Journal of Ballistics, 2015, 27(2):39-45(in Chinese).

[26]杨晓红，姜玉宪.远程空空导弹的攻击区及发射条件[J].战术导弹控制技术，2004, 46(3):123-130.YANG X H, JIANG Y X. Attack zone and launch conditions of long-range air-to-air missiles. Tactical Missile Control Technology, 2004, 46(3):123-130(in Chinese).

[27]何旭，景小宁，冯超.基于蒙特卡洛树搜索方法的空战机动决策[J].火力与指挥控制，2018, 43(3):34-39.HE X, JING X N, FENG C. Air combat maneuver decision-making based on Monte Carlo Tree Search method[J]. Firepower and Command Control, 2018, 43(3):34-39(in Chinese).

[28] SHI W, FENG Y H, CHENG G Q, et al. Research on multiaircraft cooperative air combat method based on deep reinforcement learning[J]. Acta Automatics Sinica, 2021, 47(7):1610-1623.

[29] LIU Z, FU X, GAO X. Co-optimization of communication and sensing for multiple unmanned aerial vehicles in cooperative target tracking[J]. Applied Sciences.2018, 8(6):899.

[30] CloseAirCombat:An environment based on JSBSIM aimed at one-to-one close air combat.[EB/OL]. 2025[2025-08-31] https：//github. com/liuqh16/CloseAirCombat/tree/master.

基本信息:

DOI：10.19942/j.issn.2096-5915.2025.05.44

中图分类号:V279;E91

引用信息:

[1]徐慕远,宫建宏,符小卫.基于MCTS-PPO的无人机自主空战决策方法[J].无人系统技术,2025,8(05):45-57.DOI:10.19942/j.issn.2096-5915.2025.05.44.

请选择需要下载的pdf数据

无人系统技术

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

无人系统技术

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈