Artificial Intelligence

Picture of the silhouette of a human overlaid with computer parts

Photo Details / Download Hi-Res

Automating OIE with Large Language Models

Published Dec. 21, 2023
By Cpt Alexander Sferrella, Cpt Joseph Conger, and Maj Kara Masick

Wild Blue Yonder, MAXWELL AIR FORCE BASE, Ala. --

Whether you call it non-kinetics, information warfare, information operations, or operations in the information environment (OIE), planning and executing these activities is extraordinarily difficult. A standard operation takes months, and the bureaucracy of OIE makes <24hr cradle-to-grave executions virtually impossible, especially during conflict. Generative artificial intelligence can speed up this process. Artificial intelligence can generate content, such as text and images, within seconds. With a human quality controller in the loop using the proper prompts, AI can develop large language models and text-to-image models capable of producing quality products efficiently, thereby dramatically reducing the number of man-hours. This capability can have myriad applications within the military.

Before we discuss how to automate elements of OIE, we will define important terms.

Artificial Intelligence (AI): The ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. AI simulates human intelligence for problem-solving.^{^[1]}
Generative AI (GAI): Enables users to quickly generate new content based on a variety of inputs. Inputs and outputs to these models can include text, images, sounds, animation, 3D models, or other types of data.^{^[2]}
Large Language Model (LLM): A form of GAI that uses deep learning algorithms that can recognize, summarize, translate, predict, and generate content using enormous datasets. The most well-known LLM is OpenAI’s ChatGPT.^{^[3]}
Text-to-image Model: A form of GAI that creates images from text-based prompts.
(Artificial) Hallucination: A phenomenon where an LLM generates realistic-seeming outputs that do not correspond to any real-world input^{^[4]} (e.g., references to imaginary lawsuits).^{^[5]}

The JPP is a valuable model for planners and staff officers but is rarely used outside of exercises. LLMs’ efficiency could encourage its adoption. LLM-generated mission analyses and COAs could significantly decrease the time planners spend brainstorming and generating products. As LLMs are statistical algorithms, they could better inform, weigh, and compare the likelihood of desired effects and risks of different COAs. With automated systems feeding concise operational updates and estimates to Commanders and planners in real-time, the staff can operate at the speed of the information environment. In an exercise setting, an LLM could serve as the white cell on behalf of exercise directors, managing thousands of inputs and producing realistic outcomes based on red and blue force activities.

Automation allows for greater development of a vast range of OIE COAs. LLMs capable of writing code, translating, debugging, identifying security vulnerabilities, analyzing software for compliance, etc., could increase the scale, speed, complexity, and variety of cyberspace operations available to planners. Influence operations have the potential to increase in speed and scale via narrative reiteration, elaboration, manipulation, seeding, wedging, and persuasion.^{^[6]} Theoretically, a well-trained LLM could appear as a trusted insider/majority group member speaking to other ingroup target audience members.

LLMs could help commanders and planners operate at the speed of the information environment. By continually ingesting classified and open-source data, analyzing intelligence at scale, and wargaming, LLMs could automate many aspects of intelligence support to OIE. AI could also greatly facilitate Staff Estimate and JIPOE development. LLMs can quickly draft Commander guidance products and executive actions, which the Commander could refine to ensure quality and intent are not sacrificed. Furthermore, there may be scenarios wherein a human Commander authorizes automated approvals for particular tactical tasks with clear bounds and redlines (e.g., general officers need not approve every offensive cyberspace operation). In modern warfare, there are specific contexts and timeframes wherein Commander approvals are required at speeds or scales that humans cannot operate. Delegation of approval to trained, skilled humans-in-the-loop is still preferable over complete automation. Of course, a human approver should remain responsible, especially for OIE with the potential for violent effects, because AI will not value human life the way another human should and cannot meaningfully be held legally culpable. Statecraft and relationships require human experience and the wisdom of a Commander.^{^[7]}

LLMs do have some significant limitations and vulnerabilities. While these impressive statistical algorithms can appear to understand the meaning of the words it is writing, they do not, as they are math equations of word associations. LLMs do not understand the world the way humans do, which currently causes them to frequently hallucinate. There are significant security considerations to this – one can imagine scenarios where an LLM provides false intelligence data that inspires an erroneous decision from a human commander, or an AI concludes that nuclear first-strike yields a more viable long-term outcome than steady-state operations. Therefore, they cannot be unsupervised artificial strategists, planners, or commanders.

Creating a military-useful LLMs is a multi-step process. As ‘generative’ AI, they must be trained on pre-existing data. This training data has to be collected and properly conditioned before being fed to the LLM. While training data for commercial and academic uses may be easily accessible, consistently formatted, and pre-curated for respectability and quality, quality training data for military purposes may be more challenging to procure. Security classifications will impede access, while available data will likely not be commonly formatted or consistently quality-assured. Once this training data is entered, evaluators (human or other AI) will grade the LLM’s output in an iterative refinement loop. To build military-useful LLMs, the problems of datasets, contracting, scope, and refinement loops would need to be addressed.

Data accessibility, normalization, and veracity present the greatest challenges to creating usable data sets for the military. Varying classifications of desired training materials are the most immediate obstacle to accessibility, though moving all materials to the highest constituent classification is a proven workaround. The distributed nature of military document storage adds additional complexity as materials are often stored on isolated or restricted networks, permission-regulated file servers, SharePoint sites, emails, personal folders, and printed paper. Accessing these materials would require extensive coordination. Military materials are of varying lengths, can be written in various fonts and encodings, and are often stored in file formats with flexible and complex metadata structures. Useful information can also be stored in formats less readable to machines, such as an image of written text. Normalizing these products would require transcribing the critical part of documents into text. As judgment would be required on omitting products with overly bespoke formats, humans are best capable of normalization. Still, due to the volume of training data required, other automated processes would be more practical. Relying on automated normalization, however, requires a certain amount of trust, as mangled training data would improperly train an LLM to produce mangled outputs. Therefore, data veracity is important to ensure quality LLM output. However, data accuracy can be compromised by simple errors, such as duplicate files or typos, or malicious actions, such as an adversarial poisoning of a training data set with large volumes of incorrect information. Furthermore, veracity can change over time as data becomes outdated or later disproven.

Training an LLM requires highly sought-after skillsets, large volumes of training data, and expensive computing infrastructure, and this will likely require going through contracting, which creates some additional hurdles. With cutting-edge AI developers in high demand, their market-rate salaries are far higher than government pay scales can compete with.^{^[8]} Even if the DoD could develop in-house training to bring its military personnel to industry standards, it would have difficulty keeping them in uniform. Therefore, contractors are the most viable option, creating additional hurdles as they need to obtain the proper clearances. Ensuring that contracts do not contain loopholes allowing data retention will be essential as AI companies seek to retain copies of training data from prior clients because of their value.^{^[9]} Not only will reviewing contracts require heavy technical and legal scrutiny, but monitoring contractors will also require proactive forensic work to ensure that such retention does not occur, accidentally or otherwise. Large computing resources must also be purchased or rented at correspondingly large costs.^{^[10]} Given that there are no options for renting computer equipment for work on the higher classification levels, hardware for those LLMs would have to be purchased.

Clarifying the scope of an LLM is crucial for managing expectations, especially as not every process can meaningfully benefit from the use of AI. Given that outputs can look completely different based on the purpose, even if they are called the same, it will be essential to tailor LLMs to particular missions. For example, a “planning document” for a cyber mission will differ in content, structure, terminology, and phrasing from a “planning document” for a bombing mission. An LLM trained on various “planning documents” might select the incorrect format or even meld the different formats into something unsuitable for any unit. However, while narrowing the scope allows the AI to be better tailored to a particular purpose, it may also limit the amount of available training material. Given the esoteric nature of military planning, a large amount of training data is needed for an LLM to think right rather than merely sound right. Such AI automation also may not be helpful to the top and bottom of the levels of war. A warfighter would not want to ask a chat app for ideas on deceiving an adversary during a firefight. At the same time, the slow pace of deliberating policy and strategy limits the utility of an LLM’s primary advantage of the speed of generation. Instead, it would be best suited for staff work, particularly at the operational level.

Training and evaluating LLMs and other kinds of AI is best accomplished through iterative refinement loops by which AI-produced output is repeatedly assessed. If grammar and writing style are your only parameters, these refinement loops can be completed rapidly. However, while LLMs can easily produce short text, such as poems or short emails, well enough, they are more likely to suffer hallucinations when tasked with writing longer documents. As military use relies on the substance and accuracy of a document more critical than its grammar or style, these refinement loops will likely be more laborious. The refinement loops will require a certain amount of expertise, research, or experimentation to prevent the creation of realistic but erroneous outputs. Due to the effort involved in such testing, refinement loops will likely be the most limiting factor in training a military LLM.

Using LLMs in the military is not without risks and difficulties. Issues such as flawed or insufficient training data, AI alignment, improperly calibrated trust, human complacency, model theft, and model abuse can…. Deficiencies in volume, variety, velocity, veracity, and value increase risks differently. Low-volume datasets may result in overfitting as the LLM would have two few points of reference when faced with the chaos of the real world, while the lack of variety could create the equivalent of groupthink. If datasets are not built fast enough, costs can increase, and project timelines can be slowed. Meanwhile, data of low veracity and value is prone to teaching the LLM the wrong lessons. It is also difficult to prove that an LLM is aligned with human values. The Alignment Problem from a Deep Learning Perspective offers a useful 3-part framework for discussing alignment risks:^{^[11]}

“Situationally Aware Reward Hacking” is when an AI behaves differently in training (vs. operations) to appear in greater alignment with evaluation criteria than it is. BMW’s emissions scandal is a human analog to this alignment risk.^{^[12]}
“Misaligned Internally Represented Goals” is when an AI misunderstands its purpose from human intent. For example, “pick tomatoes” could be misinterpreted as “collect bright red objects;” this could still work well enough in practice but could also result in a “tomato harvest” of bottle caps and ladybugs.
“Power-Seeking During Deployment” is a risk derived from the fact that many goals imply power-seeking, which may not be caught during training. Nick Bostrom’s famous “paperclip maximizer” thought experiment envisions an AI tasked to create “a lot of paperclips.”^{^[13]} It does this by gaining control over ever more materials and manufacturing capacity while resisting human efforts to minimize paperclip production, eventually converting all material in the universe, including humans, into paperclips.

Given these three potential risks, having a human in the loop is the best failsafe, but it is not foolproof. Conversely, humans must also have a carefully calibrated trust and distrust in the LLM. If the human doesn’t trust the LLM enough, the LLM will likely be inefficient. Yet, even more significant risks arise if the human trusts the LLM too much or trusts it to do things it isn’t equipped to handle. LLMs could become a crutch for those who lack proper training or skill, which would create dangerous scenarios when LLM hallucinations generate dangerous, false information (e.g., recommending OIE against targets not on a target list) or unduly strengthen an idea appealing to a stressed human (e.g., cutting power to a city district containing both an insurgent HQ and a civilian hospital). LLMs could present a security risk as they may contain vast amounts of information. Although reverse-engineering AI models is still an immature art, the field has immense interest.^{^[14]} Finally, there is the potential for human users to abuse the LLM, the most potent form being the automation of action loops or the removal of humans from certain loops. Rational actors may reasonably be expected to understand this and not automate processes we all would regret (C2 and fire control a la Skynet). Still, rogue actors or those desperate for asymmetric advantages may assess the risk differently.

We conclude that LLMs merit immediate research and adoption for the majority of routine military tasks, both classified and unclassified. With a well-trained, online, continually updated LLM – and decent user inputs – performance reports, daily operational update PowerPoints, intelligence reports, and myriad other tasks could be completed in minutes or hours rather than days or weeks, saving millions of man-hours across the DoD weekly. As to whether an AI could function independently and catastrophically, we say yes, within our lifetime, but not within our military.

Captain Alexander Sferrella
Capt. Sferrella is a (Defensive) Cyberspace Operations officer (17SB) with hands-on operational and MAJCOM staffs experience. He would have 'intel stink' if the smell existed. He currently leads software development and infrastructure buildout at the AFINSOC, and runs the AI-focused Military City Lesswrong group for fun.

Captain Joseph Conger
Capt. Conger is an Information Operations officer (14F) who is working on a PhD in industrial-organizational psychology at Virginia Tech. Operations in the Information Environment (OIE), intel, and some cyber are among his proficiency claims, though of all his experiences, he is most fond of skullduggery and arguing with sinecurists during staff meetings.

Major Kara Masick
Maj Masick serves as an Information Operations officer (14F) who has worked various OIE, including at NAF Staff and CCMD HQ levels. She has a passion for Military Information Support Operations (MISO) and is currently studying persuasion as a PhD student in the Measurement Research Methodology Evaluation Statistics (MRES) Lab at GMU and utilizing Large Language Models (LLMs) to analyze text.

This article is based on work conducted in Squadron Officer School's AU Advanced Research (AUAR) elective.

NOTES

[1.] B.J. Copeland, “Artificial Intelligence (AI),” Britannica, June 21, 2023. https://www.britannica.com/technology/artificial-intelligence.

[2.] “What Is Generative AI?,” NVIDIA, accessed June 23, 2023, https://www.nvidia.com/en-us/glossary/data-science/generative-ai/.

[3.] “What Are Large Language Models?,” NVIDIA, accessed June 23, 2023, https://www.nvidia.com/en-us/glossary/data-science/large-language-models/.

[4.] Hussam Alkaissi, and Samy I McFarlane, “Artificial Hallucinations in ChatGPT: Implications in Scientific Writing,” Cureus 15, no. 2: e35179, https://doi.org/10.7759/cureus.35179.

[5.] Rachel Shin, “Humiliated Lawyers Fined $5,000 for Submitting ChatGPT Hallucinations in Court: ‘I Heard about This New Site, Which I Falsely Assumed Was, like, a Super Search Engine,’” Fortune, June 23, 2023, https://fortune.com/2023/06/23/lawyers-fined-filing-chatgpt-hallucinations-in-court/; Sara Merken, “New York Lawyers sanctioned for using fake ChatGPT cases in legal brief,” Reuters, June 26, 2023, https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22.

[6.] Ben Buchanan, Andrew Lohn, Micah Musser, and Katerina Sedova, “Truth, Lies, and Automation: How Language Models Could Change Disinformation,” Center for Security and Emerging Technology, May 2021, https://doi.org/10.51593/2021CA003.

[7.] Carl von Clausewitz, On War, trans. Colonel J.J. Graham (London, Kegan Paul & Co., 1908), https://www.gutenberg.org/files/1946/1946-h/1946-h.htm.

[8.] “How Much Do AI Engineers Make? 2023 Salary Guide,” Coursera, June 16, 2023, https://www.coursera.org/articles/ai-engineer-salary.

[9.] Blake Brittain, “Google Hit with Class-Action Lawsuit over AI Data Scraping,” Reuters, July 11, 2023, https://www.reuters.com/legal/litigation/google-hit-with-class-action-lawsuit-over-ai-data-scraping-2023-07-11/.

[10.] Will Knight, “OpenAI’s CEO Says the Age of Giant AI Models Is Already Over,” Wired, April 23, 2023, https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/.

[11.] Richard Ngo, Lawrence Chan, and Sören Mindermann, “The Alignment Problem from a Deep Learning Perspective,” arXiv, February 22, 2023, https://doi.org/10.48550/arXiv.2209.00626.

[12.] Jack Ewing, “Volkswagen and BMW Are Fined Nearly $1 Billion for Colluding on Emissions Technology,” The New York Times, July 8, 2021, https://www.nytimes.com/2021/07/08/business/volkswagen-bmw-daimler-emissions-scandal.html.

[13.] Nick Bostrom, “Ethical Issues In Advanced Artificial Intelligence,” accessed June 23, 2023, https://nickbostrom.com/ethics/ai.

[14.] Seong Joon Oh, Max Augustin, Bernt Schiele, and Mario Fritz, “Towards Reverse-Engineering Black-Box Neural Networks,” arXiv, February 14, 2018, https://doi.org/10.48550/arXiv.1711.01768.; David Aronchick and Yannis Zarkada, “Owned By Statistics: Using MLOps to Make Machine Learning More Secure,” KubeCon 2020, Accessed June 23, 2023, https://docs.google.com/presentation/d/1Etn50JQdOL9Lsa065sngUmIZ0yk63NKwlXDNiHR0Ta4.; “Have I Been Trained?,” accessed June 23, 2023. https://haveibeentrained.com/.

Wild Blue Yonder Home