There are many small and medium datasets spread across the USAF that need to be parsed for standardized output, and Retrieval-Augmented Generation (RAG) or fine-tuning techniques offer powerful capabilities to quickly summarize lengthy documents and complex topics. However, LLMs are prone to hallucinations, context overrun, and fabricating sources. What are the most robust ways to incorporate new USAF datasets into an LLM that do not truncate the breadth of available data, while simultaneously allowing for complex answers and minimizing hallucinations? Furthermore, how can we verify LLM compliance with Intelligence Community Directive (ICD) 203's analytic tradecraft standards—specifically objectivity, independence of political consideration, and traceability to underlying sources—when these models are deployed for intelligence purposes? Ultimately, how can the Air Force engineer reliable LLM pipelines that satisfy rigorous military and intelligence compliance standards?
- Keith, Andrew J., "Alignment: National Security Objectives in Cold War Computer Simulations," SAASS thesis, 2025, 117 pgs.
- Keith answers this by empirically testing the reliability of modern large language models (LLMs), specifically ChatGPT and DeepSeek, to assess and categorize historical political objectives in warfare. He finds that the LLMs only aligned with a human research team's assessment in 60 percent and 53 percent of cases, respectively, demonstrating that contemporary AI tools still deeply struggle with the complex, nuanced reasoning required for national security applications. Keith concludes that because LLMs suffer from structural bias, hallucination, and significant variations in their representations of strategic objectives—such as failing to accurately prioritize territorial defense—decision-makers cannot blindly trust them for high-level strategic analysis and must actively ensure these systems are deliberately aligned with human policy intent.
- Nicholson, Capt. Jonathan, "LLM Use Case," SOS AUAR, 2025.
- Nicholson explains that standard LLMs are prone to hallucinations and have "constancy" issues, meaning their knowledge is static and cannot easily adapt to dynamic combat environments without costly, time-consuming retraining. To mitigate this, he advocates for the adoption of RAG architectures, which allow military systems to securely update their knowledge bases with uploaded documents in real-time. He points to the Air Force's NIPRGPT effort as a successful real-world baseline for RAG, but stresses that wide-scale deployment of any LLM must be paused until rigorous research solves their severe cost, data, and security vulnerabilities.