Measuring LLM Compliance with Analytic Tradecraft Standards

  • Published
  • By 55WG/A2

How can the compliance of large language models (LLMs) with Intelligence Community Directive (ICD) 203's analytic tradecraft standards of objectivity, independence of political consideration, and traceability to underlying sources be verified when LLMs are used for intelligence purposes? Can we ensure the trustworthiness and reliability of LLM-generated intelligence summaries?

An LLM using Retrieval-Augmented Generation (RAG) is exceptionally good at quickly providing overviews of lengthy documents or complex topics. Because of this powerful capability, LLMs may become an essential tool for all-source analysts. The ability to quickly summarize all IC knowledge on a specified topic is too powerful a tool to ignore. Unfortunately, LLMs are prone to hallucinations, and it is also difficult to understand how an LLM generates its results. Identical inputs can result in different outputs, and sources may be wholly fabricated by the LLM. A human analyst that behaved this way would not be trusted at best. Is there a way to use LLMs in a way that complies with analytic standards?


  • Keith, Andrew J., "Alignment: National Security Objectives in Cold War Computer Simulations," SAASS thesis, 2025, 117 pgs.
    • Keith answers this by empirically testing the reliability of modern large language models (LLMs), specifically ChatGPT and DeepSeek, to assess and categorize historical political objectives in warfare. He finds that the LLMs only aligned with a human research team's assessment in 60 percent and 53 percent of cases, respectively, demonstrating that contemporary AI tools still deeply struggle with the complex, nuanced reasoning required for national security applications. Keith concludes that because LLMs suffer from structural bias, hallucination, and significant variations in their representations of strategic objectives—such as failing to accurately prioritize territorial defense—decision-makers cannot blindly trust them for high-level strategic analysis and must actively ensure these systems are deliberately aligned with human policy intent.