The Evil Genie: Organizational Success in the Age of Complexity and Artificial Intelligence Published June 23, 2022 By Joseph Hoecherl On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?”. . . . I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - Charles Babbage, inventor of the first mechanical computer The classic evil genie story goes something like this: A man walks along a beach and stumbles across a magic lamp. After rubbing the lamp, a genie emerges and offers to grant him three wishes. The man declares, “I wish to be rich!” The genie immediately replies, “Ok, Rich, what would you like for your other wishes?” The punchlines to these jokes revolve around various less-specific-than-necessary wishes, with the wisher invariably receiving something less desirable than hoped. Like the proverbial genie, the ongoing artificial intelligence (AI) revolution is also here to grant our wishes. Unlike the evil genie of the story, AI will not purposefully grant our wishes in a way that we dislike. But it will attempt to grant what we ask, regardless of whether that is what we want. Like Mr. Babbage’s questioner, if we ask the wrong questions, we will get the wrong answers. The stakes of how the world uses this set of technological advances are incredibly high, but not necessarily for the reasons that science fiction has led us to worry about most. Especially when algorithms become very complex, we risk confusing algorithm outputs for some form of transcendent truth or inherently valuable insight. Recently, a Google engineer provided such an example when interacting with the LaMDA chatbot, mistaking a high-quality statistical estimate of the most probable next words in a sentence for actual AI sentience.1 Algorithms specialize in identifying patterns between inputs and outputs, but people must decide the inputs and determine how to use the outputs. The AI Alignment Problem for Military Leaders Artificial intelligence is an umbrella term for a host of different techniques, and even its definition remains controversial. This article primarily discusses the portion of AI that deals with measurement and optimization of policies, although this article is not about the techniques themselves. An underappreciated and far more important topic of discussion is the specific problems decision makers will solve with those techniques. Thus, decision makers must understand their problems first. In private industry, profitability determines which organizations survive and what problems their decision makers focus on solving. But profitability does not constrain Department of Defense organizations, since these organizations instead spend resources to secure a public good. In this construct, the question of how to measure effectiveness and efficiency is an enduring challenge, one that trickles down in some form to every operational unit and support function. The process of transforming an organization’s mission into metrics becomes ever more important as military organizations automate processes. Law and policy constrain decisions, but decision makers still often have some flexibility on how to measure success, whether the decision maker is a human or an algorithm. Humans experience a strong bias toward the default, but they do have the ability to reject the default and elevate problems when the situation requires. To facilitate automation, we must define and quantify exactly what metric algorithms should optimize in advance. This question of how to define and quantify what algorithms should optimize is the AI alignment problem.2 One research focus in this area is assessing why algorithms come up with their solutions and whether these mechanisms are appropriate. For an AI to solve a problem, the humans must be able to “communicate” with the AI to tell it what is good about an outcome. The classic thought experiment of how this can go wrong is the hypothetical paperclip machine, tasked to make as many paperclips as possible.3 In this fanciful example, the machine takes over the world as an intermediate step (called an instrumental goal) because that is the best strategy to make the most paperclips. A significant portion of the alignment research focuses on the intentionality of algorithms, but the immediate risks are adverse outcomes based on the AI solving the problem we gave it, not necessarily choosing an unacceptable intermediate goal of some kind. This will continue to be true as long as AI applications remain narrow, and it is far from certain that AI will progress beyond narrow applications at all. One example of narrow AI creating dramatic societal change is how social media algorithms show people incendiary posts because those posts drive the most traffic.4 This situation does not require an AI to recognize that a more partisan, angry world will drive lots of web traffic. The immediate reward is enough; the metric itself is the failure. This leads us to our central discussion: metrics. People often employ qualitative descriptions—with many underlying assumptions built in—to describe quality of outcomes, while most machine approaches either require explicit quantitative descriptions or an internal quantitative measure to evaluate the relationship between qualitative descriptions in a given dataset. This generates an opportunity for translation error. As we gain an expanded ability to solve problems, the thinkers at the intersection of the technical AI world and the operational world are critical to getting this translation right. This translation ensures the immense power of AI (efficiency) points in the right direction (effectiveness). The idea that “what gets measured gets managed” has always been true, but now there will be fewer human eyes to check if what gets measured is adequate. While there can be cost savings as we automate processes that previously required people to execute, the knowledge, ability, and efficacy of the remaining workforce becomes more and more important. This process, guided by the metrics we select, can enable military organizations to become more effective and efficient and to maintain an edge over near-peer adversaries at lower cost (good genie) or help them rapidly and confidently waste money and lose wars (evil genie). US Military Use of Metrics We know the study of measuring military operational success with metrics as operations assessment.5 Operations assessment dates to World War II and the rise of operations research, with notable successes and failures along the way. Joint Publication 5-0, Joint Planning, devotes an entire chapter to the subject, beginning as follows: “Operation assessments are an integral part of planning and execution of any operation, fulfilling the requirement to identify and analyze changes in the [operational environment] and to determine the progress of the operation . . . . [Assessments] provide perspective, insight, and the opportunity to correct, adapt, and refine planning and execution to make military operations more effective.”6 Joint Publication 5-0 clarifies that operations assessment should be complementary to the commander’s “personal sense of the progress of the operation or campaign, shaped by conversations with senior and subordinate commanders, key leader engagements (KLEs), and battlefield circulation.”7 This is important; leaders must remain aware both that existing metrics may not be adequate as well as of the possibility that no metrics may exist to fully measure important system characteristics. As Einstein famously quipped: “not all that counts can be counted, and not all that can be counted counts.” Like Einstein, we should not take this as an invitation to stop quantifying what matters but simply to recognize the limits of this process and refuse to accept existing metrics without further scrutiny if we know there are weaknesses. Support functions similarly rely on quantitative measurements of success but lack the same universal top-down structure for assessments when executing operations. Metrics still drive behavior, but the processes to reexamine whether the metrics at operational and tactical levels are achieving strategic goals is more ad hoc. Despite this, metrics can serve as an embodiment of the principle of mission command, enabling decentralized execution with a clear success criterion. Whether the low-level support metrics in place align with the military’s strategic goals is a separate question. A common joke among those using military communications networks illustrates this: “The safest network is one that no one can access.” Consequences of Poor Metrics The danger of poor metrics in bureaucracy is twofold: first, poor metrics drive poor decisions. For example, in the US Air Force, a primary driver of low readiness measures are P-Ratings, defined by the ratio of available personnel to authorized personnel with some set of characteristics. Notably, this metric does not measure how ready a force is (how many trained personnel it has); it measures the efficiency of the personnel system in filling authorizations (spaces). We can improve the value of the metric by increasing personnel or decreasing funded requirements. But increasing both personnel and funded authorizations by proportionately equal amounts results in no change. For this reason, this specific metric is fundamentally incapable of informing resourcing decisions regarding the correct number of authorizations. In fact, the fastest way to increase readiness metrics is to reduce the number of authorizations at a unit, since it takes some time to move individuals to other units via permanent change of station. In case this seems too obvious, in a group tasked with improving readiness at specific units, the author has witnessed highly educated people provide a recommendation to prevent any future authorization growth at these units to help these metrics recover. This recommendation is an example of reducing the very thing decision makers care about (actual unit readiness) to improve a metric intended to measure this value (P-Ratings). Second, poor metrics can create a vicious cycle with respect to talent management. It is difficult to change metrics that define success; organizations use these metrics for operational purposes and to identify who is leading most effectively. As organizations select leaders who can generate the best results as measured by the existing metrics, they might undervalue leaders who are mentally flexible enough to identify shortfalls in what the metric captures and who make choices to maximize performance as measured in some other manner. This hypothetical leader has the exact skillset needed to lead organizations at the strategic level. The organization will also systematically undervalue this same leader at the tactical and operational level if it does not make a concerted effort to identify and nurture this attribute. Conversely, the individual who is most likely to demonstrate the highest performance under a system with flawed metrics will also be the person least likely to challenge poor metrics as a senior leader. In combination with a human and bureaucratic tendency toward status quo bias, this dynamic can stymie solutions, even when they seem obvious to personnel at lower levels. Better Applications of Artificial Intelligence Transparency and Data Availability One of the toughest things to do in an organization is to admit to problems, colloquially that “the baby is ugly.” Subject to appropriate security and classification restrictions, making data and metrics widely available to the people in the organization accomplishes two things that make this difficulty worthwhile. First, sharing such as this creates buy-in instead of cynicism. It lets people at all levels in the organization see what they are working toward and sacrificing for. Combined with clear communication of the mission and role within the larger military structure, this creates enthusiasm. Second, this common sight picture spurs genuine innovation to fix problems. This is the same reason that open-source solutions often outperform costly and highly developed solutions in the private sector: more people and more ideas create more potential ways to solve problems.8 Additionally, those directly affected by the system’s poor functioning often have insights into failures that may not be visible from senior levels and are highly motivated to fix the problems that affect them every day. Enabling those individuals to build solutions improves the organization’s performance and creates a more affirming and responsive environment for the individuals. Thinking about Metrics Know what you are trying to measure. Leaders need to be specific with what they are measuring and how it connects to the organization’s objectives. Is it an effectiveness metric or an efficiency metric that assumes a task is effective? Organizations can absolutely build analytics-based systems that will do the wrong thing incredibly efficiently. No set of metrics can overcome unclear or vague strategic objectives. Be your own evil genie. Individual metrics are often incomplete, but a combination of metrics should capture an organization’s performance. If you can imagine a real-world good outcome with poor metric values or a real-world bad outcome with good metric values, then your assessment metrics have failed the evil genie test and require further development. Measure what matters. Leaders need to find metrics that capture what they actually care about, not what is loosely correlated with what they care about. Finding a metric that aligns with real-world good outcomes 80 percent of the time will result in disaster 20 percent of the time. An example of this is the long-standing use of waist circumference for US Air Force physical fitness tests. While the science that shows a correlation between waist circumference and health outcomes is sound, this is not perfectly correlated with Airmen fitness, since a similarly proportioned (and capable) tall or muscular Airman might have a larger waist size than a shorter Airman. This metric results in a systemic bias against tall or muscular Airmen. Do not become a slave to bad or limited metrics. Leaders need to be ready to throw out or ignore bad or misleading metrics. Data should not be cavalierly thrown out, but if the metric does not fit the “why,” then it should not be used to steer actions. Most importantly, if an action will improve a metric’s value but not actual mission effectiveness, this action should not be executed nor should an organization reward leaders who execute such actions. When incorporating automation, human decision makers must monitor and exercise judgment on whether the automated process’s performance meets strategic intent in context with the rest of the systems, structures, and business processes that interact with the automated process. Military organizations must build a culture that uses metrics but is not used by metrics. Monitor for metrics that no longer reflect value. Goodheart’s Law states that once a participant in a system begins trying to affect a metric, the metric’s original ability to measure the underlying issue will be broken.9 This is not something that leaders can always anticipate; they must continually monitor whether their metrics continue to reflect the underlying reality they wish to measure and then iteratively modify metrics or policies as appropriate. Do not confuse quantity of metrics that support an objective with the importance of the objective. Some objectives will lend themselves to easy quantification, while others will remain difficult to translate into appropriate metrics. It is a standard human cognitive bias to attempt to equate the importance of such metrics, but the strategic success of a mission or organization does not depend equally on all possible metrics, or even on all possible objectives. Leaders must establish priorities and think carefully about tradeoffs. If all else fails, acknowledge and document values that existing metrics do not cover. While the ideal is to quantify the things organizations care about to enable better tracking, this is sometimes impossible or prohibitively expensive. Leaders need to be ready to identify areas where they do not have good metrics and make sure that the organization still monitors these areas. In many cases, subjective assessments can be a partial answer, with the caveat that these may not always remain consistent over time as the individuals and the perspective informing the assessments change. When organizations cannot quantify all the relevant value functions simply, this will sometimes mean relying on recommender systems to provide diverse sets of good options (called a slate) for decision makers instead of a single best recommended solution.10 Analytic Thinking and Talent Management As organizations develop the skilled workforce to build these algorithms, those individuals that can push the cutting edge on neural networks or other analytic techniques likely will not have deep domain expertise in most areas. Important details can get lost in translation when domain experts work together with technical experts. Movements toward explainable AI can help facilitate this interaction, but the biggest issues described here are the foundational questions of what metric AI is supposed to be optimizing in the first place. For this reason, it is critically important to build a data-literate workforce within other specialties to help bridge the gap and think analytically about their domain. The US Air Force manpower and personnel community provides an excellent practical example of how to approach this. This community has enrolled its third annual cohort of personnelists (38Fs) developing data analytics expertise through the Force Support Cohort Analytics Program. The goal of this program is not to make these individuals into top-tier data scientists. Instead, this program is training data-fluent officers who can work in the personnel domain, spreading quantitative knowledge and best practices in the community. Their stated mission is “to train workforce analysts based on data analytics, critical thinking, and programming—creating 38Fs ready to tackle the A1’s challenges today and in the future.” Additionally, these individuals are the ideal collaborators to team with dedicated data scientists or operations research analysts to solve complex or tricky problems, as they have practical experience in their domain using quantitative techniques. With the technical knowledge to understand how metrics and systems work, the second step of creating a technologically competent leadership for military organizations is to encourage a substantially less risk-averse culture when it comes to challenging metrics. As described above, the Air Force rewards those who feed the current metrics. Sometimes, this means they are getting the best results. But military organizations that wish to develop an agile, high-performing culture must find and develop those individuals who challenge the status quo and think deeply about organizational performance. A key role for leaders in the digital age is to think deeply about their organization’s performance and metrics, understanding where their quantitative blind spots are and working to minimize these gaps. This is true for every mission area, from artillery to space operations to force support. While developing more senior leaders with this skillset should be a priority, organizations need this skillset at all levels of officers and enlisted personnel to create an environment conducive to innovation and rapid, positive change. As we attempt to nourish this risk tolerance, we know in advance that any truly innovative process experiences some frequency of failure. In response, we must do a better job of rewarding success more than penalizing failure, or the incentives will push more leaders to remain risk averse while we promote those most adept at innovation theater. Finally, while AI applications are quite limited now, the field is rapidly evolving. Investing in the next generation of leaders with strategic vision, critical thinking skills, and the quantitative skills to understand how these algorithms can go wrong will be a critical move if broader automation or a more generalized artificial intelligence is developed. Additionally, these same leaders will be the ones able to decipher Russian and Chinese intent, strengths, and weaknesses in any future AI-driven conflicts. These new leaders might be the most critical resource of all for future US military effectiveness. Conclusion Arthur C. Clarke noted “any sufficiently advanced technology is indistinguishable from magic.”11 This remains true until the user becomes familiar with the limitations and uses of the technology in question, rendering it no longer “sufficiently advanced.” Because the technology of AI will shape military success in the near future, military leaders do not have the luxury of remaining unfamiliar with AI’s limits and uses. In addition to historical applications to combat operations, strategic competition is to some extent a question of resources, capabilities, and influence. In this environment, every dollar wasted on ineffective practices or misaligned bureaucratic structures is an opportunity for our adversaries to gain ground. Worse, each of these dollars represents instances in which junior personnel are shaped to remain mired in archaic processes and structures. It must be a priority for leaders at all levels to understand how to think quantitatively about success, seize the advantage of new approaches when they can, and reject poor decisions and policies when the metrics driving them are misaligned. Major Joseph Hoecherl, USAF, is a PhD student at the Air Force Institute of Technology working on research in computational stochastic optimization and artificial intelligence. 1 Ian Bogost, “Google's 'Sentient' Chatbot Is Our Self-Deceiving Future,” Atlantic, June 14, 2022; and Mehdi M. Afsar, Trafford Crump, and Behrouz Far, “Reinforcement Learning Based Recommender Systems: A Survey,” vers. 1, arXiv. January 15, 2021, https://arxiv.org/. 2 Jan Leike et al., “Scalable Agent Alignment via Reward Modeling: A Research Direction.” arXiv, November 19, 2018, https://arxiv.org/. 3 Nick Bostrom, “Ethical Issues in Advanced Artificial Intelligence,” Science Fiction and Philosophy: From Time Travel to Superintelligence (2003): 277–84. 4 Luke Munn, “Angry by Design: Toxic Communication and Technical Architectures.” Humanities and Social Sciences Communications 7, no. 1 (2020): 1–11. 5 Emily Mushen and Jonathan Schroden, Are We Winning? A Brief History of Military Operations Assessment (Arlington, VA: Center for Naval Analysis, September 2014), https://www.cna.org/. 6 Chairman of the Joint Chiefs of Staff (CJCS), Joint Planning, Joint Publication 5-0 (Washington, DC: CJCS, August 2011), VI-1. 7 CJCS, Joint Planning, VI-1. 8 Steven Weber, The Success of Open Source (Cambridge, MA: Harvard University Press, 2004). 9 David Manheim and Scott Garrabrant, “Categorizing Variants of Goodhart’s Law,” vers. 4, arXiv, February 24, 2019, https://arxiv.org/. 10 David Goldberg et al., “Using Collaborative Filtering to Weave an Information Tapestry,” Communications of the ACM 35, no. 12 (1992); Dietmar Jannach et al., Recommender Systems: An Introduction (Cambridge, UK: Cambridge University Press, 2010); and Afsar, Crump, and Far, “Recommender Systems.” 11 Arthur C. Clarke, Profiles of the Future, rev. ed. (New York: Harper and Row, 1973).