Bridging the AI reliability gap. Why you need to look beyond tech solutions

AI Impacts on Operating Model Design

Artificial Intelligence (AI) is a broad term, which is currently used to describe many different tools and computing approaches to solving problems.

The general public has become very familiar with “AI” features being incorporated into their everyday lives, through applications and features incorporated into commonly used phone apps. Organisations now routinely promote their use of AI, and AI enabled tools to their customers.

At a fundamental level, the operating model of an organisation determines how it delivers services and outcomes, and creates value. This is achieved through a combination of strategy, leadership, governance, structure, people and capability, policy and process, performance reporting, and data and technology.

AI is not an operating model. It is a set of tools which can potentially be incorporated into elements of the operating model.

The components of an operating model are inter-dependent. This means that the introduction of AI tools may require significant adjustments to other elements of the operating model, such as governance, people and capability, and process.

It is these interdependencies which require careful consideration, and decision making by the executive leadership team. The impacts of AI technologies are already being raised on a regular basis in community and stakeholder engagement processes, and are often discussed in the media.

These concerns generally relate to operating model impacts, such as:

Impacts on workforce
Automated decision making
Data privacy
Bias
Errors

Types of AI Tools

Machine learning techniques are the foundation for various types of AI tools. AI tools vary greatly in terms of their capabilities, complexity, and maturity. The following table outlines common types of AI tools.

Type of AI Tool	Common Use Cases
Large Language Models	Classification of content Summarising content Generating new content Search
Natural Language Processing	Analysis of language Translation User interface technologies
Computer Vision	Analysis of images / photos / videos Identification of objects / features Identification of specific people

Large Language Models

How do they work

Large Language Models are programs that function by making predictions. The system uses a transformer (a type of neural network) to make predictions. This basic technique can be applied to any stream of data, allowing transformers to be used to process text, audio, pictures, and other data formats.

The transformer neural network is trained using a set of training data. Knowledge within the transformer’s neural network is encoded into “weights”. These weights are determined through the training process.

Once the transformer neural network is trained to an acceptable level of accuracy, then the transformer can be used to make predictions.

If the transformer neural network is required to deal with a new situation that is not covered by the original training data, then further iterations of training will be required, with additional training data.

Large Language Model Training

Training a language model is conceptually simple:

show it some content
have it predict the next item
measure how wrong the prediction was
adjust the transformer weights so that the next time it makes a prediction with the same inputs, it will be less wrong
repeat this process thousands to millions of times.

The quality of the training can be measured, and is influenced by the amount of training data, the quality of the data, and number of training iterations.

It is possible for an organisation to create their own LLM, and train the LLM with their own data. This can be achieved using an “open-weight” model (off-the-shelf) or a custom created model. Open weight models have a base level of training that can be trained further.

Commercially available large language models are typically accessed via cloud-based services, and have already been trained, with no mechanism for further customised training. Self-hosted LLMs are also available, and can use “open weight” models or proprietary models.

Top-tier cloud-based commercial LLMs have transformer models that encode more than 1 trillion parameters, and require computing resources that greatly exceed the capabilities available to most organisations.

Why is the quantity of parameters important

The number of parameters used by a transformer provides an indication of how much information can be encoded into the transformer neural network (via training). It also determines the amount of computing resources (memory, processing, storage) required to run the transformer neural network.

Models with a large number of parameters can potentially make accurate predictions for a wider range of situations, because they are capable of encoding more knowledge within their transformer neural network.

A general LLM might require a transformer with hundreds of billions of parameters, a very large training data set, and an enormous amount of computing resources and computer memory.

A highly specialised custom built LLM may be able to use a transformer with a small number of parameters (tens of millions). This scale of transformer could potentially run in a situation with limited computing resources, such as running entirely on a smart phone.

In addition to the number of parameters, each LLM is implemented with a context window of a specific size. The context window specifies how much information can be incorporated into any user request.

The context window is used to provide specific additional information relating to a question, and enables short-term information to be “remembered”. This allows you to have a conversation style interaction with an LLM tool, where things such as previous inputs and outputs are “remembered” over time.

Items which might need to fit within the context window include:

your current prompt (question)
any specific reference documents that need to be considered
any data that needs to be analysed
previous prompts (questions) and responses

Quality and hallucinations

When a user interacts with the LLM, the user input is sent to the transformer model, which predicts the likelihood (probability) of a series of potential outputs. The LLM then responds to the user input by selecting from the most probable outputs.

Most LLM tools are implemented to apply a level of “creativity” to the response, by allowing a degree of randomness in the selection of the predicted answer. For example, it might make a random selection from the outputs with the top 3 probabilities, rather than simply always choosing the answer with the highest probability. This can lead to a level of inconsistency in the responses to user inputs.

Hallucinations are also a feature of LLMs that cannot be eliminated. The quality of the predictions made by the transformer model are directly related to the training data. If the user request involves a scenario that is not covered by the training data, then the prediction could be completely wrong.

Increasing the amount of encoded information within the transformer neural network can help reduce the potential for hallucinations, but will not by itself prevent all hallucinations. Any increase in encoded information will also require an increase in the amount of training data and training iterations to achieve an acceptable level of accuracy.

AI Powered Workflows

Workflow automation tools are a very mature set of technologies, that have been widely deployed across Government and businesses. These tools do not require the use of AI related technologies.

Traditional workflow automation tools use digital logic and deterministic steps. Digital logic and deterministic steps always produce the same output for the same input.

It is now possible, however, to incorporate the use of AI tools into process steps, and potentially integrate the use of AI tools directly into automated workflows. AI tools are suited to steps that incorporate unstructured data, interpretation, and content generation.

	Deterministic Digital Logic	AI (LLM)
Strengths	Same input = same output Consistent decisions and outputs Mathematical calculations Data validation Work routing Templated content Highly efficient and cost effective	Understanding of language Interpreting complex information Content generation Handling ambiguity Handling unstructured data Flexibility
Weaknesses	Making changes can be complex Handling unstructured data Handling ambiguity	Consistent decisions and outputs Error rates and hallucinations Mathematical calculations Data validation Computing resource requirements Cost

Governance and Controls

Put simply, the outputs of LLM and machine learning based tools need to undergo subsequent checks, and ongoing monitoring.

In complex workflows, it will be critical to ensure governance and risk controls are built into key steps in the workflow, and able to rapidly detect quality problems and errors.

It is important to detect any errors at the point where they occur, before the outputs are sent to later workflow stages. Early detection minimises wasted resources.

There are multiple quality review techniques that can be considered, including:

human review of all outputs
quality review of all outputs by a separate digital tool (rules based engine, rather than AI)
periodic human review of a sample of outputs

There could potentially be a substantial amount of re-work of outputs, to correct errors. Every step of the process involving LLM handling may need checks, monitoring, and resources to enable reworks.

These measures are likely to require a more complex set of workflow steps, with additional human review, human decision making, and quality measurements, and resources to handle the re-work of faulty outputs.

To implement enhanced governance and controls, the process may require:

additional quality control steps
data cleaning and data validation before passing information to AI systems
data cleaning and validation of outputs from AI systems
human oversight of outputs from AI systems
measurement and reporting of additional quality related information, such as error rates, and re-work volumes.
additional workforce resources to handle oversight and re-work.

Agentic AI

Agentic AI is a type of tool that is designed to act autonomously. It is meant to act as an “agent”, to essentially replace the role of a person.

Agentic AI can determine by itself how to carry out a request, and act with minimal human intervention. These Agentic AI tools can also interact directly with other technology systems, and perform actual actions to complete steps in a workflow.

The autonomous nature of agentic AI clearly complicates governance, and reduces the opportunity to incorporate human oversight.

Type	Workforce Role		Tool Role	Interactions with Workforce	Governance / Risk Controls
Standard LLM	Workforce uses LLM as a support / reference only Workforce oversees LLM outputs (quality) Workforce manually requests LLM to carry out tasks		Information classification Information summarisation Generating new content Research / search	Requires prompts to act Requires step-by-step instructions Prompts may be complex, and need standardisation	Human review of all outputs Quality review of all outputs by a separate digital tool (rules based engine, rather than AI) Periodic human review of a sample of outputs
Agentic AI	Minimal Workforce oversees Agentic AI outputs (quality) Workforce manually requests Agentic AI to carry out tasks		Can independently determine how to carry out a task Capable of complex multistep tasks Carry out LLM tasks Perform data entry / interact with other computer systems	Can act without workforce intervention	Human review of all outputs Quality review of all outputs by a separate digital tool (rules based engine, rather than AI) Periodic human review of a sample of outputs
Workflow Automation		Provides inputs Makes decisions	Workflow standardisation Follows defined workflow / process steps Document / data routing Implementation of governance / controls	User inputs Decision making	Human review of all outputs Quality review of all outputs by a separate digital tool (rules based engine, rather than AI) Periodic human review of a sample of outputs

How Operating Model Can Bridge the AI Reliability Gap

While AI based tools can be very useful for certain tasks, their weaknesses can potentially lead to a substantial error rate, and poor-quality outputs. The gap is most clearly evident in the consistency of responses, and errors caused by hallucinations.

To bridge the reliability gap, it is necessary to consider the entire operating model. This includes governance requirements, workforce skills and capacity, as well as actual process design and tool selection.

It is important to ground any transformation effort in the higher-level business requirements. Business requirements are not “detailed process design”, technical specifications, or technology decisions. They are the business requirements for the operating model, and hence need the input of a wider range of stakeholders within the organisation.

The benefit of this style of approach is simple. The high-level business requirements that are produced will be stable over a longer time frame, and agnostic to the particular technologies that may be part of any given stage of an implementation. They will serve as a blue-print that can guide further detailed design and decision making.

When you have a well-documented set of high-level business requirements, the detailed design and implementation can be grounded in the broader context, and prioritise the elements that really matter. Focusing on the business requirements will also enable benefits realisation to be more clearly defined and measured.

Key steps in bridging the gap: