Bridging the AI reliability gap. Why you need to look beyond tech solutions

Artificial Intelligence (AI) is a broad term, which is currently used to describe many different tools and computing approaches to solving problems.

The general public has become very familiar with “AI” features being incorporated into their everyday lives, through applications and features incorporated into commonly used phone apps. Organisations now routinely promote their use of AI, and AI enabled tools to their customers.

At a fundamental level, the operating model of an organisation determines how it delivers services and outcomes, and creates value. This is achieved through a combination of strategy, leadership, governance, structure, people and capability, policy and process, performance reporting, and data and technology.

AI is not an operating model. It is a set of tools which can potentially be incorporated into elements of the operating model.

The components of an operating model are inter-dependent. This means that the introduction of AI tools may require significant adjustments to other elements of the operating model, such as governance, people and capability, and process.

It is these interdependencies which require careful consideration, and decision making by the executive leadership team. The impacts of AI technologies are already being raised on a regular basis in community and stakeholder engagement processes, and are often discussed in the media.

These concerns generally relate to operating model impacts, such as:

  • Impacts on workforce
  • Automated decision making
  • Data privacy
  • Errors

Types of AI Tools

Machine learning techniques are the foundation for various types of AI tools. AI tools vary greatly in terms of their capabilities, complexity, and maturity. The following table outlines common types of AI tools.

Type of AI ToolCommon Use Cases
Large Language ModelsClassification of contentSummarising contentGenerating new contentSearch
Natural Language ProcessingAnalysis of languageTranslationUser interface technologies
Computer VisionAnalysis of images / photos / videosIdentification of objects / featuresIdentification of specific people

Large Language Models

How do they work

Large Language Models are programs that function by making predictions. The system uses a transformer (a type of neural network) to make predictions. This basic technique can be applied to any stream of data, allowing transformers to be used to process text, audio, pictures, and other data formats.

The transformer neural network is trained using a set of training data. Knowledge within the transformer’s neural network is encoded into “weights”. These weights are determined through the training process.

Once the transformer neural network is trained to an acceptable level of accuracy, then the transformer can be used to make predictions.

If the transformer neural network is required to deal with a new situation that is not covered by the original training data, then further iterations of training will be required, with additional training data.

Large Language Model Training

Training a language model is conceptually simple:

  • show it some content
  • have it predict the next item
  • measure how wrong the prediction was
  • adjust the transformer weights so that the next time it makes that same prediction, it will be less wrong
  • repeat this process thousands to millions of times.

The quality of the training can be measured, and is influenced by the amount of training data, the quality of the data, and number of training iterations.

It is possible for an organisation to create their own LLM, and train the LLM with their own data. This can be achieved using an “open-weight” model (off-the-shelf) or a custom created model. Open weight models have a base level of training that can be trained further.

Commercially available large language models are typically accessed via cloud-based services, and have already been trained, with no mechanism for further customised training. Self-hosted LLMs are also available, and can use “open weight” models or proprietary models.

Top-tier cloud-based commercial LLMs have transformer models that encode more than 1 trillion parameters, and require computing resources that greatly exceed the capabilities available to most organisations.

Why is the quantity of parameters important

The number of parameters used by a transformer provides an indication of how much information can be encoded into the transformer neural network (via training). It also determines the amount of computing resources (memory, processing, storage) required to run the transformer neural network.

Models with a large number of parameters can potentially make accurate predictions for a wider range of situations, because they are capable of encoding more knowledge within their transformer neural network.

A general LLM might require a transformer with hundreds of billions of parameters, a very large training data set, and an enormous amount of computing resources and computer memory.

A highly specialised custom built LLM may be able to use a transformer with a small number of parameters (tens of millions). This scale of transformer could potentially run in a situation with limited computing resources, such as running entirely on a smart phone.

In addition to the number of parameters, each LLM is implemented with a context window of a specific size. The context window specifies how much information can be incorporated into any user request.

The context window is used to provide specific additional information relating to a question, and enables short-term information to be “remembered”. This allows you to have a conversation style interaction with an LLM tool, where things such as previous inputs and outputs are “remembered” over time.

Items which might need to fit within the context window include:

  • your current prompt (question)
  • any specific reference documents that need to be considered
  • any data that needs to be analysed
  • previous prompts (questions) and responses

Quality and hallucinations

When a user interacts with the LLM, the user input is sent to the transformer model, which predicts the likelihood (probability) of a series of potential outputs. The LLM then responds to the user input by selecting from the most probable outputs.

Most LLM tools are implemented to apply a level of “creativity” to the response, by allowing a degree of randomness in the selection of the predicted answer. For example, it might make a random selection from the outputs with the top 3 probabilities, rather than simply always choosing the answer with the highest probability. This can lead to a level of inconsistency in the responses to user inputs.

Hallucinations are also a feature of LLMs that cannot be eliminated. The quality of the predictions made by the transformer model are directly related to the training data. If the user request involves a scenario that is not covered by the training data, then the prediction could be completely wrong.

Increasing the amount of encoded information within the transformer neural network can help reduce the potential for hallucinations, but will not by itself prevent all hallucinations. Any increase in encoded information will also require an increase in the amount of training data and training iterations to achieve an acceptable level of accuracy.

AI Powered Workflows

Workflow automation tools are a very mature set of technologies, that have been widely deployed across Government and businesses. These tools do not require the use of AI related technologies.

Traditional workflow automation tools use digital logic and deterministic steps. Digital logic and deterministic steps always produce the same output for the same input.

It is now possible, however, to incorporate the use of AI tools into process steps, and potentially integrate the use of AI tools directly into automated workflows. AI tools are suited to steps that incorporate unstructured data, interpretation, and content generation.

 Deterministic Digital LogicAI (LLM)
StrengthsSame input = same output
Consistent decisions and outputs
Mathematical calculations
Data validation
Work routing
Templated content
Highly efficient and cost effective
Understanding of language
Interpreting complex information
Content generation
Handling ambiguity
Handling unstructured data
Flexibility 
WeaknessesMaking changes can be complex
Handling unstructured data
Handling ambiguity
Consistent decisions and outputs
Error rates and hallucinations
Mathematical calculations
Data validation
Computing resource requirements Cost

Governance and Controls

Put simply, the outputs of LLM and machine learning based tools need to undergo subsequent checks, and ongoing monitoring.

In complex workflows, it will be critical to ensure governance and risk controls are built into key steps in the workflow, and able to rapidly detect quality problems and errors.

It is important to detect any errors at the point where they occur, before the outputs are sent to later workflow stages. Early detection minimises wasted resources.

There are multiple quality review techniques that can be considered, including:

  • human review of all outputs
  • quality review of all outputs by a separate digital tool (non-AI LLM)
  • periodic human review of a sample of outputs

There could potentially be a substantial amount of re-work of outputs, to correct errors. Every step of the process involving LLM handling may need checks, monitoring, and resources to enable reworks.

These measures are likely to require a more complex set of workflow steps, with additional human review, human decision making, and quality measurements, and resources to handle the re-work of faulty outputs.

To implement enhanced governance and controls, the process may require:

  • additional quality control steps
  • data cleaning and data validation before passing information to AI systems
  • data cleaning and validation of outputs from AI systems
  • human oversight of outputs from AI systems
  • measurement and reporting of additional quality related information, such as error rates, and re-work volumes.
  • additional workforce resources to handle oversight and re-work.

Agentic AI

Agentic AI is a type of tool that is designed to act autonomously. It is meant to act as an “agent”, to essentially replace the role of a person.

Agentic AI can determine by itself how to carry out a request, and act with minimal human intervention. These Agentic AI tools can also interact directly with other technology systems, and perform actual actions to complete steps in a workflow.

The autonomous nature of agentic AI clearly complicates governance, and reduces the opportunity to incorporate human oversight.

TypeWorkforce
Role
Tool RoleInteractions with WorkforceGovernance
/ Risk Controls
Standard LLM  Workforce uses LLM as a support / reference only

Workforce oversees LLM outputs (quality)

Workforce manually requests LLM to carry out tasks
Information classification

Information summarisation

Generating new content

Research / search
Requires prompts to act

Requires step-by-step instructions

Prompts may be complex, and need standardisation
Human review of all outputs

Quality review of all outputs by a separate digital tool (rules based engine, rather than AI)

Periodic human review of a sample of outputs
Agentic AI  Minimal

Workforce oversees Agentic AI outputs (quality)

Workforce manually requests Agentic AI to carry out tasks  
Can independently determine how to carry out a task

Capable of complex multistep tasks

Carry out LLM tasks

Perform data entry / interact with other computer systems
Can act without workforce intervention Human review of all outputs

Quality review of all outputs by a separate digital tool (rules based engine, rather than AI)

Periodic human review of a sample of outputs 
Workflow AutomationProvides inputs

Makes decisions
Workflow standardisation

Follows defined workflow / process steps

Document / data routing

Implementation of governance / controls
User inputs Decision makingHuman review of all outputs

Quality review of all outputs by a separate digital tool (rules based engine, rather than AI)

Periodic human review of a sample of outputs

How Operating Model Can Bridge the AI Reliability Gap

While AI based tools can be very useful for certain tasks, their weaknesses can potentially lead to a substantial error rate, and poor-quality outputs. The gap is most clearly evident in the consistency of responses, and errors caused by hallucinations.

To bridge the reliability gap, it is necessary to consider the entire operating model. This includes governance requirements, workforce skills and capacity, as well as actual process design and tool selection.

It is important to ground any transformation effort in the higher-level business requirements. Business requirements are not “detailed process design”, technical specifications, or technology decisions. They are the business requirements for the operating model, and hence need the input of a wider range of stakeholders within the organisation.

The benefit of this style of approach is simple. The high-level business requirements that are produced will be stable over a longer time frame, and agnostic to the particular technologies that may be part of any given stage of an implementation. They will serve as a blue-print that can guide further detailed design and decision making.

When you have a well-documented set of high-level business requirements, the detailed design and implementation can be grounded in the broader context, and prioritise the elements that really matter. Focusing on the business requirements will also enable benefits realisation to be more clearly defined and measured.

Key steps in bridging the gap:

  • Discover and document the personas
  • Discover and document the high-level user journeys
  • Document the current state service blueprints
  • Discover high level business requirements



Discover more from SpencerMaurice - Management consultancy specialising in independent reviews, business strategy, transformation, organisational redesign, governance and enterprise risk

Subscribe now to keep reading and get access to the full archive.

Continue reading