If you are coming to grips with the new terminologies and disciplines of the AI-driven world like ‘semantic annotation’, ‘natural language processing’ (NLP), and ‘retrieval-augmented generation’ (RAG), you might consider putting the simpler and prosaic ‘paradata’ above them all in the hierarchy of significance to generative AI and large language models (LLMs).
In its simplest definition, paradata documents ‘data on the making and processing of data’, and it considers or surveys the decision-making process and analytics behind how data is being collected and used. Significantly, it can be applied retrospectively, contemporaneously, and predictively, unlike the static data it surveys. A fuller discussion can be found on Michael Andrew’s Storyneedle blog here.
With activist investors demanding increased AI transparency from big tech along with the disclosure of ethical guidelines, paradata documentation will become a mainstream requirement to police the magnification and elaboration of misinformation and disinformation. Read the story in Raconteur here.
“Paradata ‘documents the organisational and technical context for AI decision-making where accountability is necessary… and allows for the reconstruction of the AI process, documenting due diligence…to mitigate risk…’”
- Patricia Franks, Editor of the Encyclopedia of Archival Science
(You can read more here on the AIIM blog)
There are giant leaps afoot in generative AI with the emergence of RAG, pioneered by the team at Meta. RAG combines the accuracy of semantic search overlay of datasets to ensure that content generated is more accurate, context-specific, and up-to-date based on the retrieved information. It’s a fine-tuning of internal knowledge that produces adaptive and more-reliable textual outputs.
Nevertheless, if hallucinations are to be eradicated from LLMs, the quality of the metadata itself and the parametric knowledge base needs questioning and qualifying. Read more about semantic annotation by Ontotext here and more on RAG by Meta here
Paradata is becoming essential to AI Risk Management
AI ‘black boxes’ house ever-more sophisticated algorithms and neural networks. As a result, they have become less easily diagnosed and explained, so less predictable and less desirable for high-risk applications.
An increased emphasis on accountability brings paradata into play. Paradata overlays enable better processual insight, transparency, and accountability, posing questions about the procurement of metadata, such as:
- Are AI research teams keeping sufficient records about their processes?
- Is the collection of data up-to-date with compliance requirements?
- Do the metadata records meet quality standards?
- Was the data collection methodology sound?
The UK Royal Society defines ‘explainable AI’ as characterised by interpretability, explainability, transparency, justifiability, and contestability.
Explainability goes further than simply citing the data behind AI-driven output; the quality, source, and integrity of that data is now being analysed.
The European Union’s new Regulatory Framework on Artificial Intelligence (see our May 2024 newsletter) envisages a layered risk-based approach to AI applications for AI developers, distributors, and users. It categorises AI use into four levels of risk: unacceptable, high, limited, and minimal.
The higher-risk application will now require documentation in the form of paradata to justify the deployment of these AI systems and their intended use as part of the due diligence process. Paradata serves to document the full scope of the application and context of use, not just the algorithm itself.
If we consider a machine learning life cycle as comprising:
- data gathering
- data preparation
- data wrangling
- data analysis
- model training
- model testing
- deployment
then paradata would infiltrate steps 2 to 4 – and perhaps insert a whole new step. Read more here by Scott Cameron, InterPARES Research Assistant at the University of British Columbia
Processual paradata can embed transparency and accountability in AI in its context of use and for audit-trail purposes. As a technical communicator, you may not be directly involved in the gathering, processing or analysing of paradata documentation and formulating the key questions to be asked by it. However, understanding the origins and purpose of the data you work with encourages collaboration across teams, particularly with members unfamiliar with the ‘how’ and ‘why’ the data was gathered to create a common understanding of the goal.
Introducing paradata into the methodology and infrastructure behind content creation will establish new opportunities for content developers whose roles will inevitably broaden in line with governance requirements. Paradata documentation also supports reuse and fosters an open-resource-type framework for others to understand the context in which information was developed.
As one study comments: ‘understanding and documenting of the context of creation, curation and use of research data… make it useful and usable for researchers and others in the future.’ See the De Gruyter Open Access report from the Journal of Open Information Science study here.
Paradata will be especially relevant to communicators whose output is linked to surveys and research data (back to Michael Andrews’ blog post here), as well as for data segmentation for technical writers engaged in UX research and UX instructions. If you are involved in semantically annotating content or knowledge graphs with ‘machine processable marginalia’, then paradata can help you screen the quality of your data sources.
Technical communicators: heed the call for ‘responsible AI’ and invest in paradata knowledge now!
It might land you a promotion!
It won’t be long before technical communicators and their support teams will be relying on the paradata behind the AI-driven output to reveal how content modifications can or will impact a business or its customers. Much of this will be compliance and governance led, but digital communicators will also need to garner a greater understanding of the ‘qualitative’ aspects of the metadata they work with to validate and uphold the veracity of their output.
At Firehead we encourage technical communicators to stay updated and familiar with every new facet of this increasingly AI-driven industry and use the widest vocabulary to broaden your career opportunities.
And don’t be surprised if, in future newsletters, we uncover new compound terms like ‘metaparadata’ as further overlays filter and screen digital information vaults by whether they’re socially ethical and relevant, sector-specifically reliable, or cohort desirable in the context of use. Data mining will inevitably dig deeper and deeper as AI looks to exploit every grain of manipulable binary code or, pretty soon, every qubit.