Automating PRISMA Flows with LLMs

From Query Building to Deduplication

Systematic reviews remain a cornerstone of scientific synthesis, but the traditional workflow — from query formulation to record screening — is slow, manual, and error-prone. With the maturation of local and compliant large language models (LLMs), researchers can now automate major components of the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) process while maintaining transparency, reproducibility, and methodological rigor.

The next generation of review pipelines integrates AI-assisted query building, automated metadata cleaning, and deduplication routines directly within open research ecosystems such as Zotero, Rayyan, or OpenAlex — all governed by error bounds and audit trails.

This article outlines how LLMs can augment, but not replace, the methodological integrity of systematic evidence synthesis.

Rethinking the PRISMA Workflow in the Age of Local AI

The PRISMA 2020 framework (Page et al., 2021) requires that researchers document each phase of the review process — identification, screening, eligibility, and inclusion — in a transparent, reproducible flow diagram.

Historically, this required manual logging of counts from multiple databases (PubMed, Scopus, Web of Science) and spreadsheet-based tracking of duplicates and exclusions.

Today, an LLM-based system can automatically:

Parse search logs from databases and APIs.
Detect overlaps in metadata.
Generate real-time flow diagrams in line with PRISMA standards.
Maintain verifiable logs for reproducibility.

When implemented carefully, these AI-augmented pipelines preserve the auditability required for publication while reducing the mechanical burden on researchers.

Step One: AI-Assisted Query Building and Expansion

LLMs can assist in constructing structured Boolean queries across multiple bibliographic databases.

For instance, a researcher studying digital mental health interventions for adolescents can ask an LLM:

“Generate equivalent Boolean queries for PubMed, Scopus, and PsycINFO covering digital CBT, telehealth, and mobile interventions for adolescent depression.”

The LLM produces syntax-specific query strings, e.g.:

(PubMed)
("cognitive behavioral therapy"[MeSH Terms] OR "CBT" OR "telepsychology" OR "mobile mental health")
AND ("adolescent"[MeSH Terms] OR teen* OR youth*)
AND (depression OR "mood disorder")

(Scopus)
TITLE-ABS-KEY(("digital cognitive behavioral therapy" OR "telepsychology" OR "mHealth") AND (adolescent OR teen* OR youth*) AND (depression))

This accelerates the identification stage and ensures cross-platform equivalence — a critical yet often error-prone step in systematic searching.

To maintain reproducibility, each generated query and timestamp is saved as a query log (e.g., query_log_2025-11-10.json) that records:

Model name and version.
Query prompt and output.
Database target and timestamp.This establishes a verifiable provenance record suitable for supplementary materials or pre-registration.

Step Two: Automated Record Aggregation and Metadata Normalization

After export, bibliographic records from multiple databases (often in RIS, BibTeX, or CSV formats) can be aggregated into a unified structure.

An LLM integrated within a Python or R pipeline can:

Parse heterogeneous metadata formats.
Normalize inconsistent fields (author order, DOI formatting, journal titles).
Identify incomplete metadata and flag missing DOIs for manual review.

For example, the model may flag a record where the title field contains abstract fragments or where an ISSN is misclassified as a DOI.

Using open libraries such as Zotero translators, Crossref REST API, or OpenAlex, these inconsistencies can be programmatically corrected with LLM-driven heuristics — while retaining an audit trail of every transformation (e.g., metadata_log.json).

Step Three: Deduplication via Semantic Similarity

Traditional deduplication tools rely on strict string matching (e.g., DOI equality or Levenshtein distance between titles). However, variations in capitalization, subtitle punctuation, or indexing conventions often lead to false positives and false negatives.

Here, LLMs trained on text embeddings (such as OpenAI’s text-embedding-3-large or local equivalents like Mistral 7B Instruct) can compute semantic similarity scores between titles, abstracts, or authorship fields.

Pairs with cosine similarity >0.96 can be flagged as probable duplicates.

Example:

Field	Record A	Record B	Similarity
Title	“Digital CBT for adolescent depression: A systematic review”	“A systematic review of digital cognitive-behavioral therapy for adolescent depression”	0.982

The deduplication model operates under user-defined error bounds, e.g., tolerance ±0.02, to control false merges.

Every decision — merge, ignore, or flag — is written to a traceable deduplication log for reproducibility.

Step Four: PRISMA Flow Automation and Visualization

Once record counts are established (initial hits, after deduplication, after screening, etc.), the pipeline automatically generates a PRISMA flow diagram using libraries like matplotlib, diagrammer, or PRISMA2020 R.

An LLM or scripting agent can:

Read the logged counts from the deduplication and screening stages.
Populate the PRISMA XML/JSON schema.
Render the figure and export it in SVG or PDF format.

A versioned example:

{
  "identified_records": 2432,
  "after_deduplication": 1920,
  "screened": 1800,
  "full_text_reviewed": 312,
  "included": 42,
  "model_version": "phi-4:offline",
  "timestamp": "2025-11-10T12:45:00Z"
}

This structured record ensures computational reproducibility — each visual element of the PRISMA diagram corresponds to machine-logged evidence, not manual reporting.

Error Bounds and Auditability

The inclusion of AI does not absolve researchers from methodological transparency.

Each LLM component must:

Log model version, prompt, and confidence score.
Record every modification or classification as a JSONL event.
Be reproducible using identical seeds and model checkpoints.

An effective audit trail might contain:

Original search outputs.
LLM-modified metadata entries.
Similarity matrices for deduplication.
PRISMA schema files and rendering scripts.

These elements form the computational equivalent of a PRISMA appendix, enabling replication and peer verification.

Toolchain Example: A Reproducible Open Pipeline

A working, privacy-preserving PRISMA automation pipeline can be built from open tools:

Stage	Tools	Output
Query generation	LM Studio (offline Llama 3) or GPT-4-turbo	Database-specific Boolean queries
Record aggregation	Zotero + Better BibTeX / RIS parser	Unified reference library
Metadata normalization	Python + LLM-powered heuristics	Cleaned metadata log
Deduplication	Sentence-transformer embeddings	Deduplication report
PRISMA flow generation	R PRISMA2020 or Python plotly	PRISMA-compliant diagram
Audit archive	Git + JSON logs	Reproducible repository

This workflow aligns with FAIR (Findable, Accessible, Interoperable, Reusable) principles while keeping the process transparent and locally verifiable.

Example: Meta-Analysis on Adolescent Digital Interventions

In a real-world application, a psychology research team conducting a meta-analysis on adolescent digital interventions used a semi-automated PRISMA pipeline.

Within hours, they:

Generated equivalent Boolean queries for six databases.
Aggregated 2,430 records into Zotero.
Used LLM-based normalization to standardize DOIs and titles.
Deduplicated down to 1,910 unique records.
Produced an auto-generated PRISMA flowchart complete with metadata provenance.

The process reduced screening time by 40% and improved reproducibility metrics, as every decision was logged and exportable.

Ethical and Methodological Implications

While LLM automation enhances efficiency, it also raises methodological and ethical questions:

Bias propagation: Poorly tuned models may over-include or exclude studies.
Overconfidence: Automated screening must not replace human domain judgment.
Transparency: All LLM interactions must be logged and made available for peer review.

Therefore, AI systems should act as assistive engines, not as decision-makers — maintaining the researcher as the accountable epistemic agent.

Conclusion: Toward Auditable, AI-Augmented Synthesis

The automation of PRISMA flows through LLMs heralds a shift toward computationally reproducible evidence synthesis. When coupled with strict audit trails, versioning, and human verification, these systems can dramatically increase efficiency without eroding methodological rigor.

In this model, LLMs are not substitutes for reviewers but instruments of traceable augmentation — streamlining laborious steps while preserving scientific integrity.

The result is a transparent, high-throughput, and auditable review pipeline that strengthens the reliability of systematic evidence for the AI era.

Sources:

Page, M. J., et al. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews.BMJ, 372:n71.
Haddaway, N. R., et al. (2022). Automation in systematic reviews: A framework for data management and reproducibility. Research Synthesis Methods, 13(3), 457–474.
Crossref REST API Documentation (2024). https://api.crossref.org
OpenAlex API Reference (2024). https://docs.openalex.org
Zotero and Better BibTeX documentation (2025). https://www.zotero.org/support

Share the Post:

v0.app

Fast prototyping with generative AI Why Everyone Is Talking About v0.app — And Why You Should Try It Today If

Writing books using generative AI

Authoring automata In the rapidly evolving landscape of generative artificial intelligence (GenAI), authors and content creators now have access to