LLM Integration

This document explains how Large Language Models (LLMs) and AI extraction systems can map unstructured CV content into the Barba-CV JSON format.

Barba-CV is designed to provide the deterministic structural layer that sits between:

unstructured CV documents
probabilistic AI extraction
structured HR / ATS / MCP workflows

Why LLM integration matters

Many AI agents, developers, and orchestration systems explicitly look for:

a schema
a structured format
a JSON mapping target
a reliable method for AI extraction

Barba-CV addresses this need by giving LLMs a stable JSON target into which messy CV content can be mapped.

This is especially useful when CV data comes from:

PDF extraction
DOCX extraction
OCR pipelines
copy-pasted CV text
ATS export text

In all of these cases, the source text may be noisy, partial, or inconsistently formatted.

Barba-CV helps the AI system transform that text into a consistent and reusable structure.

Core principle

The recommended Barba-CV LLM workflow is:

CV file
   ↓
Raw text extraction
   ↓
Barba-CV example JSON + schema guidance
   ↓
LLM mapping
   ↓
Barba-CV JSON output

In practice, the Barba-CV engine uses the following method:

extract the raw text from the source CV
provide the LLM with the Barba-CV sample JSON template
instruct the LLM to fill the JSON from the extracted text
require the LLM to preserve the original meaning and wording as much as possible
forbid hallucination, normalization drift, invented dates, invented employers, or invented facts

This approach has proven effective because the LLM receives a clear structural target rather than being asked to invent its own JSON layout.

Recommended input to the LLM

The recommended prompt package contains:

1. The raw extracted CV text

This is the source material.

It may come from:

PDF-to-text extraction
DOCX-to-text extraction
OCR
ATS text export

2. The Barba-CV example JSON

Use the canonical template file:

examples/barba-cv.example.json

This shows the model exactly:

which fields exist
how arrays are structured
how nested objects are organized
what empty values look like

3. The Barba-CV schema and/or schema reference

Depending on the workflow, the model may also receive:

schema/barba-cv.schema.json
docs/barba-cv-schema-reference.md

This improves field interpretation and reduces ambiguity.

Recommended instructions to the LLM

A good Barba-CV instruction set should tell the model to:

map the CV content into the provided Barba-CV JSON structure
preserve the source text as faithfully as possible
avoid rewriting facts
avoid changing dates, figures, names, employers, schools, or locations
avoid inventing missing information
leave fields empty when information is not available
keep human-readable dates as they appear when needed
classify skills into the appropriate skill buckets when possible

The main principle is:

Structure the data without altering the facts.

Anti-hallucination rule

This is one of the most important principles of Barba-CV integration.

When converting CV text into JSON, the LLM must:

not invent dates
not invent job titles
not invent companies
not invent schools
not invent certifications
not rewrite factual content into more polished but less accurate wording

If information is missing or uncertain, the correct behavior is:

leave the field empty
keep the wording close to the source
optionally report ambiguity through parsing metadata if the pipeline supports it

This is essential because CVs are often used in:

HR workflows
ATS systems
legal or contractual contexts
commercial recruitment decisions

Accuracy matters more than stylistic rewriting.

Recommended prompting pattern

A practical prompt usually includes three parts:

Part 1 — Role and task

Explain to the LLM that it must convert extracted CV text into Barba-CV JSON.

Part 2 — Strict rules

Tell the LLM explicitly:

do not hallucinate
do not change names
do not change dates
do not change numbers
do not change company names
do not change school names
do not normalize beyond what is clearly supported by the text

Part 3 — Structural target

Provide the Barba-CV example JSON and ask the model to populate it.

This is generally more reliable than asking the model to generate a fresh schema-compliant JSON from scratch.

Example high-level prompt logic

A typical Barba-CV prompt can be summarized like this:

You are given raw text extracted from a CV.
Use the provided Barba-CV JSON template as the target structure.
Populate the JSON only with information explicitly present in the text.
Do not invent, rewrite, embellish, or normalize facts beyond what the source clearly states.
Preserve dates, names, organizations, schools, titles, and figures.
If a field is unknown, leave it empty.
Return valid JSON only.

This pattern works well because it constrains the LLM both semantically and structurally.

Why the example JSON matters

The example JSON is not just documentation.

For LLMs, it acts as:

a structural target
a formatting guide
a section inventory
a hallucination control aid

Without a concrete sample JSON, an LLM may:

invent field names
omit important sections
flatten nested structures
output inconsistent array/object shapes

Providing the Barba-CV sample template significantly improves consistency.

Recommended operational workflow

A robust implementation often follows these steps:

Step 1 — Extract text

Extract the raw text from the source document.

Step 2 — Clean obvious extraction noise

Optionally remove obvious extraction artefacts if they do not change meaning.

Step 3 — Send prompt + template to the LLM

Provide:

the extracted text
the Barba-CV sample JSON
the anti-hallucination instructions

Step 4 — Validate output

Validate the returned JSON against:

schema/barba-cv.schema.json

Step 5 — Post-process if needed

Populate meta fields such as:

processor_engine
ats_processed
parsing_errors
cv_uuid

Role of the schema in LLM workflows

The schema is useful in two ways.

For the LLM

It clarifies the intended structure and helps reduce ambiguity.

For the system

It enables formal validation after generation.

This is important because an LLM may still produce:

malformed JSON
wrong nesting
wrong data types
unexpected fields

The schema acts as the final structural guardrail.

Recommended validation strategy

The best practice is:

LLM generates Barba-CV JSON
system validates against the schema
if invalid, either:
- run a repair pass
- reject the output
- flag parsing errors in metadata

This makes the pipeline more robust for MCP and API use.

Mapping philosophy

Barba-CV does not ask the LLM to rewrite the candidate’s career story.

It asks the LLM to:

extract
map
structure
preserve meaning

That distinction is fundamental.

Barba-CV is therefore better understood as a structured mapping target rather than a rewriting format.

Typical failure modes to avoid

When integrating LLMs, common errors include:

merging several experiences into one
inventing normalized dates
replacing the exact employer name with a guessed official name
rewriting achievements in more polished marketing language
changing the meaning of technical responsibilities
classifying all skills into one generic array
dropping certifications or languages because they appear at the bottom of the CV

These issues should be explicitly guarded against in prompts and post-validation.

Practical recommendation

For real-world reliability, the simplest proven approach is often the best:

use the extracted CV text as source
provide the Barba-CV example JSON as target
instruct the LLM to fill it faithfully
forbid hallucination
validate against the schema

This method is simple, reproducible, and easy to integrate into AI agents, APIs, and MCP workflows.

Summary

Barba-CV LLM integration is based on a simple idea:

Give the LLM a deterministic JSON target, and require it to map the source CV faithfully without inventing data.

This is the core mechanism by which Barba-CV turns probabilistic AI extraction into reusable structured career data.