AI Parsing Guidelines
AI Parsing Guidelines
This document explains how AI systems (LLMs, extraction pipelines, or parsing engines) should transform unstructured CV data into the Barba-CV JSON format.
Barba-CV is designed to act as the deterministic structural layer between probabilistic AI extraction and structured HR datasets.
This page focuses on extraction and field-level mapping behavior, while LLM Integration focuses on prompt packaging, orchestration, and schema-guided workflow design.
1. Goal of AI parsing
The goal of an AI parser is to convert:
Unstructured CV (PDF / DOCX / HTML / text)
↓
AI extraction
↓
Barba-CV JSON structure
The parser should map information into the Barba-CV structure without inventing data.
Missing information should remain empty.
2. Never invent information
If a field cannot be extracted with confidence, it should be left empty.
Examples:
"date_of_birth": ""
"phone": ""
"organization": ""
AI systems must not hallucinate information.
3. Prefer structured extraction over free text
Whenever possible, the AI should structure information into the correct fields.
Example:
Instead of:
"profile_summary": "Worked at ACME as a software engineer from 2020 to 2023"
Use:
"experiences": [{
"organization": "ACME",
"role_title": "Software Engineer",
"start_date": "2020",
"end_date": "2023"
}]
4. Handling incomplete dates
Dates in CVs are highly variable.
Allowed examples:
"2020"
"Jan 2021"
"2021-2023"
"September 2019"
"Present"
Do not attempt to normalize aggressively.
Dates should remain human readable.
5. Experience extraction
Each professional role should become one element in experiences.
Example:
"experiences": [
{
"organization": "Example Company",
"role_title": "Senior Engineer",
"start_date": "2020",
"end_date": "2023",
"tasks": [],
"achievements": []
}
]
Descriptions should be split when possible:
- responsibilities →
tasks - measurable outcomes →
achievements
6. Education extraction
Each education entry becomes one element in education.
Fields to extract when available:
- school
- degree
- field
- dates
- location
7. Skills classification
Skills should be categorized when possible:
"skills": {
"it_skills": [],
"hard_skills": [],
"soft_skills": []
}
Guidelines:
| Category | Meaning |
|---|---|
| it_skills | programming, tools, software |
| hard_skills | professional capabilities |
| soft_skills | interpersonal or behavioral skills |
If classification is unclear, place the skill in hard_skills.
8. Position sought
The field position_sought describes the candidate’s professional target.
Examples:
"position_sought": [
"Full Stack Developer",
"Python Expert"
]
If the CV includes a headline or title, it should be mapped here.
9. Languages
Languages should include:
{
"language": "English",
"level": "Fluent"
}
Levels remain free text to allow different conventions.
10. Certifications
Certifications should include issuer and dates when available.
Example:
{
"name": "AWS Certified Solutions Architect",
"issuer": "Amazon",
"date_obtained": "2022"
}
11. Project achievements
Major projects or consulting missions can be extracted into project_achievements.
Example:
{
"title": "ERP Implementation",
"client": "Manufacturing Group",
"role": "Project Manager",
"period": "2021-2022"
}
12. Metadata population
The meta block contains operational information about parsing.
Typical fields include:
"meta": {
"cv_uuid": "",
"processor_engine": "",
"ats_processed": false
}
AI systems should populate metadata only when known.
13. Extensions
Custom system information must be placed inside extensions.
Example:
"extensions": {
"ats_id": "12345",
"client_reference": "ABC"
}
This prevents breaking the core schema.
14. Output validation
After extraction, the generated JSON should:
- validate against
barba-cv.schema.json - contain only supported fields
- respect the root CV structure
Summary
AI parsing with Barba-CV follows a simple rule:
Extract faithfully. Structure deterministically. Never invent data.
This ensures reliable CV parsing while keeping the schema compatible with real-world documents.