No description

Find a file

gdelaunay 55a45dcaa7 fix build		2026-05-17 22:17:32 +02:00
config	init	2026-05-07 15:38:10 +02:00
cv_tool	fix build	2026-05-17 22:17:32 +02:00
schemas	init	2026-05-07 15:38:10 +02:00
tests	init	2026-05-07 15:38:10 +02:00
.env.example	fix build	2026-05-17 22:17:32 +02:00
.gitignore	fix build	2026-05-17 22:17:32 +02:00
pyproject.toml	fix build	2026-05-17 22:17:32 +02:00
README.md	init	2026-05-07 15:38:10 +02:00
uv.lock	fix build	2026-05-17 22:17:32 +02:00

README.md

Projet CV — PDF to Structured JSON Converter

Pitch

Un outil CLI Python qui prend un CV en PDF, en extrait le texte (OCR si besoin), le structure via un LLM local (LM Studio) en JSON selon un schéma fourni par l'utilisateur, et exporte le résultat.

Objectif

Transformer n'importe quel CV PDF (scanné ou natif) en données JSON structurées, selon un schéma personnalisable.

Cas d'usage

Convertir un CV PDF en JSON exploitable
Alimenter une base de données de profils depuis des CVs
Générer des versions adaptées (LinkedIn, ATS, etc.)
Export JSON pour Power Platform / Power Automate

Stack technique

Python 3.11+
PDF extraction : pdfplumber (texte natif) + fallback pdf2image + pytesseract (scans)
LLM local : LM Studio (API compatible OpenAI) — aucune clé cloud nécessaire
Export : JSON (principal), Word/docx (bonus optionnel)
Schéma JSON : fourni par l'utilisateur (fichier YAML ou inline), pas de modèle hardcodé

Architecture

CV.pdf
  │
  ├─ PDF natif (texte) ──► pdfplumber ──► text brut
  │
  └─ PDF scanné (image) ──► pdf2image ──► images ──► pytesseract ──► text brut
                                │
                                ▼
                         text brut nettoyé
                                │
                                ▼
              ┌───────────────────────────────┐
              │  Schéma JSON utilisateur       │
              │  + Prompt LLM local            │
              │  (LM Studio / compatible OAI)  │
              └───────────────────────────────┘
                                │
                                ▼
                         JSON structuré
                                │
                    ┌───────────┼───────────┐
                    ▼           ▼           ▼
              export JSON   export JSON  (bonus)
              stdout        fichier     python-docx

Schéma JSON — Feature principale

Le schéma JSON n'est JAMAIS hardcodé. L'utilisateur le fournit de 3 manières :

Fichier YAML (recommandé) : cv-tool cv.pdf --schema my-schema.yaml
Fichier JSON : cv-tool cv.pdf --schema my-schema.json
Inline : cv-tool cv.pdf --schema '{"personal": {...}}'

Format YAML supporté :

personal:
  full_name: string
  email: string
  phone: string
  location: string
  title: string
experience:
  - company: string
    title: string
    dates: string
    description: string
education:
  - institution: string
    degree: string
    field: string
skills:
  - name: string
    level: string
custom_section:
  label: "Section perso"
  items:
    - field: string
      value: string

Le LLM utilise ce schéma comme contrainte de sortie. Le validateur vérifie la conformité.

Étapes de développement

Phase 1 — Extraction PDF (MVP)

Détecter si PDF est texte natif ou scan
Extraire le texte (pdfplumber pour natif)
Fallback OCR (pdf2image + pytesseract pour scans)
Nettoyage du texte (normalisation, suppression espaces multiples)
CLI basique : cv-tool cv.pdf --extract → stdout

Phase 2 — Structuration LLM local

Interface avec LM Studio (URL + clé via .env)
Chargement du schéma JSON (YAML/JSON/inline)
Construction du prompt système avec schéma
Validation du JSON produit (pydantic)
CLI : cv-tool cv.pdf --schema schema.yaml → JSON

Phase 3 — Export & Polish

Export JSON stdout / fichier
Option --format docx (bonus, python-docx)
Fichier .env (LM_STUDIO_URL, LM_STUDIO_API_KEY)
Logging, erreurs, verbosity
Tests unitaires
Packaging

Fichiers du projet

cv-tool/
├── README.md              ← ce fichier
├── pyproject.toml         ← config Python
├── .env.example           ← variables d'environnement LM Studio
├── cv_tool/
│   ├── __init__.py
│   ├── cli.py             ← interface CLI (click)
│   ├── extractor.py       ← extraction PDF + OCR
│   ├── llm.py             ← structuration via LLM local
│   ├── schema.py          ← chargement/validation schéma JSON
│   ├── validator.py       ← validation du JSON produit
│   └── template.py        ← export docx (bonus)
├── config/
│   └── default.yaml       ← config par défaut
└── tests/
    ├── test_extractor.py
    ├── test_llm.py
    └── test_schema.py

Configuration

.env (LM Studio)

LM_STUDIO_URL=http://localhost:1234/v1
LM_STUDIO_API_KEY=your-key-here

default.yaml

llm:
  url: ${LM_STUDIO_URL}        # depuis .env
  api_key: ${LM_STUDIO_API_KEY} # depuis .env
  model: ""                    # modèle actif dans LM Studio
  temperature: 0.1

ocr:
  engine: "tesseract"           # tesseract ou easyocr
  dpi: 300
  lang: "fra+eng"

output:
  format: "json"               # json ou docx
  indent: 2

Notes techniques

OCR : Tesseract en 2026

Tesseract 5.3+ reste pertinent pour l'OCR open-source.

Fallback : Tesseract (léger, intégré)
Option premium : EasyOCR / PaddleOCR (plus précis, plus lourd)

PDF text extraction

pdfplumber : meilleur pour préserver la structure (tables, colonnes)
Stratégie : pdfplumber d'abord, si < 10 chars/page → fallback OCR

LLM local : LM Studio

LM Studio expose une API compatible OpenAI sur http://localhost:1234/v1
Tout modèle chargé dans LM Studio est utilisable (Llama, Mistral, Phi, etc.)
Aucune dépendance Python lourde (pas de transformers, pas de GGUF)
Compatible avec le client openai Python (juste changer base_url)

Pourquoi pas Word ?

python-docx est fragile (mises en forme perdues, templates cassés)
Power Platform / Power Automate est fait pour le templating
L'export JSON est la vraie valeur : le consommateur gère le formatage