name: langextract-usage description: How to use LangExtract to extract structured information from text. Use when writing code that calls lx.extract(), building extraction pipelines, defining examples, or troubleshooting alignment issues.
LangExtract Usage
LangExtract extracts structured information from unstructured text using LLMs.
The main entry point is lx.extract(). Every extraction needs at least one
example and input text. prompt_description is optional but recommended.
Install
pip install langextract
# For OpenAI model support:
pip install langextract[openai]
API keys
# Gemini (checks GEMINI_API_KEY then LANGEXTRACT_API_KEY)
export GEMINI_API_KEY="your_key"
# OpenAI (checks OPENAI_API_KEY then LANGEXTRACT_API_KEY)
export OPENAI_API_KEY="your_key"
# Ollama: no API key needed, just a running Ollama server
Auto-routing + env-default resolution is tuned for GPT-style model IDs
(gpt-4*, gpt-5*). For OpenAI-compatible endpoints or non-GPT IDs,
pass an explicit ModelConfig — see references/providers.md.
Basic extraction
import langextract as lx
examples = [
lx.data.ExampleData(
text="Patient takes lisinopril 10mg daily for hypertension.",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="lisinopril",
attributes={"dose": "10mg", "frequency": "daily"},
),
lx.data.Extraction(
extraction_class="condition",
extraction_text="hypertension",
attributes={"status": "active"},
),
],
)
]
result = lx.extract(
text_or_documents="Patient is prescribed metformin 500mg twice daily.",
prompt_description="Extract medications and conditions with attributes.",
examples=examples,
model_id="gemini-2.5-flash",
)
for e in result.extractions:
print(e.extraction_class, e.extraction_text)
print(f" char_interval: {e.char_interval}")
print(f" attributes: {e.attributes}")
See examples/basic_extraction.py for a runnable version and
examples/relationship_extraction.py for using extraction_class +
attributes to encode relationships between entities.
Writing good examples
Examples drive model behavior. Follow these rules:
extraction_textmust be verbatim from the example text, not paraphrased- List extractions in order of appearance in the text
- Each
extraction_classshould be consistent across examples - Include attributes that match what you want extracted at runtime
LangExtract can raise prompt-alignment warnings when an example's
extraction_text values don't align cleanly to the example's text
(failed alignment, or fuzzy/lesser rather than exact). The validator does
not enforce rules 2–4 above — those are guidance to steer the model's
output, not checked invariants.
To fail fast on alignment issues during development, pass these kwargs
directly to lx.extract() (both default to permissive):
from langextract.prompt_validation import PromptValidationLevel
result = lx.extract(
...,
prompt_validation_level=PromptValidationLevel.ERROR, # default: WARNING
prompt_validation_strict=True, # default: False
)
See references/prompt-validation.md for level semantics and strict-mode
behavior.
Key parameters
result = lx.extract(
text_or_documents=text, # str, URL, or list of Documents
prompt_description=prompt, # what to extract (optional)
examples=examples, # few-shot examples (required)
model_id="gemini-2.5-flash", # model to use
extraction_passes=1, # >1 for higher recall (costs more)
max_char_buffer=1000, # chunk size
batch_length=10, # chunks per batch
max_workers=10, # parallel workers (provider-dependent)
context_window_chars=None, # cross-chunk context for coreference
)
text_or_documentsaccepts a URL string (fetch_urls=True by default).- For full parallelism, keep
batch_length >= max_workers. Parallel processing viamax_workersis provider-dependent; some providers parallelize batched prompts, while others (such as the current Ollama provider) process them sequentially. context_window_charsincludes characters from the previous chunk as context for the current one, which helps with coreference and entity continuity across chunk boundaries.- For multiple documents, pass a list of
lx.data.Document— seeexamples/multiple_documents.py.
Working with results
# Filter to grounded extractions only (have source positions)
grounded = [e for e in result.extractions if e.char_interval]
# Access source position
for e in grounded:
start = e.char_interval.start_pos
end = e.char_interval.end_pos
matched_text = result.text[start:end]
# Save to JSONL
lx.io.save_annotated_documents(
[result], output_name="results.jsonl", output_dir="."
)
# Visualize from JSONL (shows first document)
html = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
if hasattr(html, "data"):
f.write(html.data) # Jupyter/Colab
else:
f.write(html)
# Or visualize an AnnotatedDocument directly
html = lx.visualize(result)
Provider selection
The default is Gemini. For other providers, see references/providers.md,
which covers:
- OpenAI (JSON mode; fence behavior auto-configured)
- Ollama (local models,
model_url) ModelConfigfor advanced provider_kwargs (custombase_url, etc.)- Custom provider plugins via
router.register()
Common issues
Extractions with char_interval=None: the extraction could not be
located in the source text. Common causes include paraphrased output,
alignment misses, or hallucinated entities. Filter with
[e for e in result.extractions if e.char_interval]. To tune alignment,
see references/resolver-params.md.
Prompt alignment warnings: your examples have extraction_text that
doesn't match the example text verbatim. Fix the examples, or see
references/prompt-validation.md to fail fast during development.
Slow on long documents: extraction_passes > 1 multiplies processing
time. Start with 1 and increase only if recall is insufficient.
Model selection: gemini-2.5-flash is recommended for most tasks;
gemini-2.5-pro for complex reasoning.
Further reading
references/providers.md— OpenAI, Ollama, ModelConfig, custom pluginsreferences/resolver-params.md— fuzzy alignment tuningreferences/prompt-validation.md— catching example issues earlyexamples/— runnable scripts for each scenario above