Wizard Docx¶
WizardDocx is a Python library focused on text extraction from Microsoft Word documents. It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside ‘docx’. Legacy .doc is supported in read-only mode without OCR.
Installation¶
Requires Python 3.9+.
pip install wizarddocx
Note
For OCR, install Tesseract.
Quick start¶
import wizarddocx as wd
text = wd.extract_text("example.docx")
print(text)
Supported formats¶
Format |
OCR option |
|---|---|
DOC |
Not available |
DOCX |
Optional |
Parameters¶
Parameter |
Description |
|---|---|
|
(str | bytes | Path) Source to extract from: path string, bytes, or
|
|
(str, optional) File extension when |
|
(int | str | list[int|str] | None) Page selection only docx. For paged
formats use numbers and ranges ( |
|
(bool, optional) Enable Tesseract OCR for images and scanned DOCX.
Defaults to |
|
(str, optional) Tesseract language code. Defaults to |
Detailed parameters and examples¶
input_data¶
Accepts a filesystem path, a pathlib.Path, or raw bytes.
Path string
import wizarddocx as wd
text = wd.extract_text("docs/report.docx")
pathlib.Path
from pathlib import Path
import wizarddocx as wd
text = wd.extract_text(Path("docs/report.docx"))
Bytes (must set ``extension``)
from pathlib import Path
import wizarddocx as wd
raw = Path("img.doc").read_bytes()
text = wd.extract_text(raw, extension="doc")
BytesIO (streams)
import io, wizarddocx as wd
buf = io.BytesIO(open("img.docx", "rb").read())
text = wd.extract_text(buf.getvalue(), extension="docx")
extension¶
Required only when passing bytes. Indicates the file type.
Example
import wizarddocx as wd
doc_bytes = open("img.doc", "rb").read()
text = wd.extract_text(doc_bytes, extension="doc")
Warning
Passing bytes without extension raises a validation error.
pages¶
Select pages.
Accepted forms by format:
Paged — 1-based: - single int:
1- range string:"1-3"- CSV string:"1,3,5-7"- mixed list:[1, 3, "5-7"]Invalid tokens and out-of-range pages are silently skipped.
Examples — paged
import wizarddocx as wd
page1 = wd.extract_text("docs/big.docx", pages=1)
subset = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])
Enable OCR for raster content and scanned documents. language_ocr controls the recognition language.
Images
import wizarddocx as wd
img_txt = wd.extract_text("scan.docx", ocr=True) # default 'eng'
Scanned DOCX
import wizarddocx as wd
docx_txt = wd.extract_text("contract_scanned.docx", ocr=True, language_ocr="ita")
Returns¶
str — concatenated Unicode text from the selected pages.
Errors¶
Bytes without
extension→ validation error.Unsupported or invalid input → domain-specific error.
Missing or unreadable file → I/O error.