Wizard Docx¶

WizardDocx is a Python library focused on text extraction from Microsoft Word documents. It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside ‘docx’. Legacy .doc is supported in read-only mode without OCR.

Installation¶

Requires Python 3.9+.

pip install wizarddocx

Note

For OCR, install Tesseract.

Quick start¶

import wizarddocx as wd

text = wd.extract_text("example.docx")
print(text)

Supported formats¶

Format	OCR option
DOC	Not available
DOCX	Optional

Parameters¶

Parameter	Description
`input_data`	(str \| bytes \| Path) Source to extract from: path string, bytes, or `pathlib.Path`.
`extension`	(str, optional) File extension when `input_data` is bytes (e.g., `"doc"`, `"docx"`).
`pages`	(int \| str \| list[int\|str] \| None) Page selection only docx. For paged formats use numbers and ranges (`1`, `"2-5"`, `[1, "5-7"]`).
`ocr`	(bool, optional) Enable Tesseract OCR for images and scanned DOCX. Defaults to `False`.
`language_ocr`	(str, optional) Tesseract language code. Defaults to `"eng"`.

Detailed parameters and examples¶

`input_data`¶

Accepts a filesystem path, a pathlib.Path, or raw bytes.

Path string

import wizarddocx as wd
text = wd.extract_text("docs/report.docx")

pathlib.Path

from pathlib import Path
import wizarddocx as wd
text = wd.extract_text(Path("docs/report.docx"))

Bytes (must set ``extension``)

from pathlib import Path
import wizarddocx as wd
raw = Path("img.doc").read_bytes()
text = wd.extract_text(raw, extension="doc")

BytesIO (streams)

import io, wizarddocx as wd
buf = io.BytesIO(open("img.docx", "rb").read())
text = wd.extract_text(buf.getvalue(), extension="docx")

`extension`¶

Required only when passing bytes. Indicates the file type.

Example

import wizarddocx as wd
doc_bytes = open("img.doc", "rb").read()
text = wd.extract_text(doc_bytes, extension="doc")

Warning

Passing bytes without extension raises a validation error.

`pages`¶

Select pages.

Accepted forms by format:

Paged — 1-based: - single int: 1 - range string: "1-3" - CSV string: "1,3,5-7" - mixed list: [1, 3, "5-7"] Invalid tokens and out-of-range pages are silently skipped.

Examples — paged

import wizarddocx as wd
page1  = wd.extract_text("docs/big.docx", pages=1)
subset = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])

Enable OCR for raster content and scanned documents. language_ocr controls the recognition language.

Images

import wizarddocx as wd
img_txt = wd.extract_text("scan.docx", ocr=True)               # default 'eng'

Scanned DOCX

import wizarddocx as wd
docx_txt = wd.extract_text("contract_scanned.docx", ocr=True, language_ocr="ita")

Returns¶

str — concatenated Unicode text from the selected pages.

Errors¶

Bytes without extension → validation error.
Unsupported or invalid input → domain-specific error.
Missing or unreadable file → I/O error.

License¶

AGPL-3.0-or-later.

Resources¶

Contact & Author¶

Author:: Mattia Rubino
Email:: textwizard.dev@gmail.com

Wizard Docx¶

Installation¶

Quick start¶

Supported formats¶

Parameters¶

Detailed parameters and examples¶

input_data¶

extension¶

pages¶

Returns¶

Errors¶

License¶

Resources¶

Contact & Author¶

`input_data`¶

`extension`¶

`pages`¶