Wizard Docx

wizarddocx Banner
PyPI - Version PyPI - Downloads/month License

WizardDocx is a Python library focused on text extraction from Microsoft Word documents. It parses Word documents natively and can apply local OCR with Tesseract for embedded images or scanned pages inside ‘docx’. Legacy .doc is supported in read-only mode without OCR.

Installation

Requires Python 3.9+.

pip install wizarddocx

Note

For OCR, install Tesseract.

Quick start

import wizarddocx as wd

text = wd.extract_text("example.docx")
print(text)

Supported formats

Format

OCR option

DOC

Not available

DOCX

Optional

Parameters

Parameter

Description

input_data

(str | bytes | Path) Source to extract from: path string, bytes, or pathlib.Path.

extension

(str, optional) File extension when input_data is bytes (e.g., "doc", "docx").

pages

(int | str | list[int|str] | None) Page selection only docx. For paged formats use numbers and ranges (1, "2-5", [1, "5-7"]).

ocr

(bool, optional) Enable Tesseract OCR for images and scanned DOCX. Defaults to False.

language_ocr

(str, optional) Tesseract language code. Defaults to "eng".

Detailed parameters and examples

input_data

Accepts a filesystem path, a pathlib.Path, or raw bytes.

Path string

import wizarddocx as wd
text = wd.extract_text("docs/report.docx")

pathlib.Path

from pathlib import Path
import wizarddocx as wd
text = wd.extract_text(Path("docs/report.docx"))

Bytes (must set ``extension``)

from pathlib import Path
import wizarddocx as wd
raw = Path("img.doc").read_bytes()
text = wd.extract_text(raw, extension="doc")

BytesIO (streams)

import io, wizarddocx as wd
buf = io.BytesIO(open("img.docx", "rb").read())
text = wd.extract_text(buf.getvalue(), extension="docx")

extension

Required only when passing bytes. Indicates the file type.

Example

import wizarddocx as wd
doc_bytes = open("img.doc", "rb").read()
text = wd.extract_text(doc_bytes, extension="doc")

Warning

Passing bytes without extension raises a validation error.

pages

Select pages.

Accepted forms by format:

  • Paged — 1-based: - single int: 1 - range string: "1-3" - CSV string: "1,3,5-7" - mixed list: [1, 3, "5-7"] Invalid tokens and out-of-range pages are silently skipped.

Examples — paged

import wizarddocx as wd
page1  = wd.extract_text("docs/big.docx", pages=1)
subset = wd.extract_text("docs/big.docx", pages=[1, 3, "5-7"])

Enable OCR for raster content and scanned documents. language_ocr controls the recognition language.

Images

import wizarddocx as wd
img_txt = wd.extract_text("scan.docx", ocr=True)               # default 'eng'

Scanned DOCX

import wizarddocx as wd
docx_txt = wd.extract_text("contract_scanned.docx", ocr=True, language_ocr="ita")

Returns

str — concatenated Unicode text from the selected pages.

Errors

  • Bytes without extension → validation error.

  • Unsupported or invalid input → domain-specific error.

  • Missing or unreadable file → I/O error.

License

AGPL-3.0-or-later.

Resources

Contact & Author

Author:

Mattia Rubino

Email:

textwizard.dev@gmail.com