MCP Website Downloader

by angrysky56

AI/Data & Knowledge Website Downloader Documentation RAG Indexing

MCP Website Downloader is a simple MCP server designed for downloading documentation websites and preparing them for Retrieval-Augmented Generation (RAG) indexing. It aims to download complete documentation sites and organize assets for use with RAG systems.

View on GitHub

Last updated: N/A

What is MCP Website Downloader?

MCP Website Downloader is an MCP server that downloads documentation websites and prepares them for RAG indexing. It downloads website content, organizes assets, and creates an index for RAG systems.

How to use MCP Website Downloader?

Fork and download the repository.
Install dependencies using uv venv, ./venv/Scripts/activate, and pip install -e ..
Configure the server in your claude_desktop_config.json file with the appropriate paths.
(Optional) Start the server using python -m mcp_windows_website_downloader.server --library docs_library.
Use through Claude Desktop or other MCP clients by calling the 'download' tool with a URL.

Key features of MCP Website Downloader

Downloads complete documentation sites
Maintains link structure and navigation (partially)
Downloads and organizes assets (CSS, JS, images)
Creates an index for RAG systems
Simple single-purpose MCP interface

Use cases of MCP Website Downloader

Preparing documentation websites for RAG-based question answering systems
Creating local copies of documentation for offline access
Building custom knowledge bases from online documentation
Automating the process of extracting information from websites for AI applications

FAQ from MCP Website Downloader

What is the purpose of the rag_index.json file?

The rag_index.json file contains metadata about the downloaded website, including the URL, domain, number of pages, and path to the downloaded site. This information can be used by RAG systems to index and retrieve relevant content.

What kind of error handling does the server have?

The server handles common issues such as invalid URLs, network errors, asset download failures, malformed HTML, deep recursion, and file system errors. It returns error responses in JSON format with a detailed error message.

How does the server handle asset downloads?

The server downloads and organizes assets such as CSS, JS, and images. It attempts to maintain the original site structure and organizes assets by type.

What is the MCP architecture of the server?

The server follows a standard MCP architecture with separate modules for the server implementation (server.py), core downloader functionality (core.py), and helper utilities (utils.py).

How can I contribute to the project?

You can contribute by forking the repository, creating a feature branch, making your changes, and submitting a pull request.