MCP Server
by donphi
The MCP Server enables AI assistants to query and retrieve information from custom document collections using the Model Context Protocol. This extends the AI's knowledge base with specific content, overcoming knowledge cutoffs.
Last updated: N/A
š MCP Server for Document Processing
š About Model Context Protocol (MCP)
The Model Context Protocol (MCP) is a new standard created by Anthropic to enable AI assistants to access external tools and data sources. This protocol allows AI models to extend their capabilities beyond their training data by connecting to specialized services like this MCP server.
By implementing the MCP standard, this server enables AI assistants to query and retrieve information from your custom document collection, effectively extending their knowledge base with your specific content.
š§ Extend LLM Knowledge with Up-to-Date Information
This Model Context Protocol (MCP) server lets you overcome one of the biggest limitations of large language models: knowledge cutoffs. By creating your own MCP server, you can feed AI assistants up-to-date information about:
- Latest Framework Documentation: Use content not in LLM training data (React 19, Angular 17, Vue 3.4+, etc.)
- Private Codebases: Help AI assistants understand your proprietary code patterns and structures
- Technical Specifications: Import documentation on new APIs, protocols, or tools
Recommended Data Sources:
- FireCrawl.dev: A powerful tool for scraping documentation websites
- Official GitHub repositories: Download READMEs and documentation
- Technical blogs and tutorials: Save key articles as Markdown files
šļø Architecture
The system consists of two main components:
- š Processing Pipeline: Reads Markdown and text files, chunks them, generates embeddings, and stores them in a vector database.
- š MCP Server: Exposes the processed content through MCP tools, allowing AI assistants to search and retrieve relevant information.
š” Example Use Cases
Upgrading AI Knowledge with Latest Framework Documentation
# Scrape latest React 19 docs using FireCrawl.dev
# Place the saved markdown files in the data/ directory
# Run the pipeline to process the documentation
# Now ask your AI assistant about React 19 features!
Using Private Codebase Documentation
# Export your API documentation as markdown
# Place the markdown files in the data/ directory
# Run the pipeline to process
# Now your AI assistant can help debug issues with your specific APIs!
ā Prerequisites
- Docker: Docker Desktop for Windows or Mac, or Docker Engine for Linux
- OpenAI API key (Optional): Can use free local embeddings instead
- AI assistant that supports MCP: Such as Roo or other compatible assistants
š ļø Setup
-
Clone this repository:
git clone https://github.com/donphi/mcp-server.git cd mcp-server
-
Create a
.env
file with your configuration:# Copy the example file cp .env.example .env # Edit the file with your settings nano .env
On Windows, you can use Notepad to edit the .env file.
-
Place your Markdown (.md) and text (.txt) files in the
data/
directory.
āļø Configuration
You can configure the MCP server using environment variables in the .env
file:
# API Keys
OPENAI_API_KEY=your_openai_api_key_here # Optional - can use free local embeddings instead
ANTHROPIC_API_KEY=your_anthropic_api_key_here # Optional
# Pipeline Configuration
CHUNK_SIZE=800 # Size of text chunks
CHUNK_OVERLAP=120 # Overlap between chunks (in tokens)
BATCH_SIZE=10 # Batch size for embedding generation
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Model to use (see options below)
SUPPORTED_EXTENSIONS=.md,.txt,.pdf,.docx,.doc # Comma-separated list of supported file extensions
# Server Configuration
CLAUDE_MODEL=claude-3-7-sonnet-20240307 # Claude model to use
MAX_RESULTS=10 # Maximum number of results to return
USE_ANTHROPIC=true # Whether to use Anthropic API for responses
# Paths
DATA_DIR=/data # Directory containing input files
OUTPUT_DIR=/output # Directory for output files
DB_PATH=/db # Directory for vector database
CONFIG_PATH=/config/server_config.json # Path to server configuration file
š Embedding Models
The system supports multiple embedding models for converting text to vector representations:
Free Models (no API key required)
These models run locally within the Docker container and don't require any API keys:
-
sentence-transformers/all-MiniLM-L6-v2: A compact model designed for sentence and short paragraph encoding, providing efficient embeddings suitable for rapid retrieval tasks.
-
BAAI/bge-m3: A versatile model supporting multiple retrieval functionalities, over 100 languages, and inputs up to 8192 tokens, making it ideal for comprehensive retrieval tasks.
-
Snowflake/snowflake-arctic-embed-m: Optimized for high-quality retrieval performance, this model balances accuracy and inference speed effectively.
Paid Models (require OpenAI API key)
- text-embedding-3-small: Optimized for speed and cost-effectiveness with good quality
- text-embedding-3-large: Highest quality embeddings (more expensive)
When you run the pipeline, you'll be prompted to choose which model to use. If you don't have an OpenAI API key, the system will automatically use one of the free local models.
š Usage
š Processing the Files
To process your files and generate embeddings:
docker-compose build pipeline
docker-compose run pipeline
On Windows, you can run these commands in Command Prompt or PowerShell after installing Docker Desktop.
This will:
- Prompt you to choose an embedding model
- Install necessary packages if needed
- Read all supported files in the
data/
directory - Process and chunk the content
- Generate embeddings
- Store the embeddings in the vector database (creates a
chroma.sqlite3
file in thedb/
directory)
ā ļø IMPORTANT NEXT STEP: After processing your files, you MUST build the server before running it. See the next section.
š§ Building the MCP Server
REQUIRED STEP: After processing your documents, you need to build the server component before running it:
docker-compose build server
Note for Windows users: This step is critical before running the MCP server. Without building the server image, you'll encounter an "invalid reference format" error when trying to run the server.
The updated run scripts for Linux/macOS will automatically build the server image if it's missing, but it's still recommended to build it manually for better performance and to avoid unexpected delays when first running the server.
š Connecting to an MCP-Compatible AI Assistant
ā ļø REMINDER: Before configuring your MCP server connection, make sure you've completed these steps:
- Built the pipeline (
docker-compose build pipeline
) - Run the pipeline (
docker-compose run pipeline
) - Built the server (
docker-compose build server
) - This step is critical and often missed!
The MCP server needs to be configured with your AI assistant. We provide scripts to generate the configuration:
For macOS/Linux:
-
Make the setup script executable and run it:
chmod +x setup-mcpServer-json.sh ./setup-mcpServer-json.sh
-
This will create a
mcp-config.json
file with the correct configuration. -
Add the configuration to your AI assistant.
For Windows:
-
Double-click on the
setup-mcpServer-json.bat
file or run it from Command Prompt:setup-mcpServer-json.bat
-
This will create a
mcp-config.json
file with the correct configuration. -
Add the configuration to your AI assistant.
IMPORTANT FOR WINDOWS USERS: The
run-mcp-server.bat
file has been updated to use Docker Compose consistently, which resolves the "invalid reference format" error that some Windows users were experiencing. If you're still encountering this issue, make sure you're using the latest version of the batch file from this repository.
Example: Configuring with Roo
If you're using Roo as your AI assistant:
- Run the appropriate setup script for your platform to generate the configuration file
- In Roo, click the "MCP Server" button/tab in the sidebar
- Enable the "Enable MCP Servers" toggle
- Click "Edit MCP Settings"
- Copy and paste the entire contents of the mcp-config.json file
- Save the settings
š§© Using the MCP Server
Once configured, you can use the MCP server with an AI assistant that supports MCP. With compatible assistants like Roo, you can use it in two ways:
-
Automatic mode (with
autoQuery: true
): Ask questions normally, and the AI will automatically check your vector database for relevant information.Example: "What are the key features of React 19?"
-
Explicit tool usage: Directly ask the AI to use a specific tool.
Example: "Use the search_content tool to find information about React 19 Compiler."
š§° MCP Tools
The MCP server exposes the following tools:
- š read_md_files: Process and retrieve files. Parameters:
file_path
(optional path to a specific file or directory) - š search_content: Search across processed content. Parameters:
query
(required search query) - š get_context: Retrieve contextual information. Parameters:
query
(required context query),window_size
(optional number of context items to retrieve) - šļø project_structure: Provide project structure information. No parameters.
- š” suggest_implementation: Generate implementation suggestions. Parameters:
description
(required description of what to implement)
š Supported File Types
By default, the following file types are supported:
- Markdown files (.md)
- Text files (.txt)
- PDF files (.pdf)
- Word documents (.docx, .doc)
You can configure additional file extensions by setting the SUPPORTED_EXTENSIONS
environment variable in your .env
file.
š Operational Modes
The MCP server can operate in two modes:
-
š¤ Full Processing Mode: When the Anthropic API key is provided and
USE_ANTHROPIC
is set totrue
, the server will use Claude to generate responses based on the retrieved context. -
š Context Retrieval Mode: When the Anthropic API key is not provided or
USE_ANTHROPIC
is set tofalse
, the server will only retrieve and return the relevant context, allowing the client (e.g., AI assistant) to process it using its own LLM.
š Project Structure
mcp-server/
āāā Dockerfile.pipeline
āāā Dockerfile.server
āāā docker-compose.yml
āāā requirements.pipeline.txt
āāā requirements.server.txt
āāā README.md
āāā .env.example
āāā run-mcp-server.sh # For macOS/Linux
āāā run-mcp-server.bat # For Windows
āāā setup-mcpServer-json.sh # Setup script for macOS/Linux
āāā setup-mcpServer-json.bat # Setup script for Windows
āāā enhanced_chunking.py
āāā inspect_chunks.py
āāā run_chunk_analysis.sh
āāā setup_enhanced_chunking.sh
āāā visualize_chunks.py
āāā restart_server.sh
āāā chunk_analysis/ # Tools for analyzing chunking methods
ā āāā docker_entrypoint.sh
ā āāā docker-compose.yml
ā āāā Dockerfile
ā āāā inspect_chunks.py
ā āāā README.md
ā āāā run_tests.sh
ā āāā semi_interactive_chunking.py
ā āāā test_chunking.py
āāā src/
ā āāā pipeline.py
ā āāā server.py
ā āāā utils/
ā āāā __init__.py
ā āāā chunking.py
ā āāā embedding.py
ā āāā vector_db.py
āāā config/
ā āāā pipeline_config.json
ā āāā server_config.json
āāā data/
ā āāā README.md
āāā output/
ā āāā .gitkeep
āāā db/
āāā .gitkeep
ā Troubleshooting
- Docker not found: Ensure Docker is installed and running. Check with
docker --version
. - "Invalid reference format" error: This common error can occur for two reasons:
- Missing build step: You tried to run the MCP server without building the server image first. Always run
docker-compose build server
before attempting to run the server. - Mixing Docker and Docker Compose: The Windows batch file has been updated to use Docker Compose consistently. If you're still encountering this error, make sure you're using the latest version of the batch files from this repository.
- Missing build step: You tried to run the MCP server without building the server image first. Always run
- API key issues: Not to worry! You can use the free local embedding models without any API keys.
- Missing sentence-transformers package: If you select a free model, the system will automatically install the required package.
- Chroma database not found: Make sure you've run the pipeline to process your documents first.
- Connection issues: Verify the path in your MCP configuration points to the correct location of the run script.
- Windows path issues: If you encounter path problems on Windows, ensure paths use double backslashes (\\) in the JSON configuration.
- Embedding model mismatch: The server automatically detects which model was used to create the database and uses the same model for retrieval.
Document Chunking Issues
Inconsistent Chunking
If you notice inconsistent chunking between files, it may be due to:
- The document type detection system remembering previous decisions
- Missing spaCy dependencies
- Config file vs environment variable conflicts
Solutions:
- The pipeline automatically resets document type memory between runs
- Ensure spaCy is installed:
pip install spacy && python -m spacy download en_core_web_md
- Verify .env and config files are consistent
PDF Processing
PDFs may not chunk properly if:
- The PDF contains scanned images rather than text
- The PDF has complex formatting
- Required dependencies are missing
Solutions:
- The pipeline has improved PDF handling with better diagnostics
- For scanned PDFs, consider pre-processing with OCR
- Install PyPDF:
pip install pypdf
š¬ Advanced Configuration
For advanced use cases, the pipeline and server can be customized:
- Custom Embedding Functions: Create custom embedding logic
- Document Type Classification: Modify document type detection
- Chunking Behavior: Adjust chunking parameters for specific needs
- Chunk Analysis: Compare standard and enhanced chunking methods using the testing tools in
/chunk_analysis
:# First build the Docker container cd chunk_analysis docker-compose build # Then run the tests ./run_tests.sh
Chunking Strategies
The pipeline uses these document-specific chunking strategies:
- Scientific Papers: Split by sections, preserve references
- Financial Documents: Preserve tables and numerical sections
- Technical Documentation: Preserve code blocks and examples
- Narrative Text: Use semantic boundaries via spaCy NLP
- General: Balanced approach using section headers and semantic breaks
SpaCy is used as the preferred chunking method for all document types when available.
š License
MIT
Created with ā¤ļø by donphi