sLLM
by Indigo-Coder-github
This repository contains experiments and configurations for fine-tuning and quantizing language models, particularly focusing on Korean healthcare professional licensing examinations. It explores various approaches, including full fine-tuning, quantization, and LoRA/QLoRA, to optimize model performance.
Last updated: N/A
What is sLLM?
This repository provides experimental setups and configurations for fine-tuning and quantizing language models. It focuses on improving the performance of models on the KorMedMCQA benchmark, which consists of multi-choice questions for Korean healthcare professional licensing examinations. The repository explores different techniques like full fine-tuning, quantization, and LoRA/QLoRA to achieve optimal results.
How to use sLLM?
The repository includes configurations for different experimental setups. To use the configurations, you need to set up your environment with the required dependencies (e.g., DeepSpeed, Transformers). Then, you can use the provided scripts and configurations (e.g., deepspeed_config.json
, training arguments) to fine-tune or quantize your chosen language model. The README provides details on specific hyperparameters used for each setup.
Key features of sLLM
Full fine-tuning configurations
Quantization configurations (4-bit)
LoRA/QLoRA configurations
DeepSpeed integration
Flash Attention 2 support
Hyperparameter settings for Gemma3 1B
Configurations for KorMedMCQA benchmark
Use cases of sLLM
Fine-tuning language models for Korean medical question answering
Quantizing language models for efficient inference
Experimenting with different fine-tuning techniques (full fine-tuning, LoRA, QLoRA)
Optimizing model performance on the KorMedMCQA benchmark
Reproducing the experimental results from the KorMedMCQA paper
FAQ from sLLM
What is KorMedMCQA?
What is KorMedMCQA?
KorMedMCQA is a multi-choice question answering benchmark for Korean healthcare professional licensing examinations.
What is DeepSpeed used for?
What is DeepSpeed used for?
DeepSpeed is used for efficient training of large language models.
What is Flash Attention 2?
What is Flash Attention 2?
Flash Attention 2 is an optimized attention mechanism for faster and more efficient processing.
What is the purpose of quantization?
What is the purpose of quantization?
Quantization reduces the memory footprint and computational cost of language models by using lower precision data types.
What are LoRA and QLoRA?
What are LoRA and QLoRA?
LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are parameter-efficient fine-tuning techniques that adapt pre-trained language models by training a small number of additional parameters.