Home NewsX Difference between Pre-Training with SFT

Difference between Pre-Training with SFT

by info.odysseyx@gmail.com
0 comment 7 views


The goals of pre-training, the datasets used, and the number of GPUs required are all different. However, to explain the differences in the nature of deep learning training, let’s look at them:

Pre-training involves randomly initializing model parameters, building a model, and then training the model with a large amount of unlabeled data to learn general features of the corpus. Fine-tuning, on the other hand, loads parameters from a pre-trained model, trains the model with a small amount of high-quality labeled data while retaining the general features learned during pre-training, thereby improving the model’s capability and performance for a specific task.

The parameters mentioned above include: weights, bias, word embeddings, positional encoding, attention mechanism parameters, etc.

More detailed explanation

Pre-training It aims to learn the basic structure and semantic features of language using large unsupervised data sets (e.g., text corpora). Pre-training typically involves the following steps:

  1. Random initialization of weights: The model’s parameters (weights, biases, etc.) are randomly initialized at the start of pretraining.
  2. Large data sets: Learning is done using massive amounts of unsupervised data.
  3. Learning general functions: The model learns general features of a language by optimizing a loss function (e.g., cross-entropy loss for language models).

Key points of pre-training

  • Random initialization: All model parameters (weights, biases, etc.) are randomly set at the start of pretraining.
  • Big data: Learning is done using large unsupervised datasets.
  • General Features: The model learns the basic structure and semantic features of language, providing a good starting point for subsequent tasks.

Fine tuning The goal is to optimize the performance of a model on a specific task using task-specific data sets. Fine-tuning typically involves the following steps:

  1. Loading pre-trained weights: The model’s weights and biases are loaded from a pre-trained model.
  2. Data by task: Training is done using a dataset appropriate for the task.
  3. Optimize your work performance: The model adjusts its parameters by optimizing a loss function to improve performance on a specific task.

Key points of fine tuning

  • Loading pre-trained weights: The model’s parameters are loaded from the pre-trained model, and the common features learned during pre-training are preserved.
  • Data by task: Learning is done using a dataset that fits the task at hand.
  • Optimize your work: The model’s parameters are further tuned to optimize its performance for a specific task.

summation

  1. Training Efficiency: Pre-training typically requires significant computational resources and time, as it requires learning all model parameters from a large dataset. Fine-tuning is relatively efficient, as it builds on top of a pre-trained model and only requires additional optimization for task-specific data.
  2. Model Performance: Pre-trained models have already learned general language features, so they can converge faster and perform better on specific tasks through fine-tuning. Training task-specific models from random initialization usually requires more data and time, and their performance may not match that of pre-training + fine-tuning approaches.
  3. Application Scenarios: Pre-trained models can be used as general-purpose base models suitable for various downstream tasks. Fine-tuning allows the model to quickly adapt to different task requirements without having to train it from scratch.

Pre-training code demo

Take GPT-2 as an example.

https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel

To pretrain GPT-2, you need to use the following classes: GPT2LM head model and GPT2 configuration.

config = GPT2Config() 
model = GPT2LMHeadModel(config) 
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") 
tokenizer.pad_token = tokenizer.eos_token  
dataset = load_dataset("wikitext", "wikitext-2-raw-v1") 

def tokenize_function(examples): 
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512, return_special_tokens_mask=True) 

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"]) 
print("Train dataset size:", len(tokenized_datasets["train"])) 
print("Validation dataset size:", len(tokenized_datasets["validation"])) 
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) 
training_args = TrainingArguments( 
    output_dir="./results", 
    overwrite_output_dir=True, 
    num_train_epochs=5, 
    per_device_train_batch_size=64, 
    save_steps=10_000, 
    save_total_limit=2, 
    remove_unused_columns=False, 
    report_to=[], 
    learning_rate=5e-4 
) 
 
trainer = Trainer( 
    model=model, 
    args=training_args, 
    data_collator=data_collator, 
    train_dataset=tokenized_datasets["train"], 
    eval_dataset=tokenized_datasets["validation"] 
) 
 
if torch.cuda.is_available(): 
    model.cuda() 

trainer.train() 

Because the model is small, pretraining can be done with a single H100 GPU.

The training results are as follows:

step Training loss
500 6.505700
1000 5.657100
1500 5.269900
2000 4.972000
2500 4.725000

The trained model can be used for inference validation.

model = GPT2LMHeadModel.from_pretrained("./results/checkpoint-2870") 
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") 
tokenizer.pad_token = tokenizer.eos_token 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
model.to(device) 
model.eval() 
input_text = "Once upon a time" 
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(device) 
with torch.no_grad(): 
    outputs = model.generate( 
        inputs.input_ids, 
        attention_mask=inputs.attention_mask, 
        max_length=100, 
        num_return_sequences=1, 
        no_repeat_ngram_size=2, 
        early_stopping=True, 
        temperature=0.7, 
        top_p=0.9, 
        do_sample=True, 
        pad_token_id=tokenizer.eos_token_id 
    ) 
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) 
print(generated_text) 

The inference results are as follows.

Once upon a time, a community was struck by an earthquake and a new government emerged, a military government that began as the “most prominent” one.

Fine-tuning code demo

When fine-tuning a model, we usually refer to supervised fine-tuning (SFT). SFT can be divided into parameter-efficient fine-tuning (PEFT) and global fine-tuning. In PEFT implementation, methods such as LoRA, QLoRA, and GA-LoRA are quite popular.

First, let’s look at how to load the model for Full Fine Tuning. We use: AutoModelForCausalLM.Pre-trained A class that retrieves the parameters of a pre-trained model.

 model = AutoModelForCausalLM.from_pretrained(
          model_name, attn_implementation=attn_implementation, device_map={"": 0}
)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})

See the repository for the full fine-tuning code.

https://github.com/davidsajare/david-share/tree/master/Deep-Learning/SmolLM-Full-Fine-Tuning

Next, let’s look at the differences in the code implementations of fine-tuning, LoRA, and QLoRA. In terms of model loading and training parameters, Full Fine-Tuning, LoRA, and QLoRA have the following differences:

Differences in loading models

  • Full fine tuning
    • Train by loading the entire model directly.
    • use AutoModelForCausalLM.Pre-trained Load the model.
  • Laura
    • After loading the model, we perform parameter-efficient fine-tuning using the LoRA configuration.
    • use Laura Config at Peft A library for configuring LoRA parameters.
    • The target module is typically a specific projection layer, such as: K_ProJ, q_projectetc.
  • Q Laura
    • Reduce memory usage by incorporating quantization techniques (e.g. 4-bit quantization) based on LoRA.
    • use Bit and byte configuration For quantization configuration.
    • call Preparing a model for kbit training Prepare the model.

Differences in training parameters

  • Full fine tuning
    • Train all model parameters.
    • Typically requires more memory and computing resources.
    • Use standard optimization programs such as: Adam Torch.
  • Laura
    • We only train the low-rank matrices inserted by LoRA, leaving other parameters unchanged.
    • Faster learning and reduced memory usage.
    • Use an optimization program like this: Page_Adam_8bit.
  • Q Laura
    • Combining LoRA and quantization techniques further reduces memory usage.
    • Suitable for fine-tuning large-scale models in resource-constrained environments.
    • Also use Page_Adam_8bit Optimizer.

When performing LoRA or QLoRA fine-tuning, you can specify which modules to learn as follows:

model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,
)

For more information, see:

https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Continue-Pre-training

Distributed implementation of training

There is no doubt that pre-training large-scale language models requires multi-node and multi-GPU setups. This requires distributed training. Currently, basic distributed pre-training can be implemented by calling NCCL. Higher-level tools such as Megatron, DeepSpeed, and HF’s acceleration library (currently supporting FSDP) can be used. These tools effectively implement DP/PP/TP.

Megatron-Deepspeed

For more information on pre-training with Megatron combined with DeepSpeed, see:

https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Megatron%2BDeepspeed-Pretrain-G…

Deep speed

For an example implementation of SFT using DeepSpeed, see:

https://github.com/davidsajare/david-share/tree/master/Multimodal-Models/DeepSpeed-FT-Stable-Diffusi…

Axolotl

Some open source fine-tuning tools, such as Axolotl, can now interface directly with DeepSpeed. For example, see:

https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Fine-tuning-with-Axolotl

Accelerate

When using FSDP AccelerateCombining different parallel strategies can achieve more efficient training.

  1. Data Parallel Processing (DP)
    • FSDP itself is a data-parallel strategy implemented by partitioning model parameters.
  1. Pipeline Parallelism (PP)
    • The model can be divided into several stages, each of which runs on a different device. This requires manual partitioning of the model and management of the data flow.
  1. Tensor Parallelism (TP)
    • The computation of a single layer is distributed across multiple devices. This requires modifying the model’s computational graph.
  • Combining these strategies typically requires significant customization and tuning of the model and training script. Accelerate We provide some tools to simplify this process, but specific implementations may require combining other PyTorch libraries (e.g. ). Torch. Dispersion) and custom code.

Examples of FSDP include: Acceleratereference:

https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Llama-3.1-70B-FSDP-Fine-Tuning

My github: https://github.com/davidsajare/david-share.git





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX