Difference between Pre-Training with SFT by info.odysseyx@gmail.com August 16, 2024 written by info.odysseyx@gmail.com August 16, 2024 0 comment 14 views 14 The goals of pre-training, the datasets used, and the number of GPUs required are all different. However, to explain the differences in the nature of deep learning training, let’s look at them: Pre-training involves randomly initializing model parameters, building a model, and then training the model with a large amount of unlabeled data to learn general features of the corpus. Fine-tuning, on the other hand, loads parameters from a pre-trained model, trains the model with a small amount of high-quality labeled data while retaining the general features learned during pre-training, thereby improving the model’s capability and performance for a specific task. The parameters mentioned above include: weights, bias, word embeddings, positional encoding, attention mechanism parameters, etc. More detailed explanation Pre-training It aims to learn the basic structure and semantic features of language using large unsupervised data sets (e.g., text corpora). Pre-training typically involves the following steps: Random initialization of weights: The model’s parameters (weights, biases, etc.) are randomly initialized at the start of pretraining. Large data sets: Learning is done using massive amounts of unsupervised data. Learning general functions: The model learns general features of a language by optimizing a loss function (e.g., cross-entropy loss for language models). Key points of pre-training Random initialization: All model parameters (weights, biases, etc.) are randomly set at the start of pretraining. Big data: Learning is done using large unsupervised datasets. General Features: The model learns the basic structure and semantic features of language, providing a good starting point for subsequent tasks. Fine tuning The goal is to optimize the performance of a model on a specific task using task-specific data sets. Fine-tuning typically involves the following steps: Loading pre-trained weights: The model’s weights and biases are loaded from a pre-trained model. Data by task: Training is done using a dataset appropriate for the task. Optimize your work performance: The model adjusts its parameters by optimizing a loss function to improve performance on a specific task. Key points of fine tuning Loading pre-trained weights: The model’s parameters are loaded from the pre-trained model, and the common features learned during pre-training are preserved. Data by task: Learning is done using a dataset that fits the task at hand. Optimize your work: The model’s parameters are further tuned to optimize its performance for a specific task. summation Training Efficiency: Pre-training typically requires significant computational resources and time, as it requires learning all model parameters from a large dataset. Fine-tuning is relatively efficient, as it builds on top of a pre-trained model and only requires additional optimization for task-specific data. Model Performance: Pre-trained models have already learned general language features, so they can converge faster and perform better on specific tasks through fine-tuning. Training task-specific models from random initialization usually requires more data and time, and their performance may not match that of pre-training + fine-tuning approaches. Application Scenarios: Pre-trained models can be used as general-purpose base models suitable for various downstream tasks. Fine-tuning allows the model to quickly adapt to different task requirements without having to train it from scratch. Pre-training code demo Take GPT-2 as an example. https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel To pretrain GPT-2, you need to use the following classes: GPT2LM head model and GPT2 configuration. config = GPT2Config() model = GPT2LMHeadModel(config) tokenizer = GPT2Tokenizer.from_pretrained("gpt2") tokenizer.pad_token = tokenizer.eos_token dataset = load_dataset("wikitext", "wikitext-2-raw-v1") def tokenize_function(examples): return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512, return_special_tokens_mask=True) tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"]) print("Train dataset size:", len(tokenized_datasets["train"])) print("Validation dataset size:", len(tokenized_datasets["validation"])) data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) training_args = TrainingArguments( output_dir="./results", overwrite_output_dir=True, num_train_epochs=5, per_device_train_batch_size=64, save_steps=10_000, save_total_limit=2, remove_unused_columns=False, report_to=[], learning_rate=5e-4 ) trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"] ) if torch.cuda.is_available(): model.cuda() trainer.train() Because the model is small, pretraining can be done with a single H100 GPU. The training results are as follows: step Training loss 500 6.505700 1000 5.657100 1500 5.269900 2000 4.972000 2500 4.725000 The trained model can be used for inference validation. model = GPT2LMHeadModel.from_pretrained("./results/checkpoint-2870") tokenizer = GPT2Tokenizer.from_pretrained("gpt2") tokenizer.pad_token = tokenizer.eos_token device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() input_text = "Once upon a time" inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(device) with torch.no_grad(): outputs = model.generate( inputs.input_ids, attention_mask=inputs.attention_mask, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) The inference results are as follows. Once upon a time, a community was struck by an earthquake and a new government emerged, a military government that began as the “most prominent” one. Fine-tuning code demo When fine-tuning a model, we usually refer to supervised fine-tuning (SFT). SFT can be divided into parameter-efficient fine-tuning (PEFT) and global fine-tuning. In PEFT implementation, methods such as LoRA, QLoRA, and GA-LoRA are quite popular. First, let’s look at how to load the model for Full Fine Tuning. We use: AutoModelForCausalLM.Pre-trained A class that retrieves the parameters of a pre-trained model. model = AutoModelForCausalLM.from_pretrained( model_name, attn_implementation=attn_implementation, device_map={"": 0} ) model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True}) See the repository for the full fine-tuning code. https://github.com/davidsajare/david-share/tree/master/Deep-Learning/SmolLM-Full-Fine-Tuning Next, let’s look at the differences in the code implementations of fine-tuning, LoRA, and QLoRA. In terms of model loading and training parameters, Full Fine-Tuning, LoRA, and QLoRA have the following differences: Differences in loading models Full fine tuning Train by loading the entire model directly. use AutoModelForCausalLM.Pre-trained Load the model. Laura After loading the model, we perform parameter-efficient fine-tuning using the LoRA configuration. use Laura Config at Peft A library for configuring LoRA parameters. The target module is typically a specific projection layer, such as: K_ProJ, q_projectetc. Q Laura Reduce memory usage by incorporating quantization techniques (e.g. 4-bit quantization) based on LoRA. use Bit and byte configuration For quantization configuration. call Preparing a model for kbit training Prepare the model. Differences in training parameters Full fine tuning Train all model parameters. Typically requires more memory and computing resources. Use standard optimization programs such as: Adam Torch. Laura We only train the low-rank matrices inserted by LoRA, leaving other parameters unchanged. Faster learning and reduced memory usage. Use an optimization program like this: Page_Adam_8bit. Q Laura Combining LoRA and quantization techniques further reduces memory usage. Suitable for fine-tuning large-scale models in resource-constrained environments. Also use Page_Adam_8bit Optimizer. When performing LoRA or QLoRA fine-tuning, you can specify which modules to learn as follows: model = FastLanguageModel.get_peft_model( model, r = 128, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head",], # Add for continual pretraining lora_alpha = 32, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context random_state = 3407, use_rslora = True, ) For more information, see:: https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Continue-Pre-training Distributed implementation of training There is no doubt that pre-training large-scale language models requires multi-node and multi-GPU setups. This requires distributed training. Currently, basic distributed pre-training can be implemented by calling NCCL. Higher-level tools such as Megatron, DeepSpeed, and HF’s acceleration library (currently supporting FSDP) can be used. These tools effectively implement DP/PP/TP. Megatron-Deepspeed For more information on pre-training with Megatron combined with DeepSpeed, see: https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Megatron%2BDeepspeed-Pretrain-G… Deep speed For an example implementation of SFT using DeepSpeed, see: https://github.com/davidsajare/david-share/tree/master/Multimodal-Models/DeepSpeed-FT-Stable-Diffusi… Axolotl Some open source fine-tuning tools, such as Axolotl, can now interface directly with DeepSpeed. For example, see: https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Fine-tuning-with-Axolotl Accelerate When using FSDP AccelerateCombining different parallel strategies can achieve more efficient training. Data Parallel Processing (DP) FSDP itself is a data-parallel strategy implemented by partitioning model parameters. Pipeline Parallelism (PP) The model can be divided into several stages, each of which runs on a different device. This requires manual partitioning of the model and management of the data flow. Tensor Parallelism (TP) The computation of a single layer is distributed across multiple devices. This requires modifying the model’s computational graph. Combining these strategies typically requires significant customization and tuning of the model and training script. Accelerate We provide some tools to simplify this process, but specific implementations may require combining other PyTorch libraries (e.g. ). Torch. Dispersion) and custom code. Examples of FSDP include: Acceleratereference: https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Llama-3.1-70B-FSDP-Fine-Tuning My github: https://github.com/davidsajare/david-share.git Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post PhD scholarship in structural biology at the Leibniz Research Institute for next post MAHE-Manipal Institute of Technology (MIT) Bengaluru Campus has accepted online applications from eligible candidates for the following Faculty Recruitment Post – Faculty | Faculty Recruitment 2024 You may also like The Sonos Arc Ultra raises the bar for home theater audio December 5, 2024 Aptera Motors will showcase its solar EV at CES 2025 December 3, 2024 How Chromebook tools strengthen school cybersecurity December 2, 2024 Nvidia unveils the ‘Swiss Army Knife’ of AI audio tools: Fugato November 26, 2024 Nvidia Blackwell and the future of data center cooling November 25, 2024 Enterprise productivity is the easiest AI sell November 20, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.