At the cost of 1 local Intern, get 2 remote Experienced Professionals
Hero

If you are a client, who wants to work remotely from home for US company click here

If you are a startup, then click here to get more information

If you are a client, who wants to work remotely from home for US company click here

Article
TL;DR: When working with complex, multi-step tasks in language models, 🔗 Prompt Chaining offers precise control by breaking the process into discrete steps, while Stepwise Prompting combines all steps into a single prompt for efficiency. Start with stepwise prompting for simpler tasks, and switch to prompt chaining if quality or consistency declines. Prompting for multi-step processes Prompting often involves instructing models to produce responses for complex, multi-step processes. Try to generate prompt for the following scenario. This scenario involves processing a document through three main steps: Summarizing each paragraph of the document in one line. Extracting key points from the entire document. Organizing the generated summary lines under their respective key points in the order they appear in the original document. This task requires a combination of summarization, key point extraction, and organizational skills. It's a complex process that involves understanding the document's structure, content, and main themes. The two popular ways to handle such tasks are: Prompt Chaining, and Stepwise Prompting. Prompt chaining Prompt chaining is an approach for refining outputs from large language models (LLMs) that involves using a series of discrete prompts to guide the model through different phases of a task. This method allows for a more structured and controlled approach to refinement, with each step having a specific focus. By breaking down complex tasks into smaller, more manageable prompts, prompt chaining provides greater flexibility and precision in directing the LLM's output. For example, in a text summarization task, one prompt might focus on extracting key points, another on organizing them coherently, and a final prompt on polishing the language. Stepwise prompting Stepwise prompting, on the other hand, integrates all phases of the task within a single prompt. This approach attempts to guide the LLM through the entire process in one go, challenging the model to generate a longer and more complex output based on a single set of instructions. While simpler to implement, stepwise prompting may be less effective for complex tasks that benefit from a more granular approach. For instance, when summarizing a lengthy academic paper, a single stepwise prompt might struggle to capture all the nuances of content selection, organization, and style that separate prompts in a chain could address individually. Further, you can improve the clarity by assigning step names followed by step explanation. For example: 1) Analysis: Analyze the given text for key themes. 2) Summary: Summarize the themes identified in `Analysis`. 3) Conclusion: Draw conclusions based on the `Summary`. The use of backticks to reference previous outputs leaves no ambiguity. When to Use Step Names in Stepwise Prompts? Use step names when the output of a step will be referenced later in the prompt or in subsequent prompts. Step names help in organization and add to clarity, especially for tasks with multiple interdependent steps. Additionally, it can help the model more effectively associate output variables with the specific steps in which they are extracted. When Step Names Might Be Optional? 1. Simple, Linear Tasks: For straightforward tasks with clear progression, step names might be unnecessary. 2. Short Prompts: In brief prompts with only 2-3 steps, numbering alone might suffice. Best Practices - Be consistent: If you use step names, use them for all steps in the prompt. - Keep names short and descriptive: Use clear, concise labels that indicate the purpose of each step. Illustration of Stepwise prompting and Prompt Chaining: Generate document outline Let's create prompts for the scenario explained earlier using both techniques: Prompt Chaining: For prompt chaining, we'll break down the task into three separate prompts, each focusing on a specific subtask. Prompt 1 (Paragraph Summarization): You are tasked with summarizing a document. For each paragraph in the given document, create a one-line summary that captures its main idea. Please provide these summary lines in a numbered list, with each number corresponding to the paragraph number in the original document. Prompt 2 (Key Point Extraction): Based on the entire document, identify and list the main key points. These should be the overarching themes or crucial ideas that span multiple paragraphs. Present these key points in a bulleted list. Prompt 3 (Organization): You will be provided with two lists: one containing one-line summaries of each paragraph, and another containing key points extracted from the document. Go through the `Document`. Your task is to organize the summary lines from `Summary list` under their most relevant key points from `Key point list`. Maintain the original order of the summary lines within each key point. Present the result as a structured list with key points as main headings and relevant summary lines as sub-points. Output can be in the following format: **<key point 1> <summary line 1> <summary line 2> **<key point 2> <summary line 3> <summary line 4> <summary line 5> Ensure that all summary lines are included and that they maintain their original numbering order within each key point category. Document: It is the document that needs to be converted into topic-wise `summary list` Summary list: It is a list of one-line summaries of each paragraph in the `document` Key point list: It is a list of key points covered in the `Document` Stepwise prompting: The prompt below demonstrates how we can replicate the three distinct steps previously used in prompt chaining within a single step by using stepwise prompting. You are tasked with analyzing, summarizing, and organizing the content of a given document. Please follow these steps in order: 1. Read document: Carefully read through the entire document. 2. Summary list: Create a one-line summary for each paragraph, capturing its main idea. Number these summaries according to the paragraph they represent. 3. Key point list: Identify the main key points of the entire document. These should be overarching themes or crucial ideas that span multiple paragraphs. 4. Key point wise Summary list: Organize the `Summary list` under their most relevant key points from `Key point list`. Maintain the original order of the summary lines within each key point. Present the result as a structured list with key points as main headings and relevant summary lines as sub-points. The output should be in the following format: **<key point 1> <summary line 1> <summary line 2> **<key point 2> <summary line 3> <summary line 4> <summary line 5> Ensure that all summary lines are included and that they maintain their original numbering order within each key point category. Missing paragraphs or lines Notice the last line: Ensure that all summary lines are included and that they maintain their original numbering order within each key point category. Despite our repeated efforts to ensure the AI model analyzes every paragraph and line, it continues to overlook some. Models may consider multiple paragraphs as one paragraph, or may skip paragraphs completely. This is much common problem when processing long documents. Similarly, if you have to process each line in a paragraph, model may combine lines or skip lines. The solution is you should consider breaking the document into smaller chunks and process each chunk separately. Further, I've seen better results if you provide numbered list of paragraphs, and ask it to generate output as the numbered list corresponding to each paragraph. This makes it easier for the AI model to keep track of paragraphs. Same goes for line processing within paragraphs. Stepwise prompting vs Prompt Chaining Here's a comparison of prompt chaining and stepwise prompting in table format: Aspects Prompt Chaining Stepwise Prompting Execution Runs the LLM multiple times, with each step focusing on a specific subtask Completes all phases within a single generation, requiring only one run of the LLM Complexity and Control Allows for more precise control over each phase of the task, but requires more comprehensive prompts from humans Uses a simpler prompt containing sequential steps, but challenges the LLM to generate a longer and more complex output Effectiveness Generally yields better results, especially in text summarization tasks Might produce a simulated refinement process rather than a genuine one, potentially limiting its effectiveness Task Breakdown Excels at breaking down complex tasks into smaller, more manageable prompts Attempts to handle the entire task in a single, more complex prompt Iterative Improvement Allows for easier iteration and improvement of individual steps in the process Less flexible for targeted improvements without modifying the entire prompt Resource Usage May require more computational resources due to multiple LLM runs More efficient in terms of API calls or processing time Learning Curve Higher initial complexity for prompt designers, but potentially more intuitive for complex tasks Simpler to implement initially, but may be challenging to optimize for complex tasks Recommendation for choosing between the two: I recommend starting with stepwise prompting, as it is a more cost-effective solution and requires less engineering effort compared to prompt chaining. However, if you notice a decline in quality or inconsistent results, switching to prompt chaining will be necessary. Which approach do you prefer when working with LLMs? Let’s discuss!
6 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
Introduction In our previous articles, we explored the limitations of in-context learning and the motivations behind fine-tuning large language models (LLMs). We highlighted the growing need for solutions that can provide task-specific optimizations without the computational overhead traditionally associated with full model fine-tuning. This led us to our exploration of Low-Rank Adaptation (LoRA) in "LoRA Demystified: Optimizing Language Models with Low-Rank Adaptation," where we delved into the intricacies of this groundbreaking technique that has revolutionized how we fine-tune LLMs. Now, we take the next logical step in our journey: scaling LoRA for production environments. As organizations increasingly rely on specialized language models for a variety of tasks, a new challenge emerges: how can we serve multiple LoRA-adapted models simultaneously without sacrificing performance or breaking the bank? This article answers that question by introducing cutting-edge techniques for multi-tenant LoRA serving, enabling the deployment of thousands of fine-tuned LLMs on a single GPU. We'll explore the evolution from basic LoRA implementation to advanced serving strategies, focusing on: The challenges of batching different task types in a multi-tenant environment Innovative solutions like Segmented Gather Matrix-Vector Multiplication (SGMV) The concept and implementation of heterogeneous continuous batching Practical examples using state-of-the-art tools like LoRAX By the end of this article, you'll have a comprehensive understanding of how to leverage LoRA at scale, opening new possibilities for efficient, cost-effective deployment of multiple specialized language models in production environments. A Brief Recap of LoRA Before we dive into the complexities of multi-tenant serving, let's quickly recap the key idea behind LoRA: Keep the pretrained model's weights intact. Add small, trainable matrices to each layer of the Transformer architecture. Use rank decomposition to keep these additional matrices low-rank. This approach offers several significant advantages: Drastically Reduced Parameter Count: By focusing on low-rank updates, LoRA significantly reduces the number of parameters that need to be trained. This makes fine-tuning more efficient and less resource-intensive. Preserved Base Model: Since the original model weights remain unchanged, you can easily switch between different LoRA adaptations or revert to the base model without any loss of information. Cost-Effective Customization: The reduced computational requirements make it feasible to create multiple customized LoRA models tailored to specific needs, even with limited resources. Competitive Performance: Despite its simplicity, LoRA often achieves performance comparable to full fine-tuning across a wide range of tasks. LoRA's efficiency and effectiveness have made it a cornerstone of modern LLM deployment strategies. However, to truly leverage its power in production environments, we need to address the challenges of serving multiple LoRA-adapted models simultaneously. This is where batching strategies come into play. Challenges in Batching Different Task Types As we move towards deploying multiple LoRA-adapted models in production, we encounter a new set of challenges, particularly when it comes to batching requests efficiently. Let's explore these challenges and why traditional batching approaches fall short. The GPU Utilization Imperative Graphics Processing Units (GPUs) are expensive and limited resources. Efficient GPU utilization is crucial for cost-effective deployment of LLMs. As highlighted by Yu et al. in their 2022 study, batching is one of the most effective methods for consolidating workloads to enhance performance and GPU utilization. The Naive Approach: Separate Queues A straightforward approach to handling multiple LoRA-adapted models would be to batch workloads separately for each task type or adapter. This method involves: Segregating tasks into queues based on their type or associated adapter. Waiting for each queue to reach a specific size (batch size) before processing However, this approach leads to several significant drawbacks: Resource Underutilization: The system might have idle resources even when there are enough tasks of different types for a batch, simply because it's waiting for individual queues to fill. This significantly reduces overall throughput. Unpredictable Performance: Performance becomes highly dependent on the arrival rate of each task type. Less frequent tasks can cause long delays in their respective queues, potentially holding up dependent tasks waiting for completion. Scalability Issues: Adding new task types or adapters requires creating new queues, increasing management complexity and potentially leading to more idle periods with less frequent queues. Latency Spikes: Tasks might experience high latency if they arrive when their queue is nearly empty, as they'll have to wait for the queue to fill before being processed. Here's a simplified Python example illustrating the challenges of this naive approach: import queue import time class NaiveBatchingSystem: def __init__(self, batch_size=32): self.queues = {} self.batch_size = batch_size def add_task(self, task_type, task): if task_type not in self.queues: self.queues[task_type] = queue.Queue() self.queues[task_type].put(task) def process_batches(self): while True: for task_type, task_queue in self.queues.items(): if task_queue.qsize() >= self.batch_size: batch = [task_queue.get() for _ in range(self.batch_size)] print(f"Processing batch of {task_type} tasks") # Process the batch... else: print(f"Waiting for more {task_type} tasks...") time.sleep(1) # Avoid busy-waiting # Usage batcher = NaiveBatchingSystem() batcher.add_task("math", "2 + 2") batcher.add_task("translation", "Hello in French") batcher.process_batches() This example demonstrates how tasks of different types might be stuck waiting for their respective queues to fill, even if there are enough total tasks to form a batch. These challenges highlight the need for a more sophisticated approach to batching, one that can efficiently consolidate multi-tenant LoRA serving workloads onto a small number of GPUs while maximizing overall utilization. To address these challenges, researchers have developed innovative techniques like Segmented Gather Matrix-Vector Multiplication (SGMV). Segmented Gather Matrix-Vector Multiplication (SGMV) Chen et al. introduced SGMV in 2023 as a novel CUDA kernel designed specifically for multi-tenant LoRA serving. SGMV enables the batching of GPU operations, allowing multiple distinct LoRA models to be executed concurrently. How SGMV Works At its core, SGMV optimizes the matrix multiplication operations that are central to LoRA adapters. Here's a simplified explanation of how it works: Segmentation: Instead of treating each LoRA adapter as a separate entity, SGMV segments the operations across multiple adapters. Gather: It efficiently gathers the relevant weights from different adapters based on the incoming requests. Batched Multiplication: The gathered weights are then used in a batched matrix-vector multiplication operation, leveraging the GPU's parallel processing capabilities. Benefits of SGMV By leveraging SGMV, we can: Process Multiple Adapters Concurrently: Different LoRA models can be executed in parallel, improving overall system performance and resource utilization. Eliminate Queue-Based Bottlenecks: SGMV allows for grouping requests for different adapters together, avoiding the need for separate queues for each adapter or task type. Maintain Continuous Processing: The system can process tasks constantly, regardless of type, keeping the processing flow continuous and avoiding delays from waiting for specific task types to accumulate. Improve Throughput and Consistency: Heterogeneous continuous batching significantly improves overall throughput and maintains consistent performance even with a growing number of different tasks or adapters. While the actual implementation of SGMV is complex and involves low-level GPU programming, its effects can be observed at the system level. Heterogeneous Continuous Batching in LoRAX LoRAX , an open-source Multi-LoRA inference server, represents a significant leap forward in the efficient deployment of multiple fine-tuned language models. At its core, LoRAX leverages the power of SGMV to achieve heterogeneous continuous batching, optimizing overall system throughput while maintaining low latency. Key Components of LoRAX LoRAX's architecture is built around three fundamental components that enable its powerful heterogeneous batching capabilities: Dynamic Adapter Loading: LoRAX doesn't require all adapters to be pre-loaded into GPU memory. Instead, it dynamically downloads and loads adapters onto the GPU as requests arrive. This on-demand loading ensures efficient use of GPU memory and allows the system to handle a large number of different adapters without blocking other requests. Continuous Batching: Unlike traditional batching systems that wait for a fixed batch size, LoRAX employs a token-based approach to manage batching. It dynamically groups requests into batches based on available GPU memory and desired latency, ensuring a continuous flow of processing. Asynchronous Adapter Scheduling: A background thread in LoRAX efficiently manages adapter offloading and loading, minimizing the performance impact of swapping adapters in and out of GPU memory. Implementation Example Let's look at a simplified example of how LoRAX handles a batch of tasks using the lorax-client with Flask: from flask import Flask, jsonify, request from lorax import Client import requests app = Flask(__name__) # Configuration LORAX_ENDPOINT = "http://127.0.0.1:8080" # Replace with your LoRAX server endpoint CALLBACK_URL = "http://localhost:5001/uploadresponse/" # Replace with your callback endpoint # Initialize the LoRAX client lorax_client = Client(LORAX_ENDPOINT) @app.route("/lorax/upload", methods=["POST"]) def upload_batch(): """ Handles batch upload requests. """ try: # Parse the request body data = request.get_json() batch_id = data.get("batchId") prompts = data.get("data") if not batch_id or not prompts: return jsonify({"message": "Missing batchId or data"}), 400 # Send the batch to LoRAX responses = [] for prompt_data in prompts: response = lorax_client.generate( prompt_data["prompt"], adapter_id=prompt_data.get("adapter_id"), max_new_tokens=prompt_data.get("max_new_tokens"), # ... other parameters ) responses.append(response.dict()) # Trigger the callback callback_data = {"batchId": batch_id, "response": responses} requests.post(CALLBACK_URL, json=callback_data) return jsonify({"message": "Batch processed successfully"}), 200 except Exception as e: print(f"Error processing batch: {e}") return jsonify({"message": "Error processing batch"}), 500 if __name__ == "__main__": app.run(debug=True, port=5001) This implementation showcases several key aspects of LoRAX's heterogeneous continuous batching: Batch of Tasks: The Flask server receives a batch of tasks as a JSON payload. Each task includes a prompt, an optional adapter ID, and the maximum number of tokens to generate. LoRAX Client: The server uses the lorax-client library to communicate with the LoRAX server, abstracting away the complexities of heterogeneous batching. Heterogeneous Batching: Notice that the server doesn't need to filter or sort prompts by adapter ID. LoRAX handles this internally, dynamically grouping tasks based on available resources and efficiently managing adapter loading. Dynamic Adapter Loading: If an adapter specified in a request isn't already loaded, LoRAX will download and load it on-demand, allowing for efficient use of GPU memory. Asynchronous Processing: The server processes each prompt in the batch asynchronously, allowing for efficient handling of multiple requests with different adapters. Testing with curl To test this implementation, you can use a curl command like this: curl -X POST -H "Content-Type: application/json" \ -d '{"batchId": "10001", "data": [ { "prompt": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k", "max_new_tokens": 64 }, { "prompt": "[INST] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/INST]", "adapter_id": "ai2sql/ai2sql_mistral_7b", "max_new_tokens": 128 }, { "prompt": "[INST] What is the capital of France? Provide a brief history. [/INST]", "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k", "max_new_tokens": 128 } ]}' \ http://localhost:5001/lorax/upload This curl command sends a POST request to the Flask server's /lorax/upload endpoint with a batch of three prompts. The prompts are varied and include both math and SQL tasks, each specifying a different LoRA adapter to use. LoRAX's heterogeneous continuous batching shines in this scenario. It efficiently handles the diverse set of tasks, potentially loading different adapters as needed, and processes them concurrently. This approach significantly improves throughput and maintains low latency, even when dealing with a mix of task types and adapters. By leveraging LoRAX and its implementation of heterogeneous continuous batching, we can efficiently serve multiple fine-tuned LLMs in production, overcoming the challenges of traditional batching methods and maximizing GPU utilization. References Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. Yu, C., Han, S., Shen, H., Gao, Y., & Li, J. (2022). PaLM-Coder: Improving Large Language Model Based Program Synthesis Through Batching and Speculative Execution. arXiv preprint arXiv:2212.08272. Chen, Z., Jiang, Y., Luo, Y., Liu, X., Ji, S., & Gong, Z. (2023). LoRAX: A High-Performance Multi-Tenant LoRA Inference Server. arXiv preprint arXiv:2311.03285. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., ... & Pasunuru, R. (2022). OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, & Arvind Krishnamurthy. (2023). Punica: Multi-Tenant LoRA Serving. Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, & Devvret Rishi. (2024). LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report.
7 min read
authors:
Akshat PatilAkshat Patil
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
1. Introduction to LoRA Hey there, language model enthusiasts! Today, we're diving into the fascinating world of LoRA - Low-Rank Adaptation. If you've been keeping up with the latest trends in fine-tuning large language models, you've probably heard this term buzzing around. But what exactly is LoRA, and why should you care? Let's break it down!Have you ever wished you could fine-tune a massive language model without breaking the bank or waiting for days? Enter LoRA - the game-changing technique that's revolutionizing how we adapt large language models. If you've been keeping up with the AI world, you've likely heard whispers about LoRA, but maybe you're not quite sure what all the fuss is about. Well, buckle up, because we're about to embark on a journey that will demystify LoRA and show you how it's reshaping the landscape of language model optimization. Imagine being able to tailor a behemoth language model to your specific needs without the hefty computational costs typically associated with fine-tuning. That's the magic of LoRA, or Low-Rank Adaptation. In a world where AI models are growing exponentially in size and complexity, LoRA emerges as a beacon of efficiency, allowing us to adapt these digital giants with surgical precision. In this article, we're going to pull back the curtain on LoRA. We'll start by unraveling what LoRA is and why it's causing such a stir in the AI community. Then, we'll roll up our sleeves and dive into the nitty-gritty of how LoRA works, from its clever use of low-rank matrix decomposition to its seamless integration with pre-trained models. But we won't stop at theory. We'll guide you through implementing LoRA in PyTorch, breaking down the process into manageable chunks. You'll learn how to create LoRA layers, wrap them around your favorite pre-trained model, and orchestrate a forward pass that leverages the power of LoRA. We'll also explore best practices for using LoRA, from choosing the right rank parameter to optimizing the scaling factor. And for those ready to push the boundaries, we'll delve into advanced techniques that can take your LoRA implementations to the next level. Whether you're an AI researcher looking to streamline your model adaptation process, a developer aiming to make the most of limited computational resources, or simply an enthusiast curious about the cutting edge of language model optimization, this article has something for you. So, are you ready to unlock the potential of LoRA and revolutionize how you work with large language models? Let's dive in and demystify LoRA together! What is LoRA? LoRA, short for Low-Rank Adaptation, is a clever technique that's revolutionizing how we fine-tune large language models. Introduced by Hu et al. in 2022, LoRA allows us to adapt pre-trained models to specific tasks without the hefty computational cost typically associated with full fine-tuning. At its core, LoRA works by adding small, trainable matrices to each layer of the Transformer architecture. These matrices are decomposed into low-rank representations, hence the name. The beauty of this approach is that it keeps the original pre-trained model weights untouched while introducing a minimal number of new parameters to learn. Benefits of LoRA for Language Model Fine-Tuning Now, you might be wondering, "Why should I use LoRA instead of traditional fine-tuning?" Great question! Here are some compelling reasons: Efficiency: LoRA dramatically reduces the number of trainable parameters, making fine-tuning faster and less resource-intensive. Cost-effectiveness: With fewer parameters to train, you can save on computational costs and energy consumption. Flexibility: LoRA allows you to create multiple task-specific adaptations of a single base model without the need for full fine-tuning each time. Performance: Despite its simplicity, LoRA often achieves comparable or even better performance than full fine-tuning for many tasks. 2. Understanding the LoRA Architecture Before we dive into the implementation, let's take a moment to understand how LoRA works under the hood. This knowledge will help you appreciate the elegance of the technique and make informed decisions when using it. Low-Rank Matrix Decomposition The key idea behind LoRA is low-rank matrix decomposition. In linear algebra, a low-rank matrix is one that can be approximated by the product of two smaller matrices. LoRA leverages this concept to create efficient adaptations. Instead of learning a full matrix of weights for each layer, LoRA introduces two smaller matrices, A and B. The adaptation is then computed as the product of these matrices, scaled by a small factor. Mathematically, it looks like this: LoRA adaptation = α * (A * B) Where: A is a matrix of size (input_dim, r) B is a matrix of size (r, output_dim) r is the rank, typically much smaller than input_dim and output_dim α is a scaling factor This decomposition allows us to capture the most important directions of change in the weight space using far fewer parameters. Integration with Pre-Trained Models LoRA integrates seamlessly with pre-trained models. Here's how it works: The original weights of the pre-trained model are frozen (not updated during training). LoRA layers are added in parallel to the existing linear layers in the model. During the forward pass, the output of the original layer and the LoRA layer are summed. Only the LoRA layers are updated during training, leaving the base model untouched. This approach allows us to adapt the model's behavior without modifying its original knowledge, resulting in efficient and effective fine-tuning. 3. Implementing LoRA in PyTorch Now that we understand the theory, let's roll up our sleeves and implement LoRA in PyTorch! We'll break this down into three main components: the LoRA Layer, the LoRA Model, and the forward pass. 3.1 LoRA Layer Implementation First, let's create our LoRA Layer. This is where the magic happens! ```python import torch import torch.nn as nn class LoRALayer(nn.Module): def __init__(self, in_features, out_features, rank=4): super().__init__() self.lora_A = nn.Parameter(torch.randn(in_features, rank)) self.lora_B = nn.Parameter(torch.zeros(rank, out_features)) self.scaling = 0.01 def forward(self, x): return self.scaling * (x @ self.lora_A @ self.lora_B) ``` Let's break this down: We define a new `LoRALayer` class that inherits from `nn.Module`. In the constructor, we create two parameter matrices: `lora_A` and `lora_B`. Notice that `lora_A` is initialized randomly, while `lora_B` starts as all zeros. The `scaling` factor is set to 0.01. This small value helps to keep the LoRA adaptation subtle at the beginning of training. In the forward pass, we compute the LoRA adaptation by multiplying the input `x` with `lora_A` and `lora_B`, then scaling the result. 3.2 LoRA Model Implementation Now that we have our LoRA Layer, let's create a LoRA Model that wraps around our base pre-trained model: ```python class LoRAModel(nn.Module): def __init__(self, base_model): super().__init__() self.base_model = base_model self.lora_layers = nn.ModuleDict() # Add LoRA layers to relevant parts of the base model for name, module in self.base_model.named_modules(): if isinstance(module, nn.Linear): self.lora_layers[name] = LoRALayer(module.in_features, module.out_features) ``` Here's what's happening: We create a `LoRAModel` class that takes a `base_model` as input. We iterate through all modules in the base model, looking for linear layers. For each linear layer, we create a corresponding LoRA layer and add it to our `lora_layers` dictionary. This approach allows us to selectively apply LoRA to specific layers of the model, typically focusing on the attention and feed-forward layers in a Transformer architecture. 3.3 LoRA Model Forward Pass Finally, let's implement the forward pass for our LoRA Model: ```python def forward(self, x): # Forward pass through base model, adding LoRA outputs where applicable for name, module in self.base_model.named_modules(): if name in self.lora_layers: x = module(x) + self.lora_layers[name](x) else: x = module(x) return x ``` In this forward pass: We iterate through the modules of the base model. If a module has a corresponding LoRA layer, we add the LoRA output to the base module's output. For modules without LoRA, we simply pass the input through as usual. This implementation ensures that the LoRA adaptations are applied exactly where we want them, while leaving the rest of the model unchanged. 4. Using the LoRA Model Great job! Now that we have our LoRA model implemented, let's talk about how to use it effectively. Training Process Training a LoRA model is similar to training any other PyTorch model, with a few key differences: Freeze the base model parameters: ```python for param in model.base_model.parameters(): param.requires_grad = False ``` Only optimize the LoRA parameters: ```python optimizer = torch.optim.AdamW(model.lora_layers.parameters(), lr=1e-3) ``` Train as usual, but remember that you're only updating the LoRA layers: ```python for epoch in range(num_epochs): for batch in dataloader: optimizer.zero_grad() output = model(batch) loss = criterion(output, targets) loss.backward() optimizer.step() ``` Inference with LoRA-Adapted Models When it's time to use your LoRA-adapted model for inference, you can simply use it like any other PyTorch model: ```python model.eval() with torch.no_grad(): output = model(input_data) ``` The beauty of LoRA is that you can easily switch between different adaptations by changing the LoRA layers, all while keeping the same base model. 5. Best Practices As you start experimenting with LoRA, keep these best practices in mind: Choosing the Rank Parameter The rank parameter (r) in LoRA determines the complexity of the adaptation. A higher rank allows for more expressive adaptations but increases the number of parameters. Start with a small rank (e.g., 4 or 8) and increase if needed. Scaling Factor Optimization The scaling factor (α) in the LoRA layer can significantly impact performance. While we set it to 0.01 in our example, you might want to treat it as a hyperparameter and tune it for your specific task. Performance Comparisons Always compare your LoRA-adapted model's performance with a fully fine-tuned model. In many cases, LoRA can achieve comparable or better results with far fewer parameters, but it's essential to verify this for your specific use case. 6. Advanced LoRA Techniques Ready to take your LoRA skills to the next level? Here are some advanced techniques to explore: Hyperparameter Tuning for the Scaling Factor Instead of using a fixed scaling factor, you can make it learnable: ```python self.scaling = nn.Parameter(torch.ones(1)) ``` This allows the model to adjust the impact of the LoRA adaptation during training. Selective Application of LoRA You might not need to apply LoRA to every layer. Experiment with applying it only to specific layers (e.g., only to attention layers) to find the best trade-off between adaptation and efficiency. Freezing Base Model Parameters We touched on this earlier, but it's crucial to ensure your base model parameters are frozen: ```python for param in model.base_model.parameters(): param.requires_grad = False ``` This ensures that only the LoRA parameters are updated during training. And there you have it! You're now equipped with the knowledge to implement and use LoRA for optimizing language models. Remember, the key to mastering LoRA is experimentation. Don't be afraid to try different configurations and see what works best for your specific use case. Happy adapting, and may your language models be ever more efficient and effective! Summary In this article, we've demystified LoRA (Low-Rank Adaptation), a powerful technique for optimizing large language models. We explored how LoRA enables efficient fine-tuning by introducing small, trainable matrices to pre-trained models, dramatically reducing computational costs while maintaining performance. We delved into the LoRA architecture, explaining its use of low-rank matrix decomposition and seamless integration with pre-trained models. We then provided a step-by-step guide to implementing LoRA in PyTorch, covering the creation of LoRA layers, wrapping them around base models, and executing forward passes. Key takeaways include: LoRA offers a cost-effective and flexible approach to adapting large language models. Implementing LoRA involves creating specialized layers and integrating them with existing model architectures. Best practices such as choosing appropriate rank parameters and optimizing scaling factors are crucial for success. Advanced techniques like learnable scaling factors and selective application can further enhance LoRA's effectiveness. As AI models continue to grow in size and complexity, techniques like LoRA become increasingly valuable. Whether you're an AI researcher, developer, or enthusiast, LoRA opens up new possibilities for working with large language models.
7 min read
authors:
Akshat PatilAkshat Patil
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
Introduction Artificial Intelligence (AI) is transforming industries across the globe, and its influence on social media management is particularly noteworthy. AI agents can analyze vast amounts of data, uncovering patterns and trends that are otherwise difficult to detect. As part of my Master of Science degree program in Information Systems at the University of Utah, I faced a pivotal decision: complete a certification course or take on a capstone project. Drawn to the challenge and potential for growth, I chose the latter. Under the guidance of Prof. Rohit Aggarwal from the Information Systems Department at the David Eccles School of Business, I embarked on an exciting journey to build an AI agent capable of revolutionizing social media posts. Little did I know that this project would push me far beyond the boundaries of my classroom knowledge and into the realm of practical, cutting-edge AI application. Seizing the Opportunity: The Beginning of My AI Journey The project's scope was ambitious: develop an AI system that could analyze historical social media content and generate future posts to drive engagement. I chose Instagram as the primary platform, focusing on strategies employed by industry leaders like Ogilvy, BBDO, AKQA, and MCCANN. These companies, known for their expertise in brand promotion, provided valuable insights into audience engagement that I could leverage in my AI agent. My task involved collecting data from company websites and social media platforms through web scraping, processing it, and utilizing AI models to extract meaningful themes. The ultimate goal was to generate future posts that would drive engagement and align with each company's brand. This project would test not only my technical skills but also my ability to understand and apply marketing strategies in a digital context. The Journey: Building the AI Agent Data Collection and Preprocessing My first major hurdle was data collection through web scraping. While I had some exposure to Python in my classes, this project demanded a level of expertise I hadn't yet achieved. I spent countless hours poring over YouTube tutorials, documentation, and online forums to master the intricacies of web scraping. I learned to use Python libraries like Instaloader to obtain data from Instagram pages, and Selenium and BeautifulSoup to scrape company websites. This process yielded valuable information, including captions, shares, likes, and comments from each company's Instagram account. After collecting the data, I moved on to preprocessing. I cleaned the data by removing duplicates and null values, converted dates to a datetime format, and prepared it for analysis. Ensuring that the data was accurate and well-organized was crucial for setting a solid foundation for theme extraction. This step taught me the importance of data quality in AI projects, a concept that was only briefly touched upon in my coursework. Theme Extraction and Grouping With the data ready, I conducted theme extraction using the Gemma2b model from Ollama. This was a significant leap from the basic machine learning concepts I had learned in class. I employed zero-shot prompting, a method where I asked the model to perform tasks it hadn't been explicitly trained on. By providing suggested themes, I guided Gemma2b to extract relevant themes from the Instagram posts, such as 'Product Announcement' and ‘Customer Story.’ Once I extracted the themes, I grouped and normalized them. I used Gemma2b to categorize the themes into more concise groups, ensuring that similar themes like 'Customer Story' and 'Customer Stories' were treated as one. This normalization was essential for scaling the data effectively, teaching me about the nuances of natural language processing and the importance of context in AI-driven text analysis. Engagement Analysis and Generating Future Posts Next, I conducted an engagement analysis by calculating scores for each theme based on likes, shares, and comments. Summing up these metrics helped me identify the top 10 themes across all companies. This analysis revealed which themes were driving engagement and how companies like Ogilvy and AKQA were leveraging these strategies. This step required me to blend my understanding of social media metrics with data analysis techniques, bridging the gap between marketing concepts and technical implementation. Armed with this analysis, I used Gemma2b to generate future social media posts. I crafted these posts based on the successful strategies I identified, with suggestions for images, videos, captions, and hashtags. I also included a predicted engagement score for each post, aiding social media managers in planning their content effectively. This phase of the project was particularly exciting as it allowed me to see the practical application of AI in content creation, a concept far beyond what I had learned in my classes. To make my AI agent accessible, I developed an interactive interface using Streamlit. This user-friendly platform allowed social media managers to interact with the model, generate posts, and visualize engagement predictions. Creating this interface pushed me to learn about web application development and user experience design, areas that were entirely new to me but crucial for making my AI agent practical and usable. Challenges I Faced Throughout this project, I encountered numerous challenges that pushed me far beyond what I had learned in my classes: Web Scraping Implementation: Despite my theoretical knowledge of web scraping in Python, this project demanded practical application at a much higher level. I had to enhance my skills through intensive study of YouTube tutorials and comprehensive reading on the subject, including its legal implications to ensure compliance. Model Selection and Deployment: I initially explored quantized models for local execution, gaining extensive knowledge about their capabilities and limitations. After considering various options, including GPU-dependent models, I settled on Gemma 2b with Ollama due to its compatibility with my local machine's resources. This decision came after attempting to use Google Colab's enhanced GPU environment, which proved financially unfeasible for my project's scope. Development Environment Setup: Setting up the working environment posed its own challenges. I opted for Visual Studio Code, which provided a robust platform for code structuring and debugging the large language model. This choice significantly improved my workflow efficiency, but required me to learn a new development environment. Data Processing and Analysis: Data cleaning and merging CSV files presented initial hurdles. I overcame these by developing Python scripts to streamline these processes. The most significant challenge was extracting themes from the large dataset using Gemma 2b, which required substantial computational time. To address this, I utilized a high-RAM system and implemented checkpoints in my code to manage the process more effectively. Model Fine-tuning and Result Validation: To ensure the extracted themes aligned with the desired format, I implemented a training method using sample themes. This was followed by a meticulous manual review process to verify the accuracy and relevance of the extracted themes. Post-processing and Application Development: Once I extracted themes, I leveraged the model to categorically group them and align them with engagement metrics. Additionally, I used Gemma to generate weekly posts designed to resonate with the target audience. The final step involved developing a Streamlit application to generate prompt responses, providing a user-friendly interface for accessing the project's insights. Lessons Learned and Conclusion Despite the difficulties, this project provided me with invaluable lessons. I honed my coding skills, mastered the intricacies of web scraping, and gained hands-on experience with machine learning models. Additionally, the project emphasized the importance of adaptability, communication, and project management—skills that are crucial for success in any professional setting. Building this AI agent was a transformative experience for me. It not only equipped me with technical skills but also prepared me for future roles in AI and data analytics. My project demonstrated the potential of AI in enhancing social media management and underscored the importance of understanding data to make informed decisions. Looking ahead, I'm excited about the possibilities AI offers and the role I can play in shaping this technology. This experience has not only provided me with technical skills but also ignited a passion for creating AI solutions that can make a real difference in how businesses understand and interact with their digital audience. My journey of building my first AI agent has laid a solid foundation for future projects, and I have a strong desire to continue learning and growing in this dynamic field.
5 min read
authors:
Ololade OlaitanOlolade Olaitan
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
Opinion: Teaching Ai & Critical Thinking The opinions expressed herein are derived from our research and own experiences in: developing a few AI Agents, observing student engagement across different variations of our AI classes, engaging in discussions within AI committees and with attendees of BizAI 2024. The Growing Importance of Critical Thinking in the AI Era In the new era of artificial intelligence (AI) and large language models (LLMs), critical thinking skills have become more important than ever before. A 2022 survey by the World Economic Forum revealed that 78% of executives believed critical thinking would be a top-three skill for employees in the next five years, up from 56% in 2020 [1]. As AI systems become more advanced and capable of performing a wide range of tasks, it is crucial for humans to develop and maintain strong critical thinking abilities to effectively leverage these tools and make informed decisions. The Necessity of Human Insight for Effective AI Utilization One of the key reasons critical thinking is so valuable in this context is that LLMs excel at providing information and executing tasks based on their training data, but they often struggle with higher-level reasoning, problem decomposition, and decision-making. While an LLM can generate code, write articles, or answer questions, it may not always understand the broader context or implications of the task at hand. This is where human critical thinking comes into play. For example, let's consider a scenario where a company wants to develop a new product. An LLM can assist by generating ideas, conducting market research, and even creating a project plan. However, it is up to the human decision-makers to critically evaluate the generated ideas, assess their feasibility and potential impact, and direct the AI model where it made mistake or missed something significant such as company values, long-term goals, and potential risks. Cultivating Critical Thinking Skills for AI Collaboration Moreover, as the value of knowing "how" to perform a task decreases due to the capabilities of LLMs, the value of knowing "what" to do and "why" increases. Because AI can manage a lot of "how" to perform a task, it frees professionals to focus on "what" and "why". By developing strong critical thinking abilities, professionals can effectively collaborate with AI systems, leveraging their strengths while compensating for their limitations. This synergy between human reasoning and AI capabilities has the potential to make professionals more productive, bring costs down and help companies grow manifold. However, it is important to note that critical thinking skills must be actively cultivated and practiced. As professors, we need to think of ways to teach students with the tools and training necessary to thrive in an AI-driven world. Let us consider an example of how AI and critical thinking can be taught in tandem. Example: Teaching AI and Critical Thinking in Tandem In one of our courses we teach students how to effectively use AI models to augment their thought process and plan AI agents for revamping business processes. Students explore how to plan an AI agent that learns the tacit knowledge, which experts develop over years of experience. Further, how another AI agent can use this tacit knowledge in conjunction with Retrieval Augmented Generation (RAG) as part of its context to generate decisions or content that mimics the complex decision making of an expert. Through this process, students not only learn technical skills related to AI and LLMs but also develop essential critical thinking abilities such as problem decomposition, strategic planning, and effective communication. They learn to view AI as a tool to augment and enhance their own thinking, rather than a replacement for human judgment and decision-making. They also have better understanding of the limitations of AI models. These AI models solve a lot of "how" type problems that professionals earlier had to spend significant time learning, planning and working on. However, these models also come with their own set of challenges such as context window, limited reasoning abilities, and variability in responses. Hence, there is strong need for students to prepare for AI integration in workplaces accounting for AI models' limitations. Educating students to remain in control Teaching students to view LLMs as highly knowledgeable assistants that sometimes get confused and need direction is a valuable approach. It encourages students to take an active role in guiding and correcting the AI, rather than simply accepting its outputs at face value. They recognize that while AI can provide valuable insights and generate ideas, it is ultimately up to humans to critically evaluate and act upon that information. This understanding helps students develop a healthy and productive relationship with AI, one in which they are in control and can effectively leverage these tools to support their own learning and growth. Intellectual laziness & associated risks While the collaboration between humans and AI presents numerous opportunities, it is essential to be aware of potential drawbacks and risks. As AI models become more advanced and capable, there is a genuine concern that some early learners, may become overly dependent on these tools. This over-reliance could diminish their critical thinking and problem-solving abilities, possibly fostering "intellectual laziness." Individuals might become less inclined to learn and explore new concepts on their own, relying instead on AI for answers. Further, they may lose faith in their own judgment and may stop questioning the AI model's output. In one of our research studies, we observe this behavior among early software developers who start relying on AI models too much. This situation could widen the divide between those who use AI to boost their productivity and those who lean on it too much. To counter these risks, it's important that, along with fostering critical-thinking abilities, we need to stress the need for critical engagement with AI. We should encourage students to scrutinize and question the outputs of AI actively. They need to help students see that excessive reliance on AI can lead to a lack of depth in understanding and personal growth. By advocating for a strategy that equally values AI resources and independent thinking skills, we can guide learners through this new landscape successfully. As we look towards the future, the increasing importance of critical thinking skills in the AI era will have significant implications for job markets and educational curricula. Professionals who can effectively collaborate with AI systems and leverage their capabilities will be in high demand. Hence, faculty will need to adapt their programs to ensure that students understand the importance of using AI as a tool to augment their thinking and not as a replacement. Further, we must rethink our courses and integrate more emphasis on the "what", challenging students to apply their critical thinking skills to real-world problems and decision-making scenarios. Invite our colleagues for collaboration This is not a trivial task, and it will require collaboration and idea-sharing among faculty members. We have been actively exploring these issues and would greatly value the perspectives and insights of our colleagues on this topic. We welcome further discussions and encourage you to reach out to us to share your thoughts and experiences. Disclaimer It's important to note that these insights are primarily anecdotal and have not undergone scientific scrutiny. Additionally, the research involving developers where we noted instances of intellectual laziness has not been validated yet through peer review. References World Economic Forum. (2022). The Future of Jobs Report 2022. Geneva, Switzerland.
5 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

If you are a startup, then click here to get more information