At the cost of 1 local Intern, get 2 remote Experienced Professionals
Hero

If you are a client, who wants to work remotely from home for US company click here

If you are a startup, then click here to get more information

If you are a client, who wants to work remotely from home for US company click here

Article
Large Language Models (LLMs) have rapidly emerged as a transformative force in artificial intelligence, demonstrating remarkable capabilities in understanding, generating, and interacting with human language. From powering sophisticated chatbots and translation services to assisting in complex coding and creative writing tasks, LLMs are reshaping industries and redefining human-computer interaction. However, the term "LLM" encompasses a wide and increasingly diverse range of model types, each with unique architectures, training methodologies, strengths, and weaknesses. Understanding these distinctions is crucial for effectively leveraging their power and navigating the rapidly evolving AI landscape. This tutorial aims to provide a comprehensive overview of several key types of LLMs that are prominent today or represent significant directions in research and development. We will delve into the fundamental characteristics, training processes, applications, and limitations of each category, offering clarity on how they differ and where their specific advantages lie. We will begin by exploring Base Models , the foundational building blocks trained on vast amounts of unlabeled text data. These models excel at pattern recognition and language prediction but often lack the ability to follow specific instructions reliably. Building upon this foundation, we will examine Instruction-Tuned Models , which are fine-tuned using supervised learning and human feedback to better understand and execute user commands, making them more suitable for task-oriented applications like chatbots and assistants. Next, we will investigate more specialized architectures. Mixture of Experts (MoE) Models represent a significant architectural innovation, employing multiple specialized sub-networks ("experts") and a gating mechanism to route tasks efficiently. This approach allows for dramatically larger model sizes (in terms of total parameters) while maintaining computational efficiency during training and inference, albeit with challenges related to memory requirements and fine-tuning. We will then turn our attention to models explicitly designed for complex cognitive tasks. Reasoning Models are optimized to tackle problems requiring multi-step thought processes, such as mathematical proofs, logic puzzles, and complex planning. These models often generate intermediate steps, providing transparency into their reasoning process. Further expanding capabilities, Multimodal Models (MLLMs) break the text-only barrier, processing and understanding information across various modalities like images, audio, and video alongside text. We will clarify how these differ fundamentally from models solely focused on generating images or video from text. We will also explore Hybrid Models , which blend characteristics from different categories, potentially integrating diverse reasoning approaches or dynamically deciding how to process information based on complexity. Finally, we will look at Deep Research [Agents] , AI agents designed for autonomous, in-depth investigation using web browsing and iterative analysis. By exploring each of these categories, this tutorial will equip you with a clearer understanding of the diverse capabilities within the LLM ecosystem, helping you appreciate the specific strengths and applications of different model types. Chapter 1: Base Models - The Foundation of Language Understanding At the heart of the Large Language Model revolution lie the Base Models , often referred to as foundation models. These represent the initial, fundamental stage of LLM development, serving as the bedrock upon which more specialized and task-oriented models are built. Understanding base models is essential to grasping the core principles of how LLMs learn and function before they are adapted for specific applications like conversation or instruction following. A base LLM can be conceptualized as the "raw" or "core" version of a language model [1]. Its primary characteristic stems from its training methodology: unsupervised learning on truly massive and diverse datasets. These datasets typically encompass vast swathes of text and code scraped from the public internet, digitized books, scientific articles, and other sources, potentially amounting to trillions of words. The key here is that the data is largely unlabeled; the model isn't explicitly told what the "correct" answer is for a given input during this phase. Instead, base models are trained on objectives like next-token prediction or masked language modeling. In next-token prediction, the model learns to predict the most statistically probable next word (or sub-word unit, called a token) in a sequence, given the preceding context. For example, given the input "The cat sat on the...", the model learns to assign high probability to words like "mat", "chair", or "windowsill" based on the patterns it has observed in its training data. Masked language modeling involves predicting missing (masked) words within a sentence. Through these self-supervised tasks, the model implicitly learns intricate patterns of grammar, syntax, semantics, factual knowledge, and even some rudimentary reasoning abilities embedded within the language data [1, 2]. The sheer scale of the training data allows base models to develop a broad, general understanding across an incredibly wide range of topics. They become repositories of information gleaned from their training corpus, capable of generating text that is often coherent, contextually relevant, and stylistically varied [1]. However, this knowledge is statistical and pattern-based; the model doesn't "understand" in the human sense but rather excels at predicting sequences based on learned correlations. A defining feature, and often a limitation, of base models is that they are not inherently designed to follow instructions or engage in coherent dialogue. While they can complete prompts or answer questions based on the patterns they've learned (e.g., if trained on many Q&A pairs, they might answer a question), their behavior can be unpredictable [1, 3]. They might continue a prompt in an unexpected way, generate factually incorrect information (hallucinate), or fail to adhere to specific constraints given in a prompt. Their primary goal during training was sequence prediction, not adherence to user intent. Prompt engineering for base models often requires careful crafting to steer the model towards the desired output format or content. Despite these limitations for direct interaction, base models are incredibly valuable as foundations. Their broad knowledge and language understanding capabilities make them the ideal starting point for fine-tuning [1]. By taking a pre-trained base model and further training it on smaller, curated datasets tailored to specific tasks (like question answering, summarization, or following instructions), developers can create more specialized and reliable models, such as the instruction-tuned models we will discuss in the next chapter. In summary, base LLMs are characterized by: Unsupervised Pre-training: Trained on vast, unlabeled text/code datasets. Core Objective: Typically next-token prediction or masked language modeling. Broad Knowledge: Develop general understanding across many topics from data patterns. Limited Instruction Following: Not inherently designed to follow user commands reliably. Foundation Role: Serve as the starting point for fine-tuning into specialized models. Their applications in their raw form might include generating creative text variations, exploring language patterns, or acting as a knowledge base where precise instruction following isn't paramount. However, their most significant impact lies in enabling the development of more sophisticated, fine-tuned models that power many of the AI applications we interact with daily. References : [1] Toloka Team. (2024, November 19). Base LLM vs. instruction-tuned LLM. Toloka Blog. Retrieved May 2, 2025, from https://toloka.ai/blog/base-llm-vs-instruction-tuned-llm/ [2] AWS. What is LLM? - Large Language Models Explained. Amazon Web Services. Retrieved May 2, 2025, from https://aws.amazon.com/what-is/large-language-model/ [3] Boutnaru, S. (2025, February 9). The Artificial Intelligence Journey — Base LLM (Base Large Language Model). Medium. Retrieved May 2, 2025, from https://medium.com/@boutnaru/the-artificial-intelligence-journey-base-llm-base-large-language-model-726423106b14 Chapter 2: Instruction-Tuned Models - Aligning LLMs with User Intent While base models possess vast knowledge, their inherent nature as sequence predictors makes them unreliable for tasks requiring specific actions or adherence to user commands. To bridge this gap and create more practical, interactive AI systems, the concept of Instruction-Tuned Models was developed. These models represent a crucial evolution, taking a pre-trained base model and refining it specifically to understand and follow human instructions effectively [1]. Instruction tuning is a form of supervised fine-tuning (SFT) applied after the initial unsupervised pre-training phase. Instead of just predicting the next token, the model is trained on a dataset composed of explicit instruction-prompt-response pairs [1]. These pairs demonstrate the desired behavior for various tasks. For example, the dataset might contain examples like: Instruction : "Summarize the following text." Prompt : "[Lengthy article text]" Response : "[Concise summary of the article]" Or: Instruction : "Translate this sentence to French." Prompt : "Hello, how are you?" Response : "Bonjour, comment ça va?" This dataset is often curated through significant human effort, involving labelers writing instructions, prompts, and high-quality responses. Increasingly, techniques like Reinforcement Learning from Human Feedback (RLHF) or AI Feedback (RLAIF) are also employed. In RLHF, human reviewers rank different model outputs for the same prompt, and this feedback is used to train a reward model, which then guides the LLM's fine-tuning via reinforcement learning to produce outputs that align better with human preferences [4]. This combined SFT and RLHF/RLAIF process helps the model learn not just what information to provide, but how to provide it in a helpful, harmless, and honest manner, aligning it more closely with user intent [1]. The primary benefit of instruction tuning is a marked improvement in the model's ability to follow complex, multi-step instructions without deviating [1]. Unlike base models that might ramble or misunderstand the core task, instruction-tuned models are trained to interpret the user's command and generate a relevant, structured response. They become significantly better at tasks requiring specific formats (like creating lists or writing code), adhering to constraints (like tone or length), and understanding the nuances of user requests [1]. Key characteristics that distinguish instruction-tuned models include: Improved Instruction Following: They are explicitly trained to understand and execute commands, leading to more reliable and predictable behavior [1]. Enhanced Task Specialization: They excel at specific NLP tasks they were fine-tuned on, such as summarization, translation, question answering, code generation, and structured content creation [1]. Better User Intent Understanding: The fine-tuning process makes them more adept at grasping the underlying goal of a user's prompt, even if not perfectly phrased [1]. Controllability: Users have more control over the output's style, tone, and format through instructions. Safety and Alignment: Fine-tuning often incorporates safety measures and alignment techniques to reduce harmful, biased, or untruthful outputs. Instruction-tuned models power many of the LLM applications commonly used today, including advanced chatbots like ChatGPT, Google Gemini, and Anthropic's Claude. Their applications are vast and continue to expand: Conversational AI: Engaging in coherent, helpful dialogue, answering questions, and providing assistance. Content Creation: Generating articles, marketing copy, emails, stories, and other creative text formats based on specific instructions. Coding Assistance: Generating, explaining, debugging, and translating code snippets. Educational Tools: Providing explanations, tutoring, and answering student queries. Data Analysis and Reporting: Summarizing data, generating insights, and creating structured reports [1]. While instruction tuning significantly enhances usability and reliability, it's important to note that these models still inherit the knowledge (and potential biases) of their underlying base model. They are not immune to generating incorrect information (hallucinations), but the fine-tuning process generally makes them more grounded and less prone to unpredictable outputs compared to raw base models. In essence, instruction tuning transforms a knowledgeable but unguided base model into a helpful and capable assistant, aligning its vast linguistic capabilities with the specific needs and intentions of human users. References : [1] Toloka Team. (2024, November 19). Base LLM vs. instruction-tuned LLM. Toloka Blog. Retrieved May 2, 2025, from https://toloka.ai/blog/base-llm-vs-instruction-tuned-llm/ [4] OpenAI. (2024, September 12). Learning to reason with LLMs. OpenAI Blog. Retrieved May 2, 2025, from https://openai.com/index/learning-to-reason-with-llms/ (Implicit reference to RLHF/RLAIF in reasoning model development, applicable concept here) Chapter 3: Mixture of Experts (MoE) Models - Scaling Efficiently As the demand for more powerful and knowledgeable Large Language Models grows, researchers continually seek ways to increase model size without incurring prohibitive computational costs during training and inference. One of the most promising architectural innovations addressing this challenge is the Mixture of Experts (MoE) model. MoE represents a significant departure from traditional "dense" architectures, enabling models to scale to trillions of parameters while maintaining relative efficiency [5]. In a standard dense transformer model, every input token is processed by all parameters in each layer, particularly the feed-forward network (FFN) layers which constitute a large portion of the model's parameters. This means the computational cost scales directly with the model size. MoE introduces the concept of sparsity or conditional computation to overcome this limitation [5]. Instead of dense FFN layers, MoE models incorporate specialized MoE layers . Each MoE layer consists of two primary components [5]: Multiple Experts : A set of smaller, independent neural networks (typically FFNs themselves, though they could be more complex). Each expert can be thought of as specializing in different types of data or tasks, although this specialization often emerges implicitly during training rather than being explicitly assigned. Gating Network (Router) : A small neural network that acts as a traffic controller. For each input token arriving at the MoE layer, the gating network dynamically decides which expert(s) should process that token. It calculates scores for each expert based on the token's representation and typically selects the top-k experts (where k is often 1 or 2) to handle the computation [5]. The core idea is that for any given token, only a small fraction of the total parameters within the MoE layer (i.e., the parameters of the selected expert(s)) are activated and used for computation. The outputs from the activated expert(s) are then typically combined, often weighted by the scores assigned by the gating network, before being passed to the next layer [5]. It's crucial to note that while the FFN layers are replaced by sparse MoE layers, other components of the transformer, like the attention mechanisms, are usually shared across all tokens, similar to dense models. This is why a model like Mixtral 8x7B, despite having 8 experts of 7B parameters each in its MoE layers, has a total parameter count closer to 47B rather than 56B (8x7B) – the non-FFN parameters are shared [5]. This sparse activation leads to significant benefits: Efficient Pre-training : MoE models can achieve comparable quality to dense models with substantially less computational cost during pre-training. This allows researchers to train much larger models (in terms of total parameters) or use larger datasets within the same compute budget [5]. Faster Inference : Although an MoE model might have a very large number of total parameters, the actual number of computations (FLOPs) required per token during inference only depends on the parameters of the activated experts (and the shared parameters). For example, Mixtral 8x7B, using 2 experts per token, has an inference speed roughly equivalent to a 12-14B parameter dense model, not a 47B or 56B one [5]. However, the MoE architecture also introduces unique challenges: High Memory Requirements : Despite the computational efficiency, all parameters (including all experts) must be loaded into the GPU memory (VRAM) during inference. This means an MoE model requires significantly more VRAM than a dense model with equivalent inference FLOPs [5]. Training Instability and Load Balancing : Ensuring that tokens are distributed relatively evenly across experts (load balancing) is critical for efficient training and preventing experts from becoming over- or under-utilized. Sophisticated loss functions and routing strategies are often needed to manage this [5]. Fine-tuning Difficulties : MoE models have historically been more challenging to fine-tune effectively compared to dense models, sometimes exhibiting tendencies towards overfitting. However, recent advancements in instruction-tuning techniques for MoEs are showing promise [5]. Communication Overhead : In distributed training or inference setups, routing tokens to the correct experts across different devices can introduce communication bottlenecks. Prominent examples of MoE models include Llama 4 Scout (109B-A17B) and Alibaba’s Qwen3-235B-A22B. In Qwen3-235B-A22B, "Qwen3" designates the third generation of the model, "235B" indicates the total number of parameters, and "A22B" means that only 22 billion parameters are active per token via a Mixture-of-Experts design (8 out of 128 experts per token). This approach achieves scalability and efficiency by activating only a subset of the model for each input, allowing for dense-model-level quality with reduced computational cost. Qwen3 exemplifies how sparse activation and expert routing enable large-scale models to be both powerful and relatively efficient. References : [5] Sanseviero, O., Tunstall, L., Schmid, P., Mangrulkar, S., Belkada, Y., & Cuenca, P. (2023, December 11). Mixture of Experts Explained. Hugging Face Blog. Retrieved May 2, 2025, from https://huggingface.co/blog/moe Chapter 4: Reasoning Models - Enhancing Complex Problem-Solving While instruction-tuned models significantly improve an LLM's ability to follow commands and perform specific tasks, many real-world problems require more than just direct execution; they demand complex, multi-step thinking, logical deduction, and the ability to plan and execute a sequence of operations. This is where Reasoning Models come into play. These are LLMs that have been specifically enhanced or designed to excel at tasks requiring intricate, step-by-step problem-solving [6]. Reasoning, in this context, refers to the process of tackling questions or problems that necessitate intermediate steps to arrive at a correct solution [6]. Simple factual recall (e.g., "What is the capital of France?") doesn't typically require reasoning. However, solving a mathematical word problem, debugging complex code, navigating a logic puzzle, or planning a multi-stage project involves breaking the problem down, applying rules or principles, and synthesizing information through a sequence of steps. While most modern instruction-tuned LLMs possess some basic reasoning capabilities learned implicitly during pre-training or fine-tuning, dedicated reasoning models are optimized to handle significantly higher levels of complexity [6]. A key characteristic often associated with reasoning models is their ability to generate or utilize intermediate steps, sometimes referred to as a "chain of thought" or "scratchpad" [6, 9]. These intermediate steps can manifest in two ways: Explicit Reasoning : The model includes the steps of its reasoning process directly within its output, showing its work much like a student solving a math problem. This provides transparency and allows users (or developers) to potentially identify errors in the reasoning path. Implicit Reasoning : The model performs multiple internal iterations or calculations, generating intermediate thoughts or results that are not necessarily shown to the end-user but are used internally to arrive at the final answer [6]. OpenAI's o1 model is suggested to operate partly in this manner [4, 6]. Enhancing the reasoning capabilities of LLMs involves several distinct approaches, often used in combination [6]: Inference-Time Techniques : These methods don't alter the underlying model but change how it's prompted or how its outputs are generated during inference. Techniques like Chain-of-Thought (CoT) prompting explicitly ask the model to "think step-by-step." Self-Consistency involves generating multiple reasoning paths and selecting the most consistent answer through majority voting. Tree-of-Thoughts (ToT) explores multiple reasoning paths concurrently, evaluating intermediate steps like searching through a tree [6, 9]. These techniques essentially allocate more computational resources at inference time to improve reasoning quality. Specialized Training Data : Fine-tuning models (using SFT) on datasets specifically designed to teach reasoning is crucial. These datasets might include mathematical problems with step-by-step solutions, logical deduction exercises, code with explanations, or complex instruction-following tasks [6]. Reinforcement Learning : Similar to instruction tuning, RL (often RLHF or RLAIF) can be used with reward models specifically designed to incentivize correct reasoning steps (process supervision) or accurate final outcomes resulting from complex reasoning (outcome supervision) [4, 6]. Models like DeepSeek-R1 utilize extensive RL training for reasoning [6]. Distillation : Smaller models can be trained to mimic the reasoning outputs of larger, more capable reasoning models, effectively distilling the reasoning capability into a more efficient package [6]. Reasoning models are particularly well-suited for tasks where accuracy hinges on logical coherence and multi-step processing [6]: Advanced Mathematics : Solving complex equations, proofs, and word problems. Logic Puzzles and Games : Navigating riddles, strategic games, and constraint satisfaction problems. Scientific Reasoning : Formulating hypotheses, designing experiments, interpreting data. Complex Code Generation and Debugging : Understanding intricate program logic, identifying bugs, generating complex algorithms. Planning and Scheduling : Breaking down complex goals into actionable steps. However, this specialization comes with trade-offs [6]: Computational Cost : Both training specialized reasoning models and employing inference-time reasoning techniques can be computationally expensive. Verbosity and Efficiency : For simple tasks not requiring deep reasoning, these models might be overly verbose or less efficient than standard instruction-tuned models. Potential for Plausible Errors : Reasoning models can sometimes generate convincing-looking but ultimately incorrect reasoning paths ("overthinking" or sophisticated hallucination). The development of reasoning models represents a significant step towards more capable and versatile AI systems, pushing LLMs beyond simple pattern matching and instruction following towards more human-like problem-solving abilities. References : [4] OpenAI. (2024, September 12). Learning to reason with LLMs. OpenAI Blog. Retrieved May 2, 2025, from https://openai.com/index/learning-to-reason-with-llms/ [6] Raschka, S. (2025, February 5). Understanding Reasoning LLMs. Ahead of AI. Retrieved May 2, 2025, from https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html [9] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903. (General reference for CoT) Chapter 5: Multimodal Models - Understanding Beyond Text The world is inherently multimodal; humans perceive and interact with information through various senses – sight, sound, touch – often simultaneously. Traditional Large Language Models, however, primarily operate within the realm of text. Multimodal Large Language Models (MLLMs) represent a significant leap forward, designed to process, understand, and even generate information across multiple data types or modalities, such as text, images, audio, and video [7]. This capability allows them to engage with information in a way that more closely mirrors human perception and enables a wider range of more complex applications. MLLMs expand upon the foundations laid by traditional LLMs. While they often leverage a powerful pre-trained LLM as their backbone for language understanding and reasoning, they incorporate additional components to handle non-textual data [7]. The core architectural difference lies in the need for specialized encoders for each modality. For instance, an MLLM might use a Vision Transformer (ViT) or a Convolutional Neural Network (CNN) to process images, an audio encoder for sound, and the standard LLM tokenizer/embedding layer for text. These encoders transform the input from each modality into vector representations (embeddings) [7]. A crucial step in MLLM architecture is embedding alignment and fusion . The embeddings generated by the different modality encoders need to be projected into a shared space where the model can understand the relationships between them. A dedicated fusion module or specific training techniques (like contrastive learning) are employed to integrate these diverse representations into a unified multimodal understanding [7]. This allows the model, for example, to connect the word "dog" in a text caption to the visual features of a dog in an accompanying image. The general workflow often involves [7]: Ingestion and Encoding : Receiving input across multiple modalities (e.g., an image and a text question about it) and processing each through its respective encoder. Alignment and Fusion : Projecting and combining the different embeddings into a unified representation. Cross-Modal Learning/Reasoning : Processing the fused representation, often using the LLM backbone, to understand the relationships and context across modalities. Output Generation : Producing an output, which could be text (e.g., answering the question about the image), but potentially also another modality depending on the model's architecture and training (though text output is most common for current MLLMs focused on understanding). This ability to process combined inputs leads to powerful capabilities beyond text-only models: Rich Data Interpretation : Analyzing documents containing text, charts, and images; understanding videos with audio and visual elements. Cross-Modal Reasoning : Answering detailed questions about images or videos, describing visual scenes, explaining audio events. Enhanced Interaction : Enabling more natural human-AI interaction, such as discussing a shared visual context. However, building and training MLLMs presents significant challenges [7]: Architectural Complexity : Designing effective encoders and fusion mechanisms is difficult. Training Data : Requires massive, diverse datasets pairing different modalities (e.g., image-caption pairs, video-transcript pairs). Computational Cost : Training these complex models with large parameter counts and diverse data is computationally intensive. Alignment: Ensuring meaningful alignment between representations from different modalities remains an active area of research. Distinguishing MLLMs from Text-to-Image/Video Models: It is vital to differentiate MLLMs from models primarily focused on generating one modality from another, such as text-to-image models (e.g., Stable Diffusion, Midjourney) or text-to-video models (e.g., Sora). While both involve multiple modalities, their core purpose differs significantly [7]: Text-to-Image/Video Models : These are primarily generative specialists . Their main function is to synthesize high-fidelity visual content (images or video frames) based solely on a textual description. They excel at translating text prompts into pixel data but typically lack deep understanding or reasoning capabilities about the input modalities beyond what's needed for generation. They take text in and produce images/video out. Multimodal LLMs (MLLMs) : These models prioritize cross-modal understanding and reasoning . They are designed to take multiple modalities as input (e.g., image + text, video + audio + text) and perform tasks that require comprehending the relationship between these inputs. Their output is often textual (e.g., describing an image, answering a question about a video), reflecting their focus on understanding and explanation, although future MLLMs might generate outputs in various modalities more frequently. Their strength lies in interpreting and reasoning about combined multimodal data. In essence, while a text-to-image model creates a picture from a description, an MLLM can look at a picture and a description (or question) and reason about them together. Models like Google's Gemini and OpenAI's GPT-4V are prominent examples of MLLMs focused on understanding and reasoning across text and images. References : [7] NVIDIA. What Are Multimodal Large Language Models? NVIDIA Glossary. Retrieved May 2, 2025, from https://www.nvidia.com/en-us/glossary/multimodal-large-language-models/ Chapter 6: Hybrid Models - Integrating Diverse Capabilities The landscape of Large Language Models is not strictly defined by mutually exclusive categories. As research progresses, models are emerging that blend characteristics from different types, leading to the concept of Hybrid Models . While the term "hybrid" can be applied broadly, in the context of our discussion, we focus on models that integrate different reasoning approaches or dynamically decide how to process information based on the input's complexity or nature, particularly touching upon the user's query about models deciding whether to use reasoning. The need for hybrid approaches arises from the observation that no single model architecture or training paradigm is optimal for all tasks. Simple queries might be best handled by efficient instruction-tuned models, while complex problems demand the sophisticated multi-step processing of reasoning models. A hybrid model aligns with the user's query and makes dynamic reasoning selection. Such a model might possess multiple internal pathways or modules optimized for different levels of cognitive load: Fast Pathway : For simple, routine queries, the model might use a computationally cheaper, faster processing route, perhaps akin to a standard instruction-tuned response mechanism. Deep Reasoning Pathway : When the model detects complexity, ambiguity, contradictions, or specific triggers indicating a need for careful analysis (e.g., mathematical symbols, logical operators, planning requests), it could activate a more resource-intensive reasoning module. This module might employ techniques like chain-of-thought, self-correction, or even call external tools or specialized sub-models [6, 8]. The decision mechanism itself could be a learned component, perhaps a gating network similar to those in MoE models, but routing tasks based on complexity rather than just token identity. Alternatively, it could be triggered by specific prompt structures or internal confidence scores. Note: Remember to include the "no_think" in your system prompt if you don't want model to spend time "thinking". Letting the model to engage in elaborate reasoning when it already produces equally good results without it will significantly increase both cost and response time. While models explicitly marketed as "hybrid reasoning selectors" are not yet commonplace. Another way how people may use hybrid model is involving data from multiple modalities, requiring the capabilities of MLLMs. It involves the integration of different types of reasoning or processing within a single system. For instance, research explores combining symbolic reasoning (like mathematical logic or rule-based systems) with the pattern-matching strengths of neural networks. An LLM might handle the natural language understanding and common-sense aspects of a problem, while interfacing with a symbolic solver for precise calculations or logical deductions. The study on hybrid reasoning for autonomous driving provides a concrete example, although focused on combining reasoning types and modalities rather than dynamically choosing whether to reason [8]. In this context, the LLM acts as a central processing unit integrating diverse inputs: visual data (detected objects), sensor readings (speed, distance), and contextual knowledge (traffic laws, physics). It applies both common-sense reasoning (interpreting the driving scene) and potentially arithmetic reasoning (calculating safe distances) to make driving decisions (like brake/throttle control) [8]. This demonstrates a hybrid approach by fusing different data streams and reasoning forms to tackle a complex, dynamic task. Hybrid models represent a move towards more adaptive and efficient AI. By dynamically allocating computational resources and selecting appropriate processing strategies based on the task at hand, they promise to combine the breadth of knowledge from large models with the specialized capabilities needed for complex reasoning and interaction, potentially deciding on-the-fly whether a simple response or a deep, reasoned analysis is required. References : [6] Raschka, S. (2025, February 5). Understanding Reasoning LLMs. Ahead of AI. Retrieved May 2, 2025, from https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html [8] Azarafza, M., Nayyeri, M., Steinmetz, C., Staab, S., & Rettberg, A. (2024). Hybrid Reasoning Based on Large Language Models for Autonomous Car Driving. arXiv preprint arXiv:2402.13602. [10] Olteanu, A. (2025, February 5). OpenAI's Deep Research: A Guide With Practical Examples. DataCamp Blog. Retrieved May 2, 2025, from https://www.datacamp.com/blog/deep-research-openai Chapter 7: Deep Research- AI Agents for In-Depth Investigation Beyond models focused on specific cognitive abilities like reasoning or multimodal understanding, a new category is emerging: Deep Research or AI Research Agents . These systems represent a shift towards more autonomous AI, designed specifically to conduct complex, multi-step research tasks by leveraging LLMs, web browsing, tool use, and iterative refinement [10]. OpenAI's "Deep Research" agent, powered by a version of their o3 model, is a prime example of this category [10]. Unlike standard LLM interactions (even those with browsing capabilities) which typically provide relatively quick, single-turn responses based on limited web searches, deep research agents are built for sustained investigation. They aim to tackle complex queries that require synthesizing information from numerous sources, cross-referencing data, and producing structured, comprehensive outputs, much like a human researcher would [10]. Think of tasks like compiling a detailed market analysis report, comparing complex products based on diverse criteria, or summarizing the state-of-the-art in a scientific field – tasks that demand more than a simple search query. The core functionality of these models revolves around an iterative research process [10]: Query Understanding and Planning: Upon receiving a research query, the agent often starts by clarifying the scope and objectives with the user. It then formulates a plan, breaking down the research task into smaller, manageable steps. Information Gathering (Tool Use): The agent autonomously uses tools, primarily web browsers, to search for relevant information online. It may access dozens or even hundreds of sources. Analysis and Synthesis: The agent reads and analyzes the gathered information, extracting key points, identifying patterns, comparing data across sources, and potentially using other tools (like code interpreters for data analysis) to process the findings. Iterative Refinement: The process is often iterative. Based on initial findings, the agent might refine its search queries, seek out additional sources, or revisit previous steps to deepen its understanding or resolve conflicting information. Report Generation: Finally, the agent synthesizes its findings into a structured, often well-cited report, presenting the information in a coherent and organized manner. These models build upon advancements in reasoning capabilities but are specifically optimized for the context of web browsing and real-world data analysis [10]. Their training often involves reinforcement learning focused on successful execution of complex browsing and reasoning tasks, teaching them how to navigate the web effectively, evaluate source credibility (to some extent), and synthesize disparate information [10]. Key characteristics distinguishing deep research agents include: Autonomy and Iteration : They operate with a higher degree of autonomy, performing multiple steps over an extended period (minutes rather than seconds) to complete a research task. Extensive Tool Use : Heavy reliance on web browsing is fundamental, potentially augmented by other tools for calculation, data analysis, or code execution. Focus on Synthesis : The primary goal is not just information retrieval but deep analysis and synthesis across multiple sources. Structured Output : They typically aim to produce comprehensive, structured reports rather than brief answers. Deep research agents show significant promise in benchmarks designed to test complex, real-world tasks requiring reasoning, tool use, and knowledge retrieval, such as GAIA (General AI Agent benchmark) and Humanity’s Last Exam [10]. Their performance often improves the more they are allowed to iterate and use their tools, highlighting the value of their multi-step approach [10]. Potential applications are broad, targeting anyone needing in-depth research [10]: Professionals : Generating market reports, competitive analyses, policy summaries. Researchers and Students : Literature reviews, gathering data from diverse online sources. Consumers : Detailed product comparisons for high-stakes purchases. Journalists and Analysts : Fact-checking, background research, multi-source insight generation. However, these models are still in early stages. They can still produce incorrect facts or inferences (hallucinations), and their ability to critically evaluate source quality is an ongoing challenge. Users need to treat their outputs as highly sophisticated drafts requiring human review and verification, rather than infallible final reports [10]. Nonetheless, deep research represent a powerful new direction, moving LLMs towards becoming more capable and autonomous assistants for complex knowledge work. References : [10] Olteanu, A. (2025, February 5). OpenAI's Deep Research: A Guide With Practical Examples. DataCamp Blog. Retrieved May 2, 2025, from https://www.datacamp.com/blog/deep-research-openai Conclusion: The Evolving Ecosystem of Language Models This exploration into the diverse types of Large Language Models highlights the rapid evolution and specialization occurring within the field of artificial intelligence. From the foundational Base Models trained on vast unlabeled text, we have seen the development of Instruction-Tuned Models designed for better user alignment and task execution. Architectural innovations like Mixture of Experts (MoE) demonstrate pathways to scale models efficiently, while dedicated Reasoning Models push the boundaries of complex problem-solving. Furthermore, the ability to understand the world beyond text is captured by Multimodal Models (MLLMs), which integrate information from images, audio, and video, distinguishing them clearly from purely generative text-to-image or text-to-video systems. The emergence of Hybrid Models suggests a future where AI systems dynamically adapt their processing strategies, potentially choosing between fast responses and deep reasoning based on task complexity. Finally, Deep Research Agents showcase the potential for LLMs to act as autonomous agents, conducting in-depth investigations and synthesizing knowledge from extensive online sources. Understanding these different categories is crucial for anyone looking to leverage LLMs effectively. Each type possesses unique strengths, weaknesses, training requirements, and ideal use cases. A base model might suffice for exploring language patterns, while a complex planning task demands a reasoning model. Analyzing a chart within a document requires multimodal capabilities, and scaling to extreme parameter counts might necessitate an MoE architecture. Choosing the right type of model, or understanding the capabilities of a given model, depends heavily on the specific task and desired outcome. The field continues to advance at an unprecedented pace. The lines between these categories may blur further as new architectures and training techniques emerge, combining features in novel ways. However, the fundamental principles underlying these different approaches – unsupervised learning, supervised fine-tuning, reinforcement learning, sparsity, multimodality, reasoning processes, and agentic behavior – will likely remain key building blocks for future generations of AI. As these models become increasingly integrated into various aspects of our lives, a clear understanding of their diverse forms and functions will be essential for navigating the future of artificial intelligence. References [1] Toloka Team. (2024, November 19). Base LLM vs. instruction-tuned LLM. Toloka Blog. Retrieved May 2, 2025, from https://toloka.ai/blog/base-llm-vs-instruction-tuned-llm/ [2] AWS. What is LLM? - Large Language Models Explained. Amazon Web Services. Retrieved May 2, 2025, from https://aws.amazon.com/what-is/large-language-model/ [3] Boutnaru, S. (2025, February 9). The Artificial Intelligence Journey — Base LLM (Base Large Language Model). Medium. Retrieved May 2, 2025, from https://medium.com/@boutnaru/the-artificial-intelligence-journey-base-llm-base-large-language-model-726423106b14 [4] OpenAI. (2024, September 12). Learning to reason with LLMs. OpenAI Blog. Retrieved May 2, 2025, from https://openai.com/index/learning-to-reason-with-llms/ [5] Sanseviero, O., Tunstall, L., Schmid, P., Mangrulkar, S., Belkada, Y., & Cuenca, P. (2023, December 11). Mixture of Experts Explained. Hugging Face Blog. Retrieved May 2, 2025, from https://huggingface.co/blog/moe [6] Raschka, S. (2025, February 5). Understanding Reasoning LLMs. Ahead of AI. Retrieved May 2, 2025, from https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html [7] NVIDIA. What Are Multimodal Large Language Models? NVIDIA Glossary. Retrieved May 2, 2025, from https://www.nvidia.com/en-us/glossary/multimodal-large-language-models/ [8] Azarafza, M., Nayyeri, M., Steinmetz, C., Staab, S., & Rettberg, A. (2024). Hybrid Reasoning Based on Large Language Models for Autonomous Car Driving. arXiv preprint arXiv:2402.13602. [9] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903. [10] Olteanu, A. (2025, February 5). OpenAI's Deep Research: A Guide With Practical Examples. DataCamp Blog. Retrieved May 2, 2025, from https://www.datacamp.com/blog/deep-research-openai
21 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
EduAiQ: Intelligent Quota Management for Academic AI Access Faculty, staff, and students increasingly rely on AI chatbots like ChatGPT for teaching, research, and day-to-day academic tasks. However, providing this level of access at scale presents a major challenge. Purchasing individual ChatGPT Plus licenses across an entire university would be prohibitively expensive, and would also lock the institution into a single provider—a risky move given how rapidly the AI ecosystem is evolving, with no clear long-term winner. Further, different academic needs call for different AI models—humanities may prioritize conversational fluency, while STEM disciplines may require code generation and technical reasoning. Meanwhile, a new wave of open-weight models like LLaMA, DeepSeek, and Qwen are rapidly approaching or matching ChatGPT’s capabilities, and can be deployed at a fraction of the cost—especially via serverless inference providers such as Together.ai. At the same time, educational institutions need centralized tools to manage and allocate AI resources—allowing administrators to set usage policies and quotas across colleges, departments, faculty, courses, and students, while staying within contractual or budgetary limits. This requires both technical flexibility and governance infrastructure—a solution that respects budgets while still fostering innovation. The Solution: EduAiQ EduAiQ could be an open-source platform designed specifically for academic institutions to provide affordable, controlled, and scalable access to AI model APIs. It integrates with GPT4All, a powerful open-source ecosystem that delivers a ChatGPT-style chat interface and supports multiple backend model providers. GPT4All already connects to APIs like ChatGPT and can be extended to work with other closed and open-weight models (e.g., Claude, Gemini, DeepSeek, Mistral) via providers like Together.ai. With EduAiQ, educational institutes can: Deploy a single, unified chat interface like ChatGPT using GPT4All with open-weight Model APIs Avoid paying for per-user licenses by managing backend API usage directly Choose and switch between AI models based on performance, price, or policy Support both commercial and open-source models to optimize cost and compliance Hierarchical Quota Management The platform also includes a robust multi-tiered quota and access control system. Administrators at the university, college, department, faculty, and course levels can: Allocate and enforce usage limits Inherit or override quotas from higher levels Assign separate quotas for teaching vs. research Track usage metrics to forecast demand and control costs This hierarchical structure ensures flexibility, accountability, and autonomy—without breaching budgetary or contractual limits with API providers. Here's a breakdown of the concept and its key components: Key Concepts Quota Hierarchy University Admin → manages global contract with model providers and sets quotas for each college. College Admin → gets a share from the university quota or buys its own; allocates to departments; sets quotas for various departments. Department Admin → gets a share from the college quota or buys its own; sets quotas for faculty (research & teaching) and courses. Faculty → gets separate quotas for teaching and research; Course → gets a quota that students use for assignments, labs, etc. Admin level rights: Can allocate quotas downwards as percentages or fixed limits. Can inherit quota from the level above Can acquire/buy their own (grant-funded, special budgets, etc.). Can track and adjust usage in real-time. Functional Requirements Quota Management Interface Web-based dashboards for each level of admin (University, College, Dept., etc.) Set, update, and track quotas Allocate quotas as percentages or fixed values View usage reports and forecasts Authentication & Role-Based Access Control Integration with university SSO systems (e.g., Shibboleth, CAS) Fine-grained permissions based on role (student, faculty, admin) Usage Metering and Billing Track API usage (input/output tokens) Map usage to users and roles Generate reports and optionally chargebacks (e.g., deduct from grant) API Gateway or Proxy Layer A middleware that routes requests to various model providers Enforces quotas and access rules Logs usage for accounting Extensibility Support for multiple API providers Private models (ChatGPT, Claude, Gemini, Grok), Open-weight inference API providers such as Together.ai for open-weight models such as (Llama, Deepseek, Mistral, Qwen) Pluggable system to add new models/contracts Allow export of usage data for integration with university ERP/billing systems Benefits Cost control and accountability at every level. Scalability across diverse use cases: teaching, research, student projects. Autonomy for departments/faculty with oversight from higher levels. Encourages responsible use of expensive API resources. Possible Tech Stack Backend: Node.js, Python (FastAPI), or Go Frontend: Next.js and React (with role-based dashboards) Database: PostgreSQL for quotas and usage data Auth: SAML/OAuth integration with university identity providers Logging/Analytics: ELK stack, Kafka, Spark
5 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
Happy International Recruiters Day! I missed writing this shout out two weeks back but better late than never. Over the past few years, I’ve had the pleasure of working closely with many recruiters. As someone observing the hiring ecosystem from both strategic and human angles, I’ve come to a deep and genuine appreciation for what recruiters actually go through—and why they deserve far more credit than they get. Recruiting, from the outside, might seem like it’s all LinkedIn messages and phone calls. But beneath that surface is a job that’s demanding, emotionally draining, and often thankless. Yet, day after day, recruiters show up and do it anyway. Here’s just a glimpse of what they deal with: 💬 Ghosting Is Real Recruiters spend hours carefully sourcing, screening, prepping, and coordinating interviews. Then, just when it seems like things are progressing—poof. The candidate vanishes. No call. No reply. Just silence. It’s not only frustrating, it’s demoralizing—especially when that recruiter genuinely believed they were helping someone move forward in their career. 🤔 Hiring Managers Who Don’t Know What They Want One day, a role needs a "rockstar senior lead." The next, it’s "actually, we’d prefer someone more junior." Recruiters are constantly chasing a moving target, trying to balance shifting expectations while still sourcing viable candidates in a competitive market. It’s like being told to hit a bullseye—while the dartboard is on a rollercoaster. 🤯 Unrealistic Expectations “Find someone with 10 years of experience… in a technology that’s only existed for three.” Enough said. (And let’s not even start on the budget constraints.) 🤬 Rejection Fallout Recruiters are the ones who deliver the bad news—sometimes daily. And while most candidates understand, some lash out. Frustrated job seekers, understandably emotional, often end up venting their pain at the one person who tried to help. Recruiters carry that emotional weight. Quietly. Repeatedly. 🗑️ Being Treated Like Spam Reaching out to passive candidates isn’t easy. Especially when they’re met with cold shoulders, snarky replies, or worse—outright disrespect. Many recruiters are genuinely trying to present good opportunities, not just “fill a role.” But still, they get brushed off like inbox clutter. 📚The Need to Constantly Learn Job roles are evolving. Technologies shift monthly. Market demands pivot on a dime. Good recruiters aren’t just messengers—they’re students. They learn the ins and outs of new roles, skill sets, and industry trends just to have relevant conversations and make the right matches. They’re expected to understand every job they hire for, even if the role didn’t exist last year. 🗜 Pressure from All Sides Recruiting is a numbers game—but it’s also deeply human. Success is measured by placements, time-to-hire, and funnel ratios. But those metrics don’t show the late-night calls, the emotional support, or the hours of prep before a big interview. You’re stuck between candidates who want clarity and empathy, and hiring teams who want speed and perfection. ✨And Still… They Show Up Despite all of this—despite the frustration, the emotional labor, the ambiguity—recruiters keep going. They guide candidates. They mediate expectations. They play therapist, career coach, and project manager all in one. They help change lives. Often without getting thanked. So to all the recruiters out there grinding through the noise, the rejections, and the endless juggling: I see you. I appreciate you. Thank you for doing what you do. If you’ve worked with a recruiter who made a difference in your career—tell them. A quick thank you can mean more than you know. #Recruiters #ThankYou #Hiring #MAdAiLab
2 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
The use of AI is quickly becoming ubiquitous, and it's no surprise that job candidates are tapping into it at every stage of the hiring process. From résumé writing to interview prep, AI is fundamentally reshaping how applicants present themselves. But this transformation isn't without consequences. Applicant Tracking Systems (ATS), once a cornerstone of candidate filtering, are becoming less effective as AI-generated resumes easily pass keyword and formatting checks. At the same time, a tight and competitive job market means recruiters often face hundreds—sometimes thousands—of applicants for a single position. To cope, many recruiters rely on skills assessments to screen candidates. But that strategy is also being undermined by AI. As models become more powerful, what may be considered an effective test today can quickly become obsolete within weeks. Recruiters are forced into a game of cat and mouse—constantly redesigning assessments that might briefly outpace AI’s capabilities. Another layer of complexity: access to AI is not equal. Candidates with the means to afford premium tools like ChatGPT Pro ($200/month for team access) or Claude Opus have an unfair advantage. These tools can complete tests far more effectively than free-tier models. So in effect, the hiring process is tilted toward those with financial privilege—if your parents can pay for elite AI, you’re suddenly a stronger applicant. This dynamic has created what recruiters increasingly refer to as an "AI arms race" among candidates. As more applicants leverage AI to enhance their applications, those who don't use these tools appear significantly less qualified by comparison—even when they possess superior actual skills. This competitive pressure forces even hesitant or ethically concerned candidates to adopt AI assistance simply to remain viable in the candidate pool. This escalating cycle normalizes AI dependence and further obscures genuine talent assessment, creating a classic prisoner's dilemma where individual rational choices lead to a collectively problematic outcome. Even live interviews are no longer the gold standard. With the rise of interview co-pilots—tools that feed candidates real-time responses over Zoom—recruiters can no longer be certain if they’re speaking with a candidate or a candidate-plus-AI hybrid. These tools can convincingly speak about complex projects the candidate may not have even worked on. This challenge has prompted some companies to take a hard stance. Even Anthropic, the AI research company behind Claude 3.5 Sonnet—a model now more widely used than ChatGPT-4—has banned candidates from using AI during the application process. To many, this feels deeply ironic: why would an AI company prohibit the use of AI? But the irony disappears when you consider the deeper challenges. A common argument in favor of allowing AI is that employees already use it on the job—so why not let candidates use it too? Denying AI during the hiring process, some say, is a false constraint that fails to reflect how people actually work. That may sound reasonable in theory, but it overlooks the practical difficulties of designing effective, AI-resistant assessments. It’s easy to blame hiring managers for not being creative enough—but the reality is far more complicated. Ideally, a recruiter would want to hire a professional who can guide AI—especially in scenarios where AI falls short. AI tends to fail when it lacks context, when information is siloed across teams, or when decisions rely on undocumented institutional knowledge or nuanced domain expertise. These are exactly the kinds of gaps humans fill by asking the right questions, drawing from experience, and applying judgment in ambiguous situations. In a post-AI world, the focus increasingly shifts from “how” (which AI can often handle) to “why” and “what.” However, the more ambiguity a test includes to assess this kind of thinking, the more likely candidates are to drop out—often due to unfamiliarity or discomfort. On the flip side, the more a test is simplified for objectivity or scalability, the easier it becomes for AI to solve. That’s why traditional approaches—like data structures and algorithms problems, or platforms like HackerRank and LeetCode—have long served as the go-to for candidate evaluation. But those methods only worked when AI wasn’t capable of solving them easily. That era is over. And so, this new wave of evaluation processes banning AI use during hiring begins to make sense. The goal isn’t to resist progress, nor are hiring managers being short-sighted or pedantic—like insisting on using log tables in an era of ubiquitous calculators. The intent is to test whether candidates possess fundamental skills without AI assistance. After all, guiding AI through complex scenarios requires a solid grasp of the basics. If a candidate can’t solve simple problems on their own, how can they be expected to troubleshoot or direct AI effectively when it inevitably runs into limitations? Succeeding at basic tasks without AI now seems like a necessary condition. And in the absence of a clear sufficient condition to evaluate deeper competency, many companies are turning to this necessary condition as a pragmatic compromise. Some pioneering companies are exploring technological countermeasures to restore assessment integrity. Machine proctored testing platforms that record candidates' screens and video use AI detection algorithms show promise for initial evaluation round. After filtering candidates they use the tests involving AI-human collaboration directly—designing exercises that explicitly require candidates to demonstrate how they would guide AI through complex scenarios with incomplete information. But we know this space is evolving fast. How are you handling it? Have you encountered cases where candidates used AI to misrepresent their skills—or seen clever ways they’ve cheated the system? How are you adapting your evaluation process to stay ahead of these shifts? Share your experiences and strategies—we’re all figuring this out together. #RecruitmentTrends #AIinHiring #TalentAcquisition #FutureofWork #MAdAILab #AI
3 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The adoption of artificial intelligence (AI) technologies, such as ChatGPT, has surged dramatically, propelling the market capitalization of Nvidia (AI hardware manufacturer) to over $3.3 trillion and establishing it as the world's largest company in July 2024. Looking ahead, global investment in AI technologies is projected to reach $200 billion by 2025, underscoring the expanding role of AI across various industries. Generative AI (GenAI) has become a central focus, accounting for 48% of total AI funding in 2023—a significant rise from just 8% in 2022. Until the last decade, much of what was popularly considered Artificial Intelligence (AI) was dominated by traditional machine learning (ML) methods such as regression, support vector machines (SVMs), random forests, and XGBoost. These methods excelled in many applications and were particularly well-suited to structured data—tabular datasets with defined numerical or categorical features. Traditional ML typically follows a structured pipeline with well-defined steps as shown in the first part of the diagram: data gathering, feature engineering, model training, and deployment. They performed remarkably well for tasks such as classification (e.g., predicting categories) or regression (e.g., forecasting numbers). Applications like recommender systems and predictive analytics flourished using these techniques. However, traditional ML had limitations when applied to tasks involving unstructured data like text, images, or audio. To illustrate, consider a social media manager tasked with choosing an image for a blog article. This process might involve: Identifying suitable keywords based on the article content. Searching for images using those keywords. Evaluating multiple image attributes, including aesthetics, relevance, and emotional appeal. Making a final decision. Traditional ML approaches could extract keywords or classify images to some extent, but required extensive feature engineering, significant human effort and expertise for feature engineering and often resulted in subpar performance. This was a major reason why many creative, human-centric tasks remained difficult to automate. The emergence of deep learning and, more recently, Generative AI (GenAI), has transformed how AI tackles such challenges. Deep learning models handle unstructured data like text, images, audio and video in a way that traditional ML could not. These models learn to extract complex patterns directly from raw data, effectively eliminating the need for manual feature engineering. GenAI models can be highly useful for assisting with parts of human decision making, especially those involving text/image generation, summarization, answering question based on context, extracting keywords as per instructions, etc.). This fundamental shift in capabilities has created both new opportunities and challenges. Yet despite these advances, many discussions still equate AI with traditional ML approaches, failing to recognize that GenAI is the primary driver of current enthusiasm and innovation. This oversimplification not only misrepresents the technology but also risks diverting attention from the unique challenges of GenAI. Consequently, professionals may overlook the need to address these challenges and allocate resources effectively to ensure GenAI's reliable integration into business workflows. In this article, we examine the fundamental ways GenAI differs from traditional ML and explore how these distinctions necessitate workflow redesign. The accompanying flowchart provides a comprehensive visualization of a typical GenAI workflow, highlighting six key areas where it diverges from conventional ML pipelines: workflow experimentation, metrics and evaluation, guardrails, output monitoring, prompt engineering, and model drift. As the flowchart illustrates, each of these areas introduces new considerations and complexities that must be carefully managed. Below, we discuss these key differences in detail, demonstrating why organizations need to fundamentally reevaluate how they develop, guide, and deploy GenAI systems. 1. Workflow Experimentation for Human-Like Decision Making GenAI's capacity to simulate human decision-making enables organizations to automate complex tasks that were previously considered too nuanced for traditional automation. However, this capability introduces new challenges in workflow design and optimization, as shown in the "Workflow Experimentation" element of our flowchart. While Generative AI (GenAI) has revolutionized certain aspects of human decision-making automation, traditional tools like coding, machine learning (ML), APIs, and database calls remain essential for other facets. Implementing GenAI effectively requires more than just leveraging its advanced capabilities; it necessitates a nuanced approach to replicating the integrated and often implicit cognitive steps humans naturally perform during decision-making. A key aspect of GenAI workflow experimentation involves task decomposition—an essential process of unpacking what were previously fluid human cognitive processes into discrete, manageable components that can be handled by AI systems. When humans perform complex tasks, they often seamlessly integrate multiple cognitive processes without explicitly defining each step. For instance, a human editor reviewing an article simultaneously considers content accuracy, tone, audience engagement, and cultural sensitivity without consciously separating these aspects. The challenge in GenAI implementation lies in identifying and articulating these implicit cognitive steps and finding ways to replicate them through AI operations. This decomposition process often reveals multiple potential approaches to solving the same problem, each representing a different way of breaking down and reconstructing human cognitive workflows, with its own trade-offs in terms of reliability, cost, and complexity. Consider, for example, the task of selecting an appropriate image for a blog post. This seemingly straightforward task can be approached through several workflow variations: Option A: Extracting keywords and themes from the article using GenAI, then using these elements to filter images on a stock platform. This approach prioritizes content relevance but might miss emotional resonance. Option B: Using GenAI to analyze both the article's content and its intended emotional impact, then filtering images based on these dual parameters. This method might yield more emotionally aligned results but requires more complex processing. Similarly, in content moderation—another domain where GenAI is increasingly deployed—different workflow variations emerge: Option A: Direct classification of content using a single GenAI pass, which is faster but might miss subtle violations. Option B: A multi-step approach where content is first decomposed into distinct elements (text, context, implications) before being analyzed, potentially offering higher accuracy at the cost of increased processing time and complexity. These workflow variations illustrate why experimentation is crucial in GenAI implementation. Each approach represents a different way of decomposing the task, and the optimal choice often depends on specific use case requirements, resource constraints, and performance metrics. This level of workflow experimentation stands in marked contrast to traditional ML approaches. Traditional machine learning (ML) models focus on specific predictive and descriptive tasks, with their architectures designed around transforming input data into well-defined output formats. Unlike generative AI models that can produce diverse forms of content, traditional ML specializes in targeted tasks like classification, regression, and pattern recognition. They are typically employed in structured environments with clearly defined inputs and outputs, where the goal is to optimize performance on specific metrics. Hence, they do not require extensive workflow experimentation, where different task decompositions and sequences need to be explored. The flowchart illustrates this divergence by showing how GenAI workflows can branch into either prompt-based solutions or fine-tuned models, depending on the task requirements. This flexibility in approach, combined with the ability to experiment with different task decomposition strategies, allows organizations to iteratively develop and refine workflows that can handle increasingly sophisticated decision-making processes. Potential workflow sequences must be compared carefully, with the final choice balancing trade-offs across various metrics such as reliability, cost, and complexity. 2. Metrics and Evaluation Traditional ML systems rely on clear, quantitative metrics such as accuracy, mean squared error (MSE), and F1 score, which are objective and easily automated. For instance, in a customer churn prediction model, accuracy directly measures the percentage of correct predictions, while MSE quantifies the difference between predicted and actual values in tasks like sales forecasting. These metrics provide straightforward ways to assess model performance and guide improvements. The Metrics Planning step for GenAI requires a more nuanced and multi-faceted approach. GenAI outputs, such as empathetic dialogue, error-free technical instructions, or humorous marketing copy, often require subjective evaluation that defies straightforward numerical measurement. Additionally, GenAI systems can sometimes generate plausible but factually incorrect information—a phenomenon known as hallucination—which requires specialized detection methods. This complexity necessitates three distinct types of evaluation: Automated Metrics: For text generation tasks, metrics may include ROUGE scores (which measure content overlap with reference texts), specialized checks for hallucination and factual accuracy, and RAGas benchmarks (Retrieval-Augmented Generation assessment scores that evaluate how accurately the model uses retrieved information in its responses). While these automated metrics provide a baseline for evaluation, they capture only certain aspects of output quality. Generative AI-Based Evaluations: Specialized models can assess various qualitative aspects such as coherence, creativity, emotional alignment, or contextual relevance—whether for text, images, or speech outputs. These models can also cross-validate factual claims and identify potential inconsistencies that might indicate hallucination. This approach creates a more comprehensive evaluation framework that can handle subjective qualities at scale. Human Assessments: Domain experts and end-users play a crucial role in evaluating subjective qualities such as empathy, humor, and cultural sensitivity through the human-in-the-loop process. Human assessors are particularly valuable in catching subtle hallucinations that might slip past automated checks, especially in domain-specific contexts where factual accuracy is critical. These three evaluation streams converge in the Fine-Tuned Model or Prompt Chain Validation phase, where results are synthesized and compared against human expectations. When misalignments emerge—for instance, if automated metrics indicate high performance but human assessors find the outputs lacking appropriate emotional tone or cultural nuance—teams can pursue several established adjustment paths. These include revising the evaluation metrics, experimenting with different prompt chains, or modifying fine-tuning strategies. Teams can first try different metrics to better capture the desired generated output aspects. If misalignments persist, they can explore different prompt chains or models for fine-tuning. When both these approaches prove insufficient, teams can try a different workflow variation, such as trying alternative task decomposition strategies. This comprehensive evaluation process stands in marked contrast to traditional ML approaches. In conventional ML, performance improvements typically involve straightforward parameter tuning, algorithm selection, or feature engineering. GenAI evaluation, however, requires a more sophisticated approach that balances multiple feedback loops and assessment types. This multi-layered system helps ensure not only technical accuracy and the absence of hallucination, but also the subjective qualities that make GenAI outputs truly valuable. 3. Pre-Generative Guardrails Following the careful design of workflows, implementing robust pre-generative guardrails becomes essential for responsible GenAI deployment. As shown in the flowchart's Query Processing section, these guardrails serve as a critical checkpoint before any user query reaches the AI system, with clear "Pass" and "Reject" decision paths. Unlike traditional ML systems that operate within tightly controlled environments, GenAI models face dynamic, user-driven interactions that require comprehensive protective measures. These pre-generative guardrails evaluate incoming queries through several key security and safety lenses: Prompt Injection Defense: Users might attempt to manipulate the model by embedding instructions within seemingly innocent queries that could override the system's intended behavior. For example, a user might append "ignore previous instructions" to their query, attempting to bypass safety constraints. Pre-generative guardrails detect and block such manipulation attempts through pattern matching and semantic analysis. Societal Risk Assessment: These checks evaluate whether incoming queries could lead to outputs that might cause broader social harm, such as requests for generating discriminatory content, promoting misinformation, or creating misleading deepfakes. Reputational Risk Screening: Separate from societal concerns, these checks focus specifically on protecting organizational interests by identifying queries that might result in outputs misaligned with company values, brand guidelines, or regulatory requirements. For instance, in a financial services context, this might involve blocking queries that could lead to unauthorized financial advice. The flowchart illustrates how these guardrails operate in practice through a binary decision process: Pass: Queries that clear all safety checks proceed to the AI-augmented workflow deployment stage for processing. Reject: Queries flagged as potentially harmful are blocked, triggering the "Inform user/system" response. This might involve providing users with a clear explanation of why their query was rejected and suggesting appropriate modifications, or alerting system administrators about potential misuse patterns. Consider a practical example: In a customer service AI system, incoming queries first pass through these guardrails. If a user asks for help with a product return, the query passes through. However, if they request confidential information about other customers, the system would reject the query and inform the user about data privacy policies. These pre-generative guardrails work in concert with the broader workflow shown in the flowchart, forming the first line of defense in a comprehensive safety framework. 4. Monitoring and Tracing The monitoring phase in GenAI deployment introduces unique operational challenges that go beyond traditional ML's focus on system performance and data drift detection. To address these challenges, organizations rely on random sampling and human evaluation of production outputs as a cornerstone of their monitoring strategy. This approach enables teams to detect subtle quality issues, validate metric effectiveness, and identify potential degradations that automated systems might miss. Random sampling proves particularly crucial for GenAI systems because they are susceptible to "silent degradation" - a phenomenon where outputs maintain good metric scores while becoming less effective in practice. For instance, a customer service AI might continue to achieve high ROUGE scores while providing less actionable information to users. Through random sampling, organizations can identify these cases where outputs achieve high metric scores but fail to meet actual user needs. This distinction requires monitoring not just metric values but the metrics themselves, creating a meta-level monitoring challenge unique to GenAI systems. Organizations implement continuous feedback loops that reassess metric effectiveness through careful analysis of user interactions. This monitoring approach becomes even more vital when dealing with foundation models accessed through external APIs - a common architecture in GenAI systems. Unlike traditional ML systems where organizations train models from scratch on their specific data, GenAI systems often rely on external foundation models beyond their direct control. When providers update or deprecate these underlying models, it can impact system performance in subtle ways. For example, a model API upgrade might improve general performance but alter the model's reasoning ability, requiring adjustments to prompts or fine-tuning strategies. Through this comprehensive monitoring approach, organizations can maintain robust oversight of their GenAI systems despite the unique challenges of external dependencies and potential silent degradations. Tracing: In traditional ML, tracing an incorrect prediction is straightforward – teams can directly examine the input features, model weights, and final prediction to understand what went wrong. There's typically no need to track intermediate steps since the prediction process is deterministic and follows a fixed pipeline. However, GenAI systems often involve multiple processing steps with intermediate outputs that must be traced. For example, a customer support AI might first classify the query type, then retrieve relevant documentation, and finally generate a response by combining this information. When the final output is problematic, organizations need to trace through each step to identify where the process went awry. This traceability becomes particularly challenging with AI agents, where the model dynamically decides which steps to take and which tools to use based on the context. In such cases, two identical queries might follow completely different execution paths, making it essential to maintain detailed traces of decision points, tool selections, and intermediate outputs for effective monitoring and debugging. 5. Guardrails on Generated Output Post-deployment validation of GenAI outputs remains essential despite pre-deployment safeguards, with three primary types of guardrails: hallucination checks, ethical/bias controls, and brand tone verification. Traditional ML outputs are typically constrained to specific, pre-defined categories or numerical predictions, eliminating concerns about brand voice consistency or open-ended ethical implications. Hallucination presents a unique challenge in GenAI systems, as they can produce convincing but inaccurate information. For instance, an AI handling insurance queries might confidently provide incorrect policy details, creating liability risks. Unlike traditional ML models, GenAI can generate inconsistent outputs for identical inputs, necessitating runtime guardrails beyond standard performance monitoring. These include analyzing semantic drift, checking for contradictions, validating claims, and employing multiple model consensus to identify potential hallucinations. Ethical and bias controls involve regular auditing of response patterns across demographic groups, with specific fairness metrics under continuous monitoring. Organizations establish and maintain explicit ethical guidelines, ensuring transparency about system capabilities while documenting limitations and appropriate use cases. Regular reviews and updates of these principles occur based on real-world impact assessments. Brand voice consistency requires continuous validation, implemented through LLM-based validators trained on company communications, encoded style guides, contextual appropriateness checks, and brand personality metrics. For example, if a company emphasizes warm, empathetic communication, guardrails would flag overly technical responses even if factually accurate. When guardrails detect issues or in high-stakes scenarios, outputs are routed for human review through clearly established escalation pathways. The flowchart shows how GenAI systems require two critical runtime pathways: a "Pass" pathway for outputs that clear automated checks, and a "Review Needed" pathway that routes flagged outputs to human reviewers. This human-in-the-loop review process involves trained professionals who handle edge cases, provide rapid intervention for inappropriate content, and validate complex brand voice alignment. The system's effectiveness relies on tight integration between automated checks and human review, with continuous feedback loops refining both components. 6. Iterating and Experimenting with Prompt Engineering Unlike traditional ML systems where model inputs follow rigid, predefined structures, GenAI systems often require carefully crafted prompts that serve as the interface between human intent and model capability. This fundamental difference emerges from GenAI's ability to process natural language instructions, making prompt engineering a crucial discipline that combines technical precision with domain expertise. The flowchart illustrates how prompt engineering fits into a larger workflow, positioned after the "Prompt Chain & Model planning" phase. Teams must first determine whether a task can be effectively accomplished through prompt chains and plan how different prompts will work together. For instance, a customer support workflow might begin with a prompt that classifies the query type, followed by separate prompts for information retrieval and response generation. This structured approach to prompt chain design provides a framework for systematic prompt development and optimization. The trade-off between prompt length and performance represents one of the most critical considerations in prompt engineering. Organizations must establish comprehensive testing frameworks that balance multiple competing factors. A minimal prompt might achieve faster response times but risk inconsistent outputs, while more detailed prompts can ensure reliability at the cost of increased processing overhead. Testing frameworks typically evaluate prompts across four key dimensions: reliability testing to verify consistent performance across diverse inputs and edge cases, cost analysis to measure and optimize token usage and processing time, quality assessment using both automated metrics and human evaluation to ensure outputs meet standards, and safety verification to confirm that prompts maintain model guardrails and prevent unsafe behavior. Through this systematic evaluation process, teams can quantify the impact of prompt length and complexity on both performance and operational costs. Prompt versioning and performance tracking introduce a level of rigor absent in traditional ML feature engineering. Teams maintain prompt repositories where each version is tagged with performance metrics such as completion rate (percentage of successful responses), accuracy scores from automated evaluations, and user satisfaction ratings. These historical comparisons help teams understand how prompt modifications impact performance. For example, when a financial services chatbot shows declining accuracy in tax-related queries, teams can analyze how different prompt versions handled similar questions in the past, using this data to guide improvements. The role of non-technical stakeholders in prompt engineering represents a significant departure from traditional ML practices. While feature engineering in traditional ML is primarily a technical exercise, effective prompt engineering requires cross-functional collaboration, relying heavily on non-technical stakeholders for crafting clear and reliable prompts. Legal experts help craft prompts that capture nuanced regulatory requirements, marketing teams ensure brand voice consistency, and subject matter experts validate technical accuracy. These stakeholders don't just provide input – they actively participate in prompt design and improve prompts with appropriate technical terminology, reasoning steps, or user persona details. 7. Feedback Unlike traditional ML systems where feedback primarily drives model retraining and feature engineering, GenAI feedback loops introduce unique complexities that fundamentally reshape how organizations learn from and improve their AI systems. The "Aggregate feedback for Workflow improvement" node in our flowchart illustrates this expanded feedback scope, showing how multiple streams - including direct user feedback, human-in-the-loop assessments, and monitoring data - must be synthesized to drive improvements across different system components. The variable nature of GenAI outputs creates the first major departure from traditional ML feedback patterns. In traditional ML, when a model makes an incorrect prediction, the feedback is straightforward: the prediction was wrong, and the model needs to be adjusted to map that input to the correct output. However, GenAI systems might generate different but equally valid outputs for the same input, or produce outputs that are partially correct but need refinement in specific aspects like tone or detail level. This variability means organizations must develop sophisticated feedback taxonomies that capture not just correctness, but also aspects like reasoning quality, creativity, and contextual appropriateness. The application of feedback in GenAI systems also differs fundamentally from traditional ML. When a traditional ML model receives feedback about poor performance, the solution typically involves retraining the model with additional data or adjusting feature engineering. In contrast, GenAI feedback might lead to multiple types of adjustments: Prompt refinements when feedback reveals gaps in instruction clarity or context Guardrail updates when new edge cases or failure modes are identified Workflow redesigns when feedback shows systematic limitations in current task decomposition Metric adjustments when usage patterns reveal new quality aspects that need measurement HITL routing criteria updates when certain query types consistently require human review This multi-level adaptation process, where operational feedback can trigger revisions at any level of the system—from guardrails to metrics to workflow design—reflects the unique complexity of maintaining effective GenAI systems in production. Conclusion The fundamental differences between GenAI and traditional ML systems necessitate a complete reimagining of AI workflow design and implementation. While traditional ML focuses on specific predictive tasks with clear metrics, GenAI introduces complexities in evaluation, requiring balanced consideration of automated metrics, AI-based assessments, and human judgment. Organizations must master new disciplines like prompt engineering and implement sophisticated guardrails both before and after generation. Successful GenAI deployment demands robust monitoring systems that can detect subtle degradations and maintain traceability across multiple processing steps. The feedback loop becomes more intricate, potentially triggering adjustments across multiple system components—from prompt refinements to workflow redesigns. As GenAI continues to evolve, organizations that understand and adapt to these distinct characteristics will be better positioned to harness its potential while managing associated risks and challenges.
13 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
AI-coding tools are making significant inroads in the software development domain. These tools seem like magic when one first starts using them, as generating a fully functional application with just a few lines of instructions appears surreal. However, this initial excitement quickly subsides as code complexity increases. Problems with AI tools such as Cursor or Windsurf in big codebase In large codebases, AI tools often miss existing code, leading to duplication and inconsistencies. They also tend to forget prior instructions, causing repeated mistakes and forcing developers to reassert constraints. Debugging becomes a frustrating loop of false confidence—AI repeatedly claims to have identified issues, makes minor tweaks, but problems persist. This cycle of "Ah, I see the issue now..." followed by ineffective fixes rapidly grows tiresome. Over time, abandoned or broken code fragments—"ghost code"—accumulate, confusing the AI further and degrading generation quality. Before digging deeper into why AI tools face these problems, it's important to first understand how AI coding tools work. How AI coding tools work: These tools use AI models such as Claude 3.7 Sonnet or ChatGPT-4o to generate code. A logical question to ask is what the role of the AI tool is if it doesn't generate code itself. The tool's primary roles are: Prompt Enhancement for AI model: It enhances prompts for the AI model by taking your instructions, adding relevant context (code) to your prompt, and incorporating guidance (e.g., .cursorrules, .windsurfrules). These tools largely use Retrieval-Augmented Generation (RAG) based approaches to identify appropriate context. While RAG is a step forward in surfacing relevant information, it relies on heuristic or semantic similarity-based retrieval, which often fails to capture code structure, control flow, or execution dependencies. Integrating response in code: It integrates responses into code by reading the AI model's output and incorporating it back into appropriate files and positions, sometimes removing existing code when needed. This integration typically relies on Abstract Syntax Trees (AST) based approaches. Automated code improvement loops (Agentic): It performs agentic execution by running the modified code automatically using a terminal and observing any errors. If errors occur, the tool automatically creates a new prompt using those errors, adds relevant context and guidance, and sends it to the model. The model responds, the tool integrates the changes, and runs the code again. This process continues iteratively until either no errors remain, the maximum number of attempts is reached, or the user intervenes. This agentic execution is both a blessing and a curse. When you provide minimal detailed instructions and allow the AI to operate based on its training, it often successfully replicates code without significant issues. However, as soon as you begin correcting it, adding logic, constraints, and other specific requirements, these tools increasingly struggle to identify the "right context" for your instructions. The key reason for this misidentified context is that as the sequence length (i.e., number of tokens) approaches approximately 60% of the context window (the maximum number of tokens the AI model can process), the model begins experiencing the 'lost in the middle' problem and tends to miss critical elements from the prompt. Identifying relevant context challenge The key challenge for these AI tools is to identify the relevant context without approaching the context window limit. Despite their best efforts, sequence length rapidly increases when these tools rely solely on code because code consumes significantly more tokens than text explanations for two main reasons: Explanations in words requires far less tokens than code: Explaining concepts in natural language requires far fewer tokens than implementing them in code. What might take just a few sentences to explain—such as "Create a login page with username and password options. Also integrate the option to sign in using Google, Facebook, and LinkedIn"—would translate to hundreds of lines of code when implemented with React packages, CSS, and various library imports. Tokenization process: The tokenization process used by underlying AI models is less efficient for code than for natural language. For instance, Original Text tokens processed for input to GPT number of characters number of tokens generated Loop over productTypes loop, over, product, type, s 22 5 for ( i = 0; i < count(productTypes); i++) { for, (, i, =, 0, ;, i, <, count, (, product, type, s, ), ; , i, +, +, ), { 44 20 Even accounting for the fact that the code has twice as many characters, it produces four times as many tokens as the equivalent natural language expression. In summary, AI tools struggle to understand code context purely from the code itself because code often lacks the high-level meaning or intent behind it, making it difficult for AI to determine which parts are truly important. This is why prompt compression—reducing the amount of code or context in a prompt—remains such a significant challenge. AI tools' approaches to mitigate context problem: AI tools like Cursor and Windsurf try to help by using rules to guide the AI on what context to include. These rules can be set globally or just for a specific project, and they can change depending on the files or parts of the code you're working on. For example, Cursor lets you define rules that automatically include related files based on the code you're editing. Windsurf does something similar with its memory and context system. While these tools don’t fully solve the problem—especially when it comes to deeper understanding like how pieces of code connect behind the scenes—they help the AI focus better. Another promising aspect of modern AI tools is their use of markdown-based planning documents. Cline was one of the first to formalize this approach by introducing two distinct modes: Plan mode, which focuses on gathering requirements and outlining implementation steps in markdown, and Act mode, which uses those plans to guide actual code generation. Following Cline’s lead, tools like Cursor and Windsurf have also introduced similar planning-focused workflows in recent months. In Plan mode, Cline can analyze relevant files, engage in dialogue to clarify objectives, and develop comprehensive implementation strategies. Cursor's Planner mode similarly supports creating detailed, context-aware plans, and its Agent mode can act on those plans autonomously, performing multi-step changes across a codebase. Windsurf brings a different strength with its Cascade system, which builds a semantic model of the project, helping the AI reason about dependencies across multiple files. These recent advances are pushing markdown planning from static outlines toward dynamic, interactive workflows that better support the complexity of modern software development. However, these planning modes can still struggle when applied to large, modular codebases—particularly if the plans are kept too high-level or not continuously refined. While the initial documents often outline technologies, frameworks, and task breakdowns (e.g., to-do, in-progress, done), the depth of planning depends heavily on how the tools are configured and used. Without iterative refinement or deeper integration with the code structure, these documents may miss important implementation details—especially in systems where components span many interdependent files. That said, tools like Cline, Cursor, and Windsurf are actively evolving in this direction, with features that increasingly support context-aware, multi-file planning and documentation. Augmenting Technical Lead / Architect's with AI Tools While AI tools can significantly benefit from markdown documents covering specifications explained in natural language, technical leads create these specifications separately from the code repos. Since specifications aren't typically stored in the code repo that AI tools can access, architects must manually provide all relevant context to effectively use AI assistance. This manual context-bridging is cumbersome and time-consuming, effectively creating a barrier that leads to minimal AI adoption during the architectural and design phases. Having specifications in the code repository would greatly enhance AI's ability to decompose new requirements into specifications. In particular, it enables: Mapping Dependencies: AI can cross-reference specifications with implementation details in code to uncover architectural constraints, data flows, compatibility with existing APIs or services, and component interactions. This ensures that newly generated specifications align with the existing system’s structure and limitations. Specification Consistency: Referencing existing specs allows the AI to maintain consistent formatting, abstraction level, and terminology. This promotes maintainability and coherence across the codebase, especially as the system scales. Code Reuse Optimization: With visibility into both specifications and code, the AI can identify reusable functions, services, and patterns. It can recommend leveraging existing components rather than rebuilding functionality from scratch—reducing duplication and accelerating delivery. By contrast, if AI-generated specifications were stored directly in the code repository as markdown files: Context Availability: AI coding tools could automatically access both the code and its corresponding human-readable specifications Intent Comprehension: When generating or modifying code, the AI would have immediate access to the high-level intent and requirements Token Efficiency: The specifications would consume far fewer tokens than equivalent code, making better use of the context window Purpose Understanding: The AI would better understand the purpose behind code structures, reducing the likelihood of generating inconsistent or duplicate code The disconnect between specifications and implementation creates significant inefficiencies in the software development lifecycle. As highlighted earlier, while tools like Cline, Cursor, and Windsurf are evolving to support context-aware planning, the separation between specifications (created by technical leads) and code repositories (accessed by AI tools) creates a fundamental barrier to effective AI adoption in architectural and design phases. Bringing specifications directly into the code repository—ideally as markdown files—solves this gap by making high-level system intent natively accessible to AI tools. This enables AI to participate meaningfully in system design by mapping dependencies across the codebase, maintaining consistency in structure and language, and identifying opportunities for code reuse. With the ability to cross-reference both the specifications and implementation, AI can generate more accurate scaffolding, propose system-level patterns, and ensure new additions align with existing architecture. The result is a more coherent, maintainable, and rapidly evolvable codebase—laying the groundwork for full-cycle AI involvement across system design, coding, and testing. AI-augmented Software Development (System Design + Coding + Testing) As illustrated in the image below, AI-augmented software development streamlines the traditional multi-role workflow into a highly integrated loop, compressing weeks of planning, coding, and testing into days. In this model, AI doesn't merely assist at the code generation level—it participates actively across system design, development, and testing phases. By embedding specifications, code, and test logic into a shared context, AI can iterate through the entire software lifecycle far more fluidly and intelligently. Here's how this transformation takes place: 1. System Design Integration AI tools start by asking clarifying questions, scaffold system designs, and generate specifications directly from high-level requirements. Unlike traditional setups where specs are abstracted from the implementation, here AI gains direct access to the evolving structure of the codebase. This tight feedback loop enhances architectural precision and accelerates design validation. 2. Continuous Contextual Coding Once specifications are established, AI transitions smoothly into coding. Rather than treating code as isolated tasks, the AI draws from the broader system architecture and prior specifications to write aligned, reusable, and dependency-aware code. The ability to reference multi-file contexts significantly improves coherence across modules and helps avoid structural fragmentation. 3. Automated Testing & Error Recovery AI doesn’t just generate tests—it iteratively runs and fixes them. When failures occur, it can trace them back through the system spec and implementation, identify the cause, and attempt auto-corrections. This reduces the back-and-forth cycle between developers and QA, tightening the feedback loop to near real-time. Compression of Sprint Cycles Where traditional development might require 2–3 weeks per sprint, AI-augmented workflows often converge in 3–5 days. This acceleration is not just a product of faster coding—it’s a reflection of reduced context switching, fewer handoffs, and automation of error resolution and testing. The Evolving Human-AI Partnership The human's role is evolving—from writing every line of specification, code, and tests to guiding the AI, answering its questions, and correcting course when it goes off track. New workflow: AI: Writes ~90% of specifications, code, and tests Human: Provides clarity, reviews output, fills in gaps, and occasionally writes the tricky parts By saving specifications alongside code and allowing AI tools to operate across both spaces, this development paradigm reduces planning debt and eliminates many points of friction that slow down traditional software delivery. As AI-native tools continue evolving, the emphasis shifts from just writing code to managing a dynamic, context-rich development environment that continuously reasons about what’s being built—and why. Redefining the Technical Team's Role This new partnership fundamentally transforms how technical teams work. Rather than implementing every detail manually, teams now focus on higher-level concerns while leveraging AI to handle routine implementation tasks. The AI handles most of the coding workload, generating specifications based on requirements, implementing standard patterns, writing routine tests, and handling boilerplate code. This frees the technical team to concentrate on what humans do best: providing domain expertise, making architectural decisions, reviewing for business alignment, and solving novel problems that the AI hasn't encountered in its training. Technical teams become more like directors—guiding the AI with clear requirements, reviewing its output for quality and alignment with business goals, intervening when necessary to correct misunderstandings, and writing specialized components that require deep domain knowledge or innovative approaches. This creates a multiplier effect where a single technical team with AI assistance can accomplish what previously required multiple teams of specialists, all while maintaining or even improving code quality and reducing technical debt.
8 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
Bill Gates recently said, "Within 10 years, AI will replace many doctors and teachers—humans won’t be needed ‘for most things.’" Geoffrey Hinton, often dubbed the "Godfather of Deep Learning," echoed the sentiment: "If a private tutor—that’s a person—is like 2 times better, these [AI tutors] will be 3 or 4 times better... It may not be good news for universities." As someone who’s been fortunate to engage deeply with education systems—and recently moderated a Deans' panel discussing the future of universities ( Deans' Panel Key Takeaways )—I agree that AI will reshape the landscape dramatically. But I don’t believe universities will vanish. If anything, they will adapt, and they must adapt. Here’s why I believe universities are here to stay: 1. Coming of Age Needs a Campus Undergraduate education is not just about learning content—it's about growing up. For many students, university is their first step into independence. It's where teenagers become adults in a safe, intellectually vibrant environment. They form lifelong friendships. They build social and professional networks that often outlast the knowledge from their coursework. And despite what AI can offer in personalized learning, that human environment cannot be replicated digitally—at least not yet. 2. The Signal Still Matters Let’s be real: a university degree is more than knowledge. It’s a signal. It tells employers—and society at large—that someone has stuck with a rigorous path, completed assignments, sat for exams, and acquired a certain level of expertise. AI might one day be able to measure real skill more effectively, but we’re not there yet. Until then, degrees still serve as a trusted validator. 3. Learning Is Like Going to the Gym Almost everyone agrees that staying fit is important. Yet, many of us pay for gym memberships we rarely use. Sound familiar? The same goes for learning. We bookmark articles, sign up for online courses, add videos to "watch later"—with every intention to return. But most of us don't. Learning is work. It’s not always fun. And self-discipline is hard. Universities force you to show up. They give you deadlines, structure, and a curriculum. They keep you accountable, just like personal trainers do. That’s part of their enduring value. But the Threat Is Real… That doesn’t mean we should get complacent. I’m working on an AI agent myself—one that takes a textbook chapter and converts it into a lecture video with my digital clone teaching the material. It’s already about 85% of the way there, and with some human-in-the-loop edits, the result could be better than if I filmed it myself. Some might say it’s foolish to build a tool that could replace your own job. But let’s be honest: it’s also kind of cool. With style transfer, I could have my lecture delivered in the voice and energy of Eminem, Taylor Swift, or even Andrew Ng—whatever keeps students engaged. So... Will Professors Be Replaced? Maybe some. Maybe many. But not all. Universities will survive not just because of tradition, but because they serve a set of social, emotional, and cognitive functions that AI hasn’t cracked yet. Professors won’t disappear—they’ll evolve. They’ll become curators, mentors, experience designers, and yes, even AI co-pilots in the classroom. The real challenge is not about whether AI will take our jobs—it’s whether we’ll use it to enhance what we do best. What do you think? Will universities survive the AI wave? Will you want your kids to go to college—or just to ChatGPT 10.0? Let’s talk.
6 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
In today's data-driven world, organizations face a critical challenge: they need realistic data for testing, development, and training AI models, but using real customer data raises serious privacy concerns and may violate regulations like HIPAA or FERPA. Enter synthetic data generation—a powerful solution that creates artificial data mimicking the statistical properties of real data while completely protecting privacy. This article explores a sophisticated synthetic data generation pipeline that transforms real data into artificial data that's statistically similar yet entirely synthetic. What sets this approach apart is its ability to not just replicate individual feature distributions, but to preserve the crucial relationships between features—the correlations and dependencies that make data truly useful. Let's dive into how this works. Why Synthetic Data Matters Before exploring the technical implementation, let's understand why synthetic data is valuable: Privacy compliance : Eliminates the risk of exposing sensitive customer information Development freedom : Enables teams to work with realistic data without security constraints Training AI models : Provides diverse, representative data for machine learning applications Testing edge cases : Allows creation of specific scenarios that might be rare in real data Relationship preservation : Maintains critical correlations and dependencies between variables that simple randomization methods cannot capture The last point is particularly crucial. Many synthetic data approaches can generate values that match individual feature distributions, but fail to maintain the relationships between features. For example, in a banking dataset, a simple approach might generate realistic account balances and realistic transaction frequencies independently, but miss the vital correlation between these variables. Our method specifically addresses this challenge. The Problem with Independent Feature Generation To understand why preserving relationships matters, consider a common e-commerce scenario: In real customer data, affluent customers not only spend more per purchase, but they also tend to buy a wider variety of products. This creates a natural correlation between: Average purchase amount Number of unique products purchased Customer income level If we were to generate synthetic data by creating each of these features independently—even if each feature's distribution perfectly matches the original data—we would lose these critical relationships. We might end up with unrealistic scenarios like low-income customers purchasing large numbers of luxury items or high-income customers only purchasing a single budget item. This problem compounds when building predictive models. A model trained on such independently-generated synthetic data would learn incorrect patterns and make faulty predictions when applied to real data. For instance, a recommendation engine might suggest luxury products to customers unlikely to purchase them or miss obvious cross-selling opportunities. Our approach using copula-based modeling specifically addresses this challenge by mathematically capturing and preserving the dependency structures between features, ensuring the synthetic data maintains these natural correlations. The Synthetic Data Generation Pipeline Our implementation creates a comprehensive pipeline that transforms real data into synthetic data through several sophisticated steps while preserving statistical properties and relationships. Let's break down each component: 1. Preprocessing The first stage prepares the data through three main steps: Missing Data Handling Processes target variables first, addressing imbalanced classes for categorical targets or applying transformations to reduce skewness in continuous targets Imputes missing values using median for numerical features and mode/"Unknown" for categorical features Categorical Data Encoding Applies intelligent encoding based on cardinality (number of unique values): Binary encoding for features with 2 unique values One-hot encoding for features with ≤10 unique values Frequency encoding for high-cardinality features Identifies and transforms highly skewed numerical features using Box-Cox transformation Standardizing Features Scales numerical features to have zero mean and unit variance Preserves categorical features in their encoded form Stores all transformation parameters for later inverse transformation 2. Clustering using HDBSCAN Data often comes from mixed populations with different underlying patterns. HDBSCAN clustering helps identify these natural groupings: Uses Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) Advantages over traditional clustering algorithms: No need to specify the number of clusters in advance Finds clusters of varying densities and shapes Adaptively determines cluster count based on data density Handles small datasets by adjusting clustering parameters (min_cluster_size, min_samples) Assigns noise points to their nearest clusters Creates a 'cluster' column to track membership, falling back to a single cluster if needed Separates data by cluster for subsequent processing 3. For Each Cluster Processing each cluster separately allows the algorithm to better capture the unique characteristics of different data subgroups: a) Statistical Modeling This three-step process captures both individual feature distributions and their interrelationships: Fit Marginal Distribution for Each Feature Tests multiple distribution types (normal, lognormal, exponential, gamma) Selects best fit using AIC (Akaike Information Criterion) Stores distribution parameters for each feature Models each feature's unique pattern independently (e.g., ages might follow a normal distribution, while income follows a log-normal distribution) Transform to Uniform using CDF Applies Cumulative Distribution Function (CDF) of fitted distributions Transforms each feature to uniform [0,1] distribution Creates standardized representation necessary for copula modeling Fit Best Copula Model Across Features Tests different copula types (Gaussian, Student-t, Clayton) Selects best-fitting copula based on log-likelihood Captures dependency structure between features A copula is a mathematical function that connects different marginal distributions to form a joint distribution, preserving relationships between variables. For instance, if higher income correlates with more frequent purchases in your original data, copulas maintain this relationship in synthetic data. This is where the real magic happens in preserving feature relationships. While each feature's individual distribution is important, the connections between features often contain the most valuable information. For example: In financial data, transaction frequency may be correlated with account balance In healthcare data, age may be correlated with certain medical conditions In e-commerce data, purchase frequency may be correlated with customer lifetime value Copulas mathematically encode these dependencies, allowing us to generate synthetic data where these critical relationships remain intact. Without this step, you might have realistic-looking individual features but unrealistic combinations of values that would never occur in real data. b) Data Generation After modeling comes generation of the synthetic data: Draw Samples from Fitted Copula Generates correlated uniform [0,1] samples from the fitted copula model Maintains the dependency structure between features Inverse CDF to Transform Each Feature Back Applies inverse CDF (percent point function) using stored distribution parameters Transforms uniform values back to realistic data following original distributions Restores each feature's original statistical shape while preserving relationships Adds appropriate cluster labels to track membership 4. Combine Cluster Data Merges synthetic data from all clusters based on original cluster proportions Maintains the natural groupings and subpopulations present in the original data Preserves the overall data structure and cluster characteristics 5. Postprocessing This stage restores the data to its original format: Reverse Encoding of Categorical Features Converts encoded categorical features back to their original form: Binary encodings → original binary categories One-hot encodings → original categorical values Frequency encodings → original categorical values Reverse Standardization Applies inverse transformation to all standardized numerical features Restores original scale and data types Ensures the synthetic data matches the format of the original data Handles decimal formatting and type conversion 6. Validation The final step is thorough quality checking to ensure the synthetic data truly resembles the original: Validate Each Feature & Target Distribution Independently For numerical features: Applies Kolmogorov-Smirnov tests and compares statistical moments For categorical features: Performs chi-square tests and compares category frequencies Calculates metrics like maximum and average differences between distributions Validate Correlations Compares correlation matrices (Pearson, Spearman) Calculates Frobenius norm of difference matrices Ensures dependency structures are preserved This validation step is critical for our goal of relationship preservation. After all the transformations, we need to verify that the synthetic data maintains the same correlation patterns as the original. The process compares both linear (Pearson) and rank-based (Spearman) correlations, allowing us to detect if the relationship structures have been maintained across different types of dependencies. Validate Cluster Preservation Compares cluster proportions between original and synthetic data Evaluates if cluster characteristics are maintained Compiles all validation results into a comprehensive report with statistical measures Running the Script The script can be run in several ways: With Cursor/Windsurf/Cline Simply ask the model to run the script. Without Cursor/Windsurf/Cline Run the following in terminal: python -m pip install -r requirements.txt If you're having trouble, try upgrading pip: python -m pip install --upgrade pip For Windows users facing installation issues, follow the guide at: https://github.com/bycloudai/InstallVSBuildToolsWindows?tab=readme-ov-file Note: Sometimes on Windows machines, py works instead of python: py -m pip install -r requirements.txt Limitations and Considerations While this implementation is powerful, it has some limitations: It doesn't make special distinctions between different types of variables during the correlation modeling phase—it treats all variables (including transformed categorical ones) as continuous. This means it might not perfectly preserve some special relationships between categorical and continuous variables, or between categories that were originally part of the same variable. Conclusion The synthetic data generation pipeline described here offers a powerful solution for organizations needing realistic test data without privacy concerns. What sets it apart from simpler approaches is its sophisticated handling of feature relationships through copula modeling and cluster-aware generation. By carefully modeling both the individual distributions of features and their relationships, then generating new data that follows these patterns, we can create synthetic data that is: Statistically similar to real data Maintains important relationships between different pieces of information Preserves the overall structure and patterns of the original data Safe to use without worrying about privacy regulations Suitable for testing, development, and analysis purposes This relationship preservation is crucial for many real-world applications: AI model training : Models trained on synthetic data with preserved relationships will learn the same patterns present in real data Financial analysis : Synthetic financial data must maintain relationships between risk factors and outcomes Healthcare research : The correlations between patient characteristics and medical conditions must be preserved Market research : Customer behavior patterns and preferences need to maintain their interdependencies The attached code implements this entire pipeline, making it accessible for data scientists and developers who need high-quality synthetic data where relationships between features matter as much as the features themselves.
6 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Code with Explanation is posted here: Colab Notebook Link Exercise 5: RAG with Agentic Pattern: ReAct + Reflection Code with Explanation is posted here: Colab Notebook Link These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". In the previous exercise, we explored how to break down a complex query into sub-queries, retrieve relevant chunks from a vector database for each sub-query, and generate answers based on those chunks. However, there are instances where the necessary knowledge to answer a user's question may not be available in our vector databases. In such cases, we need to equip our system with pre-built tools that can fetch information from external sources. Specifically, in the Colab notebook, we demonstrate how to retrieve LinkedIn handles of directors listed in SEC filings. To achieve this, we utilize a set of tools, as illustrated in the following diagram: Vector Search Tool – Provides access to the vector database for the LLM. Director Extraction Tool – Extracts director names from the previously stored last portion of SEC filings. Web Search Tool – Conducts Google searches for directors one at a time and retrieves their LinkedIn handles. For further details on the code implementation, please refer to the Colab notebook. However, before diving into the notebook, we strongly recommend reviewing the ReAct explanation provided below. ReAct prompting (Fundamental pattern for AI Agents) ReAct (Reasoning + Action) represents a groundbreaking framework that revolutionizes how large language models (LLMs) approach complex problem-solving. By combining reasoning capabilities with action-based decision making, ReAct enables models to tackle challenging tasks through a dynamic and iterative process. At its core, the framework establishes a continuous loop of three interconnected components: reasoning, action, and observation. The reasoning phase, often called the "Thought" stage, serves as the model's internal cognitive process. During this stage, the model analyzes the current situation, drawing upon multiple sources of information including the original task requirements, previous reasoning steps, past actions, and accumulated observations. This framework allows the model to break down complex goals into manageable subtasks, incorporate relevant background knowledge, and continuously evaluate progress toward the ultimate objective. The model can also use this phase to identify potential obstacles and develop contingency plans when faced with unexpected challenges. The action phase represents the bridge between thought and implementation. It determines which tool to employ based on the preceding thought process. The model examines its available tool descriptions and capabilities, matching them against the requirements identified in its last reasoning step. For example, if the thought process concludes that numerical data needs analysis, the model might select a calculator tool. If the reasoning indicates a need for external information, it might choose a search tool. Following each action, the observation phase captures the results and consequences of the actions. These observations serve as crucial feedback, providing new information that feeds into the next iteration of reasoning. For instance, if the model uses a search tool to gather information about a topic, the search results become observations that inform its subsequent thinking and decision-making process. It creates a feedback loop where each cycle of thought, action, and observation builds upon previous iterations. This allows the model to maintain and adjust its high-level strategy while incorporating new information and responding to changing circumstances. The framework's flexibility enables it to handle complex tasks that require multiple steps, logical reasoning, and interaction with various external tools and information sources. Here is a typical prompt that is used in LangChain framework to implement ReAct: Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: {tools} Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [{tool_names}] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! Question: {input} Thought:{agent_scratchpad} ReAct implementation: LangChain link LLamaIndex link The prompt starts by defining the task scope and available tools. {tools} is a placeholder that gets populated with descriptions of tools the agent can use, like search, calculators, or data analysis tools. The format section establishes the strict protocol the agent must follow: Question: {input} {input} gets replaced with the actual user question This sets up the goal the agent needs to achieve ReAct Components: "Thought:" - Where the agent reasons about what it needs to do "Action:" - Limited to the tools listed in {tool_names} "Action Input:" - The specific input for the chosen tool "Observation:" - Where results from tool usage appear The ... can repeat N times indicates this is an iterative process - the agent can go through multiple cycles of Thought/Action/Observation until it has enough information. Conclusion Format: A final "Thought:" declaring the agent has sufficient information "Final Answer:" providing the response to the original question { agent_scratchpad} The { agent_scratchpad} at the end is particularly important - it acts as a dynamic working memory space for the LLM agent and gets populated with the ongoing history of all previous Thought/Action/Observation cycles during execution. Think of it like a digital notepad where the agent records its step-by-step problem-solving process. The scratchpad typically contains: Previous thoughts the agent has had including any intermediate conclusions Actions it has taken Observations it has received This allows the agent to: Reference previous findings Build upon earlier observations Maintain continuity in its reasoning process Track what approaches have already been tried Technical Details It will be worth going over the example covered in the original paper briefly and further understand how ReAct is different from Chain of Thought (CoT) prompting. Source: Paper link The use of few-shot exemplars significantly enhances the efficacy of ReAct. In the original paper, the authors provided the language model with a small number of human-annotated examples that showcase the desired reasoning process and action sequence. These exemplars serve as a template for the model to follow when addressing new, unseen instances of the task. The exemplars in ReAct typically consist of a series of thought-action-observation steps: Thoughts: The exemplars include explicit reasoning steps that guide the model's decision-making process. These thoughts help break down the task into smaller sub-goals, provide relevant context or common sense knowledge, and offer guidance on the next action to take. Actions: The exemplars demonstrate the specific actions the model should take to progress towards solving the task. These actions can include information retrieval (e.g., searching a knowledge base), navigation (e.g., clicking on a specific link), or providing a final answer. Observations: After each action, the exemplars include the corresponding observation or result from the environment. These observations provide the model with the necessary context to inform its subsequent reasoning and actions. By studying these few-shot exemplars, the language model learns to internalize the reasoning process and action sequence required to complete the task successfully. The model can then apply this learned pattern to new, unseen instances of the task, even with limited or no additional training. The ReAct paper demonstrates the effectiveness of this few-shot approach across various domains, including question answering (HotpotQA), fact verification (Fever), and interactive problem-solving (ALFWorld and WebShop). In each case, the model is provided with just a handful of annotated exemplars (ranging from 2 to 6) and achieves competitive performance compared to baseline methods that rely on extensive fine-tuning or reinforcement learning. How is ReAct different from CoT? Chain-of-Thought prompting encourages models to break down complex problems into smaller, logical steps before reaching a conclusion. While this approach improves accuracy for many tasks, it operates within the confined space of the model's existing knowledge. ReAct fundamentally extends the CoT paradigm by introducing dynamic interaction with the external world. While CoT might reason "To find the population of Tokyo, I need to recall the most recent census data," ReAct can actually execute this step by searching current databases. This ability to ground reasoning in real-world data and tools addresses several key limitations of CoT: Knowledge Freshness: While CoT relies on the model's training data, ReAct can access current information through external tools. Verification Capability: CoT's conclusions are based solely on internal reasoning, but ReAct can verify its assumptions against external sources. Computational Accuracy: Rather than relying on the model's ability to perform calculations mentally (as in CoT), ReAct can utilize specialized tools for precise computations. Adaptive Problem-Solving: ReAct can adjust its approach based on intermediate results, while CoT follows a more linear reasoning path. For example, in solving a math problem, CoT might think through each step mentally, while ReAct could combine reasoning with actual calculator usage, reducing computational errors while maintaining logical clarity. This integration of external tools with reasoning creates a more robust and reliable problem-solving system. Applications and Implications Question Answering: ReAct prompting can be used to improve question-answering systems by allowing the model to generate verbal reasoning traces and perform task-specific actions, leading to more accurate and context-aware responses Deciding appropriate tools for a sub-task: ReAct prompting can be tailored for wide variety of tasks where the LLM needs to perform actions, such as retrieving specific data, performing computations, or even interacting with software interfaces through APIs. Challenges and Considerations ReAct can easily derail from the main task and pursue self-created tasks not aligned with the original goal. ReAct tends to use external tools more often when it can use LLM's knowledge to answer things. Implementing ReAct prompting may require a significant number of prompts, leading to increased costs and potential delays in obtaining the final answer. Complexity in Implementation: Implementing ReAct prompting requires a more complex setup than traditional prompting methods. It involves configuring the LLM to interact with external tools and ensuring secure and efficient communication between the model and these tools. Process 1. User query A user asks a question: Who are the directors of Tesla? What are their LinkedIn handles? What are the financial goals of Tesla this year? What is the next auto show that Tesla will participate in? 2. LLM call to figure out the next step Langchain's AgentExecutor first generates prompt by filling in the values for placeholder variables {Tools}, {Question}, and {Scratchpad} in the prompt. The prompt becomes like this: Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: Company Directors Information - Retrieve the names of company directors for a chosen company. Optionally, their LinkedIn handles can also be included. Use the format: company_name, true/false. Available companies: Tesla, General Motors WebSearch - Performs a web search on the query. Vector Reranker Search - Retrieves information from an embedding based vector DB containing financial data and company information. Structure query as a sentence Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [Company Directors Information, WebSearch, Vector Reranker Search] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the question. Follow these steps: Begin! Question: Who are the directors of Tesla. What are their linkedin handles? What are the financial goals of tesla this year. What is the next auto show that Tesla will participate in. Thought: After generating the prompt it sends the generated prompt to LLM, parses its response, and add the response to scratchpad in the following format: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Here: The Thought explains the reasoning and identifies the needed information. The Action specifies the name of the external tool that can help with the needed information in Thought . The Action Input tells the system what specific data is needed by this external tool. 3. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is Company Directors Information tool and adds the response as Observation to the scratchpad. Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought: NOTE: The ellipses ("...") in this document indicate the information has been redacted to conserve space and it can be seen in Langsmith's interface 4. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response (Thought, Action and Action input) to scratchpad: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? 5. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is Vector Reranker Search tool and adds the response as Observation to the scratchpad. The response in this case consists of retrieved reranked chunks from vector database for the query What are the financial goals of Tesla for this year? Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? Observation: required to spend RMB 14.08 billion in capital expenditures by the end of 2023, which has been achieved in 2023, and to generate RMB 2.23 billion of annual tax revenues … audited the Company's internal control over financial Thought: 6. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response (Thought, Action and Action input) to scratchpad: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? Observation: required to spend RMB 14.08 billion in capital expenditures by the end of 2023, which has been achieved in 2023, and to generate RMB 2.23 billion of annual tax revenues … audited the Company's internal control over financial Thought:I have gathered information on Tesla's financial goals for this year, which include capital expenditures and tax revenue generation. Now, I will search for information on the next auto show that Tesla will participate in. Action: WebSearch Action Input: Next auto show Tesla will participate in 2023 The LLM determines that the newly obtained information from Vector Reranker Search aligns with its financial goals and identifies the remaining need as finding details about the next auto show using the Websearch tool. 7. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is WebSearch tool and adds the response as Observation to the scratchpad. The response in this case consists of top 3 Google search results for the query: Next auto show Tesla will participate in 2023 … Thought:I have gathered information on Tesla's financial goals for this year, which include capital expenditures and tax revenue generation. Now, I will search for information on the next auto show that Tesla will participate in. Action: WebSearch Action Input: Next auto show Tesla will participate in 2023 Observation: Title: Tesla to Join 2023 Detroit Auto Show Snippet: Tesla will join the 2023 Detroit Auto Show that will take place September 13-24. The company will showcase its vehicles at the Powering ... Link: https://driveteslacanada.ca/news/tesla-to-join-2023-detroit-auto-show/ … Link: https://detroitautoshow.com/ Thought: 8. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response. But this time, the parse response is different. Instead of generating Action and Action-input, it generates the Final answer, since the LLM observes that it has all the information it needed to answer user's question. Thought: I have gathered the necessary information to answer the question. Final Answer: 1. Directors of Tesla and their LinkedIn handles: - Elon Musk (LinkedIn: [Elon Musk](https://www.linkedin.com/in/elon-musk-a93a0b221)) - Robyn Denholm (LinkedIn: [Robyn Denholm](https://au.linkedin.com/in/robyn-denholm-a807795)) - Ira Ehrenpreis (LinkedIn: [Ira Ehrenpreis](https://www.linkedin.com/in/iraehrenpreis)) - Joseph Gebbia (LinkedIn: [Joseph Gebbia](https://www.linkedin.com/in/john-coppola-jr-4b2ab395)) - James Murdoch (LinkedIn: [James Murdoch](https://www.linkedin.com/in/jamesrmurdoch)) - Kimbal Musk (LinkedIn: [Kimbal Musk](https://www.linkedin.com/in/kimbal)) - JB Straubel (LinkedIn: [JB Straubel](https://www.linkedin.com/in/jb-straubel-b694981)) - Kathleen Wilson-Thompson (LinkedIn: [Kathleen Wilson-Thompson](https://www.linkedin.com/in/kathleen-wilson-thompson-275654201)) 2. Financial goals of Tesla for this year include capital expenditures of RMB 14.08 billion and generating RMB 2.23 billion in annual tax revenues by the end of 2023. Tesla aims to improve vehicle performance, decrease production costs, and increase affordability and customer awareness. 3. The next auto show Tesla will participate in is the 2023 Detroit Auto Show, which takes place from September 13-24. Reflexion Reflexion is a framework that enables language models (LLMs) to learn and improve their performance on various tasks through self-reflection and iterative optimization. The key idea behind Reflexion is to convert binary or scalar feedback from the environment into informative verbal feedback, which the LLM agent then uses to update its context and improve its performance in subsequent attempts. The Reflexion framework consists of four main components: Actor: An LLM that generates text and actions based on the current state observations and its memory. The Actor samples actions from its current policy and receives observations from the environment. Various models, such as Chain of Thought and ReAct, can be used as the Actor. Evaluator: A component that assesses the quality of the generated outputs produced by the Actor. The Evaluator takes a generated trajectory as input and computes a reward score reflecting the Actor's performance on the given task. The Evaluator can incorporate both internal and external assessment mechanisms. Internal evaluation can be using self-reflection or confidence signals like log probabilities and entropy measures that assess output quality without external reference points. External evaluation involves independent validation through unit tests, searching the web for relevant information to fact-check, or using LLM-as-a-judge approaches that provide assessment based on predefined criteria. Self-Reflection: An LLM that generates verbal self-reflections to provide feedback for future trials. Given the current trajectory, evaluation and the agent's persistent memory, the Self-Reflection model generates specific and informative feedback. This feedback is stored in the agent's memory for future reference. Memory: The memory component in Reflexion consists of short-term memory (trajectory history) and long-term memory (outputs from the Self-Reflection model). These memory components provide context that is both specific and influenced by lessons learned over multiple trials, giving Reflexion agents an advantage over other LLM action-choice methods. The Actor uses the updated memory to inform its decisions in the next trial. Source: Paper link Reflection implementation: LlamaIndex link This iterative process of trial, evaluation, self-reflection, and memory persistence allows the agent to rapidly improve its performance on various tasks. This approach draws inspiration from human cognitive processes, particularly the distinction between "System 1" and "System 2" thinking patterns first popularized by psychologists. System 1 represents quick, instinctive reactions, while System 2 embodies slower, more deliberate analysis. While this additional computational step may increase response time, it often proves valuable for complex tasks where accuracy and thoroughness matter more than speed. Reflection: Reflection is a simpler version of Reflexion where there is no separate evaluation before self-critique. The reflector simply provides feedback based on its own assessment, which can be helpful but might not address specific shortcomings or gaps. At its core, reflection serves as a prompting strategy that enables AI systems to evaluate and refine their responses through structured self-criticism. Multi-Agent Collaboration: The Reflection pattern can be implemented in a multi-agent framework, where one agent is responsible for generating outputs and another agent provides constructive criticism. This back-and-forth between the agents can lead to increasingly refined and improved responses. Sample implementation: "Review your previous answer and find problems with your answer" "Based on the problems you found, improve your answer." Source: link Process 1. User query A user asks a question: Who are the directors of Tesla? What are their LinkedIn handles? What are the financial goals of Tesla this year? What is the next auto show that Tesla will participate in? 2. LLM call to figure out the next step + Function calling + … Langchain's AgentExecutor first generates prompt by filling in the values for placeholder variables {Tools}, {Question}, and {Scratchpad} in the prompt.Then it sends the prompt to LLM, parses the output, calls the tool as instructed by the LLM, receives the tool's output, updates the output as observation in scratchpad, and then sends the prompt back to LLM. This process continues for a bit, till it gets to the following point. Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: Company Directors Information - Retrieve the names of company directors for a chosen company. Optionally, their LinkedIn handles can also be included. Use the format: company_name, true/false. Available companies: Tesla, General Motors WebSearch - Performs a web search on the query. Vector Reranker Search - Retrieves information from an embedding based vector DB containing financial data and company information. Structure query as a sentence Director's previous work and education - Retrieves director's education and work experience using their LinkedIn URL. Use the format: url Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [Company Directors Information, WebSearch, Vector Reranker Search, Director's previous work and education] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the question. Follow these steps: Begin! Question: Make an analysis on the independence of backgrounds of directors at tesla Thought:To analyze the independence of backgrounds of directors at Tesla, I need to gather information about the directors, including their names, previous work experience, and education. This will help determine the diversity and independence of their backgrounds. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … Thought: To analyze the independence of backgrounds of Tesla's directors, I need to gather information about their previous work experience and education. This will help assess the diversity and independence of their backgrounds. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/elon-musk-a93a0b221 Observation: Professional Experience: Education: Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the other directors as well. Action: Director's previous work and education Action Input: https://au.linkedin.com/in/robyn-denholm-a807795 Observation: Professional Experience: Tech Council of Australia, Tesla, Blackbird, Telstra, Telstra Education: Australian Institute of Company Directors, UNSW, UNSW Business School Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/iraehrenpreis Observation: Professional Experience: Menlo Micro, DBL Partners, Tesla Motors, Tesla Motors, Tesla Motors Education: Stanford University Graduate School of Business, Stanford Law School, UCLA Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/john-coppola-jr-4b2ab395 Observation: Professional Experience: JMC Detailing LLC, Koons Chevrolet, Koons Tyson Jeep, Chili's Education: Radford University - College of Business and Economics, George C. Marshall High School, marshall high school Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/jamesrmurdoch Observation: Professional Experience: Lupa Systems, 21st Century Fox, 21st Century Fox, News America Incorporated, News America Incorporated Education: Harvard University Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/kimbal Observation: Professional Experience: The Kitchen Community, The Kitchen, Tesla Motors, SpaceX, Me.dium Education: The French Culinary Institute, Queen's University Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/jb-straubel-b694981 Observation: Professional Experience: Redwood Materials, QuantumScape, Straubel Foundation, Tesla, Volacom Education: Stanford University, Stanford University, EDGEWOOD HIGH SCHOOL Thought: To complete the analysis on the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the last remaining director, Kathleen Wilson-Thompson. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/kathleen-wilson-thompson-275654201 Observation: Professional Experience: McKesson, Self-employed, Tesla, Tesla, Walgreens Boots Alliance Education: University of Michigan, Wayne State University, Wayne State University Law School Thought: 3. LLM call to figure out the next step … Observation: Professional Experience: McKesson, Self-employed, Tesla, Tesla, Walgreens Boots Alliance Education: University of Michigan, Wayne State University, Wayne State University Law School Thought: To analyze the independence of backgrounds of Tesla's directors, I have gathered information about their previous work experience and education. Here's a summary of the findings: 1. **Elon Musk**: Information on his professional experience and education was not retrieved, but he is widely known as the CEO of Tesla, SpaceX, and other ventures. 2. **Robyn Denholm**: Has experience in telecommunications and technology sectors, having worked at Telstra and the Tech Council of Australia. She holds degrees from UNSW and the Australian Institute of Company Directors. 3. **Ira Ehrenpreis**: Has a background in venture capital and law, with experience at DBL Partners and education from Stanford University and UCLA. 4. **Joseph Gebbia**: His professional experience includes roles in the automotive and hospitality industries, with education from Radford University. 5. **James Murdoch**: Comes from a media and entertainment background, having worked at 21st Century Fox and News America Incorporated. He is a Harvard University graduate. 6. **Kimbal Musk**: Has experience in the culinary and technology sectors, with roles at The Kitchen and SpaceX. He studied at The French Culinary Institute and Queen's University. 7. **JB Straubel**: Has a strong background in technology and sustainability, with experience at Tesla and Redwood Materials. He studied at Stanford University. 8. **Kathleen Wilson-Thompson**: Has experience in healthcare and retail, having worked at Walgreens Boots Alliance and McKesson. She holds degrees from the University of Michigan and Wayne State University Law School. The directors of Tesla have diverse backgrounds spanning technology, telecommunications, media, culinary arts, venture capital, and healthcare. This diversity suggests a level of independence in their backgrounds, as they bring varied experiences and perspectives to the board. Final Answer: Tesla's board of directors has diverse and independent backgrounds, with experience in technology, telecommunications, media, culinary arts, venture capital, and healthcare, contributing varied perspectives to the company. 4. Prompt for Reflection Your task is to analyze whether the `Answer` is missing some information related to the `Question`. Give feedback on the missing requirments of the answer. Mention only the essential information. Here is the previous interaction: Question: Make an analysis on the independence of backgrounds of directors at tesla Answer: Tesla's board of directors has diverse and independent backgrounds, with experience in technology, telecommunications, media, culinary arts, venture capital, and healthcare, contributing varied perspectives to the company. Reflection: Provide brief, concise thoughts on what additional information needs to be collected in the next iteration. Based on your reflection, conclude with one of the following actions: If the current Answer provides sufficient information for Original Input, state "STOP". If further refinement is needed, provide 2-3 brief thoughts for improvement, each on a new line, and end with "CONTINUE". Output 5. Reflection assessment … Output The answer provides a general overview of the backgrounds of Tesla's directors, highlighting their diverse experiences and educational backgrounds. However, it lacks specific analysis on the independence of these backgrounds in terms of decision-making and influence within the board. - Include an analysis of how the diverse backgrounds contribute to independent decision-making and governance at Tesla. - Discuss any potential conflicts of interest or affiliations that might affect the independence of the directors. - Provide examples or evidence of how the directors' independent backgrounds have influenced Tesla's strategic decisions or policies. CONTINUE 6. Prompt for thinking on Reflection Improve `Previous Answer` based on `Reflections`. Dont look for information already present in `Previous Answer`. Formulate a new Final Answer. Reflections: The answer provides a general overview of the diversity and independence of Tesla's directors' backgrounds. However, it lacks specific details and analysis. - Include specific examples of directors and their backgrounds to illustrate the diversity and independence. - Analyze how these diverse backgrounds impact decision-making and governance at Tesla. - Discuss any potential challenges or benefits that arise from having such a diverse board. CONTINUE 7. Response to Reflection Previous Answer: Tesla's directors have diverse and independent backgrounds, with experiences spanning technology, automotive, media, culinary, and healthcare industries, and education from prestigious institutions. This diversity can contribute to a broad range of perspectives and expertise in the boardroom. The directors of Tesla have diverse and independent backgrounds, which contribute to a broad range of experiences and perspectives within the board. Here's a summary of their backgrounds: 1. **Elon Musk**: Holds bachelor's degrees in physics and economics and attended graduate school in physics at Stanford University. He has a strong background in technology and entrepreneurship, having founded companies like SpaceX and Tesla. 2. **Robyn Denholm**: Has experience in telecommunications and technology, having worked at Telstra and Blackbird. She studied at the Australian Institute of Company Directors and UNSW. 3. **Ira Ehrenpreis**: Has a background in venture capital and law, with education from Stanford University Graduate School of Business and Stanford Law School. 4. **Joseph Gebbia**: His professional experience includes roles in the automotive and hospitality industries, with education from Radford University. 5. **James Murdoch**: Comes from a media and entertainment background, having worked at 21st Century Fox and News America Incorporated. He studied at Harvard University. 6. **Kimbal Musk**: Has experience in the culinary and technology sectors, with education from The French Culinary Institute and Queen's University. 7. **JB Straubel**: Has a strong background in engineering and technology, having worked at Tesla and founded Redwood Materials. He studied at Stanford University. 8. **Kathleen Wilson-Thompson**: Has experience in healthcare and law, having worked at Walgreens Boots Alliance and McKesson. She studied at the University of Michigan and Wayne State University Law School. Overall, the directors bring a mix of expertise from technology, media, law, healthcare, and entrepreneurship, which supports a diverse and independent board at Tesla.
20 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Code with Explanation is posted here: Colab Notebook Link Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This exercise, the third in the series, focuses on illustrating how complex queries can be decomposed into simpler sub-queries by in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker. We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code. Query Decomposition When users interact with RAG systems, they often pose complex questions that encompass multiple aspects or require information from different areas of the knowledge base. Consider a query like "How do Tesla and GM's approaches to manufacturing and production compare, particularly for electric vehicles? Where are their vehicles produced?" This question combines several distinct informational needs: manufacturing methodologies, EV-specific production approaches, and factory locations for two different companies. Direct vector similarity search with such compound queries can be suboptimal, as the embedding may not effectively capture all query dimensions simultaneously. Query decomposition addresses this challenge by leveraging a Large Language Model (LLM) to break down complex queries into simpler, more focused sub-queries that can be processed independently before being synthesized into a comprehensive response. The decomposition process typically starts by prompting the LLM to analyze the user's question and identify its core components. A sample prompt, shown below, guides the LLM to generate a set of atomic sub-queries that collectively cover all aspects of the original question. For the automotive manufacturing comparison, the LLM might generate targeted sub-queries like: What is Tesla's approach to EV manufacturing and production? What is GM's approach to EV manufacturing and production? Where are Tesla's manufacturing facilities located? Where are GM's manufacturing facilities located? This approach enables more precise matching with relevant chunks in the vector database, as each sub-query can be vectorized to capture specific semantic aspects more accurately. The retrieved chunks for each sub-query are then combined and reranked to provide a complete context for the language model to generate a coherent response that compares and contrasts both companies' manufacturing strategies and facility locations. The use of LLMs for query decomposition offers several advantages over rule-based or keyword-based approaches. LLMs can understand implicit relationships within questions, identify logical dependencies between different query components, and generate sub-queries that maintain the original intent while being optimized for retrieval. Sample prompt for query decomposition: You are an expert at converting user questions into specific database queries for similarity search. Break down the `user-question` into distinct sub-queries that address different aspects of the original question. Ensure that the set of sub-queries comprehensively covers the main aspects of the original question. user-question: ```<paste user-question here>```
2 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Code with Explanation is posted here: Colab Notebook Link Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This exercise, the second in the series, focuses on illustrating how Reranking makes a difference in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker. We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code. Reranking Retrieved Chunks using Reranker (Cross-Encoder model) While vector similarity search provides a good initial set of relevant chunks, it can sometimes miss nuanced semantic relationships or return chunks that are only superficially similar to the query. Consider a user asking "List out major changes that occurred in Tesla in 2023." A vector search might rank chunks discussing changes from 2022 higher than a more relevant chunk about a Director selling common stock in 2023, simply because the 2022 chunks share more semantic similarities around the concept of "changes" and "Tesla." This highlights a limitation of pure vector similarity matching. This is where rerankers come into play, serving as a crucial refinement layer in the RAG pipeline. A reranker takes the initial set of retrieved chunks from the vector database and performs a more sophisticated, computationally intensive analysis to improve the ranking quality. The reranking process often employs cross-encoders, which are transformer models that simultaneously process both the query and a candidate chunk to produce a relevance score. This approach captures more subtle semantic relationships and contextual nuances. It can correctly identify that the Director's stock sale in 2023 is more relevant to the query than changes from 2022, despite fewer surface-level semantic similarities. A natural question arises: why not use these more sophisticated reranker models for the initial retrieval instead of vector search? The answer lies in computational efficiency. Using a reranker as the primary retrieval mechanism would require passing each query through the model alongside millions of individual chunks in the vector database, computing similarity scores one at a time. This process would be prohibitively expensive and slow, especially for large-scale applications that need to maintain responsive query times. This is why modern RAG systems typically employ a two-stage retrieval process that combines the best of both approaches. They first use rapid vector similarity search to quickly identify a promising set of candidates (e.g., top 100 chunks), then apply the more sophisticated reranker to this smaller set to determine the final top-k chunks (e.g., top 5-10) that will be provided as context to the language model. This hybrid approach balances computational efficiency with retrieval quality, ensuring that the system provides accurate and relevant responses while maintaining reasonable response times. Technical Details Reranking model The key distinction between Embedding models (typically bi-encoders) and Reranking models (typically cross-encoders) lies in how they process queries and chunks. Bi-encoders process each text independently - the query and chunk are fed through the model separately to generate their respective embeddings. These embeddings can then be compared using similarity metrics like cosine similarity. This approach allows for efficient retrieval since chunk embeddings can be pre-computed and indexed, but it limits the model's ability to capture complex interactions between the query and document. Cross-encoders take a fundamentally different approach by processing the query and chunk together as a single input. By concatenating the query and chunk with a separator token, the model can leverage its attention mechanisms to directly compare and contrast every term in the query with every term in the chunk. This enables the model to capture nuanced relevance patterns and contextual relationships that might be missed when processing texts independently. For example, if a query asks about "Tesla's competitors' plant locations," a cross-encoder can directly attend to chunks mentioning locations of Tesla's competitors (e.g., Ford's plant in Michigan, Rivian's facility in Illinois) while downweighting or deprioritizing chunks that primarily describe Tesla's plant locations. This results in a reranked list where chunks about competitors' locations are moved higher than those focusing on Tesla itself, better aligning with the user's intent. When it comes to model training objectives, embedding models and cross-encoder models serve different purposes despite often starting from the same base architectures like BERT. Embedding models are fine-tuned specifically to generate high-quality sentence or paragraph level embeddings that capture semantic meaning in a fixed-dimensional vector space. The training process typically involves contrastive learning objectives that push similar texts closer together and dissimilar texts further apart in the embedding space. Cross-encoder models, on the other hand, are fine-tuned to directly predict a relevance score given a query-document pair. Rather than generating embeddings, the model learns to output a single similarity score that indicates how well the document answers the query. This direct optimization for the ranking task typically leads to better ranking performance, though at the cost of computational efficiency since pairs must be processed together.
3 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
'AI Won't Replace Humans, But Humans Who Know AI Will.' While the above idea is widely accepted, it is unclear what is the meaning of "Who Know AI". Through in-depth interviews with AI experts, we identified two complementary frameworks that directly counter four common misconceptions we uncovered in a previous article, Why 30% of AI Projects Fail: 4 Common AI Misconceptions among Executives These two frameworks are: The SPADE framework - which stands for Sequence Decomposition & Experimentation, Prompt Design Strategy, AI Understanding, Data Strategy, and Evaluations - encompasses the technical capabilities needed to effectively guide AI implementations, from breaking down complex processes to designing robust evaluation systems. The CATE framework - representing Critical Thinking, Adaptability & Learning, Troubleshooting & Analysis, and Ethical Awareness - addresses the equally crucial soft skills that enable AI conductors to bridge the gap between technical possibilities and business value, manage stakeholder expectations, and ensure ethical implementation. Together, these frameworks provide a comprehensive blueprint for building teams capable of driving successful AI transformations. SPADE: Technical Skills Sequence Decomposition & Experimentation AI conductors must be able to deconstruct complex human-dependent cognitive and operational processes into discrete, manageable components that can be completed by a combination of GenAI and other approaches such as coding, traditional machine learning, and third party services. This decomposition allows organizations to identify where AI can add the most value while maintaining human oversight where needed. However, identifying the right components is just the beginning—AI conductors must systematically experiment with different approaches to optimize outcomes. Consider what appears to be a straightforward task: selecting an image for a corporate blog post. An AI conductor might break this down into several potential execution sequences. One approach could follow a content-first path: using AI to extract key themes and messages from the post, applying these insights to filter a stock image database, and then employing AI-assisted selection for the final choice. An alternative sequence might prioritize emotional resonance: first analyzing the post's intended emotional impact, then using these emotional markers to guide image selection, and finally validating the choice against the post's content themes. AI conductors must coordinate with AI Engineers to test these various approaches, measuring their effectiveness against both technical metrics and business objectives. This requires carefully weighing factors such as implementation costs, processing time, accuracy rates, and—crucially—alignment with human expectations and organizational standards. Prompt Design Strategy AI conductors developing prompt design strategies need to focus on three core components: reasoning, instruction, and context management. For reasoning, AI conductors must understand that while modern AI models demonstrate impressive capabilities in developing plans and breaking down complex tasks, they have specific limitations when handling tasks requiring significant human judgment. For complex scenarios, AI conductors may need to guide the AI model to first generate its thought process, evaluate its generated thoughts, and reiterate before generating the final response. They must also consider when to incorporate external guidance for handling tacit knowledge and organization-specific information that may not be captured in the model's training data. This external guidance can be approached in two ways: through explicit rules articulated by domain experts who understand the nuances of organizational practices, or through patterns discovered by applying traditional machine learning techniques to historical organizational data. When determining instructions, AI conductors must consider several key factors: the interdependence between tasks, reasoning load on model for each component, and data dependencies. Such considerations help them decide whether to combine multiple tasks in one prompt or to split instructions across multiple prompts. AI conductors also need to carefully calibrate context as per business objective—excessive contextual information can overwhelm the model and lead to suboptimal results, while insufficient context can produce incomplete or misaligned responses. The key is finding the right balance that aligns with both technical capabilities and business requirements. AI Understanding AI conductors must have a comprehensive understanding of Generative AI models' capabilities to effectively drive internal adoption and integration of AI solutions. While these models excel at extracting information from text according to instructions, their ability to extract information from images is rapidly evolving and less widely understood. These extraction capabilities eliminate traditional roadblocks like data collection, feature engineering, and model training. Understanding generation capabilities across text, code, images, audio, and video is crucial for guiding different business units. For customer support, they may help choose between text-only models or multimodal models based on the nature of data and industry domain. For UX generation, AI conductors can help marketing teams build prototypes quickly without coding experience. For code generation, they need to help developers overcome initial resistance and embrace AI tools. While text-to-image may not yet be reliable for corporate applications, AI conductors can help teams use AI to generate infographics, flowcharts, and diagrams for various business documents. Data Strategy As an AI conductor, developing a robust data strategy begins with ensuring data representativeness while carefully managing privacy concerns. Modern approaches have moved beyond traditional human annotation to embrace hybrid systems that leverage large foundational models for generating data, with human annotators focusing only on low-confidence cases requiring expert judgment. When real data isn't available or suitable, AI conductors must know how to leverage AI to generate synthetic data that maintains real-world patterns while protecting privacy. This approach is particularly valuable for testing AI systems across rare scenarios or checking for biases across different demographic groups. The strategy must also include sophisticated approaches to data sanitization, removing personally identifiable information while preserving valuable patterns and insights. Evaluations AI conductors must develop comprehensive evaluation frameworks that go beyond traditional metrics to assess both technical performance and business value. This includes designing nuanced criteria for subjective qualities like tone and creativity, implementing robust monitoring systems to detect subtle degradation in performance, and establishing clear tracing mechanisms for debugging complex AI workflows. Unlike traditional machine learning, where metrics are straightforward and objective, generative AI requires multi-faceted evaluation approaches. These might combine traditional metrics with AI-based evaluations (LLM-as-judge) and human assessments to catch subtle errors or "hallucinations." AI conductors must also implement pre-generative guardrails to prevent misuse and ensure safety, while maintaining robust monitoring systems to detect issues like "silent degradation," where automated scores remain acceptable while real-world utility declines. CATE: Soft Skills Critical Thinking As an AI conductor in the AI era, critical thinking has become more crucial than ever. While organizations traditionally relied on a few key decision-makers to determine the "why" and "what" of projects, with larger teams focused on the "how" of execution, AI is shifting this dynamic. As AI tools become increasingly sophisticated at handling execution tasks, the bottleneck isn't in implementation but in strategic thinking: identifying which problems to solve, why they matter, and what approach to take. AI conductors must think critically to prioritize high-impact initiatives, collaborate with stakeholders to validate strategic decisions, and work with technical teams to develop effective implementation plans. Communication Skills AI conductors must excel at multifaceted communication, serving as bridges between technical teams, business stakeholders, and end users. They need to effectively translate technical concepts into business value propositions when speaking with executives, while also conveying business requirements and constraints clearly to technical teams. Crucially, they must be exceptional listeners, skilled at building rapport with domain experts to understand their nuanced decision-making processes. This involves managing expectations about AI capabilities—neither overselling nor underselling what AI can achieve—and clearly communicating both the possibilities and limitations of AI solutions. Adaptability & Learning In the rapidly evolving AI landscape, AI conductors must possess exceptional adaptability and commitment to continuous learning. The field resembles drinking from a firehose—new tools emerge constantly, models improve dramatically, and novel possibilities unfold weekly. AI conductors must maintain an experimental mindset, consistently exploring new use cases while staying current with the latest developments in AI capabilities. The most effective AI conductors are those who can quickly adapt their strategies as technologies evolve, finding creative ways to work around current limitations while preparing for future capabilities. Troubleshooting & Analysis AI conductors need strong analytical abilities to effectively collaborate with technical teams in diagnosing and resolving AI system issues—particularly when content extraction or generation falls short of expectations, or when guardrails fail to perform as intended. While technical teams handle the deep debugging, AI conductors must understand enough to have productive conversations about potential issues: whether system prompts might need restructuring, if prompt instructions could be overwhelming the model, or if reasoning guidance may be insufficient. The goal isn't for AI conductors to solve technical problems themselves, but rather to serve as an effective bridge between business needs and technical solutions. Ethical Awareness AI conductors must possess a strong foundation in ethical awareness to ensure responsible AI implementation within their organizations. This involves anticipating potential societal impacts of AI systems and proactively addressing concerns before they become problems. They need to champion ethical considerations at the strategic level: developing organization-wide AI ethics policies, establishing review processes for high-risk applications, and creating clear escalation pathways for ethical concerns. AI conductors should also foster a culture of responsible innovation where teams feel empowered to raise ethical concerns without fear of impeding progress. This requires balancing the pressure for rapid AI deployment with the need for thoughtful consideration of long-term implications and societal impact. Moving Forward For executives leading AI transformations, the path forward is clear: success depends not on accumulating technical talent alone, but on cultivating AI conductors who can orchestrate the interplay between AI capabilities and business strategy. This shift is particularly crucial as generative AI increasingly handles operational "how" tasks that previously required human execution. As this capability expands, organizations face a new bottleneck: the growing demand for professionals who excel at determining the "why" and "what" of AI initiatives. This requires three key executive actions with specific implementation steps: Changes Needed for Hiring AI Conductors/Strategists Traditional IT hiring emphasizes deep technical expertise, but AI success demands professionals who can translate between business needs and technical possibilities. Executives should: Revise job descriptions to prioritize candidates who demonstrate both the technical literacy outlined in SPADE and the orchestration capabilities detailed in CATE Implement assessment methods that evaluate a candidate's ability to communicate complex AI concepts to different stakeholders Establish competitive compensation structures that recognize the unique value of these hybrid skill sets Set a target ratio of AI conductors to technical AI specialists (1:3 is often effective in early implementations) Grooming AI Conductors/Strategists Internally Create dedicated paths for developing AI conductors internally with clear milestones and timelines. Rather than relying solely on external hires, organizations should create two distinct development tracks to nurture AI conductors from existing talent: a) The Engineer-to-Conductor Path: Identify engineers who demonstrate strong aptitude for CATE skills (particularly communication and critical thinking). These technically-skilled professionals often already possess much of the SPADE framework, but need structured development in: Role-playing exercises simulating interactions with various stakeholders (executives, domain experts, end users) Shadowing business strategy meetings to understand organizational priorities Mentorship from business leaders to develop business acumen Progressive responsibility in presenting AI concepts to non-technical audiences Training in ethical frameworks and responsible AI principles b) The Analyst-to-Conductor Path: Identify business analysts who show technical curiosity, comfort with coding fundamentals, and willingness to engage with technical documentation. These business-savvy professionals need intensive development in SPADE skills through: Structured technical training in AI fundamentals and capabilities Hands-on workshops for prompt engineering and evaluation design Paired work with AI engineers to understand technical limitations and possibilities Progressive responsibility in translating business requirements into technical specifications Guided practice in decomposing complex workflows into AI-enabled components Finally, evolve governance structures to empower AI conductors. Traditional project management frameworks, with their emphasis on fixed requirements and linear execution, often constrain AI initiatives. Executives need to establish new governance models that enable rapid experimentation while maintaining appropriate controls. This includes redefining success metrics beyond technical performance to encompass business impact and ethical considerations. Organizations that execute these shifts will not only accelerate their AI initiatives but also build lasting competitive advantages through superior orchestration capabilities. In an era where AI tools are increasingly commoditized, the ability to effectively direct these tools toward strategic objectives will become the key differentiator.
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
AI's transformative power is often likened to groundbreaking innovations such as the printing press, the steam engine, or the internet. Each of these innovations transformed society in unique ways: the printing press democratized knowledge, the steam engine industrialized production, the internet-connected humanity, and AI is now augmenting human intelligence across every sector. For example, in healthcare, AI algorithms can analyze medical images with greater speed and accuracy than human doctors, leading to earlier diagnoses and better patient outcomes. Similarly, in the legal field, tasks that once required weeks of meticulous effort—such as sifting through thousands of documents to locate a critical email—can now be accomplished in moments with AI. AI's revolutionary potential is well-recognized by business leaders. However, the rapid evolution of AI technologies and their relative newness pose significant challenges for companies attempting to integrate AI effectively into their workflows. A Gartner report forecasts that by the end of 2025, at least 30% of generative AI (GenAI) projects will be abandoned. Our research, based on a survey of C-Suite Executives from medium-sized companies across industries including retail, healthcare, food, hospitality, and education, alongside AI experts from Microsoft, Google, Facebook, AWS, Salesforce, Alibaba and top consultancy companies including Deloitte, McKinsey, PwC, Accenture, BCG, Cognizant, Bain and IBM, revealed significant gaps in business leaders' understanding of AI's capabilities and limitations. Compounding these misconceptions is a growing shortage of professionals who truly understand how to leverage AI effectively. Since Harvard Business School's Dean famously remarked, 'AI Won't Replace Humans, But Humans Who Know AI Will,' this sentiment has gained widespread traction in business circles. However, there remains little clarity around what it actually means to ‘know AI.’ To address this gap, we interviewed many AI experts and discovered that successful AI implementations require more than AI Engineers—they need what we call "AI conductors": professionals who orchestrate the complex interplay between AI capabilities, business strategy, and human factors. Our research helped identify the comprehensive set of technical and soft skills that define these conductors. These skills extend beyond technical proficiency, emphasizing the ability to design AI-driven workflows, develop robust evaluation frameworks, and ensure that AI-driven decisions are consistently aligned with human values, ethical considerations, and business objectives. Our research revealed a more fundamental challenge: executives must first overcome their own blind spots about AI before they can effectively develop AI conductors in their organizations. Based on our interviews, four critical misunderstandings consistently undermine AI initiatives. The Four AI Blind Spots Many Executives Have AI as a Data Analysis Tool First, executives often view AI primarily as a data analysis tool. This mindset underestimates AI's true potential. AI's real strength lies in its ability to simulate human decision-making and automate complex tasks that were once considered too intricate. Unlike traditional automation, which relied on explicit rules and structured inputs, AI, specifically Generative AI (GenAI), can handle ambiguous, unstructured, and context-heavy tasks with minimal human intervention. For the first time in history, machines can infer intent, adapt to new scenarios, and integrate multimodal data—text, images, and audio—allowing them to approach the complexity of human decision-making. This shift enables AI to go beyond routine task automation and actively enhance human judgment in areas requiring reasoning, pattern recognition, and contextual awareness. GenAI makes augmenting human decision-making possible in virtually every domain, including HR, finance, accounting, technology, healthcare, legal, customer service, and scientific research. GenAI is Similar to Traditional Machine Learning (ML) Second, executives incorrectly assume that generative AI implementation mirrors traditional machine learning (ML). Unlike traditional ML's objective metrics like accuracy and F1 scores, GenAI requires subjective, multi-faceted evaluation combining automated metrics, LLM-based assessments, and human review to evaluate aspects like tone, coherence, and creativity while catching hallucinations and subtle errors. GenAI's ability to accept free-form input necessitates robust pre-generative guardrails to prevent misuse, detect prompt injection attempts, assess societal risks, and enforce regulatory and brand guidelines, along with post-generation checks for factual accuracy, bias, and brand consistency. Additionally, monitoring GenAI systems is more complex due to the risk of "silent degradation" where automated scores remain acceptable despite declining real-world utility, requiring continuous human review, and the need to trace multiple, dynamically selected steps in the generation process, making debugging and root-cause analysis significantly more challenging than traditional ML's typically deterministic and relatively straightforward evaluation processes. Plug-and-Play Perception Third, executives often view AI as a plug-and-play solution. C-suite leaders frequently assume that AI implementation is primarily a technical endeavor where engineers and data scientists develop and maintain models, while business professionals merely assist with data identification and occasional feedback. This mindset, fueled by ambitious claims about AI's capabilities, creates the illusion that AI can be seamlessly integrated without significant human oversight. However, AI experts emphasize that even as AI technology evolves rapidly, significant involvement of AI conductors (i.e., GenAI strategists) remains essential across the entire AI-enabled workflow. This includes development, deployment, and ongoing monitoring phases. The reality is that successful AI integration requires creating many new processes and tools, understanding cognitive workflows, evaluating existing systems, conducting feasibility studies, establishing appropriate scope, securing stakeholder buy-in, identifying quality data sources, and planning resources effectively. These activities highlight that AI functions best as an enabler rather than an independent operator, requiring continuous human guidance to ensure alignment with strategic objectives, practical constraints, and ethical considerations. Hire Engineers & Data Scientists Fourth, many executives mistakenly believe that successful AI implementation depends primarily on technical expertise and is mostly about model training and upkeep. This often leads to the strategy of hiring as many data scientists and ML/AI engineers as possible, while relying on existing technology leaders to oversee AI initiatives. Though these leaders excel at managing technical projects, they often lack deep understanding of GenAI's unique characteristics. They may fail to recognize that augmenting human decision-making requires decoding cognitive processes and blending various technologies. Moreover, existing leaders typically struggle to envision emerging use cases enabled by rapidly advancing multimodal AI capabilities. Our research shows that the most successful implementations occur when organizations balance technical talent with AI conductors – professionals who may not code extensively but possess strong technical knowledge of generative AI capabilities and limitations. These AI conductors play a vital role by steering implementations, managing executive expectations, advocating for essential resources (particularly in AI evaluation systems), and ensuring proper controls are in place. They serve as strategic bridges between technology and business objectives, preventing the equivalent of deploying powerful AI systems without proper safeguards – like driving on a freeway with faulty brakes. Organizations that recognize the need to hire or groom these AI conductors alongside technical talent consistently achieve more successful, responsible AI integrations. Through in-depth interviews with AI experts, we identified two complementary frameworks that directly counter these misconceptions by providing a clear blueprint for the capabilities AI conductors need to develop. You can read about these complementary frameworks: SPADE and CATE here:
4 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
TL;DR The narrative revolves around GloboTech’s leadership team, led by CEO Sarah Klein, grappling with the need to pivot their hiring strategy to prioritize "AI conductors"—consultants adept at orchestrating AI workflows and solutions over traditional coding-centric "musicians." This shift introduces challenges in candidate evaluation, risk of pushback from traditional talent pools, and the need for market positioning to retain competitive advantage. Through detailed discussions, the case delves into: • The trade-offs between traditional coding skills and AI workflow redesign capabilities. • The design of assessments to measure hybrid skills like critical thinking and problem-solving over coding tests. • The operational and reputational risks of implementing innovative hiring strategies. • Market positioning challenges as the firm transitions to a leadership role in AI consulting. Contact Rohit Aggarwal for Feedback, Collaboration or Teaching Notes for this case Prologue On a brisk February morning in Boston, the leadership team at GloboTech Services gathered around a sleek conference table for a critical hiring strategy meeting. The company had built its reputation by providing high-level IT consultants to global enterprises, helping them tackle everything from infrastructure modernization to AI-driven system implementations. Despite their success, CEO Sarah Klein sensed a shift in the market. "We've always prided ourselves on hiring the best coders—our 'star musicians,'" she began, scanning the room. "But I'm not convinced that's still the formula for success in consulting, where AI tools are taking over a lot of the coding tasks." Next to her, Ian Roberts, Chief Technology Officer (CTO), flipped through slides showing recent developments in generative AI. "Our technical assessments have been great at verifying data-structure knowledge and coding prowess," he said. "But with AI writing so much of the code now, we might need a different skill set—people who can guide AI rather than manually produce every line themselves." Carla Sanchez, the VP of Human Resources, chimed in. "We need to rethink our entire recruitment strategy," she said. "We're focusing too heavily on coding under time pressure. We're not measuring critical thinking, problem-solving, or communication—the skills consultants need when orchestrating AI solutions for clients." The Setting Company Profile GloboTech Services: A mid-sized IT services firm specializing in advanced AI-driven consulting for enterprise clients Headquarters: Boston, MA Founded: 2015 by CEO Sarah Klein, a former venture capitalist turned IT services entrepreneur Scale: 50,000 employees globally, $5 billion in annual revenue Leadership Team Sarah Klein (CEO) Visionary who believes in staying ahead of tech trends Concerned about maintaining the company's reputation for innovation Ian Roberts (CTO) Deep technical background; formerly a lead engineer at a major tech giant Believes in maintaining rigorous technical standards, but open to reevaluating what "technical" means in the age of AI Carla Sanchez (VP of Human Resources) Responsible for recruitment strategy and employee development Wants to attract top "AI conductors" over "musicians" who rely solely on coding Ming Li (Director of Operations) Manages day-to-day processes, including test administration logistics for consultants Concerned about cheating and test integrity in remote hiring Amelia Price (Director of Strategy & Partnerships) Monitors industry trends and competitive moves Sees a big opportunity for GloboTech to become a thought leader in AI consulting Unfolding the Drama: The Leadership Meeting 1. The Changing Nature of Skills Sarah placed her phone face down on the table. "I just got off a call with a major enterprise client. They're revamping their own hiring for AI roles—telling me it's not just about coding. They need consultants who can figure out what to build and why, more than how to build it in excruciating detail." Ian projected a chart on the screen. "Our consultant screening has historically revolved around coding challenges—data structures, algorithms, design patterns. But with the pace of AI, we might attract amazing 'musicians'—fantastic coders—who aren't necessarily good 'conductors.' They might write perfect code but struggle to guide AI agents, collaborate with humans, and orchestrate solutions where AI handles a substantial percentage of tasks traditionally done by human programmers. We need consultants who can step back from the instrument and conduct the entire AI-human orchestra." He paused. "I'm worried about how candidates will react if we suddenly pivot from traditional coding tests to assessments focused on critical thinking and communication. These developers have spent years practicing data structures and algorithms, preparing for the kind of technical interviews that every major tech company gives. If they encounter our new format—heavy on problem-solving and light on coding—they might think something's wrong. We might see high drop-off rates mid-process, or candidates might abandon our assessments entirely when they don't recognize the format. They've invested so much time mastering one type of interview, and we're asking them to showcase completely different skills." Carla responded quickly: "We're not throwing coding out the window. But we have to rebalance. Consulting engagements demand critical thinking, analytical reasoning, and communication—especially in scenarios where the code is assisted by AI." 2. The Mismatch and Fear of Candidate Pushback Sarah tapped the table with her pen. "Our job postings and screening tests still scream 'traditional coding challenge.' If we pivot to these intangible, AI-focused skills, we risk confusing applicants. They might ask, 'Is GloboTech even a serious tech consulting firm?'" Amelia added, "Top technical talent often wants to go somewhere they can do cutting-edge programming. If they think we're turning them into project managers or 'AI conductors,' they might walk away before they grasp what we're really offering." Cheating, AI Tools, and Resume Inflation 1. The New Age of "Resume Customization" Ming cleared her throat. "We're also seeing more AI-driven cheating. Last week, we got two identical cover letters from different applicants—clearly generated by the same AI. And these candidates aced our coding tests but failed miserably in real-life consulting simulations." Ian shook his head. "They could've used AI copilots to solve those coding problems. Ironically, we do want consultants to leverage AI effectively, but we also need to see authentic problem-solving. Right now, we're struggling to tell genuine skill from AI-generated fluff." 2. Heightened Proctoring Requirements Carla jumped in. "We could implement more intense proctoring—live video, screen captures, advanced plagiarism detection. But that raises privacy concerns. Candidates might be uncomfortable being recorded at home, especially when we run hours-long tests." Ming nodded. "And we need to ensure compliance with data protection laws. Recording for a quick interview is one thing. Doing continuous screen captures is another." The Core Challenges Emerge Carla stood up and walked to the whiteboard. "Before we go further, let's tackle the assessment design challenge head-on. How do we actually evaluate these hybrid skills we're talking about?" She wrote a few bullet points as she spoke. "We need someone who can code, yes, but also someone who can think strategically about AI implementation. Traditional coding tests won't cut it anymore. Let me give you a concrete example." Carla pulled up a recent project slide. "When we're implementing generative AI solutions, we're essentially trying to replicate complex human decision-making processes. Our consultants need to be able to step back and decompose these decisions into logical sequences. It's like being a cognitive architect." She drew a flowchart on the whiteboard. "Look at our recent customer service automation project. Our traditional developers wanted to jump straight into coding a solution. But our more AI-savvy consultants took a different approach. They first mapped out the human service rep's decision-making process, broke it down into discrete steps, and then experimented with different combinations of tools – sometimes using coding, sometimes traditional machine learning models, sometimes generative AI APIs. They kept iterating until they found the right sequence." Ian nodded vigorously. "Exactly. And it's not just about technical implementation. These consultants need to design custom evaluation metrics that align with human preferences. They're constantly experimenting – trying different prompts, adjusting model parameters, fine-tuning the interaction between various AI tools. It's a completely different skill set from traditional software development." "And the results speak for themselves," Carla continued. "The team that took this experimental, methodical approach delivered a solution that actually matched how human service reps make decisions. The team that rushed into coding ended up with a rigid system that couldn't handle nuanced cases." Ian leaned forward. "What if we let candidates use AI tools during the assessment? It could show us how they approach AI-assisted problem solving in real time. Though," he added with a frown, "we'd need to figure out which tools to allow and how to evaluate their usage effectively." "That's exactly the kind of thing that keeps me up at night," Ming interjected. "We hire thousands of consultants annually across our global offices. How do we maintain consistency in these new assessments across regions? We'll need to retrain every interviewer, every recruiter. And let's be honest – evaluating soft skills consistently at this scale is already challenging. Adding AI orchestration skills to the mix makes it even more complex." Amelia, who had been quietly taking notes, looked up. "There's also the client side to consider. Just this morning, I had a call with a client who was concerned about our transition. They asked point-blank: 'If you're moving away from pure coding skills, how can you justify your current billing rates?' We need to articulate the value proposition of these new 'AI conductors' clearly." "That's a crucial point," Sarah agreed. "We're not just changing our hiring process – we're transforming how clients perceive the value of our consultants. Some clients might worry we're becoming less technical, when in reality, we're evolving to meet their future needs." Ming rubbed her temples. "And let's not forget about our existing client projects. How do we manage expectations during this transition? We can't just flip a switch and suddenly have all our consultants become AI experts overnight." "The billing question is particularly tricky," Ian added. "Traditionally, we've justified our rates based on our consultants' technical expertise. Now we need to help clients understand why strategic AI orchestration skills command similar or even higher rates." The Market Reality The challenge was particularly acute given GloboTech's position in the market. Unlike pure software companies, IT services firms needed consultants who could bridge the gap between business needs and technical implementation—a skill set becoming even more critical in the AI era. The market signals were impossible to ignore: Job postings mentioning AI had seen a staggering 1,000% increase in 2023 compared to the end of 2022, according to CIO.com. LinkedIn's latest workforce report highlighted AI Engineers and AI Consultants among the top 10 fastest-growing jobs in the U.S. This explosive growth in AI-related positions wasn't just a trend—it represented a fundamental shift in how technology services would be delivered. Clients weren't just asking for AI capabilities; they were demanding consultants who could navigate the entire AI transformation journey. As a mid-tier player competing against giants like Accenture, Cognizant, and IBM, GloboTech had to be strategic about its transformation. These larger competitors had deeper pockets for training programs and could afford to experiment with different approaches. Balancing Risks and Rewards 1. Thought Leader vs. Status Quo As the debate wore on, Amelia spoke up. "What if we lead the charge? Publish white papers, host webinars on how AI is changing the consulting skill set, and promote the idea of an 'AI conductor.' That might mean we lose some candidates, but it could also attract the right ones—and elevate our brand." Sarah considered it. "We'd be positioning ourselves as the front-runners in an evolving market. But are we ready if top coders turn away or if our current teams resist the change? And what if our competitors hang back, learn from our hiccups, and then adopt a more polished version of our model?" 2. Suboptimal Hires if We Delay Ian cautioned against waiting too long. "We're already seeing mismatched skill sets in projects. If we continue hiring the old way, we'll get consultants who can code but can't coordinate AI solutions effectively. That's a recipe for losing contracts over time." He pulled up a recent case on his tablet. "Take what happened with StandardCorp last month. Our team was technically strong—great coders who could build traditional systems. But when it came to implementing AI-driven HR automation, they struggled. They approached it like a traditional software development project, writing custom code for everything instead of effectively orchestrating existing AI tools and APIs. After three months of delays and cost overruns, StandardCorp pulled the plug on our $2 million contract and hired AIShape, a boutique AI consultancy. They completed the project in six weeks using mostly off-the-shelf AI models and custom prompts. That's just one example, but it's a pattern I'm seeing more frequently." "That StandardCorp loss really hurt us," Carla added. "Not just the revenue, but our reputation. Their CIO told me directly that while our consultants were 'brilliant programmers,' they seemed to be 'fighting against AI instead of leveraging it.' That's exactly the perception we need to change." The Tipping Point Midway through the conversation, Sarah's phone buzzed again. She glanced at a text from a major strategic partner: "Heard rumors you're pioneering a new approach to AI-hiring. Would love to see it." Sarah's eyes lit up—word had leaked out. The room fell silent as the leaders processed the potential brand perception issues. They each realized the importance of messaging this shift correctly to avoid driving away strong applicants. Amelia noticed Sarah's reaction. "We don't operate in a bubble," she said. "Clients and partners are watching. If we move quickly, we could become their go-to experts on AI adoption—something that sets GloboTech apart." Implementation Challenges Surface Ming raised her hand. "There's another critical challenge we haven't discussed – the assessment process itself. These new tests focusing on critical thinking and communication will have significant subjective components. How do we evaluate them consistently and fairly?" "I've been exploring some innovative approaches," Carla responded. "What if we used a combination of LLMs and human assessors? The AI could provide an initial structured evaluation of communication patterns and problem-solving approaches, while our human assessors focus on nuance and cultural fit." "Interesting," Ian leaned forward. "But how do we ensure the AI evaluations align with our human preferences? We'd need to carefully calibrate these systems." Ming nodded. "And scale is a huge concern. We're talking about thousands of assessments across different offices. We need standardization without losing the human touch." "That brings up another crucial point," Carla added. "We need to completely retrain our interviewing staff. Our assessors are experts at evaluating coding skills, but evaluating AI orchestration abilities? That's a whole different ball game. We need to develop new interview questions, testing procedures, evaluation rubrics – and then train everyone consistently." Sarah turned to Amelia. "What about the client side? How do we message this transformation to them?" Amelia straightened in her chair. "That's tricky. Some clients still equate technical capability with coding skills. We need to help them understand why an AI orchestrator might be more valuable than a traditional programmer. I had a client ask me yesterday, 'If they're not primarily coding, why should we pay the same rates?'" "We need concrete metrics," Sarah mused. "Not just technical delivery metrics, but business impact measurements. Something that shows how these new skills translate to better solutions." "Exactly," Amelia agreed. "And we need to manage expectations during the transition. Some clients might see mixed teams – traditional developers working alongside our new AI orchestrators. We need to clearly articulate the value of each role and how they complement each other." Ian pulled up a dashboard on his laptop. "Look at the StandardCorp project we discussed earlier. The traditional team spent 70% of their time coding custom solutions. The AI-focused team spent 60% of their time on solution design and orchestration, using existing tools and APIs. The result? Faster delivery, better scalability, and higher user satisfaction. These are the kinds of metrics we need to highlight." "But communicating this to clients requires our account teams to understand it deeply themselves," Carla pointed out. "We need to train them too – help them articulate the value proposition of these new hybrid skills." Discussion Questions Revising the Screening Process What specific interview or assessment formats could GloboTech introduce to better evaluate a consultant's capacity to leverage AI tools and think strategically, rather than just coding? How can the firm still assess coding fundamentals while placing greater weight on problem-solving and communication? Addressing Cheating and AI Misuse What proctoring tools or strategies can discourage AI-assisted cheating without undermining candidate privacy or trust? How can GloboTech position these measures so they feel fair and transparent? Talent Supply and Demand With a large pool of applicants possessing basic coding skills, how can GloboTech raise its hiring bar to find top-tier "AI conductors" without pushing away excellent coders? What ongoing training and development might be needed to keep new hires aligned with evolving AI needs? Market Positioning How should GloboTech differentiate itself from both traditional IT services competitors and pure technology companies? What messaging would attract candidates with the right blend of technical and strategic capabilities? Strategic Timing In deciding whether to be a "first mover," which risks and rewards should GloboTech weigh most heavily? How might GloboTech mitigate potential downsides if it chooses to be a thought leader in AI-driven hiring? Epilogue: The Decision Awaits By the meeting's conclusion, the leadership team at GloboTech Services was left with more questions than answers. The urgency was evident: AI had reshaped the consulting landscape, reducing the emphasis on raw coding prowess and highlighting the need for "AI orchestration" skills like communication and strategic thinking. As the group filed out of the conference room, Sarah lingered, still processing the partner's text and the implications of intrusive proctoring. She wondered how a company culture so steeped in coding pride would adapt to the idea of "conductors" over "musicians." In a fast-changing industry, which path—gradual or bold—would ensure success?
8 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Code with Explanation is posted here: Colab Notebook Link Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with Langsmith/Langfuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning and Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This first tutorial focuses on developing a basic end-to-end RAG pipeline. It is divided into three parts to provide a comprehensive understanding of building a simple RAG system for 10-K filings: RAG Fundamentals There are two key processes for RAG a) Pre- processing Data for RAG Downloads 10-K filing, splits filings into chunks, generates chunk embeddings, stores and retrieves chunks from vector database b) User query processing Generates user query embeddings, retrieves chunks similar to query embedding, creates prompts for models to generate response. High-Level Overview of Underlying Models We strongly encourage readers to go through RAG fundamentals before diving into the code. RAG Fundamentals 10-K SEC filings are comprehensive annual reports that provide an in-depth overview of a publicly traded company's operations, financial performance, and risks. These documents are essential for investors, analysts, and regulators, offering insights into business strategies, legal issues, financial health, and future outlook. However, their length and complexity often make extracting specific information time-consuming and challenging, especially when dealing with multiple filings across different companies. RAG systems address these challenges by combining traditional retrieval methods with the generative capabilities of large language models (LLMs). By structuring and embedding text from 10-K filings into a searchable database, RAG systems can quickly retrieve and synthesize relevant information, enabling users to answer complex queries efficiently. In this exercise, we will work with 10-K filings from companies like Tesla and GM, utilizing their SEC-hosted webpages as data sources. Pre-processing Data for RAG (Retrieval-Augmented Generation) Pre-processing text for RAG systems (e.g., company policy documents, emails, website content, and reports) involves key steps to prepare and organize data for efficient querying and retrieval. The primary steps include chunking, embedding generation, and vector database integration. Here’s a breakdown: Chunking Chunking is the process of breaking down large texts into smaller, manageable pieces that are easier to process and retrieve. In knowledge bases with lengthy documents, breaking them into smaller chunks enables RAG models to query and retrieve only the most relevant sections for user queries. This targeted retrieval promotes contextually coherent responses while reducing off-topic content and conserving computational resources, making the process more efficient and scalable. A key consideration in chunking is determining the appropriate chunk size to balance context preservation and semantic specificity. Semantic specificity refers to how distinctly and unambiguously a text conveys an idea. Larger chunks excel at maintaining discussion context and keeping related ideas together, which helps models understand references and pronouns. This is particularly valuable for tasks like document summarization or question answering that require comprehensive topic understanding. However, larger chunks can encompass multiple themes, potentially diluting the semantic focus of their embeddings and leading to less precise retrievals when queries target specific aspects. Conversely, smaller chunks typically focus on single ideas, generating highly focused and semantically rich embeddings that can be matched precisely with specific queries. The drawback is potential loss of broader context, where important background information or pronoun references might fall outside the chunk's scope. This can result in retrieved chunks that, while semantically relevant, may miss crucial context for coherent responses. The optimal chunk size depends on the specific application requirements and often involves experimentation. To address the risk of splitting important information across chunks, an overlapping sentences approach is often used. This involves adding a portion of the end of one chunk to the beginning of the next, helping preserve context and semantic integrity of ideas that span chunk boundaries. This ensures the model maintains a better understanding of the text as a whole, enhancing information continuity before moving into the vectorization phase of the RAG model's data pre-processing pipeline. Generating Chunk Embeddings using Embedding Model Think of embeddings as a way to translate text into a sequence of numbers that computers can understand and compare. When you convert text into embeddings (also referred as vectors), you're essentially creating a numerical "fingerprint" that captures the meaning of that text. In a RAG (Retrieval Augmented Generation) system, embeddings serve three key functions: They convert chunks of your company's documents (manuals, reports, policies) into these numerical fingerprints. They similarly convert user questions into numerical fingerprints. They allow rapid searching by comparing these fingerprints to find relevant matching chunks. Let's say an attorney has a new case about a contract dispute where a software company failed to deliver custom AI features they promised to build for a client. The attorney has this case summary: “Contract dispute: Client paid $2M for custom AI software development. Contract specified 6-month delivery. Vendor delivered incomplete features after 8 months, failing to meet specifications. Client seeking damages.” When this query is converted to an embedding, it captures key legal concepts like breach of contract, delayed delivery, and incomplete work. The system compares this numerical pattern against thousands of past cases' embeddings to find similar precedents. Precisely speaking, the system compares it against embeddings of chunks from past legal cases and finds chunks with similar numerical patterns about breach of software development contracts, delayed project deliveries, and incomplete or non-conforming deliverables. By comparing embeddings of chunks rather than entire cases, attorneys can quickly pinpoint not only the precedent cases but also the most relevant sections in these cases. This helps attorneys rapidly identify relevant precedents without reading through thousands of unrelated cases. Storing Chunk Embeddings in Vector Database After generating embeddings for text chunks, storing them effectively becomes crucial for a RAG system's performance. While traditional relational databases are excellent for structured data, they face significant challenges when handling embeddings due to their high-dimensional nature. For context, embeddings generated from BERT-Base models produce vectors containing 768 numbers, while BERT-Large models create even larger vectors with 1024 elements. Traditional databases simply weren't designed to efficiently manage and query data with such high dimensionality. This is where vector databases come into play, offering a specialized solution designed specifically for handling these high-dimensional vectors. These databases implement sophisticated indexing techniques that allow for rapid similarity searches, making them particularly well-suited for RAG applications. When a user submits a query, the system needs to quickly identify and retrieve the most semantically similar chunks from potentially millions of stored embeddings. Vector databases excel at this task, providing the necessary infrastructure for swift and accurate information retrieval that would be impractical or impossible with traditional database systems. Popular vector database solutions include FAISS and Pinecone, which are specifically optimized for storing and querying these high-dimensional embeddings. These databases implement efficient similarity search mechanisms, typically using cosine similarity measures, enabling them to rapidly identify and retrieve the most relevant chunks of information in response to user queries. This capability is essential for maintaining the responsiveness and effectiveness of RAG systems, particularly when dealing with large-scale knowledge bases. Handling User's Query After preprocessing data and setting up the vector database infrastructure, the RAG system needs to handle real-time user queries effectively. This process happens in four key stages: query vectorization, vector database retrieval, prompt creation, and response generation. Generating Query Embeddings using Embedding model First, query vectorization converts incoming user questions or requests into the same type of numerical representations (embeddings) used for the stored knowledge base chunks. This step is crucial and must use the exact same embedding model that was employed during the preprocessing phase. For instance, if BERT-Base was used to generate the 768-dimensional vectors for your stored chunks, the same model must be used for converting user queries into embeddings. This consistency ensures that both the stored chunks and user queries exist in the same semantic space, making similarity comparisons meaningful and accurate. Using different embedding models for queries versus stored chunks would be like trying to compare distances between points on two different maps with different scales – the results would be unreliable. Retrieving Relevant Chunks using Vector Database Once the query has been converted into an embedding, the vector database performs a similarity search to find the most relevant chunks from the knowledge base. This search typically employs cosine similarity or other distance metrics to identify stored vectors that are closest to the query vector in the high-dimensional space. Modern vector databases can execute these similarity searches extremely efficiently, even across millions of chunks. The system then retrieves the original text chunks corresponding to the most similar vectors, providing the contextually relevant information needed for the RAG model to generate its response. Creating Effective Prompts with Retrieved Context Creating Effective Prompts with Retrieved Context After retrieving the most relevant chunks, the next crucial step is constructing an effective prompt that helps the language model generate accurate and contextually appropriate responses. This process requires careful consideration of how to structure and combine the retrieved information with the user's query. The basic structure of a RAG prompt typically consists of three main components: instructions for the model, the retrieved context, and the user's query. Think of this like preparing a subject matter expert for a consultation – you first explain how they should approach the task (instructions), provide them with relevant reference materials (retrieved context), and then present the specific question they need to address (user's query). Consider this approach: Give an answer for the `question` using only the given `context`. Use only the provided `context` to answer the `question`. If the information needed isn't in the `context`, acknowledge this limitation rather than making assumptions. Provide a detailed answer with thorough explanations, avoiding summaries. question: {question} context: {context} Answer: The instructions at the top sets the foundation for how the model should process and utilize the retrieved information. This helps ensure the model stays grounded in the retrieved information rather than hallucinating or drawing from its pre-trained knowledge. The context section would typically join the ranked chunks with newline characters (\n\n) before inserting them into the prompt template. This preserves the ranking while creating a readable and processable format for the language model. Response Generation After generating the prompt with its carefully structured components, the RAG system passes this combined input to a Large Language Model (LLM) for response generation. The LLM processes the instructions, context (retrieved chunks), and user query together to produce a coherent, contextually appropriate response that addresses the user's needs. The LLM leverages the context to ground its responses rather than relying solely on its pre-trained knowledge. This approach significantly reduces hallucination risks since the model is explicitly instructed to base its response on the provided context. If the retrieved context lacks sufficient information to fully address the query, the model acknowledges these limitations instead of making unsupported claims. The effectiveness of response generation heavily depends on the quality of the prompt engineering discussed earlier. Depending on the requirements, the response from the LLM can be further customized or refined based on additional criteria, such as tone, style, or specific user preferences. Note: The implementation of robust guardrails is crucial when deploying LLMs in RAG systems to ensure responsible and reliable output. A comprehensive validation system should verify that the model's responses strictly align with the provided context, preventing both subtle and obvious forms of hallucination. Additional checks should evaluate responses for potential biases and ethical concerns, including screening for harmful content, discriminatory language, or inappropriate recommendations. These guardrails should also ensure compliance with company policies, regulatory requirements, and societal norms while maintaining appropriate tone and professionalism. The system should be designed to either automatically modify responses that don't meet these criteria or flag them for human review, ensuring a balance between accurate information delivery and responsible AI behavior. Technical Details Embedding model Embedding models are specialized versions of encoder architectures (like BERT) that are fine-tuned specifically to create meaningful vectors (sequence of numbers) of entire sentences or passages, rather than individual words or tokens. Base encoder models create contextual representations at the token level - meaning each word is represented by a vector that depends on surrounding words. However, they aren't trained to directly optimize for sentence-level similarity. In contrast, embedding models are explicitly trained on sentence-pair tasks using contrastive learning. During training, they learn to generate sentence vectors that: Place similar sentences close together in vector space Push dissimilar sentences far apart Capture high-level semantic relationships rather than just word-level patterns This targeted training makes them much better at tasks requiring sentence-level semantic understanding, like finding similar legal cases or matching questions to relevant documents. Note: The terms vectors, embeddings, and representations are often used interchangeably, and all refer to sequences of numbers that represent data in a machine-readable format. In large language models (LLMs): Tokenization: Input text is first broken down into smaller units called tokens. The process maps the text to elements of a predefined vocabulary or dictionary. Since the vocabulary may not contain every possible word, tokenization handles out-of-vocabulary (OOV) words by breaking them into subwords, characters, or other smaller components, depending on the tokenization strategy used. Token Embeddings: Each token is then converted into a numerical vector (embedding). At this stage, these embeddings are static, meaning they do not depend on the context provided by surrounding tokens. Contextualized Embeddings: These are embeddings generated after processing token embeddings through the layers of the transformer model. Unlike static embeddings, contextualized embeddings reflect the meaning of each token based on its surrounding tokens in the input sequence. For example, in the phrases "sits by a river bank" and "went to a bank to deposit a check," the word "bank" has different meanings. Contextualized embeddings capture these differences by producing distinct representations for the word "bank" in each context. The choice of embedding model can significantly impact the quality of your vectors and retrieval effectiveness. Since new embedding models come out on a regular basis, you can select an appropriate model from MTEB leaderboard . Response Generation Model The Large Language Models (LLMs) used for response generation in RAG systems are primarily based on decoder architectures, exemplified by models like ChatGPT, Claude, Llama, and Qwen. These decoder models operate fundamentally differently from the encoder-based models used in embedding generation and reranking stages. Their core objective is next-token prediction, where the model can only see and process tokens that come before the current position, unlike encoder models which have full visibility of the entire input sequence. This architectural constraint creates a more challenging training task, as the model must learn to generate coherent and contextually appropriate text while working with limited future context. This limitation actually drives these models to develop stronger reasoning capabilities and deeper understanding of language patterns, as they must make predictions based solely on previous context. A crucial development stage for decoder models is instruction tuning, which enables them to understand and follow specific directives in prompts. Without this specialized training, these models would simply continue the pattern of text generation rather than providing appropriate responses to instructions. For example, when presented with a prompt like "How are you?", a base model might simply complete the phrase with "doing today", while an instruction-tuned model would recognize the question format and respond appropriately with something like "I'm fine, thank you. How about yourself?" This capability is essential for RAG systems where the model needs to interpret prompts that combine retrieved context with specific instructions about how to use that information. The complexity of the text generation task necessitates significantly larger model architectures compared to embedding and reranking models. These decoder models typically employ many more parameters and layers to support their advanced reasoning capabilities. The scale difference is substantial - while embedding models might operate with hundreds of millions of parameters, modern decoder models often contain hundreds of billions of parameters. This massive scale translates directly to computational costs, with training expenses often reaching hundreds of millions of dollars. As a result, most organizations opt to access these capabilities through APIs provided by services like ChatGPT and Claude, or leverage open-weight models such as the 405-billion parameter Llama hosted on platforms like Together.ai, rather than training their own models from scratch. The combination of this complex architecture, instruction tuning, and massive scale enables decoder models to perform the sophisticated task of synthesizing information from retrieved context into coherent, relevant responses. In a RAG system, this manifests as the ability to not just understand the retrieved chunks and user query, but to reason about their relationships and generate new text that effectively addresses the user's needs while remaining grounded in the provided context.
17 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

If you are a startup, then click here to get more information