HiringNet - Find your teammates

At the cost of 1 local Intern, get 2 remote Experienced Professionals

Hero

If you are a client, who wants to work remotely from home for US company click here

If you are a startup, then click here to get more information

If you are a client, who wants to work remotely from home for US company click here

Structured Synthetic Data Generation: Preserving Statistical Relationships Between Features

Article

In today's data-driven world, organizations face a critical challenge: they need realistic data for testing, development, and training AI models, but using real customer data raises serious privacy concerns and may violate regulations like HIPAA or FERPA. Enter synthetic data generation—a powerful solution that creates artificial data mimicking the statistical properties of real data while completely protecting privacy. This article explores a sophisticated synthetic data generation pipeline that transforms real data into artificial data that's statistically similar yet entirely synthetic. What sets this approach apart is its ability to not just replicate individual feature distributions, but to preserve the crucial relationships between features—the correlations and dependencies that make data truly useful. Let's dive into how this works. Why Synthetic Data Matters Before exploring the technical implementation, let's understand why synthetic data is valuable: Privacy compliance : Eliminates the risk of exposing sensitive customer information Development freedom : Enables teams to work with realistic data without security constraints Training AI models : Provides diverse, representative data for machine learning applications Testing edge cases : Allows creation of specific scenarios that might be rare in real data Relationship preservation : Maintains critical correlations and dependencies between variables that simple randomization methods cannot capture The last point is particularly crucial. Many synthetic data approaches can generate values that match individual feature distributions, but fail to maintain the relationships between features. For example, in a banking dataset, a simple approach might generate realistic account balances and realistic transaction frequencies independently, but miss the vital correlation between these variables. Our method specifically addresses this challenge. The Problem with Independent Feature Generation To understand why preserving relationships matters, consider a common e-commerce scenario: In real customer data, affluent customers not only spend more per purchase, but they also tend to buy a wider variety of products. This creates a natural correlation between: Average purchase amount Number of unique products purchased Customer income level If we were to generate synthetic data by creating each of these features independently—even if each feature's distribution perfectly matches the original data—we would lose these critical relationships. We might end up with unrealistic scenarios like low-income customers purchasing large numbers of luxury items or high-income customers only purchasing a single budget item. This problem compounds when building predictive models. A model trained on such independently-generated synthetic data would learn incorrect patterns and make faulty predictions when applied to real data. For instance, a recommendation engine might suggest luxury products to customers unlikely to purchase them or miss obvious cross-selling opportunities. Our approach using copula-based modeling specifically addresses this challenge by mathematically capturing and preserving the dependency structures between features, ensuring the synthetic data maintains these natural correlations. The Synthetic Data Generation Pipeline Our implementation creates a comprehensive pipeline that transforms real data into synthetic data through several sophisticated steps while preserving statistical properties and relationships. Let's break down each component: 1. Preprocessing The first stage prepares the data through three main steps: Missing Data Handling Processes target variables first, addressing imbalanced classes for categorical targets or applying transformations to reduce skewness in continuous targets Imputes missing values using median for numerical features and mode/"Unknown" for categorical features Categorical Data Encoding Applies intelligent encoding based on cardinality (number of unique values): Binary encoding for features with 2 unique values One-hot encoding for features with ≤10 unique values Frequency encoding for high-cardinality features Identifies and transforms highly skewed numerical features using Box-Cox transformation Standardizing Features Scales numerical features to have zero mean and unit variance Preserves categorical features in their encoded form Stores all transformation parameters for later inverse transformation 2. Clustering using HDBSCAN Data often comes from mixed populations with different underlying patterns. HDBSCAN clustering helps identify these natural groupings: Uses Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) Advantages over traditional clustering algorithms: No need to specify the number of clusters in advance Finds clusters of varying densities and shapes Adaptively determines cluster count based on data density Handles small datasets by adjusting clustering parameters (min_cluster_size, min_samples) Assigns noise points to their nearest clusters Creates a 'cluster' column to track membership, falling back to a single cluster if needed Separates data by cluster for subsequent processing 3. For Each Cluster Processing each cluster separately allows the algorithm to better capture the unique characteristics of different data subgroups: a) Statistical Modeling This three-step process captures both individual feature distributions and their interrelationships: Fit Marginal Distribution for Each Feature Tests multiple distribution types (normal, lognormal, exponential, gamma) Selects best fit using AIC (Akaike Information Criterion) Stores distribution parameters for each feature Models each feature's unique pattern independently (e.g., ages might follow a normal distribution, while income follows a log-normal distribution) Transform to Uniform using CDF Applies Cumulative Distribution Function (CDF) of fitted distributions Transforms each feature to uniform [0,1] distribution Creates standardized representation necessary for copula modeling Fit Best Copula Model Across Features Tests different copula types (Gaussian, Student-t, Clayton) Selects best-fitting copula based on log-likelihood Captures dependency structure between features A copula is a mathematical function that connects different marginal distributions to form a joint distribution, preserving relationships between variables. For instance, if higher income correlates with more frequent purchases in your original data, copulas maintain this relationship in synthetic data. This is where the real magic happens in preserving feature relationships. While each feature's individual distribution is important, the connections between features often contain the most valuable information. For example: In financial data, transaction frequency may be correlated with account balance In healthcare data, age may be correlated with certain medical conditions In e-commerce data, purchase frequency may be correlated with customer lifetime value Copulas mathematically encode these dependencies, allowing us to generate synthetic data where these critical relationships remain intact. Without this step, you might have realistic-looking individual features but unrealistic combinations of values that would never occur in real data. b) Data Generation After modeling comes generation of the synthetic data: Draw Samples from Fitted Copula Generates correlated uniform [0,1] samples from the fitted copula model Maintains the dependency structure between features Inverse CDF to Transform Each Feature Back Applies inverse CDF (percent point function) using stored distribution parameters Transforms uniform values back to realistic data following original distributions Restores each feature's original statistical shape while preserving relationships Adds appropriate cluster labels to track membership 4. Combine Cluster Data Merges synthetic data from all clusters based on original cluster proportions Maintains the natural groupings and subpopulations present in the original data Preserves the overall data structure and cluster characteristics 5. Postprocessing This stage restores the data to its original format: Reverse Encoding of Categorical Features Converts encoded categorical features back to their original form: Binary encodings → original binary categories One-hot encodings → original categorical values Frequency encodings → original categorical values Reverse Standardization Applies inverse transformation to all standardized numerical features Restores original scale and data types Ensures the synthetic data matches the format of the original data Handles decimal formatting and type conversion 6. Validation The final step is thorough quality checking to ensure the synthetic data truly resembles the original: Validate Each Feature & Target Distribution Independently For numerical features: Applies Kolmogorov-Smirnov tests and compares statistical moments For categorical features: Performs chi-square tests and compares category frequencies Calculates metrics like maximum and average differences between distributions Validate Correlations Compares correlation matrices (Pearson, Spearman) Calculates Frobenius norm of difference matrices Ensures dependency structures are preserved This validation step is critical for our goal of relationship preservation. After all the transformations, we need to verify that the synthetic data maintains the same correlation patterns as the original. The process compares both linear (Pearson) and rank-based (Spearman) correlations, allowing us to detect if the relationship structures have been maintained across different types of dependencies. Validate Cluster Preservation Compares cluster proportions between original and synthetic data Evaluates if cluster characteristics are maintained Compiles all validation results into a comprehensive report with statistical measures Running the Script The script can be run in several ways: With Cursor/Windsurf/Cline Simply ask the model to run the script. Without Cursor/Windsurf/Cline Run the following in terminal: python -m pip install -r requirements.txt If you're having trouble, try upgrading pip: python -m pip install --upgrade pip For Windows users facing installation issues, follow the guide at: https://github.com/bycloudai/InstallVSBuildToolsWindows?tab=readme-ov-file Note: Sometimes on Windows machines, py works instead of python: py -m pip install -r requirements.txt Limitations and Considerations While this implementation is powerful, it has some limitations: It doesn't make special distinctions between different types of variables during the correlation modeling phase—it treats all variables (including transformed categorical ones) as continuous. This means it might not perfectly preserve some special relationships between categorical and continuous variables, or between categories that were originally part of the same variable. Conclusion The synthetic data generation pipeline described here offers a powerful solution for organizations needing realistic test data without privacy concerns. What sets it apart from simpler approaches is its sophisticated handling of feature relationships through copula modeling and cluster-aware generation. By carefully modeling both the individual distributions of features and their relationships, then generating new data that follows these patterns, we can create synthetic data that is: Statistically similar to real data Maintains important relationships between different pieces of information Preserves the overall structure and patterns of the original data Safe to use without worrying about privacy regulations Suitable for testing, development, and analysis purposes This relationship preservation is crucial for many real-world applications: AI model training : Models trained on synthetic data with preserved relationships will learn the same patterns present in real data Financial analysis : Synthetic financial data must maintain relationships between risk factors and outcomes Healthcare research : The correlations between patient characteristics and medical conditions must be preserved Market research : Customer behavior patterns and preferences need to maintain their interdependencies The attached code implements this entire pipeline, making it accessible for data scientists and developers who need high-quality synthetic data where relationships between features matter as much as the features themselves.

6 min read

authors:

Exercise 4 & 5: RAG with ReAct (+ Reflection) for 10-K filings

Article

The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Code with Explanation is posted here: Colab Notebook Link Exercise 5: RAG with Agentic Pattern: ReAct + Reflection Code with Explanation is posted here: Colab Notebook Link These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". In the previous exercise, we explored how to break down a complex query into sub-queries, retrieve relevant chunks from a vector database for each sub-query, and generate answers based on those chunks. However, there are instances where the necessary knowledge to answer a user's question may not be available in our vector databases. In such cases, we need to equip our system with pre-built tools that can fetch information from external sources. Specifically, in the Colab notebook, we demonstrate how to retrieve LinkedIn handles of directors listed in SEC filings. To achieve this, we utilize a set of tools, as illustrated in the following diagram: Vector Search Tool – Provides access to the vector database for the LLM. Director Extraction Tool – Extracts director names from the previously stored last portion of SEC filings. Web Search Tool – Conducts Google searches for directors one at a time and retrieves their LinkedIn handles. For further details on the code implementation, please refer to the Colab notebook. However, before diving into the notebook, we strongly recommend reviewing the ReAct explanation provided below. ReAct prompting (Fundamental pattern for AI Agents) ReAct (Reasoning + Action) represents a groundbreaking framework that revolutionizes how large language models (LLMs) approach complex problem-solving. By combining reasoning capabilities with action-based decision making, ReAct enables models to tackle challenging tasks through a dynamic and iterative process. At its core, the framework establishes a continuous loop of three interconnected components: reasoning, action, and observation. The reasoning phase, often called the "Thought" stage, serves as the model's internal cognitive process. During this stage, the model analyzes the current situation, drawing upon multiple sources of information including the original task requirements, previous reasoning steps, past actions, and accumulated observations. This framework allows the model to break down complex goals into manageable subtasks, incorporate relevant background knowledge, and continuously evaluate progress toward the ultimate objective. The model can also use this phase to identify potential obstacles and develop contingency plans when faced with unexpected challenges. The action phase represents the bridge between thought and implementation. It determines which tool to employ based on the preceding thought process. The model examines its available tool descriptions and capabilities, matching them against the requirements identified in its last reasoning step. For example, if the thought process concludes that numerical data needs analysis, the model might select a calculator tool. If the reasoning indicates a need for external information, it might choose a search tool. Following each action, the observation phase captures the results and consequences of the actions. These observations serve as crucial feedback, providing new information that feeds into the next iteration of reasoning. For instance, if the model uses a search tool to gather information about a topic, the search results become observations that inform its subsequent thinking and decision-making process. It creates a feedback loop where each cycle of thought, action, and observation builds upon previous iterations. This allows the model to maintain and adjust its high-level strategy while incorporating new information and responding to changing circumstances. The framework's flexibility enables it to handle complex tasks that require multiple steps, logical reasoning, and interaction with various external tools and information sources. Here is a typical prompt that is used in LangChain framework to implement ReAct: Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: {tools} Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [{tool_names}] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! Question: {input} Thought:{agent_scratchpad} ReAct implementation: LangChain link LLamaIndex link The prompt starts by defining the task scope and available tools. {tools} is a placeholder that gets populated with descriptions of tools the agent can use, like search, calculators, or data analysis tools. The format section establishes the strict protocol the agent must follow: Question: {input} {input} gets replaced with the actual user question This sets up the goal the agent needs to achieve ReAct Components: "Thought:" - Where the agent reasons about what it needs to do "Action:" - Limited to the tools listed in {tool_names} "Action Input:" - The specific input for the chosen tool "Observation:" - Where results from tool usage appear The ... can repeat N times indicates this is an iterative process - the agent can go through multiple cycles of Thought/Action/Observation until it has enough information. Conclusion Format: A final "Thought:" declaring the agent has sufficient information "Final Answer:" providing the response to the original question { agent_scratchpad} The { agent_scratchpad} at the end is particularly important - it acts as a dynamic working memory space for the LLM agent and gets populated with the ongoing history of all previous Thought/Action/Observation cycles during execution. Think of it like a digital notepad where the agent records its step-by-step problem-solving process. The scratchpad typically contains: Previous thoughts the agent has had including any intermediate conclusions Actions it has taken Observations it has received This allows the agent to: Reference previous findings Build upon earlier observations Maintain continuity in its reasoning process Track what approaches have already been tried Technical Details It will be worth going over the example covered in the original paper briefly and further understand how ReAct is different from Chain of Thought (CoT) prompting. Source: Paper link The use of few-shot exemplars significantly enhances the efficacy of ReAct. In the original paper, the authors provided the language model with a small number of human-annotated examples that showcase the desired reasoning process and action sequence. These exemplars serve as a template for the model to follow when addressing new, unseen instances of the task. The exemplars in ReAct typically consist of a series of thought-action-observation steps: Thoughts: The exemplars include explicit reasoning steps that guide the model's decision-making process. These thoughts help break down the task into smaller sub-goals, provide relevant context or common sense knowledge, and offer guidance on the next action to take. Actions: The exemplars demonstrate the specific actions the model should take to progress towards solving the task. These actions can include information retrieval (e.g., searching a knowledge base), navigation (e.g., clicking on a specific link), or providing a final answer. Observations: After each action, the exemplars include the corresponding observation or result from the environment. These observations provide the model with the necessary context to inform its subsequent reasoning and actions. By studying these few-shot exemplars, the language model learns to internalize the reasoning process and action sequence required to complete the task successfully. The model can then apply this learned pattern to new, unseen instances of the task, even with limited or no additional training. The ReAct paper demonstrates the effectiveness of this few-shot approach across various domains, including question answering (HotpotQA), fact verification (Fever), and interactive problem-solving (ALFWorld and WebShop). In each case, the model is provided with just a handful of annotated exemplars (ranging from 2 to 6) and achieves competitive performance compared to baseline methods that rely on extensive fine-tuning or reinforcement learning. How is ReAct different from CoT? Chain-of-Thought prompting encourages models to break down complex problems into smaller, logical steps before reaching a conclusion. While this approach improves accuracy for many tasks, it operates within the confined space of the model's existing knowledge. ReAct fundamentally extends the CoT paradigm by introducing dynamic interaction with the external world. While CoT might reason "To find the population of Tokyo, I need to recall the most recent census data," ReAct can actually execute this step by searching current databases. This ability to ground reasoning in real-world data and tools addresses several key limitations of CoT: Knowledge Freshness: While CoT relies on the model's training data, ReAct can access current information through external tools. Verification Capability: CoT's conclusions are based solely on internal reasoning, but ReAct can verify its assumptions against external sources. Computational Accuracy: Rather than relying on the model's ability to perform calculations mentally (as in CoT), ReAct can utilize specialized tools for precise computations. Adaptive Problem-Solving: ReAct can adjust its approach based on intermediate results, while CoT follows a more linear reasoning path. For example, in solving a math problem, CoT might think through each step mentally, while ReAct could combine reasoning with actual calculator usage, reducing computational errors while maintaining logical clarity. This integration of external tools with reasoning creates a more robust and reliable problem-solving system. Applications and Implications Question Answering: ReAct prompting can be used to improve question-answering systems by allowing the model to generate verbal reasoning traces and perform task-specific actions, leading to more accurate and context-aware responses Deciding appropriate tools for a sub-task: ReAct prompting can be tailored for wide variety of tasks where the LLM needs to perform actions, such as retrieving specific data, performing computations, or even interacting with software interfaces through APIs. Challenges and Considerations ReAct can easily derail from the main task and pursue self-created tasks not aligned with the original goal. ReAct tends to use external tools more often when it can use LLM's knowledge to answer things. Implementing ReAct prompting may require a significant number of prompts, leading to increased costs and potential delays in obtaining the final answer. Complexity in Implementation: Implementing ReAct prompting requires a more complex setup than traditional prompting methods. It involves configuring the LLM to interact with external tools and ensuring secure and efficient communication between the model and these tools. Process 1. User query A user asks a question: Who are the directors of Tesla? What are their LinkedIn handles? What are the financial goals of Tesla this year? What is the next auto show that Tesla will participate in? 2. LLM call to figure out the next step Langchain's AgentExecutor first generates prompt by filling in the values for placeholder variables {Tools}, {Question}, and {Scratchpad} in the prompt. The prompt becomes like this: Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: Company Directors Information - Retrieve the names of company directors for a chosen company. Optionally, their LinkedIn handles can also be included. Use the format: company_name, true/false. Available companies: Tesla, General Motors WebSearch - Performs a web search on the query. Vector Reranker Search - Retrieves information from an embedding based vector DB containing financial data and company information. Structure query as a sentence Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [Company Directors Information, WebSearch, Vector Reranker Search] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the question. Follow these steps: Begin! Question: Who are the directors of Tesla. What are their linkedin handles? What are the financial goals of tesla this year. What is the next auto show that Tesla will participate in. Thought: After generating the prompt it sends the generated prompt to LLM, parses its response, and add the response to scratchpad in the following format: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Here: The Thought explains the reasoning and identifies the needed information. The Action specifies the name of the external tool that can help with the needed information in Thought . The Action Input tells the system what specific data is needed by this external tool. 3. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is Company Directors Information tool and adds the response as Observation to the scratchpad. Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought: NOTE: The ellipses ("...") in this document indicate the information has been redacted to conserve space and it can be seen in Langsmith's interface 4. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response (Thought, Action and Action input) to scratchpad: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? 5. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is Vector Reranker Search tool and adds the response as Observation to the scratchpad. The response in this case consists of retrieved reranked chunks from vector database for the query What are the financial goals of Tesla for this year? Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? Observation: required to spend RMB 14.08 billion in capital expenditures by the end of 2023, which has been achieved in 2023, and to generate RMB 2.23 billion of annual tax revenues … audited the Company's internal control over financial Thought: 6. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response (Thought, Action and Action input) to scratchpad: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? Observation: required to spend RMB 14.08 billion in capital expenditures by the end of 2023, which has been achieved in 2023, and to generate RMB 2.23 billion of annual tax revenues … audited the Company's internal control over financial Thought:I have gathered information on Tesla's financial goals for this year, which include capital expenditures and tax revenue generation. Now, I will search for information on the next auto show that Tesla will participate in. Action: WebSearch Action Input: Next auto show Tesla will participate in 2023 The LLM determines that the newly obtained information from Vector Reranker Search aligns with its financial goals and identifies the remaining need as finding details about the next auto show using the Websearch tool. 7. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is WebSearch tool and adds the response as Observation to the scratchpad. The response in this case consists of top 3 Google search results for the query: Next auto show Tesla will participate in 2023 … Thought:I have gathered information on Tesla's financial goals for this year, which include capital expenditures and tax revenue generation. Now, I will search for information on the next auto show that Tesla will participate in. Action: WebSearch Action Input: Next auto show Tesla will participate in 2023 Observation: Title: Tesla to Join 2023 Detroit Auto Show Snippet: Tesla will join the 2023 Detroit Auto Show that will take place September 13-24. The company will showcase its vehicles at the Powering ... Link: https://driveteslacanada.ca/news/tesla-to-join-2023-detroit-auto-show/ … Link: https://detroitautoshow.com/ Thought: 8. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response. But this time, the parse response is different. Instead of generating Action and Action-input, it generates the Final answer, since the LLM observes that it has all the information it needed to answer user's question. Thought: I have gathered the necessary information to answer the question. Final Answer: 1. Directors of Tesla and their LinkedIn handles: - Elon Musk (LinkedIn: [Elon Musk](https://www.linkedin.com/in/elon-musk-a93a0b221)) - Robyn Denholm (LinkedIn: [Robyn Denholm](https://au.linkedin.com/in/robyn-denholm-a807795)) - Ira Ehrenpreis (LinkedIn: [Ira Ehrenpreis](https://www.linkedin.com/in/iraehrenpreis)) - Joseph Gebbia (LinkedIn: [Joseph Gebbia](https://www.linkedin.com/in/john-coppola-jr-4b2ab395)) - James Murdoch (LinkedIn: [James Murdoch](https://www.linkedin.com/in/jamesrmurdoch)) - Kimbal Musk (LinkedIn: [Kimbal Musk](https://www.linkedin.com/in/kimbal)) - JB Straubel (LinkedIn: [JB Straubel](https://www.linkedin.com/in/jb-straubel-b694981)) - Kathleen Wilson-Thompson (LinkedIn: [Kathleen Wilson-Thompson](https://www.linkedin.com/in/kathleen-wilson-thompson-275654201)) 2. Financial goals of Tesla for this year include capital expenditures of RMB 14.08 billion and generating RMB 2.23 billion in annual tax revenues by the end of 2023. Tesla aims to improve vehicle performance, decrease production costs, and increase affordability and customer awareness. 3. The next auto show Tesla will participate in is the 2023 Detroit Auto Show, which takes place from September 13-24. Reflexion Reflexion is a framework that enables language models (LLMs) to learn and improve their performance on various tasks through self-reflection and iterative optimization. The key idea behind Reflexion is to convert binary or scalar feedback from the environment into informative verbal feedback, which the LLM agent then uses to update its context and improve its performance in subsequent attempts. The Reflexion framework consists of four main components: Actor: An LLM that generates text and actions based on the current state observations and its memory. The Actor samples actions from its current policy and receives observations from the environment. Various models, such as Chain of Thought and ReAct, can be used as the Actor. Evaluator: A component that assesses the quality of the generated outputs produced by the Actor. The Evaluator takes a generated trajectory as input and computes a reward score reflecting the Actor's performance on the given task. The Evaluator can incorporate both internal and external assessment mechanisms. Internal evaluation can be using self-reflection or confidence signals like log probabilities and entropy measures that assess output quality without external reference points. External evaluation involves independent validation through unit tests, searching the web for relevant information to fact-check, or using LLM-as-a-judge approaches that provide assessment based on predefined criteria. Self-Reflection: An LLM that generates verbal self-reflections to provide feedback for future trials. Given the current trajectory, evaluation and the agent's persistent memory, the Self-Reflection model generates specific and informative feedback. This feedback is stored in the agent's memory for future reference. Memory: The memory component in Reflexion consists of short-term memory (trajectory history) and long-term memory (outputs from the Self-Reflection model). These memory components provide context that is both specific and influenced by lessons learned over multiple trials, giving Reflexion agents an advantage over other LLM action-choice methods. The Actor uses the updated memory to inform its decisions in the next trial. Source: Paper link Reflection implementation: LlamaIndex link This iterative process of trial, evaluation, self-reflection, and memory persistence allows the agent to rapidly improve its performance on various tasks. This approach draws inspiration from human cognitive processes, particularly the distinction between "System 1" and "System 2" thinking patterns first popularized by psychologists. System 1 represents quick, instinctive reactions, while System 2 embodies slower, more deliberate analysis. While this additional computational step may increase response time, it often proves valuable for complex tasks where accuracy and thoroughness matter more than speed. Reflection: Reflection is a simpler version of Reflexion where there is no separate evaluation before self-critique. The reflector simply provides feedback based on its own assessment, which can be helpful but might not address specific shortcomings or gaps. At its core, reflection serves as a prompting strategy that enables AI systems to evaluate and refine their responses through structured self-criticism. Multi-Agent Collaboration: The Reflection pattern can be implemented in a multi-agent framework, where one agent is responsible for generating outputs and another agent provides constructive criticism. This back-and-forth between the agents can lead to increasingly refined and improved responses. Sample implementation: "Review your previous answer and find problems with your answer" "Based on the problems you found, improve your answer." Source: link Process 1. User query A user asks a question: Who are the directors of Tesla? What are their LinkedIn handles? What are the financial goals of Tesla this year? What is the next auto show that Tesla will participate in? 2. LLM call to figure out the next step + Function calling + … Langchain's AgentExecutor first generates prompt by filling in the values for placeholder variables {Tools}, {Question}, and {Scratchpad} in the prompt.Then it sends the prompt to LLM, parses the output, calls the tool as instructed by the LLM, receives the tool's output, updates the output as observation in scratchpad, and then sends the prompt back to LLM. This process continues for a bit, till it gets to the following point. Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: Company Directors Information - Retrieve the names of company directors for a chosen company. Optionally, their LinkedIn handles can also be included. Use the format: company_name, true/false. Available companies: Tesla, General Motors WebSearch - Performs a web search on the query. Vector Reranker Search - Retrieves information from an embedding based vector DB containing financial data and company information. Structure query as a sentence Director's previous work and education - Retrieves director's education and work experience using their LinkedIn URL. Use the format: url Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [Company Directors Information, WebSearch, Vector Reranker Search, Director's previous work and education] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the question. Follow these steps: Begin! Question: Make an analysis on the independence of backgrounds of directors at tesla Thought:To analyze the independence of backgrounds of directors at Tesla, I need to gather information about the directors, including their names, previous work experience, and education. This will help determine the diversity and independence of their backgrounds. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … Thought: To analyze the independence of backgrounds of Tesla's directors, I need to gather information about their previous work experience and education. This will help assess the diversity and independence of their backgrounds. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/elon-musk-a93a0b221 Observation: Professional Experience: Education: Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the other directors as well. Action: Director's previous work and education Action Input: https://au.linkedin.com/in/robyn-denholm-a807795 Observation: Professional Experience: Tech Council of Australia, Tesla, Blackbird, Telstra, Telstra Education: Australian Institute of Company Directors, UNSW, UNSW Business School Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/iraehrenpreis Observation: Professional Experience: Menlo Micro, DBL Partners, Tesla Motors, Tesla Motors, Tesla Motors Education: Stanford University Graduate School of Business, Stanford Law School, UCLA Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/john-coppola-jr-4b2ab395 Observation: Professional Experience: JMC Detailing LLC, Koons Chevrolet, Koons Tyson Jeep, Chili's Education: Radford University - College of Business and Economics, George C. Marshall High School, marshall high school Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/jamesrmurdoch Observation: Professional Experience: Lupa Systems, 21st Century Fox, 21st Century Fox, News America Incorporated, News America Incorporated Education: Harvard University Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/kimbal Observation: Professional Experience: The Kitchen Community, The Kitchen, Tesla Motors, SpaceX, Me.dium Education: The French Culinary Institute, Queen's University Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/jb-straubel-b694981 Observation: Professional Experience: Redwood Materials, QuantumScape, Straubel Foundation, Tesla, Volacom Education: Stanford University, Stanford University, EDGEWOOD HIGH SCHOOL Thought: To complete the analysis on the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the last remaining director, Kathleen Wilson-Thompson. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/kathleen-wilson-thompson-275654201 Observation: Professional Experience: McKesson, Self-employed, Tesla, Tesla, Walgreens Boots Alliance Education: University of Michigan, Wayne State University, Wayne State University Law School Thought: 3. LLM call to figure out the next step … Observation: Professional Experience: McKesson, Self-employed, Tesla, Tesla, Walgreens Boots Alliance Education: University of Michigan, Wayne State University, Wayne State University Law School Thought: To analyze the independence of backgrounds of Tesla's directors, I have gathered information about their previous work experience and education. Here's a summary of the findings: 1. **Elon Musk**: Information on his professional experience and education was not retrieved, but he is widely known as the CEO of Tesla, SpaceX, and other ventures. 2. **Robyn Denholm**: Has experience in telecommunications and technology sectors, having worked at Telstra and the Tech Council of Australia. She holds degrees from UNSW and the Australian Institute of Company Directors. 3. **Ira Ehrenpreis**: Has a background in venture capital and law, with experience at DBL Partners and education from Stanford University and UCLA. 4. **Joseph Gebbia**: His professional experience includes roles in the automotive and hospitality industries, with education from Radford University. 5. **James Murdoch**: Comes from a media and entertainment background, having worked at 21st Century Fox and News America Incorporated. He is a Harvard University graduate. 6. **Kimbal Musk**: Has experience in the culinary and technology sectors, with roles at The Kitchen and SpaceX. He studied at The French Culinary Institute and Queen's University. 7. **JB Straubel**: Has a strong background in technology and sustainability, with experience at Tesla and Redwood Materials. He studied at Stanford University. 8. **Kathleen Wilson-Thompson**: Has experience in healthcare and retail, having worked at Walgreens Boots Alliance and McKesson. She holds degrees from the University of Michigan and Wayne State University Law School. The directors of Tesla have diverse backgrounds spanning technology, telecommunications, media, culinary arts, venture capital, and healthcare. This diversity suggests a level of independence in their backgrounds, as they bring varied experiences and perspectives to the board. Final Answer: Tesla's board of directors has diverse and independent backgrounds, with experience in technology, telecommunications, media, culinary arts, venture capital, and healthcare, contributing varied perspectives to the company. 4. Prompt for Reflection Your task is to analyze whether the `Answer` is missing some information related to the `Question`. Give feedback on the missing requirments of the answer. Mention only the essential information. Here is the previous interaction: Question: Make an analysis on the independence of backgrounds of directors at tesla Answer: Tesla's board of directors has diverse and independent backgrounds, with experience in technology, telecommunications, media, culinary arts, venture capital, and healthcare, contributing varied perspectives to the company. Reflection: Provide brief, concise thoughts on what additional information needs to be collected in the next iteration. Based on your reflection, conclude with one of the following actions: If the current Answer provides sufficient information for Original Input, state "STOP". If further refinement is needed, provide 2-3 brief thoughts for improvement, each on a new line, and end with "CONTINUE". Output 5. Reflection assessment … Output The answer provides a general overview of the backgrounds of Tesla's directors, highlighting their diverse experiences and educational backgrounds. However, it lacks specific analysis on the independence of these backgrounds in terms of decision-making and influence within the board. - Include an analysis of how the diverse backgrounds contribute to independent decision-making and governance at Tesla. - Discuss any potential conflicts of interest or affiliations that might affect the independence of the directors. - Provide examples or evidence of how the directors' independent backgrounds have influenced Tesla's strategic decisions or policies. CONTINUE 6. Prompt for thinking on Reflection Improve `Previous Answer` based on `Reflections`. Dont look for information already present in `Previous Answer`. Formulate a new Final Answer. Reflections: The answer provides a general overview of the diversity and independence of Tesla's directors' backgrounds. However, it lacks specific details and analysis. - Include specific examples of directors and their backgrounds to illustrate the diversity and independence. - Analyze how these diverse backgrounds impact decision-making and governance at Tesla. - Discuss any potential challenges or benefits that arise from having such a diverse board. CONTINUE 7. Response to Reflection Previous Answer: Tesla's directors have diverse and independent backgrounds, with experiences spanning technology, automotive, media, culinary, and healthcare industries, and education from prestigious institutions. This diversity can contribute to a broad range of perspectives and expertise in the boardroom. The directors of Tesla have diverse and independent backgrounds, which contribute to a broad range of experiences and perspectives within the board. Here's a summary of their backgrounds: 1. **Elon Musk**: Holds bachelor's degrees in physics and economics and attended graduate school in physics at Stanford University. He has a strong background in technology and entrepreneurship, having founded companies like SpaceX and Tesla. 2. **Robyn Denholm**: Has experience in telecommunications and technology, having worked at Telstra and Blackbird. She studied at the Australian Institute of Company Directors and UNSW. 3. **Ira Ehrenpreis**: Has a background in venture capital and law, with education from Stanford University Graduate School of Business and Stanford Law School. 4. **Joseph Gebbia**: His professional experience includes roles in the automotive and hospitality industries, with education from Radford University. 5. **James Murdoch**: Comes from a media and entertainment background, having worked at 21st Century Fox and News America Incorporated. He studied at Harvard University. 6. **Kimbal Musk**: Has experience in the culinary and technology sectors, with education from The French Culinary Institute and Queen's University. 7. **JB Straubel**: Has a strong background in engineering and technology, having worked at Tesla and founded Redwood Materials. He studied at Stanford University. 8. **Kathleen Wilson-Thompson**: Has experience in healthcare and law, having worked at Walgreens Boots Alliance and McKesson. She studied at the University of Michigan and Wayne State University Law School. Overall, the directors bring a mix of expertise from technology, media, law, healthcare, and entrepreneurship, which supports a diverse and independent board at Tesla.

20 min read

authors:

Exercise 3: RAG with Query Decomposition & Tracing with LangSmith

Article

The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Code with Explanation is posted here: Colab Notebook Link Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This exercise, the third in the series, focuses on illustrating how complex queries can be decomposed into simpler sub-queries by in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker. We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code. Query Decomposition When users interact with RAG systems, they often pose complex questions that encompass multiple aspects or require information from different areas of the knowledge base. Consider a query like "How do Tesla and GM's approaches to manufacturing and production compare, particularly for electric vehicles? Where are their vehicles produced?" This question combines several distinct informational needs: manufacturing methodologies, EV-specific production approaches, and factory locations for two different companies. Direct vector similarity search with such compound queries can be suboptimal, as the embedding may not effectively capture all query dimensions simultaneously. Query decomposition addresses this challenge by leveraging a Large Language Model (LLM) to break down complex queries into simpler, more focused sub-queries that can be processed independently before being synthesized into a comprehensive response. The decomposition process typically starts by prompting the LLM to analyze the user's question and identify its core components. A sample prompt, shown below, guides the LLM to generate a set of atomic sub-queries that collectively cover all aspects of the original question. For the automotive manufacturing comparison, the LLM might generate targeted sub-queries like: What is Tesla's approach to EV manufacturing and production? What is GM's approach to EV manufacturing and production? Where are Tesla's manufacturing facilities located? Where are GM's manufacturing facilities located? This approach enables more precise matching with relevant chunks in the vector database, as each sub-query can be vectorized to capture specific semantic aspects more accurately. The retrieved chunks for each sub-query are then combined and reranked to provide a complete context for the language model to generate a coherent response that compares and contrasts both companies' manufacturing strategies and facility locations. The use of LLMs for query decomposition offers several advantages over rule-based or keyword-based approaches. LLMs can understand implicit relationships within questions, identify logical dependencies between different query components, and generate sub-queries that maintain the original intent while being optimized for retrieval. Sample prompt for query decomposition: You are an expert at converting user questions into specific database queries for similarity search. Break down the `user-question` into distinct sub-queries that address different aspects of the original question. Ensure that the set of sub-queries comprehensively covers the main aspects of the original question. user-question: ```<paste user-question here>```

2 min read

authors:

Exercise 2: RAG with Reranker for 10-K filings

Article

The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Code with Explanation is posted here: Colab Notebook Link Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This exercise, the second in the series, focuses on illustrating how Reranking makes a difference in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker. We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code. Reranking Retrieved Chunks using Reranker (Cross-Encoder model) While vector similarity search provides a good initial set of relevant chunks, it can sometimes miss nuanced semantic relationships or return chunks that are only superficially similar to the query. Consider a user asking "List out major changes that occurred in Tesla in 2023." A vector search might rank chunks discussing changes from 2022 higher than a more relevant chunk about a Director selling common stock in 2023, simply because the 2022 chunks share more semantic similarities around the concept of "changes" and "Tesla." This highlights a limitation of pure vector similarity matching. This is where rerankers come into play, serving as a crucial refinement layer in the RAG pipeline. A reranker takes the initial set of retrieved chunks from the vector database and performs a more sophisticated, computationally intensive analysis to improve the ranking quality. The reranking process often employs cross-encoders, which are transformer models that simultaneously process both the query and a candidate chunk to produce a relevance score. This approach captures more subtle semantic relationships and contextual nuances. It can correctly identify that the Director's stock sale in 2023 is more relevant to the query than changes from 2022, despite fewer surface-level semantic similarities. A natural question arises: why not use these more sophisticated reranker models for the initial retrieval instead of vector search? The answer lies in computational efficiency. Using a reranker as the primary retrieval mechanism would require passing each query through the model alongside millions of individual chunks in the vector database, computing similarity scores one at a time. This process would be prohibitively expensive and slow, especially for large-scale applications that need to maintain responsive query times. This is why modern RAG systems typically employ a two-stage retrieval process that combines the best of both approaches. They first use rapid vector similarity search to quickly identify a promising set of candidates (e.g., top 100 chunks), then apply the more sophisticated reranker to this smaller set to determine the final top-k chunks (e.g., top 5-10) that will be provided as context to the language model. This hybrid approach balances computational efficiency with retrieval quality, ensuring that the system provides accurate and relevant responses while maintaining reasonable response times. Technical Details Reranking model The key distinction between Embedding models (typically bi-encoders) and Reranking models (typically cross-encoders) lies in how they process queries and chunks. Bi-encoders process each text independently - the query and chunk are fed through the model separately to generate their respective embeddings. These embeddings can then be compared using similarity metrics like cosine similarity. This approach allows for efficient retrieval since chunk embeddings can be pre-computed and indexed, but it limits the model's ability to capture complex interactions between the query and document. Cross-encoders take a fundamentally different approach by processing the query and chunk together as a single input. By concatenating the query and chunk with a separator token, the model can leverage its attention mechanisms to directly compare and contrast every term in the query with every term in the chunk. This enables the model to capture nuanced relevance patterns and contextual relationships that might be missed when processing texts independently. For example, if a query asks about "Tesla's competitors' plant locations," a cross-encoder can directly attend to chunks mentioning locations of Tesla's competitors (e.g., Ford's plant in Michigan, Rivian's facility in Illinois) while downweighting or deprioritizing chunks that primarily describe Tesla's plant locations. This results in a reranked list where chunks about competitors' locations are moved higher than those focusing on Tesla itself, better aligning with the user's intent. When it comes to model training objectives, embedding models and cross-encoder models serve different purposes despite often starting from the same base architectures like BERT. Embedding models are fine-tuned specifically to generate high-quality sentence or paragraph level embeddings that capture semantic meaning in a fixed-dimensional vector space. The training process typically involves contrastive learning objectives that push similar texts closer together and dissimilar texts further apart in the embedding space. Cross-encoder models, on the other hand, are fine-tuned to directly predict a relevance score given a query-document pair. Rather than generating embeddings, the model learns to output a single similarity score that indicates how well the document answers the query. This direct optimization for the ranking task typically leads to better ranking performance, though at the cost of computational efficiency since pairs must be processed together.

3 min read

authors:

Essential Skills for AI Conductors (ie AI Managers, AI Consultants): SPADE & CATE frameworks

Article

'AI Won't Replace Humans, But Humans Who Know AI Will.' While the above idea is widely accepted, it is unclear what is the meaning of "Who Know AI". Through in-depth interviews with AI experts, we identified two complementary frameworks that directly counter four common misconceptions we uncovered in a previous article, Why 30% of AI Projects Fail: 4 Common AI Misconceptions among Executives These two frameworks are: The SPADE framework - which stands for Sequence Decomposition & Experimentation, Prompt Design Strategy, AI Understanding, Data Strategy, and Evaluations - encompasses the technical capabilities needed to effectively guide AI implementations, from breaking down complex processes to designing robust evaluation systems. The CATE framework - representing Critical Thinking, Adaptability & Learning, Troubleshooting & Analysis, and Ethical Awareness - addresses the equally crucial soft skills that enable AI conductors to bridge the gap between technical possibilities and business value, manage stakeholder expectations, and ensure ethical implementation. Together, these frameworks provide a comprehensive blueprint for building teams capable of driving successful AI transformations. SPADE: Technical Skills Sequence Decomposition & Experimentation AI conductors must be able to deconstruct complex human-dependent cognitive and operational processes into discrete, manageable components that can be completed by a combination of GenAI and other approaches such as coding, traditional machine learning, and third party services. This decomposition allows organizations to identify where AI can add the most value while maintaining human oversight where needed. However, identifying the right components is just the beginning—AI conductors must systematically experiment with different approaches to optimize outcomes. Consider what appears to be a straightforward task: selecting an image for a corporate blog post. An AI conductor might break this down into several potential execution sequences. One approach could follow a content-first path: using AI to extract key themes and messages from the post, applying these insights to filter a stock image database, and then employing AI-assisted selection for the final choice. An alternative sequence might prioritize emotional resonance: first analyzing the post's intended emotional impact, then using these emotional markers to guide image selection, and finally validating the choice against the post's content themes. AI conductors must coordinate with AI Engineers to test these various approaches, measuring their effectiveness against both technical metrics and business objectives. This requires carefully weighing factors such as implementation costs, processing time, accuracy rates, and—crucially—alignment with human expectations and organizational standards. Prompt Design Strategy AI conductors developing prompt design strategies need to focus on three core components: reasoning, instruction, and context management. For reasoning, AI conductors must understand that while modern AI models demonstrate impressive capabilities in developing plans and breaking down complex tasks, they have specific limitations when handling tasks requiring significant human judgment. For complex scenarios, AI conductors may need to guide the AI model to first generate its thought process, evaluate its generated thoughts, and reiterate before generating the final response. They must also consider when to incorporate external guidance for handling tacit knowledge and organization-specific information that may not be captured in the model's training data. This external guidance can be approached in two ways: through explicit rules articulated by domain experts who understand the nuances of organizational practices, or through patterns discovered by applying traditional machine learning techniques to historical organizational data. When determining instructions, AI conductors must consider several key factors: the interdependence between tasks, reasoning load on model for each component, and data dependencies. Such considerations help them decide whether to combine multiple tasks in one prompt or to split instructions across multiple prompts. AI conductors also need to carefully calibrate context as per business objective—excessive contextual information can overwhelm the model and lead to suboptimal results, while insufficient context can produce incomplete or misaligned responses. The key is finding the right balance that aligns with both technical capabilities and business requirements. AI Understanding AI conductors must have a comprehensive understanding of Generative AI models' capabilities to effectively drive internal adoption and integration of AI solutions. While these models excel at extracting information from text according to instructions, their ability to extract information from images is rapidly evolving and less widely understood. These extraction capabilities eliminate traditional roadblocks like data collection, feature engineering, and model training. Understanding generation capabilities across text, code, images, audio, and video is crucial for guiding different business units. For customer support, they may help choose between text-only models or multimodal models based on the nature of data and industry domain. For UX generation, AI conductors can help marketing teams build prototypes quickly without coding experience. For code generation, they need to help developers overcome initial resistance and embrace AI tools. While text-to-image may not yet be reliable for corporate applications, AI conductors can help teams use AI to generate infographics, flowcharts, and diagrams for various business documents. Data Strategy As an AI conductor, developing a robust data strategy begins with ensuring data representativeness while carefully managing privacy concerns. Modern approaches have moved beyond traditional human annotation to embrace hybrid systems that leverage large foundational models for generating data, with human annotators focusing only on low-confidence cases requiring expert judgment. When real data isn't available or suitable, AI conductors must know how to leverage AI to generate synthetic data that maintains real-world patterns while protecting privacy. This approach is particularly valuable for testing AI systems across rare scenarios or checking for biases across different demographic groups. The strategy must also include sophisticated approaches to data sanitization, removing personally identifiable information while preserving valuable patterns and insights. Evaluations AI conductors must develop comprehensive evaluation frameworks that go beyond traditional metrics to assess both technical performance and business value. This includes designing nuanced criteria for subjective qualities like tone and creativity, implementing robust monitoring systems to detect subtle degradation in performance, and establishing clear tracing mechanisms for debugging complex AI workflows. Unlike traditional machine learning, where metrics are straightforward and objective, generative AI requires multi-faceted evaluation approaches. These might combine traditional metrics with AI-based evaluations (LLM-as-judge) and human assessments to catch subtle errors or "hallucinations." AI conductors must also implement pre-generative guardrails to prevent misuse and ensure safety, while maintaining robust monitoring systems to detect issues like "silent degradation," where automated scores remain acceptable while real-world utility declines. CATE: Soft Skills Critical Thinking As an AI conductor in the AI era, critical thinking has become more crucial than ever. While organizations traditionally relied on a few key decision-makers to determine the "why" and "what" of projects, with larger teams focused on the "how" of execution, AI is shifting this dynamic. As AI tools become increasingly sophisticated at handling execution tasks, the bottleneck isn't in implementation but in strategic thinking: identifying which problems to solve, why they matter, and what approach to take. AI conductors must think critically to prioritize high-impact initiatives, collaborate with stakeholders to validate strategic decisions, and work with technical teams to develop effective implementation plans. Communication Skills AI conductors must excel at multifaceted communication, serving as bridges between technical teams, business stakeholders, and end users. They need to effectively translate technical concepts into business value propositions when speaking with executives, while also conveying business requirements and constraints clearly to technical teams. Crucially, they must be exceptional listeners, skilled at building rapport with domain experts to understand their nuanced decision-making processes. This involves managing expectations about AI capabilities—neither overselling nor underselling what AI can achieve—and clearly communicating both the possibilities and limitations of AI solutions. Adaptability & Learning In the rapidly evolving AI landscape, AI conductors must possess exceptional adaptability and commitment to continuous learning. The field resembles drinking from a firehose—new tools emerge constantly, models improve dramatically, and novel possibilities unfold weekly. AI conductors must maintain an experimental mindset, consistently exploring new use cases while staying current with the latest developments in AI capabilities. The most effective AI conductors are those who can quickly adapt their strategies as technologies evolve, finding creative ways to work around current limitations while preparing for future capabilities. Troubleshooting & Analysis AI conductors need strong analytical abilities to effectively collaborate with technical teams in diagnosing and resolving AI system issues—particularly when content extraction or generation falls short of expectations, or when guardrails fail to perform as intended. While technical teams handle the deep debugging, AI conductors must understand enough to have productive conversations about potential issues: whether system prompts might need restructuring, if prompt instructions could be overwhelming the model, or if reasoning guidance may be insufficient. The goal isn't for AI conductors to solve technical problems themselves, but rather to serve as an effective bridge between business needs and technical solutions. Ethical Awareness AI conductors must possess a strong foundation in ethical awareness to ensure responsible AI implementation within their organizations. This involves anticipating potential societal impacts of AI systems and proactively addressing concerns before they become problems. They need to champion ethical considerations at the strategic level: developing organization-wide AI ethics policies, establishing review processes for high-risk applications, and creating clear escalation pathways for ethical concerns. AI conductors should also foster a culture of responsible innovation where teams feel empowered to raise ethical concerns without fear of impeding progress. This requires balancing the pressure for rapid AI deployment with the need for thoughtful consideration of long-term implications and societal impact. Moving Forward For executives leading AI transformations, the path forward is clear: success depends not on accumulating technical talent alone, but on cultivating AI conductors who can orchestrate the interplay between AI capabilities and business strategy. This shift is particularly crucial as generative AI increasingly handles operational "how" tasks that previously required human execution. As this capability expands, organizations face a new bottleneck: the growing demand for professionals who excel at determining the "why" and "what" of AI initiatives. This requires three key executive actions with specific implementation steps: Changes Needed for Hiring AI Conductors/Strategists Traditional IT hiring emphasizes deep technical expertise, but AI success demands professionals who can translate between business needs and technical possibilities. Executives should: Revise job descriptions to prioritize candidates who demonstrate both the technical literacy outlined in SPADE and the orchestration capabilities detailed in CATE Implement assessment methods that evaluate a candidate's ability to communicate complex AI concepts to different stakeholders Establish competitive compensation structures that recognize the unique value of these hybrid skill sets Set a target ratio of AI conductors to technical AI specialists (1:3 is often effective in early implementations) Grooming AI Conductors/Strategists Internally Create dedicated paths for developing AI conductors internally with clear milestones and timelines. Rather than relying solely on external hires, organizations should create two distinct development tracks to nurture AI conductors from existing talent: a) The Engineer-to-Conductor Path: Identify engineers who demonstrate strong aptitude for CATE skills (particularly communication and critical thinking). These technically-skilled professionals often already possess much of the SPADE framework, but need structured development in: Role-playing exercises simulating interactions with various stakeholders (executives, domain experts, end users) Shadowing business strategy meetings to understand organizational priorities Mentorship from business leaders to develop business acumen Progressive responsibility in presenting AI concepts to non-technical audiences Training in ethical frameworks and responsible AI principles b) The Analyst-to-Conductor Path: Identify business analysts who show technical curiosity, comfort with coding fundamentals, and willingness to engage with technical documentation. These business-savvy professionals need intensive development in SPADE skills through: Structured technical training in AI fundamentals and capabilities Hands-on workshops for prompt engineering and evaluation design Paired work with AI engineers to understand technical limitations and possibilities Progressive responsibility in translating business requirements into technical specifications Guided practice in decomposing complex workflows into AI-enabled components Finally, evolve governance structures to empower AI conductors. Traditional project management frameworks, with their emphasis on fixed requirements and linear execution, often constrain AI initiatives. Executives need to establish new governance models that enable rapid experimentation while maintaining appropriate controls. This includes redefining success metrics beyond technical performance to encompass business impact and ethical considerations. Organizations that execute these shifts will not only accelerate their AI initiatives but also build lasting competitive advantages through superior orchestration capabilities. In an era where AI tools are increasingly commoditized, the ability to effectively direct these tools toward strategic objectives will become the key differentiator.

authors:

Why 30% of AI Projects Fail: 4 Common AI Misconceptions among Executives

Article

AI's transformative power is often likened to groundbreaking innovations such as the printing press, the steam engine, or the internet. Each of these innovations transformed society in unique ways: the printing press democratized knowledge, the steam engine industrialized production, the internet-connected humanity, and AI is now augmenting human intelligence across every sector. For example, in healthcare, AI algorithms can analyze medical images with greater speed and accuracy than human doctors, leading to earlier diagnoses and better patient outcomes. Similarly, in the legal field, tasks that once required weeks of meticulous effort—such as sifting through thousands of documents to locate a critical email—can now be accomplished in moments with AI. AI's revolutionary potential is well-recognized by business leaders. However, the rapid evolution of AI technologies and their relative newness pose significant challenges for companies attempting to integrate AI effectively into their workflows. A Gartner report forecasts that by the end of 2025, at least 30% of generative AI (GenAI) projects will be abandoned. Our research, based on a survey of C-Suite Executives from medium-sized companies across industries including retail, healthcare, food, hospitality, and education, alongside AI experts from Microsoft, Google, Facebook, AWS, Salesforce, Alibaba and top consultancy companies including Deloitte, McKinsey, PwC, Accenture, BCG, Cognizant, Bain and IBM, revealed significant gaps in business leaders' understanding of AI's capabilities and limitations. Compounding these misconceptions is a growing shortage of professionals who truly understand how to leverage AI effectively. Since Harvard Business School's Dean famously remarked, 'AI Won't Replace Humans, But Humans Who Know AI Will,' this sentiment has gained widespread traction in business circles. However, there remains little clarity around what it actually means to ‘know AI.’ To address this gap, we interviewed many AI experts and discovered that successful AI implementations require more than AI Engineers—they need what we call "AI conductors": professionals who orchestrate the complex interplay between AI capabilities, business strategy, and human factors. Our research helped identify the comprehensive set of technical and soft skills that define these conductors. These skills extend beyond technical proficiency, emphasizing the ability to design AI-driven workflows, develop robust evaluation frameworks, and ensure that AI-driven decisions are consistently aligned with human values, ethical considerations, and business objectives. Our research revealed a more fundamental challenge: executives must first overcome their own blind spots about AI before they can effectively develop AI conductors in their organizations. Based on our interviews, four critical misunderstandings consistently undermine AI initiatives. The Four AI Blind Spots Many Executives Have AI as a Data Analysis Tool First, executives often view AI primarily as a data analysis tool. This mindset underestimates AI's true potential. AI's real strength lies in its ability to simulate human decision-making and automate complex tasks that were once considered too intricate. Unlike traditional automation, which relied on explicit rules and structured inputs, AI, specifically Generative AI (GenAI), can handle ambiguous, unstructured, and context-heavy tasks with minimal human intervention. For the first time in history, machines can infer intent, adapt to new scenarios, and integrate multimodal data—text, images, and audio—allowing them to approach the complexity of human decision-making. This shift enables AI to go beyond routine task automation and actively enhance human judgment in areas requiring reasoning, pattern recognition, and contextual awareness. GenAI makes augmenting human decision-making possible in virtually every domain, including HR, finance, accounting, technology, healthcare, legal, customer service, and scientific research. GenAI is Similar to Traditional Machine Learning (ML) Second, executives incorrectly assume that generative AI implementation mirrors traditional machine learning (ML). Unlike traditional ML's objective metrics like accuracy and F1 scores, GenAI requires subjective, multi-faceted evaluation combining automated metrics, LLM-based assessments, and human review to evaluate aspects like tone, coherence, and creativity while catching hallucinations and subtle errors. GenAI's ability to accept free-form input necessitates robust pre-generative guardrails to prevent misuse, detect prompt injection attempts, assess societal risks, and enforce regulatory and brand guidelines, along with post-generation checks for factual accuracy, bias, and brand consistency. Additionally, monitoring GenAI systems is more complex due to the risk of "silent degradation" where automated scores remain acceptable despite declining real-world utility, requiring continuous human review, and the need to trace multiple, dynamically selected steps in the generation process, making debugging and root-cause analysis significantly more challenging than traditional ML's typically deterministic and relatively straightforward evaluation processes. Plug-and-Play Perception Third, executives often view AI as a plug-and-play solution. C-suite leaders frequently assume that AI implementation is primarily a technical endeavor where engineers and data scientists develop and maintain models, while business professionals merely assist with data identification and occasional feedback. This mindset, fueled by ambitious claims about AI's capabilities, creates the illusion that AI can be seamlessly integrated without significant human oversight. However, AI experts emphasize that even as AI technology evolves rapidly, significant involvement of AI conductors (i.e., GenAI strategists) remains essential across the entire AI-enabled workflow. This includes development, deployment, and ongoing monitoring phases. The reality is that successful AI integration requires creating many new processes and tools, understanding cognitive workflows, evaluating existing systems, conducting feasibility studies, establishing appropriate scope, securing stakeholder buy-in, identifying quality data sources, and planning resources effectively. These activities highlight that AI functions best as an enabler rather than an independent operator, requiring continuous human guidance to ensure alignment with strategic objectives, practical constraints, and ethical considerations. Hire Engineers & Data Scientists Fourth, many executives mistakenly believe that successful AI implementation depends primarily on technical expertise and is mostly about model training and upkeep. This often leads to the strategy of hiring as many data scientists and ML/AI engineers as possible, while relying on existing technology leaders to oversee AI initiatives. Though these leaders excel at managing technical projects, they often lack deep understanding of GenAI's unique characteristics. They may fail to recognize that augmenting human decision-making requires decoding cognitive processes and blending various technologies. Moreover, existing leaders typically struggle to envision emerging use cases enabled by rapidly advancing multimodal AI capabilities. Our research shows that the most successful implementations occur when organizations balance technical talent with AI conductors – professionals who may not code extensively but possess strong technical knowledge of generative AI capabilities and limitations. These AI conductors play a vital role by steering implementations, managing executive expectations, advocating for essential resources (particularly in AI evaluation systems), and ensuring proper controls are in place. They serve as strategic bridges between technology and business objectives, preventing the equivalent of deploying powerful AI systems without proper safeguards – like driving on a freeway with faulty brakes. Organizations that recognize the need to hire or groom these AI conductors alongside technical talent consistently achieve more successful, responsible AI integrations. Through in-depth interviews with AI experts, we identified two complementary frameworks that directly counter these misconceptions by providing a clear blueprint for the capabilities AI conductors need to develop. You can read about these complementary frameworks: SPADE and CATE here:

4 min read

authors:

Case Study: "Conductors, Not Coders": Redefining Talent Acquisition in the Age of AI

Article

TL;DR The narrative revolves around GloboTech’s leadership team, led by CEO Sarah Klein, grappling with the need to pivot their hiring strategy to prioritize "AI conductors"—consultants adept at orchestrating AI workflows and solutions over traditional coding-centric "musicians." This shift introduces challenges in candidate evaluation, risk of pushback from traditional talent pools, and the need for market positioning to retain competitive advantage. Through detailed discussions, the case delves into: • The trade-offs between traditional coding skills and AI workflow redesign capabilities. • The design of assessments to measure hybrid skills like critical thinking and problem-solving over coding tests. • The operational and reputational risks of implementing innovative hiring strategies. • Market positioning challenges as the firm transitions to a leadership role in AI consulting. Contact Rohit Aggarwal for Feedback, Collaboration or Teaching Notes for this case Prologue On a brisk February morning in Boston, the leadership team at GloboTech Services gathered around a sleek conference table for a critical hiring strategy meeting. The company had built its reputation by providing high-level IT consultants to global enterprises, helping them tackle everything from infrastructure modernization to AI-driven system implementations. Despite their success, CEO Sarah Klein sensed a shift in the market. "We've always prided ourselves on hiring the best coders—our 'star musicians,'" she began, scanning the room. "But I'm not convinced that's still the formula for success in consulting, where AI tools are taking over a lot of the coding tasks." Next to her, Ian Roberts, Chief Technology Officer (CTO), flipped through slides showing recent developments in generative AI. "Our technical assessments have been great at verifying data-structure knowledge and coding prowess," he said. "But with AI writing so much of the code now, we might need a different skill set—people who can guide AI rather than manually produce every line themselves." Carla Sanchez, the VP of Human Resources, chimed in. "We need to rethink our entire recruitment strategy," she said. "We're focusing too heavily on coding under time pressure. We're not measuring critical thinking, problem-solving, or communication—the skills consultants need when orchestrating AI solutions for clients." The Setting Company Profile GloboTech Services: A mid-sized IT services firm specializing in advanced AI-driven consulting for enterprise clients Headquarters: Boston, MA Founded: 2015 by CEO Sarah Klein, a former venture capitalist turned IT services entrepreneur Scale: 50,000 employees globally, $5 billion in annual revenue Leadership Team Sarah Klein (CEO) Visionary who believes in staying ahead of tech trends Concerned about maintaining the company's reputation for innovation Ian Roberts (CTO) Deep technical background; formerly a lead engineer at a major tech giant Believes in maintaining rigorous technical standards, but open to reevaluating what "technical" means in the age of AI Carla Sanchez (VP of Human Resources) Responsible for recruitment strategy and employee development Wants to attract top "AI conductors" over "musicians" who rely solely on coding Ming Li (Director of Operations) Manages day-to-day processes, including test administration logistics for consultants Concerned about cheating and test integrity in remote hiring Amelia Price (Director of Strategy & Partnerships) Monitors industry trends and competitive moves Sees a big opportunity for GloboTech to become a thought leader in AI consulting Unfolding the Drama: The Leadership Meeting 1. The Changing Nature of Skills Sarah placed her phone face down on the table. "I just got off a call with a major enterprise client. They're revamping their own hiring for AI roles—telling me it's not just about coding. They need consultants who can figure out what to build and why, more than how to build it in excruciating detail." Ian projected a chart on the screen. "Our consultant screening has historically revolved around coding challenges—data structures, algorithms, design patterns. But with the pace of AI, we might attract amazing 'musicians'—fantastic coders—who aren't necessarily good 'conductors.' They might write perfect code but struggle to guide AI agents, collaborate with humans, and orchestrate solutions where AI handles a substantial percentage of tasks traditionally done by human programmers. We need consultants who can step back from the instrument and conduct the entire AI-human orchestra." He paused. "I'm worried about how candidates will react if we suddenly pivot from traditional coding tests to assessments focused on critical thinking and communication. These developers have spent years practicing data structures and algorithms, preparing for the kind of technical interviews that every major tech company gives. If they encounter our new format—heavy on problem-solving and light on coding—they might think something's wrong. We might see high drop-off rates mid-process, or candidates might abandon our assessments entirely when they don't recognize the format. They've invested so much time mastering one type of interview, and we're asking them to showcase completely different skills." Carla responded quickly: "We're not throwing coding out the window. But we have to rebalance. Consulting engagements demand critical thinking, analytical reasoning, and communication—especially in scenarios where the code is assisted by AI." 2. The Mismatch and Fear of Candidate Pushback Sarah tapped the table with her pen. "Our job postings and screening tests still scream 'traditional coding challenge.' If we pivot to these intangible, AI-focused skills, we risk confusing applicants. They might ask, 'Is GloboTech even a serious tech consulting firm?'" Amelia added, "Top technical talent often wants to go somewhere they can do cutting-edge programming. If they think we're turning them into project managers or 'AI conductors,' they might walk away before they grasp what we're really offering." Cheating, AI Tools, and Resume Inflation 1. The New Age of "Resume Customization" Ming cleared her throat. "We're also seeing more AI-driven cheating. Last week, we got two identical cover letters from different applicants—clearly generated by the same AI. And these candidates aced our coding tests but failed miserably in real-life consulting simulations." Ian shook his head. "They could've used AI copilots to solve those coding problems. Ironically, we do want consultants to leverage AI effectively, but we also need to see authentic problem-solving. Right now, we're struggling to tell genuine skill from AI-generated fluff." 2. Heightened Proctoring Requirements Carla jumped in. "We could implement more intense proctoring—live video, screen captures, advanced plagiarism detection. But that raises privacy concerns. Candidates might be uncomfortable being recorded at home, especially when we run hours-long tests." Ming nodded. "And we need to ensure compliance with data protection laws. Recording for a quick interview is one thing. Doing continuous screen captures is another." The Core Challenges Emerge Carla stood up and walked to the whiteboard. "Before we go further, let's tackle the assessment design challenge head-on. How do we actually evaluate these hybrid skills we're talking about?" She wrote a few bullet points as she spoke. "We need someone who can code, yes, but also someone who can think strategically about AI implementation. Traditional coding tests won't cut it anymore. Let me give you a concrete example." Carla pulled up a recent project slide. "When we're implementing generative AI solutions, we're essentially trying to replicate complex human decision-making processes. Our consultants need to be able to step back and decompose these decisions into logical sequences. It's like being a cognitive architect." She drew a flowchart on the whiteboard. "Look at our recent customer service automation project. Our traditional developers wanted to jump straight into coding a solution. But our more AI-savvy consultants took a different approach. They first mapped out the human service rep's decision-making process, broke it down into discrete steps, and then experimented with different combinations of tools – sometimes using coding, sometimes traditional machine learning models, sometimes generative AI APIs. They kept iterating until they found the right sequence." Ian nodded vigorously. "Exactly. And it's not just about technical implementation. These consultants need to design custom evaluation metrics that align with human preferences. They're constantly experimenting – trying different prompts, adjusting model parameters, fine-tuning the interaction between various AI tools. It's a completely different skill set from traditional software development." "And the results speak for themselves," Carla continued. "The team that took this experimental, methodical approach delivered a solution that actually matched how human service reps make decisions. The team that rushed into coding ended up with a rigid system that couldn't handle nuanced cases." Ian leaned forward. "What if we let candidates use AI tools during the assessment? It could show us how they approach AI-assisted problem solving in real time. Though," he added with a frown, "we'd need to figure out which tools to allow and how to evaluate their usage effectively." "That's exactly the kind of thing that keeps me up at night," Ming interjected. "We hire thousands of consultants annually across our global offices. How do we maintain consistency in these new assessments across regions? We'll need to retrain every interviewer, every recruiter. And let's be honest – evaluating soft skills consistently at this scale is already challenging. Adding AI orchestration skills to the mix makes it even more complex." Amelia, who had been quietly taking notes, looked up. "There's also the client side to consider. Just this morning, I had a call with a client who was concerned about our transition. They asked point-blank: 'If you're moving away from pure coding skills, how can you justify your current billing rates?' We need to articulate the value proposition of these new 'AI conductors' clearly." "That's a crucial point," Sarah agreed. "We're not just changing our hiring process – we're transforming how clients perceive the value of our consultants. Some clients might worry we're becoming less technical, when in reality, we're evolving to meet their future needs." Ming rubbed her temples. "And let's not forget about our existing client projects. How do we manage expectations during this transition? We can't just flip a switch and suddenly have all our consultants become AI experts overnight." "The billing question is particularly tricky," Ian added. "Traditionally, we've justified our rates based on our consultants' technical expertise. Now we need to help clients understand why strategic AI orchestration skills command similar or even higher rates." The Market Reality The challenge was particularly acute given GloboTech's position in the market. Unlike pure software companies, IT services firms needed consultants who could bridge the gap between business needs and technical implementation—a skill set becoming even more critical in the AI era. The market signals were impossible to ignore: Job postings mentioning AI had seen a staggering 1,000% increase in 2023 compared to the end of 2022, according to CIO.com. LinkedIn's latest workforce report highlighted AI Engineers and AI Consultants among the top 10 fastest-growing jobs in the U.S. This explosive growth in AI-related positions wasn't just a trend—it represented a fundamental shift in how technology services would be delivered. Clients weren't just asking for AI capabilities; they were demanding consultants who could navigate the entire AI transformation journey. As a mid-tier player competing against giants like Accenture, Cognizant, and IBM, GloboTech had to be strategic about its transformation. These larger competitors had deeper pockets for training programs and could afford to experiment with different approaches. Balancing Risks and Rewards 1. Thought Leader vs. Status Quo As the debate wore on, Amelia spoke up. "What if we lead the charge? Publish white papers, host webinars on how AI is changing the consulting skill set, and promote the idea of an 'AI conductor.' That might mean we lose some candidates, but it could also attract the right ones—and elevate our brand." Sarah considered it. "We'd be positioning ourselves as the front-runners in an evolving market. But are we ready if top coders turn away or if our current teams resist the change? And what if our competitors hang back, learn from our hiccups, and then adopt a more polished version of our model?" 2. Suboptimal Hires if We Delay Ian cautioned against waiting too long. "We're already seeing mismatched skill sets in projects. If we continue hiring the old way, we'll get consultants who can code but can't coordinate AI solutions effectively. That's a recipe for losing contracts over time." He pulled up a recent case on his tablet. "Take what happened with StandardCorp last month. Our team was technically strong—great coders who could build traditional systems. But when it came to implementing AI-driven HR automation, they struggled. They approached it like a traditional software development project, writing custom code for everything instead of effectively orchestrating existing AI tools and APIs. After three months of delays and cost overruns, StandardCorp pulled the plug on our $2 million contract and hired AIShape, a boutique AI consultancy. They completed the project in six weeks using mostly off-the-shelf AI models and custom prompts. That's just one example, but it's a pattern I'm seeing more frequently." "That StandardCorp loss really hurt us," Carla added. "Not just the revenue, but our reputation. Their CIO told me directly that while our consultants were 'brilliant programmers,' they seemed to be 'fighting against AI instead of leveraging it.' That's exactly the perception we need to change." The Tipping Point Midway through the conversation, Sarah's phone buzzed again. She glanced at a text from a major strategic partner: "Heard rumors you're pioneering a new approach to AI-hiring. Would love to see it." Sarah's eyes lit up—word had leaked out. The room fell silent as the leaders processed the potential brand perception issues. They each realized the importance of messaging this shift correctly to avoid driving away strong applicants. Amelia noticed Sarah's reaction. "We don't operate in a bubble," she said. "Clients and partners are watching. If we move quickly, we could become their go-to experts on AI adoption—something that sets GloboTech apart." Implementation Challenges Surface Ming raised her hand. "There's another critical challenge we haven't discussed – the assessment process itself. These new tests focusing on critical thinking and communication will have significant subjective components. How do we evaluate them consistently and fairly?" "I've been exploring some innovative approaches," Carla responded. "What if we used a combination of LLMs and human assessors? The AI could provide an initial structured evaluation of communication patterns and problem-solving approaches, while our human assessors focus on nuance and cultural fit." "Interesting," Ian leaned forward. "But how do we ensure the AI evaluations align with our human preferences? We'd need to carefully calibrate these systems." Ming nodded. "And scale is a huge concern. We're talking about thousands of assessments across different offices. We need standardization without losing the human touch." "That brings up another crucial point," Carla added. "We need to completely retrain our interviewing staff. Our assessors are experts at evaluating coding skills, but evaluating AI orchestration abilities? That's a whole different ball game. We need to develop new interview questions, testing procedures, evaluation rubrics – and then train everyone consistently." Sarah turned to Amelia. "What about the client side? How do we message this transformation to them?" Amelia straightened in her chair. "That's tricky. Some clients still equate technical capability with coding skills. We need to help them understand why an AI orchestrator might be more valuable than a traditional programmer. I had a client ask me yesterday, 'If they're not primarily coding, why should we pay the same rates?'" "We need concrete metrics," Sarah mused. "Not just technical delivery metrics, but business impact measurements. Something that shows how these new skills translate to better solutions." "Exactly," Amelia agreed. "And we need to manage expectations during the transition. Some clients might see mixed teams – traditional developers working alongside our new AI orchestrators. We need to clearly articulate the value of each role and how they complement each other." Ian pulled up a dashboard on his laptop. "Look at the StandardCorp project we discussed earlier. The traditional team spent 70% of their time coding custom solutions. The AI-focused team spent 60% of their time on solution design and orchestration, using existing tools and APIs. The result? Faster delivery, better scalability, and higher user satisfaction. These are the kinds of metrics we need to highlight." "But communicating this to clients requires our account teams to understand it deeply themselves," Carla pointed out. "We need to train them too – help them articulate the value proposition of these new hybrid skills." Discussion Questions Revising the Screening Process What specific interview or assessment formats could GloboTech introduce to better evaluate a consultant's capacity to leverage AI tools and think strategically, rather than just coding? How can the firm still assess coding fundamentals while placing greater weight on problem-solving and communication? Addressing Cheating and AI Misuse What proctoring tools or strategies can discourage AI-assisted cheating without undermining candidate privacy or trust? How can GloboTech position these measures so they feel fair and transparent? Talent Supply and Demand With a large pool of applicants possessing basic coding skills, how can GloboTech raise its hiring bar to find top-tier "AI conductors" without pushing away excellent coders? What ongoing training and development might be needed to keep new hires aligned with evolving AI needs? Market Positioning How should GloboTech differentiate itself from both traditional IT services competitors and pure technology companies? What messaging would attract candidates with the right blend of technical and strategic capabilities? Strategic Timing In deciding whether to be a "first mover," which risks and rewards should GloboTech weigh most heavily? How might GloboTech mitigate potential downsides if it chooses to be a thought leader in AI-driven hiring? Epilogue: The Decision Awaits By the meeting's conclusion, the leadership team at GloboTech Services was left with more questions than answers. The urgency was evident: AI had reshaped the consulting landscape, reducing the emphasis on raw coding prowess and highlighting the need for "AI orchestration" skills like communication and strategic thinking. As the group filed out of the conference room, Sarah lingered, still processing the partner's text and the implications of intrusive proctoring. She wondered how a company culture so steeped in coding pride would adapt to the idea of "conductors" over "musicians." In a fast-changing industry, which path—gradual or bold—would ensure success?

8 min read

authors:

Exercise 1: Simple RAG for 10-K filings

Article

The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Code with Explanation is posted here: Colab Notebook Link Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with Langsmith/Langfuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning and Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This first tutorial focuses on developing a basic end-to-end RAG pipeline. It is divided into three parts to provide a comprehensive understanding of building a simple RAG system for 10-K filings: RAG Fundamentals There are two key processes for RAG a) Pre- processing Data for RAG Downloads 10-K filing, splits filings into chunks, generates chunk embeddings, stores and retrieves chunks from vector database b) User query processing Generates user query embeddings, retrieves chunks similar to query embedding, creates prompts for models to generate response. High-Level Overview of Underlying Models We strongly encourage readers to go through RAG fundamentals before diving into the code. RAG Fundamentals 10-K SEC filings are comprehensive annual reports that provide an in-depth overview of a publicly traded company's operations, financial performance, and risks. These documents are essential for investors, analysts, and regulators, offering insights into business strategies, legal issues, financial health, and future outlook. However, their length and complexity often make extracting specific information time-consuming and challenging, especially when dealing with multiple filings across different companies. RAG systems address these challenges by combining traditional retrieval methods with the generative capabilities of large language models (LLMs). By structuring and embedding text from 10-K filings into a searchable database, RAG systems can quickly retrieve and synthesize relevant information, enabling users to answer complex queries efficiently. In this exercise, we will work with 10-K filings from companies like Tesla and GM, utilizing their SEC-hosted webpages as data sources. Pre-processing Data for RAG (Retrieval-Augmented Generation) Pre-processing text for RAG systems (e.g., company policy documents, emails, website content, and reports) involves key steps to prepare and organize data for efficient querying and retrieval. The primary steps include chunking, embedding generation, and vector database integration. Here’s a breakdown: Chunking Chunking is the process of breaking down large texts into smaller, manageable pieces that are easier to process and retrieve. In knowledge bases with lengthy documents, breaking them into smaller chunks enables RAG models to query and retrieve only the most relevant sections for user queries. This targeted retrieval promotes contextually coherent responses while reducing off-topic content and conserving computational resources, making the process more efficient and scalable. A key consideration in chunking is determining the appropriate chunk size to balance context preservation and semantic specificity. Semantic specificity refers to how distinctly and unambiguously a text conveys an idea. Larger chunks excel at maintaining discussion context and keeping related ideas together, which helps models understand references and pronouns. This is particularly valuable for tasks like document summarization or question answering that require comprehensive topic understanding. However, larger chunks can encompass multiple themes, potentially diluting the semantic focus of their embeddings and leading to less precise retrievals when queries target specific aspects. Conversely, smaller chunks typically focus on single ideas, generating highly focused and semantically rich embeddings that can be matched precisely with specific queries. The drawback is potential loss of broader context, where important background information or pronoun references might fall outside the chunk's scope. This can result in retrieved chunks that, while semantically relevant, may miss crucial context for coherent responses. The optimal chunk size depends on the specific application requirements and often involves experimentation. To address the risk of splitting important information across chunks, an overlapping sentences approach is often used. This involves adding a portion of the end of one chunk to the beginning of the next, helping preserve context and semantic integrity of ideas that span chunk boundaries. This ensures the model maintains a better understanding of the text as a whole, enhancing information continuity before moving into the vectorization phase of the RAG model's data pre-processing pipeline. Generating Chunk Embeddings using Embedding Model Think of embeddings as a way to translate text into a sequence of numbers that computers can understand and compare. When you convert text into embeddings (also referred as vectors), you're essentially creating a numerical "fingerprint" that captures the meaning of that text. In a RAG (Retrieval Augmented Generation) system, embeddings serve three key functions: They convert chunks of your company's documents (manuals, reports, policies) into these numerical fingerprints. They similarly convert user questions into numerical fingerprints. They allow rapid searching by comparing these fingerprints to find relevant matching chunks. Let's say an attorney has a new case about a contract dispute where a software company failed to deliver custom AI features they promised to build for a client. The attorney has this case summary: “Contract dispute: Client paid $2M for custom AI software development. Contract specified 6-month delivery. Vendor delivered incomplete features after 8 months, failing to meet specifications. Client seeking damages.” When this query is converted to an embedding, it captures key legal concepts like breach of contract, delayed delivery, and incomplete work. The system compares this numerical pattern against thousands of past cases' embeddings to find similar precedents. Precisely speaking, the system compares it against embeddings of chunks from past legal cases and finds chunks with similar numerical patterns about breach of software development contracts, delayed project deliveries, and incomplete or non-conforming deliverables. By comparing embeddings of chunks rather than entire cases, attorneys can quickly pinpoint not only the precedent cases but also the most relevant sections in these cases. This helps attorneys rapidly identify relevant precedents without reading through thousands of unrelated cases. Storing Chunk Embeddings in Vector Database After generating embeddings for text chunks, storing them effectively becomes crucial for a RAG system's performance. While traditional relational databases are excellent for structured data, they face significant challenges when handling embeddings due to their high-dimensional nature. For context, embeddings generated from BERT-Base models produce vectors containing 768 numbers, while BERT-Large models create even larger vectors with 1024 elements. Traditional databases simply weren't designed to efficiently manage and query data with such high dimensionality. This is where vector databases come into play, offering a specialized solution designed specifically for handling these high-dimensional vectors. These databases implement sophisticated indexing techniques that allow for rapid similarity searches, making them particularly well-suited for RAG applications. When a user submits a query, the system needs to quickly identify and retrieve the most semantically similar chunks from potentially millions of stored embeddings. Vector databases excel at this task, providing the necessary infrastructure for swift and accurate information retrieval that would be impractical or impossible with traditional database systems. Popular vector database solutions include FAISS and Pinecone, which are specifically optimized for storing and querying these high-dimensional embeddings. These databases implement efficient similarity search mechanisms, typically using cosine similarity measures, enabling them to rapidly identify and retrieve the most relevant chunks of information in response to user queries. This capability is essential for maintaining the responsiveness and effectiveness of RAG systems, particularly when dealing with large-scale knowledge bases. Handling User's Query After preprocessing data and setting up the vector database infrastructure, the RAG system needs to handle real-time user queries effectively. This process happens in four key stages: query vectorization, vector database retrieval, prompt creation, and response generation. Generating Query Embeddings using Embedding model First, query vectorization converts incoming user questions or requests into the same type of numerical representations (embeddings) used for the stored knowledge base chunks. This step is crucial and must use the exact same embedding model that was employed during the preprocessing phase. For instance, if BERT-Base was used to generate the 768-dimensional vectors for your stored chunks, the same model must be used for converting user queries into embeddings. This consistency ensures that both the stored chunks and user queries exist in the same semantic space, making similarity comparisons meaningful and accurate. Using different embedding models for queries versus stored chunks would be like trying to compare distances between points on two different maps with different scales – the results would be unreliable. Retrieving Relevant Chunks using Vector Database Once the query has been converted into an embedding, the vector database performs a similarity search to find the most relevant chunks from the knowledge base. This search typically employs cosine similarity or other distance metrics to identify stored vectors that are closest to the query vector in the high-dimensional space. Modern vector databases can execute these similarity searches extremely efficiently, even across millions of chunks. The system then retrieves the original text chunks corresponding to the most similar vectors, providing the contextually relevant information needed for the RAG model to generate its response. Creating Effective Prompts with Retrieved Context Creating Effective Prompts with Retrieved Context After retrieving the most relevant chunks, the next crucial step is constructing an effective prompt that helps the language model generate accurate and contextually appropriate responses. This process requires careful consideration of how to structure and combine the retrieved information with the user's query. The basic structure of a RAG prompt typically consists of three main components: instructions for the model, the retrieved context, and the user's query. Think of this like preparing a subject matter expert for a consultation – you first explain how they should approach the task (instructions), provide them with relevant reference materials (retrieved context), and then present the specific question they need to address (user's query). Consider this approach: Give an answer for the `question` using only the given `context`. Use only the provided `context` to answer the `question`. If the information needed isn't in the `context`, acknowledge this limitation rather than making assumptions. Provide a detailed answer with thorough explanations, avoiding summaries. question: {question} context: {context} Answer: The instructions at the top sets the foundation for how the model should process and utilize the retrieved information. This helps ensure the model stays grounded in the retrieved information rather than hallucinating or drawing from its pre-trained knowledge. The context section would typically join the ranked chunks with newline characters (\n\n) before inserting them into the prompt template. This preserves the ranking while creating a readable and processable format for the language model. Response Generation After generating the prompt with its carefully structured components, the RAG system passes this combined input to a Large Language Model (LLM) for response generation. The LLM processes the instructions, context (retrieved chunks), and user query together to produce a coherent, contextually appropriate response that addresses the user's needs. The LLM leverages the context to ground its responses rather than relying solely on its pre-trained knowledge. This approach significantly reduces hallucination risks since the model is explicitly instructed to base its response on the provided context. If the retrieved context lacks sufficient information to fully address the query, the model acknowledges these limitations instead of making unsupported claims. The effectiveness of response generation heavily depends on the quality of the prompt engineering discussed earlier. Depending on the requirements, the response from the LLM can be further customized or refined based on additional criteria, such as tone, style, or specific user preferences. Note: The implementation of robust guardrails is crucial when deploying LLMs in RAG systems to ensure responsible and reliable output. A comprehensive validation system should verify that the model's responses strictly align with the provided context, preventing both subtle and obvious forms of hallucination. Additional checks should evaluate responses for potential biases and ethical concerns, including screening for harmful content, discriminatory language, or inappropriate recommendations. These guardrails should also ensure compliance with company policies, regulatory requirements, and societal norms while maintaining appropriate tone and professionalism. The system should be designed to either automatically modify responses that don't meet these criteria or flag them for human review, ensuring a balance between accurate information delivery and responsible AI behavior. Technical Details Embedding model Embedding models are specialized versions of encoder architectures (like BERT) that are fine-tuned specifically to create meaningful vectors (sequence of numbers) of entire sentences or passages, rather than individual words or tokens. Base encoder models create contextual representations at the token level - meaning each word is represented by a vector that depends on surrounding words. However, they aren't trained to directly optimize for sentence-level similarity. In contrast, embedding models are explicitly trained on sentence-pair tasks using contrastive learning. During training, they learn to generate sentence vectors that: Place similar sentences close together in vector space Push dissimilar sentences far apart Capture high-level semantic relationships rather than just word-level patterns This targeted training makes them much better at tasks requiring sentence-level semantic understanding, like finding similar legal cases or matching questions to relevant documents. Note: The terms vectors, embeddings, and representations are often used interchangeably, and all refer to sequences of numbers that represent data in a machine-readable format. In large language models (LLMs): Tokenization: Input text is first broken down into smaller units called tokens. The process maps the text to elements of a predefined vocabulary or dictionary. Since the vocabulary may not contain every possible word, tokenization handles out-of-vocabulary (OOV) words by breaking them into subwords, characters, or other smaller components, depending on the tokenization strategy used. Token Embeddings: Each token is then converted into a numerical vector (embedding). At this stage, these embeddings are static, meaning they do not depend on the context provided by surrounding tokens. Contextualized Embeddings: These are embeddings generated after processing token embeddings through the layers of the transformer model. Unlike static embeddings, contextualized embeddings reflect the meaning of each token based on its surrounding tokens in the input sequence. For example, in the phrases "sits by a river bank" and "went to a bank to deposit a check," the word "bank" has different meanings. Contextualized embeddings capture these differences by producing distinct representations for the word "bank" in each context. The choice of embedding model can significantly impact the quality of your vectors and retrieval effectiveness. Since new embedding models come out on a regular basis, you can select an appropriate model from MTEB leaderboard . Response Generation Model The Large Language Models (LLMs) used for response generation in RAG systems are primarily based on decoder architectures, exemplified by models like ChatGPT, Claude, Llama, and Qwen. These decoder models operate fundamentally differently from the encoder-based models used in embedding generation and reranking stages. Their core objective is next-token prediction, where the model can only see and process tokens that come before the current position, unlike encoder models which have full visibility of the entire input sequence. This architectural constraint creates a more challenging training task, as the model must learn to generate coherent and contextually appropriate text while working with limited future context. This limitation actually drives these models to develop stronger reasoning capabilities and deeper understanding of language patterns, as they must make predictions based solely on previous context. A crucial development stage for decoder models is instruction tuning, which enables them to understand and follow specific directives in prompts. Without this specialized training, these models would simply continue the pattern of text generation rather than providing appropriate responses to instructions. For example, when presented with a prompt like "How are you?", a base model might simply complete the phrase with "doing today", while an instruction-tuned model would recognize the question format and respond appropriately with something like "I'm fine, thank you. How about yourself?" This capability is essential for RAG systems where the model needs to interpret prompts that combine retrieved context with specific instructions about how to use that information. The complexity of the text generation task necessitates significantly larger model architectures compared to embedding and reranking models. These decoder models typically employ many more parameters and layers to support their advanced reasoning capabilities. The scale difference is substantial - while embedding models might operate with hundreds of millions of parameters, modern decoder models often contain hundreds of billions of parameters. This massive scale translates directly to computational costs, with training expenses often reaching hundreds of millions of dollars. As a result, most organizations opt to access these capabilities through APIs provided by services like ChatGPT and Claude, or leverage open-weight models such as the 405-billion parameter Llama hosted on platforms like Together.ai, rather than training their own models from scratch. The combination of this complex architecture, instruction tuning, and massive scale enables decoder models to perform the sophisticated task of synthesizing information from retrieved context into coherent, relevant responses. In a RAG system, this manifests as the ability to not just understand the retrieved chunks and user query, but to reason about their relationships and generate new text that effectively addresses the user's needs while remaining grounded in the provided context.

17 min read

authors:

AI Zombocalypse: The Real Threat is Not AI Itself, But How We Use It

Article

The rise of Artificial Intelligence (AI) is often met with polarized views, with one side heralding its potential to revolutionize the world, and the other warning against its perceived threat to professionals and humanity as a whole. However, there’s a different and far more nuanced concern—one that isn't about AI replacing humans but about AI turning people, particularly new learners, into “mindless zombies.” This phenomenon—let's call it the "AI Zombocalypse"—is characterized by professionals becoming overly reliant on AI tools, ultimately losing their critical thinking and problem-solving abilities. While it may sound hyperbolic, this trend is not just an abstract possibility but a present danger, particularly for those just starting their careers. They are at risk of developing shallow, unstructured thinking patterns that lack the depth, creativity, and analytical rigor necessary to solve complex problems. This article explores how AI-induced mindlessness is a greater threat than AI itself, and how the current generation of learners is uniquely vulnerable to this issue. The Alluring Power of AI and the Danger Beneath AI tools are incredibly effective in getting things done quickly, which creates a sense of exhilaration, especially for those who are new to a field. It provides them with results that look polished on the surface and offer an illusion of completeness. But there's often a catch: when you start to dig deeper into these AI-generated results, you frequently find repetition of the same ideas in different forms, a lack of originality, or a vacuousness that becomes apparent upon closer inspection. Essentially, AI can deliver quantity at the expense of quality, leading to content that may look good on paper but fails to hold water upon critical evaluation. This allure of quick, seemingly accurate solutions is akin to a drug—an instant gratification that is hard to resist, especially for new learners who are keen to make an impression or solve a problem quickly. However, just as a drug masks the underlying issues rather than solving them, AI can obscure the learner's understanding, often bypassing essential skills in critical thinking, debugging, and problem decomposition. Evidence of the Problem: Uplevel's Findings The issue of blind AI reliance is supported by real-world data. A study conducted by Uplevel examined about 800 developers over three months using GitHub Copilot, an AI-powered coding assistant by Microsoft. The results were stark: there were "no significant improvements for developers" using Copilot compared to the previous three months without it, and in fact, 41% more bugs were introduced when using AI assistance. This indicates that the AI-generated code was not only less effective but potentially harmful to code quality. New learners are particularly prone to these pitfalls, as they may lack the ability to properly vet AI-generated solutions and instead blindly accept them. This reinforces the point that, far from enhancing developer productivity, AI can actually hinder the development of critical coding and debugging skills, which are essential for quality work. The Disappearance of Debugging Skills and the “Streetlight Effect” Debugging is a skill that separates a good programmer or problem-solver from a mediocre one. It requires systematically breaking down a problem, placing breakpoints, adding logging, and continuously analyzing the state of the system to understand what's going wrong. However, the rise of AI-assisted development tools is eroding this foundational skill. Instead of trying to understand the issue and experiment with possible solutions, learners are too quick to turn to AI for an answer. In this context, we often see the "streetlight effect," where learners act like the proverbial drunkard who searches for their keys only where there is light, not necessarily where they dropped them. They focus on where the AI's solution shines—regardless of whether it’s the right area to focus on. The AI provides a suggestion, and instead of critically evaluating it, they blindly implement it, often without truly understanding the underlying problem or even the solution. This type of behavior discourages deep, analytical thinking and stunts their problem-solving growth. A Symptom of a Broader Problem The erosion of debugging skills is not just about software; it reflects a broader loss of critical thinking that will affect every field as AI tools become ubiquitous. The human role in a world dominated by AI will shift from doing the work to guiding AI when it makes mistakes. This guiding role requires strong analytical skills to track state, validate solutions, and detect errors—skills that are being dulled by over-reliance on AI for immediate answers. The issue is not limited to debugging but represents a deeper problem: losing the ability to critically analyze, question, and break down complex issues. The Double Whammy for New Learners New learners are facing a perfect storm: on one hand, they are struggling to find jobs in a post-COVID world where companies are adjusting their expectations, downsizing, and assuming AI will bring productivity gains. Tools like Cursor, Replit Agent, Devin, and All Hands are reducing the need for large, entry-level engineering teams by automating many programming and administrative tasks. On the other hand, the very skills that new learners need to stand out—critical thinking, complex problem-solving, and the ability to debug effectively—are being eroded by their dependence on AI. Rather than developing mental models to decompose complex problems into manageable subtasks, they lean on AI to do the heavy lifting. AI's involvement can be particularly insidious because, unlike traditional learning, it does not encourage a systematic approach to problem-solving. It hands over pre-packaged solutions that make sense on a superficial level but fail to build the cognitive pathways necessary for long-term understanding. In a sense, AI is like Gollum's "my precious" from The Lord of the Rings, offering a shortcut that feels empowering but ultimately leads to an addiction that diminishes the user's abilities and critical thinking. The "Idiocracy" Parallel The 2006 satirical film Idiocracy foresaw a world where society’s intellectual rigor had been dulled to an extreme degree, leaving humans incapable of critical thought and complex problem-solving. Eerily, this future seems to be materializing faster than we anticipated, particularly as AI tools make it easier for people to bypass thinking for themselves. Just as Idiocracy predicted the rise of the popular Crocs footwear (which did indeed happen), it also anticipated a world where intellectual complacency would become the norm—thanks to technology, and now, AI. What Needs to Change: A Call for Cognitive Resilience It's clear that AI is here to stay, and its benefits are undeniable. But we must address how AI is affecting new learners and professionals before it becomes too late. To avoid an AI Zombocalypse, learners need to be taught not just how to use AI, but how to use it responsibly and critically. This includes: 1. Encouraging Debugging as a Core Skill Developers must learn to debug effectively, which involves breaking down problems, questioning assumptions, and methodically testing hypotheses. Simply pasting in AI solutions without understanding their implications is counterproductive. 2. Promoting Deep Problem-Solving Over Superficial Solutions AI can often offer quick fixes, but educators and mentors need to stress the importance of deeply understanding the problems at hand. Learners should be encouraged to decompose problems into smaller, manageable tasks and to critically analyze AI suggestions before implementing them. 3. Fostering a Healthy Skepticism Toward AI Solutions Learners should be trained to view AI as a tool—not an infallible oracle. It’s crucial to cross-check AI-generated suggestions against one’s understanding of the problem, and to not simply accept AI's word as gospel. 4. Building Resilience and Self-Reliance in Learning New learners should be encouraged to struggle and learn from their struggles. Over-reliance on AI shortcuts hampers the development of the problem-solving tenacity that is crucial in the long run. Conclusion The threat posed by AI is not its power to replace humans but its ability to make humans complacent, uncritical, and reliant on easy solutions. The real danger of AI is the rise of "AI zombies"—professionals and learners who have lost their cognitive edge, unable to think critically or solve problems without AI’s hand-holding. As technology continues to advance, our educational systems and professional development practices must adapt to emphasize critical thinking, deep problem-solving, and debugging skills that resist the allure of AI’s quick fixes. The future will belong to those who use AI thoughtfully, critically, and responsibly—not to those who let AI think for them.

5 min read

authors:

Opik vs LangSmith- Which Platform Wins for LLM Tracing & Evaluation?

Article

As large language models (LLMs) increasingly become central to various applications, the need for robust tools to monitor, evaluate, and optimize these models is more important than ever. Two standout platforms that have emerged in this landscape are Opik and LangSmith. Both platforms offer powerful features for developing and managing LLM applications, yet they cater to distinct needs and workflows. In this blog, we’ll dive into a comprehensive comparison of Opik and LangSmith, examining their key features, strengths, and weaknesses. My recent experiments with both tools—focused on classifying emotions in Twitter data—provided valuable insights, particularly in terms of usability. I conducted two primary experiments: one centered on prompt refinement and the other on model comparison. Through these experiences, I aimed to highlight ease of use as a critical factor in choosing the right platform for your LLM projects. Overview of Opik Opik is an advanced, open-source platform designed for logging, viewing, and evaluating large language model (LLM) traces throughout both development and production stages. Its primary objective is to empower developers with detailed insights to debug, evaluate, and optimize LLM applications effectively. Opik also has SDK support for direct use, you can just setup your account and use it. Key Features of Opik: Self-Hosting Options: Opik offers flexible deployment options for both local and production environments. It supports local deployment via Docker Compose and scalable deployments using Kubernetes, making it adaptable for different scales of use. Comprehensive Tracing: Opik enables comprehensive logging and trace viewing, allowing developers to annotate traces and track LLM behavior in both local and distributed environments. This ensures greater visibility into model performance and helps identify issues quickly during both development and production phases. Integrated Evaluation Tools: Opik provides a set of built-in evaluation metrics, including heuristic performance measures and relevance assessments. It also supports metrics for detecting hallucinations and moderating content, and users can define custom metrics based on specific application needs. Testing Frameworks: Opik integrates with Pytest, providing developers with a framework to thoroughly test their LLM applications. This ensures that models are rigorously evaluated before deployment. Integration: Opik simplifies logging, viewing, and evaluating LLM traces with a robust set of integrations. Key features include: OpenAI : Log all OpenAI LLM calls for easy tracking. LangChain : Capture logs from LangChain interactions. LlamaIndex : Monitor LlamaIndex LLM performance. Ollama : Integrate logging for Ollama LLMs. Predibase : Fine-tune and serve open-source LLMs while logging their usage. Ragas : Evaluate Retrieval Augmented Generation (RAG) pipelines effectively. Overall, Opik’s rich set of tools and integrations make it a powerful asset for developers working with LLMs, offering end-to-end support for debugging, optimizing, and scaling LLM applications. * You can access a comprehensive exploration of Opik from this link. Overview of LangSmith LangSmith is a comprehensive platform designed to streamline the development, debugging, testing, and monitoring of production-grade LLM (Large Language Model) applications. It bridges the gap between traditional software development processes and the unique challenges posed by LLMs, particularly around handling non-deterministic, complex workflows. Key Features of LangSmith: Advanced Tracing Capabilities: LangSmith excels in tracing the performance of LLM applications by providing detailed insights into the sequence of calls and inputs/outputs at each step. It supports code annotations for automatic trace generation, with options to toggle traces on or off depending on needs. Developers can also control trace sampling rates, ensuring that they log only what’s necessary, particularly useful in high-volume applications. The platform can trace multimodal interactions (e.g., text and image inputs) and distributed systems, ensuring a holistic view of an application’s performance. Dataset Management: LangSmith offers powerful dataset management, allowing developers to create and curate datasets for testing and evaluation. This feature supports few-shot learning experiments, which is essential for optimizing LLM performance. Developers can also organize experiments and results by dataset for better analysis and insights. Evaluation Metrics: Built-in evaluators enable both automated and manual testing of LLM outputs, supporting various metrics like relevance, accuracy, harmfulness, hallucination, and more. LangSmith’s evaluation tools can assess how changes in prompts or model configurations impact overall performance. Playground and Prompts: LangSmith includes an interactive playground that allows developers to tweak and experiment with prompts in real-time. This environment is user-friendly and removes friction from the iteration process, helping teams rapidly optimize their application’s behavior. Scalability: Designed for scalability, LangSmith is built on a cloud architecture capable of handling LLM applications at large scales. It supports robust data retention policies, and its monitoring tools ensure that applications run efficiently and cost-effectively, even under heavy use. Usability: Comparative Experiments I conducted an experiment with Opik and LangSmith and explored their usability while classifying emotions in Twitter data. I conducted two main experiments: one focused on prompt refinement and the other on model comparison. Here’s a breakdown of my findings, emphasizing ease of use rather than performance. For the prompt refinement experiment, I used the Emotion dataset from Twitter to classify tweets into happiness, sadness, or neutral categories. Both platforms required only an API key and client initialization for setup, which was straightforward. For the model comparison experiment, I applied the best-performing prompt from the first experiment to compare two models: gpt-4o-mini and claude-3-sonnet. Open-Source Flexibility vs. Closed-Source Stability Opik : Open-Source : Opik is an open-source platform, giving developers the freedom to access, modify, and customize the platform’s source code. This flexibility fosters a collaborative environment where developers can contribute to the platform, improve it, and tailor it to their specific project needs. Customization : The open-source nature allows Opik users to implement unique, project-specific features or adjustments, which is valuable for teams with highly specialized requirements. This community-driven development model also allows the platform to evolve continuously based on user contributions. Ideal for Developers Seeking Flexibility : For teams or individuals who prefer to have control over their tools and the ability to customize according to their workflow, Opik is well-suited. It enables full transparency and adaptability, empowering developers to iterate on the platform as they wish. LangSmith : Closed-Source : LangSmith, on the other hand, is a proprietary, closed-source platform. While this restricts customization compared to Opik, it offers the advantage of being a more stable and streamlined platform. LangSmith’s closed-source nature ensures that updates are consistent and cohesive, with dedicated support to maintain the platform’s performance and reliability. Stability and Support : Being closed-source allows LangSmith to provide a more stable user experience, particularly important for enterprise users. It ensures regular updates, dedicated customer support, and a fully integrated suite of tools that work seamlessly together. Ideal for Enterprises Seeking Stability : Enterprises or teams that prioritize stability and dedicated support may prefer LangSmith. The closed-source model can provide peace of mind, knowing that the platform will continue to function reliably with cohesive updates and minimal disruption. Self-hosting Opik : Local Installation : Opik offers a local installation option, which is quick to set up and allows developers to get started immediately. However, this local setup is not intended for production environments, as it lacks the robustness required for large-scale operations. The local installation is suitable for quick testing and experimentation. It operates through a local URL and requires basic configuration of the SDK to interact with the self-hosted instance. This setup makes it very user-friendly for small-scale or short-term tasks. Kubernetes Installation : For production-ready deployment, Opik supports installation via Kubernetes. This option allows for scalability and ensures that all of Opik’s core functionalities—such as tracing and evaluation—are accessible in a more stable environment. Despite the production readiness of the Kubernetes setup, Opik lacks certain user management features in its self-hosted mode, which might be a drawback for larger teams needing detailed access control. There is no mention of built-in storage options in Opik’s self-hosted mode, implying that developers may need to set up external storage solutions for data management. Managed Options : For organizations seeking reduced maintenance, Opik provides managed deployment options through Comet. This allows teams to focus more on development and analysis without worrying about infrastructure maintenance. LangSmith : Docker and Kubernetes Support : LangSmith can be self-hosted via Docker or Kubernetes, making it suitable for both controlled cloud environments and large-scale production deployments. This flexibility allows LangSmith to cater to different organizational needs, from small startups to large enterprises. Componentized Architecture : LangSmith’s architecture is more complex than Opik’s, as it comprises multiple components including the Frontend, Backend, Platform Backend, Playground, and Queue. This setup ensures that LangSmith is highly modular and scalable but also requires more infrastructure management. The need to expose the Frontend for UI and API access adds to the operational complexity. Storage Bundling : Unlike Opik, LangSmith includes bundled storage services by default, making it easier for teams to get started without needing to configure external storage systems. However, users still have the option to configure external storage systems if their project demands it. Enterprise Focus : LangSmith is designed with large, security-conscious enterprises in mind. Its multi-component infrastructure is intended to support complex, secure environments. However, this also means that LangSmith may have a higher maintenance overhead compared to simpler platforms like Opik. The increased complexity requires careful configuration and management to ensure all components operate smoothly. Tracing Opik : Opik offers versatile tracing options, allowing you to log traces to the Comet LLM Evaluation platform via either the REST API or the Opik Python SDK. It supports integrations with a variety of tools, including LangChain, LlamaIndex, Ragas, Ollama, and Predibase, making it a flexible choice for developers looking to track their LLM performance across multiple frameworks. LangSmith : LangSmith provides tracing support primarily with LangChain, Vercel AI, and LangGraph. While it may have fewer integrations compared to Opik, LangSmith compensates with more advanced and low-level features for tracing. This can be beneficial for users who require in-depth analysis and customization in their LLM evaluations. Opik Tracing LangSmith Tracing As shown, LangSmith allows you to view more detailed information, including input, total tokens used, latency, feedback (i.e., evaluation score), metadata, and more. In contrast, Opik provides limited information, showing only input, output, scores, metadata, and so on. Here's a detailed comparison of Opik’s tracing and LangSmith’s tracing based on their dashboard visuals: Similarities: Tracing and Logging of Inputs/Outputs: Both Opik and LangSmith provide a clear breakdown of the input and output logs for evaluation tasks. Each platform displays detailed information regarding the input prompts and the model-generated outputs, which is essential for understanding the context and accuracy of the LLM response. The platforms also show additional details like feedback scores (Opik) or evaluation metrics (LangSmith), enabling users to assess performance in an organized format. Structured Presentation: Both dashboards offer a structured format where evaluation tasks are broken down into sections like "Input/Output," "Feedback Scores," and "Metadata." This ensures that users can navigate easily through the various components of the model evaluation. Status Indicators: Both platforms highlight the success/failure status of each evaluation task. This feature is useful for quickly identifying which tasks were successful and which may need further investigation. Differences: Visualization of Trace Details: Opik provides a more simplified view of the trace spans, with a focus on essential data such as input and output in a straightforward format. The left panel of the Opik dashboard groups spans hierarchically but is relatively simple. LangSmith , however, offers a more detailed tracing breakdown . It displays additional technical details like token usage, latency, and trace spans with granular timing (e.g., 0.2s). The dashboard offers richer metadata and breakdowns on a more technical level, making it more suitable for in-depth performance analysis. Feedback and Evaluation: Opik allows for quick feedback scores and custom metrics within the same pane, which are summarized easily in the CLI or notebook interface. The evaluation task is shown with simple input/output YAML formatting. LangSmith focuses more on detailed feedback evaluations . It provides more elaborate evaluation results, including a link to the platform dashboard for viewing advanced statistics and data visualizations. Visual Complexity: LangSmith has a more sophisticated interface with more detailed trace spans and multiple evaluation layers. This visual complexity can provide more powerful insights but may require more effort to navigate. Opik is more minimalist, prioritizing simplicity in its presentation. This could be more user-friendly for developers who prefer a lightweight and efficient interface. Evaluation Opik : Opik simplifies the process of defining metrics, allowing users to easily initialize and pass them as parameters during evaluation. It supports both heuristic and LLM-based judge metrics, with the added flexibility to create custom metrics tailored to specific needs. This user-friendly approach makes it accessible for developers looking to assess their LLM applications efficiently. Opik also summarizes results directly in the CLI or notebook, allowing for easy access to insights on-the-fly. LangSmith : LangSmith requires a more hands-on approach to metric definition. In LangSmith, evaluators are functions that score application performance based on specific examples from your dataset and the outputs generated during execution. Each evaluator returns an EvaluationResult, which includes: key, score and comment. LangSmith provides a link to its dashboard for viewing results, which, while informative, required navigating away from the immediate workflow. Opik Evaluation LangSmith Evaluation Both LangSmith and Opik provide overall metric scores as well as scores for each individual dataset item. In summary, both platforms give evaluation results in a similar way; the main difference lies in the setup of the metrics. In Opik, the setup is straightforward, while in LangSmith, it requires more effort to configure. Here's a detailed comparison of Opik’s dataset and LangSmith’s dataset based on their dashboard visuals: Similarities: Experiment Tracking: Both Opik and LangSmith provide a clear overview of experiments conducted on datasets. Each experiment is tracked with a unique identifier or name, and the results are logged in a structured manner. They both display the correctness of the evaluation (precision, recall, or label correctness) in a way that allows users to immediately grasp the performance of the model for each dataset item. Metric Display: Both systems display evaluation metrics for each experiment, such as precision , recall , and other relevant scores. This enables developers to gauge how well a specific model or experiment performed based on specific performance indicators. Dataset Connection: In both systems, experiments are linked to datasets, which allows for context-driven evaluation. This connection between the experiment and dataset ensures that users can quickly refer back to the dataset and see how the model performed against each data point. Differences: Visualization of Metrics: Opik: In the Opik evaluation dashboard, you can see metrics such as context precision and recall displayed prominently at the top of the interface. Each dataset entry is evaluated based on these metrics, and results are presented for each item. The emphasis is on immediate metric visibility for each input/output pair within the dataset. LangSmith: LangSmith provides an aggregate view of the experiment performance. Instead of breaking down individual metrics per dataset entry, LangSmith focuses on displaying experiment-level metrics such as Correct Label scores across multiple runs. This is useful for a more general performance comparison between different models or experiment configurations over time. Apart from that, you can also view metrics for each dataset entry by clicking on any specific experiment. Detailed Experiment Comparison: LangSmith: The LangSmith evaluation dashboard provides an overview of multiple experiments at once, listing them with splits, repetitions, and correctness scores. This allows users to quickly compare how different versions of models or setups have performed relative to one another, ideal for tracking improvements or regressions over time. Opik: The Opik evaluation dashboard focuses on individual metrics for each input . It presents a more fine-grained evaluation, especially when comparing precision and recall for specific inputs. However, it lacks a broad overview of multiple experiments in one glance. Dataset Opik : Opik presents a more straightforward view of dataset information, displaying inputs and expected outputs clearly. However, it lacks the advanced visualization capabilities found in LangSmith, which may limit users’ ability to quickly identify trends and insights. LangSmith : LangSmith excels in offering advanced visualization features that clearly showcase trends and evaluation metrics within the dataset tab. It provides rich support for datasets, allowing users to view experiments conducted on the dataset, perform pairwise experiments, and explore various formats, including key-value pairs, LLM, and chat data. This comprehensive approach makes it easier to analyze and understand the dataset’s performance and evaluation. Opik Dataset LangSmith Dataset As shown, LangSmith allows you to see how many experiments were run on a dataset, along with their metric scores and other details. In contrast, Opik only provides information about the dataset and its items. Here's a detailed comparison of Opik’s dataset and LangSmith’s dataset based on their dashboard visuals: Similarities: Sentiment Dataset: Both dashboards displays a dataset , with inputs and expected outputs. Each dataset item includes both the original input and the expected label. Dataset Structure: Both platforms show the dataset in a structured table format, where inputs and expected outputs are clearly listed. This ensures transparency and consistency in dataset management for both platforms. Support for Experimentation: Both platforms support running experiments on the datasets. They allow users to test different models or versions of a model and compare the performance based on these input/output pairs. Differences: Visualization: Opik Dataset: The Opik dataset interface is minimalistic, showing only the input/output pairs . It lacks advanced visualization capabilities, focusing instead on providing clear data entries for developers to reference. LangSmith Dataset: In contrast, the LangSmith dataset interface provides rich visualizations . For example, it shows a chart of experiments , enabling users to see the results of evaluations over time or across multiple experiments. This provides better analytical tools for users who want to track model performance trends. Experiment Features: Opik Dataset: The Opik interface offers simplicity, focusing on basic dataset information and expected outcomes. While it supports dataset-based evaluations, it lacks advanced tools for conducting complex experiments directly from the interface. LangSmith Dataset: LangSmith offers more advanced options for conducting experiments, such as pairwise experiments and the ability to add evaluators and generate new examples. It also supports few-shot learning , giving users more flexibility to perform sophisticated analyses on their datasets. Customization and Flexibility: LangSmith offers more features for interacting with datasets, such as tagging dataset versions, adding new examples, and generating examples. These features make it easier for users to experiment with their datasets and modify them on the go, offering more flexibility and control over data. Opik , on the other hand, is streamlined for straightforward dataset management and lacks these interactive features, focusing on simplicity and clarity for the user. * You can access a code and other exploration details of this comparison from this link. The table below highlights the functionality supported in Opik vs. LangSmith: Feature/Functionality Opik LangSmith Open-Source ✅ ❌ Self-hosting Support ✅ ✅ Dataset ✅ ✅ Tracing ✅ ✅ Evaluation ✅ ✅ Pytest Integration ✅ ❌ OpenAI Support ✅ ✅ LangChain Support ✅ ✅ LlamaIndex Support ✅ ❌ Ollama Support ✅ ❌ Predibase Support ✅ ❌ Ragas Support ✅ ❌ LangGraph Cloud Support ❌ ✅ Own Prompt Management ❌ ❌ Capture Human Feedback ❌ ✅ Advanced Monitoring & Automations ❌ ✅ Conclusion Both Opik and LangSmith offer valuable tools for large language model (LLM) application development, but they cater to different user needs and contexts. Opik is well-suited for developers who appreciate open-source flexibility and a user-friendly setup. Its straightforward metric definition, extensive integrations, and ease of use make it ideal for quick implementations and individual projects. However, it falls short in several areas critical for enterprise use, such as advanced dataset management, sophisticated monitoring, and built-in support for human feedback mechanisms. Opik’s limited tracing capabilities and basic logging features may hinder comprehensive performance analysis and compliance with privacy regulations, which are vital in larger team environments. LangSmith , in contrast, excels in enterprise settings where stability, scalability, and comprehensive monitoring are essential. Its advanced tracing capabilities, rich dataset management, and detailed visualization features facilitate deeper analysis and collaboration among stakeholders. LangSmith excels with its sophisticated tracing options, including the ability to log images and manage sensitive data effectively. Its built-in automation tools allow teams to respond proactively to issues, a necessity in high-stakes production settings. The closed-source model of LangSmith streamlines updates and support, allowing teams to focus on development rather than maintenance. These features are crucial for organizations aiming to deploy production-grade applications effectively. For AI researchers and engineers working on personal projects, Opik offers a flexible and accessible environment for experimentation and learning. Its open-source nature allows for customization without the constraints of a closed-source system. Conversely, AI engineers in enterprise environments will benefit from LangSmith’s comprehensive features tailored for production, including stability, extensive support, and advanced monitoring capabilities. In conclusion, the choice between Opik and LangSmith depends on the specific context of the user. Opik is a great fit for individuals and small teams focused on exploration, while LangSmith is the preferred option for organizations aiming to build scalable, production-ready applications. Aligning your toolset with your project requirements and long-term goals is essential for success in the evolving landscape of AI development.

6 min read

authors:

WebLLM, WebGPU, and MLC: A Comprehensive Explanation

Article

Apple's recent advancements in Edge AI, known as " Apple Intelligence ," are setting new standards for AI on edge devices (such as mobile phones, tablets, and laptops) and shaping user expectations across the technology landscape. By embedding AI capabilities directly within iPhones, iPads, and Macs, Apple emphasizes privacy, low latency, and efficiency. This strategy allows tasks like image generation, text rewriting, and voice commands to be processed locally on the device, offering faster, more reliable, and secure interactions without constant cloud support. Apple is not alone in this focus. The trend is evident across other major players such as Microsoft, Google, Facebook and Samsung, all working on running AI on edge devices. While Edge AI offers many benefits, it also presents challenges, including the need for more powerful hardware and potential limitations on model size. To address these challenges and enable efficient on-device AI, technologies like WebLLM (for running large language models in web browsers), MLC (Machine Learning Compilation for optimizing AI models), and WebGPU (a low-level graphics API for web browsers) are being actively developed. These technologies are receiving contributions from a wide range of companies, including top tech giants. The WebGPU API, which serves as the backbone for running WebLLM models efficiently in the browser, is already supported across major browsers like Chrome, Firefox, and Safari. Given the rapid development of these technologies that will power a significant portion of future mobile and web applications, it's crucial to understand how they work. In the following sections, we will explain WebLLM, MLC, and WebGPU in detail, and illustrate their deployment using a practical WebLLM chat example that works directly on your device. WebLLM WebLLM is a high-performance, in-browser inference engine for Large Language Models (LLMs). It is designed to allow developers to deploy and run large language models directly in the browser with WebGPU for hardware acceleration, without requiring any server support. It is open-source and can be accessed on GitHub here. WebLLM manages the overall inference process, which includes: Tokenization: Converting natural language input into a format suitable for model processing. Model Management: Downloading and loading model weights into browser memory, where they are stored efficiently, often in a quantized format. Inference and Detokenization: Interfacing with MLC for computational tasks and converting results back to a human-readable form. WebLLM is designed to be compatible with the OpenAI API, allowing developers to use the same interface they would with OpenAI, supporting features such as streaming outputs, JSON-mode generation, and function calling (currently in progress). Key Features Include: In-Browser Inference Using WebGPU: Achieves hardware-accelerated LLM inference directly within the browser. Compatibility with OpenAI API: Facilitates integration using existing OpenAI-compatible functionalities. Structured JSON Generation: Provides JSON-mode structured generation for applications that require schema-based outputs. Extensive Model Support: Works natively with a variety of models, including Llama, Phi, Mistral, Qwen , etc., with the ability to integrate custom models using the MLC format. Real-Time Interactions & Streaming: Supports interactive applications like chat completions, allowing real-time text generation. Performance Optimization with Web Workers & Service Workers: Enables efficient computations and model lifecycle management by offloading tasks to separate browser threads. MLC LLM (Machine Learning Compilation for Large Language Models) MLC LLM is a specialized component of the MLC ecosystem, designed to optimize the inference of Large Language Models (LLMs) across various platforms, including browsers, desktops, and mobile devices. It is a machine learning compiler and high-performance deployment engine for large language models. It compiles and prepares LLMs for efficient execution based on the underlying hardware capabilities. Throughout this explanation, we will refer to MLC LLM as "MLC." MLC works closely with WebLLM by receiving tokenized inputs and preparing computational tasks that are optimized for the available hardware. These tasks are compiled into efficient GPU kernels, CPU instructions, or WebGPU shaders to ensure that LLM inference runs effectively across platforms. The goal of MLC is to bring high-performance LLM deployment natively to web browsers, desktops, and mobile devices. MLC is open-source and can be found on GitHub here , providing tools for efficient execution of LLMs across different environments, including browsers and native platforms. Platform-Specific Optimization MLC is designed to adapt to various hardware and platform needs, enabling efficient LLM inference. The mission of this project is to enable everyone to develop, optimize, and deploy AI models natively on everyone’s platforms. Key features: GPU Support for Major Manufacturers (AMD, NVIDIA, Apple, Intel): MLC optimizes the execution for different GPU types using APIs such as Vulkan, ROCm, CUDA, and Metal, based on the platform and hardware availability. Browser Support with WebGPU & WASM: MLC runs natively within web browsers by leveraging WebGPU and WebAssembly, providing hardware-accelerated inference directly in the browser. Mobile Platform Support: On iOS devices, MLC uses Metal for efficient execution on Apple GPUs, while on Android devices, it leverages OpenCL to support Adreno and Mali GPUs. MLCEngine: A Unified Inference Engine At the heart of MLC is MLCEngine, a high-performance inference engine that runs across various platforms, providing the necessary backbone for executing LLMs efficiently. MLCEngine offers OpenAI-compatible APIs for easy integration into various environments, including REST servers, Python applications, JavaScript, and mobile platforms. By using MLC, developers can deploy LLMs seamlessly across different platforms, harnessing the benefits of optimized hardware acceleration, whether it's for browsers, desktops, or mobile devices. WebGPU WebGPU is the hardware acceleration layer that enables efficient LLM inference within the browser. It interfaces directly with MLC, executing the optimized kernels or instructions prepared by MLC based on the available hardware resources (GPUs or CPUs). WebGPU is responsible for: Parallel Computation & Memory Transfers: Performing the necessary computations and managing memory efficiently to support the rapid inference of large models. Fallback to CPU when GPU is Unavailable: When no GPU is available, WebGPU ensures that computations can still proceed on the CPU, though performance will be reduced. By providing a direct bridge between model operations and hardware execution, WebGPU is critical for achieving the performance necessary for real-time LLM inference in web applications. Here is a refined and focused discussion that accurately captures the flow of WebLLM Chat using Llama 3.2, while addressing the clarity on custom model endpoints and structured outputs. Illustration with WebLLM Chat Using Llama 3.2 This section walks through how WebLLM Chat uses Llama 3.2 for real-time AI conversations within the browser. It highlights each step from user interaction to model response, leveraging WebGPU and MLC LLM's capabilities to optimize performance. The following diagram extends the earlier diagram to show on how Llama 3.2 can be used for chat interactions using WebLLM. Step-by-Step Flow of WebLLM Chat with Llama 3.2 Initialization & Model Loading Interface & Model Selection: Open WebLLM Chat in the browser. The user selects Llama 3.2 from the available models. Upon selection, the model weights are downloaded (if not cached) and loaded into memory. Progress Feedback: WebLLM Chat provides real-time progress updates on the model loading process, ensuring the user knows when Llama 3.2 is ready for conversation. Tokenization & User Input Input & Tokenization: The user types a query into WebLLM Chat. The interface tokenizes this input to prepare it for Llama 3.2 inference, converting the natural language into a sequence that the model understands. Responsive UI Through Web Workers: To keep the UI smooth and responsive, WebLLM uses Web Workers to offload computations away from the main thread. This enables real-time input processing without performance lags. Inference & WebGPU Acceleration Model Execution & Hardware Utilization: WebLLM uses MLC LLM to manage computations, leveraging WebGPU to perform inference on available GPUs for faster response generation. Real-Time Response Generation: The model streams its response as it is generated, token by token, and WebLLM Chat displays these results incrementally. This streaming capability allows users to interact with the model in real-time. Inference Output & Structure Standard Chat Output: By default, Llama 3.2 provides plain text responses suitable for typical chat-based interactions. The responses are detokenized and presented back to the user in a natural language format. Structured Outputs (JSON Mode): If specific structured data is required (e.g., formatted as JSON), WebLLM Chat can be configured to return such responses. This is particularly useful if you want to use WebLLM to respond to complex queries where the data needs to be formatted (e.g., a structured list, a dictionary of items, etc.). Generating structured output can be part of the model’s behavior if it has been fine-tuned for it. Depending on your model's performance, you may need to validate the structured outputs in the interface. Lifecycle Management Lifecycle Management & Caching: Model weights and configurations are cached locally after the initial load, improving efficiency for subsequent interactions. Web Workers manage computations to ensure smooth inference without interrupting the chat's responsiveness. Mermaid code for the diagram 1 graph TD A[Web Application] <-->|Real-Time User Input/Output| B[WebLLM] B <-->|Model Management, Tokenization & Inference Requests| D[MLC] D <-->|Compiled & Optimized Computation Tasks for GPU/CPU| C[WebGPU] C -->|Delegate to Hardware| E[Discrete GPU] C -->|Or Fallback to CPU| F[Fallback to CPU] E -->|Execution Results| C F -->|Execution Results| C C -->|Computation Results| D D -->|Inference Results| B B -->|Detokenization & User Output| A style A fill:#FFA07A,stroke:#333,stroke-width:2px,color:#000000 style B fill:#A0D8EF,stroke:#333,stroke-width:2px,color:#000000 style C fill:#FFD700,stroke:#333,stroke-width:2px,color:#000000 style D fill:#98FB98,stroke:#333,stroke-width:2px,color:#000000 style E fill:#DDA0DD,stroke:#333,stroke-width:2px,color:#000000 style F fill:#DDA0DD,stroke:#333,stroke-width:2px,color:#000000 classDef default stroke:#333,stroke-width:2px,color:#000000 Mermaid code for the diagram 2 graph TD A[Web Application] <-->|User Input & Output| B[WebLLM Chat Interface] B <-->|Tokenization & Inference Requests| D[MLC LLM Engine] D <-->|Optimized Computations for GPU/CPU| C[WebGPU Interface] C -->|Delegate Computations| E[Discrete GPU] C -->|Fallback to CPU| F[CPU Processing] E -->|Execution Results| C F -->|Execution Results| C C -->|Compute Results| D D -->|Inference Results| B B -->|Streamed Responses| A %% Note on Validation for Structured Outputs B -->|If Required: Validate & Reprocess| G[Validate Structured Output] %% Styling the Nodes for Clarity style A fill:#FFA07A,stroke:#333,stroke-width:2px,color:#000000 style B fill:#A0D8EF,stroke:#333,stroke-width:2px,color:#000000 style C fill:#FFD700,stroke:#333,stroke-width:2px,color:#000000 style D fill:#98FB98,stroke:#333,stroke-width:2px,color:#000000 style E fill:#DDA0DD,stroke:#333,stroke-width:2px,color:#000000 style F fill:#DDA0DD,stroke:#333,stroke-width:2px,color:#000000 style G fill:#FF6347,stroke:#333,stroke-width:2px,color:#000000 classDef default stroke:#333,stroke-width:2px,color:#000000

6 min read

authors:

Bridging AI Intelligence Gaps: Leveraging Human Critical Thinking to Optimize AI Agent Performance 🧠🤖

Article

1. Identifying and Analyzing Distraction Points AI Agents tend to go astray, especially in the design phase, because of variety of reasons namely: Spending excessive time on non-critical or low-priority sub-tasks that don't directly contribute to the agent’s primary objectives. Engaging in excessive reflection on secondary components that do not directly impact task performance. AI agents can struggle to resolve conflicting or ambiguous information, which may lead to confusion or suboptimal decisions. Human oversight, complemented by automated monitoring tools, plays a critical role in recognizing when an AI agent becomes distracted or sidetracked, particularly during over-reflection or pursuit of tangential paths. In a Human-in-the-Loop (HITL) framework, humans can intervene at various stages of the agent’s reasoning process to detect these moments of distraction and correct the course. When identifying and analyzing distraction points in an AI agent's reasoning process, humans should employ critical thinking strategies such as chunking, hierarchical organization, pattern recognition, and abstraction. Chunking allows humans to break the agent's complex reasoning flow into manageable segments, enabling easier identification of specific areas where distractions may occur. Using a hierarchical structure to organize these segments further reduces cognitive load, allowing humans to focus on fewer elements at a time. This top-down decomposition helps in isolating distraction points within specific parts of the flow, ensuring that humans aren't overwhelmed by the entire process at once. Within this structured hierarchy, pattern recognition becomes more efficient, as humans can more easily spot recurring behaviors or common points where the agent gets sidetracked. Once these patterns are identified, they can be further simplified through abstraction , which allows common flows to be combined, reducing the number of individual flows that need to be addressed. This abstraction not only streamlines the analysis but also makes it easier to apply corrective measures across similar distraction points. The process is iterative, meaning that as human operators gain more insight into where distractions occur, they can refine the hierarchy and segmentation, improving their conceptualization of the agent's flow over time. Domain knowledge and task frequency further guide this process, helping humans prioritize flows that occur often or are critical to the agent's objectives. This combination of techniques allows humans to efficiently detect, analyze, and address distractions in the agent’s reasoning flow, such as repeatedly choosing low-priority actions or misinterpreting key goals. 2. Setting Rule-Based Guidance to Avoid Stuck States Once distraction or misprioritization points are identified, humans can design specific rule-based guidance, such as thresholds or priority rules, to help the AI agent stay focused on its core objectives. Rule-based systems act as guardrails, providing structure that prevents the agent from becoming distracted by irrelevant or low-priority tasks, keeping it aligned with core objectives. Defining Action Thresholds : Humans can set reflection thresholds based on task complexity or importance to ensure the agent moves forward after a reasonable amount of time spent on reflection. Task Prioritization Rules : Task Prioritization Rules: Humans can encode priority rules, such as task-scoring systems or hierarchical structures, to help the agent distinguish between critical and tertiary tasks. For example: “Always prioritize goal completion over error correction unless the error is critical to task success or safety.” “Allocate a percentage of resources to core tasks, adjusting dynamically based on task importance, with minimal resources allocated to secondary tasks.” Timeout Mechanisms : Rule-based timeouts can be implemented to ensure that agents do not spend too long on low-priority tasks. If an agent is stuck reflecting on a minor issue for too long, the system can trigger a timeout, prompting the agent to either stop and reassess its priorities or initiate predefined fallback actions. Flow-Based Rules : For specific workflows, humans can create step-by-step rules to keep the agent focused on the main flow of the task, ensuring progression toward the ultimate goal. These rules can guide the agent through key stages, ensuring it progresses toward the ultimate objective even if it encounters distractions. If the agent starts deviating from the intended flow, these rules can nudge it back on track. Humans can establish hierarchical guiding principles that break down constraints from general to specific, helping the agent focus on fewer constraints at a time, thereby reducing the complexity of its reasoning process. By analyzing recurring patterns where the agent tends to get stuck, humans can design preemptive rules that directly address these distraction points based on past observations. By using abstraction, humans can generalize rules across multiple workflows, enabling the agent to apply the same principles in various contexts with minimal adjustments. For instance, rather than addressing each case where the agent becomes sidetracked, abstracted rules can encompass a range of similar scenarios, allowing the agent to handle recurring distractions with minimal human intervention. 3. Managing Excessive or Inappropriate Tool-Calling Excessive or inappropriate tool-calling is a significant challenge for AI agents, particularly in workflows requiring interaction with external systems, where internal reasoning might be more efficient or appropriate. Overuse of tools can lead to inefficiencies such as wasted computational resources, increased latency, or distraction from the agent's primary objectives, much like over-reflection. Tool-Calling Limits : Humans can set dynamic limits on how often an agent can call a tool within a given period, adjusting these limits based on task context or performance feedback to ensure optimal efficiency. This prevents agents from wasting computational resources and time by repeatedly calling tools when internal reasoning could provide a quicker or more efficient solution, based on predefined criteria. Contextual Tool Use : Humans can establish rules to define appropriate contexts for tool usage, such as setting task-specific thresholds or constraints based on complexity or resource requirements. This teaches the agent when a tool is necessary and when it should rely on its own reasoning, either through rule-based systems or by training the agent with reinforcement learning techniques. Fallback Mechanisms : If an agent calls a tool and fails to progress, rule-based fallback mechanisms can interrupt the cycle, prompting the agent to escalate the issue by requesting human feedback, switching tools, or reverting to internal reasoning based on predefined criteria. Effective management of excessive or inappropriate tool-calling by AI agents requires a series of critical thinking strategies. Humans must evaluate when tool use is necessary, guiding the agent in recognizing when to rely on internal reasoning, while also ensuring that both tools and reasoning mechanisms are optimized for task efficiency. Analytical thinking is crucial in evaluating the LLM output against the task's goal, focusing on aspects such as accuracy, relevance, and completeness to ensure alignment with task objectives. By breaking down the output into its core components and comparing it with the goal criteria, critical thinkers can assess its accuracy, completeness, and relevance. This ensures the output is both factually correct and aligned with the task's objectives. Further, reasoning allows humans to pinpoint where the output falls short—whether through missing information or failure to meet specific goals. Recognizing these gaps is essential for determining the use of appropriate tools or identifying changes needed in the tool’s design or the agent’s reasoning framework. This involves assessing the strengths and limitations of available tools, weighing factors such as efficiency and relevance, and determining which tool is best suited to fill the identified gaps. Once a tool is selected, problem-solving also applies when the tool fails, such as when it delivers incomplete data, slow processing, or incorrect results. Critical thinkers must diagnose whether the failure is tool-related or task-specific and determine appropriate fallback actions, such as switching to a different tool or reverting to internal reasoning. Finally, reasoning aids in selecting the best fallback option by comparing the effectiveness and potential of various strategies, often requiring real-time analysis to minimize workflow disruption. By considering multiple approaches—whether retrying, switching tools, or returning to internal logic—critical thinkers ensure that the agent remains on track toward its goal, despite obstacles in the workflow. 4. Generating Synthetic Data for Training Agentic Reasoning and Tool-Calling In addition to rule-based guidance, synthetic data—such as simulated task scenarios or artificially generated datasets—can be used to train AI agents on how to reason through tasks effectively. By simulating complex, domain-specific scenarios and potential distractions, synthetic data helps the agent learn to prioritize key tasks and balance internal reasoning with external tool use. Scenario Generation : Synthetic data can simulate edge cases—such as rare, anomalous, or highly complex situations—where the agent might become distracted or misuse tools. This allows the agent to learn how to identify important objectives, avoid over-focusing on irrelevant details, and generalize these lessons across various types of tasks and distractions. Tool-Calling Optimization : Training agents on synthetic data can improve their understanding of when to call external tools (e.g., APIs or databases) by simulating conditions and thresholds that define when tool use is necessary or redundant. The agent can learn: To call tools only when required for task completion. To avoid excessive or redundant tool calls that waste resources or introduce delays. Balancing Reflection and Tool Use : Training the agent on scenarios that combine reflection and tool use helps it develop a better understanding of when reflection should lead to action or tool invocation, optimizing decision-making through reinforcement learning or iterative feedback. Synthetic data can be used to fine-tune models and expand the agent's reasoning abilities, improving its management of excessive or inappropriate tool-calling. Critical thinking skills such as deductive reasoning , combined with domain-specific expertise, play a critical role for humans in guiding AI tools for scenario generation, allowing the simulation of edge cases where the agent might get distracted or misuse tools. These scenarios help the agent learn to prioritize important objectives by incorporating factors like time constraints and task hierarchies, guiding the agent away from irrelevant details. Synthetic data has been playing a significant role in the recent advancements of foundational models. Analytical thinking is necessary to assess how well the agent responds to these scenarios, often through a combination of automated performance metrics and human analysis, to identify areas for improvement. Developing scenarios that teach efficient decision-making without over-relying on tools or reflection, especially through iterative design and feedback, helps the agent optimize its reasoning process.

6 min read

authors:

Why substantial percentage of GenAI projects are failing? 📉

Article

Before understanding why many GenAI projects fail, it may not hurt to have an overview of what these models are capable of in simple terms. GenAI models, like ChatGPT, have been trained on vast datasets that include diverse publicly available text sources, such as Wikipedia, websites, books, and other digital texts. These models can answer a wide range of questions based on this knowledge, as you may have experienced when using ChatGPT. Beyond leveraging GenAI's general knowledge, you can also provide your own context (i.e., your specific knowledge, data, or documents) with your instructions, asking GenAI to extract information or generate output based on that context. With the right instructions and context, GenAI can produce outputs that closely resemble human decision-making. While GenAI’s potential is significant, its success depends on how it's deployed, and many projects fail because of misunderstandings about its capabilities and proper usage. 1. Approaching GenAI as Process Automation Instead of Process Redesign: A common pitfall in deploying GenAI is that organizations often treat it like traditional automation, where predefined processes are automated step-by-step. However, GenAI isn’t just about automating existing processes; it has the potential to transform workflows, especially in areas requiring human decision-making and creativity. When businesses focus solely on automation and fail to redesign processes, they miss the opportunity to unlock GenAI’s real potential—augmenting human intelligence. Traditional systems analysis and design approaches focused on capturing predefined, deterministic workflows meant to be used by humans. Now, with GenAI augmenting many human decision-making steps, new workflows must be designed to accommodate this shift. For example, when generating customer support FAQs, a customer executive previously had to manually review past support tickets, reference existing FAQs, and create new ones. In the GenAI paradigm, such interfaces for manual searching are no longer necessary. Instead, the focus should be on determining what information to retrieve from past customer support tickets via APIs and integrating GenAI models to automatically analyze and generate FAQs. Rather than creating interfaces for humans to sift through data, the emphasis should be on designing systems that allow GenAI to access and process information directly from databases or APIs. For instance, you can develop an application where GenAI models are connected to your customer support database, allowing them to automatically extract common issues, analyze sentiment, and generate draft FAQs or support documents. The human role then shifts to reviewing and refining the outputs generated by GenAI, ensuring they meet quality standards and align with the company’s messaging. This paradigm shift requires businesses to redesign their workflows to be more GenAI-centric. Processes should be built around the capabilities of GenAI, leveraging its strengths in data processing and content generation. If you're unsure how to plan for GenAI systems, you may want to go through the GenAI planning framework here . 2. Over-Reliance on a Single Inadequate Prompt to Handle Complex Human Decision-Making: There is often an over-expectation that simply feeding AI a prompt will produce sophisticated decision-making outputs, underestimating the complexity of human judgment. In many decision-making processes, humans navigate through multiple steps, often using their intuition and expertise to assess situations in parts. To replicate this with AI, the decision-making process often needs to be broken down, with humans providing clear step-by-step instructions for the AI to follow. Stepwise prompting involves tackling the entire multi-step process within a single prompt. For moderately complex tasks, this can be efficient as it reduces the need for multiple interactions with the model. However, for more complex tasks that involve numerous steps and intricate explanations, it may be less effective. In such cases, you may need to break the task into a sequence of sub-tasks, using different prompts for each sub-task. This process is called Prompt chaining . You can read more about it here . Additionally, besides providing clear instructions, offering a few examples to the model can help it better understand your expectations. This technique is known as Few shot prompting or In-context learning . You can read more about it here . Integrating all these ideas in a prompt can get complex and overwhelming. We have found Metadata prompting to be highly effective for handling this complexity by separating concerns: first, focusing on explaining the instructions while assuming that all variables or constructs are predefined, and then later explaining those variables/constructs in detail. You can read more about it here . Even after following these best practices, issues may still arise—such as using semantically incorrect words, introducing modifier ambiguity, or mixing up the order of instructions. You can learn how to systematically iterate and improve prompts to resolve these issues here . Another reason to use Prompt chaining is the generation capacity of AI models. While these models can process large amounts of input, their ability to generate content is more limited. If you attempt to generate content covering too many topics at once, the quality may degrade, turning into overly simplistic or "listicle" style outputs. To maintain high-quality results, it’s often necessary to break tasks into smaller sub-tasks, use more focused prompts for each sub-task, reduce the cognitive load on the AI, and ensure better results for each segment of the task. 3. Assuming Models Need to Be Fine-Tuned Rather Than Using Proper Prompting Techniques A significant misunderstanding is the assumption that GenAI models need to be fine-tuned for every task. This often leads organizations to unnecessarily commit to fine-tuning, adding substantial cost and complexity to their AI projects. In reality, many tasks can be handled effectively using advanced prompting techniques such as Metadata prompting , Few-shot learning , Stepwise prompting , and Prompt chaining —all without the need for fine-tuning. Fine-tuning models not only slows down the process by requiring the creation and management of training data but also complicates the deployment and inference stages. Teams often work independently on fine-tuning models, resulting in fragmented efforts, when in fact, using a common base model with heterogeneous adapters could allow for better resource utilization and system flexibility. Heterogeneous Parameter-Efficient Fine-Tuning (PEFT) adapters can be applied in batches to various models, optimizing resource usage. Read more about using heterogeneous PEFT adapters here . 4. Over-Reliance on AI Autonomy A growing trend in AI development is designing autonomous AI agents using approaches like ReAct with Reflection, Look-Ahead Task Sequencing (LATS) , and others. While these designs can be effective for certain use cases, such as simple Q&A using tools like Google Search when retrieval-augmented generation (RAG) systems are insufficient, they pose challenges when applied to more complex tasks that require nuanced reasoning and decision-making. When GenAI systems are used to augment human decision-making, over-reliance on these agentic designs can lead to several issues: the models may over-reflect, lose focus, struggle to differentiate between critical and irrelevant details, or overuse external tools rather than leveraging their own knowledge. As a result, costs can rapidly escalate, and the quality of output on complex tasks often deteriorates. To mitigate these issues, well-thought-out guardrails and interventions are necessary. These guardrails help define task scope, keep AI models on track, and improve governance, reducing risks associated with unmonitored AI autonomy. Without these measures, autonomous AI systems may underperform on complex tasks and fail to deliver the expected value. 5. Misaligned Goal Setting and Success Metrics for New GenAI Processes GenAI enables entirely new processes that were previously impossible or too resource-intensive for traditional AI or human-driven systems. These innovations include automated creative content generation, context-aware conversational agents, and intelligent document synthesis (e.g., creating detailed reports, legal contracts, or tailored marketing content based on minimal inputs). GenAI can also facilitate dynamic decision-making by generating and iterating on multiple solutions in real-time, which traditional AI systems cannot handle effectively without substantial human input. However, this new potential presents a significant challenge: defining appropriate goals and success metrics. Organizations often struggle to set realistic objectives that take full advantage of GenAI’s strengths because they are anchored in conventional process thinking. Since GenAI can fundamentally change how work is performed, companies must redefine what success looks like and select goals that offer the highest ROI. Misunderstanding or underestimating these possibilities often leads to poorly chosen objectives and a failure to fully realize the impact GenAI could offer. 6. Lack of Context Provided in Prompts One of the major reasons GenAI projects fail is the lack of adequate context in prompts. While humans draw on years of experience, domain knowledge, and exposure to various settings to interpret ambiguous information, GenAI models rely solely on the data they’ve been trained on and the explicit details provided in prompts. When important context is missing, AI models may generate responses that are too vague, irrelevant, or even incorrect. For instance, planning a weekly social media calendar requires tacit knowledge of what types of posts work best and which ones underperform within a specific industry domain. Without this background and context, GenAI systems may struggle to generate high-quality content, leading to generic, low-engagement posts. By incorporating tacit industry knowledge into the prompt, you can guide the AI to create more relevant and impactful content. You can learn how to include tacit knowledge in prompts for more effective results here .

5 min read

authors:

How to improve your prompts? ✍

Article

1. Asking model to explain its reasoning Asking an AI model to generate explanations for its labels or recommendations can significantly enhance output quality by promoting deeper reasoning and analysis. This approach, closely related to chain-of-thought prompting, encourages the model to articulate its decision-making process, which can reveal and potentially correct flaws in its reasoning. By requiring explanations, the model is pushed to engage in more thorough contextual understanding and align its thinking more closely with human-like reasoning patterns. This process can help mitigate biases, improve transparency, and ultimately lead to more thoughtful, well-justified outputs. Additionally, the act of explaining can reinforce the model's grasp of concepts and relationships, potentially improving its performance over time. This technique not only enhances the model's ability to handle complex tasks but also provides valuable insights into its decision-making process, fostering greater trust and understanding between AI systems and their users. 2. Iterate and Refine The "Iterate and Refine" guideline in prompt engineering highlights the necessity for continuous testing, evaluation, and enhancement of prompts used with AI models, like ChatGPT, to optimize response efficiency and accuracy. This iterative process involves experimenting with various prompts, analyzing AI responses, and refining prompts based on performance to improve response quality and relevance gradually. Acknowledging the trial and error involved is essential, as crafting the perfect prompt often requires multiple attempts due to the complexities of human language and AI interpretation. Initial attempts may not fully convey the needed context or specificity, necessitating prompt adjustments. Ask for LLM's understanding of the prompt Start by ensuring the AI comprehends your prompt correctly. This step involves not just asking for understanding, but also an iterative refinement process. Here's the expanded process: a. Initial Query: Use this specific prompt to get the AI's initial understanding: Provide your understanding of the following prompt for an AI tool: b. Analyze the Response: Carefully review the AI's explanation of your prompt. Look for any misinterpretations, gaps in understanding, or areas where the AI's interpretation doesn't align with your intent. c. Iterative Refinement: Use the edit option to change your original prompt. Update the prompt to incorporate better wordings or explanations you see in the AI's output. When you save the updated prompt, the AI will give you another explanation. Review this new explanation carefully. d. Decision Point: If you see the need for further minor changes, repeat the process from step c. If you're satisfied with the AI's understanding and feel no further changes are necessary, proceed to the next step in the prompt engineering process. It may take you 2-3 iterations to fix your prompt. This iterative refinement within the first step is crucial because it allows you to: - Gain insights into how the AI interprets your language - Incrementally improve your prompt based on the AI's feedback - Ensure a solid foundation of mutual understanding before moving on to more complex refinements An alternative to manual intervention is to let LLM handle the rewrite. Once you've confirmed that the AI's understanding is good and it hasn't misconstrued or deviated much from your intent, ask it to improve the prompt: Rewrite the prompt to make it better Or Evaluate the structure of the following content, focusing on improving its organization and presentation. Avoid adding or suggesting new information--your task is to reframe the existing content for better clarity and flow. This collaborative approach can lead to unexpected insights and refinements. Addressing potential uncertainties Next, address any subjectivity or unclear elements in your prompt that could lead to unreliable results, especially when the context might differ from your test cases. Identify potential uncertainties You can begin with using LLM to help you identify uncertainties in your instructions by using the prompt given below. This question helps you pinpoint areas where your prompt might be open to interpretation or lacking specificity. Is there any subjectivity in the prompt or something unclear for an AI tool Asking LLM to guess answers for uncertainties Instead of manually addressing how to make instructions more specific for identified uncertainties, you can ask the LLM to make educated guesses about potential answers or solutions. This approach capitalizes on the model's advanced capabilities, potentially saving you time and effort. By using the prompt given below you're essentially outsourcing part of the problem-solving process to the AI. This not only helps in generating potential solutions but also provides insights into how the model might interpret and respond to ambiguities in your prompt, further informing your refinement process. Make your best guess and try to answer subjectivities you identified in the last response LLM can overdo and list frivolous points at times, besides mostly great feedback on uncertainties. You would want to filter good points from the rest. Ask LLM to rewrite prompt to address uncertainties Based on the insights gained, you can ask LLM to rewrite your prompt to address the identified issues. Remember that the AI might overdo it and list some frivolous points alongside mostly great feedback. You may need to mention specific points that you want to be incorporated leaving the rest. By selectively incorporating points, you prevent the prompt from becoming overly complex or veering off-track due to the AI's tendency to sometimes over-elaborate. Rewrite the prompt to address the following points: - point 1 - point 2 While these guidelines provide a solid foundation, don't hesitate to experiment with different phrasings, structures, and approaches. Each use case may require unique tweaks to achieve optimal results. By combining thoughtful design with systematic testing and refinement, you can create highly effective prompt templates that maximize the capabilities of LLMs in your workflow.

4 min read

authors:

Few-shot, Many-shot, Metadata and CoT Prompting 💡

Article

Zero Shot Prompting Zero-shot prompting is a technique used with Generative Pre-trained Language Models (LLMs) like GPT (Generative Pre-trained Transformer) that enables the model to undertake tasks it hasn't been explicitly trained on. It involves presenting a task to a language model without any task-specific examples or training. The model is expected to understand and execute the task based solely on its pre-existing knowledge and the general instructions provided in the prompt. We communicate with the model using a prompt that explains what we want to achieve. The model uses its pre-trained knowledge, acquired from a vast amount of text data, to infer the best way to complete the task. This capability is pivotal for several reasons: Versatility and Adaptability: It allows models to handle a wide range of tasks without the need for fine-tuning or retraining, making them highly versatile and adaptable to new challenges. Whether it’s sentiment analysis, summarization, or question-answering, the model adapts to the prompts provided. Cost Efficiency: Reducing the necessity for large, annotated datasets for every new task saves significant resources in data collection and annotation. Generalization: Demonstrates the model's ability to generalize from its training data to new, unseen tasks, highlighting its understanding of language and concepts. Example of Zero-Shot Prompting Let's consider the task of sentiment classification. Here's how you would set up your prompt: Task: Sentiment classification Classes: Positive, neutral, negative Text: "That shot selection was awesome." Prompt: “Classify the given text into one of the following sentiment categories: positive, neutral, negative.” The model's response would likely be "positive" because it has learned from its training data that the word "awesome" is associated with positive sentiment. Few Shot Prompting Few-shot prompting is a technique used to guide large language models (LLMs), such as ChatGPT and Llama, to perform specific tasks or understand particular contexts using only a small number of examples. In few-shot prompting, you provide the model with a few carefully selected examples (typically between 2 and 10) that demonstrate both the input and the desired output of the task. These examples help the model infer the pattern or context of the task, which it then attempts to generalize to new, unseen inputs. It's important to note that the model does not update its internal weights during few-shot prompting. The model temporarily "learns" or infers patterns from the provided examples but discards this information once the interaction is over. Example 1: Input: "Do you have the latest model of the XYZ smartphone in stock?" Response: "Thank you for your inquiry. Yes, we have the latest XYZ smartphone model available. Would you like to place an order?" Example 2: Input: "Is the ABC laptop available in your store?" Response: "Thank you for reaching out. The ABC laptop is currently out of stock, but we expect new shipments to arrive next month. Can we notify you when it's available?" Your task: Input: "Can you tell me if you have the DEF headphones in stock?" Response: In this scenario, the model is provided with two examples of customer inquiries regarding product availability, along with the corresponding email responses. In the first example, the product is in stock, and the response includes an offer to place an order. In the second example, the product is out of stock, and the response offers to notify the customer when it becomes available. When the model is tasked with generating a response to a new inquiry about DEF headphones, it applies the pattern observed in the previous examples to craft an appropriate reply. This might involve confirming the product's availability and suggesting next steps if it's in stock, or explaining that the product is out of stock and offering alternatives or a notification service. This approach enables the model to understand the context of customer service in a business setting and to generate responses that are both relevant and considerate of the customer's needs. Exemplars (Examples) Exemplars are specific instances or examples that demonstrate how a task should be performed, helping to train or guide machine learning models, especially in few-shot learning scenarios. Here's how few-shot prompting can be approached using exemplars for a business-related task, such as drafting email responses to customer inquiries about product availability, while paying attention to avoiding common pitfalls: Ensure Exemplar Consistency: All exemplars should follow a consistent format and structure. This consistency helps the model to understand the task and apply its learning to new inputs effectively. Select Relevant Exemplars: Choose exemplars directly related to the task at hand. Irrelevant exemplars can confuse the model, leading to inaccurate outputs. Diversify Your Exemplars: To give the model a broad understanding of the task, include a range of exemplars that cover various scenarios and outcomes related to the task. This diversity helps the model handle different inputs more effectively. Keep Exemplars Simple and Clear: While it's important to capture the complexity of the task, overly complicated exemplars can confuse the model. Aim for clarity and simplicity to ensure the model can easily learn from the examples provided. Optimize the Number of Exemplars: Balance is key. Too few exemplars may not provide enough information for the model to understand the task, while too many can overwhelm it. Adjust the number of exemplars based on the task's complexity and the model's performance. Incorporate Contextual Clues in Exemplars: Providing clear instructions and relevant context within your exemplars is crucial. These clues help the model to understand the task better and generate more accurate outputs. Many-shot Prompting Many-shot prompting is a variant of few-shot learning where, instead of using a handful of examples (e.g., around 10), you use several hundred examples (e.g., 500-800). Models with large context windows, such as Gemma, can accommodate many examples in a single prompt. However, a significant downside of utilizing such large context windows is the increased computational cost and slower inference times. With this many examples, it may be more efficient to fine-tune the model directly, avoiding the repeated cost of processing large context lengths during every inference. In-Context Learning In-Context Learning refers to a large language model's ability to perform tasks by interpreting examples provided in the input prompt, without updating its internal parameters. Few-shot prompting and many-shot prompting are both forms of in-context learning. Despite the term "learning," the model doesn't actually update its weights or retain information beyond the current interaction. Instead, it temporarily infers patterns or rules from the examples in the prompt but discards this inferred knowledge once the interaction concludes. Metadata Prompting Metadata prompting is an approach designed to simplify and streamline the process of instructing large language models (LLMs). It applies principles of modularity and separation of concerns to prompt engineering, enhancing the effectiveness of communication with LLMs. Traditionally, prompts often combine task descriptions with explanations of various entities involved, resulting in complex and cluttered instructions. The core principle of metadata prompting is to separate the task description from entity explanations. It encourages users to start by clearly defining the main task, using all necessary entities without worrying about explaining them. To distinguish entities within the task description, they are enclosed in backticks (`). This allows for a focused and concise task description while clearly marking which terms will be explained later. After the task is clearly defined, each entity that requires explanation is described separately in JSON format. The entity names serve as keys, with their explanations as corresponding values. This structured approach offers several benefits: It creates a clear separation between the task description and entity explanations. It makes prompts easier to understand, modify, and maintain. It helps visualize connections between different parts of the task more effectively. It reduces clutter in the main task description. It introduces modularity, allowing for easier updates and reuse of entity explanations across different prompts. By structuring prompts in this way, metadata prompting aims to create more efficient, readable, and adaptable instructions for AI models, ultimately improving the quality of AI-generated outputs and making the process of working with LLMs more user-friendly. Taking an example, let's consider a situation where a user wants to assign custom tags to each paragraph in an extensive document. Given the limitations on the token size that an LLM can handle, the document would need partitioning into segments. Yet, for every segment, crucial context like the document's title, headings, and preceding paragraphs must be provided. Traditional prompting methods might fall short here, as LLMs could have difficulty discerning metadata from the main content. In contrast, Metadata prompting offers a more straightforward communication method. Tag each of `target-paragraphs` with one of the `tags` considering `article-title`, `headings` and `preceding-paragraphs`. tags: """ tagA: definition of tag A tagB: definition of tag B """, article-title: """Article title""", headings: """ h1: heading with type Heading 1 h2: heading with type Heading 2 """ preceding-paragraphs: """Provide 2 paragraphs that come before the target paragraphs to give more context""" target-paragraphs: """Provide the paragraphs you want the task to summarize""" Using impressive NLU and in-context learning abilities of LLMs, AI agents typically use text as an interface between components to plan, use external tools, evaluate, reflect, and improve without additional training. Chain of Thought (CoT) prompting Chain of thought prompting is a technique used to encourage language models to break down complex problems into a series of smaller, interconnected steps or thoughts, mimicking the way humans reason through problems. A language model is prompted to generate a series of short sentences that mimic the reasoning process a person might employ in solving a task. The process involves three main steps: Step-by-step reasoning: Instead of directly providing the final answer, the model generates a series of intermediate reasoning steps that guide it towards the solution by breaking down the problem into smaller, more manageable parts. Intermediate outputs: At each step, the model generates an intermediate output that serves as a building block for the next step in the chain of thought. These outputs can be partial solutions, relevant information, or logical connections. Final output: After generating the intermediate steps, the model combines the information to produce the final answer or solution to the original prompt. There are several approaches to prompting a model to generate intermediate reasoning steps in a chain of thought. The most common and the one used in the original paper by Wei et al. (2022) is few-shot learning. In this approach, the model is provided with a few examples of problems along with their corresponding chains of thought and final answers. The model learns from these examples and applies the same reasoning pattern to new, unseen problems, relying on its ability to generalize from a small number of examples. In their experiments, Wei et al. (2022) provided the model with examples of problems, each demonstrating the step-by-step reasoning process. For instance: Source: Paper link Note: A good read for automating picking examplers, Auto-CoT: Paper link , Good summary article When presented with a new question, the model uses these examples as a reference to generate its own chain of thought and final answer. The authors found that this few-shot learning approach led to significant improvements in the model's performance on various reasoning tasks, including arithmetic, commonsense reasoning, and symbolic manipulation. The generated chains of thought also provided valuable insights into the model's reasoning process, making its outputs more interpretable and trustworthy. Typical implementation: Question 1 to n are the few shot exemplars with their respective Reasonings and Answers. Question: {question 1} Reasoning: Let's think step-by-step. {reasoning 1} Answer: {answer 1} ... Question: {question n} Reasoning: Let's think step-by-step. {reasoning n} Answer: {answer n} Question: {question 1} Reasoning: Let's think step-by-step. Other approaches to prompting a model to generate intermediate reasoning steps include: 1. Zero-shot Chain of Thought: By appending the phrase "Let's think step by step", "Break down your reasoning into clear steps", or "Take a deep breath and work on this problem step-by-step" to the original prompt given to the model, it encourages the model to break down its reasoning process into a series of logical and intermediate steps rather than attempting to reach the final answer in one leap. 2. Structured prompts: Prompts that include placeholders for intermediate reasoning steps, which the model is trained to fill in along with the final answer. For instance, a prompt might be structured as follows: Question: [Original question] Step 1: [Placeholder for first reasoning step] Step 2: [Placeholder for second reasoning step] ... Step N: [Placeholder for final reasoning step] Answer: [Placeholder for final answer] The model is trained to fill in the placeholders with relevant intermediate steps and the final answer. How is it Different from Standard Prompting? Standard prompting might involve asking a model a direct question and receiving a direct answer, without any explanation of the steps taken to reach that answer. CoT prompting, on the other hand, explicitly asks the model to show its work, providing a step-by-step breakdown of its reasoning. This not only leads to more accurate answers in many cases but also provides an explanation that can be helpful for users to understand the model's thought process. Business Example: Enhancing Customer Support with RAG and CoT Consider an online retailer implementing a chatbot equipped with RAG and chain of thought prompting to handle customer inquiries. A customer asks a complicated question about a product's features, compatibility with other devices, and return policy. Logical Processing: Through chain of thought prompting, the chatbot first breaks down the query into sub-questions: What are the product's key features? Which devices are compatible? What is the return policy? Retrieval: For each sub-question, the chatbot sequentially processes the information, starting with product features, moving to compatibility, and finally addressing the return policy. At each step, it synthesizes information from the retrieved documents and previous reasoning steps. Final Response: The chatbot compiles its findings into a comprehensive response that clearly explains the product's features, compatibility with specific devices, and return policy, offering a detailed and helpful answer to the customer's inquiry. This example illustrates how chain of thought prompting in RAG transforms the way LLMs handle complex queries, enabling them to provide more accurate, detailed, and contextually relevant responses. By mimicking human-like reasoning and adaptability, this approach significantly enhances the capabilities of AI in business applications, particularly in areas requiring deep understanding and nuanced responses. Other Prompting types Stepwise prompting–Please read here Prompt chaining–Please read here

8 min read

authors:

If you are a startup, then click here to get more information