At the cost of 1 local Intern, get 2 remote Experienced Professionals
Hero

If you are a client, who wants to work remotely from home for US company click here

If you are a startup, then click here to get more information

If you are a client, who wants to work remotely from home for US company click here

Article
Happy International Recruiters Day! I missed writing this shout out two weeks back but better late than never. Over the past few years, I’ve had the pleasure of working closely with many recruiters. As someone observing the hiring ecosystem from both strategic and human angles, I’ve come to a deep and genuine appreciation for what recruiters actually go through—and why they deserve far more credit than they get. Recruiting, from the outside, might seem like it’s all LinkedIn messages and phone calls. But beneath that surface is a job that’s demanding, emotionally draining, and often thankless. Yet, day after day, recruiters show up and do it anyway. Here’s just a glimpse of what they deal with: 💬 Ghosting Is Real Recruiters spend hours carefully sourcing, screening, prepping, and coordinating interviews. Then, just when it seems like things are progressing—poof. The candidate vanishes. No call. No reply. Just silence. It’s not only frustrating, it’s demoralizing—especially when that recruiter genuinely believed they were helping someone move forward in their career. 🤔 Hiring Managers Who Don’t Know What They Want One day, a role needs a "rockstar senior lead." The next, it’s "actually, we’d prefer someone more junior." Recruiters are constantly chasing a moving target, trying to balance shifting expectations while still sourcing viable candidates in a competitive market. It’s like being told to hit a bullseye—while the dartboard is on a rollercoaster. 🤯 Unrealistic Expectations “Find someone with 10 years of experience… in a technology that’s only existed for three.” Enough said. (And let’s not even start on the budget constraints.) 🤬 Rejection Fallout Recruiters are the ones who deliver the bad news—sometimes daily. And while most candidates understand, some lash out. Frustrated job seekers, understandably emotional, often end up venting their pain at the one person who tried to help. Recruiters carry that emotional weight. Quietly. Repeatedly. 🗑️ Being Treated Like Spam Reaching out to passive candidates isn’t easy. Especially when they’re met with cold shoulders, snarky replies, or worse—outright disrespect. Many recruiters are genuinely trying to present good opportunities, not just “fill a role.” But still, they get brushed off like inbox clutter. 📚The Need to Constantly Learn Job roles are evolving. Technologies shift monthly. Market demands pivot on a dime. Good recruiters aren’t just messengers—they’re students. They learn the ins and outs of new roles, skill sets, and industry trends just to have relevant conversations and make the right matches. They’re expected to understand every job they hire for, even if the role didn’t exist last year. 🗜 Pressure from All Sides Recruiting is a numbers game—but it’s also deeply human. Success is measured by placements, time-to-hire, and funnel ratios. But those metrics don’t show the late-night calls, the emotional support, or the hours of prep before a big interview. You’re stuck between candidates who want clarity and empathy, and hiring teams who want speed and perfection. ✨And Still… They Show Up Despite all of this—despite the frustration, the emotional labor, the ambiguity—recruiters keep going. They guide candidates. They mediate expectations. They play therapist, career coach, and project manager all in one. They help change lives. Often without getting thanked. So to all the recruiters out there grinding through the noise, the rejections, and the endless juggling: I see you. I appreciate you. Thank you for doing what you do. If you’ve worked with a recruiter who made a difference in your career—tell them. A quick thank you can mean more than you know. #Recruiters #ThankYou #Hiring #MAdAiLab
2 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
The use of AI is quickly becoming ubiquitous, and it's no surprise that job candidates are tapping into it at every stage of the hiring process. From résumé writing to interview prep, AI is fundamentally reshaping how applicants present themselves. But this transformation isn't without consequences. Applicant Tracking Systems (ATS), once a cornerstone of candidate filtering, are becoming less effective as AI-generated resumes easily pass keyword and formatting checks. At the same time, a tight and competitive job market means recruiters often face hundreds—sometimes thousands—of applicants for a single position. To cope, many recruiters rely on skills assessments to screen candidates. But that strategy is also being undermined by AI. As models become more powerful, what may be considered an effective test today can quickly become obsolete within weeks. Recruiters are forced into a game of cat and mouse—constantly redesigning assessments that might briefly outpace AI’s capabilities. Another layer of complexity: access to AI is not equal. Candidates with the means to afford premium tools like ChatGPT Pro ($200/month for team access) or Claude Opus have an unfair advantage. These tools can complete tests far more effectively than free-tier models. So in effect, the hiring process is tilted toward those with financial privilege—if your parents can pay for elite AI, you’re suddenly a stronger applicant. This dynamic has created what recruiters increasingly refer to as an "AI arms race" among candidates. As more applicants leverage AI to enhance their applications, those who don't use these tools appear significantly less qualified by comparison—even when they possess superior actual skills. This competitive pressure forces even hesitant or ethically concerned candidates to adopt AI assistance simply to remain viable in the candidate pool. This escalating cycle normalizes AI dependence and further obscures genuine talent assessment, creating a classic prisoner's dilemma where individual rational choices lead to a collectively problematic outcome. Even live interviews are no longer the gold standard. With the rise of interview co-pilots—tools that feed candidates real-time responses over Zoom—recruiters can no longer be certain if they’re speaking with a candidate or a candidate-plus-AI hybrid. These tools can convincingly speak about complex projects the candidate may not have even worked on. This challenge has prompted some companies to take a hard stance. Even Anthropic, the AI research company behind Claude 3.5 Sonnet—a model now more widely used than ChatGPT-4—has banned candidates from using AI during the application process. To many, this feels deeply ironic: why would an AI company prohibit the use of AI? But the irony disappears when you consider the deeper challenges. A common argument in favor of allowing AI is that employees already use it on the job—so why not let candidates use it too? Denying AI during the hiring process, some say, is a false constraint that fails to reflect how people actually work. That may sound reasonable in theory, but it overlooks the practical difficulties of designing effective, AI-resistant assessments. It’s easy to blame hiring managers for not being creative enough—but the reality is far more complicated. Ideally, a recruiter would want to hire a professional who can guide AI—especially in scenarios where AI falls short. AI tends to fail when it lacks context, when information is siloed across teams, or when decisions rely on undocumented institutional knowledge or nuanced domain expertise. These are exactly the kinds of gaps humans fill by asking the right questions, drawing from experience, and applying judgment in ambiguous situations. In a post-AI world, the focus increasingly shifts from “how” (which AI can often handle) to “why” and “what.” However, the more ambiguity a test includes to assess this kind of thinking, the more likely candidates are to drop out—often due to unfamiliarity or discomfort. On the flip side, the more a test is simplified for objectivity or scalability, the easier it becomes for AI to solve. That’s why traditional approaches—like data structures and algorithms problems, or platforms like HackerRank and LeetCode—have long served as the go-to for candidate evaluation. But those methods only worked when AI wasn’t capable of solving them easily. That era is over. And so, this new wave of evaluation processes banning AI use during hiring begins to make sense. The goal isn’t to resist progress, nor are hiring managers being short-sighted or pedantic—like insisting on using log tables in an era of ubiquitous calculators. The intent is to test whether candidates possess fundamental skills without AI assistance. After all, guiding AI through complex scenarios requires a solid grasp of the basics. If a candidate can’t solve simple problems on their own, how can they be expected to troubleshoot or direct AI effectively when it inevitably runs into limitations? Succeeding at basic tasks without AI now seems like a necessary condition. And in the absence of a clear sufficient condition to evaluate deeper competency, many companies are turning to this necessary condition as a pragmatic compromise. Some pioneering companies are exploring technological countermeasures to restore assessment integrity. Machine proctored testing platforms that record candidates' screens and video use AI detection algorithms show promise for initial evaluation round. After filtering candidates they use the tests involving AI-human collaboration directly—designing exercises that explicitly require candidates to demonstrate how they would guide AI through complex scenarios with incomplete information. But we know this space is evolving fast. How are you handling it? Have you encountered cases where candidates used AI to misrepresent their skills—or seen clever ways they’ve cheated the system? How are you adapting your evaluation process to stay ahead of these shifts? Share your experiences and strategies—we’re all figuring this out together. #RecruitmentTrends #AIinHiring #TalentAcquisition #FutureofWork #MAdAILab #AI
3 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The adoption of artificial intelligence (AI) technologies, such as ChatGPT, has surged dramatically, propelling the market capitalization of Nvidia (AI hardware manufacturer) to over $3.3 trillion and establishing it as the world's largest company in July 2024. Looking ahead, global investment in AI technologies is projected to reach $200 billion by 2025, underscoring the expanding role of AI across various industries. Generative AI (GenAI) has become a central focus, accounting for 48% of total AI funding in 2023—a significant rise from just 8% in 2022. Until the last decade, much of what was popularly considered Artificial Intelligence (AI) was dominated by traditional machine learning (ML) methods such as regression, support vector machines (SVMs), random forests, and XGBoost. These methods excelled in many applications and were particularly well-suited to structured data—tabular datasets with defined numerical or categorical features. Traditional ML typically follows a structured pipeline with well-defined steps as shown in the first part of the diagram: data gathering, feature engineering, model training, and deployment. They performed remarkably well for tasks such as classification (e.g., predicting categories) or regression (e.g., forecasting numbers). Applications like recommender systems and predictive analytics flourished using these techniques. However, traditional ML had limitations when applied to tasks involving unstructured data like text, images, or audio. To illustrate, consider a social media manager tasked with choosing an image for a blog article. This process might involve: Identifying suitable keywords based on the article content. Searching for images using those keywords. Evaluating multiple image attributes, including aesthetics, relevance, and emotional appeal. Making a final decision. Traditional ML approaches could extract keywords or classify images to some extent, but required extensive feature engineering, significant human effort and expertise for feature engineering and often resulted in subpar performance. This was a major reason why many creative, human-centric tasks remained difficult to automate. The emergence of deep learning and, more recently, Generative AI (GenAI), has transformed how AI tackles such challenges. Deep learning models handle unstructured data like text, images, audio and video in a way that traditional ML could not. These models learn to extract complex patterns directly from raw data, effectively eliminating the need for manual feature engineering. GenAI models can be highly useful for assisting with parts of human decision making, especially those involving text/image generation, summarization, answering question based on context, extracting keywords as per instructions, etc.). This fundamental shift in capabilities has created both new opportunities and challenges. Yet despite these advances, many discussions still equate AI with traditional ML approaches, failing to recognize that GenAI is the primary driver of current enthusiasm and innovation. This oversimplification not only misrepresents the technology but also risks diverting attention from the unique challenges of GenAI. Consequently, professionals may overlook the need to address these challenges and allocate resources effectively to ensure GenAI's reliable integration into business workflows. In this article, we examine the fundamental ways GenAI differs from traditional ML and explore how these distinctions necessitate workflow redesign. The accompanying flowchart provides a comprehensive visualization of a typical GenAI workflow, highlighting six key areas where it diverges from conventional ML pipelines: workflow experimentation, metrics and evaluation, guardrails, output monitoring, prompt engineering, and model drift. As the flowchart illustrates, each of these areas introduces new considerations and complexities that must be carefully managed. Below, we discuss these key differences in detail, demonstrating why organizations need to fundamentally reevaluate how they develop, guide, and deploy GenAI systems. 1. Workflow Experimentation for Human-Like Decision Making GenAI's capacity to simulate human decision-making enables organizations to automate complex tasks that were previously considered too nuanced for traditional automation. However, this capability introduces new challenges in workflow design and optimization, as shown in the "Workflow Experimentation" element of our flowchart. While Generative AI (GenAI) has revolutionized certain aspects of human decision-making automation, traditional tools like coding, machine learning (ML), APIs, and database calls remain essential for other facets. Implementing GenAI effectively requires more than just leveraging its advanced capabilities; it necessitates a nuanced approach to replicating the integrated and often implicit cognitive steps humans naturally perform during decision-making. A key aspect of GenAI workflow experimentation involves task decomposition—an essential process of unpacking what were previously fluid human cognitive processes into discrete, manageable components that can be handled by AI systems. When humans perform complex tasks, they often seamlessly integrate multiple cognitive processes without explicitly defining each step. For instance, a human editor reviewing an article simultaneously considers content accuracy, tone, audience engagement, and cultural sensitivity without consciously separating these aspects. The challenge in GenAI implementation lies in identifying and articulating these implicit cognitive steps and finding ways to replicate them through AI operations. This decomposition process often reveals multiple potential approaches to solving the same problem, each representing a different way of breaking down and reconstructing human cognitive workflows, with its own trade-offs in terms of reliability, cost, and complexity. Consider, for example, the task of selecting an appropriate image for a blog post. This seemingly straightforward task can be approached through several workflow variations: Option A: Extracting keywords and themes from the article using GenAI, then using these elements to filter images on a stock platform. This approach prioritizes content relevance but might miss emotional resonance. Option B: Using GenAI to analyze both the article's content and its intended emotional impact, then filtering images based on these dual parameters. This method might yield more emotionally aligned results but requires more complex processing. Similarly, in content moderation—another domain where GenAI is increasingly deployed—different workflow variations emerge: Option A: Direct classification of content using a single GenAI pass, which is faster but might miss subtle violations. Option B: A multi-step approach where content is first decomposed into distinct elements (text, context, implications) before being analyzed, potentially offering higher accuracy at the cost of increased processing time and complexity. These workflow variations illustrate why experimentation is crucial in GenAI implementation. Each approach represents a different way of decomposing the task, and the optimal choice often depends on specific use case requirements, resource constraints, and performance metrics. This level of workflow experimentation stands in marked contrast to traditional ML approaches. Traditional machine learning (ML) models focus on specific predictive and descriptive tasks, with their architectures designed around transforming input data into well-defined output formats. Unlike generative AI models that can produce diverse forms of content, traditional ML specializes in targeted tasks like classification, regression, and pattern recognition. They are typically employed in structured environments with clearly defined inputs and outputs, where the goal is to optimize performance on specific metrics. Hence, they do not require extensive workflow experimentation, where different task decompositions and sequences need to be explored. The flowchart illustrates this divergence by showing how GenAI workflows can branch into either prompt-based solutions or fine-tuned models, depending on the task requirements. This flexibility in approach, combined with the ability to experiment with different task decomposition strategies, allows organizations to iteratively develop and refine workflows that can handle increasingly sophisticated decision-making processes. Potential workflow sequences must be compared carefully, with the final choice balancing trade-offs across various metrics such as reliability, cost, and complexity. 2. Metrics and Evaluation Traditional ML systems rely on clear, quantitative metrics such as accuracy, mean squared error (MSE), and F1 score, which are objective and easily automated. For instance, in a customer churn prediction model, accuracy directly measures the percentage of correct predictions, while MSE quantifies the difference between predicted and actual values in tasks like sales forecasting. These metrics provide straightforward ways to assess model performance and guide improvements. The Metrics Planning step for GenAI requires a more nuanced and multi-faceted approach. GenAI outputs, such as empathetic dialogue, error-free technical instructions, or humorous marketing copy, often require subjective evaluation that defies straightforward numerical measurement. Additionally, GenAI systems can sometimes generate plausible but factually incorrect information—a phenomenon known as hallucination—which requires specialized detection methods. This complexity necessitates three distinct types of evaluation: Automated Metrics: For text generation tasks, metrics may include ROUGE scores (which measure content overlap with reference texts), specialized checks for hallucination and factual accuracy, and RAGas benchmarks (Retrieval-Augmented Generation assessment scores that evaluate how accurately the model uses retrieved information in its responses). While these automated metrics provide a baseline for evaluation, they capture only certain aspects of output quality. Generative AI-Based Evaluations: Specialized models can assess various qualitative aspects such as coherence, creativity, emotional alignment, or contextual relevance—whether for text, images, or speech outputs. These models can also cross-validate factual claims and identify potential inconsistencies that might indicate hallucination. This approach creates a more comprehensive evaluation framework that can handle subjective qualities at scale. Human Assessments: Domain experts and end-users play a crucial role in evaluating subjective qualities such as empathy, humor, and cultural sensitivity through the human-in-the-loop process. Human assessors are particularly valuable in catching subtle hallucinations that might slip past automated checks, especially in domain-specific contexts where factual accuracy is critical. These three evaluation streams converge in the Fine-Tuned Model or Prompt Chain Validation phase, where results are synthesized and compared against human expectations. When misalignments emerge—for instance, if automated metrics indicate high performance but human assessors find the outputs lacking appropriate emotional tone or cultural nuance—teams can pursue several established adjustment paths. These include revising the evaluation metrics, experimenting with different prompt chains, or modifying fine-tuning strategies. Teams can first try different metrics to better capture the desired generated output aspects. If misalignments persist, they can explore different prompt chains or models for fine-tuning. When both these approaches prove insufficient, teams can try a different workflow variation, such as trying alternative task decomposition strategies. This comprehensive evaluation process stands in marked contrast to traditional ML approaches. In conventional ML, performance improvements typically involve straightforward parameter tuning, algorithm selection, or feature engineering. GenAI evaluation, however, requires a more sophisticated approach that balances multiple feedback loops and assessment types. This multi-layered system helps ensure not only technical accuracy and the absence of hallucination, but also the subjective qualities that make GenAI outputs truly valuable. 3. Pre-Generative Guardrails Following the careful design of workflows, implementing robust pre-generative guardrails becomes essential for responsible GenAI deployment. As shown in the flowchart's Query Processing section, these guardrails serve as a critical checkpoint before any user query reaches the AI system, with clear "Pass" and "Reject" decision paths. Unlike traditional ML systems that operate within tightly controlled environments, GenAI models face dynamic, user-driven interactions that require comprehensive protective measures. These pre-generative guardrails evaluate incoming queries through several key security and safety lenses: Prompt Injection Defense: Users might attempt to manipulate the model by embedding instructions within seemingly innocent queries that could override the system's intended behavior. For example, a user might append "ignore previous instructions" to their query, attempting to bypass safety constraints. Pre-generative guardrails detect and block such manipulation attempts through pattern matching and semantic analysis. Societal Risk Assessment: These checks evaluate whether incoming queries could lead to outputs that might cause broader social harm, such as requests for generating discriminatory content, promoting misinformation, or creating misleading deepfakes. Reputational Risk Screening: Separate from societal concerns, these checks focus specifically on protecting organizational interests by identifying queries that might result in outputs misaligned with company values, brand guidelines, or regulatory requirements. For instance, in a financial services context, this might involve blocking queries that could lead to unauthorized financial advice. The flowchart illustrates how these guardrails operate in practice through a binary decision process: Pass: Queries that clear all safety checks proceed to the AI-augmented workflow deployment stage for processing. Reject: Queries flagged as potentially harmful are blocked, triggering the "Inform user/system" response. This might involve providing users with a clear explanation of why their query was rejected and suggesting appropriate modifications, or alerting system administrators about potential misuse patterns. Consider a practical example: In a customer service AI system, incoming queries first pass through these guardrails. If a user asks for help with a product return, the query passes through. However, if they request confidential information about other customers, the system would reject the query and inform the user about data privacy policies. These pre-generative guardrails work in concert with the broader workflow shown in the flowchart, forming the first line of defense in a comprehensive safety framework. 4. Monitoring and Tracing The monitoring phase in GenAI deployment introduces unique operational challenges that go beyond traditional ML's focus on system performance and data drift detection. To address these challenges, organizations rely on random sampling and human evaluation of production outputs as a cornerstone of their monitoring strategy. This approach enables teams to detect subtle quality issues, validate metric effectiveness, and identify potential degradations that automated systems might miss. Random sampling proves particularly crucial for GenAI systems because they are susceptible to "silent degradation" - a phenomenon where outputs maintain good metric scores while becoming less effective in practice. For instance, a customer service AI might continue to achieve high ROUGE scores while providing less actionable information to users. Through random sampling, organizations can identify these cases where outputs achieve high metric scores but fail to meet actual user needs. This distinction requires monitoring not just metric values but the metrics themselves, creating a meta-level monitoring challenge unique to GenAI systems. Organizations implement continuous feedback loops that reassess metric effectiveness through careful analysis of user interactions. This monitoring approach becomes even more vital when dealing with foundation models accessed through external APIs - a common architecture in GenAI systems. Unlike traditional ML systems where organizations train models from scratch on their specific data, GenAI systems often rely on external foundation models beyond their direct control. When providers update or deprecate these underlying models, it can impact system performance in subtle ways. For example, a model API upgrade might improve general performance but alter the model's reasoning ability, requiring adjustments to prompts or fine-tuning strategies. Through this comprehensive monitoring approach, organizations can maintain robust oversight of their GenAI systems despite the unique challenges of external dependencies and potential silent degradations. Tracing: In traditional ML, tracing an incorrect prediction is straightforward – teams can directly examine the input features, model weights, and final prediction to understand what went wrong. There's typically no need to track intermediate steps since the prediction process is deterministic and follows a fixed pipeline. However, GenAI systems often involve multiple processing steps with intermediate outputs that must be traced. For example, a customer support AI might first classify the query type, then retrieve relevant documentation, and finally generate a response by combining this information. When the final output is problematic, organizations need to trace through each step to identify where the process went awry. This traceability becomes particularly challenging with AI agents, where the model dynamically decides which steps to take and which tools to use based on the context. In such cases, two identical queries might follow completely different execution paths, making it essential to maintain detailed traces of decision points, tool selections, and intermediate outputs for effective monitoring and debugging. 5. Guardrails on Generated Output Post-deployment validation of GenAI outputs remains essential despite pre-deployment safeguards, with three primary types of guardrails: hallucination checks, ethical/bias controls, and brand tone verification. Traditional ML outputs are typically constrained to specific, pre-defined categories or numerical predictions, eliminating concerns about brand voice consistency or open-ended ethical implications. Hallucination presents a unique challenge in GenAI systems, as they can produce convincing but inaccurate information. For instance, an AI handling insurance queries might confidently provide incorrect policy details, creating liability risks. Unlike traditional ML models, GenAI can generate inconsistent outputs for identical inputs, necessitating runtime guardrails beyond standard performance monitoring. These include analyzing semantic drift, checking for contradictions, validating claims, and employing multiple model consensus to identify potential hallucinations. Ethical and bias controls involve regular auditing of response patterns across demographic groups, with specific fairness metrics under continuous monitoring. Organizations establish and maintain explicit ethical guidelines, ensuring transparency about system capabilities while documenting limitations and appropriate use cases. Regular reviews and updates of these principles occur based on real-world impact assessments. Brand voice consistency requires continuous validation, implemented through LLM-based validators trained on company communications, encoded style guides, contextual appropriateness checks, and brand personality metrics. For example, if a company emphasizes warm, empathetic communication, guardrails would flag overly technical responses even if factually accurate. When guardrails detect issues or in high-stakes scenarios, outputs are routed for human review through clearly established escalation pathways. The flowchart shows how GenAI systems require two critical runtime pathways: a "Pass" pathway for outputs that clear automated checks, and a "Review Needed" pathway that routes flagged outputs to human reviewers. This human-in-the-loop review process involves trained professionals who handle edge cases, provide rapid intervention for inappropriate content, and validate complex brand voice alignment. The system's effectiveness relies on tight integration between automated checks and human review, with continuous feedback loops refining both components. 6. Iterating and Experimenting with Prompt Engineering Unlike traditional ML systems where model inputs follow rigid, predefined structures, GenAI systems often require carefully crafted prompts that serve as the interface between human intent and model capability. This fundamental difference emerges from GenAI's ability to process natural language instructions, making prompt engineering a crucial discipline that combines technical precision with domain expertise. The flowchart illustrates how prompt engineering fits into a larger workflow, positioned after the "Prompt Chain & Model planning" phase. Teams must first determine whether a task can be effectively accomplished through prompt chains and plan how different prompts will work together. For instance, a customer support workflow might begin with a prompt that classifies the query type, followed by separate prompts for information retrieval and response generation. This structured approach to prompt chain design provides a framework for systematic prompt development and optimization. The trade-off between prompt length and performance represents one of the most critical considerations in prompt engineering. Organizations must establish comprehensive testing frameworks that balance multiple competing factors. A minimal prompt might achieve faster response times but risk inconsistent outputs, while more detailed prompts can ensure reliability at the cost of increased processing overhead. Testing frameworks typically evaluate prompts across four key dimensions: reliability testing to verify consistent performance across diverse inputs and edge cases, cost analysis to measure and optimize token usage and processing time, quality assessment using both automated metrics and human evaluation to ensure outputs meet standards, and safety verification to confirm that prompts maintain model guardrails and prevent unsafe behavior. Through this systematic evaluation process, teams can quantify the impact of prompt length and complexity on both performance and operational costs. Prompt versioning and performance tracking introduce a level of rigor absent in traditional ML feature engineering. Teams maintain prompt repositories where each version is tagged with performance metrics such as completion rate (percentage of successful responses), accuracy scores from automated evaluations, and user satisfaction ratings. These historical comparisons help teams understand how prompt modifications impact performance. For example, when a financial services chatbot shows declining accuracy in tax-related queries, teams can analyze how different prompt versions handled similar questions in the past, using this data to guide improvements. The role of non-technical stakeholders in prompt engineering represents a significant departure from traditional ML practices. While feature engineering in traditional ML is primarily a technical exercise, effective prompt engineering requires cross-functional collaboration, relying heavily on non-technical stakeholders for crafting clear and reliable prompts. Legal experts help craft prompts that capture nuanced regulatory requirements, marketing teams ensure brand voice consistency, and subject matter experts validate technical accuracy. These stakeholders don't just provide input – they actively participate in prompt design and improve prompts with appropriate technical terminology, reasoning steps, or user persona details. 7. Feedback Unlike traditional ML systems where feedback primarily drives model retraining and feature engineering, GenAI feedback loops introduce unique complexities that fundamentally reshape how organizations learn from and improve their AI systems. The "Aggregate feedback for Workflow improvement" node in our flowchart illustrates this expanded feedback scope, showing how multiple streams - including direct user feedback, human-in-the-loop assessments, and monitoring data - must be synthesized to drive improvements across different system components. The variable nature of GenAI outputs creates the first major departure from traditional ML feedback patterns. In traditional ML, when a model makes an incorrect prediction, the feedback is straightforward: the prediction was wrong, and the model needs to be adjusted to map that input to the correct output. However, GenAI systems might generate different but equally valid outputs for the same input, or produce outputs that are partially correct but need refinement in specific aspects like tone or detail level. This variability means organizations must develop sophisticated feedback taxonomies that capture not just correctness, but also aspects like reasoning quality, creativity, and contextual appropriateness. The application of feedback in GenAI systems also differs fundamentally from traditional ML. When a traditional ML model receives feedback about poor performance, the solution typically involves retraining the model with additional data or adjusting feature engineering. In contrast, GenAI feedback might lead to multiple types of adjustments: Prompt refinements when feedback reveals gaps in instruction clarity or context Guardrail updates when new edge cases or failure modes are identified Workflow redesigns when feedback shows systematic limitations in current task decomposition Metric adjustments when usage patterns reveal new quality aspects that need measurement HITL routing criteria updates when certain query types consistently require human review This multi-level adaptation process, where operational feedback can trigger revisions at any level of the system—from guardrails to metrics to workflow design—reflects the unique complexity of maintaining effective GenAI systems in production. Conclusion The fundamental differences between GenAI and traditional ML systems necessitate a complete reimagining of AI workflow design and implementation. While traditional ML focuses on specific predictive tasks with clear metrics, GenAI introduces complexities in evaluation, requiring balanced consideration of automated metrics, AI-based assessments, and human judgment. Organizations must master new disciplines like prompt engineering and implement sophisticated guardrails both before and after generation. Successful GenAI deployment demands robust monitoring systems that can detect subtle degradations and maintain traceability across multiple processing steps. The feedback loop becomes more intricate, potentially triggering adjustments across multiple system components—from prompt refinements to workflow redesigns. As GenAI continues to evolve, organizations that understand and adapt to these distinct characteristics will be better positioned to harness its potential while managing associated risks and challenges.
13 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
AI-coding tools are making significant inroads in the software development domain. These tools seem like magic when one first starts using them, as generating a fully functional application with just a few lines of instructions appears surreal. However, this initial excitement quickly subsides as code complexity increases. Problems with AI tools such as Cursor or Windsurf in big codebase In large codebases, AI tools often miss existing code, leading to duplication and inconsistencies. They also tend to forget prior instructions, causing repeated mistakes and forcing developers to reassert constraints. Debugging becomes a frustrating loop of false confidence—AI repeatedly claims to have identified issues, makes minor tweaks, but problems persist. This cycle of "Ah, I see the issue now..." followed by ineffective fixes rapidly grows tiresome. Over time, abandoned or broken code fragments—"ghost code"—accumulate, confusing the AI further and degrading generation quality. Before digging deeper into why AI tools face these problems, it's important to first understand how AI coding tools work. How AI coding tools work: These tools use AI models such as Claude 3.7 Sonnet or ChatGPT-4o to generate code. A logical question to ask is what the role of the AI tool is if it doesn't generate code itself. The tool's primary roles are: Prompt Enhancement for AI model: It enhances prompts for the AI model by taking your instructions, adding relevant context (code) to your prompt, and incorporating guidance (e.g., .cursorrules, .windsurfrules). These tools largely use Retrieval-Augmented Generation (RAG) based approaches to identify appropriate context. While RAG is a step forward in surfacing relevant information, it relies on heuristic or semantic similarity-based retrieval, which often fails to capture code structure, control flow, or execution dependencies. Integrating response in code: It integrates responses into code by reading the AI model's output and incorporating it back into appropriate files and positions, sometimes removing existing code when needed. This integration typically relies on Abstract Syntax Trees (AST) based approaches. Automated code improvement loops (Agentic): It performs agentic execution by running the modified code automatically using a terminal and observing any errors. If errors occur, the tool automatically creates a new prompt using those errors, adds relevant context and guidance, and sends it to the model. The model responds, the tool integrates the changes, and runs the code again. This process continues iteratively until either no errors remain, the maximum number of attempts is reached, or the user intervenes. This agentic execution is both a blessing and a curse. When you provide minimal detailed instructions and allow the AI to operate based on its training, it often successfully replicates code without significant issues. However, as soon as you begin correcting it, adding logic, constraints, and other specific requirements, these tools increasingly struggle to identify the "right context" for your instructions. The key reason for this misidentified context is that as the sequence length (i.e., number of tokens) approaches approximately 60% of the context window (the maximum number of tokens the AI model can process), the model begins experiencing the 'lost in the middle' problem and tends to miss critical elements from the prompt. Identifying relevant context challenge The key challenge for these AI tools is to identify the relevant context without approaching the context window limit. Despite their best efforts, sequence length rapidly increases when these tools rely solely on code because code consumes significantly more tokens than text explanations for two main reasons: Explanations in words requires far less tokens than code: Explaining concepts in natural language requires far fewer tokens than implementing them in code. What might take just a few sentences to explain—such as "Create a login page with username and password options. Also integrate the option to sign in using Google, Facebook, and LinkedIn"—would translate to hundreds of lines of code when implemented with React packages, CSS, and various library imports. Tokenization process: The tokenization process used by underlying AI models is less efficient for code than for natural language. For instance, Original Text tokens processed for input to GPT number of characters number of tokens generated Loop over productTypes loop, over, product, type, s 22 5 for ( i = 0; i < count(productTypes); i++) { for, (, i, =, 0, ;, i, <, count, (, product, type, s, ), ; , i, +, +, ), { 44 20 Even accounting for the fact that the code has twice as many characters, it produces four times as many tokens as the equivalent natural language expression. In summary, AI tools struggle to understand code context purely from the code itself because code often lacks the high-level meaning or intent behind it, making it difficult for AI to determine which parts are truly important. This is why prompt compression—reducing the amount of code or context in a prompt—remains such a significant challenge. AI tools' approaches to mitigate context problem: AI tools like Cursor and Windsurf try to help by using rules to guide the AI on what context to include. These rules can be set globally or just for a specific project, and they can change depending on the files or parts of the code you're working on. For example, Cursor lets you define rules that automatically include related files based on the code you're editing. Windsurf does something similar with its memory and context system. While these tools don’t fully solve the problem—especially when it comes to deeper understanding like how pieces of code connect behind the scenes—they help the AI focus better. Another promising aspect of modern AI tools is their use of markdown-based planning documents. Cline was one of the first to formalize this approach by introducing two distinct modes: Plan mode, which focuses on gathering requirements and outlining implementation steps in markdown, and Act mode, which uses those plans to guide actual code generation. Following Cline’s lead, tools like Cursor and Windsurf have also introduced similar planning-focused workflows in recent months. In Plan mode, Cline can analyze relevant files, engage in dialogue to clarify objectives, and develop comprehensive implementation strategies. Cursor's Planner mode similarly supports creating detailed, context-aware plans, and its Agent mode can act on those plans autonomously, performing multi-step changes across a codebase. Windsurf brings a different strength with its Cascade system, which builds a semantic model of the project, helping the AI reason about dependencies across multiple files. These recent advances are pushing markdown planning from static outlines toward dynamic, interactive workflows that better support the complexity of modern software development. However, these planning modes can still struggle when applied to large, modular codebases—particularly if the plans are kept too high-level or not continuously refined. While the initial documents often outline technologies, frameworks, and task breakdowns (e.g., to-do, in-progress, done), the depth of planning depends heavily on how the tools are configured and used. Without iterative refinement or deeper integration with the code structure, these documents may miss important implementation details—especially in systems where components span many interdependent files. That said, tools like Cline, Cursor, and Windsurf are actively evolving in this direction, with features that increasingly support context-aware, multi-file planning and documentation. Augmenting Technical Lead / Architect's with AI Tools While AI tools can significantly benefit from markdown documents covering specifications explained in natural language, technical leads create these specifications separately from the code repos. Since specifications aren't typically stored in the code repo that AI tools can access, architects must manually provide all relevant context to effectively use AI assistance. This manual context-bridging is cumbersome and time-consuming, effectively creating a barrier that leads to minimal AI adoption during the architectural and design phases. Having specifications in the code repository would greatly enhance AI's ability to decompose new requirements into specifications. In particular, it enables: Mapping Dependencies: AI can cross-reference specifications with implementation details in code to uncover architectural constraints, data flows, compatibility with existing APIs or services, and component interactions. This ensures that newly generated specifications align with the existing system’s structure and limitations. Specification Consistency: Referencing existing specs allows the AI to maintain consistent formatting, abstraction level, and terminology. This promotes maintainability and coherence across the codebase, especially as the system scales. Code Reuse Optimization: With visibility into both specifications and code, the AI can identify reusable functions, services, and patterns. It can recommend leveraging existing components rather than rebuilding functionality from scratch—reducing duplication and accelerating delivery. By contrast, if AI-generated specifications were stored directly in the code repository as markdown files: Context Availability: AI coding tools could automatically access both the code and its corresponding human-readable specifications Intent Comprehension: When generating or modifying code, the AI would have immediate access to the high-level intent and requirements Token Efficiency: The specifications would consume far fewer tokens than equivalent code, making better use of the context window Purpose Understanding: The AI would better understand the purpose behind code structures, reducing the likelihood of generating inconsistent or duplicate code The disconnect between specifications and implementation creates significant inefficiencies in the software development lifecycle. As highlighted earlier, while tools like Cline, Cursor, and Windsurf are evolving to support context-aware planning, the separation between specifications (created by technical leads) and code repositories (accessed by AI tools) creates a fundamental barrier to effective AI adoption in architectural and design phases. Bringing specifications directly into the code repository—ideally as markdown files—solves this gap by making high-level system intent natively accessible to AI tools. This enables AI to participate meaningfully in system design by mapping dependencies across the codebase, maintaining consistency in structure and language, and identifying opportunities for code reuse. With the ability to cross-reference both the specifications and implementation, AI can generate more accurate scaffolding, propose system-level patterns, and ensure new additions align with existing architecture. The result is a more coherent, maintainable, and rapidly evolvable codebase—laying the groundwork for full-cycle AI involvement across system design, coding, and testing. AI-augmented Software Development (System Design + Coding + Testing) As illustrated in the image below, AI-augmented software development streamlines the traditional multi-role workflow into a highly integrated loop, compressing weeks of planning, coding, and testing into days. In this model, AI doesn't merely assist at the code generation level—it participates actively across system design, development, and testing phases. By embedding specifications, code, and test logic into a shared context, AI can iterate through the entire software lifecycle far more fluidly and intelligently. Here's how this transformation takes place: 1. System Design Integration AI tools start by asking clarifying questions, scaffold system designs, and generate specifications directly from high-level requirements. Unlike traditional setups where specs are abstracted from the implementation, here AI gains direct access to the evolving structure of the codebase. This tight feedback loop enhances architectural precision and accelerates design validation. 2. Continuous Contextual Coding Once specifications are established, AI transitions smoothly into coding. Rather than treating code as isolated tasks, the AI draws from the broader system architecture and prior specifications to write aligned, reusable, and dependency-aware code. The ability to reference multi-file contexts significantly improves coherence across modules and helps avoid structural fragmentation. 3. Automated Testing & Error Recovery AI doesn’t just generate tests—it iteratively runs and fixes them. When failures occur, it can trace them back through the system spec and implementation, identify the cause, and attempt auto-corrections. This reduces the back-and-forth cycle between developers and QA, tightening the feedback loop to near real-time. Compression of Sprint Cycles Where traditional development might require 2–3 weeks per sprint, AI-augmented workflows often converge in 3–5 days. This acceleration is not just a product of faster coding—it’s a reflection of reduced context switching, fewer handoffs, and automation of error resolution and testing. The Evolving Human-AI Partnership The human's role is evolving—from writing every line of specification, code, and tests to guiding the AI, answering its questions, and correcting course when it goes off track. New workflow: AI: Writes ~90% of specifications, code, and tests Human: Provides clarity, reviews output, fills in gaps, and occasionally writes the tricky parts By saving specifications alongside code and allowing AI tools to operate across both spaces, this development paradigm reduces planning debt and eliminates many points of friction that slow down traditional software delivery. As AI-native tools continue evolving, the emphasis shifts from just writing code to managing a dynamic, context-rich development environment that continuously reasons about what’s being built—and why. Redefining the Technical Team's Role This new partnership fundamentally transforms how technical teams work. Rather than implementing every detail manually, teams now focus on higher-level concerns while leveraging AI to handle routine implementation tasks. The AI handles most of the coding workload, generating specifications based on requirements, implementing standard patterns, writing routine tests, and handling boilerplate code. This frees the technical team to concentrate on what humans do best: providing domain expertise, making architectural decisions, reviewing for business alignment, and solving novel problems that the AI hasn't encountered in its training. Technical teams become more like directors—guiding the AI with clear requirements, reviewing its output for quality and alignment with business goals, intervening when necessary to correct misunderstandings, and writing specialized components that require deep domain knowledge or innovative approaches. This creates a multiplier effect where a single technical team with AI assistance can accomplish what previously required multiple teams of specialists, all while maintaining or even improving code quality and reducing technical debt.
8 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
Bill Gates recently said, "Within 10 years, AI will replace many doctors and teachers—humans won’t be needed ‘for most things.’" Geoffrey Hinton, often dubbed the "Godfather of Deep Learning," echoed the sentiment: "If a private tutor—that’s a person—is like 2 times better, these [AI tutors] will be 3 or 4 times better... It may not be good news for universities." As someone who’s been fortunate to engage deeply with education systems—and recently moderated a Deans' panel discussing the future of universities ( Deans' Panel Key Takeaways )—I agree that AI will reshape the landscape dramatically. But I don’t believe universities will vanish. If anything, they will adapt, and they must adapt. Here’s why I believe universities are here to stay: 1. Coming of Age Needs a Campus Undergraduate education is not just about learning content—it's about growing up. For many students, university is their first step into independence. It's where teenagers become adults in a safe, intellectually vibrant environment. They form lifelong friendships. They build social and professional networks that often outlast the knowledge from their coursework. And despite what AI can offer in personalized learning, that human environment cannot be replicated digitally—at least not yet. 2. The Signal Still Matters Let’s be real: a university degree is more than knowledge. It’s a signal. It tells employers—and society at large—that someone has stuck with a rigorous path, completed assignments, sat for exams, and acquired a certain level of expertise. AI might one day be able to measure real skill more effectively, but we’re not there yet. Until then, degrees still serve as a trusted validator. 3. Learning Is Like Going to the Gym Almost everyone agrees that staying fit is important. Yet, many of us pay for gym memberships we rarely use. Sound familiar? The same goes for learning. We bookmark articles, sign up for online courses, add videos to "watch later"—with every intention to return. But most of us don't. Learning is work. It’s not always fun. And self-discipline is hard. Universities force you to show up. They give you deadlines, structure, and a curriculum. They keep you accountable, just like personal trainers do. That’s part of their enduring value. But the Threat Is Real… That doesn’t mean we should get complacent. I’m working on an AI agent myself—one that takes a textbook chapter and converts it into a lecture video with my digital clone teaching the material. It’s already about 85% of the way there, and with some human-in-the-loop edits, the result could be better than if I filmed it myself. Some might say it’s foolish to build a tool that could replace your own job. But let’s be honest: it’s also kind of cool. With style transfer, I could have my lecture delivered in the voice and energy of Eminem, Taylor Swift, or even Andrew Ng—whatever keeps students engaged. So... Will Professors Be Replaced? Maybe some. Maybe many. But not all. Universities will survive not just because of tradition, but because they serve a set of social, emotional, and cognitive functions that AI hasn’t cracked yet. Professors won’t disappear—they’ll evolve. They’ll become curators, mentors, experience designers, and yes, even AI co-pilots in the classroom. The real challenge is not about whether AI will take our jobs—it’s whether we’ll use it to enhance what we do best. What do you think? Will universities survive the AI wave? Will you want your kids to go to college—or just to ChatGPT 10.0? Let’s talk.
6 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
In today's data-driven world, organizations face a critical challenge: they need realistic data for testing, development, and training AI models, but using real customer data raises serious privacy concerns and may violate regulations like HIPAA or FERPA. Enter synthetic data generation—a powerful solution that creates artificial data mimicking the statistical properties of real data while completely protecting privacy. This article explores a sophisticated synthetic data generation pipeline that transforms real data into artificial data that's statistically similar yet entirely synthetic. What sets this approach apart is its ability to not just replicate individual feature distributions, but to preserve the crucial relationships between features—the correlations and dependencies that make data truly useful. Let's dive into how this works. Why Synthetic Data Matters Before exploring the technical implementation, let's understand why synthetic data is valuable: Privacy compliance : Eliminates the risk of exposing sensitive customer information Development freedom : Enables teams to work with realistic data without security constraints Training AI models : Provides diverse, representative data for machine learning applications Testing edge cases : Allows creation of specific scenarios that might be rare in real data Relationship preservation : Maintains critical correlations and dependencies between variables that simple randomization methods cannot capture The last point is particularly crucial. Many synthetic data approaches can generate values that match individual feature distributions, but fail to maintain the relationships between features. For example, in a banking dataset, a simple approach might generate realistic account balances and realistic transaction frequencies independently, but miss the vital correlation between these variables. Our method specifically addresses this challenge. The Problem with Independent Feature Generation To understand why preserving relationships matters, consider a common e-commerce scenario: In real customer data, affluent customers not only spend more per purchase, but they also tend to buy a wider variety of products. This creates a natural correlation between: Average purchase amount Number of unique products purchased Customer income level If we were to generate synthetic data by creating each of these features independently—even if each feature's distribution perfectly matches the original data—we would lose these critical relationships. We might end up with unrealistic scenarios like low-income customers purchasing large numbers of luxury items or high-income customers only purchasing a single budget item. This problem compounds when building predictive models. A model trained on such independently-generated synthetic data would learn incorrect patterns and make faulty predictions when applied to real data. For instance, a recommendation engine might suggest luxury products to customers unlikely to purchase them or miss obvious cross-selling opportunities. Our approach using copula-based modeling specifically addresses this challenge by mathematically capturing and preserving the dependency structures between features, ensuring the synthetic data maintains these natural correlations. The Synthetic Data Generation Pipeline Our implementation creates a comprehensive pipeline that transforms real data into synthetic data through several sophisticated steps while preserving statistical properties and relationships. Let's break down each component: 1. Preprocessing The first stage prepares the data through three main steps: Missing Data Handling Processes target variables first, addressing imbalanced classes for categorical targets or applying transformations to reduce skewness in continuous targets Imputes missing values using median for numerical features and mode/"Unknown" for categorical features Categorical Data Encoding Applies intelligent encoding based on cardinality (number of unique values): Binary encoding for features with 2 unique values One-hot encoding for features with ≤10 unique values Frequency encoding for high-cardinality features Identifies and transforms highly skewed numerical features using Box-Cox transformation Standardizing Features Scales numerical features to have zero mean and unit variance Preserves categorical features in their encoded form Stores all transformation parameters for later inverse transformation 2. Clustering using HDBSCAN Data often comes from mixed populations with different underlying patterns. HDBSCAN clustering helps identify these natural groupings: Uses Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) Advantages over traditional clustering algorithms: No need to specify the number of clusters in advance Finds clusters of varying densities and shapes Adaptively determines cluster count based on data density Handles small datasets by adjusting clustering parameters (min_cluster_size, min_samples) Assigns noise points to their nearest clusters Creates a 'cluster' column to track membership, falling back to a single cluster if needed Separates data by cluster for subsequent processing 3. For Each Cluster Processing each cluster separately allows the algorithm to better capture the unique characteristics of different data subgroups: a) Statistical Modeling This three-step process captures both individual feature distributions and their interrelationships: Fit Marginal Distribution for Each Feature Tests multiple distribution types (normal, lognormal, exponential, gamma) Selects best fit using AIC (Akaike Information Criterion) Stores distribution parameters for each feature Models each feature's unique pattern independently (e.g., ages might follow a normal distribution, while income follows a log-normal distribution) Transform to Uniform using CDF Applies Cumulative Distribution Function (CDF) of fitted distributions Transforms each feature to uniform [0,1] distribution Creates standardized representation necessary for copula modeling Fit Best Copula Model Across Features Tests different copula types (Gaussian, Student-t, Clayton) Selects best-fitting copula based on log-likelihood Captures dependency structure between features A copula is a mathematical function that connects different marginal distributions to form a joint distribution, preserving relationships between variables. For instance, if higher income correlates with more frequent purchases in your original data, copulas maintain this relationship in synthetic data. This is where the real magic happens in preserving feature relationships. While each feature's individual distribution is important, the connections between features often contain the most valuable information. For example: In financial data, transaction frequency may be correlated with account balance In healthcare data, age may be correlated with certain medical conditions In e-commerce data, purchase frequency may be correlated with customer lifetime value Copulas mathematically encode these dependencies, allowing us to generate synthetic data where these critical relationships remain intact. Without this step, you might have realistic-looking individual features but unrealistic combinations of values that would never occur in real data. b) Data Generation After modeling comes generation of the synthetic data: Draw Samples from Fitted Copula Generates correlated uniform [0,1] samples from the fitted copula model Maintains the dependency structure between features Inverse CDF to Transform Each Feature Back Applies inverse CDF (percent point function) using stored distribution parameters Transforms uniform values back to realistic data following original distributions Restores each feature's original statistical shape while preserving relationships Adds appropriate cluster labels to track membership 4. Combine Cluster Data Merges synthetic data from all clusters based on original cluster proportions Maintains the natural groupings and subpopulations present in the original data Preserves the overall data structure and cluster characteristics 5. Postprocessing This stage restores the data to its original format: Reverse Encoding of Categorical Features Converts encoded categorical features back to their original form: Binary encodings → original binary categories One-hot encodings → original categorical values Frequency encodings → original categorical values Reverse Standardization Applies inverse transformation to all standardized numerical features Restores original scale and data types Ensures the synthetic data matches the format of the original data Handles decimal formatting and type conversion 6. Validation The final step is thorough quality checking to ensure the synthetic data truly resembles the original: Validate Each Feature & Target Distribution Independently For numerical features: Applies Kolmogorov-Smirnov tests and compares statistical moments For categorical features: Performs chi-square tests and compares category frequencies Calculates metrics like maximum and average differences between distributions Validate Correlations Compares correlation matrices (Pearson, Spearman) Calculates Frobenius norm of difference matrices Ensures dependency structures are preserved This validation step is critical for our goal of relationship preservation. After all the transformations, we need to verify that the synthetic data maintains the same correlation patterns as the original. The process compares both linear (Pearson) and rank-based (Spearman) correlations, allowing us to detect if the relationship structures have been maintained across different types of dependencies. Validate Cluster Preservation Compares cluster proportions between original and synthetic data Evaluates if cluster characteristics are maintained Compiles all validation results into a comprehensive report with statistical measures Running the Script The script can be run in several ways: With Cursor/Windsurf/Cline Simply ask the model to run the script. Without Cursor/Windsurf/Cline Run the following in terminal: python -m pip install -r requirements.txt If you're having trouble, try upgrading pip: python -m pip install --upgrade pip For Windows users facing installation issues, follow the guide at: https://github.com/bycloudai/InstallVSBuildToolsWindows?tab=readme-ov-file Note: Sometimes on Windows machines, py works instead of python: py -m pip install -r requirements.txt Limitations and Considerations While this implementation is powerful, it has some limitations: It doesn't make special distinctions between different types of variables during the correlation modeling phase—it treats all variables (including transformed categorical ones) as continuous. This means it might not perfectly preserve some special relationships between categorical and continuous variables, or between categories that were originally part of the same variable. Conclusion The synthetic data generation pipeline described here offers a powerful solution for organizations needing realistic test data without privacy concerns. What sets it apart from simpler approaches is its sophisticated handling of feature relationships through copula modeling and cluster-aware generation. By carefully modeling both the individual distributions of features and their relationships, then generating new data that follows these patterns, we can create synthetic data that is: Statistically similar to real data Maintains important relationships between different pieces of information Preserves the overall structure and patterns of the original data Safe to use without worrying about privacy regulations Suitable for testing, development, and analysis purposes This relationship preservation is crucial for many real-world applications: AI model training : Models trained on synthetic data with preserved relationships will learn the same patterns present in real data Financial analysis : Synthetic financial data must maintain relationships between risk factors and outcomes Healthcare research : The correlations between patient characteristics and medical conditions must be preserved Market research : Customer behavior patterns and preferences need to maintain their interdependencies The attached code implements this entire pipeline, making it accessible for data scientists and developers who need high-quality synthetic data where relationships between features matter as much as the features themselves.
6 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Code with Explanation is posted here: Colab Notebook Link Exercise 5: RAG with Agentic Pattern: ReAct + Reflection Code with Explanation is posted here: Colab Notebook Link These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". In the previous exercise, we explored how to break down a complex query into sub-queries, retrieve relevant chunks from a vector database for each sub-query, and generate answers based on those chunks. However, there are instances where the necessary knowledge to answer a user's question may not be available in our vector databases. In such cases, we need to equip our system with pre-built tools that can fetch information from external sources. Specifically, in the Colab notebook, we demonstrate how to retrieve LinkedIn handles of directors listed in SEC filings. To achieve this, we utilize a set of tools, as illustrated in the following diagram: Vector Search Tool – Provides access to the vector database for the LLM. Director Extraction Tool – Extracts director names from the previously stored last portion of SEC filings. Web Search Tool – Conducts Google searches for directors one at a time and retrieves their LinkedIn handles. For further details on the code implementation, please refer to the Colab notebook. However, before diving into the notebook, we strongly recommend reviewing the ReAct explanation provided below. ReAct prompting (Fundamental pattern for AI Agents) ReAct (Reasoning + Action) represents a groundbreaking framework that revolutionizes how large language models (LLMs) approach complex problem-solving. By combining reasoning capabilities with action-based decision making, ReAct enables models to tackle challenging tasks through a dynamic and iterative process. At its core, the framework establishes a continuous loop of three interconnected components: reasoning, action, and observation. The reasoning phase, often called the "Thought" stage, serves as the model's internal cognitive process. During this stage, the model analyzes the current situation, drawing upon multiple sources of information including the original task requirements, previous reasoning steps, past actions, and accumulated observations. This framework allows the model to break down complex goals into manageable subtasks, incorporate relevant background knowledge, and continuously evaluate progress toward the ultimate objective. The model can also use this phase to identify potential obstacles and develop contingency plans when faced with unexpected challenges. The action phase represents the bridge between thought and implementation. It determines which tool to employ based on the preceding thought process. The model examines its available tool descriptions and capabilities, matching them against the requirements identified in its last reasoning step. For example, if the thought process concludes that numerical data needs analysis, the model might select a calculator tool. If the reasoning indicates a need for external information, it might choose a search tool. Following each action, the observation phase captures the results and consequences of the actions. These observations serve as crucial feedback, providing new information that feeds into the next iteration of reasoning. For instance, if the model uses a search tool to gather information about a topic, the search results become observations that inform its subsequent thinking and decision-making process. It creates a feedback loop where each cycle of thought, action, and observation builds upon previous iterations. This allows the model to maintain and adjust its high-level strategy while incorporating new information and responding to changing circumstances. The framework's flexibility enables it to handle complex tasks that require multiple steps, logical reasoning, and interaction with various external tools and information sources. Here is a typical prompt that is used in LangChain framework to implement ReAct: Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: {tools} Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [{tool_names}] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! Question: {input} Thought:{agent_scratchpad} ReAct implementation: LangChain link LLamaIndex link The prompt starts by defining the task scope and available tools. {tools} is a placeholder that gets populated with descriptions of tools the agent can use, like search, calculators, or data analysis tools. The format section establishes the strict protocol the agent must follow: Question: {input} {input} gets replaced with the actual user question This sets up the goal the agent needs to achieve ReAct Components: "Thought:" - Where the agent reasons about what it needs to do "Action:" - Limited to the tools listed in {tool_names} "Action Input:" - The specific input for the chosen tool "Observation:" - Where results from tool usage appear The ... can repeat N times indicates this is an iterative process - the agent can go through multiple cycles of Thought/Action/Observation until it has enough information. Conclusion Format: A final "Thought:" declaring the agent has sufficient information "Final Answer:" providing the response to the original question { agent_scratchpad} The { agent_scratchpad} at the end is particularly important - it acts as a dynamic working memory space for the LLM agent and gets populated with the ongoing history of all previous Thought/Action/Observation cycles during execution. Think of it like a digital notepad where the agent records its step-by-step problem-solving process. The scratchpad typically contains: Previous thoughts the agent has had including any intermediate conclusions Actions it has taken Observations it has received This allows the agent to: Reference previous findings Build upon earlier observations Maintain continuity in its reasoning process Track what approaches have already been tried Technical Details It will be worth going over the example covered in the original paper briefly and further understand how ReAct is different from Chain of Thought (CoT) prompting. Source: Paper link The use of few-shot exemplars significantly enhances the efficacy of ReAct. In the original paper, the authors provided the language model with a small number of human-annotated examples that showcase the desired reasoning process and action sequence. These exemplars serve as a template for the model to follow when addressing new, unseen instances of the task. The exemplars in ReAct typically consist of a series of thought-action-observation steps: Thoughts: The exemplars include explicit reasoning steps that guide the model's decision-making process. These thoughts help break down the task into smaller sub-goals, provide relevant context or common sense knowledge, and offer guidance on the next action to take. Actions: The exemplars demonstrate the specific actions the model should take to progress towards solving the task. These actions can include information retrieval (e.g., searching a knowledge base), navigation (e.g., clicking on a specific link), or providing a final answer. Observations: After each action, the exemplars include the corresponding observation or result from the environment. These observations provide the model with the necessary context to inform its subsequent reasoning and actions. By studying these few-shot exemplars, the language model learns to internalize the reasoning process and action sequence required to complete the task successfully. The model can then apply this learned pattern to new, unseen instances of the task, even with limited or no additional training. The ReAct paper demonstrates the effectiveness of this few-shot approach across various domains, including question answering (HotpotQA), fact verification (Fever), and interactive problem-solving (ALFWorld and WebShop). In each case, the model is provided with just a handful of annotated exemplars (ranging from 2 to 6) and achieves competitive performance compared to baseline methods that rely on extensive fine-tuning or reinforcement learning. How is ReAct different from CoT? Chain-of-Thought prompting encourages models to break down complex problems into smaller, logical steps before reaching a conclusion. While this approach improves accuracy for many tasks, it operates within the confined space of the model's existing knowledge. ReAct fundamentally extends the CoT paradigm by introducing dynamic interaction with the external world. While CoT might reason "To find the population of Tokyo, I need to recall the most recent census data," ReAct can actually execute this step by searching current databases. This ability to ground reasoning in real-world data and tools addresses several key limitations of CoT: Knowledge Freshness: While CoT relies on the model's training data, ReAct can access current information through external tools. Verification Capability: CoT's conclusions are based solely on internal reasoning, but ReAct can verify its assumptions against external sources. Computational Accuracy: Rather than relying on the model's ability to perform calculations mentally (as in CoT), ReAct can utilize specialized tools for precise computations. Adaptive Problem-Solving: ReAct can adjust its approach based on intermediate results, while CoT follows a more linear reasoning path. For example, in solving a math problem, CoT might think through each step mentally, while ReAct could combine reasoning with actual calculator usage, reducing computational errors while maintaining logical clarity. This integration of external tools with reasoning creates a more robust and reliable problem-solving system. Applications and Implications Question Answering: ReAct prompting can be used to improve question-answering systems by allowing the model to generate verbal reasoning traces and perform task-specific actions, leading to more accurate and context-aware responses Deciding appropriate tools for a sub-task: ReAct prompting can be tailored for wide variety of tasks where the LLM needs to perform actions, such as retrieving specific data, performing computations, or even interacting with software interfaces through APIs. Challenges and Considerations ReAct can easily derail from the main task and pursue self-created tasks not aligned with the original goal. ReAct tends to use external tools more often when it can use LLM's knowledge to answer things. Implementing ReAct prompting may require a significant number of prompts, leading to increased costs and potential delays in obtaining the final answer. Complexity in Implementation: Implementing ReAct prompting requires a more complex setup than traditional prompting methods. It involves configuring the LLM to interact with external tools and ensuring secure and efficient communication between the model and these tools. Process 1. User query A user asks a question: Who are the directors of Tesla? What are their LinkedIn handles? What are the financial goals of Tesla this year? What is the next auto show that Tesla will participate in? 2. LLM call to figure out the next step Langchain's AgentExecutor first generates prompt by filling in the values for placeholder variables {Tools}, {Question}, and {Scratchpad} in the prompt. The prompt becomes like this: Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: Company Directors Information - Retrieve the names of company directors for a chosen company. Optionally, their LinkedIn handles can also be included. Use the format: company_name, true/false. Available companies: Tesla, General Motors WebSearch - Performs a web search on the query. Vector Reranker Search - Retrieves information from an embedding based vector DB containing financial data and company information. Structure query as a sentence Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [Company Directors Information, WebSearch, Vector Reranker Search] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the question. Follow these steps: Begin! Question: Who are the directors of Tesla. What are their linkedin handles? What are the financial goals of tesla this year. What is the next auto show that Tesla will participate in. Thought: After generating the prompt it sends the generated prompt to LLM, parses its response, and add the response to scratchpad in the following format: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Here: The Thought explains the reasoning and identifies the needed information. The Action specifies the name of the external tool that can help with the needed information in Thought . The Action Input tells the system what specific data is needed by this external tool. 3. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is Company Directors Information tool and adds the response as Observation to the scratchpad. Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought: NOTE: The ellipses ("...") in this document indicate the information has been redacted to conserve space and it can be seen in Langsmith's interface 4. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response (Thought, Action and Action input) to scratchpad: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? 5. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is Vector Reranker Search tool and adds the response as Observation to the scratchpad. The response in this case consists of retrieved reranked chunks from vector database for the query What are the financial goals of Tesla for this year? Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? Observation: required to spend RMB 14.08 billion in capital expenditures by the end of 2023, which has been achieved in 2023, and to generate RMB 2.23 billion of annual tax revenues … audited the Company's internal control over financial Thought: 6. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response (Thought, Action and Action input) to scratchpad: Thought: To answer the question, I need to gather information on the directors of Tesla, their LinkedIn handles, Tesla's financial goals for this year, and the next auto show Tesla will participate in. First, I will retrieve the names of the company directors for Tesla and their LinkedIn handles. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … https://www.linkedin.com/in/kathleen-wilson-thompson-275654201) Thought:I have obtained the names and LinkedIn handles of the directors of Tesla. Next, I will gather information on Tesla's financial goals for this year. Action: Vector Reranker Search Action Input: What are the financial goals of Tesla for this year? Observation: required to spend RMB 14.08 billion in capital expenditures by the end of 2023, which has been achieved in 2023, and to generate RMB 2.23 billion of annual tax revenues … audited the Company's internal control over financial Thought:I have gathered information on Tesla's financial goals for this year, which include capital expenditures and tax revenue generation. Now, I will search for information on the next auto show that Tesla will participate in. Action: WebSearch Action Input: Next auto show Tesla will participate in 2023 The LLM determines that the newly obtained information from Vector Reranker Search aligns with its financial goals and identifies the remaining need as finding details about the next auto show using the Websearch tool. 7. Function calling AgentExecutor executes the action using the tool suggested by LLM in its last response. It retrieves the response generated by the tool, which is WebSearch tool and adds the response as Observation to the scratchpad. The response in this case consists of top 3 Google search results for the query: Next auto show Tesla will participate in 2023 … Thought:I have gathered information on Tesla's financial goals for this year, which include capital expenditures and tax revenue generation. Now, I will search for information on the next auto show that Tesla will participate in. Action: WebSearch Action Input: Next auto show Tesla will participate in 2023 Observation: Title: Tesla to Join 2023 Detroit Auto Show Snippet: Tesla will join the 2023 Detroit Auto Show that will take place September 13-24. The company will showcase its vehicles at the Powering ... Link: https://driveteslacanada.ca/news/tesla-to-join-2023-detroit-auto-show/ … Link: https://detroitautoshow.com/ Thought: 8. LLM call to figure out the next step AgentExecutor updates the value of scratchpad in the prompt, and sends it again to LLM for it to decide the next step. It again parses the response received from LLM, and adds the parsed response. But this time, the parse response is different. Instead of generating Action and Action-input, it generates the Final answer, since the LLM observes that it has all the information it needed to answer user's question. Thought: I have gathered the necessary information to answer the question. Final Answer: 1. Directors of Tesla and their LinkedIn handles: - Elon Musk (LinkedIn: [Elon Musk](https://www.linkedin.com/in/elon-musk-a93a0b221)) - Robyn Denholm (LinkedIn: [Robyn Denholm](https://au.linkedin.com/in/robyn-denholm-a807795)) - Ira Ehrenpreis (LinkedIn: [Ira Ehrenpreis](https://www.linkedin.com/in/iraehrenpreis)) - Joseph Gebbia (LinkedIn: [Joseph Gebbia](https://www.linkedin.com/in/john-coppola-jr-4b2ab395)) - James Murdoch (LinkedIn: [James Murdoch](https://www.linkedin.com/in/jamesrmurdoch)) - Kimbal Musk (LinkedIn: [Kimbal Musk](https://www.linkedin.com/in/kimbal)) - JB Straubel (LinkedIn: [JB Straubel](https://www.linkedin.com/in/jb-straubel-b694981)) - Kathleen Wilson-Thompson (LinkedIn: [Kathleen Wilson-Thompson](https://www.linkedin.com/in/kathleen-wilson-thompson-275654201)) 2. Financial goals of Tesla for this year include capital expenditures of RMB 14.08 billion and generating RMB 2.23 billion in annual tax revenues by the end of 2023. Tesla aims to improve vehicle performance, decrease production costs, and increase affordability and customer awareness. 3. The next auto show Tesla will participate in is the 2023 Detroit Auto Show, which takes place from September 13-24. Reflexion Reflexion is a framework that enables language models (LLMs) to learn and improve their performance on various tasks through self-reflection and iterative optimization. The key idea behind Reflexion is to convert binary or scalar feedback from the environment into informative verbal feedback, which the LLM agent then uses to update its context and improve its performance in subsequent attempts. The Reflexion framework consists of four main components: Actor: An LLM that generates text and actions based on the current state observations and its memory. The Actor samples actions from its current policy and receives observations from the environment. Various models, such as Chain of Thought and ReAct, can be used as the Actor. Evaluator: A component that assesses the quality of the generated outputs produced by the Actor. The Evaluator takes a generated trajectory as input and computes a reward score reflecting the Actor's performance on the given task. The Evaluator can incorporate both internal and external assessment mechanisms. Internal evaluation can be using self-reflection or confidence signals like log probabilities and entropy measures that assess output quality without external reference points. External evaluation involves independent validation through unit tests, searching the web for relevant information to fact-check, or using LLM-as-a-judge approaches that provide assessment based on predefined criteria. Self-Reflection: An LLM that generates verbal self-reflections to provide feedback for future trials. Given the current trajectory, evaluation and the agent's persistent memory, the Self-Reflection model generates specific and informative feedback. This feedback is stored in the agent's memory for future reference. Memory: The memory component in Reflexion consists of short-term memory (trajectory history) and long-term memory (outputs from the Self-Reflection model). These memory components provide context that is both specific and influenced by lessons learned over multiple trials, giving Reflexion agents an advantage over other LLM action-choice methods. The Actor uses the updated memory to inform its decisions in the next trial. Source: Paper link Reflection implementation: LlamaIndex link This iterative process of trial, evaluation, self-reflection, and memory persistence allows the agent to rapidly improve its performance on various tasks. This approach draws inspiration from human cognitive processes, particularly the distinction between "System 1" and "System 2" thinking patterns first popularized by psychologists. System 1 represents quick, instinctive reactions, while System 2 embodies slower, more deliberate analysis. While this additional computational step may increase response time, it often proves valuable for complex tasks where accuracy and thoroughness matter more than speed. Reflection: Reflection is a simpler version of Reflexion where there is no separate evaluation before self-critique. The reflector simply provides feedback based on its own assessment, which can be helpful but might not address specific shortcomings or gaps. At its core, reflection serves as a prompting strategy that enables AI systems to evaluate and refine their responses through structured self-criticism. Multi-Agent Collaboration: The Reflection pattern can be implemented in a multi-agent framework, where one agent is responsible for generating outputs and another agent provides constructive criticism. This back-and-forth between the agents can lead to increasingly refined and improved responses. Sample implementation: "Review your previous answer and find problems with your answer" "Based on the problems you found, improve your answer." Source: link Process 1. User query A user asks a question: Who are the directors of Tesla? What are their LinkedIn handles? What are the financial goals of Tesla this year? What is the next auto show that Tesla will participate in? 2. LLM call to figure out the next step + Function calling + … Langchain's AgentExecutor first generates prompt by filling in the values for placeholder variables {Tools}, {Question}, and {Scratchpad} in the prompt.Then it sends the prompt to LLM, parses the output, calls the tool as instructed by the LLM, receives the tool's output, updates the output as observation in scratchpad, and then sends the prompt back to LLM. This process continues for a bit, till it gets to the following point. Your task is to gather relevant information to build context for the question. Focus on collecting details related to the question. Gather as much context as possible before formulating your answer. You have access to the following tools: Company Directors Information - Retrieve the names of company directors for a chosen company. Optionally, their LinkedIn handles can also be included. Use the format: company_name, true/false. Available companies: Tesla, General Motors WebSearch - Performs a web search on the query. Vector Reranker Search - Retrieves information from an embedding based vector DB containing financial data and company information. Structure query as a sentence Director's previous work and education - Retrieves director's education and work experience using their LinkedIn URL. Use the format: url Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [Company Directors Information, WebSearch, Vector Reranker Search, Director's previous work and education] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the question. Follow these steps: Begin! Question: Make an analysis on the independence of backgrounds of directors at tesla Thought:To analyze the independence of backgrounds of directors at Tesla, I need to gather information about the directors, including their names, previous work experience, and education. This will help determine the diversity and independence of their backgrounds. Action: Company Directors Information Action Input: Tesla, true Observation: Directors of Tesla: Elon Musk (LinkedIn: https://www.linkedin.com/in/elon-musk-a93a0b221); Robyn Denholm (LinkedIn: … Thought: To analyze the independence of backgrounds of Tesla's directors, I need to gather information about their previous work experience and education. This will help assess the diversity and independence of their backgrounds. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/elon-musk-a93a0b221 Observation: Professional Experience: Education: Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the other directors as well. Action: Director's previous work and education Action Input: https://au.linkedin.com/in/robyn-denholm-a807795 Observation: Professional Experience: Tech Council of Australia, Tesla, Blackbird, Telstra, Telstra Education: Australian Institute of Company Directors, UNSW, UNSW Business School Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/iraehrenpreis Observation: Professional Experience: Menlo Micro, DBL Partners, Tesla Motors, Tesla Motors, Tesla Motors Education: Stanford University Graduate School of Business, Stanford Law School, UCLA Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/john-coppola-jr-4b2ab395 Observation: Professional Experience: JMC Detailing LLC, Koons Chevrolet, Koons Tyson Jeep, Chili's Education: Radford University - College of Business and Economics, George C. Marshall High School, marshall high school Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/jamesrmurdoch Observation: Professional Experience: Lupa Systems, 21st Century Fox, 21st Century Fox, News America Incorporated, News America Incorporated Education: Harvard University Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/kimbal Observation: Professional Experience: The Kitchen Community, The Kitchen, Tesla Motors, SpaceX, Me.dium Education: The French Culinary Institute, Queen's University Thought: To continue analyzing the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the remaining directors. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/jb-straubel-b694981 Observation: Professional Experience: Redwood Materials, QuantumScape, Straubel Foundation, Tesla, Volacom Education: Stanford University, Stanford University, EDGEWOOD HIGH SCHOOL Thought: To complete the analysis on the independence of backgrounds of Tesla's directors, I need to gather information about the previous work experience and education of the last remaining director, Kathleen Wilson-Thompson. Action: Director's previous work and education Action Input: https://www.linkedin.com/in/kathleen-wilson-thompson-275654201 Observation: Professional Experience: McKesson, Self-employed, Tesla, Tesla, Walgreens Boots Alliance Education: University of Michigan, Wayne State University, Wayne State University Law School Thought: 3. LLM call to figure out the next step … Observation: Professional Experience: McKesson, Self-employed, Tesla, Tesla, Walgreens Boots Alliance Education: University of Michigan, Wayne State University, Wayne State University Law School Thought: To analyze the independence of backgrounds of Tesla's directors, I have gathered information about their previous work experience and education. Here's a summary of the findings: 1. **Elon Musk**: Information on his professional experience and education was not retrieved, but he is widely known as the CEO of Tesla, SpaceX, and other ventures. 2. **Robyn Denholm**: Has experience in telecommunications and technology sectors, having worked at Telstra and the Tech Council of Australia. She holds degrees from UNSW and the Australian Institute of Company Directors. 3. **Ira Ehrenpreis**: Has a background in venture capital and law, with experience at DBL Partners and education from Stanford University and UCLA. 4. **Joseph Gebbia**: His professional experience includes roles in the automotive and hospitality industries, with education from Radford University. 5. **James Murdoch**: Comes from a media and entertainment background, having worked at 21st Century Fox and News America Incorporated. He is a Harvard University graduate. 6. **Kimbal Musk**: Has experience in the culinary and technology sectors, with roles at The Kitchen and SpaceX. He studied at The French Culinary Institute and Queen's University. 7. **JB Straubel**: Has a strong background in technology and sustainability, with experience at Tesla and Redwood Materials. He studied at Stanford University. 8. **Kathleen Wilson-Thompson**: Has experience in healthcare and retail, having worked at Walgreens Boots Alliance and McKesson. She holds degrees from the University of Michigan and Wayne State University Law School. The directors of Tesla have diverse backgrounds spanning technology, telecommunications, media, culinary arts, venture capital, and healthcare. This diversity suggests a level of independence in their backgrounds, as they bring varied experiences and perspectives to the board. Final Answer: Tesla's board of directors has diverse and independent backgrounds, with experience in technology, telecommunications, media, culinary arts, venture capital, and healthcare, contributing varied perspectives to the company. 4. Prompt for Reflection Your task is to analyze whether the `Answer` is missing some information related to the `Question`. Give feedback on the missing requirments of the answer. Mention only the essential information. Here is the previous interaction: Question: Make an analysis on the independence of backgrounds of directors at tesla Answer: Tesla's board of directors has diverse and independent backgrounds, with experience in technology, telecommunications, media, culinary arts, venture capital, and healthcare, contributing varied perspectives to the company. Reflection: Provide brief, concise thoughts on what additional information needs to be collected in the next iteration. Based on your reflection, conclude with one of the following actions: If the current Answer provides sufficient information for Original Input, state "STOP". If further refinement is needed, provide 2-3 brief thoughts for improvement, each on a new line, and end with "CONTINUE". Output 5. Reflection assessment … Output The answer provides a general overview of the backgrounds of Tesla's directors, highlighting their diverse experiences and educational backgrounds. However, it lacks specific analysis on the independence of these backgrounds in terms of decision-making and influence within the board. - Include an analysis of how the diverse backgrounds contribute to independent decision-making and governance at Tesla. - Discuss any potential conflicts of interest or affiliations that might affect the independence of the directors. - Provide examples or evidence of how the directors' independent backgrounds have influenced Tesla's strategic decisions or policies. CONTINUE 6. Prompt for thinking on Reflection Improve `Previous Answer` based on `Reflections`. Dont look for information already present in `Previous Answer`. Formulate a new Final Answer. Reflections: The answer provides a general overview of the diversity and independence of Tesla's directors' backgrounds. However, it lacks specific details and analysis. - Include specific examples of directors and their backgrounds to illustrate the diversity and independence. - Analyze how these diverse backgrounds impact decision-making and governance at Tesla. - Discuss any potential challenges or benefits that arise from having such a diverse board. CONTINUE 7. Response to Reflection Previous Answer: Tesla's directors have diverse and independent backgrounds, with experiences spanning technology, automotive, media, culinary, and healthcare industries, and education from prestigious institutions. This diversity can contribute to a broad range of perspectives and expertise in the boardroom. The directors of Tesla have diverse and independent backgrounds, which contribute to a broad range of experiences and perspectives within the board. Here's a summary of their backgrounds: 1. **Elon Musk**: Holds bachelor's degrees in physics and economics and attended graduate school in physics at Stanford University. He has a strong background in technology and entrepreneurship, having founded companies like SpaceX and Tesla. 2. **Robyn Denholm**: Has experience in telecommunications and technology, having worked at Telstra and Blackbird. She studied at the Australian Institute of Company Directors and UNSW. 3. **Ira Ehrenpreis**: Has a background in venture capital and law, with education from Stanford University Graduate School of Business and Stanford Law School. 4. **Joseph Gebbia**: His professional experience includes roles in the automotive and hospitality industries, with education from Radford University. 5. **James Murdoch**: Comes from a media and entertainment background, having worked at 21st Century Fox and News America Incorporated. He studied at Harvard University. 6. **Kimbal Musk**: Has experience in the culinary and technology sectors, with education from The French Culinary Institute and Queen's University. 7. **JB Straubel**: Has a strong background in engineering and technology, having worked at Tesla and founded Redwood Materials. He studied at Stanford University. 8. **Kathleen Wilson-Thompson**: Has experience in healthcare and law, having worked at Walgreens Boots Alliance and McKesson. She studied at the University of Michigan and Wayne State University Law School. Overall, the directors bring a mix of expertise from technology, media, law, healthcare, and entrepreneurship, which supports a diverse and independent board at Tesla.
20 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Code with Explanation is posted here: Colab Notebook Link Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This exercise, the third in the series, focuses on illustrating how complex queries can be decomposed into simpler sub-queries by in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker. We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code. Query Decomposition When users interact with RAG systems, they often pose complex questions that encompass multiple aspects or require information from different areas of the knowledge base. Consider a query like "How do Tesla and GM's approaches to manufacturing and production compare, particularly for electric vehicles? Where are their vehicles produced?" This question combines several distinct informational needs: manufacturing methodologies, EV-specific production approaches, and factory locations for two different companies. Direct vector similarity search with such compound queries can be suboptimal, as the embedding may not effectively capture all query dimensions simultaneously. Query decomposition addresses this challenge by leveraging a Large Language Model (LLM) to break down complex queries into simpler, more focused sub-queries that can be processed independently before being synthesized into a comprehensive response. The decomposition process typically starts by prompting the LLM to analyze the user's question and identify its core components. A sample prompt, shown below, guides the LLM to generate a set of atomic sub-queries that collectively cover all aspects of the original question. For the automotive manufacturing comparison, the LLM might generate targeted sub-queries like: What is Tesla's approach to EV manufacturing and production? What is GM's approach to EV manufacturing and production? Where are Tesla's manufacturing facilities located? Where are GM's manufacturing facilities located? This approach enables more precise matching with relevant chunks in the vector database, as each sub-query can be vectorized to capture specific semantic aspects more accurately. The retrieved chunks for each sub-query are then combined and reranked to provide a complete context for the language model to generate a coherent response that compares and contrasts both companies' manufacturing strategies and facility locations. The use of LLMs for query decomposition offers several advantages over rule-based or keyword-based approaches. LLMs can understand implicit relationships within questions, identify logical dependencies between different query components, and generate sub-queries that maintain the original intent while being optimized for retrieval. Sample prompt for query decomposition: You are an expert at converting user questions into specific database queries for similarity search. Break down the `user-question` into distinct sub-queries that address different aspects of the original question. Ensure that the set of sub-queries comprehensively covers the main aspects of the original question. user-question: ```<paste user-question here>```
2 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Exercise 2: RAG with Reranker for 10-K filings Code with Explanation is posted here: Colab Notebook Link Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This exercise, the second in the series, focuses on illustrating how Reranking makes a difference in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker. We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code. Reranking Retrieved Chunks using Reranker (Cross-Encoder model) While vector similarity search provides a good initial set of relevant chunks, it can sometimes miss nuanced semantic relationships or return chunks that are only superficially similar to the query. Consider a user asking "List out major changes that occurred in Tesla in 2023." A vector search might rank chunks discussing changes from 2022 higher than a more relevant chunk about a Director selling common stock in 2023, simply because the 2022 chunks share more semantic similarities around the concept of "changes" and "Tesla." This highlights a limitation of pure vector similarity matching. This is where rerankers come into play, serving as a crucial refinement layer in the RAG pipeline. A reranker takes the initial set of retrieved chunks from the vector database and performs a more sophisticated, computationally intensive analysis to improve the ranking quality. The reranking process often employs cross-encoders, which are transformer models that simultaneously process both the query and a candidate chunk to produce a relevance score. This approach captures more subtle semantic relationships and contextual nuances. It can correctly identify that the Director's stock sale in 2023 is more relevant to the query than changes from 2022, despite fewer surface-level semantic similarities. A natural question arises: why not use these more sophisticated reranker models for the initial retrieval instead of vector search? The answer lies in computational efficiency. Using a reranker as the primary retrieval mechanism would require passing each query through the model alongside millions of individual chunks in the vector database, computing similarity scores one at a time. This process would be prohibitively expensive and slow, especially for large-scale applications that need to maintain responsive query times. This is why modern RAG systems typically employ a two-stage retrieval process that combines the best of both approaches. They first use rapid vector similarity search to quickly identify a promising set of candidates (e.g., top 100 chunks), then apply the more sophisticated reranker to this smaller set to determine the final top-k chunks (e.g., top 5-10) that will be provided as context to the language model. This hybrid approach balances computational efficiency with retrieval quality, ensuring that the system provides accurate and relevant responses while maintaining reasonable response times. Technical Details Reranking model The key distinction between Embedding models (typically bi-encoders) and Reranking models (typically cross-encoders) lies in how they process queries and chunks. Bi-encoders process each text independently - the query and chunk are fed through the model separately to generate their respective embeddings. These embeddings can then be compared using similarity metrics like cosine similarity. This approach allows for efficient retrieval since chunk embeddings can be pre-computed and indexed, but it limits the model's ability to capture complex interactions between the query and document. Cross-encoders take a fundamentally different approach by processing the query and chunk together as a single input. By concatenating the query and chunk with a separator token, the model can leverage its attention mechanisms to directly compare and contrast every term in the query with every term in the chunk. This enables the model to capture nuanced relevance patterns and contextual relationships that might be missed when processing texts independently. For example, if a query asks about "Tesla's competitors' plant locations," a cross-encoder can directly attend to chunks mentioning locations of Tesla's competitors (e.g., Ford's plant in Michigan, Rivian's facility in Illinois) while downweighting or deprioritizing chunks that primarily describe Tesla's plant locations. This results in a reranked list where chunks about competitors' locations are moved higher than those focusing on Tesla itself, better aligning with the user's intent. When it comes to model training objectives, embedding models and cross-encoder models serve different purposes despite often starting from the same base architectures like BERT. Embedding models are fine-tuned specifically to generate high-quality sentence or paragraph level embeddings that capture semantic meaning in a fixed-dimensional vector space. The training process typically involves contrastive learning objectives that push similar texts closer together and dissimilar texts further apart in the embedding space. Cross-encoder models, on the other hand, are fine-tuned to directly predict a relevance score given a query-document pair. Rather than generating embeddings, the model learns to output a single similarity score that indicates how well the document answers the query. This direct optimization for the ranking task typically leads to better ranking performance, though at the cost of computational efficiency since pairs must be processed together.
3 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
'AI Won't Replace Humans, But Humans Who Know AI Will.' While the above idea is widely accepted, it is unclear what is the meaning of "Who Know AI". Through in-depth interviews with AI experts, we identified two complementary frameworks that directly counter four common misconceptions we uncovered in a previous article, Why 30% of AI Projects Fail: 4 Common AI Misconceptions among Executives These two frameworks are: The SPADE framework - which stands for Sequence Decomposition & Experimentation, Prompt Design Strategy, AI Understanding, Data Strategy, and Evaluations - encompasses the technical capabilities needed to effectively guide AI implementations, from breaking down complex processes to designing robust evaluation systems. The CATE framework - representing Critical Thinking, Adaptability & Learning, Troubleshooting & Analysis, and Ethical Awareness - addresses the equally crucial soft skills that enable AI conductors to bridge the gap between technical possibilities and business value, manage stakeholder expectations, and ensure ethical implementation. Together, these frameworks provide a comprehensive blueprint for building teams capable of driving successful AI transformations. SPADE: Technical Skills Sequence Decomposition & Experimentation AI conductors must be able to deconstruct complex human-dependent cognitive and operational processes into discrete, manageable components that can be completed by a combination of GenAI and other approaches such as coding, traditional machine learning, and third party services. This decomposition allows organizations to identify where AI can add the most value while maintaining human oversight where needed. However, identifying the right components is just the beginning—AI conductors must systematically experiment with different approaches to optimize outcomes. Consider what appears to be a straightforward task: selecting an image for a corporate blog post. An AI conductor might break this down into several potential execution sequences. One approach could follow a content-first path: using AI to extract key themes and messages from the post, applying these insights to filter a stock image database, and then employing AI-assisted selection for the final choice. An alternative sequence might prioritize emotional resonance: first analyzing the post's intended emotional impact, then using these emotional markers to guide image selection, and finally validating the choice against the post's content themes. AI conductors must coordinate with AI Engineers to test these various approaches, measuring their effectiveness against both technical metrics and business objectives. This requires carefully weighing factors such as implementation costs, processing time, accuracy rates, and—crucially—alignment with human expectations and organizational standards. Prompt Design Strategy AI conductors developing prompt design strategies need to focus on three core components: reasoning, instruction, and context management. For reasoning, AI conductors must understand that while modern AI models demonstrate impressive capabilities in developing plans and breaking down complex tasks, they have specific limitations when handling tasks requiring significant human judgment. For complex scenarios, AI conductors may need to guide the AI model to first generate its thought process, evaluate its generated thoughts, and reiterate before generating the final response. They must also consider when to incorporate external guidance for handling tacit knowledge and organization-specific information that may not be captured in the model's training data. This external guidance can be approached in two ways: through explicit rules articulated by domain experts who understand the nuances of organizational practices, or through patterns discovered by applying traditional machine learning techniques to historical organizational data. When determining instructions, AI conductors must consider several key factors: the interdependence between tasks, reasoning load on model for each component, and data dependencies. Such considerations help them decide whether to combine multiple tasks in one prompt or to split instructions across multiple prompts. AI conductors also need to carefully calibrate context as per business objective—excessive contextual information can overwhelm the model and lead to suboptimal results, while insufficient context can produce incomplete or misaligned responses. The key is finding the right balance that aligns with both technical capabilities and business requirements. AI Understanding AI conductors must have a comprehensive understanding of Generative AI models' capabilities to effectively drive internal adoption and integration of AI solutions. While these models excel at extracting information from text according to instructions, their ability to extract information from images is rapidly evolving and less widely understood. These extraction capabilities eliminate traditional roadblocks like data collection, feature engineering, and model training. Understanding generation capabilities across text, code, images, audio, and video is crucial for guiding different business units. For customer support, they may help choose between text-only models or multimodal models based on the nature of data and industry domain. For UX generation, AI conductors can help marketing teams build prototypes quickly without coding experience. For code generation, they need to help developers overcome initial resistance and embrace AI tools. While text-to-image may not yet be reliable for corporate applications, AI conductors can help teams use AI to generate infographics, flowcharts, and diagrams for various business documents. Data Strategy As an AI conductor, developing a robust data strategy begins with ensuring data representativeness while carefully managing privacy concerns. Modern approaches have moved beyond traditional human annotation to embrace hybrid systems that leverage large foundational models for generating data, with human annotators focusing only on low-confidence cases requiring expert judgment. When real data isn't available or suitable, AI conductors must know how to leverage AI to generate synthetic data that maintains real-world patterns while protecting privacy. This approach is particularly valuable for testing AI systems across rare scenarios or checking for biases across different demographic groups. The strategy must also include sophisticated approaches to data sanitization, removing personally identifiable information while preserving valuable patterns and insights. Evaluations AI conductors must develop comprehensive evaluation frameworks that go beyond traditional metrics to assess both technical performance and business value. This includes designing nuanced criteria for subjective qualities like tone and creativity, implementing robust monitoring systems to detect subtle degradation in performance, and establishing clear tracing mechanisms for debugging complex AI workflows. Unlike traditional machine learning, where metrics are straightforward and objective, generative AI requires multi-faceted evaluation approaches. These might combine traditional metrics with AI-based evaluations (LLM-as-judge) and human assessments to catch subtle errors or "hallucinations." AI conductors must also implement pre-generative guardrails to prevent misuse and ensure safety, while maintaining robust monitoring systems to detect issues like "silent degradation," where automated scores remain acceptable while real-world utility declines. CATE: Soft Skills Critical Thinking As an AI conductor in the AI era, critical thinking has become more crucial than ever. While organizations traditionally relied on a few key decision-makers to determine the "why" and "what" of projects, with larger teams focused on the "how" of execution, AI is shifting this dynamic. As AI tools become increasingly sophisticated at handling execution tasks, the bottleneck isn't in implementation but in strategic thinking: identifying which problems to solve, why they matter, and what approach to take. AI conductors must think critically to prioritize high-impact initiatives, collaborate with stakeholders to validate strategic decisions, and work with technical teams to develop effective implementation plans. Communication Skills AI conductors must excel at multifaceted communication, serving as bridges between technical teams, business stakeholders, and end users. They need to effectively translate technical concepts into business value propositions when speaking with executives, while also conveying business requirements and constraints clearly to technical teams. Crucially, they must be exceptional listeners, skilled at building rapport with domain experts to understand their nuanced decision-making processes. This involves managing expectations about AI capabilities—neither overselling nor underselling what AI can achieve—and clearly communicating both the possibilities and limitations of AI solutions. Adaptability & Learning In the rapidly evolving AI landscape, AI conductors must possess exceptional adaptability and commitment to continuous learning. The field resembles drinking from a firehose—new tools emerge constantly, models improve dramatically, and novel possibilities unfold weekly. AI conductors must maintain an experimental mindset, consistently exploring new use cases while staying current with the latest developments in AI capabilities. The most effective AI conductors are those who can quickly adapt their strategies as technologies evolve, finding creative ways to work around current limitations while preparing for future capabilities. Troubleshooting & Analysis AI conductors need strong analytical abilities to effectively collaborate with technical teams in diagnosing and resolving AI system issues—particularly when content extraction or generation falls short of expectations, or when guardrails fail to perform as intended. While technical teams handle the deep debugging, AI conductors must understand enough to have productive conversations about potential issues: whether system prompts might need restructuring, if prompt instructions could be overwhelming the model, or if reasoning guidance may be insufficient. The goal isn't for AI conductors to solve technical problems themselves, but rather to serve as an effective bridge between business needs and technical solutions. Ethical Awareness AI conductors must possess a strong foundation in ethical awareness to ensure responsible AI implementation within their organizations. This involves anticipating potential societal impacts of AI systems and proactively addressing concerns before they become problems. They need to champion ethical considerations at the strategic level: developing organization-wide AI ethics policies, establishing review processes for high-risk applications, and creating clear escalation pathways for ethical concerns. AI conductors should also foster a culture of responsible innovation where teams feel empowered to raise ethical concerns without fear of impeding progress. This requires balancing the pressure for rapid AI deployment with the need for thoughtful consideration of long-term implications and societal impact. Moving Forward For executives leading AI transformations, the path forward is clear: success depends not on accumulating technical talent alone, but on cultivating AI conductors who can orchestrate the interplay between AI capabilities and business strategy. This shift is particularly crucial as generative AI increasingly handles operational "how" tasks that previously required human execution. As this capability expands, organizations face a new bottleneck: the growing demand for professionals who excel at determining the "why" and "what" of AI initiatives. This requires three key executive actions with specific implementation steps: Changes Needed for Hiring AI Conductors/Strategists Traditional IT hiring emphasizes deep technical expertise, but AI success demands professionals who can translate between business needs and technical possibilities. Executives should: Revise job descriptions to prioritize candidates who demonstrate both the technical literacy outlined in SPADE and the orchestration capabilities detailed in CATE Implement assessment methods that evaluate a candidate's ability to communicate complex AI concepts to different stakeholders Establish competitive compensation structures that recognize the unique value of these hybrid skill sets Set a target ratio of AI conductors to technical AI specialists (1:3 is often effective in early implementations) Grooming AI Conductors/Strategists Internally Create dedicated paths for developing AI conductors internally with clear milestones and timelines. Rather than relying solely on external hires, organizations should create two distinct development tracks to nurture AI conductors from existing talent: a) The Engineer-to-Conductor Path: Identify engineers who demonstrate strong aptitude for CATE skills (particularly communication and critical thinking). These technically-skilled professionals often already possess much of the SPADE framework, but need structured development in: Role-playing exercises simulating interactions with various stakeholders (executives, domain experts, end users) Shadowing business strategy meetings to understand organizational priorities Mentorship from business leaders to develop business acumen Progressive responsibility in presenting AI concepts to non-technical audiences Training in ethical frameworks and responsible AI principles b) The Analyst-to-Conductor Path: Identify business analysts who show technical curiosity, comfort with coding fundamentals, and willingness to engage with technical documentation. These business-savvy professionals need intensive development in SPADE skills through: Structured technical training in AI fundamentals and capabilities Hands-on workshops for prompt engineering and evaluation design Paired work with AI engineers to understand technical limitations and possibilities Progressive responsibility in translating business requirements into technical specifications Guided practice in decomposing complex workflows into AI-enabled components Finally, evolve governance structures to empower AI conductors. Traditional project management frameworks, with their emphasis on fixed requirements and linear execution, often constrain AI initiatives. Executives need to establish new governance models that enable rapid experimentation while maintaining appropriate controls. This includes redefining success metrics beyond technical performance to encompass business impact and ethical considerations. Organizations that execute these shifts will not only accelerate their AI initiatives but also build lasting competitive advantages through superior orchestration capabilities. In an era where AI tools are increasingly commoditized, the ability to effectively direct these tools toward strategic objectives will become the key differentiator.
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
AI's transformative power is often likened to groundbreaking innovations such as the printing press, the steam engine, or the internet. Each of these innovations transformed society in unique ways: the printing press democratized knowledge, the steam engine industrialized production, the internet-connected humanity, and AI is now augmenting human intelligence across every sector. For example, in healthcare, AI algorithms can analyze medical images with greater speed and accuracy than human doctors, leading to earlier diagnoses and better patient outcomes. Similarly, in the legal field, tasks that once required weeks of meticulous effort—such as sifting through thousands of documents to locate a critical email—can now be accomplished in moments with AI. AI's revolutionary potential is well-recognized by business leaders. However, the rapid evolution of AI technologies and their relative newness pose significant challenges for companies attempting to integrate AI effectively into their workflows. A Gartner report forecasts that by the end of 2025, at least 30% of generative AI (GenAI) projects will be abandoned. Our research, based on a survey of C-Suite Executives from medium-sized companies across industries including retail, healthcare, food, hospitality, and education, alongside AI experts from Microsoft, Google, Facebook, AWS, Salesforce, Alibaba and top consultancy companies including Deloitte, McKinsey, PwC, Accenture, BCG, Cognizant, Bain and IBM, revealed significant gaps in business leaders' understanding of AI's capabilities and limitations. Compounding these misconceptions is a growing shortage of professionals who truly understand how to leverage AI effectively. Since Harvard Business School's Dean famously remarked, 'AI Won't Replace Humans, But Humans Who Know AI Will,' this sentiment has gained widespread traction in business circles. However, there remains little clarity around what it actually means to ‘know AI.’ To address this gap, we interviewed many AI experts and discovered that successful AI implementations require more than AI Engineers—they need what we call "AI conductors": professionals who orchestrate the complex interplay between AI capabilities, business strategy, and human factors. Our research helped identify the comprehensive set of technical and soft skills that define these conductors. These skills extend beyond technical proficiency, emphasizing the ability to design AI-driven workflows, develop robust evaluation frameworks, and ensure that AI-driven decisions are consistently aligned with human values, ethical considerations, and business objectives. Our research revealed a more fundamental challenge: executives must first overcome their own blind spots about AI before they can effectively develop AI conductors in their organizations. Based on our interviews, four critical misunderstandings consistently undermine AI initiatives. The Four AI Blind Spots Many Executives Have AI as a Data Analysis Tool First, executives often view AI primarily as a data analysis tool. This mindset underestimates AI's true potential. AI's real strength lies in its ability to simulate human decision-making and automate complex tasks that were once considered too intricate. Unlike traditional automation, which relied on explicit rules and structured inputs, AI, specifically Generative AI (GenAI), can handle ambiguous, unstructured, and context-heavy tasks with minimal human intervention. For the first time in history, machines can infer intent, adapt to new scenarios, and integrate multimodal data—text, images, and audio—allowing them to approach the complexity of human decision-making. This shift enables AI to go beyond routine task automation and actively enhance human judgment in areas requiring reasoning, pattern recognition, and contextual awareness. GenAI makes augmenting human decision-making possible in virtually every domain, including HR, finance, accounting, technology, healthcare, legal, customer service, and scientific research. GenAI is Similar to Traditional Machine Learning (ML) Second, executives incorrectly assume that generative AI implementation mirrors traditional machine learning (ML). Unlike traditional ML's objective metrics like accuracy and F1 scores, GenAI requires subjective, multi-faceted evaluation combining automated metrics, LLM-based assessments, and human review to evaluate aspects like tone, coherence, and creativity while catching hallucinations and subtle errors. GenAI's ability to accept free-form input necessitates robust pre-generative guardrails to prevent misuse, detect prompt injection attempts, assess societal risks, and enforce regulatory and brand guidelines, along with post-generation checks for factual accuracy, bias, and brand consistency. Additionally, monitoring GenAI systems is more complex due to the risk of "silent degradation" where automated scores remain acceptable despite declining real-world utility, requiring continuous human review, and the need to trace multiple, dynamically selected steps in the generation process, making debugging and root-cause analysis significantly more challenging than traditional ML's typically deterministic and relatively straightforward evaluation processes. Plug-and-Play Perception Third, executives often view AI as a plug-and-play solution. C-suite leaders frequently assume that AI implementation is primarily a technical endeavor where engineers and data scientists develop and maintain models, while business professionals merely assist with data identification and occasional feedback. This mindset, fueled by ambitious claims about AI's capabilities, creates the illusion that AI can be seamlessly integrated without significant human oversight. However, AI experts emphasize that even as AI technology evolves rapidly, significant involvement of AI conductors (i.e., GenAI strategists) remains essential across the entire AI-enabled workflow. This includes development, deployment, and ongoing monitoring phases. The reality is that successful AI integration requires creating many new processes and tools, understanding cognitive workflows, evaluating existing systems, conducting feasibility studies, establishing appropriate scope, securing stakeholder buy-in, identifying quality data sources, and planning resources effectively. These activities highlight that AI functions best as an enabler rather than an independent operator, requiring continuous human guidance to ensure alignment with strategic objectives, practical constraints, and ethical considerations. Hire Engineers & Data Scientists Fourth, many executives mistakenly believe that successful AI implementation depends primarily on technical expertise and is mostly about model training and upkeep. This often leads to the strategy of hiring as many data scientists and ML/AI engineers as possible, while relying on existing technology leaders to oversee AI initiatives. Though these leaders excel at managing technical projects, they often lack deep understanding of GenAI's unique characteristics. They may fail to recognize that augmenting human decision-making requires decoding cognitive processes and blending various technologies. Moreover, existing leaders typically struggle to envision emerging use cases enabled by rapidly advancing multimodal AI capabilities. Our research shows that the most successful implementations occur when organizations balance technical talent with AI conductors – professionals who may not code extensively but possess strong technical knowledge of generative AI capabilities and limitations. These AI conductors play a vital role by steering implementations, managing executive expectations, advocating for essential resources (particularly in AI evaluation systems), and ensuring proper controls are in place. They serve as strategic bridges between technology and business objectives, preventing the equivalent of deploying powerful AI systems without proper safeguards – like driving on a freeway with faulty brakes. Organizations that recognize the need to hire or groom these AI conductors alongside technical talent consistently achieve more successful, responsible AI integrations. Through in-depth interviews with AI experts, we identified two complementary frameworks that directly counter these misconceptions by providing a clear blueprint for the capabilities AI conductors need to develop. You can read about these complementary frameworks: SPADE and CATE here:
4 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
TL;DR The narrative revolves around GloboTech’s leadership team, led by CEO Sarah Klein, grappling with the need to pivot their hiring strategy to prioritize "AI conductors"—consultants adept at orchestrating AI workflows and solutions over traditional coding-centric "musicians." This shift introduces challenges in candidate evaluation, risk of pushback from traditional talent pools, and the need for market positioning to retain competitive advantage. Through detailed discussions, the case delves into: • The trade-offs between traditional coding skills and AI workflow redesign capabilities. • The design of assessments to measure hybrid skills like critical thinking and problem-solving over coding tests. • The operational and reputational risks of implementing innovative hiring strategies. • Market positioning challenges as the firm transitions to a leadership role in AI consulting. Contact Rohit Aggarwal for Feedback, Collaboration or Teaching Notes for this case Prologue On a brisk February morning in Boston, the leadership team at GloboTech Services gathered around a sleek conference table for a critical hiring strategy meeting. The company had built its reputation by providing high-level IT consultants to global enterprises, helping them tackle everything from infrastructure modernization to AI-driven system implementations. Despite their success, CEO Sarah Klein sensed a shift in the market. "We've always prided ourselves on hiring the best coders—our 'star musicians,'" she began, scanning the room. "But I'm not convinced that's still the formula for success in consulting, where AI tools are taking over a lot of the coding tasks." Next to her, Ian Roberts, Chief Technology Officer (CTO), flipped through slides showing recent developments in generative AI. "Our technical assessments have been great at verifying data-structure knowledge and coding prowess," he said. "But with AI writing so much of the code now, we might need a different skill set—people who can guide AI rather than manually produce every line themselves." Carla Sanchez, the VP of Human Resources, chimed in. "We need to rethink our entire recruitment strategy," she said. "We're focusing too heavily on coding under time pressure. We're not measuring critical thinking, problem-solving, or communication—the skills consultants need when orchestrating AI solutions for clients." The Setting Company Profile GloboTech Services: A mid-sized IT services firm specializing in advanced AI-driven consulting for enterprise clients Headquarters: Boston, MA Founded: 2015 by CEO Sarah Klein, a former venture capitalist turned IT services entrepreneur Scale: 50,000 employees globally, $5 billion in annual revenue Leadership Team Sarah Klein (CEO) Visionary who believes in staying ahead of tech trends Concerned about maintaining the company's reputation for innovation Ian Roberts (CTO) Deep technical background; formerly a lead engineer at a major tech giant Believes in maintaining rigorous technical standards, but open to reevaluating what "technical" means in the age of AI Carla Sanchez (VP of Human Resources) Responsible for recruitment strategy and employee development Wants to attract top "AI conductors" over "musicians" who rely solely on coding Ming Li (Director of Operations) Manages day-to-day processes, including test administration logistics for consultants Concerned about cheating and test integrity in remote hiring Amelia Price (Director of Strategy & Partnerships) Monitors industry trends and competitive moves Sees a big opportunity for GloboTech to become a thought leader in AI consulting Unfolding the Drama: The Leadership Meeting 1. The Changing Nature of Skills Sarah placed her phone face down on the table. "I just got off a call with a major enterprise client. They're revamping their own hiring for AI roles—telling me it's not just about coding. They need consultants who can figure out what to build and why, more than how to build it in excruciating detail." Ian projected a chart on the screen. "Our consultant screening has historically revolved around coding challenges—data structures, algorithms, design patterns. But with the pace of AI, we might attract amazing 'musicians'—fantastic coders—who aren't necessarily good 'conductors.' They might write perfect code but struggle to guide AI agents, collaborate with humans, and orchestrate solutions where AI handles a substantial percentage of tasks traditionally done by human programmers. We need consultants who can step back from the instrument and conduct the entire AI-human orchestra." He paused. "I'm worried about how candidates will react if we suddenly pivot from traditional coding tests to assessments focused on critical thinking and communication. These developers have spent years practicing data structures and algorithms, preparing for the kind of technical interviews that every major tech company gives. If they encounter our new format—heavy on problem-solving and light on coding—they might think something's wrong. We might see high drop-off rates mid-process, or candidates might abandon our assessments entirely when they don't recognize the format. They've invested so much time mastering one type of interview, and we're asking them to showcase completely different skills." Carla responded quickly: "We're not throwing coding out the window. But we have to rebalance. Consulting engagements demand critical thinking, analytical reasoning, and communication—especially in scenarios where the code is assisted by AI." 2. The Mismatch and Fear of Candidate Pushback Sarah tapped the table with her pen. "Our job postings and screening tests still scream 'traditional coding challenge.' If we pivot to these intangible, AI-focused skills, we risk confusing applicants. They might ask, 'Is GloboTech even a serious tech consulting firm?'" Amelia added, "Top technical talent often wants to go somewhere they can do cutting-edge programming. If they think we're turning them into project managers or 'AI conductors,' they might walk away before they grasp what we're really offering." Cheating, AI Tools, and Resume Inflation 1. The New Age of "Resume Customization" Ming cleared her throat. "We're also seeing more AI-driven cheating. Last week, we got two identical cover letters from different applicants—clearly generated by the same AI. And these candidates aced our coding tests but failed miserably in real-life consulting simulations." Ian shook his head. "They could've used AI copilots to solve those coding problems. Ironically, we do want consultants to leverage AI effectively, but we also need to see authentic problem-solving. Right now, we're struggling to tell genuine skill from AI-generated fluff." 2. Heightened Proctoring Requirements Carla jumped in. "We could implement more intense proctoring—live video, screen captures, advanced plagiarism detection. But that raises privacy concerns. Candidates might be uncomfortable being recorded at home, especially when we run hours-long tests." Ming nodded. "And we need to ensure compliance with data protection laws. Recording for a quick interview is one thing. Doing continuous screen captures is another." The Core Challenges Emerge Carla stood up and walked to the whiteboard. "Before we go further, let's tackle the assessment design challenge head-on. How do we actually evaluate these hybrid skills we're talking about?" She wrote a few bullet points as she spoke. "We need someone who can code, yes, but also someone who can think strategically about AI implementation. Traditional coding tests won't cut it anymore. Let me give you a concrete example." Carla pulled up a recent project slide. "When we're implementing generative AI solutions, we're essentially trying to replicate complex human decision-making processes. Our consultants need to be able to step back and decompose these decisions into logical sequences. It's like being a cognitive architect." She drew a flowchart on the whiteboard. "Look at our recent customer service automation project. Our traditional developers wanted to jump straight into coding a solution. But our more AI-savvy consultants took a different approach. They first mapped out the human service rep's decision-making process, broke it down into discrete steps, and then experimented with different combinations of tools – sometimes using coding, sometimes traditional machine learning models, sometimes generative AI APIs. They kept iterating until they found the right sequence." Ian nodded vigorously. "Exactly. And it's not just about technical implementation. These consultants need to design custom evaluation metrics that align with human preferences. They're constantly experimenting – trying different prompts, adjusting model parameters, fine-tuning the interaction between various AI tools. It's a completely different skill set from traditional software development." "And the results speak for themselves," Carla continued. "The team that took this experimental, methodical approach delivered a solution that actually matched how human service reps make decisions. The team that rushed into coding ended up with a rigid system that couldn't handle nuanced cases." Ian leaned forward. "What if we let candidates use AI tools during the assessment? It could show us how they approach AI-assisted problem solving in real time. Though," he added with a frown, "we'd need to figure out which tools to allow and how to evaluate their usage effectively." "That's exactly the kind of thing that keeps me up at night," Ming interjected. "We hire thousands of consultants annually across our global offices. How do we maintain consistency in these new assessments across regions? We'll need to retrain every interviewer, every recruiter. And let's be honest – evaluating soft skills consistently at this scale is already challenging. Adding AI orchestration skills to the mix makes it even more complex." Amelia, who had been quietly taking notes, looked up. "There's also the client side to consider. Just this morning, I had a call with a client who was concerned about our transition. They asked point-blank: 'If you're moving away from pure coding skills, how can you justify your current billing rates?' We need to articulate the value proposition of these new 'AI conductors' clearly." "That's a crucial point," Sarah agreed. "We're not just changing our hiring process – we're transforming how clients perceive the value of our consultants. Some clients might worry we're becoming less technical, when in reality, we're evolving to meet their future needs." Ming rubbed her temples. "And let's not forget about our existing client projects. How do we manage expectations during this transition? We can't just flip a switch and suddenly have all our consultants become AI experts overnight." "The billing question is particularly tricky," Ian added. "Traditionally, we've justified our rates based on our consultants' technical expertise. Now we need to help clients understand why strategic AI orchestration skills command similar or even higher rates." The Market Reality The challenge was particularly acute given GloboTech's position in the market. Unlike pure software companies, IT services firms needed consultants who could bridge the gap between business needs and technical implementation—a skill set becoming even more critical in the AI era. The market signals were impossible to ignore: Job postings mentioning AI had seen a staggering 1,000% increase in 2023 compared to the end of 2022, according to CIO.com. LinkedIn's latest workforce report highlighted AI Engineers and AI Consultants among the top 10 fastest-growing jobs in the U.S. This explosive growth in AI-related positions wasn't just a trend—it represented a fundamental shift in how technology services would be delivered. Clients weren't just asking for AI capabilities; they were demanding consultants who could navigate the entire AI transformation journey. As a mid-tier player competing against giants like Accenture, Cognizant, and IBM, GloboTech had to be strategic about its transformation. These larger competitors had deeper pockets for training programs and could afford to experiment with different approaches. Balancing Risks and Rewards 1. Thought Leader vs. Status Quo As the debate wore on, Amelia spoke up. "What if we lead the charge? Publish white papers, host webinars on how AI is changing the consulting skill set, and promote the idea of an 'AI conductor.' That might mean we lose some candidates, but it could also attract the right ones—and elevate our brand." Sarah considered it. "We'd be positioning ourselves as the front-runners in an evolving market. But are we ready if top coders turn away or if our current teams resist the change? And what if our competitors hang back, learn from our hiccups, and then adopt a more polished version of our model?" 2. Suboptimal Hires if We Delay Ian cautioned against waiting too long. "We're already seeing mismatched skill sets in projects. If we continue hiring the old way, we'll get consultants who can code but can't coordinate AI solutions effectively. That's a recipe for losing contracts over time." He pulled up a recent case on his tablet. "Take what happened with StandardCorp last month. Our team was technically strong—great coders who could build traditional systems. But when it came to implementing AI-driven HR automation, they struggled. They approached it like a traditional software development project, writing custom code for everything instead of effectively orchestrating existing AI tools and APIs. After three months of delays and cost overruns, StandardCorp pulled the plug on our $2 million contract and hired AIShape, a boutique AI consultancy. They completed the project in six weeks using mostly off-the-shelf AI models and custom prompts. That's just one example, but it's a pattern I'm seeing more frequently." "That StandardCorp loss really hurt us," Carla added. "Not just the revenue, but our reputation. Their CIO told me directly that while our consultants were 'brilliant programmers,' they seemed to be 'fighting against AI instead of leveraging it.' That's exactly the perception we need to change." The Tipping Point Midway through the conversation, Sarah's phone buzzed again. She glanced at a text from a major strategic partner: "Heard rumors you're pioneering a new approach to AI-hiring. Would love to see it." Sarah's eyes lit up—word had leaked out. The room fell silent as the leaders processed the potential brand perception issues. They each realized the importance of messaging this shift correctly to avoid driving away strong applicants. Amelia noticed Sarah's reaction. "We don't operate in a bubble," she said. "Clients and partners are watching. If we move quickly, we could become their go-to experts on AI adoption—something that sets GloboTech apart." Implementation Challenges Surface Ming raised her hand. "There's another critical challenge we haven't discussed – the assessment process itself. These new tests focusing on critical thinking and communication will have significant subjective components. How do we evaluate them consistently and fairly?" "I've been exploring some innovative approaches," Carla responded. "What if we used a combination of LLMs and human assessors? The AI could provide an initial structured evaluation of communication patterns and problem-solving approaches, while our human assessors focus on nuance and cultural fit." "Interesting," Ian leaned forward. "But how do we ensure the AI evaluations align with our human preferences? We'd need to carefully calibrate these systems." Ming nodded. "And scale is a huge concern. We're talking about thousands of assessments across different offices. We need standardization without losing the human touch." "That brings up another crucial point," Carla added. "We need to completely retrain our interviewing staff. Our assessors are experts at evaluating coding skills, but evaluating AI orchestration abilities? That's a whole different ball game. We need to develop new interview questions, testing procedures, evaluation rubrics – and then train everyone consistently." Sarah turned to Amelia. "What about the client side? How do we message this transformation to them?" Amelia straightened in her chair. "That's tricky. Some clients still equate technical capability with coding skills. We need to help them understand why an AI orchestrator might be more valuable than a traditional programmer. I had a client ask me yesterday, 'If they're not primarily coding, why should we pay the same rates?'" "We need concrete metrics," Sarah mused. "Not just technical delivery metrics, but business impact measurements. Something that shows how these new skills translate to better solutions." "Exactly," Amelia agreed. "And we need to manage expectations during the transition. Some clients might see mixed teams – traditional developers working alongside our new AI orchestrators. We need to clearly articulate the value of each role and how they complement each other." Ian pulled up a dashboard on his laptop. "Look at the StandardCorp project we discussed earlier. The traditional team spent 70% of their time coding custom solutions. The AI-focused team spent 60% of their time on solution design and orchestration, using existing tools and APIs. The result? Faster delivery, better scalability, and higher user satisfaction. These are the kinds of metrics we need to highlight." "But communicating this to clients requires our account teams to understand it deeply themselves," Carla pointed out. "We need to train them too – help them articulate the value proposition of these new hybrid skills." Discussion Questions Revising the Screening Process What specific interview or assessment formats could GloboTech introduce to better evaluate a consultant's capacity to leverage AI tools and think strategically, rather than just coding? How can the firm still assess coding fundamentals while placing greater weight on problem-solving and communication? Addressing Cheating and AI Misuse What proctoring tools or strategies can discourage AI-assisted cheating without undermining candidate privacy or trust? How can GloboTech position these measures so they feel fair and transparent? Talent Supply and Demand With a large pool of applicants possessing basic coding skills, how can GloboTech raise its hiring bar to find top-tier "AI conductors" without pushing away excellent coders? What ongoing training and development might be needed to keep new hires aligned with evolving AI needs? Market Positioning How should GloboTech differentiate itself from both traditional IT services competitors and pure technology companies? What messaging would attract candidates with the right blend of technical and strategic capabilities? Strategic Timing In deciding whether to be a "first mover," which risks and rewards should GloboTech weigh most heavily? How might GloboTech mitigate potential downsides if it chooses to be a thought leader in AI-driven hiring? Epilogue: The Decision Awaits By the meeting's conclusion, the leadership team at GloboTech Services was left with more questions than answers. The urgency was evident: AI had reshaped the consulting landscape, reducing the emphasis on raw coding prowess and highlighting the need for "AI orchestration" skills like communication and strategic thinking. As the group filed out of the conference room, Sarah lingered, still processing the partner's text and the implications of intrusive proctoring. She wondered how a company culture so steeped in coding pride would adapt to the idea of "conductors" over "musicians." In a fast-changing industry, which path—gradual or bold—would ensure success?
8 min read
authors:
Rohit AggarwalRohit Aggarwal

Article
The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications: Exercise 1: Simple RAG for 10-K filings Code with Explanation is posted here: Colab Notebook Link Exercise 2: RAG with Reranker for 10-K filings Exercise 3: RAG with Query Decomposition & Tracing with Langsmith/Langfuse Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning and Action) Exercise 5: RAG with Agentic Pattern: ReAct + Reflection These exercises incrementally build on basic RAG with focus on “why” before “what” and “how". This first tutorial focuses on developing a basic end-to-end RAG pipeline. It is divided into three parts to provide a comprehensive understanding of building a simple RAG system for 10-K filings: RAG Fundamentals There are two key processes for RAG a) Pre- processing Data for RAG Downloads 10-K filing, splits filings into chunks, generates chunk embeddings, stores and retrieves chunks from vector database b) User query processing Generates user query embeddings, retrieves chunks similar to query embedding, creates prompts for models to generate response. High-Level Overview of Underlying Models We strongly encourage readers to go through RAG fundamentals before diving into the code. RAG Fundamentals 10-K SEC filings are comprehensive annual reports that provide an in-depth overview of a publicly traded company's operations, financial performance, and risks. These documents are essential for investors, analysts, and regulators, offering insights into business strategies, legal issues, financial health, and future outlook. However, their length and complexity often make extracting specific information time-consuming and challenging, especially when dealing with multiple filings across different companies. RAG systems address these challenges by combining traditional retrieval methods with the generative capabilities of large language models (LLMs). By structuring and embedding text from 10-K filings into a searchable database, RAG systems can quickly retrieve and synthesize relevant information, enabling users to answer complex queries efficiently. In this exercise, we will work with 10-K filings from companies like Tesla and GM, utilizing their SEC-hosted webpages as data sources. Pre-processing Data for RAG (Retrieval-Augmented Generation) Pre-processing text for RAG systems (e.g., company policy documents, emails, website content, and reports) involves key steps to prepare and organize data for efficient querying and retrieval. The primary steps include chunking, embedding generation, and vector database integration. Here’s a breakdown: Chunking Chunking is the process of breaking down large texts into smaller, manageable pieces that are easier to process and retrieve. In knowledge bases with lengthy documents, breaking them into smaller chunks enables RAG models to query and retrieve only the most relevant sections for user queries. This targeted retrieval promotes contextually coherent responses while reducing off-topic content and conserving computational resources, making the process more efficient and scalable. A key consideration in chunking is determining the appropriate chunk size to balance context preservation and semantic specificity. Semantic specificity refers to how distinctly and unambiguously a text conveys an idea. Larger chunks excel at maintaining discussion context and keeping related ideas together, which helps models understand references and pronouns. This is particularly valuable for tasks like document summarization or question answering that require comprehensive topic understanding. However, larger chunks can encompass multiple themes, potentially diluting the semantic focus of their embeddings and leading to less precise retrievals when queries target specific aspects. Conversely, smaller chunks typically focus on single ideas, generating highly focused and semantically rich embeddings that can be matched precisely with specific queries. The drawback is potential loss of broader context, where important background information or pronoun references might fall outside the chunk's scope. This can result in retrieved chunks that, while semantically relevant, may miss crucial context for coherent responses. The optimal chunk size depends on the specific application requirements and often involves experimentation. To address the risk of splitting important information across chunks, an overlapping sentences approach is often used. This involves adding a portion of the end of one chunk to the beginning of the next, helping preserve context and semantic integrity of ideas that span chunk boundaries. This ensures the model maintains a better understanding of the text as a whole, enhancing information continuity before moving into the vectorization phase of the RAG model's data pre-processing pipeline. Generating Chunk Embeddings using Embedding Model Think of embeddings as a way to translate text into a sequence of numbers that computers can understand and compare. When you convert text into embeddings (also referred as vectors), you're essentially creating a numerical "fingerprint" that captures the meaning of that text. In a RAG (Retrieval Augmented Generation) system, embeddings serve three key functions: They convert chunks of your company's documents (manuals, reports, policies) into these numerical fingerprints. They similarly convert user questions into numerical fingerprints. They allow rapid searching by comparing these fingerprints to find relevant matching chunks. Let's say an attorney has a new case about a contract dispute where a software company failed to deliver custom AI features they promised to build for a client. The attorney has this case summary: “Contract dispute: Client paid $2M for custom AI software development. Contract specified 6-month delivery. Vendor delivered incomplete features after 8 months, failing to meet specifications. Client seeking damages.” When this query is converted to an embedding, it captures key legal concepts like breach of contract, delayed delivery, and incomplete work. The system compares this numerical pattern against thousands of past cases' embeddings to find similar precedents. Precisely speaking, the system compares it against embeddings of chunks from past legal cases and finds chunks with similar numerical patterns about breach of software development contracts, delayed project deliveries, and incomplete or non-conforming deliverables. By comparing embeddings of chunks rather than entire cases, attorneys can quickly pinpoint not only the precedent cases but also the most relevant sections in these cases. This helps attorneys rapidly identify relevant precedents without reading through thousands of unrelated cases. Storing Chunk Embeddings in Vector Database After generating embeddings for text chunks, storing them effectively becomes crucial for a RAG system's performance. While traditional relational databases are excellent for structured data, they face significant challenges when handling embeddings due to their high-dimensional nature. For context, embeddings generated from BERT-Base models produce vectors containing 768 numbers, while BERT-Large models create even larger vectors with 1024 elements. Traditional databases simply weren't designed to efficiently manage and query data with such high dimensionality. This is where vector databases come into play, offering a specialized solution designed specifically for handling these high-dimensional vectors. These databases implement sophisticated indexing techniques that allow for rapid similarity searches, making them particularly well-suited for RAG applications. When a user submits a query, the system needs to quickly identify and retrieve the most semantically similar chunks from potentially millions of stored embeddings. Vector databases excel at this task, providing the necessary infrastructure for swift and accurate information retrieval that would be impractical or impossible with traditional database systems. Popular vector database solutions include FAISS and Pinecone, which are specifically optimized for storing and querying these high-dimensional embeddings. These databases implement efficient similarity search mechanisms, typically using cosine similarity measures, enabling them to rapidly identify and retrieve the most relevant chunks of information in response to user queries. This capability is essential for maintaining the responsiveness and effectiveness of RAG systems, particularly when dealing with large-scale knowledge bases. Handling User's Query After preprocessing data and setting up the vector database infrastructure, the RAG system needs to handle real-time user queries effectively. This process happens in four key stages: query vectorization, vector database retrieval, prompt creation, and response generation. Generating Query Embeddings using Embedding model First, query vectorization converts incoming user questions or requests into the same type of numerical representations (embeddings) used for the stored knowledge base chunks. This step is crucial and must use the exact same embedding model that was employed during the preprocessing phase. For instance, if BERT-Base was used to generate the 768-dimensional vectors for your stored chunks, the same model must be used for converting user queries into embeddings. This consistency ensures that both the stored chunks and user queries exist in the same semantic space, making similarity comparisons meaningful and accurate. Using different embedding models for queries versus stored chunks would be like trying to compare distances between points on two different maps with different scales – the results would be unreliable. Retrieving Relevant Chunks using Vector Database Once the query has been converted into an embedding, the vector database performs a similarity search to find the most relevant chunks from the knowledge base. This search typically employs cosine similarity or other distance metrics to identify stored vectors that are closest to the query vector in the high-dimensional space. Modern vector databases can execute these similarity searches extremely efficiently, even across millions of chunks. The system then retrieves the original text chunks corresponding to the most similar vectors, providing the contextually relevant information needed for the RAG model to generate its response. Creating Effective Prompts with Retrieved Context Creating Effective Prompts with Retrieved Context After retrieving the most relevant chunks, the next crucial step is constructing an effective prompt that helps the language model generate accurate and contextually appropriate responses. This process requires careful consideration of how to structure and combine the retrieved information with the user's query. The basic structure of a RAG prompt typically consists of three main components: instructions for the model, the retrieved context, and the user's query. Think of this like preparing a subject matter expert for a consultation – you first explain how they should approach the task (instructions), provide them with relevant reference materials (retrieved context), and then present the specific question they need to address (user's query). Consider this approach: Give an answer for the `question` using only the given `context`. Use only the provided `context` to answer the `question`. If the information needed isn't in the `context`, acknowledge this limitation rather than making assumptions. Provide a detailed answer with thorough explanations, avoiding summaries. question: {question} context: {context} Answer: The instructions at the top sets the foundation for how the model should process and utilize the retrieved information. This helps ensure the model stays grounded in the retrieved information rather than hallucinating or drawing from its pre-trained knowledge. The context section would typically join the ranked chunks with newline characters (\n\n) before inserting them into the prompt template. This preserves the ranking while creating a readable and processable format for the language model. Response Generation After generating the prompt with its carefully structured components, the RAG system passes this combined input to a Large Language Model (LLM) for response generation. The LLM processes the instructions, context (retrieved chunks), and user query together to produce a coherent, contextually appropriate response that addresses the user's needs. The LLM leverages the context to ground its responses rather than relying solely on its pre-trained knowledge. This approach significantly reduces hallucination risks since the model is explicitly instructed to base its response on the provided context. If the retrieved context lacks sufficient information to fully address the query, the model acknowledges these limitations instead of making unsupported claims. The effectiveness of response generation heavily depends on the quality of the prompt engineering discussed earlier. Depending on the requirements, the response from the LLM can be further customized or refined based on additional criteria, such as tone, style, or specific user preferences. Note: The implementation of robust guardrails is crucial when deploying LLMs in RAG systems to ensure responsible and reliable output. A comprehensive validation system should verify that the model's responses strictly align with the provided context, preventing both subtle and obvious forms of hallucination. Additional checks should evaluate responses for potential biases and ethical concerns, including screening for harmful content, discriminatory language, or inappropriate recommendations. These guardrails should also ensure compliance with company policies, regulatory requirements, and societal norms while maintaining appropriate tone and professionalism. The system should be designed to either automatically modify responses that don't meet these criteria or flag them for human review, ensuring a balance between accurate information delivery and responsible AI behavior. Technical Details Embedding model Embedding models are specialized versions of encoder architectures (like BERT) that are fine-tuned specifically to create meaningful vectors (sequence of numbers) of entire sentences or passages, rather than individual words or tokens. Base encoder models create contextual representations at the token level - meaning each word is represented by a vector that depends on surrounding words. However, they aren't trained to directly optimize for sentence-level similarity. In contrast, embedding models are explicitly trained on sentence-pair tasks using contrastive learning. During training, they learn to generate sentence vectors that: Place similar sentences close together in vector space Push dissimilar sentences far apart Capture high-level semantic relationships rather than just word-level patterns This targeted training makes them much better at tasks requiring sentence-level semantic understanding, like finding similar legal cases or matching questions to relevant documents. Note: The terms vectors, embeddings, and representations are often used interchangeably, and all refer to sequences of numbers that represent data in a machine-readable format. In large language models (LLMs): Tokenization: Input text is first broken down into smaller units called tokens. The process maps the text to elements of a predefined vocabulary or dictionary. Since the vocabulary may not contain every possible word, tokenization handles out-of-vocabulary (OOV) words by breaking them into subwords, characters, or other smaller components, depending on the tokenization strategy used. Token Embeddings: Each token is then converted into a numerical vector (embedding). At this stage, these embeddings are static, meaning they do not depend on the context provided by surrounding tokens. Contextualized Embeddings: These are embeddings generated after processing token embeddings through the layers of the transformer model. Unlike static embeddings, contextualized embeddings reflect the meaning of each token based on its surrounding tokens in the input sequence. For example, in the phrases "sits by a river bank" and "went to a bank to deposit a check," the word "bank" has different meanings. Contextualized embeddings capture these differences by producing distinct representations for the word "bank" in each context. The choice of embedding model can significantly impact the quality of your vectors and retrieval effectiveness. Since new embedding models come out on a regular basis, you can select an appropriate model from MTEB leaderboard . Response Generation Model The Large Language Models (LLMs) used for response generation in RAG systems are primarily based on decoder architectures, exemplified by models like ChatGPT, Claude, Llama, and Qwen. These decoder models operate fundamentally differently from the encoder-based models used in embedding generation and reranking stages. Their core objective is next-token prediction, where the model can only see and process tokens that come before the current position, unlike encoder models which have full visibility of the entire input sequence. This architectural constraint creates a more challenging training task, as the model must learn to generate coherent and contextually appropriate text while working with limited future context. This limitation actually drives these models to develop stronger reasoning capabilities and deeper understanding of language patterns, as they must make predictions based solely on previous context. A crucial development stage for decoder models is instruction tuning, which enables them to understand and follow specific directives in prompts. Without this specialized training, these models would simply continue the pattern of text generation rather than providing appropriate responses to instructions. For example, when presented with a prompt like "How are you?", a base model might simply complete the phrase with "doing today", while an instruction-tuned model would recognize the question format and respond appropriately with something like "I'm fine, thank you. How about yourself?" This capability is essential for RAG systems where the model needs to interpret prompts that combine retrieved context with specific instructions about how to use that information. The complexity of the text generation task necessitates significantly larger model architectures compared to embedding and reranking models. These decoder models typically employ many more parameters and layers to support their advanced reasoning capabilities. The scale difference is substantial - while embedding models might operate with hundreds of millions of parameters, modern decoder models often contain hundreds of billions of parameters. This massive scale translates directly to computational costs, with training expenses often reaching hundreds of millions of dollars. As a result, most organizations opt to access these capabilities through APIs provided by services like ChatGPT and Claude, or leverage open-weight models such as the 405-billion parameter Llama hosted on platforms like Together.ai, rather than training their own models from scratch. The combination of this complex architecture, instruction tuning, and massive scale enables decoder models to perform the sophisticated task of synthesizing information from retrieved context into coherent, relevant responses. In a RAG system, this manifests as the ability to not just understand the retrieved chunks and user query, but to reason about their relationships and generate new text that effectively addresses the user's needs while remaining grounded in the provided context.
17 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
The rise of Artificial Intelligence (AI) is often met with polarized views, with one side heralding its potential to revolutionize the world, and the other warning against its perceived threat to professionals and humanity as a whole. However, there’s a different and far more nuanced concern—one that isn't about AI replacing humans but about AI turning people, particularly new learners, into “mindless zombies.” This phenomenon—let's call it the "AI Zombocalypse"—is characterized by professionals becoming overly reliant on AI tools, ultimately losing their critical thinking and problem-solving abilities. While it may sound hyperbolic, this trend is not just an abstract possibility but a present danger, particularly for those just starting their careers. They are at risk of developing shallow, unstructured thinking patterns that lack the depth, creativity, and analytical rigor necessary to solve complex problems. This article explores how AI-induced mindlessness is a greater threat than AI itself, and how the current generation of learners is uniquely vulnerable to this issue. The Alluring Power of AI and the Danger Beneath AI tools are incredibly effective in getting things done quickly, which creates a sense of exhilaration, especially for those who are new to a field. It provides them with results that look polished on the surface and offer an illusion of completeness. But there's often a catch: when you start to dig deeper into these AI-generated results, you frequently find repetition of the same ideas in different forms, a lack of originality, or a vacuousness that becomes apparent upon closer inspection. Essentially, AI can deliver quantity at the expense of quality, leading to content that may look good on paper but fails to hold water upon critical evaluation. This allure of quick, seemingly accurate solutions is akin to a drug—an instant gratification that is hard to resist, especially for new learners who are keen to make an impression or solve a problem quickly. However, just as a drug masks the underlying issues rather than solving them, AI can obscure the learner's understanding, often bypassing essential skills in critical thinking, debugging, and problem decomposition. Evidence of the Problem: Uplevel's Findings The issue of blind AI reliance is supported by real-world data. A study conducted by Uplevel examined about 800 developers over three months using GitHub Copilot, an AI-powered coding assistant by Microsoft. The results were stark: there were "no significant improvements for developers" using Copilot compared to the previous three months without it, and in fact, 41% more bugs were introduced when using AI assistance. This indicates that the AI-generated code was not only less effective but potentially harmful to code quality. New learners are particularly prone to these pitfalls, as they may lack the ability to properly vet AI-generated solutions and instead blindly accept them. This reinforces the point that, far from enhancing developer productivity, AI can actually hinder the development of critical coding and debugging skills, which are essential for quality work. The Disappearance of Debugging Skills and the “Streetlight Effect” Debugging is a skill that separates a good programmer or problem-solver from a mediocre one. It requires systematically breaking down a problem, placing breakpoints, adding logging, and continuously analyzing the state of the system to understand what's going wrong. However, the rise of AI-assisted development tools is eroding this foundational skill. Instead of trying to understand the issue and experiment with possible solutions, learners are too quick to turn to AI for an answer. In this context, we often see the "streetlight effect," where learners act like the proverbial drunkard who searches for their keys only where there is light, not necessarily where they dropped them. They focus on where the AI's solution shines—regardless of whether it’s the right area to focus on. The AI provides a suggestion, and instead of critically evaluating it, they blindly implement it, often without truly understanding the underlying problem or even the solution. This type of behavior discourages deep, analytical thinking and stunts their problem-solving growth. A Symptom of a Broader Problem The erosion of debugging skills is not just about software; it reflects a broader loss of critical thinking that will affect every field as AI tools become ubiquitous. The human role in a world dominated by AI will shift from doing the work to guiding AI when it makes mistakes. This guiding role requires strong analytical skills to track state, validate solutions, and detect errors—skills that are being dulled by over-reliance on AI for immediate answers. The issue is not limited to debugging but represents a deeper problem: losing the ability to critically analyze, question, and break down complex issues. The Double Whammy for New Learners New learners are facing a perfect storm: on one hand, they are struggling to find jobs in a post-COVID world where companies are adjusting their expectations, downsizing, and assuming AI will bring productivity gains. Tools like Cursor, Replit Agent, Devin, and All Hands are reducing the need for large, entry-level engineering teams by automating many programming and administrative tasks. On the other hand, the very skills that new learners need to stand out—critical thinking, complex problem-solving, and the ability to debug effectively—are being eroded by their dependence on AI. Rather than developing mental models to decompose complex problems into manageable subtasks, they lean on AI to do the heavy lifting. AI's involvement can be particularly insidious because, unlike traditional learning, it does not encourage a systematic approach to problem-solving. It hands over pre-packaged solutions that make sense on a superficial level but fail to build the cognitive pathways necessary for long-term understanding. In a sense, AI is like Gollum's "my precious" from The Lord of the Rings, offering a shortcut that feels empowering but ultimately leads to an addiction that diminishes the user's abilities and critical thinking. The "Idiocracy" Parallel The 2006 satirical film Idiocracy foresaw a world where society’s intellectual rigor had been dulled to an extreme degree, leaving humans incapable of critical thought and complex problem-solving. Eerily, this future seems to be materializing faster than we anticipated, particularly as AI tools make it easier for people to bypass thinking for themselves. Just as Idiocracy predicted the rise of the popular Crocs footwear (which did indeed happen), it also anticipated a world where intellectual complacency would become the norm—thanks to technology, and now, AI. What Needs to Change: A Call for Cognitive Resilience It's clear that AI is here to stay, and its benefits are undeniable. But we must address how AI is affecting new learners and professionals before it becomes too late. To avoid an AI Zombocalypse, learners need to be taught not just how to use AI, but how to use it responsibly and critically. This includes: 1. Encouraging Debugging as a Core Skill Developers must learn to debug effectively, which involves breaking down problems, questioning assumptions, and methodically testing hypotheses. Simply pasting in AI solutions without understanding their implications is counterproductive. 2. Promoting Deep Problem-Solving Over Superficial Solutions AI can often offer quick fixes, but educators and mentors need to stress the importance of deeply understanding the problems at hand. Learners should be encouraged to decompose problems into smaller, manageable tasks and to critically analyze AI suggestions before implementing them. 3. Fostering a Healthy Skepticism Toward AI Solutions Learners should be trained to view AI as a tool—not an infallible oracle. It’s crucial to cross-check AI-generated suggestions against one’s understanding of the problem, and to not simply accept AI's word as gospel. 4. Building Resilience and Self-Reliance in Learning New learners should be encouraged to struggle and learn from their struggles. Over-reliance on AI shortcuts hampers the development of the problem-solving tenacity that is crucial in the long run. Conclusion The threat posed by AI is not its power to replace humans but its ability to make humans complacent, uncritical, and reliant on easy solutions. The real danger of AI is the rise of "AI zombies"—professionals and learners who have lost their cognitive edge, unable to think critically or solve problems without AI’s hand-holding. As technology continues to advance, our educational systems and professional development practices must adapt to emphasize critical thinking, deep problem-solving, and debugging skills that resist the allure of AI’s quick fixes. The future will belong to those who use AI thoughtfully, critically, and responsibly—not to those who let AI think for them.
5 min read
authors:
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

Article
As large language models (LLMs) increasingly become central to various applications, the need for robust tools to monitor, evaluate, and optimize these models is more important than ever. Two standout platforms that have emerged in this landscape are Opik and LangSmith. Both platforms offer powerful features for developing and managing LLM applications, yet they cater to distinct needs and workflows. In this blog, we’ll dive into a comprehensive comparison of Opik and LangSmith, examining their key features, strengths, and weaknesses. My recent experiments with both tools—focused on classifying emotions in Twitter data—provided valuable insights, particularly in terms of usability. I conducted two primary experiments: one centered on prompt refinement and the other on model comparison. Through these experiences, I aimed to highlight ease of use as a critical factor in choosing the right platform for your LLM projects. Overview of Opik Opik is an advanced, open-source platform designed for logging, viewing, and evaluating large language model (LLM) traces throughout both development and production stages. Its primary objective is to empower developers with detailed insights to debug, evaluate, and optimize LLM applications effectively. Opik also has SDK support for direct use, you can just setup your account and use it. Key Features of Opik: Self-Hosting Options: Opik offers flexible deployment options for both local and production environments. It supports local deployment via Docker Compose and scalable deployments using Kubernetes, making it adaptable for different scales of use. Comprehensive Tracing: Opik enables comprehensive logging and trace viewing, allowing developers to annotate traces and track LLM behavior in both local and distributed environments. This ensures greater visibility into model performance and helps identify issues quickly during both development and production phases. Integrated Evaluation Tools: Opik provides a set of built-in evaluation metrics, including heuristic performance measures and relevance assessments. It also supports metrics for detecting hallucinations and moderating content, and users can define custom metrics based on specific application needs. Testing Frameworks: Opik integrates with Pytest, providing developers with a framework to thoroughly test their LLM applications. This ensures that models are rigorously evaluated before deployment. Integration: Opik simplifies logging, viewing, and evaluating LLM traces with a robust set of integrations. Key features include: OpenAI : Log all OpenAI LLM calls for easy tracking. LangChain : Capture logs from LangChain interactions. LlamaIndex : Monitor LlamaIndex LLM performance. Ollama : Integrate logging for Ollama LLMs. Predibase : Fine-tune and serve open-source LLMs while logging their usage. Ragas : Evaluate Retrieval Augmented Generation (RAG) pipelines effectively. Overall, Opik’s rich set of tools and integrations make it a powerful asset for developers working with LLMs, offering end-to-end support for debugging, optimizing, and scaling LLM applications​. * You can access a comprehensive exploration of Opik from this link. Overview of LangSmith LangSmith is a comprehensive platform designed to streamline the development, debugging, testing, and monitoring of production-grade LLM (Large Language Model) applications. It bridges the gap between traditional software development processes and the unique challenges posed by LLMs, particularly around handling non-deterministic, complex workflows. Key Features of LangSmith: Advanced Tracing Capabilities: LangSmith excels in tracing the performance of LLM applications by providing detailed insights into the sequence of calls and inputs/outputs at each step. It supports code annotations for automatic trace generation, with options to toggle traces on or off depending on needs. Developers can also control trace sampling rates, ensuring that they log only what’s necessary, particularly useful in high-volume applications. The platform can trace multimodal interactions (e.g., text and image inputs) and distributed systems, ensuring a holistic view of an application’s performance. Dataset Management: LangSmith offers powerful dataset management, allowing developers to create and curate datasets for testing and evaluation. This feature supports few-shot learning experiments, which is essential for optimizing LLM performance. Developers can also organize experiments and results by dataset for better analysis and insights​. Evaluation Metrics: Built-in evaluators enable both automated and manual testing of LLM outputs, supporting various metrics like relevance, accuracy, harmfulness, hallucination, and more. LangSmith’s evaluation tools can assess how changes in prompts or model configurations impact overall performance​. Playground and Prompts: LangSmith includes an interactive playground that allows developers to tweak and experiment with prompts in real-time. This environment is user-friendly and removes friction from the iteration process, helping teams rapidly optimize their application’s behavior​. Scalability: Designed for scalability, LangSmith is built on a cloud architecture capable of handling LLM applications at large scales. It supports robust data retention policies, and its monitoring tools ensure that applications run efficiently and cost-effectively, even under heavy use​. Usability: Comparative Experiments I conducted an experiment with Opik and LangSmith and explored their usability while classifying emotions in Twitter data. I conducted two main experiments: one focused on prompt refinement and the other on model comparison. Here’s a breakdown of my findings, emphasizing ease of use rather than performance. For the prompt refinement experiment, I used the Emotion dataset from Twitter to classify tweets into happiness, sadness, or neutral categories. Both platforms required only an API key and client initialization for setup, which was straightforward. For the model comparison experiment, I applied the best-performing prompt from the first experiment to compare two models: gpt-4o-mini and claude-3-sonnet. Open-Source Flexibility vs. Closed-Source Stability Opik : Open-Source : Opik is an open-source platform, giving developers the freedom to access, modify, and customize the platform’s source code. This flexibility fosters a collaborative environment where developers can contribute to the platform, improve it, and tailor it to their specific project needs. Customization : The open-source nature allows Opik users to implement unique, project-specific features or adjustments, which is valuable for teams with highly specialized requirements. This community-driven development model also allows the platform to evolve continuously based on user contributions. Ideal for Developers Seeking Flexibility : For teams or individuals who prefer to have control over their tools and the ability to customize according to their workflow, Opik is well-suited. It enables full transparency and adaptability, empowering developers to iterate on the platform as they wish. LangSmith : Closed-Source : LangSmith, on the other hand, is a proprietary, closed-source platform. While this restricts customization compared to Opik, it offers the advantage of being a more stable and streamlined platform. LangSmith’s closed-source nature ensures that updates are consistent and cohesive, with dedicated support to maintain the platform’s performance and reliability. Stability and Support : Being closed-source allows LangSmith to provide a more stable user experience, particularly important for enterprise users. It ensures regular updates, dedicated customer support, and a fully integrated suite of tools that work seamlessly together. Ideal for Enterprises Seeking Stability : Enterprises or teams that prioritize stability and dedicated support may prefer LangSmith. The closed-source model can provide peace of mind, knowing that the platform will continue to function reliably with cohesive updates and minimal disruption. Self-hosting Opik : Local Installation : Opik offers a local installation option, which is quick to set up and allows developers to get started immediately. However, this local setup is not intended for production environments, as it lacks the robustness required for large-scale operations. The local installation is suitable for quick testing and experimentation. It operates through a local URL and requires basic configuration of the SDK to interact with the self-hosted instance. This setup makes it very user-friendly for small-scale or short-term tasks. Kubernetes Installation : For production-ready deployment, Opik supports installation via Kubernetes. This option allows for scalability and ensures that all of Opik’s core functionalities—such as tracing and evaluation—are accessible in a more stable environment. Despite the production readiness of the Kubernetes setup, Opik lacks certain user management features in its self-hosted mode, which might be a drawback for larger teams needing detailed access control. There is no mention of built-in storage options in Opik’s self-hosted mode, implying that developers may need to set up external storage solutions for data management. Managed Options : For organizations seeking reduced maintenance, Opik provides managed deployment options through Comet. This allows teams to focus more on development and analysis without worrying about infrastructure maintenance. LangSmith : Docker and Kubernetes Support : LangSmith can be self-hosted via Docker or Kubernetes, making it suitable for both controlled cloud environments and large-scale production deployments. This flexibility allows LangSmith to cater to different organizational needs, from small startups to large enterprises. Componentized Architecture : LangSmith’s architecture is more complex than Opik’s, as it comprises multiple components including the Frontend, Backend, Platform Backend, Playground, and Queue. This setup ensures that LangSmith is highly modular and scalable but also requires more infrastructure management. The need to expose the Frontend for UI and API access adds to the operational complexity. Storage Bundling : Unlike Opik, LangSmith includes bundled storage services by default, making it easier for teams to get started without needing to configure external storage systems. However, users still have the option to configure external storage systems if their project demands it. Enterprise Focus : LangSmith is designed with large, security-conscious enterprises in mind. Its multi-component infrastructure is intended to support complex, secure environments. However, this also means that LangSmith may have a higher maintenance overhead compared to simpler platforms like Opik. The increased complexity requires careful configuration and management to ensure all components operate smoothly. Tracing Opik : Opik offers versatile tracing options, allowing you to log traces to the Comet LLM Evaluation platform via either the REST API or the Opik Python SDK. It supports integrations with a variety of tools, including LangChain, LlamaIndex, Ragas, Ollama, and Predibase, making it a flexible choice for developers looking to track their LLM performance across multiple frameworks. LangSmith : LangSmith provides tracing support primarily with LangChain, Vercel AI, and LangGraph. While it may have fewer integrations compared to Opik, LangSmith compensates with more advanced and low-level features for tracing. This can be beneficial for users who require in-depth analysis and customization in their LLM evaluations. Opik Tracing LangSmith Tracing As shown, LangSmith allows you to view more detailed information, including input, total tokens used, latency, feedback (i.e., evaluation score), metadata, and more. In contrast, Opik provides limited information, showing only input, output, scores, metadata, and so on. Here's a detailed comparison of Opik’s tracing and LangSmith’s tracing based on their dashboard visuals: Similarities: Tracing and Logging of Inputs/Outputs: Both Opik and LangSmith provide a clear breakdown of the input and output logs for evaluation tasks. Each platform displays detailed information regarding the input prompts and the model-generated outputs, which is essential for understanding the context and accuracy of the LLM response. The platforms also show additional details like feedback scores (Opik) or evaluation metrics (LangSmith), enabling users to assess performance in an organized format. Structured Presentation: Both dashboards offer a structured format where evaluation tasks are broken down into sections like "Input/Output," "Feedback Scores," and "Metadata." This ensures that users can navigate easily through the various components of the model evaluation. Status Indicators: Both platforms highlight the success/failure status of each evaluation task. This feature is useful for quickly identifying which tasks were successful and which may need further investigation. Differences: Visualization of Trace Details: Opik provides a more simplified view of the trace spans, with a focus on essential data such as input and output in a straightforward format. The left panel of the Opik dashboard groups spans hierarchically but is relatively simple. LangSmith , however, offers a more detailed tracing breakdown . It displays additional technical details like token usage, latency, and trace spans with granular timing (e.g., 0.2s). The dashboard offers richer metadata and breakdowns on a more technical level, making it more suitable for in-depth performance analysis. Feedback and Evaluation: Opik allows for quick feedback scores and custom metrics within the same pane, which are summarized easily in the CLI or notebook interface. The evaluation task is shown with simple input/output YAML formatting. LangSmith focuses more on detailed feedback evaluations . It provides more elaborate evaluation results, including a link to the platform dashboard for viewing advanced statistics and data visualizations. Visual Complexity: LangSmith has a more sophisticated interface with more detailed trace spans and multiple evaluation layers. This visual complexity can provide more powerful insights but may require more effort to navigate. Opik is more minimalist, prioritizing simplicity in its presentation. This could be more user-friendly for developers who prefer a lightweight and efficient interface. Evaluation Opik : Opik simplifies the process of defining metrics, allowing users to easily initialize and pass them as parameters during evaluation. It supports both heuristic and LLM-based judge metrics, with the added flexibility to create custom metrics tailored to specific needs. This user-friendly approach makes it accessible for developers looking to assess their LLM applications efficiently. Opik also summarizes results directly in the CLI or notebook, allowing for easy access to insights on-the-fly. LangSmith : LangSmith requires a more hands-on approach to metric definition. In LangSmith, evaluators are functions that score application performance based on specific examples from your dataset and the outputs generated during execution. Each evaluator returns an EvaluationResult, which includes: key, score and comment. LangSmith provides a link to its dashboard for viewing results, which, while informative, required navigating away from the immediate workflow. Opik Evaluation LangSmith Evaluation Both LangSmith and Opik provide overall metric scores as well as scores for each individual dataset item. In summary, both platforms give evaluation results in a similar way; the main difference lies in the setup of the metrics. In Opik, the setup is straightforward, while in LangSmith, it requires more effort to configure. Here's a detailed comparison of Opik’s dataset and LangSmith’s dataset based on their dashboard visuals: Similarities: Experiment Tracking: Both Opik and LangSmith provide a clear overview of experiments conducted on datasets. Each experiment is tracked with a unique identifier or name, and the results are logged in a structured manner. They both display the correctness of the evaluation (precision, recall, or label correctness) in a way that allows users to immediately grasp the performance of the model for each dataset item. Metric Display: Both systems display evaluation metrics for each experiment, such as precision , recall , and other relevant scores. This enables developers to gauge how well a specific model or experiment performed based on specific performance indicators. Dataset Connection: In both systems, experiments are linked to datasets, which allows for context-driven evaluation. This connection between the experiment and dataset ensures that users can quickly refer back to the dataset and see how the model performed against each data point. Differences: Visualization of Metrics: Opik: In the Opik evaluation dashboard, you can see metrics such as context precision and recall displayed prominently at the top of the interface. Each dataset entry is evaluated based on these metrics, and results are presented for each item. The emphasis is on immediate metric visibility for each input/output pair within the dataset. LangSmith: LangSmith provides an aggregate view of the experiment performance. Instead of breaking down individual metrics per dataset entry, LangSmith focuses on displaying experiment-level metrics such as Correct Label scores across multiple runs. This is useful for a more general performance comparison between different models or experiment configurations over time. Apart from that, you can also view metrics for each dataset entry by clicking on any specific experiment. Detailed Experiment Comparison: LangSmith: The LangSmith evaluation dashboard provides an overview of multiple experiments at once, listing them with splits, repetitions, and correctness scores. This allows users to quickly compare how different versions of models or setups have performed relative to one another, ideal for tracking improvements or regressions over time. Opik: The Opik evaluation dashboard focuses on individual metrics for each input . It presents a more fine-grained evaluation, especially when comparing precision and recall for specific inputs. However, it lacks a broad overview of multiple experiments in one glance. Dataset Opik : Opik presents a more straightforward view of dataset information, displaying inputs and expected outputs clearly. However, it lacks the advanced visualization capabilities found in LangSmith, which may limit users’ ability to quickly identify trends and insights. LangSmith : LangSmith excels in offering advanced visualization features that clearly showcase trends and evaluation metrics within the dataset tab. It provides rich support for datasets, allowing users to view experiments conducted on the dataset, perform pairwise experiments, and explore various formats, including key-value pairs, LLM, and chat data. This comprehensive approach makes it easier to analyze and understand the dataset’s performance and evaluation. Opik Dataset LangSmith Dataset As shown, LangSmith allows you to see how many experiments were run on a dataset, along with their metric scores and other details. In contrast, Opik only provides information about the dataset and its items. Here's a detailed comparison of Opik’s dataset and LangSmith’s dataset based on their dashboard visuals: Similarities: Sentiment Dataset: Both dashboards displays a dataset , with inputs and expected outputs. Each dataset item includes both the original input and the expected label. Dataset Structure: Both platforms show the dataset in a structured table format, where inputs and expected outputs are clearly listed. This ensures transparency and consistency in dataset management for both platforms. Support for Experimentation: Both platforms support running experiments on the datasets. They allow users to test different models or versions of a model and compare the performance based on these input/output pairs. Differences: Visualization: Opik Dataset: The Opik dataset interface is minimalistic, showing only the input/output pairs . It lacks advanced visualization capabilities, focusing instead on providing clear data entries for developers to reference. LangSmith Dataset: In contrast, the LangSmith dataset interface provides rich visualizations . For example, it shows a chart of experiments , enabling users to see the results of evaluations over time or across multiple experiments. This provides better analytical tools for users who want to track model performance trends. Experiment Features: Opik Dataset: The Opik interface offers simplicity, focusing on basic dataset information and expected outcomes. While it supports dataset-based evaluations, it lacks advanced tools for conducting complex experiments directly from the interface. LangSmith Dataset: LangSmith offers more advanced options for conducting experiments, such as pairwise experiments and the ability to add evaluators and generate new examples. It also supports few-shot learning , giving users more flexibility to perform sophisticated analyses on their datasets. Customization and Flexibility: LangSmith offers more features for interacting with datasets, such as tagging dataset versions, adding new examples, and generating examples. These features make it easier for users to experiment with their datasets and modify them on the go, offering more flexibility and control over data. Opik , on the other hand, is streamlined for straightforward dataset management and lacks these interactive features, focusing on simplicity and clarity for the user. * You can access a code and other exploration details of this comparison from this link. The table below highlights the functionality supported in Opik vs. LangSmith: Feature/Functionality Opik LangSmith Open-Source ✅ ❌ Self-hosting Support ✅ ✅ Dataset ✅ ✅ Tracing ✅ ✅ Evaluation ✅ ✅ Pytest Integration ✅ ❌ OpenAI Support ✅ ✅ LangChain Support ✅ ✅ LlamaIndex Support ✅ ❌ Ollama Support ✅ ❌ Predibase Support ✅ ❌ Ragas Support ✅ ❌ LangGraph Cloud Support ❌ ✅ Own Prompt Management ❌ ❌ Capture Human Feedback ❌ ✅ Advanced Monitoring & Automations ❌ ✅ Conclusion Both Opik and LangSmith offer valuable tools for large language model (LLM) application development, but they cater to different user needs and contexts. Opik is well-suited for developers who appreciate open-source flexibility and a user-friendly setup. Its straightforward metric definition, extensive integrations, and ease of use make it ideal for quick implementations and individual projects. However, it falls short in several areas critical for enterprise use, such as advanced dataset management, sophisticated monitoring, and built-in support for human feedback mechanisms. Opik’s limited tracing capabilities and basic logging features may hinder comprehensive performance analysis and compliance with privacy regulations, which are vital in larger team environments. LangSmith , in contrast, excels in enterprise settings where stability, scalability, and comprehensive monitoring are essential. Its advanced tracing capabilities, rich dataset management, and detailed visualization features facilitate deeper analysis and collaboration among stakeholders. LangSmith excels with its sophisticated tracing options, including the ability to log images and manage sensitive data effectively. Its built-in automation tools allow teams to respond proactively to issues, a necessity in high-stakes production settings. The closed-source model of LangSmith streamlines updates and support, allowing teams to focus on development rather than maintenance. These features are crucial for organizations aiming to deploy production-grade applications effectively. For AI researchers and engineers working on personal projects, Opik offers a flexible and accessible environment for experimentation and learning. Its open-source nature allows for customization without the constraints of a closed-source system. Conversely, AI engineers in enterprise environments will benefit from LangSmith’s comprehensive features tailored for production, including stability, extensive support, and advanced monitoring capabilities. In conclusion, the choice between Opik and LangSmith depends on the specific context of the user. Opik is a great fit for individuals and small teams focused on exploration, while LangSmith is the preferred option for organizations aiming to build scalable, production-ready applications. Aligning your toolset with your project requirements and long-term goals is essential for success in the evolving landscape of AI development.
6 min read
authors:
Sumit MishraSumit Mishra
Rohit AggarwalRohit Aggarwal
Harpreet SinghHarpreet Singh

If you are a startup, then click here to get more information