Filter articles by tags or search for specific topics:
The article discusses the author's approach to structuring data pipelines by integrating the medallion architecture, Kimball dimensional modeling, and semantic layers. It emphasizes the importance of defining clear roles and outputs for each layer—Bronze, Silver, and Gold—to cater to different user needs. The author argues for making the semantic layer a first-class priority in data architecture, highlighting its role in providing governed metrics for self-service analytics. The article concludes with a concrete example of how marketing attribution data flows through this architecture.
The article discusses the evolution of generative AI applications into agentic AI systems at Amazon, highlighting the need for a comprehensive evaluation framework. It emphasizes the importance of assessing not just individual model performance but also the emergent behaviors of the entire system. The authors present a detailed evaluation methodology that includes automated workflows and a library of metrics tailored for agentic AI applications. Best practices and lessons learned from real-world implementations are shared to guide developers in evaluating and deploying these complex systems effectively.
The article discusses how OpenAI has successfully scaled PostgreSQL to handle the demands of 800 million ChatGPT users, achieving millions of queries per second. It outlines the challenges faced during high write traffic, the optimizations implemented, and the architectural decisions made to maintain performance and reliability. Key strategies include offloading read traffic, optimizing queries, and managing workloads to prevent service degradation. The article also highlights the importance of connection pooling and caching to enhance database efficiency.
The article discusses the importance of memory in AI agents, particularly how it enables them to learn from past interactions and improve their performance over time. It categorizes memory into three types: session memory, user memory, and learned memory, each with distinct characteristics and benefits. The author provides code examples for implementing these memory types in agents, emphasizing the significance of learned memory for enhancing agent capabilities. The article concludes with a discussion on what constitutes a good learning and the need for human oversight in the learning process.
This article provides a comprehensive guide on building a conversational agent in BigQuery using the Conversational Analytics API. It outlines the steps to configure the agent, create conversations, and manage interactions with users. The API enables users to query BigQuery data using natural language, facilitating real-time insights and dynamic reporting. The article emphasizes the importance of clear system instructions and schema descriptions to enhance the agent's effectiveness.
The article discusses the complexities of evaluating AI agents, emphasizing the importance of rigorous evaluations (evals) throughout the agent lifecycle. It outlines various evaluation structures, types of graders, and the significance of early and continuous eval development. The piece highlights the challenges faced by teams without evals, which can lead to reactive development cycles. It also provides insights into different agent types and their evaluation techniques, ultimately advocating for a systematic approach to agent evaluation to enhance performance and reliability.
The article discusses the evolution of data mesh from a concept filled with hype to a mature socio-technical paradigm by 2026. It emphasizes the challenges organizations face in implementing data mesh, particularly in changing organizational behaviors and aligning data initiatives with business strategies. The authors share insights on the four core principles of data mesh: domain ownership, treating data as a product, self-serve data platforms, and federated computational governance. The article concludes that successful data mesh implementations require a long-term commitment to organizational transformation rather than merely adopting new technologies.
The article discusses the challenges of correlating real-time data with historical data in data analysis, particularly in e-commerce scenarios. It presents an optimized solution using Apache Flink to join streaming order data with historical customer and product information, leveraging Alluxio for caching. The implementation details include using Hive dimension tables and Flink's temporal joins to enhance performance and reduce bottlenecks. The article also addresses state management issues in Flink applications and provides insights into improving data processing efficiency.
The article discusses LinkedIn's transformation of its search technology stack, focusing on the integration of large language models (LLMs) to enhance search experiences. It details the challenges and innovations involved in deploying LLMs at scale, including query understanding, semantic retrieval, and ranking processes. The use of AI-driven job and people search features aims to provide more relevant and personalized results. Additionally, the article highlights the importance of continuous relevance measurement and quality evaluation in maintaining a high-quality search experience.
This article discusses Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for improved SQL generation and table discovery. The system addresses the challenges of understanding analytical intent and provides a structured approach to data governance and documentation. By encoding historical query patterns and utilizing AI-generated documentation, the agent enhances the efficiency and reliability of data analytics at Pinterest. The article outlines the architecture and operational principles behind the agent's design, emphasizing the importance of context and governance in AI-driven analytics.
The article discusses the introduction of catalog-managed tables in Delta Lake 4.1.0, which shift the management of table access and metadata from the filesystem to a catalog-centric model. This change aims to simplify table discovery, enhance governance, and improve performance by allowing clients to reference tables by name rather than by path. The article also highlights the challenges faced with filesystem-managed tables and how catalog-managed tables address these issues, paving the way for a more interoperable and efficient data ecosystem.
The article discusses the architecture and challenges of LinkedIn's job ingestion system, which processes millions of job postings daily from diverse sources. It highlights the importance of reliability, scalability, and extensibility in handling heterogeneous job data feeds. The system employs a modular, event-driven pipeline that includes job intake and processing stages, utilizing various methods for data extraction and transformation. The article emphasizes the need for robust security protocols and maintaining data quality to ensure a trustworthy job catalog for users.
The article discusses the necessity of operational memory in AI agents, particularly in high-stakes environments where reliability is crucial. It critiques the current reliance on large context windows and highlights the inefficiencies of existing memory architectures. The author proposes a Context File System (CFS) that separates reasoning from execution, allowing agents to build a library of proven procedures. This shift aims to enhance automation and reduce costs in enterprise settings.
The article discusses the capabilities of Amazon OpenSearch Service, focusing on its zero-ETL integrations with various AWS services. It highlights how these integrations simplify data access and analysis by eliminating the need for complex ETL pipelines. The article covers specific integrations with services such as Amazon S3, CloudWatch, DynamoDB, RDS, Aurora, and DocumentDB, detailing their features, benefits, and best practices. Overall, it emphasizes the operational efficiency and innovation acceleration that zero-ETL integrations can provide for real-time analytics and search applications.
The article discusses the challenges organizations face in finding and verifying data across analytics and AI workflows. It introduces Databricks' new Discover experience, which integrates business context and trust into the Unity Catalog, allowing users to find and access trusted data and AI assets more efficiently. The article highlights the importance of domains, intelligent curation, and governed access in facilitating a unified discovery experience that enhances user confidence and reduces bottlenecks in data access.
The article discusses Netflix's challenges in managing metrics within its experimentation platform and how DataJunction, an open-source metric platform, addresses these issues. It highlights the importance of a centralized semantic layer for defining metrics and dimensions, which simplifies the onboarding process for data scientists and analytics engineers. The authors detail the architecture and design decisions behind DataJunction, emphasizing its SQL parsing capabilities and integration with existing tools. The article concludes with plans for further integration and unification of analytics at Netflix.
The article introduces the Native Execution Engine for Microsoft Fabric, designed to enhance Apache Spark's performance without requiring code changes. It explains the challenges faced by traditional Spark execution due to increasing data volumes and real-time processing demands. The Native Execution Engine leverages C++ and vectorized execution to optimize Spark workloads, particularly for columnar data formats like Parquet and Delta Lake. The integration of open-source technologies Velox and Apache Gluten is highlighted, showcasing significant performance improvements and cost savings for users.
The article discusses the development and implementation of Spot Balancer, a tool created by Notion in collaboration with AWS, which optimizes the use of Spark on Kubernetes by balancing cost and reliability. It highlights the challenges faced when using Spot Instances for Spark jobs and how Spot Balancer allows for better control over executor placement to prevent job failures. The article outlines the transition from Amazon EMR to EMR on EKS and the benefits of dynamic provisioning and efficient resource management. Ultimately, the tool has helped Notion reduce Spark compute costs by 60-90% without sacrificing reliability.
The article discusses Spotify's innovative multi-agent architecture designed to enhance its advertising platform. By addressing the fragmented decision-making processes across various advertising channels, the architecture aims to unify workflows and optimize campaign management through specialized AI agents. This approach allows for more efficient budget allocation, audience targeting, and overall campaign performance, leveraging historical data and machine learning. The article highlights the importance of a programmable decision layer and the challenges faced in implementing this system.
The article discusses the challenges enterprises face with manual workflows across multiple web applications and introduces AI agent-driven browser automation as a solution. It highlights how AI agents can intelligently navigate complex workflows, reduce manual intervention, and improve operational efficiency. The article provides a detailed example of an e-commerce order management system that utilizes Amazon Bedrock and AI agents for automating order processing across various retailer websites. It emphasizes the importance of human oversight in handling exceptions and maintaining compliance.
The article discusses Vinted's journey in standardizing large-scale decentralized data pipelines as they migrated their data infrastructure to the cloud. Initially, teams operated independently, but as dependencies grew, coordination became challenging. To address this, they developed a DAG generator that abstracts pipeline creation and standardizes dependency interactions, allowing teams to focus on data models rather than orchestration details. This approach improved visibility and reduced operational complexity across decentralized teams.
This article discusses the development of a multi-agent architecture for a natural language to SQL (NL-to-SQL) analytics system. It highlights the limitations of a monolithic MCP-based system and presents a new A2A (agent-to-agent) pipeline that improves stability, scalability, and error isolation. The article details the roles of specialized agents in the pipeline and how they collaborate to convert user queries into accurate SQL. Additionally, it emphasizes the importance of a structured data mart for efficient analytics execution.
This article presents a solution for automating business reporting using generative AI and Amazon Bedrock. It highlights the inefficiencies of traditional reporting processes and introduces a serverless architecture that leverages AWS services to streamline report writing and enhance internal communication. The solution includes a user-friendly interface for associates and managers, enabling efficient report generation and submission. Additionally, it addresses challenges such as data management and risk mitigation associated with AI implementation.
The article discusses how Slack developed a comprehensive metrics framework to enhance the performance and cost-efficiency of their Apache Spark jobs on Amazon EMR. By integrating generative AI and custom monitoring tools, they achieved significant improvements in job completion times and cost reductions. The framework captures over 40 metrics, providing granular insights into application behavior and resource usage. The article outlines the architecture of their monitoring solution and the benefits of AI-assisted tuning for Spark operations.
The article discusses the collaboration between AWS and Visa to introduce Visa Intelligent Commerce, which leverages Amazon Bedrock AgentCore to enable agentic commerce. This new approach allows for seamless, autonomous payment experiences that reduce manual intervention in transactions. The article explains how intelligent agents can handle multi-step tasks in various sectors, particularly in payments and shopping, transforming traditional workflows into more efficient, outcome-driven processes. It also highlights the technical architecture and tools involved in building these agentic workflows.
The article discusses the evolving landscape of data engineering as it adapts to the needs of AI agents in an increasingly automated environment. It emphasizes the importance of building reliable, code-first data platforms that can handle multimodal data and provide context for agents. The shift from traditional data engineering tasks to high-level system supervision is highlighted, along with the necessity for safety and correctness in data pipelines. Ultimately, the article envisions a future where humans and AI agents collaborate seamlessly, transforming data engineering practices.
The article reflects on the significant advancements in large language models (LLMs) throughout 2025, highlighting key developments such as reasoning models, reinforcement learning with verifiable rewards (RLVR), and the GRPO algorithm. It discusses the evolving landscape of LLM architectures, the importance of inference scaling, and the challenges of benchmarking in the field. The author shares predictions for future trends in LLM development, emphasizing the need for continual learning and domain specialization. Overall, it provides a comprehensive overview of the state of LLMs and their implications for various industries.
The article discusses the importance of separating tech stacks for personalization and experimentation at Spotify. It explains how personalized applications enhance user experiences by tailoring content to individual preferences using advanced machine learning models. The distinction between personalization and experimentation is highlighted, emphasizing the need for different infrastructures and methodologies for each. The article also outlines the benefits of this separation in terms of scalability and efficiency in evaluating recommendation systems.
This article discusses the integration of Atlan and Amazon SageMaker Unified Studio to unify governance and metadata management across data and AI environments. It highlights the importance of maintaining consistent metadata in hybrid environments where different teams use various tools. The article provides a detailed overview of the integration process, including setting up secure connections and automated synchronization of metadata. It emphasizes the benefits of having a single, trusted view of data assets for both business and technical users.
At QCon AI NYC 2025, Aaron Erickson presented agentic AI as an engineering challenge rather than a simple prompt crafting task. He emphasized the importance of combining probabilistic components with deterministic boundaries to enhance reliability. The article discusses the role of specialized agents and deterministic tools in operational systems, highlighting the need for structured outputs and effective tool selection. Erickson's insights provide a framework for understanding the complexities of deploying AI in real-world applications.
The article discusses Netflix's implementation of an Entertainment Knowledge Graph, which unifies disparate entertainment datasets into a cohesive ecosystem. This ontology-driven architecture enhances analytics, machine learning, and strategic decision-making by providing semantic connectivity and conceptual consistency. It addresses challenges in traditional data management by allowing rapid integration of new data types and relationships, ultimately improving insights into the entertainment landscape. The article outlines the architecture, use cases, and future outlook of the knowledge graph.
The article discusses how Care Access, a healthcare organization, utilized Amazon Bedrock's prompt caching feature to significantly reduce data processing costs by 86% and improve processing speed by 66%. By caching static medical record content while varying analysis questions, Care Access optimized their operations to handle large volumes of medical records efficiently while maintaining compliance with healthcare regulations. The implementation details, including the architecture and security measures, are also highlighted, showcasing the transformative impact of this technology on their health screening program.
The article discusses the challenges of processing large datasets using single-node frameworks like Polars, DuckDB, and Daft compared to traditional Spark clusters. It highlights the concept of 'cluster fatigue' and the emotional and financial costs associated with running distributed systems. The author conducts a performance comparison of these frameworks on a 650GB dataset stored in Delta Lake on S3, demonstrating that single-node frameworks can effectively handle large datasets without the need for extensive resources. The findings suggest that modern Lake House architectures can benefit from these lightweight alternatives.
This article explores how Apache Spark interacts with YARN for resource management in a cluster environment. It details the roles of YARN's components: Resource Manager, Application Master, and Node Manager, and explains the communication process during Spark application execution. The author discusses common exceptions encountered when running Spark on YARN, emphasizing the importance of understanding these interactions for effective troubleshooting. The article serves as a guide for advanced users looking to optimize Spark applications on YARN.
This article explores the performance benefits of using Spark SQL's Catalyst optimizer, particularly focusing on DataFrame transformations. It discusses the four stages of Catalyst optimization, emphasizing the Physical Plan stage and how caching DataFrames can significantly improve query performance. The author provides insights into the execution plans generated by Spark and the implications of using UnsafeRow for memory management. Ultimately, the article concludes that while simple queries may not benefit from Catalyst optimization without caching, performance can be enhanced when DataFrames are cached.
This article explores the join and aggregation operations in Spark's Catalyst optimization engine. It discusses how Spark generates execution plans for these operations, including SortMergeJoin and HashAggregate, and the underlying mechanisms that ensure efficient data processing. The author highlights the complexities of data shuffling and the importance of distribution and ordering in Spark plans. Overall, the article provides insights into the optimization strategies employed by Spark Catalyst for handling join and aggregation queries.
The article introduces a new method called Contextual Retrieval, which enhances the traditional Retrieval-Augmented Generation (RAG) approach by improving the retrieval step through two sub-techniques: Contextual Embeddings and Contextual BM25. This method significantly reduces retrieval failures and improves the accuracy of AI models in specific contexts. The article also discusses the importance of context in information retrieval and provides insights into implementing this technique using Claude, along with considerations for performance optimization and cost reduction.
This article discusses the introduction of the 'think' tool for Claude, which enhances its ability to solve complex problems by allowing it to pause and reflect during multi-step tasks. Unlike the 'extended thinking' capability, the 'think' tool focuses on processing new information and ensuring compliance with policies. The article provides practical guidance for implementing the tool, backed by performance evaluations showing significant improvements in customer service scenarios. It emphasizes the importance of strategic prompting and offers best practices for effective use.
The article discusses the emerging field of context engineering for AI agents, emphasizing the importance of managing context as a finite resource. It contrasts context engineering with prompt engineering, highlighting the need for strategies that optimize the utility of tokens during LLM inference. The article explores techniques such as compaction, structured note-taking, and sub-agent architectures to maintain coherence over long-horizon tasks. It concludes by stressing the significance of thoughtful context curation to enhance agent performance.
The article discusses the redesign of Monzo's Fraud Prevention Platform, emphasizing the challenges of detecting and preventing fraud in a fast-paced environment. It outlines the complexities of fraud detection, including the sophistication and speed of fraudsters, and the need for a balance between user experience and security. The system design is explained in detail, highlighting the use of machine learning models and a microservices architecture to monitor and respond to fraudulent activities effectively.
Google Cloud has announced the general availability of Iceberg REST Catalog support in BigLake metastore, enhancing open data interoperability across various data engines. This fully-managed, serverless metastore allows users to query data using their preferred engines, including Apache Spark and BigQuery, without the need for data duplication. The integration with Dataplex Universal Catalog provides comprehensive governance and lineage capabilities. Organizations like Spotify are already leveraging this technology to build modern lakehouse platforms.
The article discusses the evolving role of data engineers in the context of Agentic AI, highlighting how the interaction with data is shifting from a builder-centric model to a more user-driven approach. It emphasizes the importance of rethinking traditional ETL/ELT processes, prioritizing data curation over mere collection, and building infrastructure that supports AI agents. The author outlines five key principles for data engineers to adapt to these changes, focusing on context-aware data handling and the management of AI-generated artifacts.
The article discusses Lyft's transition from a fully Kubernetes-based machine learning platform to a hybrid architecture utilizing AWS SageMaker for offline workloads and Kubernetes for online model serving. It highlights the challenges faced with the original architecture, including operational complexity and resource management, and details the technical decisions made to simplify the infrastructure while maintaining performance. The migration aimed to reduce operational overhead and improve reliability, allowing teams to focus on developing new capabilities rather than managing infrastructure.
The article discusses how organizations can leverage platform engineering principles to accelerate the development and deployment of generative AI applications. It highlights the challenges faced by organizations in experimenting with generative AI and emphasizes the importance of building reusable components to manage costs and improve efficiency. The article outlines the architecture of generative AI applications, including the integration of various data layers and the role of large language models. It also covers best practices for observability, orchestration, and governance in AI workflows.
The article introduces AWS Professional Services' new approach to consulting, leveraging agentic AI to enhance cloud adoption and digital transformation for organizations. It highlights the role of specialized AI agents in streamlining consulting processes, improving solution quality, and reducing project timelines. The integration of AI with human expertise is emphasized as a means to deliver better customer outcomes. Real-world examples, including the NFL's use of AWS agents, illustrate the tangible benefits of this innovative consulting model.
The article discusses the challenges AI agents face when browsing the web, particularly with CAPTCHAs and other bot detection mechanisms. It introduces Amazon Bedrock AgentCore Browser's new feature, Web Bot Auth, which provides AI agents with verifiable cryptographic identities to reduce CAPTCHA friction. The article explains how this protocol works and its collaboration with WAF providers to ensure secure access for verified bots. It highlights the benefits for both AI agents and website owners in managing automated traffic.
This article discusses the challenges associated with using Amazon EC2 Spot Instances within Auto Scaling Groups due to their unpredictable interruptions. It presents a solution in the form of a custom event-driven monitoring and analytics dashboard, named 'Spot Interruption Insights', which provides near real-time visibility into Spot Instance interruptions. The article outlines a step-by-step guide to building this monitoring solution using various AWS services, including Amazon EventBridge, SQS, Lambda, and OpenSearch Service, to optimize capacity planning and improve workload resilience.
The article discusses the advancements in text-to-SQL capabilities using Google's Gemini models, which allow users to generate SQL queries from natural language prompts. It highlights the challenges faced in understanding user intent, providing business-specific context, and the limitations of large language models in generating precise SQL. Various techniques to improve text-to-SQL performance are explored, including intelligent retrieval of data, disambiguation methods, and validation processes. The article serves as an introduction to a series on enhancing text-to-SQL solutions within Google Cloud products.
The article discusses Yelp's transformation of its data infrastructure through the adoption of a streaming lakehouse architecture on AWS. This modernization aimed to address challenges related to data processing latency, operational complexity, and compliance with regulations like GDPR. By migrating from self-managed Apache Kafka to Amazon MSK and implementing Apache Paimon for storage, Yelp achieved significant improvements, reducing analytics data latencies from 18 hours to minutes and cutting storage costs by over 80%. The article outlines the architectural shifts and technologies involved in this transformation.
The article discusses Uber's implementation of zone failure resilience (ZFR) in Apache Pinot, a real-time analytics platform. It details the strategies used to ensure that Pinot can withstand zone failures without impacting queries or data ingestion. By leveraging instance assignment capabilities and integrating with Uber's isolation groups, the article outlines how they achieved a robust deployment model that enhances operational efficiency and reliability. The migration process for existing clusters to this new setup is also highlighted, showcasing the challenges and solutions involved.
The article discusses Ericsson's transformative journey towards data governance using Google Cloud's Dataplex Universal Catalog. It highlights the importance of data integrity and governance in modern telecommunications, particularly for Ericsson's Managed Services. The piece outlines the steps taken by Ericsson to operationalize its data strategy, emphasizing the need for clean, reliable data and the balance between compliance and innovation. It also touches on future priorities in AI-powered data governance and the lessons learned from their experience.
The article discusses how Confluent is evolving its data infrastructure to accommodate the demands of AI agents, emphasizing the need for real-time data processing capabilities. Key features introduced include Confluent Intelligence, a controlled stack for AI agent development, and the Real-Time Context Engine, which aims to provide timely data delivery. The article highlights the importance of integrating streaming data with AI systems to enhance decision-making and operational efficiency. It concludes by noting that the future of AI will depend on the robustness of the underlying data systems.
The article discusses the enhancements made to Metaflow, a framework for managing machine learning and AI workflows at Netflix. It introduces a new feature called Spin, which allows for rapid, iterative development similar to using notebook cells, enabling developers to quickly test and debug individual steps in their workflows. The article emphasizes the importance of state management and the differences between traditional software engineering and ML/AI development. It also highlights how Metaflow integrates with other tools to streamline the deployment process.
The article discusses a significant transformation in enterprise software architecture where AI agents evolve from assistive tools to operational execution engines. This shift is characterized by traditional backends transitioning to governance roles, particularly in sectors like banking, healthcare, and retail. The adoption of protocols like the Model Context Protocol (MCP) allows AI agents to directly invoke services and orchestrate workflows, leading to increased efficiency and autonomy in enterprise applications. Predictions indicate that by 2026, 40% of enterprise applications will incorporate such autonomous agents.
The article discusses how Pinterest has developed a system to identify user journeys, which are sequences of user-item interactions that reveal user interests, intents, and contexts. By leveraging user data and machine learning techniques, Pinterest aims to enhance its recommendation system, moving beyond immediate interests to long-term user goals. The approach includes dynamic keyword extraction, clustering, and journey ranking, which collectively improve user engagement through personalized notifications. The article outlines the system architecture, key components, and the impact of journey-aware notifications on user interactions.
The article discusses Zillow's innovative approach to personalization in the real estate market through AI-driven user memory. It emphasizes the importance of understanding user preferences and adapting to their evolving needs over time. The article outlines how Zillow combines batch and real-time data processing to create a dynamic user memory that enhances the home shopping experience. Key components include recency and frequency of user interactions, flexibility in preferences, and predictive modeling to anticipate user needs.
The article announces the preview of the Data Engineering Agent in BigQuery, designed to automate complex data engineering tasks. It highlights how the agent can streamline pipeline development, maintenance, and troubleshooting, allowing data professionals to focus on higher-level tasks. Key features include natural language pipeline creation, intelligent modifications, and integration with Dataplex for enhanced data governance. The article also shares positive feedback from early users, emphasizing the agent's potential to transform data engineering workflows. I would like to see how this help the data modelling tasks, usually DE team don't own this part or need cross ownership this with DA.
The article discusses Uber AI Solutions' innovative in-tool quality-checking framework, Requirement Adherence, which enhances data labeling quality using large language models (LLMs). It outlines the challenges of traditional labeling workflows and presents a two-step approach involving rule extraction and in-tool validation. By leveraging LLMs, the framework efficiently identifies labeling errors in real-time, significantly reducing rework and costs for enterprise clients. The article emphasizes the importance of maintaining data privacy and the continuous improvement of the system through feedback mechanisms.
This article discusses the challenges and advancements in post-training generative recommender systems, particularly focusing on a novel algorithm called Advantage-Weighted Supervised Fine-tuning (A-SFT). The authors highlight the limitations of traditional reinforcement learning methods in recommendation contexts, such as the lack of counterfactual observations and noisy reward models. A-SFT aims to improve recommendation quality by effectively combining supervised fine-tuning with reinforcement learning techniques. The results demonstrate that A-SFT outperforms existing methods in aligning generative models with user preferences.
This article provides a comprehensive guide on selecting the appropriate embedding model for Retrieval-Augmented Generation (RAG) systems. It discusses the importance of embedding models in converting human language into machine-readable vectors and evaluates various types of embedding models, including sparse, dense, and hybrid models. Key factors for evaluating these models are outlined, such as context window, tokenization unit, dimensionality, and training data. The article concludes by emphasizing the need for practical testing with real-world data to ensure effective implementation.
This article discusses how Shopify has transformed its product taxonomy management from manual processes to an AI-driven multi-agent system. The new system addresses challenges such as scaling taxonomy, maintaining consistency, and ensuring quality through automated analysis and domain expertise. By integrating real product data and employing specialized AI agents, Shopify can proactively adapt its taxonomy to meet evolving merchant and customer needs. The article highlights the efficiency gains and quality improvements achieved through this innovative approach.
This article discusses how to visualize data lineage in Amazon SageMaker Catalog, integrating various AWS analytics services like AWS Glue, Amazon EMR, and Amazon Redshift. It provides a step-by-step guide on configuring resources and implementing data lineage tracking to enhance data governance and quality.
The article discusses how The New York Times has evolved its subscription strategy using real-time algorithms and causal machine learning to optimize its digital subscription funnel. It highlights the transition from static paywalls to dynamic decision-making processes that consider various business KPIs. The implementation of real-time algorithms allows for tailored user experiences based on engagement and conversion metrics, ultimately enhancing subscription and registration rates. The article emphasizes the importance of collaboration between data science and business leadership in defining objectives and constraints.
Lightning Engine is open source?
Good features for the LLM API, in Azure you have the equivalent: https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities
This Data Processing MCP Server can fully manage all the EMR, Athena and Glue services. You really don't need code any more...
i don't know "OpenLineage standard" before, I guess Datahub should enable to support it as well.
Even “more scaling” was the headline driver, the deeper motivation was to build a log‐storage system that could operate reliably, efficiently, and autonomously at hyperscale—solving Kafka’s metadata, coordination, and rebalancing challenges.
i like this statement "We're not just building technology; we're building expertise." :)
this is exactly what i need "It takes the candidate list from your existing search or retrieval system and re-orders it based on deep semantic understanding", but how's the performance for the domain specific queries?
Still not fully see the vision of LH 2.0 ...
This is very interesting for my use case. It talk about some features I really needed like “Agent Card”, "long-running tasks" etc.
Introduce using LLM as a “proxy defense,” which implements an additional component acting as a firewall to filter out unsafe user utterances.
Data enrichment is a very good design pattern in the Recommendation system.
Interested to know when an Amazon Bedrock knowledge base for the Redshift database, how to open the access for Apps with nature language.
- Apache RocketMQ: Enhancing RAG Data Timeliness - Dynamic Prompt Data Updates - Full-Chain Data Quality Tracking
Instead of maintaining separate connectors for each data source, developers can now build against a standard protocol.
EMR showcase, but capture the EMR events into Lambda is interesting.
Agreed, DE will more focus on enabling the data value rather just build pipelines. But eventually we will need less human resources i think.
For the recommendation of the title, i think the semantic based is more efficient than the user behaviour based.
The validation and remediation are interesting.
Interesting to see all the commands of input and output are by the Kafka topics events.
Snowflake or Databricks? You need make your choice ;)
PII fields are identified and separated stored is interesting, we would need to join later for any operations on it?
Treat time series data as a language to be modeled by off-the-shelf transformer architectures.
modern data platform architecture based on Databrick tech stack.
Very classic MDS
Sounds very first, but if we work with big dataset, how to handle the data transformation in the memory? If we work with small data, we can rewrite into Parquet format and the performance is not an issue.
I read the blog but wasn’t fully convinced by its main argument. In my view, Medallion Architecture is just one way to manage data, and it doesn’t necessarily require physically moving or copying data between different stages. Simply tagging tables should be sufficient. Different stages can enforce distinct archival, retention policies, and operational processes. Additionally, from a high-level perspective, the concept of data products doesn’t fundamentally contradict Medallion Architecture.
Interesting architecture to handle bursty and unpredictable traffic on AWS
My understanding of "Table Virtualization" is share the tables between two data platforms.
Real good summary for the main tech products in the different categories of data industry!
Very classic Glue job pipeline to feed the AWS Bedrock Knowledge Bases for a RAG use case.
Some good usage of GCP gemini in your data engineering tasks, but I'm concern about my bill of GCP now ^^.
summarise of all the concept and technologies to build a production ready RAG solution.
very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.
"AI-centric" data processing focuses on preparing and managing large-scale, multimodal datasets efficiently for AI model training, fine-tuning, and deployment, rather than traditional database queries. It involves optimizing computation across heterogeneous resources (CPUs/GPUs), improving data flow efficiency, and enabling scalability—all crucial for building next-generation AI models.
Workflow: Evaluator-optimizer is interesting.
Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.
Explained in the three systems (API, data warehouse, AI inference), how to efficiently collect and validate the Lineage metadata.
Monzo Bank optimized their data retention strategy in Amazon Keyspaces by replacing the traditional Time to Live (TTL) approach with a bulk deletion mechanism. By partitioning time-series data across multiple tables, each representing a specific time bucket, they can efficiently drop entire tables of expired data. This method significantly reduces operational costs associated with per-row TTL deletions.
The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.
JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.
Interesting sharing about the project and system design stage.
Good platform built for Finops
Good summary and topics for ML ops.
Automated granted the S3 data access to the Data consumers.
Good summarise the current problem for using Iceberg system, but the new S3 Table looks addressing all these pain points.
Very good sharing blog, a lot of tips to build a modern LLM RAG app.
Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.
S3 Table bucket handle the Iceberg compaction and catalog maintenance tasks for you.
Twitch has leveraged Views in their Data Lake to enhance data agility, minimize downtime, and streamline development workflows. By utilizing Views as interfaces to underlying data tables, they've enabled seamless schema modifications, such as column renames and VARCHAR resizing, without necessitating data reprocessing. This approach has facilitated rapid responses to data quality issues and supported efficient ETL processes, contributing to a scalable and adaptable data infrastructure.
Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.
An agent breaks down the process of answering questions into multiple steps, and uses different tools to answer different types of questions or interact with multiple data sources, is a good practise.
Without Iceberg, there are lot of overhead works to implement WAP pattern.
Grab's Data Engineering and Data Governance teams collaborated to automate metadata generation and sensitive data identification using Large Language Models (LLMs). This initiative aimed to enhance data discovery and streamline access management across the organization.
Grab's Data Engineering and Data Governance teams enhanced their Large Language Model (LLM) integration to automate metadata generation and data classification. Post-rollout improvements focused on refining model accuracy, reducing manual verification, and increasing scalability across the data lake.
Build a process to built the complete data lineage information by merging the partial lineage generated by dbt automatically.
With AWS LakeFormation, creating filter packages and controlling access. A filter package provides a restricted view of a data asset by defining column and row filters on the tables.
To search user profiles to remove, we use an AWS Lambda function that queries Aurora, DynamoDB, and Athena and places those locations in a DynamoDB table specifically for GDPR requests.
Netflix's Distributed Counter Abstraction is a scalable service designed to handle high-throughput counting operations with low latency. It supports two primary counter types: Best-Effort, which offers near-immediate access with potential slight inaccuracies, and Eventually Consistent, which ensures accurate counts with minimal delays. This abstraction is built atop Netflix's TimeSeries Abstraction and is managed via the Data Gateway Control Plane, allowing for flexible configuration and global deployment.
Netflix's TimeSeries Data Abstraction Layer is designed to efficiently store and query vast amounts of temporal event data with low millisecond latency. It addresses challenges such as high throughput, efficient querying of large datasets, global read and write operations, tunable configurations, handling bursty traffic, and cost efficiency. The abstraction integrates with storage backends like Apache Cassandra and Elasticsearch, offering flexibility and scalability to support Netflix's diverse use cases.
I think the challenges part still is the adaption of this kind of system, integration and change the workflow is a huge cost for the stakeholder team.
58 Group optimized its data integration platform using Apache SeaTunnel to handle over 500 billion daily data messages efficiently. This effort addressed challenges such as high reliability, throughput, low latency, and simplified maintenance. By evolving from Kafka Connect to SeaTunnel, the architecture now supports diverse data sources, enhanced task management, and real-time monitoring, with future plans to leverage AI for diagnostics and transition to cloud environments.
Very flexible and scalable recommendation solution
Calculate new features for the user segmentation, and a good share for the validation.
Classic RAG solution for this kind of application.
Our team had the similar problem "Despite this integration’s technical success, we soon noticed that the new system was delivering lower-than-expected user engagement." Nice re-thinking about the improvement built!
Why we load the S3 data into the Redshift again? it already queryable via Redshift Spectrum? I guess it's for the performance? Transform the S3 raw data, build the data models and write back into S3?
This is one of big difference comparing to DE with SE.
Good "dataset" feature since 2.4.0, released on September 19, 2022.
Fully adapt the SDLC practises for analytic world ...
This paper provides an overview of the Data Vault concept and the business benefits of leveraging it on the cloud-based enterprise database BigQuery.
Comprehensive explanation of how Alluxio accelerates data access in the cloud.
This is mostly like a Netflix level's problem, huge engineering works to build this KV data abstract layer.
I agree with the 'dark data' problem in large organizations, and tools like Dataplex can help by automating data discovery. However, with thousands of tables generated, it raises the question: who will sift through these massive results to identify truly valuable datasets? This process could be very time-consuming.
Using LLM RAG to fetch the right dataset and combined auto enhanced explanation and analysis for the users is really a good idea.
This article discusses an approach similar to the raw, curated, and delivery zones we've talked about before. The key concept is to process and manage data in distinct zones or stages to support data governance and optimize data usage. Most data teams will likely need to adopt some version of this architecture to efficiently handle and control large volumes of data assets.
QuintoAndar's DAG Builder allows scalable management of 10,000+ Apache Airflow DAGs by using YAML configurations to generate DAGs, minimizing code duplication and standardizing data pipeline creation. By separating DAG structures from workflow-specific parameters, QuintoAndar enables data engineers to create new pipelines through declarative YAML files, streamlining the process and ensuring quality across pipelines. This system improves team productivity, simplifies code maintenance, and reduces the learning curve for new team members.
This guide demonstrates integrating Gretel with BigQuery DataFrames for synthetic data generation. By leveraging BigQuery's pandas-compatible APIs and Gretel's machine learning tools, users can generate and de-identify high-quality synthetic data that maintains data privacy and regulatory compliance. The process includes data de-identification with Gretel's Transform v2 and synthetic data generation with Gretel Navigator Fine Tuning, optimized for handling patient records with complex data relationships.
AWS introduces a visual designer in SageMaker Pipelines to simplify fine-tuning and deploying Llama 3.x models. This new UI allows users to create, manage, and automate workflows for continuous model updates using a no-code interface. The article details a sample pipeline for customizing LLMs with SEC financial data, enabling tasks like model evaluation, deployment, and conditional registration based on performance.
Netflix's Data Gateway platform abstracts the complexities of distributed databases, providing scalable, secure, and reliable data access layers (DAL) through standardized gRPC and HTTP APIs.
Pinterest's journey of adopting Ray for infrastructure enhancement started in 2023. It involved overcoming challenges like Kubernetes integration, optimizing resource utilization, and ensuring security. The Ray infrastructure enables scalable, efficient machine learning workloads, significantly improving last-mile data processing, batch inference, and recommender systems model training. By focusing on distributed processing, cost management, and developer velocity, Pinterest achieved improved scalability and operational efficiency for its machine learning applications.
Pinterest enhances machine learning dataset iteration speed by adopting Ray for distributed processing, addressing bottlenecks in dataset handling for recommender systems. Previously slow processes involving Apache Spark and Airflow workflows now leverage Ray's parallelization, resulting in a significant reduction in training time. Ray’s support for CPU/GPU resource management and streaming execution has led to increased throughput and cost savings, improving ML engineer velocity and overall efficiency in managing large-scale data.
Adevinta's Central Product and Tech department has implemented a data mesh architecture to manage and deliver data products across its marketplaces. The initiative emphasizes domain-specific datasets, SQL accessibility, and datasets as products, with a focus on improving decision-making for data analysts, scientists, and product managers. Key strategies include centralized governance, domain-oriented data, and the establishment of working agreements to ensure data quality and alignment across decentralized teams.
This article discusses how Netflix enhances long-term member satisfaction through personalized recommendations. By moving beyond traditional metrics like clicks or CTR, Netflix uses reward engineering to optimize for long-term satisfaction. The process involves defining proxy rewards based on user interactions, predicting delayed feedback, and aligning recommendations with long-term engagement. Challenges include dealing with delayed feedback, the disparity between online and offline metrics, and refining proxy rewards to better align with long-term satisfaction.
The article discusses Spotify's approach to creating and managing high-quality dashboards at scale. Spotify utilizes Tableau and Looker Studio as primary tools, supported by a Dashboard Quality Framework that ensures consistency and trust in the dashboards. The framework includes automatic checks ('Vital Signs') and a manual design checklist ('Spicy Dashboard Design'). The Dashboard Portal centralizes dashboard access, offering search, curation, and quality labeling features, enhancing the overall accessibility and reliability of dashboards across the company.
This article discusses Airbnb's development of a comprehensive Data Protection Platform (DPP) to address challenges in data security and privacy compliance. The platform integrates various services like Madoka for metadata management, Inspekt for data classification, and Cipher for encryption. It highlights the need for automated data protection due to the complexity of handling sensitive data across different environments and the importance of complying with global regulations like GDPR and CCPA.
It's a good high level summary, but i think each team still need make some spike to find out the suitable tools for their use case and project.
This article is for talk about the idea behind fal dbt, extend the dbt capability on airflow platform. It also talk about a lot of other popular tools on Airflow.
I think Dagster has zoom in from Job level view to the asset/table level view for the pipelines. There is always having the Pro and Cons.
It's a good showcase blog for GCP, but it would be very interesting to see some more detail about how Fluidly data team leverage GCP to launch their new data driven business products.
The highlight is configure a data profile jobs without one line of code or SQL, and you have a good UI to check the job output.
The article discusses the importance of treating data as a product at Netflix, emphasizing a product mindset that focuses on intentional design, clear ownership, and continuous evaluation to enhance data utility and trust. It outlines key principles for data products, including clear purpose, defined users, and lifecycle management.
The article discusses Agoda's decision to migrate their Feature Store Serving system from a JVM-based stack to Rust, driven by performance and reliability challenges. It details the migration process, including the proof of concept, performance benchmarks, and the importance of shadow testing to ensure correctness. The transition resulted in significant efficiency gains, handling five times more traffic while drastically reducing CPU and memory usage, leading to substantial cost savings. The article emphasizes the role of AI tools and the Rust compiler in facilitating the team's adoption of Rust despite their initial lack of experience.
This article provides an overview of four primary methods for evaluating large language models (LLMs): multiple-choice benchmarks, verifiers, leaderboards, and LLM judges. It discusses the advantages and limitations of each method, emphasizing the importance of understanding these evaluation techniques for better interpreting model performance. The article also includes code examples for implementing these evaluation methods from scratch, making it a valuable resource for practitioners in the field of LLM development and evaluation.
This article discusses the challenges and solutions for building end-to-end data lineage in enterprise data analytics, particularly for one-time and complex queries. It highlights the use of Amazon Athena, Amazon Redshift, Amazon Neptune, and dbt to create a unified data modeling language across different platforms. The authors explain how to automate the data lineage generation process using AWS services like Lambda and Step Functions, ensuring accuracy and scalability. The article provides insights into the architecture and implementation details necessary for effective data lineage tracking.
The article discusses the evolving landscape of databases as they adapt to meet the needs of emerging AI agents. It highlights four innovative initiatives: AgentDB, which treats databases as disposable files; 'Postgres for Agents,' enhancing PostgreSQL for agent use; Databricks Lakebase, merging transactional and analytical capabilities; and Bauplan Labs, focusing on safety and reliability in data operations. These initiatives reflect a broader trend of reimagining database functionality to better serve machine users in an agent-native world.
The article discusses the Chain-of-Draft (CoD) prompting technique, which offers a more efficient alternative to the traditional Chain-of-Thought (CoT) method for large language models. CoD reduces verbosity and improves cost efficiency and response times by limiting reasoning steps to five words or less. The authors demonstrate the implementation of CoD using Amazon Bedrock and AWS Lambda, showcasing significant reductions in token usage and latency while maintaining accuracy. The article emphasizes the practical benefits of CoD for organizations scaling their generative AI implementations.
The article outlines Halodoc's comprehensive approach to data validation within a Lakehouse architecture, emphasizing the importance of data accuracy and reliability. It describes a multi-layered validation strategy that employs AI to enhance data quality checks at various stages of the data pipeline. The validation layers include checks for data consistency, structural correctness, business correctness, and reconciliation, ensuring that data remains trustworthy throughout its journey. The implementation of this strategy has led to reduced data incidents and increased trust among analytics and product teams.